Scheduling EMR ETL and sql-runner

#1

It seems like tools like factotum can help scheduling the EMR ETL runner and sql-runner.

However the problem is the EMR ETL runner is it’s asynchronous. It fires and the EMR cluster starts up and does it’s thing.

How can i schedule the sql runner to run after the EMR cluster completes it’s work?

#2

Hi @trung,

I have been facing with the same problem in recent weeks. As my implementation is AWS based, I use AWS StepFunction (particularly State machine) for schedule. In my case both (ETL EMR Runner and SQL runner) run in containers (one classic, one Fargate). In both cases, main processes return exit code based on actual state (so 0 if everything is OK). Based on this state, I do following steps and send error notifications to myself and other interests.

Additionally I have tiny lambda function helping me to fire some SQL on weekly and monthly basis.

Inf you need more details, feel free to drop a line.

Cheers,
GE

#3

How can i schedule the sql runner to run after the EMR cluster completes it’s work?

You could use a single DAG for both and have the SQL-runner job as the second step which depends on the first.