Cron Job for emr-etl and snowflake data

I’m not really familiar with cron jobs but I’d to schedule the following

  • run the snowplow-emr-etl-runner
  • run the snowflake loader. I’m running a command line data flow tasks today
  • run a series of SQL scripts against Snowflake.

I’m wondering if I should create a shell script that gets started by the cron job. The shell script would ensure the sequence of events. Any thoughts? Any scripts that you may have is greatly appreciated.

1 Like

@sonnypolaris, that is exactly how we manage the pipelines for our clients at the moment. To facilitate the scheduling and organizing the steps to be executed, we also use in-house built open-source Factotum (wrapping up EmrEtlRunner), Dataflow Runner (wrapping up Snowflake transformer and loader and/or EmrEtlRunner), and SQL Runner (to run data model on data on Redshift, Snowflake, BigQuery).

1 Like

@ihor

How do you kick off the Dataflow-Runner after EmrErlRunner completion? Is the EmrEtlRunner synchronous with the EMR cluster?

My current cron job looks something like /home/ec2-user/snowplow/snowplow-emr-etl-runner run -c /home/ec2-user/snowplow/config.yml -r /home/ec2-user/snowplow/iglu_resolver.json

If I just append && dataflow-runner run-transient --emr-config cluster.json --emr-playbook playbook.json is that going to work or is it going to launch the second EMR cluster too soon; before the first has finished?

@davehowell, the EMR cluster is terminated by EmrEtlRunner (if using the cluster in transient mode). The Dataflow Runner should spin a new cluster. I do not expect any conflicts with the clusters even if the Dataflow Runner request to spin EMR cluster while the cluster used by EmrEtlRunner has not terminated fully yet.

1 Like

Thanks for your reply, that gives me more confidence. I wasn’t thinking about conflicts, but making sure it waits until all the enriched files are finished before the snowflake transformer & snowflake loader begin that next stage.