Scheduling EmrEtlRunner and StorageLoader


#1

The documentation is not super clear on the preferred method for scheduling daily execution of emr-etl-runner and storage-loader. Should we be using cron and snowplow-runner-and-loader.sh ? There’s also blog posts around using cron/make and now a new data pipeline runner called Factotum (which unfortunately I can’t use at the moment because my ETL executables live on a Ubuntu ec2 instance since I had trouble on Linux).

Can someone recommend the most straightforward process currently? I simply need to run emr-etl-runner followed by storage-loader, although in the near future I hope to add Sql Runner to the flow as well.


#2

Hey Travis,

You are right - those are the three options that we have discussed in various places across the blog and wiki.

If you want the most straightforward process today for just EmrEtlRunner and StorageLoader, I would just go with snowplow-runner-and-loader.sh.

You can always upgrade to Factotum in the future - from v0.2.0 there should be some neat features in Factotum which should make it easier to work with EmrEtlRunner and SQL Runner. Factotum runs fine on Ubuntu.

In any case, I would skip the Make-based approach - Factotum is a better fit for running a jobflow.


#3

Ok awesome, thanks for the quick clarification!