Dataflow Runner released


#1

We are pleased to announce the release of [Dataflow Runner] dataflow-runner-post, a new open-source system for the creation and running of AWS EMR jobflow clusters and steps.

This release signals the first step in our journey to deconstruct EmrEtlRunner into two separate applications, a Dataflow Runner and snowplowctl, per our RFC on Discourse.


#2

This is really cool stuff Josh.

Does this mean that with the correct playbooks/API calls you could in theory have on persistent EMR cluster responsible for multiple runs? It might look something like

  1. Bootstrap EMR cluster
  2. Run complete enrichment process
  3. Go into idle mode (optionally remove task/core nodes)
  4. Run step 2

#3

@mike that’s correct! At the moment there would be no way to shutdown nodes between runs - although EMR auto-scaling rules could provide the answer there…

The only caveat would be that your playbook would need to handle any cleanup required to get the cluster into a clean state ready for another enrichment process.