Splitting EmrEtlRunner into snowplowctl and Dataflow Runner


#1

As part of our drive to make the Snowplow community more collaborative and widen our network of open source contributors, we will be regularly posting our proposals for new features, apps and libraries under the Request for Comments section of our forum. You are welcome to post your own proposals too!

This Request for Comments is to split our current EmrEtlRunner into two applications:

  1. snowplowctl, a command-line tool for working with Snowplow, and
  2. Dataflow Runner, a new open-source application (placeholder repo is snowplow/dataflow-runner).

Background on EmrEtlRunner

Currently the Snowplow batch pipeline is orchestrated by EmrEtlRunner, a JRuby application which:

  1. Has a deep understanding of the dataflow (aka DAG of Hadoop job steps) required to run the Snowplow batch pipeline on Elastic MapReduce (EMR)
  2. Contains logic to execute that dataflow using a combination of Sluice to perform S3 file move operations, and Elasticity to bring up a transient EMR cluster and add jobflow steps to that cluster

Problems with EmrEtlRunner

The fundamental problem with EmrEtlRunner is that it complects business logic (“how do I run Snowplow batch?”) with general-purpose EMR execution (“how do I spin up an EMR cluster? how do I recover from a bootstrap failure?”). This is problematic for a few reasons:

  • It makes it very difficult to test EmrEtlRunner thoroughly, because the side effecting EMR execution and the business logic are mixed together
  • The jobflow DAG is only constructed at runtime by code, which makes it very difficult to share with other orchestration tools such as Factotum
  • EmrEtlRunner ties the Snowplow batch pipeline needlessly closely to EMR, when the Snowplow batch components could be as easily run on YARN, Google Cloud Dataproc or Azure HDInsight
  • It makes it impossible for non-Snowplow projects to use our (in theory general-purpose) EMR execution code

Given these limitations, the plan is to split EmrEtlRunner into two applications: snowctl and Dataflow Runner. We’ll look at Dataflow Runner first.

Dataflow Runner

Overview

Dataflow Runner will be a general-purpose executor for data processing jobs (such as Hadoop or Spark) on dataflow fabrics, starting with AWS Elastic MapReduce and eventually supporting YARN, Google Dataproc and Azure HDInsight.

A crucial point: while we will port across from EmrEtlRunner everything we have learnt about running Snowplow jobs reliably on EMR, we will not port across any Snowplow-specific business logic.

Instead, Dataflow Runner will be somewhat similar to our existing SQL Runner app. It will take two main arguments:

  1. A playbook containing all of the jobflow steps we want to run on the cluster
  2. A specification for the new cluster to create, or alternatively an identifier for an existing cluster to run our playbook

The command-line interface is still TBC, but we are thinking of something like:

$ dataflow-runner up cluster-spec.avro
{ "cluster-id": "123" }
$ dataflow-runner run jobflow-playbook.avro --cluster-id 123
$ dataflow-runner halt 123

Or where the cluster is brought up for a single job:

$ dataflow-runner run jobflow-playbook.avro --cluster-spec cluster-spec.avro

Implementation

We will likely use self-describing Avro for both the cluster configuration and the playbook specifying the jobflow, for a couple of reasons:

  1. A self-describing format will make it easy to validate Dataflow Runner configuration files
  2. Using Avro will allow us to auto-generate structs/classes for the Dataflow Runner to use to interact with these configuration files

We are considering implementing Dataflow Runner in Golang, for several reasons:

  1. Low memory requirements versus a JVM app will make it easier to run many instances of Dataflow Runner on a Mesos cluster at the same time
  2. Golang has a growing community involved in infrastructure work, which should help drive community contributions
  3. Golang SDKs to interface with to EMR and Dataproc are already available

Alternatives to Golang for Dataflow Runner would be Scala, Java or Kotlin.

Testing

Testing will be an important part of Dataflow Runner. Because Dataflow Runner will not contain any business logic, it should be relatively straightforward to test Dataflow Runner using mocks or even a test AWS account.

snowplowctl

Overview

With Dataflow Runner being a powerful general-purpose EMR jobflow runner, writing these playbooks would be very intimidating for new Snowplow users. The EmrEtlRunner contains a lot of Snowplow-specific batch pipeline logic, and we would need a new home for this logic.

The idea is to introduce a new command-line tool called snowplowctl, modelled after the AWS CLI (aws), to provide various commands to help with setting up a Snowplow pipeline.

snowplowctl would have a command to generate a valid Dataflow Runner playbook, based on certain provided arguments and/or a config. This could look something like this:

$ snowplowctl generate dataflow --config config.yml
Dataflow Runner playbook config.avro written

The exact interface and support for existing EmrEtlRunner commands (such as --skip and --start needs further thought.

Implementation

There is an open question whether we should implement snowplowctl as a “clean room” app, or whether we should chisel it out of the current EmrEtlRunner.

Once the EMR orchestration code has been moved out to Dataflow Runner, there will not be a lot left in EmrEtlRunner - potentially favoring the clean room implementation.

If we were to do a clean room implementation, we wouldn’t use JRuby - instead we would likely use Scala, Java, Golang or Kotlin. Specifically, we would need a language which has first-class support for:

  • Avro, for writing Dataflow Runner playbooks
  • Iglu, for validating self-describing JSONs and Avros
  • AWS, for e.g. uploading configuration files to DynamoDB

Testing

Testing snowplowctl would be relatively straightforward: because all of the complex I/O has been moved out to Dataflow Runner, a lot of snowplowctl will consist of pure functions, and can be tested as such.

Other commands

This would just be the start for snowplowctl - we would add further commands to make it easier to work with Snowplow.

For example, we could add a command to lint our various configuration files:

$ snowplowctl lint --enrichments ./enrichments-dir

It would be nice to have a command for uploading enrichments to DynamoDB:

$ snowplowctl store dynamodb --enrichments ./enrichments-dir

Open questions

1. What happens to config.yml?

Currently EmrEtlRunner (and StorageLoader) are driven by an extremely comprehensive config.yml YAML configuration file, see the sample file.

Of the current file:

  • The cluster configuration would be superceded by the Dataflow Runner’s cluster configuration file
  • We would like to move storage target configurations out of this file (into self-describing JSONs, similar to the enrichments)
  • Perhaps all of the S3 bucket paths could be simplified, by snowplowctl making more assumptions about the layout of Snowplow’s files in S3

Given the above, potentially we can remove config.yml completely, or at least pare it down significantly.

2. Better name for Dataflow Runner?

Dataflow Runner might be a little confusing, especially given Google Cloud Dataflow, which this app would (at least initially) not support.

Other suggestions for names welcome.

3. How do we sequence these changes?

There are at least 2 blockers before we can accomplish this split:

  • We need to remove all of the Sluice code from EmrEtlRunner (ideally replacing it with equivalent functionality powered by S3DistCp). Otherwise there will be no way of representing this functionality in Dataflow Runner
  • We are planning to move StorageLoader from a standalone orchestration app to an app which runs on the Hadoop cluster as part of the EMR jobflow. It makes sense to make this change first

Although there are blockers before we can complete this migration, we can get started with the standalone Dataflow Runner earlier; it makes sense for this new app to achieve some maturity before we move Snowplow workloads over to it.

REQUEST FOR COMMENTS

As always, please do share comments, feedback, criticisms, alternative ideas in the thread below.


Does Dataflow Runner replace EmrEtlRunner
Learnings from using the new Spark EMR Jobs
Shredding & loading enriched events in near-real-time
Replacing Amazon Redshift with Apache Spark for event data modeling [tutorial]
Does Dataflow Runner replace EmrEtlRunner
DescribeJobFlows deprecated
EmrEtlRunner EMR Jobflow hangs after status 200 on tracker pixel GET
Dataflow Runner released