Does Dataflow Runner replace EmrEtlRunner


#1

I am reading up about DataFlow Runner but can’t seem to grok does it completely replace EmrEtlRunner or is it complimentary. If it replaces it what is the workflow to then use storage loader to get the data into Postgres or Redshift? Do you then need the DataFlow iglu schema plus the old config.yml?

Thanks


#3

As stated in the RFC, the goal in the long run is to have two components:

  • dataflow runner which will actually spin up a cluster and run the pipeline
  • snowplow ctl which will be in charge of generating configuration files ready to be fed to dataflow runner from “the old config.yml”

#4

So both dataflow runner and snowplow ctl will replace the EmrEtlRunner and storage-loader?
Sorry to be dumb here.


#5

Indeed, Storage Loader will be turned into an application that will be part of the EMR jobflow (the one ran by Dataflow Runner).

Sorry if this wasn’t made clearer. Anyway, this is still a bit far off into the future, everything will be specified in due time :thumbsup:.


#6

Hi @BenFradet, for enriching and loading events to Redshift, is Dataflow Runner now the recommended approach? Or is using EmrEtlRunner + Storage Loader still the way to do it?


#7

Hi @bryce,

EmrEtlRunner still is the way to go. But we deprecated StorageLoader in latest R90 release. @BenFradet’s upcoming R91 release will include new generate command which should alleviate transition, but I believe EmrEtlRunner will remain default approach even after that for some more time.


#8

Thanks @anton! That’s kind of what I thought after finding and reviewing the recent release notes.