Converting from emrEtlRunner to DataflowRunner example?


#1

Not sure where to grab the step names or locations from the current running config. I am pulling from an s3 bucket resulting from S3 Loader gzipping from scala stream enricher.


#2

@dbuscaglia, to convert from EmrEtlRunner (running in Stream Enrich mode as per your earlier topics) to DataflowRunner, your playbook could be utilizing the following jars:

  1. s3-dist-cp.jar from AWS (see their wiki). You can use this utility to move files between buckets (we use it in latest versions of EmrEtlRunner)
  2. s3://snowplow-hosted-assets/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.0.jar to shred your data
  3. s3://snowplow-hosted-assets/4-storage/rdb-loader/snowplow-rdb-loader-0.14.0.jar to load your data to Redshift

Note: you might need to adjust the jar location according to the region the buckets are in. For example, if us-east-1 is used the RDB loader jar would be s3://snowplow-hosted-assets-us-east-1/4-storage/rdb-loader/snowplow-rdb-loader-0.14.0.jar. You might also want to keep an eye on the latest versions of the above apps.

Your DataflowRunner playbook would contain the following steps replicating the EmrEtlRunner steps from bath pipeline dataflow diagram (assuming run mode for persistent EMR cluster).

  1. Stage files from enriched:stream to enriched:good bucket (with s3DistCp utility)
  2. Shred files (to place shredded files into shredded:good bucket)
  3. Load data to Redshift
  4. Archive enriched:good to enriched:archive bucket (with s3DistCp utility)
  5. Archive shredded:good to shredded:archive bucket (with s3DistCp utility)

#3

Thank you very much @ihor