Not sure where to grab the step names or locations from the current running config. I am pulling from an s3 bucket resulting from S3 Loader gzipping from scala stream enricher.
@dbuscaglia, to convert from EmrEtlRunner (running in Stream Enrich mode as per your earlier topics) to DataflowRunner, your playbook could be utilizing the following jars:
s3-dist-cp.jarfrom AWS (see their wiki). You can use this utility to move files between buckets (we use it in latest versions of EmrEtlRunner)
s3://snowplow-hosted-assets/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.0.jarto shred your data
s3://snowplow-hosted-assets/4-storage/rdb-loader/snowplow-rdb-loader-0.14.0.jarto load your data to Redshift
Note: you might need to adjust the jar location according to the region the buckets are in. For example, if
us-east-1 is used the RDB loader jar would be
s3://snowplow-hosted-assets-us-east-1/4-storage/rdb-loader/snowplow-rdb-loader-0.14.0.jar. You might also want to keep an eye on the latest versions of the above apps.
Your DataflowRunner playbook would contain the following steps replicating the EmrEtlRunner steps from bath pipeline dataflow diagram (assuming
run mode for persistent EMR cluster).
- Stage files from
enriched:goodbucket (with s3DistCp utility)
- Shred files (to place shredded files into
- Load data to Redshift
enriched:archivebucket (with s3DistCp utility)
shredded:archivebucket (with s3DistCp utility)