Dataflow Runner setup

Ben_Harker · February 2, 2022, 10:50pm

Alright so we now have the Quick start setup running on AWS, and are getting results in the RDS database. I have a couple questions around setup for Redshift

My understanding is that this is typically accomplished through the RDB shredder/loader, which involves setting up the dataflow runner. I have viewed the documentation around this here: https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/dataflow-runner/ but I was wondering if there were any more step-by-step guides for getting that up and running as I’m not sure where the best place to install the runner is or how that runs really (not familiar with EMR clusters/playbooks).
I wanted to gut-check the differences between the postgres loader and the rdb loader. From what I’ve been able to tell, the rdb loader is splitting up the data and storing differently than the postgres loader which I imaging helps in storage size. Does that sound right?
If my understanding above is correct, is there a simpler way to push to redshift event data similarly to the postgres loader, without breaking it up? Issues with that approach?

Thanks

Ben_Harker · February 9, 2022, 12:42am

Update: I have installed the dataflow runner on an EC2 instance in an effort to get this running. When I attempt to run it I am getting an error: At least one of Availability Zone and Subnet id is required. I have the subnetId set in the cluster.json file that I’m passing in for config in the syntax used in the cluster.json.sample in the dataflow runner repo, so I assume the config isn’t loading correctly as it looks like that is one of the first things referenced when the application runs. Is .json the right filetype? Is ./dataflow-runner up --emr-config config/cluster.json the right way to reference that file?

Ben_Harker · February 9, 2022, 5:33pm

Update 2: This seems to have been an authentication issue. I updated the credentials from env to default and that seems to have resolved things.

EddieM · February 11, 2022, 11:10am

Hi @Ben_Harker
Thanks for the update - glad to hear that you appear to have sorted things out.
Also, thanks for pointing out how you did this - super helpful to other people.
Cheers,

Topic		Replies	Views
Does Dataflow Runner replace EmrEtlRunner For engineers	6	2363	August 16, 2017
Dataflow runner docker container AWS real-time pipeline	8	1464	November 29, 2021
How to run RDB shredder? For engineers	3	1305	December 31, 2021
Help with provisioning rdb loader AWS batch pipeline (Legacy)	8	1593	November 10, 2018
Snowflake Loader - Process ran successfully but no data loaded Storage targets	12	3672	May 29, 2019

Dataflow Runner setup

Related Topics