Dataflow Runner setup

Alright so we now have the Quick start setup running on AWS, and are getting results in the RDS database. I have a couple questions around setup for Redshift

  1. My understanding is that this is typically accomplished through the RDB shredder/loader, which involves setting up the dataflow runner. I have viewed the documentation around this here: but I was wondering if there were any more step-by-step guides for getting that up and running as I’m not sure where the best place to install the runner is or how that runs really (not familiar with EMR clusters/playbooks).
  2. I wanted to gut-check the differences between the postgres loader and the rdb loader. From what I’ve been able to tell, the rdb loader is splitting up the data and storing differently than the postgres loader which I imaging helps in storage size. Does that sound right?
  3. If my understanding above is correct, is there a simpler way to push to redshift event data similarly to the postgres loader, without breaking it up? Issues with that approach?


Update: I have installed the dataflow runner on an EC2 instance in an effort to get this running. When I attempt to run it I am getting an error: At least one of Availability Zone and Subnet id is required. I have the subnetId set in the cluster.json file that I’m passing in for config in the syntax used in the cluster.json.sample in the dataflow runner repo, so I assume the config isn’t loading correctly as it looks like that is one of the first things referenced when the application runs. Is .json the right filetype? Is ./dataflow-runner up --emr-config config/cluster.json the right way to reference that file?

Update 2: This seems to have been an authentication issue. I updated the credentials from env to default and that seems to have resolved things.

1 Like

Hi @Ben_Harker
Thanks for the update - glad to hear that you appear to have sorted things out.
Also, thanks for pointing out how you did this - super helpful to other people.