Schedule dataflow-runner(shredder) every 30min and ENV for shredder/loader

I’ve few questions/concerns.

  1. How to schedule dataflow-runner (shredder) every 30min. (if not possible, how/when dataflow-runner will run/execute shredder)

  2. For Shredder/Loader we want to read values from ENV.
    ex :
    Shredder : logUri, src and dest
    Loader : schema, db host url etc.

  3. Can Shredder run without AWS secrete key? provided EC2 instance has all the provisions

  4. Can Loader load shredded data to Redshift, without username/password, using only AWS Role Arn

It’s up to you. You could for instance use cron jobs or Nomad.

In cluster.json and playbook.json we can’t use environment variables apart for AWS credentials, so you would need to retrieve the values and update the config dynamically just before using them if you don’t want to store values directly in the files.

For the configuration file for loader and shredder you can use environment variables like this in the hocon: "host": ${REDSHIFT_HOST}

Indeed, what matters is that the EMR cluster and the IAM role used for shredder have sufficient permissions. If shredder is run with Dataflow Runner then Dataflow Runner needs to know about the secrete key.

No it can’t, at the moment only authentication with username/password is supported.

Thanks @BenB that solves most of my doubts. one last thing

running shredder with Dataflow runner. but I can’t provide AWS secrete key, as the code live in some public repo or can’t even pass/replace from ENV dynamically. Do you see any better way?

I’m not sure to understand. When you run the shredder with dataflow-runner (either with transient or persistent EMR cluster), you have to specify the region and creds :

    "region": "eu-central-1",
    "credentials": {
      "accessKeyId": "env",
      "secretAccessKey": "env"
    },

And then AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables need to be set when dataflow-runner command is run. That’s the only way Dataflow Runner can know where to create/use the EMR cluster for shredder (which region and which AWS account).

You have to set the 2 env variables at some point, you can’t run shredder without it.

Wherever they are stored that’s fine that doesn’t matter, but you need to retrieve them and set them on the machine where you run dataflow-runner, just before you run it.

1 Like