How to configure and run EmrEtlRunner on amazon?


#1

Documentation does not have many details on how to do this, i would like a step by step. I`m just starting with snowplow and i’ve had no sucess to run EmrEtlRunner. Could someone help me?


#2

@feliciosan, that could be quite a big topic. Your starting point is here: https://github.com/snowplow/snowplow/wiki/Setting-up-EmrEtlRunner. It would be a better approach if you just follow the guidance there and ask the questions along the way if you get stuck anywhere.

Also, the following diagram would be helpful, in understanding the workflow of data processed by EmrEtlRunner: https://github.com/snowplow/snowplow/wiki/Batch-pipeline-steps.


#3

I’ve got the config.yml.sample, and i also downloaded the snowplow-emr-etl-runner file, but i don’t know how to setup and run this on AWS. i already have a EC2 instance running on AWS. So i got stuck in step 4.CONFIGURATION of this link: https://github.com/snowplow/snowplow/wiki/Setting-up-EmrEtlRunner


#4

@feliciosan, the step 4 on that page is called “Self-hosting Spark Enrich”, not “CONFIGURE”. That step is optional and could be ignored.

If you need help with preparing the config.yml you can refer to Common configuration.

Additionally, you will need

Once configuration files completed you can run EmrEtlRunner as per https://github.com/snowplow/snowplow/wiki/2-Using-EmrEtlRunner.