Currently I’ve access to a S3 bucket with raw data since the beginning of 2017 (with tracker version 2.6.2). The data is collected, however it has never been processed. I want to focus on enriching the data (no shredding yet) to see what the quality of the data is. Because I don’t have a lot of experience with the enrichment part I was wondering what the best steps to take are (based on https://github.com/snowplow/snowplow/wiki/setting-up-EmrEtlRunner)
- Installing EmrEtlRunner. Does it matter which version of http://dl.bintray.com/snowplow/snowplow-generic/ I use?
- Setting up YAML file. I can use the sample file (https://github.com/snowplow/snowplow/blob/master/3-enrich/emr-etl-runner/config/config.yml.sample) as input, however is the info in this file dependent of the EmrEtlRunner version?
And all other best practices / tips are welcome