Best (enrichment) steps to take with old implementation


#1

Hello all,

Currently I’ve access to a S3 bucket with raw data since the beginning of 2017 (with tracker version 2.6.2). The data is collected, however it has never been processed. I want to focus on enriching the data (no shredding yet) to see what the quality of the data is. Because I don’t have a lot of experience with the enrichment part I was wondering what the best steps to take are (based on https://github.com/snowplow/snowplow/wiki/setting-up-EmrEtlRunner)

  1. Installing EmrEtlRunner. Does it matter which version of http://dl.bintray.com/snowplow/snowplow-generic/ I use?
  2. Setting up YAML file. I can use the sample file (https://github.com/snowplow/snowplow/blob/master/3-enrich/emr-etl-runner/config/config.yml.sample) as input, however is the info in this file dependent of the EmrEtlRunner version?

And all other best practices / tips are welcome :slight_smile:

Greetings,
Bart


#2

Your intentions are a bit vague here, but:

  1. I doubt EmrEtlRunner version should depend on tracker version.
  2. Yes, EMR AMI version and software versions directly depend on the EmrEtlRunner version you are running. You can see config examples for every version tag in github within 3-enrich/emr-etl-runner/config/config.yml.sample

#3

The enrichment process is independent of the tracking so you shouldn’t have any issues running EmrEtlRunner on your existing data. It’s worthwhile using the latest version of Spark Enrich and probably running on a small subset of data first such as a single day rather than the entire period.

In the EmrEtlRunner command you can set up the option to skip shredding (and other steps) so that you can just archive the data on S3.


#4

I managed to get ETL job up and running for a day of data without shredding. Thnx @kazgurs1 & @mike for your help!