Second job for importing bad rows


#1

Could someone explain this to me in more detail?
https://groups.google.com/forum/#!searchin/snowplow-user/Tobias$20/snowplow-user/qVqjNTDkuS4/uN4Tv3X6IQAJ


#2

Hi Tobias,

Sorry about the cutoff in my original answer! It was probably some sort of copy and paste error.

The idea is: first run EmrEtlRunner with the --skip elasticsearch option. This will totally skip the Elasticsearch step, leaving your bad rows in S3.

Then check the identity of the bad rows bucket(s) you want to load into Elasticsearch and alter your configuration file to use those buckets as sources for the Elasticsearch step:

sources: ["s3://out/enriched/bad/run=2015--01-01-00-00-00", "s3://out/shred/bad/run=2015--01-01-00-00-00"]

Then run EmrEtlRunner again, skipping every step except the Elasticsearch step, using --skip staging,s3distcp,enrich,shred,archive_raw.

Splitting the job in two like this prevents Elasticsearch timeouts from causing the whole job to be reported as failing.

Hope that helps,
Fred