Second job for importing bad rows

tclass · June 8, 2016, 3:58pm

Could someone explain this to me in more detail?
https://groups.google.com/forum/#!searchin/snowplow-user/Tobias$20/snowplow-user/qVqjNTDkuS4/uN4Tv3X6IQAJ

fred · June 9, 2016, 10:03am

Hi Tobias,

Sorry about the cutoff in my original answer! It was probably some sort of copy and paste error.

The idea is: first run EmrEtlRunner with the --skip elasticsearch option. This will totally skip the Elasticsearch step, leaving your bad rows in S3.

Then check the identity of the bad rows bucket(s) you want to load into Elasticsearch and alter your configuration file to use those buckets as sources for the Elasticsearch step:

sources: ["s3://out/enriched/bad/run=2015--01-01-00-00-00", "s3://out/shred/bad/run=2015--01-01-00-00-00"]

Then run EmrEtlRunner again, skipping every step except the Elasticsearch step, using --skip staging,s3distcp,enrich,shred,archive_raw.

Splitting the job in two like this prevents Elasticsearch timeouts from causing the whole job to be reported as failing.

Hope that helps,
Fred

Topic		Replies	Views
EmrEtlRunner skip issues configuration Enrichment	10	3177	July 31, 2016
EmrEtlRunner fails during raw staging S3 step AWS batch pipeline (Legacy)	4	1979	November 8, 2018
Snowplow::EmrEtlRunner::EmrExecutionError Enrichment	3	1089	April 25, 2019
Trouble sending bad rows to amazon elasticsearch service (EsHadoopInvalidRequest) AWS batch pipeline (Legacy)	4	2975	August 1, 2017
Sending bad rows to Elasticsearch AWS batch pipeline (Legacy)	2	1265	April 28, 2017

Second job for importing bad rows

Related Topics