"Elasticity Scalding Step: Shred Enriched Events" failures


#1

Hi

I’m using snowplow r77 and have started having failures on the Shred Enriched Event step (both staging and production environment) in the last 5 days.

Here is the output I get :
Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-2K7LCNCVIN965 failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com
/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.
Snowplow ETL Staging: TERMINATING [STEP_FAILURE] ~ elapsed time n/a [2016-04-11 08:09:40 +0000 - ]

    1. Elasticity S3DistCp Step: Raw S3 -> HDFS: COMPLETED ~ 00:01:08 [2016-04-11 08:09:40 +0000 - 2016-04-11 08:10:48 +0000]
    1. Elasticity Scalding Step: Enrich Raw Events: COMPLETED ~ 00:02:20 [2016-04-11 08:10:55 +0000 - 2016-04-11 08:13:15 +0000]
    1. Elasticity S3DistCp Step: Enriched HDFS -> S3: COMPLETED ~ 00:00:40 [2016-04-11 08:13:15 +0000 - 2016-04-11 08:13:55 +0000]
    1. Elasticity S3DistCp Step: Enriched HDFS _SUCCESS -> S3: COMPLETED ~ 00:00:40 [2016-04-11 08:13:55 +0000 - 2016-04-11 08:14:36 +0000]
    1. Elasticity Scalding Step: Shred Enriched Events: FAILED ~ 00:00:06 [2016-04-11 08:14:36 +0000 - 2016-04-11 08:14:42 +0000]
    1. Elasticity S3DistCp Step: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]):

The step fails after 5-6 seconds and there are no logs available at all in EMR which makes it hard to debug.

I’m sure that there are some events to shred (in other words, I don’t get this error : https://github.com/snowplow/snowplow/wiki/Troubleshooting#shred-fail).

I was using spot instances, and disabled it (following the recommendation here : https://groups.google.com/forum/#!topic/snowplow-user/rFw6E4Ysafs) but still have the same problem.

Any idea of what could be wrong ?
Or what I should look for.

Thank you
Alexis


#2

Hi Alexis,

Thanks for raising this. We have seen this behavior across our users and our own jobs too - intermittent failures on Hadoop Shred after between 4 to 9 seconds (often 6 seconds).

Re-running the job almost always fixes it - very occasionally we have to restart it a couple of times.

Occasionally the failure is correlated with bootstrap failures bringing the cluster up (which EmrEtlRunner automatically recovers from).

We have a support ticket open with AWS to find out what is causing this. If it’s something in Hadoop Shred we’ll obviously fix it.

Will keep you posted!


#3

Hey!

I’m encountering the same issue. Any news/update on the situation?

I’ve tried to re-run the process with --process-shred xxxxx but still the same issue.
Will try again later today.

Let me know if you found something!

Thank you!
Tim


#4

Hi @Timmycarbone - the solution to this is in this thread: EMR jobflow failing on Hadoop Enrich step after a few seconds


#5

Awesome thank you!

Although re-running it another time fixed it, I will apply the solution in the linked thread.

Thanks again!
Tim