Suggested best practices for recovering from EmrEtlRunner failures?


#1

We run the EmrEtl using cronic (the snowplow-runner-and-loader.sh script). So far as I can tell, the only place we deviate from the described setup in https://github.com/snowplow/snowplow/wiki/3-Scheduling-EmrEtlRunner is that we run hourly instead of daily. Most of the time it runs perfectly. But sometimes (and I’m writing because it has happened maybe 10 times in the past week) it fails. When it fails, data is left in either the processing or enriched/good buckets and all subsequent etl runs just abort. Right now when it fails someone has to manually go run the right command to restart everything. In the past week we’ve really only seen two kinds of failures: iglu central returns a 500 when we try to fetch a schema, and the aws troubles yesterday made some files fail to properly move locations.

I feel like we ought to be able to recover from, e.g., the iglu central failures without human intervention, but the cron job obviously can’t do that on it’s own. Does anyone have any suggestions for handling ‘common’ failures like these automatically?


#2

Hi @charlietanksley,

What version of Snowplow are you running? The release R79 addresses handling of Iglu Central failures. In the older versions, the client aggressively cached failed schema lookups, which could cause a whole set of events to fail if the first lookup of a schema failed unexpectedly. Starting from R79, the client will retry the lookup 3 times before caching a schema as missing.

As for rerunning the failed job, I guess you can’t really avoid the manual intervention completely. However, you could simplify the process. Please, take a look at this blog where the Snowplow approach to this problem is being showcased. See if you can accommodate it (or something similar) in your environment. In essence, each possible scenario is described with a corresponding makefile. Once the rout cause is determined you (re-)run the failed job with the appropriate makefile.

All the possible failure points are well described in the wiki page Batch Pipeline Steps.

Hopefully this helps.

–Ihor


#3

Thanks @ihor,

I wanted to add Charlie: it’s super important to us at Snowplow that Iglu Central has many 9s of availability - the proper functioning of Snowplow really depends on this. To this end we have deployed Iglu Central on a highly scalable architecture with a minimum of moving parts: it’s just an Amazon CloudFront-backed static website.

If you are seeing regular connectivity issues to Iglu Central, please provide us with as many details as you can and we will investigate.


#4

@ihor that is super helpful, thank you! We are on r77, so thanks for pointing out that the upgrade will help! I think with the tools you’ve given me here we can solve our problem. :smiley:


#5

Hey @alex,

Looking back over the logs I found two cases where the schema lookup in Iglu Central seems to have been the problem. I’ll paste the logs where I see the actual error message, but I can get you other logs from the relevant runs if that is helpful (and you can tell me what would be useful).

Here are the contents of logs/<bucket>/steps/<step>/stderr.gz for an EMR cluster that was created at 2016-07-16 13:24 (UTC-4)

Exception in thread "main" java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at com.twitter.scalding.Job$.apply(Job.scala:47)
	at com.twitter.scalding.Tool.getJob(Tool.scala:48)
	at com.twitter.scalding.Tool.run(Tool.scala:68)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at com.snowplowanalytics.snowplow.enrich.hadoop.JobRunner$.main(JobRunner.scala:33)
	at com.snowplowanalytics.snowplow.enrich.hadoop.JobRunner.main(JobRunner.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: com.snowplowanalytics.snowplow.enrich.common.FatalEtlError: NonEmptyList(error: NonEmptyList(error: Could not find schema with key iglu:com.snowplowanalytics.snowplow/currency_conversion_config/jsonschema/1-0-0 in any repository, tried:
    level: "error"
    repositories: ["Iglu Client Embedded [embedded]","Iglu Central [HTTP]"]
, error: Unexpected exception fetching iglu:com.snowplowanalytics.snowplow/currency_conversion_config/jsonschema/1-0-0 in HTTP Iglu repository Iglu Central: java.io.IOException: Server returned HTTP response code: 500 for URL: http://iglucentral.com/schemas/com.snowplowanalytics.snowplow/currency_conversion_config/jsonschema/1-0-0
    level: "error"
)
    level: "error"
)
	at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$2.apply(EtlJob.scala:140)
	at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$2.apply(EtlJob.scala:140)
	at scalaz.Validation$class.fold(Validation.scala:64)
	at scalaz.Failure.fold(Validation.scala:330)
	at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob.<init>(EtlJob.scala:139)
	... 16 more

And here are the contents of the same file for an EMR cluster created at 2016-07-13 13:23 (UTC-4)

Exception in thread "main" java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at com.twitter.scalding.Job$.apply(Job.scala:47)
	at com.twitter.scalding.Tool.getJob(Tool.scala:48)
	at com.twitter.scalding.Tool.run(Tool.scala:68)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at com.snowplowanalytics.snowplow.enrich.hadoop.JobRunner$.main(JobRunner.scala:33)
	at com.snowplowanalytics.snowplow.enrich.hadoop.JobRunner.main(JobRunner.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: com.snowplowanalytics.snowplow.enrich.common.FatalEtlError: NonEmptyList(error: NonEmptyList(error: Could not find schema with key iglu:com.snowplowanalytics.snowplow/enrichments/jsonschema/1-0-0 in any repository, tried:
    level: "error"
    repositories: ["Iglu Client Embedded [embedded]","Iglu Central [HTTP]"]
, error: Unexpected exception fetching iglu:com.snowplowanalytics.snowplow/enrichments/jsonschema/1-0-0 in HTTP Iglu repository Iglu Central: java.io.IOException: Server returned HTTP response code: 500 for URL: http://iglucentral.com/schemas/com.snowplowanalytics.snowplow/enrichments/jsonschema/1-0-0
    level: "error"
)
    level: "error"
)
	at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$2.apply(EtlJob.scala:140)
	at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$2.apply(EtlJob.scala:140)
	at scalaz.Validation$class.fold(Validation.scala:64)
	at scalaz.Failure.fold(Validation.scala:330)
	at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob.<init>(EtlJob.scala:139)
	... 16 more

Let me know if there is any more information I can provide that would be useful!


Enrichment job failure
#6

Thanks @charlietanksley - it’s interesting that the job is failing on pretty much the first HTTP connection it needs to make (to validate the enrichment configurations). I wonder if the network connectivity issue is on the EMR side or on the Iglu Central side…

Do others have similar experiences to share?