Elasticity Spark Step: Shred Enriched Events: consistent failure without clear reason


#1

Hi there,

I’ve been trying to run the EmrCtlRunner for a few days now and it’s consistently failing at the “Elasticity Spark Step: Shred Enriched Events” step.

The step is dealing wiht millions of events to shredd. Every time it fails.

All executions after the first failure have been with --skip staging,enrich.

Every stderr log file ends as below:

17/10/25 01:04:38 INFO Client: Application report for application_1508886109246_0002 (state: RUNNING)
17/10/25 01:04:39 INFO Client: Application report for application_1508886109246_0002 (state: RUNNING)
17/10/25 01:04:40 INFO Client: Application report for application_1508886109246_0002 (state: RUNNING)
17/10/25 01:04:41 INFO Client: Application report for application_1508886109246_0002 (state: RUNNING)
17/10/25 01:04:42 INFO Client: Application report for application_1508886109246_0002 (state: RUNNING)
17/10/25 01:04:43 INFO Client: Application report for application_1508886109246_0002 (state: FINISHED)
17/10/25 01:04:43 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 172.31.45.129
	 ApplicationMaster RPC port: 0
	 queue: default
	 start time: 1508886685765
	 final status: FAILED
	 tracking URL: http://ip-172-31-42-20.us-west-2.compute.internal:20888/proxy/application_1508886109246_0002/
	 user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1508886109246_0002 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1167)
	at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1213)
	at org.apache.spark.deploy.yarn.Client.main(Client.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/10/25 01:04:43 INFO ShutdownHookManager: Shutdown hook called
17/10/25 01:04:43 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-75e08e3c-9ff2-42c0-b30d-bad9df2abaf8
Command exiting with ret '1'

Any idea why?

This is running with a cluster of 3x m3.xlarge CORE machines and 0 Tasks - it runs for a little shy of 2 hours and fails.

version: snowplow-rdb-shredder-0.12.0


#3

@cmartins you need to look at the spark logs to know why it failed.

Those logs are located in the bucket you specified in your EmrEtlRunner’s config.yml (aws -> s3 -> buckets -> log).

They should be located in s3://{{your bucket above}}/j-{{cluster id}}/containers/{{application id}}/{{container id}}

Note that the application ids are numbered according to the order in which they occured, e.g. if your shred job was the 7th step, the application id will end in 0007.

Note also that the container id ending in 1 will be your driver and the rest executors.

However at first sight this looks like a cluster dimensioning issue, please have a look at this discussion to fine tune your cluster.


#4

thank you - that did the trick - that spreadsheet is great. we need to publish that one more broadly.