Elasticity Spark Step: Shred Enriched Events: consistent failure without clear reason

cmartins · November 9, 2017, 10:23pm

Hi there,

I’ve been trying to run the EmrCtlRunner for a few days now and it’s consistently failing at the “Elasticity Spark Step: Shred Enriched Events” step.

The step is dealing wiht millions of events to shredd. Every time it fails.

All executions after the first failure have been with --skip staging,enrich.

Every stderr log file ends as below:

17/10/25 01:04:38 INFO Client: Application report for application_1508886109246_0002 (state: RUNNING)
17/10/25 01:04:39 INFO Client: Application report for application_1508886109246_0002 (state: RUNNING)
17/10/25 01:04:40 INFO Client: Application report for application_1508886109246_0002 (state: RUNNING)
17/10/25 01:04:41 INFO Client: Application report for application_1508886109246_0002 (state: RUNNING)
17/10/25 01:04:42 INFO Client: Application report for application_1508886109246_0002 (state: RUNNING)
17/10/25 01:04:43 INFO Client: Application report for application_1508886109246_0002 (state: FINISHED)
17/10/25 01:04:43 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 172.31.45.129
	 ApplicationMaster RPC port: 0
	 queue: default
	 start time: 1508886685765
	 final status: FAILED
	 tracking URL: http://ip-172-31-42-20.us-west-2.compute.internal:20888/proxy/application_1508886109246_0002/
	 user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1508886109246_0002 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1167)
	at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1213)
	at org.apache.spark.deploy.yarn.Client.main(Client.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/10/25 01:04:43 INFO ShutdownHookManager: Shutdown hook called
17/10/25 01:04:43 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-75e08e3c-9ff2-42c0-b30d-bad9df2abaf8
Command exiting with ret '1'

Any idea why?

This is running with a cluster of 3x m3.xlarge CORE machines and 0 Tasks - it runs for a little shy of 2 hours and fails.

version: snowplow-rdb-shredder-0.12.0

BenFradet · November 10, 2017, 11:54am

@cmartins you need to look at the spark logs to know why it failed.

Those logs are located in the bucket you specified in your EmrEtlRunner’s config.yml (aws -> s3 -> buckets -> log).

They should be located in s3://{{your bucket above}}/j-{{cluster id}}/containers/{{application id}}/{{container id}}

Note that the application ids are numbered according to the order in which they occured, e.g. if your shred job was the 7th step, the application id will end in 0007.

Note also that the container id ending in 1 will be your driver and the rest executors.

However at first sight this looks like a cluster dimensioning issue, please have a look at this discussion to fine tune your cluster.

cmartins · November 11, 2017, 12:41am

thank you - that did the trick - that spreadsheet is great. we need to publish that one more broadly.

Topic		Replies	Views
Shred step failure, no error message For engineers	4	616	June 1, 2021
"Elasticity Scalding Step: Shred Enriched Events" failures Enrichment	4	2393	April 29, 2016
[shred] spark: Shred Enriched Events - Failures For engineers	7	916	February 18, 2020
Emr etl runner fails without useful error on step "Elasticity Spark Step: Enrich Raw Events" Troubleshooting	3	3148	July 25, 2018
EmrEtlRunner running for days at Step "Shred Enriched Events" Enrichment	3	1256	May 9, 2018

Elasticity Spark Step: Shred Enriched Events: consistent failure without clear reason

Related Topics