Shred step just started failing (R97)

Hi -

We are running R97 Knossos – haven’t upgraded in over a year because we never had a problem.

However last week, on a couple of our nightly ETL jobs from a Cloudfront collector, we had a failure at the Shred step. Rerunning with ‘-f shred’ the job completed OK.

But then last night, after the same error, we have had no success with 3 recovery attempts.

Maybe I’m not looking at the right log, but stderr for the failed step is not super informative:

19/03/19 11:07:33 INFO Client: 
	 client token: N/A
	 diagnostics: User class threw exception: org.apache.spark.SparkException: Job aborted.
	 ApplicationMaster host: 10.0.0.96
	 ApplicationMaster RPC port: 0
	 queue: default
	 start time: 1552992503630
	 final status: FAILED
	 tracking URL: http://ip-10-0-0-82.ec2.internal:20888/proxy/application_1552992078044_0002/
	 user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1552992078044_0002 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1104)
	at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150)
	at org.apache.spark.deploy.yarn.Client.main(Client.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/03/19 11:07:33 INFO ShutdownHookManager: Shutdown hook called
19/03/19 11:07:33 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-570e85f9-ed28-48c3-9ba6-e116cb96b606
Command exiting with ret '1'

At this point I would appreciate any advice at all. Thanks in advance!

Wade Leftwich
Ithaca, NY

Responding to my own post.

I disabled cross-batch natural deduplication, by removing the DynamoDB config from my targets directory, and the job proceeded to completion.

I actually don’t know if this really made a difference, because the problem had been intermittent. There had been no errors logged in DynamoDB.

But anyway, at least I got yesterday’s data into Redshift.