Shred failure with R89/Spark

Hi guys,

Nice work on the Spark release! Our pipeline ran successfully a few times, but as I was experimenting with instance types, the Shred step failed 2 hours into the job. This is probably memory-related, but I wasn’t expecting this with 4x c4.4xlarge instances (each with 30GB of memory).

Here’s the stderr file from one of the containers:

Thanks!
Bernardo

Thanks for raising @bernardosrulzon - looks like the smoking gun is:

ERROR YarnClusterScheduler: Lost executor 57 on ip-10-0-51-185.ec2.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

Can you share the memory utilization graph for the job duration from the EMR console?

Sure! Memory is fully allocated throughout the Shred job.

Update: Running EmrEtlRunner with --process-shred, the Shred step fails 10 minutes after the job. Same error on the logs. Trying to run with 4x r3.2xlarge now.

Hey @bernardosrulzon ,

spark.yarn.executor.memoryOverhead is supposed to be 10% of the executor memory which in your case should be a bit less than ~3Go. The 5.5Go is a bit surprising to me.

To minimize this overhead, you can distribute the work on more instances even if they are smaller, the bigger the memory pool, the bigger the overhead.