Not enough space to cache xxx - shredding failing

we are trying to figure out why workflow is failing on shredding every time recently. anyone know if this could be related? we are only processing about 150 files and they are about 42 megs each. so pretty small. other thing to note is we enabled expirement and variation flags in Optimizely on our website about 4-5 days ago. so payloads got huge. here’s the related discussion about optimizely payload increase (Optimizely flags - payload way too big)

couple of things we tried so far:

  1. just ran shred only. didn’t work. same error
  2. deleted enriched files and ran skip staging. same error.
  3. change number of cores. same error.
  4. deleted ALL files/directories in good/bad in shredding and enriched folders on S3 since i saw some weird errors about it couldn’t access a shredded bad folder for some reason as well.

next to try:

  1. move files out of processing into a temp folder on S3 and try the next batch of logs to see if it’s something in the current logs that’s corrupt or throwing things off. and trying a clean run again.

running 2 big core instances. also tried 3.

# Adjust your Hadoop cluster below
jobflow:
  job_name: Snowplow ETL # Give your job a name
  master_instance_type: m4.large
  core_instance_count: 2
  core_instance_type: r3.xlarge
  task_instance_count: 0 # Increase to use spot instances
  task_instance_type: m1.medium
  task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
additional_info:        # Optional JSON string for selecting additional features

==> application_1498927638961_0005/container_1498927638961_0005_01_000006/stderr <==
17/07/01 17:16:58 INFO Executor: Finished task 46.0 in stage 0.0 (TID 55). 1532 bytes result sent to driver
17/07/01 17:16:58 INFO CoarseGrainedExecutorBackend: Got assigned task 73
17/07/01 17:16:58 INFO Executor: Running task 60.0 in stage 0.0 (TID 73)
17/07/01 17:16:58 INFO HadoopRDD: Input split: hdfs://ip-172-30-7-154.ec2.internal:8020/local/snowplow/enriched-events/part-00060-6fb9c652-06c1-470d-91a5-e7f18e303455.csv:0+122602491
17/07/01 17:17:08 INFO MemoryStore: Will not store rdd_2_60
17/07/01 17:17:08 WARN MemoryStore: Not enough space to cache rdd_2_60 in memory! (computed 36.4 MB so far)
17/07/01 17:17:08 INFO MemoryStore: Memory use = 2.6 GB (blocks) + 234.9 MB (scratch space shared across 2 tasks(s)) = 2.8 GB. Storage limit = 2.8 GB.
17/07/01 17:17:08 WARN BlockManager: Block rdd_2_60 could not be removed as it was not found on disk or in memory
17/07/01 17:17:08 WARN BlockManager: Putting block rdd_2_60 failed
17/07/01 17:17:12 INFO MemoryStore: Will not store rdd_2_52
17/07/01 17:17:12 WARN MemoryStore: Not enough space to cache rdd_2_52 in memory! (computed 202.6 MB so far)
17/07/01 17:17:12 INFO MemoryStore: Memory use = 2.6 GB (blocks) + 202.0 MB (scratch space shared across 1 tasks(s)) = 2.8 GB. Storage limit = 2.8 GB.
17/07/01 17:17:12 WARN BlockManager: Block rdd_2_52 could not be removed as it was not found on disk or in memory
17/07/01 17:17:12 WARN BlockManager: Putting block rdd_2_52 failed

here’s the latest stderr from EMR. even after i cleaned out all bad directories. there’s no reason why it should complain about bad directory existing.

17/07/01 18:40:20 INFO Client: Application report for application_1498927638961_0005 (state: RUNNING)
17/07/01 18:40:21 INFO Client: Application report for application_1498927638961_0005 (state: RUNNING)
17/07/01 18:40:22 INFO Client: Application report for application_1498927638961_0005 (state: RUNNING)
17/07/01 18:40:23 INFO Client: Application report for application_1498927638961_0005 (state: RUNNING)
17/07/01 18:40:24 INFO Client: Application report for application_1498927638961_0005 (state: RUNNING)
17/07/01 18:40:25 INFO Client: Application report for application_1498927638961_0005 (state: ACCEPTED)
17/07/01 18:40:25 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1498928829867
final status: UNDEFINED
tracking URL: http://ip-172-30-7-154.ec2.internal:20888/proxy/application_1498927638961_0005/
user: hadoop
17/07/01 18:40:26 INFO Client: Application report for application_1498927638961_0005 (state: ACCEPTED)
17/07/01 18:40:27 INFO Client: Application report for application_1498927638961_0005 (state: ACCEPTED)
17/07/01 18:40:28 INFO Client: Application report for application_1498927638961_0005 (state: ACCEPTED)
17/07/01 18:40:29 INFO Client: Application report for application_1498927638961_0005 (state: ACCEPTED)
17/07/01 18:40:30 INFO Client: Application report for application_1498927638961_0005 (state: RUNNING)
17/07/01 18:40:30 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 172.30.3.67
ApplicationMaster RPC port: 0
queue: default
start time: 1498928829867
final status: UNDEFINED
tracking URL: http://ip-172-30-7-154.ec2.internal:20888/proxy/application_1498927638961_0005/
user: hadoop
17/07/01 18:40:31 INFO Client: Application report for application_1498927638961_0005 (state: RUNNING)
17/07/01 18:40:32 INFO Client: Application report for application_1498927638961_0005 (state: RUNNING)
17/07/01 18:40:33 INFO Client: Application report for application_1498927638961_0005 (state: RUNNING)
17/07/01 18:40:34 INFO Client: Application report for application_1498927638961_0005 (state: FINISHED)
17/07/01 18:40:34 INFO Client:
client token: N/A
diagnostics: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3://ga-snowplow-production/snowplow-shredded/bad/run=2017-07-01-16-43-13 already exists
ApplicationMaster host: 172.30.3.67
ApplicationMaster RPC port: 0
queue: default
start time: 1498928829867
final status: FAILED
tracking URL: http://ip-172-30-7-154.ec2.internal:20888/proxy/application_1498927638961_0005/
user: hadoop
Exception in thread “main” org.apache.spark.SparkException: Application application_1498927638961_0005 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1167)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1213)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/07/01 18:40:34 INFO ShutdownHookManager: Shutdown hook called
17/07/01 18:40:34 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-f1e2a082-ef93-4844-ba7e-2e4875f8ae30
Command exiting with ret ‘1’

so good and bad news. moved the raw files that we were trying to shred to a temp folder on s3 and then running the newer files in “in”. (so moved from processing to temp). that worked. but how do we know which file in the previous batch is causing a problem shredding?

ok final update. moving the files slowly in smaller batches into the processing queue helped. finally done with all remaining. just curious what might have gone wrong except maybe the # of files was too big to process.

Hey @mjensen, you might want to have a look at R89 Spark job underutilizing cluster where we’ve been brainstorming why the Spark cluster is not fully used.