How do I debug EMR failure: TERMINATED_WITH_ERRORS [VALIDATION_ERROR]

A few days ago my pipeline suddenly broke down, with the following output:

D, [2017-02-08T00:00:19.333000 #16545] DEBUG – : Waiting a minute to allow S3 to settle (eventual consistency)
D, [2017-02-08T00:01:19.337000 #16545] DEBUG – : Initializing EMR jobflow
D, [2017-02-08T00:01:20.782000 #16545] DEBUG – : EMR jobflow j-1OM63IVDKT8P2 started, waiting for jobflow to complete…
F, [2017-02-08T00:17:22.217000 #16545] FATAL – :
Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-1OM63IVDKT8P2 failed, check Amazon EMR console and Hadoop logs for details (help: Troubleshooting jobs on Elastic MapReduce · snowplow/snowplow Wiki · GitHub). Data files not archived.
au Snowplow ETL: TERMINATED_WITH_ERRORS [VALIDATION_ERROR] ~ elapsed time n/a [ - 2017-02-08 00:16:55 +0000]

    1. Elasticity S3DistCp Step: Shredded HDFS → S3: CANCELLED ~ elapsed time n/a [ - ]
    1. Elasticity Scalding Step: Shred Enriched Events: CANCELLED ~ elapsed time n/a [ - ]
    1. Elasticity S3DistCp Step: Enriched HDFS _SUCCESS → S3: CANCELLED ~ elapsed time n/a [ - ]
    1. Elasticity S3DistCp Step: Enriched HDFS → S3: CANCELLED ~ elapsed time n/a [ - ]
    1. Elasticity Scalding Step: Enrich Raw Events: CANCELLED ~ elapsed time n/a [ - ]):
      uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:475:in run' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in send_to’
      uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in call_with' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in block in redefine_method’
      uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:68:in run' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in send_to’
      uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in call_with' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in block in redefine_method’
      uri:classloader:/emr-etl-runner/bin/snowplow-emr-etl-runner:39:in <main>' org/jruby/RubyKernel.java:973:in load’
      uri:classloader:/META-INF/main.rb:1:in <main>' org/jruby/RubyKernel.java:955:in require’
      uri:classloader:/META-INF/main.rb:1:in (root)' uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in
      Error running EmrEtlRunner, exiting with return code 1. StorageLoader not run
      D, [2017-02-08T06:00:11.778000 #17104] DEBUG – : Staging raw logs…
      F, [2017-02-08T06:00:13.640000 #17104] FATAL – :
      Snowplow::EmrEtlRunner::DirectoryNotEmptyError (Should not stage files for enrichment, processing bucket s3n://au-snowplow-analytics-etl/processing/ is not empty):
      uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/s3_tasks.rb:124:in stage_logs_for_emr' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in send_to’
      uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in call_with' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in block in redefine_method’
      uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:51:in run' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in send_to’
      uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in call_with' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in block in redefine_method’
      uri:classloader:/emr-etl-runner/bin/snowplow-emr-etl-runner:39:in <main>' org/jruby/RubyKernel.java:973:in load’
      uri:classloader:/META-INF/main.rb:1:in <main>' org/jruby/RubyKernel.java:955:in require’
      uri:classloader:/META-INF/main.rb:1:in (root)' uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in

I don’t understand what that VALIDATION_ERROR means, and opening the EMR web console on AWS, What I see is in events is this failure, and all steps seems to have been canceled:

Feb 8 01:16 AM Amazon EMR Cluster j-1OM63IVDKT8P2 (Au Snowplow ETL) has terminated with errors at 2017-02-08 00:16 UTC with a reason of VALIDATION_ERROR. j-1OM63IVDKT8P2 Cluster Cluster State Change CRITICAL February 8, 2017 at 01:16:58 AM (UTC+1)

I can’t find anything in the log in the job log folder on S3 (but I might be mistaken, I think it’s pretty hard to understand what the different log files are exactly). Inside the job folder only the node folder was created with log files and sub directories were created, before the job was terminated. Also this pages doesn’t give much help - I think it’s quite outdated unfortunately Troubleshooting jobs on Elastic MapReduce · snowplow/snowplow Wiki · GitHub

Yea, it’s a little bit overwhelming if you just see a stack trace with so much information. At some point you get a little bit used to it and know where to look. I quoted the important part on the bottom. It says, that your processing directory is not empty, so you first have to process the files in that directory or copy the files back into your “au-snowplow-analytics-etl”.

I would prefer option one, so how do you process files that are already in your processing/ directory?
You add the option “–skip staging” when you start the emr-etl-runner. Here are all the parameters listed you can use

P.S. it also said StorageLoader not run, so it could be, that you have finished files in our enriched: good and shredded: out/ folders, in this case you first have to import these files, you can see here how you do that Home · snowplow/snowplow Wiki · GitHub

Great thanks, did this --skip staging thing - and seems to work - any idea why it failed, and if I can do anything to prevent it?

I can only guess, do you have the assets: s3://snowplow-hosted-assets in your yaml file? But I can’t remember to have had that error before. Other than that, the EMR job can fail sometimes, for example if the network between the nodes is bad. It shouldn’t happen that often, maybe every 3-4 months and then you just restart the EMR job, load the data with the storage loader and everything works fine again