EMRETL 'main' throwing Spark Exception


#1

I’ve got an issue that doesn’t match the troubleshooting guide as far as I can tell. One of the job steps fails with this at the error output:

D, [2017-08-18T18:49:52.533000 #12] DEBUG -- : Initializing EMR jobflow                                                
D, [2017-08-18T18:50:19.737000 #12] DEBUG -- : EMR jobflow j-17 started, waiting for jobflow to complete... 
I, [2017-08-18T18:50:19.742000 #12]  INFO -- : SnowplowTracker::Emitter initialized with endpoint http://dokvqiql64pip.cloudfront.net:80/i                                                                                                     
I, [2017-08-18T18:50:21.620000 #12]  INFO -- : Attempting to send 1 request                                            
I, [2017-08-18T18:50:21.625000 #12]  INFO -- : Sending GET request to http://dokvqiql64pip.cloudfront.net:80/i...      
I, [2017-08-18T18:50:21.884000 #12]  INFO -- : GET request to http://dokpip.cloudfront.net:80/i finished with status code 200                                                                                                           
I, [2017-08-18T19:04:36.213000 #12]  INFO -- : Attempting to send 1 request                                            
I, [2017-08-18T19:04:36.217000 #12]  INFO -- : Sending GET request to http://dokvqiql64pip.cloudfront.net:80/i...      
I, [2017-08-18T19:04:36.306000 #12]  INFO -- : GET request to http://dokpip.cloudfront.net:80/i finished with status code 200                                                                                                           
F, [2017-08-18T19:04:38.275000 #12] FATAL -- :             

Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-17JR3DGT92QRJ failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.                                                 
Snowplow ETL: TERMINATING [STEP_FAILURE] ~ elapsed time n/a [2017-08-18 18:58:37 +0000 - ]                             
 - 1. Elasticity S3DistCp Step: Raw s3://traqqy-collector-logs/ -> Raw Staging S3: COMPLETED ~ 00:02:07 [2017-08-18 18:58:40 +0000 - 2017-08-18 19:00:47 +0000]                                                                                
 - 2. Elasticity S3DistCp Step: Raw S3 -> Raw HDFS: COMPLETED ~ 00:01:40 [2017-08-18 19:00:49 +0000 - 2017-08-18 19:02:29 +0000]                                                                                                               
 - 3. Elasticity Spark Step: Enrich Raw Events: FAILED ~ 00:01:36 [2017-08-18 19:02:31 +0000 - 2017-08-18 19:04:08 +0000]                                                                                                                      
 - 4. Elasticity S3DistCp Step: Shredded S3 -> Shredded Archive S3: CANCELLED ~ elapsed time n/a [ - ]                 
 - 5. Elasticity S3DistCp Step: Enriched S3 -> Enriched Archive S3: CANCELLED ~ elapsed time n/a [ - ]                 
 - 6. Elasticity Custom Jar Step: Load PostgreSQL enriched events storage Storage Target: CANCELLED ~ elapsed time n/a [ - ]                                                                                                                   
 - 7. Elasticity S3DistCp Step: Raw Staging S3 -> Raw Archive S3: CANCELLED ~ elapsed time n/a [ - ]                   
 - 8. Elasticity S3DistCp Step: Shredded HDFS _SUCCESS -> S3: CANCELLED ~ elapsed time n/a [ - ]                       
 - 9. Elasticity S3DistCp Step: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]                                
 - 10. Elasticity Spark Step: Shred Enriched Events: CANCELLED ~ elapsed time n/a [ - ]                                
 - 11. Elasticity Custom Jar Step: Empty Raw HDFS: CANCELLED ~ elapsed time n/a [ - ]                                  
 - 12. Elasticity S3DistCp Step: Enriched HDFS _SUCCESS -> S3: CANCELLED ~ elapsed time n/a [ - ]                      
 - 13. Elasticity S3DistCp Step: Enriched HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]):                             
    uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:572:in `run'                                
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'                           
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'                                
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'           
    uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:92:in `run'                                  
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'                           
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'                                
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'           
    uri:classloader:/emr-etl-runner/bin/snowplow-emr-etl-runner:41:in `<main>'                                         
    org/jruby/RubyKernel.java:979:in `load'                
    uri:classloader:/META-INF/main.rb:1:in `<main>'        
    org/jruby/RubyKernel.java:961:in `require'             
    uri:classloader:/META-INF/main.rb:1:in `(root)'        
    uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in `<main>'  

On the Amazon EMR console stderr end like this:

17/08/18 19:04:04 INFO Client: Application report for application_15 (state: RUNNING)
17/08/18 19:04:05 INFO Client: Application report for application_15 (state: FINISHED)
17/08/18 19:04:05 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 192.168.14.14
	 ApplicationMaster RPC port: 0
	 queue: default
	 start time: 15
	 final status: FAILED
	 tracking URL: http://ip-west-1.compute.internal:20888/proxy/application_15blah
	 user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_15blah finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1167)
	at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1213)
	at org.apache.spark.deploy.yarn.Client.main(Client.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/08/18 19:04:05 INFO ShutdownHookManager: Shutdown hook called
17/08/18 19:04:05 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-bablah
Command exiting with ret '1'

additional info: it’s a clouddfront collector saving into s3 buckets setup before the lascaux release, but I think I’ve got all the permission up to date and correct, and emretl is being ran in a container on my local laptop


#2

What’s in the EMR logs?


#3

I’m not certain which you mean by the ‘emr logs’. The output from this binary was the first lump of stuff I pasted above. The second lump was the stderr from the S3 logs bucket on the step that failed.


#4

You can find the other logs @joshuacox by reading this guide:


#5

That guide is the very first link I posted in the OP, the second lump of text, and the picture I posted above is from the cluster details which seems to be an interface for me to look at steps.

Perhaps, the referenced guide needs to be updated? As the pictures do not look like what I am seeing, and there is no ‘job flow’ browser, perhaps they renamed that to ‘events’? Or I might be looking in the wrong place, when I click on EMR in amazon I get a link like:

https://eu-west-1.console.aws.amazon.com/elasticmapreduce/home?region=eu-west-1#cluster-list:

which looks like the attached screenshot

there is a ‘jobs’ next to steps in this same interface, but I’ve only seen it populated while the tasks are running.


#6

In your s3://traqqy-collector-logs/emrlogs/ folder there should be files according to the following structure:

j-{cluster id}/containers/application_{application id}/container_{container id}

That’s what we mean with the EMR logs for Spark.

You have to dig through those logs to find out what went wrong with the job per se.


#7

And again that IS the second lump of text posted above, it is the stderr.gz there is no stdout available from this step


#8

@BenFradet Ok I was mistaken, I am now where you led me inside of j-{cluster id}/containers/application_{application id}/container_{container id} I’m not certain which container to associate with the error, so I just immediately looked at the last one:

2017-08-19T00:14:33.688+0000: [GC (Allocation Failure) 2017-08-19T00:14:33.688+0000: [ParNew: 16384K->2047K(18432K), 0.0126193 secs] 16384K->2170K(59392K), 0.0127729 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:34.025+0000: [GC (Allocation Failure) 2017-08-19T00:14:34.025+0000: [ParNew: 18431K->1607K(18432K), 0.0132618 secs] 18554K->2793K(59392K), 0.0133710 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:34.370+0000: [GC (Allocation Failure) 2017-08-19T00:14:34.370+0000: [ParNew: 17991K->1364K(18432K), 0.0029973 secs] 19177K->2549K(59392K), 0.0031033 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:35.039+0000: [GC (Allocation Failure) 2017-08-19T00:14:35.039+0000: [ParNew: 17748K->2047K(18432K), 0.0196674 secs] 18933K->3880K(59392K), 0.0197669 secs] [Times: user=0.02 sys=0.00, real=0.02 secs] 
2017-08-19T00:14:35.490+0000: [GC (Allocation Failure) 2017-08-19T00:14:35.490+0000: [ParNew: 18431K->1132K(18432K), 0.0081721 secs] 20264K->3641K(59392K), 0.0082810 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:36.045+0000: [GC (Allocation Failure) 2017-08-19T00:14:36.045+0000: [ParNew: 17516K->1426K(18432K), 0.0097347 secs] 20025K->4447K(59392K), 0.0098442 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:36.674+0000: [GC (Allocation Failure) 2017-08-19T00:14:36.674+0000: [ParNew: 17810K->1179K(18432K), 0.0067293 secs] 20831K->4576K(59392K), 0.0068329 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:37.006+0000: [GC (Allocation Failure) 2017-08-19T00:14:37.006+0000: [ParNew: 17563K->1080K(18432K), 0.0073587 secs] 20960K->4839K(59392K), 0.0074710 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:37.361+0000: [GC (Allocation Failure) 2017-08-19T00:14:37.361+0000: [ParNew: 17464K->1034K(18432K), 0.0048126 secs] 21223K->4794K(59392K), 0.0049113 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:37.583+0000: [GC (Allocation Failure) 2017-08-19T00:14:37.583+0000: [ParNew: 17418K->1513K(18432K), 0.0060308 secs] 21178K->5273K(59392K), 0.0061394 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:37.827+0000: [GC (Allocation Failure) 2017-08-19T00:14:37.827+0000: [ParNew: 17897K->1253K(18432K), 0.0086279 secs] 21657K->5530K(59392K), 0.0087411 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:38.128+0000: [GC (Allocation Failure) 2017-08-19T00:14:38.128+0000: [ParNew: 17637K->1388K(18432K), 0.0060732 secs] 21914K->5827K(59392K), 0.0061776 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:38.414+0000: [GC (Allocation Failure) 2017-08-19T00:14:38.414+0000: [ParNew: 17772K->1220K(18432K), 0.0078387 secs] 22211K->5993K(59392K), 0.0079494 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:39.155+0000: [GC (Allocation Failure) 2017-08-19T00:14:39.155+0000: [ParNew: 17604K->1061K(18432K), 0.0112325 secs] 38761K->22541K(59392K), 0.0113440 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:39.174+0000: [GC (CMS Initial Mark) [1 CMS-initial-mark: 21480K(40960K)] 22948K(59392K), 0.0017382 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:39.178+0000: [CMS-concurrent-mark-start]
2017-08-19T00:14:39.227+0000: [CMS-concurrent-mark: 0.049/0.049 secs] [Times: user=0.04 sys=0.01, real=0.05 secs] 
2017-08-19T00:14:39.231+0000: [CMS-concurrent-preclean-start]
2017-08-19T00:14:39.232+0000: [CMS-concurrent-preclean: 0.001/0.001 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:39.234+0000: [GC (CMS Final Remark) [YG occupancy: 3258 K (18432 K)]2017-08-19T00:14:39.234+0000: [Rescan (parallel) , 0.0022410 secs]2017-08-19T00:14:39.237+0000: [weak refs processing, 0.0000606 secs]2017-08-19T00:14:39.237+0000: [class unloading, 0.0032887 secs]2017-08-19T00:14:39.240+0000: [scrub symbol table, 0.0038652 secs]2017-08-19T00:14:39.244+0000: [scrub string table, 0.0006022 secs][1 CMS-remark: 21480K(40960K)] 24738K(59392K), 0.0104549 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:39.250+0000: [CMS-concurrent-sweep-start]
2017-08-19T00:14:39.254+0000: [CMS-concurrent-sweep: 0.003/0.003 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:39.254+0000: [CMS-concurrent-reset-start]
2017-08-19T00:14:39.301+0000: [CMS-concurrent-reset: 0.047/0.047 secs] [Times: user=0.02 sys=0.02, real=0.04 secs] 
2017-08-19T00:14:39.526+0000: [GC (Allocation Failure) 2017-08-19T00:14:39.526+0000: [ParNew: 17445K->1203K(18432K), 0.0061159 secs] 54724K->38821K(59392K), 0.0062276 secs] [Times: user=0.01 sys=0.01, real=0.01 secs] 
2017-08-19T00:14:39.546+0000: [GC (CMS Initial Mark) [1 CMS-initial-mark: 37617K(40960K)] 40031K(59392K), 0.0015368 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:39.548+0000: [CMS-concurrent-mark-start]
2017-08-19T00:14:39.587+0000: [CMS-concurrent-mark: 0.039/0.039 secs] [Times: user=0.03 sys=0.00, real=0.04 secs] 
2017-08-19T00:14:39.587+0000: [CMS-concurrent-preclean-start]
2017-08-19T00:14:39.588+0000: [CMS-concurrent-preclean: 0.001/0.001 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:39.598+0000: [GC (CMS Final Remark) [YG occupancy: 3889 K (18432 K)]2017-08-19T00:14:39.598+0000: [Rescan (parallel) , 0.0020199 secs]2017-08-19T00:14:39.600+0000: [weak refs processing, 0.0000535 secs]2017-08-19T00:14:39.600+0000: [class unloading, 0.0033384 secs]2017-08-19T00:14:39.604+0000: [scrub symbol table, 0.0122510 secs]2017-08-19T00:14:39.616+0000: [scrub string table, 0.0006083 secs][1 CMS-remark: 37617K(40960K)] 41507K(59392K), 0.0184844 secs] [Times: user=0.01 sys=0.01, real=0.02 secs] 
2017-08-19T00:14:39.622+0000: [CMS-concurrent-sweep-start]
2017-08-19T00:14:39.625+0000: [CMS-concurrent-sweep: 0.003/0.003 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:39.626+0000: [CMS-concurrent-reset-start]
2017-08-19T00:14:39.627+0000: [CMS-concurrent-reset: 0.002/0.002 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:39.884+0000: [GC (Allocation Failure) 2017-08-19T00:14:39.884+0000: [ParNew: 17587K->730K(18432K), 0.0043763 secs] 55152K->38532K(81044K), 0.0044842 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:40.106+0000: [GC (Allocation Failure) 2017-08-19T00:14:40.106+0000: [ParNew: 17108K->1337K(18432K), 0.0040187 secs] 54910K->39139K(81044K), 0.0041295 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:40.233+0000: [GC (Allocation Failure) 2017-08-19T00:14:40.233+0000: [ParNew: 17721K->1279K(18432K), 0.0061208 secs] 55523K->39286K(81044K), 0.0062307 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:40.566+0000: [GC (Allocation Failure) 2017-08-19T00:14:40.566+0000: [ParNew: 17663K->1330K(18432K), 0.0101995 secs] 55670K->39446K(81044K), 0.0103114 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:42.577+0000: [GC (CMS Initial Mark) [1 CMS-initial-mark: 54499K(62612K)] 60903K(81044K), 0.0039391 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:42.585+0000: [CMS-concurrent-mark-start]
2017-08-19T00:14:42.626+0000: [CMS-concurrent-mark: 0.042/0.042 secs] [Times: user=0.03 sys=0.00, real=0.04 secs] 
2017-08-19T00:14:42.630+0000: [CMS-concurrent-preclean-start]
2017-08-19T00:14:42.632+0000: [CMS-concurrent-preclean: 0.001/0.001 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:42.632+0000: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 2017-08-19T00:14:47.730+0000: [CMS-concurrent-abortable-preclean: 0.192/5.098 secs] [Times: user=0.53 sys=0.01, real=5.09 secs] 
2017-08-19T00:14:47.730+0000: [GC (CMS Final Remark) [YG occupancy: 6403 K (18432 K)]2017-08-19T00:14:47.730+0000: [Rescan (parallel) , 0.0038720 secs]2017-08-19T00:14:47.734+0000: [weak refs processing, 0.0000652 secs]2017-08-19T00:14:47.734+0000: [class unloading, 0.0038823 secs]2017-08-19T00:14:47.738+0000: [scrub symbol table, 0.0048416 secs]2017-08-19T00:14:47.743+0000: [scrub string table, 0.0006288 secs][1 CMS-remark: 54499K(62612K)] 60903K(81044K), 0.0135369 secs] [Times: user=0.01 sys=0.00, real=0.02 secs] 
2017-08-19T00:14:47.744+0000: [CMS-concurrent-sweep-start]
2017-08-19T00:14:47.747+0000: [CMS-concurrent-sweep: 0.003/0.003 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:47.748+0000: [CMS-concurrent-reset-start]
2017-08-19T00:14:47.749+0000: [CMS-concurrent-reset: 0.002/0.002 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
Heap
 par new generation   total 18432K, used 11704K [0x00000000d0000000, 0x00000000d1400000, 0x00000000d5330000)
  eden space 16384K,  63% used [0x00000000d0000000, 0x00000000d0a21540, 0x00000000d1000000)
  from space 2048K,  64% used [0x00000000d1200000, 0x00000000d134cb90, 0x00000000d1400000)
  to   space 2048K,   0% used [0x00000000d1000000, 0x00000000d1000000, 0x00000000d1200000)
 concurrent mark-sweep generation total 90584K, used 54348K [0x00000000d5330000, 0x00000000daba6000, 0x0000000100000000)
 Metaspace       used 26048K, capacity 26316K, committed 26620K, reserved 1073152K
  class space    used 3452K, capacity 3522K, committed 3580K, reserved 1048576K

it appears it is having garbage collection allocation issues? I see on this SO topic that may not be the issue. Perhaps the abort is more important CMS: abort preclean due to time


#9

You should be able to see an exception somewhere in these logs and that will be the root cause of the failure.


#10

@joshuacox were you able to resolve this? I have a very similar problem with r91 version.


#11

you know I am further along, but I still have issues, and am still hunting around these log files looking for what is clogging the pipes. I have been swamped with other things, but I’d gladly pitch in to solve both of our problems. In particular, here’s my current error hunting strategy. I’ve been having issues with the s3fuse implementations so I tried goofys so after grabbing my logs bucket and copying it locally I grep around for errors like so:

$ find . -iname '*.gz'|xargs -n1 -I{} zgrep -i error {} 
      '-XX:OnOutOfMemoryError=kill -9 %p' \ 
17/08/19 00:15:05 ERROR ApplicationMaster: User class threw exception: java.io.IOException: Not a file: hdfs://ip-172-31-1-33.eu-west-1.compute.internal:8020/local/snowplow/raw-events/emrlogs/j-17JR3DGT92QRJ

and the 'Allocation failures as well:

$ find . -iname '*.gz'|xargs -n1 -I{} zgrep -i fail {}
2017-08-19 00:10:27,378 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: maxTaskFailuresPerNode is 3
2017-08-19 00:12:36,287 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: maxTaskFailuresPerNode is 3
2017-08-19T00:14:33.688+0000: [GC (Allocation Failure) 2017-08-19T00:14:33.688+0000: [ParNew: 16384K->2047K(18432K), 0.0126193 secs] 16384K->2170K(59392K), 0.0127729 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:34.025+0000: [GC (Allocation Failure) 2017-08-19T00:14:34.025+0000: [ParNew: 18431K->1607K(18432K), 0.0132618 secs] 18554K->2793K(59392K), 0.0133710 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:34.370+0000: [GC (Allocation Failure) 2017-08-19T00:14:34.370+0000: [ParNew: 17991K->1364K(18432K), 0.0029973 secs] 19177K->2549K(59392K), 0.0031033 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:35.039+0000: [GC (Allocation Failure) 2017-08-19T00:14:35.039+0000: [ParNew: 17748K->2047K(18432K), 0.0196674 secs] 18933K->3880K(59392K), 0.0197669 secs] [Times: user=0.02 sys=0.00, real=0.02 secs] 
2017-08-19T00:14:35.490+0000: [GC (Allocation Failure) 2017-08-19T00:14:35.490+0000: [ParNew: 18431K->1132K(18432K), 0.0081721 secs] 20264K->3641K(59392K), 0.0082810 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:36.045+0000: [GC (Allocation Failure) 2017-08-19T00:14:36.045+0000: [ParNew: 17516K->1426K(18432K), 0.0097347 secs] 20025K->4447K(59392K), 0.0098442 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:36.674+0000: [GC (Allocation Failure) 2017-08-19T00:14:36.674+0000: [ParNew: 17810K->1179K(18432K), 0.0067293 secs] 20831K->4576K(59392K), 0.0068329 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:37.006+0000: [GC (Allocation Failure) 2017-08-19T00:14:37.006+0000: [ParNew: 17563K->1080K(18432K), 0.0073587 secs] 20960K->4839K(59392K), 0.0074710 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:37.361+0000: [GC (Allocation Failure) 2017-08-19T00:14:37.361+0000: [ParNew: 17464K->1034K(18432K), 0.0048126 secs] 21223K->4794K(59392K), 0.0049113 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:37.583+0000: [GC (Allocation Failure) 2017-08-19T00:14:37.583+0000: [ParNew: 17418K->1513K(18432K), 0.0060308 secs] 21178K->5273K(59392K), 0.0061394 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:37.827+0000: [GC (Allocation Failure) 2017-08-19T00:14:37.827+0000: [ParNew: 17897K->1253K(18432K), 0.0086279 secs] 21657K->5530K(59392K), 0.0087411 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:38.128+0000: [GC (Allocation Failure) 2017-08-19T00:14:38.128+0000: [ParNew: 17637K->1388K(18432K), 0.0060732 secs] 21914K->5827K(59392K), 0.0061776 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:38.414+0000: [GC (Allocation Failure) 2017-08-19T00:14:38.414+0000: [ParNew: 17772K->1220K(18432K), 0.0078387 secs] 22211K->5993K(59392K), 0.0079494 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:39.155+0000: [GC (Allocation Failure) 2017-08-19T00:14:39.155+0000: [ParNew: 17604K->1061K(18432K), 0.0112325 secs] 38761K->22541K(59392K), 0.0113440 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2017-08-19T00:14:39.526+0000: [GC (Allocation Failure) 2017-08-19T00:14:39.526+0000: [ParNew: 17445K->1203K(18432K), 0.0061159 secs] 54724K->38821K(59392K), 0.0062276 secs] [Times: user=0.01 sys=0.01, real=0.01 secs] 
2017-08-19T00:14:39.884+0000: [GC (Allocation Failure) 2017-08-19T00:14:39.884+0000: [ParNew: 17587K->730K(18432K), 0.0043763 secs] 55152K->38532K(81044K), 0.0044842 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:40.106+0000: [GC (Allocation Failure) 2017-08-19T00:14:40.106+0000: [ParNew: 17108K->1337K(18432K), 0.0040187 secs] 54910K->39139K(81044K), 0.0041295 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:40.233+0000: [GC (Allocation Failure) 2017-08-19T00:14:40.233+0000: [ParNew: 17721K->1279K(18432K), 0.0061208 secs] 55523K->39286K(81044K), 0.0062307 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2017-08-19T00:14:40.566+0000: [GC (Allocation Failure) 2017-08-19T00:14:40.566+0000: [ParNew: 17663K->1330K(18432K), 0.0101995 secs] 55670K->39446K(81044K), 0.0103114 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 
17/08/19 00:15:05 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.io.IOException: Not a file: hdfs://ip-172-31-1-33.eu-west-1.compute.internal:8020/local/snowplow/raw-events/emrlogs/j-17JR3DGT92QRJ)
17/08/19 00:15:06 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: java.io.IOException: Not a file: hdfs://ip-172-31-1-33.eu-west-1.compute.internal:8020/local/snowplow/raw-events/emrlogs/j-17JR3DGT92QRJ)
$ pwd                 
/mnt/unreal/j-R2K0M7H4OS5T/containers

Out of Memory and failed allocation:
It may be too much at once because I’ve let it go for about a month and a half now with out being able to complete this EMR run. But it’s only my blog, and I can’t imagine I get that much traffic (google tells me I’ve had about 500 visitors in the past 90 days, which I would think would be lightweight for snowplow).

There is the not a file error:
Not a file: hdfs://ip-172-31-1-33.eu-west-1.compute.internal:8020/local/snowplow/raw-events/emrlogs/j-17JR3DGT92QRJ
but notice I’m in job j-R2K0M7H4OS5T not j-17JR3D… pwd /mnt/unreal/j-R2K0M7H4OS5T/containers


#12

Okey. So digging through my logs i end up with almost identical error:

17/09/05 20:15:47 ERROR ApplicationMaster: User class threw exception: java.io.IOException: Not a file: hdfs://ip-172-31-7-212.ec2.internal:8020/local/snowplow/raw-events/j-18HIRIGZU5W2U/node
java.io.IOException: Not a file: hdfs://ip-172-31-7-212.ec2.internal:8020/local/snowplow/raw-events/j-18HIRIGZU5W2U/node

I see also that the EMR cluster is started with:

stderr:17/09/05 20:15:27 INFO SparkContext: Running Spark version 2.1.0

…but under 3-enrich/emr-etl-runner/config/config.yml i have spark_enrich: 1.9.0.

Looking at the assets at s3://snowplow-hosted-assets, i see that there is no jar file for Spark 2.1.0:

2017-05-15 14:02 72917240 s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.9.0-rc1.jar
2017-06-09 14:31 72917236 s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.9.0-rc2.jar
2017-06-12 11:15 72917225 s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.9.0.jar

I am not sure if this is related. Any ideas/suggestions?

Here is the gist with the stderr from emr cluster’s failing step: https://gist.github.com/izo3/069a35afd2f9055d695c66de6ddb8dfb


#13

Hi @hnganchev,

No this is unrelated. First line referring to Spark 2.1.0, which is Apache Spark, distributed computing system. Spark Enrich is our own job, built on top of Apache Spark.

@hnganchev, @joshuacox could you please upload your gzipped logs somewhere (you can grep it first to remove sensetive information). Because I’m sure there’s exception somewhere that we’re missing.


#14

I feel a bit hesitant to post all my logs publicly even sanitized (there are 110 files from the EMR jobs). Do you have an email @snowplow that i can use to send you this directly?


#15

I found the problem and my EMR job just finished successfully. The problem is caused by having identical bucket names for:

  • EMR logs
  • collector bucket (your collector bucket log location)

In 3-enrich/emr-etl-runner/config.yml you have this:

 buckets:
  assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
  jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
  log: s3://snowplow-clientxxx-app-logs
  raw:
    in:                  # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
      - s3://snowplow-clientxxx-app-logs         # e.g. s3://my-old-collector-bucket
    processing: s3://snowplow-clientxxx-app-archive/processing
    archive: s3://snowplow-clientxxx-app-archive/raw     # e.g. s3://my-archive-bucket/raw

If “log: s3://snowplow-clientxxx-app-logs” is identical to “- s3://snowplow-clientxxx-app-logs”, when EMR runs it will create a job reference inside. When the enricher starts, it will start consuming from this bucket and will try to find the logs under the job folder and it will fail. Another problem is that the folder will get copied to s3://snowplow-clientxxx-app-archive/processing folder and if the job is re-run eventually it will fail again. To resolve this, you need to do this:

  1. ensure the buckets are different e.g.:
  assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
  jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
  log: s3://snowplow-clientxxx-emr-logs
  raw:
    in:                  # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
      - s3://snowplow-clientxxx-app-logs         # e.g. s3://my-old-collector-bucket
    processing: s3://snowplow-clientxxx-app-archive/processing
    archive: s3://snowplow-clientxxx-app-archive/raw     # e.g. s3://my-archive-bucket/raw
  1. Empty the contents of (make a backup if you want):
    s3://snowplow-clientxxx-app-archive/processing

  2. Ensure s3://snowplow-clientxxx-app-logs/ does not have any job folders, only log files inside.

Cheers!


#16

Thanks for sharing the resolution!