Elasticsearch load failed: "java.io.IOException: Not a file: s3://..."


#1

Hi guys,

Has anyone experienced the following error being thrown in EMR when loading bad rows into Elasticsearch?:

Exception in thread "main" cascading.flow.FlowException: unhandled exception
	at cascading.flow.BaseFlow.complete(BaseFlow.java:918)
	at com.twitter.scalding.Job.run(Job.scala:265)
	at com.twitter.scalding.Tool.start$1(Tool.scala:104)
	at com.twitter.scalding.Tool.run(Tool.scala:120)
	at com.twitter.scalding.Tool.run(Tool.scala:68)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at com.snowplowanalytics.snowplow.storage.hadoop.JobRunner$.main(JobRunner.scala:35)
	at com.snowplowanalytics.snowplow.storage.hadoop.JobRunner.main(JobRunner.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException: Not a file: s3://mint-sp-out/enriched/bad/run=2014-11-29-03-09-19
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:288)
	at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:200)
	at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:134)
	at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:328)
	at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:320)
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
	at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
	at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
	at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:107)
	at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:196)
	at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)
	at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
	at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)

It’s weird we’re getting “Caused by: java.io.IOException: Not a file: s3://mint-sp-out/enriched/bad/run=2014-11-29-03-09-19”.

I get the feeling it’s a permissions error…


#2

Googling the error a little further I turned up this SO answer which claims:

Unfortunately Hadoop does not recursively check the subdirectories of Amazon S3 buckets. The input files must be directly in the input directory or Amazon S3 bucket that you specify, not in sub-directories. According to this document (“Are you trying to recursively traverse input directories?”) Looks like EMR does not support recursive directory at the moment. We are sorry about the inconvenience.

via http://stackoverflow.com/a/25719039/458627

Due to this we have to specify the individual run folders as an array as documented on the Cuban Mcaw post:

The “sources” field is an array of buckets from which to load bad rows. If you leave this field blank, then the bad rows buckets created by the current run of the EmrEtlRunner will be loaded. Alternatively you can explicitly specify an array of bad row buckets to load.

via http://snowplowanalytics.com/blog/2015/12/04/snowplow-r73-cuban-macaw-released/

Waiting to try re-running it and will report back.