EmrEtlRunner fails during raw staging S3 step

abrenaut · October 24, 2018, 10:00pm

Hello,

I have Snowplow batch running on AWS (scala-collector > s3-loader > EmrEtlRunner).

It was running fine for the past few weeks but lately I’ve been getting a lot of failures during the raw staging S3 step.

The step fails with the following trace in stderr

    Error: java.lang.RuntimeException: Reducer task failed to copy 2275 files: s3://snowplow/raw/in/2018-10-24-49589377919602874491714939496115412362808439243580375074-49589377919602874491714939496115412362808439243580375074.lzo.index etc
  at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.cleanup(CopyFilesReducer.java:67)
  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179)
  at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:635)
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:422)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

I have to manually move files from the raw/processing folder to the raw/in and re-run the job hoping that it won’t fail this time to fix it.

If I look at the container logs I can see the following error

2018-10-23 18:32:19,725 ERROR [s3distcp-simpler-executor-worker-1] com.amazon.elasticmapreduce.s3distcp.CopyFilesRunnable: Error downloading input files. Not marking as committed

java.io.FileNotFoundException: No such file or directory 's3://snowplow/raw/in/2018-10-23-49588889455877140086970628804200750496158524777810624562-49588889455877140086970628809616738168032063824091676722.lzo.index'
  at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:816)
  at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:1194)
  at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773)
  at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:166)
  at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.openInputStream(CopyFilesReducer.java:293)
  at com.amazon.elasticmapreduce.s3distcp.CopyFilesRunnable.mergeAndCopyFiles(CopyFilesRunnable.java:102)
  at com.amazon.elasticmapreduce.s3distcp.CopyFilesRunnable.run(CopyFilesRunnable.java:35)
  at com.amazon.elasticmapreduce.s3distcp.SimpleExecutor$Worker.run(SimpleExecutor.java:49)
  at java.lang.Thread.run(Thread.java:748)

When the file 2018-10-23-49588889455877140086970628804200750496158524777810624562-49588889455877140086970628809616738168032063824091676722.lzo.index actually exists.

Any idea if there is something wrong with the EmrEtlRunner or if it’s an issue with s3DistCp? And how could this be potentially solved?

ami_version: 5.9.0
rdb_loader: 0.14.0
rdb_shredder: 0.13.1
spark_enrich: 1.16.0
S3 bucket encryption turned on

Thank you!
Arthur

ihor · October 24, 2018, 10:30pm

This is an issue with AWS S3, not EmrEtlRunner config.

Note the files are moved (copied over) with a native AWS utility S3DistCp. The error “No such file or directory” while it is there could be a result of infamous eventual consistency issue inherent to S3 service.

Your logs show “copy 2275 files”. You might wish to run your batch job more often to reduce the number of files.

Also, why would you move the files to processing bucket manually? Let the EmrEtlRunner do that for you. This kind of failure normally rectifies itself. If some of the files have been moved during the failure nonetheless (processing bucket is not empty), just resume the pipeline with --skip staging option.

abrenaut · October 25, 2018, 3:01pm

Thanks ihor,

The reason I move the files to the processing bucket manually is that it’s the recommended way to deal with this error.

If the job died during the move-to-processing step, either:

Rerun EmrEtlRunner with the command-line option of --skip staging, or:
Move any files from the Processing Bucket back to the In Bucket and rerun EmrEtlRunner without any --skip option*
* We recommend option 2 if only a handful of files were transferred to your Processing Bucket before the S3 error.

I was also worried that some lzo files may have been copied successfully but not the corresponding lzo.index (not sure if that would mean we’d be missing some data?).

I will try changing the s3 loader buffer config so that it creates files less frequently to see if that helps.

ihor · October 25, 2018, 5:19pm

@abrenaut, I also think that you rather might need to adjust S3 Loader settings than the config file for EmrEtlRunner. Let us know how you get on.

abrenaut · November 8, 2018, 5:55pm

I stopped getting error once I changed the buffer config on the s3 loader.

Thanks for the help @ihor

Topic		Replies	Views
EmrEtlRunner fails at Hadoop Shred step Storage targets	5	1479	May 22, 2020
Error in Raw S3 -> Raw HDFS Step AWS batch pipeline (Legacy)	0	1199	June 28, 2018
Getting error in Elasticity S3DistCp Step: Enriched HDFS -> S3: FAILED AWS batch pipeline (Legacy)	5	1555	September 27, 2017
Problem at S3 to HDFS S3DistCp step AWS batch pipeline (Legacy)	19	6986	June 4, 2021
Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down issues AWS batch pipeline (Legacy)	2	6389	November 13, 2017

EmrEtlRunner fails during raw staging S3 step

Related Topics