To investigate my issue I pulled the cluster details and found that all the clusters failed to execute the step function and all of them failed with the same issue.
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; Request ID: B45741D03; S3 Extended Request ID: ZK7GWdk03GRanA0EGKSQBYV48PxSkWQgepd2ke795DPLxliAiaYPwF7kIj1q+=), S3 Extended Request ID: ZK7GWdk03GRanA0EGKSQBYV48PxSkWQgepd2ke795v97zIYs=
Looking further into the same step function error logs, I also found the following error:
Error: java.lang.RuntimeException: Reducer task failed to copy 488 files: s3://wogaa-snowplow-production-sentiments-kinesis/2020-01-15-4960066067674028270618312387666095132115106-496006606767402827061831029376723819246242.gz etc
Taking it further, I found for your s3-dist-cp job there were 31 reduce jobs launched out of which 13 passed and the rest of them failed. This was because the s3-dist-cp command launches as many reducers as possible to increase speed up the copy. This is usually effective in getting the copy done as soon as possible. However, when the EMR cluster is big, you can quickly reach the API rate limit imposed by S3, which is described in the following AWS documentation. When you copy data from HDFS to S3, the corresponding rate limit is 3,500 PUT requests per second.
Can I can reduce the request rate by limiting the number of reducers writing to S3 using property -Dmapreduce.job.reduces=X. Reducing the number of reducers might slow down the job but it can help in completing it successfully without any S3 issues
Or do you have any good solution for this? This is a very painful issue for me from 2 weeks.