We are using snowplow 112 version with stream enrich and last 2 weeks we have been getting troubles with S3distcp.
- Sometimes it fails while copying shredded data from HDFS-> S3 or archiving the data.
- Most of the time it archives the data but still sends failure signal to EMR job.
- In another scenario, while copying data from HDFS-> S3 using distcp, reduce job fails at reduce step and tries 3 4 times and recreates multiple version of data in S3.
- EMR failed another day at the loader step as it was unable to locate one of JsonPath files but worked again on retries.
Is there someone encountering similar issue with s3distcp? Any solutions/recommendations will be highly appreciated as this issue is impacting our production environment. Thanks in advance!