How to engage EMRFS consistency when running snowplow-emr-etl-runner


#1

when we switched to a larger node type, we got error from the last step in shredding Elasticity S3DistCp Step: Shredded HDFS -> S3:
Error: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; Request ID: 5A2F87935C17C792), S3 Extended Request ID: 6YcZaPRh5xyaWrQUz9KDpRyKhiGt59QcWVIXNvsOxk1oNRegZX6CgEN1974w1c0eIN35YgzTe/I=

That is caused by a lot of data is being pushed to S3 aggressively (according to AWS). The ways to mitigate is either reset “–targetSize=SIZE” to a large size or engage EMRFS consistency http://docs.aws.amazon.com/emr/latest/ManagementGuide/emrfs-configure-consistent-view.html.

Can we modify the config.yml to implement the above suggestions given we are using snowplow-emr-etl-runner? What is a good way to do it?

Thanks,
Richard


Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down issues
#2

I agree that one way to go about this is to modify --targetSize combining it with --groupBy.

However, another way to go about it would be upstream. If you’re using the scala-stream-collector you can produce bigger files in s3 with the s3-loader by having a bigger buffer.

Those bigger files would then ripple through your pipeline after enrich and after shred. And you would end up with bigger files being moved to S3 and wouldn’t hit “SlowDown”.

This is particularly interesting because both the enrich and the shred jobs’s parallelism are dictated by the number of files. Through bigger files you can better utilize your cluster.

Finally, another way to go about it would be to run emr etl runner more frequently, depending on how frequently you’re running it now obviously.