I agree that one way to go about this is to modify
--targetSize combining it with
However, another way to go about it would be upstream. If you’re using the scala-stream-collector you can produce bigger files in s3 with the s3-loader by having a bigger buffer.
Those bigger files would then ripple through your pipeline after enrich and after shred. And you would end up with bigger files being moved to S3 and wouldn’t hit “SlowDown”.
This is particularly interesting because both the enrich and the shred jobs’s parallelism are dictated by the number of files. Through bigger files you can better utilize your cluster.
Finally, another way to go about it would be to run emr etl runner more frequently, depending on how frequently you’re running it now obviously.