EMR ETL perfomance


Hi @alex

I ETL-EMR batch job is trying to process 150K files on S3 and Step 2 is taking way too long, it has completed in 20 hours! using this configuration below. I came across your small file post.

Do you think that is the issue, also where do I insert the S3Distcopy consolidation task, just looking for a specific pointer please, thanks for your help.

Current EMR Config:

master_instance_type: r3.xlarge
core_instance_count: 2
core_instance_type: r3.xlarge
task_instance_count: 3 # Increase to use spot instances
task_instance_type: r3.xlarge
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures

hadoop_enrich: 1.7.0 # Version of the Hadoop Enrichment process
hadoop_shred: 0.9.0 # Version of the Hadoop Shredding process
hadoop_elasticsearch: 0.1.0


what the step 2 logs said?


@ChocoPowwwa Hi,

Nothing stood out in the logs, I think it is the Hadoop small file issue mentioned in post from @alex I attached in my original post.

Only errors-
log4j:ERROR Failed to rename [/mnt/var/log/hadoop/steps/s-1HUUYLJ0H2U66/syslog] to [/mnt/var/log/hadoop/steps/s-1HUUYLJ0H2U66/syslog.2017-01-11-01].


How many events/how many files are going into the EMR job?


@mike Hi,

47K pairs of LZO/Index.



That’s a pretty significant number of files. Is that for a large data range or is data being sinked on a very regular basis?

I imagine the time taken just to copy 47K files from S3 to HDFS would be reasonable in of itself - I wonder if it’s worth considering merging some LZO files together to create larger files rather than attempting to process 47K all at once. Thoughts @alex?


Based off this slide for EMR deep dive (slide 25) - bigger files = better performance.
Make sure to compress (we use the lzo), and dont forget to increase your timeout/number of records to accommodate the file size.

47k pairs -> what file size / time / # of records are you at for syncing with s3?

You also want to avoid “small file problem” that can have a negative effect not only on s3 copy but on EMR processing as well.

(BDT305) Amazon EMR Deep Dive and Best Practices from Amazon Web Services


@13scoobie @mike Thanks.

The files are from about 2 weeks of activity on a very low volume site. I am assuming each LZO is one event (about 8K - 900K each compressed) and 47K events per day is not that crazy (I would imagine for even a daily volume).

Question - Where do I add a step in the EMR job to S3distcpy and compress the files into few large one as stated by @alex in his post here



@13scoobie and @mike are right - that’s a ton of files!

To fix the problem going forwards, adjust the buffer configuration for your S3 Sink. To resolve the historical problem, you can do the following:

  1. Remove the .lzo.index files
  2. Run compaction using /usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar
  3. (Optional) Re-index the files using s3://snowplow-hosted-assets/third-party/twitter/hadoop-lzo-0.4.20.jar
  4. Kick off the regular Snowplow job from the EMR phase


Don’t the files get aggregated into 128mb chunks in the S3DistCp step? The large number of files wouldn’t explain why the EMR process took 20 hours to run, right?


Sorry @alex I am unable to find that setting to adjust the buffer on Snowplow. And it seems Kinesis (streams) doesn’t allow for that I know the Kinesis Firehose does. Am I totally off track here?



@sachinsingh10 You’ll want to look at the buffer settings in your configuration file here (under buffer).