I ETL-EMR batch job is trying to process 150K files on S3 and Step 2 is taking way too long, it has completed in 20 hours! using this configuration below. I came across your small file post.
Do you think that is the issue, also where do I insert the S3Distcopy consolidation task, just looking for a specific pointer please, thanks for your help.
Current EMR Config:
task_instance_count: 3 # Increase to use spot instances
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
hadoop_enrich: 1.7.0 # Version of the Hadoop Enrichment process
hadoop_shred: 0.9.0 # Version of the Hadoop Shredding process