We had a period where ETL wasn’t run and now trying to play catch up. I’m just rerunning about 842 files with an average of 5 megs per raw file that contains the events. so very small data files. i increased the number of core instances to 8 r3.xlarge hoping that would help get it done. it’s been running for about 15 hours now and utilization is very low on the 8 machines. i know the process prefers smaller batches and i’ll have to cancel and write a script to only give ETL say 50 files at a time before kicking it off if that’s the case. any recommendations or help highly appreciated. also any recommendations on what cores should be based on the list of EC2 instances from amazon here? https://aws.amazon.com/ec2/pricing/on-demand/ like should i try m4.xlarge instead of r3?
Normally we run this hourly so number of files is more like 20-30 at most per hourly run and i can use just one core instance to get that done.
# Adjust your Hadoop cluster below jobflow: job_name: Snowplow ETL # Give your job a name master_instance_type: m4.large core_instance_count: 8 core_instance_type: r3.xlarge task_instance_count: 0 # Increase to use spot instances task_instance_type: m1.medium task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures additional_info: # Optional JSON string for selecting additional features