I inherited a snowplow pipeline from 2017, which includes EMR ETL Runner with the following conf (copying only the relevant part):
emr: jobflow: master_instance_type: m4.large core_instance_count: 8 core_instance_type: r4.xlarge core_instance_bid: # In USD. Adjust bid, or leave blank for on-demand core instances core_instance_ebs: # Optional. Attach an EBS volume to each core instance. volume_size: 40 # Gigabytes volume_type: "gp2" # volume_iops: 400 # Optional. Will only be used if volume_type is "io1" ebs_optimized: false # Optional. Will default to true task_instance_count: 0 # Increase to use spot instances task_instance_type: m1.medium task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures configuration: yarn-site: yarn.resourcemanager.am.max-attempts: "1" yarn.nodemanager.vmem-check-enabled: "false" yarn.nodemanager.resource.memory-mb: "28672" yarn.scheduler.maximum-allocation-mb: "28672" spark: maximizeResourceAllocation: "false" spark-defaults: spark.dynamicAllocation.enabled: "false" spark.executor.instances: "23" spark.yarn.executor.memoryOverhead: "2048" spark.executor.memory: "8G" spark.executor.cores: "1" spark.yarn.driver.memoryOverhead: "2048" spark.driver.memory: "8G" spark.driver.cores: "1" spark.default.parallelism: "46"
As you can see it uses a lot of resources, at least compared to the sample config I saw on github.
But, for context, we receive several millions of visits per day.
Now I just updated the entire pipeline: installed Stream Enrich Kinesis, upgraded EMR ETL Runner to 1.0.4, switched it to Stream Enrich mode, updated the conf to EMR 6.1, RDB Shredder/Loader to 0.18.1, Redshift JSON schema to 4.0.0, etc… And I’m running EMR ETL Runner every 2 hours.
So, since I removed the enriching step from EMR ETL Runner and that RDB Shredder/Loader received a lot of performance improvements over the years, I’m wondering if we still need all the resources listed in that config.
From the enricher Kinesis stream monitoring, I see we receive ~12,000 records/min. So that’s ~1,440,000 records to shred and load every 2 hours. With the current config it takes 17 min.
Anyone running EMR ETL Runner in a similar context? Any recommendations?
If not, what would be the way to start optimizing? Decreasing the nb of instances? Downgrading the instances type? What metrics should I look at in the EMR jobs monitoring?
Thanks in advance.