EmrEtlRunner sizing

trung · June 21, 2019, 8:40am

Hi, we are ingesting 2M events and the ETL is taking about 4 hours now. We are looking at optimising this by changing the instance types and amounts.

This is our config:

  master_instance_type: m4.large
  core_instance_count: 2
  core_instance_type: c4.large
  core_instance_ebs:
    volume_size: 100
    volume_type: "gp2"
    volume_iops: 400
    ebs_optimized: false
  task_instance_count: 0
  task_instance_type: m4.large
  task_instance_bid: 0.015

Are there any best practices available?

Also what are the allowed types as i tried to use c5.large and it had a big failure. Is there a list somewhere for each type?

ihor · June 21, 2019, 10:19pm

@trung, are you using Spark enrichment or Hadoop? We have switched to Spark starting from R89 release. The Spark framework differs from Hadoop in that it does most of the processing in memory while Hadoop relies on writing temp data to HDFS. Thus, Spark is more productive as it eliminates “costly” writing to the disk.

In other words, for Spark configuration, you would want to use instances like r4 that comes with a higher volume of memory. This, however, is not enough for high volume data processing. Spark requires tuning to ensure the resources are used to their maximum (not idle unnecessarily).

It’s hard to know for sure what the best configuration would be just knowing the volume of events. No event is the same; lots depends on the volume and complexity of self-describing events. We have a rough correlation between the size of enriched files and the EMR cluster/Spark configuration. If you provide what the typical payload size of compresed/uncompressed files is we can provide a better-adjusted configuration.

For now, if we simply replace your 2x c4.large for 1x r4.xlarge the Spark configuration will be as below

  jobflow:
    master_instance_type: "m4.large"
    core_instance_count: 1
    core_instance_type: "r4.xlarge"
    core_instance_ebs:
      volume_size: 40
      volume_type: "gp2"
      ebs_optimized: true
    . . .
  configuration:
    yarn-site:
      yarn.nodemanager.vmem-check-enabled: "false"
      yarn.nodemanager.resource.memory-mb: "27648"
      yarn.scheduler.maximum-allocation-mb: "27648"
    spark:
      maximizeResourceAllocation: "false"
    spark-defaults:
      spark.dynamicAllocation.enabled: "false"
      spark.executor.instances: "2"
      spark.yarn.executor.memoryOverhead: "1024"
      spark.executor.memory: 8G
      spark.executor.cores: "1"
      spark.yarn.driver.memoryOverhead: "1024"
      spark.driver.memory: 8G
      spark.driver.cores: "1"
      spark.default.parallelism: "8"

You can read more about Spark configuration in this post: Learnings from using the new Spark EMR Jobs. In particular, the following spreadsheet could be used to find the optimal configuration: https://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/.

trung · June 24, 2019, 2:09am

@ihor, we are using emr-etl so i guess thats hadoop? However im keen on switching to spark, are there migration guides? Thanks!

egor · June 24, 2019, 5:05am

Hi @trung,

we are using emr-etl so i guess thats hadoop?

This depends on emr-etl-runner version. As Ihor mentioned earlier - Spark has replaced Hadoop starting from R89.

What release are you running?

However im keen on switching to spark, are there migration guides?

The upgrade guide can be found here: Upgrade Guide · snowplow/snowplow Wiki · GitHub.

trung · June 24, 2019, 11:53am

@egor using r113_filitosa.

ihor · June 24, 2019, 3:19pm

Then you are probably already using Spark Enrich (from version 1.9.0) and thus can try the config I provided earlier. You can also let us know the typical size of enriched files per ETL process and we can adjust the Spark configuration accordingly. Though, I already gave you all the info you need earlier.

Topic		Replies	Views
Optimizing and reducing shredding/loading costs For engineers	4	880	January 20, 2021
Processing a big file in EMR or split it up? AWS batch pipeline (Legacy)	2	2798	March 17, 2018
Core_Instance_Count not increasing Enrichment	5	1701	June 18, 2020
Recommended ec2 instances for EMR ETL Runner in 2018 AWS batch pipeline (Legacy)	2	1856	September 4, 2018
Learnings from using the new Spark EMR Jobs AWS batch pipeline (Legacy)	8	13125	August 23, 2017

EmrEtlRunner sizing

Related Topics