EmrEtlRunner sizing

Hi, we are ingesting 2M events and the ETL is taking about 4 hours now. We are looking at optimising this by changing the instance types and amounts.

This is our config:

  master_instance_type: m4.large
  core_instance_count: 2
  core_instance_type: c4.large
  core_instance_ebs:
    volume_size: 100
    volume_type: "gp2"
    volume_iops: 400
    ebs_optimized: false
  task_instance_count: 0
  task_instance_type: m4.large
  task_instance_bid: 0.015

Are there any best practices available?

Also what are the allowed types as i tried to use c5.large and it had a big failure. Is there a list somewhere for each type?

@trung, are you using Spark enrichment or Hadoop? We have switched to Spark starting from R89 release. The Spark framework differs from Hadoop in that it does most of the processing in memory while Hadoop relies on writing temp data to HDFS. Thus, Spark is more productive as it eliminates “costly” writing to the disk.

In other words, for Spark configuration, you would want to use instances like r4 that comes with a higher volume of memory. This, however, is not enough for high volume data processing. Spark requires tuning to ensure the resources are used to their maximum (not idle unnecessarily).

It’s hard to know for sure what the best configuration would be just knowing the volume of events. No event is the same; lots depends on the volume and complexity of self-describing events. We have a rough correlation between the size of enriched files and the EMR cluster/Spark configuration. If you provide what the typical payload size of compresed/uncompressed files is we can provide a better-adjusted configuration.

For now, if we simply replace your 2x c4.large for 1x r4.xlarge the Spark configuration will be as below

  jobflow:
    master_instance_type: "m4.large"
    core_instance_count: 1
    core_instance_type: "r4.xlarge"
    core_instance_ebs:
      volume_size: 40
      volume_type: "gp2"
      ebs_optimized: true
    . . .
  configuration:
    yarn-site:
      yarn.nodemanager.vmem-check-enabled: "false"
      yarn.nodemanager.resource.memory-mb: "27648"
      yarn.scheduler.maximum-allocation-mb: "27648"
    spark:
      maximizeResourceAllocation: "false"
    spark-defaults:
      spark.dynamicAllocation.enabled: "false"
      spark.executor.instances: "2"
      spark.yarn.executor.memoryOverhead: "1024"
      spark.executor.memory: 8G
      spark.executor.cores: "1"
      spark.yarn.driver.memoryOverhead: "1024"
      spark.driver.memory: 8G
      spark.driver.cores: "1"
      spark.default.parallelism: "8"

You can read more about Spark configuration in this post: Learnings from using the new Spark EMR Jobs. In particular, the following spreadsheet could be used to find the optimal configuration: https://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/.

@ihor, we are using emr-etl so i guess thats hadoop? However im keen on switching to spark, are there migration guides? Thanks!

Hi @trung,

we are using emr-etl so i guess thats hadoop?

This depends on emr-etl-runner version. As Ihor mentioned earlier - Spark has replaced Hadoop starting from R89.

What release are you running?

However im keen on switching to spark, are there migration guides?

The upgrade guide can be found here: https://github.com/snowplow/snowplow/wiki/Upgrade-Guide.

@egor using r113_filitosa.

Then you are probably already using Spark Enrich (from version 1.9.0) and thus can try the config I provided earlier. You can also let us know the typical size of enriched files per ETL process and we can adjust the Spark configuration accordingly. Though, I already gave you all the info you need earlier.