Core_Instance_Count not increasing

Hi Team,

I am setting up latest Snowplow ETL Batch processing.
While running the batch process for large data size around ~20GB.

It is failing at enrich step. Also one thing I noticed that even tough I am provisioning for 6 core nodes still it is provisioning that.

Below is my config.yml file details

emr:
ami_version: 5.9.0
region: us-east-1
jobflow_role: EMR_EC2_DefaultRole
service_role: EMR_DefaultRole
placement:
ec2_subnet_id: XXX
ec2_key_name:XX
security_configuration:
bootstrap:
software:
hbase:
lingual:
# Adjust your Hadoop cluster below
jobflow:
job_name: Snowplow ETL QA
master_instance_type: r4.8xlarge
core_instance_count: 6
core_instance_type: r4.8xlarge
core_instance_bid:
core_instance_ebs: # Optional. Attach an EBS volume to each core instance.
#volume_size: 100 # Gigabytes
#volume_type: “gp2”
#volume_iops: 400 # Optional. Will only be used if volume_type is “io1”
task_instance_count: 0
task_instance_type: m3.4xlarge
bootstrap_failure_tries: 2
configuration:
yarn-site:
yarn.resourcemanager.am.max-attempts: “1”
spark:
maximizeResourceAllocation: “false”
additional_info:
collectors:
format: thrift
enrich:
versions:
spark_enrich: 1.18.0
continue_on_unexpected_error: true
output_compression: NONE
storage:
versions:
rdb_loader: 0.14.0
rdb_shredder: 0.13.1
hadoop_elasticsearch: 0.1.0

Not sure why it failing for large data set (small data set it working fine) . Also not sure why its not provisioning more core instance even tough it is configured as 6.

Please help with correct configuration in order to process large data set of 30-40GBs

Hi @sp_user,

It appears you are using EMR ETL Runner R117 which has an issue with core_instances and ebs volume (see: https://github.com/snowplow/snowplow/issues/4285). The issue has been fixed in the latest version.

As for the correct configuration for large data sets: you will need to specify additional configuration settings to utilize as much resources as possible. I would recommend to read this thread to get a sense on how this can be done. You may consider to use one of configurations provided in the thread (e.g. 1x m4.xlarge & 5x r4.8xlarge).

In overall it’s better to run the job more often and process less data. It should be more robust and cost efficient model.

Hope this helps.

Thanks Egor.

Will upgrade the version and keep you posted on the same.

Hi Egor,

I have upgraded to R119 . Core instance count issue got resolved.

Just another query related to configuration :
– What type of instance and count will be use full to process 40 GB within 2 hours duration.
– Any Spark/Yarn configuration need to be done , if so please help me out with that.

Thanks in advance.

Hi Egor,

I tried to execute the batch (20GB Load) with above configuration. But its stuck in Spark enrich step.

Below are log details.

20/06/17 05:01:51 INFO Client: Application report for application_1592369903184_0002 (state: RUNNING)
20/06/17 05:01:51 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.1.4.246
ApplicationMaster RPC port: 0
queue: default
start time: 1592370106702
final status: UNDEFINED
tracking URL: http://ip-10-1-4-239.ec2.internal:20888/proxy/application_1592369903184_0002/
user: hadoop
20/06/17 05:01:52 INFO Client: Application report for application_1592369903184_0002 (state: RUNNING)
20/06/17 05:01:53 INFO Client: Application report for application_1592369903184_0002 (state: RUNNING)
20/06/17 05:01:54 INFO Client: Application report for application_1592369903184_0002 (state: RUNNING)
20/06/17 05:01:55 INFO Client: Application report for application_1592369903184_0002 (state: RUNNING)

Can you please suggest. Urgent.

Thanks in advance.

@sp_user, I assume your enriched data is gzipped which means when uncompressed it could take up 400 GB which is too much to process with your EMR cluster within 2 hours. It’s best to split the payload into a few batches.

With that much data you collect, I would advise running your EMR job more frequently to reduce the volume of data to process in one go. We typically do not go above the following configuration (max EMR cluster we ever used).

    jobflow:
      master_instance_type: m4.xlarge
      core_instance_count: 10
      core_instance_type: r4.8xlarge
      core_instance_ebs:
        volume_size: 320
        volume_type: gp2
        ebs_optimized: true
    configuration:
      yarn-site:
        yarn.nodemanager.vmem-check-enabled: "false"
        yarn.nodemanager.resource.memory-mb: "245760"
        yarn.scheduler.maximum-allocation-mb: "245760"
      spark:
        maximizeResourceAllocation: "false"
      spark-defaults:
        spark.dynamicAllocation.enabled: "false"
        spark.executor.instances: "99"
        spark.yarn.executor.memoryOverhead: "4096"
        spark.executor.memory: "20G"
        spark.executor.cores: "3"
        spark.yarn.driver.memoryOverhead: "4096"
        spark.driver.memory: "20G"
        spark.driver.cores: "3"
        spark.default.parallelism: "1188"

The above configuration is aimed at the payload of ~10 GB of compressed (.gz) data.