"[shred] spark: Shred Enriched Events" Step Fails

The “Spark: Shred Enrich Events” step continues to fail, and I’m having trouble finding the cause. The log files are below. Can some help me find the root issue?

We are using the Clojure collector and attempting to write to Redshift

Shred Enrich Event - Step Details

Status :FAILED
Reason :
Log File :s3://snowplow-emretlrunner-out/log/j-1Z6IDPVQ9FWMQ/steps/s-2UH0K79D2I0ZD/stderr.gz
Details :Exception in thread "main" org.apache.spark.SparkException: Application application_1579119791561_0002 finished with failed status
JAR location :command-runner.jar
Main class :None
Arguments :spark-submit --class com.snowplowanalytics.snowplow.storage.spark.ShredJob --master yarn --deploy-mode cluster s3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar --iglu-config ew0KICAic2NoZW1hIjogImlnbHU6Y29tLnNub3dwbG93YW5hbHl0aWNzLmlnbHUvcmVzb2x2ZXItY29uZmlnL2pzb25zY2hlbWEvMS0wLTEiLA0KICAiZGF0YSI6IHsNCiAgICAiY2FjaGVTaXplIjogNTAwLA0KICAgICJyZXBvc2l0b3JpZXMiOiBbDQogICAgICB7DQogICAgICAgICJuYW1lIjogIklnbHUgQ2VudHJhbCIsDQogICAgICAgICJwcmlvcml0eSI6IDAsDQogICAgICAgICJ2ZW5kb3JQcmVmaXhlcyI6IFsgImNvbS5zbm93cGxvd2FuYWx5dGljcyIgXSwNCiAgICAgICAgImNvbm5lY3Rpb24iOiB7DQogICAgICAgICAgImh0dHAiOiB7DQogICAgICAgICAgICAidXJpIjogImh0dHA6Ly9pZ2x1Y2VudHJhbC5jb20iDQogICAgICAgICAgfQ0KICAgICAgICB9DQogICAgICB9LA0KICAgICAgew0KICAgICAgICAibmFtZSI6ICJJZ2x1IENlbnRyYWwgLSBHQ1AgTWlycm9yIiwNCiAgICAgICAgInByaW9yaXR5IjogMSwNCiAgICAgICAgInZlbmRvclByZWZpeGVzIjogWyAiY29tLnNub3dwbG93YW5hbHl0aWNzIiBdLA0KICAgICAgICAiY29ubmVjdGlvbiI6IHsNCiAgICAgICAgICAiaHR0cCI6IHsNCiAgICAgICAgICAgICJ1cmkiOiAiaHR0cDovL21pcnJvcjAxLmlnbHVjZW50cmFsLmNvbSINCiAgICAgICAgICB9DQogICAgICAgIH0NCiAgICAgIH0NCiAgICBdDQogIH0NCn0NCg== --input-folder hdfs:///local/snowplow/enriched-events/* --output-folder hdfs:///local/snowplow/shredded-events/ --bad-folder s3://snowplow-emretlrunner-out/shredded/bad/run=2020-01-15-20-18-25/
Action on failure:Terminate cluster

Controller file

2020-01-15T20:28:21.420Z INFO Ensure step 2 jar file command-runner.jar
2020-01-15T20:28:21.421Z INFO StepRunner: Created Runner for step 2
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --class com.snowplowanalytics.snowplow.storage.spark.ShredJob --master yarn --deploy-mode cluster s3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar --iglu-config ew0KICAic2NoZW1hIjogImlnbHU6Y29tLnNub3dwbG93YW5hbHl0aWNzLmlnbHUvcmVzb2x2ZXItY29uZmlnL2pzb25zY2hlbWEvMS0wLTEiLA0KICAiZGF0YSI6IHsNCiAgICAiY2FjaGVTaXplIjogNTAwLA0KICAgICJyZXBvc2l0b3JpZXMiOiBbDQogICAgICB7DQogICAgICAgICJuYW1lIjogIklnbHUgQ2VudHJhbCIsDQogICAgICAgICJwcmlvcml0eSI6IDAsDQogICAgICAgICJ2ZW5kb3JQcmVmaXhlcyI6IFsgImNvbS5zbm93cGxvd2FuYWx5dGljcyIgXSwNCiAgICAgICAgImNvbm5lY3Rpb24iOiB7DQogICAgICAgICAgImh0dHAiOiB7DQogICAgICAgICAgICAidXJpIjogImh0dHA6Ly9pZ2x1Y2VudHJhbC5jb20iDQogICAgICAgICAgfQ0KICAgICAgICB9DQogICAgICB9LA0KICAgICAgew0KICAgICAgICAibmFtZSI6ICJJZ2x1IENlbnRyYWwgLSBHQ1AgTWlycm9yIiwNCiAgICAgICAgInByaW9yaXR5IjogMSwNCiAgICAgICAgInZlbmRvclByZWZpeGVzIjogWyAiY29tLnNub3dwbG93YW5hbHl0aWNzIiBdLA0KICAgICAgICAiY29ubmVjdGlvbiI6IHsNCiAgICAgICAgICAiaHR0cCI6IHsNCiAgICAgICAgICAgICJ1cmkiOiAiaHR0cDovL21pcnJvcjAxLmlnbHVjZW50cmFsLmNvbSINCiAgICAgICAgICB9DQogICAgICAgIH0NCiAgICAgIH0NCiAgICBdDQogIH0NCn0NCg== --input-folder hdfs:///local/snowplow/enriched-events/* --output-folder hdfs:///local/snowplow/shredded-events/ --bad-folder s3://snowplow-emretlrunner-out/shredded/bad/run=2020-01-15-20-18-25/'
INFO Environment:
  PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin
  LESS_TERMCAP_md=[01;38;5;208m
  LESS_TERMCAP_me=[0m
  HISTCONTROL=ignoredups
  LESS_TERMCAP_mb=[01;31m
  AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
  UPSTART_JOB=rc
  LESS_TERMCAP_se=[0m
  HISTSIZE=1000
  HADOOP_ROOT_LOGGER=INFO,DRFA
  JAVA_HOME=/etc/alternatives/jre
  AWS_DEFAULT_REGION=us-east-1
  AWS_ELB_HOME=/opt/aws/apitools/elb
  LESS_TERMCAP_us=[04;38;5;111m
  EC2_HOME=/opt/aws/apitools/ec2
  TERM=linux
  XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
  runlevel=3
  LANG=en_US.UTF-8
  AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
  MAIL=/var/spool/mail/hadoop
  LESS_TERMCAP_ue=[0m
  LOGNAME=hadoop
  PWD=/
  LANGSH_SOURCED=1
  HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-2UH0K79D2I0ZD/tmp
  _=/etc/alternatives/jre/bin/java
  CONSOLETYPE=serial
  RUNLEVEL=3
  LESSOPEN=||/usr/bin/lesspipe.sh %s
  previous=N
  UPSTART_EVENTS=runlevel
  AWS_PATH=/opt/aws
  USER=hadoop
  UPSTART_INSTANCE=
  PREVLEVEL=N
  HADOOP_LOGFILE=syslog
  PYTHON_INSTALL_LAYOUT=amzn
  HOSTNAME=ip-100-0-24-154
  NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
  HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-2UH0K79D2I0ZD
  EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
  SHLVL=5
  HOME=/home/hadoop
  HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-2UH0K79D2I0ZD/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-2UH0K79D2I0ZD/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-2UH0K79D2I0ZD
INFO ProcessRunner started child process 9618 :
hadoop    9618  4100  0 20:28 ?        00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --class com.snowplowanalytics.snowplow.storage.spark.ShredJob --master yarn --deploy-mode cluster s3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar --iglu-config ew0KICAic2NoZW1hIjogImlnbHU6Y29tLnNub3dwbG93YW5hbHl0aWNzLmlnbHUvcmVzb2x2ZXItY29uZmlnL2pzb25zY2hlbWEvMS0wLTEiLA0KICAiZGF0YSI6IHsNCiAgICAiY2FjaGVTaXplIjogNTAwLA0KICAgICJyZXBvc2l0b3JpZXMiOiBbDQogICAgICB7DQogICAgICAgICJuYW1lIjogIklnbHUgQ2VudHJhbCIsDQogICAgICAgICJwcmlvcml0eSI6IDAsDQogICAgICAgICJ2ZW5kb3JQcmVmaXhlcyI6IFsgImNvbS5zbm93cGxvd2FuYWx5dGljcyIgXSwNCiAgICAgICAgImNvbm5lY3Rpb24iOiB7DQogICAgICAgICAgImh0dHAiOiB7DQogICAgICAgICAgICAidXJpIjogImh0dHA6Ly9pZ2x1Y2VudHJhbC5jb20iDQogICAgICAgICAgfQ0KICAgICAgICB9DQogICAgICB9LA0KICAgICAgew0KICAgICAgICAibmFtZSI6ICJJZ2x1IENlbnRyYWwgLSBHQ1AgTWlycm9yIiwNCiAgICAgICAgInByaW9yaXR5IjogMSwNCiAgICAgICAgInZlbmRvclByZWZpeGVzIjogWyAiY29tLnNub3dwbG93YW5hbHl0aWNzIiBdLA0KICAgICAgICAiY29ubmVjdGlvbiI6IHsNCiAgICAgICAgICAiaHR0cCI6IHsNCiAgICAgICAgICAgICJ1cmkiOiAiaHR0cDovL21pcnJvcjAxLmlnbHVjZW50cmFsLmNvbSINCiAgICAgICAgICB9DQogICAgICAgIH0NCiAgICAgIH0NCiAgICBdDQogIH0NCn0NCg== --input-folder hdfs:///local/snowplow/enriched-events/* --output-folder hdfs:///local/snowplow/shredded-events/ --bad-folder s3://snowplow-emretlrunner-out/shredded/bad/run=2020-01-15-20-18-25/
2020-01-15T20:28:23.475Z INFO HadoopJarStepRunner.Runner: startRun() called for s-2UH0K79D2I0ZD Child Pid: 9618
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 1 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 394 seconds
2020-01-15T20:34:55.613Z INFO Step created jobs: 
2020-01-15T20:34:55.613Z WARN Step failed with exitCode 1 and took 394 seconds

Stderr file

Warning: Skip remote jar s3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar.
20/01/15 20:28:24 INFO RMProxy: Connecting to ResourceManager at ip-100-0-24-154.ec2.internal/100.0.24.154:8032
20/01/15 20:28:24 INFO Client: Requesting a new application from cluster with 1 NodeManagers
20/01/15 20:28:24 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (5632 MB per container)
20/01/15 20:28:24 INFO Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
20/01/15 20:28:24 INFO Client: Setting up container launch context for our AM
20/01/15 20:28:24 INFO Client: Setting up the launch environment for our AM container
20/01/15 20:28:24 INFO Client: Preparing resources for our AM container
20/01/15 20:28:25 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
20/01/15 20:28:27 INFO Client: Uploading resource file:/mnt/tmp/spark-c7351112-d3ef-4f42-abcb-c86e8cabaf6e/__spark_libs__2861957843382159812.zip -> hdfs://ip-100-0-24-154.ec2.internal:8020/user/hadoop/.sparkStaging/application_1579119791561_0002/__spark_libs__2861957843382159812.zip
20/01/15 20:28:28 WARN RoleMappings: Found no mappings configured with 'fs.s3.authorization.roleMapping', credentials resolution may not work as expected
20/01/15 20:28:29 INFO Client: Uploading resource s3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar -> hdfs://ip-100-0-24-154.ec2.internal:8020/user/hadoop/.sparkStaging/application_1579119791561_0002/snowplow-rdb-shredder-0.13.1.jar
20/01/15 20:28:29 INFO S3NativeFileSystem: Opening 's3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar' for reading
20/01/15 20:28:30 INFO Client: Uploading resource file:/mnt/tmp/spark-c7351112-d3ef-4f42-abcb-c86e8cabaf6e/__spark_conf__3351045635256236100.zip -> hdfs://ip-100-0-24-154.ec2.internal:8020/user/hadoop/.sparkStaging/application_1579119791561_0002/__spark_conf__.zip
20/01/15 20:28:30 INFO SecurityManager: Changing view acls to: hadoop
20/01/15 20:28:30 INFO SecurityManager: Changing modify acls to: hadoop
20/01/15 20:28:30 INFO SecurityManager: Changing view acls groups to: 
20/01/15 20:28:30 INFO SecurityManager: Changing modify acls groups to: 
20/01/15 20:28:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
20/01/15 20:28:30 INFO Client: Submitting application application_1579119791561_0002 to ResourceManager
20/01/15 20:28:31 INFO YarnClientImpl: Submitted application application_1579119791561_0002
20/01/15 20:28:32 INFO Client: Application report for application_1579119791561_0002 (state: ACCEPTED)
20/01/15 20:28:32 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1579120110995
	 final status: UNDEFINED
	 tracking URL: http://ip-100-0-24-154.ec2.internal:20888/proxy/application_1579119791561_0002/
	 user: hadoop
20/01/15 20:28:33 INFO Client: Application report for application_1579119791561_0002 (state: ACCEPTED)
20/01/15 20:28:34 INFO Client: Application report for application_1579119791561_0002 (state: ACCEPTED)
20/01/15 20:28:35 INFO Client: Application report for application_1579119791561_0002 (state: ACCEPTED)
20/01/15 20:28:36 INFO Client: Application report for application_1579119791561_0002 (state: ACCEPTED)
20/01/15 20:28:37 INFO Client: Application report for application_1579119791561_0002 (state: RUNNING)
20/01/15 20:28:37 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 100.0.18.190
	 ApplicationMaster RPC port: 0
	 queue: default
	 start time: 1579120110995
	 final status: UNDEFINED
	 tracking URL: http://ip-100-0-24-154.ec2.internal:20888/proxy/application_1579119791561_0002/
	 user: hadoop
20/01/15 20:28:38 INFO Client: Application report for application_1579119791561_0002 (state: RUNNING)
20/01/15 20:28:39 INFO Client: Application report for application_1579119791561_0002 (state: RUNNING)

  ..... omitted several of the following redundant lines in order to post

20/01/15 20:29:12 INFO Client: Application report for application_1579119791561_0002 (state: RUNNING)
20/01/15 20:34:54 INFO Client: Application report for application_1579119791561_0002 (state: FINISHED)
20/01/15 20:34:54 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 100.0.18.190
	 ApplicationMaster RPC port: 0
	 queue: default
	 start time: 1579120110995
	 final status: FAILED
	 tracking URL: http://ip-100-0-24-154.ec2.internal:20888/proxy/application_1579119791561_0002/
	 user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1579119791561_0002 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1104)
	at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150)
	at org.apache.spark.deploy.yarn.Client.main(Client.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/01/15 20:34:54 INFO ShutdownHookManager: Shutdown hook called
20/01/15 20:34:54 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-c7351112-d3ef-4f42-abcb-c86e8cabaf6e
Command exiting with ret '1'

config.yml

aws:
  access_key_id: 
  secret_access_key: 
  s3:
    region: us-east-1
    buckets:
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE
      jsonpath_assets: 
      log: s3://xxxxxxxxx-emretlrunner-out/log
      encrypted: false 
      raw:
        in:                  
          - s3://elasticbeanstalk-us-east-1-xxxxxxxx/resources/environments/logs/publish/e-xxxxxxx  
          - s3://elasticbeanstalk-us-east-1-xxxxxxxx/resources/environments/logs/publish/e-yyyyyyy 
        processing: s3://xxxxxxxx-emretlrunner-out/processing
        archive: s3://xxxxxxxxx-emretlrunner-out/archive    
      enriched:
        good: s3://xxxxxxxxx-emretlrunner-out/enriched/good      
        bad: s3://xxxxxxxxx-emretlrunner-out/enriched/bad
        errors:  
        archive: s3://xxxxxxxx-emretlrunner-out/enriched/archive
      shredded:
        good: s3://xxxxxxxxx-emretlrunner-out/shredded/good
        bad: s3://xxxxxxxxx-emretlrunner-out/shredded/bad
        errors:  
        archive: s3://xxxxxxxx-emretlrunner-out/shredded/archive
    consolidate_shredded_output: false 
  emr:
    ami_version: 5.9.0
    region: us-east-1
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
    placement: 
    ec2_subnet_id: subnet-0fxxxxxxxxxxxxxxxxxxxxxx
    ec2_key_name: xxxxx-EmrEtlRunner-Snowplow
    security_configuration: 
    bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
    # Adjust your Hadoop cluster below
    jobflow:
      job_name: Snowplow ETL # Give your job a name
      master_instance_type: c4.xlarge
      core_instance_count: 3
      core_instance_type: c4.xlarge
      core_instance_bid: 0.09
      core_instance_ebs:  # Optional. Attach an EBS volume to each core instance.
        volume_size: 100    # Gigabytes
        volume_type: "gp2"
        volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: false # Optional. Will default to true
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m1.medium
      task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    configuration:
      yarn-site:
        yarn.resourcemanager.am.max-attempts: "1"
      spark:
        maximizeResourceAllocation: "true"
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: clj-tomcat # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
  versions:
    spark_enrich: 1.18.0 # Version of the Spark Enrichment process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  versions:
    rdb_loader: 0.14.0
    rdb_shredder: 0.13.1        # Version of the Spark Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production


Hello @datawise,

We experienced an outage due Maven Central changing its security policy: https://blog.sonatype.com/central-repository-moving-to-https. We missed this very important update and right now hot-fixing scripts that communicate with Maven Central.

We fixed AMI5 script around 6PM UTC, except one region (ap-southeast-2). Now we also fixed all AMI4 scripts and suspect that your particular setup is significantly outdated. It is EmrEtlRunner compontent that is reponsible for launching bootstrap scripts and we belive that it runs AMI4 script. Can you make sure you’re using a latest version of EmrEtlRunner?

Updating EmrEtlRunner should be necessary only if next run is also not successful - we updated AMI4 scripts 10 minutes ago.

Thanks @anton! I’ll start working on this. Should have more information for you tomorrow.

@anton - It turns out the issue I reported was related to the size of the EMR instance type and not the change in Maven’s security policy. This wasn’t obvious from the log files. I guessed at increasing capacity and luckily it worked.

To avoid future confusion, are there any resources that provide rough guidelines on sizing your EMR instances for EmrEtlRunner?

@datawise, when running your job in Spark it’s important to have your Spark configuration configured properly. It is not just the instance type that matters. There is a rough correlation between the volume of enriched data and the EMR cluster size and Spark configuration.

The EC2 type c4 might not be the best choice for Spark job as all the processing on Spark is done in memory. In other words, you might be better off with r4 types instead.

You should be able to find lots of examples in this forum about how to configure Spark. For example, assuming you are using 1x r4.4xlarge core node in your EMR cluster your Spark configuration would look something like this

configuration:
  yarn-site:
    yarn.nodemanager.vmem-check-enabled: "false"
    yarn.nodemanager.resource.memory-mb: "117760"
    yarn.scheduler.maximum-allocation-mb: "117760"
  spark:
    maximizeResourceAllocation: "false"
  spark-defaults:
    spark.dynamicAllocation.enabled: "false"
    spark.executor.instances: "4"
    spark.yarn.executor.memoryOverhead: "3072"
    spark.executor.memory: "20G"
    spark.executor.cores: "3"
    spark.yarn.driver.memoryOverhead: "3072"
    spark.driver.memory: "20G"
    spark.driver.cores: "3"
    spark.default.parallelism: "48"

This allows for optimal resource utilization of the EMR cluster.

1 Like

Thanks again @ihor

Hey @datawise Can you please post the EMR configuration that finally worked for you. Thx

No problem @Tejas_Behra. To solve my EMR problem I used two r4.2xlarge instances.

aws:
  access_key_id: xxxxxxxxxxxxxxx
  secret_access_key: xxxxxxxxxxxxxxx
  s3:
    region: us-east-1
    buckets:
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE
      jsonpath_assets: 
      log: s3://xxxxxxxx/log
      encrypted: false 
      raw:
        in:                  
          - s3://xxxxxxxxxx/resources/environments/logs/publish/e-bma4hpmtqn  
        processing: s3://xxxxxxxxxx/processing
        archive: s3://xxxxxxxxxx/archive    
      enriched:
        good: s3://xxxxxxxxxx/enriched/good      
        bad: s3://xxxxxxxxxx/enriched/bad
        errors:  
        archive: s3://xxxxxxxxxx/enriched/archive
      shredded:
        good: s3://xxxxxxxxxx/shredded/good
        bad: s3://xxxxxxxxxx/shredded/bad
        errors:  
        archive: s3://xxxxxxxxxx/shredded/archive
    consolidate_shredded_output: false 
  emr:
    ami_version: 5.9.0
    region: us-east-1
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
    placement: 
    ec2_subnet_id: xxxxxxxxxxxxxxxx
    ec2_key_name: RF-EmrEtlRunner-Snowplow
    security_configuration: 
    bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
    # Adjust your Hadoop cluster below
    jobflow:
      job_name: Snowplow ETL # Give your job a name
      master_instance_type: r4.2xlarge
      core_instance_count: 2
      core_instance_type: r4.2xlarge
      core_instance_bid: 0.17
      core_instance_ebs:  # Optional. Attach an EBS volume to each core instance.
        volume_size: 100    # Gigabytes
        volume_type: "gp2"
        volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: false # Optional. Will default to true
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m1.medium
      task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    configuration:
      yarn-site:
        yarn.resourcemanager.am.max-attempts: "1"
      spark:
        maximizeResourceAllocation: "true"
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: clj-tomcat # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
  versions:
    spark_enrich: 1.18.0 # Version of the Spark Enrichment process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  versions:
    rdb_loader: 0.14.0
    rdb_shredder: 0.13.1        # Version of the Spark Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production


Thanks @datawise for answering my question. I have asked you the same question here also - https://discourse.snowplowanalytics.com/t/shred-spark-shred-enriched-events-failures/3551
So let me try with the configuration.