Snowplow::EmrEtlRunner::EmrExecutionError

#1

Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-61C0SLMW1HTF failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.
Snowplow ETL: TERMINATED_WITH_ERRORS [VALIDATION_ERROR] ~ elapsed time n/a [ - 2019-04-24 10:12:40 +0000]

    1. [enrich] s3-dist-cp: Raw S3 -> Raw HDFS: CANCELLED ~ elapsed time n/a [ - ]
    1. [staging] s3-dist-cp: Raw s3://wogaa-snowplow-kinesis/ -> Raw Staging S3: CANCELLED ~ elapsed time n/a [ - ]
    1. Elasticity Setup Hadoop Debugging: CANCELLED ~ elapsed time n/a [ - ]
    1. [archive_shredded] s3-dist-cp: Shredded S3 -> Shredded Archive S3: CANCELLED ~ elapsed time n/a [ - ]
    1. [archive_enriched] s3-dist-cp: Enriched S3 -> Enriched Archive S3: CANCELLED ~ elapsed time n/a [ - ]
    1. [archive_raw] s3-dist-cp: Raw Staging S3 -> Raw Archive S3: CANCELLED ~ elapsed time n/a [ - ]
    1. [shred] s3-dist-cp: Shredded HDFS _SUCCESS -> S3: CANCELLED ~ elapsed time n/a [ - ]
    1. [shred] s3-dist-cp: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]
    1. [shred] spark: Shred Enriched Events: CANCELLED ~ elapsed time n/a [ - ]
    1. [cleanup] Empty Raw HDFS: CANCELLED ~ elapsed time n/a [ - ]
    1. [enrich] spark: Enriched HDFS _SUCCESS -> S3: CANCELLED ~ elapsed time n/a [ - ]
    1. [enrich] spark: Enriched HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]
    1. [enrich] spark: Enrich Raw Events: CANCELLED ~ elapsed time n/a [ - ]):
      uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:783:in run' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:insend_to’
      uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in call_with' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:inblock in redefine_method’
      uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:138:in run' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:insend_to’
      uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in call_with' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:inblock in redefine_method’
      uri:classloader:/emr-etl-runner/bin/snowplow-emr-etl-runner:41:in <main>' org/jruby/RubyKernel.java:994:inload’
      uri:classloader:/META-INF/main.rb:1:in <main>' org/jruby/RubyKernel.java:970:inrequire’
      uri:classloader:/META-INF/main.rb:1:in (root)' uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in
aws:
  access_key_id: 
  secret_access_key: 
  s3:
    region: ap-southeast-1
    buckets:
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
      jsonpath_assets:  # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
      log: s3://my_s3/log
      encrypted: false # Whether the buckets below are enrcrypted using server side encryption (SSE-s3)
      raw:
        in: # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
          - s3://my_s3-kinesis # e.g. s3://my-old-collector-bucket
        processing: s3://my_s3/processing
        archive: s3://my_s3/archive/raw # e.g. s3://my-archive-bucket/raw
      enriched:
        good: s3://my_s3/enriched/good # e.g. s3://my-out-bucket/enriched/good
        bad: s3://my_s3/enriched/bad # e.g. s3://my-out-bucket/enriched/bad
        errors: # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3://my_s3/archive/enriched # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
      shredded:
        good: s3://my_s3/shredded/good # e.g. s3://my-out-bucket/shredded/good
        bad: s3://my_s3/shredded/bad # e.g. s3://my-out-bucket/shredded/bad
        errors: # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3://my_s3/archive/shredded # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
    consolidate_shredded_output: false # Whether to combine files when copying from hdfs to s3
  emr:
    ami_version: 5.9.0
    region: ap-southeast-1 # Always set this
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole # Created using $ aws emr create-default-roles
    placement: # Set this if not running in VPC. Leave blank otherwise
    ec2_subnet_id: my ec2 running subnet # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: emr-etl-runner
    security_configuration: # Specify your EMR security configuration if needed. Leave blank otherwise
    bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      hbase: # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual: # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
    jobflow:
      job_name: Snowplow ETL # Give your job a name
      master_instance_type: m4.large
      core_instance_count: 2
      core_instance_type: m4.large
      core_instance_ebs: # Optional. Attach an EBS volume to each core instance.
        volume_size: 100 # Gigabytes
        volume_type: "gp2"
        volume_iops: 400 # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: false # Optional. Will default to true
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m4.large
      task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    configuration:
      yarn-site:
        yarn.resourcemanager.am.max-attempts: "1"
      spark:
        maximizeResourceAllocation: "true"
    additional_info: # Optional JSON string for selecting additional features
collectors:
  format: "thrift"
enrich:
  versions:
    spark_enrich: 1.17.0 # Version of the Spark Enrichment process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  versions:
    rdb_loader: 0.14.0
    rdb_shredder: 0.13.1 # Version of the Spark Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
monitoring:
  tags: {} # Name-value pairs describing this job
    logging:
    level: DEBUG # You can optionally switch to INFO for production
  snowplow:
    method: get
    protocol: http
    port: 80
    app_id: snowplow # e.g. snowplow
    collector: d3rkrsqld9gmqf.cloudfront.net # e.g. d3rkrsqld9gmqf.cloudfront.net
#2

@buddhi_weragoda, the VALIDATION_ERROR is normally an indication the instance type you requested to spin EMR cluster is either not available in your availability zone or even region or you have reached the limit on the number of EC2 running concurrently.

The region ap-southeast-... is known to fall behind in terms of the node types availability in there. Check that m4.large is available. You can use the approach described here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-INSTANCE_TYPE_NOT_SUPPORTED-error.html.

#3

HI ihor,

I already checked the instance Types. m4 large is support in this region :(.

This is I get to clarify again. I changed it to not supported image type and run again. Then the error is different. Please correct me if I’m wrong

Changes

   job_name: Snowplow ETL # Give your job a name
      master_instance_type: m3.large
      core_instance_count: 2
      core_instance_type: m3.large
      core_instance_ebs: # Optional. Attach an EBS volume to each core instance.
        volume_size: 100 # Gigabytes
        volume_type: "gp2"
        volume_iops: 400 # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: false # Optional. Will default to true
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m3.large
      task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    configuration:
      yarn-site:
        yarn.resourcemanager.am.max-attempts: "1"
      spark:
        maximizeResourceAllocation: "true"
    additional_info: # Optional JSON string for selecting additional features

OUTPUT

#4

Hi Ihor,

I found the issue my typo error with ec2 key name. Thanks a lot for your time on looking on this.

Have a nice day