EmrEtlRunner error - Elasticity Scalding Step: Enrich Raw Events: FAILED


#1

Hi,

I’m having the following error when running EMRETLRunner in this manner. I’m working with a very small volume of logs here whilst I get my cluster up and running, so I can’t image this is anything due to capacity issues on the EMR servers.

sudo -u snowplow ./snowplow-emr-etl-runner --debug --skip staging,archive_raw --config emr-etl-runner.conf --resolver resolver.json --targets targets/ --enrichments enrichments/
Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-RYRQ9UEQE0O4 failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.
Snowplow ETL: TERMINATING [STEP_FAILURE] ~ elapsed time n/a [2017-06-01 15:50:06 UTC - ]
 - 1. Elasticity Setup Hadoop Debugging: COMPLETED ~ 00:00:06 [2017-06-01 15:50:08 UTC - 2017-06-01 15:50:14 UTC]
 - 2. Elasticity Scalding Step: Enrich Raw Events: FAILED ~ 00:00:14 [2017-06-01 15:50:18 UTC - 2017-06-01 15:50:32 UTC]
 - 3. Elasticity S3DistCp Step: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]
 - 4. Elasticity Scalding Step: Shred Enriched Events: CANCELLED ~ elapsed time n/a [ - ]
 - 5. Elasticity S3DistCp Step: Enriched HDFS _SUCCESS -> S3: CANCELLED ~ elapsed time n/a [ - ]
 - 6. Elasticity S3DistCp Step: Enriched HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]):
    uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:500:in `run'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'
    uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:74:in `run'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'
    uri:classloader:/emr-etl-runner/bin/snowplow-emr-etl-runner:39:in `<main>'
    org/jruby/RubyKernel.java:973:in `load'
    uri:classloader:/META-INF/main.rb:1:in `<main>'
    org/jruby/RubyKernel.java:955:in `require'
    uri:classloader:/META-INF/main.rb:1:in `(root)'
    uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in `<main>'

Here’s the stderr log from the failed step:

Exception in thread "main" com.twitter.scalding.InvalidSourceException: [com.twitter.scalding.MultipleTextLineFilesWrappedArray(s3n://gm-snowplow-data-eu-west-1/processing/)] Data is missing from one or more paths in: List(s3n://gm-snowplow-data-eu-west-1/processing/)
	at com.twitter.scalding.FileSource.validateTaps(FileSource.scala:186)
	at com.twitter.scalding.FlowState$$anonfun$validateSources$1.apply(FlowState.scala:59)
	at com.twitter.scalding.FlowState$$anonfun$validateSources$1.apply(FlowState.scala:54)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
...<snip>

Here’s my EMRETLRunner conf:

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: xxxxxxxx
  secret_access_key: xxxxxxxx
  s3:
    region: eu-west-1
    buckets:
      assets: s3://snowplow-hosted-assets
      log: s3n://snowplow-data-eu-west-1/logs/
      raw:
        in:
          - s3n://elasticbeanstalk-eu-west-1-xxxxxxxxxxx/resources/environments/logs/publish/e-iufsscreji/
        processing: s3n://snowplow-data-eu-west-1/processing/
        archive: s3://snowplow-data-eu-west-1/archive/raw/
      enriched:
        good: s3://snowplow-data-eu-west-1/enriched/good
        bad: s3://snowplow-data-eu-west-1/enriched/bad
        errors: s3://snowplow-data-eu-west-1/enriched/errors
        archive: s3://snowplow-data-eu-west-1/enriched/archive
      shredded:
        good: s3://snowplow-data-eu-west-1/shredded/good
        bad: s3://snowplow-data-eu-west-1/shredded/bad
        errors: s3://snowplow-data-eu-west-1/shredded/errors
  emr:
    ami_version: 4.5.0      # Don't change this
    region: eu-west-1       # Always set this
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
    placement: # Set this if not running in VPC. Leave blank otherwise
    ec2_subnet_id: subnet-955dXXXX # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: XXXXXXX
    bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
    # Adjust your Hadoop cluster below
    jobflow:
      master_instance_type: c4.large
      core_instance_count: 2
      core_instance_type: c4.large
      core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
        volume_size: 100    # Gigabytes
        volume_type: "gp2"
        volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: true # Optional. Will default to true
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: c4.large
      task_instance_bid:  # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: clj-tomcat # Or 'clj-tomcat' for the Clojure Collector, or 'thrift' for Thrift records, or 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs
enrich:
  job_name: Snowplow ETL # Give your job a name
  versions:
    hadoop_enrich: 1.8.0 # Version of the Hadoop Enrichment process
    hadoop_shred: 0.11.0 # Version of the Hadoop Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
  continue_on_unexpected_error: true # Set to 'true' (and set out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  download:
    folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production
  snowplow:
    method: get
    app_id: ADD HERE # e.g. snowplow
    collector: XXXXXX.eu-west-1.elasticbeanstalk.com # e.g. d3rkrsqld9gmqf.cloudfront.net

Any advice would be very very helpful!

Thank you,

Graham


#2

Hi @Graham-M,

You run the EmrEtlRunner with --skip staging option. That implies the “raw” events (file) are already in the processing bucket. However, the error suggests the bucket is empty.

On the other hand, the error points to the bucket s3n://gm-snowplow-data-eu-west-1/processing/ while the configuration file points to s3n://snowplow-data-eu-west-1/processing/ (no gm- prefix). Are you sure that’s the configuration you are using?


#3

Hi @ihor,

Thank you for your reply, my apologies for the delay in replying to you.

The gm- prefix is one I’ve been using, but I tried to obfuscate the bucket names, which I appear to have done very badly here.

I’ll test again without the --skip staging option and report back to this thread.

One quick question - can you tell me what the --targets option is for? I can’t see to find any documentation for it.

Thanks,

Graham


#4

@Graham-M,

The targets option was introduced in Snowplow release R88: https://snowplowanalytics.com/blog/2017/04/27/snowplow-r88-angkor-wat-released/. It takes the target section of the older config.yml file out into a dedicated JSON configuration files.

Therefore, you do need to ensure you are using the correct EmrEtlRunner/StorageLoader versions as the configuration file format is different.