EmrEtlRunner skip issues configuration


For some reason, each time a row is malformed the process stops to a halt, leaving files in the “processing” folder.
This, in turn, causes our all consecutive runs to fail on folder not empty.

Lately this issue is causing us a lot of work as we need to move all the files from “processing” back to the “in” bucket and restart the ETL.
How can we configure the ETL to “skip” malformed rows, and send an email of how many missed rows were skipped?


Hi Asaf - can you please share with us the specific error message(s) you’re getting?



Hi Yali,

We looked at the wrong “/bad” folder. We are actually not sure why the proccess stoped:

After that, on the next run we receive:
Snowplow::EmrEtlRunner::DirectoryNotEmptyError (Should not stage files for enrichment, processing bucket s3://theculturetrip-snow-plow-processing/events/processing/ is not empty)



Hi Asaf,

When the EMR process runs, any data that cannot be successfully processed will be written to the ‘bad’ bucket. It is normal to have several of these lines generated with each run. We have guides to using these bad rows to debugging upstream issues here, here and here.

However, this process is non-blocking. The reason your EMR job is failing will not be related to the bad rows.

When the EMR job fails you should get an error message back from EmrEtlRunner. In addition it should be possible to look in the AWS EMR console and see what an error message there.

If you can share with us those error messages we’ll be in a better position to help you diagnose the route cause of the failures.

All the best,



Hi Yali,

You can see in the print screen i shared that no logs were created.
Should i be looking some where else?



Sorry - missed the screenshot!
Can you share your EmrEtlRunner config please, with the sensitive bits (AWS creds etc.) removed?


  access_key_id: ----------------------
  secret_access_key: ----------------------
    region: us-east-1
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
      jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
      log: s3://theculturetrip-snow-plow-logs
        in:                  # Multiple in buckets are permitted
          - ----------------------          # e.g. s3://my-in-bucket
         # - s3://elasticbeanstalk-us-east-1-754733210272/resources/environments/logs/publish/e-yyubswy4pr/i-0daea088
        processing: ----------------------
        archive: ----------------------    # e.g. s3://my-archive-bucket/raw
        good: ----------------------       # e.g. s3://my-out-bucket/enriched/good
        bad: ----------------------        # e.g. s3://my-out-bucket/enriched/bad
        errors:      # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: ----------------------    # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
        good: ----------------------      # e.g. s3://my-out-bucket/shredded/good
        bad: ----------------------        # e.g. s3://my-out-bucket/shredded/bad
        errors:      # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: ----------------------    # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
    ami_version: 3.7.0      # Don't change this
    region: ----------------------        # Always set this
    jobflow_role: ---------------------- # Created using $ aws emr create-default-roles
    service_role: ---------------------- # Created using $ aws emr create-default-roles
    placement: ---------------------- # Set this if not running in VPC. Leave blank otherwise
    ec2_subnet_id:  # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: TheCultureTrip-Snow-Plow
    bootstrap:            # Set this to specify custom boostrap actions. Leave empty otherwise
      hbase:                # To launch on cluster, provide version, "0.92.0", keep quotes
      lingual:              # To launch on cluster, provide version, "1.1", keep quotes
    # Adjust your Hadoop cluster below
      master_instance_type: m1.medium
      core_instance_count: 2
      core_instance_type: m1.medium
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m1.medium
      task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
  format: cloudfront # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
  job_name: Snowplow ETL # Give your job a name
    hadoop_enrich: 1.5.1 # Version of the Hadoop Enrichment process
    hadoop_shred: 0.7.0 # Version of the Hadoop Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: GZIP # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
    folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
    - name: ----------------------
      type: redshift
      host: ---------------------- # The endpoint as shown in the Redshift console
      database: snow_plow # Name of database
      port: 5439 # Default Redshift port
      ssl_mode: ---------------------- # One of disable (default), require, verify-ca or verify-full
      table: ----------------------
      username: ----------------------
      password: ----------------------
      maxerror: 1 # Stop loading on first error, or increase to permit more load errors
      comprows: 200000 # Default for a 1 XL node cluster. Not used unless --include compupdate specified
      es_nodes_wan_only: false # Set to true if using Amazon Elasticsearch Service
  tags: {} # Name-value pairs describing this job
    level: DEBUG # You can optionally switch to INFO for production
    method: get
    app_id: snowplow # e.g. snowplow
    collector: ---------------------- # e.g. d3rkrsqld9gmqf.cloudfront.net


Hi Asaf,

You’re running an old version of Snowplow which uses an old EMR AMI (version 3.7.0).

We’ve found that newer AMIs are significantly less likely to fail on job steps so my first recommendation would be for you to upgrade. The latest version of the batch enrichment using version 4.5.0 of the EMR AMI:

  ami_version: 4.5.0 
  hadoop_enrich: 1.7.0        
  hadoop_shred: 0.9.0         
  hadoop_elasticsearch: 0.1.0 

My second recommendation would be not to use 2xm1.medium core instances: better to run the job on a single larger core instance (e.g. an m1.large or m3.xlarge). Using an m1.medium instance for the master node is fine - this node does very little at all.

You can consult our upgrade guide here: https://github.com/snowplow/snowplow/wiki/Upgrade-Guide. It’s r79 you want to get on (r80 and 81 are updates to the real-time pipeline). Follow the steps to upgrade from your current version




Hi @yali,

After updating the configuration i get the following error below, do you know what is missing?

ArgumentError (AWS EMR API Error (ValidationException): The supplied ami version is invalid.):
    /home/ec2-user/deployCloudFrontEMR/exe/snowplow-emr-etl-runner!/gems/elasticity-6.0.5/lib/elasticity/aws_session.rb:33:in `submit'
    /home/ec2-user/deployCloudFrontEMR/exe/snowplow-emr-etl-runner!/gems/elasticity-6.0.5/lib/elasticity/emr.rb:302:in `run_job_flow'
    /home/ec2-user/deployCloudFrontEMR/exe/snowplow-emr-etl-runner!/gems/elasticity-6.0.5/lib/elasticity/job_flow.rb:141:in `run'
    /home/ec2-user/deployCloudFrontEMR/exe/snowplow-emr-etl-runner!/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:400:in `run'
    /home/ec2-user/deployCloudFrontEMR/exe/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/method_reference.rb:46:in `send_to'
    /home/ec2-user/deployCloudFrontEMR/exe/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts.rb:305:in `call_with'
    /home/ec2-user/deployCloudFrontEMR/exe/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/decorators.rb:159:in `common_method_added'
    /home/ec2-user/deployCloudFrontEMR/exe/snowplow-emr-etl-runner!/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:68:in `run'
    /home/ec2-user/deployCloudFrontEMR/exe/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/method_reference.rb:46:in `send_to'
    /home/ec2-user/deployCloudFrontEMR/exe/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts.rb:305:in `call_with'
    /home/ec2-user/deployCloudFrontEMR/exe/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/decorators.rb:159:in `common_method_added'
    file:/home/ec2-user/deployCloudFrontEMR/exe/snowplow-emr-etl-runner!/emr-etl-runner/bin/snowplow-emr-etl-runner:39:in `(root)'
    org/jruby/RubyKernel.java:1091:in `load'
    file:/home/ec2-user/deployCloudFrontEMR/exe/snowplow-emr-etl-runner!/META-INF/main.rb:1:in `(root)'
    org/jruby/RubyKernel.java:1072:in `require'
    file:/home/ec2-user/deployCloudFrontEMR/exe/snowplow-emr-etl-runner!/META-INF/main.rb:1:in `(root)'
    /tmp/jruby2415309698897338396extract/jruby-stdlib-!/META-INF/jruby.home/lib/ruby/shared/rubygems/core_ext/kernel_require.rb:1:in `(root)'

finished deployEMR


Hello @Asaf

For me it looks like you haven’t replaced old EmrEtlRunner with newer (r77) one. You can find a link (and other instructions) in Upgrade Guide @Yali provided.


Hi @yali& @anton,

This issue was resolved, thanks you for the help.