Debugging Storage Loader Failure


#1

Hi All,

Ok this is the first time this has occurred, just looking for pointer on how to debug it.

I ran 160K thrift logs through EMR_ETL and all the steps completed without issue, but when running the Storage Loader I receive:


Unexpected error: Cannot find atomic-events directory in shredded/good
/home/ubuntu/emretlrunner/snowplow-storage-loader!/storage-loader/lib/snowplow-storage-loader/redshift_loader.rb:70:in `load_events_and_shredded_types'
file:/home/ubuntu/emretlrunner/snowplow-storage-loader!/storage-loader/bin/snowplow-storage-loader:54:in `(root)'
org/jruby/RubyArray.java:1613:in `each'
file:/home/ubuntu/emretlrunner/snowplow-storage-loader!/storage-loader/bin/snowplow-storage-loader:51:in `(root)'
org/jruby/RubyKernel.java:1091:in `load'
file:/home/ubuntu/emretlrunner/snowplow-storage-loader!/META-INF/main.rb:1:in `(root)'
org/jruby/RubyKernel.java:1072:in `require'
file:/home/ubuntu/emretlrunner/snowplow-storage-loader!/META-INF/main.rb:1:in `(root)'
/tmp/jruby5278389271672921041extract/jruby-stdlib-1.7.20.1.jar!/META-INF/jruby.home/lib/ruby/shared/rubygems/core_ext/kernel_require.rb:1:in `(root)'

Now I checked that that is true there are no Good records and only Bad ones for this particular run. Could you please suggest where do I go from here?

Regards
SS


#2

Your EMR cluster will save its logs wherever you specified in the runner config which should give you some view into whats going on.

Are you using any custom event schemas? If so a likely cause is that there is a problem with your event format that is causing your events to fail validation.


#3

I have this same error. If I understand @acgray, the emr-etl-runner is not doing something it should, and storage-loader is barfing because of it. The EMR process exits cleanly, according to AWS.

My architecture is a straightforward batch system using the Elastic Beanstalk collector, EMRETLRunner, StorageLoader and RedShift.

According to other threads, I understand I should be looking at the EMR logs, but I’m not exactly sure where to look.

$ find j-1O5xxxxxxxx/ -type f | wc -l
421

Error’s here:

Loading Snowplow events and shredded types into AWS Redshift enriched events storage (Redshift cluster)...
Unexpected error: Cannot find atomic-events directory in shredded/good
uri:classloader:/storage-loader/lib/snowplow-storage-loader/redshift_loader.rb:77:in `load_events_and_shredded_types'
uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'
uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'
uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'
uri:classloader:/storage-loader/bin/snowplow-storage-loader:54:in `block in (root)'
uri:classloader:/storage-loader/bin/snowplow-storage-loader:51:in `<main>'
org/jruby/RubyKernel.java:977:in `load'
uri:classloader:/META-INF/main.rb:1:in `<main>'
org/jruby/RubyKernel.java:959:in `require'
uri:classloader:/META-INF/main.rb:1:in `(root)'
uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in `<main>'

My conf looks like this:

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: xxxxx
  secret_access_key: xxxxx
  s3:
    region: eu-west-1
    buckets:
      assets: s3://snowplow-hosted-assets
      log: s3n://snowplow-stage-eu-west-1/logs/
      raw:
        in:
          - s3n://elasticbeanstalk-eu-west-1-XXXXXXXXXXXX/resources/environments/logs/publish/e-rhdpxxxxx/
        processing: s3n://snowplow-stage-eu-west-1/processing/
        archive: s3://snowplow-stage-eu-west-1/archive/raw/
      enriched:
        good: s3://snowplow-stage-eu-west-1/enriched/good
        bad: s3://snowplow-stage-eu-west-1/enriched/bad
        errors: s3://snowplow-stage-eu-west-1/enriched/errors
        archive: s3://snowplow-stage-eu-west-1/enriched/archive
      shredded:
        good: s3://snowplow-stage-eu-west-1/shredded/good
        bad: s3://snowplow-stage-eu-west-1/shredded/bad
        errors: s3://snowplow-stage-eu-west-1/shredded/errors
        archive: s3://snowplow-stage-eu-west-1/shredded/archive
  emr:
    ami_version: 4.5.0      # Don't change this
    region: eu-west-1       # Always set this
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
    placement: # Set this if not running in VPC. Leave blank otherwise
    ec2_subnet_id: subnet-xxxxx # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: eu-west-1-key
    bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
    # Adjust your Hadoop cluster below
    jobflow:
      master_instance_type: c4.large
      core_instance_count: 2
      core_instance_type: c4.large
      core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
        volume_size: 100    # Gigabytes
        volume_type: "gp2"
        volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: true # Optional. Will default to true
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: c4.large
      task_instance_bid:  # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: clj-tomcat # Or 'clj-tomcat' for the Clojure Collector, or 'thrift' for Thrift records, or 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs
enrich:
  job_name: snowplow-emr-etl-runner-stage ETL # Give your job a name
  versions:
    hadoop_enrich: 1.8.0 # Version of the Hadoop Enrichment process
    hadoop_shred: 0.11.0 # Version of the Hadoop Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
  continue_on_unexpected_error: false # Set to 'true' (and set out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: GZIP # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  download:
    folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production
  snowplow:
    method: get
    app_id: snowplow-emr-etl-runner-stage # e.g. snowplow
    collector: snowplow-stage.eu-west-1.elasticbeanstalk.com # e.g. d3rkrsqld9gmqf.cloudfront.net