NameError: uninitialized constant Snowplow::EmrEtlRunner::S3::AWS

When trying to start EmrEtlRunner, I am getting these JRuby errors. Any suggestions? This is with R113. Is there a dependency here I missed?

[ec2-user@XX emretlrunner]$ ./snowplow-emr-etl-runner run --config config.yml --resolver iglu_resolver.json
uri:classloader:/gems/avro-1.8.1/lib/avro/schema.rb:350: warning: constant ::Fixnum is deprecated
D, [2019-03-25T17:10:12.248617 #22419] DEBUG – : Initializing EMR jobflow
NameError: uninitialized constant Snowplow::EmrEtlRunner::S3::AWS
const_missing at org/jruby/RubyModule.java:3526
list_objects at uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/s3.rb:106
block in empty_impl at uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/s3.rb:85
loop at org/jruby/RubyKernel.java:1418
empty_impl at uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/s3.rb:84
empty? at uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/s3.rb:36
initialize at uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:179
send_to at uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43
call_with at uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76
block in redefine_method at uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138
run at uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:135
send_to at uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43
call_with at uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76
block in redefine_method at uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138
at uri:classloader:/emr-etl-runner/bin/snowplow-emr-etl-runner:41
load at org/jruby/RubyKernel.java:994
at uri:classloader:/META-INF/main.rb:1
require at org/jruby/RubyKernel.java:970
(root) at uri:classloader:/META-INF/main.rb:1
at uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1
ERROR: org.jruby.embed.EvalFailedException: (NameError) uninitialized constant Snowplow::EmrEtlRunner::S3::AWS

@jason, it sounds like something is missing in your configuration file. Could you show what it looks like with sensitive data removed (add it between pair of ``` to preserve indentation)?

Sure, here you go. Thank you for taking a look.

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: HIDDEN
  secret_access_key:  HIDDEN
  s3:
    region: us-east-1
    buckets:
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
      jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
      log: s3://mys3snowplow-logs/emr
      encrypted: false # Whether the buckets below are enrcrypted using server side encryption (SSE-S3)
      raw:
        in:                  # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
          - s3://mys3snowplow         # e.g. s3://my-old-collector-bucket
        processing: s3://mys3snowplow-processing/processing
        archive: s3://mys3snowplow-archive/raw
      enriched:
        good: s3://mys3snowplow-processed/enriched/good
        bad: s3://mys3snowplow-processed/enriched/bad
        errors: s3://mys3snowplow-processed/enriched/errors 
        archive: s3://mys3snowplow-processed/enriched/archive
      shredded:
        good: s3://mys3snowplow-processed/shredded/good
        bad:  s3://mys3snowplow-processed/shredded/bad
        errors:  s3://mys3snowplow-processed/shredded/errors
        archive:  s3://mys3snowplow-processed/shredded/archive
    consolidate_shredded_output: false # Whether to combine files when copying from hdfs to s3
  emr:
    ami_version: 5.9.0
    region: us-east-1
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
    placement:      # Set this if not running in VPC. Leave blank otherwise
    ec2_subnet_id: subnet-aaaaaa # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: snowplow
    security_configuration: # Specify your EMR security configuration if needed. Leave blank otherwise
    bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
    # Adjust your Hadoop cluster below
    jobflow:
      job_name: Snowplow ETL # Give your job a name
      master_instance_type: m1.medium
      core_instance_count: 2
      core_instance_type: m1.medium
      core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
        volume_size: 100    # Gigabytes
        volume_type: "gp2"
        volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: false # Optional. Will default to true
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m1.medium
      task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    configuration:
      yarn-site:
        yarn.resourcemanager.am.max-attempts: "1"
      spark:
        maximizeResourceAllocation: "true"
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: thrift # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
  versions:
    spark_enrich: 1.17.0 # Version of the Spark Enrichment process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  versions:
    rdb_loader: 0.14.0
    rdb_shredder: 0.13.1        # Version of the Spark Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production
  snowplow:
    method: get
    protocol: https
    port: 443
    app_id: snowplow # e.g. snowplow
    collector: sp.mydomain.com # e.g. d3rkrsqld9gmqf.cloudfront.net

@jason, the config seems OK. Could you check that the names of your buckets are not misspelt (they actually exist)? If they do then try replacing s3 with s3a in the following 3 sections: raw, enriched, and shredded.

Additional comments:

  • I would create a separate bucket for archive
  • You are using very old EC2 types, m1.medium; they might not be available. Regardless, I would use at least m4.large (depends on the volume of data to process)

Thank you. It turned out that I hadn’t granted the S3 permissions properly in IAM. I resolved that (and also created the separate bucket as you suggested) and the error has resolved itself.

I also updated the EC2 types, as I had just used the boilerplate from the Common Configuration page and hadn’t thought too hard about whether those instance types were still available.

I’m now running into EMR Access Denied, but I see threads related to that, so I’ll investigate those and follow up from there.