My config.yml causes a contract violation in EmrEtlRunner

sachinsingh10 · June 10, 2016, 3:16am

Hi there,

I haven’t been able to get the ETLEMR process going. I have scoured through other similar issue and can’t find a solution.

**ERROR:**
$ ./snowplow-emr-etl-runner -d --config config/config.yml --resolver resolver.json
F, [2016-06-10T02:54:57.229000 #26357] FATAL -- :

ContractError (Contract violation for return value:
    Expected: {:aws=>{:access_key_id=>String, :secret_access_key=>String, :s3=>{:region=>String, :buckets=>{:assets=>String, :jsonpath_assets=>#<Contracts::Maybe:0x122a10c5 @vals=[String, nil]>, :log=>String, :raw=>{:in=>#<Contracts::ArrayOf:0x7b2fed4 @contract=String>, :processing=>String, :archive=>String}, :enriched=>{:good=>String, :bad=>String, :errors=>#<Contracts::Maybe:0x7878143e @vals=[String, nil]>, :archive=>#<Contracts::Maybe:0x1379303c @vals=[String, nil]>}, :shredded=>{:good=>String, :bad=>String, :errors=>#<Contracts::Maybe:0x794dbd20 @vals=[String, nil]>, :archive=>#<Contracts::Maybe:0x5ebbbe17 @vals=[String, nil]>}}}, :emr=>{:ami_version=>String, :region=>String, :jobflow_role=>String, :service_role=>String, :placement=>#<Contracts::Maybe:0x77bd0897 @vals=[String, nil]>, :ec2_subnet_id=>#<Contracts::Maybe:0x1b7f2eeb @vals=[String, nil]>, :ec2_key_name=>String, :bootstrap=>#<Contracts::Maybe:0x3ae15467 @vals=[#<Contracts::ArrayOf:0x1def1f1e @contract=String>, nil]>, :software=>{:hbase=>#<Contracts::Maybe:0x1f6c5464 @vals=[String, nil]>, :lingual=>#<Contracts::Maybe:0x118815a @vals=[String, nil]>}, :jobflow=>{:master_instance_type=>String, :core_instance_count=>Contracts::Num, :core_instance_type=>String, :task_instance_count=>Contracts::Num, :task_instance_type=>String, :task_instance_bid=>#<Contracts::Maybe:0x6fd43c45 @vals=[Contracts::Num, nil]>}, :additional_info=>#<Contracts::Maybe:0x56c10f5e @vals=[String, nil]>, :bootstrap_failure_tries=>Contracts::Num}}, :collectors=>{:format=>String}, :enrich=>{:job_name=>String, :versions=>{:hadoop_enrich=>String, :hadoop_shred=>String}, :continue_on_unexpected_error=>Contracts::Bool, :output_compression=>#<Proc:0x3a9b7e55@/home/ec2-user/snowplow-emr-etl-runner!/emr-etl-runner/lib/snowplow-emr-etl-runner/contracts.rb:23 (lambda)>}, :storage=>{:download=>{:folder=>#<Contracts::Maybe:0x2d381822 @vals=[String, nil]>}, :targets=>#<Contracts::ArrayOf:0x3db8775f @contract={:name=>String, :type=>String, :host=>String, :database=>String, :port=>Contracts::Num, :ssl_mode=>#<Contracts::Maybe:0x4bc348d3 @vals=[String, nil]>, :table=>String, :username=>#<Contracts::Maybe:0x2bb0931e @vals=[String, nil]>, :password=>#<Contracts::Maybe:0x3627cf6 @vals=[String, nil]>, :es_nodes_wan_only=>#<Contracts::Maybe:0x561d596c @vals=[Contracts::Bool, nil]>, :maxerror=>#<Contracts::Maybe:0x437f7292 @vals=[Contracts::Num, nil]>, :comprows=>#<Contracts::Maybe:0x59dcd5ec @vals=[Contracts::Num, nil]>}>}, :monitoring=>{:tags=>#<Contracts::HashOf:0x19a51da1 @value=String, @key=Symbol>, :logging=>{:level=>String}, :snowplow=>#<Contracts::Maybe:0x273c422e @vals=[{:method=>String, :collector=>String, :app_id=>String}, nil]>}},
    Actual: {:aws=>{:access_key_id=>"AKIAIUFF5DVIXMMDFVFA", :secret_access_key=>"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", :s3=>{:region=>"ap-southeast-2", :buckets=>{:assets=>"s3://snowplow-hosted-assets", :jsonpath_assets=>nil, :log=>"s3://ETLEMR/ETLEMR_logs", :raw=>{:in=>"s3://elasticbeanstalk-ap-southeast-2-098002129817/resources/environments/logs/publish/e-m3cepm4223/i-7232cba3", :processing=>"s3://ETLEMR/ETLEMR_in_processing", :archive=>"s3://ETLEMR/ETLEMR_in_archive"}, :enriched=>{:good=>"s3://ETLEMR/ETLEMR_enriched_good", :bad=>"s3://ETLEMR/ETLEMR_enriched_bad", :errors=>"s3://ETLEMR/ETLEMR_enriched_errors", :archive=>"s3://ETLEMR/ETLEMR_enriched_archive"}, :shredded=>{:good=>"s3://ETLEMR/ETLEMR_shredded_good", :bad=>"s3://ETLEMR/ETLEMR_shredded_bad", :errors=>"s3://ETLEMR/ETLEMR_shredded_errors", :archive=>"s3://ETLEMR/ETLEMR_shredded_archive"}}}, :emr=>{:ami_version=>"4.5.0", :region=>"ap-southeast-2", :jobflow_role=>"EMR_EC2_DefaultRole", :service_role=>"EMR_DefaultRole", :placement=>nil, :ec2_subnet_id=>nil, :ec2_key_name=>"firstec2.ppk", :bootstrap=>[], :software=>{:hbase=>nil, :lingual=>nil}, :jobflow=>{:master_instance_type=>"m1.medium", :core_instance_count=>2, :core_instance_type=>"m1.medium", :task_instance_count=>0, :task_instance_type=>"m1.medium", :task_instance_bid=>0.015}, :bootstrap_failure_tries=>3, :additional_info=>nil}}, :collectors=>{:format=>"clj-tomcat"}, :enrich=>{:job_name=>"Snowplow ETL", :versions=>{:hadoop_enrich=>"1.7.0", :hadoop_shred=>"0.9.0", :hadoop_elasticsearch=>"0.1.0"}, :continue_on_unexpected_error=>false, :output_compression=>"NONE"}, :storage=>{:download=>{:folder=>nil}, :targets=>[{:name=>"My Redshift database", :type=>"redshift", :host=>"ADD HERE", :database=>"ADD HERE", :port=>5439, :table=>"atomic.events", :username=>"ADD HERE", :password=>"ADD HERE", :maxerror=>1, :comprows=>200000, :ssl_mode=>"disable"}]}, :monitoring=>{:tags=>{}, :logging=>{:level=>"DEBUG"}, :snowplow=>{:method=>"get", :app_id=>"redplanet", :collector=>"ec2-52-63-114-56.ap-southeast-2.compute.amazonaws.com"}}}
    Value guarded in: Snowplow::EmrEtlRunner::Cli::load_config
    With Contract: Maybe, String => Hash
    At: /home/ec2-user/snowplow-emr-etl-runner!/emr-etl-runner/lib/snowplow-emr-etl-runner/cli.rb:134 ):
    /home/ec2-user/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts.rb:69:in `Contract'
    org/jruby/RubyProc.java:271:in `call'
    /home/ec2-user/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts.rb:147:in `failure_callback'
    /home/ec2-user/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/decorators.rb:164:in `common_method_added'
    /home/ec2-user/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/decorators.rb:159:in `common_method_added'
    file:/home/ec2-user/snowplow-emr-etl-runner!/emr-etl-runner/bin/snowplow-emr-etl-runner:37:in `(root)'
    org/jruby/RubyKernel.java:1091:in `load'
    file:/home/ec2-user/snowplow-emr-etl-runner!/META-INF/main.rb:1:in `(root)'
    org/jruby/RubyKernel.java:1072:in `require'
    file:/home/ec2-user/snowplow-emr-etl-runner!/META-INF/main.rb:1:in `(root)'
    /tmp/jruby8204250454434082145extract/jruby-stdlib-1.7.20.1.jar!/META-INF/jruby.home/lib/ruby/shared/rubygems/core_ext/kernel_require.rb:1:in `(root)'

Here is my config.yml:

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: XXXXXXXXXXXXXXXXXX
  secret_access_key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  s3:
    region: ap-southeast-2
    buckets:
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
      jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
      log: s3://ETLEMR/ETLEMR_logs
      raw:
        in: s3://elasticbeanstalk-ap-southeast-2-098002129817/resources/environments/logs/publish/e-m3cepm4223/i-7232cba3  # Multiple in buckets are permitted
              # e.g. s3://my-in-bucket
        processing: s3://ETLEMR/ETLEMR_in_processing
        archive: s3://ETLEMR/ETLEMR_in_archive    # e.g. s3://my-archive-bucket/in
      enriched:
        good: s3://ETLEMR/ETLEMR_enriched_good       # e.g. s3://my-out-bucket/enriched/good
        bad: s3://ETLEMR/ETLEMR_enriched_bad         # e.g. s3://my-out-bucket/enriched/bad
        errors: s3://ETLEMR/ETLEMR_enriched_errors      # Leave blank unless continue_on_unexpected_error: set to true below
        archive: s3://ETLEMR/ETLEMR_enriched_archive    # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
      shredded:
        good: s3://ETLEMR/ETLEMR_shredded_good     # e.g. s3://my-out-bucket/shredded/good
        bad: s3://ETLEMR/ETLEMR_shredded_bad        # e.g. s3://my-out-bucket/shredded/bad
        errors: s3://ETLEMR/ETLEMR_shredded_errors      # Leave blank unless continue_on_unexpected_error: set to true below
        archive: s3://ETLEMR/ETLEMR_shredded_archive     # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
  emr:
    ami_version: 4.5.0      # Don't change this
    region: ap-southeast-2  # Always set this
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
    placement:      # Set this if not running in VPC. Leave blank otherwise
    ec2_subnet_id:  # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: firstec2.ppk
    bootstrap: []           #Set this to specify custom boostrap actions. Leave empty otherwise.
    software:
      hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
    # Adjust your Hadoop cluster below
    jobflow:
      master_instance_type: m1.medium
      core_instance_count: 2
      core_instance_type: m1.medium
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m1.medium
      task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: clj-tomcat # Or 'clj-tomcat' for the Clojure Collector, or 'thrift' for Thrift records, or 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs
enrich:
  job_name: Snowplow ETL # Give your job a name
  versions:
    hadoop_enrich: 1.7.0 # Version of the Hadoop Enrichment process
    hadoop_shred: 0.9.0 # Version of the Hadoop Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
  continue_on_unexpected_error: false # Set to 'true' (and set out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  download:
    folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
  targets:
    - name: "My Redshift database"
      type: redshift
      host: ADD HERE # The endpoint as shown in the Redshift console
      database: ADD HERE # Name of database
      port: 5439 # Default Redshift port
      table: atomic.events
      username: ADD HERE
      password: ADD HERE
      maxerror: 1 # Stop loading on first error, or increase to permit more load errors
      comprows: 200000 # Default for a 1 XL node cluster. Not used unless --include compupdate specified
      ssl_mode: disable
    #- name: "My PostgreSQL database"
    #  type: postgres
    #  host: ADD HERE # Hostname of database server
    #  database: ADD HERE # Name of database
    #  port: 5432 # Default Postgres port
    #  table: atomic.events
    #  username: ADD HERE
    #  password: ADD HERE
    #  maxerror: # Not required for Postgres
    #  comprows: # Not required for Postgres
    #  ssl_mode: disable
    #- name: "myelasticsearchtarget" # Name for the target - used to label the corresponding jobflow step
    #  type: elasticsearch # Marks the database type as Elasticsearch
    #  host: "ec2-43-1-854-22.compute-1.amazonaws.com" # Elasticsearch host
    #  database: index1 # The Elasticsearch index
    #  port: 9200 # Port used to connect to Elasticsearch
    # table: type1 # The Elasticsearch type
    #  es_nodes_wan_only: false # Set to true if using Amazon Elasticsearch Service
    #  username: # Unnecessary for Elasticsearch
    #  password: # Unnecessary for Elasticsearch
    #  sources: # Leave blank or specify: ["s3://out/enriched/bad/run=xxx", "s3://out/shred/bad/run=yyy"]
    #  maxerror: # Not required for Elasticsearch
    #  comprows: # Not required for Elasticsearch
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production
  snowplow:
    method: get
    app_id: redplanet # e.g. snowplow
    collector: ec2-52-XX-XXX-XX.ap-southeast-2.compute.amazonaws.com # e.g. d3rkrsqld9gmqf.cloudfront.net

Your help would be greatly appreciated.

ihor · June 10, 2016, 5:27pm

Hi @sachinsingh10,

You left out both placement and ec2_subnet_id. At least one of them has to have a value. Which one depends on whether your job will be running in VPC.

Please, refer to this wiki page to get help on expected values.

On a side note, could you, please, (if still expirience a problem) include your code in between a couple of triple ticks to retain the indentation when posting here. Indentation is significant in YAML files and it’s hard to track misreconfigured parameters without seeing them.

sachinsingh10 · June 11, 2016, 5:53am

@ihor thanks for the quick reply.

I did enter the subnet but still have a contract violation.

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: 
  secret_access_key: 
  s3:
    region: ap-southeast-2
    buckets:
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
      jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
      log: s3n://ETLEMR/ETLEMR_logs
      raw:
        in: s3n://elasticbeanstalk-ap-southeast-2-098002129817/resources/environments/logs/publish/e-m3cepm4223/i-7232cba3  # Multiple in buckets are permitted
              # e.g. s3://my-in-bucket
        processing: s3://ETLEMR/ETLEMR_in_processing
        archive: s3://ETLEMR/ETLEMR_in_archive    # e.g. s3://my-archive-bucket/in
      enriched:
        good: s3://ETLEMR/ETLEMR_enriched_good       # e.g. s3://my-out-bucket/enriched/good
        bad: s3://ETLEMR/ETLEMR_enriched_bad         # e.g. s3://my-out-bucket/enriched/bad
        errors: s3://ETLEMR/ETLEMR_enriched_errors      # Leave blank unless continue_on_unexpected_error: set to true below
        archive: s3://ETLEMR/ETLEMR_enriched_archive    # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
      shredded:
        good: s3://ETLEMR/ETLEMR_shredded_good     # e.g. s3://my-out-bucket/shredded/good
        bad: s3://ETLEMR/ETLEMR_shredded_bad        # e.g. s3://my-out-bucket/shredded/bad
        errors: s3://ETLEMR/ETLEMR_shredded_errors      # Leave blank unless continue_on_unexpected_error: set to true below
        archive: s3://ETLEMR/ETLEMR_shredded_archive     # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
  emr:
    ami_version: 4.5.0      # Don't change this
    region: ap-southeast-2  # Always set this
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
    placement:      # Set this if not running in VPC. Leave blank otherwise
    ec2_subnet_id: subnet-6980dc2f  # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: firstec2
    bootstrap: []           #Set this to specify custom boostrap actions. Leave empty otherwise.
    software:
      hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
    # Adjust your Hadoop cluster below
    jobflow:
      master_instance_type: m1.medium
      core_instance_count: 2
      core_instance_type: m1.medium
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m1.medium
      task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: clj-tomcat # Or 'clj-tomcat' for the Clojure Collector, or 'thrift' for Thrift records, or 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs
enrich:
  job_name: Snowplow ETL # Give your job a name
  versions:
    hadoop_enrich: 1.7.0 # Version of the Hadoop Enrichment process
    hadoop_shred: 0.9.0 # Version of the Hadoop Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
  continue_on_unexpected_error: false # Set to 'true' (and set out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  download:
    folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
  targets:
    - name: "My Redshift database"
      type: redshift
      host: ADD HERE # The endpoint as shown in the Redshift console
      database: ADD HERE # Name of database
      port: 5439 # Default Redshift port
      table: atomic.events
      username: ADD HERE
      password: ADD HERE
      maxerror: 1 # Stop loading on first error, or increase to permit more load errors
      comprows: 200000 # Default for a 1 XL node cluster. Not used unless --include compupdate specified
      ssl_mode: disable
    #- name: "My PostgreSQL database"
    #  type: postgres
    #  host: ADD HERE # Hostname of database server
    #  database: ADD HERE # Name of database
    #  port: 5432 # Default Postgres port
    #  table: atomic.events
    #  username: ADD HERE
    #  password: ADD HERE
    #  maxerror: # Not required for Postgres
    #  comprows: # Not required for Postgres
    #  ssl_mode: disable
    #- name: "myelasticsearchtarget" # Name for the target - used to label the corresponding jobflow step
    #  type: elasticsearch # Marks the database type as Elasticsearch
    #  host: "ec2-43-1-854-22.compute-1.amazonaws.com" # Elasticsearch host
    #  database: index1 # The Elasticsearch index
    #  port: 9200 # Port used to connect to Elasticsearch
    # table: type1 # The Elasticsearch type
    #  es_nodes_wan_only: false # Set to true if using Amazon Elasticsearch Service
    #  username: # Unnecessary for Elasticsearch
    #  password: # Unnecessary for Elasticsearch
    #  sources: # Leave blank or specify: ["s3://out/enriched/bad/run=xxx", "s3://out/shred/bad/run=yyy"]
    #  maxerror: # Not required for Elasticsearch
    #  comprows: # Not required for Elasticsearch
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production
  snowplow:
    method: get
    app_id: redplanet # e.g. snowplow
    collector: ec2-52-63-114-56.ap-southeast-2.compute.amazonaws.com # e.g. d3rkrsqld9gmqf.cloudfront.net

New errors:

       F, [2016-06-11T05:41:45.884000 #29655] FATAL -- :

ContractError (Contract violation for return value:
    Expected: {:aws=>{:access_key_id=>String, :secret_access_key=>String, :s3=>{:region=>String, :buckets=>{:assets=>String, :jsonpath_assets=>#<Contracts::Maybe:0x122a10c5 @vals=[String, nil]>, :log=>String, :raw=>{:in=>#<Contracts::ArrayOf:0x7b2fed4 @contract=String>, :processing=>String, :archive=>String}, :enriched=>{:good=>String, :bad=>String, :errors=>#<Contracts::Maybe:0x7878143e @vals=[String, nil]>, :archive=>#<Contracts::Maybe:0x1379303c @vals=[String, nil]>}, :shredded=>{:good=>String, :bad=>String, :errors=>#<Contracts::Maybe:0x794dbd20 @vals=[String, nil]>, :archive=>#<Contracts::Maybe:0x5ebbbe17 @vals=[String, nil]>}}}, :emr=>{:ami_version=>String, :region=>String, :jobflow_role=>String, :service_role=>String, :placement=>#<Contracts::Maybe:0x77bd0897 @vals=[String, nil]>, :ec2_subnet_id=>#<Contracts::Maybe:0x1b7f2eeb @vals=[String, nil]>, :ec2_key_name=>String, :bootstrap=>#<Contracts::Maybe:0x3ae15467 @vals=[#<Contracts::ArrayOf:0x1def1f1e @contract=String>, nil]>, :software=>{:hbase=>#<Contracts::Maybe:0x1f6c5464 @vals=[String, nil]>, :lingual=>#<Contracts::Maybe:0x118815a @vals=[String, nil]>}, :jobflow=>{:master_instance_type=>String, :core_instance_count=>Contracts::Num, :core_instance_type=>String, :task_instance_count=>Contracts::Num, :task_instance_type=>String, :task_instance_bid=>#<Contracts::Maybe:0x6fd43c45 @vals=[Contracts::Num, nil]>}, :additional_info=>#<Contracts::Maybe:0x56c10f5e @vals=[String, nil]>, :bootstrap_failure_tries=>Contracts::Num}}, :collectors=>{:format=>String}, :enrich=>{:job_name=>String, :versions=>{:hadoop_enrich=>String, :hadoop_shred=>String}, :continue_on_unexpected_error=>Contracts::Bool, :output_compression=>#<Proc:0x3a9b7e55@/home/ec2-user/snowplow-emr-etl-runner!/emr-etl-runner/lib/snowplow-emr-etl-runner/contracts.rb:23 (lambda)>}, :storage=>{:download=>{:folder=>#<Contracts::Maybe:0x2d381822 @vals=[String, nil]>}, :targets=>#<Contracts::ArrayOf:0x3db8775f @contract={:name=>String, :type=>String, :host=>String, :database=>String, :port=>Contracts::Num, :ssl_mode=>#<Contracts::Maybe:0x4bc348d3 @vals=[String, nil]>, :table=>String, :username=>#<Contracts::Maybe:0x2bb0931e @vals=[String, nil]>, :password=>#<Contracts::Maybe:0x3627cf6 @vals=[String, nil]>, :es_nodes_wan_only=>#<Contracts::Maybe:0x561d596c @vals=[Contracts::Bool, nil]>, :maxerror=>#<Contracts::Maybe:0x437f7292 @vals=[Contracts::Num, nil]>, :comprows=>#<Contracts::Maybe:0x59dcd5ec @vals=[Contracts::Num, nil]>}>}, :monitoring=>{:tags=>#<Contracts::HashOf:0x19a51da1 @value=String, @key=Symbol>, :logging=>{:level=>String}, :snowplow=>#<Contracts::Maybe:0x273c422e @vals=[{:method=>String, :collector=>String, :app_id=>String}, nil]>}},
    Actual: {:aws=>{:access_key_id=>"AKIAIUFF5DVIXMMDFVFA", :secret_access_key=>"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", :s3=>{:region=>"ap-southeast-2", :buckets=>{:assets=>"s3://snowplow-hosted-assets", :jsonpath_assets=>nil, :log=>"s3n://ETLEMR/ETLEMR_logs", :raw=>{:in=>"s3n://elasticbeanstalk-ap-southeast-2-098002129817/resources/environments/logs/publish/e-m3cepm4223/i-7232cba3", :processing=>"s3://ETLEMR/ETLEMR_in_processing", :archive=>"s3://ETLEMR/ETLEMR_in_archive"}, :enriched=>{:good=>"s3://ETLEMR/ETLEMR_enriched_good", :bad=>"s3://ETLEMR/ETLEMR_enriched_bad", :errors=>"s3://ETLEMR/ETLEMR_enriched_errors", :archive=>"s3://ETLEMR/ETLEMR_enriched_archive"}, :shredded=>{:good=>"s3://ETLEMR/ETLEMR_shredded_good", :bad=>"s3://ETLEMR/ETLEMR_shredded_bad", :errors=>"s3://ETLEMR/ETLEMR_shredded_errors", :archive=>"s3://ETLEMR/ETLEMR_shredded_archive"}}}, :emr=>{:ami_version=>"4.5.0", :region=>"ap-southeast-2", :jobflow_role=>"EMR_EC2_DefaultRole", :service_role=>"EMR_DefaultRole", :placement=>nil, :ec2_subnet_id=>"subnet-6980dc2f", :ec2_key_name=>"firstec2", :bootstrap=>[], :software=>{:hbase=>nil, :lingual=>nil}, :jobflow=>{:master_instance_type=>"m1.medium", :core_instance_count=>2, :core_instance_type=>"m1.medium", :task_instance_count=>0, :task_instance_type=>"m1.medium", :task_instance_bid=>0.015}, :bootstrap_failure_tries=>3, :additional_info=>nil}}, :collectors=>{:format=>"clj-tomcat"}, :enrich=>{:job_name=>"Snowplow ETL", :versions=>{:hadoop_enrich=>"1.7.0", :hadoop_shred=>"0.9.0", :hadoop_elasticsearch=>"0.1.0"}, :continue_on_unexpected_error=>false, :output_compression=>"NONE"}, :storage=>{:download=>{:folder=>nil}, :targets=>[{:name=>"My Redshift database", :type=>"redshift", :host=>"ADD HERE", :database=>"ADD HERE", :port=>5439, :table=>"atomic.events", :username=>"ADD HERE", :password=>"ADD HERE", :maxerror=>1, :comprows=>200000, :ssl_mode=>"disable"}]}, :monitoring=>{:tags=>{}, :logging=>{:level=>"DEBUG"}, :snowplow=>{:method=>"get", :app_id=>"redplanet", :collector=>"ec2-52-63-114-56.ap-southeast-2.compute.amazonaws.com"}}}
    Value guarded in: Snowplow::EmrEtlRunner::Cli::load_config
    With Contract: Maybe, String => Hash
    At: /home/ec2-user/snowplow-emr-etl-runner!/emr-etl-runner/lib/snowplow-emr-etl-runner/cli.rb:134 ):
    /home/ec2-user/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts.rb:69:in `Contract'
    org/jruby/RubyProc.java:271:in `call'
    /home/ec2-user/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts.rb:147:in `failure_callback'
    /home/ec2-user/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/decorators.rb:164:in `common_method_added'
    /home/ec2-user/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/decorators.rb:159:in `common_method_added'
    file:/home/ec2-user/snowplow-emr-etl-runner!/emr-etl-runner/bin/snowplow-emr-etl-runner:37:in `(root)'
    org/jruby/RubyKernel.java:1091:in `load'
    file:/home/ec2-user/snowplow-emr-etl-runner!/META-INF/main.rb:1:in `(root)'
    org/jruby/RubyKernel.java:1072:in `require'
    file:/home/ec2-user/snowplow-emr-etl-runner!/META-INF/main.rb:1:in `(root)'
    /tmp/jruby3259888274547429362extract/jruby-stdlib-1.7.20.1.jar!/META-INF/jruby.home/lib/ruby/shared/rubygems/core_ext/kernel_require.rb:1:in `(root)'

Thanks for you assistance.

Regards
SS10

ihor · June 13, 2016, 11:29pm

Hi @sachinsingh10,

The raw:in parameter is expected to be an array of string values as there could be a multidude of the data sources. Could you change the current

raw:
     in: s3://elasticbeanstalk-ap-southeast-2-098002129817/resources/environments/logs/publish/e-m3cepm4223/i-7232cba3

to

raw:
     in:
          - s3://elasticbeanstalk-ap-southeast-2-098002129817/resources/environments/logs/publish/e-m3cepm4223/i-7232cba3

and try again.

Note the “-” in front of the bucket name, which denotes an array element (one element in the array in your case).

–Ihor

sachinsingh10 · June 14, 2016, 7:24am

@ihor
Thanks for you help. I am still struggling with EMR.

Now the furthest I have got is this

…
$ ./snowplow-emr-etl-runner -d --config config/config.yml --resolver resolver.json
D, [2016-06-14T07:08:37.950000 #11020] DEBUG – : Staging raw logs…
F, [2016-06-14T07:08:44.052000 #11020] FATAL – :

Snowplow::EmrEtlRunner::DirectoryNotEmptyError (Should not stage files for enrichment, processing bucket s3://sachinetlemr/inprocessing/ is not empty):
/home/ec2-user/snowplow-emr-etl-runner!/emr-etl-runner/lib/snowplow-emr-etl-runner/s3_tasks.rb:122:in stage_logs_for_emr' /home/ec2-user/snowplow-emr-etl-runner!/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:51:in run’
/home/ec2-user/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/method_reference.rb:46:in send_to' /home/ec2-user/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts.rb:305:in call_with’
/home/ec2-user/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/decorators.rb:159:in common_method_added' file:/home/ec2-user/snowplow-emr-etl-runner!/emr-etl-runner/bin/snowplow-emr-etl-runner:39:in (root)’
org/jruby/RubyKernel.java:1091:in load' file:/home/ec2-user/snowplow-emr-etl-runner!/META-INF/main.rb:1:in (root)’
org/jruby/RubyKernel.java:1072:in require' file:/home/ec2-user/snowplow-emr-etl-runner!/META-INF/main.rb:1:in (root)’
/tmp/jruby5328142271144796247extract/jruby-stdlib-1.7.20.1.jar!/META-INF/jruby.home/lib/ruby/shared/rubygems/core_ext/kernel_require.rb:1:in `(root)’

…

Copies one log file into “in processing” folder.

But after change to
master_instance_type: m3.xlarge
core_instance_count: 2
core_instance_type: m3.xlarge
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m3.xlarge

The result is -

$ ./snowplow-emr-etl-runner -d --config config/config.yml --resolver resolver.json
D, [2016-06-14T07:18:03.653000 #11116] DEBUG – : Staging raw logs…
moving files from s3://elasticbeanstalk-ap-southeast-2-098002129817/resources/environments/logs/publish/e-m3cepm4223/i-7232cba3/ to s3://sachinetlemr/inprocessing/

then just comes back to $.

Now I have researched this issue here and the only thing that looks off is that my logs bucket is in ap-southeast-2 and the other buckets are in US. Does that matter?

Any help would be appreciated.

PS: I have on EMR console 2 instance of the job being terminated with errors.

Regards
SS10

ihor · June 14, 2016, 4:35pm

@sachinsingh10,

The above error indicates that the processing bucket is not empty.

When you launch EmrEtlRunner, it checks the following buckets to be empty first:

raw:processing
enriched:good
shredded:good

Being not empty is an indication that the previous run failed as a successful EMR job would have resulted in moving the files in raw:processing to raw:archive. The good buckets, on the other hand, would be taken care of by StorageLoader post data load into the relevant storage.

Please, refer to the “Batch Pipeline Steps” diagram to understand the flow.

In your scenario, you would have to either delete the files in the processing bucket if not required or rerun the job with --skip staging options. Again, refer to the link provided above.

On the side note, if you see that your logs contain only the staging step (the only record), it could indicate that the events file moved from the collector is empty. That is no events have been captured by the collector.

Hopefully this help.

–Ihor

sachinsingh10 · June 15, 2016, 6:27am

@ihor

Thanks for your help. You were correct the files were not capturing data as the pages were https.

I am stuck here now. Can’t seem to find how to debug the [Validation Error]

$ ./snowplow-emr-etl-runner --skip staging -d --config config/config.yml --resolver resolver.json D, [2016-06-15T06:19:05.979000 #14063] DEBUG – : Initializing EMR jobflow
D, [2016-06-15T06:19:17.179000 #14063] DEBUG – : EMR jobflow j-33MI2GZ11XARH started, waiting for jobflow to complete…
I, [2016-06-15T06:19:17.193000 #14063] INFO – : SnowplowTracker::Emitter initialized with endpoint http://ec2-52-63-114-56.ap-southeast-2.compute.amazonaws.com:80/i
I, [2016-06-15T06:19:20.441000 #14063] INFO – : Attempting to send 1 request
I, [2016-06-15T06:19:20.443000 #14063] INFO – : Sending GET request to http://ec2-52-63-114-56.ap-southeast-2.compute.amazonaws.com:80/i…
I, [2016-06-15T06:19:20.471000 #14063] INFO – : GET request to http://ec2-52-63-114-56.ap-southeast-2.compute.amazonaws.com:80/i finished with status code 200
I, [2016-06-15T06:21:30.332000 #14063] INFO – : Attempting to send 1 request
I, [2016-06-15T06:21:30.334000 #14063] INFO – : Sending GET request to http://ec2-52-63-114-56.ap-southeast-2.compute.amazonaws.com:80/i…
I, [2016-06-15T06:21:30.365000 #14063] INFO – : GET request to http://ec2-52-63-114-56.ap-southeast-2.compute.amazonaws.com:80/i finished with status code 200
F, [2016-06-15T06:21:33.466000 #14063] FATAL – :

Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-33MI2GZ11XARH failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.

Snowplow ETL: TERMINATED_WITH_ERRORS [VALIDATION_ERROR]

~ elapsed time n/a [ - 2016-06-15 06:19:41 UTC]

1. Elasticity S3DistCp Step: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]
1. Elasticity Scalding Step: Shred Enriched Events: CANCELLED ~ elapsed time n/a [ - ]
1. Elasticity S3DistCp Step: Enriched HDFS _SUCCESS -> S3: CANCELLED ~ elapsed time n/a [ - ]
1. Elasticity S3DistCp Step: Enriched HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]
1. Elasticity Scalding Step: Enrich Raw Events: CANCELLED ~ elapsed time n/a [ - ]
1. Elasticity Setup Hadoop Debugging: CANCELLED ~ elapsed time n/a [ - ]):
  /home/ec2-user/snowplow-emr-etl-runner!/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:471:in run' /home/ec2-user/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/method_reference.rb:46:insend_to’
  /home/ec2-user/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts.rb:305:in call_with' /home/ec2-user/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/decorators.rb:159:incommon_method_added’
  /home/ec2-user/snowplow-emr-etl-runner!/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:68:in run' /home/ec2-user/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/method_reference.rb:46:insend_to’
  /home/ec2-user/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts.rb:305:in call_with' /home/ec2-user/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/decorators.rb:159:incommon_method_added’
  file:/home/ec2-user/snowplow-emr-etl-runner!/emr-etl-runner/bin/snowplow-emr-etl-runner:39:in (root)' org/jruby/RubyKernel.java:1091:inload’
  file:/home/ec2-user/snowplow-emr-etl-runner!/META-INF/main.rb:1:in (root)' org/jruby/RubyKernel.java:1072:inrequire’
  file:/home/ec2-user/snowplow-emr-etl-runner!/META-INF/main.rb:1:in (root)' /tmp/jruby2837279731794889989extract/jruby-stdlib-1.7.20.1.jar!/META-INF/jruby.home/lib/ruby/shared/rubygems/core_ext/kernel_require.rb:1:in(root)’

Latest YAML

…

aws:
  # Credentials can be hardcoded or set in environment variables


  access_key_id: XXXXXXXXXX
  secret_access_key: XXXXXXXXX
  
  
  s3:
    region: us-east-1
    buckets:
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
  jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here

  log: s3://sachinetlemr/logs/

  raw:

    in:
      - s3://elasticbeanstalk-ap-southeast-2-098002129817/resources/environments/logs/publish/e-m3cepm4223/i-7232cba3  # Multiple in buckets are permitted # e.g. s3://my-in-bucket
    processing: s3://sachinetlemr/inprocessing/
    archive: s3://sachinetlemr/inarchive/    # e.g. s3://my-archive-bucket/in
  enriched:
    good: s3://sachinetlemr/enrichedgood/       # e.g. s3://my-out-bucket/enriched/good
    bad: s3://sachinetlemr/enrichedbad/         # e.g. s3://my-out-bucket/enriched/bad
    errors: s3://sachinetlemr/enrichederrors/      # Leave blank unless continue_on_unexpected_error: set to true below
    archive: s3://sachinetlemr/enrichedarchive/    # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
  shredded:
    good: s3://sachinetlemr/shreddedgood/     # e.g. s3://my-out-bucket/shredded/good
    bad: s3://sachinetlemr/shreddedbad/        # e.g. s3://my-out-bucket/shredded/bad
    errors: s3://sachinetlemr/shreddederrors/      # Leave blank unless continue_on_unexpected_error: set to true below
    archive: s3://sachinetlemr/shreddedarchive/     # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
  emr:
ami_version: 4.5.0      # Don't change this
region: us-east-1  # Always set this
jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
placement:      # Set this if not running in VPC. Leave blank otherwise
ec2_subnet_id: subnet-6980dc2f  # Set this if running in VPC. Leave blank otherwise
ec2_key_name: firstec2
bootstrap: []           #Set this to specify custom boostrap actions. Leave empty otherwise.
software:
  hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
  lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
# Adjust your Hadoop cluster below
jobflow:
  master_instance_type: m1.medium
  core_instance_count: 2
  core_instance_type: m1.medium
  task_instance_count: 0 # Increase to use spot instances
  task_instance_type: m1.medium
  task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: clj-tomcat # Or 'clj-tomcat' for the Clojure Collector, or 'thrift' for Thrift records, or 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs
enrich:
  job_name: Snowplow ETL # Give your job a name
  versions:
hadoop_enrich: 1.7.0 # Version of the Hadoop Enrichment process
hadoop_shred: 0.9.0 # Version of the Hadoop Shredding process
hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
  continue_on_unexpected_error: false # Set to 'true' (and set out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  download:
folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
  targets:
- name: "My Redshift database"
  type: redshift
  host: my-snowlow-test.coywuakogqe6.ap-southeast-2.redshift.amazonaws.com:5439 # The endpoint as shown in the Redshift console
  database: snowplow # Name of database
  port: 5439 # Default Redshift port
  table: atomic.events
  username: XXXX
  password: XXXX
  maxerror: 1 # Stop loading on first error, or increase to permit more load errors
  comprows: 200000 # Default for a 1 XL node cluster. Not used unless --include compupdate specified
  ssl_mode: disable
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
level: DEBUG # You can optionally switch to INFO for production
  snowplow:
method: get
app_id: redplanet # e.g. snowplow
collector: ec2-52-63-114-56.ap-southeast-2.compute.amazonaws.com # e.g. d3rkrsqld9gmqf.cloudfront.net

…

Regards
SS10

sachinsingh10 · June 15, 2016, 7:17am

@ihor
Please ignore the above comment. EMR job is Running! I will keep you posted.

sachinsingh10 · June 18, 2016, 6:52am

@ihor
Big thanks for you help.

I am making progress albeit slow :).

Now it seems the that the EMR process is failing Elasticity S3DistCp Step: Enriched HDFS -> S3. I have the log of the error here.

2016-06-20 04:28:44,873 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Running with args: --src hdfs:///local/snowplow/enriched-events/ --dest s3://sachinetlemr/enrichedgood/run=2016-06-20-04-21-56/ --srcPattern .part-. --s3Endpoint s3.amazonaws.com
2016-06-20 04:28:45,206 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): S3DistCp args: --src hdfs:///local/snowplow/enriched-events/ --dest s3://sachinetlemr/enrichedgood/run=2016-06-20-04-21-56/ --srcPattern .part-. --s3Endpoint s3.amazonaws.com
2016-06-20 04:28:47,145 INFO com.amazon.ws.emr.hadoop.fs.EmrFileSystem (main): Consistency disabled, using com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem as filesystem implementation
2016-06-20 04:28:47,384 INFO amazon.emr.metrics.MetricsSaver (main): MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: false maxMemoryMb: 3072 maxInstanceCount: 500 lastModified: 1466396745373
2016-06-20 04:28:47,384 INFO amazon.emr.metrics.MetricsSaver (main): Created MetricsSaver j-3HEI77KHN3XQL:i-0addb688:RunJar:06616 period:60 /mnt/var/em/raw/i-0addb688_20160620_RunJar_06616_raw.bin
2016-06-20 04:28:49,488 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Using output path 'hdfs:/tmp/200eb0ac-9edf-499b-8491-312f2b7781ce/output’
2016-06-20 04:28:49,523 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Created 0 files to copy 0 files
2016-06-20 04:28:49,539 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Reducer number: 7
2016-06-20 04:28:49,629 INFO org.apache.hadoop.yarn.client.RMProxy (main): Connecting to ResourceManager at ip-172-31-34-82.ap-southeast-2.compute.internal/172.31.34.82:8032
2016-06-20 04:28:50,084 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): Cleaning up the staging area /tmp/hadoop-yarn/staging/hadoop/.staging/job_1466396736142_0004
2016-06-20 04:28:50,087 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Try to recursively delete hdfs:/tmp/200eb0ac-9edf-499b-8491-312f2b7781ce/tempspace

2016-06-20T04:28:42.351Z INFO Ensure step 3 jar file /usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar
2016-06-20T04:28:42.351Z INFO StepRunner: Created Runner for step 3
INFO startExec 'hadoop jar /usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar --src hdfs:///local/snowplow/enriched-events/ --dest s3://sachinetlemr/enrichedgood/run=2016-06-20-04-21-56/ --srcPattern .part-. --s3Endpoint s3.amazonaws.com’
INFO Environment:
TERM=linux
CONSOLETYPE=serial
SHLVL=5
JAVA_HOME=/etc/alternatives/jre
HADOOP_IDENT_STRING=hadoop
LANGSH_SOURCED=1
XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
HADOOP_ROOT_LOGGER=INFO,DRFA
AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
UPSTART_JOB=rc
MAIL=/var/spool/mail/hadoop
EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
PWD=/
HOSTNAME=ip-172-31-34-82
LESS_TERMCAP_se=[0m
LOGNAME=hadoop
UPSTART_INSTANCE=
AWS_PATH=/opt/aws
LESS_TERMCAP_mb=[01;31m
_=/etc/alternatives/jre/bin/java
LESS_TERMCAP_me=[0m
NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
LESS_TERMCAP_md=[01;38;5;208m
runlevel=3
AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
UPSTART_EVENTS=runlevel
HISTSIZE=1000
previous=N
HADOOP_LOGFILE=syslog
PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin
EC2_HOME=/opt/aws/apitools/ec2
HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-M0RH64H9KH12
LESS_TERMCAP_ue=[0m
AWS_ELB_HOME=/opt/aws/apitools/elb
RUNLEVEL=3
USER=hadoop
HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-M0RH64H9KH12/tmp
PREVLEVEL=N
HOME=/home/hadoop
HISTCONTROL=ignoredups
LESSOPEN=||/usr/bin/lesspipe.sh %s
AWS_DEFAULT_REGION=ap-southeast-2
LANG=en_US.UTF-8
LESS_TERMCAP_us=[04;38;5;111m
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-M0RH64H9KH12/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-M0RH64H9KH12/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-M0RH64H9KH12
INFO ProcessRunner started child process 6616 :
hadoop 6616 2394 0 04:28 ? 00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar --src hdfs:///local/snowplow/enriched-events/ --dest s3://sachinetlemr/enrichedgood/run=2016-06-20-04-21-56/ --srcPattern .part-. --s3Endpoint s3.amazonaws.com
2016-06-20T04:28:46.368Z INFO HadoopJarStepRunner.Runner: startRun() called for s-M0RH64H9KH12 Child Pid: 6616
INFO Synchronously wait child process to complete : hadoop jar /usr/share/aws/emr/s3-dist-cp/lib/s3…
INFO waitProcessCompletion ended with exit code 1 : hadoop jar /usr/share/aws/emr/s3-dist-cp/lib/s3…
INFO total process run time: 8 seconds
2016-06-20T04:28:52.499Z INFO Step created jobs:
2016-06-20T04:28:52.499Z WARN Step failed with exitCode 1 and took 8 seconds

Exception in thread “main” java.lang.RuntimeException: Error running job
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:927)
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:720)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at com.amazon.elasticmapreduce.s3distcp.Main.main(Main.java:22)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ip-172-31-34-82.ap-southeast-2.compute.internal:8020/tmp/200eb0ac-9edf-499b-8491-312f2b7781ce/files
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:317)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:352)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:901)
… 10 more

Any help will be appreciated @ihor

ihor · June 20, 2016, 4:52pm

Hi @sachinsingh10,

It appears your EMR cluster might have crashed. You can try rerunning the job with --skip staging option. But first ensure you emptied (deleted content of) the enriched:good bucket. Refer to the diagram I mentioned earlier: the job failed at step #4.

By the way, your earlier error message “Snowplow ETL: TERMINATED_WITH_ERRORS [VALIDATION_ERROR]” would indicate the limit of the available instances has been reached for your AWS account. As a reminder, the amount of the instances used to bring up EMR cluster is determined by the values you specify under jobflow section of the config.yml file.

Regards,
Ihor

juliopim · July 7, 2016, 8:58pm

Hi,

Sorry but I am seeing the same error, and didn’t resolve it. See below please. Could you help me?

root@ip-10-31-70-157:~# ./snowplow-emr-etl-runner --skip staging,archive_raw --config config/config.yml
F, [2016-07-07T20:46:05.166000 #3145] FATAL -- : 

ContractError (Contract violation for return value:
    Expected: {:aws=>{:access_key_id=>String, :secret_access_key=>String, :s3=>{:region=>String, :buckets=>{:assets=>String, :jsonpath_assets=>#<Contracts::Maybe:0x1ef93e01 @vals=[String, nil]>, :log=>String, :raw=>{:in=>#<Contracts::ArrayOf:0x4fafd27e @contract=String>, :processing=>String, :archive=>String}, :enriched=>{:good=>String, :bad=>String, :errors=>#<Contracts::Maybe:0x6cd66f6a @vals=[String, nil]>, :archive=>#<Contracts::Maybe:0xd512c1 @vals=[String, nil]>}, :shredded=>{:good=>String, :bad=>String, :errors=>#<Contracts::Maybe:0x578b2dec @vals=[String, nil]>, :archive=>#<Contracts::Maybe:0x66863941 @vals=[String, nil]>}}}, :emr=>{:ami_version=>String, :region=>String, :jobflow_role=>String, :service_role=>String, :placement=>#<Contracts::Maybe:0x39f4a7c4 @vals=[String, nil]>, :ec2_subnet_id=>#<Contracts::Maybe:0x111fe921 @vals=[String, nil]>, :ec2_key_name=>String, :bootstrap=>#<Contracts::Maybe:0x1ff542a3 @vals=[#<Contracts::ArrayOf:0x38848217 @contract=String>, nil]>, :software=>{:hbase=>#<Contracts::Maybe:0x48ee3c2d @vals=[String, nil]>, :lingual=>#<Contracts::Maybe:0x54387873 @vals=[String, nil]>}, :jobflow=>{:master_instance_type=>String, :core_instance_count=>Contracts::Num, :core_instance_type=>String, :task_instance_count=>Contracts::Num, :task_instance_type=>String, :task_instance_bid=>#<Contracts::Maybe:0x3a80c534 @vals=[Contracts::Num, nil]>}, :additional_info=>#<Contracts::Maybe:0xfd5689d @vals=[String, nil]>, :bootstrap_failure_tries=>Contracts::Num}}, :collectors=>{:format=>String}, :enrich=>{:job_name=>String, :versions=>{:hadoop_enrich=>String, :hadoop_shred=>String}, :continue_on_unexpected_error=>Contracts::Bool, :output_compression=>#<Proc:0x2430cf17@/root/snowplow-emr-etl-runner!/emr-etl-runner/lib/snowplow-emr-etl-runner/contracts.rb:23 (lambda)>}, :storage=>{:download=>{:folder=>#<Contracts::Maybe:0x218f2f51 @vals=[String, nil]>}, :targets=>#<Contracts::ArrayOf:0x1d9af731 @contract={:name=>String, :type=>String, :host=>String, :database=>String, :port=>Contracts::Num, :ssl_mode=>#<Contracts::Maybe:0x189633f2 @vals=[String, nil]>, :table=>String, :username=>#<Contracts::Maybe:0x2b974137 @vals=[String, nil]>, :password=>#<Contracts::Maybe:0x5d22604e @vals=[String, nil]>, :es_nodes_wan_only=>#<Contracts::Maybe:0x13374ca6 @vals=[Contracts::Bool, nil]>, :maxerror=>#<Contracts::Maybe:0x3f1d6a13 @vals=[Contracts::Num, nil]>, :comprows=>#<Contracts::Maybe:0x55c46ec1 @vals=[Contracts::Num, nil]>}>}, :monitoring=>{:tags=>#<Contracts::HashOf:0x39afe59f @key=Symbol, @value=String>, :logging=>{:level=>String}, :snowplow=>#<Contracts::Maybe:0x16f34376 @vals=[{:method=>String, :collector=>String, :app_id=>String}, nil]>}},
Value guarded in: Snowplow::EmrEtlRunner::Cli::load_config
    With Contract: Maybe, String => Hash
    At: /root/snowplow-emr-etl-runner!/emr-etl-runner/lib/snowplow-emr-etl-runner/cli.rb:134 ):
    /root/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts.rb:69:in `Contract'
    org/jruby/RubyProc.java:271:in `call'
    /root/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts.rb:147:in `failure_callback'
    /root/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/decorators.rb:164:in `common_method_added'
    /root/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/decorators.rb:159:in `common_method_added'
    file:/root/snowplow-emr-etl-runner!/emr-etl-runner/bin/snowplow-emr-etl-runner:37:in `(root)'
    org/jruby/RubyKernel.java:1091:in `load'
    file:/root/snowplow-emr-etl-runner!/META-INF/main.rb:1:in `(root)'
    org/jruby/RubyKernel.java:1072:in `require'
    file:/root/snowplow-emr-etl-runner!/META-INF/main.rb:1:in `(root)'
    /tmp/jruby6329844848439325027extract/jruby-stdlib-1.7.20.1.jar!/META-INF/jruby.home/lib/ruby/shared/rubygems/core_ext/kernel_require.rb:1:in `(root)'

Here is my config.yml:

aws:                                                                                                                                                                                                                                         
  # Credentials can be hardcoded or set in environment variables - user snowplow_instaler                                                                                                                                                    
  access_key_id: XXXXXXXXXXXXXXXXXXXXXXXX                                                                                                                                                                                                    
  secret_access_key: XXXXXXXXXXXXXXXXXXX                                                                                                                                                                                                     
  s3:                                                                                                                                                                                                                                        
    region: us-east-1                                                                                                                                                                                                                        
    buckets:                                                                                                                                                                                                                                 
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket                                                                                                                
      jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here                                                                                                   
      log: s3://emretlsnowplow/log/                                                                                                                                                                                                          
      raw:                                                                                                                                                                                                                                   
        in:                                                                                                                                                                                                                                  
          - s3://elasticbeanstalk-uXXXXXXXXXXXXXXXXXXX/resources/environments/logs/publish/XXXXXX/i-eXXXXXXX           # e.g. s3://my-in-bucket                                                                                              
          -                                                                                                                                                                                                                                  
        processing: s3://emretlraw/proc/                                                                                                                                                                                                     
        archive: s3://emretlrawarch/raw/    # e.g. s3://my-archive-bucket/raw                                                                                                                                                                
      enriched:                                                                                                                                                                                                                              
        good: s3://emretl/enriched/good/       # e.g. s3://my-out-bucket/enriched/good                                                                                                                                                       
        bad: s3://emretl/enriched/bad/        # e.g. s3://my-out-bucket/enriched/bad                                                                                                                                                         
        errors: s3://emretl/enriched/errors/    # Leave blank unless :continue_on_unexpected_error: set to true below                                                                                                                        
        archive: s3://emretl/enriched/    # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched                                                                                                                        
      shredded:                                                                                                                                                                                                                              
        good: s3://emretl/shredded/good/       # e.g. s3://my-out-bucket/shredded/good                                                                                                                                                       
        bad: s3://emretl/shredded/bad/        # e.g. s3://my-out-bucket/shredded/bad                                                                                                                                                         
        errors: s3://emretl/shredded/errors/     # Leave blank unless :continue_on_unexpected_error: set to true below                                                                                                                       
        archive: s3://emretl/shredded/    # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded                                                                                                                        
  emr:                                                                                                                                                                                                                                       
    ami_version: 4.5.0                                                                                                                                                                                                                       
    region: us-east-1        # Always set this                                                                                                                                                                                               
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles                                                                                                                                                         
    service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles                                                                                                                                                         
    placement:      # Set this if not running in VPC. Leave blank otherwise                                                                                                                                                                  
    ec2_subnet_id: subnet-4e46c621 # Set this if running in VPC. Leave blank otherwise                                                                                                                                                       
    ec2_key_name: XXXXXXXXX                                                                                                                                                                                                                  
    bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
    # Adjust your Hadoop cluster below
    jobflow:
      master_instance_type: m1.medium
      core_instance_count: 2
      core_instance_type: m1.medium
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m1.medium
      task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: clj-tomcat # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
  job_name: Snowplow ETL # Give your job a name
  versions:
    hadoop_enrich: 1.7.0 # Version of the Hadoop Enrichment process
    hadoop_shred: 0.9.0 # Version of the Hadoop Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  download:
    folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
  targets:
    - name: "My Redshift database"
      type: redshift
      host:  # The endpoint as shown in the Redshift console
      database:  # Name of database
      port: 5439 # Default Redshift port
      ssl_mode: disable # One of disable (default), require, verify-ca or verify-full
      table: atomic.events
      username: 
      password: 
      maxerror: 1 # Stop loading on first error, or increase to permit more load errors
      comprows: 200000 # Default for a 1 XL node cluster. Not used unless --include compupdate specified
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production
  snowplow:
    method: get
    app_id: snowplow # e.g. snowplow
    collector: clojXXXXurecXXollecXtor-eXnv.us-east-1.elasticbeanstalk.com # e.g. d3rkrsqld9gmqf.cloudfront.net

ihor · July 7, 2016, 10:13pm

Hi @juliopim,

I wonder if the empty item in raw:in causes this problem:

juliopim:

s3:                                                                                                                                                                                                                                        
    region: us-east-1                                                                                                                                                                                                                        
    buckets:                                                                                                                                                                                                                                 
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket                                                                                                                
      jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here                                                                                                   
      log: s3://emretlsnowplow/log/                                                                                                                                                                                                          
      raw:                                                                                                                                                                                                                                   
        in:                                                                                                                                                                                                                                  
          - s3://elasticbeanstalk-uXXXXXXXXXXXXXXXXXXX/resources/environments/logs/publish/XXXXXX/i-eXXXXXXX           # e.g. s3://my-in-bucket                                                                                              
          -

Could you remove the 2nd dash (-) and try again?

Regards,
Ihor

juliopim · July 11, 2016, 7:25pm

I removed the 2nd dash (-), but the same error show me.

ihor · July 11, 2016, 8:22pm

Hi @juliopim,

You have missing value in targets: section. If you do not intend to use Redshift at the moment then, please, replace that (whole) section with targets: []. That is

storage:
  download:
    folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
  targets: []

–Ihor

juliopim · July 12, 2016, 2:08pm

Ah ok, it’s running now.

Tanks

sachinsingh10 · July 12, 2016, 11:53pm

@ihor

Thank you for continued support. I was finally able to get ETLEMR running successfully.

Key Learning :

Spend the extra few hours/minutes on YAML configuration most errors stem from formatting errors.
Understand the Batch Pipeline Steps before the first execution this greatly helps in debugging.

Thanks @ihor for the responsiveness.

Regards
SS10

Topic		Replies	Views
EMR contract broken AWS batch pipeline (Legacy)	3	1281	August 9, 2017
Contract error (Contract violation ) when running EmrEtlRunner Enrichment	7	2071	July 28, 2017
Cannot start EmrEtlRunner due to Contract Violation AWS batch pipeline (Legacy)	0	1374	September 6, 2017
Having issues with config.yaml and Contract Violation AWS batch pipeline (Legacy)	4	1914	December 9, 2017
ReturnContractError in running emretl runner AWS batch pipeline (Legacy)	2	1301	October 26, 2017

My config.yml causes a contract violation in EmrEtlRunner

Snowplow ETL: TERMINATED_WITH_ERRORS [VALIDATION_ERROR]

Related Topics