ReturnContractError in running emretl runner


#1

For the following command

./snowplow-emr-etl-runner run --config config.yml --resolver snowplow/3-enrich/config/iglu_resolver.json --enrichments snowplow/3-enrich/config/enrichments/ --targets snowplow/4-storage/config/targets/ --skip staging

I am getting

ReturnContractError (Contract violation for return value:
Expected: {:aws=>{:access_key_id=>String, :secret_access_key=>String, :∂s3=>{:region=>String, :buckets=>{:assets=>String, :jsonpath_assets=>(String or nil), :log=>String, :raw=>{:in=>(a collection Array of String), :processing=>String, :archive=>String}, :enriched=>{:good=>String, :bad=>String, :errors=>(String or nil), :archive=>(String or nil)}, :shredded=>{:good=>String, :bad=>String, :errors=>(String or nil), :archive=>(String or nil)}}}, :emr=>{:ami_version=>String, :region=>String, :jobflow_role=>String, :service_role=>String, :placement=>(String or nil), :ec2_subnet_id=>(String or nil), :ec2_key_name=>String, :bootstrap=>((a collection Array of String) or nil), :software=>{:hbase=>(String or nil), :lingual=>(String or nil)}, :jobflow=>{:master_instance_type=>String, :core_instance_count=>Num, :core_instance_type=>String, :core_instance_ebs=>#<Contracts::Maybe:0x5ccc971e @vals=[{:volume_size=>#<Proc:0x1b6683c4@uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/contracts.rb:26 (lambda)>, :volume_type=>#<Proc:0x69abeb14@uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/contracts.rb:25 (lambda)>, :volume_iops=>#<Contracts::Maybe:0x7db2b614 @vals=[#<Proc:0x1b6683c4@uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/contracts.rb:26 (lambda)>, nil]>, :ebs_optimized=>#<Contracts::Maybe:0x2c1a95a2 @vals=[Contracts::Bool, nil]>}, nil]>, :task_instance_count=>Num, :task_instance_type=>String, :task_instance_bid=>(Num or nil)}, :additional_info=>(String or nil), :bootstrap_failure_tries=>Num}}, :collectors=>{:format=>String}, :enrich=>{:job_name=>String, :versions=>{:hadoop_enrich=>String, :hadoop_shred=>String}, :continue_on_unexpected_error=>Bool, :output_compression=>#<Proc:0x5a07ae2f@uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/contracts.rb:24 (lambda)>}, :storage=>{:download=>{:folder=>(String or nil)}}, :monitoring=>{:tags=>(Hash<Symbol, String>), :logging=>{:level=>String}, :snowplow=>({:method=>String, :collector=>String, :app_id=>String} or nil)}},
Actual: {:aws=>{:access_key_id=>“", :secret_access_key=>"", :s3=>{:region=>“us-west-2”, :buckets=>{:assets=>“s3://snowplow-hosted-assets”, :jsonpath_assets=>"", :log=>“s3n://archivebucketsl/logs/”, :raw=>{:in=>[“s3n://emrrawstore/raw2017/10/25/11/”], :processing=>“s3n://stream-snowplow/Processing”, :archive=>“s3://archivebucketsl/raw”}, :enriched=>{:good=>“s3://dataenrich/enriched/good”, :bad=>“s3://dataenrich/enriched/bad”, :errors=>“s3://dataenrich/enriched/errors”, :archive=>“s3://dataenrich/enriched/archive”}, :shredded=>{:good=>“s3://dataenrich/shredded/good”, :bad=>“s3://dataenrich/shredded/bad”, :errors=>“s3://dataenrich/shredded/errors”, :archive=>“s3://dataenrich/shredded/archive”}}}, :emr=>{:job_name=>“Snowplow_ETL”, :ami_version=>“5.5.0”, :region=>“us-west-2”, :jobflow_role=>“EMR_EC2_DefaultRole”, :service_role=>“EMR_DefaultRole”, :placement=>“us-west-2a”, :ec2_subnet_id=>nil, :ec2_key_name=>"”, :bootstrap=>[], :software=>{:hbase=>“0.92.0”, :lingual=>“1.1”}, :jobflow=>{:master_instance_type=>“m1.medium”, :core_instance_count=>2, :core_instance_type=>“m1.medium”, :core_instance_ebs=>{:volume_size=>100, :volume_type=>“gp2”, :volume_iops=>400, :ebs_optimized=>false}, :task_instance_count=>0, :task_instance_type=>“m1.medium”, :task_instance_bid=>0.015}, :bootstrap_failure_tries=>3, :configuration=>{:“yarn-site”=>{:“yarn.resourcemanager.am.max-attempts”=>“1”}, :spark=>{:maximizeResourceAllocation=>“true”}}, :additional_info=>nil}}, :collectors=>{:format=>“thrift”}, :enrich=>{:versions=>{:spark_enrich=>“1.9.0”}, :continue_on_unexpected_error=>false, :output_compression=>“NONE”}, :storage=>{:versions=>{:rdb_loader=>“0.12.0”, :rdb_shredder=>“0.12.0”, :hadoop_elasticsearch=>“0.1.0”}}, :monitoring=>{:tags=>{}, :logging=>{:level=>“DEBUG”}}}
Value guarded in: Snowplow::EmrEtlRunner::Cli::load_config
With Contract: Maybe, String => Hash

my config.yml
aws:
access_key_id: ******************
secret_access_key: *********************
s3:
region: us-west-2
buckets:
assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
log: s3n://archivebucketsl/logs/
raw:
in:
- “s3n://elasticbeanstalk-us-west-2-/resources/environments/logs/publish/e-td2pe7ge4w/****/” # e.g. s3://my-in-bucket
processing: s3n://stream-snowplow/Processing
archive: s3://archivebucketsl/raw # e.g. s3://my-archive-bucket/in
enriched:
good: s3://dataenrich/enriched/good # e.g. s3://my-out-bucket/enriched/good
bad: s3://dataenrich/enriched/bad # e.g. s3://my-out-bucket/enriched/bad
errors: s3://dataenrich/enriched/errors # Leave blank unless continue_on_unexpected_error: set to true below
archive: s3://dataenrich/enriched/archive # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
shredded:
good: s3://dataenrich/shredded/good # e.g. s3://my-out-bucket/shredded/good
bad: s3://dataenrich/shredded/bad # e.g. s3://my-out-bucket/shredded/bad
errors: s3://dataenrich/shredded/errors # Leave blank unless continue_on_unexpected_error: set to true below
archive: s3://dataenrich/shredded/archive # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
emr:
job_name: Snowplow_ETL # Give your job a name
ami_version: 5.5.0 # Don’t change this
region: us-west-2 # Always set this
jobflow_role: EMR_EC2_DefaultRole # Created using aws emr create-default-roles service_role: EMR_DefaultRole # Created using aws emr create-default-roles
placement: “us-west-2a” # Set this if not running in VPC. Leave blank otherwise
ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise
ec2_key_name: *****
bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise
software:
hbase: “0.92.0” # Optional. To launch on cluster, provide version, “0.92.0”, keep quotes. Leave empty otherwise.
lingual: “1.1” # Optional. To launch on cluster, provide version, “1.1”, keep quotes. Leave empty otherwise.
# Adjust your Spark cluster below
jobflow:
master_instance_type: m1.medium
core_instance_count: 2
core_instance_type: m1.medium
core_instance_ebs: # Optional. Attach an EBS volume to each core instance.
volume_size: 100 # Gigabytes
volume_type: "gp2"
volume_iops: 400 # Optional. Will only be used if volume_type is "io1"
ebs_optimized: false # Optional. Will default to true
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m1.medium
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
configuration:
yarn-site:
yarn.resourcemanager.am.max-attempts: "1"
spark:
maximizeResourceAllocation: "true"
additional_info: # Optional JSON string for selecting additional features
collectors:
format: clj-tomcat # Or ‘clj-tomcat’ for the Clojure Collector, or ‘thrift’ for Thrift records, or ‘tsv/com.amazon.aws.cloudfront/wd_access_log’ for Cloudfront access logs
enrich:
versions:
spark_enrich: 1.9.0 # Version of the Spark Enrichment process
continue_on_unexpected_error: false # Set to ‘true’ (and set out_errors: above) if you don’t want any exceptions thrown from ETL
output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
versions:
rdb_loader: 0.12.0
rdb_shredder: 0.12.0 # Version of the Spark Shredding process
hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
monitoring:
tags: {} # Name-value pairs describing this job
logging:
level: DEBUG # You can optionally switch to INFO for production

my redshift.json

{
“schema”: “iglu:com.snowplowanalytics.snowplow.storage/redshift_config/jsonschema/2-0-0”,
“data”: {
“name”: “AWS Redshift enriched events storage”,
“host”: “snowplow..us-west-2.redshift.amazonaws.com",
“database”: "
",
“port”: 5439,
“sslMode”: “DISABLE”,
“username”: "
*”,
“password”: “",
“roleArn”: "arn:aws:iam::
******:role/RedshiftCopyUnload”,
“schema”: “atomic”,
“maxError”: 1,
“compRows”: 20000,
“purpose”: “ENRICHED_EVENTS”
}

Please help me resolve what is the issue here


#3

@jayeeta-datta-narvar,

You are using conflicting versions of the EmrEtlRunner and configuration file format. It appears the EmrEtlRunner you run is for Snoplwow R88 but the configuration file is for Snowplow R92.

If you mean to use the latest version (R92), could you, please, download the corresponding binary for EmrEtlRunner. Additionally, I can see you use rdb_loader 0.12.0. Do replace it with the latest 0.13.0.


#4

You might want to remove/rotate the AWS access key and secret key as you’ve included it in this post.