Upgrade EmrEtlRunner to use Spark-enrich


#1

We are using an older version of EMrEtlRunner, and would like to upgrade it to use Spark-enrich/ rdb_shredder/ rdb_loader. Is there upgrade documentation(s) available?

Thanks.


#2

Hi @RichardJ,

I believe easiest way would be to use Snowplow version matrix to find your current and target versions and then follow links to each blog post’s “Upgrade” section for each version inbetween.


#3

Not sure how to use your matrix, but in the old version we use ( ami_version: 5.9.0 , hadoop_enrich: 1.8.0, hadoop_shred: 0.10.0, hadoop_elasticsearch: 0.1.0 ), now we want to use Release 89 Plain of Jars (2017-06-12).

I downloaded snowplow_emr_r89_plain_of_jars.zip, and followed config.yml.sample on https://github.com/snowplow/snowplow/blob/master/3-enrich/emr-etl-runner/config/config.yml.sample. But got the following error -

ReturnContractError (Contract violation for return value:
        Expected: {:aws=>{:access_key_id=>String, :secret_access_key=>String, :s3=>{:region=>String, :buckets=>{:assets=>String, ......

My config.yml (based on config.yml.sample) is:

aws:
  access_key_id: iam
  secret_access_key: iam
  s3:
    region: us-west-2
    buckets:
      assets: s3://snowplow-hosted-assets
      jsonpath_assets: s3://<%= ENV['S3BUCKET'] %>/jsonpaths
      log: s3n://<%= ENV['S3BUCKET'] %>/etl/logs
      raw:
        in: ["s3n://<%= ENV['S3BUCKET'] %>/raw/"]
        processing: s3://<%= ENV['S3BUCKET'] %>/etl/processing
        archive: s3://<%= ENV['S3BUCKET'] %>/archive/raw
      enriched:
        good: s3://<%= ENV['S3BUCKET'] %>/enriched/good
        bad: s3://<%= ENV['S3BUCKET'] %>/enriched/bad
        errors: s3://<%= ENV['S3BUCKET'] %>/enriched/errors
        archive: s3://<%= ENV['S3BUCKET'] %>/enriched/archive
      shredded:
        good: s3://<%= ENV['S3BUCKET'] %>/shredded/good
        bad: s3://<%= ENV['S3BUCKET'] %>/shredded/bad
        errors: s3://<%= ENV['S3BUCKET'] %>/shredded/errors
        archive: s3://<%= ENV['S3BUCKET'] %>/shredded/archive
  emr:
    ami_version: 5.9.0
    region: us-west-2
    jobflow_role: arn:aws:iam::330838374348:instance-profile/sp-EMRInstanceProfile-1WK1SCGI04EKJ
    service_role: arn:aws:iam::330838374348:role/sp-EMRServiceRole-1F0EVLTF8LOT9
    placement:
    ec2_subnet_id: subnet-1111180f
    ec2_key_name: emr-dev
    bootstrap: []
    software:
      hbase:
      lingual:
    jobflow:
      job_name: clickstream-jwn-<%= ENV['SNOWPLOW_ENV'] %>
      master_instance_type: <%= ENV['SNOWPLOW_EMR_MASTER_TYPE'] %>
      core_instance_count: <%= ENV['SNOWPLOW_EMR_CORE_NUM'] %>
      core_instance_type: <%= ENV['SNOWPLOW_EMR_CORE_TYPE'] %>
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m1.medium
      task_instance_bid: 0.015
    bootstrap_failure_tries: 3
    configuration:
      yarn-site:
        yarn.resourcemanager.am.max-attempts: "1"
      spark:
        maximizeResourceAllocation: "true"
    additional_info:
collectors:
  format: thrift
enrich:
  versions:
    spark_enrich: 1.10.0
  continue_on_unexpected_error: false
  output_compression: NONE
storage:
  version:
    rdb_loader: 0.14.0
    rdb_shredder: 0.13.0
    hadoop_elasticsearch: 0.1.0
monitoring:
  tags: { "name" : "EmrEtlRunner" } # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production
  snowplow:
    method: get
    collector: <%= ENV['SNOWPLOW_COLLECTOR'] %> # e.g. d3rkrsqld9gmqf.cloudfront.net
    app_id: 'sp.batch' # e.g. snowplow

#4

The indentation of the config.yml looks not good, but if you click on teh edit button, it looks pretty good there.


#5

@RichardJ,

The upgrade guide is here: https://github.com/snowplow/snowplow/wiki/Upgrade-Guide

You surely have a mismatch between the configuration file and the version of EmrEtlRunner you use.

It’s hard to tell what is exactly wrong without the full error message. Could you provide it, please?

By examining the “Expected” contract and the “Actual” you should be able to figure out what is wrong with your configuration file. There’s a distinguishing difference between different versions of EmrEtlRunner (and thus configuration file). They are not compatible and you have to follow the appropriate syntax in your config.yml for the version of EmrEtlRunner you use.


#6

Hi IHor,

Thanks for your response. I followed your link https://github.com/snowplow/snowplow/wiki/Upgrade-Guide, and config.yml.sample from the link, but still got the error. The full error message is as below:

[ec2-user@ip-172-16-34-249 ~] ./snowplow-emr-etl-runner --config {RUNNER_CONFIG} --resolver {RUNNER_RESOLVER} --enrichments {RUNNER_ENRICHMENTS} --skip staging
F, [2017-12-11T21:24:48.525000 #2811] FATAL – :

ReturnContractError (Contract violation for return value:
Expected: {:aws=>{:access_key_id=>String, :secret_access_key=>String, :s3=>{:region=>String, :buckets=>{:assets=>String, :jsonpath_assets=>(String or nil), :log=>String, :raw=>{:in=>(a collection Array of String), :processing=>String, :archive=>String}, :enriched=>{:good=>String, :bad=>String, :errors=>(String or nil), :archive=>(String or nil)}, :shredded=>{:good=>String, :bad=>String, :errors=>(String or nil), :archive=>(String or nil)}}}, :emr=>{:ami_version=>String, :region=>String, :jobflow_role=>String, :service_role=>String, :placement=>(String or nil), :ec2_subnet_id=>(String or nil), :ec2_key_name=>String, :bootstrap=>((a collection Array of String) or nil), :software=>{:hbase=>(String or nil), :lingual=>(String or nil)}, :jobflow=>{:job_name=>String, :master_instance_type=>String, :core_instance_count=>Num, :core_instance_type=>String, :core_instance_ebs=>#<Contracts::Maybe:0x19ba0a49 @vals=[{:volume_size=>#<Proc:0x41e8505a@uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/contracts.rb:26 (lambda)>, :volume_type=>#<Proc:0x7af36683@uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/contracts.rb:25 (lambda)>, :volume_iops=>#<Contracts::Maybe:0x1cf6185 @vals=[#<Proc:0x41e8505a@uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/contracts.rb:26 (lambda)>, nil]>, :ebs_optimized=>#<Contracts::Maybe:0x5bde3f2 @vals=[Contracts::Bool, nil]>}, nil]>, :task_instance_count=>Num, :task_instance_type=>String, :task_instance_bid=>(Num or nil)}, :additional_info=>(String or nil), :bootstrap_failure_tries=>Num}}, :collectors=>{:format=>String}, :enrich=>{:versions=>{:spark_enrich=>String}, :continue_on_unexpected_error=>Bool, :output_compression=>#<Proc:0x10dc65c0@uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/contracts.rb:24 (lambda)>}, :storage=>{:versions=>{:rdb_shredder=>String, :hadoop_elasticsearch=>String}, :download=>{:folder=>(String or nil)}}, :monitoring=>{:tags=>(Hash<Symbol, String>), :logging=>{:level=>String}, :snowplow=>({:method=>String, :collector=>String, :app_id=>String} or nil)}},
Actual: {:aws=>{:access_key_id=>“iam”, :secret_access_key=>“iam”, :s3=>{:region=>“us-west-2”, :buckets=>{:assets=>“s3://snowplow-hosted-assets”, :jsonpath_assets=>“s3://jwn-snowplow-nonprod/jsonpaths”, :log=>“s3n://jwn-snowplow-nonprod/etl/logs”, :raw=>{:in=>[“s3n://jwn-snowplow-nonprod/raw/”], :processing=>“s3://jwn-snowplow-nonprod/etl/processing”, :archive=>“s3://jwn-snowplow-nonprod/archive/raw”}, :enriched=>{:good=>“s3://jwn-snowplow-nonprod/enriched/good”, :bad=>“s3://jwn-snowplow-nonprod/enriched/bad”, :errors=>“s3://jwn-snowplow-nonprod/enriched/errors”, :archive=>“s3://jwn-snowplow-nonprod/enriched/archive”}, :shredded=>{:good=>“s3://jwn-snowplow-nonprod/shredded/good”, :bad=>“s3://jwn-snowplow-nonprod/shredded/bad”, :errors=>“s3://jwn-snowplow-nonprod/shredded/errors”, :archive=>“s3://jwn-snowplow-nonprod/shredded/archive”}}}, :emr=>{:ami_version=>“5.5.0”, :region=>“us-west-2”, :jobflow_role=>“arn:aws:iam::330838374348:instance-profile/sp-dev-emr-EMRInstanceProfile-1WK1SCGI04EKJ”, :service_role=>“arn:aws:iam::330838374348:role/sp-dev-emr-EMRServiceRole-1F0EVLTF8LOT9”, :placement=>nil, :ec2_subnet_id=>“subnet-7925180f”, :ec2_key_name=>“sp-emr-dev”, :bootstrap=>[], :software=>{:hbase=>nil, :lingual=>nil}, :jobflow=>{:job_name=>“clickstream-jwn-nonprod”, :master_instance_type=>“m4.xlarge”, :core_instance_count=>3, :core_instance_type=>“i2.2xlarge”, :core_instance_ebs=>{:volume_size=>100, :volume_type=>“gp2”, :volume_iops=>400, :ebs_optimized=>false}, :task_instance_count=>0, :task_instance_type=>“m1.medium”, :task_instance_bid=>0.015}, :additional_info=>nil, :bootstrap_failure_tries=>3}}, :collectors=>{:format=>“thrift”}, :enrich=>{:versions=>{:spark_enrich=>“1.9.0”}, :continue_on_unexpected_error=>false, :output_compression=>“NONE”}, :storage=>{:version=>{:rdb_shredder=>“0.12.0”, :hadoop_elasticsearch=>“0.1.0”}, :download=>{:folder=>nil}}, :monitoring=>{:tags=>{:name=>“EmrEtlRunner”}, :logging=>{:level=>“DEBUG”}, :snowplow=>{:method=>“get”, :collector=>“sp-collector-167750917.us-west-2.elb.amazonaws.com”, :app_id=>“sp.batch”}}}
Value guarded in: Snowplow::EmrEtlRunner::Cli::load_config
With Contract: Maybe, String => Hash
At: uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/cli.rb:137 ):
uri:classloader:/gems/contracts-0.11.0/lib/contracts.rb:45:in block in Contract' uri:classloader:/gems/contracts-0.11.0/lib/contracts.rb:154:infailure_callback’
uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:80:in call_with' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:inblock in load_config’
uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/cli.rb:108:in process_options' uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/cli.rb:94:inget_args_config_enrichments_resolver’
uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in send_to' uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:incall_with’
uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in block in get_args_config_enrichments_resolver' uri:classloader:/emr-etl-runner/bin/snowplow-emr-etl-runner:37:in'
org/jruby/RubyKernel.java:973:in load' uri:classloader:/META-INF/main.rb:1:in'
org/jruby/RubyKernel.java:955:in require' uri:classloader:/META-INF/main.rb:1:in(root)‘
uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in `’

I checked the “Actual” against the “Expected”, but cannot find the problem. Btw, your “Expected” doesn’t provide expected value, rather just the fields. I include my config.yml below for your reference. I used snowplow_emr_r89_plain_of_jars.zip. Thanks for help.

aws:
access_key_id: iam
secret_access_key: iam
s3:
region: us-west-2
buckets:
assets: s3://snowplow-hosted-assets
jsonpath_assets: s3://<%= ENV[‘S3BUCKET’] %>/jsonpaths
log: s3n://<%= ENV[‘S3BUCKET’] %>/etl/logs
raw:
in: [“s3n://<%= ENV[‘S3BUCKET’] %>/raw/”]
processing: s3://<%= ENV[‘S3BUCKET’] %>/etl/processing
archive: s3://<%= ENV[‘S3BUCKET’] %>/archive/raw
enriched:
good: s3://<%= ENV[‘S3BUCKET’] %>/enriched/good
bad: s3://<%= ENV[‘S3BUCKET’] %>/enriched/bad
errors: s3://<%= ENV[‘S3BUCKET’] %>/enriched/errors
archive: s3://<%= ENV[‘S3BUCKET’] %>/enriched/archive
shredded:
good: s3://<%= ENV[‘S3BUCKET’] %>/shredded/good
bad: s3://<%= ENV[‘S3BUCKET’] %>/shredded/bad
errors: s3://<%= ENV[‘S3BUCKET’] %>/shredded/errors
archive: s3://<%= ENV[‘S3BUCKET’] %>/shredded/archive
emr:
ami_version: 5.5.0
region: us-west-2
jobflow_role: arn:aws:iam::330838374348:instance-profile/sp-dev-emr-EMRInstanceProfile-1WK1SCGI04EKJ
service_role: arn:aws:iam::330838374348:role/sp-dev-emr-EMRServiceRole-1F0EVLTF8LOT9
placement:
ec2_subnet_id: subnet-7623951f
ec2_key_name: sp-dev
bootstrap: []
software:
hbase:
lingual:
jobflow:
job_name: clickstream-jwn-<%= ENV[‘SNOWPLOW_ENV’] %>
master_instance_type: <%= ENV[‘SNOWPLOW_EMR_MASTER_TYPE’] %>
core_instance_count: <%= ENV[‘SNOWPLOW_EMR_CORE_NUM’] %>
core_instance_type: <%= ENV[‘SNOWPLOW_EMR_CORE_TYPE’] %>
core_instance_ebs: # Optional. Attach an EBS volume to each core instance.
volume_size: 100 # Gigabytes
volume_type: "gp2"
volume_iops: 400 # Optional. Will only be used if volume_type is "io1"
ebs_optimized: false # Optional. Will default to true
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m1.medium
task_instance_bid: 0.015
additional_info:
bootstrap_failure_tries: 3
collectors:
format: thrift
enrich:
versions:
spark_enrich: 1.9.0
continue_on_unexpected_error: false
output_compression: NONE
storage:
version:
rdb_shredder: 0.12.0
hadoop_elasticsearch: 0.1.0
download:
folder:
monitoring:
tags: { “name” : “EmrEtlRunner” } # Name-value pairs describing this job
logging:
level: DEBUG # You can optionally switch to INFO for production
snowplow:
method: get
collector: <%= ENV[‘SNOWPLOW_COLLECTOR’] %> # e.g. d3rkrsqld9gmqf.cloudfront.net
app_id: “sp.batch” # e.g. snowplow


#7

@RichardJ,

It appears you are using EmrEtlRunner version R89. I can’t see anything obvious but I think you might need to replace

      raw:
        in: ["s3n://<%= ENV['S3BUCKET'] %>/raw/"]

with

      raw:
        in: 
          - "s3n://<%= ENV['S3BUCKET'] %>/raw/"

As well as modify

monitoring:
  tags: { "name" : "EmrEtlRunner" }

with

monitoring:
  tags: 
    name: EmrEtlRunner

#8

Hi Ihor

Thank you for helping us to get config.yml working.

Now we have made some changes in a few source codes in order to pull in InstanceProfileCredentials because we cannot explicitly put access_key_id & secret_access_key in config.yml. But we got the following error when re-build snowplow-emr-etl-runner.jar -

[Coveralls] Outside the CI environment, not sending data.
Unable to detect bundler spec under ‘/home/ec2-user/.rvm/gems/jruby-9.1.6.0/gems/bundler-1.16.0’’ and its sub-dirs
cat: deploy/snowplow-emr-etl-runner.jar: No such file or directory

What I did was: downloaded “snowplow-r89-plain-of-jars” from https://github.com/snowplow/snowplow/releases/tag/r89-plain-of-jars to aws EC2, made small changes for aws Credentials, then
cd /home/ec2-user/snowplow-r89-plain-of-jars/3-enrich/emr-etl-runner
./build.sh under /home/ec2-user/snowplow-r89-plain-of-jars/3-enrich/emr-etl-runner.

Below is the full output -

[ec2-user@ip-172-26-32-147 emr-etl-runner]$ ./build.sh
Already installed jruby-9.1.6.0.
To reinstall use:

rvm reinstall jruby-9.1.6.0

Using /home/ec2-user/.rvm/gems/jruby-9.1.6.0
Successfully installed bundler-1.16.0
1 gem installed
Using rake 11.2.2
Using CFPropertyList 2.3.2
Using public_suffix 2.0.5
Using addressable 2.5.1
Using awrence 0.1.0
Using jmespath 1.1.3
Using aws-sdk-core 2.2.9
Using aws-sdk-resources 2.2.9
Using aws-sdk 2.2.9
Using builder 3.2.2
Using bundler 1.16.0
Using contracts 0.11.0
Using json 2.0.2 (java)
Using docile 1.1.5
Using simplecov-html 0.10.0
Using simplecov 0.12.0
Using tins 1.12.0
Using term-ansicolor 1.3.2
Using thor 0.19.1
Using coveralls 0.8.15
Using diff-lcs 1.2.5
Using unf 0.1.4 (java)
Using domain_name 0.5.20170404
Using excon 0.52.0
Using formatador 0.2.5
Using fog-core 1.42.0
Using multi_json 1.12.1
Using fog-json 1.0.2
Using inflecto 0.0.2
Using fog-brightbox 0.11.0
Using nokogiri 1.6.8 (java)
Using fog-xml 0.1.2
Using fog-profitbricks 0.0.5
Using fog-radosgw 0.0.5
Using fog-sakuracloud 1.7.5
Using fog-softlayer 1.1.4
Using fog-terremark 0.1.0
Using fission 0.5.0
Using fog-vmfusion 0.1.0
Using fog-voxel 0.1.0
Using ipaddress 0.8.3
Using trollop 2.1.2
Using rbvmomi 1.9.2
Using opennebula 5.0.2
Using fog 1.25.0
Using http-cookie 1.0.3
Using mime-types 2.99.3
Using netrc 0.11.0
Using rest-client 1.8.0
Using elasticity 6.0.11
Using multi_xml 0.6.0
Using httparty 0.14.0
Using json-schema 2.7.0
Using iglu-ruby-client 0.1.0
Using jruby-jars 9.1.4.0
Using jruby-rack 1.1.20
Using net-ssh 2.9.4
Using rspec-core 2.99.2
Using rspec-expectations 2.99.2
Using rspec-mocks 2.99.4
Using rspec 2.99.0
Using rubyzip 1.2.0
Using sluice 0.4.0
Using snowplow-tracker 0.5.2
Using warbler 2.0.3
Bundle complete! 10 Gemfile dependencies, 65 gems now installed.
Use bundle info [gemname] to see where a bundled gem is installed.
Running RSpec
Coverage may be inaccurate; set the “–debug” command line option, or do JRUBY_OPTS="–debug" or set the “debug.fullTrace=true” option in your .jrubyrc
[Coveralls] Set up the SimpleCov formatter.
[Coveralls] Using SimpleCov’s default settings.
/home/ec2-user/.rvm/gems/jruby-9.1.6.0/gems/simplecov-0.12.0/lib/simplecov.rb:48: warning: tracing (e.g. set_trace_func) will not capture all events without --debug flag
/home/ec2-user/snowplow-r89-plain-of-jars/3-enrich/emr-etl-runner/spec/snowplow-emr-etl-runner/runner_spec.rb:32: warning: key (SymbolNode:skip 38) is duplicated and overwritten on line 32
/home/ec2-user/snowplow-r89-plain-of-jars/3-enrich/emr-etl-runner/spec/snowplow-emr-etl-runner/runner_spec.rb:19: warning: already initialized constant Cli
/home/ec2-user/snowplow-r89-plain-of-jars/3-enrich/emr-etl-runner/spec/snowplow-emr-etl-runner/runner_spec.rb:23: warning: already initialized constant ConfigError
/home/ec2-user/snowplow-r89-plain-of-jars/3-enrich/emr-etl-runner/spec/snowplow-emr-etl-runner/snowplow_spec.rb:19: warning: already initialized constant Snowplow
…true

Finished in 1.01 seconds
34 examples, 0 failures
[Coveralls] Outside the CI environment, not sending data.
Unable to detect bundler spec under ‘/home/ec2-user/.rvm/gems/jruby-9.1.6.0/gems/bundler-1.16.0’’ and its sub-dirs
cat: deploy/snowplow-emr-etl-runner.jar: No such file or directory

Thanks again for help,
Richard