EmrEtlRunner R92 hangs at initialization


#1

Hi everyone !

I have an issue with the latest release. I was previously running R77 and decided to upgrade to R92.
I followed the upgrade guide, and even removed and re-downloaded the sample config from scratch tbh.

My issue is that when I fire the script to process the batch events

./snowplow-emr-etl-runner run -c conf.yml -r iglu_resolver.json --targets targets/ --enrichments enrichments/

I get the output

D, [2017-10-04T22:27:33.998000 #2984] DEBUG -- : Initializing EMR jobflow

and the script just hangs there … no EMR cluster is getting started.

I have this setup in an EC2 t2.small instance that I was using before (for R77) and that was working just fine.
I tried to reboot it (like seen in other similar posts) and I tried installing Java 8 …

java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

but nothing seems to move anything.

I noticed that if I leave it running for a long time, it eventually exits with:

Excon::Error::Socket (Unsupported record version Unknown-0.0 (OpenSSL::SSL::SSLError)):
    org/jruby/ext/openssl/SSLSocket.java:222:in `connect_nonblock'
    uri:classloader:/gems/excon-0.52.0/lib/excon/ssl_socket.rb:121:in `initialize'
    uri:classloader:/gems/excon-0.52.0/lib/excon/connection.rb:403:in `socket'
    uri:classloader:/gems/excon-0.52.0/lib/excon/connection.rb:100:in `request_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/mock.rb:48:in `request_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/instrumentor.rb:26:in `request_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/base.rb:16:in `request_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/base.rb:16:in `request_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/base.rb:16:in `request_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/connection.rb:249:in `request'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/idempotent.rb:27:in `error_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/base.rb:11:in `error_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/base.rb:11:in `error_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/connection.rb:272:in `request'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/idempotent.rb:27:in `error_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/base.rb:11:in `error_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/base.rb:11:in `error_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/connection.rb:272:in `request'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/idempotent.rb:27:in `error_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/base.rb:11:in `error_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/base.rb:11:in `error_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/connection.rb:272:in `request'
    uri:classloader:/gems/fog-xml-0.1.2/lib/fog/xml/sax_parser_connection.rb:35:in `request'
    uri:classloader:/gems/fog-xml-0.1.2/lib/fog/xml/sax_parser_connection.rb:-1:in `request'
    uri:classloader:/gems/fog-xml-0.1.2/lib/fog/xml/connection.rb:7:in `request'
    uri:classloader:/gems/fog-aws-1.4.0/lib/fog/aws/storage.rb:612:in `_request'
    uri:classloader:/gems/fog-aws-1.4.0/lib/fog/aws/storage.rb:-1:in `_request'
    uri:classloader:/gems/fog-aws-1.4.0/lib/fog/aws/storage.rb:607:in `request'
    uri:classloader:/gems/fog-aws-1.4.0/lib/fog/aws/requests/storage/get_bucket.rb:43:in `get_bucket'
    uri:classloader:/gems/fog-aws-1.4.0/lib/fog/aws/models/storage/directories.rb:21:in `get'
    uri:classloader:/gems/fog-aws-1.4.0/lib/fog/aws/models/storage/files.rb:30:in `all'
    uri:classloader:/gems/fog-aws-1.4.0/lib/fog/aws/models/storage/files.rb:51:in `each'
    uri:classloader:/gems/sluice-0.4.0/lib/sluice/storage/s3/s3.rb:69:in `list_files'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'
    uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:135:in `block in initialize'
    org/jruby/RubyArray.java:2564:in `select'
    uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:133:in `initialize'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'
    uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:100:in `run'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'
    uri:classloader:/emr-etl-runner/bin/snowplow-emr-etl-runner:41:in `<main>'
    org/jruby/RubyKernel.java:979:in `load'
    uri:classloader:/META-INF/main.rb:1:in `<main>'
    org/jruby/RubyKernel.java:961:in `require'
    uri:classloader:/META-INF/main.rb:1:in `(root)'
    uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in `<main>'

Side notes, everything is in the same region (us-east-1).

I’m wondering if anyone ever encountered this ? And if someone knows has any idea of what’s happening.

I can provide more details if anyone is able and willing to guide me through this.

Thanks a lot for your help !

EDIT:
OK I managed to make it happen locally. The EMR job gets started. The initialization phase took almost 8 minutes, not sure if it’s expected or not since new steps have been added and stuff …
Not sure what I changed, I did some cleanup in the config file (removed the comments) and cleaned up some IAM permissions (removed probably useless stuff). Nothing game changing really … I think I was just being impatient during my local debugging.

I’ll try to make it happen on my EC2 instance now … I mean tomorrow, 3AM is bed time :slight_smile:


#2

Spinning up an EMR cluster takes forever unfortunately, between the starting and the bootstrapping you’re in for a good 10 minutes at least.

I did encounter this ssl issue in the past and it seems to be transient with respect to the ec2 box you’re launching emr etl runner from, i.e. a restart usually solves the issue.

See Sluice fails on staging/archiving with OpenSSL error "Unsupported record version Unknown-0.0"


#3

Thanks Ben !

When I said 8 minutes, I wasn’t talking about the boostraping time. It’s just the time it takes the cluster to be “created”, meaning to just appear in my AWS cluster list. But this happened the first time only, now it’s faster. Can’t explain why.

I’m running this locally right now, encountering other issues that I should be able to fix. Waiting till I have a clean local run before retrying in EC2.

I already tried to restart the EC2 host but … nothing happened, still the same error. That’s why I opened a new thread.

WIll keep you posted. Hopefully it’s a transient issue :slight_smile:

Thanks!


#5

Ok … so I tried to copy my local working setup in my t2.small EC2 instance and it hangs.

I decided to try to launch a brand new m3.medium instance and copy my local working setup and this happened:

[ec2-user@ip-xxxxxxxx enrich]$ ./snowplow-emr-etl-runner run -c conf.yml -e enrichments/ -r iglu_resolver.json --targets targets/
D, [2017-10-05T14:13:59.741000 #2745] DEBUG -- : Initializing EMR jobflow
D, [2017-10-05T14:21:51.794000 #2745] DEBUG -- : EMR jobflow j-XXXXXXXXXX started, waiting for jobflow to complete...

As you can see, it managed to start the jobflow this time so I’m happy about that.

But check the timestamps and the time it took the process to initialized the jobflow. This is what I was talking about earlier. I think and hope that it will get a lot faster on the second run but I just thought I’d share this in case …

I’ll probably try to get back to a smaller instance, will tell you if I encounter issues again.

Thanks to everyone who read this.