EMR ETL Runner job failing due to S3 errors


#1

We’ve had a couple of overnight ETL jobs run with errors similar to the following, after the EMR job has run:

Excon::Errors::ServiceUnavailable (Expected(200) <=> Actual(503 Service Unavailable)
excon.error.response
  :body          => "<Error><Code>SlowDown</Code><Message>Please reduce your request rate.</Message><RequestId>3CCB8046D0016B8D</RequestId><HostId>km4LiLAPmwPQL6ds36vTlTiZXFC5Z1Sijr6q5nCk9O6TA3VPRwaIHhi9FO6iXRCJrIxRwjY/57M=</HostId></Error>"
  :headers       => {
    "Connection"       => "close"
    "Content-Type"     => "application/xml"
    "Date"             => "Thu, 27 Apr 2017 22:11:10 GMT"
    "Server"           => "AmazonS3"
    "x-amz-id-2"       => "km4LiLAPmwPQL6ds36vTlTiZXFC5Z1Sijr6q5nCk9O6TA3VPRwaIHhi9FO6iXRCJrIxRwjY/57M="
    "x-amz-request-id" => "3CCB8046D0016B8D"
  }
  :local_address => "172.31.39.233"
  :local_port    => 35390
  :reason_phrase => "Slow Down"
  :remote_ip     => "52.218.49.99"
  :status        => 503
  :status_line   => "HTTP/1.1 503 Slow Down\r\n"
):
    /home/ec2-user/snowplow-r73/snowplow-emr-etl-runner!/gems/excon-0.45.3/lib/excon/middlewares/expects.rb:6:in `response_call'
    /home/ec2-user/snowplow-r73/snowplow-emr-etl-runner!/gems/excon-0.45.3/lib/excon/middlewares/response_parser.rb:8:in `response_call'
    /home/ec2-user/snowplow-r73/snowplow-emr-etl-runner!/gems/excon-0.45.3/lib/excon/connection.rb:372:in `response'
    /home/ec2-user/snowplow-r73/snowplow-emr-etl-runner!/gems/excon-0.45.3/lib/excon/connection.rb:236:in `request'
    /home/ec2-user/snowplow-r73/snowplow-emr-etl-runner!/gems/fog-1.24.0/lib/fog/xml/sax_parser_connection.rb:35:in `request'
    /home/ec2-user/snowplow-r73/snowplow-emr-etl-runner!/gems/fog-1.24.0/lib/fog/xml/connection.rb:17:in `request'
    /home/ec2-user/snowplow-r73/snowplow-emr-etl-runner!/gems/fog-1.24.0/lib/fog/aws/storage.rb:547:in `_request'
    /home/ec2-user/snowplow-r73/snowplow-emr-etl-runner!/gems/fog-1.24.0/lib/fog/aws/storage.rb:542:in `request'
    /home/ec2-user/snowplow-r73/snowplow-emr-etl-runner!/gems/fog-1.24.0/lib/fog/aws/requests/storage/copy_object.rb:32:in `copy_object'
    /home/ec2-user/snowplow-r73/snowplow-emr-etl-runner!/gems/fog-1.24.0/lib/fog/aws/models/storage/file.rb:92:in `copy'
    /home/ec2-user/snowplow-r73/snowplow-emr-etl-runner!/gems/sluice-0.2.2/lib/sluice/storage/s3/s3.rb:642:in `retry_x'
    org/jruby/ext/timeout/Timeout.java:126:in `timeout'
    /home/ec2-user/snowplow-r73/snowplow-emr-etl-runner!/gems/sluice-0.2.2/lib/sluice/storage/s3/s3.rb:641:in `retry_x'
    /home/ec2-user/snowplow-r73/snowplow-emr-etl-runner!/gems/sluice-0.2.2/lib/sluice/storage/s3/s3.rb:564:in `process_files'
    org/jruby/RubyKernel.java:1511:in `loop'
    /home/ec2-user/snowplow-r73/snowplow-emr-etl-runner!/gems/sluice-0.2.2/lib/sluice/storage/s3/s3.rb:428:in `process_files'


Error running EmrEtlRunner, exiting with return code 1. StorageLoader not run

Any idea what we can do to prevent/mitigate this?


Strange Random Error on EmtEtlRunner
#2

Hi @iain, we have seen increased error rates for Amazon S3 in just eu-west-1 across our Managed Service customer base over the past three days (two distinct periods, lasting 2+ hours each time). Unfortunately these issues don’t seem to have been reported by Amazon.

The workaround is just to manually resume the failed jobs - eventually they complete.


#3

We have a master (orchestrator) server who is in charge of manage Snowplow runs and copies among S3 buckets and, I confirm, it has problems since Wednesday evening :tired_face:. Its network usage has decreased.


#4

Our issue was linked to a DNS server list issue.
This allowed us of to improve some EmrEtlRunner steps performance, you can read more here --> Performance managing S3 buckets