EMRETLrunner error: Excon::Error::MovedPermanently (Expected(200) <=> Actual(301 Moved Permanently)


#1

Hey,
I have been trying to run emretlrunner on an EMR AWS machine. Been getting this error

Excon::Error::MovedPermanently (Expected(200) <=> Actual(301 Moved Permanently)
excon.error.response
  :body          => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>PermanentRedirect</Code><Message>The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.</Message><Bucket>processed_logs_bucketname</Bucket><Endpoint>processed_logs_bucketname.s3.ap-south-1.amazonaws.com</Endpoint><RequestId>9B5542F350E6853E</RequestId><HostId>y4or+Q7rmKf507hQG0fir376OJWO2RbAK0ULyO78qiujXD9azlrkVpUiNOZCOiRgJi1Ke4SWybg=</HostId></Error>"
  :cookies       => [
  ]
  :headers       => {
    "Content-Type"        => "application/xml"
    "Date"                => "Thu, 25 May 2017 05:01:12 GMT"
    "Server"              => "AmazonS3"
    "x-amz-bucket-region" => "ap-south-1"
    "x-amz-id-2"          => "y4or+Q7rmKf507hQG0fir376OJWO2RbAK0ULyO78qiujXD9azlrkVpUiNOZCOiRgJi1Ke4SWybg="
    "x-amz-request-id"    => "9B5542F350E6853E"
  }
  :host          => "<processed_logs_bucketname>.s3-ap-southeast-1.amazonaws.com"
  :local_address => "--"
  :local_port    => 54476
  :path          => "/"
  :port          => 443
  :reason_phrase => "Moved Permanently"
  :remote_ip     => "--"
  :status        => 301
  :status_line   => "HTTP/1.1 301 Moved Permanently\r\n"
):
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/expects.rb:7:in `response_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/response_parser.rb:9:in `response_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/connection.rb:388:in `response'
    uri:classloader:/gems/excon-0.52.0/lib/excon/connection.rb:252:in `request'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/idempotent.rb:27:in `error_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/base.rb:11:in `error_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/base.rb:11:in `error_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/connection.rb:272:in `request'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/idempotent.rb:27:in `error_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/base.rb:11:in `error_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/base.rb:11:in `error_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/connection.rb:272:in `request'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/idempotent.rb:27:in `error_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/base.rb:11:in `error_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/base.rb:11:in `error_call'
    uri:classloader:/gems/excon-0.52.0/lib/excon/connection.rb:272:in `request'
    uri:classloader:/gems/fog-xml-0.1.2/lib/fog/xml/sax_parser_connection.rb:35:in `request'
    uri:classloader:/gems/fog-xml-0.1.2/lib/fog/xml/connection.rb:7:in `request'
    uri:classloader:/gems/fog-1.25.0/lib/fog/aws/storage.rb:521:in `_request'
    uri:classloader:/gems/fog-1.25.0/lib/fog/aws/storage.rb:516:in `request'
    uri:classloader:/gems/fog-1.25.0/lib/fog/aws/requests/storage/get_bucket.rb:43:in `get_bucket'
    uri:classloader:/gems/fog-1.25.0/lib/fog/aws/models/storage/directories.rb:22:in `get'
    uri:classloader:/gems/sluice-0.4.0/lib/sluice/storage/s3/s3.rb:66:in `list_files'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'
    uri:classloader:/gems/sluice-0.4.0/lib/sluice/storage/s3/s3.rb:128:in `is_empty?'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'
    uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:314:in `initialize'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'
    uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:73:in `run'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'
    uri:classloader:/emr-etl-runner/bin/snowplow-emr-etl-runner:39:in `<main>'
    org/jruby/RubyKernel.java:973:in `load'
    uri:classloader:/META-INF/main.rb:1:in `<main>'
    org/jruby/RubyKernel.java:955:in `require'
    uri:classloader:/META-INF/main.rb:1:in `(root)'
    uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in `<main>' 

Since we are using s3 protocol based bucket names, can’t figure out how to specify region along with them.
Have already specified region under s3 sub section in config.yml.
Also note that my s3 buckets and EMR ETL jobs are running in separate regions. S3 buckets are in Mumbai region(ap-southeast-1) and EMR instances in oregon region (us-west)
Below are the contents of my config.yml.

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: <access_key>
  secret_access_key: <secret_key>
  s3:
    region: ap-southeast-1
    buckets:
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
      jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
      log: s3://<emretlrunner_logs_bucketname>
      raw:
        in:                  # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
          - s3://<snowplowlogs_bucketname>         # e.g. s3://my-old-collector-bucket
        processing: s3://<processing_logs_bucketname>
        archive: s3://<archive_logs_bucketname>    # e.g. s3://my-archive-bucket/raw
      enriched:
        good: s3://<processed_logs_bucketname>/enriched/good       # e.g. s3://my-out-bucket/enriched/good
        bad: s3://<processed_logs_bucketname>/enriched/bad        # e.g. s3://my-out-bucket/enriched/bad
        errors:      # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3://<archive_logs_bucketname>/enriched    # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
      shredded:
        good: s3://<processed_logs_bucketname>/shredded/good       # e.g. s3://my-out-bucket/shredded/good
        bad: s3://<processed_logs_bucketname>/shredded/bad        # e.g. s3://my-out-bucket/shredded/bad
        errors:     # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3://<archive_logs_bucketname>/shredded    # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
  emr:
    ami_version: 4.5.0
    region: us-west-2        # Always set this
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
    placement:      # Set this if not running in VPC. Leave blank otherwise
    ec2_subnet_id: subnet-6ff6c30b # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: Snowplow-EMR
    bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
    # Adjust your Hadoop cluster below
    jobflow:
      master_instance_type: m1.medium
      core_instance_count: 2
      core_instance_type: m1.medium
      core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
        volume_size: 100    # Gigabytes
        volume_type: "gp2"
        volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: false # Optional. Will default to true
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m1.medium
      task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: cloudfront # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
  job_name: Snowplow ETL # Give your job a name
  versions:
    hadoop_enrich: 1.8.0 # Version of the Hadoop Enrichment process
    hadoop_shred: 0.11.0 # Version of the Hadoop Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
 continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  download:
    folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
monitoring:
  tags: {'name':'snowplow-etl'} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production
  #snowplow:
   # method: get
   # app_id: ADD HERE # e.g. snowplow
   # collector: ADD HERE # e.g. d3rkrsqld9gmqf.cloudfront.net

And iglu_resolver.json

{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
  "data": {
    "cacheSize": 500,
    "repositories": [
      {
        "name": "Iglu Central",
        "priority": 0,
        "vendorPrefixes": [ "com.snowplowanalytics" ],
        "connection": {
          "http": {
            "uri": "http://iglucentral.com"
          }
        }
      }
    ]
  }
}

The command I am using to run ETL runner is -
./snowplow-emr-etl-runner --skip staging,archive_raw --config config.yml --targets targets/ --resolver iglu_resolver.json
Can anybody help me figure out if I am missing out on something?


#2

Found the issue. Was using wrong region under s3 section. Should be ap-south instead of ap-southeast.


#3

Glad to hear you managed to solve this @apoorva007.

You may want to consider running your EMR cluster in the same region as your S3 data. This means you avoid cross-region data transfer fees (for S3 data) and you also get a bit better performance as well.


#4

Hello @mike, I’m getting the same issue. But, my buckets and emr-etl job is at the same region (sa-east-1).

I get the error when I tried to run the StorageLoader service.

I think the issue is happening when the flow tried to access the s3://snowplow-hosted-assets. Returning the following message:

Unexpected error: Expected(200) <=> Actual(301 Moved Permanently) excon.error.response :body => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>PermanentRedirect</Code><Message>The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.</Message><Bucket>snowplow-hosted-assets</Bucket><Endpoint>snowplow-hosted-assets.s3-eu-west-1.amazonaws.com</Endpoint><RequestId>D21532A2B00E281B</RequestId><HostId>+SPiK6wog3Nt0qOEbtIsLDe7k6cGx0SDLWLc2CXmWH0mfC5Y94yx78ePSw5A3gWbL0tMe+JEm9Y=</HostId></Error>

But, this config is under s3 > region (sa-east-1) > buckets -> assets:
assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket.

Something changed in the last day? The tests I was doing w/ StorageLoader service was working good, I got my events into Redshift, but today I got the message above.

Can you help me to figure out what can be the problem?

Thank you


#5

Yes agreed @mike. That is the next step. Thanks for your suggestion


#6

hello snowplowers!

Just to update about my issue:

I was using a region that “I think” snowplow-hosted-assets bucket was not accessible, or changed for any reason. The message explains that I need to use other region (eu-west-1) to get the files.

With all my buckets and services running at sa-east-1, I found that I needed to store the jarfiles and other files in my own bucket, at sa-east-1, and points it in my config file.

So, I copied the contents from snowplow-hosted-assets to my own s3 bucket at sa-east-1 and the problem was fixed.

I just need to know if it is the right approach to do and if its okay to maintain, or some wiki page that explains this process, maybe I forget to read some part of the docs.

Thanks!