EmrEtlRunner Bad Request Error: KeyTooLongError


#1

Hello,

I am facing wired issue. When i run a EMRETLRunner to process the logs. At that time, it’s failing at particular step:

Problem copying archive/hadoop-hadoop-datanode-ip-10-237-134-206.us-west-2.compute.internal.log.us-west-2.i-4eda4689.us-west-2.archive.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.us-west-2.archive.us-west-2.processing.gz. Retrying.
F, [2016-08-02T08:31:15.449000 #29285] FATAL -- : 

Excon::Errors::BadRequest (Expected(200) <=> Actual(400 Bad Request)
excon.error.response
  :body          => "<Error><Code>KeyTooLongError</Code><Message>Your key is too long</Message><Size>1038</Size><MaxSizeAllowed>1024</MaxSizeAllowed><RequestId>069326F0B1EEF237</RequestId><HostId>KS3bimOrWGaBijSluiL6+7wOnWilee7/oy7SNgDZiK/J5l8MO3aVWcx3lhCKUHns1/ARXDKD3n4=</HostId></Error>"
  :headers       => {
    "Connection"       => "close"
    "Content-Type"     => "application/xml"
    "Date"             => "Tue, 02 Aug 2016 08:31:14 GMT"
    "Server"           => "AmazonS3"
    "x-amz-id-2"       => "KS3bimOrWGaBijSluiL6+7wOnWilee7/oy7SNgDZiK/J5l8MO3aVWcx3lhCKUHns1/ARXDKD3n4="
    "x-amz-request-id" => "069326F0B1EEF237"
  }
  :local_address => "10.214.35.217"
  :local_port    => 59037
  :reason_phrase => "Bad Request"
  :remote_ip     => "54.231.168.201"
  :status        => 400
  :status_line   => "HTTP/1.1 400 Bad Request\r\n"
):

I am using the old ami version : 3.7.0.

Can some one point me to the issue or the direction i should look at ?.


How to re-run a job that fails at the processing stage?
#2

Hi @rajan, can you share your config.yml file, with credentials removed?


#3

Config file :

aws:

Credentials can be hardcoded or set in environment variables

access_key_id: XXX
secret_access_key: XXX
s3:
region: us-west-2
buckets:
assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
log: s3://canvas-snowplow-logs/etl-logs
raw:
in:
- s3://canvas-snowplow-logs # Multiple in buckets are permitted
processing: s3://canvas-snowplow-logs/processing
archive: s3://canvas-snowplow-logs/archive # e.g. s3://my-archive-bucket/in
enriched:
good: s3://canvas-snowplow-logs/enriched/good # e.g. s3://my-out-bucket/enriched/good
bad: s3://canvas-snowplow-logs/enriched/bad # e.g. s3://my-out-bucket/enriched/bad
errors: s3://canvas-snowplow-logs/enriched/errors # Leave blank unless continue_on_unexpected_error: set to true below
archive: s3://canvas-snowplow-logs/enriched/archive # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
shredded:
good: s3://canvas-snowplow-logs/shredded/good # e.g. s3://my-out-bucket/shredded/good
bad: s3://canvas-snowplow-logs/shredded/bad # e.g. s3://my-out-bucket/shredded/bad
errors: s3://canvas-snowplow-logs/shredded/errors # Leave blank unless continue_on_unexpected_error: set to true below
archive: s3://canvas-snowplow-logs/shredded/archive # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
emr:
ami_version: 3.6.0 # Don’t change this
region: us-west-2 # Always set this
jobflow_role: EMR_EC2_DefaultRole # Created using aws emr create-default-roles service_role: EMR_DefaultRole # Created using aws emr create-default-roles
placement: us-west-2a # Set this if not running in VPC. Leave blank otherwise
ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise
ec2_key_name: canvasSnowplowAnalytics
bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise
software:
hbase: # To launch on cluster, provide version, “0.92.0”, keep quotes
lingual: “1.1” # To launch on cluster, provide version, “1.1”, keep quotes
# Adjust your Hadoop cluster below
jobflow:
master_instance_type: m1.medium
core_instance_count: 2
core_instance_type: m1.medium
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m1.medium
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
collectors:
format: cloudfront # Or ‘clj-tomcat’ for the Clojure Collector, or ‘thrift’ for Thrift records, or ‘tsv/com.amazon.aws.cloudfront/wd_access_log’ for Cloudfront access logs
enrich:
job_name: Snowplow canvas ETL # Give your job a name
versions:
hadoop_enrich: 1.5.1 # Version of the Hadoop Enrichment process
hadoop_shred: 0.7.0 # Version of the Hadoop Shredding process
hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
continue_on_unexpected_error: false # Set to ‘true’ (and set out_errors: above) if you don’t want any exceptions thrown from ETL
output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
download:
folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
targets:
- name: "Canvas snowplow database"
type: redshift
host: XXXX # The endpoint as shown in the Redshift console
database: logs # Name of database
port: XXX # Default Redshift port
table: atomic.events
username: canvas
password: XXXX
maxerror: 1 # Stop loading on first error, or increase to permit more load errors
comprows: 200000 # Default for a 1 XL node cluster. Not used unless --include compupdate specified
ssl_mode: disable
monitoring:
tags: {} # Name-value pairs describing this job
logging:
level: DEBUG # You can optionally switch to INFO for production
snowplow:
method: get
app_id: “Canvas snowplow” # e.g. snowplow
collector: dm5gpcb96k7gy.cloudfront.net


#4

Config file :

aws:
# Credentials can be hardcoded or set in environment variables
access_key_id: XXX
secret_access_key: XXX
s3:
region: us-west-2
buckets:
assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
log: s3://canvas-snowplow-logs/etl-logs
raw:
in:
- s3://canvas-snowplow-logs # Multiple in buckets are permitted
processing: s3://canvas-snowplow-logs/processing
archive: s3://canvas-snowplow-logs/archive # e.g. s3://my-archive-bucket/in
enriched:
good: s3://canvas-snowplow-logs/enriched/good # e.g. s3://my-out-bucket/enriched/good
bad: s3://canvas-snowplow-logs/enriched/bad # e.g. s3://my-out-bucket/enriched/bad
errors: s3://canvas-snowplow-logs/enriched/errors # Leave blank unless continue_on_unexpected_error: set to true below
archive: s3://canvas-snowplow-logs/enriched/archive # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
shredded:
good: s3://canvas-snowplow-logs/shredded/good # e.g. s3://my-out-bucket/shredded/good
bad: s3://canvas-snowplow-logs/shredded/bad # e.g. s3://my-out-bucket/shredded/bad
errors: s3://canvas-snowplow-logs/shredded/errors # Leave blank unless continue_on_unexpected_error: set to true below
archive: s3://canvas-snowplow-logs/shredded/archive # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
emr:
ami_version: 3.6.0 # Don't change this
region: us-west-2 # Always set this
jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
service_role: EMR_DefaultRole # Created using $ aws emr create-default-roles
placement: us-west-2a # Set this if not running in VPC. Leave blank otherwise
ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise
ec2_key_name: canvasSnowplowAnalytics
bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise
software:
hbase: # To launch on cluster, provide version, "0.92.0", keep quotes
lingual: "1.1" # To launch on cluster, provide version, "1.1", keep quotes
# Adjust your Hadoop cluster below
jobflow:
master_instance_type: m1.medium
core_instance_count: 2
core_instance_type: m1.medium
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m1.medium
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
collectors:
format: cloudfront # Or 'clj-tomcat' for the Clojure Collector, or 'thrift' for Thrift records, or 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs
enrich:
job_name: Snowplow canvas ETL # Give your job a name
versions:
hadoop_enrich: 1.5.1 # Version of the Hadoop Enrichment process
hadoop_shred: 0.7.0 # Version of the Hadoop Shredding process
hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
continue_on_unexpected_error: false # Set to 'true' (and set out_errors: above) if you don't want any exceptions thrown from ETL
output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
download:
folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
targets:
- name: "Canvas snowplow database"
type: redshift
host: XXXX # The endpoint as shown in the Redshift console
database: logs # Name of database
port: XXX # Default Redshift port
table: atomic.events
username: canvas
password: XXXX
maxerror: 1 # Stop loading on first error, or increase to permit more load errors
comprows: 200000 # Default for a 1 XL node cluster. Not used unless --include compupdate specified
ssl_mode: disable
monitoring:
tags: {} # Name-value pairs describing this job
logging:
level: DEBUG # You can optionally switch to INFO for production
snowplow:
method: get
app_id: "Canvas snowplow" # e.g. snowplow
collector: dm5gpcb96k7gy.cloudfront.net

#5

Hey @rajan,

Ah - you have your processing bucket inside your in bucket. Never do this, it creates a circular reference. There is a warning about this in the documentation:

Important 2: do not put your raw:processing inside your raw:in bucket, or your enriched:good inside your raw:processing, or you will create circular references which EmrEtlRunner cannot resolve when moving files.

Source: https://github.com/snowplow/snowplow/wiki/Common-configuration#s3


#6

Hey @alex,

Thanks for your reply.

If i make the changes in config file like the below, will it solve the circular reference problem.

buckets:

  log: s3://canvas-snowplow-logs/etl-logs
  raw:
    in:
    - s3://canvas-snowplow-logs/raw                
    processing: s3://canvas-snowplow-logs/processing
    archive: s3://canvas-snowplow-logs/archive 
  enriched:
    good: s3://canvas-snowplow-logs/enriched/good     
    bad: s3://canvas-snowplow-logs/enriched/bad       
    errors: s3://canvas-snowplow-logs/enriched/errors    
    archive: s3://canvas-snowplow-logs/enriched/archive  
  shredded:
    good: s3://canvas-snowplow-logs/shredded/good     
    bad: s3://canvas-snowplow-logs/shredded/bad       
    errors: s3://canvas-snowplow-logs/shredded/errors    
    archive: s3://canvas-snowplow-logs/shredded/archive

#7

Yes, that should be fine!


#8

Thanks @alex.


#9

Hey @alex ,

One more thing, my job got failed in the processing stage. There are few more logs in the raw logs folder. I want both the logs to be moved to the database. Can you guide me what are the changes i have to do in the config file to successfully move these logs to database.


#10

Please create a new thread for a new problem @rajan!