EmrEtlRunner EMR Jobflow hangs after status 200 on tracker pixel GET


#1

Hi All, starting with snowplow and AWS so this may be a rookie mistake… need help

my configuration:

  • self hosting emretlrunner on a linux vps,
  • going with PostgreSQL storage, self hosted as well
  • AWS with S3, Cloudfront and EMR
    :question: (One doubt is about the ec2_subnet_id – I am not sure if my config is running in a VPC - I did not purposely set that, I do see 3 VPC references in EMR and added the first one. Not sure what placement refers to if that’s the right choice)

./snowplow-emr-etl-runner --config config/enrich.yml.conf --resolver config/iglu-resolver.json.conf --skip staging

D, [2016-08-27T22:13:16.418000 #395] DEBUG – : Initializing EMR jobflow
D, [2016-08-27T22:13:19.936000 #395] DEBUG – : EMR jobflow j-22SGW5MD0JD66 started, waiting for jobflow to complete…
I, [2016-08-27T22:13:19.942000 #395] INFO – : SnowplowTracker::Emitter initialized with endpoint http://d170d0ewn0ezze.cloudfront.net:80/i
I, [2016-08-27T22:13:21.328000 #395] INFO – : Attempting to send 1 request
I, [2016-08-27T22:13:21.331000 #395] INFO – : Sending GET request to http://d170d0ewn0ezze.cloudfront.net:80/i
I, [2016-08-27T22:13:21.341000 #395] INFO – : GET request to http://d170d0ewn0ezze.cloudfront.net:80/i finished with status code 200

the job hangs after the above last line until I terminate it

my config file below (creds removed)

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: [My Key ID]
  secret_access_key: [My Access Key]
  s3:
    region: us-east-1
    buckets:
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
      jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
      log: s3://netgrasp-snowplow-enrich/log/emr-etl-runner
      raw:
        in:                  # Multiple in buckets are permitted
          - s3://netgrasp-snowplow-collector/log
        processing: s3://netgrasp-snowplow-enrich/processing
        archive: s3://netgrasp-snowplow-enrich/archive
      enriched:
        good: s3://netgrasp-snowplow-enrich/enriched/good
        bad:  s3://netgrasp-snowplow-enrich/enriched/bad
        errors: # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3://netgrasp-snowplow-enrich/enriched/archive  # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
      shredded:
        good: s3://netgrasp-snowplow-enrich/shredded/good
        bad:  s3://netgrasp-snowplow-enrich/shredded/bad
        errors:  # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: netgrasp-snowplow-enrich/shredded/archive    # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
  emr:
    ami_version: 4.5.0
    region: us-west-2     # Always set this
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
    placement:    # Set this if not running in VPC. Leave blank otherwise
    ec2_subnet_id: subnet-01e95464 # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: emr-keypair
    bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
    # Adjust your Hadoop cluster below
    jobflow:
      master_instance_type: m1.medium
      core_instance_count: 2
      core_instance_type: m1.medium
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m1.medium
      task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: cloudfront # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
  job_name: Snowplow ETL # Give your job a name
  versions:
    hadoop_enrich: 1.7.0 # Version of the Hadoop Enrichment process
    hadoop_shred: 0.9.0 # Version of the Hadoop Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  download:
    folder: /home/netgrasp/postgres-download # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
  targets:
     - name: "My PostgreSQL database"
      type: postgres
      host: [My host] # Hostname of database server
      database: postgress # Name of database
      port: 5432 # Default Postgres port
      ssl_mode: disable # One of disable (default), require, verify-ca or verify-full
      table: atomic.events
      username: [username]
      password: [pwd]
      maxerror: # Not required for Postgres
      comprows: # Not required for Postgres
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production
  snowplow:
    method: get
    app_id: snowplow # e.g. snowplow
    collector: d170d0ewn0ezze.cloudfront.net # e.g. d3rkrsqld9gmqf.cloudfront.net

#2

False alarm - all I had to do is wait… The process finished, though the storage PostgreSQL database was NOT populated. Maybe a question for another time!

D, [2016-08-27T22:37:33.004000 #395] DEBUG – : EMR jobflow j-22SGW5MD0JD66 completed successfully.
I, [2016-08-27T22:37:34.584000 #395] INFO – : Attempting to send 1 request
I, [2016-08-27T22:37:34.594000 #395] INFO – : Sending GET request to http://d170d0ewn0ezze.cloudfront.net:80/i
I, [2016-08-27T22:37:34.700000 #395] INFO – : GET request to http://d170d0ewn0ezze.cloudfront.net:80/i finished with status code 200
D, [2016-08-27T22:37:34.702000 #395] DEBUG – : Archiving CloudFront logs…
moving files from s3://netgrasp-snowplow-enrich/processing/ to s3://netgrasp-snowplow-enrich/archive/
(t1) MOVE netgrasp-snowplow-enrich/processing/E1HK6HQU8KOHA8.2016-08-28-01.05322bb2.us-east-1.log.gz -> netgrasp-snowplow-enrich/archive/2016-08-28/E1HK6HQU8KOHA8.2016-08-28-01.05322bb2.us-east-1.log.gz(t2) MOVE netgrasp-snowplow-enrich/processing/E1HK6HQU8KOHA8.2016-08-28-01.44956477.us-east-1.log.gz -> netgrasp-snowplow-enrich/archive/2016-08-28/E1HK6HQU8KOHA8.2016-08-28-01.44956477.us-east-1.log.gz

  +-> netgrasp-snowplow-enrich/archive/2016-08-28/E1HK6HQU8KOHA8.2016-08-28-01.05322bb2.us-east-1.log.gz
  +-> netgrasp-snowplow-enrich/archive/2016-08-28/E1HK6HQU8KOHA8.2016-08-28-01.44956477.us-east-1.log.gz
  x netgrasp-snowplow-enrich/processing/E1HK6HQU8KOHA8.2016-08-28-01.05322bb2.us-east-1.log.gz
  x netgrasp-snowplow-enrich/processing/E1HK6HQU8KOHA8.2016-08-28-01.44956477.us-east-1.log.gz

I, [2016-08-27T22:37:38.440000 #395] INFO – : Completed successfully


#3

Right - loading of Postgres (or Redshift) is currently performed by the StorageLoader component, not by EmrEtlRunner itself…


#4

@krisknap - the emr etl runner is spinning up your EMR cluster - which can be seen here:

  EMR jobflow j-22SGW5MD0JD66 started, waiting for jobflow to complete...

the emitter endpoint test is in response to the value in config:

monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production
  snowplow:
    method: get
    app_id: snowplow # e.g. snowplow
    collector: d170d0ewn0ezze.cloudfront.net # e.g. d3rkrsqld9gmqf.cloudfront.net

if you want a visualization of the EMR job, you can open up the aws web console, navigate to EMR and view the cluster list - and it will have a green circle next to the running cluster (j-22SGW5MD0JD66 -> the jobflow id noted in the above output.

you can then click on the name of the cluster, scroll down and expand Steps - and see elapsed time to watch the job as it completes. (also has logging)


#5

Thanks @13scoobie, I was indeed able to follow this in the AWS console, and in my case (medium size instances) the overhead was about 10 minutes before the cluster was bootstrapped.

Something I wondered about is whether the current framework provides for using an EMR node that is already “running”. I had one node green in “waiting” state, idle, while the emr etl runner was activating another one. Would there be a way to refer to a specific live instance/node of EMR?


#6

Hi @krisknap - it’s a good question. The answer is: you can’t target an existing EMR cluster for your job currently within EmrEtlRunner.

However, we will be replacing EmrEtlRunner in time with a new Golang-based app, called Dataflow Runner (per this RFC), and that app will support targeting an existing cluster. You can see a sneak peak of the functionality of Dataflow Runner (WIP) in its README.