Snowplow Failing On EMR Step


#1

For the past couple of days, my Snowplow script has been failing. It hangs up for about 38 minutes and then terminates saying there was an internal error. The script uses EC2 Classic. No changes have been made to the environment. I experimented with skipping all but one step and it is the EMR step that is causing the issue. However, if only the EMR step is used and the rest skipped, instead of hanging up for 38 minutes, the command finishes after a couple of seconds saying “INFO – : Completed successfully.” Has anyone experienced this?

Here is the script:

#!/bin/bash
clear
 
# use jruby environment
# https://rvm.io/integration/cron#loading-rvm-environment-files-in-shell-scripts
source /usr/local/rvm/environments/jruby-1.7.19
 
echo "enrichment kick-off"
cd /home/ec2-user/snowplow/3-enrich/emr-etl-runner
bundle exec bin/snowplow-emr-etl-runner  --config config/sp.yml --resolver config/resolver.json --enrichments ../config/enrichments

Here is the sp.yml file:

aws:
  access_key_id: Aaxc
  secret_access_key: asdf
  s3:
    region: us-east-1
    buckets:
      assets: s3://snowplow-hosted-assets
      log: s3://rrsnowplow-log/emr
      raw:
        in:
        - s3://elasticbeanstalk-us-east-1-69/resources/environments/logs/publish/e-p/
        - s3://elasticbeanstalk-us-east-1-69/resources/environments/logs/publish/e-i /
        - s3://elasticbeanstalk-us-east-1-69/resources/environments/logs/publish/e-9/
        processing: s3://rrsnowplow-etl/processing
        archive: s3://rrsnowplow-archive/raw
      enriched:
        good: s3://rrsnowplow-data/enriched/good
        bad: s3://rrsnowplow-data/enriched/bad
        errors: s3://rrsnowplow-data/enriched/errors
        archive: s3://rrsnowplow-storage-archive/enriched/good
      shredded:
        good: s3://rrsnowplow-data/shredded/good
        bad: s3://rrsnowplow-data/shredded/bad
        errors: s3://rrsnowplow-data/shreddederrors
        archive: s3://rrsnowplow-storage-archive/shredded/good
      jsonpath_assets: 
  emr:
    ami_version: 3.6.0
    region: us-east-1
    placement: us-east-1c
    ec2_subnet_id:
    jobflow_role: EMR_EC2_DefaultRole
    service_role: EMR_DefaultRole
    ec2_key_name: Key_Name
    software:
      hbase: # not used for ami_version 3.6.0
      lingual: # not used for ami_version 3.6.0
    jobflow:
      master_instance_type: m1.medium
      core_instance_count: 3
      core_instance_type: c3.xlarge
      task_instance_count: 0
      task_instance_type: m1.medium
      task_instance_bid: 0.015
    bootstrap_failure_tries: 3
collectors:
  format: clj-tomcat
enrich:
  job_name: Snowplow ETL
  versions:
    hadoop_enrich: 1.0.0
    hadoop_shred: 0.4.0
  continue_on_unexpected_error: false
  output_compression: NONE
storage:
  download:
    folder:
  targets:
  - name: RR Snowplow Events
    type: redshift
    host: snowplow.redshift.amazonaws.com
    database: db
    port: 5439
    table: atomic.table
    username: adm
    password: pw
    maxerror: 10
    comprows: 200000
monitoring:
  tags: {}
  logging:
    level: INFO
  snowplow:

#2

I’ll have to run the script at least one more time but changing the AMI from 3.6.0 to 3.9.0 seems to have fixed the issue.


#3

I am no longer getting original error. However, now the process fails during the last step, “Shredded HDFS.” My colleague thinks this is due to an outdated hadoop_enrich or hadoop_shred version.


#4

Hi @wyip, what’s the error you’re getting now?


#5

This probably has to do with the hadoop_enrich and hadoop_shred versions. This is the text from the stderr:

Exception in thread “main” java.lang.RuntimeException: Failed to get source file system
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:739)
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:720)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at com.amazon.elasticmapreduce.s3distcp.Main.main(Main.java:22)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.io.FileNotFoundException: File does not exist: hdfs:/local/snowplow/shredded-events
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:736)
… 9 more


#6

Maybe try with AMI 3.11.0, it’s still the same hadoop version (2.4.0).

If you don’t have any dependency on those particular enrich and shred versions, I would advise upgrading indeed.


#7

Sorry for the delayed response. My colleague has gone ahead and upgraded to one of the most recent version. Thanks for the help!