Emr etl runner fails without useful error on step "Elasticity Spark Step: Enrich Raw Events"


#1

Hi everyone!
I am running emr-etl-runner-r104-stopleistein but the job is failing on step Elasticity Spark Step: Enrich Raw Events

I ran this job with --skip staging as a previous job failed the same way and alread copied the raw logs to s3://snowplow-companyX-etl/processing (at least I can see them there)

I failed to find any useful information in the logs (below). Does anyone have an idea what might be going on? There are only couple megabytes of logs from the clojure collector so I don’t think hardware should be an issue.

config.yml

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: <%= ENV['AWS_SNOWPLOW_ACCESS_KEY'] %>
  secret_access_key: <%= ENV['AWS_SNOWPLOW_SECRET_KEY'] %>
  s3:
    region: ap-southeast-1
    buckets:
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
      jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
      log: s3://snowplow-companyX-out/emr-etl-logs
      raw:
        in:
          - s3n://elasticbeanstalk-ap-southeast-1-304977849758/resources/environments/logs/publish/e-btrqmphmmc
        processing: s3n://snowplow-companyX-etl/processing
        archive: s3://snowplow-companyX-out/archive  # e.g. s3://my-archive-bucket/raw
      enriched:
        good: s3://snowplow-companyX-out/enriched/good # e.g. s3://my-out-bucket/enriched/good
        bad: s3://snowplow-companyX-out/enriched/bad # e.g. s3://my-out-bucket/enriched/bad
        errors:  # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3://snowplow-companyX-out/enriched/archive # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
      shredded:
        good: s3://snowplow-companyX-out/shredded/good # e.g. s3://my-out-bucket/shredded/good
        bad: s3://snowplow-companyX-out/shredded/bad # e.g. s3://my-out-bucket/shredded/bad
        errors: # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3://snowplow-companyX-out/shredded/archive    # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
  emr:
    ami_version: 5.9.0
    region: ap-southeast-1 # Always set this
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole # Created using $ aws emr create-default-roles
    placement:              # Set this if not running in VPC. Leave blank otherwise
    ec2_subnet_id: subnet-6b4dba32 # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: snowplow-companyX-emr-etl-runner
    bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
    # Adjust your Hadoop cluster below
    jobflow:
      job_name: Snowplow company ETL # Give your job a name
      master_instance_type: m4.large
      core_instance_count: 1
      core_instance_type: m4.large
      core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
        volume_size: 10    # Gigabytes
        volume_type: "gp2"
        volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: false # Optional. Will default to true
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: c4.large
      task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    configuration:
      yarn-site:
        yarn.resourcemanager.am.max-attempts: "1"
      spark:
        maximizeResourceAllocation: "true"
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: clj-tomcat # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
  versions:
    spark_enrich: 1.14.0 # Version of the Spark Enrichment process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  versions:
    rdb_loader: 0.14.0
    rdb_shredder: 0.13.1        # Version of the Spark Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
monitoring:
  tags:
    description: emr etl run companyX # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production
  snowplow:
    method: get
    app_id: snowplow # e.g. snowplow
    collector: collector.fooobar.com # e.g. d3rkrsqld9gmqf.cloudfront.net

collector log

2018-07-21T14:02:23.722Z INFO Ensure step 2 jar file command-runner.jar
2018-07-21T14:02:23.723Z INFO StepRunner: Created Runner for step 2
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --class com.snowplowanalytics.snowplow.enrich.spark.EnrichJob --master yarn --deploy-mode cluster s3://snowplow-hosted-assets-ap-southeast-1/3-enrich/spark-enrich/snowplow-spark-enrich-1.14.0.jar --input-format clj-tomcat --etl-timestamp 1532181419947 --iglu-config ewog9Cn0K --enrichments eyJzY2 --input-folder hdfs:///local/snowplow/raw-events/* --output-folder hdfs:///local/snowplow/enriched-events/ --bad-folder s3://snowplow-datadriven-out/enriched/bad/run=2018-07-21-13-56-59/'
INFO Environment:
  PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin
  LESS_TERMCAP_md=[01;38;5;208m
  LESS_TERMCAP_me=[0m
  HISTCONTROL=ignoredups
  LESS_TERMCAP_mb=[01;31m
  AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
  UPSTART_JOB=rc
  LESS_TERMCAP_se=[0m
  HISTSIZE=1000
  HADOOP_ROOT_LOGGER=INFO,DRFA
  JAVA_HOME=/etc/alternatives/jre
  AWS_DEFAULT_REGION=ap-southeast-1
  AWS_ELB_HOME=/opt/aws/apitools/elb
  LESS_TERMCAP_us=[04;38;5;111m
  EC2_HOME=/opt/aws/apitools/ec2
  TERM=linux
  XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
  runlevel=3
  LANG=en_US.UTF-8
  AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
  MAIL=/var/spool/mail/hadoop
  LESS_TERMCAP_ue=[0m
  LOGNAME=hadoop
  PWD=/
  LANGSH_SOURCED=1
  HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-32SK2U9TW6WV4/tmp
  _=/etc/alternatives/jre/bin/java
  CONSOLETYPE=serial
  RUNLEVEL=3
  LESSOPEN=||/usr/bin/lesspipe.sh %s
  previous=N
  UPSTART_EVENTS=runlevel
  AWS_PATH=/opt/aws
  USER=hadoop
  UPSTART_INSTANCE=
  PREVLEVEL=N
  HADOOP_LOGFILE=syslog
  PYTHON_INSTALL_LAYOUT=amzn
  HOSTNAME=ip-172-31-0-233
  NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
  HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-32SK2U9TW6WV4
  EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
  SHLVL=5
  HOME=/home/hadoop
  HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-32SK2U9TW6WV4/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-32SK2U9TW6WV4/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-32SK2U9TW6WV4
INFO ProcessRunner started child process 8895 :
hadoop    8895  4018  0 14:02 ?        00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --class com.snowplowanalytics.snowplow.enrich.spark.EnrichJob --master yarn --deploy-mode cluster s3://snowplow-hosted-assets-ap-southeast-1/3-enrich/spark-enrich/snowplow-spark-enrich-1.14.0.jar --input-format clj-tomcat --etl-timestamp 1532181419947 --iglu-config ewog9<abbreviatedCn0K --enrichments eyJzY<abbreviated>9fX1dfQ== --input-folder hdfs:///local/snowplow/raw-events/* --output-folder hdfs:///local/snowplow/enriched-events/ --bad-folder s3://snowplow-datadriven-out/enriched/bad/run=2018-07-21-13-56-59/
2018-07-21T14:02:27.773Z INFO HadoopJarStepRunner.Runner: startRun() called for s-32SK2U9TW6WV4 Child Pid: 8895
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 1 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 36 seconds
2018-07-21T14:03:02.057Z INFO Step created jobs: 
2018-07-21T14:03:02.057Z WARN Step failed with exitCode 1 and took 36 seconds

stderr log

Warning: Skip remote jar s3://snowplow-hosted-assets-ap-southeast-1/3-enrich/spark-enrich/snowplow-spark-enrich-1.14.0.jar.
18/07/21 14:02:29 INFO RMProxy: Connecting to ResourceManager at ip-172-31-0-233.ap-southeast-1.compute.internal/172.31.0.233:8032
18/07/21 14:02:30 INFO Client: Requesting a new application from cluster with 1 NodeManagers
18/07/21 14:02:30 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (6144 MB per container)
18/07/21 14:02:30 INFO Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
18/07/21 14:02:30 INFO Client: Setting up container launch context for our AM
18/07/21 14:02:30 INFO Client: Setting up the launch environment for our AM container
18/07/21 14:02:30 INFO Client: Preparing resources for our AM container
18/07/21 14:02:32 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
18/07/21 14:02:33 INFO Client: Uploading resource file:/mnt/tmp/spark-7054aa11-2c97-4a67-8d19-d3a67aa30524/__spark_libs__7564400749392618887.zip -> hdfs://ip-172-31-0-233.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1532181544882_0002/__spark_libs__7564400749392618887.zip
18/07/21 14:02:37 WARN RoleMappings: Found no mappings configured with 'fs.s3.authorization.roleMapping', credentials resolution may not work as expected
18/07/21 14:02:38 INFO Client: Uploading resource s3://snowplow-hosted-assets-ap-southeast-1/3-enrich/spark-enrich/snowplow-spark-enrich-1.14.0.jar -> hdfs://ip-172-31-0-233.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1532181544882_0002/snowplow-spark-enrich-1.14.0.jar
18/07/21 14:02:38 INFO S3NativeFileSystem: Opening 's3://snowplow-hosted-assets-ap-southeast-1/3-enrich/spark-enrich/snowplow-spark-enrich-1.14.0.jar' for reading
18/07/21 14:02:40 INFO Client: Uploading resource file:/mnt/tmp/spark-7054aa11-2c97-4a67-8d19-d3a67aa30524/__spark_conf__5438212996105278327.zip -> hdfs://ip-172-31-0-233.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1532181544882_0002/__spark_conf__.zip
18/07/21 14:02:40 INFO SecurityManager: Changing view acls to: hadoop
18/07/21 14:02:40 INFO SecurityManager: Changing modify acls to: hadoop
18/07/21 14:02:40 INFO SecurityManager: Changing view acls groups to: 
18/07/21 14:02:40 INFO SecurityManager: Changing modify acls groups to: 
18/07/21 14:02:40 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
18/07/21 14:02:40 INFO Client: Submitting application application_1532181544882_0002 to ResourceManager
18/07/21 14:02:40 INFO YarnClientImpl: Submitted application application_1532181544882_0002
18/07/21 14:02:41 INFO Client: Application report for application_1532181544882_0002 (state: ACCEPTED)
18/07/21 14:02:41 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1532181760516
	 final status: UNDEFINED
	 tracking URL: http://ip-172-31-0-233.ap-southeast-1.compute.internal:20888/proxy/application_1532181544882_0002/
	 user: hadoop
18/07/21 14:02:42 INFO Client: Application report for application_1532181544882_0002 (state: ACCEPTED)
18/07/21 14:02:43 INFO Client: Application report for application_1532181544882_0002 (state: ACCEPTED)
18/07/21 14:02:44 INFO Client: Application report for application_1532181544882_0002 (state: ACCEPTED)
18/07/21 14:02:45 INFO Client: Application report for application_1532181544882_0002 (state: ACCEPTED)
18/07/21 14:02:46 INFO Client: Application report for application_1532181544882_0002 (state: ACCEPTED)
18/07/21 14:02:47 INFO Client: Application report for application_1532181544882_0002 (state: ACCEPTED)
18/07/21 14:02:48 INFO Client: Application report for application_1532181544882_0002 (state: RUNNING)
18/07/21 14:02:48 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 172.31.10.163
	 ApplicationMaster RPC port: 0
	 queue: default
	 start time: 1532181760516
	 final status: UNDEFINED
	 tracking URL: http://ip-172-31-0-233.ap-southeast-1.compute.internal:20888/proxy/application_1532181544882_0002/
	 user: hadoop
18/07/21 14:02:49 INFO Client: Application report for application_1532181544882_0002 (state: RUNNING)
18/07/21 14:02:50 INFO Client: Application report for application_1532181544882_0002 (state: RUNNING)
18/07/21 14:02:51 INFO Client: Application report for application_1532181544882_0002 (state: RUNNING)
18/07/21 14:02:52 INFO Client: Application report for application_1532181544882_0002 (state: RUNNING)
18/07/21 14:02:53 INFO Client: Application report for application_1532181544882_0002 (state: RUNNING)
18/07/21 14:02:54 INFO Client: Application report for application_1532181544882_0002 (state: RUNNING)
18/07/21 14:02:55 INFO Client: Application report for application_1532181544882_0002 (state: RUNNING)
18/07/21 14:02:56 INFO Client: Application report for application_1532181544882_0002 (state: RUNNING)
18/07/21 14:02:57 INFO Client: Application report for application_1532181544882_0002 (state: RUNNING)
18/07/21 14:02:58 INFO Client: Application report for application_1532181544882_0002 (state: RUNNING)
18/07/21 14:02:59 INFO Client: Application report for application_1532181544882_0002 (state: RUNNING)
18/07/21 14:03:00 INFO Client: Application report for application_1532181544882_0002 (state: FINISHED)
18/07/21 14:03:00 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 172.31.10.163
	 ApplicationMaster RPC port: 0
	 queue: default
	 start time: 1532181760516
	 final status: FAILED
	 tracking URL: http://ip-172-31-0-233.ap-southeast-1.compute.internal:20888/proxy/application_1532181544882_0002/
	 user: hadoop
18/07/21 14:03:00 INFO Client: Deleted staging directory hdfs://ip-172-31-0-233.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1532181544882_0002
Exception in thread "main" org.apache.spark.SparkException: Application application_1532181544882_0002 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1104)
	at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150)
	at org.apache.spark.deploy.yarn.Client.main(Client.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/07/21 14:03:00 INFO ShutdownHookManager: Shutdown hook called
18/07/21 14:03:00 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-7054aa11-2c97-4a67-8d19-d3a67aa30524
Command exiting with ret '1'

#2

Hello @pocin,

Useful information regarding job failure should be in YARN container logs. Something like s3://snowplow-companyX-out/emr-etl-logs/j-YOUR_ID/containers/application_1532181544882_0002/container_application_1532181544882_0002_01_000001/stderr.gz (or stdout.gz)


#3

Sweet!

Does the application_1532181544882_0002 folder refer to step 0002 of the whole emr job and the container_application_1532181544882_0002_01_000001 to the actual nodes (master, core, task)?

Turns out this is the error

18/07/21 14:03:00 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: com.snowplowanalytics.snowplow.enrich.common.FatalEtlError: EtlError Errors:
  - error: NonEmptyList(error: Schema key iglu:com.snowplowanalytics.snowplow/ip_lookups/jsonschema/1-0-0 is not supported. A 'ip_lookups' enrichment must have schema 'iglu:com.snowplowanalytics.snowplow/ip_lookups/jsonschema/2-0-*'.
    level: "error"
)
    level: "error"
)

So I updated it and it seems to be going fine.

BTW is the valid schemver X-0-0 hard coded in the emr-etl-runner binary? I am asking because we have another sestup of snowplow running (still using r97 emr etl runner) and there the 1-0-0 works fine.
Thanks!


#4

Hi @pocin,

BTW is the valid schemver X-0-0 hard coded in the emr-etl-runner binary? I am asking because we have another sestup of snowplow running (still using r97 emr etl runner) and there the 1-0-0 works fine.

This depends on Spark (batch pipeline) and Stream (RT pipeline) Enrich versions. The IP Lookups Enrichment has been upgraded in the Snowplow R103 and you should use an appropriate configuration file from this release onwards.

When using the new version of the application you need to make sure that all relevant configuration changes have been completed. Upgrade Guide will help in this case.

Hope this helps.