ETL EMR Failing on Step 2


#1

Hi There,

Please help!

I have the ETL-EMR job that is failing at Step two. I am running the job using --skip staging option. the job was running for over month without issues, I made the following change:

Changed

jobflow:
  master_instance_type: r3.xlarge
  core_instance_count: 2
  core_instance_type: r3.xlarge
  task_instance_count: 0 # Increase to use spot instances
  task_instance_type: r3.xlarge
  task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances

TO

jobflow:
  master_instance_type: m1.medium
  core_instance_count: 3
  core_instance_type: m3.xlarge
  task_instance_count: 0
  task_instance_type: m1.small
  task_instance_bid: 0.25

Error - “…running beyond physical memory limits”

Syslog for Step 2.

2017-01-09 23:17:46,603 INFO amazon.emr.metrics.MetricsSaver (main): MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: false maxMemoryMb: 3072 maxInstanceCount: 500 lastModified: 1484003463520
2017-01-09 23:17:46,604 INFO amazon.emr.metrics.MetricsSaver (main): Created MetricsSaver j-3K5AFU21S2CDV:i-018b9b79d37f04a23:RunJar:08207 period:60 /mnt/var/em/raw/i-018b9b79d37f04a23_20170109_RunJar_08207_raw.bin
2017-01-09 23:17:48,259 INFO cascading.flow.hadoop.util.HadoopUtil (main): resolving application jar from found main method on: com.snowplowanalytics.snowplow.enrich.hadoop.JobRunner$
2017-01-09 23:17:48,261 INFO cascading.flow.hadoop.planner.HadoopPlanner (main): using application jar: /mnt/var/lib/hadoop/steps/s-38789WQITNL6U/snowplow-hadoop-enrich-1.7.0.jar
2017-01-09 23:17:48,275 INFO cascading.property.AppProps (main): using app.id: 999FFF53599A42269AC131EB4FF52F53
2017-01-09 23:17:48,415 INFO org.apache.hadoop.conf.Configuration.deprecation (main): mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
2017-01-09 23:17:49,763 INFO com.amazon.ws.emr.hadoop.fs.EmrFileSystem (main): Consistency disabled, using com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem as filesystem implementation
2017-01-09 23:17:50,237 INFO org.apache.hadoop.conf.Configuration.deprecation (main): mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
2017-01-09 23:17:50,778 INFO cascading.util.Version (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): Concurrent, Inc - Cascading 2.6.0
2017-01-09 23:17:50,782 INFO cascading.flow.Flow (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): [com.snowplowanalytics…] starting
2017-01-09 23:17:50,782 INFO cascading.flow.Flow (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): [com.snowplowanalytics…] source: Hfs[“LzoByteArrayScheme[[UNKNOWN]->[ALL]]”][“hdfs:/local/snowplow/raw-events”]
2017-01-09 23:17:50,783 INFO cascading.flow.Flow (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): [com.snowplowanalytics…] sink: Hfs[“TextDelimited[[‘json’]]”][“s3://launchpad-etl-emr/enriched/bad/run=2017-01-09-23-03-16”]
2017-01-09 23:17:50,783 INFO cascading.flow.Flow (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): [com.snowplowanalytics…] sink: Hfs[“TextDelimited[[‘app_id’, ‘platform’, ‘etl_tstamp’, ‘collector_tstamp’, ‘dvce_created_tstamp’, ‘event’, ‘event_id’, ‘txn_id’, ‘name_tracker’, ‘v_tracker’, ‘v_collector’, ‘v_etl’, ‘user_id’, ‘user_ipaddress’, ‘user_fingerprint’, ‘domain_userid’, ‘domain_sessionidx’, ‘network_userid’, ‘geo_country’, ‘geo_region’, ‘geo_city’, ‘geo_zipcode’, ‘geo_latitude’, ‘geo_longitude’, ‘geo_region_name’, ‘ip_isp’, ‘ip_organization’, ‘ip_domain’, ‘ip_netspeed’, ‘page_url’, ‘page_title’, ‘page_referrer’, ‘page_urlscheme’, ‘page_urlhost’, ‘page_urlport’, ‘page_urlpath’, ‘page_urlquery’, ‘page_urlfragment’, ‘refr_urlscheme’, ‘refr_urlhost’, ‘refr_urlport’, ‘refr_urlpath’, ‘refr_urlquery’, ‘refr_urlfragment’, ‘refr_medium’, ‘refr_source’, ‘refr_term’, ‘mkt_medium’, ‘mkt_source’, ‘mkt_term’, ‘mkt_content’, ‘mkt_campaign’, ‘contexts’, ‘se_category’, ‘se_action’, ‘se_label’, ‘se_property’, ‘se_value’, ‘unstruct_event’, ‘tr_orderid’, ‘tr_affiliation’, ‘tr_total’, ‘tr_tax’, ‘tr_shipping’, ‘tr_city’, ‘tr_state’, ‘tr_country’, ‘ti_orderid’, ‘ti_sku’, ‘ti_name’, ‘ti_category’, ‘ti_price’, ‘ti_quantity’, ‘pp_xoffset_min’, ‘pp_xoffset_max’, ‘pp_yoffset_min’, ‘pp_yoffset_max’, ‘useragent’, ‘br_name’, ‘br_family’, ‘br_version’, ‘br_type’, ‘br_renderengine’, ‘br_lang’, ‘br_features_pdf’, ‘br_features_flash’, ‘br_features_java’, ‘br_features_director’, ‘br_features_quicktime’, ‘br_features_realplayer’, ‘br_features_windowsmedia’, ‘br_features_gears’, ‘br_features_silverlight’, ‘br_cookies’, ‘br_colordepth’, ‘br_viewwidth’, ‘br_viewheight’, ‘os_name’, ‘os_family’, ‘os_manufacturer’, ‘os_timezone’, ‘dvce_type’, ‘dvce_ismobile’, ‘dvce_screenwidth’, ‘dvce_screenheight’, ‘doc_charset’, ‘doc_width’, ‘doc_height’, ‘tr_currency’, ‘tr_total_base’, ‘tr_tax_base’, ‘tr_shipping_base’, ‘ti_currency’, ‘ti_price_base’, ‘base_currency’, ‘geo_timezone’, ‘mkt_clickid’, ‘mkt_network’, ‘etl_tags’, ‘dvce_sent_tstamp’, ‘refr_domain_userid’, ‘refr_dvce_tstamp’, ‘derived_contexts’, ‘domain_sessionid’, ‘derived_tstamp’, ‘event_vendor’, ‘event_name’, ‘event_format’, ‘event_version’, ‘event_fingerprint’, ‘true_tstamp’]]”][“hdfs:/local/snowplow/enriched-events”]
2017-01-09 23:17:50,784 INFO cascading.flow.Flow (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): [com.snowplowanalytics…] parallel execution is enabled: true
2017-01-09 23:17:50,784 INFO cascading.flow.Flow (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): [com.snowplowanalytics…] starting jobs: 3
2017-01-09 23:17:50,784 INFO cascading.flow.Flow (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): [com.snowplowanalytics…] allocating threads: 3
2017-01-09 23:17:50,785 INFO cascading.flow.FlowStep (pool-5-thread-1): [com.snowplowanalytics…] starting step: (1/3)
2017-01-09 23:17:50,907 INFO org.apache.hadoop.yarn.client.RMProxy (pool-5-thread-1): Connecting to ResourceManager at ip-10-129-14-54.ap-southeast-2.compute.internal/10.129.14.54:8032
2017-01-09 23:17:51,186 INFO org.apache.hadoop.yarn.client.RMProxy (pool-5-thread-1): Connecting to ResourceManager at ip-10-129-14-54.ap-southeast-2.compute.internal/10.129.14.54:8032
2017-01-09 23:18:09,617 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (pool-5-thread-1): Total input paths to process : 30294
2017-01-09 23:18:09,617 INFO org.apache.hadoop.conf.Configuration.deprecation (pool-5-thread-1): mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
2017-01-09 23:18:21,403 INFO cascading.util.Update (UpdateRequestTimer): newer Cascading release available: 2.6.3
2017-01-09 23:18:28,537 INFO org.apache.hadoop.mapreduce.JobSubmitter (pool-5-thread-1): number of splits:15146
2017-01-09 23:18:28,805 INFO org.apache.hadoop.mapreduce.JobSubmitter (pool-5-thread-1): Submitting tokens for job: job_1484003452716_0002
2017-01-09 23:18:29,010 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl (pool-5-thread-1): Submitted application application_1484003452716_0002
2017-01-09 23:18:29,065 INFO org.apache.hadoop.mapreduce.Job (pool-5-thread-1): The url to track the job: http://ip-10-129-14-54.ap-southeast-2.compute.internal:20888/proxy/application_1484003452716_0002/
2017-01-09 23:18:29,067 INFO cascading.flow.FlowStep (pool-5-thread-1): [com.snowplowanalytics…] submitted hadoop job: job_1484003452716_0002
2017-01-09 23:18:29,067 INFO cascading.flow.FlowStep (pool-5-thread-1): [com.snowplowanalytics…] tracking url: http://ip-10-129-14-54.ap-southeast-2.compute.internal:20888/proxy/application_1484003452716_0002/
2017-01-09 23:25:27,282 WARN cascading.flow.FlowStep (pool-5-thread-1): [com.snowplowanalytics…] hadoop job job_1484003452716_0002 state at FAILED
2017-01-09 23:25:27,284 WARN cascading.flow.FlowStep (pool-5-thread-1): [com.snowplowanalytics…] failure info: Application application_1484003452716_0002 failed 2 times due to AM Container for appattempt_1484003452716_0002_000002 exited with exitCode: -104
For more detailed output, check application tracking page:http://ip-10-129-14-54.ap-southeast-2.compute.internal:8088/cluster/app/application_1484003452716_0002Then, click on links to logs of each attempt.
Diagnostics: Container [pid=16209,containerID=container_1484003452716_0002_02_000001] is running beyond physical memory limits. Current usage: 528.6 MB of 512 MB physical memory used; 1.9 GB of 2.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1484003452716_0002_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 16228 16209 16209 16209 (java) 3858 274 1918607360 134624 /usr/lib/jvm/java-openjdk/bin/java -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1484003452716_0002/container_1484003452716_0002_02_000001/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1484003452716_0002/container_1484003452716_0002_02_000001 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog -Xmx1024m org.apache.hadoop.mapreduce.v2.app.MRAppMaster
|- 16209 16207 16209 16209 (bash) 1 0 115806208 692 /bin/bash -c /usr/lib/jvm/java-openjdk/bin/java -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1484003452716_0002/container_1484003452716_0002_02_000001/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1484003452716_0002/container_1484003452716_0002_02_000001 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog -Xmx1024m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/var/log/hadoop-yarn/containers/application_1484003452716_0002/container_1484003452716_0002_02_000001/stdout 2>/var/log/hadoop-yarn/containers/application_1484003452716_0002/container_1484003452716_0002_02_000001/stderr

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.
2017-01-09 23:25:27,298 WARN cascading.flow.FlowStep (pool-5-thread-1): [com.snowplowanalytics…] task completion events identify failed tasks
2017-01-09 23:25:27,298 WARN cascading.flow.FlowStep (pool-5-thread-1): [com.snowplowanalytics…] task completion events count: 0
2017-01-09 23:25:27,302 WARN cascading.flow.FlowStep (pool-5-thread-3): [com.snowplowanalytics…] abandoning step: (3/3) …/snowplow/enriched-events, predecessor failed: (1/3)
2017-01-09 23:25:27,302 WARN cascading.flow.FlowStep (pool-5-thread-2): [com.snowplowanalytics…] abandoning step: (2/3) …d/run=2017-01-09-23-03-16, predecessor failed: (1/3)
2017-01-09 23:25:27,303 INFO cascading.flow.FlowStep (pool-5-thread-3): [com.snowplowanalytics…] stopping: (3/3) …/snowplow/enriched-events
2017-01-09 23:25:27,303 INFO cascading.flow.FlowStep (pool-5-thread-2): [com.snowplowanalytics…] stopping: (2/3) …d/run=2017-01-09-23-03-16
2017-01-09 23:25:27,304 INFO cascading.flow.Flow (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): [com.snowplowanalytics…] stopping all jobs
2017-01-09 23:25:27,304 INFO cascading.flow.FlowStep (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): [com.snowplowanalytics…] stopping: (3/3) …/snowplow/enriched-events
2017-01-09 23:25:27,305 INFO cascading.flow.FlowStep (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): [com.snowplowanalytics…] stopping: (2/3) …d/run=2017-01-09-23-03-16
2017-01-09 23:25:27,305 INFO cascading.flow.FlowStep (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): [com.snowplowanalytics…] stopping: (1/3)
2017-01-09 23:25:27,307 INFO cascading.flow.Flow (flow com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob): [com.snowplowanalytics…] stopped all jobs

Regards
SS


#2

As the error message suggests, you are probably running into memory limits. Try looking at your EMR cluster monitoring details and increasing EC2 instance types.


#3

I see that you’re running instances from the last instance generation, I would update your instances to m3 or m4, that’s the current EC2 instance operation and if you run into memory limits as it was already said, beef up the machines

P.S. nice overview of EMR instance types: https://aws.amazon.com/emr/pricing/


#4

A great resource for comparing instance types also - http://www.ec2instances.info/

Also note, that the 4th generation of instances no longer have instance store SSDs and provide EBS storage only by default. You could potentially run into space issues unless you use 3rd gen instances or mount additional EBS by using core_instance_ebs option in the runner config: https://github.com/snowplow/snowplow/blob/master/3-enrich/emr-etl-runner/config/config.yml.sample#L44