EMR failed with OOM @ Enrich

bernardosrulzon · June 2, 2017, 3:11pm

Guys -

I’m stuck debugging an EMR failure. Cluster fails 20mins into the enrich job - monitoring shows a spike in memory allocation near the failure, and investigating logs I found the culprit:

2017-06-02 11:56:20,622 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOfRange(Arrays.java:2694)
	at java.lang.String.<init>(String.java:203)
	at java.lang.StringBuilder.toString(StringBuilder.java:405)
	at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:349)
	at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:83)
	at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2344)
	at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34)
	at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50)
	at com.snowplowanalytics.snowplow.enrich.common.outputs.BadRow.toCompactJson(BadRow.scala:101)
	at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$13$$anonfun$apply$1.apply(EtlJob.scala:189)
	at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$13$$anonfun$apply$1.apply(EtlJob.scala:188)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
	at scala.collection.immutable.List.foreach(List.scala:318)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
	at scala.collection.AbstractTraversable.map(Traversable.scala:105)
	at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$13.apply(EtlJob.scala:188)
	at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$13.apply(EtlJob.scala:182)
	at com.twitter.scalding.FlatMapFunction.operate(Operations.scala:46)
	at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:99)
	at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:39)
	at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:80)
	at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:145)
	at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:133)
	at com.twitter.scalding.MapFunction.operate(Operations.scala:59)
	at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:99)
	at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:39)
	at cascading.flow.stream.SourceStage.map(SourceStage.java:102)
	at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
	at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:130)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)

Monitoring charts

I’m running this cluster with 4x m4.4xlarge, so it’s unlikely that it doesn’t have enough memory to run this job. Improving instance types and increasing number of instances still result in job failures.

Any ideas what I could do to solve this?

Thanks!!

alex · June 2, 2017, 3:28pm

Hey @bernardosrulzon - the most likely scenario here is sending large batches of events via POST, coinciding with a very high event validation failure rate.

The problem is that each bad row contains the full raw payload, so it has a multiplicative effect:

100 events per POST, plus
90% failure rate, means
90 bad rows, each one containing the raw payload of 100 events

It is this which blows the memory out. Could that be happening here?

bernardosrulzon · June 2, 2017, 3:39pm

@alex Thanks! Very likely!

Is there any way I can see a sample of events that are failing at this stage?
Should this really result in a job failure? Can we log this error into the bad bucket, but continue processing events?

alex · June 2, 2017, 3:54pm

Hey @bernardosrulzon,

An OOM is a JVM killer unfortunately - there’s no coming back from that; think of it like an uncatchable exception.

There is a strong argument that Snowplow should avoid this situation entirely, perhaps by truncating the raw payloads in an error, or containing the specific event which failed rather than the whole payload.

Did anything get written out to the bad bucket before the crash?

bernardosrulzon · June 2, 2017, 6:55pm

Hey @alex - you nailed it

Turns out we were sending events with version 1-0-2 (in a batch with 1000 events), but the schema hadn’t been deployed yet. It would be great to see only the specific events that failed, though - it helps a lot with debugging!

Thanks!!!
Bernardo

Topic		Replies	Views
Troubleshooting EmrEtlRunner Troubleshooting	7	1856	December 6, 2016
Enrich phase failed on AWS, because out of memory Enrichment	2	1070	February 5, 2020
Spark memory woes AWS batch pipeline (Legacy)	1	1814	December 14, 2017
EMR failing in enrich step AWS batch pipeline (Legacy)	7	1949	June 9, 2019
EMR failing : Enriched HDFS -> S3: FAILED Troubleshooting	4	1869	April 11, 2017

EMR failed with OOM @ Enrich

Related Topics