EMR failed with OOM @ Enrich

Guys -

I’m stuck debugging an EMR failure. Cluster fails 20mins into the enrich job - monitoring shows a spike in memory allocation near the failure, and investigating logs I found the culprit:

2017-06-02 11:56:20,622 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOfRange(Arrays.java:2694)
	at java.lang.String.<init>(String.java:203)
	at java.lang.StringBuilder.toString(StringBuilder.java:405)
	at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:349)
	at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:83)
	at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2344)
	at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34)
	at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50)
	at com.snowplowanalytics.snowplow.enrich.common.outputs.BadRow.toCompactJson(BadRow.scala:101)
	at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$13$$anonfun$apply$1.apply(EtlJob.scala:189)
	at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$13$$anonfun$apply$1.apply(EtlJob.scala:188)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
	at scala.collection.immutable.List.foreach(List.scala:318)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
	at scala.collection.AbstractTraversable.map(Traversable.scala:105)
	at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$13.apply(EtlJob.scala:188)
	at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$13.apply(EtlJob.scala:182)
	at com.twitter.scalding.FlatMapFunction.operate(Operations.scala:46)
	at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:99)
	at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:39)
	at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:80)
	at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:145)
	at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:133)
	at com.twitter.scalding.MapFunction.operate(Operations.scala:59)
	at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:99)
	at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:39)
	at cascading.flow.stream.SourceStage.map(SourceStage.java:102)
	at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
	at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:130)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)

Monitoring charts

I’m running this cluster with 4x m4.4xlarge, so it’s unlikely that it doesn’t have enough memory to run this job. Improving instance types and increasing number of instances still result in job failures.

Any ideas what I could do to solve this?

Thanks!!

Hey @bernardosrulzon - the most likely scenario here is sending large batches of events via POST, coinciding with a very high event validation failure rate.

The problem is that each bad row contains the full raw payload, so it has a multiplicative effect:

  • 100 events per POST, plus
  • 90% failure rate, means
  • 90 bad rows, each one containing the raw payload of 100 events

It is this which blows the memory out. Could that be happening here?

@alex Thanks! Very likely!

  • Is there any way I can see a sample of events that are failing at this stage?
  • Should this really result in a job failure? Can we log this error into the bad bucket, but continue processing events?

Hey @bernardosrulzon,

An OOM is a JVM killer unfortunately - there’s no coming back from that; think of it like an uncatchable exception.

There is a strong argument that Snowplow should avoid this situation entirely, perhaps by truncating the raw payloads in an error, or containing the specific event which failed rather than the whole payload.

Did anything get written out to the bad bucket before the crash?

Hey @alex - you nailed it :slight_smile:

Turns out we were sending events with version 1-0-2 (in a batch with 1000 events), but the schema hadn’t been deployed yet. It would be great to see only the specific events that failed, though - it helps a lot with debugging!

Thanks!!!
Bernardo