Beam Enrich: Dataflow Worker logs reporting "java.lang.NullPointerException"

Earlier this week Beam Enrich started throwing errors out of the blue. Before 2020-03-10 09:00 AEDT, our Dataflow job was successfully pushing events onto our “Enriched-Good” topic, which were then being loaded just fine into our BQ dataset.

But after 2020-03-10 09:00 AEDT the pipeline stopped working. When the job kicks off, only the first ~5 seconds of enriched events show up in the enriched-good subscription. After that, an error fires off every ~10 seconds (below) and no more events land inside enriched-good.

Our setup is:

  • beam-enrich-1.0.0
    • us-central-1a
    • n1-standard-4
  • snowplow-bigquery-loader-0.2.0
    • us-central-1a
    • n1-standard-4
  • BigQuery located in australia-southeast

We haven’t made any changes recently that would have impacted this. Though Google recently announced changes to Pub/Sub cross region access.

And the error been thrown is:

java.lang.NullPointerException
	at io.circe.Printer$PrintingFolder.onString(Printer.scala:278)
	at io.circe.Printer$PrintingFolder.onString(Printer.scala:262)
	at io.circe.Json$JString.foldWith(Json.scala:302)
	at io.circe.JsonObject$MapAndVectorJsonObject.appendToFolder(JsonObject.scala:433)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:337)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:262)
	at io.circe.Json$JObject.foldWith(Json.scala:364)
	at io.circe.JsonObject$LinkedHashMapJsonObject.appendToFolder(JsonObject.scala:326)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:337)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:262)
	at io.circe.Json$JObject.foldWith(Json.scala:364)
	at io.circe.JsonObject$LinkedHashMapJsonObject.appendToFolder(JsonObject.scala:326)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:337)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:262)
	at io.circe.Json$JObject.foldWith(Json.scala:364)
	at io.circe.JsonObject$LinkedHashMapJsonObject.appendToFolder(JsonObject.scala:326)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:337)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:262)
	at io.circe.Json$JObject.foldWith(Json.scala:364)
	at io.circe.JsonObject$LinkedHashMapJsonObject.appendToFolder(JsonObject.scala:326)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:337)
	at io.circe.Printer$PrintingFolder.onObject(Printer.scala:262)
	at io.circe.Json$JObject.foldWith(Json.scala:364)
	at io.circe.Printer.pretty(Printer.scala:194)
	at io.circe.Json.noSpaces(Json.scala:116)
	at com.snowplowanalytics.snowplow.badrows.BadRow.compact(BadRow.scala:32)
	at com.snowplowanalytics.snowplow.badrows.BadRow.compact$(BadRow.scala:32)
	at com.snowplowanalytics.snowplow.badrows.BadRow$EnrichmentFailures.compact(BadRow.scala:140)
	at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.$anonfun$enrich$3(Enrich.scala:270)
	at cats.data.NonEmptyList.map(NonEmptyList.scala:76)
	at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.$anonfun$enrich$2(Enrich.scala:270)
	at cats.data.Validated.leftMap(Validated.scala:213)
	at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.$anonfun$enrich$1(Enrich.scala:270)
	at scala.collection.immutable.List.map(List.scala:286)
	at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.enrich(Enrich.scala:269)
	at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.$anonfun$enrichEvents$2(Enrich.scala:188)
	at com.snowplowanalytics.snowplow.enrich.beam.utils$.timeMs(utils.scala:153)
	at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.$anonfun$enrichEvents$1(Enrich.scala:185)
	at com.spotify.scio.util.Functions$$anon$6.processElement(Functions.scala:220)

Having trouble figuring out what’s happening here. It was rock solid up until the other day and now a backlog is forming with the shadow of data loss hanging over us now. Though we’re not yet dependent on our RT Pipeline, we want to make sure we have a few months of solid data before transitioning from Batch.

Hey @robkingston,

Its not somehting we’ve seen before - we’ll jump on investigation straight away. Meanwhile, could you please try Beam Enrich 1.1.0? It should be available on DockerHub: https://hub.docker.com/layers/snowplow/beam-enrich/1.1.0/images/sha256-66dfafbd5dd909eae57bada014ae9be36dfbbda0ae5f1b7ef0f9abd077509871?context=explore

@anton - thanks. I managed to solve this. Rolling back to 0.4.1 solved it. Backlog clearing :smile:

Both beam-enrich-1.0.0 and beam-enrich-1.1.0 fail in the same way. I noticed the new badrows functionality was added in beam-enrich-1.x.x but I’m not yet using that. It’s strange because it was working fine until the other day.

The fact that it was working before tells me that problably it stumbles upon some very unusual bad rows our JSON library then struggles to encode. On the other hand, I still don’t see anything that can contain null there, so it must be something related to Beam Enrich passing nulls around.

Anyways, we’ll investigate this. Any other information you can provide is highly appreciaetd.

Where might I find this problematic bad row? Happy to share it with you if I can find it. E.g. Null field, null record etc…

I don’t consume RT bad rows data right now, but it should be in our batch pipeline’s bad rows.

Hi @robkingston,

I’m afraid that this bad row can’t be found anywhere as it’s created inside beam-enrich and never emitted, and you can’t find it in your batch bad rows as batch is not using the new bad row format and the problem is related to the encoding of these new bad rows.

The stack trace shows that the problem occurs for EnrichmentFailures bad row, so maybe one of your enrichments was working well before and stopped working (API, SQL, …) and for some reason the associated bad row can’t be encoded with circe.

I wonder if it could be than one of the bad row field is set to null whereas a value is expected. We continue to investigate and will let you know as soon as we find something.

Cheers, @BenB. Hopefully this will help make others’ pipelines more resilient.

We’re not in any hurry to upgrade here. Backlog is clear now!

Hi @robkingston,

Great that you could clear your backlog.

Could you tell us which enrichments you activated in your pipeline please?

Hi @robkingston,

0.4.1 and 1.0.0 should emit the same bad rows, so in theory the problematic bad rows with the new format in 1.0.0 should also be emitted in 0.4.1 with the old format.

Could you tell us if you have bad rows with 0.4.1 and if yes would it be possible to get a sample please?

@BenB - here are the enrichments we have enabled:

campaign_attribution.json	 ip_lookups.json	     referer_parser.json
currency_conversion_config.json  ua_parser_config.json
event_fingerprint_config.json	 pii_enrichment_config.json

The only change in our enrichments was to upgrade referer_parser to support the changes in 1.x.x (and rolled it back in 0.4.1). The schemas both validate and the 2-0-0 referer_parser schema was working successfully for a few weeks with what may be a config issue here.

Here are the two versions.

0.4.1:

{
    "schema": "iglu:com.snowplowanalytics.snowplow/referer_parser/jsonschema/1-0-0",
    "data": {
        "vendor": "com.snowplowanalytics.snowplow",
        "name": "referer_parser",
        "enabled": true,
        "parameters": {
            "internalDomains": []
        }
    }
}

1.x.x:

{
    "schema": "iglu:com.snowplowanalytics.snowplow/referer_parser/jsonschema/2-0-0",
    "data": {
        "vendor": "com.snowplowanalytics.snowplow",
        "name": "referer_parser",
        "enabled": true,
        "parameters": {
            "database": "referers-latest.json",
            "internalDomains": [],
            "uri": "gs://sp-referer-parser/third-party/referer-parser"
        }
    }
}

Thanks for the info @robkingston! Do you still have some bad rows after having rolled back to 0.4.1?

@BenB we always have at least some bad rows. Happy to share some of them if it helps.

Are there specific errors I can pull for you? I suspect it’s only a handful of bad rows that we saw within a few weeks of data. Can pull them from S3 for you.

@robkingston if you could send us some bad rows related to enrichment problems it would be great. The bad rows that were failing in 1.0.0 should also be emitted by 0.4.1, just with a different format. It would help us to see what the 1.0.0 equivalent is and what could be wrong. Thanks a lot!

Hi @robkingston,

We might have found the origin of the problem, it would occur when one of the query string parameters value is null and something goes wrong in the enrichments. We created the issue on Github. Thanks a lot for reporting.

We’ll let you know when a new version with the fix is out. Meanwhile if you could still send us some bad rows from 0.4.1 related to enrichment problems it would be really great so that we can confirm our hypothesis.

We could confirm the origin of the issue : https://github.com/snowplow/snowplow/issues/4305#issuecomment-600692200

I’m experiencing the exact same issue and I’m happy to learn you seem to have found the origin.

I would like to avoid reverting to 0.4.1 as it uses a quite old SDK not supporting the Dataflow streamingEngine. Is there any news on when to expect a release containing a fix?

Hi @sduck,

The RC with the fix will be published on Monday. The final version somewhere later next week.