R71: JSON validation in Scala Common Enrich


#1

Hi guys

We’re looking to upgrade our pipeline and was trying to reconcile differences between processing a set of raw logs on an older version of Snowplow vs a newer version.

We were trying to figure out why the events being loaded into our Redshift events table was so much greater in older versions vs the newer version and think we isolated it to a change made in r71 regarding the moving of JSON validation to Scala Common Enrich. Specifically, the note here: http://snowplowanalytics.com/blog/2015/10/02/snowplow-r71-stork-billed-kingfisher-released/#json-validation

Please note: if the unstructured event or any of the custom contexts fail validation against their respective JSON Schemas in Iglu, then the event will be failed and written to the bad bucket.

Please correct me if I’m wrong here, but it seems like prior to this version, enriched events would STILL be loaded via the storage loader even if there were validation errors during the shredding step. This means that it is possible to have the event record to exist, but it would not have any context records.

And following this change in r71, the same raw event (with an JSON validation error in its context data), the entire event would be moved to the bad bucket. Therefore, the end result would be the event not having any trace in the resulting storage location.

  1. We’re seeking guidance in terms of how this change has affected other Snowplow users. If there has been other discussions on this topic that you know of, we’d appreciate a link to them.

  2. Are there workarounds to preserving the old behavior? Changing the version of hadoop_enrich that is used? Would such a change, if even possible, be recommended?

Thanks!


#2

Hi @AlexN,

You are exactly right - early versions of Snowplow did not perform schema validation in Hadoop Enrich, leaving it to the Hadoop Shred job (preparing data for loading Redshift) to validate those contexts. And even if an event’s context failed validation in Hadoop Shred, the architecture meant that the event itself could still land in atomic.events.

This changed in R71 - from that point Hadoop Enrich validated all unstructured events and custom contexts (apart from the derived contexts generated in Hadoop Enrich itself). No event will make it through to e.g. Redshift if any of its contexts fail validation. Stream Enrich works the exact same way.

I’m not aware of a discussion about this with community users in Discourse or Google Groups, but this did come up with a couple of customers. There were two distinct issues:

  1. Users who were sending in events with multiple contexts, with perhaps 1 or 2 contexts which consistently failed validation. Post-upgrade, they went from all events landing in atomic.events and all of most of their contexts, to no events arriving at all.
  2. Users (especially mobile users) who weren’t aware of self-describing JSON and Iglu, were not using Redshift, and were sending in regular JSONs not self-describing ones. Post-upgrade, they went from all events being available for analysis in S3 to none making it through (because Hadoop or Stream Enrich couldn’t find the self-describing schema URI to validate the JSONs)

It sounds like you are in the first bucket?

There’s no workaround within Snowplow to preserving the old behavior - we have doubled-down on schema-based event description, and treat the old behavior as a bug (because it let invalid data through the pipeline) rather than a feature. Additionally, we plan over time to replace Snowplow’s current enriched event format (TSV with embedded JSONs) with a strongly-typed Avro format, and schema-free JSONs are incompatible with that. We do have plans to allow warnings rather than failures for invalid contexts (#351 Add finegrained validation to Snowplow Common Enrich), but these are at the very earliest stage currently.

Given this situation, if you want to upgrade then my recommendation would be to write a custom Hadoop job to fix/introduce self-describing schemas for the problematic JSONs. If the problem is ongoing (e.g. you have old tracker configurations out in the wild in mobile apps), then you would need to schedule this at the start of your EMR job flow each run.

Sorry the answer isn’t better news - please follow up with any questions you have!


#3

hey, I am currently looking into event validation for raw standard events (no custom context or other customisations). Recently we experienced a issue where some tracker would send invalid data (i.e AMP tracker sending domain_userids of 64 chars) which would be incompatible with the Redshift table schema (varchar(36)).

So I am looking for possibilities to filter out those events. Naturally the JSON validation in the EMR ETL process would be the perfect spot, however, I do have a hard time to identify the exact location of the validation and against what (can’t find a Iglu schema for raw events either…?). Can somebody help?


#4

Hey @lehneres - the primitive properties (as opposed to the JSON properties) of the Snowplow Tracker Payload are largely not validated currently. You are right that it would be great to lean on our first-class JSON validation to also handle validating these, but this isn’t done today.

At some point we will likely move to an all-JSON version 3 of the Tracker Protocol - at which point we should get validation of all properties in the protocol “for free”.