Skipping json validation and "validating" on enrichment instead?


#1

Hi,

We have recently set up snowplow batch pipeline (2 hours batches) with clojure collector and Redshift as storage destination. We use a lot of unstruct events / custom contexts. We are validating fields via json schema constraints which throws the whole event into bad events whenever one field in some context fails validation.

Consider a case: field X should be integer, but is sent as string via tracker for some reason. In this case, json validation would fail and the whole event line would fall into bad events. Ofcourse we can debug the logs, find the problem and fix so that they don’t happen later. However, from reading multiple threads it seems that in general recovering rows from bad events is impossible or very time consuming process (correct me if I am wrong). If we skip the schema validation and let any data type pass for field X, then storage loader would fail as Redshift would expect an integer. When you send many fields and events you don’t really want your data to get lost or delayed because one field got corrupted. You just want to know that something is wrong with that one field and be able to fix it at a later point, if possible, but you still want or of your events flowing. How we would imagine such cases to be handled:

  • json validation passes
  • enrichment process takes field X and changes all string values to null + logs somewhere that it found an invalid type. Possibly this is not done on the raw data, but “duplicate” events / contexts are derived. Then you can still check raw data and fix it if, for example, numbers were sent as strings
  • storage loader is able to load all processed data as it is bound to match the right types

Would be great to know best practices of dealing with described problem and opinion about the process that I wrote above - could this work as part of enrichment?


#2

Hi @vytenisj - you are right, Snowplow is a very strongly typed pipeline - we validate types in Hadoop Enrich and in Hadoop Shred to make sure that all events can safely load into Redshift.

If your organization doesn’t have a mature process around testing new event types before putting them live (e.g. Snowplow Mini, Selenium integration tests etc), then as you say this can lead to events being rejected. And yes, Hadoop Event Recovery is a relatively involved process, not least because there are various ways that events can fail to process.

We have a ticket to explore replacing some validation failures with warnings:

https://github.com/snowplow/snowplow/issues/351

However this is not yet roadmapped, and causes complexities of its own: it will mean that some events are partially processed (e.g. loaded into Redshift with a couple of contexts missing), which will make recovery even harder to reason about and accomplish.

In any case - in the absence of support for partial event processing, it is well worth investing in a thorough internal testing process for your events and contexts. This will pay off in various ways (e.g. your analysts will have much greater confidence in the quality of the event data they are being sent).


#3

Thanks for the info, Alex!