We have recently set up snowplow batch pipeline (2 hours batches) with clojure collector and Redshift as storage destination. We use a lot of unstruct events / custom contexts. We are validating fields via json schema constraints which throws the whole event into bad events whenever one field in some context fails validation.
Consider a case: field X should be integer, but is sent as string via tracker for some reason. In this case, json validation would fail and the whole event line would fall into bad events. Ofcourse we can debug the logs, find the problem and fix so that they don’t happen later. However, from reading multiple threads it seems that in general recovering rows from bad events is impossible or very time consuming process (correct me if I am wrong). If we skip the schema validation and let any data type pass for field X, then storage loader would fail as Redshift would expect an integer. When you send many fields and events you don’t really want your data to get lost or delayed because one field got corrupted. You just want to know that something is wrong with that one field and be able to fix it at a later point, if possible, but you still want or of your events flowing. How we would imagine such cases to be handled:
- json validation passes
- enrichment process takes field X and changes all string values to null + logs somewhere that it found an invalid type. Possibly this is not done on the raw data, but “duplicate” events / contexts are derived. Then you can still check raw data and fix it if, for example, numbers were sent as strings
- storage loader is able to load all processed data as it is bound to match the right types
Would be great to know best practices of dealing with described problem and opinion about the process that I wrote above - could this work as part of enrichment?