We’re looking to upgrade our pipeline and was trying to reconcile differences between processing a set of raw logs on an older version of Snowplow vs a newer version.
We were trying to figure out why the events being loaded into our Redshift events table was so much greater in older versions vs the newer version and think we isolated it to a change made in r71 regarding the moving of JSON validation to Scala Common Enrich. Specifically, the note here: http://snowplowanalytics.com/blog/2015/10/02/snowplow-r71-stork-billed-kingfisher-released/#json-validation
Please note: if the unstructured event or any of the custom contexts fail validation against their respective JSON Schemas in Iglu, then the event will be failed and written to the bad bucket.
Please correct me if I’m wrong here, but it seems like prior to this version, enriched events would STILL be loaded via the storage loader even if there were validation errors during the shredding step. This means that it is possible to have the event record to exist, but it would not have any context records.
And following this change in r71, the same raw event (with an JSON validation error in its context data), the entire event would be moved to the bad bucket. Therefore, the end result would be the event not having any trace in the resulting storage location.
We’re seeking guidance in terms of how this change has affected other Snowplow users. If there has been other discussions on this topic that you know of, we’d appreciate a link to them.
Are there workarounds to preserving the old behavior? Changing the version of hadoop_enrich that is used? Would such a change, if even possible, be recommended?