You are exactly right - early versions of Snowplow did not perform schema validation in Hadoop Enrich, leaving it to the Hadoop Shred job (preparing data for loading Redshift) to validate those contexts. And even if an event's context failed validation in Hadoop Shred, the architecture meant that the event itself could still land in
This changed in R71 - from that point Hadoop Enrich validated all unstructured events and custom contexts (apart from the derived contexts generated in Hadoop Enrich itself). No event will make it through to e.g. Redshift if any of its contexts fail validation. Stream Enrich works the exact same way.
I'm not aware of a discussion about this with community users in Discourse or Google Groups, but this did come up with a couple of customers. There were two distinct issues:
- Users who were sending in events with multiple contexts, with perhaps 1 or 2 contexts which consistently failed validation. Post-upgrade, they went from all events landing in
atomic.events and all of most of their contexts, to no events arriving at all.
- Users (especially mobile users) who weren't aware of self-describing JSON and Iglu, were not using Redshift, and were sending in regular JSONs not self-describing ones. Post-upgrade, they went from all events being available for analysis in S3 to none making it through (because Hadoop or Stream Enrich couldn't find the self-describing schema URI to validate the JSONs)
It sounds like you are in the first bucket?
There's no workaround within Snowplow to preserving the old behavior - we have doubled-down on schema-based event description, and treat the old behavior as a bug (because it let invalid data through the pipeline) rather than a feature. Additionally, we plan over time to replace Snowplow's current enriched event format (TSV with embedded JSONs) with a strongly-typed Avro format, and schema-free JSONs are incompatible with that. We do have plans to allow warnings rather than failures for invalid contexts (#351 Add finegrained validation to Snowplow Common Enrich), but these are at the very earliest stage currently.
Given this situation, if you want to upgrade then my recommendation would be to write a custom Hadoop job to fix/introduce self-describing schemas for the problematic JSONs. If the problem is ongoing (e.g. you have old tracker configurations out in the wild in mobile apps), then you would need to schedule this at the start of your EMR job flow each run.
Sorry the answer isn't better news - please follow up with any questions you have!