This is a great question, and really comes down to the challenge of schema’ing third-party data like Mandrill that we a) don’t have control of, and b) don’t have full visiblity of.
Here’s the problem: let’s say that we have an “Acme Webhook” that supports Acme’s
add_to_cart event. In our webhook adapter we will take the incoming
add_to_cart event and tag it like so:
And then we have a corresponding schema in say Iglu Central.
Now let’s imagine that after a month of operation, we notice that a few Acme
add_to_cart events include another property,
quantity. What do we do? Well, this isn’t a schema evolution:
quantity field isn’t an addition to the schema - it’s been there forever, we just didn’t spot it originally
- Our webhook adapter is still happily attaching
iglu:com.acme/add_to_cart/jsonschema/1-0-0 to each event
- All Acme events can be successfully validated against a copy of
1-0-0 which correctly includes
So, if schema evolution is off the table, then we are left with patching the existing schema. This is where we overwrite an existing schema with a new version.
Of course, this is not ideal: patching is ugly, and doesn’t play nicely with the automated schema migration tech we are working on. But we don’t really have another option currently, because we don’t control the Acme data model - our Acme schema is just a best effort estimate of the data’s structure, and if it’s wrong (or incomplete), then that’s “our problem” not Acme’s.
The alternative would be to create a 1-0-1 with the additional field and rebuild the adapter to describe the incoming Acme event as 1-0-1. But this implies that the schema has evolved, which isn’t true - we’ve just improved our understanding of the schema.
A better approach we are exploring is offering schema inference for these kinds of scenarios. If we could attach the schema
iglu:com.acme/add_to_cart/jsonschema/?-?-? to the event in our webhook adapter, then we can potentially use Schema Guru (running as part of Snowplow) to infer the best-current-guess version of the schema. Expect more on schema inference soon.
A final note on the specific ticket linked about Mandrill: this is a rather low-friction patch as it a) is purely additive and b) doesn’t impact the corresponding JSON Paths file or Redshift table in any way.
Hope this helps!