Why are some schemas patched rather than changing the version?


#1

In order to get some understanding of SemVer in practice, I was looking through some of the schemas in the Iglu GitHub repo.

I’ve noticed that some of the schemas are locked to version 1-0-0, and changes made to them without incrementing the version number, which contradicts the guidelines as in: http://snowplowanalytics.com/blog/2014/05/13/introducing-schemaver-for-semantic-versioning-of-schemas/

I can see the value of making changes directly to the schema, as you can take advantage of GitHub diff’s when pull requesting schema changes - but this then breaks the SemVer convention.
Is there some other factor that you guys are using to decide on schema version bumps?

This carries the conversation over from the following:


#2

Hi @IAmFledge,

This is a great question, and really comes down to the challenge of schema’ing third-party data like Mandrill that we a) don’t have control of, and b) don’t have full visiblity of.

Here’s the problem: let’s say that we have an “Acme Webhook” that supports Acme’s add_to_cart event. In our webhook adapter we will take the incoming add_to_cart event and tag it like so:

{
 "schema": "iglu:com.acme/add_to_cart/jsonschema/1-0-0",
  "data": {
    "sku": "ipad1"
  }
}

And then we have a corresponding schema in say Iglu Central.

Now let’s imagine that after a month of operation, we notice that a few Acme add_to_cart events include another property, quantity. What do we do? Well, this isn’t a schema evolution:

  1. This quantity field isn’t an addition to the schema - it’s been there forever, we just didn’t spot it originally
  2. Our webhook adapter is still happily attaching iglu:com.acme/add_to_cart/jsonschema/1-0-0 to each event
  3. All Acme events can be successfully validated against a copy of add_to_cart version 1-0-0 which correctly includes sku and quantity

So, if schema evolution is off the table, then we are left with patching the existing schema. This is where we overwrite an existing schema with a new version.

Of course, this is not ideal: patching is ugly, and doesn’t play nicely with the automated schema migration tech we are working on. But we don’t really have another option currently, because we don’t control the Acme data model - our Acme schema is just a best effort estimate of the data’s structure, and if it’s wrong (or incomplete), then that’s “our problem” not Acme’s.

The alternative would be to create a 1-0-1 with the additional field and rebuild the adapter to describe the incoming Acme event as 1-0-1. But this implies that the schema has evolved, which isn’t true - we’ve just improved our understanding of the schema.

A better approach we are exploring is offering schema inference for these kinds of scenarios. If we could attach the schema iglu:com.acme/add_to_cart/jsonschema/?-?-? to the event in our webhook adapter, then we can potentially use Schema Guru (running as part of Snowplow) to infer the best-current-guess version of the schema. Expect more on schema inference soon.

A final note on the specific ticket linked about Mandrill: this is a rather low-friction patch as it a) is purely additive and b) doesn’t impact the corresponding JSON Paths file or Redshift table in any way.

Hope this helps!


#3

Thanks for the clear explanation Alex. This completely makes sense.

I guess SemVer could be modified to include an optional fourth element such as 1-0-0-a or 1-0-0-1 to explicitly indicate that the schema is potentially incomplete, and to explicitly record which variant of the schema was used at the time for an event / object - that way you could know if a field was truly null or just unknown at the time - but this is probably overkill, and if really needed could be inferred by an object timestamp in relation to the patch release date.

I’m very keen to find out more about the automated schema migration you mentioned, as I’m currently designing something similar, though more of a realtime transformation service (this was the reason I was trying to fully understand SemVer, and seeing that stuff could be patched was really breaking any design ideas).


#4

Hey @IAmFledge - you are exactly right - schema patching plays havoc with automated schema migration.

I’m very keen to find out more about the automated schema migration you mentioned

There’s not a lot documented yet, but the building blocks are:

  1. The comments we attach to tables in Schema Guru so we can know exactly what version of a schema a table is up-to-date with
  2. The migration SQL scripts which Schema Guru can automatically generate now for simple migrations
  3. A re-write of StorageLoader to integrate it much more deeply with Iglu (so e.g. StorageLoader could say “ah, I need to load some com.acme/add_to_cart/jsonschema/1-0-5s but the table is still on com.acme/add_to_cart/jsonschema/1-0-3, so I need to run a SQL upgrade script before running the load”)

Stay tuned as we roll out more Iglu functionality!