RFC: Extending Iglu for easier interoperability between digital event vendors

Over the past few years, we have seen many vendors in the digital analytics industry standardising on JSON as the format of choice for capturing and processing event data. In the same period, as the number of tools and applications that companies are using for digital analytics has grown, the need to plan & iterate how the data will be structured to help consumers work more effectively with it has led many to adopt JSON schema.

In the past, when you had a single system generating, processing and driving value from digital event data, schemas were less important; the single system simply needed to be able to meet the needs of its users. But in the current environment, schemas have emerged as a critical enabling technology. They facilitate the effective sharing of data from the many systems producing event data across multiple apps and platforms, to the many systems consuming that data.

Iglu was developed back in 2014 to address this need. By “decoupling” Iglu from the core Snowplow technology, our hope was that other companies would be able to adopt it to facilitate the definition of their event data, and drive the processing of data on their pipelines. However, the adoption of Iglu as the common standard for defining events in the wider digital industry hasn’t happened to date.

Recently, we have been prompted to revisit this idea following discussions with the teams at Mixpanel and Iterative.ly.

In the current environment, a company might work with multiple vendors to generate and process event data (e.g. Iterative.ly, Mixpanel and Snowplow). Is it possible that by all leveraging Iglu as an enabling technology, a company could define their events once as a set of Iglu schemas, and have those same schemas power Iterative.ly, Mixpanel and Snowplow?

Having to define & evolve a schema for an event multiple times in multiple places is cumbersome and can cause inconsistencies; enabling organisations to focus on getting that definition right once seems like a worthwhile goal, especially since the number of technologies a company works with for digital data is expanding all the time.

The conversations to date with the teams at Mixpanel and Iterative.ly have been very enlightening. They have helped us to identify how the Iglu standard does not meet all their needs, and we have started to explore how the standard could be extended to better meet them as well as the needs of the myriad of vendors working with event data.

We have discussed a number of ways to evolve the technology to facilitate broader adoption, including the following potential initial steps:

  1. Make it possible for individual vendors to extend their own Iglu metaschemas for their own Iglu Servers
  • This would make Iglu a lot more flexible and better able to meet the needs of individual technology vendors. However, it would drive discrepancies in the way different companies adopt Iglu so in parallel we would look to:
  1. Put together a working group to evolve the Iglu standard as a whole
  • A big focus for this working group would be to look at the way different vendors have extended the Iglu schema, with a view to:
    1. Incorporating extensions that are widely adopted, to promote interoperability, and
    2. Find ways to enable easy co-existence of extensions that different vendors require, but are specific to those different vendors (i.e. do not appear to be interesting for the industry as a whole to adopt)

Beyond extending Iglu to better meet the needs of vendors adopting it, we are also interested in working with related standards e.g. the Cloudevents.

Before we invest and go too far in any particular direction however, we are keen to get a feel for:

  1. How much interest there is in developing a common standard for defining digital events amongst the wider digital analytics community (beyond Snowplow, Iteratively and Mixpanel)
  2. Whether the initial approach outlined is a sensible set of first steps in starting to realise this vision

That is the motivation for posting this RFC. We look forward to your feedback!

14 Likes

@emilybe just saw this project idea posted at Measure Slack. Do you have interest hearing from someone at Amplitude as well?

Hi @HintikkaKimmo - absolutely! I will DM you to get something arranged!

  1. How much interest there is in developing a common standard for defining digital events among the wider digital analytics community (beyond Snowplow, Iteratively and Mixpanel)
  • I think this is interesting for anyone working in event collection space. The question is what kind of standardization has one in mind (some examples)? Iglu already uses json schemas which is standard on its own. The only thing thats specific is schema versioning and schema paths. I’m not sure if that really needs any improvement, unless have some compatibility with other vendors which store their schemas differently.
  • other than above, I think iglu could be useful to have some sort of UI so it would approachable not only to developers when it comes to creating/updating schemas. The other thing that comes to mind is schema portability/re-usability to other platforms/storages, for example parquet, but that probably should be handled more as an integration rather than core feature - of course defining how to implement such integrations could be useful for wider adoption.
  • there’s also validation by schema, but I’m not sure if that is outside of this given its handled in enricher atm.
  • other standardization that comes to mind which is not related to iglu itself but rather expanding standard schema library. Right now its more or less limited to snowplow and other integrations, and mostly ecommerce - there might be other vendors/domains that could benefit from predefined schemas, maybe that could fall under cloudevents depending how broad their spec will become.
1 Like

I think this is interesting for anyone working in event collection space. The question is what kind of standardization has one in mind (some examples)? Iglu already uses json schemas which is standard on its own. The only thing thats specific is schema versioning and schema paths. I’m not sure if that really needs any improvement, unless have some compatibility with other vendors which store their schemas differently.

Schema versioning was definitely the first thing that interested me about Iglu. If there was broader adoption (by end-users) and support (by vendors) of semantically versioned schemas, it would be very useful for both the end users and vendors who support them. For example, Mixpanel could quickly show customers how events have evolved over time using their schemas as the source of truth. But as you said, this already exists in Iglu so what needs to be extended?

  1. Metadata - We need a way to enrich the schema with additional metadata. This would allow us to use customer’s Iglu schemas as the source of truth for many of our “data governance” features. It seems like a natural place to keep such information. Some of this metadata would be broadly useful and some would be vendor specific. Some examples of metadata that we would find useful:

    • displayName - an alternative name to use for display purpose e.g. btn_clk => Button Click
    • owners - who owns this data / who should I talk to about it?
    • tags - arbitrary tags for data classification
    • platforms - what platforms send this event?
    • hidden - should this event be hidden in our UI?
    • dropped - should we stop ingesting this event?
  2. Loosened naming constraints on name in the self-describing props. Right now, Iglu has a restrictive regex that makes backward compatibility with many of our customers data impossible. For example an event named “Button Click” fails the regex. It’s impractical to ask people to re-implement their tracking to conform to the regex.

  3. Support for multiple entity types - we’d like to use Iglu schemas to describe more than just events. For example, we would like to use Iglu to describe our User Profiles, Group Profiles, and Lookup Tables as well. This could in theory be done in the metadata, but without supporting it in the Iglu URI format you cannot disambiguate two different types of data with the same name. Maybe this isn’t a real problem.

Here’s a sort of “straw-man” instance of how we’d like to use Iglu schemas: viewed_report_schema.json · GitHub

4 Likes

So there’s a few things I’ve got on my wishlist so I’ll just do a brain dump here.

  • Support for schema aliases (at a vendor and event level)

This would allow for easy migration / aliasing when a vendor name changes, or the name of an event changes without having to write a new schema which often creates either a new table or column. For example if a vendor is acquired (com.google => com.alphabet) for a large number of events this avoids creating a large number of new events / columns.

  • Documentation / data

Better support for metadata within schemas - e.g., first class support for events / entities (perhaps in self?) and the allowance for additionalProperties metadata somewhere within a schema which allows for some basic key-value management including authorship details, modification dates etc.

Starting to think about how to document the relationships between entities and events (at the moment this is Snowplow specific as I’m not sure if other vendors support this model yet). e.g., the addToCart event must be sent with 1+ product entities, 1 cart entity etc.

  • Support for newer JSON schema features (I’d skip v7 and go straight to 2019-09)

There is a bunch of useful stuff here like references, definitions, vocabulary, unevaluatedProperties etc. Some of these will need to be pushed into a client to resolve (e.g., references) but this opens the door to a lot of nice capabilities around not needing to repeat yourself in schemas, reusing properties across multiple schemas etc. There’s a wider discussion here around how to handle fields (i.e., should fields be local to an event / entity or globally shared?)

  • Entity / event disambiguation

I think this is covered in the documentation but I think it’s worth considering a more generic event type - e.g., a schema for enrichment is not necessarily an entity or an event.

  • Support for extended types (e.g., geography)

Likely both in the schemas (via format) and in downstream sources like Redshift and BigQuery now that more of these native types are being supported.

  • Clarification within Schemaver around backwards compatibility across data versus backwards compatibility across data structures

Historically the two have been a little conflated (mostly due to limitations in Redshift) that often muddled the water between data that was backwards compatible versus data structures that were backwards compatible. The BQ / Snowflake data model changes this a little bit so it’d be great to have a clearer structure on how a schema translates to a table / column.

  • Some relation back to version control (outside of schemaver)

For non-breaking changes such as adding metadata such as descriptions or changelog information how do we version this appropriately (Git seems a bit like overkill here but does offer some nice capabilities) independently (dependently?) of data structure changes so that we can see changes over time?

  • Handling additionalProperties

How should additionalProperties be handled where the downstream data source requires information about types / fields up front - is this up to enrichment / downstream processor or should this be pushed upstream?

  • Should Iglu schemas drop the jsonschema component from the URI?

Unless the intention is to support other formats going forward dropping jsonschema from the URI can save a sizeable number of bytes for higher volume pipelines.

We’ve been looking at Iglu closely at Iteratively as we search for a standard to expose our versioned schema registry via our API. There’s definite interest from customers to import and consume their Iteratively-managed schemas in the various tools and systems they send data to so a broadly adopted standard would certainly be welcome. A few ideas we’d +1 from the replies above:

  • Loosened name: this is a deal breaker for any consumer other than Snowplow as they all tend to support Unicode characters in event names.
  • Additional metadata: owners, tags, category, sources (platforms), etc. likely in the self field to communicate the context stored around the event.
  • Event-entity types and relationships: a clear standard for identifying events vs entities and how they may relate. The latter may be a Snowplow-only concern at the moment as everyone else expects a single merged payload, but I’d expect at least some providers to adopt the approach.
  • Support for newer versions of the JSON Schema standard.

This is a good point and perhaps has been historically constrained by downstream dependencies. Snowplow for a long time has been reliant on JSON schema components to either create tables or columns - however some sources haven’t supported unicode characters in their data stores / databases which I think may have been partly behind the option for going ASCII only (maybe? I think @anton is probably the authority on this one ).

For example - BigQuery only added last month (!) support for unicode columns - before this JSON schemas that contained any non-ascii characters (as part of the column name, not the value itself) would have failed to create columns in BigQuery. This is no longer an issue so it’s probably worth an investigation of what (if any?) downstream sources have similar constraints (e.g., columns beginning with special characters, numbers etc) that may not necessarily play well with schemas that are defined upstream of destinations.

This would an issue with Amplitude data. We allows Unicode characters everywhere and they tend to be commonly used by our clients. In our case, this could even include emojis in event names and properties.

Other important thing that comes to mind is clear entity types. We have both groups and user entities. User and groups have their own attributes which persist into each event until changed/updated. For example things like platform, paying subscriber status or event source library would be user properties. Group properties could be tricker as they can be simple things like workspace name but they can also dynamically generated like like daily active users of specific account.