Integrating event sourcing with Snowplow


#1

Hi Snowplowers,

We have recently started working on a system that uses event sourcing and we’re planning to use JSON schemas to validate them.

It would be really interesting for us to make those events part of the unified log and being able to analyze them with our Snowplow infrastructure, but there’s an issue: Snowplow events aren’t fully self-describing. Some fields that map to atomic.events (like user_id) aren’t part of the self-describing event, they’re set at a tracker level.

The way we’re currently thinking about it is to have an optional ‘metadata’ field in all event schemas, so that events would look something like:

{
  "schema": "iglu:com.burgermaster/customer_placed_order/jsonschema/1-0-0",
  "data": {
    "order_id": 1234,
    "menu_item": "cheese burger",
    "metadata": {
      "app_id": "burger_master",
      "true_timestamp": "2017-02-21 10:00:00",
      "user_id": "9876"
    }
  }
}

We then would have a consumer that would pick the event, take the fields in metadata and apply them to the tracker and remove the metadata entry before firing the Snowplow event.

The solution is OK, but it’s unfortunate that sometimes the fields within metadata (specially the user ones) would make more sense in the event definition itself, or maybe as a context that all events share.

I would like to get feedback about the approach. Does it make sense to try to tie event sourcing with Snowplow? Is anyone doing it already? Can you think of a better way of doing it?

Many thanks,
Dani


#2

Hi Dani,

You are right - at the moment there are important fields within a Snowplow event which are only modelled as part of the Snowplow Tracker Protocol; these fields like app_id and user_id are routed directly from the Tracker Protocol to the Snowplow enriched event, and they are not formally expressed as part of a self-describing JSON anywhere.

We plan on changing this over the coming months. We are working on a new RFC related to refactoring the Snowplow enriched event and the Snowplow Tracker Protocol - we essentially want to move all the “legacy” pieces of this to self-describing JSONs. To put it another way: really a Tracker API should just be an ergonomic wrapper over a self-describing event and a collection of custom contexts; there shouldn’t be any data points outside of this.

Hopefully this RFC will be out in the next few weeks. In the meantime, I think your approach makes sense, though I would probably instead go for an overall envelope something like:

{
  "event": {
    "schema": "iglu:com.burgermaster/customer_placed_order/jsonschema/1-0-0",
    "data": {
      "order_id": 1234,
      "menu_item": "cheese burger"
    }
  },
  "metadata": {
    "schema": "iglu:com.burgermaster.snowplow/tracker_metadata/jsonschema/1-0-0",
    "data": {
      "app_id": "burger_master",
      "true_timestamp": "2017-02-21 10:00:00",
      "user_id": "9876"
    }
  },
  "contexts": [
    ...
  ]
}

This decouples the tracker metadata fields from your self-describing events and any contexts, and should minimize disruption when we start implementing the RFC across Snowplow which will involve, as you say, doing cleanup such as migrating the user fields into contexts.