Self-describing events versus the mega JSON-object property for Snowflake?

We’re rethinking out Snowplow setup for our migration to Snowflake. One aspect of Snowflake is that there is no performance penalty for using semi-structured data (source).

Let’s say we want to track a booking form.

We might then have a self describing event that captures both button clicks and page-views (same schema for user journey purposes) like

action -> button-clicked, label -> complete-button, button-design -> cool-design and booking-event-type -> ski-holiday

and

action -> screen-viewed, label -> confirmation-screen, button-design -> null and booking-event-type ski-holiday

What would be the pros and cons of using a proper custom structured event versus putting everything in a generalised property json object similar to custom structured events? I.e. property -> {button-design: null, booking-event-type: ski-holiday}

As I see it
Pros with unstructured events

  • Unnesting and splitting events into tables would be easy as the json the same format from the Snowflake Loader
  • You would get more validation on the collector side out of the box.

Cons

  • You may encounter difficulties in analysing journeys across different type of events

Am I missing something?

Hey @medicinal-matt,

So this topic is in heavy ‘principles’ territory, on which I have some firm opinions, but there are arguments for the other perspective. I’ll do my best not to overstate it but do bear it in mind :slight_smile:

Long story short, I’d generally discourage this kind of generic event definition, because I see it as breaking the principles around which Snowplow is designed - whereby a schema describes an event, an event represents a single action/behaviour at a moment in time, and one field of data represents one conceptual thing.

Following these principles creates more up-front effort, because you have to design tracking well, schema your data up front etc. But it guarantees that you have consistency in your data definitions forever, and those definitions are forever documented in schemas. Metadata about how the event was processed is also in the data (including the schema), so anyone, at any point in the future, can figure out what any given event represents, using with only the information in the data, no additional context is required. (Of course in practice they’ll need context, but the principle is that all of the relevant metadata about the event is contained in the event - hence ‘self-describing’).

Structured events are basically the opposite of this, they prioritise an easy implementation at the expense of both modeling/analysis the data, and future auditability. Not only do the same fields represent different things in a structured event, but the same values of those fields can easily end up representing different things in practice too - I’ve seen some really messy implementations. You also need to somehow document and keep maintaining documentation about what these fields/values represent.

I see the idea of a generic self-describing event as the same broad idea as structured events. Now there’s certainly no technical barrier to doing things that way, and there are advantages, but my own opinion is that the trade off isn’t worth it in the long term. If you see things according to the principles that Snowplow designs itself around, it’s poor tracking design - because the same things represent many concepts, and there is no reliable mechanism to avoid the problems that stem from that.

I do recognise that implementing separate events makes it hard to model the data for things like a user journey funnel however, and I have a suggestion for this - instead of a single event representing the steps of a user journey, you could instrument a single entity (aka context) to represent it. Then you’d have screen view events and button click events, who are each independently schema’d and tracked in their own rite, but you would attach a user journey context to each of them which enables you to model the funnel more easily.

There are definitely good arguments for the alternative approach, but my own position on it that I value the separation of concepts a lot, and so my own stance is heavily weighted towards preferring separate events for each step. :slight_smile:

3 Likes

That’s interesting!

Can you elaborate on this part to give me a more clear picture?

I do recognise that implementing separate events makes it hard to model the data for things like a user journey funnel however, and I have a suggestion for this - instead of a single event representing the steps of a user journey, you could instrument a single entity (aka context) to represent it. Then you’d have screen view events and button click events, who are each independently schema’d and tracked in their own rite, but you would attach a user journey context to each of them which enables you to model the funnel more easily.

I’m think even more advanced than a static funnel where the steps are clearly defined, I’m thinking an entire Sankey plot where start and end is less defined.

It seems to me as well that anything defined outside a common field or context will be tedious to filter on, is that what you meant? How specific would you make page-view and button clicks events? Or even better a list-item-selected event?

Would it be a screen-viewed event with a screen parameter, a booking-screen-viewed-event with a screen parameter or a booking-confirmation-screen-viewed event? It seems it would be a lot of overhead for the most specific? What’s nice about making it very specific is that you will have a good key in the event_name column for the user journey. Otherwise you would have to limit your plot to a selected set of events and create an ad hoc column for the tuple [(screen-viewed, screen-name), (button-clicked, button-name)].

For the the structured/mega-json event, you wouldn’t have to limit yourself to a preselected number of events if you assume the naming convention is similar since all events would be in the same column. Maybe this is a trade-off between defining many events and making it is to create a user journey around?

The scenario perhaps becomes more clear with a drop down list. Would you make a drop-down-list event with a list-items parameter or would you make a helmet-drop-down-list event with a specific helmet-size parameter?

I mostly agree with @Colm on this one but it is a tricky argument - as there’s no real correct answer but I think it sits somewhere between the old school event category / action / label / value and the other extreme of a Snowplow event schema for every event.

Some events I think make a lot of sense as a generic schema otherwise there is too much overhead and analysis becomes super tedious - examples of this include screen views, link clicks and drop down lists (imo).

My general rule of thumb tends to be around:

  1. How important is the event from an analysis perspective (an add to cart button versus a menu category button)
  2. What is the commonality of properties between those two events e.g., an add to cart and a share button probably have some shared properties but also quite different properties.
  3. How fast is this event / entity likely to evolve over time?
  4. Can the control itself (button, link, modal) be modelled as an entity?

I would probably go with the former if you mean that you would otherwise have one event definition for every possible drop down. The idea of trying to analyse that is terrifying.

2 Likes

Now that I think about it, I’m actually leaning towards having very specific events and it being worth the overhead. I think my fear of having multiple specific events is irrational.

I would probably go with the former if you mean that you would otherwise have one event definition for every possible drop down. The idea of trying to analyse that is terrifying.

Why would that be terrifying? It seems like an OK deal? For funnels you have the event name and for specific analysis you could then easily bring the helmet size parameter.

A somewhat related question:

Why did Snowplow go to have the unstructured events as new columns instead of as a column of JSON objects? Then you would get the schema and not having to worry about which column you’re in. Is the reason that they think the main use case would be to analyse each even separately and that the might as well separate them on entry?

If you only have a couple of drop downs I think this is fine - if you’ve got something like 30 dropdowns then you end up with 30 columns which can get annoying to analyse (still doable but you’d ideally want to COALESCE in a data model).

Partly - originally each event and entity ended up in it’s own table (still is in the Redshift model) and then moved to a column per event/entity model. There was a time where all of this ended up in a single column but it tended towards being less performant (no types) or started to hit database edges - like not being indexed properly, hitting VARCHAR(MAX) limits and being sometimes awkward to access and version control.

2 Likes

if you’ve got something like 30 dropdowns then you end up with 30 columns which can get annoying to analyse (still doable but you’d ideally want to COALESCE in a data model).

This is very abstract and hard to reason about.

It sounds nasty, but I can’t come up with a realistic example when this occurs? If you had a general drop down list event, you could do GROUP BY list_name, list_item, but why would you do this? The journey you get from the specific name (unless you include the size in that journey).

It would be a trade-off in the discussion specific versus semi-specific events, but would you say it relates to the single column vs column per event approach? In the single column you would still have to filter on CASE WHEN event_name = helmet-drop-down-list propertyhelmet-size instead of helmet-drop-down-listhelmet-size

UPDATE: I have found a concrete issue. Let me try to write it here

I knew even as I was writing my original response that I’d overstate the argument! As usual Mike’s input helps me ground my ideas - I totally agree with this perspective. I think the crux of the argument I was trying to make is that those principles I spoke about apply, but we’re talking about tracking behaviour, and behaviour isn’t a sharply enough defined concept to be rigid in our thinking.

So I still think that one event should represent one concept, but I should clarify that because that concept is a behaviour, this does leave scope for things that are technically distinct from each other to still be represented as one event. For example, depending on your analysis/how you conceptualise them, a button click and a link click can represent the same thing in terms of the behaviour you’re analysing, so in my book there are certainly use cases where they can be the same event. Equally there are many ways for a user to view a product but one might conceptualise all of them as a product view.

It’s a tricky balancing act but generally I would just bear in mind that a good tracking design both fits the current use case, and doesn’t get in the way of future use cases. If a future analysis might require a distinction between button clicks and link clicks, it’s less acceptable to track them both as the one thing (or at least, one should give it some consideration).

Some things, though, are distinct classes of behaviour, and generally hard to understand as anything but separate concepts - so to my mind regardless of the convenience of classifying a (for example) page view and a button click as the same event, it’s hard for me to see past that distinction.

So long story short yes I agree with Mike’s take, interesting discussion in this thread!

Yes for sure! I think the same principle applies to a more complicated chart like a sankey. The idea here is that you would track each event as a distinct event, but attach the same entity/context to each of them, which contains the fields that are relevant to the user journey analysis.

So this way, each event is its own distinct thing (with data in its own distinct column), but they all share a common context column, which is populated for all of the relevant events.

Would it be a screen-viewed event with a screen parameter, a booking-screen-viewed-event with a screen parameter or a booking-confirmation-screen-viewed event? It seems it would be a lot of overhead for the most specific?

In this example you would have a screen view event for all three, and you would have a ‘journey’ context (I’d want to pick a better name though, I’m terrible at naming!). Let’s say the journey context has a ‘milestone’ field (again bad name) - this might contain the screen name, or some identifier for the screen, so the milestone field would have ‘booking confirmation reached’, and ‘booking screen reached’ for example. In that example we don’t see the benefit since both are the same screen view event (which you can get out of the box).

However when you track a button click which pertains to this analysis, you also attach the journey context, and the milestone field might be ‘add to cart clicked’ (but more sensible naming).

Let’s say you in future integrate some server-side actions like ‘fraud detection check passed’ - this fits the model by using the same context, even though the server-side event itself might not structurally fit any of the events on the client side (eg the event is ‘fraud detection processed’ and its data represents mostly technical information which is aimed at the behaviour of the technology rather than human behaviour). Probably a bad example but I hope the point makes sense.

Now when you want to query the data, to find all the events relevant to your journey analysis you query rows that have the journey context populated. The event_name field will be useful in seeing what is what (and all event-specific data will be somewhere in the row for that event), but all the data relevant to the journey analysis will be in the same column for all relevant events - so you don’t need to coalesce in anything unless you’re digging up additional data from a specific event type.

You end up with a structure in the events table that’s something along these lines:

event_name unstrcuct_screen_view unstruct_button_click contexts_journey
screen_view {screen_view_data} null [{user_journey_data}]
button_click null {button_click_data} [{user_journey_data}]

There are more than one ways to skin the cat, like I said, but this might be a good way to deliver more convenience at analysis time without needing to compromise on principled event design.

(Note that the context column will be an array, but if you ensure to only attach one-per-event, you can just query the first element of the array and you don’t need to do unnesting: contexts_journey[0].milestone

3 Likes