Early versus late joins in Snowplow


#1

We often get asked variations of the following question:

We’d like some advice on our data/event model in terms of when we should append data to an event (e.g. in a context), and when we should use a reference table after receiving the event. From your expertise, are there best practices with Redshift/S3 for this?

For example, if a player makes a purchase using virtual currency in the game, I would include an item_context with:

item_id: 1234
item_category: clothes
item_name: pink bunny t-shirt

OR… should I just send the item_id and append the rest of the data using a reference table when doing analysis?

In the above question, the Snowplow users asks about the relative merits of joining / appending metadata about one of the entities in the event (in this case an item) early (eagerly) i.e. in the tracker vs joining the data late (once it is loaded into Redshift). There are a few considerations that drive a decision like this, but before we go into them let’s highlight the three different places you can join additional data to an event:

Early vs mid vs late joins

There are three places in the Snowplow pipeline you can append additional data about the entities involved in an event:

1. In the tracker i.e. early / eager joins

You can append the additional metadata to the event as a context when recording the event. In the above example the additional item context is tracked directly with the event.

2. At the enrichment step i.e. middle join

If the data that you want to append is accessible via an API for example, you can append the data at the enrichment step using our API Request Enrichment. We are currently working on a SQL enrichment, that will enable you to enrich events with data stored in a SQL database. We have a long list of standard enrichments in addition.

In the above example the a request would be made to an API or SQL database with the item ID which returns the additional item information. This would then get appended to the event as a derived_context.

3. Once the data has been loaded into the datawarehouse (e.g. in Redshift) i.e. late / lazy joins

Once the data is loaded into Redshift it is generally straightforward to join with other data. (Assuming that has also been loaded into Redshift.) So you might have a separate pipeline that loads item metadata (e.g. from your CMS) into Redshift so it can be joined at the data modeling step.

Considerations

1. How rapidly evolving are the dimension values you’re widening the event data with?

If values are changing frequently, it is generally better to do a mid or ideally an early join, because if you wait to do a late join, you’re actually appending values that are different to those when the event occurred. In some cases this might be desirable (i.e. you might want to know the latest price of a product that was bought) but in most it is not (you’re more likely to want to know the price of an item when it was bought). A mid join is better than a late join here, though the joined values are still subject to change if you do a full re-process of your raw events (as seen with the MaxMind geo-location enrichment).

2. Is the data available easily available in the tracker?

If the data that you want to append to the event is already available in the tracker, it is probably a good idea to capture it there. If a lot of additional engineering is required to push the data so that it is available in the tracker, or it’s sensitive data you don’t want to share in a tracker (like PII or profit margin), then you’re generally better off joining it at enrichment or after the fact in Redshift.

3. Network bandwidth

If you’re tracking data from mobile clients, you may want to limit the amount of data you join in the tracker (early). This is particular true if you have users who have patchy network coverage or pay relatively high mobile bills.

4. Are you running the real-time pipeline?

If so, doing late joins in Redshift is not ideal, because the output of those joins will only be available in Redshift, not in Kinesis. If you need the joined data in Kinesis, join either in the tracker or at enrichment.