How Snowplow data is structured in Snowflake

Recently, we’ve seen an uptick in the number of Snowplow users who opt for using Snowflake as an alternative storage target to Redshift. However, the way Snowplow data is structured in Snowflake differs from Redshift. This post is meant as a primer on what the key differences are and how to deal with them.

In Redshift, you have the ‘main’ atomic.events table and a number of additional child tables in the the atomic schema (such as atomic.com_snowplowanalytics_snowplow_link_click_1) which you have to join back to atomic.events.

By contrast, in Snowflake everything is in the atomic.events table; with ‘shredded’ data populating its own columns in that table: one for each context or unstruct event. New columns get added to atomic.events automatically when you start tracking (and loading) a new self-describing (unstruct) event or context.

Columns that contain data for contexts begin with contexts_, followed by the name the shredded table would normally have in Redshift, eg contexts_com_snowplowanalytics_snowplow_web_page_1. Columns that contain data for unstruct events begin with unstruct_event_, followed by the name the shredded table would normally have in Redshift, eg unstruct_event_com_snowplowanalytics_snowplow_submit_form_1.

The data in these columns is stored as an object or array. It’s easy enough to query those columns:

SELECT
  contexts_com_snowplowanalytics_snowplow_web_page_1
FROM
  atomic.events
WHERE
  collector_tstamp > DATEADD('day', -1, CURRENT_DATE())::DATE
LIMIT 5;
row# CONTEXTS_COM_SNOWPLOWANALYTICS_SNOWPLOW_WEB_PAGE_1
1 [ { “id”: “7bddb40b-10e0-4d2b-86af-95a63347361e” } ]
2 [ { “id”: “7bddb40b-10e0-4d2b-86af-95a63347361e” } ]
3 [ { “id”: “7bddb40b-10e0-4d2b-86af-95a63347361e” } ]
4 [ { “id”: “7bddb40b-10e0-4d2b-86af-95a63347361e” } ]
5 [ { “id”: “51a95cde-a42e-4ca4-890c-d7b1bfd24e70” } ]

As you can see, each cell contains the full array of contexts for the event. With the web_page context, we’re only sending one JSON, but for other contexts you may have multiple JSONs in the array.

Here is an example of what an unstruct event looks like:

SELECT
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1
FROM
  atomic.events
WHERE
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1 IS NOT NULL
LIMIT 1;
row# UNSTRUCT_EVENT_COM_SNOWPLOWANALYTICS_SNOWPLOW_LINK_CLICK_1
1 { “elementClasses”: [ “icon”, “btn-flat” ], “elementId”: “search-button”, “elementTarget”: “”, “targetUrl”: “https://discourse.snowplow.io/search” }

In this case the value of the cell is a single JSON rather than an array.

In analysing the data, you will want to extract specific elements from the JSON and ‘join’ them to the data in the rest of the atomic.events table. To do that, you can call each element of the JSON by name.

Extracting the individual elements of the JSON works differently when the value is a JSON and when it is an array.

Let’s first take a look at the simpler case, in which we have a JSON. If we want to get each one of the fields from the example above, we can do it like this:

SELECT
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementClasses AS element_classes,
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementId AS element_id,
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementTarget AS element_target,
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1:targetUrl AS target_url
FROM
  atomic.events
WHERE
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1 IS NOT NULL
LIMIT 1;
row# ELEMENT_CLASSES ELEMENT_ID ELEMENT_TARGET TARGET_URL
1 [ “icon”, “btn-flat” ] “search-button” “” https://discourse.snowplow.io/search

To get the individual elements of the element_classes array, you would run a query like this one:

SELECT
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementClasses[0] AS element_classes_1,
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementClasses[1] AS element_classes_2,
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementId AS element_id,
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementTarget AS element_target,
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1:targetUrl AS target_url
FROM
  atomic.events
WHERE
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1 IS NOT NULL
  AND unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementId = 'search-button'
LIMIT 1;
row# ELEMENT_CLASSES_1 ELEMENT_CLASSES_2 ELEMENT_ID ELEMENT_TARGET TARGET_URL
1 “icon” “btn-flat” “search-button” “” https://discourse.snowplow.io/search

For arrays of JSONs, you can use Snowflake’s FLATTEN function, which works similarly to JOIN but within the same table (no join key required):

SELECT
  e.event_id,
  wp.value:id AS web_page_id
FROM
  atomic.events e, -- Note the comma
  LATERAL FLATTEN(INPUT => contexts_com_snowplowanalytics_snowplow_web_page_1) wp
LIMIT 5;
row# EVENT_ID WEB_PAGE_ID
1 13fc90f6-a7a1-4b2e-b966-616e175d8c73 “9750934c-f3b4-4049-bf21-4ab7a3b5cf8a”
2 db565e88-0979-49a0-a2a1-4487ef2af13a “1982ff91-4078-49ec-be92-8b032faaf774”
3 21852e22-d02a-430a-b65b-da302b43b053 “9750934c-f3b4-4049-bf21-4ab7a3b5cf8a”
4 35f21fca-86e3-40d4-8594-9a1a47d78cad “9750934c-f3b4-4049-bf21-4ab7a3b5cf8a”
5 a3b97738-0008-4e8b-8694-e10bd5004d71 “1982ff91-4078-49ec-be92-8b032faaf774”

You can find the full documentation for the FLATTEN function here.

6 Likes

To find out more about Snowplow’s support for Snowflake database, please checkout the release blog post:

Awesome guide!

I’m curious why the data in contexts_com_snowplowanalytics_snowplow_web_page_1 is stored in an array. I see in the Redshift equivalent that there is a generic contexts column which may store multiple contexts, e.g. custom ones, page_views etc and so an array of json makes sense there, but in Snowflake the contexts are all in their own columns. In this example is it even possible for an event to have multiple web_page_id values? I mean if its a different web page then its another event right?

So then if this is an array that always contains a single element, then it feels like lateral flatten is a bit overkill for it when you could just index the zeroth element of the array. i.e. SELECT e.contexts_com_snowplowanalytics_snowplow_web_page_1[0]

There’s a comment in the final example "-- Note the comma "
Using commas to separate a list of tables or datasets in the FROM clause is really just a SQL-92 way of doing a CROSS JOIN, i.e. a cartesian product. Better to be explicit about that.

I have 3 context fields in my snowplow.events table on Snowflake, and doing CROSS JOIN LATERAL FLATTEN 3 times feels like it is not very performant.

This is consistent across the Snowflake and BigQuery loaders - any context can contain an array of objects even though some specific contexts (like web_page in this instance) often only contain a single object but often when this is customised (as ‘entities’) there may be an array of products, values etc. If you know that the context is only going to contain a single object I tend to just use direct indexing and use [0] as you’ve mentioned.

1 Like

Hi @mike. Consistency is a good reason, thanks for clarifying that.