How Snowplow data is structured in Snowflake


#1

Recently, we’ve seen an uptick in the number of Snowplow users who opt for using Snowflake as an alternative storage target to Redshift. However, the way Snowplow data is structured in Snowflake differs from Redshift. This post is meant as a primer on what the key differences are and how to deal with them.

In Redshift, you have the ‘main’ atomic.events table and a number of additional child tables in the the atomic schema (such as atomic.com_snowplowanalytics_snowplow_link_click_1) which you have to join back to atomic.events.

By contrast, in Snowflake everything is in the atomic.events table; with ‘shredded’ data populating its own columns in that table: one for each context or unstruct event. New columns get added to atomic.events automatically when you start tracking (and loading) a new self-describing (unstruct) event or context.

Columns that contain data for contexts begin with contexts_, followed by the name the shredded table would normally have in Redshift, eg contexts_com_snowplowanalytics_snowplow_web_page_1. Columns that contain data for unstruct events begin with unstruct_event_, followed by the name the shredded table would normally have in Redshift, eg unstruct_event_com_snowplowanalytics_snowplow_submit_form_1.

The data in these columns is stored as an object or array. It’s easy enough to query those columns:

SELECT
  contexts_com_snowplowanalytics_snowplow_web_page_1
FROM
  atomic.events
WHERE
  collector_tstamp > DATEADD('day', -1, CURRENT_DATE())::DATE
LIMIT 5;
row# CONTEXTS_COM_SNOWPLOWANALYTICS_SNOWPLOW_WEB_PAGE_1
1 [ { “id”: “7bddb40b-10e0-4d2b-86af-95a63347361e” } ]
2 [ { “id”: “7bddb40b-10e0-4d2b-86af-95a63347361e” } ]
3 [ { “id”: “7bddb40b-10e0-4d2b-86af-95a63347361e” } ]
4 [ { “id”: “7bddb40b-10e0-4d2b-86af-95a63347361e” } ]
5 [ { “id”: “51a95cde-a42e-4ca4-890c-d7b1bfd24e70” } ]

As you can see, each cell contains the full array of contexts for the event. With the web_page context, we’re only sending one JSON, but for other contexts you may have multiple JSONs in the array.

Here is an example of what an unstruct event looks like:

SELECT
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1
FROM
  atomic.events
WHERE
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1 IS NOT NULL
LIMIT 1;
row# UNSTRUCT_EVENT_COM_SNOWPLOWANALYTICS_SNOWPLOW_LINK_CLICK_1
1 { “elementClasses”: [ “icon”, “btn-flat” ], “elementId”: “search-button”, “elementTarget”: “”, “targetUrl”: “https://discourse.snowplowanalytics.com/search” }

In this case the value of the cell is a single JSON rather than an array.

In analysing the data, you will want to extract specific elements from the JSON and ‘join’ them to the data in the rest of the atomic.events table. To do that, you can call each element of the JSON by name.

Extracting the individual elements of the JSON works differently when the value is a JSON and when it is an array.

Let’s first take a look at the simpler case, in which we have a JSON. If we want to get each one of the fields from the example above, we can do it like this:

SELECT
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementClasses AS element_classes,
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementId AS element_id,
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementTarget AS element_target,
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1:targetUrl AS target_url
FROM
  atomic.events
WHERE
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1 IS NOT NULL
LIMIT 1;
row# ELEMENT_CLASSES ELEMENT_ID ELEMENT_TARGET TARGET_URL
1 [ “icon”, “btn-flat” ] “search-button” “” https://discourse.snowplowanalytics.com/search

To get the individual elements of the element_classes array, you would run a query like this one:

SELECT
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementClasses[0] AS element_classes_1,
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementClasses[1] AS element_classes_2,
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementId AS element_id,
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementTarget AS element_target,
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1:targetUrl AS target_url
FROM
  atomic.events
WHERE
  unstruct_event_com_snowplowanalytics_snowplow_link_click_1 IS NOT NULL
  AND unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementId = 'search-button'
LIMIT 1;
row# ELEMENT_CLASSES_1 ELEMENT_CLASSES_2 ELEMENT_ID ELEMENT_TARGET TARGET_URL
1 “icon” “btn-flat” “search-button” “” https://discourse.snowplowanalytics.com/search

For arrays of JSONs, you can use Snowflake’s FLATTEN function, which works similarly to JOIN but within the same table (no join key required):

SELECT
  e.event_id,
  wp.value:id AS web_page_id
FROM
  atomic.events e, -- Note the comma
  LATERAL FLATTEN(INPUT => contexts_com_snowplowanalytics_snowplow_web_page_1) wp
LIMIT 5;
row# EVENT_ID WEB_PAGE_ID
1 13fc90f6-a7a1-4b2e-b966-616e175d8c73 “9750934c-f3b4-4049-bf21-4ab7a3b5cf8a”
2 db565e88-0979-49a0-a2a1-4487ef2af13a “1982ff91-4078-49ec-be92-8b032faaf774”
3 21852e22-d02a-430a-b65b-da302b43b053 “9750934c-f3b4-4049-bf21-4ab7a3b5cf8a”
4 35f21fca-86e3-40d4-8594-9a1a47d78cad “9750934c-f3b4-4049-bf21-4ab7a3b5cf8a”
5 a3b97738-0008-4e8b-8694-e10bd5004d71 “1982ff91-4078-49ec-be92-8b032faaf774”

You can find the full documentation for the FLATTEN function here.


Snowplow Snowflake Loader 0.3.0 released
Migrating from Redshift to Snowflake
#3

To find out more about Snowplow’s support for Snowflake database, please checkout the release blog post:


#4

Awesome guide!