When to use server-side sessionization?
domain_sessionidx, and both trackers now also populate the client session context. Snowplow users that use these trackers and enable client-side sessionzation can use these fields to aggregate events into sessions with little or no effort.
However, there are cases where client-side sessionization alone is not enough. For example:
You decide to reduce the time-out interval from 30 to 10 minutes because it fits better with observed user behavior. This change doesn’t affect historical data and is likely to increase future session numbers, making it harder to compare past and current performance. It might therefore make sense to re-sessionize old events using the new time-out interval in order to retain a consistent session definition.
When events belonging to a single user are captured using multiple trackers, and all events need to be taken into account when sessionizing, then no tracker is guaranteed to have all the information it needs to sessionze events client-side. This can happen when users are expected to use multiple devices during a single session, or when server-side events need to be taken into account. In those situations, sessionization will have to happen server-side.
You might decide to group events using criteria other than the time-out interval.
Time-based sessionization in SQL
This page documents one approach to server-side sessionization using SQL. Below are a set of SQL queries that create a custom session index (which increments when 30 or more minutes have elapsed between 2 consecutive events). The different steps are discussed further down this page.
WITH step_1 AS ( SELECT id, tstamp, LAG(tstamp, 1) OVER (PARTITION BY id ORDER BY tstamp) AS previous_tstamp FROM events ), step_2 AS ( SELECT id, tstamp, previous_tstamp, CASE WHEN EXTRACT(EPOCH FROM (tstamp - previous_tstamp)) < 60*30 THEN 0 ELSE 1 END AS new_session FROM step_1 ), step_3 AS ( SELECT id, tstamp, previous_tstamp, new_session, SUM(new_session) OVER (PARTITION BY id ORDER BY tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS session_idx FROM step_2 ) SELECT id, session_idx, MIN(tstamp) AS min_tstamp FROM step_3 GROUP BY id, session_idx
Finding gaps between consecutive events
Let’s use the dataset below as an example.
User A had 4 events but appears to have been inactive between the second and third event. More than 30 minutes elapsed, so we decide that the third event should belong to a new session. User B was inactive between her first and second event (32 minutes elapsed). We therefore want the second event to belong to a new session.
Let’s now implement this in SQL. We need to compare timestamps between consecutive rows that belong to the same user. First, we use
LAG to create a new column with the previous timestamp. We use a window function to partition by ID so we don’t compare timestamps that belong to different users.
SELECT id, tstamp, LAG(tstamp, 1) OVER (PARTITION BY id ORDER BY tstamp) AS previous_tstamp FROM events
Now that we have
previous_tstamp we can compare both timestamps. Let’s set the time-out interval to 30 minutes. If more than 30 minutes have elapsed between two consecutive events, we want to flag the last event as one that starts a new session.
SELECT id, tstamp, previous_tstamp, CASE WHEN EXTRACT(EPOCH FROM (tstamp - previous_tstamp)) < 60*30 THEN 0 ELSE 1 END AS new_session FROM (...)
EXTRACT(EPOCH FROM (tstamp - previous_tstamp)) to get the time difference in seconds. This is then passed through a
CASE statement to determine whether this difference exceeds the time-out interval. Note that this also returns 1 when
NULL. This outputs the following table:
Creating a session index
To create a session index, we cumulative sum over
new_session. For each user ID,
new_session is first equal to 1, then 0 for as long as the session continues. Each time a new session begins
new_session is equal to 1, a cumulative sum will thus increase the session index.
SELECT id, tstamp, previous_tstamp, new_session, SUM(new_session) OVER (PARTITION BY id ORDER BY tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS session_idx FROM (...)
We now have a session index:
From events to sessions
Now that we have a session index, we are in a position to aggregate events into sessions:
SELECT id, session_idx, MIN(tstamp) AS min_tstamp, COUNT(*) AS event_count FROM (...) GROUP BY id, session_idx
The end result is a table where each session is rolled up into a single row:
The above example uses
tstamp, which are not actual fields in the Snowplow canonical event model. Instead, Snowplow captures a whole range of possible identifiers and timestamps. This gives our users the option to pick an approach that is best suited to their needs.
user_id(a custom user ID)
user_ipaddress(the user IP address)
domain_userid(a user ID generated by Snowplow and stored in a first-party cookie)
network_userid(another user ID generated by Snowplow but stored in a third-party cookie)
user_fingerprint(an ID based on individual browser features)
The mobile trackers also capture user identifiers, which are stored in the mobile context. These include:
apple_idfa(identifier for advertisers)
apple_idfv(identifier for vendors)
It’s of course also possible to extend the base model and send in custom identifiers using a context or unstructured event.
Snowplow captures various timestamps, including:
collector_tstamp(when the event hit the collector)
dvce_tstamp(when the event happened)
derived_tstamp(introduced with Snowplow R63)