When to use server-side sessionization?
The Snowplow Javascript, Android, and iOS trackers have client-side sessionization built in. The trackers increment the session index if the time since the last event exceeds the time-out interval. The Javascript tracker assigns the session index to domain_sessionidx
, and both trackers now also populate the client session context. Snowplow users that use these trackers and enable client-side sessionzation can use these fields to aggregate events into sessions with little or no effort.
However, there are cases where client-side sessionization alone is not enough. For example:
-
You decide to reduce the time-out interval from 30 to 10 minutes because it fits better with observed user behavior. This change doesn’t affect historical data and is likely to increase future session numbers, making it harder to compare past and current performance. It might therefore make sense to re-sessionize old events using the new time-out interval in order to retain a consistent session definition.
-
When events belonging to a single user are captured using multiple trackers, and all events need to be taken into account when sessionizing, then no tracker is guaranteed to have all the information it needs to sessionze events client-side. This can happen when users are expected to use multiple devices during a single session, or when server-side events need to be taken into account. In those situations, sessionization will have to happen server-side.
-
You might decide to group events using criteria other than the time-out interval.
Time-based sessionization in SQL
This page documents one approach to server-side sessionization using SQL. Below are a set of SQL queries that create a custom session index (which increments when 30 or more minutes have elapsed between 2 consecutive events). The different steps are discussed further down this page.
WITH step_1 AS (
SELECT
id,
tstamp,
LAG(tstamp, 1) OVER (PARTITION BY id ORDER BY tstamp) AS previous_tstamp
FROM events
), step_2 AS (
SELECT
id,
tstamp,
previous_tstamp,
CASE WHEN EXTRACT(EPOCH FROM (tstamp - previous_tstamp)) < 60*30 THEN 0 ELSE 1 END AS new_session
FROM step_1
), step_3 AS (
SELECT
id,
tstamp,
previous_tstamp,
new_session,
SUM(new_session) OVER (PARTITION BY id ORDER BY tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS session_idx
FROM step_2
)
SELECT
id,
session_idx,
MIN(tstamp) AS min_tstamp
FROM step_3
GROUP BY id, session_idx
Finding gaps between consecutive events
Let’s use the dataset below as an example.
User A had 4 events but appears to have been inactive between the second and third event. More than 30 minutes elapsed, so we decide that the third event should belong to a new session. User B was inactive between her first and second event (32 minutes elapsed). We therefore want the second event to belong to a new session.
Let’s now implement this in SQL. We need to compare timestamps between consecutive rows that belong to the same user. First, we use LAG
to create a new column with the previous timestamp. We use a window function to partition by ID so we don’t compare timestamps that belong to different users.
SELECT
id,
tstamp,
LAG(tstamp, 1) OVER (PARTITION BY id ORDER BY tstamp) AS previous_tstamp
FROM events
The returns:
Now that we have previous_tstamp
we can compare both timestamps. Let’s set the time-out interval to 30 minutes. If more than 30 minutes have elapsed between two consecutive events, we want to flag the last event as one that starts a new session.
SELECT
id,
tstamp,
previous_tstamp,
CASE WHEN EXTRACT(EPOCH FROM (tstamp - previous_tstamp)) < 60*30 THEN 0 ELSE 1 END AS new_session
FROM (...)
We use EXTRACT(EPOCH FROM (tstamp - previous_tstamp))
to get the time difference in seconds. This is then passed through a CASE
statement to determine whether this difference exceeds the time-out interval. Note that this also returns 1 when previous_tstamp
is NULL
. This outputs the following table:
Creating a session index
To create a session index, we cumulative sum over new_session
. For each user ID, new_session
is first equal to 1, then 0 for as long as the session continues. Each time a new session begins new_session
is equal to 1, a cumulative sum will thus increase the session index.
SELECT
id,
tstamp,
previous_tstamp,
new_session,
SUM(new_session) OVER (PARTITION BY id ORDER BY tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS session_idx
FROM (...)
We now have a session index:
From events to sessions
Now that we have a session index, we are in a position to aggregate events into sessions:
SELECT
id,
session_idx,
MIN(tstamp) AS min_tstamp,
COUNT(*) AS event_count
FROM (...)
GROUP BY id, session_idx
The end result is a table where each session is rolled up into a single row:
In practice
The above example uses id
and tstamp
, which are not actual fields in the Snowplow canonical event model. Instead, Snowplow captures a whole range of possible identifiers and timestamps. This gives our users the option to pick an approach that is best suited to their needs.
The canonical event model has the following user-related fields:
-
user_id
(a custom user ID) -
user_ipaddress
(the user IP address) -
domain_userid
(a user ID generated by Snowplow and stored in a first-party cookie) -
network_userid
(another user ID generated by Snowplow but stored in a third-party cookie) -
user_fingerprint
(an ID based on individual browser features)
The mobile trackers also capture user identifiers, which are stored in the mobile context. These include:
-
apple_idfa
(identifier for advertisers) -
apple_idfv
(identifier for vendors) android_idfa
It’s of course also possible to extend the base model and send in custom identifiers using a context or unstructured event.
Snowplow captures various timestamps, including:
-
collector_tstamp
(when the event hit the collector) -
dvce_tstamp
(when the event happened) -
derived_tstamp
(introduced with Snowplow R63)