Basic SQL recipes for web data


#1

The following are basic SQL recipes that can be useful for anyone starting to get familiar with the structure of Snowplow data collected from the web (Javascript tracker in particular):

Number of unique visitors per day, last 31 days

The number of unique visitors can be calculated by summing the number of distinct domain_userids in a specified time period. (Because each user is assigned a unique domain_userid, based on a lack of Snowplow tracking cookies on their browser):

/* Redshift / PostgreSQL */
SELECT
DATE_TRUNC('day', derived_tstamp) as "Date",
COUNT(DISTINCT(domain_userid)) as "Uniques"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1
ORDER BY 1;

2. Number of visits per day, last 31 days

Because each user might visit a site more than once, summing the number of domain_userids returns the number if visitors, NOT the number of visits. Every time a user visits the site, however, Snowplow assigns that session with a domain_sessionidx (e.g. 1 for their first visit, 2 for their second.) Hence, to count the number of visits in a time period, we concatenate the unique domain_userid with the domain_sessionidx and then count the number of distinct concatenated entry in the events table:

/* Redshift / PostgreSQL */
SELECT
DATE_TRUNC('day', derived_tstamp) as "Date",
COUNT(DISTINCT(domain_userid || '-' || domain_sessionidx)) as "Visits"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1
ORDER BY 1;

3. Number of page views per day

Page views are one type of event that are stored in the Snowplow events table. They can easily be identified using the event field, which is set to ‘page_view’.

To count the number of page views by day, then we simply execute the following query:

/* Redshift / PostgreSQL */
SELECT
DATE_TRUNC('day', derived_tstamp) AS "Date",
COUNT(*) AS "page_views"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
AND event = 'page_view'
GROUP BY 1
ORDER BY 1;

4. Number of events

Although the number of page views is a standard metric in web analytics, this reflects the web’s history as a set of hyperlinked documents rather than the modern reality of web applications that are comprise lots of AJAX events (that need not necessarily result in a page load.)

As a result, counting the total number of events (including page views but also other AJAX events) is actually a more meaningful thing to do than to count the number of page views, as we have done above. We recommend setting up Snowplow so that all events / actions that a user takes are tracked. Hence, running the below queries should return a total sum of events on the site by time period:

/* Redshift / PostgreSQL */
SELECT
DATE_TRUNC('day', derived_tstamp) AS "Date",
event,
COUNT(*) AS "Number"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1,2
ORDER BY 1,2;

5. Pages per visit

The number of pages per visit can be calculated by visit very straightforwardly:

/* Redshift / PostgreSQL */
SELECT
domain_userid || '-' || domain_sessionidx AS "session",
COUNT(*) as "pages_visited"
FROM events
WHERE event = 'page_view'
AND derived_tstamp > current_date - integer '31'
GROUP BY 1

We can then aggregate our data by number of pages per visit, to produce a frequency table:

/* Redshift / PostgreSQL */
CREATE VIEW basic_recipes.pages_per_visit AS
SELECT
pages_visited,
COUNT(*) as "frequency"
FROM (
	SELECT
	domain_userid || '-' || domain_sessionidx AS "session",
	COUNT(*) as "pages_visited"
	FROM events
	WHERE event = 'page_view'
	AND derived_tstamp > current_date - integer '31'
	GROUP BY 1
) AS page_view_per_visit
GROUP BY 1
ORDER BY 1;

6. Bounce rate

First we need to identify all the sessions that were ‘bounces’. These are visits where there is only a single event captured: the initial page view:

/* Redshift / PostgreSQL */
SELECT
domain_userid,
domain_sessionidx,
MIN(derived_tstamp) as "time_first_touch",
COUNT(*) as "number_of_events",
CASE WHEN count(*) = 1 THEN 1 ELSE 0 END AS bounces
FROM events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1,2;

This query returns a line of data for every session. For each, it logs a timestamp, the number of events, and a flag that is set to 1 if the visitor bounced.

To calculate bounce rate by day, we take the above table, aggregate the results by day, sum the number of bounces and divide it by the total number of sessions:

/* Redshift / PostgreSQL */
SELECT
DATE_TRUNC('day', time_first_touch) AS "Date",
SUM(bounces)::REAL/COUNT(*) as "Bounce rate"
FROM (
	SELECT
	domain_userid,
	domain_sessionidx,
	MIN(derived_tstamp) as "time_first_touch",
	COUNT(*) as "number_of_events",
	CASE WHEN count(*) = 1 THEN 1 ELSE 0 END AS bounces
	FROM "atomic".events
	WHERE derived_tstamp > current_date - integer '31'
	GROUP BY 1,2
) v
GROUP BY 1
ORDER BY 1;

Note that we have to cast sum(bounces) as a ‘real’ number, to force Redshift / PostgreSQL to output a real number rather than an integer for the bounce rate.

7. % new visits

A new visit is easily identified as a visit where the domain_sessionidx = 1. Hence, to calculate the % of new visits, we need to sum all the visits where domain_sessionidx = 1 and divide by the total number of visits, in the time period.

First, we create a table with every visit stored, and identify which visits were “new”:

/* Redshift / PostgreSQL */
SELECT
MIN(derived_tstamp) AS "time_first_touch",
domain_userid,
domain_sessionidx,
CASE WHEN domain_sessionidx = 1 THEN 1 ELSE 0 END AS "first_visit"
FROM events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY domain_userid, domain_sessionidx;

Then we aggregate the visits over our desired time period, and calculate the fraction of them that are new:

/* Redshift / PostgreSQL */
SELECT
DATE_TRUNC('day', time_first_touch) AS "Date",
SUM(first_visit)::REAL/COUNT(*) as "fraction_of_visits_that_are_new"
FROM (
	SELECT
	MIN(derived_tstamp) AS "time_first_touch",
	domain_userid,
	domain_sessionidx,
	CASE WHEN domain_sessionidx = 1 THEN 1 ELSE 0 END AS "first_visit"
	FROM "atomic".events
	WHERE derived_tstamp > current_date - integer '31'
	GROUP BY domain_userid, domain_sessionidx) v
GROUP BY 1
ORDER BY 1;

8. Average visitor duration

To calculate this, 1st we need to calculate the duration of every visit:

/* Redshift / PostgreSQL */
SELECT
domain_userid,
domain_sessionidx,
MIN(derived_tstamp) as "start_time",
MAX(derived_tstamp) as "finish_time",
MAX(derived_tstamp) - min(derived_tstamp) as "duration"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1,2;

Then we simply average visit durations over the time period we’re interested e.g. by day:

/* Redshift / PostgreSQL */
SELECT
DATE_TRUNC('day', start_time) AS "Date",
AVG(duration)/1000000 as "average_visit_duration_seconds"
FROM (
	SELECT
	domain_userid,
	domain_sessionidx,
	MIN(derived_tstamp) as "start_time",
	MAX(derived_tstamp) as "finish_time",
	MAX(derived_tstamp) - min(derived_tstamp) as "duration"
	FROM "atomic".events
	WHERE derived_tstamp > current_date - integer '31'
	GROUP BY 1,2
) v
group by 1
order by 1;

9. Demographics: language

For each event the browser language is stored in the br_language field. As a result, counting the number of visitors in a time period by language is trivial:

/* Redshift / PostgreSQL */
SELECT
br_lang,
COUNT(DISTINCT(domain_userid)) as "visitors"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY br_lang
ORDER BY 2 DESC;

10. Demographics: location

We can identify the geographic location of users using the geo_country, geo_region, geo_city, geo_zipcode, geo_latitude and geo_longitude fields.

To calculate the number of visitors in the last month by country, simply execute:

/* Redshift / PostgreSQL */
SELECT
geo_country,
COUNT(DISTINCT(domain_userid)) as "visitors"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1
ORDER BY 2 DESC;

11. Behavior: new vs returning

Within a given time period, we can compare the number of new visitors (for whom domain_sessionidx = 1) with returning visitors (for whom domain_sessionidx > 1):

/* Redshift / PostgreSQL */
SELECT
domain_userid,
domain_sessionidx,
MIN(derived_tstamp) as time_first_touch,
CASE WHEN domain_sessionidx = 1 THEN 'new' ELSE 'returning' END AS "new_vs_returning"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY domain_userid, domain_sessionidx;

Then we can aggregate them by time period, to get the total new vs returning e.g. by day:

/* Redshift / PostgreSQL */
SELECT
DATE_TRUNC('day', time_first_touch) AS "Date",
SUM(first_visit)::REAL/COUNT(*) as "fraction_of_visits_that_are_new"
FROM (
	SELECT
	MIN(derived_tstamp) AS "time_first_touch",
	domain_userid,
	domain_sessionidx,
	CASE WHEN domain_sessionidx = 1 THEN 1 ELSE 0 END AS "first_visit"
	FROM "atomic".events
	WHERE derived_tstamp > current_date - integer '31'
	GROUP BY domain_userid, domain_sessionidx) v
GROUP BY 1
ORDER BY 1;

12. Behavior: frequency

We can plot the distribution of visits in a time period by the number of visits each visitor has performed:

SELECT
domain_sessionidx as "Number of visits",
COUNT(DISTINCT(domain_userid)) as "Frequency"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1
ORDER BY 2;

13. Behavior: recency

We can plot the distribution of visits by the number of days since the previous visit. To do this, we first identify all the visits in our time period:

/* Redshift / PostgreSQL */
SELECT
domain_userid,
domain_sessionidx,
domain_sessionidx - 1 as "previous_domain_sessionidx",
MIN(derived_tstamp) as "time_first_touch"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1, 2;

We can join the above table with a similar table, but join each visit to data for the previous visit, so we can calculate the number of days between visits:

/* Redshift / PostgreSQL */
SELECT
n.domain_userid,
n.domain_sessionidx,
EXTRACT(EPOCH FROM (n.time_first_touch - p.time_first_touch))/3600/24 as "days_between_visits",
CASE
	WHEN n.domain_sessionidx = 1 THEN '0'
	WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 1 THEN '1'
	WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 2 THEN '2'
	WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 3 THEN '3'
	WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 4 THEN '4'
	WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 5 THEN '5'
	WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 10 THEN '6-10'
	WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 25 THEN '11-25'
	ELSE '25+' END as "Days between visits"
FROM (
	SELECT
	domain_userid,
	domain_sessionidx,
	domain_sessionidx - 1 as "previous_domain_sessionidx",
	MIN(derived_tstamp) as "time_first_touch"
	FROM "atomic".events
	WHERE derived_tstamp > current_date - integer '31'
	GROUP BY 1,2
) n
LEFT JOIN (
	SELECT
	domain_userid,
	domain_sessionidx,
	MIN(derived_tstamp) as "time_first_touch"
	FROM "atomic".events
	GROUP BY 1,2
) p on n.previous_domain_sessionidx = p.domain_sessionidx
and n.domain_userid = p.domain_userid;

Finally, we group the results by the number of days between visits, to plot a frequency table:

/* Redshift / PostgreSQL */
SELECT
"Days between visits",
COUNT(*) as "Number of visits"
FROM (
	SELECT
	n.domain_userid,
	n.domain_sessionidx,
	EXTRACT(EPOCH FROM (n.time_first_touch - p.time_first_touch))/3600/24 as "days_between_visits",
	CASE
		WHEN n.domain_sessionidx = 1 THEN '0'
		WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 1 THEN '1'
		WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 2 THEN '2'
		WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 3 THEN '3'
		WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 4 THEN '4'
		WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 5 THEN '5'
		WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 10 THEN '6-10'
		WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 25 THEN '11-25'
		ELSE '25+' END as "Days between visits"
	FROM (
		SELECT
		domain_userid,
		domain_sessionidx,
		domain_sessionidx - 1 as "previous_domain_sessionidx",
		MIN(derived_tstamp) as "time_first_touch"
		FROM "atomic".events
		WHERE derived_tstamp > current_date - integer '31'
		GROUP BY 1,2
	) n
	LEFT JOIN (
		SELECT
		domain_userid,
		domain_sessionidx,
		MIN(derived_tstamp) as "time_first_touch"
		FROM "atomic".events
		GROUP BY 1,2
	) p ON n.previous_domain_sessionidx = p.domain_sessionidx
	AND n.domain_userid = p.domain_userid
) t
GROUP BY 1
ORDER BY 1;

14. Behavior: engagement

Google Analytics provides two sets of metrics to indicate engagement:

  1. Visit duration
  2. Page depth (i.e. number of pages visited per session)

Both of these are flakey and unsophisticated measures of engagement. Nevertheless, they are easy to report on in Snowplow. To plot visit duration, we execute the following query:

/* Redshift / PostgreSQL */
SELECT
domain_userid,
domain_sessionidx,
CASE
	WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 1800 THEN 'g. 1801+ seconds'
	WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 600 THEN 'f. 601-1800 seconds'
	WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 180 THEN 'e. 181-600 seconds'
	WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 60 THEN 'd. 61 - 180 seconds'
	WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 30 THEN 'c. 31-60 seconds'
	WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 10 THEN 'b. 11-30 seconds'
	ELSE 'a. 0-10 seconds' END AS "Visit duration"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1,2;

Then we aggregate the results for each bucket, so we have frequency by bucket:

/* Redshift / PostgreSQL */
SELECT
"Visit duration",
COUNT(*) as "Number of visits"
FROM (
	SELECT
	domain_userid,
	domain_sessionidx,
	CASE
		WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 1800 THEN 'g. 1801+ seconds'
		WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 600 THEN 'f. 601-1800 seconds'
		WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 180 THEN 'e. 181-600 seconds'
		WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 60 THEN 'd. 61 - 180 seconds'
		WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 30 THEN 'c. 31-60 seconds'
		WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 10 THEN 'b. 11-30 seconds'
		ELSE 'a. 0-10 seconds' END AS "Visit duration"
	FROM "atomic".events
	WHERE derived_tstamp > current_date - integer '31'
	GROUP BY 1,2
) t
GROUP BY 1
ORDER BY 1;

We can also look at the number of page views per visit:

/* Redshift / PostgreSLQ */
select
domain_userid,
domain_sessionidx,
count(*) as "Page views per visit"
from events
where derived_tstamp > current_date - integer '31'
and event = 'page_view'
group by domain_userid, domain_sessionidx;

We then aggregate those results together by the number of page views per visit

/* Redshift / PostgreSLQ */
SELECT
"Page views per visit",
COUNT(*) as "Number of visits"
FROM (
	SELECT
	domain_userid,
	domain_sessionidx,
	COUNT(*) as "Page views per visit"
	FROM "atomic".events
	WHERE derived_tstamp > current_date - integer '31'
	AND event = 'page_view'
	GROUP BY 1,2
) t
GROUP BY 1
ORDER BY 1;

15. Technology: browser

Browser details are stored in the events table in the br_name, br_family, br_version, br_type, br_renderingengine, br_features and br_cookies fields.

Looking at the distribution of visits by browser is straightforward:

/* Redshift / PostgreSQL */
SELECT
br_family as "Browser",
COUNT(DISTINCT(domain_userid || domain_sessionidx)) as "Visits"
FROM "atomic"events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1
ORDER BY 2 DESC;

16. Technology: operating system

Operating system details are stored in the events table in the os_name, os_family and os_manufacturer fields.

Looking at the distribution of visits by operating system is straightforward:

/* Redshift / PostgreSQL */
CREATE VIEW basic_recipes.technology_os AS
SELECT
os_name as "Operating System",
COUNT(DISTINCT(domain_userid || domain_sessionidx)) as "Visits"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1
ORDER BY 2 DESC;

17. Technology: mobile

To work out how the number of visits in a given time period splits between visitors on mobile and those not, simply execute the following query:

/* Redshift / PostgreSQL */
CREATE VIEW basic_recipes.technology_mobile AS
SELECT
CASE WHEN dvce_ismobile=1 THEN 'mobile' ELSE 'desktop' END AS "Device type",
COUNT(DISTINCT(domain_userid || domain_sessionidx)) as "Visits"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1;

Snowplow analytics doc required
Are there still data cube sql files available
#2

@yali Thanks this is superb.

  • In query 14. Behavior: engagement I can’t fin “dvce_tstamp” in atomic.events is that to be replces with “dvce_sent_tstamp” ?

Thanks
SS


#3

You’re right @sachinsingh10 - we replaced dvce_tstamp with dvce_created_tstamp (at the same time that we introduced the dvce_sent_tstamp. More details on our thinking can be found here.


#4

Hi Yali,

These are amazing, and we use these regularly for inspiration on modelling our data!

I have a quick question, I ran a query similar to your Recency to determine time to purchase. However, I found that doing a subtraction of one date from the other, as in your query lead to inaccurate results. I however did this again using a DATEDIFF function instead lead to the correct results.

Is there any reason why subtracting dates (using the derived_tstamp) gives different results to DATEDIFF?

Thanks again.


#5

Hi @jrpeck1989 - could it be that using subtraction treats the values as unix timestamps - so you get a result in seconds rather than the datepart specified in the datediff function?


#6

Hi Yali, thanks for the reply.

Since you divide the extracted unix timestamp by 3600 then by 24, the result should come out in days anyway? (to a number of decimal points anyway)

Despite this, it is still seemed Inaccurate. I will send you queries and my results.

Thanks


#7

@yali What would queries look like to get landing page and exit pages? I understand the idea is to get the earliest and latest page view timestamps, grouped by session ID, but can’t quite get he syntax correct!

Thanks


#8

Hi @jrpeck1989 - two potential approaches:

  1. Use the first_value window function to fetch the first URL (or marketing campaigns) associated wth a session. (So you’d partition the data by domain_sessionid.) This is effectively what is done in the web model where a window function ROW_NUMBER window function is used to rank page views within a session, so that is easy to pick out the first value and use it to identify the landing page.
  2. Identify landing pages based on external referrers i.e. any page view where the referrer is from a different domain.

Does that help?


#9

This is super useful thanks. It is going straight into my bookmarks folder!


#10

Thanks Yali.

Is there a corresponding technique for finding exit pages as well? Given the standard 30 minute session-ending timeout.


#11

Use the last_value window function to get the exit page!


#12

Thanks for this list Yali - very helpful starting point.

I had a question regarding 17. Technology: mobile: the field ‘dvce_ismobile’ is completely null for all events in our data. Assuming this is not expected, why would that be? I can see data for dvce_screenwidth, dvce_screenheight but none for dvce_ismobile and dvce_type.

Thanks!


#13

Have you setup the user-agent-utils enrichment?


#14

I have set up the us_parser enrichment. I can see the enriched data in the us_parser_context table.
Though I could use the some combination of data from here (eg. use iOS and Android values from os_family) to track mobiles, but I am not too sure how to make the most accurate query based on the data available.

Thanks!