Reconciling Snowplow and Google Analytics session numbers

Most Snowplow users aggregate events into sessions using domain_sessionidx, which is generated by our Javascript tracker and increments when a user was not active for 30 minutes. When comparing with Google Analytics, Snowplow users might find that both tools reports slightly different session numbers. This is because Google Analytics also expires sessions at midnight and when a user changes campaigns (i.e arrives via one campaign, leaves, and then comes back via a different campaign).

To reconcile these numbers, it’s possible to create a new session index in SQL which increments when a user has not been active for 30 minutes or when a user leaves and comes back via a different referrer.

WITH step_1 AS (

  SELECT

    domain_userid,
    domain_sessionidx, -- to compare with the custom session index

    collector_tstamp,
    dvce_tstamp,

    LAG(dvce_tstamp, 1) OVER (PARTITION BY domain_userid ORDER BY dvce_tstamp) AS previous_dvce_tstamp,
    NVL(page_referrer, '') || NVL(mkt_medium, '') || NVL(mkt_source, '') || NVL(mkt_term, '') || NVL(mkt_content, '') || NVL(mkt_campaign, '') AS referrer,

    refr_source,
    refr_medium,
    refr_term,

    mkt_source,
    mkt_medium,
    mkt_term,
    mkt_content,
    mkt_campaign

  FROM atomic.events
  WHERE collector_tstamp::date = 'YYYY-MM-DD' -- restrict the dataset
  ORDER BY domain_userid, dvce_tstamp

), step_2 AS (

  SELECT

    *, SUM(CASE WHEN refr_medium = 'internal' OR referrer IS NULL OR referrer = '' THEN 0 ELSE 1 END) OVER (PARTITION BY domain_userid ORDER BY dvce_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS new_referrer

  FROM step_1
  ORDER BY domain_userid, dvce_tstamp

), step_3 AS (

  SELECT

  *, FIRST_VALUE(referrer) OVER (PARTITION BY domain_userid, new_referrer ORDER BY dvce_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS referrer_partition

  FROM step_2
  ORDER BY domain_userid, dvce_tstamp

), step_4 AS (

  SELECT

    *, LAG(referrer_partition, 1) OVER (PARTITION BY domain_userid ORDER BY dvce_tstamp) AS previous_referrer_partition

  FROM step_3
  ORDER BY domain_userid, dvce_tstamp

), step_5 AS (

  SELECT

    *,
    CASE
      WHEN ((EXTRACT(EPOCH FROM (dvce_tstamp - previous_dvce_tstamp)) < 60*30) AND (referrer_partition = previous_referrer_partition OR (referrer_partition IS NULL AND previous_referrer_partition IS NULL))) THEN 0
      ELSE 1
    END AS new_session

  FROM step_4
  ORDER BY domain_userid, dvce_tstamp

), step_6 AS (

  SELECT

    *, SUM(new_session) OVER (PARTITION BY domain_userid ORDER BY dvce_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS new_sessionidx

  FROM step_5
  ORDER BY domain_userid, dvce_tstamp

)

SELECT
  domain_userid,
  new_sessionidx
FROM step_6
GROUP BY 1,2
ORDER BY 1,2

The SQL is explained in more detail in our documentation on server-side sessionization in SQL. The main addition are steps 2 to 4, which partition events on referrer. This is used to select events where the user arrived via a different external referrer.

Differences in allocation to source/medium

The above query should return session numbers that are closer to what is reported by Google Analytics. However, it’s also possible that the number of sessions per source/medium differs between Snowplow and Google Analytics, even when the total matches. This is because Google Analytics uses data from past sessions when no referrer information is available. This following SQL query replicates this logic:

WITH step_7 AS (

  SELECT
    domain_userid,
    new_sessionidx,

    collector_tstamp,
    dvce_tstamp,

    refr_source,
    refr_medium,
    mkt_source,
    mkt_medium

  FROM step_6
  WHERE new_session = 1
  ORDER BY domain_userid, new_sessionidx

), step_8 AS (

  SELECT

    *, SUM(CASE WHEN mkt_source IS NULL THEN 0 ELSE 1 END) OVER (PARTITION BY domain_userid ORDER BY new_sessionidx ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS mkt_partition

  FROM step_7
  ORDER BY domain_userid, new_sessionidx

), step_9 AS (

  SELECT
    *,
    FIRST_VALUE(dvce_tstamp) OVER (PARTITION BY domain_userid, mkt_partition ORDER BY new_sessionidx ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS partition_dvce_tstamp,
    FIRST_VALUE(mkt_source) OVER (PARTITION BY domain_userid, mkt_partition ORDER BY new_sessionidx ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS partition_mkt_source,
    FIRST_VALUE(mkt_medium) OVER (PARTITION BY domain_userid, mkt_partition ORDER BY new_sessionidx ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS partition_mkt_medium
  FROM step_8
  ORDER BY domain_userid, new_sessionidx

), step_10 AS (

  SELECT

    *,
    CASE
      WHEN mkt_source IS NOT NULL THEN 'mkt-' || mkt_source
      WHEN refr_source IS NOT NULL THEN 'refr-' || refr_source
      WHEN partition_mkt_source IS NOT NULL AND (EXTRACT(EPOCH FROM (dvce_tstamp - partition_dvce_tstamp)) < 60*60*24*183) THEN 'mkt-' || partition_mkt_source
      ELSE 'direct'
    END AS ga_source,
    CASE
      WHEN mkt_source IS NOT NULL THEN 'mkt-' || mkt_medium
      WHEN refr_source IS NOT NULL THEN 'refr-' || refr_medium
      WHEN partition_mkt_source IS NOT NULL AND (EXTRACT(EPOCH FROM (dvce_tstamp - partition_dvce_tstamp)) < 60*60*24*183) THEN 'mkt-' || partition_mkt_medium
      ELSE 'direct'
    END AS ga_medium

  FROM step_9
  ORDER BY domain_userid, new_sessionidx

)

-- count source:

SELECT
  ga_source,
  count(*)
FROM step_10
GROUP BY 1
ORDER BY 2 DESC

-- count medium:

SELECT
  ga_medium,
  count(*)
FROM step_10
GROUP BY 1
ORDER BY 2 DESC

Steps 8 and 9 create a new partition each time mkt_source is set. When both mkt_source and refr_source are NULL, previous campaign data is used if it is within the timeout period.

5 Likes

It is also useful to remember that Google will filter out bot traffic which Snowplow does not.

Are there standard queries/filters to ignore recognised bot traffic?

Thanks
JP

Hi @jrpeck1989,

Many bots don’t execute Javascript, so no events will be generated. Some do self-identify as bots, you can exclude those by filtering on br_type IN ('Browser', 'Browser (mobile)'). We don’t have queries to identify other bots, but we always welcome contributions.

2 Likes

I believe that the primary way Adobe recognise bot traffic is via the IAB / ABC International Spiders and Bots list. If a Snowplow user were to buy this list (I think it costs a few thousand $$$s) it should be straightforward to use it to filter traffic and compare it to the libraries we use. We’d certainly be interested in writing an enrichment to flag traffic from useragents on the list as identified as bots, but would need access to the list in order to build it.

Hi @christophe, first of all, thanks for a great post!

We have implemented Snowplow for our clickstream tracking and are able to match the overall number of sessions between Snowplow and GA. However, we ran into some trouble while trying to split the overall session into Direct, organic etc.

This is a great post that shows how GA handles sessions and how the source/medium is calculated. I used the same query in this post on our data, however, I see that the number of Direct sessions is way too high. The numbers on the other source/medium buckets seem to be in the same ballpark (save a few), but Direct traffic sessions is the most concerning.

Here are the numbers in Snowplow and GA that we’re seeing for Jan 1st 2017:

GA
(direct)/(none): 9,628
google / organic: 10,910
facebook_ad / facebook_cpc: 7,986
criteo / criteo_cpc: 3,259
google / cpc: 423
m.facebook.com / referral: 308
bing / organic: 177
ghanabuysell.com / referral: 133
(and a bunch of others that are very small)

Snowplow (ga_source)
direct: 469,062
mkt-facebook_ad: 27,251
refr-Google: 10,626
mkt-criteo: 10,121
refr-Facebook: 453
refr-Bing: 181
(and a bunch of others that are very small)

Snowplow (ga_medium)
direct: 469,062
mkt-facebook_cpc: 27,251
refr-search: 10,938
mkt-criteo_cpc: 10,121
refr-social: 453
(and a bunch of others that are very small)

As you can see, the direct sessions that we are seeing from Snowplow is many times larger that the total number of sessions that we have.

Any ideas on why this might be happening will be extremely helpful and appreciated! Thanks in advance!

Best,
Anand

We are still having a very similar issue - overall session counts do not match - there are significant differences via channels - but we expect that given the differences in rules between GA and our logic. However, the overall session count is minimum 20% higher.

@christophe @yali Any help would be highly appreciated! Thanks in advance

Hmm, is it just a count(*) on the events table, maybe it’s a double count?
i would prefer to use

count(distinct(domain_userid || domain_sessionidx))

to get the number of unique sessions. A difference between Snowplow and GA is normal, we measure with Snowplow 10-25 % more pageviews etc. then GA.

Hi @arbhat85 and @moekiss,

Could the issue be that in GA, if a session is a direct session, GA will look at the previous session (and then the previous one) until it finds a marketing channel to attribute the session to. GA will look back up until 6 month I believe? There’s a good post on this here:

https://www.e-nor.com/blog/google-analytics/direct-visits-google-analytics-attribution-precedence

HI Yali, that’s a great article to explain how Direct works in GA. Are you saying that Snowplow doesn’t treat Direct by default in the same way?

GA also seems to differentiate between traffic report and attribution reports. In the attribution reports it does split out Direct…

Is there a similar article for Snowplow?

There is no in built logic in the Snowplow protocol to determine that if a visit has no referral information (AKA Direct), to then look back across previous sessions until reliable referral info is available and then applying it to the latest session - a la Google Analytics.

This can be recreated in SQL (or any other data modelling process) should you so wish. I believe there is an article Christophe wrote on how to retroactively do this with Snowplow data - I can’t find it right now…

1 Like

Thanks Jr, if anyone has the article at hand, it’ll be great to read.

On a related note, how easy is it to replicate the GA model and create a new sessions each time the user’s channel changes?

I think Christophe’s first post in this thread tackles that exact thing

Yep, that’s indeed the post.

@yali Another reason for tremendous direct traffic could be that Google Analytics itself detects referrals from other websites and marks their source/medium as url_host/referral, while Snowplow won’t do that by itself. Then upon mkt/refr-source/medium distribution all the referrals could fall in direct traffic. Am I correct?

@Hasan_Shaukat I don’t think that’s quite accurate.

The Snowplow javascript tracker will provide all available referral information to each page view, including the url from which the page view was referred. The Snowplow enrichment process will do additional attribution of the referrer, according to the configuration of the relevant enrichment, which also allows you to define what constitutes an internal domain.

The difference is that GA makes assumptions for you and attributes according to those assumptions out of the box, whereas Snowplow provides all of the relevant underlying data, and leaves it to you to define your assumptions and attribution logic (which is the topic that the original post in this thread walks through - albeit it’s quite dated by now).

If you write some SQL for your Snowplow data whose logic is to:

  • take the referrer source/medium of the first page view of the session,
  • if not found look to previous sessions,
  • continue this process for a 6 month lookback,

then you’ll likely find that GA and Snowplow report very similar patterns (leaving room for the fact that we don’t necessarily know exactly what the under the hood logic is for GA). But of course there is room to define the logic according to whatever your business decision is on the definition of referral (which is the stronger use case for Snowplow over trying to replicate GA).

Hello everyone!

just wanted to know if it is correct to use the derived_tstamp as datetime field in the queries here, or if we should keep using the dvce_tstamp (now 'dvce_created_tstamp '). Mostly the two are of course the same, but by reading https://discourse.snowplow.io/t/which-timestamp-is-the-best-to-see-when-an-event-occurred/538 it seems that the derived_tstamp should be the preferred choice.

Thanks

It depends on what you’re doing.

dvce_created_tstamp is the clock on the device - it might be set to the wrong time, it might be in a different timezome, etc. (tz might be available elsewhere in the data though!).

collector_tstamp is the server timestamp in UTC. It’s the most reliable indicator of the actual time. derived_tstamp uses this along with both dvce_ tstamps to give you as close as possible a reading of when the event actually happened, reducing scope for those client-side problems to skew. (the post you linked describes how that’s calculated so I won’t repeat that here).

In general - if you’re debugging the data itself, trying to figure out the sequence of actions on the device, and not worrying about looking across different users/devices - then it’s fine to use dvce. For most analytics purposes, derived_ is best. For anything where the time the event was received is involved (including partitioning and limiting table scans), use collector_. Also join on collector_.

Hope that makes sense!

(Edit to add: I’ve just noticed the thread that you asked the Q on. If you’re asking specifically regarding comparing with GA data, I’m not sure which is best - I would try to figure out what the GA timestamps represent and use whatever’s closest).

2 Likes

Thanks for the super fast reply.

So, let’s say that I am using this recipe to create an intermediate table, which I will then query as if it was the events table itself. The purpose mainly involves analyses with time and across different devices/countries/and so on. But at times, I need of course to partition. At the beginning I was using the collector_tstamp, but I stumbled on problems that could be solved only by using the derived_tstamp (like analyzing the session lengths).

I’ll look into the GA timestamp anyways, thanks a lot.