Reconciling Snowplow and Google Analytics session numbers


#1

Most Snowplow users aggregate events into sessions using domain_sessionidx, which is generated by our Javascript tracker and increments when a user was not active for 30 minutes. When comparing with Google Analytics, Snowplow users might find that both tools reports slightly different session numbers. This is because Google Analytics also expires sessions at midnight and when a user changes campaigns (i.e arrives via one campaign, leaves, and then comes back via a different campaign).

To reconcile these numbers, it’s possible to create a new session index in SQL which increments when a user has not been active for 30 minutes or when a user leaves and comes back via a different referrer.

WITH step_1 AS (

  SELECT

    domain_userid,
    domain_sessionidx, -- to compare with the custom session index

    collector_tstamp,
    dvce_tstamp,

    LAG(dvce_tstamp, 1) OVER (PARTITION BY domain_userid ORDER BY dvce_tstamp) AS previous_dvce_tstamp,
    NVL(page_referrer, '') || NVL(mkt_medium, '') || NVL(mkt_source, '') || NVL(mkt_term, '') || NVL(mkt_content, '') || NVL(mkt_campaign, '') AS referrer,

    refr_source,
    refr_medium,
    refr_term,

    mkt_source,
    mkt_medium,
    mkt_term,
    mkt_content,
    mkt_campaign

  FROM atomic.events
  WHERE collector_tstamp::date = 'YYYY-MM-DD' -- restrict the dataset
  ORDER BY domain_userid, dvce_tstamp

), step_2 AS (

  SELECT

    *, SUM(CASE WHEN refr_medium = 'internal' OR referrer IS NULL OR referrer = '' THEN 0 ELSE 1 END) OVER (PARTITION BY domain_userid ORDER BY dvce_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS new_referrer

  FROM step_1
  ORDER BY domain_userid, dvce_tstamp

), step_3 AS (

  SELECT

  *, FIRST_VALUE(referrer) OVER (PARTITION BY domain_userid, new_referrer ORDER BY dvce_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS referrer_partition

  FROM step_2
  ORDER BY domain_userid, dvce_tstamp

), step_4 AS (

  SELECT

    *, LAG(referrer_partition, 1) OVER (PARTITION BY domain_userid ORDER BY dvce_tstamp) AS previous_referrer_partition

  FROM step_3
  ORDER BY domain_userid, dvce_tstamp

), step_5 AS (

  SELECT

    *,
    CASE
      WHEN ((EXTRACT(EPOCH FROM (dvce_tstamp - previous_dvce_tstamp)) < 60*30) AND (referrer_partition = previous_referrer_partition OR (referrer_partition IS NULL AND previous_referrer_partition IS NULL))) THEN 0
      ELSE 1
    END AS new_session

  FROM step_4
  ORDER BY domain_userid, dvce_tstamp

), step_6 AS (

  SELECT

    *, SUM(new_session) OVER (PARTITION BY domain_userid ORDER BY dvce_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS new_sessionidx

  FROM step_5
  ORDER BY domain_userid, dvce_tstamp

)

SELECT
  domain_userid,
  new_sessionidx
FROM step_6
GROUP BY 1,2
ORDER BY 1,2

The SQL is explained in more detail in our documentation on server-side sessionization in SQL. The main addition are steps 2 to 4, which partition events on referrer. This is used to select events where the user arrived via a different external referrer.

Differences in allocation to source/medium

The above query should return session numbers that are closer to what is reported by Google Analytics. However, it’s also possible that the number of sessions per source/medium differs between Snowplow and Google Analytics, even when the total matches. This is because Google Analytics uses data from past sessions when no referrer information is available. This following SQL query replicates this logic:

WITH step_7 AS (

  SELECT
    domain_userid,
    new_sessionidx,

    collector_tstamp,
    dvce_tstamp,

    refr_source,
    refr_medium,
    mkt_source,
    mkt_medium

  FROM step_6
  WHERE new_session = 1
  ORDER BY domain_userid, new_sessionidx

), step_8 AS (

  SELECT

    *, SUM(CASE WHEN mkt_source IS NULL THEN 0 ELSE 1 END) OVER (PARTITION BY domain_userid ORDER BY new_sessionidx ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS mkt_partition

  FROM step_7
  ORDER BY domain_userid, new_sessionidx

), step_9 AS (

  SELECT
    *,
    FIRST_VALUE(dvce_tstamp) OVER (PARTITION BY domain_userid, mkt_partition ORDER BY new_sessionidx ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS partition_dvce_tstamp,
    FIRST_VALUE(mkt_source) OVER (PARTITION BY domain_userid, mkt_partition ORDER BY new_sessionidx ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS partition_mkt_source,
    FIRST_VALUE(mkt_medium) OVER (PARTITION BY domain_userid, mkt_partition ORDER BY new_sessionidx ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS partition_mkt_medium
  FROM step_8
  ORDER BY domain_userid, new_sessionidx

), step_10 AS (

  SELECT

    *,
    CASE
      WHEN mkt_source IS NOT NULL THEN 'mkt-' || mkt_source
      WHEN refr_source IS NOT NULL THEN 'refr-' || refr_source
      WHEN partition_mkt_source IS NOT NULL AND (EXTRACT(EPOCH FROM (dvce_tstamp - partition_dvce_tstamp)) < 60*60*24*183) THEN 'mkt-' || partition_mkt_source
      ELSE 'direct'
    END AS ga_source,
    CASE
      WHEN mkt_source IS NOT NULL THEN 'mkt-' || mkt_medium
      WHEN refr_source IS NOT NULL THEN 'refr-' || refr_medium
      WHEN partition_mkt_source IS NOT NULL AND (EXTRACT(EPOCH FROM (dvce_tstamp - partition_dvce_tstamp)) < 60*60*24*183) THEN 'mkt-' || partition_mkt_medium
      ELSE 'direct'
    END AS ga_medium

  FROM step_9
  ORDER BY domain_userid, new_sessionidx

)

-- count source:

SELECT
  ga_source,
  count(*)
FROM step_10
GROUP BY 1
ORDER BY 2 DESC

-- count medium:

SELECT
  ga_medium,
  count(*)
FROM step_10
GROUP BY 1
ORDER BY 2 DESC

Steps 8 and 9 create a new partition each time mkt_source is set. When both mkt_source and refr_source are NULL, previous campaign data is used if it is within the timeout period.


Snowplow and GA are reporting different visitor numbers
Snowplow and GA are reporting different visitor numbers
#2

It is also useful to remember that Google will filter out bot traffic which Snowplow does not.


#3

Are there standard queries/filters to ignore recognised bot traffic?

Thanks
JP


#4

Hi @jrpeck1989,

Many bots don’t execute Javascript, so no events will be generated. Some do self-identify as bots, you can exclude those by filtering on br_type IN ('Browser', 'Browser (mobile)'). We don’t have queries to identify other bots, but we always welcome contributions.


#5

I believe that the primary way Adobe recognise bot traffic is via the IAB / ABC International Spiders and Bots list. If a Snowplow user were to buy this list (I think it costs a few thousand $$$s) it should be straightforward to use it to filter traffic and compare it to the libraries we use. We’d certainly be interested in writing an enrichment to flag traffic from useragents on the list as identified as bots, but would need access to the list in order to build it.


#6

Hi @christophe, first of all, thanks for a great post!

We have implemented Snowplow for our clickstream tracking and are able to match the overall number of sessions between Snowplow and GA. However, we ran into some trouble while trying to split the overall session into Direct, organic etc.

This is a great post that shows how GA handles sessions and how the source/medium is calculated. I used the same query in this post on our data, however, I see that the number of Direct sessions is way too high. The numbers on the other source/medium buckets seem to be in the same ballpark (save a few), but Direct traffic sessions is the most concerning.

Here are the numbers in Snowplow and GA that we’re seeing for Jan 1st 2017:

GA
(direct)/(none): 9,628
google / organic: 10,910
facebook_ad / facebook_cpc: 7,986
criteo / criteo_cpc: 3,259
google / cpc: 423
m.facebook.com / referral: 308
bing / organic: 177
ghanabuysell.com / referral: 133
(and a bunch of others that are very small)

Snowplow (ga_source)
direct: 469,062
mkt-facebook_ad: 27,251
refr-Google: 10,626
mkt-criteo: 10,121
refr-Facebook: 453
refr-Bing: 181
(and a bunch of others that are very small)

Snowplow (ga_medium)
direct: 469,062
mkt-facebook_cpc: 27,251
refr-search: 10,938
mkt-criteo_cpc: 10,121
refr-social: 453
(and a bunch of others that are very small)

As you can see, the direct sessions that we are seeing from Snowplow is many times larger that the total number of sessions that we have.

Any ideas on why this might be happening will be extremely helpful and appreciated! Thanks in advance!

Best,
Anand


#7

We are still having a very similar issue - overall session counts do not match - there are significant differences via channels - but we expect that given the differences in rules between GA and our logic. However, the overall session count is minimum 20% higher.


#8

@christophe @yali Any help would be highly appreciated! Thanks in advance


#9

Hmm, is it just a count(*) on the events table, maybe it’s a double count?
i would prefer to use

count(distinct(domain_userid || domain_sessionidx))

to get the number of unique sessions. A difference between Snowplow and GA is normal, we measure with Snowplow 10-25 % more pageviews etc. then GA.


#10

Hi @arbhat85 and @moekiss,

Could the issue be that in GA, if a session is a direct session, GA will look at the previous session (and then the previous one) until it finds a marketing channel to attribute the session to. GA will look back up until 6 month I believe? There’s a good post on this here:

https://www.e-nor.com/blog/google-analytics/direct-visits-google-analytics-attribution-precedence


#11

HI Yali, that’s a great article to explain how Direct works in GA. Are you saying that Snowplow doesn’t treat Direct by default in the same way?

GA also seems to differentiate between traffic report and attribution reports. In the attribution reports it does split out Direct…

Is there a similar article for Snowplow?


#12

There is no in built logic in the Snowplow protocol to determine that if a visit has no referral information (AKA Direct), to then look back across previous sessions until reliable referral info is available and then applying it to the latest session - a la Google Analytics.

This can be recreated in SQL (or any other data modelling process) should you so wish. I believe there is an article Christophe wrote on how to retroactively do this with Snowplow data - I can’t find it right now…


#13

Thanks Jr, if anyone has the article at hand, it’ll be great to read.

On a related note, how easy is it to replicate the GA model and create a new sessions each time the user’s channel changes?


#14

I think Christophe’s first post in this thread tackles that exact thing


#15

Yep, that’s indeed the post.