I’m trying to compute page views, but I am getting numbers that are very different from GA (approximately 15%). Can someone help me find the possible reasons for such a large variation?
Couple things you might check based on our experience:
- Are you excluding bot traffic in GA but not Snowplow?
- Are you using Snowplow’s sql-runner pageview query? If so, you need to make sure to aggregate (SUM) the page_view_count columns in the snowplow_pivots.page_views table since multiple pageviews from the same user on the same page in a single session get aggregated into a single row.
- Are you converting your timestamps to local time? Snowplow reports in UTC by default, GA usually by local timezone.
To answer your questions, I am excluding bot traffic in both. I tried converting timestamps to local time but that didn’t bring any significant change. I am getting Snowplow number 10-15 % higher than GA
Do you run de-duplication on your events table?
First of all - welcome, and thanks for using Snowplow!
I would indeed start with making sure all bot traffic is excluded, as @travisdevitt mentioned. You are doing this, but - for future reference - here’s a tutorial on how to do this in SQL: Excluding bots from queries in Redshift [tutorial]
I’d then compare the number of page views for the most popular pages. This will tell you whether the difference is about the same for all pages, or whether there’s something wrong with the instrumentation on certain pages.
That said, we still expect there to be a difference between both platforms because fewer ad blockers target Snowplow. Ad blockers are getting more and more common, and can, in some cases, cause a measurable difference in number of page views.
I hope this helps.
So, I followed how you mentioned to exclude bot traffic and number came down to 7-8%. Further, I saw duplicates
event_id in my table, so I excluded duplicates as well. Now my difference is around 3% which seems pretty close.
GA is the most popular web analytics platform and there are plenty of plugins for Chrome that block traffic.
My suggestion to get more clues is the following:
Create two unfiltered GA properties, include bots.
A) Track pageviews server side to GA property 1 using Universal Protocol.
B) Track pageviews server side with Snowplow tracker w app_id=‘ss’
C) Track pageviews client side (JS) to GA property 2
D) Track pageviews client side (JS) to Snowplow w app_id=‘cs’
A) should match B) very closely. These are the true number of pageviews.
Is C) or D) closer to the true number of pageviews?
This should provide you with more clues, let me know how it goes, I’m curious.