Snowplow and GA are reporting different visitor numbers

Compare to GA, my snowplow number for uniquer visitors is coming with 20-25(+)% difference. I am using distinct domain_userid and excluding bot traffic.
What can be possible reasons behind this or please help if I am doing something wrong.

Hi,

I’m also counting by distinct domain_userid, and having only 7% (+) difference.
What filters are you using? Are you counting only by actions that are being reported to GA?
My current filters are:

page_urlhost = ‘{THE SAME AS IN GA}’
and useragent not like ‘%bot%’
and useragent not like ‘%Nexus 5 Build/JOP40D%googleweblight%’

Have you guys seen this link? Lots of good pointers for reconciling with GA (this is specifically related to session).

Also - check the timezone for the data you are pulling in. Ours was off initially, and after correcting, got a lot closer to matching.

http://discourse.snowplow.io/t/reconciling-snowplow-and-google-analytics-session-numbers/80

Just thought of another thing - depending on the tracker - do you have ‘respect do not track’ enabled? I’m not sure on the numbers, but if GA does respect, and snowplow does not, that could also lead to a difference.

1 Like

Thank you @Gal @13scoobie for your suggestion. Tried these but still numbers are not improving

Failing the above it might be a problem with the cookies.

Does your site spread across multiple domains or subdomains? Can you share the tracking code with us and the rough domains setup?

When comparing Snowplow numbers against GA, we recommend starting by looking at page views by page URL. There is very little business logic associated with recording a page view, so these numbers should typically agree very closely, with Snowplow reporting higher numbers because we don’t remove bots from the list. (Note that GA will remove many more bots than are identified by the user agent parsing libraries that are available with Snowplow. The size of the discrepancy then reflects how much bot traffic your site attracts - we see it varying between 3 and 15% for e.g. jobs boards and other sites that attract crawlers.)

Often carrying out the above step throws up differences in tracking implementation between GA and Snowplow. (E.g. pages that are missed with one and not the other.)

If those two numbers agree then explore the difference in unique visitors by page. Both Google and Snowplow primarily base this number on a first party cookie, so again our expectation is that they should be pretty close, with Snowplow reporting higher numbers because of bots.

In general we recommend avoiding comparing session numbers. The GA sessionization logic is very specific (and advertiser-friendly) - the sessionization that the Snowplow JS tracker supports out of the box follows the Adobe simple 30 minute timeout model. If you must compare session numbers we have a guide (incl. SQL):

http://discourse.snowplow.io/t/reconciling-snowplow-and-google-analytics-session-numbers/80

So to summarise: my guess is that an implementation difference accounts for the very large discrepancy you’ve seen - unless you’re a site that attracts a lot of bots. (In which case - how do you filter these out when comparing?) I take it that your Snowplow numbers are higher, which makes me wonder if your GA coverage isn’t 100%? This should become clear once you start slicing the numbers by URL. It would also be worth understanding if you’ve done anything on the GA side to cusomize how uniques are identified (e.g. passing in your own user identifiers)?

To filter out bots, we are using
where br_name <> ‘Robot/Spider’

That will only get rid of a very limited set of bots and spiders - see

http://discourse.snowplow.io/t/excluding-bots-from-queries-in-redshift-tutorial/127

for details.

There is a list (which costs costs a few thousand dollars to buy) that vendors like GA and Adobe can use to filter bots which I’d expect to be more extensive than the list included in the user agent parsing libraries bundled. I’d also expect Google to have proprietary tech for spotting bots. So it’s possible that these bots account for the difference: I’d still do the check by page URL. If the discrepancy is constant across pages (or bigger for pages that are more likely to be crawled) that would suggest bots account for the discrepancy. If the discrepancy is skewed for particular page URLs, that suggests an implementation issue.

Should I do check on page_url ?

Yes I would!

@yali
So, I started with page views. We are getting difference around 2-3 %. But, same is not true for UV
Numbers we are getting for particular month-

  • Page views - GA (162437659) and Snowplow (167671647) which gives difference of ~3%

  • UV - GA(1463973) and Snowplow (1986195) which gives difference of ~35%

Considered page URL as well.

The fact that the page views number agrees is great: suggests the tracking tags have been instrumented in a very similar way.

So the question becomes why Snowplow thinks more users have visited the page than GA: it means that Snowplow thinks two users have viewed a page where GA thinks one user’s viewed the page twice. Out-of-the-box, both use a first party cookie ID set on your own domain, so it’s hard to imagine in what circumstance the Snowplow cookie would be deleted but the GA cookie not. How are you identifying users in GA? Is it based on only first party cookie IDs? Or are you pushing in your own user-level identifiers?

@yali Its based on first party cookie IDs in GA

@yali
One more thing I would like to add.
We are working with multiple domains.
Domain X - UV difference coming around 35% as mentioned above
Domain Y - UV difference around 3%
We have done same set up for both the domains

Interesting! So it’s a issue that’s isolated to a specific domain.

Are you doing any cross domain tracking with GA? If so, that have some impact.

On the domain with the issue: are you tracking across subdomains? (E.g. blog.mysite.com, www.mysite.com, app.mysite.com etc.)? If so - what domain have you set with the Snowplow JS tracker? With the Google JS tracker? (Have you set an explicit domain?)

Just trying to think of reasons why a cookie would reset for Snowplow but not for GA…

@yali
We do not do cross domain tracking with GA. Both domains have exclusive users from 2 different countries.
Also we are not doing tracking across subdomains.

That’s really odd @vivek291836. Can you share what the domain is? Without looking at it it’s very hard to know in what situation The Snowplow first party cookie would be deleted but not the GA first party cookie… It’s not something we’ve seen before and in your case the issue is domain-specific, so we need to look at the domain for an explanation…

@yali
I shared our domains with you on message. Please let us know if you find anything useful which can help us solve GA vs Snowplow issue

Hi @vivek291836 - what about enabling the gaCookies context in the JS tracker for the domain in question? This should enable you to identify exactly in the data your Snowplow cookie ID (i.e. the domain_userid field updates, but the corresponding GA cookie, stored in com_google_analytics_cookies_1._ga column, don’t.

If you’d like a member of our team to investigate the domain in question I can send you the details of a support contract?

All the best,

Yali