Google bot sending the same uuids back across multiple pages

I’m using the Snowplow JS Tracker. I noticed traffic from Google Bots are hitting the snowplow logging code but sending the same uuids back repeatedly. I enabled contexts.webpage=true and the bot is sending back iglu:com.snowplowanalytics.snowplow/web_page/jsonschema/1-0-0 with the same page uuid data.id on it across many pages.

Is this a known issue?

I don’t know if this is the case. Could this be an issue with uuid versions? I see an old dependency on uuid from request.

Yea, I see snowplow-tracker-core bringing in uuid ^3.3.3. This version is buggy for bots.

I’m not sure if this is documented anywhere but this is quite common - specifically with bots including Google Bot. Certain bots that execute Javascript have a weirdness that can often reduce the entropy being used to generate random seeds (or generate entirely deterministic random numbers) - which dramatically increases the chance of collision for anything that is seeded with things like time or a random number generator (as in the case of the uuid library).

Got it. Any change someone can update the uuid version in the Snowplow JS Tracker? That’ll probably fix most of the issues I’m seeing.

Are you seeing other issues outside of Googlebot duplicate ids? If so this is worth investigating (and potentially worth bumping the uuid version).

I don’t know. Most (if not all) appear to be bot traffic. My app does not generate enough traffic for me to notice other benefits related to incrementing the uuid version.

:wave: The uuid library is a little stuck at v3.x in the JS Tracker. From version v7, they dropped support for IE9 and 10 which we still support in the JS Tracker.

In v3 it checks for the crypto libraries for the random number generation and if it isn’t there (in the case of bots and old IE), it falls back to a Math.random implementation. Unfortunately, lots of bots have TERRIBLE Math.random implementations so you end up with lots of UUID collisions when bots land on your page. This Math.random fallback no longer exists in newer versions of the library, it simply only uses crypto or else it fails to generate a uuid (so you’d get no uuid when the traffic was bot traffic and/or ie9/10).

This is a tricky one to solve (I’ve been pondering this given the impending major release of the JS tracker), upgrading the uuid v8.x means we’d have to drop IE 9 and 10 support which might seem tempting at first but the Snowplow JavaScript Tracker needs to work in as many places as possible, so users can understand where all their traffic is coming from - this means the tracker needs to support the widest possible range of browsers. It’s also hard (although not impossible as I could wrap it) to create a ie9/10 compat version of the tracker because the uuid library isn’t api compatible between v3 and v8. I’ve been trying not to create an ie9/10 specific version as I don’t really want to live in a world with different paths (maintainance and debugging nightmare waiting to happen).

That was a bit of a brain dump on where we’re at with uuids in js. Open to thoughts and ideas!

1 Like

It also looks like the latest uuid version still has issues with bots.

I’m not too surprised by this. Googlebot and other bots are often deliberately deterministic and take shortcuts in the Javascript engine as well as other browser functionality purely because executing Javascript at that scale is just so expensive (for a few reasons).

Using the IAB enrichment to flag bots is a good solution as many of these bots / spiders will have similar issues and often the fields that are unexpected are not required in downstream data models.

1 Like