We received this question from one of our users:
We see a lot of
user_fingerprintcorresponding to different
domain_useridcorresponding to different
user_fingerprint. Is it possible that neither the
Because the answer might help other users, we’re cross-posting it to Discourse.
The short answer is this: it should be impossible, in practice, for 2 different visitors to get assigned the same
domain_userid, but it is possible for different visitors to have the same
user_fingerprint. The user fingerprint should not be used as the main identifier, but it can be used in combination with other fields in rare cases, if there’s a strong need for additional identifiers.
Note that the
domain_userid and the
How is the domain user ID generated and is it unique?
The domain user ID (
domain_userid does suffer from the traditional limitations associated with cookie storage, but, all things considered, we still recommend it as the main identifier for visitors that are not logged in.
How is the user fingerprint generated?
The user fingerprint (
atomic.events) is generated once with each page load, unless the user explicitly calls the
It takes the useragent, the string dimensions and colour depth, the timezone, the existence of session storage and local storage, and the list of plugins as inputs and uses the murmurhash function to convert those into the final fingerprint.
Is the user fingerprint stable?
No, we expect it to change over time. For example, if the useragent changes, or the list of plugins, the
user_fingerprint will also change.
We can run a couple of queries to illustrate this. Let’s start with:
SELECT fingerprints, COUNT(*) FROM ( SELECT domain_userid, COUNT(DISTINCT user_fingerprint) AS fingerprints FROM atomic.events WHERE user_fingerprint IS NOT NULL GROUP BY 1 ) GROUP BY 1 ORDER BY 1
For each visitor (
domain_userid), count the number of unique fingerprints, and return the distribution. In the case of our website, we find that:
- 90% of
domain_useridhave a unique
- 99.5% of
domain_useridhave between 1 and 5
Let’s look at how these numbers change with the number of sessions (as an approximation for time elapsed):
SELECT sessions, AVG(fingerprints), COUNT(*) FROM ( SELECT domain_userid, COUNT(DISTINCT user_fingerprint)::FLOAT AS fingerprints, COUNT(DISTiNCT domain_sessionidx) AS sessions, COUNT(*) FROM atomic.events WHERE collector_tstamp > '2015-01-01' AND user_fingerprint IS NOT NULL GROUP BY 1 ) GROUP BY 1 ORDER BY 1
The results for our website look like this:
The longer a visitor is active, the higher the chance that the user fingerprint will change at least once. It will, on average, have changed at least 3 times for visitors that had 20 sessions or more. This is one of the reasons we recommend against using the
user_fingerprint as the main identifier.
Is the user fingerprint unique?
In other words, if 2 events have a different
domain_userid but the same
user_fingerprint, can we conclude that these are from the same visitor? The answer is also no.
You’ll find examples of events that belong to different visitors, and other examples where it’s clear that it is the same visitor. The question is how often each case occurs. This is a bit harder to measure, but one approach is to measure the ratio for visitors that are logged in:
WITH prep AS ( -- which fingerprints have more than 1 domain user ID? SELECT user_fingerprint, COUNT(DISTINCT domain_userid) FROM atomic.events WHERE user_fingerprint IS NOT NULL AND user_id IS NOT NULL GROUP BY 1 ) SELECT users, COUNT(*) FROM ( SELECT user_fingerprint, COUNT(DISTINCT user_id) AS users FROM atomic.events WHERE user_id IS NOT NULL AND user_fingerprint IN (SELECT user_fingerprint FROM prep WHERE count > 1) GROUP BY 1 ) GROUP BY 1 ORDER BY 1
I ran this against 2 different datasets, both with more than 500 million logged in events. In both cases, more than 50% of
user_fingerprint mapped onto more than one user. This suggests that, if 2 events have a different
domain_userid but the same
user_fingerprint, the more probable scenario is that it’s indeed a different visitor.
Does that mean the user fingerprint cannot be used? No, there are a couple of use cases. If we concatenate the user fingerprint and the IP address, the number of
user_fingerprint (with different
domain_userid) that map onto just one user goes up. It went up to 70% in one case, and to 90% in the other. The difference might be due to the former having a lot of schools (and therefore identical computers) among their customers.
The concatenation of the user fingerprint and the IP address is still not a strong signal, but it is an option if there’s a strong need for identifier that can be used in addition to the domain user ID.