Is the user fingerprint unique and when should it be used?


#1

We received this question from one of our users:

We see a lot of user_fingerprint corresponding to different domain_userid and many domain_userid corresponding to different user_fingerprint. Is it possible that neither the domain_userid nor the user_fingerprint are unique?

Because the answer might help other users, we’re cross-posting it to Discourse.

The short answer is this: it should be impossible, in practice, for 2 different visitors to get assigned the same domain_userid, but it is possible for different visitors to have the same user_fingerprint. The user fingerprint should not be used as the main identifier, but it can be used in combination with other fields in rare cases, if there’s a strong need for additional identifiers.

Note that the domain_userid and the user_fingerprint are 2 fields specific to the Javascript tracker.

How is the domain user ID generated and is it unique?

The domain user ID (domain_userid in atomic.events) is a UUID, which the Javascript tracker generates and writes to a cookie for future visits. Because it’s a UUID, it should be impossible in practice for 2 different visitors to get assigned to the same domain_userid. The domain_userid does suffer from the traditional limitations associated with cookie storage, but, all things considered, we still recommend it as the main identifier for visitors that are not logged in.

How is the user fingerprint generated?

The user fingerprint (user_fingerprint in atomic.events) is generated once with each page load, unless the user explicitly calls the setUserFingerprint method. This is how it’s generated in the Javascript tracker:

It takes the useragent, the string dimensions and colour depth, the timezone, the existence of session storage and local storage, and the list of plugins as inputs and uses the murmurhash function to convert those into the final fingerprint.

Is the user fingerprint stable?

No, we expect it to change over time. For example, if the useragent changes, or the list of plugins, the user_fingerprint will also change.

We can run a couple of queries to illustrate this. Let’s start with:

SELECT
  fingerprints,
  COUNT(*)
FROM (
  SELECT
    domain_userid,
    COUNT(DISTINCT user_fingerprint) AS fingerprints
  FROM atomic.events
  WHERE user_fingerprint IS NOT NULL
  GROUP BY 1
)
GROUP BY 1
ORDER BY 1

For each visitor (domain_userid), count the number of unique fingerprints, and return the distribution. In the case of our website, we find that:

  • 90% of domain_userid have a unique user_fingerprint
  • 99.5% of domain_userid have between 1 and 5 user_fingerprint

Let’s look at how these numbers change with the number of sessions (as an approximation for time elapsed):

SELECT
  sessions,
  AVG(fingerprints),
  COUNT(*)
FROM (
  SELECT
    domain_userid,
    COUNT(DISTINCT user_fingerprint)::FLOAT AS fingerprints,
    COUNT(DISTiNCT domain_sessionidx) AS sessions,
    COUNT(*)
  FROM atomic.events
  WHERE collector_tstamp > '2015-01-01'
  AND user_fingerprint IS NOT NULL
  GROUP BY 1
)
GROUP BY 1
ORDER BY 1

The results for our website look like this:

The longer a visitor is active, the higher the chance that the user fingerprint will change at least once. It will, on average, have changed at least 3 times for visitors that had 20 sessions or more. This is one of the reasons we recommend against using the user_fingerprint as the main identifier.

Is the user fingerprint unique?

In other words, if 2 events have a different domain_userid but the same user_fingerprint, can we conclude that these are from the same visitor? The answer is also no.

You’ll find examples of events that belong to different visitors, and other examples where it’s clear that it is the same visitor. The question is how often each case occurs. This is a bit harder to measure, but one approach is to measure the ratio for visitors that are logged in:

WITH prep AS ( -- which fingerprints have more than 1 domain user ID?

  SELECT
    user_fingerprint,
    COUNT(DISTINCT domain_userid)
  FROM atomic.events
  WHERE user_fingerprint IS NOT NULL
    AND user_id IS NOT NULL
  GROUP BY 1

)

SELECT
  users,
  COUNT(*)
FROM (
  SELECT
    user_fingerprint,
    COUNT(DISTINCT user_id) AS users
  FROM atomic.events
  WHERE user_id IS NOT NULL
    AND user_fingerprint IN (SELECT user_fingerprint FROM prep WHERE count > 1)
  GROUP BY 1
)
GROUP BY 1
ORDER BY 1

I ran this against 2 different datasets, both with more than 500 million logged in events. In both cases, more than 50% of user_fingerprint mapped onto more than one user. This suggests that, if 2 events have a different domain_userid but the same user_fingerprint, the more probable scenario is that it’s indeed a different visitor.

Does that mean the user fingerprint cannot be used? No, there are a couple of use cases. If we concatenate the user fingerprint and the IP address, the number of user_fingerprint (with different domain_userid) that map onto just one user goes up. It went up to 70% in one case, and to 90% in the other. The difference might be due to the former having a lot of schools (and therefore identical computers) among their customers.

The concatenation of the user fingerprint and the IP address is still not a strong signal, but it is an option if there’s a strong need for identifier that can be used in addition to the domain user ID.


#2

Hi @christophe

I am receiving different domain_userid of same user for most entries. Any suggestion why is it behaving like this?


#3

Found answer here: https://github.com/snowplow/snowplow/issues/2696#issuecomment-222757042

The user_id in the client session context is typically used as a synthetic (tracker or collector-created) proxy ID for the user, with this proxy user ID being held in some kind of storage (maybe cookie or SQLite or similar) so that it (mostly) survives across sessions

Correct me if I misunderstood the answer.


#4

Hi @v3nom,

Which user ID field was used to compare against the domain user ID?

Christophe