Identifying users (identity stitching)


#1

Identifying users in your Snowplow data

Any attempt at understanding your users has to start with identifying which events, in your Snowplow atomic data set, describe actions that were carried about by particular users. We call this process ‘identity stitching’: it is the process of stitching together the myriad individual events that describe individual user journeys.

Stages in identity stitching

  1. With each event, track as many different user identifiers as possible
  2. Use events where multiple identifiers are present to build a mapping table (graph) of user identifiers for each of your users
  3. Apply that mapping table / graph to your atomic event-level data to identify which event belongs to each of your users

1. Track as many different user identifiers as possible with each event

With each event tracked in Snowplow, we want to capture as many different user-level identifiers as possible. Note that at data capture time we are not interested in definitively deciding which user performed this action. We simply want to capture all the evidence (data points) so that we can make a decision later on in the data modeling process.

Snowplow trackers are build to automatically capture as many of these identifiers as possible automatically:

Collector provided fields

  • events.user_ipaddress - the IP address that the event occurred on
  • events.network_userid - third party cookie ID (set by the Clojure Collector and Scala Stream)

All trackers

  • events.user_id - a user-level identifier that you can set

Javascript tracker

  • events.domain_userid - first party cookie ID
  • events.domain_sessionid - third party session cookie
  • events.user_fingerprint - browser fingerprint

Mobile trackers (Objective-C and Android)

  • com_snowplowanalytics_snowplow_mobile_context_1.open_idfa - open IDFA (user identifier for advertisers)
  • com_snowplowanalytics_snowplow_mobile_context_1.apple_idfa - Apple user identifier for advertisers (Apple-only)
  • com_snowplowanalytics_snowplow_mobile_context_1.apple_idfv - Apple user identifier for vendors i.e. application owners
  • com_snowplowanalytics_snowplow_mobile_context_1.android_idfa - Android user identifier for advertisers
  • com_snowplowanalytics_snowplow_client_session_1.user_id - user ID generated client-side by Snowplow Objective-C and Android trackers
  • com_snowplowanalytics_snowplow_client_session_1.session_id - session ID generated client-side by Snowplow Objective-C and Android trackers

Adding your own identifiers

You can define and pass in as many of your own user-level identifiers, by identifying your own user context e.g.

{
	"$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
	"description": "Schema for an ad click event",
	"self": {
		"vendor": "com.mycompany",
		"name": "user_context",
		"format": "jsonschema",
		"version": "1-0-0"
	},

	"type": "object",
	"properties": {
		"id": {
			"type": "string"
		},
		"email": {
			"type": "string"
		},
		"twitterHandle": {
			"type": "string"
		},
		"facebookId": {
			"type": "string"
		}
	},
	"additionalProperties": false
}

It is then possible to send this context i.e. any of these identifiers with any event recorded into Snowplow.

2. Build your user identifier mapping table (graph)

Now we have an atomic data set with a range of different events, often recorded against different platforms, each with a different set of one or more user identifiers.

To take a very typical example, we might have a webapp where users browse marketing material over several sessions, after which a fraction sign up to the service (creating a login ID), after which events are recorded against the login ID.

In the above example, we’d want to be able to:

  1. Correctly aggregate events from before the user signed up, with events after
  2. Correctly aggregate events recorded on different devices that the user accesses the service on

In this case, we’d start by identifying all the events where the user was logged in. For these events, we should have both a cookie ID (events.domain_userid) and a login ID events.user_id:

create table derived.user_mapping as (
	select
	domain_userid,
	user_id
	from atomic.events
	where domain_userid is not null
	and user_id is not null
	group by 1,2
);

The above table maps cookie IDs to user IDS. Note that if a user logs in on multiple devices, each with its own cookie ID, all those cookie IDs will be correctly mapped to the same user.

3. Apply that mapping table (graph) to your atomic event-level data to identify which event belongs to each of your users

We can, as part of the data modeling, use the above table to assign a user ID to a particular event ID, given the cookie ID. So if we run a query like the following:

select
em.user_id,
...
from atomic.events e
left join derived.user_mapping um
on e.domain_userid = um.domain_userid

The value of user_id will be set for every event whether or not our user happened to be logged in when the event occurred. Hence, events tracked when the user was browsing anonymously (prior to signing up with the service) will be correctly identified as belong to that user, with those events that were tracked subsequently.

The above example is pretty simple. You can build more complicated mapping tables, and mapping logic, for example, by:

  1. Adding in additional user identifiers e.g. for events recorded in mobile apps and other platforms
  2. Using more complicated business logic when applying the mapping table to the event level data set (e.g. controlling for particular devices which multiple different users share), or using a probabilistic rather than rule-based / deterministic approach

Network_userid and domain_userid
Cross domain tracking with Snowplow Mini?
First and last touch attribution models in SQL [tutorial]
#2

Hi @yali, I’m looking to extend these with additional 3rd party identifiers for example Google GCLID & CID and similar parameters passed by traffic sources or accessible from their cookies.
Is there a place where you have already documented such functions and parameters?
How can I share what we are doing in this area?


#3

Hi @zivbaram that’s great to hear! Please feel free to share what you’re doing either on this thread if it follows directly or by starting a new thread.

Looking forward to hearing more!

Yali