Scala Stream Collector: add support for cookie bounce


#1

Context

More and more web clients do not support or block 3rd party cookies. The result is, that one user browsing the website results in a series of events, that have the same domain_userid but a lot of different network_userids. For later processing the network_userid is completely useless and could be set to a fixed value like 00000000-0000-4000-A000-000000000000 to adhere to the UUID rules.

Why

It saves a lot of computation later in the process, because it cuts down the flood of useless network_userids, which will never be seen again.

How is it done?

Ususally the cookie bounce works like this: The collector takes the request, checks for the presence of the defined cookie name.

  • If it is there: use it and process the request.
  • If the cookie is missing, the collector issues a redirect to itself with an magic/uniq/special query parameter and the Set-Cookie header. Once the client follows the redirect, the collector checks for the presence of the cookie.
  • If it is not presented by the client, it is obvious that 3rd party cookies do not work.
  • In this case the network_userid can be set to the defined value and the request could be processed.

Implications

This behaviour results in a bit more traffic due to this additional redirect. We observed roughly 30%.

Ideally the value of the network_userid could be specified in the config. If nothing was specified, the collector would work, like it used to be.

See also

snowplow/snowplow#2697

Request for comments

Please reply in this thread.


Measuring what fraction of your visitors have third party cookies blocked
#2

Hey @christoph-buente - thanks for sharing this! I’ve added it to a new sub-category, RFCs (Requests for Comment) under a new category, Roadmap. The intention of this category is to open up the Snowplow development process and make it more collaborative.

It’s awesome that our first RFC is from the community - I look forward to the comments and feedback on your proposal!


#3

Sounds interesting @christoph-buente !

What would the latency impact of the bounce be? I assume it may effect users who disable TP cookies more since all their requests are redirected.

From your post it sounds like you’re trialling this currently. Are you seeing fewer, more or about the same number of events through this cookie-bounce approach than you are with a stable SSC?


#4

Hi Rob,

latency depends on the network mainly. But as the redirect goes to the same url, the DNS lookup has been done already. So the main influence is the location of the user and the location of the collector.

We use this type of behaviour in other parts of our systems. And it does not create more events. And ideally it should not create less events, but as for every redirect, there is a small percentage that slips. Usually that happenes, when the browser cuts off for “Too many redirects”. Which would not be the case here, cause it is just one.

But the main advantage is, that you have a fixed network_userid that clearly indicates TP cookies we’re not allowed and then you can either sort those events out in the enrichment stage or handle them differently in later processing jobs.


#5

@robkingston to cut down on the extra latency the tracking client could add some extra param which indicates that this is the first event in this session, which would allow the collector to only do the redirect check once pr session.


#6

The RFC has been implemented and submitted as a pull request.

https://github.com/snowplow/snowplow/pull/2755