Snowplow R100 Epidaurus released with PII pseudonymization support

We are excited to announce the release of Snowplow R100 Epidaurus:

https://snowplowanalytics.com/blog/2018/02/27/snowplow-r100-epidaurus-released-with-pii-pseudonymization-support/

This streaming pipeline release adds support for pseudonomizing user PII (Personally Identifiable Information) through a new Snowplow enrichment.

We are initially adding this new PII Enrichment to the Snowplow streaming pipeline; extending this support to the batch pipeline will follow in due course.

This release is intended to help our users on their journey through GDPR:

6 Likes

@knservis no changes to bump version in emr config?

Hi @mjensen, no, this is a Stream Enrich release. Support for this enrichment in the batch pipeline (Spark Enrich) will arrive in a future Snowplow version.

@alex got it thanks

If this release introduces pseudonymisation using hashing, does anonymisation use two way encryption?

I would have thought it would be the other way around - pseudonymisation uses encryption (so that the original information can be re-extracted and used if necessary) and anonymisation would use one-way hashing algorithms like SHA-256 (where it’s impossible to get the original data back unless you already have it)

Have I misunderstood something along the way?

Hi @jrpeck1989 both encryption and hashing are substituting a value with an alias (pseudonym). In the case of hashing you could either hash all values (if that is possible) and find out what the original value was, or you could build a lookup table with the hashed values (as we are doing in a subsequent release, but we are also adding salt. The lookup table will be secured). The point is that accidental and casual use of data subject’s PII is averted, but it is not impossible with sufficient resources and internal knowledge to recover at least some information. To me true anonymisation would be to each PII value with a random value or downsampling sufficiently (e.g. 192.168.255.1 -> 192.168.x.x or “Jim Beam” -> “J B”), and that happens before that information hits any permanent storage although I cannot imagine how you would be able to do that on a per data subject basis. At least that is my understanding of the two terms. I am happy to be told otherwise. What are your thoughts?

So when you collect the information, the hashing algo randomises the information, and the data is then permanently stored (S3 and Redshift) in its hashed form - is this correct?

Or are you saying there is somewhere the data is stored in its original format, it’s actually hashed during the enrichment process, and should you need it you can use it for your purposes?

@jrpeck1989 As of r100 the value is just hashed. It is not randomised, meaning it is not substituted with a random value. Each value is then replaced with it’s hash. The original value is not kept in the enrichment, but could possibly be retrieved from raw logs if those logs are not discarded.

In a later release, there will be the option (which will need to be enabled) to keep the mapping of the original value to its hash, but that would be kept separate from the rest of the data as good practice would advise that this information which constitutes PII of the data subject, should only be used with due justification and when consent is given by the data subject. That feature will be in an upcoming release.

Additionally, in a later release we will add the capability to easily scrub data from preexisting data on S3 (Removing PII form Redshift can currently be done as shown in this tutorial: GDPR: Deleting customer data from Redshift [tutorial])

1 Like

Thanks for this.

Apologies, I understand how hashing works, I was using ‘randomised’ as short-hand for “its hashed value” - I should have been clearer :slight_smile:

So is the hashed value sent over in the payload from the tracker? Or the original value?

I’d like to be conceptually clear in my mind of the process :+1:

Hey Jordan,

The original value is sent with the payload. The hashing happens in enrichment (so downstream of collecting).

@jrpeck1989 No worries. I just wanted to make sure I did not mislead anyone :slight_smile: The original value is sent from the tracker base64 encoded and hashing takes place in the first actual piece that contains any logic about the content (as opposed to handling its transmission). That is where decoding takes place and hashing of sent values, or values that come from other enrichments (e.g. you could hash the location if you are using GeoIp lookup enrichment).

1 Like

Got it!

Thanks for clarifying.

As a firm outside of the EU and a site not targeting users in the EU, are there ways to apply pseudonymization or other GDPR features only to users who are based in the EU?

I’m thinking something along the lines of…
…with the Geolocation enrichment, there is an approximation of the country a user is in, IF the visitor is in one of the 28 member states THEN apply certain rules.

Hi @petervcook,

Thanks for the great suggestion - that’s something we thought of as well, and added this ticket:

Please do add any thoughts to that ticket on how this should all work!

1 Like