Running Snowplow in Minimal Mode for GDPR


#1

Curious if anyone is running Snowplow in “minimal” mode in JavaScript Tracker until user consents for GDPR. We want to disable any user tracking as much as possible until user consents, but would like to still capture pageview and anything minimal until they consent. We were thinking maybe we could disable cookies in the Snowplow config for JS until they consent.


#2

One element to consider here is how minimal you want to be. If the Javascript tracker is sending directly to the collector (before or after consent) you’ll be recording and processing the IP address.

I don’t think there’s an easy way around this at the moment other than recording certain events - if you can - server side so that no user identifying information is captured.


#3

good point. the IP will still end up in collector and go through ETL.
I wonder if i can edit NGINX config to strip last characters of IP by default.


#4

Hey @mjensen,

Are the IP anonymisation enrichment or the PII enrichment of any use here, or do you need to ensure that this data never hits raw logs?

These will anonymise at enrichment stage. The un-anonymised values are still available in the collector logs, but the strategy here is to use lifecycle rules to permanently delete raw logs after a week - just for use in case of a failure which requires reprocessing.

I’m aware this approach might not satisfy your needs if you need to ensure that the data is never collected at all - just wanted to share in case you weren’t aware.


#5

@mjensen Interestingly there are a few ways to mask the last octets of your IP address in the combined log. Here is one nginx module that can also hash: https://github.com/masonicboom/ipscrub

What I though was interesting was the YAGNI section from that: if you’re doing all that, is it useful? Would the approach that @Colm mentioned not be good enough?

I would be interested to know your requirements if you can share them.


#6

yeah, i’m definitely looking at those as well

requirements are really to be GDPR compliant.
so up until user consents to cookies, we are thinking about disabling any tracking until then including Snowplow. i would hate to stop Snowplow completely but the domain_userid cookie is a perm cookie that shouldn’t be enabled in Snowplow until they consent. session cookies are iffy.
the biggest problem is we use Snowplow data for marketing attribution. for EU, we will loose this ability unless we rely only on 3rd party tracking pixels. but we have a lot of stuff written in-house that relies on Snowplow pageview data.


#7

Using nginx to strip or remove parts of the header with an IP address will work but it introduces another issue which is that you need to be able to do this selectively i.e., only run the anonymising functionality where you don’t have consent vs anonymising all events. Depending on how this is performed it’s going to need some degree of inspection on the nginx side to determine what should be anonymised. This is why I think it’s probably just easy to record the page views with a server side tracker.

Google Analytics gets around this (with anonymizeIP) by performing this masking at the load balancer level before any storage or analytics processing takes place.

If you’re on the Scala Stream Collector and trying to be GDPR compliant (while still collecting non-identifying analytics information) writing it to disk (either S3 or Kinesis/Kafka/Pubsub) isn’t compliant.


#8

I am not a lawyer, however that is not how I understand the spirit or the letter of GDPR. PII may be temporarily present in memory and in logs until you determine that it should not be there (because that user has withdrawn consent), in which case you remove it. In a world of dynamic IP addresses, you are not able to determine that by the IP alone and to me it sounds perfectly reasonable that the IP remains in a temporary log (or in memory) until you can make that determination. As long as it is temporary storage, be it disk or memory and only for as long as it is necessary it seems legitimate to me. As an aside, there are cases where the law allows you to keep PII even if the user has not given consent yet (see https://ico.org.uk/for-organisations/guide-to-the-general-data-protection-regulation-gdpr/lawful-basis-for-processing/contract/ )


#9

I agree that in-memory storage is one thing but writing out personal data (which may include but isn’t restricted to PII) to a more permanent storage mechanism like disk means that you are now storing data without seeking any consent from the end user and therefore no opportunity to opt out. IP address is considered a personal identifier under GDPR (Recital 30) so it falls under this remit.

As you’ve mentioned there are exceptions to this if you need the information for legal purposes or “legitimate business use case” but these are far legally more difficult to make if the information is being collected for analytics purposes and it remains unclear if some of these exceptions will remain acceptable under ePrivacy.


#10

TL;DR: Minimal mode for tracker seems like a good idea to explore to me. The server would need to know the consent status of the tracker and hash the IP if not consent is given and there may be other issues.

Hey @mike,

yes I get the strictly no PII argument, however (and this is by no means advice) it seems to me that unprocessed logs, that are temporarily stored until a determination is made, do not constitute such a violation, and it is rather a consequence of the system design that requires raw data to be temporarily stored somewhere until they are processed. As long as in the first opportunity while processing, you throw away PII for that data subject that has requested (and obviously the raw logs also) that respects the data subject’s wishes in good faith.

As for the IP of a user, while that is understandably considered PII, when we are talking about consent, you cannot know whether the data subject that has consented still has the same ip, and only when you have identified the event as pertaining to a non consenting user can you be sure that you should throw it away (as a consenting user may now be using that IP).

In any case all this is are technicalities that at some point some, hopefully technically competent, judge will make a determination on, but my reading of GDPR is that is intended, understandably to give back some level of control to the data subject when it comes to PII and that processors should make good faith efforts to grant them that. It is not meant to harm companies by making them unable to analyse the company’s operational data or contact their customers.

Given that, I believe that dropping all PII and not just the IP at the point that it is known that the event pertains to a non-consenting data subject is the right thing to do. The only way I can imagine that being done at the collector would be to have a great monolithic system there that knows everything. In snowplow the collector knows nothing about the data except that it seems legitimate.

The tracker may know that the user has not consented in that platform used (e.g. browser A) but it would not know that for all platforms (e.g. browser B, phone, IoT, what have you) so it seems incomplete, and the determination would still need to be made at the processor end as to whether this is a non-consenting user, rather than the client end.

Still a given processor may have only one service delivery platform and/or by default make the decision to not collect almost any data until the user has explicitly consented, in which case the client-side approach probably makes sense.

At this point @alex or @yali may have a much better idea about what is possible and any pitfalls.


#11

This is a really interesting discussion!

I think a lot of it boils down to the interpretation ‘legitimate business use case’, and I can imagine that the boundaries of this term will be pushed hard in a lot of cases, at least until GDPR starts to be enforced publicly.

Having said that, IMO there’s a strong argument to make that maintaining the ability to process your data and recover from failure is a legitimate use case, and therefore keeping raw logs for a short period is fine. Restricting access to this data, and keeping the period of retention short are good ideas though.

Not a lawyer, obviously. Just my ill informed opinion :slight_smile:


#12

thanks everyone. very helpful thread :slight_smile:

i updated nginx config:

map $remote_addr $ip_anonym1 {
 default 0.0.0;
 "~(?P<ip>(\d+)\.(\d+)\.(\d+))\.\d+" $ip;
 "~(?P<ip>[^:]+:[^:]+):" $ip;
}

map $remote_addr $ip_anonym2 {
 default .0;
 "~(?P<ip>(\d+)\.(\d+)\.(\d+))\.\d+" .0;
 "~(?P<ip>[^:]+:[^:]+):" ::;
}

map $ip_anonym1$ip_anonym2 $ip_anonymized {
 default 0.0.0.0;
 "~(?P<ip>.*)" $ip;
}

    log_format  main  '$ip_anonymized - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    include       conf.d/*.conf;

and then proxy to send to collector from nginx and works fine so far:

[root@ip-X-X-X-X elasticbeanstalk]# more 00_application.conf 
location / {
    proxy_pass          http://127.0.0.1:5000;
    proxy_http_version  1.1;

    proxy_set_header    Connection          $connection_upgrade;
    proxy_set_header    Upgrade             $http_upgrade;
    proxy_set_header    Host                $host;
    proxy_set_header    X-Real-IP           $ip_anonymized;
    proxy_set_header    X-Forwarded-For     $ip_anonymized;
}

#13

I think this depends on what determination is being made. If somebody were to use consent as their legal basis of storing personal data and you were then to store or process this data prior to obtaining consent - that’s a violation (Recital 42). If a user isn’t provided with an opportunity to consent to the processing (e.g., default opt in or no opt in option) this isn’t considered to meet the conditions for consent (Recital 32). It’s worth noting that GDPR here has a broad definition of “processing” (Article 4) which includes the process of data collection.

Absolutely and this is the approach some other vendors take (i.e., Google anonymises IP before collection and Adobe shortly after) - but ideally this should happen before the data is ever processed which would mean at or before the collector.