Controlling the order enrichments are run

Hi

Is it possible to control the order in which the Snowplow EMR ETL runs the custom enrichments?

We have a custom JavaScript enrichment which flags the IP addresses indicating if they’re from our office traffic or the real world. When I enable the IP anonymisation this enrichment cannot tag the IPs correctly. I’m assuming this because the IP address the JS enrichment is seeing is anonymised and so the exact matching we use no longer works. We have some IP range matching that does.

Thanks
Gareth

Hi @gareth - that’s a nice feature, the ticket for this is here:

Unfortunately it’s a lot of re-architecting to deliver this, so it’s not imminent.

2 Likes

Hi @alex

Thanks, that’s a useful ticket. With GDPR on the horizon we’d like to scrub our Snowplow events of personal information like IP addresses so downstream processing escapes the regulations. However we do need to derive some information from the IP address before it’s anonymised. Unfortunately we won’t be able to do this with the current Snowplow tooling.

Thanks
Gareth

We’re doing a lot of work on GDPR - I’m not sure that IP scrubbing for the enriched events is written is early enough in the process for GDPR adherence, because those IP addresses will still exist in the raw collector logs in S3.

We have a ticket to add support for IP scrubbing in the event collector itself, which would rule out your use case. We have a ticket for this here:

@yali may be able to share more here.

Yes that could be very useful. Do you have plans to be able to do enrichments like the geo IP lookup you currently have?

Our current plan is to the delete the source data (Cloudfront logs at present) once they’ve been processed, and now the output of the EMR ETL once we’ve done our first post-processing. The aim is to minimise the exposure of the data.

We’re very interested to see what you’re doing with respect to GDPR, partially because no one really knows what the best practice is yet. Nice to see others are thinking similar things.

Hi @gareth - some thoughts (more structured blog post to follow):

  1. We want to make it easier to capture “consent” as an event. This should make it easy for anyone working with the data to be able to query data on individual user’s directly to understand what is and is not permissable to do with the data. (So the consent lives is part of the data the consent governs.)
  2. There may be opportunities to get users to self-identify, if that means that data controllers can more effectively guarantee their rights under GDPR.
  3. We want to be able to pseudonymize any field that might contain personal data. This would include IP addresses, but it could also include cookie IDs user-defined fields in specific self-describing events or custom contexts.

Pseudoanymization is really powerful because it means you can collect the data to use for analytical purposes, you just can’t then tie it back to the user to e.g. personalize their user experience with it. So ideally, we’d have an enrichment that:

  1. Let you specify which fields to pseudoanonymize
  2. Had some logic to determine which events to run on. (So you only psueodoanonymize where you don’t have consent, for example.)

Ideally this would happen upstream of writing out the collector logs, ensuring that where you don’t have consent you don’t have personal identifiable data. However, it’s pretty hard to deliver that level of functionality on the event without first processing it. So your suggestion of deleting the raw collector logs has its appeal. You’d also have to be careful with any bad rows as well.

I need to do some more thinking on how to meet GDPR obligations and keep some of the robustness that comes with being able to reprocess the event stream from scratch and recover bad events safely. Any ideas from the community appreciated!

3 Likes

Thanks @yali I’m looking forward to the blog post.

Consent is an interesting one for us as we integrate into other people’s website so it will require some negotiation with the host retailer on how to ask for consent.

We had thought about the bad rows too, they’re are more difficult because you can’t know for sure what’s in the error message. The log lines could be processed with the standard log line pseudoanonymisation code.

The bad rows we’re planning on deleting after a short period of time to give us a chance to reprocess. The error logs are small volume (typically) and I believe there is scope for keeping type of data temporarily within the GDPR.