GDPR: Deleting customer data from Snowflake [tutorial]


#1

This tutorial is a followup on our guide to deleting customer data from Redshift. It is meant to help Snowplow users who use Snowflake as a storage target comply with the GDPR rules coming into effect later this year. Under GDPR, data subjects have the right to “be forgotten”. This means that an individual will be able to request for any data on them to be removed from all the data stores that a company uses.

Assumptions

  • A request has been made to delete all data belonging to a specific user. We’ll be using the user_id as the identifier in this tutorial but the same concepts can be applied to other fields (e.g domain_userid, user_ipaddress or any other fields that can be used to identify someone).
  • The business runs a data model which is solely derived and recomputed from the atomic data daily. This means that in removing the customer data from the atomic data in Snowflake, the modeled tables will also be cleared upon recomputation. Some further thought is required for incremental data models - this is out of scope of this tutorial.

Deleting data from Snowflake

Deleting customer data from Snowflake is much simpler than Redshift because atomic data is contained within a single table.

1. Check what data will be deleted

Before actually deleting the data, it’s always worth doing a sanity check:

SELECT
  COUNT(*),
  MIN(collector_tstamp),
  MAX(collector_tstamp)
FROM
  atomic.events
WHERE
  user_id = 'Data Subject';

If the results make sense then we’re good to continue!

2. Delete the events

We can go ahead and delete the data:

DELETE FROM
  atomic.events
WHERE
  user_id = 'Data Subject';

Time Travel and Fail-safe

Snowflake has two powerful features that allow deleted data to be queried and / or restored after it’s been removed from a table.

Time Travel

Time Travel enables accessing historical data (ie, data that has been changed or deleted) at any point within a defined period.

The standard retention period is 1 day (24 hours) and is automatically enabled for all Snowflake accounts with some configuration options:

  1. Standard Edition accounts can change the period to 0 (effectively disabling Time Travel)
  2. Enterprise Edition accounts can change the period to between 0 and 90 days

This means that, depending on you account settings, deleted data may still be accessible to you for up to 90 days after removing it from the atomic.events table. There is no way to set the data retention period for just these rows to a value different than the one for the rest of the atomic.events table.

So if you are using Time Travel with a longer window, make sure the data subject is aware their data will be deleted with some delay. According to GDPR, your obligation is to " erase personal data without undue delay".

Fail-safe

Separate and distinct from Time Travel, Fail-safe provides a (non-configurable) 7-day period during which historical data is recoverable by Snowflake.

This period starts immediately after the Time Travel retention period ends. In that period, you cannot access the data, but Snowflake can.