we are currently facing the following issue. We want to enable personalized tracking in our project that means that we will store a userId in our tracking data, of course, only if that user has given us explicit consent to do so. In order to be GDPR compliant we need to guarantee the following two things:
- takeout (The user can get the data we have collected about him or her.
- deletion (Delete all the data that is associated with this specific userId)
to 1.) This should be straight-forward in Redshift.
to 2.) This should also be straight-forward in Redshift (see. GDPR: Deleting customer data from Redshift [tutorial]) However, we are storing this userId in the raw data along the pipeline and deleting data from Redshift does not seem to be enough because the user data could be easily restored from the raw data.
Before the data is loaded into Redshift it passes the following buckets:
loader-target-bucket → enriched-bucket → shredded-bucket
(The first is the target bucket for the s3 loader, the last two are used in the shredding process).
Currently we want to keep the enriched-bucket as a single point of truth regarding our raw data. Data from the other buckets (loader and shredded) could easily be removed anyway.
Is there an efficient way to crawl through that bucket and delete files based on a userId? Over time with a rapidly increasing amount of data this will become a terribly time-consuming and inefficient procedure. Is someone else facing a similar issue?