Removing an event from storage


Quick question for everyone!

We’ve got a custom event that we’ve been inadvertently collecting millions of data points on, filling our servers, S3, and redshift.

The obvious first response is “what’s the harm in keeping the data? S3 is cheap!”. To that I say: “We know, but we want it gone anyway for one reason or another.” :slight_smile: We did consider that though, so the idea isn’t lost on me. We’ve made a calculated business decision to remove the data anyway.

So my questions are:

  1. Can we simply truncate the shredded rows in the database tables they are separated into?
  2. Can we simply delete the lines for the related events in the S3 EMR archived logs? Will they still be able to be enriched at a later date? (Or should we clear out the bulky event meta and keep the event rows?)
  3. Is there any cleanup that needs to be done on the collectors to remove this data? (I assume not as these are dumped into S3 on the hour)
  4. Are there any other places we should be removing this extra data to ‘slim’ down our storage of these events?

Thanks in advance!


Hi @bgd1229

  1. You should be able to truncate the shredded table without any issue for the contexts that are populating these tables. This makes the most sense as a first option as Redshift disk storage isn’t particularly cheap. If you’ve got millions of rows in the contexts tables you’ll also end up with a few million corresponding rows in that are now ‘orphaned’ from their custom context. If you remove these you’ll probably want to avoid doing a DELETE and VACUUM depending on how big your table is - you may want to opt for a deep copy instead depending on how much disk space you have available.

  2. You could delete these lines but I’d say going through raw S3 logs and deleting lines for rows that have already been enriched/inserted into Redshift is more effort than it’s worth.

  3. If you’ve stopped sending these contexts through the collector then there shouldn’t be anything you need to do here.

  4. Possibly removing the corresponding rows in as these will no longer join to the contexts.

I wouldn’t worry too much about the data that’s sitting in S3 at the moment - if the data is costing you a bit to host on S3 you might want to consider moving it to S3 reduced redundancy or Glacier (cheaper to store data long term, increased cost associated with fetching this data in the future).