Cascade FOREIGN KEYS in Redshift

Hi there,

I would like to understand why the foreign keys in Redshift are not set to cascade?

When i delete events, I would like all relating data to be deleted as well.

Enrico

Foreign keys (along with some other constraints) aren’t enforced by Redshift.

Thanks. We reprocessed some logs and have to deal with duplicated events now.

Are there any good ways of cleaning this up? I was under the impression events that already exists will not be imported again.

Enrico

The guide that @christophe has written here is probably the most comprehensive docs on deduplication (both pre EMR and post EMR).

Further to @mike’s points - there is no CASCADE in Redshift constraints:

http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html

Hi I also want to ask about this and to see if there is any workaround?

My use case is different than estahn as I am just trying to delete events that have been loaded after a certain period of time. I think it would make sense to use cascade function so I can just delete some rows in atomic.events table and let the event_id that corresponds to that deletion cascade to all the context tables (since these context tables use root_id to refer the event_id)

My plan for this is:

  • delete rows in events table based on etl_tstamp
  • delete rows in all context tables based on root_tstamp

It will be tedious though especially if we have a lot of context tables.

@aditya there’s a difference between those tiestamps. root_tstamp corresponds to collector_tstamp. etl_tstamp is the time that enrich processed the record.

The best option here is to get all the event_ids from atomic.events for the timeframe you want to delete, and then delete any entries in all tables with those event_ids. You could write a script to get all table names for the atomic schema from PG_TABLE_DEF and plug those into your queries.

Using timestamps is simpler, so if all you care about is expiring data that’s more than x date old, then use the collector_tstamp for atomic and the root_tsamp for all other tables.

Best,

Gotcha. I process records daily so thats why I wanted to delete records processed on certain day.

But you are right, we dont have etl_tstamp on context table. I ended up filtering using collector tstamp and root tstamp for events and context table.