We received this question from one of our users:
Do people tend to modify the sortkey or distkey of the atomic tables in Redshift if they decide to use something other than
Because the answer might help other users, we’re cross-posting it to Discourse.
It’s possible but we do not recommend it, for the following reasons:
1. You want joins between the different atomic tables to be fast
We recommend joining the atomic tables on both
event_id = root_id and collector_tstamp = root_tstamp (this prevents issues with duplicates - more information). Updating the SORTKEY in
atomic.events will slow down these joins.
2. It will increase the number of unsorted rows in atomic
Events are sorted on
collector_tstamp when we load them into Redshift. In almost all cases, events in a later run will all have a
collector_tstamp that is greater than those in all earlier runs. This prevents the table from becoming unsorted over time.
derived_tstamp, it’s possible that events in a later run will have a
derived_timestamp smaller than an earlier arriving event (e.g. if that event was cached for a long time). The more unsorted rows there are, the more often a vacuum needs to be run.
As standard all our table definitions have the distkey set to the
root_id to ensure an even distribution.
Changing this (e.g. to
domain_userid) will have two negative impacts:
1.Increasing data skew
Data skew will affect query performance. Amazon has an article on data skew including a query to find the skew value: Identifying Tables with Data Skew or Unsorted Rows. A skew value of 4.00 or higher is considered undesirable.
2. Slow joins
It will slow down joins between the events table and your other atomic tables because data will need to be shuffled between nodes.