A lot of our data is sent in to our database raw, and with multiple lines per entity. For example, when a user signs up, we send in a record (user ID, registration date, email etc.), and subsequently, when the user’s account changes (when they login, change any information etc.) we send in another row of information which sits in a a long table. In the data modelling stage we then ‘flatten’ the table to the most recent set of information as ordered by the root_tstamp.
Physically achieving this is done mostly by using the LAST_VALUE window function in Redshift, partitioned by the entity (user_id in this case). We build a prep table using a WITH statement (a Common Table Expression, I believe is the technical term…) and then select from that and group by the main entity identifier (for example, User ID).
This has been working well for us, however, on occasions where the raw table is wide (lots of columns of information) the query can become very long and difficult to follow. We’ve also come up against strange assert errors in Redshift using lots of these functions in a query, meaning the query doesn’t run. It also means that with long the GROUP BY statements needed to aggregate this information, the likelihood of a Cartesian Product starts becoming more and more likely.
Therefore, is there a best practice when it comes to aggregating, or ‘flattening’ a table to only return one line per entity, with the most recent information? Is using window functions the best way to go about this? Or are there other grouping and aggregating functions in Redshift that are better suited to this task?
A long question, I know - so any help is hugely appreciated!