Redshift Indexing

Hi,

I am aiming to speed up query performance in Redshift. One way I aim to do that is to do indexing. I wonder what column to pick to do indexing? I read around snowplow’s github issue page and one way to do this in Postgres is to use collector_tstamp and dvce_created_tstamp to index.

Any more suggestion beside those?

EDIT: After further research about redshift, found out from this reading that there is no actual ‘indexing’ in redshift and that setting up DISTKEY and SORTKEY is what considered as setting index. We already have collector_tstamp and event_id used as key. I would like to edit my question as is there any more distkey/sortkey that can be used for indexing?

Snowplow sets reasonable sensible defaults for atomic.events and the context tables in Redshift (collector_tstamp / root_tstamp as SORTKEY and root_id / event_id as DISTKEY). You can change these from the defaults but in general you’d only do so if you were optimising for a certain kind of query pattern or patterns that differed dramatically from the Snowplow defaults (e.g., potentially considering a compound sort key with collector_tstamp and app_id for example).

Thanks @mike for the advice. I guess I just want a key that generally can be used to improve overall general performance. We are not yet at the stage to optimise for a particular pattern yet.

In that case I would stick with the Snowplow defaults for the moment.

1 Like