I had missed your last reply:
Thanks for the reply. We currently already have the events table fully compressed. The events table has so many fields that are just not used. It is a massive load on our storage requirements in redshift.
So on this point - Redshift storage costs are contingent on the cluster size you use - and empty columns don’t take up much space (about 1MB if there’s no data in the column as far as I understand). It’s likely that these empty columns aren’t actually a significant cost driver.
The volume of data you keep in Redshift is likely to be the main cost driver. There are strategies you can take to reduce this cost. Two common options are:
Archive from Redshift - A record of all the Enriched data which has been loaded to Redshift should be in S3. You only need to keep atomic data in Redshift for as long as you’ll need to recompute aggregations and data models over it. So to keep costs down you can:
- Set up and run a data model which aggregates the data as per your requirements, and outputs to a set of derived tables.
- Optionally set up a job to archive the atomic data to S3
- Set up a job on a schedule to delete all data older than a certain timeframe from atomic tables
You’d need to take care to keep as much recent data to ensure you can address any issues in aggregation - if you need to recompute over it you’d need to reprocess/reload the data first, which might be a pain. Usually a year is enough, some people only keep a few months.
Consume data from S3 - some users choose to use Athena to query their data directly from S3, and shut off Redshift load completely, or use S3 for complicated granular analysis/data science and use Redshift for reporting while following the above archive strategy.
Aside from cost management strategies, if you’re tracking a lot there’s a cost to storing that data. But usually the idea is you’re deriving value that outweighs the cost of doing business - that’s contingent on how effectively you’re using the data. If that’s not the case then the other option is to rethink your tracking strategy and stop tracking the things that aren’t valuable.