Unstructured Events and the events table


#1

We are currently tracking a very large amount of unstructured events, and they do not require a second record in the atomic.events table.

Is there a best practice to address not tracking certain events in atomic.events, just their specific table?

Second question: has anyone altered the schema of atomic.events (dropping unnecessary columns)

Third question: atomic.events is not easily vacuum-able. There is no sort key on timestamp, so trying to vacuum it is taking way too long.


#2

@dbuscaglia,

It is not possible (in the pipeline itself). The events table is the parent to all the self-describing (unstruct) events and contexts. You might run data model post data load to remove the records from events for the self-describing events in question. We have a dedicated SQL Runner to assist in running the SQL scripts for you.

The events table has a predefined structure. Any interference with it would cause pipeline failure. You are free to create any structure for your own custom data but not the atomic one.

Sometimes, we do change the events table with a pipeline upgrade. However, it is still predefined and is not to be custom amended.

Does your events table support ZSTD compression? If not, we advise to migrate it to support it. It might reduce your disk usage by up to ~60%. Also, during the migration you would run deep copy which will fix your fragmented data and with the following regular VACUUM you should have no long-running job occurring.

The migration scripts are here: https://github.com/snowplow/snowplow/tree/master/4-storage/redshift-storage/sql.


#3

Thanks for the reply. We currently already have the events table fully compressed. The events table has so many fields that are just not used. It is a massive load on our storage requirements in redshift.


#4

I think Ihor has nicely captured everything but I’d like to pick up one additional point:

There is no sort key on timestamp, so trying to vacuum it is taking way too long.

I might be wrong here but feels like something’s amiss - there should be a sortkey on collector_tstamp (and a distkey on event_id). I can see where you’d run into lots of problems if there weren’t.


#5

Hey @dbuscaglia,

I had missed your last reply:

Thanks for the reply. We currently already have the events table fully compressed. The events table has so many fields that are just not used. It is a massive load on our storage requirements in redshift.

So on this point - Redshift storage costs are contingent on the cluster size you use - and empty columns don’t take up much space (about 1MB if there’s no data in the column as far as I understand). It’s likely that these empty columns aren’t actually a significant cost driver.

The volume of data you keep in Redshift is likely to be the main cost driver. There are strategies you can take to reduce this cost. Two common options are:

Archive from Redshift - A record of all the Enriched data which has been loaded to Redshift should be in S3. You only need to keep atomic data in Redshift for as long as you’ll need to recompute aggregations and data models over it. So to keep costs down you can:

  • Set up and run a data model which aggregates the data as per your requirements, and outputs to a set of derived tables.
  • Optionally set up a job to archive the atomic data to S3
  • Set up a job on a schedule to delete all data older than a certain timeframe from atomic tables

You’d need to take care to keep as much recent data to ensure you can address any issues in aggregation - if you need to recompute over it you’d need to reprocess/reload the data first, which might be a pain. Usually a year is enough, some people only keep a few months.

Consume data from S3 - some users choose to use Athena to query their data directly from S3, and shut off Redshift load completely, or use S3 for complicated granular analysis/data science and use Redshift for reporting while following the above archive strategy.

Aside from cost management strategies, if you’re tracking a lot there’s a cost to storing that data. But usually the idea is you’re deriving value that outweighs the cost of doing business - that’s contingent on how effectively you’re using the data. If that’s not the case then the other option is to rethink your tracking strategy and stop tracking the things that aren’t valuable.


#6

Colm has raised some really good points above. If there’s columns that you don’t need it’s easy enough to create a data model on top of these to only select the columns that you do need.

Redshift columns can potentially take up a little bit of space depending on how they are defined / compressed but you should be able to get a rough estimate of what that looks like (in megabytes) by running the following query

SELECT
  TRIM(name) as table_name,
  TRIM(pg_attribute.attname) AS column_name,
  COUNT(1) AS size
FROM
  svv_diskusage JOIN pg_attribute ON
    svv_diskusage.col = pg_attribute.attnum-1 AND
    svv_diskusage.tbl = pg_attribute.attrelid
WHERE
  table_name = 'events'
GROUP BY 1, 2
ORDER BY 3 DESC