Dealing with technical and economic limitations


#1

Hello snowplowers,
I would like to ask the following questions:

  • What is the byte size of one atomic event record in the Redshift database (with all enrichments)?
  • What is the most rows that you have seen in atomic.events working properly?
  • How do you deal with the situation, when you can’t afford to keep all atomic data in Redshift?
  • What’s the maximum amount of events per day, that you have seen running properly? How many collectors were in that setup and how often was the emr-elt-runner running?

Why I ask is because I would like to store 40M+ events a day. I just finished the testing setup and the atomic.events table had 24649 rows and 526mbytes, and after one more test run it has 48949 rows and 528mbytes, which then results (with a +1MB for safety) in 3 ÷ (48949 − 24649) ~= 0.1kB per event, which would translate to ~4GB a day, ~ < 1.5TB a year, which is kinda OK.

Thank you for your help! :slight_smile:
Cheers,
Filip


#2

In response to your 3rd point, one direction worth exploring is using Redshift Spectrum. You can keep your raw event data in S3 and query it ad hoc when needed, avoiding the (much higher) Redshift storage costs, and instead keeping modelled data in Redshift for ‘hot’ access.