How many events per day is maintainable for atomic.events?


#1

I realize this is a question without a concrete answer, but any general guideline or anecdotal evidence from others who are ingesting large volumes of atomic.events into Redshift is greatly appreciated.

There was some doubt internally here that pulling all events into Redshift as atomic.events was a good idea, specifically whether we could do complicated useful queries with SQL over that large table. Is there a certain order of magnitude where it doesn’t make sense to have all the atomic events in Redshift anymore? 100K events per day, 1M events per day, 10M events, 100M, etc…?

If you do need aggregate tables, do most people write EMR jobs to generate them from the logs, or do you generate the aggregate tables right from Redshift using SQL queries (select into).


#2

This is a tricky question and really comes down to a few factors.

  1. How many events are you sending/anticipate to send per day? I’ve worked with Snowplow installations that collect between 10 and 50 million events per day and that hasn’t been an issue. There are definitely installs out there that collect far more than that.

  2. What queries do you want to run? Trying to to adhoc more complex queries over a large dataset in Redshift may not work particularly well. That said this depends on the complexity and how much data you’re actually querying - is it going to be 30 days of history versus 60 days versus all time? If you’ve got queries that you need to run on a regular basis (e.g., for reporting) often creating a new table from events (or a derived version of events) either from scratch or using the difference when updating is pretty easy. If you’re only doing queries of a certain category or type it may make sense to have a table that only includes a subclass of event(s) e.g., only page_views.

  3. What hardware is Redshift running on/how much can you spend?
    Dense compute (SSDs) will give you much nicer compute performance than dense storage but if you’re collecting a large amount of data this will cost you in the long run (if you’re committing to using Redshift though and you can estimate your volume of events reserving nodes on Redshift makes this quite a bit cheaper).

That doesn’t answer your question directly but I hope that helps a bit!


#3

Just to add to @mike’s great answer:

do most people write EMR jobs to generate them from the logs

Yes, that’s definitely a sensible direction of travel for large event volumes / complicated data modeling.

In the next Snowplow release but-one (R81), we are going to add support for running arbitrary EMR jobsteps (think Spark or Cascading or Flink) after enrichment has been run, while the cluster is still up. Then we want to make it easy for the outputs of those steps to be loaded into Redshift…