I was going to say something similar for larger setups. The primary snowplow pipeline I work on sees 230-260M events/day (running @ ~350k events/min for a sustained period of time) and @christoph-buente’s breakdown is almost exactly what I’ve accounted.
For smaller pipelines I’ve found that Kinesis is the most expensive component, due to its shard-hour pricing model.
$0.36/day/shard in us-east-1. Running six, one-shard streams (collector good/bad, enricher good/bad, enricher pii, s3 sink bad) immediately puts you at ~$70/month. If you want to scale the primary streams (collector good, enricher good) up a couple shards each, you’ll pay $100/month for kinesis alone.
Application load balancers are billed based on load balancer hours (the alb is running) and LCU/hour. LCU is a four-dimension (new/active connections, processed bytes, rule evaluations) “load balancer capacity unit”.
You’ll pay $20/month in us-east-1 to keep the load balancer up, and not much thereafter until your traffic really starts heating up. $10/month for smaller installations (millions of events per month) is an overestimation, but when running a lot of traffic (hundreds of millions of events per month) through an ALB this drastically increases to be a real part of the equation.
This varies depending on the reliability/redundancy/risk profile you want, and if you’re using on-demand or reserved resources. To keep it simple:
Running three on-demand t3.small collector nodes costs ~$50/month in us-east-1 based on $0.0208/hr pricing.
Running three on-demand t3.small enricher nodes costs the same, while running a single on-demand m5.large enricher node costs ~$75.
These costs can be drastically reduced by switching to reserved instances, and building to your risk profile but nothing more.
While storage in S3 is cheap, this data definitely piles up fast. Pricing here is all over the place, and mostly depends on tracking/site volume. For low-traffic, high-margin companies this is barely even factored into the equation.
This varies, and all depends on how you want to access events. A single-node redshift dc1 is cheap, a Snowflake data warehouse is (usually) pretty pricey .
A very rough approximation I’ve found to work pretty well for small-to-medium-volume sites is $200 per month, pipeline infra only. With this being said, you can pay $50/month if you’re a thrill-seeker and $5000+/month if your site has a lot of traffic/event volume. Again, pipeline infra only.
There are definitely ways to make this more efficient - ECS or ASG’s are great for cutting costs if your traffic profile is spiky or if you just want to have a cool system. If you know the system will be up long term, reserving resources drastically cuts cost. If you don’t need everything in S3, you can merge objects and roll to glacier, etc.
I’ve intentionally left out monitoring/instrumentation infra and engineering costs
I’ve also intentionally left out all costs (explicit or implicit) associated with navigating points of scale, and knowing what to do when things happen.