Batch versus real-time: comparing infrastructure costs


I was hoping someone who has experience with both the Snowplow batch and real-time pipelines could chime in on the difference in costs from an infrastructure perspective. What was your approximate step up in cost to run the real-time pipeline (2X? 5X? 10X?) instead of the batch pipeline? We are on batch pipeline right now but want to understand what the cost might look like if we decide to go real-time later in the future.



The biggest difference is that there’s a bunch of stuff you’ll be running
all the time, not just the collectors. So you need to calculate how many
Kinesis shards you’ll need, multiplied by the steps in the pipelines. Then
the different processing steps. That’s where the big costs come in.

Batch is ridiculously cheap, especially if you use spot instances for extra
nodes in the ETL. I wrote a simple script that calculated roughly how many
task nodes were needed to process the batch sitting in the incoming bucket
so it scaled up when needed.


How did you calculate the number of task nodes per file count? Mind sharing the rough #'s you used? Very cool idea and great cost saver using spot instance