To give you a bit of context, the principle design priorities around the product that are relevant to this discussion are completeness and reliability.
A kinesis stream between the collector and enricher, for example, ensure that the collector is self-contained, and there is minimal risk of data loss as long as the collector is up. In this respect, where there’s a trade-off between cost and reliability, the design favours reliability.
Within that design, we do optimise to keep cost down in terms of operations. For the pipelines we run as part of the Snowplow Insights product (for those unfamiliar - we run the infrastructure in the customer’s cloud), we have proprietary tech that we’ve built to manage scaling kinesis (and other components), so we don’t always have to over-provision resources.
Having said all that, in our experience even with that cost trade-off, the cost to run doesn’t normally land on a very high number. There’s a minimum provisioning which means that below a certain volume it’s expensive per-event - 2-300k events is just below that minimum scale. Just doing some back of the envelope maths, I would expect kinesis costs to fall somewhere near the hundred dollar mark, for 4 kinesis streams, (2 bad, 1 good, 1 raw). There is scope to bring this down if you choose retention periods of less than 7 days (which generally is well above how long you’d realistically need to ensure ‘safe’ recovery from issues - especially if you have s3 sinks).
I believe the GCP pipeline can work out as cheaper to run than AWS at lower volumes, because PubSub is natively flexible, so that minimum provisioning/over-provisioning problem disappears. I’m risking spending a lot of time on this comment, so forgive me for not pulling the numbers on that one.
Apologies for the essay. We have had a lot of recent activity on discourse and across other forums from people who are just getting started with Snowplow so I’m conscious of an audience that may not have all the context.
TL;DR: The direct answer to your question is that no, we don’t plan on changing the design to reduce the number of kinesis streams. But we do actively work to reduce cost to run on an ongoing basis where possible.
I hope that’s helpful.