We’re discussing how to set up our snowplow pipeline and the idea came up to add a Kinesis stream in front of the Collector as a buffer to make sure we’re not losing events in case of the collector being unavailable for a while. Our tracking happens from within a mobile app, so we don’t really have to reply to received events with a cookie. Is anyone doing this, or are there good reasons to not set it up this way? Any advice is appreciated!
Quite interesting approach but still i can see bottlenecks - you need something in front of kinesis stream in order to put data in there (so literally you need collector for the collector) - i don’t see what you can win here. Moreover you would need a kinesis stream consumer to push data to current collector. IMHO does not make any sense. Of course, you can rebuild tracker to push data directly into raw kinesis stream - but in such a case you do not need anything additional.
TBH I would go for HA/HR collector (LB + autoscaling) for collector. Data loss you can observe would be statistically negligible.
I pretty much agree with @grzegorzewald.
The key is this part:
to make sure we’re not losing events in case of the collector being unavailable for a while
If you set up more than one collector, each in a different availability zone (no less than two but more AZs = more availability), a load balancer, and autoscaling, then the chances of what you’re concerned about happening are negligible.
We’ve got hundreds of pipelines and have been running for years and AFAIK we haven’t once had a collector availability issue with this strategy.
To add to what @Colm said, a bigger worry is that Kinesis cannot scale up quickly enough in case traffic from the collector spikes. We’re currently experimenting with adding SQS as a buffer for overflowing traffic and hopefully will be able to address it in a forthcoming release.
To add to what has been mentioned adding a Kinesis stream will just increase the likelihood of a failure scenario.
Depending on the tracker you’re using many trackers will keep a local buffer - this is important as mobile devices frequently go offline - and won’t be able to send events. Likewise - almost all the trackers will attempt to queue and resend events if the collector responds with a non 200 status.
In ~5 years I’ve only seen one instance where data loss was at increased risk and this was due to API issues that impacted an entire region (and all services within). To mitigate this you can either run in multiple regions, or in multicloud but there are cost implications with doing so.