Using Snowplow as an event hub


#1

Hi guys,

Here at GetNinjas we are currently using Snowplow as a solution to track client’s events and some kind of transaction events, e.g: Some user paid an invoice then my app generates an event through Snowplow just to help my BI team.

Nowadays, we are doing lots of stuff when an user pays an invoice, e.g: giving them some credits in our platform, sending them a receipt, etc. This stuff is triggered by an event which comes from AWS SQS, so basically, my App sends an event through SQS and we have a daemon getting these messages and processing them and doing all stuff described above.

We are using SQS because it’s reliable and we trust the event will land in the right destination, basically, because AWS ensures that for us.

For us, events sent through SQS could be, perfectly, sent using Snowplow, in other words, we are planning to use Snowplow as our event hub for transactional events. Perhaps it could not be the best decision, because of Snowplow characteristics like the buffer.

Moreover, we are planning to run Snowplow on Docker, and by container’s characteristics (someone can shut down one and start another one) we are afraid we may lose few events when we suffer some kind of outage in our clusters.

Is our worry reasonable? Can we use Snowplow, safely, as an event hub solution? Is there anything wrong with our architecture? Do you have any tips?

Best,


#2

Your concerns aren’t specific to Docker; all kinds of Snowplow deployment have a risk of event loss when a collector fails. If you are using Snowplow as an event bus rather than solely for analytics, then the cost of that risk is higher, but the principles are the same.

The solution is: make sure you don’t have any outages :slight_smile: That requires an approach the same as for any high-availability and fault-tolerant system: detailed failure mode analysis and planning, multiple redundancy deployments, automated failover, intelligent client retrying strategies (i.e. in your trackers) and so on.

For Snowplow case the critical process is the ingestion step; everything downstream can be retried but if you don’t capture the events in the first place they are lost forever.

A couple of Snowplow-specific tips:

  • Run the CloudFront collector as a last resort; you can use Route53 DNS health checking to failover to CloudFront if none of your primary collectors are healthy.
  • Applications in different regions can post to the same Kinesis stream, so you can run collectors in multiple regions with the same sink. Again you can use Route53 DNS to failover to another region when the primary deployment is unhealthy.
  • Make sure your trackers retry when a request fails. Some trackers (JS) do this already, others don’t yet so depending on which you use this may require some custom engineering.

We have a similar use case so I’d be interested to hear additions to this list from other Snowplowers too.


#3

Is there possible to have the collector to emit some external confirmation after the cache flush? I can imagine a flow where messages are marked as “processed” by the collector and later I can detect the lost ones and queue it back again.


#4

What cache flush in the collector are you referring to here? Do you mean after the events are flushed to Kinesis/another sink?


#5

Yep. Exactly.


#6

Hi @ll

Sorry for hijack the post but reading come to my mind some related question:

How is the flushed events handled if the kinesis/kafka to send the collected events are down?

Best


#8

@spatialy

The records which have failed sinking to Kinesis (e.g. because Kinesis was down) will be retried according to a linear backoff starting with a minimum backoff and limited by a maximum backoff.

There is no such mechanism for Kafka however. Instead, you will be able to leverage Kafka’s number of retries starting with R93 (c.f. issue 3367).


#9

Thanks @BenFradet.

@spatialy - I think it’s important to stress that high availability for your Kinesis streams / Kafka topics is an essential part of running a robust Snowplow pipeline; you can’t rely on these retry mechanisms for sustained outages.

This kind of monitoring and scaling is an important part of what we deliver with Snowplow Insights:


#10

Hi @BenFradet and @alex for the guidance.

We have both pipelines (kafka and kinesis) and are 100% with you around the HA of these parts and that the mitigation mechanism are only for minimum network shortages.

best


#11

I am less worried about losses on kafka/kinesis HA . More worried about outages in the snowplow collector.

Trying to figure a way to not lose events in case of a collector machine going down. What about the events only in the collector buffer?


#12

Hi @alanjds,

In our experience the events that not reach the collector are stored in the client browser local storage and are send in mass when successfully connect again in a future event like the client revisit our pages (the intended behavior as per the docs)

For the events in the collector buffer, we lost the data. A mitigation approach is to reach a balance between the parameters that control the buffer, and this highly depend on the impact the data lost have in your business


#13

Yes, but what if I want/can not lose events after got hit the collector?

As I understand, right now it is not possible to “not lose” stuff into the collector.

If I could hook some post-flush code, I can manage to hack a “one or many” delivery scheme instead of the actual “one or none”.

My question is: The actual collector does present such feature?
The answer seems to be: No, it does not.

Am I right?


#14

Just thinking out loud here, but it could be interesting to have a two-track system, where:

  • Low-materiality events (e.g. page pings, ad impressions) are handled as they are today (no-ack, cached)
  • High-materiality events (e.g. ecommerce transactions) are handled in a cacheless, transactional way - meaning that the connection to the tracker stays open until the event has been confirmed as being dispatched, and is acked back to the tracker with 200

Sounds a bit complicated. Remember you can have thousands of tracker instances live, maybe 2-10 collector instances, and no server affinity. I don’t see how you get the right collector acks back to the right trackers, asynchronously/post-cache flush, in a performant way.


#15

Hi @alex

Interesting approach suggestion. Maybe add some special route in the collector the events can be handled sync/async, need to see how much refactoring is needed. (here I are useless for now, I start with scala programming recently)

We have something similar but more for data security constraint over certain data-points, that’s ones are loaded to another collector server from a second tracker instance.

Best


#16

@alex I plan to put some queue between the client and the collector. RabbitMQ, for example, only trashes the message after an ACK by default, not on fetch. But I need to make the collector “receive”->“flush”->“ack”.

There can exist some dealer to pass stuff from the queue to the collector, no problem. But the collector should later inform the dealer (or any webhook maybe) when it is safe to ACK on the queue and trash the message.


#17

What is the client environment - where is the tracker running?


#18

Using Snowplow as an event hub for eg. communication between microservices will require very low latency to work well.

What can we expect in terms of latency in the real-time pipeline, from event collection until the event being available in the enriched/good kinesis streams?


#19

Hi @bernardosrulzon, for us we have the data ready in the enriched stream around 60-120 seg. after being received in the collector, and around 90-300 seg. for see the data in Elastisearch.

Hope this help.