Using Snowplow as an event hub

lucasas · August 9, 2017, 6:52pm

Hi guys,

Here at GetNinjas we are currently using Snowplow as a solution to track client’s events and some kind of transaction events, e.g: Some user paid an invoice then my app generates an event through Snowplow just to help my BI team.

Nowadays, we are doing lots of stuff when an user pays an invoice, e.g: giving them some credits in our platform, sending them a receipt, etc. This stuff is triggered by an event which comes from AWS SQS, so basically, my App sends an event through SQS and we have a daemon getting these messages and processing them and doing all stuff described above.

We are using SQS because it’s reliable and we trust the event will land in the right destination, basically, because AWS ensures that for us.

For us, events sent through SQS could be, perfectly, sent using Snowplow, in other words, we are planning to use Snowplow as our event hub for transactional events. Perhaps it could not be the best decision, because of Snowplow characteristics like the buffer.

Moreover, we are planning to run Snowplow on Docker, and by container’s characteristics (someone can shut down one and start another one) we are afraid we may lose few events when we suffer some kind of outage in our clusters.

Is our worry reasonable? Can we use Snowplow, safely, as an event hub solution? Is there anything wrong with our architecture? Do you have any tips?

Best,

acgray · August 10, 2017, 7:17am

Your concerns aren’t specific to Docker; all kinds of Snowplow deployment have a risk of event loss when a collector fails. If you are using Snowplow as an event bus rather than solely for analytics, then the cost of that risk is higher, but the principles are the same.

The solution is: make sure you don’t have any outages That requires an approach the same as for any high-availability and fault-tolerant system: detailed failure mode analysis and planning, multiple redundancy deployments, automated failover, intelligent client retrying strategies (i.e. in your trackers) and so on.

For Snowplow case the critical process is the ingestion step; everything downstream can be retried but if you don’t capture the events in the first place they are lost forever.

A couple of Snowplow-specific tips:

Run the CloudFront collector as a last resort; you can use Route53 DNS health checking to failover to CloudFront if none of your primary collectors are healthy.
Applications in different regions can post to the same Kinesis stream, so you can run collectors in multiple regions with the same sink. Again you can use Route53 DNS to failover to another region when the primary deployment is unhealthy.
Make sure your trackers retry when a request fails. Some trackers (JS) do this already, others don’t yet so depending on which you use this may require some custom engineering.

We have a similar use case so I’d be interested to hear additions to this list from other Snowplowers too.

alanjds · September 14, 2017, 10:04pm

Is there possible to have the collector to emit some external confirmation after the cache flush? I can imagine a flow where messages are marked as “processed” by the collector and later I can detect the lost ones and queue it back again.

mike · September 15, 2017, 4:06am

What cache flush in the collector are you referring to here? Do you mean after the events are flushed to Kinesis/another sink?

alanjds · September 15, 2017, 1:18pm

Yep. Exactly.

spatialy · September 19, 2017, 5:14am

Hi @ll

Sorry for hijack the post but reading come to my mind some related question:

How is the flushed events handled if the kinesis/kafka to send the collected events are down?

Best

BenFradet · September 19, 2017, 9:19am

@spatialy

The records which have failed sinking to Kinesis (e.g. because Kinesis was down) will be retried according to a linear backoff starting with a minimum backoff and limited by a maximum backoff.

There is no such mechanism for Kafka however. Instead, you will be able to leverage Kafka’s number of retries starting with R93 (c.f. issue 3367).

alex · September 19, 2017, 8:11pm

Thanks @BenFradet.

@spatialy - I think it’s important to stress that high availability for your Kinesis streams / Kafka topics is an essential part of running a robust Snowplow pipeline; you can’t rely on these retry mechanisms for sustained outages.

This kind of monitoring and scaling is an important part of what we deliver with Snowplow Insights:

spatialy · September 19, 2017, 8:30pm

Hi @BenFradet and @alex for the guidance.

We have both pipelines (kafka and kinesis) and are 100% with you around the HA of these parts and that the mitigation mechanism are only for minimum network shortages.

best

alanjds · September 20, 2017, 12:24pm

I am less worried about losses on kafka/kinesis HA . More worried about outages in the snowplow collector.

Trying to figure a way to not lose events in case of a collector machine going down. What about the events only in the collector buffer?

spatialy · September 20, 2017, 7:52pm

Hi @alanjds,

In our experience the events that not reach the collector are stored in the client browser local storage and are send in mass when successfully connect again in a future event like the client revisit our pages (the intended behavior as per the docs)

For the events in the collector buffer, we lost the data. A mitigation approach is to reach a balance between the parameters that control the buffer, and this highly depend on the impact the data lost have in your business

alanjds · September 20, 2017, 8:43pm

Yes, but what if I want/can not lose events after got hit the collector?

As I understand, right now it is not possible to “not lose” stuff into the collector.

If I could hook some post-flush code, I can manage to hack a “one or many” delivery scheme instead of the actual “one or none”.

My question is: The actual collector does present such feature?
The answer seems to be: No, it does not.

Am I right?

alex · September 20, 2017, 9:03pm

Just thinking out loud here, but it could be interesting to have a two-track system, where:

Low-materiality events (e.g. page pings, ad impressions) are handled as they are today (no-ack, cached)
High-materiality events (e.g. ecommerce transactions) are handled in a cacheless, transactional way - meaning that the connection to the tracker stays open until the event has been confirmed as being dispatched, and is acked back to the tracker with 200

Sounds a bit complicated. Remember you can have thousands of tracker instances live, maybe 2-10 collector instances, and no server affinity. I don’t see how you get the right collector acks back to the right trackers, asynchronously/post-cache flush, in a performant way.

spatialy · September 20, 2017, 9:16pm

Hi @alex

Interesting approach suggestion. Maybe add some special route in the collector the events can be handled sync/async, need to see how much refactoring is needed. (here I are useless for now, I start with scala programming recently)

We have something similar but more for data security constraint over certain data-points, that’s ones are loaded to another collector server from a second tracker instance.

Best

alanjds · September 20, 2017, 10:00pm

@alex I plan to put some queue between the client and the collector. RabbitMQ, for example, only trashes the message after an ACK by default, not on fetch. But I need to make the collector “receive”->“flush”->“ack”.

There can exist some dealer to pass stuff from the queue to the collector, no problem. But the collector should later inform the dealer (or any webhook maybe) when it is safe to ACK on the queue and trash the message.

alex · September 20, 2017, 10:18pm

What is the client environment - where is the tracker running?

bernardosrulzon · September 20, 2017, 11:34pm

Using Snowplow as an event hub for eg. communication between microservices will require very low latency to work well.

What can we expect in terms of latency in the real-time pipeline, from event collection until the event being available in the enriched/good kinesis streams?

spatialy · September 21, 2017, 8:24pm

Hi @bernardosrulzon, for us we have the data ready in the enriched stream around 60-120 seg. after being received in the collector, and around 90-300 seg. for see the data in Elastisearch.

Hope this help.

Topic		Replies	Views
Server-side infrastructure Tracking SDKs	4	1328	February 2, 2017
Moving Snowplow Micro Events to GCS Bucket For engineers	3	548	July 3, 2023
Event driven architecture with a single Snowplow stream consumer For engineers	2	1014	January 13, 2021
Send events in bulk for streaming setup on GCP GCP pipeline	13	1808	November 18, 2021
Bulk import of old events into Snowplow from Apache Kafka For engineers	4	649	January 10, 2020

Using Snowplow as an event hub

Related Topics