Why is Snowplow using Kinesis/Kafka for real-time pipeline?


#1

Hello there,

I’m considering to move to Snowplow realtime pipeline and slightly confused with technology choice.
In particular, I’m wondering why AWS Kinesis has been chosen instead of (from
my point of view) almost default for this task technology like Spark Streaming or more modern Flink etc. For me, Kinesis/Kafka looks more like Pub/Sub wire rather then data-processing tool, unlike Spark. But probably I’m missing something as I have not much experience with them.

As I can see, nothing prevents us from including Scala Common Enrich into Spark Streaming and at the same time application currently called Stream Enrich would be much smaller.

So, I guess, my primary question is following: what design choices lead you to use Kinesis/Kafka instead of Spark Streaming or similar.

Thanks in advance!


On-premise Snowplow Realtime Pipeline with Spark Streaming Enrich
#2

Hey @morozov - great question, very happy to answer in some detail. Before I do though - just so I understand the alternative you’re suggesting:

What would be your source of raw Snowplow events (aka the payloads that come out of a Snowplow event collector), and where would your Snowplow enriched events be going downstream of Spark Streaming?


#3

Recently I’ve been looking at using Spark streaming to run analysis/queries on the enriched output stream from Kinesis - this seems to work really well and, with the right setup can give you something that enables near real time applications.

Although there’s some things I’m not particularly fond of in Kinesis (having to do checkpointing externally for one) it’s actually great as a specialised queuing system. For us the fact that it’s able to replicate both the raw and enriched streams across availability zones is a big win, using something like Kafka would mean running this on EC2 or something similar in multiple AZ’s - a fair bit of overhead. In addition the flexibility of modifying the retention period for a given stream (up to 7 days) means that you’ve got a reasonably durable, short term store that makes it easy to build consumer applications off these streams (using KCL).


#4

Hello @alex!

Thank you for your attention.

Seems I already answered my question (thanks to @mike’s aiming). But anyway, it would be awesome if you could correct me if I’m wrong somewhere (which is very likely, since we’ve just started to explore this area).

My initial “propose” was to just switch Kinesis with Spark Streaming, leaving Kinesis collector as a source and Elasticsearch or S3 as sinks. However, now I see Spark Streaming like more as a framework for running custom pipelines (with transformations and aggregations) whereas Kinesis is more like pure transfer layer responsible for balancing load and queuing enriched events. So my guess is that you chose Kinesis just because it can balancing nicely, while custom jobs isn’t what you need there.

Please, correct me if I’m wrong. Thanks.


#5

Hi @morozov - that’s exactly right. Think of Kinesis (and Kafka) as the glue layer between your asynchronous micro-services. We provide you with the “core” micro-services (Stream Collector, Stream Enrich, Elasticsearch Sink, S3 Sink), but then you can add your own.

Spark Streaming is a good choice as a stream processing framework for you to write your own custom event processing in - you can easily embed our Scala Analytics SDK into Spark Streaming.

Under the hood, Spark Streaming uses the Kinesis Client Library (KCL). Because Stream Enrich is only doing simple single-event processing, Spark Streaming doesn’t add anything here, so we “cut out the middle-man” and Stream Enrich just embeds the KCL directly.

But yes - you are very welcome to develop your own event-driven microservices that work off the Kinesis enriched event stream, and Spark Streaming is a good choice for that (as are raw KCL, AWS Lambda, Apache Flink etc).

Hope this helps!

Alex