Enrich with Kafka


#1

Hello, I am struggling to find an example of Kafka -> Kafka transformation, namely how to configure / implement transformation method.
In my case collector saves raw events in Thrift format and I would like to filter unnecessary fields, maybe do some ip lookup and store the enriched events in json format into kafka sink.

Thanks,
Milan


On-premise Snowplow Realtime Pipeline with Spark Streaming Enrich
#2

Hi @magaton - the instructions for running our Stream Enrich process with Kafka, which does everything you require, are here:


#3

Thanks @alex, I’ve seen this.
Between these two components, it is now possible to stand up a Kafka-based pipeline, from Snowplow event tracking through to a Kafka topic containing Snowplow enriched events.

and in config example I can see that it is possible to define topics in sources and sinks, brokers, etc but not the transformation itself, which is my main question.


#4

Figured it out in the meantime, forgot to post an update. Apologies.
I am reading about kafka streams atm and I see it as a potentially very good fit for enrichment.
Are you maybe considering it in a near future or should I give it a try myself?

Thank you,
Milan


#5

Hey @magaton - we considered using Kafka Streams in Stream Enrich for our Kafka processing, but:

  • We didn’t identify anything it offered for this use case over vanilla Kafka producers/consumers
  • We weren’t sure if it would play nicely with Stream Enrich’s open architecture (any stream technology as source to any stream technology as sink)

Let us know if you think differently!


#6

Hi @alex
I fully agree that using kafka streams only to replace producer - consumer in enrichment
(and only in kafka based, especially if you want to keep kinesis as an option) would not bring much value to snowplow and its users.

But I can see kafka rest proxy + kafka streams + kafka connect + avro schema registry as a kind of equivalent of current snowplow architecture: Thrift based collector + (kafka/kinesis enrichment) + iglu + sauna for alerts + storage

where there is much fewer moving parts with scaling, HA and monitoring inherited from kafka / confluent, thus much easier to operate.

In this case one can have:

  • kafka cluster with enabled kafka rest proxy, kafka connectors and schema registry - there is no need now for a separate app that serves as collector. Raw events can go directly to kafka rest proxy.

  • app that runs kafka streams so that the collected data get transformed / aggregated / enriched in different ways and end up in different topics. Also probably alerts / notifications can be implemented using low level Processor API in Kafka Streams.

From there the data can be sent to ES, Spark, Neo4j, FS, RDBMS using available kafka connect sinks, depends what kind of data analytics / storage you need.

This app is either meant to be a custom implementation or there can be kind of generic code with groovy based on the fly transformation like in divolte collector, for instance.

I read your arguments about Iglu + Thrift vs Avro in another thread.
I think it is similar question as this one. Solely replacing Thrift with Avro doesn’t make much sense,
but if you consider Confluent Stack as a backbone of your product, not as one of the options, maybe it does?


#7

Hi @magaton,

It’s definitely a thoughtprovoking proposal. My thoughts:

especially if you want to keep kinesis as an option

Certainly we have no plans to remove support for Kinesis - it is working great for the Snowplow community as a minimal-ops Kafka alternative on AWS.

Building on our Kinesis success, we are now also actively exploring Azure Event Hubs (see our recent RFC) and Google Cloud Pub/Sub (see our recent example project).

These hosted unified log technologies are great for those who can’t afford for a 24/7 ops team with deep Kafka experience.

kafka cluster with enabled kafka rest proxy … - there is no need now for a separate app that serves as collector

I have a few concerns with this approach:

  1. REST is a poor paradigm for event collection - CQRS is the methodology we use
  2. The Snowplow collectors perform some important analytics functions such as cookies and redirects
  3. I’d be nervous about exposing a read/write API to your company’s Kafka cluster on the open internet

I can see … avro schema registry as a kind of equivalent of … iglu … I read your arguments about Iglu + Thrift vs Avro in another thread.

Confluent schema registry is a strict subset of Iglu, as covered in the last sub-section in this post:

http://discourse.snowplowanalytics.com/t/optionally-support-avro-and-the-confluent-schema-registry-for-kafka/1127/9

We believe that we can support Confluent schema registry as a downstream “lossy mirror” of an Iglu schema registry, but we cannot use it as our primary schema registry without losing most of the capabilities that make Snowplow, Snowplow.

app that runs kafka streams so that the collected data get transformed / aggregated / enriched in different ways and end up in different topics.

Yes, definitely - there is nothing stopping a Kafka user of Snowplow from writing their own awesome Kafka Streams (or Flink or Beam or Spark Streaming) jobs that work on the event stream and do all sorts of cool data modeling and aggregation.

That’s the nice thing about Snowplow’s async micro-service-based architecture, running on Kafka or Kinesis or Event Hub - you can write any app which plugs into our enriched event topic.

From there the data can be sent to ES, Spark, Neo4j, FS, RDBMS using available kafka connect sinks, depends what kind of data analytics / storage you need

We can’t use the generic Kafka Connect sinks because our enriched event model is too rich, and we want to support some pretty advanced behaviors (e.g. hot mutation of RDBMS tables to accommodate evolving schemas), but yes, there’s nothing stopping us from writing our own sinks on top of Kafka Connect.

I think that given the very similar semantics between Kafka, Kinesis and Event Hubs (albeit not Google Cloud Pub/Sub), there are some interesting opportunities for us around generalizing Kafka Connect further, so it (or at least its patterns) are reusable across all these unified log technologies.

This app is either meant to be a custom implementation or there can be kind of generic code with groovy based on the fly transformation like in divolte collector, for instance.

Yes, we already support JavaScript-based enrichments:

We would love to add JRuby, Jython, Groovy to this. If you were interested in contributing any of these, please let us know!

Phew! Hopefully I’ve covered everything. I think I would sum up by saying: we are excited by and impressed by what is happening in the Kafka/Confluent ecosystem, and we want to leverage as much of this as possible, but without reducing the capability-set of Snowplow, and without making things overly-specific to Kafka in a multi-unified-log world.