Snowplow > Kafka > Druid

MrMoronIV · October 20, 2019, 1:58pm

Is it possible to setup a pipeline as suggested in the title?

If so, what parts do I need to make this work?

Is this assumption correct?:

Scala Stream Collector
Scala Stream Collector installed on two CentOS instances with a load balancer in front of it. Collecting the events from the trackers.

Setup the Kafka Sink
As found on: Configure the Scala Stream Collector · snowplow/snowplow Wiki · GitHub

The collector.streams.sink.enabled setting determines which of the supported sinks to write raw events to:
"kafka" for writing Thrift-serialized records and error rows to a Kafka topic
You should fill the rest of the collector.streams.sink section according to your selection as a sink.

I then read the Kafka Topic using the thrift extension:
https://druid.apache.org/docs/latest/development/extensions-contrib/thrift.html

Where would I find the settings for the kafka sink?

Finally, I add the JavaScript tracker to my website and it gets things going?
3. Can I rename snowplow functions so adblockers don’t pickup sp.js or fired events?

Is this about right to get things going?

MrMoronIV · October 20, 2019, 2:25pm

In addition, I think I need this: https://github.com/snowplow/snowplow/tree/master/3-enrich/stream-enrich
To read events from the collectors and push them to a Kafka topic, is that correct? Or does the Stream Enrich read from the raw Kafka Topic instead?

Topic		Replies	Views
Kafka-elasticsearch sink Kafka real-time pipeline	4	2967	April 5, 2017
How to configure a kafka collector and a HDFS sink For engineers	8	2148	February 18, 2020
How to use Snowplow as a Publish/Subscriber event bus? Collectors	1	1676	June 16, 2016
Bulk import of old events into Snowplow from Apache Kafka For engineers	4	641	January 10, 2020
Setting up the real-time pipeline on AWS AWS real-time pipeline	24	5691	May 25, 2021

Snowplow > Kafka > Druid

Related Topics