How to configure a kafka collector and a HDFS sink


#1

I need to track data coming from kafka and after processing it I need to dump the data to HDFS.
Can someone please help me out to configure the same in snowplow - Kafka tracker and HDFS as sink


#2

Hi @vishwas - we don’t have a component to do this yet I’m afraid.

Are you talking about storing the enriched events to HDFS, or the raw collector payloads?


#3

Storing the enriched events to HDFS.
Collecting the data from kafka, enriching the data and then storing the enriched data to HDFS… is this possible through snowplow?


#4

Hi @vishwas - we don’t currently have a component for this, but it’s something we’ll look at building in the New Year. In the meantime, I’d suggest doing a Google search for “Kafka to HDFS” and exploring the results which come up.


#5

Hi @alex - Is there a possibilty to collect the data from kafka topic, enrich the data and push it back to kafka as another kafka topic?? If so, Could you please help in configuring the same… Thanks


#6

Yes indeed it is possible - we don’t have documentation on the wiki yet but the blog post should help you:


#7

Hi @alex - Thanks for the information. I went through the documentation, One more clarification is required. How to configure collector to use kafka as a source(to collect the data from a kafka topic…)
Overall flow is Kafka topic --> Enrichment --> Kafka topic. Need to set up this pipeline in snowplow…
Thanks…


#8

Hi @vishwas - I think you’re a bit confused on the terminology: an event collector receives events over HTTP; there’s no concept of Kafka (or Kinesis or S3 or …) as a collector’s source. Of course Stream Enrich can take Kinesis or Kafka as a source.