Collect to Kafka, enrich with Kafka then what?


#1

Hello,
I’m totally new to Snowplow and I’m just digging through extensive documentation. I managed to set-up a proof of concept by using Apache Kafka for collector & enrichment phase instead of Kinesis, but then I hit the wall and have no idea how to move my data to something suitable for further, both batch and real-time, analysis - PostgreSQL for example. I grepped 4-storage folder and there is no single mention of Kafka, after some research I found out that whole Kafka support is quite new. But if there is this support, there must be some way to use the data collected by Kafka? :slight_smile:

My question is - once I’m past enrichment phase, and data is back in Kafka topic (post_enrichment one), how do I store it in PostgreSQL or Elastic? Or I shouldn’t pick Kafka as the final post-enrich store and move data to something, for which there is an existing loader/sink?

Sorry for such basic question but I couldn’t find any remark on this, and I found the hard way trying first storage-loader then kinesis-elasticsearch-sink and both didn’t even mention Kafka support.

Best regards,
K.


On-Premise PostgreSQL storage. Still requires S3?
#2

Hi @noizex - it’s a fair question!

As you’ve spotted, our Kafka support is very new and incomplete. In terms of the specific storage targets you mention:

  • We plan on adding support for Kafka to the kinesis-elasticsearch-sink (which we’ll rename stream-elasticsearch-sink), but this will require some heavy re-factoring, and we will only start on this in the New Year
  • We have no plans to support Kafka cluster to single-node Postgres
  • If you have any other distributed analytical databases you would be interested in us supporting from Kafka (EventQL? Greenplum? Clickhouse? Vertica? Redshift?) then let us know

At the moment, as a workaround if you wanted to load the data into Elasticsearch, you could write a simple Kafka-to-Kinesis bridge and then connect that into kinesis-elasticsearch-sink

Thanks,

Alex


#3

Hi @alex, thanks for quick reply :slight_smile:

Thanks for explaining your plans regarding Kafka. We wanted to use Kafka with Elasticsearch (real-time) and PostgreSQL (batch) backends. First one seems to be achievable, as you pointed, by writing Kafka->Kinesis or just straight Kafka->Elasticsearch converter. The other I have no idea how to approach.

About Kafka to Postgres, do you mean straight Kafka->Postgres? Couldn’t it work in a similar way to Kinesis LZO S3 sink + EmrEtlRunner + StorageLoad? Or it’s completly different thing?

Since we’re doing it as a proof of concept now and don’t have many resources that we can put into writing own converters, we’ll probably try the Amazon approach now. We’ll get Kinesis + S3 and see how the whole pipeline works for us.

Cheers,
noizex


#4

Hi @noizex - sorry, you’re right - if you could get the raw events from Kafka to S3 (using a hypothetical equivalent of our kinesis-s3 component), then you could totally run the batch pipepline on EMR and load into Postgres.

My point was more that Postgres being a non-clustered database, it doesn’t make sense for us to spend time supporting loading Postgres directly from a Kafka cluster - the Postgres load would bottleneck badly.


#5

Hi @alex,

What do you think about supporting Snappydata as the database and Spark store? My github has work on a Kubernetes version of Snappydata, but I can port it back to back to ansible.

The flow is collector -> kafka-s3 or kafka-hdfs -> Snappydata (sql and spark).

There might be a possibility of helping Snowplow get this functionality.

SnappyData is a cluster sql database and that means it can scale.


#6

Using a kafka to hdfs component (needs to be custom made) means the system can use Alluxio as a s3 to hdfs bridge for both Spark and SnappyData.

Streamsets is another collector system that allows kafka to s3 / jdbc / hdfs. The only problem is you already have a collector / enricher system. It’s yet another layer.


#7

Snappydata sounds like an interesting idea - keep the suggestions coming!


#8

Hi @noizex @alex

@alex can you think the new Kafka Connector can do the job?

@noizex you solve the problem to port from Kafka -> Postgress?? any insights from what you learned?

Best


#9

Hi @spatialy - what job are you referring to?


#10

Hey @spatialy @noizex ,

If you’re still looking to move data from Kafka to Postgres I’d suggest looking at kafka-connect-jdbc.

Best,
Ben.


#11

Hi, @alex I refer to move data to a database … as @BenFradet noted below seem to be a Kafka native solution from the last month or so.

best


On-premise Snowplow Realtime Pipeline with Spark Streaming Enrich
#12

Ideally Kafka -> anything should be a Kafka Connector and completely out of scope for Snowplow, as long as Snowplow pushes the data to Kafka in a Kafka friendly format (json or avro)

Then Snowplow could provide examples and guidance to configure the kafka connector to s3 from confluent for example, which would show how to do good usage.


#13

Hi @alex,
Where are you with the support for Kafka to kinesis? I am doing a quick proof of concept for our group and have managed to have the collector write to Kafka and enrichment read and write to Kafka. I now need to get the data back to a Kinesis stream to support our existing ElasticSearch and other downstream clients of the data. I am investigating using Amazon’s KCL and rolling my own, but if there is something already implemented I would be very happy to try that.

Thanks,

Charlie


#14

Hey @charlie - we didn’t get round to this, but it shouldn’t be too difficult to write a generic Kafka to Kinesis bridge, probably using something like Kafka Connect. It doesn’t even need to be Snowplow-specific (can simply mirror the Kafka stream)…


#15

Has there been any update on an enricher for the kafka stream?
Right now we are tryign to store data in postgreSQL however to do so we are pulling from the collector and seperating the TSV from our kafka collector. We got it to sort of work with: https://github.com/snowplow/snowplow/wiki/setting-up-stream-enrich However we are still pulling and manipulating the tsv.

We want to be able to stream straight into a postgreSQL DB however we need to keep all data internal, IE, no cloud services.

Thanks!


#16

Hi @builtbyproxy - do you mean an enricher, or a loader?

Our Stream Enrich component has supported Kafka for a long while, and that continues to be maintained and enhanced.

In terms of a loader for an on-premise data warehouse, the question/note from earlier in this thread still live:

Postgres is an interesting one - although I suspect most on-premise users of Snowplow would need something more OLAP-focused and horizontally scalable.