Data modelling for real time kafka pipeline

Manav_Veer_Gulati · September 2, 2020, 8:52am

Hello,

Currently we are using the cloudfront collector and we are population redshift using the web data model defined by snowplow.

As we have limited storage, we want to get rid of storing atomic.events. So we decided to build a real time pipeline using kafka and confluent cloud. Our assumption was that we precompute the scratch tables in ksql and send them as streams in redshift.

But the current data model in scratch does window processing (eg row number() , partition , group by etc) . So if we use realtime streams we won’t be able to reduce the number of events and hence we will end up with more or less the same records as we would have in atomic.events.

My question is : Is there any defined data modelling for real time snowplow pipeline? If not, how can we implement the current model without having abundance of records in redshift through the real time pipeline?

brunoalsantos · September 2, 2020, 12:46pm

Hi Manav.

Here on PEBMED we had the same limited storage issue.
To solve this question we did an application to unload data from redshift to S3, keeping on redshift the last 90 days of data. We build this application using python through lambda.

With the data saved in S3, we now access it via the redshift spectrum.

ihor · September 2, 2020, 6:18pm

@Manav_Veer_Gulati, the atomic event is immutable in a sense its fields are preset according to Canonical event model regardless of whether it is batch or real-time processing.

If you are to work with enriched streamed data you can use one of the Snowplow Analytics SDKs. The structure of the enriched data could be inferred from this code. You could define your own Lambda to compose your own event structure. Though, it will not be possible to process such an event in the Snowplow pipeline. You would need to rely on a separate (your own) process for that.

Topic		Replies	Views
How to shred events into Redshift from the real-time pipeline? AWS real-time pipeline	2	2784	May 25, 2016
Realtime GCP pipeline GCP pipeline	7	1446	August 31, 2021
Kafka-elasticsearch sink Kafka real-time pipeline	4	2965	April 5, 2017
Snowplow real-time analysis with on-premise pipeline For engineers	6	2405	January 4, 2018
Google Cloud Dataflow example project released New releases	8	2717	April 23, 2017

Data modelling for real time kafka pipeline

Related Topics