Data modelling for real time kafka pipeline

Hello,

Currently we are using the cloudfront collector and we are population redshift using the web data model defined by snowplow.

As we have limited storage, we want to get rid of storing atomic.events. So we decided to build a real time pipeline using kafka and confluent cloud. Our assumption was that we precompute the scratch tables in ksql and send them as streams in redshift.

But the current data model in scratch does window processing (eg row number() , partition , group by etc) . So if we use realtime streams we won’t be able to reduce the number of events and hence we will end up with more or less the same records as we would have in atomic.events.

My question is : Is there any defined data modelling for real time snowplow pipeline? If not, how can we implement the current model without having abundance of records in redshift through the real time pipeline?

1 Like

Hi Manav.

Here on PEBMED we had the same limited storage issue.
To solve this question we did an application to unload data from redshift to S3, keeping on redshift the last 90 days of data. We build this application using python through lambda.

With the data saved in S3, we now access it via the redshift spectrum.

@Manav_Veer_Gulati, the atomic event is immutable in a sense its fields are preset according to Canonical event model regardless of whether it is batch or real-time processing.

If you are to work with enriched streamed data you can use one of the Snowplow Analytics SDKs. The structure of the enriched data could be inferred from this code. You could define your own Lambda to compose your own event structure. Though, it will not be possible to process such an event in the Snowplow pipeline. You would need to rely on a separate (your own) process for that.