Currently we are using the cloudfront collector and we are population redshift using the web data model defined by snowplow.
As we have limited storage, we want to get rid of storing atomic.events. So we decided to build a real time pipeline using kafka and confluent cloud. Our assumption was that we precompute the scratch tables in ksql and send them as streams in redshift.
But the current data model in scratch does window processing (eg row number() , partition , group by etc) . So if we use realtime streams we won’t be able to reduce the number of events and hence we will end up with more or less the same records as we would have in atomic.events.
My question is : Is there any defined data modelling for real time snowplow pipeline? If not, how can we implement the current model without having abundance of records in redshift through the real time pipeline?