Batch v real time enrichment


#1

Do the enrichment transformations that occur in both real-time and batch produce the same data?

Thanks!
–Kerry


#2

Hi @kerrylev,

It totally depends on the enrichment - if the underlying data looked up is the same, then yes the data attached to the event by the enrichment should be the same!


#3

@alex answered technically correct. Practical implications of a choice you’re about to make are a little more complex.
While streaming and batch enrichment processes are producing the same exact output on the same exact input if enrichments configured were exactly the same and executed in exactly the same timeframe… batch ETL ( which includes enrichment, shredding and data loading ) will shred the enriched data into individual context tables, while similar functionality is not yet available for streaming solutions. Some progress is being made to bridge the gap, but it has not yet been tested, scheduled for a release or documented.


#4

Perfect. Thanks for the responses, @alex and @dashirov-ga.

–Kerry


#5

Hi @alex and @dashirov-ga. Is streaming and batch enrichment process should configured exactly and executed in same time frame, im asking for current snowplow. I know this thread last 2 years.


#6

Snowplow is flexible to a certain extend. You can build it in various ways. It is ambiguous when you say current snowplow.

It is impossible to do so by the sheer nature of the two architectures. The real-time pipeline is for real-time processing while the batch pipeline is for batch processing.

You can combine the two by building so-called Lambda architecture. The data will be available in (near) real time, yet you still can shred it and load into Redshift (batch pipeline) if required.