Stream vs Batch


Curious, how many are using Stream vs Batch or both?

We are using both.


Too bad Discourse doesn’t support polls! We’re (obviously) using both at Snowplow.


We’re just using batch at Metail.

What is your use case for both?


The common understanding I have is that you can use batch for the majority of analysis, and stream for real-time actioning or decisioning - to go into other applications (mobile notifications, triggered emails etc.)

Oh, and for real time dashboards as well.


we use batch and run it hourly and trust this for our data marts downstream. we use stream to get real-time updates if we can’t wait for hourly batch jobs. we see small differences between batch and stream and trying to figure out why stream has less sometimes than batch.


I think I’d like to know what a dual set up looks like, the closest I’ve gotten to the real-time pipeline is our test Snowplow Mini instance.

My naive imagination is something like running one collector which writes to kinesis, that writes the enriched and shredded data to S3 for batch analyses, perhaps loading the batches into a DB, certainly getting onto a dashboard. You also run some custom analysis on the real-time stream which goes into your dashboards but don’t persist this for very long. However now I’ve skimmed the stream enrich docs and it sounds like it only does the enrich step.

Where do you duplicate the events in the pipeline? Are you able to write the output of the stream enrich to S3 and start the batch from the shredding step? Got any pros and cons of your chosen solution?



The common term for the infrastructure stack you are describing is the so called ‘Lambda architecture’ (not in any relation to AWS Lambda service), as described in this post:

I think the most common use case would look like this (at least that’s what I set up recently):

  • an ASG with scala stream collectors behind a load balancer writing to a raw event kinesis stream
  • 2 consumers for the raw kinesis stream:
    - s3 loader app sinking raw lzo files to s3 bucket
    - scala-stream-enrich app (also in an ASG), sinking enriched events to enriched_good and enriched_bad kinesis streams
  • elasticsearch-loader app consuming enriched_good kinesis stream and sinking events into your elasticsearch cluster
  • elasticsearch-loader app consuming enriched_bad kinesis stream and sinking events into your elasticsearch cluster
  • emr-etl-runner for the batch pipeline, reading lzo files from the s3 bucket

It is convenient to have all instances in ASGs, since it can be difficult to predict the performance of the RT pipeline and so you want to have monitoring in place, tune instance sizes/buffer settings and in the end, set up autoscaling for your components, including the kinesis shards. This is much easier accomplished compared to e.g. clojure collector running on ElasticBeanstalk - the collector logs are published every hour to s3, so downscaling requires extra work to prevent data loss.