Stream vs Batch

kazgurs1 · March 16, 2018, 8:20am

The common term for the infrastructure stack you are describing is the so called ‘Lambda architecture’ (not in any relation to AWS Lambda service), as described in this post:

https://discourse.snowplow.io/t/how-to-setup-a-lambda-architecture-for-snowplow/249

I think the most common use case would look like this (at least that’s what I set up recently):

an ASG with scala stream collectors behind a load balancer writing to a raw event kinesis stream
2 consumers for the raw kinesis stream:
- s3 loader app sinking raw lzo files to s3 bucket
- scala-stream-enrich app (also in an ASG), sinking enriched events to enriched_good and enriched_bad kinesis streams
elasticsearch-loader app consuming enriched_good kinesis stream and sinking events into your elasticsearch cluster
elasticsearch-loader app consuming enriched_bad kinesis stream and sinking events into your elasticsearch cluster
emr-etl-runner for the batch pipeline, reading lzo files from the s3 bucket

It is convenient to have all instances in ASGs, since it can be difficult to predict the performance of the RT pipeline and so you want to have monitoring in place, tune instance sizes/buffer settings and in the end, set up autoscaling for your components, including the kinesis shards. This is much easier accomplished compared to e.g. clojure collector running on ElasticBeanstalk - the collector logs are published every hour to s3, so downscaling requires extra work to prevent data loss.

Topic		Replies	Views
Is my version of snowplow lambda architecture correct For engineers	3	2066	May 17, 2018
On-premise Snowplow Realtime Pipeline with Spark Streaming Enrich RFCs	1	3959	June 25, 2017
Is this Lambda Architechture possible AWS real-time pipeline	5	2126	November 14, 2016
Migration from batch processing to (near) real-time For engineers	3	827	February 14, 2019
'Serverless' Snowplow architecture For engineers	7	3207	June 1, 2017

Stream vs Batch

Related Topics