Stream vs Batch

The common term for the infrastructure stack you are describing is the so called ‘Lambda architecture’ (not in any relation to AWS Lambda service), as described in this post:

https://discourse.snowplow.io/t/how-to-setup-a-lambda-architecture-for-snowplow/249

I think the most common use case would look like this (at least that’s what I set up recently):

  • an ASG with scala stream collectors behind a load balancer writing to a raw event kinesis stream
  • 2 consumers for the raw kinesis stream:
    - s3 loader app sinking raw lzo files to s3 bucket
    - scala-stream-enrich app (also in an ASG), sinking enriched events to enriched_good and enriched_bad kinesis streams
  • elasticsearch-loader app consuming enriched_good kinesis stream and sinking events into your elasticsearch cluster
  • elasticsearch-loader app consuming enriched_bad kinesis stream and sinking events into your elasticsearch cluster
  • emr-etl-runner for the batch pipeline, reading lzo files from the s3 bucket

It is convenient to have all instances in ASGs, since it can be difficult to predict the performance of the RT pipeline and so you want to have monitoring in place, tune instance sizes/buffer settings and in the end, set up autoscaling for your components, including the kinesis shards. This is much easier accomplished compared to e.g. clojure collector running on ElasticBeanstalk - the collector logs are published every hour to s3, so downscaling requires extra work to prevent data loss.

2 Likes