What is the Lambda Architecture?
In the Big Data book, Nathan Marz came up with the term Lambda Architecture for a generic, scalable and fault-tolerant data processing architecture, based on his experience working on distributed data processing systems at Backtype and Twitter.
Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before presentation. The rise of lambda architecture is correlated with the growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce.
Note: when we say real-time, we actually mean near-real-time and the time delay between the occurrence of an event and the availability of any processed data from that event. In the Lambda architecture, real-time is the ability to process the delta of data that has been captured after the start of the batch layers current iteration and now.
Here’s how it looks like, from a high-level perspective (ref: http://lambda-architecture.net/):
- All data entering the system is dispatched to both the batch layer and the speed layer for processing.
- The batch layer has two functions:
- managing the master dataset (an immutable, append-only set of raw data)
- pre-compute the batch views.
- The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
- The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
- Any incoming query can be answered by merging results from batch views and real-time views.
Snowplow approach to Lambda Architecture
At Snowplow, the Real-Time pipeline is built on Lambda Architecture. Below is the diagram depicting the data flow and the components used to implement it.
To be more specific, the list of the Snowplow components is outlined below:
- Scala Stream Collector as a representation of the data source
- EmrEtlRunner and StorageLoader represent batch layer where Kinesis S3 processes streaming data and makes it available/suitable for batch consumption
- Kinesis Enrich and Kinesis Elasticsearch Sink are utilised to build speed (real-time) layer
- The serving layer is depicted as Redshift data warehouse but not limited to it
- Raw Events Stream is common to both Real-Time and Batch pipelines
Batch layer is implemented as a standard Snowplow batch pipeline with
config.yml) pointing to the Amazon S3 bucket to which the data is sinked with Kinesis S3
- Real-Time layer:
- you would need at least 2 streams (for enriched and bad events)
- you could add a custom component, a stream job such as in AWS Lambda for example
Please, feel free to express your thoughts and ideas.
Starting from R102 release the events enriched in Kinesis Enrich component of the real-time pipeline could be fed into the batch pipeline with EmrEtlRunner running in Stream Enrich mode. The relevant diagram depicting this flow follows.