'Serverless' Snowplow architecture

alex · May 30, 2017, 7:53pm

Hey @Graham-M - interesting architecture!

It mixes batch and the real-time functions, but I haven’t read anything that says this can’t be done at present.

Mixing batch and real-time is fine - after all it’s the basis for our recommended Lambda architecture:

As a novice, I’d like to know if it’s appropriate for the Kinesis S3 sink to be used at this point in the chain? Or is it to be used for un-enriched data only?

It’s not well documented yet, but yes the Kinesis S3 sink works fine with enriched data - just configure it with the gzip output.

Kinesis S3 sink (likely a CloudWatch event starting a container every minute, with a large buffer specified, so work won’t be undertaken every minute, only when the buffer’s filled).

This feels rather complex. If you have a steady stream of events, why not just leave a small Kinesis S3 sink running all the time in an ASG? If you use some kind of scheduled approach, then you have to reason about the schedule, the shards and the sink buffer settings to understand how the component will perform.

AWS Lambda to start a container running storage-loader.

You are missing Hadoop Shred, which currently only runs on EMR. This component sits between the enriched events and the StorageLoader.

Hope this helps,

Alex

Topic		Replies	Views
Is my version of snowplow lambda architecture correct For engineers	3	2066	May 17, 2018
Snowplow Serverless For engineers	22	4887	February 23, 2023
How to setup a Lambda architecture for Snowplow For engineers	9	12020	June 3, 2016
Is this Lambda Architechture possible AWS real-time pipeline	5	2126	November 14, 2016
Stream vs Batch For engineers	9	3061	April 4, 2018

'Serverless' Snowplow architecture

Related Topics