'Serverless' Snowplow architecture

Hey @Graham-M - interesting architecture!

It mixes batch and the real-time functions, but I haven’t read anything that says this can’t be done at present.

Mixing batch and real-time is fine - after all it’s the basis for our recommended Lambda architecture:

  1. As a novice, I’d like to know if it’s appropriate for the Kinesis S3 sink to be used at this point in the chain? Or is it to be used for un-enriched data only?

It’s not well documented yet, but yes the Kinesis S3 sink works fine with enriched data - just configure it with the gzip output.

Kinesis S3 sink (likely a CloudWatch event starting a container every minute, with a large buffer specified, so work won’t be undertaken every minute, only when the buffer’s filled).

This feels rather complex. If you have a steady stream of events, why not just leave a small Kinesis S3 sink running all the time in an ASG? If you use some kind of scheduled approach, then you have to reason about the schedule, the shards and the sink buffer settings to understand how the component will perform.

AWS Lambda to start a container running storage-loader.

You are missing Hadoop Shred, which currently only runs on EMR. This component sits between the enriched events and the StorageLoader.

Hope this helps,

Alex