'Serverless' Snowplow architecture

bernardosrulzon · May 31, 2017, 1:23am

This is very interesting @alex. We’re also venturing into the real time world and wondered if we could skip the batch pipeline’s EMR and load events straight from the kinesis enriched stream to Redshift. Is that possible/recommended? It seems pointless to enrich the same data twice (stream enrich + EMR), but we would (1) lose the reduplication feature in R89, and (2) have to adapt the storage loader to handle these files correctly.

Edit: Ah, now I got the point that Hadoop Shred currently only runs on EMR. Is porting that component to real time in the product pipeline?

Edit 2: Nevermind. This is all pretty well explained in the Spark RFC (Migrating the Snowplow batch jobs from Scalding to Spark) - some pretty exciting months ahead! Looking forward to it!

Topic		Replies	Views
Is my version of snowplow lambda architecture correct For engineers	3	2066	May 17, 2018
Snowplow Serverless For engineers	22	4886	February 23, 2023
How to setup a Lambda architecture for Snowplow For engineers	9	12020	June 3, 2016
Is this Lambda Architechture possible AWS real-time pipeline	5	2126	November 14, 2016
Stream vs Batch For engineers	9	3059	April 4, 2018

'Serverless' Snowplow architecture

Related Topics