'Serverless' Snowplow architecture

This is very interesting @alex. We’re also venturing into​ the real time world and wondered if we could skip the batch pipeline’s EMR and load events straight from the kinesis enriched stream to Redshift. Is that possible/recommended? It seems pointless to enrich the same data twice (stream enrich + EMR), but we would (1) lose the reduplication feature in R89, and (2) have to adapt the storage loader to handle these files correctly.

Edit: Ah, now I got the point that Hadoop Shred currently only runs on EMR. Is porting that component to real time in the product pipeline?

Edit 2: Nevermind. This is all pretty well explained in the Spark RFC (Migrating the Snowplow batch jobs from Scalding to Spark) - some pretty exciting months ahead! Looking forward to it!

2 Likes