This is very interesting @alex. We’re also venturing into the real time world and wondered if we could skip the batch pipeline’s EMR and load events straight from the kinesis enriched stream to Redshift. Is that possible/recommended? It seems pointless to enrich the same data twice (stream enrich + EMR), but we would (1) lose the reduplication feature in R89, and (2) have to adapt the storage loader to handle these files correctly.
Edit: Ah, now I got the point that Hadoop Shred currently only runs on EMR. Is porting that component to real time in the product pipeline?
Edit 2: Nevermind. This is all pretty well explained in the Spark RFC (Migrating the Snowplow batch jobs from Scalding to Spark) - some pretty exciting months ahead! Looking forward to it!