Shredding & loading enriched events in near-real-time

Hi @rgabo - it sounds like we are thinking about all this in the same way.

Note that Sluice is no more - it was removed in Snowplow R91.

Yes, that’s correct. Although we still use EmrEtlRunner internally for all core Snowplow pipelines, we use Dataflow Runner for our R&D and non-standard/non-Snowplow jobs on EMR.

Dataflow Runner is built around the Unix philosophy - all it does is run jobflows, currently on EMR only. You can schedule it any way you like. And it’s fully declarative - it’s just Avro, so you can generate, lint or visualise a dataflow anyway you like. (We are also planning a native integration between Factotum and Dataflow Runner in the future, so you get that “view-through” that you described between Airflow and EmrEtlRunner.)

Note that a future release of EmrEtlRunner will generate Dataflow Runner playbooks, and a later release will then remove the EMR orchestration functionality from EmrEtlRunner altogether: