Hi @jacob.baumbach, @mike, @dashirov-ga,
A ton of great thoughts in this discussion! At the risk of missing the wood for the trees, I'll pick out a few points for additional comment:
I think this is a great list. One other thing I would add is the ability to perform a complete historic reprocessing of the raw events, if for example your business logic (not least the enrichments you have configured on your pipeline) change.
This actually happens very rarely, but it's nice to have this option available. While you could in theory achieve the same through a Kappa architecture, in practice it would be slow and rather costly (because you would have to "replay" the entire event archive into Kinesis first).
However, this said you don't actually need to be running the batch pipeline operationally in order to have this option available to you for the future - as long as you are storing your collector payloads to S3 using Kinesis S3, you can always use the batch pipeline later for a full reprocessing.
I'm actually more bullish on Redshift drip-feeding. We did a private PoC about 3 years ago which went into production and was able to load very substantial volumes into Redshift every 5-10 minutes.
The challenge around drip-feeding Redshift is all around schema evolution and table management - unless you fully automate these complex requirements, you have a near-real-time load process which requires regular operator intervention.
Very true - I keep meaning to write a post about how there is no such thing as real stream processing in analytics - almost every operational system quickly implements micro-batches on top of it, to massively improve performance at the cost of a little additional latency. AWS Lambda is a great example of this.
You see, with no dependency of one event in the pipeline on the other there's a lot that can be done during the time between batch runs. Why sit and do nothing and then ride a massive processing spike on an over-provisioned EMR cluster (and subsequently pay through the nose ) if a handful of small boxes can do the same amount of work, if they weren't stoping and going over and over again?
In the early days of Snowplow, it was very common for users to schedule a single run of the pipeline overnight so that their analysts had the data ready for them in the event warehouse in the morning.
Over the years we have moved to much more frequent runs - for our batch Managed Service customers we now set them up with an hourly pipeline kick-off. This helps to reduce data volume sizes per run, and crucially it means that we can discover issues such as missing Redshift tables or JSON Paths files as soon as possible.
The next steps on this trend are definitely towards long-running enrichment and load processes, whether that's running Snowplow batch on a persistent EMR cluster, or adding Redshift drip-feeding to the real-time pipeline. If you are interested in the former, this ticket has some info:
I'm sorry to hear that! If the Snowplow Managed Service is not an option for you, then yes hopefully the various open source stability and performance initiatives we have ongoing will lighten the load over the coming months.