We are thinking of potentially eliminating our batch pipeline. We will continue to write raw logs to s3, we need to generate a way to implement shredding in real-time and we will have our real-time pipeline write directly to Redshift. Are there any reasons not do this? We are still maintaining a data lake(where we write a general program that can translate old data to new schemas, etc) which makes me believe we are fine completely operating in real-time(assuming we get shredding in real-time and writing to redshift working for real-time), but I do not know if there is some other reason to keep batch. Is there some benefit to batch that I am missing?
On-premise Snowplow Realtime Pipeline with Spark Streaming Enrich
This has been discussed. Migrating the Snowplow batch jobs from Scalding to Spark
There’s a few benefits to batch that I can think of off the top of my head
- Batch is idempotent and recovers well from errors - if there’s failures it’s not difficult to fix the issue and trigger another run. Doing the same thing in real time (staging, processing, enriching) can be a bit more tricky if errors occur.
- Batch is one of the oldest parts of the Snowplow pipeline - it also makes it one of the most reliable and most well tested components.
- Batch loader currently performs shredding (which real time doesn’t) though achieving shredding in real time wouldn’t be too difficult to achieve.
- The output of batch loader (and input to storage-loader) produces files containing a large number of rows and minimises the number of database transactions that occur - this is favourable for inserting into Redshift. Redshift demonstrates great performance when inserting data in this manner. I haven’t seen anyone solve streaming/microbatching into Redshift in a performant way yet - that’s not to say it’s impossible but Redshift hasn’t been designed as a streaming database.
I have extremely long familiarity with redshift ( going back to ParAccel days ). And yes, you’re correct - micro-batching into redshift is not a great idea on very large volumes. That being said, redshift is not the point here. There are other snowplow sinks suitable for streaming - elastic search for example. Every sink can be configured to buffer for predefined time span or volume prior to sending data through. Even the existing streaming pipeline defines buffering strategy - snowplow does not process one event at a time anyways. With that, I’d rather sit on processed files ready to be inserted into the database, knowing that at the top of the hour I will see the data in the DB. And if there are errors, I will have 30-40 minutes to fix them before that dreadful top of the hour is there. You see, with no dependency of one event in the pipeline on the other there’s a lot that can be done during the time between batch runs. Why sit and do nothing and then ride a massive processing spike on an over-provisioned EMR cluster (and subsequently pay through the nose ) if a handful of small boxes can do the same amount of work, if they weren’t stoping and going over and over again?
All I do on a daily basis is trying to recover EMR ETL it from one failure or another. Today, for once, I’m completely dumbfounded - no idea if I’ll get the batch pipeline back into operational shape. On a handful of files it takes 45 minutes for it to fail and with no indication as per the reasons. Unfortunately, there’s no sound alternative for batch just yet. Hopefully scalding to spark migration will bring some stability - I’m keeping my fingers crossed.
A ton of great thoughts in this discussion! At the risk of missing the wood for the trees, I’ll pick out a few points for additional comment:
I think this is a great list. One other thing I would add is the ability to perform a complete historic reprocessing of the raw events, if for example your business logic (not least the enrichments you have configured on your pipeline) change.
This actually happens very rarely, but it’s nice to have this option available. While you could in theory achieve the same through a Kappa architecture, in practice it would be slow and rather costly (because you would have to “replay” the entire event archive into Kinesis first).
However, this said you don’t actually need to be running the batch pipeline operationally in order to have this option available to you for the future - as long as you are storing your collector payloads to S3 using Kinesis S3, you can always use the batch pipeline later for a full reprocessing.
I’m actually more bullish on Redshift drip-feeding. We did a private PoC about 3 years ago which went into production and was able to load very substantial volumes into Redshift every 5-10 minutes.
The challenge around drip-feeding Redshift is all around schema evolution and table management - unless you fully automate these complex requirements, you have a near-real-time load process which requires regular operator intervention.
Very true - I keep meaning to write a post about how there is no such thing as real stream processing in analytics - almost every operational system quickly implements micro-batches on top of it, to massively improve performance at the cost of a little additional latency. AWS Lambda is a great example of this.
You see, with no dependency of one event in the pipeline on the other there’s a lot that can be done during the time between batch runs. Why sit and do nothing and then ride a massive processing spike on an over-provisioned EMR cluster (and subsequently pay through the nose ) if a handful of small boxes can do the same amount of work, if they weren’t stoping and going over and over again?
In the early days of Snowplow, it was very common for users to schedule a single run of the pipeline overnight so that their analysts had the data ready for them in the event warehouse in the morning.
Over the years we have moved to much more frequent runs - for our batch Managed Service customers we now set them up with an hourly pipeline kick-off. This helps to reduce data volume sizes per run, and crucially it means that we can discover issues such as missing Redshift tables or JSON Paths files as soon as possible.
The next steps on this trend are definitely towards long-running enrichment and load processes, whether that’s running Snowplow batch on a persistent EMR cluster, or adding Redshift drip-feeding to the real-time pipeline. If you are interested in the former, this ticket has some info:
I’m sorry to hear that! If the Snowplow Managed Service is not an option for you, then yes hopefully the various open source stability and performance initiatives we have ongoing will lighten the load over the coming months.