To give you a little bit of background. I’m working on a project where all the infrastructure is on AWS. We are using Autoscaling Groups for components like Collector and Enrich and it’s working great so far. Also since the capacity of kinesis streams will limit the speed of those components we already started scaling the kinesis streams.
While working on making the overall capacity of the pipeline more elastic we decided to investigate how to autoscale our Spark applications.
We are using Spark Structure Streaming to run different aggregations on enriched events. After giving a try to Dynamic Allocation (Spark built-in autoscaling feature) we found that it doesn’t work well. Basically confirming this:
In Spark Streaming, data comes in batches, and executors run whenever data is available. If the executor idle timeout is less than the batch duration, executors are constantly added and removed. However, if the executor idle timeout is greater than the batch duration, executors are never removed. Therefore, Cloudera recommends that you disable dynamic allocation by setting
falsewhen running streaming applications.
Spark Streaming and Dynamic Allocation
I concluded that it’s better to build a custom autoscaling logic using the Spark Developer API and handle the autoscaling based on the duration of the last x number of batches and depending on that request or remove executors using
killExecutor methods that are available on SparkContext. Very similar to what’s explained here.
Have you faced this same problem? Can you please provide some suggestions?
Thanks in advance!