Autoscaling Spark Structured Streaming applications

Hi everyone!

To give you a little bit of background. I’m working on a project where all the infrastructure is on AWS. We are using Autoscaling Groups for components like Collector and Enrich and it’s working great so far. Also since the capacity of kinesis streams will limit the speed of those components we already started scaling the kinesis streams.

While working on making the overall capacity of the pipeline more elastic we decided to investigate how to autoscale our Spark applications.

We are using Spark Structure Streaming to run different aggregations on enriched events. After giving a try to Dynamic Allocation (Spark built-in autoscaling feature) we found that it doesn’t work well. Basically confirming this:

In Spark Streaming, data comes in batches, and executors run whenever data is available. If the executor idle timeout is less than the batch duration, executors are constantly added and removed. However, if the executor idle timeout is greater than the batch duration, executors are never removed. Therefore, Cloudera recommends that you disable dynamic allocation by setting spark.dynamicAllocation.enabled to false when running streaming applications.
Spark Streaming and Dynamic Allocation

I concluded that it’s better to build a custom autoscaling logic using the Spark Developer API and handle the autoscaling based on the duration of the last x number of batches and depending on that request or remove executors using requestExecutors and killExecutor methods that are available on SparkContext. Very similar to what’s explained here.

Have you faced this same problem? Can you please provide some suggestions?

Thanks in advance!

Hi @cmanrique,

Welcome to the Snowplow community !

Thanks for the details. Internally we use Amazon Kinesis Data Analytics to perform aggregation on Kinesis data streams. The big advantage of doing so is that this is serverless and autoscaling.

Otherwise, have you considered computing aggregates directly from a data warehouse ?

Hi @BenB!

I don’t think computing aggregates directly from a data warehouse would be a good option for us since we want as little latency as possible between the time we process the records to the time users see the results. Let me know if you think this assumption is incorrect.

On the other hand Kinesis Data Analytics looks pretty interesting. Are you using Flink to compute aggregates?

Hi @cmanrique ,

Indeed if you care about latency then computing the aggregates directly from the stream sounds like a better approach.

Yes we are using Flink, we used these libs to write our aggregating job :

"com.amazonaws"     %  "aws-kinesisanalytics-runtime" % "1.1.0"
"com.amazonaws"     %  "aws-kinesisanalytics-flink"   % "1.1.0"
"org.apache.flink"  %% "flink-streaming-scala"        % "1.8.1" % Provided
1 Like