How does the Snowplow batch pipeline scale?


#1

We’ve just been asked an excellent question by one of our users: sharing it here so everyone has access to the answer:

We know that the Snowplow pipeline has some sophisticated scaling-up routines in place (that also scale down, as we discussed before :slight_smile: ), but could you help us out with describing them a bit more detailed? I.e. what are the scaling thresholds, what is the reaction time, what is happening automatically and what needs manual input, etc.

  1. for the collector systems
  2. for the EMR jobs, and how they are linked to sudden spikes in the collector system.

#2

On the batch pipeline, the only component that autoscales is the collector.

Scaling the collector

Originally we set the collector to scale based on CPU utlization e.g. add an instance when CPU utilization hits 60%. However, experience with a viral video publisher, and load testing with our Avalanche framework suggests that this does not scale the collector cluster fast enough in all cases - we need to:

Scale on load balancer response latency e.g.:

  • Add 1 instance when 0.1 < load balancer latency < 0.15
  • Add 2 instance when 0.15 < load balancer latency < 0.6
  • Add 3 instances when 0.6 < load balancer latency

We measure average load balancer latency over a 5 minute period.

We still scale based on CPU utilization as follows:

  • Add 1 instance when 40% < CPU utilization < 65%
  • Add 2 instances when 65% < CPU utilization < 85%
  • Add 3 instances when 85% < CPU utilization

We scale the collector cluster down if CPU utilization drops below 20%. We use lifecycle hooks to ensure that when an instance is removed from an autoscaling group because of a scale down, it stays alive for another 2 hours during which it can flush any remaining logs on it before the instance is terminated, preventing data loss.

Scaling EMR

EMR does not scale automatically. Currently we’ll get an alarm if an EMR job takes longer than usual because there’s been a traffic spike. At that stage we can manually bump up the cluster size .

A significant advantage of the Real Time pipeline over the Batch pipeline is that the full pipeline (including enrichment) autoscales.

Scaling Redshift

Again - this is not automatic. We recommend adding additional Redshift nodes when your disk utilization hits 75%.