On the batch pipeline, the only component that autoscales is the collector.
Scaling the collector
Originally we set the collector to scale based on CPU utlization e.g. add an instance when CPU utilization hits 60%. However, experience with a viral video publisher, and load testing with our Avalanche framework suggests that this does not scale the collector cluster fast enough in all cases - we need to:
Scale on load balancer response latency e.g.:
- Add 1 instance when 0.1 < load balancer latency < 0.15
- Add 2 instance when 0.15 < load balancer latency < 0.6
- Add 3 instances when 0.6 < load balancer latency
We measure average load balancer latency over a 5 minute period.
We still scale based on CPU utilization as follows:
- Add 1 instance when 40% < CPU utilization < 65%
- Add 2 instances when 65% < CPU utilization < 85%
- Add 3 instances when 85% < CPU utilization
We scale the collector cluster down if CPU utilization drops below 20%. We use lifecycle hooks to ensure that when an instance is removed from an autoscaling group because of a scale down, it stays alive for another 2 hours during which it can flush any remaining logs on it before the instance is terminated, preventing data loss.
EMR does not scale automatically. Currently we’ll get an alarm if an EMR job takes longer than usual because there’s been a traffic spike. At that stage we can manually bump up the cluster size .
A significant advantage of the Real Time pipeline over the Batch pipeline is that the full pipeline (including enrichment) autoscales.
Again - this is not automatic. We recommend adding additional Redshift nodes when your disk utilization hits 75%.