The hidden complexities of Kubernetes autoscaling: Beyond the basics

We look at advanced techniques to optimise autoscaling in Kubernetes for applications with varied needs

Alessandro Pio Ardizio 06 Mar 2025

Horizontal Autoscaling is a powerful feature in Kubernetes, enabling applications to handle varying workloads automatically by dynamically adjusting the number of application instances. Unlike vertical scaling, which adds resources to a single instance, horizontal scaling increases or decreases the number of instances to match the current workload. When traffic or demand spikes, autoscaling can increase replicas to handle the load; when demand drops, it can scale down to save resources.

At its core, it sounds simple: define a minimum and maximum number of replicas, select a metric (like CPU, memory usage or any pod-external metric), set a threshold and let Kubernetes manage the scaling.

But autoscaling isn’t always as straightforward as it seems. Different workloads have unique scaling needs, and bursty traffic patterns or delayed processing can introduce complexities that require a fine-tuned configuration.

Moreover, different applications come with unique needs, and effective autoscaling requires more than a one-size-fits-all approach. Some applications prioritise low latency and need to react instantly to traffic spikes, while others are more concerned with cost efficiency and can tolerate slower scaling. Some require high throughput to handle large volumes of data, while others might focus on stability, minimising fluctuations in resource allocation.

Understanding these diverse requirements is key to building a robust autoscaling configuration. In practice, achieving the right balance between responsiveness, resource efficiency and stability requires fine-tuning, especially when dealing with bursty traffic patterns or workloads with late load ( traffic peak after a traffic decrease ).

In this article, we’ll explore these complexities and look at some advanced techniques to optimise autoscaling in Kubernetes for applications with varied needs — we will focus on a common use case: scaling based on a Kafka topic lag metric.

Apache Kafka is a highly scalable, distributed event streaming platform designed for real-time data pipelines and applications. It acts as a publish/subscribe message broker, enabling producers to send data as records (messages) to topics, while consumers process these records independently. Kafka is optimised for high-throughput, fault tolerance and low-latency communication. This makes it ideal for use cases like log aggregation, stream processing, and event-driven architectures. Its core components include brokers (managing storage and coordination), topics (organising data streams) and partitions (ensuring scalability and parallelism).

The basics of Kubernetes HPA (KEDA)

Kubernetes Event-Driven Autoscaling (KEDA) extends Kubernetes autoscaling by allowing applications to scale based on external events, such as message queue length or HTTP requests. Instead of just CPU or memory metrics, KEDA scales workloads based on defined triggers, making it perfect for dynamic workloads.

Kafka trigger with KEDA

For Kafka, KEDA can scale consumers based on message lag in a topic. This is particularly useful for bursty traffic, where scaling needs to respond directly to spikes in message load.

Key configurations:

Polling interval: Sets how often KEDA checks the lag. A shorter interval (e.g., ten seconds) enables faster responses to sudden spikes.
Lag threshold: Defines when to scale up. For example, if lag exceeds 20 messages, KEDA can add more consumers to clear the queue. Specifically, a lower threshold value will trigger scaling up more aggressively, potentially adding more replicas to handle the workload, while a higher threshold value results in fewer replicas being added; this is because the HPA calculates the number of replicas based on the following formula (desiredMetricValue in the formula is our threshold here in this scenario, for more info check the HPA K8S doc)

yaml

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

Now, let’s see what a KEDA Kafka trigger looks like:

yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-consumer-scaler
spec:
  scaleTargetRef:
    name: kafka-consumer-deployment
  pollingInterval: 10
  cooldownPeriod: 300
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: '<YOUR_KAFKA_BROKER>:9092'
        consumerGroup: <YOUR_CONSUMER_GROUP>
        topic: <YOUR_TOPIC_NAME>
        lagThreshold: "20"
        allowIdleConsumers: "false"
      authenticationRef:
        name: keda-kafka-credentials

So, at this point, we’ve covered all the basics and we already have a number of parameters to tune:

Min replicas
Max replicas (this should be at most set to the number of partitions in your topic, strictly related to the Kafka examples)
Polling interval
Lag threshold

Some potential pitfalls with Kafka trigger YAML config are improper polling intervals or allowing for idle consumers which leads to having more consumers than partitions.

Beyond the basics

At this point, the question is: “Why would the basics not be enough?”

While basic autoscaling covers foundational metrics and thresholds, real-world scenarios with Kafka require a more nuanced approach. Kafka lag, for example, can be misleading — it’s often subject to false positives. You might see a spike in lag one moment, but find it drops to zero the next. If your autoscaler is too reactive to these temporary spikes, you could end up over-scaling, adding unnecessary replicas and increasing costs.

Another important consideration is Kafka rebalancing. Every time the consumer group scales up or down, Kafka initiates a rebalancing event to assign partitions to consumers. While essential, rebalancing can be disruptive, as it pauses message processing momentarily to redistribute load. This process is often described as a “stop-the-world” event because it can cause a temporary halt, impacting latency and throughput — we will cover some Kafka Rebalancing tips later in this article.

To address these challenges, you need a carefully tuned configuration that balances responsiveness with stability. Now let’s dive into strategies to reduce rebalancing frequency and optimise autoscaling to better handle bursty traffic, while avoiding costly and disruptive over-scaling.

The Horizontal Pod Autoscaler (HPA) in Kubernetes offers advanced configurations that go beyond basic scaling settings. Understanding these defaults and how they affect your autoscaling strategy is crucial for fine-tuning performance.

HPA defaults

Stabilisation window: This setting controls how long the HPA waits before scaling down after a scale-up event. Basically, a scaling decision is done by taking the highest value for the metric in the stabilisation window for scaling down, and taking the lowest value for the metric in the stabilisation window for scaling up.
Behaviour settings: The HPA also supports scaleUp (adding replicas) and scaleDown (removing replicas) behaviours, which allow you to define how aggressively the autoscaler should respond to changes in load. For example, you can set how many replicas can be added per minute or limit the percentage of pods that can be removed during scaling down.

By default, the scale down is:

yaml

# Default scaling behavior
behavior:
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
    - type: Percent
      value: 100
      periodSeconds: 15
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
    - type: Percent
      value: 100
      periodSeconds: 15
    - type: Pods
      value: 4
      periodSeconds: 15
    selectPolicy: Max

Default autoscaling behaviour from K8S Horizontal Pod Autoscaling Default Behavior:

“For scaling down the stabilization window is 300 seconds. There is only a single policy for scaling down which allows a 100% of the currently running replicas to be removed which means the scaling target can be scaled down to the minimum allowed replicas. For scaling up there is no stabilization window. When the metrics indicate that the target should be scaled up the target is scaled up immediately. There are 2 policies where 4 pods or a 100% of the currently running replicas may at most be added every 15 seconds till the HPA reaches its steady state.”

Kubernetes’ advanced scaling behaviour is documented here, in case you need to go deep.

These advanced configurations can be added to the KEDA ScaledObject definitions, which KEDA uses to create and manage the underlying HPA objects. This integration allows you to take advantage of Kubernetes’ powerful scaling capabilities while leveraging KEDA’s event-driven model.

By fine-tuning these settings within your KEDA configuration, you can optimise how your application responds to changing workloads, minimising unnecessary scaling events and maintaining stability during periods of high demand.

Kafka trigger with default behaviour

Using the default settings for Horizontal Pod Autoscaler (HPA) with Kafka triggers can lead to excessive scaling events, both up and down.

Each time the scaling threshold is breached, whether due to a temporary spike in lag or a rapid drop, the autoscaler may quickly add or remove replicas. This behaviour often results in repeated scaling events within a short timeframe, which can negatively impact system stability and increase Kafka rebalancing overhead. The scale up/down stabilisation window values highly impact the scaling frequency.

Every scaling event triggers a Kafka rebalancing process, which is essentially a redistribution of partitions among the consumers in the group. This rebalancing is often referred to as a "stop-the-world" event because it momentarily halts message processing, impacting overall throughput and latency. During rebalancing, the system must pause to ensure that each consumer is assigned its new set of partitions before resuming message consumption, which can lead to delays and increased lag during high-demand periods. Thus, minimising scaling events and optimising the rebalancing process is crucial for maintaining system stability and performance in Kafka environments.

As promised, here’s a tip on optimising Kafka rebalancing: utilise Cooperative Sticky Rebalancing. This approach is designed to minimise disruption during the rebalancing process, allowing consumers to maintain their partition assignments as much as possible while avoiding stop-the-world events.

The Cooperative Sticky Rebalancing algorithm is:

Incremental because the final desired state of rebalancing is reached in stages. A globally balanced final state does not have to be reached at the end of each round of rebalancing. A small number of consecutive rebalancing rounds can be used for the group of Kafka clients to converge to the desired state of balanced resources. In addition, you can configure a grace period to allow a departing member to return and regain the previously assigned resources.
Cooperative because each process in the group is asked to voluntarily release resources that need to be redistributed. These resources are then made available for rescheduling, given that the client that was asked to release them does so on time.

This is how the amount of consumer replicas (number of runtime servers dequeuing from a Kafka topic) can change if not tuned correctly. In the example images below, you can see how an HPA with Kafka trigger with default values behaves and how you could tune it (remember, this always depends on your workload needs).

Default scale up and scale down window — zero seconds and five minutes

Right click the image above and open it in a new tab to get the full size view of it.

That flapping behaviour (scaling up and down often), causes too much Kafka rebalancing and blocks the consumers until the lag metric does not get stable, in this example there are four scaling up events and one scale down event with default HPA behaviour (image above) and there is a single one scale up event for HPA with advanced behaviour (image below).

Default scale up and scale down window — five minutes and ten minutes

Right click the image above and open it in a new tab to get the full size view of it.

For this specific workload, every message processed can take minutes. So the cooperative sticky partitioning strategy suited very well to avoid the stuck behaviour and together with adjusting stabilisation windows for both scale up and down the performance improved a lot.

For this specific scenario, I have set the scaling behaviour as follows:

Lag threshold

yaml

horizontalPodAutoscalerConfig:
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
      - type: Percent
        value: 30
        periodSeconds: 60
     scaleUp:
       stabilizationWindowSeconds: 300
       policies:
       - type: Percent
         value: 100
         periodSeconds: 60
       - type: Pods
         value: 4
         periodSeconds: 60
       selectPolicy: Max

Scale up with a five-minute stabilisation window to avoid false positive spike signals
Scale-up acceleration is a bit slower than the default behaviour, so even if we got a false positive, we at least are scaling up slowly
Scale down with a ten-minute stabilisation window, this allows us to get the lag near 0 before scaling down, so it speeds up the throughput on the last piece of the spike
Scale down 30% of the pods per minute, the default behaviour immediately scales down to min replicas — this allows us to catch late load spikes, if there are any

After adjusting the scaling parameters in our Kubernetes setup, we achieved a significant improvement in performance, resulting in better throughput for our system. Specifically, the time to complete AI analysis reports on images and videos was reduced from several hours to just half an hour, demonstrating the effectiveness of our optimisations.

Conclusion

Tuning workload parameters is essential for effective autoscaling with Kafka and Kubernetes. While default behaviours can work well in straightforward cases, they often fall short for more dynamic workloads, especially those with spiky traffic or variable message-processing times. To create a configuration that meets your application’s unique demands, consider the following actionable steps:

Review your workload characteristics: Analyse traffic patterns, message-processing times, and resource usage to identify specific scaling challenges.
Gradually adjust HPA settings: Start by fine-tuning parameters such as min/max replicas, polling intervals, lag thresholds and stabilisation windows. Monitor the impact of each adjustment before making further changes.
Explore advanced strategies: Implement techniques like cooperative sticky partitioning algorithm or queue-based scaling for enhanced stability in more complex environments.

While these steps can significantly improve autoscaling behaviour, it’s important to recognise that there’s no one-size-fits-all setup. Every environment and workload comes with unique requirements, so continual tuning, testing and measuring are critical to achieving optimal performance. By understanding the intricacies of your workload and iteratively fine-tuning your configuration, you can maintain a balance between resource efficiency and responsiveness, ensuring stable, cost-effective and reliable scaling.

FAQ and pitfalls

To help address common challenges, here’s a quick FAQ:

What happens if the polling interval is too short? A short polling interval can lead to excessive API calls and potentially overwhelm the Kubernetes API server. It may also cause frequent scaling events, reducing stability.
How do you determine optimal stabilisation window settings? Start with a value that reflects your workload’s typical processing time. Gradually adjust and monitor how the system responds to spikes and drops in traffic to find a balance that avoids flapping.

By proactively addressing these considerations, you can mitigate common issues and further enhance the effectiveness of your autoscaling configuration.

Extended insights: Practical advice

Please find some general guidance here:

General guidance for Kubernetes autoscaling

Table by RedHat article here.

Another useful resource for helping you to tune your HPA is available here.

An introduction to cross-cloud access in managed Kubernetes clusters

13 mins
Jeevanandham Poongavanam
26 Jul 2024

OPA policy-based testing of Helm charts

5 mins
Iheanyi Onwubiko
21 Jun 2024

Insight, imagination and expertly engineered solutions to accelerate and sustain progress.

Contact