Running HPA and VPA together on Kubernetes

TLDR: HPA handles horizontal scaling on CPU, VPA right-sizes memory. Split their concerns with controlledResources: ["memory"] so they don’t fight. Drop CPU limits. Match memory requests to limits for Guaranteed QoS. Only create PDBs when you have 2+ replicas. Don’t run HPA in staging.

I’ve been writing about autoscaling on GKE for a while now. It started with debugging HPA scaling, then understanding the four layers of GKE autoscaling, then optimising node costs with ComputeClass, and most recently hitting the VPA resourcePolicy trap when templating VPA in Helm.

Each post solved one problem, but I was still figuring out the bigger picture: how do all of these pieces fit together? This post is where I’ve landed after all of that.

The conflict everyone warns about Link to heading

You’ll see plenty of blog posts and Stack Overflow answers saying “don’t use HPA and VPA together.” That’s only half true.

The problem is when both target the same resource. HPA scales on utilisation — actual usage divided by requested. VPA changes the denominator by adjusting requests. They create a feedback loop:

HPA sees high CPU utilisation, scales up
VPA sees the new utilisation pattern, adjusts CPU requests
HPA recalculates utilisation against new requests, scales again
Repeat

The fix: split their concerns. VPA handles memory, HPA handles CPU.

autoscaling:
  vpa:
    enabled: true
    updateMode: "Auto"
    controlledResources: ["memory"]  # VPA only touches memory
  hpa:
    enabled: true
    targetCPUUtilizationPercentage: 70  # HPA only scales on CPU

With controlledResources: ["memory"], VPA won’t touch CPU requests. HPA scales horizontally on CPU utilisation. They stop fighting.

This approach is well-documented in the community — see kubernetes/autoscaler#6247 and #6060 for the upstream discussion.

Drop CPU limits Link to heading

This was a separate realisation from the autoscaling work but closely related.

CPU is compressible — when a pod hits its CPU limit, the kernel throttles it. The pod doesn’t get killed, it just gets slower. The frustrating part: this happens even when the node has spare capacity. Your pod gets throttled because of a limit you set, not because of actual resource contention.

Keep CPU requests (the scheduler needs them to place pods), but drop limits entirely:

resources:
  limits:
    # no cpu limit — allows bursting
    memory: 2000Mi
  requests:
    cpu: 400m
    memory: 2000Mi

Robusta’s “Stop Using CPU Limits” article covers this well. There’s even an academic paper confirming the performance impact of CPU limits. Eric Khun also wrote about getting faster services by removing CPU limits.

Match memory requests to limits Link to heading

Memory is incompressible — exceed the limit, get OOMKilled. No throttling, no graceful degradation.

Matching request to limit gives you Guaranteed QoS class for memory. Under memory pressure, Kubernetes evicts BestEffort pods first, then Burstable, then Guaranteed. Your pod won’t be first to go.

resources:
  limits:
    memory: 2000Mi
  requests:
    cpu: 400m
    memory: 2000Mi  # matches limit

This also plays nicely with VPA Auto mode — when VPA adjusts memory, it moves both the request and the limit together, keeping them matched.

VPA: Auto mode today, in-place resize tomorrow Link to heading

VPA Auto mode evicts pods to apply new resource recommendations. It’s conservative — maximum one pod evicted per minute, and recommendations need to differ by more than 10% to trigger an update. But it’s still eviction-based, which means restarts.

This is the current trade-off: better right-sizing at the cost of occasional restarts. For most services behind a load balancer with 2+ replicas, this is fine.

The good news is this is improving. KEP-1287 graduated to GA in Kubernetes 1.35, enabling in-place pod resizing at the Kubernetes level. VPA’s InPlaceOrRecreate update mode is being tracked and will let VPA patch running pods without eviction. If you’re on GKE, worth watching when this lands.

I wrote about a VPA resourcePolicy gotcha earlier — that’s worth reading if you’re templating VPA in Helm.

HPA: aggressive up, conservative down Link to heading

I found the default HPA behaviour too conservative for scale-up and too aggressive for scale-down. Here’s the asymmetric configuration I landed on:

autoscaling:
  hpa:
    enabled: true
    minReplicas: 2
    maxReplicas: 30
    targetCPUUtilizationPercentage: 70
    scaleUp:
      stabilizationWindowSeconds: 0       # scale up immediately
      policies:
        - type: Percent
          value: 100                      # double pod count
          periodSeconds: 15
        - type: Pods
          value: 4                        # or add 4 pods
          periodSeconds: 15
      selectPolicy: Max                   # whichever adds more
    scaleDown:
      stabilizationWindowSeconds: 600     # wait 10 min before scaling down
      policies:
        - type: Pods
          value: 1                        # remove 1 pod at a time
          periodSeconds: 15

The asymmetry is deliberate. Users notice when you’re slow to scale up — requests queue, latency spikes, errors surface. Slow scale-down is rarely a problem. A few extra pods sitting around for 10 minutes costs nearly nothing compared to a latency incident.

Scale-up: 0 second stabilisation window means react immediately. The two policies (double pods or add 4, whichever is larger) mean you scale fast when small and still scale meaningfully when large.

Scale-down: 600 second stabilisation means HPA waits 10 minutes after the last scale-up recommendation before removing pods. Then it removes 1 pod every 15 seconds, gently stepping down.

Flightcrew’s article on stabilisation windows explains this pattern well. The upstream HPA docs cover the algorithm details. And if your HPA isn’t behaving, see my earlier post on HPA debugging.

PDB: only with 2+ replicas Link to heading

A PodDisruptionBudget with a single-replica deployment can block node drains entirely. With minAvailable: 1 and 1 replica, nothing can ever evict the pod — the cluster autoscaler can’t drain the node, voluntary disruptions stall, and upgrades hang.

The fix: only create the PDB when you actually have multiple replicas.

In Helm, this looks like:

{{- $replicas := .Values.replicaCount }}
{{- if .Values.autoscaling.hpa.enabled }}
{{- $replicas = .Values.autoscaling.hpa.minReplicas }}
{{- end }}
{{- if gt ($replicas | int) 1 }}
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: {{ .Release.Name }}
spec:
  selector:
    matchLabels:
      {{- include "app.selectorLabels" $ | nindent 6 }}
  maxUnavailable: 50%
{{- end }}

In production with minReplicas: 2, the PDB allows 50% unavailability — at least 1 pod stays up during voluntary disruptions like node drains. In staging with 1 replica, no PDB exists, so node drains work fine.

I also prefer maxUnavailable over minAvailable. With rolling updates, minAvailable can block the rollout if you’re already at minimum replicas. maxUnavailable: 50% is more predictable.

See the PDB docs for the full API, and chkk.io’s article on PDB pitfalls for more gotchas.

Putting it together: production vs staging Link to heading

Here’s the full picture. Production gets the works — HPA, VPA, PDB, aggressive scaling. Staging keeps it lean to minimise costs.

Production:

replicaCount: 2

resources:
  limits:
    memory: 2000Mi
  requests:
    cpu: 400m
    memory: 2000Mi

autoscaling:
  hpa:
    enabled: true
    minReplicas: 2
    maxReplicas: 30
    targetCPUUtilizationPercentage: 70
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15
        - type: Pods
          value: 4
          periodSeconds: 15
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
        - type: Pods
          value: 1
          periodSeconds: 15
  vpa:
    enabled: true
    updateMode: "Auto"
    controlledResources: ["memory"]

pdb:
  maxUnavailable: "50%"

Staging:

replicaCount: 1

resources:
  limits:
    memory: 2000Mi
  requests:
    cpu: 400m
    memory: 2000Mi

autoscaling:
  hpa:
    enabled: false    # keep staging lean
  vpa:
    enabled: true
    updateMode: "Auto"
    controlledResources: ["memory"]

# no PDB — single replica, node drains must work

The key differences: staging disables HPA and runs a single replica to keep costs down. VPA still runs — it right-sizes memory regardless of scale, which is exactly what you want when minimising spend. No PDB either, since it would block node drains with 1 replica.