I’d been using HPAs for a while without really understanding them. They worked, so I never looked closely. Then I noticed a service was running way more replicas than the load justified — and I realised I didn’t actually know how to read what HPA was doing or why. Here’s what I learned debugging it.
What I saw Link to heading
kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
my-app Deployment/my-app 124%/70% 1 10 6 45m
Six replicas for a service that barely had any traffic. The utilisation was showing 124%, which seemed wrong — the pods weren’t under heavy load at all.
Reading describe output Link to heading
The real debugging tool is kubectl describe hpa. Most of the useful information is in the Conditions and Events sections at the bottom:
kubectl describe hpa my-app
Metrics:
Resource cpu on pods (as a percentage of request): 124% (62m) / 70%
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale recommended size matches current size
ScalingActive True ValidMetricFound the HPA was able to successfully
calculate a replica count from cpu
resource utilization (percentage of request)
ScalingLimited True TooManyReplicas the desired replica count is more than
the maximum replica count
Everything looked healthy — ScalingActive: True, no errors. But that Metrics line told the story: 62m actual CPU, reported as 124% of request. The pods were using 62 millicores of CPU, which is nothing. So why was that 124%?
Because the CPU request was set to 50m. HPA calculates utilisation as actual usage divided by requested: 62m / 50m = 124%. The service looked overloaded on paper because the requests were too low, not because it was actually busy.
The fix Link to heading
The CPU request needed to reflect what the service actually uses at baseline:
# before - request too low, inflates utilisation
resources:
requests:
cpu: 50m
memory: 256Mi
limits:
memory: 256Mi
# after - request matches realistic baseline usage
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
memory: 256Mi
With a 200m request and 62m actual usage, utilisation drops to 31% — well under the 70% target. HPA scales back down to the minimum:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
my-app Deployment/my-app 31%/70% 1 10 1 52m
The key insight: HPA doesn’t know how busy your service is. It only knows what percentage of its request is being used. If your requests are wrong, HPA’s scaling decisions will be wrong too. Getting resource requests right is a prerequisite for sensible autoscaling.
Other things that tripped me up Link to heading
These are gotchas I ran into over time as I set up HPAs on more services:
<unknown> metrics. If kubectl get hpa shows <unknown>/70% instead of a number, HPA can’t read metrics at all. The most common cause is pods with no CPU requests defined — HPA has nothing to divide by. Less commonly, metrics-server isn’t running. Check with kubectl top pods; if that fails, metrics-server is the problem.
Target not found. HPA targets a specific deployment by name. If the name in your HPA spec doesn’t match the actual deployment name (easy to get wrong with Helm template names), you’ll see ScalingActive: False with reason FailedGetScale. Double-check with kubectl get deployments and compare.
Stabilisation window confusion. HPA won’t scale down immediately after scaling up. The default scale-down stabilisation window is 5 minutes, meaning it picks the highest recommendation from the last 5 minutes before scaling down. This is intentional — it prevents flapping — but it confused me when I expected instant response. You can tune this with behavior.scaleDown.stabilizationWindowSeconds if needed.
Min equals max. If you set minReplicas: 2 and maxReplicas: 2, HPA has nowhere to go. I’ve done this accidentally when copying values between environments. It doesn’t error — it just silently does nothing.
Further reading Link to heading
- HPA algorithm details — how scaling decisions are calculated
- Running HPA and VPA together — scaling horizontally and vertically without conflicts
- GKE cluster autoscaler — node-level autoscaling that complements HPA
- Monitoring with watch and top — monitoring commands for live debugging