Kubernetes debugging in 2025

My Kubernetes workflow changed a lot this year. The biggest shift: I started using LLMs as a debugging partner. Combined with a few CLI tools I can’t live without, debugging got noticeably faster.

Aliases that stuck Link to heading

First, the obvious ones:

alias k="kubectl"
alias kdrain="kubectl drain --ignore-daemonsets --delete-emptydir-data"

k saves thousands of keystrokes a year. kdrain saves me from forgetting the flags every time I need to drain a node.

For context switching, I use kubectx and kubens (brew install kubectx), wrapped in a single function:

kube() {
  kubectx ${1:-}
  kubens ${2:-}
}

Now kube staging my-app switches both cluster and namespace in one go. kubectx - switches back to the previous context.

LLMs for debugging Link to heading

This was the game changer. I whitelist read-only kubectl commands in my Claude settings:

{
  "permissions": {
    "allow": [
      "Bash(kubectl get:*)",
      "Bash(kubectl describe:*)",
      "Bash(kubectl logs:*)",
      "Bash(kubectl top:*)",
      "Bash(helm list:*)",
      "Bash(helm status:*)",
      "Bash(helm get:*)"
    ]
  }
}

Now I can paste an error and say “debug this”. Claude can inspect the pod, check events, look at logs, and usually pinpoint the issue faster than I can. It’s like pair programming with someone who never gets tired of reading YAML.

The read-only constraint is intentional - I want help diagnosing, not an LLM running kubectl delete on my behalf.

stern for logs Link to heading

kubectl logs works fine for a single pod. But when you’re tailing multiple replicas or need to filter by content, stern is much better:

stern my-app -n staging --since 5m
stern "api-.*" -n prod -o raw | grep ERROR

I also have an alias for structlog output - only shows lines with an event field:

alias sternx="stern -o raw -i 'event'"

See Viewing Kubernetes pod logs for more.

The actual debugging flow Link to heading

When something breaks, I usually:

Check events first - kubectl get events -n my-namespace --sort-by='.lastTimestamp' tells you what Kubernetes is trying to do
Describe the resource - kubectl describe pod shows state, conditions, and recent events
Check logs - current and --previous if it crashed
Ask Claude - paste the error and let it poke around

Helm debugging Link to heading

Before deploying Helm changes, I always render locally first:

helm template my-release ./my-chart -f values-prod.yaml | less

This catches most template errors before they hit the cluster. See helm template for debugging.

For ArgoCD-managed apps, I skip the CLI entirely and use annotations to force a sync. The command is impossible to remember, so I have an alias:

# .profile
ksync() {
  kubectl patch app "${1}" -n argocd \
    -p '{"metadata": {"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' \
    --type merge
}

Then just ksync my-app. See ArgoCD application sync for more on debugging sync issues.

Cost optimisation Link to heading

Running Kubernetes isn’t cheap, but spot instances cut costs by 60-70%. If your database is off-cluster (Cloud SQL, RDS), most workloads can tolerate spot preemption - just run 2+ replicas.

I built gkecc to generate ComputeClass specs sorted by actual cost (CPU + RAM, not just CPU). It interleaves spot and on-demand by price, so you get the cheapest available option. See Cost-optimising GKE with ComputeClass.

For understanding how GKE autoscaling works (HPA, VPA, NAP, cluster autoscaler), see GKE cluster autoscaler.

What I’d do differently Link to heading

Looking back on 2025:

Should have whitelisted kubectl earlier - I was skeptical about LLMs for ops work, but the read-only constraint makes it safe
kubectx is essential - I resisted installing “yet another tool” for too long
stern over kubectl logs - the multi-pod tailing is worth it

The theme: tools that reduce context switching. Faster cluster switching, faster log tailing, faster diagnosis. The LLM part still feels slightly futuristic, but it works.