My Kubernetes workflow changed a lot this year. The biggest shift: I started using LLMs as a debugging partner. Combined with a few CLI tools I can’t live without, debugging got noticeably faster.
Aliases that stuck Link to heading
First, the obvious ones:
alias k="kubectl"
alias kdrain="kubectl drain --ignore-daemonsets --delete-emptydir-data"
k saves thousands of keystrokes a year. kdrain saves me from forgetting the flags every time I need to drain a node.
For context switching, I use kubectx and kubens (brew install kubectx), wrapped in a single function:
kube() {
kubectx ${1:-}
kubens ${2:-}
}
Now kube staging my-app switches both cluster and namespace in one go. kubectx - switches back to the previous context.
LLMs for debugging Link to heading
This was the game changer. I whitelist read-only kubectl commands in my Claude settings:
{
"permissions": {
"allow": [
"Bash(kubectl get:*)",
"Bash(kubectl describe:*)",
"Bash(kubectl logs:*)",
"Bash(kubectl top:*)",
"Bash(helm list:*)",
"Bash(helm status:*)",
"Bash(helm get:*)"
]
}
}
Now I can paste an error and say “debug this”. Claude can inspect the pod, check events, look at logs, and usually pinpoint the issue faster than I can. It’s like pair programming with someone who never gets tired of reading YAML.
The read-only constraint is intentional - I want help diagnosing, not an LLM running kubectl delete on my behalf.
stern for logs Link to heading
kubectl logs works fine for a single pod. But when you’re tailing multiple replicas or need to filter by content, stern is much better:
stern my-app -n staging --since 5m
stern "api-.*" -n prod -o raw | grep ERROR
I also have an alias for structlog output - only shows lines with an event field:
alias sternx="stern -o raw -i 'event'"
See Viewing Kubernetes pod logs for more.
The actual debugging flow Link to heading
When something breaks, I usually:
- Check events first -
kubectl get events -n my-namespace --sort-by='.lastTimestamp'tells you what Kubernetes is trying to do - Describe the resource -
kubectl describe podshows state, conditions, and recent events - Check logs - current and
--previousif it crashed - Ask Claude - paste the error and let it poke around
Helm debugging Link to heading
Before deploying Helm changes, I always render locally first:
helm template my-release ./my-chart -f values-prod.yaml | less
This catches most template errors before they hit the cluster. See helm template for debugging.
For ArgoCD-managed apps, I skip the CLI entirely and use annotations to force a sync. The command is impossible to remember, so I have an alias:
# .profile
ksync() {
kubectl patch app "${1}" -n argocd \
-p '{"metadata": {"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' \
--type merge
}
Then just ksync my-app. See ArgoCD application sync for more on debugging sync issues.
Cost optimisation Link to heading
Running Kubernetes isn’t cheap, but spot instances cut costs by 60-70%. If your database is off-cluster (Cloud SQL, RDS), most workloads can tolerate spot preemption - just run 2+ replicas.
I built gkecc to generate ComputeClass specs sorted by actual cost (CPU + RAM, not just CPU). It interleaves spot and on-demand by price, so you get the cheapest available option. See Cost-optimising GKE with ComputeClass.
For understanding how GKE autoscaling works (HPA, VPA, NAP, cluster autoscaler), see GKE cluster autoscaler.
What I’d do differently Link to heading
Looking back on 2025:
- Should have whitelisted kubectl earlier - I was skeptical about LLMs for ops work, but the read-only constraint makes it safe
- kubectx is essential - I resisted installing “yet another tool” for too long
- stern over kubectl logs - the multi-pod tailing is worth it
The theme: tools that reduce context switching. Faster cluster switching, faster log tailing, faster diagnosis. The LLM part still feels slightly futuristic, but it works.