Closing HA gaps in a redis-sentinel + haproxy setup

TL;DR: A common haproxy-in-front-of-redis-sentinel setup has three reliability traps: liveness probes cascading during sentinel failover, haproxy hard-stopping inflight connections on SIGTERM, and haproxy aborting at startup when any redis pod’s DNS is NXDOMAIN. The fixes are split probes, a preStop hook that runs kill -USR1 1, and init-addr last,libc,none resolvers k8s on every server line.

The setup Link to heading

A small redis high-availability cluster:

3 redis pods each with a redis-sentinel sidecar (quorum: 2)
3 haproxy pods in front, routing client traffic to whichever redis is currently master
Clients connect to the haproxy Service, never to redis pods directly

The haproxy config picks the master by asking each sentinel get-master-addr-by-name, and serves on :6379 only when it has a healthy master backend.

How it kept failing Link to heading

I was looking at recurring overnight pages. The on-call alert was Connection refused to a downstream service that relies on redis as part of its startup path. My first reading was that GCE spot nodes were getting preempted and dragging redis pods with them. I checked the GCP audit log and found something different:

compute.instances.preempted events in 24 hours: 1
compute.instances.delete events from the cluster autoscaler service account: about 50, all between 05:00 and 07:30 UTC

So the trigger was cluster autoscaler consolidating nodes during quiet hours, not spot preemption. Raising consolidationDelayMinutes on the ComputeClass from 5 to 30 reduced churn frequency by roughly 6x, and that alone took the alert volume down. But that only addresses how often redis pods get rescheduled, not what happens to clients when one does.

The cascading restart Link to heading

I had each haproxy pod watching the same redis cluster and reporting its own k8s health on /healthz:

listen health_check_http_url
  bind :8888
  mode http
  monitor-uri /healthz
  monitor fail if { nbsrv(bk_redis_master) lt 1 }

That monitor fail if { nbsrv(bk_redis_master) lt 1 } line caused the cascade. /healthz returns 503 whenever bk_redis_master has zero healthy servers. Both my liveness and readiness probes hit the same endpoint.

Sentinel takes 5 to 15 seconds in my config to elect a new master. During that window, no backend looks like a master to haproxy. Every haproxy replica returns 503 on /healthz at the same time. Three replicas, all watching the same cluster, all failing in lockstep. The liveness probe fires failureThreshold: 3 × periodSeconds: 3 = 9s later and kubelet kills each of them. The Service drops to zero endpoints. New replicas have to come up from scratch. A 15 second backend outage turns into a 30 to 60 second total outage with no haproxy at all.

The pattern shows up elsewhere too: any setup where multiple replicas check a shared external dependency, and tie k8s liveness to it, will wipe the fleet on a dependency blip.

Splitting liveness from readiness Link to heading

I had collapsed two questions onto one endpoint. Liveness should answer “is the process alive (would restarting help)”. Readiness should answer “should this pod receive traffic”. My /healthz only answered the readiness question.

The fix is two listen blocks, one per probe semantic:

# Liveness: 200 if haproxy is alive. Decoupled from backend state.
listen health_alive
  bind :8888
  mode http
  monitor-uri /alive
  option dontlognull

# Readiness: 503 if there is no healthy redis master right now.
listen health_ready
  bind :8889
  mode http
  monitor-uri /ready
  monitor fail if { nbsrv(bk_redis_master) lt 1 }
  option dontlognull

And the deployment:

livenessProbe:
  httpGet:
    path: /alive
    port: 8888
readinessProbe:
  httpGet:
    path: /ready
    port: 8889

During a sentinel failover now, readiness fails on all three replicas at the same time. kube-proxy removes them from the Service while they stay alive. New connections from clients route nowhere for a few seconds, but the haproxy pods themselves do not restart. Once sentinel finishes the election, readiness recovers within a probe cycle and the Service comes back to full capacity. No long restart cycle to wait through.

I should name a small correctness trade-off. With readiness failing on all replicas at once, the Service has zero endpoints during the failover window, so clients see Connection refused or routing-level errors. That is the same end state as the old setup during the same window, but without the cost of pod restarts. The cleaner answer would be a sentinel-aware client library, or a redis cluster with multi-master sharding, but neither was on the table for this setup.

Graceful shutdown in haproxy 3.x Link to heading

After fixing the cascade, I wanted to confirm an individual pod could be removed cleanly during cluster autoscaler scale-down. That part took me down a longer detour than I had budgeted for.

I tested by deleting a haproxy pod and watching a probe client hit the Service. The probe saw no failures during sentinel failover, which was what I wanted. But during pod replacement, the probe was still seeing client errors. The cause is in the haproxy management docs (haproxy 3.0 management.html):

The hard stop is simple, when the SIGTERM signal is sent to the haproxy process, it immediately quits and all established connections are closed.

The graceful stop is triggered when the SIGUSR1 signal is sent to the haproxy process. It consists in only unbinding from listening ports, but continue to process existing connections until they close. Once the last connection is closed, the process leaves.

SIGTERM is what kubelet sends when a pod is terminated. With no preStop hook, haproxy hard-stops and RSTs every in-flight client connection.

SIGUSR1 is the signal I wanted: stop accepting new connections, let existing ones drain. To trigger it during pod termination, I needed a preStop hook.

The order matters Link to heading

My first attempt was wrong:

lifecycle:
  preStop:
    exec:
      command: ["sh", "-c", "kill -USR1 1; sleep 30"]

The thinking was “send the soft-stop signal, then wait for things to drain”. The problem: kube-proxy removes the pod from the Service asynchronously, 5 to 10 seconds after the pod is marked for deletion. If I send SIGUSR1 at t=0, haproxy stops accepting at t=0, but kube-proxy keeps routing new connections to this pod until it has propagated the EndpointSlice update. Those new connections see Connection refused.

The right order is: sleep first (so haproxy keeps accepting while kube-proxy propagates the EndpointSlice removal), then signal:

lifecycle:
  preStop:
    exec:
      # 1. sleep 10s so kube-proxy removes this pod from the Service while
      #    haproxy keeps accepting (any racing connection still gets served)
      # 2. SIGUSR1 starts the soft-stop, drains in-flight connections
      # 3. sleep 20s gives existing connections time to complete
      command: ["sh", "-c", "sleep 10; kill -USR1 1; sleep 20"]

And in haproxy.cfg, I added hard-stop-after so haproxy exits within the preStop window, before kubelet sends SIGTERM:

global
  # Bound the soft-stop so haproxy exits at t=30s,
  # before SIGTERM arrives at t=30s end-of-preStop.
  hard-stop-after 20s

On the deployment, I raised terminationGracePeriodSeconds from the default 30 to 60 so the 30-second preStop has headroom.

Sequence end to end Link to heading

t=0       Pod marked for deletion. preStop starts.
t=0-10s   haproxy still listening. kube-proxy removes endpoint.
t=10s     kill -USR1 1 -> haproxy stops listening, drain begins.
          hard-stop-after 20s timer starts.
t=10-30s  Existing connections complete naturally.
t=30s     hard-stop-after fires. haproxy exits. Container gone.
t=30s     preStop sleep ends. kubelet would send SIGTERM
          but the container is already gone.

The startup-time DNS trap Link to heading

The third failure mode was a quiet CrashLoopBackOff. One redis-haproxy replica had four restarts inside a 71-minute window. The previous container’s log said:

[ALERT] (1) : config : [/data/haproxy.cfg:56] : 'server check_if_redis_is_master_0/S1' :
       could not resolve address 'redis-ha-node-1.redis-ha-headless.redis.svc.cluster.local'.
[ALERT] (1) : config : Failed to initialize server(s) addr.

haproxy resolves server addresses at config parse time, not at runtime. The haproxy.cfg lists each redis pod by its headless-service DNS name:

server R1 redis-ha-node-1.redis-ha-headless.redis.svc.cluster.local:6379 check inter 3s fall 1 rise 1

With the default publishNotReadyAddresses: false on a StatefulSet’s headless service, each pod’s DNS A record only exists while the pod is Ready. A redis-ha pod mid-reschedule returns NXDOMAIN on its own name. If a haproxy pod starts during that window, haproxy aborts with the ALERT above and exits 1. kubelet restarts it. CrashLoopBackOff until DNS recovers.

The conditions for hitting this overlap with the conditions the redis-ha PDB protects. The PDB makes redis-ha pods reschedule one at a time during voluntary disruptions, so “one node temporarily NXDOMAIN” is the normal state during evictions, the kind of state the rest of the system is designed for. Even one unresolvable redis-ha pod aborts haproxy startup, regardless of the other two servers resolving correctly.

Two directives that have to be applied together Link to heading

global
  ...

resolvers k8s
  parse-resolv-conf
  resolve_retries 3
  hold nx 10s
  ...

backend bk_redis_master
  ...
  server R1 redis-ha-node-1.redis-ha-headless...:6379 check inter 3s fall 1 rise 1 init-addr last,libc,none resolvers k8s

init-addr controls what happens at startup. last,libc,none tells haproxy to try the last-known-good cached address first, then libc; if both fail, start the server in unknown state instead of aborting the process.

resolvers gives haproxy its own runtime DNS resolver. parse-resolv-conf (haproxy 2.1+) tells it to read the container’s /etc/resolv.conf, which on a kubernetes pod points at kube-dns. The hold nx 10s setting caches NXDOMAIN responses for ten seconds before re-trying, so once the redis-ha pod becomes Ready, the server gets the new address within about ten seconds.

Without both, the fix is incomplete:

init-addr alone lets haproxy start, but the server stays at no address forever, because there’s no runtime resolver to update it.
resolvers alone doesn’t help if libc resolution at parse time aborts the process.

My first attempt had only init-addr. I had assumed check inter 3s would re-resolve the hostname on each check, but it does not. The resolver is a separate concern from health checks; without an explicit resolvers section pointing at kube-dns, the server keeps whatever address haproxy resolved at parse time.

I forced the failure mode in a test cluster by deleting a redis-ha pod and a haproxy pod within five seconds of each other:

$ kubectl delete pod -n redis redis-ha-node-1 --wait=false
$ sleep 5  # node-1 is now mid-reschedule, its DNS returns NXDOMAIN
$ kubectl delete pod -n redis $(kubectl get pods -n redis -l app.kubernetes.io/name=redis-haproxy -o name | head -1)

The new haproxy pod started cleanly, became Ready within thirteen seconds, with zero restarts. Pre-fix, the same scenario produced four restarts before recovery.

Helm extraObjects replaces lists, doesn’t merge them Link to heading

While tightening the redis-haproxy PDB to limit how many pods cluster-autoscaler can take down at once, I spotted something odd in an adjacent service. Its helm chart defined a PodDisruptionBudget in the base values.yaml under extraObjects, but kubectl get pdb returned nothing.

The cause was helm’s list-merge semantics. The base values.yaml had:

extraObjects:
  - apiVersion: policy/v1
    kind: PodDisruptionBudget
    metadata:
      name: my-service
    spec:
      minAvailable: 2

And the env-specific values file had:

extraObjects:
  - apiVersion: secrets-store.csi.x-k8s.io/v1
    kind: SecretProviderClass
    ...

Helm -f base.yaml -f env.yaml does not deep-merge lists. It replaces the entire extraObjects value with whatever the env file provides. The base PDB disappears at render time with no error and no warning, and nothing on the cluster.

No alert pointed at this. I only caught it because tightening the redis-haproxy PDB made me wonder what other services in the same namespace had set. The fix for each affected service was to repeat the PDB block in every env values file that defined its own extraObjects. Keeping cross-cutting resources like PDBs out of extraObjects and in standalone manifests or a kustomize overlay would probably be cleaner, but for these services, repeating the PDB was the smaller diff.

The honest version, in this particular case: the missing PDB on the downstream consumer was probably a bigger contributor to the overnight alerts than the haproxy work above. The original page was “consumer can’t reach the service it depends on”. A deployed PDB on the consumer would have stopped the autoscaler from taking multiple replicas down at once during consolidation, regardless of what redis-haproxy underneath was doing. The haproxy fixes are real improvements, but the simpler extraObjects override fix probably did most of the heavy lifting on the alerts that brought me here.

How I verified it Link to heading

I ran a probe pod that loops redis-cli PING against the Service every 500ms, then deleted a haproxy pod from the new cohort and counted how many of the probe’s PINGs failed during the next minute.

Before the changes, a single pod deletion produced four Connection refused entries in the probe log over the 9-second kube-proxy lag window. After the changes, the same test produced zero. The fleet replacement during the rollout still produced some errors because the old pods predated the preStop config and got hard-stopped on terminate, but later pod deletions were clean.

The control test, hitting a running pod with kubectl exec -- kill -USR1 1, was useful too. It confirms the signal handling without changing the deployment:

$ kubectl exec haproxy-pod -- sh -c 'kill -USR1 1'
$ kubectl logs haproxy-pod | tail
Proxy health_alive stopped (cumulated conns: FE: 409, BE: 0).
Proxy health_ready stopped (cumulated conns: FE: 409, BE: 0).
Proxy ft_redis_master stopped (cumulated conns: FE: 879, BE: 0).
...

Existing connections stay ESTABLISHED, the pod is marked NotReady because readiness on :8889 can no longer connect, and new connections to the pod’s IP get refused, which matches the docs.

What I’d do differently Link to heading

Liveness probes that depend on a shared external service create correlated failures across replicas. Once I named the pattern, I started seeing it elsewhere: webhook controllers checking the API server they front, sidecars checking the same upstream as the main container. The fix is the same shape each time: separate the “am I alive” question from the “can I serve” question.

The haproxy docs are explicit about SIGTERM, and I had read past them. My assumption was that graceful shutdown was the default on SIGTERM. The default plain mode treats SIGTERM as immediate hard stop. Master-worker mode (opt-in via -W or the master-worker keyword) handles signals as a separate concern, but I was not running it.

Two of the three haproxy fixes had bugs I missed on first read: preStop ordering (SIGUSR1 before the sleep instead of after) and the missing resolvers block to back up init-addr. Both bugs were silent in my own reasoning. I had a clear mental model of what the code should do, but the actual semantics were different. Re-reading against the docs would have caught either one, except I had already convinced myself the code was correct.