Helm pre-upgrade hooks are not atomic

TL;DR: A pre-upgrade hook applies resources that stay applied when the upgrade fails and you roll back. Helm has a pre-rollback phase to undo those changes, but it runs only if you define one. Pair every pre-upgrade hook with a matching pre-rollback, or move the hook into its own release.

A quick word on SecretProviderClass Link to heading

A SecretProviderClass (SPC) is the custom resource the Secrets Store CSI Driver uses to tell the driver which secrets to fetch from an external store (GCP Secret Manager, AWS Secrets Manager, Azure Key Vault, Vault) and what to do with them. Two things it can do:

Mount secrets as files into a CSI volume inside the pod, at paths you define under spec.parameters.secrets. Your app reads /mnt/secrets/MY_SECRET.
Sync those mounted files into a Kubernetes Secret, via spec.secretObjects. This lets legacy code that reads env vars via secretKeyRef keep working, by pointing the secretKeyRef at the synced Secret.

The secretObjects path is a compatibility shim for code that predates the CSI mount.

What I was doing Link to heading

A service I look after had both paths wired up: some env vars came from secretKeyRef against the synced Secret, and the same files were mounted in the CSI volume. I had already moved the app to read from the CSI mount for all its secrets, so the secretKeyRef plumbing was dead weight. I wrote one helm release that stripped, in lockstep, the env.secretKeyRef entries on the Deployment and the matching secretObjects entries on the SPC. The release also had a pre-upgrade Job that ran database migrations.

What went wrong Link to heading

The migration Job failed for an unrelated reason (bad IAM binding on its service account). Helm reported UPGRADE FAILED: pre-upgrade hooks failed. So far, so expected: helm would skip the new manifests.

The pods started failing. Every container under the Deployment hit CreateContainerConfigError with:

Error: couldn't find key API_TOKEN in Secret default/my-secrets

Two things did not add up. First, the old Deployment referenced API_TOKEN through secretKeyRef. The new one did not. If the upgrade had failed and rolled back, the old Deployment should still have been valid. Second, why was the key missing from the Secret?

I ran helm rollback. It succeeded. The pods kept failing.

What I found Link to heading

I had two assumptions wrong.

Helm applied the SPC before running the Job. Pre-upgrade hooks are ordinary Kubernetes resources that helm applies as part of the release. The hook phase for upgrade goes:

Render the chart.
Apply pre-upgrade hooks, which includes both the migration Job and the SPC.
Wait for each hook to report ready.
If any hook fails, stop and mark the release failed.
Only after all hooks succeed, diff and apply the main manifests.

I had annotated the SPC as a pre-upgrade hook so its lifecycle would match the Job’s. Helm applied the new SPC at step 2. The CSI driver reconciled a few seconds later, stripping the Secret to match. The Job failed at step 3. Helm stopped at step 4 and never ran step 5. But the SPC change in step 2 had already landed.

I didn’t know pre-rollback existed. I annotated the SPC with pre-upgrade so its lifecycle matched the Job’s. I never annotated it with pre-rollback. On rollback, Helm reconciled the main manifest set against the target revision and skipped every hook the target revision didn’t define. Helm runs pre-rollback hooks after rendering templates and before touching resources. I had no pre-rollback hook. The Deployment went back to its old state. The SPC stayed stripped.

I was left with an old Deployment referencing API_TOKEN via secretKeyRef, and a new SPC no longer producing that key in the synced Secret.

The second surprise Link to heading

I expected the CSI driver to do one of two things when I removed all secretObjects entries: either drop every synced key to match, or leave the Secret alone as unmanaged. The driver did neither. 68 of the 69 synced keys stayed in the Secret. The driver pruned one. That one key was the only one the old Deployment still referenced via secretKeyRef. The other 68 orphans pointed at keys no code cared about any more.

I have no tidy explanation for which keys the driver keeps and which it drops. Assume zeroing out secretObjects does not give you a clean reset.

Getting back to green Link to heading

Two things unblocked me:

I reapplied the old SPC by hand with kubectl apply -f. The CSI driver recreated the missing key in the Secret within a few seconds, and the old Deployment pods came up.
I fixed the IAM binding that had caused the migration Job to fail, then rolled forward to the new manifests in a second release.

With the migration Job running, the follow-up upgrade went through and left the SPC in the stripped state I had intended.

Rules I now follow Link to heading

One concern per release. Stripping secretKeyRef on the Deployment is one release. Removing secretObjects on the SPC is another. Changing the IAM username the prestart hook uses is a third. Each release should have a blast radius you can explain in one sentence.
Assume hooks commit. A pre-upgrade hook change sticks, whether or not the main manifest lands and whether or not you then roll back. If the hook applies a resource the main app depends on, and the main manifest never lands, you have a split cluster and no automated recovery.
Reach for pre-rollback when a hook has to un-do itself. Pair every pre-upgrade hook that applies a resource the main manifests depend on with a pre-rollback hook that restores the previous state:
```
annotations:
  "helm.sh/hook": pre-upgrade,pre-rollback
```
Helm runs pre-rollback after rendering templates, before touching resources.
Stage the removal of synced keys. If CSI sync depends on secretObjects, remove entries in small batches and confirm the Secret matches expectations after each batch. Or delete the Secret yourself and let CSI recreate it from the remaining entries.
Do not rely on library defaults to fill helm holes. My setup worked before because a Python config class had a default value that stood in for a missing helm value. A later release set that value in helm to something different, the silent default stopped helping, and the prestart hook broke. Set values in helm so the behaviour does not depend on a default three repos away.
Test the failing prestart path. If the migration Job can fail for auth, DB unavailability, or a bad migration, you are one transient failure away from a release stuck half-applied. Before relying on a pre-upgrade hook, work out what happens if it fails mid-upgrade.
Third-party libraries bring their own env readers. The app’s settings class uses a descriptor that reads from /mnt/secrets/<KEY> on attribute access. A vendored client library that builds its own BaseSettings and reads the same key from os.environ does not hook into that descriptor. When I removed the env var from helm, the library’s default empty string won and the outbound API call failed on first use. Before stripping an env var, grep the vendored code for any module that reads it. The app’s secret-loading path does not cover them.

Follow-up Link to heading

I hit the pattern in the last two rules a few hours after publishing this post. A separate cleanup removed an API key env var from the same service. The app’s own config reads that key from the CSI mount now. A vendored client library had its own pydantic.BaseSettings reading the key from the environment, and the app’s mount plumbing could not see it. Every request that hit that library started erroring with a ValueError about the missing API key. I restored the env injection as a hotfix, then published a library release that accepted the key as a constructor argument. The app now reads the key from the CSI mount and passes it into the client call.