TL;DR: A helm release with pre-upgrade hooks is not atomic. The hook runs first and commits side effects. If the main upgrade fails after that, those side effects stay. helm rollback does not re-run hooks, so you end up in a split state the manifests do not describe. Plan migrations so no single helm release can half-apply.
A quick word on SecretProviderClass Link to heading
A SecretProviderClass (SPC) is the custom resource the Secrets Store CSI Driver uses to tell the driver which secrets to fetch from an external store (GCP Secret Manager, AWS Secrets Manager, Azure Key Vault, Vault) and what to do with them. Two things it can do:
- Mount secrets as files into a CSI volume inside the pod, at paths you define under
spec.parameters.secrets. Your app reads/mnt/secrets/MY_SECRET. - Sync those mounted files into a Kubernetes
Secret, viaspec.secretObjects. This lets legacy code that reads env vars viasecretKeyRefkeep working, by pointing thesecretKeyRefat the syncedSecret.
The secretObjects path is a compatibility shim for code that predates the CSI mount.
What I was doing Link to heading
A service I look after had both paths wired up: some env vars came from secretKeyRef against the synced Secret, and the same files were mounted in the CSI volume. I had already moved the app to read from the CSI mount for all its secrets, so the secretKeyRef plumbing was dead weight. I wrote one helm release that stripped, in lockstep, the env.secretKeyRef entries on the Deployment and the matching secretObjects entries on the SPC. The release also had a pre-upgrade Job that ran database migrations.
What went wrong Link to heading
The migration Job failed for an unrelated reason (bad IAM binding on its service account). Helm reported UPGRADE FAILED: pre-upgrade hooks failed. So far, so expected: helm would skip the new manifests.
The pods started failing. Every container under the Deployment hit CreateContainerConfigError with:
Error: couldn't find key API_TOKEN in Secret default/my-secrets
Two things did not add up. First, the old Deployment referenced API_TOKEN through secretKeyRef. The new one did not. If the upgrade had failed and rolled back, the old Deployment should still have been valid. Second, why was the key missing from the Secret?
I ran helm rollback. It succeeded. The pods kept failing.
What I found Link to heading
I had two assumptions wrong.
Helm applied the SPC before running the Job. Pre-upgrade hooks are ordinary Kubernetes resources that helm applies as part of the release. The hook phase for upgrade goes:
- Render the chart.
- Apply
pre-upgradehooks, which includes both the migrationJoband the SPC. - Wait for each hook to report ready.
- If any hook fails, stop and mark the release
failed. - Only after all hooks succeed, diff and apply the main manifests.
I had annotated the SPC as a pre-upgrade hook so its lifecycle would match the Job’s. Helm applied the new SPC at step 2. The CSI driver reconciled a few seconds later, stripping the Secret to match. The Job failed at step 3. Helm stopped at step 4 and never ran step 5. But the SPC change in step 2 had already landed.
helm rollback does not re-run hooks. I had expected rollback to walk the whole release back, including the SPC. Rollback only reconciles the main manifest set. Helm treats hooks as orchestration artefacts, fire-and-forget. The Deployment manifest went back to its old state. The SPC stayed stripped.
I was left with an old Deployment referencing API_TOKEN via secretKeyRef, and a new SPC no longer producing that key in the synced Secret.
The second surprise Link to heading
I expected the CSI driver to do one of two things when I removed all secretObjects entries: either drop every synced key to match, or leave the Secret alone as unmanaged. The driver did neither. 68 of the 69 synced keys stayed in the Secret. The driver pruned one. That one key was the only one the old Deployment still referenced via secretKeyRef. The other 68 orphans pointed at keys no code cared about any more.
I have no tidy explanation for which keys the driver keeps and which it drops. Assume zeroing out secretObjects does not give you a clean reset.
Getting back to green Link to heading
Two things unblocked me:
- I reapplied the old SPC by hand with
kubectl apply -f. The CSI driver recreated the missing key in theSecretwithin a few seconds, and the old Deployment pods came up. - I fixed the IAM binding that had caused the migration
Jobto fail, then rolled forward to the new manifests in a second release.
With the migration Job running, the follow-up upgrade went through and left the SPC in the stripped state I had intended.
Rules I now follow Link to heading
One concern per release. Stripping
secretKeyRefon the Deployment is one release. RemovingsecretObjectson the SPC is another. Changing the IAM username the prestart hook uses is a third. Each release should have a blast radius you can explain in one sentence.Assume hooks commit. A pre-upgrade hook change sticks, whether or not the main manifest lands and whether or not you then roll back. If the hook applies a resource the main app depends on, and the main manifest never lands, you have a split cluster and no automated recovery.
Stage the removal of synced keys. If CSI sync depends on
secretObjects, remove entries in small batches and confirm theSecretmatches expectations after each batch. Or delete theSecretyourself and let CSI recreate it from the remaining entries.Do not rely on library defaults to fill helm holes. My setup worked before because a Python config class had a default value that stood in for a missing helm value. A later release set that value in helm to something different, the silent default stopped helping, and the prestart hook broke. Set values in helm so the behaviour does not depend on a default three repos away.
Test the failing prestart path. If the migration
Jobcan fail for auth, DB unavailability, or a bad migration, you are one transient failure away from a release stuck half-applied. Before relying on a pre-upgrade hook, work out what happens if it fails mid-upgrade.Third-party libraries bring their own env readers. The app’s settings class uses a descriptor that reads from
/mnt/secrets/<KEY>on attribute access. A vendored client library that builds its ownBaseSettingsand reads the same key fromos.environdoes not hook into that descriptor. When I removed the env var from helm, the library’s default empty string won and the outbound API call failed on first use. Before stripping an env var, grep the vendored code for any module that reads it. The app’s secret-loading path does not cover them.
Follow-up Link to heading
I hit the pattern in the last two rules a few hours after publishing this post. A separate cleanup removed an API key env var from the same service. The app’s own config reads that key from the CSI mount now. A vendored client library had its own pydantic.BaseSettings reading the key from the environment, and the app’s mount plumbing could not see it. Every request that hit that library started erroring with a ValueError about the missing API key. I restored the env injection as a hotfix, then published a library release that accepted the key as a constructor argument. The app now reads the key from the CSI mount and passes it into the client call.
Further reading Link to heading
- Helm Chart Hooks for the official ordering and deletion policy reference.
- Secrets Store CSI Driver concepts on how
secretObjectssyncs mounted files to KubernetesSecretobjects. helm upgrade --atomicrolls back main manifests on failure, but does not revert hook side effects.