I was doing planned maintenance on a Kubernetes cluster. Within seconds of starting the work, my phone was buzzing with PagerDuty alerts. I needed to silence them, but not blindly - only the specific alerts related to the maintenance work.
Via the UI Link to heading
Go to http://alertmanager:9093/#/silences and click “New Silence”.
The UI is actually good for this - you can see which alerts will match your silence before creating it. The UI works well for one-off silences and amtool for scripted ones.
Via amtool Link to heading
Create a silence:
amtool silence add alertname=MyAlert --duration=2h --comment="Maintenance: upgrading k8s nodes"
List silences:
amtool silence query
Expire a silence early:
amtool silence expire SILENCE_ID
Matchers Link to heading
Silence by label (regex supported):
amtool silence add alertname=~".*Error.*" --duration=1h
Multiple matchers (AND logic):
amtool silence add alertname=MyAlert severity=critical
Silence all alerts for a specific pod:
amtool silence add pod="my-app-.*" --duration=1h --comment="Rolling restart"
Silence a specific namespace:
amtool silence add namespace=production --duration=30m
When to silence vs when to fix Link to heading
This is the important question. I have a simple rule:
Silence: Temporary, expected state (maintenance, known issue with a fix in progress, flaky test being investigated)
Fix: Permanent change needed (alert threshold wrong, alert fires too often, underlying issue needs addressing)
Examples from my experience:
Silence these Link to heading
- Deploying a new version (expected downtime)
- Scaling down a service intentionally
- Running database migrations (high CPU expected)
- Testing alert routing (don’t wake up the on-call)
- Known issue with an open incident
Fix these Link to heading
- Alert fires every day at the same time (adjust threshold or fix root cause)
- Alert is too sensitive (tune the PromQL query)
- Alert fires for expected behaviour (delete the alert)
- Service is chronically unhealthy (fix the service, not the alert)
If you find yourself renewing the same silence repeatedly, that’s a code smell. Either fix the underlying issue or delete the alert if it’s not valuable.
Naming conventions Link to heading
Comments are crucial. When you’re debugging at 3am and see a silence, you need to know why it exists and whether it’s safe to expire.
My template:
[Type]: [What] - [Why] - [Who] - [Ticket]
Examples:
# Good
amtool silence add ... --comment="Maintenance: k8s upgrade - nodes will restart - @bart - OPS-123"
# Good
amtool silence add ... --comment="Known issue: memory leak in v2.1.0 - fix in v2.1.1 - @sarah - BUG-456"
# Bad (vague, no context)
amtool silence add ... --comment="Maintenance"
# Bad (no owner or ticket)
amtool silence add ... --comment="Silencing this for now"
Include:
- Type: Maintenance, Known issue, Testing, etc.
- What: What you’re silencing
- Why: Reason for the silence
- Who: Owner (use @ mentions if your org supports it)
- Ticket: Link to incident or Jira ticket
This makes it easy to audit silences and know who to ask if something looks wrong.
My silence workflow Link to heading
When I need to silence alerts for maintenance:
- Be specific: Only silence what you need. Don’t use
alertname=~".*"unless you really mean it. - Keep it short: I default to 1-2 hours. It’s easy to extend if needed.
- Document: Include a detailed comment with ticket reference.
- Notify: Tell your team in Slack. Someone will inevitably ask why alerts stopped.
- Clean up: Expire the silence immediately after maintenance completes.
Checking what’s silenced Link to heading
This is important - silences can hide real problems.
Check what’s being silenced:
amtool alert query --silenced
Show all alerts including silenced:
amtool alert query
List active silences:
amtool silence query
I have a Slack bot that posts daily reports of active silences. Helps catch forgotten silences that are masking real issues.
Common mistakes Link to heading
1. Silencing too broadly Link to heading
# BAD - silences ALL alerts
amtool silence add severity=critical --duration=1h
This masks real problems. Be specific:
# GOOD - silences specific service
amtool silence add alertname=ServiceDown service=my-app --duration=1h
2. Forgetting to expire Link to heading
Set a calendar reminder or use shorter durations. I’ve seen silences that were active for months because someone forgot to expire them.
3. No comments Link to heading
Future you (or your on-call colleague) will have no idea why the silence exists. Always include a comment.
4. Silencing instead of fixing Link to heading
If you’re silencing the same alert repeatedly, fix the root cause or delete the alert.
Alertmanager routing and inhibition Link to heading
Silences are a blunt tool. For more sophisticated suppression, use Alertmanager’s routing and inhibition rules.
Routing: Send different alerts to different receivers (or nowhere).
Inhibition: Suppress alerts based on other alerts (e.g., if the whole cluster is down, don’t alert on individual pods).
Example inhibition rule (in alertmanager.yml):
inhibit_rules:
- source_match:
alertname: 'NodeDown'
target_match:
alertname: 'PodDown'
equal: ['node']
This says: “If a node is down, don’t alert on pods down on that node.”
This is more maintainable than constantly silencing alerts manually.
Resources Link to heading
- Alertmanager configuration documentation
- Routing tree editor - visualise your routing config
- amtool documentation
- Inhibition rules
- Silence API reference