Silencing alerts properly in Alertmanager

I was doing planned maintenance on a Kubernetes cluster. Within seconds of starting the work, my phone was buzzing with PagerDuty alerts. I needed to silence them, but not blindly - only the specific alerts related to the maintenance work.

Via the UI Link to heading

Go to http://alertmanager:9093/#/silences and click “New Silence”.

The UI is actually good for this - you can see which alerts will match your silence before creating it. The UI works well for one-off silences and amtool for scripted ones.

Via amtool Link to heading

Create a silence:

amtool silence add alertname=MyAlert --duration=2h --comment="Maintenance: upgrading k8s nodes"

List silences:

amtool silence query

Expire a silence early:

amtool silence expire SILENCE_ID

Matchers Link to heading

Silence by label (regex supported):

amtool silence add alertname=~".*Error.*" --duration=1h

Multiple matchers (AND logic):

amtool silence add alertname=MyAlert severity=critical

Silence all alerts for a specific pod:

amtool silence add pod="my-app-.*" --duration=1h --comment="Rolling restart"

Silence a specific namespace:

amtool silence add namespace=production --duration=30m

When to silence vs when to fix Link to heading

This is the important question. I have a simple rule:

Silence: Temporary, expected state (maintenance, known issue with a fix in progress, flaky test being investigated)

Fix: Permanent change needed (alert threshold wrong, alert fires too often, underlying issue needs addressing)

Examples from my experience:

Silence these Link to heading

Deploying a new version (expected downtime)
Scaling down a service intentionally
Running database migrations (high CPU expected)
Testing alert routing (don’t wake up the on-call)
Known issue with an open incident

Fix these Link to heading

Alert fires every day at the same time (adjust threshold or fix root cause)
Alert is too sensitive (tune the PromQL query)
Alert fires for expected behaviour (delete the alert)
Service is chronically unhealthy (fix the service, not the alert)

If you find yourself renewing the same silence repeatedly, that’s a code smell. Either fix the underlying issue or delete the alert if it’s not valuable.

Naming conventions Link to heading

Comments are crucial. When you’re debugging at 3am and see a silence, you need to know why it exists and whether it’s safe to expire.

My template:

[Type]: [What] - [Why] - [Who] - [Ticket]

Examples:

# Good
amtool silence add ... --comment="Maintenance: k8s upgrade - nodes will restart - @bart - OPS-123"

# Good
amtool silence add ... --comment="Known issue: memory leak in v2.1.0 - fix in v2.1.1 - @sarah - BUG-456"

# Bad (vague, no context)
amtool silence add ... --comment="Maintenance"

# Bad (no owner or ticket)
amtool silence add ... --comment="Silencing this for now"

Include:

Type: Maintenance, Known issue, Testing, etc.
What: What you’re silencing
Why: Reason for the silence
Who: Owner (use @ mentions if your org supports it)
Ticket: Link to incident or Jira ticket

This makes it easy to audit silences and know who to ask if something looks wrong.

My silence workflow Link to heading

When I need to silence alerts for maintenance:

Be specific: Only silence what you need. Don’t use alertname=~".*" unless you really mean it.
Keep it short: I default to 1-2 hours. It’s easy to extend if needed.
Document: Include a detailed comment with ticket reference.
Notify: Tell your team in Slack. Someone will inevitably ask why alerts stopped.
Clean up: Expire the silence immediately after maintenance completes.

Checking what’s silenced Link to heading

This is important - silences can hide real problems.

Check what’s being silenced:

amtool alert query --silenced

Show all alerts including silenced:

amtool alert query

List active silences:

amtool silence query

I have a Slack bot that posts daily reports of active silences. Helps catch forgotten silences that are masking real issues.

Common mistakes Link to heading

1. Silencing too broadly Link to heading

# BAD - silences ALL alerts
amtool silence add severity=critical --duration=1h

This masks real problems. Be specific:

# GOOD - silences specific service
amtool silence add alertname=ServiceDown service=my-app --duration=1h

2. Forgetting to expire Link to heading

Set a calendar reminder or use shorter durations. I’ve seen silences that were active for months because someone forgot to expire them.

3. No comments Link to heading

Future you (or your on-call colleague) will have no idea why the silence exists. Always include a comment.

4. Silencing instead of fixing Link to heading

If you’re silencing the same alert repeatedly, fix the root cause or delete the alert.

Alertmanager routing and inhibition Link to heading

Silences are a blunt tool. For more sophisticated suppression, use Alertmanager’s routing and inhibition rules.

Routing: Send different alerts to different receivers (or nowhere).

Inhibition: Suppress alerts based on other alerts (e.g., if the whole cluster is down, don’t alert on individual pods).

Example inhibition rule (in alertmanager.yml):

inhibit_rules:
  - source_match:
      alertname: 'NodeDown'
    target_match:
      alertname: 'PodDown'
    equal: ['node']

This says: “If a node is down, don’t alert on pods down on that node.”

This is more maintainable than constantly silencing alerts manually.

Resources Link to heading

Alertmanager configuration documentation
Routing tree editor - visualise your routing config
amtool documentation
Inhibition rules
Silence API reference