We Had 400 Alerts a Week and Maybe 12 of Them Mattered
A client's on-call rotation was burning people out — not because of real incidents, but because of noise. Here's how we cut alerts by 87% without missing anything that actually required a human.
The first thing I noticed when I joined this engagement wasn't the code. It was the Slack channel. Their #alerts-production channel had 63 unread messages from the past four hours, and nobody was reacting to any of them. Engineers had muted it. The on-call person told me he just skimmed it every 30 minutes and "used gut feel" to decide what was real.
This is a team that was paying for Datadog, PagerDuty, and a custom alerting pipeline built on top of CloudWatch. Three systems, hundreds of alert rules, and the net result was that everyone ignored all of them.
The Audit Nobody Wanted to Do
I proposed something boring: let's count. For two weeks, we tagged every alert that fired with one of four labels — actionable (a human needed to do something), informational (interesting but no action needed), duplicate (already covered by another alert), or noise (shouldn't be an alert at all).
The numbers were grim. In those two weeks, 847 alerts fired. The breakdown:
- Actionable: 23 (2.7%)
- Informational: 91 (10.7%)
- Duplicate: 204 (24.1%)
- Noise: 529 (62.5%)
Almost two-thirds of all alerts were pure noise. The biggest offender was a set of CPU utilization alerts on their Kubernetes nodes, firing whenever any node crossed 70% for more than five minutes. In a cluster with autoscaling enabled, this was meaningless. The nodes were supposed to run hot before scaling kicked in. But someone had set those thresholds two years ago when they were on static EC2 instances, and nobody had revisited them after the migration.
The duplicates were almost as bad. A single database failover would trigger alerts from CloudWatch (RDS event), Datadog (connection pool errors), their application health checks (timeout), and PagerDuty (downstream service degradation). Four pages for one event. The on-call engineer would acknowledge the PagerDuty page, then spend ten minutes closing the others.
What We Cut
We started with a rule: if an alert doesn't have a documented response — a specific thing a human should do when it fires — it's not an alert. It's a metric. It belongs on a dashboard, not in someone's pocket at 3 AM.
We went through every alert rule — all 189 of them — in four sessions over two weeks. Each session was two hours with the team leads from platform, backend, and SRE. It was tedious. Nobody enjoyed it. But by the end we had 41 alert rules left.
Some of what we removed:
- CPU/memory thresholds on autoscaled infrastructure. Replaced with alerts on scaling failures (when autoscaling can't add capacity).
- Individual pod restart alerts. Replaced with a rate-based alert: more than 5 restarts of the same deployment in 10 minutes.
- SSL certificate expiry warnings at 30, 14, and 7 days. Cert-manager handled renewal automatically. We kept only a 3-day alert as a last-resort signal that automation had failed.
- Every 5xx error. Replaced with an error rate threshold: more than 2% of requests returning 5xx over a 5-minute window.
That last one was the most contentious. The backend lead worried about missing one-off errors that indicated deeper problems. Fair concern. We compromised by routing individual 5xx errors to a weekly report that the team reviewed in their Friday engineering sync. Not every signal needs to wake someone up.
The Routing Problem
Cutting alert rules was half the battle. The other half was routing. Before the cleanup, every alert went to the same Slack channel and the same PagerDuty escalation policy. A non-critical Elasticsearch disk space warning hit the same way as a payment processing failure.
We set up three tiers:
- P1 — Page immediately. Customer-facing outages, payment failures, data integrity issues. These woke people up. We had 6 of these.
- P2 — Slack notification during business hours. Degraded performance, elevated error rates, infrastructure warnings. These needed attention within hours, not minutes. We had 19 of these.
- P3 — Weekly review. Trends, capacity forecasts, non-critical deprecation warnings. These went to a digest. We had 16 of these.
The P1 list was deliberately short. If everything is urgent, nothing is.
Three Months Later
We tracked the same metrics for the quarter after the cleanup. Weekly alert volume dropped from around 400 to 51. PagerDuty pages dropped from 38 per week to 9. The mean time to acknowledge a page went from 14 minutes to 3 minutes — not because the engineers got faster, but because they actually trusted that a page meant something real.
The on-call satisfaction survey (yes, the client ran one) improved from 2.1 to 3.8 out of 5. One engineer told me it was the first time in a year that being on-call hadn't ruined his weekend.
Warning
What I Took Away
The hardest part of this project wasn't technical. It was convincing people to delete alerts. Engineers feel safe when they're monitoring everything. Removing an alert feels like removing a safety net. But a safety net that's tangled in a hundred other nets doesn't catch anyone — it just makes the whole system heavier.
The most useful framing I found was this: an alert is a contract with the on-call engineer. It says "when this fires, here is what you should do." If you can't fill in the second part of that sentence, you don't have an alert. You have anxiety encoded in YAML.
I still wonder about the alerts we deleted. Statistically, some of them were probably catching real signals buried in the noise. But the team couldn't act on 847 alerts a week. They could act on 51. Is a system that catches everything but acts on nothing better than one that catches less but acts on all of it?