ESC
← Back to blog

Alerts Are Not Incidents

· X min read
SRE Operations Observability
AI Summary

Your on-call engineer gets paged at 3 AM. They pull up the alert -- CPU utilization on a single node crossed 85% for five minutes. They check the dashboard. The node is fine now. Traffic dipped back to normal. They acknowledge the alert and go back to sleep. Twenty minutes later, another page. Disk utilization on a different node crossed 90%. They check it. A log rotation job hadn't run yet. It's fine. They go back to sleep again.

By 7 AM, they've been paged four times. None of the alerts represented an actual problem. No customer was affected. No service degraded. But the engineer is exhausted, frustrated, and starting their day already behind. This is what alert fatigue looks like, and it's one of the most corrosive problems in modern operations.

The root issue isn't bad monitoring. It's a conceptual failure. Most teams never draw a clear line between an alert and an incident. They treat every alert as something that demands human attention, and they wonder why their on-call rotation is a revolving door of burnout.

What an Alert Actually Is

An alert is a signal. It's the system telling you that a condition has been met -- a threshold crossed, a pattern matched, an anomaly detected. That's it. An alert is not a statement that something is broken. It's not a declaration that customers are impacted. It's not an automatic justification for waking someone up.

Alerts exist on a spectrum. Some are informational -- disk usage trending upward, a deployment just completed, a batch job finished. Some are warnings -- a service is approaching its error budget, latency is creeping up, a dependency is responding slowly. And some are critical -- the system is actively degraded, users are being impacted, data integrity is at risk.

The problem is that most teams treat the entire spectrum as "page someone." They configure monitoring, set thresholds, and route everything to PagerDuty. Every alert becomes an interrupt. Every interrupt demands context-switching. And the cost of that context-switching is enormous -- not just in time, but in trust.

What an Incident Actually Is

An incident is a materially different thing. An incident means that something is wrong in a way that matters. Customers are affected. A service level objective is being violated. Business operations are impaired. Data is at risk. An incident has scope, impact, and urgency that justify pulling humans away from whatever else they're doing.

Here's the distinction that changes everything: an alert is a signal from your system. An incident is a decision by your team. Alerts are automated. Incidents are declared. The gap between them is where operational maturity lives.

A mature team receives alerts and evaluates them. Some alerts are acknowledged and dismissed -- they're informational, or the condition resolved itself, or it's a known issue with a scheduled fix. Some alerts are correlated with other signals and escalated to incident status. Some alerts trigger automated responses that handle the situation without human intervention. The key is that a human -- or an automated system with well-defined logic -- makes the determination about whether an alert constitutes an incident.

The Cost of Conflating Them

When alerts and incidents are treated as the same thing, the consequences compound quickly.

Alert fatigue. Engineers learn to ignore alerts because most of them don't matter. This is a rational response to an irrational system. If 80% of your pages are false positives or non-actionable, your on-call will develop the habit of deprioritizing all pages. Then the one that actually matters -- the real incident -- gets the same dismissive treatment. You've built a system that cries wolf, and now nobody comes running when the wolf shows up.

Eroded trust in monitoring. When dashboards are full of firing alerts that everyone knows don't matter, the monitoring system itself loses credibility. Engineers stop looking at dashboards. They stop trusting the data. They develop ad-hoc ways of checking system health -- SSHing into boxes, running manual queries, asking colleagues. You've invested in observability infrastructure and then trained your team to ignore it.

On-call burnout. On-call rotations should be sustainable. An engineer should be able to carry a pager for a week and maintain their normal work output, their sleep schedule, and their sanity. When every alert is treated as an incident, on-call becomes a marathon of interruptions. The best engineers start refusing on-call shifts. The ones who can't refuse start looking for jobs where they won't be paged for things that don't matter.

Slow incident response. When every alert triggers the incident response process, the process itself becomes bloated. War rooms get opened for non-issues. Status pages get updated for transient blips. Stakeholders get notified about things that resolved on their own. Eventually, the team starts cutting corners on the process to keep up with the volume, which means that when a real incident hits, the response muscle is atrophied.

Defining Your Incident Threshold

The fix starts with a definition. Your team needs to explicitly decide what constitutes an incident. This isn't a philosophical exercise -- it's a practical one. Write it down. Make it concrete. Here's a starting framework:

An alert becomes an incident when one or more of the following are true:

Everything else is an alert. Alerts get logged, tracked, and analyzed. Some get automated responses. Some generate tickets for follow-up. But they don't page anyone, they don't open war rooms, and they don't wake engineers up at 3 AM.

Alert Routing and Prioritization

Once you've defined the boundary between alerts and incidents, you need infrastructure to enforce it. This means tiered alert routing -- not everything goes to the same place.

Tier 1: Log and track. Informational alerts get sent to a logging system and a dashboard. Nobody is notified. These are the "good to know" signals -- deployment completions, batch job status, capacity utilization trends. Engineers can review them during business hours as part of routine operational review.

Tier 2: Notify during business hours. Warning-level alerts get sent to a Slack channel or email. They're visible, but they don't interrupt. These are conditions that need attention but aren't urgent -- a service approaching its resource limits, a certificate expiring in 30 days, a dependency showing intermittent errors.

Tier 3: Page the on-call. Critical alerts -- the ones that meet your incident threshold -- page a human. These should be rare. If your on-call is getting paged more than once or twice per shift for genuine incidents, either your systems have serious reliability problems or your incident threshold is set too low.

The goal is a funnel, not a firehose. Hundreds of signals come in. Monitoring systems process them. Most get logged. Some get surfaced. A handful page someone. And the ones that page someone actually matter.

The False Positive Tax

Every false positive has a cost, and most teams dramatically underestimate it. A single false-positive page at 3 AM costs you roughly 30 minutes of engineer time -- acknowledging the alert, investigating, determining it's a non-issue, documenting the finding, and trying to fall back asleep. But the real cost is cumulative and behavioral.

After a week of false positives, your on-call engineer starts taking longer to respond. After a month, they start ignoring non-obvious alerts. After a quarter, your best people are negotiating their way off the on-call rotation entirely. The false positive tax isn't measured in minutes. It's measured in institutional trust and team retention.

Track your false positive rate. If more than 20% of your pages turn out to be non-actionable, you have a problem that deserves engineering investment. Every false positive alert should generate a follow-up: either tune the threshold, add a suppression rule, or automate the response so a human doesn't need to be involved next time.

Building the Feedback Loop

The distinction between alerts and incidents isn't a one-time decision. It requires ongoing calibration. Your systems change, your traffic patterns shift, your thresholds need adjustment. The way to maintain the distinction is with a feedback loop.

After every on-call shift, the outgoing engineer should answer three questions: Which pages were genuine incidents? Which pages were false positives or non-actionable? What changes would improve signal quality for the next rotation?

This takes ten minutes and generates enormous value over time. It turns on-call handoff from a procedural formality into a continuous improvement mechanism. Each rotation gets slightly better signal-to-noise ratio than the last. Over months, the transformation is dramatic.

Some teams formalize this as a weekly "alert review" meeting. Fifteen minutes, once a week. Pull up the alert log. Classify each alert as signal or noise. Action the noise -- tune it, suppress it, or automate it. It sounds boring. It's one of the highest-leverage operational practices you can adopt.

The Cultural Shift

The hardest part of this isn't technical. It's cultural. Many teams have an implicit belief that more alerts equals better monitoring. That turning off an alert is risky. That suppressing a page might mean missing something important. This belief is wrong, and it's the primary driver of alert fatigue.

Good monitoring isn't about volume. It's about precision. A system with ten well-tuned alerts that fire only when something genuinely needs attention is infinitely more valuable than a system with five hundred alerts that fire constantly and are mostly ignored. The first system builds trust. The second destroys it.

The cultural shift requires leadership support. When an engineer proposes suppressing a noisy alert, the response should be "yes, and let's make sure the condition is covered by a better mechanism" -- not "but what if we miss something?" The fear of missing something is valid, but it has to be weighed against the certainty of missing things because your team has learned to tune out the noise.

An alert that doesn't change behavior isn't monitoring. It's noise with a notification attached.

Start by defining what an incident is for your team. Write it down. Socialize it. Then audit your alerts against that definition. You'll find that the vast majority of what pages your on-call doesn't meet the bar. That's not a failure of your monitoring -- it's an opportunity to fix your operational model. The teams that get this right don't have fewer problems. They have fewer interruptions. And that distinction makes all the difference.

Comments