ESC
← Back to blog

SLOs Are Contracts, Not Dashboards

· X min read
SRE Reliability Strategy
AI Summary

I've lost count of how many organizations I've walked into where someone proudly shows me their SLO dashboard. Beautiful graphs. Color-coded thresholds. Real-time error rates broken down by service. Availability percentages calculated to four decimal places. Then I ask one question: "What happens when you miss this target?" And I get a blank stare.

This is the fundamental problem with how most teams implement SLOs. They treat them as metrics to observe rather than contracts to enforce. The dashboard exists. The numbers are accurate. But nothing changes when the numbers go red. Feature work continues. Sprint priorities stay the same. The error budget burns through, and nobody adjusts course. At that point, your SLO isn't an objective. It's a decoration.

The Alphabet Soup Problem

Before we get into what SLOs should be, let's clear up what they are. The confusion between SLIs, SLOs, and SLAs is pervasive, and getting the hierarchy wrong leads to broken implementations.

SLIs -- Service Level Indicators -- are measurements. They're the raw signals: request latency at the 99th percentile, error rate as a percentage of total requests, availability measured as successful health checks over time. SLIs are objective. They come from your monitoring systems. They describe what is happening.

SLOs -- Service Level Objectives -- are targets. They say "this SLI should meet this threshold over this time window." For example: 99.9% of requests should complete in under 300ms over a rolling 30-day window. SLOs are aspirational but grounded. They describe what should be happening.

SLAs -- Service Level Agreements -- are contracts with consequences. They're typically between a service provider and a customer, and they carry penalties for violation -- credits, refunds, or contractual remedies. SLAs describe what must happen, and they specify what goes wrong when it doesn't.

Here's where teams go sideways: they implement SLIs, call them SLOs, and ignore SLAs entirely. They have measurements and call them objectives. But an objective without a consequence is just a wish. The entire point of an SLO is that it triggers action when it's violated. Without that trigger, you've built an expensive speedometer and never looked at the speed limit.

Error Budgets: The Mechanism That Actually Matters

The concept that turns SLOs from dashboard decorations into operational tools is the error budget. An error budget is the inverse of your SLO -- it's how much unreliability you can tolerate within your target window.

If your SLO is 99.9% availability over 30 days, your error budget is 0.1% -- roughly 43 minutes of downtime per month. That's not a lot. And that's the point. Those 43 minutes are a finite resource. Every deployment that causes a brief outage, every upstream dependency that introduces errors, every infrastructure hiccup that drops requests -- all of it consumes your error budget.

When teams actually track error budget consumption, something powerful happens: reliability and feature velocity become negotiable trade-offs with a shared currency. The product team wants to ship a risky migration? Fine -- but it will consume error budget, and if the budget is already low, the migration waits. The infrastructure team wants to do a rolling upgrade? Same calculus. Every change that carries risk has a cost, and that cost is measured in the same units as the reliability target.

This is what most teams miss. The error budget isn't a metric. It's a decision-making framework. It answers the question that every engineering organization struggles with: "Should we prioritize reliability work or feature work right now?" If you have error budget remaining, ship features. If your error budget is depleted, fix reliability. It's not always that simple in practice, but it gives you a rational starting point instead of the usual political negotiation.

Why Most SLO Implementations Fail

I've seen the same pattern at a dozen organizations. An SRE team reads the Google SRE book. They get inspired. They define SLIs for their critical services. They set SLO targets. They build dashboards. They present the whole thing to leadership. Everyone nods approvingly. And then nothing changes.

The SLOs don't fail because the numbers are wrong. They fail because the organizational machinery to act on them doesn't exist. There are several common failure modes:

No error budget policy. The team defined SLOs but never wrote down what happens when the error budget is exhausted. Without a policy, there's no trigger for behavior change. The SLO gets violated, someone sends a message in Slack, and then everyone goes back to what they were doing. You need a written policy that says: when error budget drops below X%, we freeze feature deployments. When it drops below Y%, we redirect engineering capacity to reliability. When it's exhausted, we enter an incident-prevention mode where only critical fixes ship.

SLOs set by engineers, not negotiated with stakeholders. If the product team and business leadership weren't involved in setting the SLO, they have no ownership of it. When the error budget runs out and the SRE team says "we need to pause feature work," the product team pushes back because they never agreed to that trade-off. SLOs need to be negotiated, not declared. The product team needs to understand that choosing 99.9% over 99.95% means more room for velocity, and choosing 99.99% means significantly less.

Too many SLOs. Some teams define SLOs for everything -- every service, every endpoint, every dependency. This creates so much signal that it becomes noise. You can't make decisions based on fifty competing objectives. Start with three to five SLOs for your most critical user journeys. The checkout flow. The login path. The API that your largest customers depend on. These are the SLOs that matter. Everything else is monitoring.

SLOs that don't reflect user experience. A service can report 100% availability while users are having a terrible experience. If your SLO measures "did the server return a 200?" but doesn't account for response time, content correctness, or downstream dependency health, it's measuring server happiness, not user happiness. Good SLOs are defined from the user's perspective: "Can the user complete this action successfully, within an acceptable time frame?"

The Negotiation Framework

Here's where SLOs become genuinely powerful -- when they're used as a negotiation tool between engineering and product. This is the conversation most organizations have never had explicitly, and it's the most important one.

Every system has a natural reliability level -- the level it achieves without dedicated reliability investment. For most systems, this is somewhere around 99% to 99.5%. Getting from there to 99.9% requires meaningful investment: better monitoring, redundancy, graceful degradation, automated failover. Getting from 99.9% to 99.99% requires significantly more: multi-region architecture, sophisticated load balancing, extensive chaos engineering, possibly a dedicated reliability team. Getting to 99.999% is a lifestyle choice that affects every engineering decision.

Each jump in reliability costs more -- in engineering time, infrastructure spend, and reduced velocity. The question isn't "how reliable should we be?" It's "how much are we willing to pay for each increment of reliability?" And "pay" here includes the opportunity cost of features not shipped.

When you frame it this way, the conversation changes entirely. Product leadership isn't being told "we need to work on reliability." They're being asked to choose between options with explicit trade-offs. "We can target 99.9% and ship the new feature by Q3, or we can target 99.95% and the feature slips to Q4. Which do you prefer?" That's a business decision, not a technical mandate. And it's a decision that should be made consciously, not by default.

SLOs as Contracts: What This Actually Looks Like

When I say SLOs are contracts, I mean they should carry the weight of agreements between teams. Not legal agreements -- organizational ones. Here's what an SLO-as-contract looks like in practice:

The SLO is documented and agreed upon by engineering, product, and business stakeholders. It specifies the SLI, the target, the measurement window, and the error budget policy. Everyone signs off. It's not a wiki page that the SRE team maintains in isolation.

Error budget consumption is reported regularly -- weekly at minimum. It's included in sprint reviews, product meetings, and engineering leadership syncs. The error budget is as visible as the feature roadmap. When it's healthy, nobody worries. When it's trending toward exhaustion, the conversation starts early.

The error budget policy is enforced. When the policy says "freeze feature deployments," deployments actually freeze. This is the hardest part, and it's where organizational commitment is tested. The first time a deadline approaches and the error budget is exhausted, leadership has to decide whether the SLO means what it says. If they override the policy, the SLO is dead -- nobody will take it seriously again.

SLOs are reviewed and renegotiated quarterly. Business needs change. Systems evolve. The SLO that made sense six months ago might be too aggressive or too lenient now. Regular review keeps the targets meaningful and the buy-in genuine.

Starting from Zero

If your organization doesn't have SLOs yet -- or has dashboard-only SLOs that nobody acts on -- here's a practical path forward:

  1. Identify your critical user journeys. Not services -- journeys. "User can log in." "User can complete a purchase." "API consumer can retrieve account data." These are the things that matter from a business perspective, and they're the right level of abstraction for SLOs.
  2. Define SLIs for each journey. What can you measure that tells you whether the journey is working? Availability, latency, correctness -- pick the dimensions that matter most for each journey. Keep it simple. One or two SLIs per journey is plenty to start.
  3. Set conservative initial targets. Look at your historical data. If your login flow has been 99.7% available over the past six months, don't set an SLO of 99.99%. Start at 99.5% -- below your historical baseline. You want early wins, not early failures. You can tighten the target once the organizational muscle is developed.
  4. Write the error budget policy before you launch. This is non-negotiable. The policy is the entire point. Without it, you're building another dashboard. The policy doesn't need to be complex. It needs to specify thresholds and actions. "Below 50% error budget remaining: alert engineering leadership. Below 25%: pause non-critical deployments. Budget exhausted: all engineering capacity redirected to reliability until budget recovers."
  5. Get leadership buy-in on the policy, not just the targets. Anyone will agree to an availability target. The real agreement is on what happens when you miss it. Make sure the VP of Product and the VP of Engineering both understand and agree that error budget exhaustion means feature velocity slows down. If they won't agree to that, your SLOs are aspirational at best.

The Uncomfortable Truth

SLOs force an uncomfortable truth into the open: you can't have maximum reliability and maximum feature velocity at the same time. Every organization is making this trade-off already -- they're just making it implicitly, inconsistently, and without data.

Teams that ship constantly without measuring reliability are implicitly choosing velocity over stability. Teams that refuse to deploy because they're afraid of breaking something are implicitly choosing stability over velocity. Neither approach is optimal because neither is intentional.

SLOs make the trade-off explicit. They put a number on "how reliable do we need to be" and they create a mechanism -- the error budget -- for managing the tension between shipping and stability. When done right, they're not a constraint on velocity. They're a framework for sustainable velocity. They tell you when you can push hard and when you need to ease off. They replace gut feelings and political arguments with data-driven decisions.

An SLO that doesn't change your behavior when it's violated is just a number. Make it a contract, give it teeth, or don't bother.

The teams that get this right don't have fewer arguments about reliability versus features. They have better arguments -- ones grounded in shared objectives, measured trade-offs, and agreed-upon policies. The SLO isn't the end of the conversation. It's the framework that makes the conversation productive.

Comments