SLI, SLO, and SLA: The Reliability Contract Stack

Overview

Service Level Indicators, Objectives, and Agreements form the foundational vocabulary of reliability engineering. They answer three distinct but related questions: what are we measuring, what do we commit to internally, and what do we promise to customers? Getting this hierarchy right determines whether engineering teams can make rational decisions about reliability investment, and whether those decisions align with what users actually experience.

The core insight, introduced systematically by the Google SRE organization in the early 2000s, is that reliability is not an abstract virtue — it is a measurable property of a system, and every reliability decision is a tradeoff with some other goal, most commonly velocity. Without a precise measurement framework, "we should be more reliable" is a wish. With SLIs, SLOs, and SLAs, it becomes a falsifiable claim with engineering consequences.

Prerequisites

Basic understanding of percentiles (p50, p99, p999) and what they mean for latency distributions
Familiarity with HTTP status codes and the concept of error rate
Understanding of uptime calculation (five nines = 99.999% = ~5 minutes downtime per year)
Awareness that distributed systems fail partially — not all-or-nothing

The SLI: What You Measure

A Service Level Indicator is a quantitative measure of some aspect of the service's behavior. The critical constraint is that a good SLI must be a proxy for user happiness — not an infrastructure metric that correlates with user happiness, but the user-facing behavior itself.

Bad SLIs (infrastructure metrics): - CPU utilization at 80% - Memory usage at 4 GB - Disk I/O wait at 15ms - Network interface errors per second

These metrics are relevant for debugging, but none of them directly describe what a user experiences. A service can have 95% CPU utilization and still serve every request successfully. Conversely, a service can have 20% CPU and still timeout on every request.

Good SLIs (user-experience metrics): - Request success rate: percentage of HTTP requests returning 2xx or 3xx - Availability: fraction of time the service is serving requests successfully - Latency: percentage of requests served within a threshold (e.g., p99 < 200ms) - Throughput: requests per second successfully handled - Data durability: percentage of stored objects still retrievable after 30 days

The SLI formula:

SLI = (good events) / (total events) × 100%

Example — availability SLI:
SLI = (successful requests) / (total requests) × 100%

Example — latency SLI:
SLI = (requests completing in < 200ms) / (total requests) × 100%

A critical choice when defining SLIs is the measurement point. A load balancer sees different traffic than an application server, which sees different traffic than a client-side SDK. Measure as close to the user as possible. Synthetic monitoring from outside your infrastructure captures what users actually experience; internal metrics capture what your code does.

The SLO: Your Internal Target

A Service Level Objective is a target value for an SLI over a rolling time window. It is an internal commitment — not contractual, not customer-facing — that defines what "good enough" looks like for this service.

SLO = SLI target over a measurement window

Examples:
- 99.9% of requests succeed over a 28-day rolling window
- 99% of requests complete in < 200ms over a 28-day rolling window
- 99.95% availability measured monthly
- 99.999999% (eight nines) data durability over one year

SLOs are not set by wishful thinking. They are set by asking: what is the minimum reliability level that keeps users happy? For most user-facing services, research shows that users notice and abandon services when availability drops below roughly 99.9%, and when latency exceeds 200ms for interactive requests. These numbers are starting points, not dogma.

SLOs by service type:

Service Type	Typical Availability SLO	Typical Latency SLO
User-facing web app	99.9%	p99 < 500ms
Payment processing	99.95%	p99 < 1000ms
Internal API (non-critical)	99.5%	p99 < 1000ms
Batch processing	99.0%	N/A (throughput SLO instead)
Data durability (object storage)	99.999999999% (11 nines)	N/A
CDN	99.99%	p95 < 50ms

The key discipline is not to make SLOs aspirational. An SLO you never miss is too loose — it means you are overinvesting in reliability and burning resources that could ship features. An SLO you always miss is useless. The goal is to set the SLO at the level where missing it actually causes user pain, and then treat violations as actionable signals.

The SLA: The Contractual Promise

A Service Level Agreement is a contractual commitment to customers, typically backed by financial penalties (service credits, refunds) when violated. SLAs are almost always looser than SLOs, for good reason: the SLO is your internal alarm system, giving you time to react before you breach the SLA.

Typical relationship:
SLA: 99.5% monthly availability (contractual, financial consequences)
SLO: 99.9% monthly availability (internal target, 5× stricter)
Current SLI: 99.95% (measured behavior)

If SLI drops below SLO → engineering alarm, fix before SLA breach
If SLI drops below SLA → financial penalties, customer trust damage

SLAs exist in customer contracts, terms of service, and vendor agreements. AWS, GCP, and Azure all publish SLAs — AWS EC2's is 99.99% monthly uptime. If they breach it, customers receive service credits. Note that these SLA credits rarely cover the full cost of an outage to the customer — they are symbolic, not compensatory.

SLI → SLO → SLA Relationship Diagram

  USER EXPERIENCE
       |
       | (measured by)
       v
  +---------+
  |   SLI   |  <-- What we measure: request success rate,
  |         |      latency percentile, availability %
  +---------+
       |
       | (target for)
       v
  +---------+
  |   SLO   |  <-- Internal target: 99.9% over 28 days
  |         |      Breach = engineering action required
  +---------+
       |
       | (looser than)
       v
  +---------+
  |   SLA   |  <-- External contract: 99.5% monthly
  |         |      Breach = financial penalties
  +---------+
       |
       v
  CUSTOMER CONTRACT

  SLO buffer = SLO - SLA = warning runway
  (Example: 99.9% SLO vs 99.5% SLA = 0.4% buffer)

The Error Budget

The error budget is the direct mathematical consequence of defining an SLO. If your SLO is 99.9% availability, you have a 0.1% error budget — that is how much unreliability you are permitted over the measurement window.

Error budget = 100% - SLO

99.9%  SLO → 0.1%  budget → 43.8 minutes/month or 8.76 hours/year
99.99% SLO → 0.01% budget → 4.38 minutes/month or 52.6 minutes/year
99.5%  SLO → 0.5%  budget → 3.65 hours/month
99.0%  SLO → 1.0%  budget → 7.3 hours/month

(Monthly calculation assumes 30-day months = 43,200 minutes)

Error budgets reframe the reliability conversation. They make explicit that some amount of unreliability is not just acceptable — it is paid for. Spending error budget on planned deployments, experiments, and configuration changes is legitimate. Spending it on incidents is waste. The error budget becomes the currency through which engineering teams negotiate between velocity and reliability.

Historical Context

The formal SLI/SLO/SLA framework was systematized by Ben Treynor Sloss when he founded Google's Site Reliability Engineering function in 2003. Treynor's core insight was that reliability work needed to be treated like a product — with quantifiable goals, clear ownership, and a way to say "this is good enough, now go build features." The framework was documented publicly in the 2016 Google SRE book, which became the seminal text in the field.

Before SRE, reliability was typically managed through threshold-based alerting: alert when CPU > 80%, alert when error rate > 1%. This approach fails because thresholds are arbitrary, infrastructure metrics are not user-facing, and every threshold eventually becomes stale. SLO-based alerting replaced "is something wrong?" with "are users experiencing the service we promised them?"

The concepts themselves predate Google — service agreements between IT departments and business units have existed since mainframe computing in the 1970s. But the SRE framework formalized the relationship between measurement, internal targets, and external contracts in a way that scaled to complex distributed systems.

SLO-Based vs Threshold-Based Alerting

THRESHOLD-BASED ALERTING:
  Alert if: error_rate > 1%
  Problem: 1% is arbitrary. Is 0.9% fine? Is 1.1% catastrophic?
           What about a 2% error rate for 5 minutes vs 0.5% for 12 hours?
           Both might consume the same error budget.

SLO-BASED ALERTING (burn rate):
  Alert if: current error budget consumption rate is unsustainable
  Question: "At this rate, when will we exhaust the monthly error budget?"

  Burn rate = (current bad event rate) / (SLO bad event rate)

  If SLO is 99.9% (0.1% errors allowed):
  - Current error rate 0.1% → burn rate 1x → consuming exactly at budget
  - Current error rate 1%   → burn rate 10x → exhausts budget in 3 days
  - Current error rate 0.5% → burn rate 5x  → exhausts budget in 6 days

SLO-based alerting is context-aware: a brief spike matters less than a sustained elevated error rate. It connects every alert to user impact: you are burning the budget you paid for with your SLO, and the question is how fast.

Production Examples

Example 1: E-commerce checkout service - SLI: percentage of checkout requests returning HTTP 200 within 3 seconds - SLO: 99.95% over a 28-day rolling window - SLA: 99.9% monthly (in vendor contract with payment processor) - Error budget: 21.6 minutes/month

Example 2: Video streaming platform - SLI: percentage of video play requests that successfully begin playback within 2 seconds - SLO: 99.9% over 28 days - SLA: N/A (consumer service, no contractual SLA — reputation risk only) - Error budget: 40 minutes/month; deployments typically cost 2-5 minutes each

Example 3: Object storage (S3-class) - SLI: percentage of stored objects successfully retrieved over 12 months - SLO: 99.999999999% (eleven nines) durability - SLA: 99.9% availability, 99.999999999% durability - Error budget for availability: 8.76 hours/year

Debugging Notes

When an SLO is being violated or the error budget is burning fast, the diagnostic sequence is:

Distinguish SLI measurement errors from actual problems. Is the measurement pipeline healthy? A broken metrics collector can make a healthy service look degraded. Verify with synthetic probes from outside your infrastructure.
Identify if the SLI is capturing the right population. Health check endpoints inflating "successful request" counts? Bot traffic skewing the denominator? SLI should measure real user traffic.
Correlate with deployment events. Most SLO violations follow a deployment. Check for timing correlation between error budget burn and code or config changes.
Check the measurement window. A brief spike may not violate the 28-day SLO even if it is alarming. Understand whether you are looking at a burn rate alert (fast, actionable) or an actual SLO violation (slow, serious).
Separate user-facing from internal SLIs. An internal microservice SLO violation may not translate to a user-facing SLO violation if you have caching, retries, or graceful degradation upstream.

Security Implications

SLIs can be manipulated. An adversary who can generate large volumes of successful "health check" requests can artificially inflate the denominator, making a service appear more available than it is. SLI measurement pipelines must:

Use authenticated synthetic probes that test real user paths, not just health endpoints
Separate health check traffic from user traffic in SLI calculations
Protect metrics pipelines with the same rigor as production services — a compromised metrics store can be used to hide attacks by masking anomalous traffic patterns

DDoS attacks are a special case: a service under DDoS will show poor SLIs, but the errors are externally caused. SLO policies should distinguish between internally-caused and externally-caused budget consumption, though the user impact is identical.

Performance Implications

Measuring SLIs has overhead. Every request must be logged, timestamped, and classified as successful or failed. For high-volume services (100K+ requests/second), this measurement overhead is non-trivial:

Sampling is acceptable for SLI measurement (measure 1% of traffic, statistically valid for most purposes)
Use efficient log pipelines; do not emit a synchronous metric write on every request
Pre-aggregate in the service (count successes per second, emit the count) rather than logging individual requests and aggregating later
Distinguish between p99 latency measurement (requires individual request timing) and availability measurement (requires only success/failure counts)

Failure Modes

The aspirational SLO: Setting 99.999% because it sounds impressive, when your actual reliability is 99.9%. The error budget is never spent, nobody believes it, and the SLO has no engineering consequence.

The unmeasured SLI: Defining an SLO with no actual measurement backing it. SLO compliance is claimed but never verified. This is more common than it sounds — many organizations have SLOs on paper but the measurement pipeline either does not exist or measures the wrong thing.

The SLO for the wrong service: Measuring and setting SLOs on infrastructure metrics (load balancer uptime, database availability) without corresponding user-facing SLOs. The infrastructure can be "up" while users experience errors.

SLO theater: Going through the motions of setting SLOs with no error budget policy — no consequences for violating the SLO, no freeze on deployments when the budget runs out. Without consequences, the SLO is decorative.

Modern Usage

As of the mid-2020s, SLO-based reliability management has become standard practice across organizations of all sizes, not just hyperscalers. Tools like Google Cloud Operations Suite, Datadog, New Relic, Honeycomb, and Nobl9 provide first-class SLO management features: define an SLO, connect it to metrics, track burn rate, and alert on budget consumption.

The OpenSLO specification (openslo.com) defines a vendor-neutral YAML schema for declaring SLOs as code, enabling SLO definitions to live in version control alongside service definitions.

SLOs are increasingly part of platform engineering: internal developer platforms provide SLO templates for common service types, so teams do not have to design their measurement framework from scratch.

Future Directions

AI-assisted SLO setting: Using historical traffic and failure data to recommend SLO values rather than requiring engineers to determine them manually.

Client-side SLIs: Moving SLI measurement to the client (mobile app, browser SDK) to capture the full user experience including network conditions, rather than just server-side behavior. Google's Chrome UX Report and Web Vitals initiative are examples of this shift.

SLOs for ML systems: Traditional SLIs measure availability and latency. ML systems need new SLIs for prediction quality, model staleness, and data drift — all of which can degrade user experience without any infrastructure failure.

Cascading SLOs: Automatically computing composite SLOs for systems that depend on multiple services, with dependency graphs that propagate SLO risk from dependencies to dependents.

Exercises

For a simple REST API serving user profiles, define three SLIs: one for availability, one for latency, and one for data correctness. For each, specify the exact measurement: numerator, denominator, measurement point, and time window.
A team's SLO is 99.9% availability over 28 days. In the first week, they had a 30-minute incident with 100% error rate. Calculate remaining error budget for the rest of the month.
Your service currently has a 99.95% SLO. Marketing wants to launch a new feature that will require 3 hours of maintenance window. How does this affect the annual error budget? Should the SLO be revised?
A company's SLA promises 99.9% monthly uptime with a 10% service credit for each 0.1% shortfall. They have 99.7% uptime this month. Calculate the service credit owed on a $50,000 monthly contract.
Design a set of SLIs for a real-time bidding system that must respond in under 100ms. What makes latency SLI design particularly challenging at these timescales?

References

Beyer, B., Jones, C., Petoff, J., Murphy, N.R. (eds). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media, 2016. Chapter 4: Service Level Objectives.
Beyer, B., Murphy, N.R., Rensin, D., Kawahara, K., Thorne, S. (eds). The Site Reliability Workbook. O'Reilly Media, 2018. Chapter 2: Implementing SLOs.
Sloss, B.T. "Keys to SRE." Google SRE Conference, 2014.
Hausenblas, M., et al. OpenSLO Specification. https://openslo.com/
Nygard, M.T. Release It! Design and Deploy Production-Ready Software. 2nd ed., Pragmatic Bookshelf, 2018.
Limoncelli, T.A., Chalup, S.R., Hogan, C.J. The Practice of Cloud System Administration. Addison-Wesley, 2014.