07 — SLI, SLO, and SLA
Technical Overview
SLI (Service Level Indicator), SLO (Service Level Objective), and SLA (Service Level Agreement) form the framework that turns raw observability data into actionable reliability engineering. These three constructs — a metric, a target, and a contract — define the bridge between technical system behavior and business commitments. Properly implemented, they replace vague alert fatigue with a disciplined signal: the error budget, which tells engineering teams when to invest in reliability versus when it is safe to ship features.
This framework was systematized by Google's SRE organization and published in Site Reliability Engineering (2016). It has since become the dominant approach for managing reliability in cloud-native systems.
Prerequisites
- Familiarity with Prometheus and PromQL
- Understanding of percentile metrics and rate calculations
- Experience with on-call and incident response
- Basic probability and statistics literacy
Core Content
SLI: Service Level Indicator
An SLI is the measured metric that reflects user experience. It must be a quantifiable, observable property of the service from the user's perspective.
Good SLI properties: - Directly correlated with user happiness - Observable and measurable (not theoretical) - Low noise (not affected by unrelated system behavior) - Actionable (teams can influence it)
SLI categories (from Google SRE):
| Category | Example SLI |
|---|---|
| Availability | Fraction of successful HTTP requests (non-5xx) |
| Latency | Fraction of requests completing in < 200ms |
| Quality | Fraction of search results including personalization |
| Coverage | Fraction of events processed within 5 minutes |
| Freshness | Fraction of data updated within 1 hour |
| Throughput | Fraction of time pipeline processes at target rate |
Ratio-based SLIs are preferred over raw counts because they normalize for traffic volume:
Availability SLI = good_requests / total_requests
Where:
good_requests = HTTP 2xx + 3xx + 4xx (user error, not service error)
total_requests = all requests (including 5xx)
Note: 4xx responses are typically "good" — the user made a bad request,
the service correctly reported an error.
PromQL for availability SLI:
# 5-minute availability SLI (fraction of good requests)
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Latency SLI: fraction of requests completing in <200ms
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
Anti-patterns for SLIs:
- Internal metrics (CPU utilization, memory usage) — not directly correlated with user experience
- Metrics that can be gamed (error rate that excludes timeouts)
- Metrics that are technically accurate but don't reflect reality (uptime without health checks)
SLO: Service Level Objective
An SLO is a target value for an SLI, measured over a time window. It represents the reliability level the service commits to internally.
SLO = SLI_metric compared_to SLI_threshold over time_window
Examples:
99.9% of requests succeed (availability SLI >= 99.9%) over rolling 30 days
99% of requests complete in < 200ms (latency SLI >= 99%) over rolling 30 days
99.95% of messages processed within 5 minutes over calendar month
Choosing the right SLO: - Start with measurement: what is your current actual availability? Setting an SLO above your current performance means you're already violating it. - Understand user expectations: A batch analytics system may need only 99% availability. A payments API needs 99.99%. - Consider cost: 99.9% allows ~43 minutes downtime per month. 99.99% allows ~4 minutes. Each nine requires roughly 10x more reliability engineering investment.
SLO targets and permitted downtime:
99% → 7.3 hours/month 432 minutes/month
99.5% → 3.6 hours/month
99.9% → 43.8 minutes/month
99.95% → 21.9 minutes/month
99.99% → 4.4 minutes/month
99.999%→ 26 seconds/month
SLA: Service Level Agreement
An SLA is an externally-facing, contractual commitment. It includes: - The SLO (metric and target) - Measurement methodology (how is compliance calculated?) - Consequences for violation (service credits, refunds, termination rights) - Exclusions (scheduled maintenance, force majeure, customer-caused outages)
The SLA target is always weaker than the internal SLO: if your internal SLO is 99.9%, your SLA might commit to 99.5%. The buffer allows time to detect and remediate SLO violations before they become SLA violations.
SLA design considerations: - Exclude maintenance windows (but publish them in advance) - Define the measurement agent (who calculates compliance? Customer's measurement can differ from provider's) - Specify credit calculation (e.g., 10% of monthly bill for each 0.1% below SLA) - Many cloud providers calculate SLAs per calendar month, not rolling window
Error Budget
The error budget is the quantification of acceptable unreliability:
Error Budget = 1 - SLO Target
For 99.9% SLO:
Error budget = 0.1% of requests may fail
Over 30 days at 100 req/s = 259,200,000 total requests
Budget = 259,200 failed requests
Or equivalently: 43.8 minutes of complete downtime
Error Budget Remaining = (budget - consumed) / budget * 100%
Consumed = total_bad_requests_in_window / total_requests_in_window - (1 - SLO)
Actually: (1 - current_SLI) / (1 - SLO_target)
PromQL error budget consumption (recording rules):
# Error ratio over 30 days
- record: slo:error_ratio:30d
expr: |
1 - (
sum(increase(http_requests_total{status!~"5.."}[30d]))
/
sum(increase(http_requests_total[30d]))
)
# Error budget consumed (fraction of budget used)
- record: slo:error_budget_consumed:30d
expr: |
slo:error_ratio:30d / (1 - 0.999) # 0.999 = 99.9% SLO target
Error Budget Policy
The error budget policy defines what happens when the error budget is exhausted or nearly exhausted:
Example policy: - >50% budget remaining: normal development cadence. Features ship. - 25-50% budget remaining: increased reliability review for new deployments. Load testing required. - 0-25% budget remaining: all non-critical feature work pauses. Every deployment requires reliability review and phased rollout. - Budget exhausted (0%): feature freeze. Engineering focus shifts entirely to reliability improvements. No new deployments until budget is replenished. - Budget exhausted for 2+ consecutive windows: escalation to engineering leadership; reliability postmortem required; public SLO review.
This policy transforms reliability from an engineering concern into a shared engineering + product concern. Product managers have incentive to support reliability work because it restores their ability to ship features.
Burn Rate Alerting
Alerting directly on SLI metrics ("error rate > 1%") has poor signal-to-noise characteristics: brief spikes may not matter for the monthly SLO; slow, sustained degradation may exhaust the budget before anyone notices.
Burn rate solves this: how fast is the error budget being consumed relative to the budget allocation rate?
Burn Rate = current_error_rate / (1 - SLO_target)
For 99.9% SLO (error budget = 0.1%):
Burn rate 1x = consuming budget at exactly the rate it replenishes
(error rate = 0.1% exactly)
Burn rate 14x = budget exhausted in 30/14 = 2.1 days
(error rate = 1.4%)
Burn rate 1000x = budget exhausted in 43 minutes
(error rate = 100%)
Google SRE multi-window burn rate alerts (MWMBAs):
The two-window approach uses both a short window (to detect fast burns quickly) and a long window (to confirm the signal is sustained, not a spike):
ALERT TIERS:
Tier 1 (page immediately):
Fast burn: 14x burn rate
Windows: 1h AND 5m (both must be true)
Meaning: budget exhausted in 2 days; detected within 5 minutes
Tier 2 (page):
Slow burn: 6x burn rate
Windows: 6h AND 30m (both must be true)
Meaning: budget exhausted in 5 days; detected within 30 minutes
Tier 3 (ticket):
Very slow burn: 3x burn rate
Windows: 3d AND 6h (both must be true)
Meaning: budget significantly consumed but not immediately critical
Tier 4 (report):
Any burn: 1x burn rate
Windows: 30d
Meaning: SLO target is being missed
PromQL implementation (for 99.9% availability SLO):
# 1-hour burn rate
- record: slo:burnrate1h
expr: |
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
/
(1 - 0.999)
# 5-minute burn rate
- record: slo:burnrate5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
/
(1 - 0.999)
# Tier 1 page alert (fast burn)
- alert: SLOFastBurn
expr: |
slo:burnrate1h > 14
and
slo:burnrate5m > 14
for: 2m
labels:
severity: page
annotations:
summary: "Fast error budget burn on {{ $labels.service }}"
description: "Burn rate {{ $value | humanize }}x. Budget exhausted in {{ printf \"%.1f\" (1 / $value * 720) }} hours."
SLO-Based vs Threshold-Based Alerting
Threshold alerting (traditional): "alert if error rate > 1% for 5 minutes." Simple, but has three problems: 1. False positives: a 2-minute spike at 5% error rate may not matter for the monthly SLO 2. False negatives: a sustained 0.5% error rate (below threshold) may exhaust the budget if the SLO is 99.9% 3. Doesn't express urgency: you don't know if you have 20 minutes of budget left or 20 days
SLO-based alerting (burn rate): alerts directly on budget consumption rate. Better precision: pages only when it matters (budget is at risk), with urgency proportional to burn rate.
SLO Coverage
Not everything should have an SLO. Criteria for what to SLO: - SLO candidates: customer-facing services, services with SLA commitments, internal services depended upon by customer-facing services - Not SLO candidates: internal tools, development services, services with no defined reliability requirements
Too many SLOs creates alert fatigue and dilutes attention. Google SRE recommends starting with one SLO per customer journey (not per service), then expanding.
Measuring SLOs from Multiple Sources
SLIs should be measured where the user experiences them, not deep inside the system. Options:
- Server-side request metrics (Prometheus instrumentation): measures what the server sees. Misses failures before the request arrives (network failures, load balancer failures).
- Load balancer / API gateway metrics: captures more of the user experience. Many failures are caught here before reaching backend services.
- Synthetic monitoring (black-box probing, Prometheus Blackbox Exporter, Datadog Synthetics): artificial requests from external locations. Best proxy for real user experience.
- Real User Monitoring (RUM): JavaScript in the browser captures actual user-experienced latency, including DNS resolution, TLS handshake, rendering. Most accurate but requires frontend instrumentation.
Production recommendation: use at least two sources and compare. Discrepancy between load balancer SLI and server-side SLI indicates infrastructure-level failures. Discrepancy between synthetic monitoring and load balancer SLI indicates geographic or network issues.
Historical Context
The SLI/SLO/SLA framework as a discipline was codified in Google's SRE book (2016), though Google had been practicing it internally since the mid-2000s. The innovation was not inventing the concepts (SLAs have existed in IT contracts for decades) but rather operationalizing them as an engineering tool: error budgets, burn rate alerting, and the policy framework that connects reliability metrics to product decisions.
Before SRE, reliability was typically managed by operations teams using threshold-based alerting: if a metric crosses a value, page someone. This worked for simple systems but produced alert fatigue (too many pages) in complex distributed systems. The SLO framework shifted from "is this metric bad right now?" to "are we on track to meet our reliability commitment over the next 30 days?"
The multi-window multi-burn-rate alert approach was published in the Google SRE Workbook (2018) as the solution to false-positive alert fatigue from single-window burn rate alerts.
Production Examples
# Check current error budget consumption
curl -sG http://prometheus:9090/api/v1/query \
--data-urlencode 'query=slo:error_budget_consumed:30d' | \
jq '.data.result[] | {service: .metric.service, consumed: .value[1] | tonumber | . * 100 | tostring + "%"}'
# Find services currently burning their budget faster than 14x
curl -sG http://prometheus:9090/api/v1/query \
--data-urlencode 'query=slo:burnrate1h > 14' | \
jq '.data.result[] | {service: .metric.service, burnrate: .value[1]}'
# List all active SLO alerts
curl -s http://alertmanager:9093/api/v2/alerts | \
jq '.[] | select(.labels.alertname | startswith("SLO")) | {alert: .labels.alertname, service: .labels.service, status: .status.state}'
Grafana SLO dashboard query (error budget gauge):
# Remaining error budget as percentage
(1 - slo:error_budget_consumed:30d) * 100
# Days of error budget remaining at current burn rate
(1 - slo:error_budget_consumed:30d) / slo:burnrate1h * 720 / 24
# 720 = hours in 30 days
Debugging Notes
SLO violating but no active alerts: The SLO is measured over 30 days; the burn rate alert may not have fired if the consumption was gradual. Check the 30-day error budget consumption metric directly. If the budget is below 50% with 15 days left, that's already a concern even without an alert firing.
Burn rate alert fires briefly, then resolves: Common cause is a single-window burn rate spike (a brief burst of errors). This is why MWMBAs require both a short window AND a long window to agree — reduces false positives from transient spikes.
SLI shows 0% error rate but users are complaining: SLI is measured server-side; errors happening at the CDN, load balancer, or in DNS are not counted. Check synthetic monitoring for a complete picture. Also check for silent failures (requests that succeed with wrong responses — business logic errors not reflected in HTTP status codes).
SLO calculation different from SLA calculation: Your SLO may use a rolling 30-day window; your SLA uses a calendar month. At the beginning of a new month, error budget resets even if last month's budget was exhausted. Confirm which window your SLA uses and implement the SLO calculation accordingly.
Security Implications
- SLO dashboards and error budget data may reveal business-sensitive information: which services are degraded, during which time periods (potentially correlating with business events), and by what magnitude. Apply access controls.
- Error budget policies that freeze deployments can create urgency to bypass security reviews ("we're on budget freeze but this security patch is critical"). Define a security exception process in the error budget policy.
- SLA contractual terms are legal documents. Ensure that SLO calculation methodology cannot be disputed by a customer — use a third-party status page or agree on measurement source.
Performance Implications
- SLO recording rules (30-day window PromQL queries) are expensive in Prometheus. Pre-compute them as recording rules that run every 30 seconds, not on-demand. A
increase(counter[30d])query without a recording rule can cause a query timeout on large Prometheus instances. - High-resolution SLO metrics (per-second granularity over 30 days) require substantial storage. Use
stepintervals of 1 minute minimum for 30-day SLO calculations.
Failure Modes and Real Incidents
SLO gaming (incident at anonymous e-commerce company): A team facing budget exhaustion started routing known-bad requests (that would result in 500s) to a different error category that was excluded from the SLI. The error budget showed green while users experienced failures. Fix: include all request paths in the SLI; use synthetic monitoring as an independent check; audit SLI exclusions in SLO reviews.
Error budget exhausted before product noticed (Google internal, SRE book): A storage service exhausted its 30-day error budget in 18 days due to a series of small incidents. Because the error budget policy was not enforced, features continued to ship, causing more instability. The policy must have executive enforcement to be effective.
Alert storm from burn rate alert misconfiguration: A team set burn rate alert threshold at 2x (very low) with a 5-minute window only (no long window). Any brief traffic spike that happened to skew error rates triggered an immediate page. On-call engineers were paged 40+ times per day for non-incidents. Fix: implement two-window approach; set tier-1 threshold at 14x minimum.
Modern Usage
- Nobl9, Chronosphere, Sloth: SLO management platforms that automate SLO creation, tracking, and alerting configuration on top of Prometheus/Datadog/New Relic.
- Sloth (open source): generates Prometheus recording rules and alerting rules from a simple SLO definition YAML:
version: "prometheus/v1"
service: "payment-api"
slos:
- name: "requests-availability"
objective: 99.9
sli:
events:
error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))
total_query: sum(rate(http_requests_total[{{.window}}]))
alerting:
name: PaymentAPIAvailability
page_alert: {labels: {severity: critical}}
ticket_alert: {labels: {severity: warning}}
Future Directions
- Multi-dimensional SLOs: SLOs that vary by customer tier (enterprise customers get 99.99%, standard get 99.9%), region, or time of day. Standard Prometheus cannot natively express these without significant relabeling.
- SLO-aware deployment systems: CI/CD pipelines that check error budget consumption before proceeding with deployments (automatic canary gate: if burn rate > 3x, pause rollout).
- Customer-facing SLO dashboards: public status pages that expose the actual measured SLI in real time, not just incident history. Builds trust; forces discipline.
Exercises
-
SLI selection: For a ride-sharing mobile app's backend, define at least 5 SLIs covering different aspects of the user experience. For each, write the ratio formula and the PromQL expression that would compute it. Justify why each reflects user happiness.
-
Error budget calculation: A service has a 99.9% availability SLO. Over the last 30 days: total requests = 100M, 5xx responses = 90,000. (a) Calculate the error budget as a fraction and as minutes of downtime. (b) Calculate how much budget was consumed. (c) Is the team in the "feature freeze" zone?
-
Multi-window burn rate implementation: Implement the full four-tier MWMBA alerting system for a service with a 99.9% availability SLO in Prometheus recording rules and alerting rules YAML. Use windows: 1h/5m for fast burn (14x), 6h/30m for slow burn (6x). Test the alert with injected errors at controlled rates.
-
Error budget policy: Write a complete error budget policy document for a hypothetical e-commerce checkout service with 99.9% SLO. Cover: how budget is measured, what actions are taken at 75%, 50%, 25%, and 0% remaining, who is responsible for enforcement, and how exceptions (critical security patches) are handled.
-
SLO dashboard: In Grafana, build an SLO dashboard for a service with availability and latency SLOs. The dashboard should show: current SLI value, SLO target line, error budget gauge (% remaining), error budget burn down over 30 days, current burn rate, and projected days until budget exhaustion at current rate.
References
- Beyer, Betsy et al. Site Reliability Engineering. Google, 2016. Chapters 4, 6.
- Murphy, Niall et al. The Site Reliability Workbook. O'Reilly, 2018. Chapter 5 (Alerting on SLOs).
- Google SRE Workbook — Alerting on SLOs: https://sre.google/workbook/alerting-on-slos/
- Sloth project: https://github.com/slok/sloth
- Nobl9 SLO resources: https://www.nobl9.com/resources/
- Fong-Jones, Liz. "SLO Fundamentals." Honeycomb blog, 2020.
- Majors, Charity. "The Error Budget." InfoQ presentation, 2019.