07 — SLI, SLO, and SLA

Technical Overview

SLI (Service Level Indicator), SLO (Service Level Objective), and SLA (Service Level Agreement) form the framework that turns raw observability data into actionable reliability engineering. These three constructs — a metric, a target, and a contract — define the bridge between technical system behavior and business commitments. Properly implemented, they replace vague alert fatigue with a disciplined signal: the error budget, which tells engineering teams when to invest in reliability versus when it is safe to ship features.

This framework was systematized by Google's SRE organization and published in Site Reliability Engineering (2016). It has since become the dominant approach for managing reliability in cloud-native systems.

Prerequisites

Familiarity with Prometheus and PromQL
Understanding of percentile metrics and rate calculations
Experience with on-call and incident response
Basic probability and statistics literacy

Core Content

SLI: Service Level Indicator

An SLI is the measured metric that reflects user experience. It must be a quantifiable, observable property of the service from the user's perspective.

Good SLI properties: - Directly correlated with user happiness - Observable and measurable (not theoretical) - Low noise (not affected by unrelated system behavior) - Actionable (teams can influence it)

SLI categories (from Google SRE):

Category	Example SLI
Availability	Fraction of successful HTTP requests (non-5xx)
Latency	Fraction of requests completing in < 200ms
Quality	Fraction of search results including personalization
Coverage	Fraction of events processed within 5 minutes
Freshness	Fraction of data updated within 1 hour
Throughput	Fraction of time pipeline processes at target rate

Ratio-based SLIs are preferred over raw counts because they normalize for traffic volume:

Availability SLI = good_requests / total_requests

Where:
  good_requests = HTTP 2xx + 3xx + 4xx (user error, not service error)
  total_requests = all requests (including 5xx)

  Note: 4xx responses are typically "good" — the user made a bad request,
  the service correctly reported an error.

PromQL for availability SLI:

# 5-minute availability SLI (fraction of good requests)
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Latency SLI: fraction of requests completing in <200ms
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))

Anti-patterns for SLIs: - Internal metrics (CPU utilization, memory usage) — not directly correlated with user experience - Metrics that can be gamed (error rate that excludes timeouts) - Metrics that are technically accurate but don't reflect reality (uptime without health checks)

SLO: Service Level Objective

An SLO is a target value for an SLI, measured over a time window. It represents the reliability level the service commits to internally.

SLO = SLI_metric compared_to SLI_threshold over time_window

Examples:
  99.9% of requests succeed (availability SLI >= 99.9%) over rolling 30 days
  99% of requests complete in < 200ms (latency SLI >= 99%) over rolling 30 days
  99.95% of messages processed within 5 minutes over calendar month

Choosing the right SLO: - Start with measurement: what is your current actual availability? Setting an SLO above your current performance means you're already violating it. - Understand user expectations: A batch analytics system may need only 99% availability. A payments API needs 99.99%. - Consider cost: 99.9% allows ~43 minutes downtime per month. 99.99% allows ~4 minutes. Each nine requires roughly 10x more reliability engineering investment.

SLO targets and permitted downtime:

99%    → 7.3 hours/month   432 minutes/month
99.5%  → 3.6 hours/month
99.9%  → 43.8 minutes/month
99.95% → 21.9 minutes/month
99.99% → 4.4 minutes/month
99.999%→ 26 seconds/month

SLA: Service Level Agreement

An SLA is an externally-facing, contractual commitment. It includes: - The SLO (metric and target) - Measurement methodology (how is compliance calculated?) - Consequences for violation (service credits, refunds, termination rights) - Exclusions (scheduled maintenance, force majeure, customer-caused outages)

The SLA target is always weaker than the internal SLO: if your internal SLO is 99.9%, your SLA might commit to 99.5%. The buffer allows time to detect and remediate SLO violations before they become SLA violations.

SLA design considerations: - Exclude maintenance windows (but publish them in advance) - Define the measurement agent (who calculates compliance? Customer's measurement can differ from provider's) - Specify credit calculation (e.g., 10% of monthly bill for each 0.1% below SLA) - Many cloud providers calculate SLAs per calendar month, not rolling window

Error Budget

The error budget is the quantification of acceptable unreliability:

Error Budget = 1 - SLO Target

For 99.9% SLO:
  Error budget = 0.1% of requests may fail
  Over 30 days at 100 req/s = 259,200,000 total requests
  Budget = 259,200 failed requests
  Or equivalently: 43.8 minutes of complete downtime

Error Budget Remaining = (budget - consumed) / budget * 100%
  Consumed = total_bad_requests_in_window / total_requests_in_window - (1 - SLO)
  Actually:  (1 - current_SLI) / (1 - SLO_target)

PromQL error budget consumption (recording rules):

# Error ratio over 30 days
- record: slo:error_ratio:30d
  expr: |
    1 - (
      sum(increase(http_requests_total{status!~"5.."}[30d]))
      /
      sum(increase(http_requests_total[30d]))
    )

# Error budget consumed (fraction of budget used)
- record: slo:error_budget_consumed:30d
  expr: |
    slo:error_ratio:30d / (1 - 0.999)  # 0.999 = 99.9% SLO target

Error Budget Policy

The error budget policy defines what happens when the error budget is exhausted or nearly exhausted:

Example policy: - >50% budget remaining: normal development cadence. Features ship. - 25-50% budget remaining: increased reliability review for new deployments. Load testing required. - 0-25% budget remaining: all non-critical feature work pauses. Every deployment requires reliability review and phased rollout. - Budget exhausted (0%): feature freeze. Engineering focus shifts entirely to reliability improvements. No new deployments until budget is replenished. - Budget exhausted for 2+ consecutive windows: escalation to engineering leadership; reliability postmortem required; public SLO review.

This policy transforms reliability from an engineering concern into a shared engineering + product concern. Product managers have incentive to support reliability work because it restores their ability to ship features.

Burn Rate Alerting

Alerting directly on SLI metrics ("error rate > 1%") has poor signal-to-noise characteristics: brief spikes may not matter for the monthly SLO; slow, sustained degradation may exhaust the budget before anyone notices.

Burn rate solves this: how fast is the error budget being consumed relative to the budget allocation rate?

Burn Rate = current_error_rate / (1 - SLO_target)

For 99.9% SLO (error budget = 0.1%):
  Burn rate 1x = consuming budget at exactly the rate it replenishes
    (error rate = 0.1% exactly)
  Burn rate 14x = budget exhausted in 30/14 = 2.1 days
    (error rate = 1.4%)
  Burn rate 1000x = budget exhausted in 43 minutes
    (error rate = 100%)

Google SRE multi-window burn rate alerts (MWMBAs):

The two-window approach uses both a short window (to detect fast burns quickly) and a long window (to confirm the signal is sustained, not a spike):

ALERT TIERS:

Tier 1 (page immediately):
  Fast burn: 14x burn rate
  Windows: 1h AND 5m (both must be true)
  Meaning: budget exhausted in 2 days; detected within 5 minutes

Tier 2 (page):
  Slow burn: 6x burn rate  
  Windows: 6h AND 30m (both must be true)
  Meaning: budget exhausted in 5 days; detected within 30 minutes

Tier 3 (ticket):
  Very slow burn: 3x burn rate
  Windows: 3d AND 6h (both must be true)
  Meaning: budget significantly consumed but not immediately critical

Tier 4 (report):
  Any burn: 1x burn rate
  Windows: 30d
  Meaning: SLO target is being missed

PromQL implementation (for 99.9% availability SLO):

# 1-hour burn rate
- record: slo:burnrate1h
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[1h]))
    /
    sum(rate(http_requests_total[1h]))
    /
    (1 - 0.999)

# 5-minute burn rate
- record: slo:burnrate5m
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))
    /
    (1 - 0.999)

# Tier 1 page alert (fast burn)
- alert: SLOFastBurn
  expr: |
    slo:burnrate1h > 14
    and
    slo:burnrate5m > 14
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "Fast error budget burn on {{ $labels.service }}"
    description: "Burn rate {{ $value | humanize }}x. Budget exhausted in {{ printf \"%.1f\" (1 / $value * 720) }} hours."

SLO-Based vs Threshold-Based Alerting

Threshold alerting (traditional): "alert if error rate > 1% for 5 minutes." Simple, but has three problems: 1. False positives: a 2-minute spike at 5% error rate may not matter for the monthly SLO 2. False negatives: a sustained 0.5% error rate (below threshold) may exhaust the budget if the SLO is 99.9% 3. Doesn't express urgency: you don't know if you have 20 minutes of budget left or 20 days

SLO-based alerting (burn rate): alerts directly on budget consumption rate. Better precision: pages only when it matters (budget is at risk), with urgency proportional to burn rate.

SLO Coverage

Not everything should have an SLO. Criteria for what to SLO: - SLO candidates: customer-facing services, services with SLA commitments, internal services depended upon by customer-facing services - Not SLO candidates: internal tools, development services, services with no defined reliability requirements

Too many SLOs creates alert fatigue and dilutes attention. Google SRE recommends starting with one SLO per customer journey (not per service), then expanding.

Measuring SLOs from Multiple Sources

SLIs should be measured where the user experiences them, not deep inside the system. Options:

Server-side request metrics (Prometheus instrumentation): measures what the server sees. Misses failures before the request arrives (network failures, load balancer failures).
Load balancer / API gateway metrics: captures more of the user experience. Many failures are caught here before reaching backend services.
Synthetic monitoring (black-box probing, Prometheus Blackbox Exporter, Datadog Synthetics): artificial requests from external locations. Best proxy for real user experience.
Real User Monitoring (RUM): JavaScript in the browser captures actual user-experienced latency, including DNS resolution, TLS handshake, rendering. Most accurate but requires frontend instrumentation.

Production recommendation: use at least two sources and compare. Discrepancy between load balancer SLI and server-side SLI indicates infrastructure-level failures. Discrepancy between synthetic monitoring and load balancer SLI indicates geographic or network issues.

Historical Context

The SLI/SLO/SLA framework as a discipline was codified in Google's SRE book (2016), though Google had been practicing it internally since the mid-2000s. The innovation was not inventing the concepts (SLAs have existed in IT contracts for decades) but rather operationalizing them as an engineering tool: error budgets, burn rate alerting, and the policy framework that connects reliability metrics to product decisions.

Before SRE, reliability was typically managed by operations teams using threshold-based alerting: if a metric crosses a value, page someone. This worked for simple systems but produced alert fatigue (too many pages) in complex distributed systems. The SLO framework shifted from "is this metric bad right now?" to "are we on track to meet our reliability commitment over the next 30 days?"

The multi-window multi-burn-rate alert approach was published in the Google SRE Workbook (2018) as the solution to false-positive alert fatigue from single-window burn rate alerts.

Production Examples

# Check current error budget consumption
curl -sG http://prometheus:9090/api/v1/query \
  --data-urlencode 'query=slo:error_budget_consumed:30d' | \
  jq '.data.result[] | {service: .metric.service, consumed: .value[1] | tonumber | . * 100 | tostring + "%"}'

# Find services currently burning their budget faster than 14x
curl -sG http://prometheus:9090/api/v1/query \
  --data-urlencode 'query=slo:burnrate1h > 14' | \
  jq '.data.result[] | {service: .metric.service, burnrate: .value[1]}'

# List all active SLO alerts
curl -s http://alertmanager:9093/api/v2/alerts | \
  jq '.[] | select(.labels.alertname | startswith("SLO")) | {alert: .labels.alertname, service: .labels.service, status: .status.state}'

Grafana SLO dashboard query (error budget gauge):

# Remaining error budget as percentage
(1 - slo:error_budget_consumed:30d) * 100

# Days of error budget remaining at current burn rate
(1 - slo:error_budget_consumed:30d) / slo:burnrate1h * 720 / 24
# 720 = hours in 30 days

Debugging Notes

SLO violating but no active alerts: The SLO is measured over 30 days; the burn rate alert may not have fired if the consumption was gradual. Check the 30-day error budget consumption metric directly. If the budget is below 50% with 15 days left, that's already a concern even without an alert firing.

Burn rate alert fires briefly, then resolves: Common cause is a single-window burn rate spike (a brief burst of errors). This is why MWMBAs require both a short window AND a long window to agree — reduces false positives from transient spikes.

SLI shows 0% error rate but users are complaining: SLI is measured server-side; errors happening at the CDN, load balancer, or in DNS are not counted. Check synthetic monitoring for a complete picture. Also check for silent failures (requests that succeed with wrong responses — business logic errors not reflected in HTTP status codes).

SLO calculation different from SLA calculation: Your SLO may use a rolling 30-day window; your SLA uses a calendar month. At the beginning of a new month, error budget resets even if last month's budget was exhausted. Confirm which window your SLA uses and implement the SLO calculation accordingly.

Security Implications

SLO dashboards and error budget data may reveal business-sensitive information: which services are degraded, during which time periods (potentially correlating with business events), and by what magnitude. Apply access controls.
Error budget policies that freeze deployments can create urgency to bypass security reviews ("we're on budget freeze but this security patch is critical"). Define a security exception process in the error budget policy.
SLA contractual terms are legal documents. Ensure that SLO calculation methodology cannot be disputed by a customer — use a third-party status page or agree on measurement source.

Performance Implications

SLO recording rules (30-day window PromQL queries) are expensive in Prometheus. Pre-compute them as recording rules that run every 30 seconds, not on-demand. A increase(counter[30d]) query without a recording rule can cause a query timeout on large Prometheus instances.
High-resolution SLO metrics (per-second granularity over 30 days) require substantial storage. Use step intervals of 1 minute minimum for 30-day SLO calculations.

Failure Modes and Real Incidents

SLO gaming (incident at anonymous e-commerce company): A team facing budget exhaustion started routing known-bad requests (that would result in 500s) to a different error category that was excluded from the SLI. The error budget showed green while users experienced failures. Fix: include all request paths in the SLI; use synthetic monitoring as an independent check; audit SLI exclusions in SLO reviews.

Error budget exhausted before product noticed (Google internal, SRE book): A storage service exhausted its 30-day error budget in 18 days due to a series of small incidents. Because the error budget policy was not enforced, features continued to ship, causing more instability. The policy must have executive enforcement to be effective.

Alert storm from burn rate alert misconfiguration: A team set burn rate alert threshold at 2x (very low) with a 5-minute window only (no long window). Any brief traffic spike that happened to skew error rates triggered an immediate page. On-call engineers were paged 40+ times per day for non-incidents. Fix: implement two-window approach; set tier-1 threshold at 14x minimum.

Modern Usage

Nobl9, Chronosphere, Sloth: SLO management platforms that automate SLO creation, tracking, and alerting configuration on top of Prometheus/Datadog/New Relic.
Sloth (open source): generates Prometheus recording rules and alerting rules from a simple SLO definition YAML:

version: "prometheus/v1"
service: "payment-api"
slos:
  - name: "requests-availability"
    objective: 99.9
    sli:
      events:
        error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total[{{.window}}]))
    alerting:
      name: PaymentAPIAvailability
      page_alert: {labels: {severity: critical}}
      ticket_alert: {labels: {severity: warning}}

Future Directions

Multi-dimensional SLOs: SLOs that vary by customer tier (enterprise customers get 99.99%, standard get 99.9%), region, or time of day. Standard Prometheus cannot natively express these without significant relabeling.
SLO-aware deployment systems: CI/CD pipelines that check error budget consumption before proceeding with deployments (automatic canary gate: if burn rate > 3x, pause rollout).
Customer-facing SLO dashboards: public status pages that expose the actual measured SLI in real time, not just incident history. Builds trust; forces discipline.

Exercises

SLI selection: For a ride-sharing mobile app's backend, define at least 5 SLIs covering different aspects of the user experience. For each, write the ratio formula and the PromQL expression that would compute it. Justify why each reflects user happiness.
Error budget calculation: A service has a 99.9% availability SLO. Over the last 30 days: total requests = 100M, 5xx responses = 90,000. (a) Calculate the error budget as a fraction and as minutes of downtime. (b) Calculate how much budget was consumed. (c) Is the team in the "feature freeze" zone?
Multi-window burn rate implementation: Implement the full four-tier MWMBA alerting system for a service with a 99.9% availability SLO in Prometheus recording rules and alerting rules YAML. Use windows: 1h/5m for fast burn (14x), 6h/30m for slow burn (6x). Test the alert with injected errors at controlled rates.
Error budget policy: Write a complete error budget policy document for a hypothetical e-commerce checkout service with 99.9% SLO. Cover: how budget is measured, what actions are taken at 75%, 50%, 25%, and 0% remaining, who is responsible for enforcement, and how exceptions (critical security patches) are handled.
SLO dashboard: In Grafana, build an SLO dashboard for a service with availability and latency SLOs. The dashboard should show: current SLI value, SLO target line, error budget gauge (% remaining), error budget burn down over 30 days, current burn rate, and projected days until budget exhaustion at current rate.

References

Beyer, Betsy et al. Site Reliability Engineering. Google, 2016. Chapters 4, 6.
Murphy, Niall et al. The Site Reliability Workbook. O'Reilly, 2018. Chapter 5 (Alerting on SLOs).
Google SRE Workbook — Alerting on SLOs: https://sre.google/workbook/alerting-on-slos/
Sloth project: https://github.com/slok/sloth
Nobl9 SLO resources: https://www.nobl9.com/resources/
Fong-Jones, Liz. "SLO Fundamentals." Honeycomb blog, 2020.
Majors, Charity. "The Error Budget." InfoQ presentation, 2019.