Section 28: Reliability Engineering — Overview
Section Purpose and Scope
This section covers Site Reliability Engineering as a systems engineering discipline: the quantitative framework of SLIs, SLOs, and error budgets; capacity planning; reliability design patterns; chaos engineering; disaster recovery; and the organizational practices (incident response, postmortems, blameless culture) that make reliability sustainable. The focus is on the engineering substrate — what makes systems actually reliable — not on organizational politics or team structure. It draws from Google SRE practices, Netflix chaos engineering, and the broader distributed systems reliability literature.
Prerequisites
- Section 15: Networking (network failure modes, TCP behavior under stress)
- Section 17: Distributed Systems (consensus, replication, partial failure, CAP)
- Section 18: Database Internals (replication lag, durability guarantees)
- Section 21: Cloud Infrastructure (multi-region architecture, AZ failure modes)
- Section 23: Observability (SLI measurement, alerting — reliability requires observability)
Learning Objectives
- Define SLI, SLO, and SLA with precise examples and explain error budget arithmetic.
- Design a multi-window/multi-burn-rate alerting strategy based on error budget depletion.
- Explain the circuit breaker, retry with exponential backoff, and bulkhead patterns and when each applies.
- Design a chaos engineering experiment with hypothesis, steady-state definition, and abort conditions.
- Calculate RTO and RPO for a given system and design an architecture to meet them.
- Lead a blameless postmortem and extract actionable items beyond the immediate fix.
- Build a capacity planning model that anticipates failure domains.
- Distinguish between toil (manual, repetitive, tactical work) and engineering work.
Architecture Overview
Reliability Engineering Framework:
┌──────────────────────────────────────────────────────────────────┐
│ SLO / Error Budget Lifecycle │
│ │
│ SLI: request_success_rate = good_requests / total_requests │
│ SLO: 99.9% over 30 days → Error Budget = 0.1% = 43.8 min/mo │
│ │
│ Budget Consumption: │
│ 100% ──────────────────────────────────── 0% │
│ ▲ ▼ incident ▼ budget exhausted │
│ │ replenish spend → feature freeze │
│ └──────────────────────────────────────── new month │
└──────────────────────────────────────────────────────────────────┘
Reliability Design Patterns:
Circuit Breaker:
┌─────────────┐ normal ┌─────────────┐
│ Caller │──────────►│ Service │
└─────────────┘ └──────┬──────┘
│ │ failures > threshold
│ open (fail fast) ▼
│◄─────────────────── ┌──────────┐
│ │ OPEN │
│ half-open (probe) │ state │
└────────────────────►└──────────┘
│ success
▼
┌──────────┐
│ CLOSED │◄── return to normal
└──────────┘
Retry with Exponential Backoff + Jitter:
┌───────────────────────────────────────────────────────────────┐
│ attempt 1: t=0 (immediate) │
│ attempt 2: t=1s + jitter (base=1s, jitter=random[0,1]s) │
│ attempt 3: t=3s + jitter (base=2s) │
│ attempt 4: t=7s + jitter (base=4s) │
│ attempt 5: t=15s + jitter (base=8s, capped at max) │
│ Jitter prevents thundering herd on recovery │
└───────────────────────────────────────────────────────────────┘
Multi-Region Architecture:
┌─────────────────────────────────────────────────────────────────┐
│ Region A (primary) Region B (secondary) │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ ┌────┐ ┌──────┐ │ │ ┌────┐ ┌──────┐ │ │
│ │ │ AZ1│ │ AZ2 │ │◄──────►│ │ AZ3│ │ AZ4 │ │ │
│ │ └────┘ └──────┘ │ async │ └────┘ └──────┘ │ │
│ │ DB primary │ repl │ DB replica │ │
│ └──────────────────┘ └──────────────────┘ │
│ │ │ │
│ ┌───────▼────────────────────────────▼─────────────────────┐ │
│ │ Global Load Balancer / Anycast DNS │ │
│ │ Failover: DNS TTL or BGP withdrawal │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Incident Response Lifecycle:
Detect → Triage → Diagnose → Mitigate → Resolve → Review
↑ │
└──── postmortem → action items → prevention ──────┘
Key Concepts
- SLI (Service Level Indicator): A quantitative measure of service behavior from the user perspective. Availability SLI, latency SLI (p99 < threshold), throughput SLI, durability SLI. Must be directly measurable.
- SLO (Service Level Objective): Target for an SLI over a window. "99.95% availability over 28 days." SLOs should be slightly tighter internally than customer-facing SLAs to provide headroom.
- SLA (Service Level Agreement): Contractual commitment to customers, usually with financial penalties for breach. SLOs should be set tighter than SLAs — the SLA is the floor, not the target.
- Error Budget: The allowed amount of SLI violation within the SLO window. If SLO = 99.9%, error budget = 0.1% of the window (~43 minutes/month). Budget enables risk quantification: how much can we afford to spend on a deployment?
- Burn Rate: Rate at which error budget is being consumed relative to the window. Burn rate of 1 = exactly consuming budget on pace. Burn rate of 14.4 over 1 hour consumes 2% of monthly budget in 1 hour — warrants paging.
- Toil: Manual, repetitive, automatable, tactical work that scales linearly with service growth. SRE principle: minimize toil to below 50% of work time. Toil is not inherently bad but should trend down as systems mature.
- Circuit Breaker: Proxy that tracks failure rate to a downstream service. States: CLOSED (normal), OPEN (failing fast, not forwarding), HALF-OPEN (probing recovery). Prevents cascade failures from a failing dependency.
- Bulkhead: Isolate subsystems so failure in one does not exhaust resources in others. Thread pool isolation, connection pool isolation, rate limiting per tenant. Named after ship compartment design.
- Retry with Backoff and Jitter: Retrying failed requests with exponential backoff prevents retry storms. Jitter (randomization) spreads retries to avoid thundering herd after shared failure recovery. Full jitter vs equal jitter vs decorrelated jitter.
- Chaos Engineering: Deliberately injecting failures to verify resilience. Principles (Netflix): define steady state (SLI baseline), hypothesize it will persist during failure, run experiment in production at minimal blast radius, measure deviation. Tools: Chaos Monkey, Gremlin, LitmusChaos, AWS Fault Injection Simulator.
- RTO (Recovery Time Objective): Maximum acceptable time from failure to restored service. Drives architecture decisions: active-active vs active-passive, automated vs manual failover.
- RPO (Recovery Point Objective): Maximum acceptable data loss (measured in time). RPO=0 requires synchronous replication. RPO=1h allows hourly snapshots. Drives replication strategy.
- Runbook: Documented procedure for handling a known failure mode or operational task. A runbook invoked from an alert should lead to resolution without requiring knowledge not in the runbook. Living documents — must stay current.
- Blameless Postmortem: Analysis after an incident focused on systemic causes rather than individual culpability. Output: timeline, contributing factors (5-whys or fishbone), action items with owners and dates. Psychological safety required for honest postmortems.
- Capacity Planning: Forecasting resource requirements given traffic growth and failure scenarios. Includes: N+2 redundancy (survive two simultaneous failures), seasonal capacity (Black Friday), growth trends, load testing.
- Disaster Recovery (DR): Procedures and infrastructure for recovering from catastrophic failures (data center loss, data corruption). DR tiers: cold standby (RTO hours), warm standby (RTO minutes), hot standby/active-active (RTO seconds).
Major Historical Milestones
| Year | Event |
|---|---|
| 2003 | Google creates the first SRE team (Ben Treynor Sloss) |
| 2006 | Amazon publishes Dynamo paper — reliability through eventual consistency |
| 2011 | Netflix releases Chaos Monkey into production — chaos engineering born |
| 2012 | Google introduces error budgets as a deployment gating mechanism |
| 2012 | Netflix Simian Army expands chaos to Latency Monkey, Chaos Gorilla |
| 2014 | AWS launches detailed well-architected framework concepts |
| 2016 | Google releases SRE Book (free online) — SRE methodology documented |
| 2016 | Principles of Chaos Engineering published (Netflix) |
| 2017 | Google releases "The Site Reliability Workbook" |
| 2018 | Multi-window alerting (Google SRE Workbook ch. 5) becomes standard |
| 2019 | AWS re:Invent "Cell-based architecture" talk — blast radius containment |
| 2020 | LitmusChaos donated to CNCF — cloud-native chaos engineering |
| 2021 | AWS Well-Architected Framework reliability pillar formalized |
| 2022 | FinOps + reliability intersection: cost of over-provisioning vs downtime |
| 2023 | AI-assisted incident triage tools begin production deployment |
Modern Relevance
SRE has become the dominant operational model for cloud-native services. The error budget framework resolves the fundamental conflict between development velocity (taking risks) and operational stability (avoiding risks) by quantifying the acceptable risk allowance. Teams that internalize error budget thinking make better architectural decisions about reliability investment.
Chaos engineering has matured from a Netflix experiment to a standard practice. GameDays (structured chaos experiments with full team participation) are now a reliability engineering best practice. The tooling has expanded from server instance termination to network partitions, latency injection, CPU/memory pressure, and database failover simulation.
The multi-region active-active architecture is now achievable for stateless services but remains architecturally expensive for stateful ones. The data consistency vs availability tradeoff (CAP theorem) means active-active databases typically require eventual consistency, which applications must be designed to handle. Global distributed databases (Spanner, CockroachDB, YugabyteDB) push the cost of strong consistency to the infrastructure layer.
File Map
28-reliability-engineering/
├── 00-overview.md ← this file
├── 01-sre-principles.md ← SRE origin, toil, error budgets, ops/eng balance
├── 02-sli-slo-sla.md ← definitions, SLI design, SLO target setting
├── 03-error-budgets.md ← budget arithmetic, burn rates, policy
├── 04-toil-reduction.md ← toil identification, automation, runbooks
├── 05-capacity-planning.md ← forecasting, load testing, N+2 design
├── 06-reliability-patterns.md ← circuit breaker, retry, bulkhead, timeout
├── 07-chaos-engineering.md ← principles, tools, GameDays, blast radius
├── 08-disaster-recovery.md ← RTO/RPO, cold/warm/hot standby, runbooks
├── 09-multi-region.md ← active-active, global load balancing, data replication
├── 10-data-durability.md ← replication, erasure coding, backup validation
├── 11-incident-response.md ← lifecycle, roles, communication, escalation
├── 12-postmortems.md ← blameless culture, template, action items
├── 13-runbooks.md ← runbook design, automation, maintenance
└── 14-blameless-culture.md ← psychological safety, just culture, learning
Cross-References
- Section 17 (Distributed Systems): Consensus, replication, CAP theorem — reliability in distributed context
- Section 21 (Cloud Infrastructure): Multi-AZ/multi-region design, cloud failure modes
- Section 22 (Kubernetes Internals): Pod disruption budgets, topology spread, health probes
- Section 23 (Observability): SLI measurement requires solid observability; error budget burn rate alerting
- Section 26 (Security): Security incidents follow the same incident response lifecycle
- Section 40 (Failure History): Historical failures provide the empirical basis for reliability patterns