Reliability Patterns: Building Systems That Fail Gracefully

Overview

Reliability patterns are architectural and implementation techniques that improve a system's ability to tolerate failure, degrade gracefully, and recover quickly. These patterns recognize a fundamental truth about distributed systems: failures are not exceptions, they are the expected operating condition. Network calls fail, services become slow, disks fill, dependencies misbehave. The question is not whether your system will encounter these conditions but what it does when it does.

This file covers the core reliability patterns used in production systems: circuit breakers, retry with jitter, bulkheads, timeouts, health checks, graceful degradation, load shedding, backpressure, and rate limiting. Each pattern addresses a specific failure mode. Combining them correctly builds a system that can absorb significant component failure without losing service.

Prerequisites

Understanding of distributed systems and network communication
Familiarity with HTTP clients, connection pools, and thread pools
Basic understanding of latency, throughput, and queuing theory
Experience with at least one production service that calls external dependencies

Circuit Breaker Pattern

The circuit breaker is the most important reliability pattern for services with dependencies. Named after the electrical circuit breaker that prevents overcurrent from damaging equipment, the software circuit breaker prevents a failing dependency from taking down the entire calling service.

The fundamental problem without circuit breakers: if Service A calls Service B, and Service B starts timing out after 10 seconds, every thread in Service A blocks for 10 seconds waiting for a response. If A has 100 concurrent threads and receives 200 requests/second, within 5 seconds all threads are blocked waiting for B. Service A is now unable to respond to any requests — it has been brought down by B's failure, even though A's own code is fine.

State Machine

                    +----------+
                    |  CLOSED  |  <-- Normal operation
                    |          |      Calls pass through
                    +----------+
                    /     |     \
              success  failure  threshold
                /        |      reached
               /    (count)      \
              /                   v
        stays               +----------+
        CLOSED              |   OPEN   |  <-- Failure threshold exceeded
                            |          |      All calls fail immediately
                            +----------+
                                  |
                            timeout elapsed
                            (e.g., 30 sec)
                                  |
                                  v
                            +-----------+
                            | HALF-OPEN |  <-- Test recovery
                            |           |      Allow N probe requests
                            +-----------+
                            /           \
                      probe           probe
                      success         fails
                         /                \
                        v                  v
                   +----------+       +----------+
                   |  CLOSED  |       |   OPEN   |
                   | (reset)  |       | (reopen) |
                   +----------+       +----------+

State transitions:
CLOSED → OPEN:  failure count exceeds threshold in time window
                (e.g., 5 failures in 10 seconds, or 50% error rate)
OPEN → HALF-OPEN: after timeout period (e.g., 30 seconds)
HALF-OPEN → CLOSED: probe requests succeed
HALF-OPEN → OPEN: probe request fails

When the circuit is OPEN, calls fail immediately without attempting the network call. This prevents: - Thread exhaustion in the caller (no blocking on timeouts) - Overloading the failing dependency with retries during recovery - Cascade failures spreading to upstream services

Implementation Considerations

# Conceptual circuit breaker configuration
CircuitBreaker(
    failure_threshold=5,        # open after 5 failures
    failure_window=10,          # ... in 10 seconds
    success_threshold=2,        # close after 2 consecutive successes
    timeout=30,                 # stay open for 30 seconds before half-open
    half_open_max_calls=3,      # allow 3 probe calls in half-open
    excluded_exceptions=[       # don't count these as failures
        NotFoundException,      # 404 is not a service failure
        AuthException,          # 401 is not a service failure
    ]
)

Libraries: Netflix Hystrix (Java, now in maintenance), Resilience4j (Java), Polly (.NET), circuitbreaker (Python), go-circuit (Go).

Retry with Exponential Backoff and Jitter

Retrying failed requests is essential for handling transient failures — brief network glitches, momentary service restarts, transient overload. But naive retries are dangerous.

The thundering herd problem:

T=0: 1000 clients all make request to Service B
T=1: Service B briefly overloaded, all 1000 requests fail
T=1: All 1000 clients immediately retry simultaneously
T=1: Service B receives 1000 concurrent retries → overloaded again
T=2: All 1000 retries fail → all clients retry again
     → perpetual storm preventing recovery

Exponential backoff:

Attempt 1: wait 1 second
Attempt 2: wait 2 seconds
Attempt 3: wait 4 seconds
Attempt 4: wait 8 seconds
Attempt 5: wait 16 seconds (cap at max, e.g., 32 seconds)

Formula: min(cap, base × 2^attempt)

Exponential backoff reduces retry frequency over time but does NOT solve the thundering herd — all clients still retry at the same exponential schedule.

Jitter breaks synchrony:

Full jitter:  sleep = random(0, min(cap, base × 2^attempt))
Equal jitter: sleep = min(cap/2, base × 2^attempt) + random(0, cap/2)
Decorrelated: sleep = random(base, min(cap, prev_sleep × 3))

AWS's research ("Exponential Backoff And Jitter", Brooker 2015)
found full jitter optimal for reducing load on recovering services.

Jitter distributes retries randomly across time, so a recovering service sees a trickle of retries rather than a synchronized wave.

Retry policy design: - Always set a maximum retry count (3-5 typically) - Only retry on retryable errors (network errors, 429, 503 — NOT 400, 401, 404) - Respect Retry-After headers from servers - Add retry budget: if >50% of requests are retries, stop retrying (the service is in trouble, more retries make it worse)

Bulkhead Pattern

The bulkhead pattern (named after the watertight compartments in ship hulls) isolates failures by partitioning resources so that failure in one partition cannot exhaust resources for others.

WITHOUT BULKHEAD:
  Shared thread pool (100 threads) serves all dependencies:

  Dependency A (fast): uses 10 threads
  Dependency B (slow): uses 10 threads
  Dependency C (broken, timeout 30s): fills remaining 80 threads → blocks ALL traffic

  Result: Dependency C's failure brings down A and B too.

WITH BULKHEAD (thread pool isolation):
  Dependency A: dedicated pool of 20 threads
  Dependency B: dedicated pool of 20 threads
  Dependency C: dedicated pool of 20 threads

  Dependency C breaks → exhausts its 20 threads → A and B unaffected
  C's failure is contained; overall service degraded but not down.

WITH BULKHEAD (semaphore isolation):
  Dependency A: max 50 concurrent requests (semaphore)
  Dependency B: max 50 concurrent requests
  Dependency C: max 20 concurrent requests

  Cheaper than thread pool isolation (no thread switching overhead)
  but cannot timeout blocking synchronous calls.

Bulkheads are also applicable at higher architectural levels: - Separate Kubernetes pods/node pools for different customer tiers - Separate database connection pools for read vs write paths - Separate service instances for paying vs free tier users

Timeout

Every network call must have an explicit timeout. The default for many HTTP clients is either infinite or a very large value (minutes). Both are dangerous in production.

TIMEOUT HIERARCHY:
  Connection timeout: time to establish TCP connection (1-5 seconds)
  Request timeout:    time for complete request/response cycle
  Read timeout:       time for server to send first byte after accepting request

  Set independently! A server may accept a connection quickly
  but then process slowly.

TIMEOUT BUDGET (also called "deadline propagation"):
  Frontend (user-facing) SLO: 500ms
    ↓ calls
  Service A: timeout 400ms budget remaining
    ↓ calls
  Service B: timeout 300ms budget remaining
    ↓ calls
  Database: timeout 200ms budget remaining

  Each hop reduces the remaining budget. Prevents "timeout pyramid"
  where backend services timeout first but frontend keeps waiting.

  gRPC deadline propagation does this automatically.
  HTTP: propagate via custom header (e.g., X-Request-Deadline)

Timeout anti-patterns: - No timeout (infinite wait): catastrophic under failure - Same timeout for all services: database needs different timeout than in-memory cache - Timeout without circuit breaker: you will time out on every call until circuit opens - Very long timeouts "just in case": accumulate thread holds, cause cascade failure

Health Checks

Health checks allow orchestrators (Kubernetes, load balancers, service meshes) to route traffic only to instances capable of serving it.

HEALTH CHECK TYPES:

Liveness probe: Is the process alive and not deadlocked?
  Response: HTTP 200 if process can respond at all
  Failure action: RESTART the container/process
  Implementation: lightweight check — can the process even respond?
  Do NOT check dependencies (database, etc.) — a database being down
  should not cause the process to restart.

Readiness probe: Can this instance serve user traffic right now?
  Response: HTTP 200 if ready to accept traffic
  Failure action: REMOVE from load balancer pool (but do not restart)
  Implementation: check database connection, cache connection, any
  critical dependencies. Return 503 if not ready.
  Used during: startup, graceful shutdown, reconnection after failure.

Startup probe: Has the application completed initialization?
  Response: HTTP 200 when fully initialized
  Failure action: RESTART if takes too long
  Use case: services with long startup times (JVM warmup, ML model
  loading) that would fail liveness checks during init.

                 Process starts
                      |
                 Startup probe
                 (waiting for init)
                      |
                   passes
                      |
                 Liveness probe ←-----+
                 (is process alive?)  |  Runs continuously
                      |               |
                   passes +-----------+
                      |
                 Readiness probe ←----+
                 (can serve traffic?) |  Runs continuously
                      |               |
                   passes +-----------+
                      |
                 Add to load balancer pool

Graceful Degradation

Graceful degradation means serving a reduced but functional response instead of an error when a dependency fails.

EXAMPLES:

E-commerce product page:
  Dependency: recommendation service
  Full experience: "Customers also bought: [list]"
  Degraded experience: no recommendations shown
  Error experience: entire page fails to load

  Correct: catch recommendation service failure,
           render page without recommendations.

Social media feed:
  Dependency: ad targeting service
  Full experience: personalized ads
  Degraded: generic/house ads
  Error: page fails

  Correct: on ad service failure, fall back to house ads.
           Revenue reduced but not zero.

Search results:
  Dependency: personalization service
  Full experience: personalized ranking
  Degraded: default relevance ranking
  Error: no results

  Correct: serve unranked results when personalization fails.

STALE CACHE FALLBACK:
  On fresh data fetch failure, return stale cached data with
  Cache-Control: stale-while-revalidate
  This is often the best degradation strategy for read-heavy paths.

The key discipline: decide in advance what the degraded behavior is for each dependency. Do not discover it during an incident.

Load Shedding

Load shedding is the deliberate rejection of requests when a service is operating at or near capacity. It is better to fail some requests cleanly than to fail all requests slowly.

WITHOUT LOAD SHEDDING:
  Service capacity: 1000 req/sec at p99 < 100ms
  Incoming: 2000 req/sec
  Result: queue builds up, latency climbs to seconds,
          eventually all requests timeout → complete failure

WITH LOAD SHEDDING:
  Service capacity: 1000 req/sec at p99 < 100ms
  Incoming: 2000 req/sec
  Action: reject 1000 req/sec with HTTP 503
  Result: 50% of requests succeed quickly, 50% fail fast
          Users see errors but those who retry may succeed
          Service remains healthy

SHEDDING STRATEGIES:
  1. Queue depth: reject when internal queue exceeds threshold
  2. CPU utilization: reject when CPU > 80%
  3. Concurrency limit: reject when concurrent requests > limit
  4. Latency: reject when p99 latency exceeds target
             (adaptive load shedding)

PRIORITY-BASED SHEDDING:
  Not all requests are equal. Shed low-priority first:
  - Background batch jobs → first to shed
  - Anonymous/unauthenticated requests → shed early
  - Free tier users → shed before paid tier
  - Health checks → never shed
  - Payment flows → last to shed

Backpressure

Backpressure propagates a signal upstream when a downstream component is saturated, allowing the system to self-regulate without unbounded queue growth.

WITHOUT BACKPRESSURE:
  Producer → [Queue: growing without bound] → Consumer (slow)

  Producer keeps producing, queue grows to memory limit,
  OOM kill, or long processing delays for queued items.

WITH BACKPRESSURE:
  Producer → [Queue: bounded] → Consumer
     ↑                              |
     |     "I'm full, slow down"    |
     +------------------------------+

  When queue reaches capacity, producer is told to slow down
  (block, retry, or drop). Consumer controls the flow rate.

IMPLEMENTATIONS:
  TCP: flow control via window size (built-in backpressure)
  Reactive Streams (RxJava, Project Reactor): request(n) protocol
  Kafka: consumer controls poll rate; producer blocks when buffer full
  gRPC: flow control per stream via HTTP/2 window
  Go channels: blocking send on full channel = backpressure

BACKPRESSURE VS LOAD SHEDDING:
  Backpressure: signal to slow down, upstream adapts
  Load shedding: flat rejection, upstream cannot adapt

  Use backpressure between cooperating internal services.
  Use load shedding at the edge where you cannot control the caller.

Rate Limiting

Rate limiting protects services from being overwhelmed by any single client, and protects downstream services from being overwhelmed by the service itself.

Rate limiting algorithms:

TOKEN BUCKET:
  Bucket holds N tokens. Each request consumes 1 token.
  Tokens replenished at rate R per second.
  If bucket empty, request is rejected.

  Allows bursting up to bucket size N.
  Good for: client-facing APIs where brief bursts are acceptable.

  [tokens: ████████░░] (8 of 10 full)
  Request → consume 1 token → [tokens: ███████░░░]
  No request → refill 1/sec → [tokens: ████████░░]

LEAKY BUCKET:
  Requests enter a queue. Process queue at fixed rate R.
  If queue full, reject request.

  No bursting. Output rate is always exactly R.
  Good for: rate-smoothing to downstream services.

  [queue: ▓▓▓▓░░░░] → process at 10/sec

FIXED WINDOW COUNTER:
  Count requests per time window (e.g., per minute).
  If count > limit, reject.

  Problem: boundary effect. 60 requests in last second of window 1
  + 60 requests in first second of window 2 = 120 requests in 2
  seconds, 2x the limit.

SLIDING WINDOW LOG:
  Keep timestamp of each request. Count requests in rolling window.
  If count > limit, reject.

  No boundary effect. Expensive: stores every request timestamp.

SLIDING WINDOW COUNTER:
  Hybrid: weighted blend of current and previous window counters.
  current_count + (previous_count × (1 - elapsed_fraction))

  Approximates sliding window cheaply.
  Used by Redis rate limiting libraries, Nginx.

Rate limiting scope: - Per-IP rate limiting: protects against individual misbehaving clients - Per-user/API-key: fair usage enforcement - Per-endpoint: protect expensive operations more strictly - Global: protect downstream service from aggregate traffic

Production Example: Combined Pattern Usage

TYPICAL SERVICE DEPENDENCY PROTECTION STACK:

Client request arrives
       |
       v
[Rate Limiter] → 429 if client over limit
       |
       v
[Load Shedder] → 503 if service at capacity
       |
       v
[Bulkhead: thread pool for Dependency X]
       |
       v
[Circuit Breaker: Dependency X] → fail fast if open
       |
       v
[Timeout: 500ms] → fail if slow
       |
       v
Dependency X call
       |
  fails? → [Retry with exponential backoff + jitter] → max 3 attempts
       |
  still fails? → [Graceful Degradation] → serve cached/default response

Debugging Notes

Circuit breaker stuck open: If the circuit never transitions to HALF-OPEN, check the timeout configuration. If the upstream service recovered quickly but the circuit stays open for minutes, the timeout is too long. Monitor circuit state as a metric — alert on circuits that have been open for > 5 minutes.

Retry storms: If you see sudden spikes in request volume to a dependency, look for retry amplification. One failed request at the top level may become 3-4 requests to the backend if every service in the call chain retries independently. Use a retry budget and backoff aggressively.

Bulkhead misconfiguration: A bulkhead with too few threads causes excessive rejection even when the dependency is healthy. Monitor thread pool utilization and rejection rate. Right-size by measuring peak concurrency under normal load, then adding headroom.

Health check false positives: A readiness check that checks a database connection will mark the instance not ready during network blips, causing load balancer churn and potentially cascading. Add circuit breakers inside health check probes, or use shorter timeouts.

Security Implications

Rate limiting as DDoS defense: Rate limiting at the edge is a first line of defense against volumetric attacks. However, sophisticated attackers use distributed IPs, defeating per-IP limits. Use rate limiting in combination with WAF and anycast DDoS mitigation.
Circuit breakers and information disclosure: A fast-failing open circuit may behave differently from a slow-failing closed circuit, potentially leaking information about the internal system state to an attacker timing requests.
Retry amplification attacks: An attacker can deliberately trigger slow responses to cause the service to retry, multiplying load. Retry budgets and circuit breakers mitigate this.

Performance Implications

Circuit breakers have minimal overhead — a single in-memory counter check per request
Thread pool bulkheads have context-switching overhead; semaphore bulkheads are cheaper
Retry logic should not be in the hot path for requests that succeed — implement retry lazily
Rate limiter implementations using Redis (for distributed rate limiting) add a network round trip. Use local token buckets for high-throughput paths; use Redis for cross-instance coordination only

Failure Modes

Missing timeout: The most common reliability mistake in production. Every service has at least one dependency with no timeout configured.

Retry without backoff: Synchronous retries with no delay cause immediate thundering herds.

Circuit breaker on non-critical path: A circuit breaker that opens and causes the entire service to fail, instead of degrading gracefully. The circuit breaker should trigger the graceful degradation path, not an unhandled exception.

Over-broad bulkheads: One bulkhead for all external dependencies. Dependency A failure still affects B and C. Bulkheads should be per-dependency.

Modern Usage

Service meshes (Istio, Linkerd) implement circuit breakers, retries, timeouts, and health checks at the infrastructure level, without requiring application code changes. This is increasingly the standard for Kubernetes-based architectures. The tradeoff: centralized policy is easier to manage consistently but harder to customize per-service.

Resilience4j (Java), Polly (.NET), and Envoy proxy (sidecar) are the most common implementations for language-level and infrastructure-level patterns respectively.

Future Directions

Adaptive thresholds: Circuit breakers that automatically tune their thresholds based on historical traffic patterns rather than static configuration.

ML-driven load shedding: Predicting which requests to shed to maximize user-facing SLO compliance, rather than simple FIFO rejection.

eBPF-based reliability: Implementing timeout and retry policies at the kernel networking level via eBPF, reducing per-request overhead.

Exercises

Implement a circuit breaker with three states (CLOSED, OPEN, HALF-OPEN) using a state machine in a language of your choice. Test it with a mock service that alternates between healthy and failing.
Design a retry policy for a payment processing service. What makes payment retries different from typical API retries? What errors should never be retried?
A service has 200 concurrent threads shared across 5 dependencies. Dependency E starts timing out at 10 seconds. Model how quickly all threads become exhausted. How many bulkhead threads would you assign to each dependency?
Implement token bucket rate limiting using a Redis INCR command. How do you handle the race condition between checking the count and incrementing it?
Your service is experiencing 50% error rate from a dependency. Should you prefer load shedding, circuit breaking, or graceful degradation? What factors determine the choice?

References

Fowler, M. "CircuitBreaker." MartinFowler.com, 2014.
Nygard, M.T. Release It! Design and Deploy Production-Ready Software. 2nd ed., Pragmatic Bookshelf, 2018. Chapters 4-5.
Brooker, M. "Exponential Backoff And Jitter." AWS Architecture Blog, 2015.
Netflix Technology Blog. "Hystrix: Latency and Fault Tolerance for Distributed Systems." 2012.
Reactive Streams Specification: https://www.reactive-streams.org/
Richardson, C. Microservices Patterns. Manning, 2018. Chapter 3: Inter-process communication.