Skip to content

Capacity Planning

Overview

Capacity planning is the discipline of ensuring that systems have sufficient resources to handle anticipated demand before those resources run out. It sits at the intersection of forecasting, queueing theory, and economic tradeoffs: too little capacity means outages; too much means wasted money. In cloud environments, capacity planning has evolved from "buying servers six months in advance" to a continuous loop of forecasting, load testing, autoscaling configuration, and cost optimization.

The goal is not to predict the future perfectly — it is to avoid catastrophic under-provisioning while minimizing wasteful over-provisioning. This document covers the theory (Little's Law, Amdahl's Law, the USE method), practical methodology (demand forecasting, load testing), and real-world tradeoffs from production systems at scale.

Prerequisites

  • Understanding of system performance metrics: throughput, latency, utilization
  • Basic queueing theory concepts
  • Familiarity with load testing tools (locust, k6, wrk)
  • Understanding of autoscaling (horizontal pod autoscaler, cloud autoscaling groups)
  • Business context: traffic patterns, growth rate, major events

Historical Context

Capacity planning originated in the mainframe era of the 1960s-70s when computing resources were enormously expensive and lead times for hardware procurement were months. IBM developed formal models for predicting mainframe utilization, and the field produced foundational queueing theory work (Kleinrock, 1975).

In the 1990s-2000s web era, capacity planning meant buying physical servers 3-6 months ahead, over-provisioning by 2-3x as a safety margin, and doing annual capacity reviews. The cost of hardware meant these decisions had long-lived consequences.

The cloud era (AWS EC2 launched 2006, autoscaling 2009) changed the economics: resources could be provisioned in minutes, and the unit cost fell dramatically. This shifted capacity planning from "buy the right hardware" to "configure the right autoscaling policies and cost controls."

Notable failures that shape modern capacity planning practices: - 2011 PlayStation Network outage: Security incident + under-provisioned recovery infrastructure led to 77-day outage, $171M loss. - 2012 AWS US-East failure: Elastic Load Balancing control plane overwhelmed during an EBS maintenance event — even the systems meant to handle failover weren't capacity-planned for failure scenarios. - 2013 Target Black Friday: Database capacity not planned for omnichannel surge; checkout outages during peak. - 2021 Facebook outage: The BGP withdrawal that took down Facebook also overloaded their systems when billions of users tried to refresh simultaneously — recovery itself triggered a capacity crisis. - Twitter 2022-2023: Musk-era 75% headcount reduction + infrastructure cuts resulted in significant over-provisioning being revealed (and subsequently cut), but also caused several reliability incidents as headroom was eliminated too aggressively.

Demand Forecasting

Accurate demand forecasting is the foundation of capacity planning. Without a good forecast, headroom targets are meaningless.

Linear Extrapolation

The simplest model: fit a line (or exponential curve) to historical traffic data.

  Monthly active users (MAU) growth:

  Jan: 1.0M
  Feb: 1.1M
  Mar: 1.2M
  Apr: 1.35M
  May: 1.5M

  Linear trend: +100k/month
  6-month forecast: July = 2.0M

  Compound growth (better for tech products):
  Growth rate = (1.5/1.0)^(1/4) = 1.107 per month (10.7% MoM)
  July forecast = 1.5M × 1.107^2 = 1.84M

  Limitations:
  - Does not account for product growth inflection points
  - Cannot predict viral growth or sudden plateaus
  - Does not capture seasonality

Seasonality Modeling

Most consumer products have recurring patterns that must be accounted for:

  Seasonality patterns:

  Daily:   8am-10pm peak, overnight trough (~50% of peak)
  Weekly:  weekdays >1.5× weekends for B2B; reverse for consumer
  Annual:  Q4 peak for e-commerce (Nov-Dec ~3× normal)
           Summer peak for travel/food delivery
           January peak for fitness/banking apps

  Decomposition approach (STL: Seasonal and Trend decomposition):

  Observed = Trend + Seasonal + Residual

  Example: plan for Q4 e-commerce peak

  August baseline: 1000 req/s
  August → November trend: +5%/month = 1157 req/s
  November seasonal multiplier: 2.8× (historical)
  Black Friday spike multiplier: 4× of November baseline

  Required capacity for Black Friday:
    1157 × 2.8 × 4 = ~13,000 req/s

  Add 30% headroom: ~17,000 req/s

  Time to provision: 6 weeks before Black Friday
  (allows load testing + tuning of new infrastructure)

Event-Driven Spikes

Some spikes are predictable but not periodic:

  Known future events requiring capacity planning:

  Product launches: "Galaxy S25 announcement" → 10× spike in Samsung.com
  Marketing campaigns: Super Bowl ad → 50× spike in 30-second window
  External integrations: "Now available on iOS App Store" → unpredictable
  Competitor outages: "AWS east-1 down" → AWS competitor traffic +300%

  Super Bowl example (advertising during game):

  Before ad: 50 req/s
  During ad: 5,000 req/s (100× surge in 30 seconds)

  Lessons:
  1. Pre-warm autoscaling groups: EC2 instances take 3-5 minutes to start
     → Pre-scale to 10× BEFORE the ad airs
  2. Test the pre-warm by running load tests at target scale
  3. CDN offload: static pages fully cached, reduces origin load
  4. Graceful degradation: under extreme load, serve simplified version
     (text-only, no images, no personalization)

Load Testing Methodology

Load testing answers the question: "What is my system's saturation point?"

  Capacity Planning Load Test Process:

  Step 1: Establish baseline
    - What is current peak production load? (req/s, concurrent users)
    - What is current infrastructure size?
    - What are current p50/p95/p99 latencies at baseline?

  Step 2: Define success criteria
    - SLO: p99 latency < 500ms
    - Availability: 99.9% error rate during test
    - Define "saturation": when SLO is violated or error rate > 0.1%

  Step 3: Ramp-up test (find breaking point)
    Tool: k6, locust, wrk2, Apache JMeter

    Start at: 1× current peak
    Increase: +10% every 5 minutes (or use k6 stages)
    Stop when: SLO violated or error rate exceeds threshold
    Record: saturation point (request rate when SLO breaks)

  Step 4: Identify bottleneck resource
    - Which resource saturates first: CPU, memory, DB connections, network?
    - Use USE method: check utilization + saturation per resource

  Step 5: Calculate headroom
    If saturation at 10,000 req/s and current peak is 3,000 req/s:
    Headroom = 233% (3.3× current load before saturation)
    Target: 70% utilization = run at 7,000 req/s = need to scale if
    peak grows beyond that.

  Step 6: Translate to infrastructure
    Current: 10 app servers handle 3,000 req/s
    Saturation: 10 servers at 10,000 req/s
    Target: be safe up to 17,000 req/s (Black Friday projection)
    Needed: 10 × (17,000/10,000) × 1.3 (headroom) ≈ 22 servers
# k6 ramp-up load test:
cat > load_test.js << 'EOF'
import http from 'k6/http';
import { check } from 'k6';

export const options = {
  stages: [
    { duration: '5m', target: 1000 },   // ramp to 1000 VUs
    { duration: '10m', target: 1000 },  // hold
    { duration: '5m', target: 2000 },   // ramp to 2000
    { duration: '10m', target: 2000 },  // hold
    { duration: '5m', target: 5000 },   // ramp to 5000 — find saturation
    { duration: '5m', target: 0 },      // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(99)<500'],   // SLO: p99 < 500ms
    http_req_failed: ['rate<0.001'],    // <0.1% errors
  },
};

export default function() {
  const res = http.get('https://your-service/api/endpoint');
  check(res, { 'status was 200': (r) => r.status == 200 });
}
EOF
k6 run load_test.js

Little's Law

Little's Law is a fundamental theorem of queueing theory with direct applications to capacity planning:

  Little's Law: L = λW

  L = average number of items in the system (queue + service)
  λ = average arrival rate (items per second)
  W = average time an item spends in the system (seconds)

  Example: Web server

  Observed:
    λ = 1,000 requests/second (arrival rate)
    W = 200ms average response time (including queue time)

  Therefore:
    L = 1,000 × 0.2 = 200 concurrent requests in the system

  Application to capacity planning:

  Question: We want to handle 5,000 req/s at 200ms response time.
  How many concurrent request slots do we need?

  L = 5,000 × 0.2 = 1,000 concurrent requests

  Each server handles 100 concurrent requests (from load testing):
  Servers needed = 1,000 / 100 = 10 servers

  Alternative use: if I add servers and latency drops from 200ms to 150ms:
  L = 5,000 × 0.15 = 750 concurrent requests
  → Can reduce server count from 10 to 7.5 (round up to 8)

  Thread pool sizing:

  If each request uses one thread:
  Thread pool size = λ × W (average thread time, not total time)

  For a service with 1,000 req/s, 50ms CPU time per request:
  Threads needed = 1,000 × 0.05 = 50 threads minimum
  Add 30% headroom = 65 threads

Amdahl's Law

Amdahl's Law bounds the speedup achievable by parallelizing a system:

  Amdahl's Law: Speedup(N) = 1 / (S + (1-S)/N)

  S = serial fraction of the work (cannot be parallelized)
  N = number of parallel processors/threads

  Examples:

  If 5% of work is serial (S = 0.05):
    N=2:   1 / (0.05 + 0.95/2)   = 1.90× speedup
    N=10:  1 / (0.05 + 0.95/10)  = 6.90× speedup
    N=100: 1 / (0.05 + 0.95/100) = 16.8× speedup
    N=∞:   1 / 0.05              = 20× maximum possible speedup

  Implication for capacity planning:

  Horizontal scaling (adding servers) has diminishing returns
  if there is ANY serial component:
  - Single-threaded database writes: S = high → can't scale writes horizontally
  - Session stickiness: S = high → can't route freely
  - Global rate limiter: serial path → bottleneck at scale

  Identify and eliminate serial paths before scaling:
  - Replace synchronous writes with async queues
  - Use consistent hashing to eliminate central coordinators
  - Shard rate limiters by user ID or region

  Gunther's Universal Scalability Law (USL):
  Extends Amdahl with a "crosstalk" term for distributed coordination overhead:

  Speedup(N) = N / (1 + σ(N-1) + κN(N-1))

  σ = contention coefficient (like Amdahl's serial fraction)
  κ = coherency/crosstalk coefficient (cost of coordination)

  When κ > 0, adding more nodes eventually DECREASES throughput
  (classic distributed systems problem: gossip/consensus overhead)

USE Method for Capacity

The USE (Utilization, Saturation, Errors) method by Brendan Gregg provides a systematic framework for identifying resource bottlenecks:

  For each resource in the system:

  Utilization: what percentage of time is the resource busy?
    - CPU: 75% → approaching saturation
    - Disk I/O: 80% busy → likely bottleneck
    - Network: 60% bandwidth used
    - Memory: 90% used (but different semantics — high is OK unless swapping)

  Saturation: how much work is waiting (queued)?
    - CPU: run queue length > number of CPUs → saturated
    - Disk: I/O wait time > 1ms avg → queue forming
    - Network: TX queue drops → saturated
    - Memory: swap activity → RAM saturated

  Errors: are there errors from the resource?
    - Network: dropped packets → overload or NIC issue
    - Disk: I/O errors → hardware failure
    - CPU: machine check exceptions → hardware issue

  Capacity headroom targets by resource:

  +------------------+------------------+---------------------------+
  | Resource         | Warning          | Critical / Plan to scale  |
  +------------------+------------------+---------------------------+
  | CPU utilization  | >60% sustained   | >80% peak                 |
  | CPU run queue    | >1× core count   | >2× core count            |
  | Memory           | >85% used        | Any swap activity         |
  | Disk I/O         | >70% busy        | >90% busy                 |
  | Network          | >60% bandwidth   | >80% bandwidth            |
  | DB connections   | >70% pool used   | >90% pool used            |
  | Thread pool      | >70% active      | Any queue wait            |
  +------------------+------------------+---------------------------+

Headroom Targets

The industry standard is to target 70% peak utilization as the operational ceiling:

  Why 70% and not 100%?

  1. Traffic is spiky: even if average is 70%, peaks will exceed 100%
     if headroom is at 100%. The 30% absorbs transient spikes.

  2. Autoscaling lag: cloud instances take 3-5 minutes to start.
     If you hit 100% before scaling triggers, you have a gap.

  3. Failure absorption: if one of N servers fails, remaining N-1
     must absorb the load. At 70%, N-1 servers handle 70N/(N-1) load.
     For N=5 (80% per server) — still within capacity.

  4. Maintenance: rolling restarts, deployments, and updates take
     servers offline temporarily.

  Calculating target capacity:

  Current peak load: P = 5,000 req/s
  Saturation point:  S = 10,000 req/s (from load test)
  Current utilization at peak: 5,000/10,000 = 50% ← good headroom now

  Forecast 6-month growth: 30% increase in traffic
  Expected peak in 6 months: 6,500 req/s
  Expected utilization: 6,500/10,000 = 65% ← still fine

  Forecast 12-month growth: 100% increase (doubling)
  Expected peak in 12 months: 10,000 req/s
  Expected utilization: 100% ← alarm! Must add capacity before then

  Action: scale to handle 14,300 req/s (10,000 / 0.70) by month 9.
  That means doubling server count (adding 10 more servers of same spec).

Capacity Planning Workflow Diagram

  Capacity Planning Workflow

  ┌─────────────────────────────────────────────────────────────────┐
  │                     CAPACITY PLANNING CYCLE                      │
  └─────────────────────────────────────────────────────────────────┘

  Step 1: Data Collection
  ┌──────────────────┐    ┌──────────────────┐    ┌─────────────────┐
  │ Traffic Metrics  │    │  Resource Metrics │    │ Business Signals│
  │ - req/s by region│    │ - CPU/mem/disk    │    │ - Product roadmap│
  │ - User growth    │    │ - DB connections  │    │ - Marketing plans│
  │ - Historical     │    │ - Queue depths    │    │ - Seasonality   │
  └────────┬─────────┘    └────────┬──────────┘    └────────┬────────┘
           └──────────────────────┬┘                        │
                                  ▼                         │
  Step 2: Forecast              Demand Model ◄──────────────┘
                                  │
                         Traffic projections
                         (3/6/12 month horizons)
                                  │
                                  ▼
  Step 3: Load Testing         Saturation Point
                                  │
                         Per-resource bottleneck
                                  │
                                  ▼
  Step 4: Gap Analysis     Current vs Required Capacity
                         ┌──────────────────────────────┐
                         │ Current capacity: 10,000 rps  │
                         │ 12-month forecast: 20,000 rps │
                         │ Gap: need 2× scaling          │
                         └──────────────────────────────┘
                                  │
                                  ▼
  Step 5: Plan             Scaling Strategy
                         - Add 15 servers by month 9
                         - Pre-warm autoscaling for Black Friday
                         - Database read replicas for read scaling
                                  │
                                  ▼
  Step 6: Execute + Monitor     Continuous feedback
          ← (repeat cycle every quarter or on major events)

Cloud Capacity Planning Tradeoffs

The cloud changes capacity planning in important ways:

  Traditional (on-premises):
    + Resources paid for upfront → strong incentive to be efficient
    + No variable cost → simpler billing
    - 6-month procurement lead times
    - Under-provisioned = outage; Over-provisioned = sunk cost

  Cloud (IaaS/PaaS):
    + Provision in minutes → just-in-time provisioning possible
    + Autoscaling handles routine spikes automatically
    - Cost varies with load → spikes cost money
    - Spot/preemptible instances: cheap but can be reclaimed
    - Resource limits (vCPU quotas) can still cause provisioning failures

  Over-provisioning in cloud:
    Cost: expensive (e.g., 100 EC2 instances at $500/month = $50k/month)
    Benefit: protection against demand spikes, simple operations

  Just-in-time (autoscaling only):
    Cost: optimal under normal conditions
    Risk: startup lag during sudden spikes, quota limits

  Recommended hybrid approach:
    - Base capacity: 70% of expected peak, always running (reserved instances)
    - Burst capacity: autoscaling to 150% of expected peak (on-demand)
    - Emergency: pre-warmed AMIs / container images for fast startup
    - Chaos: quarterly "autoscaling fire drills" — kill servers, verify scale-up

Real Incident: Twitter Over-Provisioning Post-Acquisition

The 2022 Twitter acquisition under Elon Musk revealed extreme over-provisioning:

  Twitter pre-acquisition (2022):
  - ~8,000 servers in data centers
  - Running at ~15-20% average utilization (highly over-provisioned)
  - Large "just in case" capacity for viral events (elections, breaking news)
  - Infrastructure cost: ~$1B/year

  Post-acquisition infrastructure cuts:
  - Reduced to ~1,000-2,000 servers (75-80% reduction)
  - Eliminated multiple data centers
  - Average utilization rose to 50-60%

  Outcome:
  - Most users experienced no degradation for routine use
  - Confirming extreme prior over-provisioning
  - BUT: several high-traffic events (World Cup, product launches)
    caused notable performance degradation and partial outages
  - Monitoring and observability degraded (tools cut too aggressively)
  - "Success" of cuts depended heavily on reduced feature set
    (many features removed = less compute needed)

  Lessons:
  1. Social media has extreme variance: average ≪ peak
     Peak event (election results): 10-20× average traffic
     Without headroom, these events cause outages
  2. Infrastructure teams carry institutional knowledge about
     WHERE headroom is needed (not just how much)
  3. Cost optimization and reliability are in tension;
     the right tradeoff depends on business risk tolerance

Debugging Notes

# Check current resource utilization (USE method starting point)
# CPU:
top -b -n 1 | head -5
mpstat -P ALL 1 1  # per-CPU utilization

# Memory:
free -m
vmstat 1 5         # look for si/so (swap in/out)

# Disk:
iostat -x 1 5      # look for %util column > 70%
iotop -b -n 5      # per-process I/O

# Network:
sar -n DEV 1 5     # bandwidth per interface
ss -s              # socket summary

# For Kubernetes workloads:
kubectl top nodes           # node utilization
kubectl top pods -A         # pod utilization
kubectl describe node <n> | grep -A10 "Allocated resources"

# Database connection pool utilization (PostgreSQL):
SELECT count(*), state FROM pg_stat_activity GROUP BY state;
# Connections near max_connections → scale connection pool or add replicas

# Queue depth monitoring (Kafka):
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group my-consumer-group
# Look for LAG column

Security Implications

  • Capacity planning data (traffic volumes, database sizes, growth rates) is confidential business information. Restrict access to metrics dashboards.
  • Over-provisioned infrastructure can mask security breaches: unusual traffic patterns (DDoS, data exfiltration) are harder to detect when headroom is large.
  • Autoscaling triggered by an attacker-controlled load pattern (DDoS) can cause unexpected cost spikes ("bill shock"). Implement autoscaling budget limits in AWS/GCP.
  • Load test traffic must be clearly identified to avoid triggering security alerts (WAF blocks, rate limiting, fraud detection).

Performance Implications

  • Capacity headroom directly enables performance: a system at 50% utilization responds faster than the same system at 90% (queueing theory: response time rises non-linearly with utilization, especially above 70%).
  • Pre-warming (scaling up before load arrives) is always better than reactive scaling — reactive autoscaling under real load creates a window where the system is under-provisioned and degraded.

Failure Modes

Scenario Cause Mitigation
Gradual outage on launch day Traffic grew faster than forecast Monitor leading indicators (signups, not just traffic)
Autoscaling lags, users see errors Scale-up lag 5min, burst arrived in 30s Pre-warm + have emergency runbook
DB saturates while app scales App scaled 10×, DB is the bottleneck Include DB in capacity plan; add read replicas
Cost spike from bot traffic Autoscaling responds to fake load Rate limiting + WAF before autoscaling reacts
Quota exhausted in cloud region Forgot to request EC2 quota increase Quarterly quota review + multi-region

Modern Usage

  • FinOps integration: Capacity planning and cost optimization are increasingly unified. Tools like AWS Compute Optimizer, GCP Recommender, and Kubecost provide continuous right-sizing recommendations.
  • Predictive autoscaling: AWS Application Auto Scaling predictive scaling uses ML to forecast demand and scale proactively. GCP has similar "scheduled autoscaling."
  • Load shedding as capacity planning complement: Rather than always provisioning for peak, design systems to gracefully degrade under overload (circuit breakers, rate limiting, feature flags to disable expensive features under load).

Future Directions

  • Carbon-aware capacity planning: Shift workloads to regions/times with lower carbon intensity; schedule batch jobs for low-carbon windows. Projects like electricitymap.org provide real-time grid carbon data.
  • ML-based demand forecasting: Models that incorporate product signals (new feature launches, A/B test rollouts) as leading indicators beyond historical traffic.
  • Serverless as capacity planning escape hatch: True serverless (Lambda, Cloud Run) eliminates capacity planning for compute at the cost of cold start latency and vendor lock-in.

Exercises

  1. Take any web service you operate (or a demo app). Use wrk or k6 to find its saturation point under load. Use the USE method to identify which resource saturates first. Propose a scaling strategy.

  2. Model a seasonal traffic pattern for a hypothetical e-commerce site: baseline 1,000 req/s in August, historical Black Friday peak 15× baseline. Calculate required server count if each server handles 5,000 req/s at saturation and you want 30% headroom.

  3. Apply Little's Law to a thread pool sizing problem: a service handles 500 req/s, with average server-side processing time of 80ms. How many threads does it need? What if latency improves to 30ms after a database optimization?

  4. Build a Grafana dashboard with the USE method for a service: CPU utilization, CPU run queue length, memory utilization, disk I/O utilization, and network bandwidth. Add threshold annotations at 60% and 80%.

  5. Simulate the Twitter over-provisioning scenario: deploy a service with 10× the needed capacity. Measure response time and resource costs. Then gradually reduce to 1.3× headroom. Note where response time starts to degrade.

References

  • Brendan Gregg, "Systems Performance" (2nd ed., 2020) — USE Method, capacity analysis
  • Neil Gunther, "Guerrilla Capacity Planning" (2007) — Amdahl's Law, USL
  • John Little, "A Proof for the Queuing Formula: L = λW" — Operations Research, 1961
  • Adrian Cockcroft, "Capacity Planning for the Cloud" — Netflix Tech Blog (2012)
  • "Capacity Planning at Scale" — Stripe Engineering Blog (2019)
  • AWS Capacity Planning Whitepaper: aws.amazon.com/whitepapers/
  • "Post-Mortem: The Great Twitter Migration" — numerous journalistic accounts, 2022-2023
  • Kelsey Hightower, "Kubernetes as an Application Runtime" — capacity management in k8s context
  • "Cost of Cloud, a Trillion Dollar Paradox" — a16z essay, Sarah Wang and Martin Casado, 2021