Incident Response: Managing Failures Under Pressure

Overview

Incident response is the structured process for detecting, triaging, mitigating, and learning from service disruptions. Its goal is to restore service as quickly as possible while maintaining clear communication and avoiding mistakes made under pressure. Good incident response is a discipline, not a talent — it can be learned, practiced, and systematically improved.

The most important insight in incident response is the distinction between mitigation and fix. Mitigation restores service quickly, often without understanding the root cause. Fix addresses the underlying problem but may take days or weeks. Under an active incident, the only goal is mitigation. The postmortem addresses the fix. Conflating these two activities — trying to understand the root cause while also fighting the fire — is the most common cause of extended incident duration.

Prerequisites

Understanding of monitoring and alerting concepts
Familiarity with on-call operations and paging systems
Basic system administration and debugging skills
Awareness of the services and infrastructure in your environment
Error budget understanding (see 02-error-budgets.md)

Incident Lifecycle

INCIDENT LIFECYCLE:

  Alert fires / User reports
          |
          v
  +---------------+
  |   DETECTION   |
  |               |  <- Who noticed? How? When vs when impact started?
  |               |     Gap between impact start and detection = MTTD
  +---------------+
          |
          v
  +---------------+
  |    TRIAGE     |
  |               |  <- Is this a real incident? What severity?
  |               |     Who is affected? What is the scope?
  +---------------+
          |
          v
  +---------------+
  |  MITIGATION   |
  |               |  <- Restore service, even without understanding why.
  |               |     Rollback, feature flag off, traffic shift.
  +---------------+
          |
          v
  +---------------+
  |  RESOLUTION   |
  |               |  <- Confirm service is healthy. Monitor for recurrence.
  |               |     Brief summary to stakeholders. Declare incident over.
  +---------------+
          |
          v
  +---------------+
  |  POSTMORTEM   |
  |               |  <- Blameless analysis. Root cause. Action items.
  |               |     Within 48-72 hours. (See 06-postmortems.md)
  +---------------+

Key time metrics:
  MTTD (Mean Time To Detect):   alert fires - impact started
  MTTM (Mean Time To Mitigate): incident declared - service restored
  MTTR (Mean Time To Recover):  impact started - full resolution
  MTTF (Mean Time To Failure):  time between incidents (steady state health)

  MTTR = MTTD + MTTM is a useful decomposition:
  improve detection → reduce MTTD → reduce MTTR
  improve mitigation tools → reduce MTTM → reduce MTTR

Incident Severity Levels

Severity levels give everyone a shared vocabulary for the seriousness of an incident. They determine escalation path, communication frequency, and required response.

SEV1 — CRITICAL (Complete Outage / Data Loss Risk)
  Definition: Core service completely unavailable to all users,
              or risk of permanent data loss, or security breach
  Examples:   - All users cannot log in
              - All checkout requests failing
              - Database corruption detected
              - Active security breach in progress
  Response:   - Immediate page to on-call + escalation to manager
              - Incident Commander assigned within 5 minutes
              - Executive notification within 15 minutes
              - Status page update within 10 minutes
              - Communication update every 30 minutes
  Target MTTR: < 30 minutes to mitigation

SEV2 — HIGH (Major Degradation)
  Definition: Significant portion of users affected, or
              critical feature unavailable, or major performance degradation
  Examples:   - Checkout 40% error rate (but some users still transact)
              - Search unavailable (users can still browse)
              - p99 latency 10x normal
  Response:   - Page on-call engineer
              - IC assigned within 15 minutes
              - Status page update within 20 minutes
              - Communication update every 60 minutes
  Target MTTR: < 2 hours to mitigation

SEV3 — MEDIUM (Minor Degradation)
  Definition: Small subset of users affected, non-critical feature
              unavailable, or mild performance degradation
  Examples:   - Recommendations unavailable (main feature works)
              - Email notifications delayed
              - Admin panel degraded
  Response:   - Non-urgent notification to on-call
              - Assigned as high-priority ticket
              - Fix within business day
  Target MTTR: < 24 hours

SEV4 — LOW (Cosmetic / Monitoring Issues)
  Definition: No user impact, but something is wrong and needs fixing
  Examples:   - Dashboard broken (data is fine)
              - Monitoring alert misfiring
              - Non-critical dependency degraded with no user impact
  Response:   - Ticket created, normal priority
  Target MTTR: < 1 week

Incident Commander Role

The Incident Commander (IC) is the person responsible for coordinating the incident response. The IC does not need to be the most technically skilled engineer — in fact, technical depth is sometimes a liability, as highly technical ICs are tempted to investigate rather than coordinate.

IC RESPONSIBILITIES:

COORDINATION:
  - Owns the incident channel/bridge — sets the agenda
  - Delegates investigation to Subject Matter Experts (SMEs)
  - Prevents everyone from working on the same thing simultaneously
  - Tracks what has been tried and what has not
  - Makes go/no-go decisions for mitigations
  - Ensures someone is monitoring impact while others investigate

COMMUNICATION:
  - Writes internal status updates (Slack, Jira)
  - Approves customer-facing communication
  - Escalates to management when severity warrants
  - Calls in additional resources when needed

DOCUMENTATION:
  - Maintains incident timeline in real time (or delegates)
  - Every action taken is logged with timestamp
  - Every hypothesis and result is recorded
  - This timeline becomes the basis for the postmortem

WHAT IC DOES NOT DO:
  - Directly investigate the technical problem
  - Type commands on production servers
  - Make unilateral technical decisions without SME input
  - Ignore the stop conditions

IC ROTATION:
  The IC role is rotated, not permanent. All senior engineers
  should be IC-capable. IC skills are trained, not innate.
  Practice through GameDays and simulated incidents.

On-Call Rotation Design

A sustainable on-call rotation is essential for long-term engineer retention and effective incident response.

SUSTAINABLE ON-CALL DESIGN:

Alert volume:
  Target: < 2 pages per 12-hour shift (not counting false alarms)
  Danger zone: > 5 pages per night shift
  Alert fatigue kills response quality more than complexity

Shift design:
  Primary:   First responder, woken up if needed
  Secondary: Backup if primary unreachable, for SEV1 escalation
  Manager:   Escalation for executive communication, resource calls

  Shadow/reverse shadow: New engineers shadow before going primary.
                         Experienced engineers mentor through shadow.

On-call runbooks:
  Every alert must have a corresponding runbook:
  - What does this alert mean?
  - What are the common causes?
  - Step-by-step diagnostic procedure
  - Step-by-step mitigation options
  - Who to escalate to and when

  A page without a runbook is an unfinished feature.

On-call compensation:
  On-call significantly impacts quality of life.
  Compensate appropriately: time-in-lieu, direct pay, or equivalent.
  Unexpensed on-call exhausts engineers and leads to attrition.

Rotation frequency:
  Weekly rotations are most common. Some teams prefer bi-weekly
  to reduce context switching overhead.

  Never: permanent on-call for a single person (heroism → burnout)

Incident Response Flowchart

Page/Alert received
        |
        v
   Is this a real problem?
   (Quick 5-minute check)
       / \
      YES  NO
      |     \
      v      Resolve alert,
   Declare    document false
   incident   positive, improve
      |       alert conditions
      v
   Assign severity (SEV1-4)
        |
   +----+----+
   |         |
  SEV1      SEV2/3
   |         |
   v         v
Page IC,   On-call
escalate   investigates
manager    |
   |       |
   +---+---+
       |
       v
   Open incident channel
   (Slack #incident-YYYYMMDD)
       |
       v
   Post initial update:
   "Investigating [symptom] since [time].
    Scope: [affected users/services].
    IC: [name]. SME: [name]."
       |
       v
   Collect data:
   What changed recently?
   (Deployments, config, traffic patterns)
       |
       v
   Form hypotheses → test one at a time
       |
       +---------> Hypothesis confirmed?
       |                    |
       |               YES  |  NO
       |                |   |
       |                v   +--------> Next hypothesis
       |           Mitigation:
       |           rollback / feature flag / traffic shift
       |           restart / scale out / reroute
       |                |
       v                v
   Monitor: did mitigation restore steady state?
       |
       +---------> YES: Declare resolved
       |                    |
       |                    v
       |               Post final update.
       |               Schedule postmortem.
       |               Update status page.
       |
       +---------> NO: Continue investigation

Mitigation vs Fix

MITIGATION:          |  FIX:
Restore service now  |  Prevent recurrence
                     |
Rollback deployment  |  Find and fix the bug
Feature flag off     |  Add defensive coding
Traffic shift to     |  Root cause analysis
  healthy region     |  System improvement
Scale out            |  Architectural change
Restart service      |  Process improvement
Block malformed      |  Input validation added
  requests           |  
                     |
Hours to do          |  Days to weeks to do
Done during incident |  Done after postmortem
                     |
Can be temporary     |  Permanent improvement
May re-occur         |  Should prevent recurrence
                     |
EXAMPLE:             |  EXAMPLE:
Config error causes  |  Add validation to
errors. Rollback     |  deployment pipeline that
the config change.   |  rejects invalid configs.
Service restored     |  Future configs auto-verified.
in 5 minutes.        |

The discipline of keeping mitigation and fix separate is hard under pressure. Engineers want to understand the problem before acting. The IC's job is to enforce: "restore service first, investigate later." The postmortem is the right place for thorough investigation.

Incident Communication

Communication during incidents is as important as technical response. Poor communication causes: - Duplicate investigation effort (people don't know what's being tried) - Stakeholders escalating inappropriately (don't know status) - Users hearing about the incident from social media before the status page - Post-incident trust damage that could have been mitigated with proactive updates

COMMUNICATION CHANNELS:

Internal (engineers):
  Slack incident bridge: #incident-2024-11-15-checkout-failure
  Format: [HH:MM UTC] [IC/SME name] [update]
  All commands run should be logged: "15:32 [alice] kubectl rollout undo deploy/checkout"
  All findings logged: "15:45 [bob] confirmed: errors started at 15:22, correlates with deploy v2.8"
  All mitigations logged: "15:48 [alice] rollback complete. Monitoring recovery."

External (customers):
  Status page: statuspage.io, Atlassian Statuspage, etc.

  INITIAL UPDATE (post within 10 min of SEV1 declaration):
  "We are investigating reports of [symptom] affecting [scope].
   Our team is actively working on this. Next update by [time]."

  PROGRESS UPDATE (every 30-60 min):
  "Our team has identified [finding]. We are implementing a fix.
   We expect to restore service by [time estimate].
   Next update by [time]."

  RESOLUTION UPDATE:
  "The issue affecting [service] has been resolved as of [time].
   [Brief description of what happened and what was done].
   We will publish a post-incident review within [48-72 hours]."

Customer communication rules:
  - Never speculate about root cause until confirmed
  - Never blame third-party services unless confirmed and they have consented
  - Always provide a time for the next update, then keep that commitment
  - Plain language: no jargon, no technical detail that creates more confusion

MTTR as the Primary Metric

MTTR (Mean Time To Recover) is the most important reliability metric for incident response, even more than incident frequency. An organization that has 10 incidents per year but resolves each in 10 minutes has far less user impact than one that has 2 incidents per year but takes 8 hours each.

MTTR = MTTD (detection) + MTTM (mitigation)

To reduce MTTD:
  - Improve alerting coverage (more SLO-based alerts)
  - Reduce alert latency (alert fires within 1-2 minutes of SLO violation)
  - Synthetic monitoring from external vantage points
  - User-facing error reporting (let users report problems)

To reduce MTTM:
  - Improve runbooks (pre-written, practiced procedures)
  - Improve deployment tooling (fast rollback, feature flags)
  - Improve observability (dashboards that immediately show scope and cause)
  - Reduce blast radius (canary deployments, feature flags, circuit breakers)
  - Practice (GameDays, tabletop exercises)
  - Clear escalation paths (no time lost finding the right person)

Common Mistakes

Not declaring an incident early enough: "Hero" behavior — engineers work silently trying to fix the problem alone, hoping to resolve it before anyone notices. This delays communication, prevents coordination, and often extends MTTR. Declare early; downgrade later if it turns out to be minor.

Alert fatigue leading to ignored pages: When every alert sounds urgent, none are treated urgently. Pages must be actionable, specific, and rare enough to command attention. Monitor page volume per engineer per shift.

Changing too many things at once: Under pressure, engineers may apply multiple mitigations simultaneously. This makes it impossible to know which mitigation worked (or which change made things worse). Change one thing at a time. Document every change.

Mitigation causing a second incident: A hasty rollback or traffic shift can introduce a new failure mode. All mitigations carry risk. The IC should verify the mitigation plan before executing.

Hero culture: A small number of engineers who know everything and are always called. This does not scale, burns those engineers out, and prevents others from developing incident response skills.

Skipping postmortems: The postmortem is where learning happens. An organization that resolves incidents without postmortems is fated to repeat them.

Production Example: Checkout Failure Incident

Timeline (all times UTC):
14:00 - Deployment v3.2 of checkout service to production
14:22 - SLO burn rate alert fires: checkout error rate 3x normal
14:24 - On-call engineer [alice] acknowledges alert
14:26 - alice declares SEV2 incident, opens #incident-20241115-checkout
14:27 - alice pages IC [bob]; bob joins bridge
14:30 - bob posts status page update: "Investigating checkout issues"
14:31 - alice: "Error rate 8%. All errors are NullPointerException in 
         checkout/v3.2. Deploy at 14:00 is the likely cause."
14:32 - bob: "Hypothesis: v3.2 deploy. Mitigation: rollback to v3.1. 
         alice please initiate rollback."
14:33 - alice: "Initiating rollback: kubectl rollout undo deploy/checkout"
14:35 - alice: "Rollback complete. Monitoring."
14:38 - alice: "Error rate dropping: 5% → 3% → 1% → 0.05%. 
         Looks like rollback worked."
14:40 - bob: "Error rate at baseline. Declaring resolved."
14:40 - bob posts status page: "Issue resolved at 14:40 UTC. 
         Post-incident review within 48 hours."
14:41 - bob: "MTTR: 18 minutes from alert. Good rollback. 
         Scheduling postmortem for tomorrow."

Impact: 18 minutes of elevated errors (~8% error rate)
Error budget consumed: significant (8% × 18min of traffic)
Root cause: TBD in postmortem (NullPointerException in v3.2)

Debugging Notes

During an active incident, maintain diagnostic hygiene:

Work from latest metrics backward, not from assumptions. What do the dashboards show right now?
Check what changed recently: git log --since=2h, deployment logs, config change logs.
Correlate symptom onset time with change events. Most incidents correlate with a change.
Check dependencies before assuming the problem is local. Is an upstream or downstream service also degraded?
Use snapshots, not only streaming. Take a snapshot of current state (errors/second, p99 latency, connection pool usage) before applying any mitigation — you may need to compare before/after.

Security Implications

Incident response infrastructure must be hardened: - The incident communication channel should be separate from the compromised service in the event of a security breach - Runbooks should not contain credentials - Incident documentation tools (Jira, PagerDuty) should have MFA - Incident bridges should be authenticated — avoid public incident channels where attackers could monitor response progress - Post-incident, rotate any credentials that may have been exposed during the incident or investigation

Performance Implications

The act of investigating an incident can itself degrade performance: - Running many diagnostic queries on an already-stressed database can worsen the outage - Enabling verbose logging during an incident significantly increases disk I/O and write latency - Restarting services to clear a presumed memory leak causes brief unavailability - Every mitigation attempt has a risk profile; assess before executing

Failure Modes of Incident Response

No runbooks: Engineers must reconstruct diagnostic procedures from memory under pressure. Runbooks must be written proactively and kept up to date.

Runbook rot: Runbooks exist but are 18 months out of date, referencing services and commands that no longer exist. Runbooks must be reviewed quarterly.

IC vacuum: No clear IC assignment leads to everyone and no one coordinating. Always assign an IC explicitly within the first 5 minutes.

Communication debt: Deciding to communicate "once we understand the issue" results in stakeholders getting no updates for an hour. Communicate early and often, even with partial information.

Modern Usage

PagerDuty, OpsGenie, and VictorOps provide on-call scheduling, alert routing, escalation policies, and incident timelines. FireHydrant, Incident.io, and Rootly provide incident management workflows integrated with Slack, including automated status page updates and postmortem template generation. These tools reduce process overhead and help teams focus on technical response.

Future Directions

AI-assisted incident response: LLMs analyzing metrics, logs, and traces in real time to surface hypotheses and suggest mitigation steps. Early tools exist (Bedrock Agents, PagerDuty Copilot) but reliability requires validation.

Automated mitigation: Runbooks executed automatically when certain alert conditions are met — e.g., automatic rollback triggered by error rate SLO burn. Requires mature runbooks and careful blast radius controls.

Incident simulation training: VR/simulation environments where engineers practice incident response in a realistic environment without affecting production.

Exercises

Role-play an incident: one person plays the IC, one plays the SME, one plays the communications lead. Use a historical incident from your own experience or from a public postmortem. Practice the communication cadence and IC coordination patterns.
Audit your current alert configuration. For each alert, answer: does it have a runbook? Is the runbook current? Is this alert actionable, or is it informational? What is the MTTD for your most critical service?
Define severity levels for a service you work on. What thresholds separate SEV1 from SEV2 in your context? What is the maximum acceptable MTTR for each severity?
Write a runbook for a common failure in your service. Include: symptoms, diagnostic steps, mitigation options with expected outcomes, escalation criteria.
Measure your organization's MTTR for the last 10 incidents. What is the largest contributor to MTTR: detection time or mitigation time? What would improve each?

References

Beyer, B., Jones, C., Petoff, J., Murphy, N.R. (eds). Site Reliability Engineering. O'Reilly, 2016. Chapter 14: Managing Incidents.
Limoncelli, T.A. The Practice of System and Network Administration. 3rd ed., Addison-Wesley, 2016.
PagerDuty. PagerDuty Incident Response Documentation. https://response.pagerduty.com/
FireHydrant. Incident Response Guide. https://firehydrant.com/blog/
Allspaw, J. "Incidents as We Imagine Them Versus How They Actually Are." Adaptive Capacity Labs, 2018.
Hollnagel, E., Woods, D., Leveson, N. Resilience Engineering: Concepts and Precepts. Ashgate, 2006.