Skip to content

Postmortems: Learning from Failure

Overview

A postmortem (also called a post-incident review, or PIR) is a structured process for learning from service disruptions. Its purpose is not to assign blame, not to punish mistakes, and not to fulfill an administrative requirement — it is to understand what happened so the system and the organization can improve. A well-run postmortem is one of the highest-leverage activities in reliability engineering: a single incident can yield insights that prevent ten future ones.

The blameless postmortem, pioneered at Google and popularized across the industry by Etsy's John Allspaw, rests on a foundational principle: people do not make errors because they are careless or incompetent. They make errors because the systems and processes around them create conditions where errors are likely. An engineer who deploys a bad config is not a person who should be blamed — they are a signal that the deployment pipeline lacked sufficient validation. Fixing the person accomplishes nothing; fixing the system prevents the class of failure.

Prerequisites

  • Experience with incident response (see 05-incident-response.md)
  • Understanding of systems thinking and complex systems failure
  • Psychological safety: participants must feel safe reporting what actually happened without fear of career consequence
  • Authority: postmortem action items must be acted upon, requiring organizational support

Blameless Postmortem Philosophy

The blameless philosophy has deep intellectual roots in safety engineering, particularly the work of James Reason (Swiss Cheese Model, 1990) and Sidney Dekker (New View of Human Error, Safety Differently). These researchers studied high-consequence industries — aviation, nuclear power, healthcare — and found consistently that accidents are caused by latent system weaknesses, not individual failures.

OLD VIEW (Blame Model):
  Incident occurs
       ↓
  Find who was responsible
       ↓
  Discipline or retrain that person
       ↓
  "Problem solved"

  Why this fails:
  - Punishing the person does not fix the system
  - People hide information to avoid blame → less learning
  - Creates "don't be the last person to touch it" culture
  - Same incident recurs with a different person involved

NEW VIEW (Systems Model):
  Incident occurs
       ↓
  Understand what conditions made the incident possible
       ↓
  Find systemic factors: gaps in tooling, process, knowledge, monitoring
       ↓
  Fix those systemic factors
       ↓
  Incident class prevented, not just this instance

The shift is from "who did this?" to "what allowed this to happen?"

Psychological safety is not optional. If engineers fear career consequences for honest reporting, they will omit key details. The postmortem will have gaps. The root cause will be misidentified. The action items will be incomplete. Organizations that punish people for incidents lose the ability to learn from them.

Amazon, Google, and Etsy have published extensively on their blameless cultures. Etsy's "blameless post-mortems and just culture" by John Allspaw (2012) is the seminal blog post that brought the concept to the software industry.

Postmortem Timing

TIMING PRINCIPLES:

As soon as possible after the incident:
  - Human memory degrades rapidly after stressful events
  - Details that seem unforgettable are forgotten within 48 hours
  - "I'll remember that" is almost always wrong

But not immediately after the incident:
  - Engineers need to decompress after a stressful event
  - Postmortem requires preparation and time to collect data
  - Exhausted engineers make poor analysts

Recommended: 24-72 hours after the incident
  - Memory still fresh
  - Engineers recovered from stress
  - Data (logs, metrics, deployment records) collected and organized
  - Participants can prepare

For complex or high-severity incidents:
  - May require multiple sessions
  - First session: timeline reconstruction
  - Second session: root cause analysis and action items

For simple incidents:
  - Asynchronous written postmortem (no meeting needed)
  - Faster, lower overhead
  - Still reviewed by team

Postmortem Structure

Template Overview

POSTMORTEM DOCUMENT STRUCTURE:

1. HEADER
   Date/time of incident:
   Date/time of postmortem:
   Severity: SEV[1-4]
   Duration: [impact start] to [resolved]
   Author(s):
   Reviewers:
   Status: Draft / In Review / Final

2. INCIDENT SUMMARY (2-3 sentences)
   Brief description of what happened and user impact.
   "On November 15, checkout requests experienced elevated error rates
   (peak 8%) for 18 minutes due to a NullPointerException introduced
   in the v3.2 deployment."

3. IMPACT
   - User impact: how many users, which features, which regions
   - Revenue impact (if quantifiable)
   - Error budget consumed
   - SLA breached? (yes/no)

4. TIMELINE
   HH:MM UTC | Event
   14:00     | v3.2 deployed to production
   14:22     | Burn rate alert fires (checkout error rate 3x SLO)
   14:24     | On-call acknowledges
   14:26     | SEV2 declared, incident channel opened
   14:32     | Rollback decision made
   14:33     | Rollback initiated
   14:35     | Rollback complete
   14:40     | Error rate at baseline, resolved

5. ROOT CAUSE ANALYSIS
   (See 5 Whys section below)
   Proximate cause: What was the immediate trigger?
   Contributing factors: What made the system vulnerable?
   Root cause: What systemic condition allowed this?

6. DETECTION
   How was the incident detected?
   What was MTTD?
   Was the detection mechanism appropriate?
   Could detection have been faster?

7. RESPONSE
   What went well?
   What was harder than expected?
   Were runbooks useful? Accurate?
   Was the IC effective?
   Was communication adequate?

8. LESSONS LEARNED
   Bullet points: what do we know now that we did not know?
   Technical findings:
   Process findings:
   Tooling gaps:

9. ACTION ITEMS
   [Each item must have: owner, due date, issue/ticket link]

   Format:
   | Action Item | Owner | Due | Ticket | Priority |
   |-------------|-------|-----|--------|----------|
   | Add input validation to checkout config | @alice | Nov 22 | ENG-1234 | P1 |
   | Add deployment smoke test for NullPointerException | @bob | Nov 29 | ENG-1235 | P1 |
   | Improve rollback documentation in runbook | @charlie | Dec 6 | ENG-1236 | P2 |

The 5 Whys Technique

The 5 Whys is a root cause analysis technique developed by Sakichi Toyoda at Toyota in the 1930s and applied to software systems. The idea is simple: ask "why" repeatedly until you reach a root cause rather than a proximate cause.

EXAMPLE APPLICATION:

Incident: Checkout service had 8% error rate for 18 minutes.

Why 1: Why did checkout have an 8% error rate?
  Answer: v3.2 introduced a NullPointerException in the payment handler.

Why 2: Why did v3.2 have a NullPointerException?
  Answer: A new code path did not handle the case where a user's 
          saved payment method had been deleted.

Why 3: Why did the code path not handle the deleted payment method case?
  Answer: The developer was unaware this was a valid state — there was
          no documentation about payment method lifecycle.

Why 4: Why was there no documentation about payment method lifecycle?
  Answer: The payment team and checkout team had never formally 
          documented the contract between their services.

Why 5: Why had the contract never been documented?
  Answer: There is no process requiring service contract documentation 
          before dependent services are built.

ROOT CAUSE: No formal service contract documentation process between teams.

ACTION ITEMS:
1. Document payment method lifecycle states (immediate)
2. Add contract tests between checkout and payment services
3. Establish cross-team API contract review process

Note how "developer didn't check for null" is NOT the root cause.
The developer is not the problem. The system that allowed building
without contract documentation is the problem.

5 Whys Limitations

The 5 Whys is a powerful starting technique but has known limitations: - Different people asking the same incident may follow different "why" chains and reach different root causes - Complex incidents have multiple contributing causes, not a single root cause - "Why" questions assume linear causality; distributed system failures are often non-linear

For complex incidents, Fault Tree Analysis (FTA) or Fishbone (Ishikawa) diagrams may be more appropriate, mapping multiple contributing factors simultaneously.

Contributing Factors vs Root Cause

Most serious incidents have multiple contributing factors, not a single root cause. The 5 Whys is useful but may oversimplify. A more complete analysis identifies all the factors that, together, enabled the incident:

CONTRIBUTING FACTORS ANALYSIS:

Incident: Database primary OOM caused 45-minute outage

Contributing factors:
  1. Memory leak in new ORM version (code factor)
  2. Staging environment did not run the new ORM under production load (testing gap)
  3. Memory alerting threshold was 90% — too late to react before OOM (monitoring gap)
  4. Automated failover had not been tested in 6 months and failed silently (process gap)
  5. On-call runbook did not include manual failover procedure (documentation gap)
  6. On-call engineer was in first month and had never done manual failover (training gap)

"Root cause" alone (memory leak in ORM) would yield:
  Action item: fix memory leak

That's useful but insufficient. The SYSTEM failed in six different ways.
Each contributing factor is an action item:
  1. Fix memory leak, add regression test
  2. Add load testing in staging pipeline
  3. Alert at 75% memory, not 90%
  4. Test automated failover monthly
  5. Add manual failover procedure to runbook
  6. Require new engineers to shadow 3 incidents before solo on-call

Addressing only #1 leaves five failure modes unresolved.
The next incident may not involve a memory leak, but may expose #3, #4, or #5.

Postmortem Anti-Patterns

Blame without system analysis: "The engineer deployed without testing." This terminates the analysis at the person rather than the system. It is the most common anti-pattern and produces the least learning.

Action items that are never done: Postmortem action items that are not tracked, not assigned to specific owners, or not completed. A postmortem with completed action items is 10x more valuable than 10 postmortems with no follow-through. Action items must be in the engineering backlog, prioritized, and tracked.

"Human error" as root cause: This is the most insidious anti-pattern. "The engineer made a mistake" stops the analysis before it starts. Humans make mistakes under pressure, when tired, when documentation is unclear, when tooling is confusing. Human error is never a root cause — it is always a symptom of a system that accommodated human error.

Postmortem theater: Going through the motions — writing a document, holding a meeting, generating action items — without genuine curiosity about what happened or organizational commitment to act on findings. This is worse than no postmortem because it creates the illusion of learning without the reality.

Scope creep: Postmortems that try to fix everything wrong with the system, not just what is relevant to this incident. Focus on the incident. Other improvements belong in a separate roadmap.

Secret postmortems: Postmortems that are not shared beyond the immediate team. One of the highest-value uses of postmortems is cross-team learning. A database failure in team A is almost always relevant to team B's operations.

ANTI-PATTERN CHECKLIST:
[ ] Are action items assigned to named individuals (not teams)?
[ ] Do all action items have due dates?
[ ] Are action items tracked in the team's engineering backlog?
[ ] Does the root cause analysis avoid "human error" as a conclusion?
[ ] Has the postmortem been shared with relevant stakeholders?
[ ] Have previous postmortem action items been completed?
    (If not, why not? Is there a systemic backlog problem?)

Postmortem Review Process

REVIEW FLOW:

  Author(s) write draft
          |
          v
  Circulate to participants
  (those involved in the incident)
          |
  Review period: 24-48 hours
  Comments added asynchronously
          |
          v
  Group review meeting (optional for complex incidents)
  30-60 minutes
  Goal: shared understanding, not blame assignment
  Facilitator: neutral party (not the IC or on-call)
          |
          v
  Author(s) incorporate feedback
          |
          v
  Stakeholder review:
  - Engineering manager: approves action item priority
  - Product: aware of customer impact findings
  - Security: any security implications flagged
          |
          v
  Final document published
  Action items tracked
  Document added to searchable postmortem database

Postmortem Database

A searchable database of past postmortems is one of the most underutilized reliability tools. Benefits: - New engineers can understand the failure history of systems they work on - Recurring incidents become visible ("this is the third time we've had a Redis OOM") - Action items that keep getting deferred become visible - Similar incidents at other companies can be compared (Google, GitLab, Cloudflare all publish postmortems)

SEARCHABLE ATTRIBUTES:
  - Date
  - Severity
  - Service(s) affected
  - Root cause category (deployment, infrastructure, code bug, dependency, config)
  - MTTR
  - Error budget consumed
  - Action items: total, completed, pending
  - Tags: database, cache, networking, deployment, etc.

QUERIES THAT BECOME POSSIBLE:
  "Show all incidents where root cause was deployment-related"
  → Track whether deployment safety improvements are reducing incidents

  "Show all incidents for service X in last 12 months"
  → Service reliability trend, identify chronic vs occasional issues

  "Show all incidents with MTTR > 2 hours"
  → What slows down response? Common patterns?

  "Show all pending action items older than 60 days"
  → Are we completing our commitments?

Google SRE Postmortem Template

Google's SRE organization published their postmortem template structure, which has become a widely-used starting point:

GOOGLE SRE POSTMORTEM TEMPLATE:

  Postmortem for [service] on [date]

  Summary:
  [2-3 sentence incident description]

  Impact:
  [Users affected, revenue, error budget]

  Root Causes:
  [Primary root cause]

  Trigger:
  [What triggered the incident]

  Resolution:
  [What resolved the incident]

  Detection:
  [How was the incident detected]

  Action Items:
  [Table: description, type, priority, bug, owner]
  Type categories: prevent, detect, mitigate, process

  Lessons Learned:
  What went well:
  What went wrong:
  Where we got lucky:

  Timeline:
  [Full chronological timeline in UTC]

The "where we got lucky" section is particularly valuable:
  documenting near-misses reveals additional vulnerabilities
  that could cause a worse incident next time.

Production Example: GitLab Database Incident (2017)

GitLab's January 31, 2017 incident — where a misconfigured database deletion on the wrong server caused data loss — is one of the most well-documented and instructive public postmortems in the industry. Key findings:

The engineer ran a PostgreSQL cleanup command on the production database server instead of the staging server. This was a human error, but the blameless analysis revealed: - The database replica was not replicating correctly (had been broken for months, undetected) - Database backups to S3 were broken (three backup methods, all failing) - The database restore process had never been tested - LVM snapshots were created too infrequently - The team was working under significant time pressure at 23:30 UTC

The postmortem identified six independent systems that all failed simultaneously. The "human error" (wrong server) was the trigger, but the system had no defense in depth. The action items addressed all six failure modes. This incident became a landmark in public postmortem culture because GitLab livestreamed the investigation and published an extremely detailed report with no sanitization. It remains required reading for reliability engineers.

Debugging Notes

Postmortems are themselves processes that can fail. Watch for: - Participants who are reluctant to speak candidly (blame culture indicators) - Action items that are vague ("improve testing") rather than specific ("add integration test covering case X by date Y") - Timelines with gaps — what was happening during the gap? The gap may contain the most important information - Root causes that seem to terminate at a person or team ("team B didn't tell us about the API change")

Security Implications

Postmortems may contain sensitive information: - Credentials that were exposed - Internal system topology and architecture - Vendor reliability data that may be contractually confidential - Security vulnerabilities that have not yet been patched

Postmortem databases should have access controls. Public postmortems (increasingly popular for transparency) should be reviewed before publication to remove sensitive data while preserving the technical learnings.

Performance Implications

Postmortem processes consume engineering time. A good postmortem for a SEV2 incident might take 4-8 hours of engineering time across multiple people. This is a worthwhile investment when the action items prevent a recurring SEV2 (which costs 10x more), but organizations running 50+ incidents per month cannot postmortem all of them. Prioritize: - SEV1: always postmortem - SEV2: always postmortem - SEV3: postmortem if novel or recurring - SEV4: async write-up, no meeting needed

Failure Modes

Action item graveyard: Postmortem action items that sit in the backlog for months and are never completed. Requires organizational commitment: action items from P1 postmortems must be prioritized above feature work.

Postmortem summary without depth: "The service was down because of a bug" — no timeline, no contributing factors, no action items. A summary is not a postmortem.

Missing historical context: A postmortem that does not reference previous related incidents. The database team that does not know their service had an OOM incident 6 months ago by a different engineer will repeat the same analysis.

Modern Usage

As of 2024-2025, dedicated incident management tools (FireHydrant, Incident.io, Rootly, PagerDuty) provide postmortem templates integrated with incident timelines, automatic timeline generation from Slack messages and PagerDuty alert history, action item tracking integrated with Jira/Linear, and postmortem databases with search. This reduces the overhead of postmortem creation, which improves completion rates.

The trend toward "production reviews" (not limiting learning to incidents — also reviewing near-misses, successful experiments, and expected failures) expands the postmortem culture to proactive learning.

Future Directions

AI-assisted postmortem generation: LLMs that analyze the incident channel, alert history, metrics, and deployment logs to auto-generate an initial postmortem timeline. Engineers review and add interpretation rather than reconstructing from scratch.

Postmortem similarity matching: Automatically surface similar past postmortems when a new one is created, enabling faster root cause identification and building on past learnings.

Action item impact tracking: Closing the loop — measuring whether completed postmortem action items reduced incident frequency or severity. Quantifying the ROI of reliability improvements.

Exercises

  1. Find a public postmortem (GitLab 2017, Cloudflare incidents, AWS service health history) and apply the 5 Whys to identify the root cause chain. Do you agree with the published root cause?

  2. Write a postmortem for an incident you have personally experienced (or a fictional one based on a system you know). Include all sections. Identify contributing factors beyond the proximate cause.

  3. Review your organization's last three postmortems. What percentage of action items were completed? What do the incomplete ones have in common?

  4. Analyze the language in a postmortem you have access to. Find any instances where the language places blame on an individual rather than a system. Rewrite those sentences to be systemic.

  5. Design a postmortem database schema. What fields would you include? What queries would you want to support? How would you ensure action item follow-through?

References

  • Allspaw, J. "Blameless PostMortems and a Just Culture." Etsy Code as Craft Blog, 2012.
  • Dekker, S. The Field Guide to Understanding Human Error. 3rd ed., Ashgate, 2014.
  • Reason, J. "The Contribution of Latent Human Failures to the Breakdown of Complex Systems." Philosophical Transactions of the Royal Society, 1990.
  • Beyer, B., et al. (eds). Site Reliability Engineering. O'Reilly, 2016. Chapter 15: Postmortem Culture.
  • GitLab. "GitLab.com Database Incident, January 31, 2017." https://about.gitlab.com/blog/2017/02/01/gitlab-dot-com-database-incident/
  • Google SRE. "Example Postmortem." https://sre.google/sre-book/example-postmortem/
  • Hollnagel, E. Safety-I and Safety-II: The Past and Future of Safety Management. Ashgate, 2014.
  • Woods, D.D. "Essential Characteristics of Resilience." In Resilience Engineering: Concepts and Precepts, 2006.