10 — Distributed Transactions

Technical Overview

A distributed transaction is a transaction that spans multiple nodes, databases, or services. The goal is the same as a single-node transaction: either all operations commit atomically, or all are rolled back — no partial state. This guarantee is significantly harder to achieve when participants are on different machines with independent failure domains. The protocols developed to solve this problem — Two-Phase Commit, Three-Phase Commit, and Saga — represent decades of engineering trade-offs between correctness, availability, and performance.

The dominant practical wisdom in modern distributed systems: avoid distributed transactions when possible. When unavoidable, understand exactly what guarantee each protocol provides and what it costs.

Prerequisites

Distributed systems fundamentals (01-distributed-systems-fundamentals.md)
Replication (08-replication.md)
Sharding (09-sharding.md)
Basic understanding of database transactions (ACID)
Two Generals problem and consensus

Core Content

Why Distributed Transactions Are Hard

On a single node, atomicity is achieved via Write-Ahead Logging (WAL): write the changes to the log, then apply them. On rollback, undo from the log. The disk subsystem guarantees the WAL is durable before the commit record is written.

In a distributed setting:

Single-node commit:
  WAL → apply → done. (Atomic, fast.)

Distributed commit (2 nodes):
  Node A: "I'm ready to commit"
  Node B: "I'm ready to commit"

  Questions:
  - What if Node A crashes after "ready" but before commit?
  - What if the coordinator crashes after telling A to commit 
    but before telling B?
  - What if the network drops the "commit" message to B?
  - What if Node B's disk is full when it tries to write?

  These partial failures have no analog in single-node systems.

The fundamental difficulty is the Atomic Commitment Problem: get all participants to agree on commit-or-abort, with no participant "orphaned" (committed when others aborted).

Two-Phase Commit (2PC)

2PC is the foundational protocol for distributed atomic commit. It was developed in the 1970s, with formal analysis by Gray (1978) and Lampson & Sturgis (1979).

2PC Protocol:

Phase 1: Prepare (Voting Phase)

  Coordinator → All Participants: PREPARE(transaction_id)

  Each Participant:
  1. Write all changes to a redo/undo log (persistent)
  2. Acquire all locks needed for the transaction
  3. Reply: VOTE-YES (I can commit) or VOTE-NO (I cannot)

  Coordinator:
  If ALL votes are YES → proceed to Phase 2 with COMMIT
  If ANY vote is NO → proceed to Phase 2 with ABORT

Phase 2: Commit/Abort

  Coordinator writes COMMIT or ABORT to its own log (persistent).

  Coordinator → All Participants: COMMIT or ABORT

  Each Participant:
  If COMMIT: apply changes, release locks, write COMMITTED to log
  If ABORT:  undo changes, release locks, write ABORTED to log

  Participant → Coordinator: ACK

  Coordinator can now clean up the transaction.

Phase Diagram:

Coordinator          Participant A        Participant B
     │                     │                   │
     │──── PREPARE ─────→  │                   │
     │──── PREPARE ──────────────────────────→ │
     │                     │                   │
     │←─── VOTE-YES ───────│                   │
     │←─────────────────── VOTE-YES ───────────│
     │                     │                   │
     [WRITE COMMIT to log]  │                   │
     │                     │                   │
     │──── COMMIT ───────→ │                   │
     │──── COMMIT ───────────────────────────→ │
     │                     │                   │
     │←─── ACK ────────────│                   │
     │←─────────────────── ACK ────────────────│
     │                     │                   │
  [done]

The durability guarantee: Once the coordinator writes COMMIT to its log (the "commit point"), the transaction is committed — even if the coordinator crashes immediately after. Participants will eventually hear COMMIT or ask the coordinator for the status.

2PC Failure Modes

2PC is a blocking protocol. Participants that have voted YES are "in doubt" — they have locked resources and cannot unilaterally commit or abort. They must wait for the coordinator's decision.

2PC Failure Scenarios:

Scenario 1: Participant crashes before voting
  → Coordinator sees timeout, sends ABORT to all others.
  → Safe: no partial commit.

Scenario 2: Participant crashes after voting YES
  → Participant recovers, asks coordinator for status.
  → Coordinator sends COMMIT or ABORT.
  → Safe: recovery via coordinator.

Scenario 3: Coordinator crashes AFTER writing COMMIT to log
             but BEFORE sending COMMIT to participants
  → Participants hold locks, wait indefinitely.
  → Network timeout fires: "what happened?"
  → Participants CANNOT unilaterally decide.
     If they abort: what if the coordinator committed (and told some others)?
     If they commit: what if the coordinator aborted?
  → Participants are BLOCKED until coordinator recovers.

  ╔════════════════════════════════════════════════════════╗
  ║  This is the fundamental 2PC blocking problem.        ║
  ║  A coordinator failure can leave participants          ║
  ║  blocked indefinitely, holding locks.                 ║
  ╚════════════════════════════════════════════════════════╝

Scenario 4: Network partition during Phase 2
  → Coordinator sends COMMIT to A, network drops COMMIT to B.
  → A commits, B is in doubt.
  → B holds locks, waits.
  → If coordinator is in partitioned half, B waits until partition heals.

The blocking property is why 2PC is considered problematic: a single coordinator failure can block all participants for seconds to minutes. In a system with many concurrent distributed transactions, this creates cascading lock waits.

Three-Phase Commit (3PC)

Three-Phase Commit (Skeen, 1981) adds a pre-commit phase to make the protocol non-blocking — at the cost of assumptions about timing.

3PC Protocol:

Phase 1: CanCommit (same as 2PC Phase 1 prepare)
  Coordinator: Can everyone commit?
  Participants: YES or NO votes.

Phase 2: PreCommit
  If all YES:
    Coordinator → Participants: PRE-COMMIT
    Participants: Write PRE-COMMIT to log. Reply: ACK.

  (This phase is the key addition: participants now KNOW the coordinator
   intends to commit. If coordinator fails now, they can safely commit.)

Phase 3: DoCommit
  Coordinator → Participants: DO-COMMIT
  Participants: commit, release locks, reply ACK.

The insight: In 3PC, if a participant has received PRE-COMMIT, it knows the coordinator decided to commit (no participant voted NO). If the coordinator fails, the participant can ask other participants: "did you get PRE-COMMIT?" If any did, they all commit. If none did, they all abort. This eliminates blocking.

The catch: 3PC is non-blocking only in a synchronous network where message delivery is bounded. In an asynchronous network (or with network partitions), 3PC can still deadlock. Two partitions might independently decide different outcomes based on what each half knows. 3PC trades coordinator-crash blocking for network-partition blocking. Since network partitions are more common than pure coordinator crashes, 3PC is rarely used in practice.

Saga Pattern

The Saga pattern (Hector Garcia-Molina and Kenneth Salem, 1987) takes a fundamentally different approach: instead of distributed locking, use a sequence of local transactions with compensating transactions for rollback.

Saga Pattern:

  Business operation: "Book Flight + Hotel + Car"

  Decompose into local transactions:
  T1: Book Flight      (compensating: Cancel Flight C1)
  T2: Book Hotel       (compensating: Cancel Hotel C2)
  T3: Book Car         (compensating: Cancel Car C3)

  Happy path:
  T1 → T2 → T3 → done

  Failure at T3 (car unavailable):
  T1 → T2 → T3 FAILS → C2 → C1 → done (all compensated)

  Failure at T2 (hotel unavailable):
  T1 → T2 FAILS → C1 → done

  No distributed locking!
  Each T_i is a local transaction on one service/database.
  C_i is a compensating transaction (semantic undo).

Choreography vs Orchestration:

Choreography (event-based):
  Each service reacts to events and emits its own events.

  FlightService → "FlightBooked" event
  HotelService  hears "FlightBooked" → books hotel → "HotelBooked" event
  CarService    hears "HotelBooked" → books car → "CarBooked" event

  Pros: Loose coupling, services are independent.
  Cons: Hard to follow the business logic flow; distributed debugging.

Orchestration (central coordinator):
  A Saga Orchestrator tells each service what to do.

  SagaOrchestrator → FlightService: "Book flight X"
  SagaOrchestrator hears "FlightBooked" → HotelService: "Book hotel Y"
  SagaOrchestrator hears "HotelBooked" → CarService: "Book car Z"

  Pros: Business logic in one place; easier to observe.
  Cons: Orchestrator is a coordination point; potential bottleneck.

Saga limitations: - Transactions are not isolated from each other during the saga. Between T1 and T2, other sagas can read the partially-committed state. This is ACI (no Isolation) rather than ACID. - Compensating transactions must be idempotent (retryable) and should not fail. - Not suitable when "undoing" is semantically impossible (e.g., you already sent an email or printed a package label).

Distributed MVCC

Multi-Version Concurrency Control (MVCC) in distributed systems assigns globally-ordered timestamps to transactions. Each transaction reads a consistent snapshot of the database as of its start timestamp.

Google Spanner TrueTime:

Spanner assigns commit timestamps using TrueTime. TrueTime gives a time interval TT.now() → [earliest, latest] with the guarantee that the true time is within the interval. Before committing, Spanner waits until earliest > transaction_timestamp (commit-wait) to ensure the transaction's timestamp is in the past.

Spanner Commit Sequence:

  1. Transaction T starts at timestamp ts_start.
  2. Reads are at ts_start (any version with commit_ts <= ts_start).
  3. Writes are prepared with tentative commit timestamp ts_commit.
  4. Wait until TT.now().earliest > ts_commit. (Commit wait: ~1-7ms)
  5. Commit: writes become visible at ts_commit.

  Guarantee: If T1 commits before T2 starts (in real time):
    ts_commit(T1) < ts_start(T2)
    T2 will always see T1's writes.
    → External consistency (stronger than linearizability)

CockroachDB HLC-based MVCC:

CockroachDB uses Hybrid Logical Clocks (HLC) as timestamps. Every row version has an HLC timestamp. Transactions read at their HLC start timestamp. Conflicts are resolved by refreshing the read timestamp or retrying the transaction.

CockroachDB Read Refresh:

  Transaction T starts at HLC timestamp t=100.
  T reads x=5 (version at t=80).
  Another transaction commits x=6 at t=90. (Still < 100, so T should see it)

  Wait — T read at t=100 but the value at t=100 should be x=6 (committed at t=90).

  CockroachDB refreshes: re-reads x at latest timestamp, sees x=6.
  If the re-read returns the same value T used for decision-making: OK.
  If the re-read returns a different value: TRANSACTION RETRY required.

Google Percolator

Percolator (Peng & Dabek, Google, 2010) is a distributed transaction protocol built on top of Bigtable (a non-transactional store). It uses locks stored as data and supports transactions via MVCC.

Percolator Two-Phase Commit:

Phase 1 (Prewrite):
  1. Client picks a primary row and a transaction start timestamp ts_start.
  2. For each row to be written:
     a. Check for write-write conflicts: any existing lock or newer write?
     b. Write a "lock" cell: points to primary row.
     c. Write the new data as a tentative write at ts_start.

Phase 2 (Commit):
  1. Write a "committed" record to the primary row at commit timestamp ts_commit.
  2. For each secondary row: write commit pointer (pointing to primary).
     Delete the lock cell.

Read protocol:
  1. Read all versions with timestamp <= ts_read.
  2. If a lock exists at ts <= ts_read: another transaction is in flight.
     Wait for it to complete or resolve stale locks.

Percolator is the basis for TiKV's distributed transaction implementation (Pessimistic mode uses a similar two-phase approach).

XA Transactions

XA (eXtended Architecture) is a standard for distributed transactions from The Open Group (1991). It's the lowest common denominator for cross-system 2PC.

XA Two-Phase Commit:

  Transaction Manager (TM)
       │
       ├── Resource Manager 1 (RM1) — e.g., PostgreSQL
       ├── Resource Manager 2 (RM2) — e.g., Oracle
       └── Resource Manager 3 (RM3) — e.g., IBM MQ

  XA APIs:
  xa_start(xid)      — Start a transaction branch
  xa_end(xid)        — End a transaction branch
  xa_prepare(xid)    — Phase 1 prepare
  xa_commit(xid)     — Phase 2 commit
  xa_rollback(xid)   — Phase 2 abort
  xa_recover(xid)    — Recover in-doubt transactions after crash

  Supported by: Oracle, IBM DB2, PostgreSQL (xa_start), MySQL (XA START),
                most enterprise databases, JTA (Java Transaction API).

XA is widely used in enterprise Java applications (JBoss, WebLogic) but has a reputation for poor performance (2PC overhead × multiple heavyweight databases) and operational complexity (in-doubt transactions after coordinator failures).

Practical Guidance: Avoid Distributed Transactions When Possible

The most important advice for distributed systems engineers:

Design Alternatives to Distributed Transactions:

1. Collocate data: Put data that participates in the same transaction 
   on the same shard/node. No distribution → no distributed transaction.
   Example: Stripe keeps all data for one customer in one shard.

2. Eventual consistency + Saga: Accept that not all business operations 
   require strict atomicity. Use Sagas for business-level compensation.

3. Outbox pattern: Write to local DB + message queue in same local transaction.
   Reliably forward the message via the "outbox" table.

   App → DB Transaction: { INSERT order; INSERT outbox_msg }
   Background: reads outbox, sends to Kafka, marks as sent

   This avoids distributed transactions between DB and Kafka.

4. Idempotent operations: Design operations to be safely retried.
   Instead of "transfer $10 from A to B" (requires atomicity),
   use "if A has balance ≥ 10, debit A; then credit B" with 
   idempotency keys. Each step is independently retried.

5. Two-phase approach for immutable data: 
   "Reserve" inventory in step 1, "confirm" in step 2.
   Each step is a local transaction. The reservation is "pending",
   not committed. This limits the blast radius of failures.

Historical Context

1978: Jim Gray publishes "Notes on Database Operating Systems," introducing the formal model for transaction processing including atomic commit and 2PC concepts.

1979: Lampson and Sturgis publish "Crash Recovery in a Distributed Data Storage System," providing the formal correctness analysis of 2PC.

1981: Skeen and Stonebraker publish "A Formal Model of Crash Recovery in a Distributed System," introducing 3PC.

1987: Garcia-Molina and Salem publish "SAGAS," introducing the Saga pattern for long-running transactions in SIGMOD 1987. The paper was originally about financial transactions that couldn't hold locks for minutes or hours.

1991: The Open Group publishes the XA specification, standardizing the 2PC interface between transaction managers and resource managers.

2010: Google publishes the Percolator paper (Peng & Dabek), showing how to build distributed transactions on a non-transactional key-value store.

2012: Google Spanner paper demonstrates external consistency across globally distributed shards using TrueTime.

2015–present: The Saga pattern is rediscovered and popularized in the context of microservices by Chris Richardson and others. The eventuate.io and temporal.io frameworks provide Saga infrastructure.

Production Examples

Google Spanner: Uses Paxos per-group (shard) + 2PC across groups. Spanner's coordinator is typically the Paxos leader of the shard containing the "first" write in the transaction. Commit latency: 10–100ms per commit (dominated by TrueTime wait).

CockroachDB: Uses a Percolator-inspired 2PC with HLC timestamps. The transaction coordinator is usually the gateway node. The "transaction record" (similar to Percolator's primary row) is the key for locking and commit state. CockroachDB supports both optimistic (read-refresh) and pessimistic (lock before read) concurrency.

Temporal.io: A workflow engine that provides durable sagas. Temporal workflows are code that can be paused and resumed across machine failures. Each "activity" in a Temporal workflow is a local operation. Failed activities are automatically retried. This is orchestration-based Sagas with durable state.

Apache Flink (2PC checkpointing): Flink uses a 2PC-based checkpointing mechanism for exactly-once stream processing. The checkpoint coordinator sends "barrier" messages; when all operators receive a barrier, they snapshot state. The coordinator commits the checkpoint only when all operators have confirmed their snapshot. If any operator fails, the entire checkpoint is aborted and the previous checkpoint is restored.

PayPal SEDA/Saga: PayPal uses a saga-like pattern for payment processing. Each step of a payment (reserve funds, update ledger, notify bank) is a local transaction. Compensating transactions (refunds, reversals) are explicitly implemented for each step. The Saga state machine is persisted in the database to survive crashes.

Debugging Notes

Finding in-doubt 2PC transactions:

-- PostgreSQL: find prepared transactions
SELECT gid, transaction, prepared, owner, database
FROM pg_prepared_xacts;

-- These transactions hold locks. If the coordinator crashed,
-- they may never commit or abort automatically.
-- Manual resolution: COMMIT PREPARED 'gid' or ROLLBACK PREPARED 'gid'

-- MySQL: find in-doubt XA transactions
XA RECOVER;
-- Manual resolution: XA COMMIT or XA ROLLBACK

Monitoring 2PC health: - Alert on pg_prepared_xacts count > 0 for > 60 seconds. - Alert on XA_RECOVER returning rows for > recovery_timeout. - Monitor coordinator health separately from participant health.

Saga debugging: - Saga orchestrator state should be stored durably (in a database). - Every saga transition should be logged with the current step and the event that triggered it. - Failed compensating transactions are the hardest case: if C2 fails while rolling back T2, you have a partially-compensated saga. Design sagas so compensating transactions are always retryable.

Security Implications

Lock holding during 2PC prepare phase: In-doubt transactions hold database locks for potentially extended periods. A malicious or buggy coordinator can exploit this to create lock-based DoS attacks against the database.
Coordinator trust: In XA, the transaction manager has authority to commit or abort any in-doubt transaction. Ensure the TM is trusted and its communication channel with RMs is authenticated (mTLS).
Saga data visibility: Intermediate Saga states are visible to other transactions (no isolation). This means partially-completed sagas can be read. Ensure that partially-applied state doesn't expose sensitive data or trigger incorrect downstream actions.
Idempotency key security: Idempotency keys (used to make retries safe) must be tied to the requesting identity. An attacker who knows an idempotency key can suppress a retry or replay a past operation.

Performance Implications

Transaction Type              | Latency  | Throughput | Availability
------------------------------|----------|------------|-------------
Single-node transaction        | 1-5ms   | Very high  | High
2PC (same datacenter)          | 5-20ms  | Medium     | Medium*
2PC (cross-datacenter)         | 50-200ms| Low        | Low*
Saga (choreography)            | seconds | High       | High
Saga (orchestration)           | seconds | High       | High
Spanner 2PC + TrueTime wait   | 10-100ms| Medium     | High (Paxos)

* Availability reduced because coordinator failure blocks participants.

The cost of 2PC: two round-trips (prepare + commit), two fsyncs (coordinator + participants), plus lock holding time between phases. For a 3-shard transaction across a datacenter, this is typically 10–20ms.

Failure Modes and Real Incidents

2015, GitHub Distributed Transaction Lock Buildup: GitHub's MySQL cluster used XA transactions for cross-shard operations on their githost layer. A periodic job that ran XA transactions became slow due to a table scan. The XA transactions held locks while the coordinator processed the scan. Thousands of subsequent transactions queued waiting for locks. Resolution: kill the slow XA transactions and add index to the scanned table.

2018, Distributed Transaction Cascade at Uber: Uber's Schemaless system had a distributed transaction coordinator that used MySQL as its backing store. During a MySQL failover, the coordinator's state was temporarily unavailable. Participants in ~500 in-flight distributed transactions were stuck in the "prepared" state. Upon recovery, the coordinator correctly resolved all transactions, but the 30-second window of stuck transactions caused downstream service timeouts.

2019, Percolator-Style Lock Contention in TiDB: TiDB's distributed transactions use optimistic concurrency by default. A write-heavy workload caused many transaction retries (each failed optimistic transaction retries from scratch). Switching to pessimistic concurrency (lock before read) reduced retries at the cost of higher lock contention. The trade-off is workload-dependent.

Modern Usage

Temporal.io and Conductor (Netflix): Durable workflow engines that implement Saga orchestration as a managed service. Used for multi-step business processes.
CQRS + Event Sourcing: Command-Query Responsibility Segregation with Event Sourcing naturally implements Sagas: commands produce events, sagas react to events, the event log is the audit trail.
Microservice outbox pattern: Instead of distributed transactions, services write to a local "outbox" table in the same local transaction as their business logic. A relay service reads the outbox and publishes events to Kafka. This is 2PC without a distributed coordinator.
AWS Step Functions: Managed Saga orchestrator. Workflow steps are AWS Lambda functions or service integrations. Step Functions handles retry, compensation, and state persistence.

Future Directions

Deterministic databases: FaunaDB (and its successor systems) use a deterministic transaction ordering mechanism that eliminates the need for explicit distributed locking. All nodes process the same ordered transaction log; coordination is in the log ordering, not the execution.
Calvin / Aria / Sundial: Research on deterministic distributed transaction processing. Transactions are pre-ordered before execution, eliminating conflicts during execution. Achieves high throughput with strong consistency, but requires full transaction specification upfront (no interactive transactions).
Vector clocks for Sagas: Research into using vector clocks to track Saga causality, enabling more precise ordering guarantees for eventually-consistent Sagas.

Exercises

2PC Simulation: Implement a 2PC coordinator and 3 participants in Python using threads. Simulate the following failure scenarios: (a) participant A votes NO, (b) participant B crashes after voting YES, (c) coordinator crashes after receiving all YES votes but before sending COMMIT. For each scenario, determine the final state of all participants and whether the protocol is consistent.
2PC Blocking Measurement: Using PostgreSQL's PREPARE TRANSACTION and pg_prepared_xacts, measure: (a) how long a transaction is blocked by an in-doubt 2PC transaction on the same row, (b) what happens to that blocking transaction when you kill the coordinator process, (c) how long until PostgreSQL's recovery process resolves the in-doubt transaction.
Saga Implementation: Implement a Saga for a travel booking system (flight + hotel + car) using the orchestration pattern. Your orchestrator must: (a) execute steps in sequence, (b) run compensating transactions in reverse order on failure, (c) handle idempotent retries (a step that was partially applied should be safe to retry), (d) persist Saga state to a database so the orchestrator can crash and resume.
Outbox Pattern: Build a simple e-commerce order service with two concerns: (a) write order to PostgreSQL database, (b) publish "OrderPlaced" event to Kafka. Without distributed transactions, implement the outbox pattern so that: (a) if the app crashes before Kafka publish, the event is retried on restart, (b) if Kafka is unavailable, the order is still written to the DB and published later. Verify at-least-once delivery semantics.
Spanner-Style Timestamp Wait: Simulate Spanner's commit-wait mechanism. Create a distributed system with 3 nodes and a simulated TrueTime API (physical clock ± 5ms uncertainty). Implement: (a) transaction read at a snapshot timestamp, (b) commit at a timestamp that includes wait for TrueTime interval to pass, (c) verify that for 100 concurrent transactions, no transaction reads a value that was committed after it started.

References

Gray, J. (1978). "Notes on Database Operating Systems." Lecture Notes in Computer Science, 60, 393–481.
Lampson, B., & Sturgis, H. (1979). "Crash Recovery in a Distributed Data Storage System." Xerox PARC Technical Report.
Skeen, D. (1981). "Nonblocking Commit Protocols." SIGMOD 1981.
Garcia-Molina, H., & Salem, K. (1987). "SAGAS." SIGMOD 1987.
Peng, D., & Dabek, F. (2010). "Large-scale Incremental Processing Using Distributed Transactions and Notifications." OSDI 2010. (Percolator)
Corbett, J. C., et al. (2012). "Spanner: Google's Globally Distributed Database." OSDI 2012.
Richardson, C. (2018). Microservices Patterns. Manning. (Saga pattern in microservices context)
Bernstein, P. A., Hadzilacos, V., & Goodman, N. (1987). Concurrency Control and Recovery in Database Systems. Addison-Wesley.