Skip to content

04 — Logical Clocks

Technical Overview

In a distributed system, physical clocks drift apart, NTP synchronization is imprecise, and there is no shared global time. Yet many fundamental distributed algorithms require some notion of ordering: we need to know which event caused which, or at least establish a consistent ordering so all nodes can agree on the sequence of events. Logical clocks solve this by tracking causality rather than wall-clock time. They assign numbers to events such that the numbers respect the happened-before relationship, giving us a consistent way to reason about ordering without requiring synchronized physical clocks.


Prerequisites

  • Understanding of happened-before relation (see 01-distributed-systems-fundamentals.md)
  • Basic distributed systems concepts
  • Familiarity with vector data structures
  • Some exposure to distributed databases (helpful but not required)

Core Content

Physical Clock Problems in Distributed Systems

Every computer has a hardware oscillator that ticks at roughly a fixed rate. The OS uses this to maintain a wall clock. Problems arise in distributed settings:

Clock drift: Quartz oscillators drift by ~1-100 ppm (parts per million). At 100 ppm drift, a clock drifts by 8.64 seconds per day. After one day without synchronization, two machines can disagree by up to 17 seconds.

NTP accuracy: Network Time Protocol (NTP) corrects drift by adjusting the clock based on time servers. In a datacenter, NTP typically achieves 1–10ms accuracy. Over the internet, accuracy degrades to 10–100ms. This means two machines "synchronized" via NTP can disagree by up to 20ms about which event happened first.

Non-monotonic time: NTP can step the clock backward. CLOCK_REALTIME on Linux can jump backward when NTP corrects a forward drift. This breaks any code that uses wall-clock differences to measure elapsed time. Always use CLOCK_MONOTONIC for elapsed time within a process.

The ordering problem: Suppose Node A writes x=5 at time T=1000ms and Node B writes x=7 at time T=1001ms (according to each node's local clock). If Node A's clock is 5ms ahead of Node B's, the "real" order might be the opposite. You cannot use wall-clock timestamps to establish causal ordering across nodes without atomic-clock-level synchronization.

Clock Skew Example:

  Node A clock: 10:00:00.005 (5ms fast)
  Node B clock: 10:00:00.000 (reference)

  Event sequence:
  A writes x=1   at A-time 10:00:00.005
  A sends to B
  B receives, writes x=2 at B-time 10:00:00.003

  B's timestamp (10:00:00.003) < A's timestamp (10:00:00.005)
  Physical timestamps claim B's write came FIRST.
  But causally, A's write caused B's write.
  Physical clocks give the WRONG causal order.

Lamport Clocks

Leslie Lamport published "Time, Clocks, and the Ordering of Events in a Distributed System" in Communications of the ACM in July 1978. This paper introduced the happened-before relation and Lamport timestamps.

The happened-before relation (→):

For events a and b:
  a → b if:
    (1) a and b are in the same process, and a occurs before b
    (2) a is a send event, b is the corresponding receive event
    (3) ∃c such that a → c and c → b  (transitivity)

If neither a → b nor b → a, then a and b are concurrent: a ∥ b

Lamport Clock Algorithm:

Each process maintains a counter LC initialized to 0.

Rules:
  1. Before each event in process P: LC_P = LC_P + 1
  2. When process P sends message m: 
       LC_P = LC_P + 1
       attach LC_P to message m (as timestamp ts(m))
  3. When process P receives message m with timestamp ts(m):
       LC_P = max(LC_P, ts(m)) + 1

Pseudocode:

class LamportClock:
    def __init__(self):
        self.time = 0

    def tick(self):          # internal event
        self.time += 1
        return self.time

    def send(self):          # before sending message
        self.time += 1
        return self.time     # attach to message

    def receive(self, msg_time):   # on receiving message
        self.time = max(self.time, msg_time) + 1
        return self.time

Example trace:

Process A:           Process B:          Process C:
LC=0                 LC=0                LC=0

event(LC=1)          
send(LC=2) ─────────────────────────────→ recv
                     recv: LC=max(0,2)+1=3
                     event(LC=4)          
                     send(LC=5) ─────────→ recv
                                          recv: LC=max(0,5)+1=6
                                          event(LC=7)

The key property: If a → b, then LC(a) < LC(b).

The critical limitation: The converse is NOT guaranteed. If LC(a) < LC(b), it does NOT follow that a → b. They might be concurrent. Lamport timestamps cannot distinguish causally related events from concurrent events with the same ordering.

Limitation Example:

Process A: event(LC=5)
Process B: event(LC=3) ──then── event(LC=6)

LC(A's event) = 5, LC(B's first event) = 3, LC(B's second event) = 6
5 < 6, suggesting A's event happened before B's second event.
But if there's no message between them, they may be concurrent (A ∥ B's second event).

Vector Clocks

Vector clocks (independently developed by Colin Fidge and Friedemann Mattern in 1988) solve the limitation of Lamport clocks. A vector clock tracks a separate Lamport counter for each process.

Data structure: A vector of size N (number of processes), where V[i] represents the number of events process i has executed that are causally known to the holder.

Vector Clock Algorithm:

Each process P_i maintains vector VC_i of size N, initialized to all zeros.

Rules:
  1. Before each internal event at P_i:
       VC_i[i] = VC_i[i] + 1

  2. Before P_i sends message m:
       VC_i[i] = VC_i[i] + 1
       attach VC_i to m

  3. When P_i receives message m with vector timestamp VT_m:
       for j in 1..N:
           VC_i[j] = max(VC_i[j], VT_m[j])
       VC_i[i] = VC_i[i] + 1

Pseudocode:

class VectorClock:
    def __init__(self, process_id, n_processes):
        self.pid = process_id
        self.vc = [0] * n_processes

    def tick(self):
        self.vc[self.pid] += 1
        return list(self.vc)

    def send(self):
        self.vc[self.pid] += 1
        return list(self.vc)    # attach copy to message

    def receive(self, msg_vc):
        for i in range(len(self.vc)):
            self.vc[i] = max(self.vc[i], msg_vc[i])
        self.vc[self.pid] += 1
        return list(self.vc)

Comparing vector clocks (V and W are vector clocks):

V == W  iff  V[i] == W[i] for all i
V <= W  iff  V[i] <= W[i] for all i
V < W   iff  V <= W and V != W  (V happened-before W)
V ∥ W   iff  NOT (V < W) AND NOT (W < V)  (concurrent)

Vector Clock Update Diagram:

Process A (pid=0):   [1,0,0]  [2,0,0]                    [4,2,1]
                        ↑        ↑                            ↑
                     event1   send(m1)                    recv(m3)
                                  \                          /
                                   \     msg(VC=[2,0,0])   msg(VC=[3,2,1])
                                    ↓                      ↑
Process B (pid=1):               [2,1,0]  [2,2,0]  [3,2,0] [3,2,1]
                                    ↑        ↑        ↑       ↑
                                 recv(m1) event2  send(m2) send(m3)
                                              \    /
                                               ↓  ↑
Process C (pid=2):                          [2,2,1] [2,2,2]
                                                ↑     ↑
                                             recv(m2) event3

Reading vector clocks:
  A's event1: [1,0,0] → A has seen 1 A-event, 0 B-events, 0 C-events
  B recv(m1): [2,1,0] → B now knows about A's 2 events + its own 1 event
  C recv(m2): [2,2,1] → C knows about A's 2, B's 2, C's 1 events

Causal ordering:
  A's send(m1) [2,0,0] < B's recv(m1) [2,1,0] → causal (m1 sent before recv)
  B's event2 [2,2,0]  ∥ C's event3 [2,2,2] if no message between them

The full power of vector clocks: For any two events a and b: - If VC(a) < VC(b): a causally precedes b (a → b) - If VC(a) ∥ VC(b): a and b are concurrent (no causal relationship)

Vector clocks capture complete causal ordering information.

Version Vectors vs Vector Clocks: A Critical Distinction

These terms are often used interchangeably but have an important difference:

Vector Clocks track causality between events. Every event increments the local counter.

Version Vectors track the version of a specific piece of data across replicas. The counter represents "how many updates has this replica made to this object" rather than "how many events has occurred."

Vector Clock:
  Used to: determine causal ordering of events across processes
  Counter represents: number of EVENTS at a process
  Granularity: per event

Version Vector:
  Used to: determine if two object versions are causally related or concurrent
  Counter represents: number of UPDATES to an OBJECT at a replica
  Granularity: per object

Amazon Dynamo's original design used vector clocks on a per-key basis. This was actually version vectors. The distinction matters: if a key is written 1000 times, the version vector may accumulate thousands of entries if the key moves across many replicas. Dynamo bounded this by truncating old entries, which introduced subtle bugs (two objects might incorrectly appear causally related).

Riak (an open-source Dynamo-inspired database) used version vectors for conflict detection. When a get returns two conflicting versions (siblings), the application must resolve the conflict. Riak 2.0 introduced CRDTs to avoid this.

Hybrid Logical Clocks (HLC)

Proposed by Sandeep Kulkarni, Murat Demirbas, et al. in 2014, Hybrid Logical Clocks combine physical clock timestamps with logical clock counters. The goal: maintain the ordering properties of logical clocks while staying close to physical time (useful for debugging, TTL calculations, and human-readable timestamps).

HLC data structure: (physical_time, logical_counter)

HLC = (l, c)
  l = the physical time component (NTP wall clock)
  c = the logical counter (0 when l is fresh)

Rules:
  On send/local event:
    l_new = max(l_old, physical_now)
    if l_new == l_old: c = c + 1
    else: c = 0
    timestamp = (l_new, c)

  On receive(l_m, c_m):
    l_new = max(l_old, l_m, physical_now)
    if l_new == l_old == l_m: c = max(c_old, c_m) + 1
    elif l_new == l_old: c = c_old + 1
    elif l_new == l_m: c = c_m + 1
    else: c = 0
    timestamp = (l_new, c)

HLC bounds: The paper proves that l never drifts more than ε from physical time (where ε is the NTP error bound). This means: - HLC can be used for NTP-bounded staleness checks - HLC timestamps are human-readable (close to wall time) - HLC provides the same causal ordering as Lamport clocks

CockroachDB uses HLC extensively. Every transaction is assigned an HLC timestamp. The HLC allows CockroachDB to: 1. Provide serializable transactions across globally distributed nodes 2. Detect when a transaction's timestamp is "too far in the past" (transaction uncertainty windows) 3. Expose timestamps that are meaningful to humans for debugging


Historical Context

1978: Leslie Lamport publishes "Time, Clocks, and the Ordering of Events in a Distributed System" in CACM. This is one of the most cited papers in computer science. Lamport later won the 2013 Turing Award partly for this work.

1988: Colin Fidge ("Timestamps in Message-Passing Systems That Preserve the Partial Ordering") and Friedemann Mattern ("Virtual Time and Global States of Distributed Systems") independently develop vector clocks. Fidge's paper appeared in February 1988; Mattern's in October 1988.

1990s: Vector clocks become standard in distributed databases for conflict detection. Amazon's Dynamo (2007) popularizes version vectors for key-value stores.

2014: Kulkarni, Demirbas, Madeppa, Avva, and Leone publish "Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases" at the 2014 International Symposium on Stabilization, Safety, and Security of Distributed Systems. CockroachDB adopts HLC in 2015.


Production Examples

Riak (vector clocks): Riak used vector clocks (actually version vectors) to track concurrent writes to the same key. If two clients wrote to the same key simultaneously, Riak would return both values as "siblings" and let the application merge them. Riak 2.0 (2014) introduced CRDTs, which avoid sibling proliferation for common data types (counters, sets, maps, flags).

DynamoDB (version vectors): The original Dynamo paper described using vector clocks to detect concurrent writes. In practice, Amazon simplified this to Last-Write-Wins with a timestamp (not a true vector clock), sacrificing conflict detection for simplicity. DynamoDB transactions (2018) provide stronger guarantees for multi-item operations.

CockroachDB (HLC): CockroachDB uses HLC timestamps for MVCC (Multi-Version Concurrency Control). Every row version has an HLC timestamp. A transaction reads at a specific HLC timestamp and sees a consistent snapshot of the database. The HLC ensures this snapshot is causally consistent with all events that happened before the transaction began.

Cassandra (Lamport-inspired timestamps): Cassandra uses microsecond-resolution wall-clock timestamps for Last-Write-Wins conflict resolution. This is NOT a proper Lamport clock (it doesn't do the max() merge on receive), which makes it vulnerable to clock skew bugs. If two clients write the same key at the same microsecond (or if clocks are skewed), the outcome depends on which replica receives each write first — a well-known source of data loss in Cassandra.


Debugging Notes

Diagnosing causality violations:

  1. Check timestamps: If event B claims to precede event A but was caused by A, you have a causality violation. This typically means a Lamport clock receive rule wasn't applied, or physical timestamps were used naively.

  2. Vector clock inspection: Add logging of full vector clocks to critical events. When you see a bug, inspect the vector clocks of the suspicious events. If VC(buggy_event) ∥ VC(causing_event), they were treated as concurrent when they shouldn't have been.

  3. HLC drift monitoring: In CockroachDB, monitor the maximum HLC offset. If HLC drifts too far from physical time (>500ms by default), CockroachDB will crash-stop rather than risk incorrect ordering.

Tools: - OpenTelemetry traces include timestamps that can expose ordering issues - Lamport/vector clocks can be added as trace attributes for post-hoc analysis


Security Implications

  • Replay attacks: Logical clocks are often used to prevent replaying old messages. If an attacker can forge a high logical timestamp, they can cause a message to appear "newer" than it is, potentially overwriting legitimate data.
  • LWW and timestamp manipulation: Cassandra's LWW conflict resolution uses physical timestamps. An attacker (or buggy client) that writes with a future timestamp can prevent any legitimate future writes from taking effect until the fake timestamp is in the past.
  • Clock manipulation: If an attacker can control NTP sources, they can manipulate physical clocks, which affects HLC-based systems. CockroachDB's maximum allowed clock offset (default: 500ms) limits the damage from clock manipulation.

Performance Implications

  • Lamport clocks: Essentially free — one integer per message, one comparison per receive. Overhead is negligible.
  • Vector clocks: O(N) space and comparison time where N is the number of processes. For a 3-node cluster, this is trivial. For a 1000-node cluster, attaching full vector clocks to every message is expensive. Amazon Dynamo addressed this by truncating vector clocks, but this introduced subtle bugs.
  • HLC: Slightly more complex than Lamport clocks but still very cheap. Two integers (physical time, logical counter). No coordination required.
  • Version vectors at scale: Storing per-object version vectors in a database with millions of keys is expensive in memory. Sharded databases like Riak use per-object version vectors, which scale with the number of objects × number of replicas.

Failure Modes and Real Incidents

2012, Cassandra LWW clock skew bug: A production Cassandra cluster experienced writes being silently discarded. Investigation revealed one node's clock was ~2 seconds ahead. Writes to that node received timestamps 2 seconds in the future. When the node was corrected via NTP, its timestamps became "old" and any subsequent writes to the same keys with correct timestamps were silently discarded — they had smaller timestamps. The data appeared correct to the node with the skewed clock but was stale everywhere else.

Dynamo vector clock truncation: The Dynamo paper acknowledged that vector clocks were pruned when they grew too large (over 10 entries per clock). This truncation lost causal history, meaning two conflicting versions might be incorrectly treated as causally ordered rather than concurrent. The application would receive a "consistent" value that was actually the wrong winner of an undetected conflict.

CockroachDB clock offset enforcement: CockroachDB will deliberately crash any node whose HLC drifts more than 500ms from its peers. This is a safety mechanism: rather than serve potentially incorrect data due to extreme clock skew, the node stops and requires operator intervention. In 2019, a misconfigured VMs had their NTP disabled, causing periodic CockroachDB crashes when the clock drifted. The crashes were confusing until the root cause (NTP) was identified.


Modern Usage

Logical clocks remain foundational to modern distributed systems:

  • etcd (Kubernetes backbone): Uses Raft's term and log index as a form of logical clock for consensus. The term+index together serve as a globally ordered identifier for commands.
  • Apache Kafka: Uses offsets (per-partition monotonic integers) as logical clocks within a partition. These don't capture causality across partitions, which is a known limitation when ordering events from multiple partitions.
  • Google Spanner: Uses TrueTime (physical clock with bounded uncertainty) rather than logical clocks, but the commit-wait mechanism is essentially an implementation of the Lamport clock property: wait until the true time is past your commit timestamp before releasing the lock.
  • OpenTelemetry distributed tracing: Trace and span IDs implement a form of causal tracking without full vector clocks. The parent-child relationship between spans captures causal ordering for observability purposes.

Future Directions

  • GPS-disciplined physical clocks as default: As GPS clock hardware becomes cheaper, more systems may use TrueTime-equivalent APIs. This would reduce the need for logical clocks in practice.
  • Causal+ consistency at scale: COPS (Lloyd et al., 2011) demonstrated geo-distributed causal consistency with near-eventual-consistency performance. Follow-on work (Eiger, 2013; Occult, 2017) has pushed this toward practical deployment.
  • Logical clocks in hardware: Research into SmartNIC (Network Interface Card) offloading suggests that logical clock maintenance could be moved to the network layer, reducing software overhead and enabling more consistent global ordering.

Exercises

  1. Lamport Clock Simulation: Implement a Lamport clock simulation in Python with 3 processes. Generate a random sequence of 10 events per process with 5 messages passed between processes. Print the Lamport timestamp for each event. Verify manually that the happened-before relation is respected.

  2. Vector Clock Concurrency Detection: Extend the above simulation to use vector clocks. For every pair of events in your simulation, determine whether they are causally related or concurrent using vector clock comparison. Count how many event pairs are concurrent vs. causally ordered. How does this fraction change as you add more inter-process messages?

  3. Clock Skew Impact: Write a program that simulates an LWW key-value store where Node A's clock is 100ms ahead. Have client 1 write key=1 via Node A, then have client 2 read key=1 from Node B (which has the replicated value with Node A's timestamp). Then have client 2 write key=2 (a "subsequent" write, but with a current physical timestamp). What happens to key=1 vs key=2 ordering? What would need to change to use Lamport clocks here instead?

  4. HLC Implementation: Implement a Hybrid Logical Clock in Python. Test it by: (a) creating 3 nodes with clocks that differ by 50ms, (b) sending 100 messages between them, (c) verifying that for every send/receive pair, HLC(send) < HLC(receive), (d) verifying that all HLC values are within 150ms of the maximum physical clock in the system.

  5. Version Vector Merge: Implement a simple distributed key-value store with version vectors for conflict detection. Have 3 replicas accept writes while "partitioned" from each other (no synchronization). After 10 writes each, "heal" the partition and implement conflict resolution: (a) detect conflicting versions using version vector comparison, (b) implement LWW resolution, (c) implement application-level merge for a set data type. Compare the results.


References

  1. Lamport, L. (1978). "Time, Clocks, and the Ordering of Events in a Distributed System." Communications of the ACM, 21(7), 558–565.
  2. Fidge, C. J. (1988). "Timestamps in Message-Passing Systems That Preserve the Partial Ordering." Proceedings of the 11th Australian Computer Science Conference, 56–66.
  3. Mattern, F. (1989). "Virtual Time and Global States of Distributed Systems." Parallel and Distributed Algorithms, 215–226.
  4. Kulkarni, S., et al. (2014). "Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases." SSS 2014.
  5. DeCandia, G., et al. (2007). "Dynamo: Amazon's Highly Available Key-value Store." SOSP 2007.
  6. Lloyd, W., et al. (2011). "Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS." SOSP 2011.
  7. Corbett, J. C., et al. (2012). "Spanner: Google's Globally Distributed Database." OSDI 2012.