03 — TCP Timers

Technical Overview

TCP is a timer-driven protocol. Where most protocols rely on explicit control signals, TCP uses timers to detect failures, manage connection liveness, and optimize throughput. Five distinct timers govern different aspects of connection lifecycle: the retransmission timer fires when an ACK doesn't arrive in time; the persist timer prevents deadlock when the receiver's window drops to zero; the keepalive timer detects dead peers; the TIME_WAIT timer enforces port reuse safety; and the delayed ACK timer coalesces acknowledgements. Each timer has distinct failure modes and tuning implications.

Prerequisites

TCP state machine (see 01-tcp-state-machine.md)
TCP congestion control and RTO basics (see 02-congestion-control.md)
Linux socket options (SO_KEEPALIVE, TCP_NODELAY)
Familiarity with ss, sysctl, tcpdump

Core Content

Timer Taxonomy

TCP Connection Timers
======================

  Data in flight ─────────────────────────────────────────────────
  (unACKed segments)           Retransmission Timer
                                 |
                                 | fires if no ACK within RTO
                                 v
                              Retransmit, double RTO (backoff)

  rwnd == 0 ───────────────────────────────────────────────────────
  (receiver full)              Persist Timer
                                 |
                                 | fires periodically to probe window
                                 v
                              Send window probe (1-byte segment)

  Connection idle ─────────────────────────────────────────────────
  (no data exchange)           Keepalive Timer
                                 |
                                 | fires after tcp_keepalive_time idle
                                 v
                              Send keepalive probe (empty ACK-like)
                              Repeat tcp_keepalive_probes times
                              If no response: close connection (ETIMEDOUT)

  TIME_WAIT state ─────────────────────────────────────────────────
                               TIME_WAIT Timer
                                 |
                                 | fires after 2*MSL (tcp_fin_timeout)
                                 v
                              Release port, free resources

  Data segment received ────────────────────────────────────────────
  (ACK not yet sent)           Delayed ACK Timer
                                 |
                                 | fires after 40–200ms
                                 v
                              Send ACK (coalescing multiple segments)

Retransmission Timeout (RTO)

The retransmission timer is started when a segment is sent and an ACK is expected. If no ACK arrives before the timer fires, the segment is retransmitted.

Computing RTO: Jacobson's Algorithm (RFC 6298)

Van Jacobson's 1988 paper introduced smoothed RTT estimation with variance tracking:

On each ACK with a new RTT sample R:

  SRTT = α * SRTT + (1 - α) * R      /* smoothed RTT, α = 0.875 */
  RTTVAR = β * RTTVAR + (1 - β) * |SRTT - R|  /* variance, β = 0.75 */
  RTO = SRTT + max(G, 4 * RTTVAR)    /* G = clock granularity (1ms) */

Bounds:
  RTO_min = 200ms (Linux default)
  RTO_max = 120s (configurable via tcp_rto_max)
  Initial RTO = 1s (RFC 6298) or 3s (some implementations)

The key insight in Jacobson's algorithm: using RTT variance (not just mean) to set RTO. If RTTs are stable (variance low), RTO is tight (SRTT + small margin). If RTTs fluctuate (high-variance satellite link), RTO is wider.

Linux implements this in net/ipv4/tcp_input.c:tcp_rtt_estimator():

/* Simplified kernel implementation */
static void tcp_rtt_estimator(struct sock *sk, long mrtt_us)
{
    struct tcp_sock *tp = tcp_sk(sk);
    long m = mrtt_us;   /* new RTT sample */

    if (m == 0) m = 1;

    if (tp->srtt_us != 0) {
        m -= (tp->srtt_us >> 3);   /* m = R - SRTT */
        tp->srtt_us += m;          /* SRTT += (R - SRTT) / 8 */
        if (m < 0) m = -m;
        m -= (tp->mdev_us >> 2);
        tp->mdev_us += m;          /* RTTVAR update */
    } else {
        /* First measurement */
        tp->srtt_us = m << 3;
        tp->mdev_us = m << 1;
    }

    tp->rto = usecs_to_jiffies(tp->srtt_us + (tp->mdev_us << 2));
    /* Clamp to [TCP_RTO_MIN, TCP_RTO_MAX] */
}

Exponential Backoff

On each retransmission, RTO is doubled (binary exponential backoff):

RTO after 1st retransmit: RTO * 2
RTO after 2nd retransmit: RTO * 4
RTO after 3rd retransmit: RTO * 8
...
Maximum: tcp_rto_max (default 120s)

This prevents retransmit storms and gives the network time to recover.

Retransmit limits:

# tcp_retries1: number of retransmits before informing network layer
# (triggers IP route cache refresh, PMTUD reset)
sysctl net.ipv4.tcp_retries1   # default 3

# tcp_retries2: number of retransmits before giving up the connection
# Approximate formula: timeout = 2^(retries2+1) * initial_RTO
sysctl net.ipv4.tcp_retries2   # default 15 = ~924 seconds

For a 15-second timeout (useful for microservices where dead connections should fail fast):

sysctl -w net.ipv4.tcp_retries2=5
# 2^6 * 200ms ≈ 12.8 seconds total retransmit time

Persist Timer

The persist timer prevents deadlock when the receiver's advertised window drops to zero (receiver buffer full). In this state: - Sender cannot send (rwnd = 0) - Receiver has no data to ACK, so won't send an ACK - Neither side makes progress → permanent deadlock

The persist timer fires periodically and sends a window probe — a 1-byte segment beyond the current window. The receiver ACKs this (or ignores it if window is still 0) and includes its current rwnd. If window has opened, the sender resumes.

Sender                          Receiver
  | data (window = 1000)          |
  +------------------------------>|
  | data (window = 500)           |
  +------------------------------>|
  |<------------------------------ rwnd=0 (buffer full)
  | <persist timer starts>        |
  |                               | application reads 100 bytes
  | window probe (1 byte)         |
  +------------------------------>|
  |<------------------------------ ack, rwnd=100
  | send data (up to 100 bytes)   |
  +------------------------------>|

The persist timer uses the same backoff as RTO, but never gives up (unlike the retransmit timer). A zero window can persist indefinitely if the receiver never reads its buffer.

Diagnose zero-window conditions:

# tcpdump: look for "win 0" in output
tcpdump -ni eth0 -v 'tcp' | grep 'win 0'

# ss: Recv-Q full indicates receiver buffer full (causes zero window)
ss -ti dst :5432 | grep rcv_space

# bpftrace: trace zero-window ACK events
bpftrace -e 'kprobe:tcp_send_ack { $sk = (struct sock *)arg0;
    $tp = (struct tcp_sock *)$sk;
    if ($tp->rcv_wnd == 0) { printf("zero window: %s\n", comm); }
}'

TCP Keepalive Timer

TCP keepalives detect half-open connections — connections where one end has lost state (crashed, NAT timeout, firewall RST) but the other end remains in ESTABLISHED state, unaware of the failure.

Without keepalive, a server holding a connection to a crashed client will wait indefinitely for data. The socket occupies resources forever.

Keepalive parameters (Linux defaults):

tcp_keepalive_time = 7200s (2 hours idle before first probe)
tcp_keepalive_intvl = 75s  (interval between probes)
tcp_keepalive_probes = 9   (unacknowledged probes before ETIMEDOUT)

Total dead-connection detection time = 7200 + 75*9 = 7875 seconds (~2.2 hours)

This default is absurdly long for microservices. Tune per-socket or system-wide:

# System-wide (applies to new sockets)
sysctl -w net.ipv4.tcp_keepalive_time=30
sysctl -w net.ipv4.tcp_keepalive_intvl=5
sysctl -w net.ipv4.tcp_keepalive_probes=3
# Now: 30 + 5*3 = 45 seconds to detect dead connection

# Per-socket (preferred — doesn't affect other applications)
int val = 1;
setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE, &val, sizeof(val));
int idle = 30, intvl = 5, count = 3;
setsockopt(fd, IPPROTO_TCP, TCP_KEEPIDLE, &idle, sizeof(idle));
setsockopt(fd, IPPROTO_TCP, TCP_KEEPINTVL, &intvl, sizeof(intvl));
setsockopt(fd, IPPROTO_TCP, TCP_KEEPCNT, &count, sizeof(count));

Keepalive vs application-level heartbeat: TCP keepalive operates at the TCP layer — it detects dead TCP connections. It does NOT detect: - Application crashes with living TCP stack - Proxy/load balancer failures between the two endpoints - Application deadlocks (the TCP stack is alive, the app isn't processing)

For these cases, application-level heartbeats (HTTP keepalive pings, gRPC PING frames, database idle_in_transaction_session_timeout) are required.

Delayed ACK Timer

TCP delayed ACK (RFC 1122) defers sending an ACK for up to 200ms, hoping to piggyback it on data going the other direction (reducing ACK traffic by 50% in request-response protocols):

Sender                    Receiver
  | segment 1               |
  +------------------------>|
  | segment 2               | <delayed ACK timer starts: 40–200ms>
  +------------------------>|
  |                         | timer fires
  |     cumulative ACK 2 <--|

On Linux, the delayed ACK timer fires after 40ms initially (the HZ interval) with maximum 200ms. The implementation is in net/ipv4/tcp_input.c:__tcp_ack_snd_check().

The Nagle-Delayed ACK interaction (the most common latency bug):

Client sends a request as two writes: write(header) then write(body) with Nagle enabled
Nagle holds the second write (small segment, previous ACK not yet received)
Server's delayed ACK timer is waiting to batch the ACK for the header
Deadlock: Nagle waits for ACK, delayed ACK waits for response → 40–200ms added latency

Fix options: - Client: TCP_NODELAY to disable Nagle - Client: TCP_CORK to batch the two writes into one send, then uncork - Server: disable delayed ACK (TCP_QUICKACK): setsockopt(fd, IPPROTO_TCP, TCP_QUICKACK, &one, sizeof(one)) — but note TCP_QUICKACK is not persistent; it must be set after each recv()

Observe the interaction:

tcpdump -ni lo -v 'tcp' | grep -E '(Flags|seq|ack|length)'
# Look for a ~200ms gap between send and ACK (Nagle + delayed ACK)

Timer Implementation in Kernel

Modern Linux uses hrtimers (high-resolution timers) for TCP timers, replacing the older jiffies-based timer wheel for precision-sensitive operations.

/* TCP socket timer setup (from include/linux/tcp.h) */
struct tcp_sock {
    /* ... */
    u32         srtt_us;      /* smoothed round trip time in usecs */
    u32         mdev_us;      /* medium deviation */
    u32         mdev_max_us;  /* maximal mdev for the last rtt period */
    u32         rttvar_us;    /* smoothed mdev_max */
    u32         rtt_seq;      /* sequence number to update rttvar */

    u32         rto;          /* retransmit timeout */
    u32         rto_min;      /* minimum retransmit timeout, jiffies */
    /* ... */
    struct hrtimer  pacing_timer;    /* for TCP pacing */
    struct timer_list retransmit_timer;  /* Legacy, being replaced */
};

Tracing timers:

# See all TCP timer fires
bpftrace -e 'kprobe:tcp_retransmit_timer { @[comm] = count(); }'

# Monitor RTO estimation per connection
bpftrace -e 'kprobe:tcp_rtt_estimator {
    $tp = (struct tcp_sock *)arg0;
    printf("srtt=%u rto=%u\n", $tp->srtt_us >> 3, $tp->rto);
}'

Historical Context

Van Jacobson's 1988 paper introduced both the SRTT/RTTVAR algorithm and the concept of exponential backoff for retransmits. Before Jacobson, implementations used a fixed timeout — causing catastrophic retransmit storms when the network became congested. Karn's algorithm (Phil Karn, 1987) contributed the rule that RTT samples must not be taken from retransmitted segments (since you can't tell if the ACK is for the original or the retransmit).

RFC 6298 (2011) updated the RTO calculation to address Jacobson's original clock granularity issues and specified the 1-second initial RTO (vs 3 seconds in the original RFC 2988).

The TCP keepalive mechanism was added to BSD Unix in the early 1980s, primarily to prevent idle connections from holding server resources across modem disconnections. The 2-hour default dates to modem era when a call dropping was expected to self-recover within 2 hours.

Production Examples

Database connection pool tuning (PostgreSQL):

# postgresql.conf
tcp_keepalives_idle = 30         # seconds before first probe
tcp_keepalives_interval = 5      # seconds between probes
tcp_keepalives_count = 3         # max unacknowledged probes
# Detects dead client in 45 seconds (vs 2.2 hours default)

Nginx upstream connection timeouts:

# nginx.conf
upstream backend {
    server 10.0.0.1:8080;
    keepalive_timeout 65;
}
proxy_connect_timeout 5s;   # connection timeout (RTO equivalent)
proxy_read_timeout 60s;     # upstream response timeout

Debugging Notes

# Check current RTO per connection (in milliseconds)
ss -ti 'dst :443' | grep rto

# Observe retransmit events
ss -ti | grep retrans

# Watch for delayed ACK interactions (check for ~200ms gaps)
tcpdump -ni eth0 -tt 'tcp and port 5432' | awk '{print $1, $NF}' | head -50

# Monitor keepalive probe transmission
bpftrace -e 'kprobe:tcp_send_probe0 { printf("keepalive probe: %s\n", comm); }'

# Check RTO bounds
sysctl net.ipv4.tcp_rto_min  # Linux 5.0+: per-route min RTO via ip route
# ip route add 10.0.0.0/24 via 10.0.0.1 rto_min 5ms  # per-route minimum

# Count persist timer firings (zero window)
bpftrace -e 'kprobe:tcp_probe_timer { @[comm] = count(); }'

# Full connection timer state
cat /proc/net/tcp  # hex state dump including timer info
ss -ti 'state established' | head -20

Security Implications

Keepalive as DoS vector: an attacker can hold millions of TCP connections open with minimal data, consuming memory. Each ESTABLISHED socket is ~2KB. Without keepalive or idle timeouts, connections persist until application timeout. Implement both SO_KEEPALIVE and application-level idle timeouts.
RTO manipulation: a path attacker who can delay packets can influence RTT samples, inflating RTO. An inflated RTO means longer wait before retransmit — reducing throughput. Karn's algorithm mitigates this for retransmits but not initial measurements.
Persist timer and DoS: an attacker can advertise zero window, forcing the server to continuously send persist probes — keeping the connection alive indefinitely and consuming server resources. Mitigation: implement application-level read timeouts independent of TCP.
Delayed ACK and timing attacks: the predictable 200ms delayed ACK timer can be used to time side-channel attacks measuring server processing time. Applications with security sensitivity should disable delayed ACK.

Performance Implications

Timer	Default	Typical production value	Reason
`tcp_keepalive_time`	7200s	30–120s	Fast dead-connection detection
`tcp_keepalive_intvl`	75s	5–10s	Shorter probe interval
`tcp_keepalive_probes`	9	3	Faster conclusive failure
`tcp_retries2`	15	5–8	Faster connection abandonment
`tcp_fin_timeout`	60s	15–30s	Faster TIME_WAIT cleanup

Delayed ACK has a significant performance impact for request-response protocols: - HTTP/1.1 keep-alive with pipelining disabled: each request sees delayed ACK latency - Redis commands: Redis clients often send command in multiple writes → Nagle/delayed ACK interaction - PostgreSQL queries: psql sends query as multiple packets without TCP_CORK → same issue

Fix: TCP_NODELAY on the client side. This is the standard recommendation for database clients, Redis clients, and any RPC framework.

Failure Modes and Real Incidents

Incident: Delayed ACK causing 200ms latency spikes in Redis (2013, common pattern) Redis clients (Python redis-py, Java Jedis) send multi-part commands in two writes: command length + command body. Server-side delayed ACK held the ACK for the first write. Nagle held the second write on client. 200ms stall per command. Fix: TCP_NODELAY in redis-py (added in 2.9.0). Detected by tcpdump showing 200ms gaps.

Incident: RTO inflation on GCP (2019, latency anomaly) A Google Cloud internal network upgrade temporarily increased RTT variance. tcp_rtt_estimator computed inflated RTTVAR, setting RTO to 3x the actual RTT. Every retransmit waited 3x longer than necessary, causing cascading timeouts. Diagnosed via ss -ti showing RTO values far above measured RTT. Fix: Google tuned tcp_rto_min to 5ms via per-route routing attributes.

Failure Mode: Persist timer exhaustion A slow consumer application processes messages at 100/s but receives at 10,000/s. Receive buffer fills → zero window. Server persist timer fires every few seconds. Server holds 1 million connections — 1M persist timer fires per few seconds → timer storm. Fix: backpressure (flow control), increase receiver buffer, or TCP_NOTSENT_LOWAT to limit sender-side buffering.

Modern Usage

TCP_USER_TIMEOUT (Linux 2.6.37): setsockopt that sets the maximum time any segment can remain unacknowledged — a higher-level timeout than RTO that doesn't depend on the number of retransmits. Used by PostgreSQL JDBC, Go's database/sql.
SO_BUSY_POLL (Linux 3.11): busy-poll on socket receive, reducing delayed ACK latency to <1ms for latency-sensitive applications at the cost of CPU.
net.ipv4.tcp_min_rtt_wlen: window length in seconds for minimum RTT tracking (used by BBR). Reducing this makes RTT estimates more responsive to network changes.

Future Directions

Hardware timer offload: SmartNICs are beginning to implement TCP keepalive probing in NIC firmware, freeing host CPU from keepalive overhead at scale (millions of connections)
QUIC timers: QUIC's loss detection uses packet number gaps rather than RTO timeouts, allowing faster loss detection without timer imprecision. The TCP ecosystem is studying these approaches for future TCP enhancements
Adaptive RTO: machine learning-based RTO prediction (estimating RTT from packet-level patterns) could reduce false retransmits on high-variance paths — an active research area

Exercises

Measure the Nagle-delayed ACK interaction on your local system. Write a client that sends a 1-byte header and 1023-byte body in two separate write() calls to a server with SO_RCVBUF=65536. Use tcpdump with timestamps (-tt) to measure the delay between the two writes and the server ACK. Then enable TCP_NODELAY and measure again.
Reproduce RTO exponential backoff: on a tc netem link with 100% packet loss (tc qdisc add dev lo root netem loss 100%), initiate a TCP connection and observe retransmit timer doubling via bpftrace. Capture the RTO sequence and verify it matches the binary exponential pattern.
Tune TCP keepalive for a database connection pool. Set TCP_KEEPIDLE=10, TCP_KEEPINTVL=2, TCP_KEEPCNT=3. Then simulate a dead server by suspending it with SIGSTOP. Measure how long the client takes to detect the failure and return an error.
Instrument the persist timer using bpftrace (kprobe:tcp_probe_timer). Create a zero-window condition by writing to a socket whose read side is suspended. Observe persist timer firing and measure the backoff interval between probes.
Compare TCP_USER_TIMEOUT vs tcp_retries2 for dead-connection detection. Set TCP_USER_TIMEOUT=5000 (5 seconds) on one connection and tcp_retries2=5 system-wide. Simulate network failure with iptables -A OUTPUT -j DROP. Which method detects failure faster and more consistently? Why?

References

RFC 6298 — Computing TCP's Retransmission Timer (Paxson, Allman, et al., 2011)
RFC 1122 — Requirements for Internet Hosts: delayed ACK specification
RFC 9293 — Transmission Control Protocol (2022 update of RFC 793)
Jacobson, V. & Karels, M. Congestion Avoidance and Control, SIGCOMM 1988. (SRTT algorithm)
Karn, P. & Partridge, C. Improving Round-Trip Time Estimates in Reliable Transport Protocols, SIGCOMM 1987.
net/ipv4/tcp_input.c — tcp_rtt_estimator(), __tcp_ack_snd_check()
net/ipv4/tcp_timer.c — retransmit, persist, keepalive timer implementations
include/net/tcp.h — timer constants, TCP_RTO_MIN, TCP_RTO_MAX
man 7 tcp — TCP_KEEPIDLE, TCP_KEEPINTVL, TCP_KEEPCNT, TCP_USER_TIMEOUT
Kurose, J.F. & Ross, K.W. Computer Networking: A Top-Down Approach. Chapter 3 (TCP timers).