Skip to content

01 — TCP State Machine

Technical Overview

TCP is a stateful protocol: every connection progresses through a defined set of states from creation to destruction. The state machine governs which packets are valid at each point, what actions the kernel takes on receipt, and how resources are allocated and released. Misunderstanding the state machine leads to TIME_WAIT exhaustion, port reuse bugs, half-open connection accumulation, and SYN flood vulnerability. For anyone operating production TCP services, the state machine is fundamental knowledge.


Prerequisites

  • TCP/IP fundamentals: SYN, ACK, FIN, RST flags, sequence numbers
  • Basic socket API: connect(), accept(), close() (see 02-sockets.md)
  • Linux kernel socket buffer concepts
  • Familiarity with ss, netstat, tcpdump

Core Content

The 11 TCP States

TCP State Machine (RFC 793 + RFC 1122 clarifications)
======================================================

                      CLOSED
                        |
                 [server: listen()]
                        |
                        v
                      LISTEN
                        |
               [SYN received from client]
                        |
                        v
                    SYN_RCVD <---------------------------------+
                        |                                      |
              [ACK received (3-way complete)]     [simultaneous open: SYN]
                        |                                      |
                        v                                      |
                   ESTABLISHED <----- [SYN_SENT + SYN rcvd] --+
                    /      \
     [active close]          [passive close]
     close() / FIN sent      FIN received
          |                        |
          v                        v
      FIN_WAIT_1              CLOSE_WAIT
          |                        |
    [ACK received]           [close() called]
          |                        |
          v                        v
      FIN_WAIT_2               LAST_ACK
          |                        |
    [FIN received]           [ACK received]
          |                        |
          v                        v
       TIME_WAIT                CLOSED
          |
    [2*MSL timer expires]
          |
          v
        CLOSED


Client (active open)          Server (passive open)
====================          ====================

CLOSED                        CLOSED
  |  send SYN                   |
  |  seq=x                      | listen()
  v                             v
SYN_SENT                      LISTEN
  |                              |  recv SYN
  |                              |  send SYN-ACK
  |  recv SYN-ACK                |  seq=y, ack=x+1
  |  send ACK                    v
  |  ack=y+1                  SYN_RCVD
  v                              |  recv ACK
ESTABLISHED <-------------------> ESTABLISHED
  |                              |
  |  (data transfer)             |
  |                              |
  |  send FIN                    |
  v                              v
FIN_WAIT_1                    CLOSE_WAIT
  |  recv ACK                    |  (application reads remaining data)
  v                              |  close()
FIN_WAIT_2                       |  send FIN
  |  recv FIN                    v
  |  send ACK                 LAST_ACK
  v                              |  recv ACK
TIME_WAIT                        v
  |  (2*MSL = 60-120s)         CLOSED
  v
CLOSED

Three-Way Handshake

The three-way handshake establishes: 1. The initial sequence numbers (ISN) for both directions 2. That both endpoints can send and receive 3. Synchronization of SYN sequence numbers for flow control

Client                              Server
  |  SYN (seq=1000)                  |
  +--------------------------------->|
  |                                  |  SYN_RCVD
  |  SYN-ACK (seq=5000, ack=1001)    |
  |<---------------------------------+
  |  ESTABLISHED                     |
  |  ACK (seq=1001, ack=5001)        |
  +--------------------------------->|
                                     |  ESTABLISHED

The ISN must be pseudo-random to prevent sequence number prediction attacks. Linux uses a cryptographic hash of the 4-tuple + secret + time to generate ISNs.

SYN queue: when the server receives a SYN, it adds an entry to the SYN queue (incomplete connections, SYN_RCVD state) before sending SYN-ACK. The completed connection (after final ACK) moves to the accept queue (complete, awaiting accept()).

# Monitor SYN queue and accept queue depths
ss -lnt
# Recv-Q: current accept queue depth
# Send-Q: accept queue max (listen backlog)

# If Recv-Q near Send-Q: accept() too slow or under attack

Four-Way Teardown

TCP connections close independently in each direction (half-close). Each direction requires a FIN + ACK:

Active closer              Passive closer
     |  FIN (seq=n)             |
     +------------------------->|   FIN_WAIT_1
     |                          |   CLOSE_WAIT
     |  ACK (ack=n+1)           |
     |<-------------------------+   FIN_WAIT_2
     |                          |
     |  (passive closer reads remaining data, calls close())
     |                          |
     |  FIN (seq=m)             |
     |<-------------------------+   LAST_ACK
     |  ACK (ack=m+1)           |
     +------------------------->|   CLOSED
     |
     TIME_WAIT (2*MSL)
     CLOSED

If both sides call close() simultaneously, the CLOSING state handles FIN-FIN crossing (a FIN received before the local FIN was ACKed).


TIME_WAIT: Purpose and Duration

TIME_WAIT lasts 2*MSL (Maximum Segment Lifetime): - RFC 793 specifies MSL = 2 minutes → TIME_WAIT = 4 minutes - Linux default: tcp_fin_timeout = 60 seconds = TIME_WAIT duration - Common in practice: 60–120 seconds

TIME_WAIT serves two purposes:

  1. Ensures the final ACK reaches the passive closer: if the final ACK is lost, the passive closer retransmits its FIN. The active closer (in TIME_WAIT) will re-send the ACK. Without TIME_WAIT, a new connection with the same 4-tuple could receive the retransmitted FIN.

  2. Prevents old duplicate segments: segments delayed in the network (up to 1 MSL) that arrive after the connection closes could corrupt a new connection with the same 4-tuple. TIME_WAIT ensures all old segments expire before allowing reuse.

# Count TIME_WAIT sockets
ss -tan state time-wait | wc -l

# Distribution by remote IP (useful for identifying source)
ss -tan state time-wait | awk '{print $5}' | sort | uniq -c | sort -rn | head

# Kernel TIME_WAIT tracking (per-namespace)
cat /proc/sys/net/netfilter/nf_conntrack_count

TIME_WAIT Exhaustion

High-rate short-lived connections (microservices, REST APIs) create TIME_WAIT sockets faster than they expire. With 60-second lifetime and 65,535 ephemeral ports, the limit is:

Max connections/second = 65535 / 60 = ~1092 connections/second per destination IP

This limit can be hit by a single aggressive service. Symptoms: connect() returns EADDRNOTAVAIL.

Mitigations:

SO_REUSEADDR: allows binding to TIME_WAIT port on the server side (already on by default for server sockets). Does NOT help client-side TIME_WAIT exhaustion.

net.ipv4.tcp_tw_reuse=1: allows client to reuse a TIME_WAIT socket for a new outgoing connection if the remote timestamp (RFC 1323) is greater than the last seen timestamp — prevents old segment confusion. Safe to enable.

sysctl -w net.ipv4.tcp_tw_reuse=1

net.ipv4.tcp_tw_recycle: DO NOT USE in modern Linux (removed in Linux 4.12). Was supposed to accelerate TIME_WAIT cleanup but broke NAT — all clients behind a NAT sharing an IP but different timestamps would have connections randomly rejected.

Connection pooling: the real fix. HTTP keep-alive, database connection pooling, gRPC persistent connections — reduce connection churn and eliminate TIME_WAIT accumulation at the source.


SYN Flood and SYN Cookies

A SYN flood attack sends millions of SYN packets with spoofed source IPs. Each SYN allocates a struct tcp_request_sock in the SYN queue. With a finite backlog (default 128 before somaxconn tuning), the SYN queue fills, and legitimate connections fail.

SYN Cookies (Linux default when backlog is full, net.ipv4.tcp_syncookies=1): instead of allocating state, encode connection parameters in the ISN:

ISN = SHA1(src_ip, src_port, dst_ip, dst_port, secret, time)
       ^ cryptographic hash, not stored in kernel

When the final ACK arrives, the kernel reconstructs the parameters from the ACK number (which echoes the ISN + 1) and verifies the hash. Valid → connection accepted. No state allocation in the SYN queue.

# Enable SYN cookies (default on modern Linux)
sysctl -w net.ipv4.tcp_syncookies=1

# Expand SYN queue capacity (for legitimate high-connection-rate servers)
sysctl -w net.ipv4.tcp_max_syn_backlog=65536

# Monitor SYN cookie activity
netstat -s | grep 'SYN cookies'
# "SYN cookies sent" increasing → under SYN flood

SYN Cookie limitations: TCP options (window scaling, SACK) are not preserved in cookies (limited bits available). Connections completed via SYN cookie may have reduced performance — window scaling requires specific option encoding.


Half-Open Connections

A half-open connection occurs when one side's TCP state is ESTABLISHED but the other side has lost state (crashed and restarted, or NAT timeout). The still-connected side keeps sending data; the crashed side responds with RST when packets arrive.

Detection: TCP keepalive probes (SO_KEEPALIVE) — after tcp_keepalive_time seconds of inactivity, send probes. If tcp_keepalive_probes probes are unacknowledged, send RST and close.

For applications, set keepalive per-socket:

int opt = 1;
setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE, &opt, sizeof(opt));
// Default: probe after 7200s, then 9 probes at 75s intervals = ~2.4 hour detection time
// Tune for faster detection:
int idle = 30, intvl = 5, count = 3;
setsockopt(fd, IPPROTO_TCP, TCP_KEEPIDLE, &idle, sizeof(idle));
setsockopt(fd, IPPROTO_TCP, TCP_KEEPINTVL, &intvl, sizeof(intvl));
setsockopt(fd, IPPROTO_TCP, TCP_KEEPCNT, &count, sizeof(count));
// Detects dead connection in 30 + 5*3 = 45 seconds

RST Handling

A RST segment immediately terminates a connection from either side. The kernel clears all socket state and returns ECONNRESET to the application.

Common RST causes: - Application calls close() with SO_LINGER l_linger=0 - Packet arrives for a closed port (kernel sends RST) - iptables -j REJECT --reject-with tcp-reset - Remote host reboots (NAT sees returning packets for unknown connections) - Firewall injection of forged RST (TCP reset attacks — Comcast 2007, GFW)

RST injection attack mitigation: TCP sequence number validation. A forged RST must have the exact sequence number. With PAWS (RFC 1323 timestamps), both sequence number and timestamp must match.


Historical Context

The TCP state machine was defined in RFC 793 (Jon Postel, DARPA, 1981). The TIME_WAIT state was included from the beginning, reflecting lessons from ARPANET where segment duplication caused protocol confusion.

The SYN flood attack was first publicly exploited in 1996 — the send_synflood attack by "daemon9" demonstrated that the SYN queue was a finite resource. SYN cookies were invented by Daniel J. Bernstein (djb) in 1996, though the implementation in Linux was later independently contributed by Eric Schenk.

tcp_tw_recycle was added in Linux 2.6.x and later removed in 4.12 (2017) by Florian Westphal after years of production incidents where NAT gateways with multiple clients caused connection drops. The code was simply deleted — the dangers outweighed any benefit.


Production Examples

Kubernetes service connection exhaustion:

# Pod making 2000 req/s to external API → TIME_WAIT exhaustion
ss -tan state time-wait | wc -l
# > 60000 TIME_WAIT sockets → EADDRNOTAVAIL on new connects

# Diagnosis: check /proc/net/sockstat
cat /proc/net/sockstat | grep TCP
# TCP: inuse N orphan M tw N alloc N mem N

# Fix: enable tw_reuse, add connection pooling to application
sysctl -w net.ipv4.tcp_tw_reuse=1

SYN flood monitoring:

# Real-time SYN queue depth (watching for flood)
watch -n0.5 'ss -lnt | grep :443'

# tcpdump for SYN flood source analysis
tcpdump -ni eth0 'tcp[tcpflags] & tcp-syn != 0 and tcp[tcpflags] & tcp-ack == 0' \
    -w /tmp/synflood.pcap &

# Counter monitoring
while true; do
    netstat -s | grep 'SYNs to LISTEN'
    sleep 1
done

Debugging Notes

# Complete socket state distribution
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn

# Track state transitions (bpftrace)
bpftrace -e '
kprobe:tcp_set_state {
    $sk = (struct sock *)arg0;
    $old = arg1;
    printf("%-20s %d -> %d\n", comm, $old, $sk->sk_state);
}'

# Find sockets stuck in CLOSE_WAIT (application not calling close())
ss -tan state close-wait | head -20
# Large number = application bug (not closing accepted sockets)

# SYN_RCVD accumulation (SYN flood or server too slow)
ss -tan state syn-recv | wc -l

# Check for TIME_WAIT exhaustion
dmesg | grep -i 'TCP: request_sock_TCP: Possible SYN flooding'

# Capture SYN retransmits (client not receiving SYN-ACK → loss)
tcpdump -ni eth0 'tcp[tcpflags] == tcp-syn' -v | grep 'seq '

Security Implications

  • SYN flood: net.ipv4.tcp_syncookies=2 — always use SYN cookies (not just when backlog full). Slight performance hit but eliminates SYN queue exhaustion entirely.
  • TCP sequence prediction (CVE historical): predictable ISNs allow blind data injection. Linux uses cryptographic ISN generation since 3.x.
  • TIME_WAIT assassination: RFC 1337 documents that a RST segment arriving during TIME_WAIT can prematurely close it. Linux ignores RSTs during TIME_WAIT by default.
  • tcp_tw_recycle and NAT: already mentioned — never enable in environments with NAT (cloud, home networks, corporate NAT). It was a security footgun that caused intermittent drops.
  • Half-open detection: without keepalive, a half-open connection holds resources indefinitely. Production services should always configure SO_KEEPALIVE or application-level heartbeats.

Performance Implications

State Resource consumption
SYN_RCVD (no SYN cookie) struct tcp_request_sock (~200B)
ESTABLISHED struct tcp_sock (~1.5–2KB) + sk_buff buffers
TIME_WAIT struct tcp_timewait_sock (~168B) — smaller than ESTABLISHED
CLOSE_WAIT Same as ESTABLISHED (still has recv buffer)

TIME_WAIT sockets use a reduced tcp_timewait_sock structure — significantly smaller than an ESTABLISHED socket. 50,000 TIME_WAIT sockets consume ~8MB — generally not a memory concern. The concern is port exhaustion, not memory.


Failure Modes and Real Incidents

Incident: CLOSE_WAIT leak in Java service (2018) A Java service accepted connections but a bug in the thread pool caused some connections to never have close() called on the socket. CLOSE_WAIT sockets accumulated — each holds the full socket buffer. After 24 hours: 500,000 CLOSE_WAIT sockets consuming 100GB of kernel socket buffer memory. OOM killed the Java process. Diagnosis: ss -tan state close-wait | wc -l, process maps.

Incident: tcp_tw_recycle causing intermittent connection failures (2015, many sites) A major social network enabled tcp_tw_recycle on their frontend servers. Users behind corporate NAT (all sharing one external IP) experienced 5–10% connection failure rates. The symptom: SYNACK sent but the SYN's timestamp failed validation (NAT assigns monotonically non-increasing timestamps across clients). Connections silently dropped. Took 2 weeks to diagnose.

Failure Mode: SYN queue + SYN cookies + SACK negotiation SYN cookies don't preserve SACK negotiation. Under SYN flood with SYN cookies active, new connections established via cookies can't use SACK — any retransmit requires retransmitting all unacknowledged data. On a lossy path this causes severe throughput degradation.


Modern Usage

  • TCP Fast Open (TFO): eliminates the 1-RTT cost of the three-way handshake for repeated connections — sends data with the SYN packet using a cached cookie. Enabled by sysctl net.ipv4.tcp_fastopen=3.
  • MPTCP (Multipath TCP): Linux 5.6+ — uses the same state machine but manages multiple subflows simultaneously. Each subflow is a standard TCP connection; MPTCP adds a meta-state machine above it.
  • QUIC: replaces TCP with UDP + QUIC state machine, eliminating TIME_WAIT and enabling 0-RTT reconnection (see 04-quic-protocol.md)

Future Directions

  • TCP with hardware state offload: some SmartNICs (Bluefield, Stingray) implement TCP state machine in NIC firmware, moving connection state out of the host kernel
  • Connection migration: MPTCP and QUIC both support changing the underlying network path (IP address change) without resetting the connection — the TCP state machine has no equivalent
  • Formal verification: there are ongoing efforts to formally verify the TCP state machine implementation in Linux against the RFC specification using tools like TLA+

Exercises

  1. Use bpftrace to trace all TCP state transitions on your system for 30 seconds. What is the most common state transition? What does a large number of ESTABLISHED → CLOSE_WAIT transitions without corresponding CLOSE_WAIT → LAST_ACK transitions indicate?

  2. Reproduce TIME_WAIT exhaustion: write a client that opens and immediately closes 100,000 TCP connections to a local server. Observe ss -tan state time-wait | wc -l. Implement SO_REUSEADDR and tcp_tw_reuse and observe the effect.

  3. Reproduce SYN flood and SYN cookie activation: use hping3 -S --flood -V -p 8080 127.0.0.1 while monitoring netstat -s | grep 'SYN cookies'. Verify that net.ipv4.tcp_syncookies=2 causes cookies to be used for all connections, not just when the queue is full.

  4. Write a test that creates a pair of TCP sockets, allows the connection to establish, then kills one side with SIGKILL (simulating a crash). Use TCP keepalive on the other side and measure how long until the surviving socket transitions from ESTABLISHED to CLOSED.

  5. Capture a complete TCP connection with tcpdump -w cap.pcap and analyze each state transition in Wireshark (or tcpdump -r cap.pcap -v). Verify that the sequence numbers in SYN/ACK, FIN/ACK packets match the state machine diagram.


References

  • RFC 793 — Transmission Control Protocol (original specification)
  • RFC 1122 — Requirements for Internet Hosts — Communication Layers (state machine clarifications)
  • RFC 1337 — TIME-WAIT Assassination Hazards in TCP
  • RFC 7413 — TCP Fast Open
  • net/ipv4/tcp.ctcp_set_state(), tcp_close()
  • net/ipv4/tcp_minisocks.c — TIME_WAIT socket management, SYN cookies
  • net/ipv4/tcp_input.ctcp_rcv_state_process()
  • Stevens, W.R. TCP/IP Illustrated, Volume 1. Chapters 17–24 (TCP).
  • Bernstein, D.J. SYN Cookies. cr.yp.to/syncookies.html. 1996.
  • man 7 tcp — Linux TCP socket documentation