Section 16: TCP/IP Internals
Purpose and Scope
TCP/IP is the protocol bedrock of the internet, and understanding it at the implementation level — not merely the specification level — is what separates a systems engineer from a network administrator. This section examines TCP's state machine, the 3-way handshake and 4-way teardown, every major congestion control algorithm (Reno, CUBIC, BBR, RACK/TLP), flow control and the sliding window, TCP timer mechanics, and the full set of TCP options. It extends to UDP internals, the QUIC protocol's architectural innovations, IP routing (FIB lookup, routing decisions), ARP/NDP, ICMP semantics, DNS resolution internals, the TLS/SSL layer, and the generational evolution of HTTP.
The orientation is implementation-first: understanding why BBR behaves differently than CUBIC in a bufferbloat scenario, or why TIME_WAIT causes port exhaustion, requires reading behavior from the kernel source and measuring it empirically.
Prerequisites
- Section 15 (Networking): network stack, socket layer, sk_buff, netfilter
- Section 03 (OS Fundamentals): process model, file descriptors, socket API
- Familiarity with Wireshark/tcpdump packet analysis
- Basic understanding of IP addressing, subnets, and routing concepts
Learning Objectives
Upon completing this section you will be able to:
- Draw and explain the complete TCP state machine (11 states, transitions, TIME_WAIT purpose).
- Explain why TIME_WAIT lasts 2*MSL and how to mitigate port exhaustion in high-connection-rate services.
- Describe CUBIC congestion control: W_cubic function, fast convergence, and behavior on high-BDP links.
- Explain BBR's model-based approach: how it estimates bottleneck bandwidth and RTprop independently.
- Explain TCP's sliding window: send window, receive window, rwnd, cwnd, ssthresh.
- Describe every major TCP timer: RTO (exponential backoff, Karn's algorithm), keepalive, TIME_WAIT, delayed ACK.
- Explain QUIC's innovations: connection ID, 0-RTT, stream multiplexing without HoL blocking, connection migration.
- Trace a DNS resolution from stub resolver through recursive resolver to authoritative server.
- Explain TLS 1.3 handshake: key exchange, certificate verification, session resumption (PSK).
- Describe the HoL blocking problem in HTTP/1.1, how HTTP/2 addressed it at L7, and why HTTP/3 (QUIC) solves it at L4.
Architecture Overview
Application
┌────────────────────────────────────────────────────────────────┐
│ HTTP/1.1 HTTP/2 (ALPN) HTTP/3 (QUIC/UDP) │
│ text framing binary multiplexed UDP-based streams │
└────────────────────────┬───────────────────────┬──────────────┘
│ │
┌────────────────────────▼──────┐ ┌─────────────▼─────────────┐
│ TLS 1.3 │ │ QUIC (built-in crypto) │
│ 1-RTT handshake / 0-RTT PSK │ │ Connection ID based │
└────────────────────────┬──────┘ └─────────────┬─────────────┘
│ │
┌────────────────────────▼──────┐ ┌─────────────▼─────────────┐
│ TCP │ │ UDP │
│ State machine (11 states) │ │ Checksums only │
│ Congestion: BBR/CUBIC/Reno │ │ No ordering/retransmit │
│ Flow: rwnd, cwnd, ssthresh │ └───────────────────────────┘
│ Timers: RTO, keepalive, 2MSL │
└────────────────────────┬──────┘
│
┌────────────────────────▼──────────────────────────────────────┐
│ IP Layer │
│ IPv4: FIB lookup (LPM), TTL, fragmentation, DSCP │
│ IPv6: extension headers, flow labels, NDP (replaces ARP) │
│ Routing: neighbor table → ARP/NDP → link layer │
└────────────────────────┬──────────────────────────────────────┘
│
┌────────────────────────▼──────────────────────────────────────┐
│ ARP (IPv4) / NDP (IPv6) / ICMP / ICMPv6 │
└───────────────────────────────────────────────────────────────┘
TCP Congestion Control State:
Slow Start → Congestion Avoidance → Fast Retransmit → Fast Recovery
cwnd: 1 → ssthresh (SS) → +1/cwnd per ACK (CA) → halve on loss
BBR: models BtlBw and RTprop; sends at BtlBw regardless of buffer state
Key Concepts
- TCP State Machine: CLOSED, LISTEN, SYN_SENT, SYN_RECEIVED, ESTABLISHED, FIN_WAIT_1, FIN_WAIT_2, CLOSE_WAIT, CLOSING, LAST_ACK, TIME_WAIT — 11 states governing connection lifecycle.
- 3-Way Handshake: SYN → SYN-ACK → ACK; establishes ISNs, MSS, window scale, SACK options.
- TIME_WAIT: After active close, connection stays for 2*MSL (60–120 seconds on Linux) to absorb delayed duplicates; causes ephemeral port exhaustion under high connection rate.
- Sliding Window: Receiver advertises rwnd; sender limits unacknowledged data to min(cwnd, rwnd); the product of BDP determines throughput ceiling.
- Congestion Control: Algorithms that infer network congestion and throttle sending rate; Reno (loss-based), CUBIC (cubic W function, faster ramp), BBR (model-based).
- BBR (Bottleneck Bandwidth and RTT): Estimates bottleneck bandwidth (BtlBw) and minimum RTT (RTprop) independently; operates at the optimal operating point without filling buffers.
- RTO (Retransmission Timeout): Computed from smoothed RTT and variance (Jacobson/Karels algorithm); doubles on each timeout (binary exponential backoff).
- SACK (Selective Acknowledgment): TCP option allowing receiver to report non-contiguous received segments; enables selective retransmission rather than go-back-N.
- TCP Fast Open (TFO): Sends data in the SYN packet using a cookie; reduces latency for short-lived connections by 1 RTT.
- QUIC: UDP-based transport protocol by Google, standardized as RFC 9000; provides stream multiplexing, 0-RTT connection establishment, connection migration, and built-in TLS 1.3.
- Head-of-Line (HoL) Blocking: In TCP, a lost packet blocks delivery of all subsequent data in the stream; HTTP/2 multiplexing over TCP still suffers HoL at the transport layer; QUIC eliminates it per stream.
- IP FIB (Forwarding Information Base): The kernel's routing table; LPM (Longest Prefix Match) lookup determines next hop; stored in trie or hash structures.
- ARP/NDP: Address Resolution Protocol (IPv4) maps IP to MAC address; Neighbor Discovery Protocol (IPv6) adds router discovery and SLAAC.
- DNS: Hierarchical distributed naming system; stub resolver → recursive resolver → root → TLD → authoritative; responses cached per TTL.
- TLS 1.3: Removes RSA key exchange and static DH; uses ephemeral ECDH for forward secrecy; 1-RTT full handshake, 0-RTT resumption with PSK.
- HTTP/2: Binary framing, multiplexed streams, HPACK header compression, server push; all over a single TCP connection.
- HTTP/3: HTTP/2 semantics over QUIC; independent stream loss recovery eliminates transport-layer HoL blocking.
Major Historical Milestones
| Year | Milestone |
|---|---|
| 1974 | Cerf & Kahn publish TCP (Transmission Control Program) |
| 1978 | TCP and IP separated into distinct protocols |
| 1981 | RFC 793: TCP specification; RFC 791: IPv4 |
| 1983 | BSD 4.2 TCP/IP implementation; ARPANET transition |
| 1988 | Van Jacobson publishes congestion control algorithms (slow start, CA) |
| 1989 | Karn's algorithm for RTT estimation in the presence of retransmits |
| 1994 | TCP SACK (Selective Acknowledgment) RFC 2018 |
| 1996 | RFC 1948: protection against sequence number attacks |
| 1997 | TCP Reno (Fast Retransmit + Fast Recovery) widely deployed |
| 1999 | IPv6 specification finalized (RFC 2460) |
| 2000 | Linux 2.4: TCP rework; scalable socket infrastructure |
| 2004 | CUBIC congestion control published; default in Linux since 2.6.19 |
| 2004 | Google SPDY precursor work begins; HTTP/1.1 limitations documented |
| 2012 | Google proposes QUIC (Quick UDP Internet Connections) |
| 2015 | HTTP/2 standardized (RFC 7540) |
| 2016 | BBR congestion control published by Google; merged in Linux 4.9 |
| 2018 | TLS 1.3 standardized (RFC 8446) |
| 2020 | TCP Pacing + RACK/TLP loss detection in Linux mainstream |
| 2021 | QUIC standardized (RFC 9000); HTTP/3 standardized (RFC 9114) |
| 2022 | BBRv2 experimental in Linux; addresses BBR fairness issues |
Modern Relevance and Production Use Cases
Cloud CDN and web serving (Cloudflare, Fastly, Akamai) deploy BBR as the default congestion control algorithm; BBR dramatically improves throughput on lossy last-mile and high-RTT paths, improving median page load times by 4–14% in Google's deployment data.
gRPC and microservices rely on HTTP/2 multiplexing to avoid connection overhead between services; understanding TCP HoL blocking explains why a single slow gRPC call can stall unrelated concurrent calls over the same connection.
Mobile applications benefit from QUIC/HTTP/3 because connection migration survives network handoffs (WiFi to LTE) without a new handshake; QUIC deployments at Google and Meta show 5–15% improvement in video rebuffering rates.
High-frequency trading platforms use custom TCP stacks (kernel bypass with modified CUBIC or no congestion control) and kernel-bypass UDP; TIME_WAIT and RTO are replaced by application-level retransmit with tighter timeouts.
Kubernetes services rely on iptables/conntrack for ClusterIP; TIME_WAIT and conntrack table exhaustion are persistent production issues at scale; kube-proxy IPVS mode reduces conntrack pressure.
File Map
| File | Description |
|---|---|
01-tcp-state-machine.md |
11 states, transition triggers, TIME_WAIT duration and purpose |
02-three-way-handshake.md |
SYN/SYN-ACK/ACK mechanics, ISN generation, options negotiation |
03-four-way-teardown.md |
FIN/FIN-ACK sequence, half-close, TIME_WAIT exhaustion |
04-sliding-window.md |
rwnd, cwnd, BDP, window scaling option, zero window |
05-congestion-control-reno.md |
Slow start, congestion avoidance, fast retransmit, fast recovery |
06-congestion-control-cubic.md |
W_cubic function, Bic origin, fast convergence |
07-congestion-control-bbr.md |
BtlBw/RTprop model, pacing, PROBE_BW/PROBE_RTT phases |
08-tcp-timers.md |
RTO computation (Jacobson), keepalive, delayed ACK, TIME_WAIT |
09-tcp-options.md |
MSS, SACK, timestamps, window scale, TFO, MD5 |
10-flow-control.md |
Receiver-side rwnd, zero window probing, Nagle algorithm |
11-tcp-loss-recovery.md |
RACK, TLP (Tail Loss Probe), SACK-based retransmit |
12-udp-internals.md |
UDP socket, checksum, fragmentation, multicast |
13-quic-protocol.md |
QUIC streams, connection ID, 0-RTT, ACK frequency, migration |
14-ip-routing.md |
FIB structure, LPM lookup, route cache, policy routing |
15-arp-ndp.md |
ARP request/reply, gratuitous ARP, NDP, SLAAC, router advertisement |
16-icmp-diagnostics.md |
ICMP types, ping, traceroute, PMTUD, ICMPv6 |
17-dns-internals.md |
Resolution chain, DNSSEC, DNS-over-TLS/HTTPS, negative caching |
18-tls-ssl-layer.md |
TLS 1.3 handshake, cipher suites, certificate pinning, mTLS |
19-http-evolution.md |
HTTP/1.1 pipelining, HTTP/2 frames/streams, HTTP/3 over QUIC |
20-tcp-tuning.md |
net.core, net.ipv4.tcp_* sysctl knobs, ss/netstat analysis |
Cross-References
- Section 15 (Networking): socket layer, sk_buff, netfilter, XDP — the platform TCP runs on
- Section 10 (Synchronization): TCP socket lock, receive queue spinlocks
- Section 17 (Distributed Systems): distributed system failure modes emerge from TCP semantics (timeout, partial failure)
- Section 19 (Virtualization): virtio-net TCP offload, network namespaces in containers
- Section 18 (Database Internals): database replication streams over TCP; understanding replication lag requires TCP congestion understanding