Skip to content

Section 16: TCP/IP Internals

Purpose and Scope

TCP/IP is the protocol bedrock of the internet, and understanding it at the implementation level — not merely the specification level — is what separates a systems engineer from a network administrator. This section examines TCP's state machine, the 3-way handshake and 4-way teardown, every major congestion control algorithm (Reno, CUBIC, BBR, RACK/TLP), flow control and the sliding window, TCP timer mechanics, and the full set of TCP options. It extends to UDP internals, the QUIC protocol's architectural innovations, IP routing (FIB lookup, routing decisions), ARP/NDP, ICMP semantics, DNS resolution internals, the TLS/SSL layer, and the generational evolution of HTTP.

The orientation is implementation-first: understanding why BBR behaves differently than CUBIC in a bufferbloat scenario, or why TIME_WAIT causes port exhaustion, requires reading behavior from the kernel source and measuring it empirically.


Prerequisites

  • Section 15 (Networking): network stack, socket layer, sk_buff, netfilter
  • Section 03 (OS Fundamentals): process model, file descriptors, socket API
  • Familiarity with Wireshark/tcpdump packet analysis
  • Basic understanding of IP addressing, subnets, and routing concepts

Learning Objectives

Upon completing this section you will be able to:

  1. Draw and explain the complete TCP state machine (11 states, transitions, TIME_WAIT purpose).
  2. Explain why TIME_WAIT lasts 2*MSL and how to mitigate port exhaustion in high-connection-rate services.
  3. Describe CUBIC congestion control: W_cubic function, fast convergence, and behavior on high-BDP links.
  4. Explain BBR's model-based approach: how it estimates bottleneck bandwidth and RTprop independently.
  5. Explain TCP's sliding window: send window, receive window, rwnd, cwnd, ssthresh.
  6. Describe every major TCP timer: RTO (exponential backoff, Karn's algorithm), keepalive, TIME_WAIT, delayed ACK.
  7. Explain QUIC's innovations: connection ID, 0-RTT, stream multiplexing without HoL blocking, connection migration.
  8. Trace a DNS resolution from stub resolver through recursive resolver to authoritative server.
  9. Explain TLS 1.3 handshake: key exchange, certificate verification, session resumption (PSK).
  10. Describe the HoL blocking problem in HTTP/1.1, how HTTP/2 addressed it at L7, and why HTTP/3 (QUIC) solves it at L4.

Architecture Overview

  Application
  ┌────────────────────────────────────────────────────────────────┐
  │  HTTP/1.1          HTTP/2 (ALPN)         HTTP/3 (QUIC/UDP)    │
  │  text framing      binary multiplexed    UDP-based streams     │
  └────────────────────────┬───────────────────────┬──────────────┘
                           │                       │
  ┌────────────────────────▼──────┐  ┌─────────────▼─────────────┐
  │         TLS 1.3               │  │  QUIC (built-in crypto)   │
  │  1-RTT handshake / 0-RTT PSK  │  │  Connection ID based      │
  └────────────────────────┬──────┘  └─────────────┬─────────────┘
                           │                       │
  ┌────────────────────────▼──────┐  ┌─────────────▼─────────────┐
  │         TCP                   │  │         UDP                │
  │  State machine (11 states)    │  │  Checksums only            │
  │  Congestion: BBR/CUBIC/Reno   │  │  No ordering/retransmit    │
  │  Flow: rwnd, cwnd, ssthresh   │  └───────────────────────────┘
  │  Timers: RTO, keepalive, 2MSL │
  └────────────────────────┬──────┘
                           │
  ┌────────────────────────▼──────────────────────────────────────┐
  │                     IP Layer                                   │
  │  IPv4: FIB lookup (LPM), TTL, fragmentation, DSCP             │
  │  IPv6: extension headers, flow labels, NDP (replaces ARP)     │
  │  Routing: neighbor table → ARP/NDP → link layer               │
  └────────────────────────┬──────────────────────────────────────┘
                           │
  ┌────────────────────────▼──────────────────────────────────────┐
  │  ARP (IPv4) / NDP (IPv6) / ICMP / ICMPv6                      │
  └───────────────────────────────────────────────────────────────┘

  TCP Congestion Control State:
  Slow Start → Congestion Avoidance → Fast Retransmit → Fast Recovery
  cwnd:  1 → ssthresh (SS) → +1/cwnd per ACK (CA) → halve on loss
  BBR:   models BtlBw and RTprop; sends at BtlBw regardless of buffer state

Key Concepts

  • TCP State Machine: CLOSED, LISTEN, SYN_SENT, SYN_RECEIVED, ESTABLISHED, FIN_WAIT_1, FIN_WAIT_2, CLOSE_WAIT, CLOSING, LAST_ACK, TIME_WAIT — 11 states governing connection lifecycle.
  • 3-Way Handshake: SYN → SYN-ACK → ACK; establishes ISNs, MSS, window scale, SACK options.
  • TIME_WAIT: After active close, connection stays for 2*MSL (60–120 seconds on Linux) to absorb delayed duplicates; causes ephemeral port exhaustion under high connection rate.
  • Sliding Window: Receiver advertises rwnd; sender limits unacknowledged data to min(cwnd, rwnd); the product of BDP determines throughput ceiling.
  • Congestion Control: Algorithms that infer network congestion and throttle sending rate; Reno (loss-based), CUBIC (cubic W function, faster ramp), BBR (model-based).
  • BBR (Bottleneck Bandwidth and RTT): Estimates bottleneck bandwidth (BtlBw) and minimum RTT (RTprop) independently; operates at the optimal operating point without filling buffers.
  • RTO (Retransmission Timeout): Computed from smoothed RTT and variance (Jacobson/Karels algorithm); doubles on each timeout (binary exponential backoff).
  • SACK (Selective Acknowledgment): TCP option allowing receiver to report non-contiguous received segments; enables selective retransmission rather than go-back-N.
  • TCP Fast Open (TFO): Sends data in the SYN packet using a cookie; reduces latency for short-lived connections by 1 RTT.
  • QUIC: UDP-based transport protocol by Google, standardized as RFC 9000; provides stream multiplexing, 0-RTT connection establishment, connection migration, and built-in TLS 1.3.
  • Head-of-Line (HoL) Blocking: In TCP, a lost packet blocks delivery of all subsequent data in the stream; HTTP/2 multiplexing over TCP still suffers HoL at the transport layer; QUIC eliminates it per stream.
  • IP FIB (Forwarding Information Base): The kernel's routing table; LPM (Longest Prefix Match) lookup determines next hop; stored in trie or hash structures.
  • ARP/NDP: Address Resolution Protocol (IPv4) maps IP to MAC address; Neighbor Discovery Protocol (IPv6) adds router discovery and SLAAC.
  • DNS: Hierarchical distributed naming system; stub resolver → recursive resolver → root → TLD → authoritative; responses cached per TTL.
  • TLS 1.3: Removes RSA key exchange and static DH; uses ephemeral ECDH for forward secrecy; 1-RTT full handshake, 0-RTT resumption with PSK.
  • HTTP/2: Binary framing, multiplexed streams, HPACK header compression, server push; all over a single TCP connection.
  • HTTP/3: HTTP/2 semantics over QUIC; independent stream loss recovery eliminates transport-layer HoL blocking.

Major Historical Milestones

Year Milestone
1974 Cerf & Kahn publish TCP (Transmission Control Program)
1978 TCP and IP separated into distinct protocols
1981 RFC 793: TCP specification; RFC 791: IPv4
1983 BSD 4.2 TCP/IP implementation; ARPANET transition
1988 Van Jacobson publishes congestion control algorithms (slow start, CA)
1989 Karn's algorithm for RTT estimation in the presence of retransmits
1994 TCP SACK (Selective Acknowledgment) RFC 2018
1996 RFC 1948: protection against sequence number attacks
1997 TCP Reno (Fast Retransmit + Fast Recovery) widely deployed
1999 IPv6 specification finalized (RFC 2460)
2000 Linux 2.4: TCP rework; scalable socket infrastructure
2004 CUBIC congestion control published; default in Linux since 2.6.19
2004 Google SPDY precursor work begins; HTTP/1.1 limitations documented
2012 Google proposes QUIC (Quick UDP Internet Connections)
2015 HTTP/2 standardized (RFC 7540)
2016 BBR congestion control published by Google; merged in Linux 4.9
2018 TLS 1.3 standardized (RFC 8446)
2020 TCP Pacing + RACK/TLP loss detection in Linux mainstream
2021 QUIC standardized (RFC 9000); HTTP/3 standardized (RFC 9114)
2022 BBRv2 experimental in Linux; addresses BBR fairness issues

Modern Relevance and Production Use Cases

Cloud CDN and web serving (Cloudflare, Fastly, Akamai) deploy BBR as the default congestion control algorithm; BBR dramatically improves throughput on lossy last-mile and high-RTT paths, improving median page load times by 4–14% in Google's deployment data.

gRPC and microservices rely on HTTP/2 multiplexing to avoid connection overhead between services; understanding TCP HoL blocking explains why a single slow gRPC call can stall unrelated concurrent calls over the same connection.

Mobile applications benefit from QUIC/HTTP/3 because connection migration survives network handoffs (WiFi to LTE) without a new handshake; QUIC deployments at Google and Meta show 5–15% improvement in video rebuffering rates.

High-frequency trading platforms use custom TCP stacks (kernel bypass with modified CUBIC or no congestion control) and kernel-bypass UDP; TIME_WAIT and RTO are replaced by application-level retransmit with tighter timeouts.

Kubernetes services rely on iptables/conntrack for ClusterIP; TIME_WAIT and conntrack table exhaustion are persistent production issues at scale; kube-proxy IPVS mode reduces conntrack pressure.


File Map

File Description
01-tcp-state-machine.md 11 states, transition triggers, TIME_WAIT duration and purpose
02-three-way-handshake.md SYN/SYN-ACK/ACK mechanics, ISN generation, options negotiation
03-four-way-teardown.md FIN/FIN-ACK sequence, half-close, TIME_WAIT exhaustion
04-sliding-window.md rwnd, cwnd, BDP, window scaling option, zero window
05-congestion-control-reno.md Slow start, congestion avoidance, fast retransmit, fast recovery
06-congestion-control-cubic.md W_cubic function, Bic origin, fast convergence
07-congestion-control-bbr.md BtlBw/RTprop model, pacing, PROBE_BW/PROBE_RTT phases
08-tcp-timers.md RTO computation (Jacobson), keepalive, delayed ACK, TIME_WAIT
09-tcp-options.md MSS, SACK, timestamps, window scale, TFO, MD5
10-flow-control.md Receiver-side rwnd, zero window probing, Nagle algorithm
11-tcp-loss-recovery.md RACK, TLP (Tail Loss Probe), SACK-based retransmit
12-udp-internals.md UDP socket, checksum, fragmentation, multicast
13-quic-protocol.md QUIC streams, connection ID, 0-RTT, ACK frequency, migration
14-ip-routing.md FIB structure, LPM lookup, route cache, policy routing
15-arp-ndp.md ARP request/reply, gratuitous ARP, NDP, SLAAC, router advertisement
16-icmp-diagnostics.md ICMP types, ping, traceroute, PMTUD, ICMPv6
17-dns-internals.md Resolution chain, DNSSEC, DNS-over-TLS/HTTPS, negative caching
18-tls-ssl-layer.md TLS 1.3 handshake, cipher suites, certificate pinning, mTLS
19-http-evolution.md HTTP/1.1 pipelining, HTTP/2 frames/streams, HTTP/3 over QUIC
20-tcp-tuning.md net.core, net.ipv4.tcp_* sysctl knobs, ss/netstat analysis

Cross-References

  • Section 15 (Networking): socket layer, sk_buff, netfilter, XDP — the platform TCP runs on
  • Section 10 (Synchronization): TCP socket lock, receive queue spinlocks
  • Section 17 (Distributed Systems): distributed system failure modes emerge from TCP semantics (timeout, partial failure)
  • Section 19 (Virtualization): virtio-net TCP offload, network namespaces in containers
  • Section 18 (Database Internals): database replication streams over TCP; understanding replication lag requires TCP congestion understanding