02 — TCP Congestion Control

Technical Overview

TCP congestion control is the mechanism that prevents the Internet from collapsing under its own load. Without it, every sender would transmit as fast as possible, routers would drop everything, and throughput would approach zero. Congestion control adapts the sending rate to match available bandwidth while sharing capacity fairly among competing flows. The algorithms have evolved from Van Jacobson's crisis-driven 1988 design through Cubic's high-BDP optimization to BBR's model-based approach, and each reflects the network conditions of its era.

Prerequisites

TCP state machine and three-way handshake (see 01-tcp-state-machine.md)
TCP sequence numbers, ACKs, and retransmission fundamentals
Bandwidth-Delay Product (BDP) concept
Familiarity with ss, tc, iperf3, sysctl

Core Content

The 1988 Congestion Collapse

On October 27, 1988, the ARPANET (predecessor to the Internet) effectively ground to a halt. Throughput between UC Berkeley and MIT dropped from 32 kbps to 40 bps — a 800x reduction — due to congestion collapse.

The mechanism: routers dropped packets when buffers filled. Each dropped packet triggered TCP's timeout-based retransmit. But at timeout, TCP retransmitted the old packet while still sending new data at full rate — more packets, more drops, more retransmits, runaway positive feedback loop.

Van Jacobson and Michael Karels at LBNL diagnosed and fixed it in the weeks that followed, producing the algorithms (slow start, congestion avoidance, fast retransmit, fast recovery) that are still the foundation of TCP congestion control today.

Congestion Window (cwnd)

The sender limits the amount of unacknowledged data in flight to the minimum of: 1. cwnd (congestion window) — set by congestion control algorithm 2. rwnd (receiver window) — advertised by the receiver (flow control)

Bytes in flight = min(cwnd, rwnd)

The sender cannot have more than min(cwnd, rwnd) bytes unacknowledged at any time. If cwnd = 1 MSS (maximum segment size, typically 1460 bytes), the sender can send 1 segment and must wait for ACK before sending the next. If cwnd = 100 MSS, 100 segments can be in flight.

Slow Start

After connection establishment, cwnd starts at 10 MSS (Linux default since tcp_init_cwnd=10). Slow start grows cwnd exponentially: for each ACK received, cwnd += 1 MSS. Since one window of data generates one window of ACKs, the doubling happens once per RTT:

RTT 0: send 10 segments
RTT 1: 10 ACKs received → cwnd = 20, send 20 segments
RTT 2: 20 ACKs received → cwnd = 40
RTT 3: cwnd = 80
...until cwnd reaches ssthresh

Slow start terminates when cwnd >= ssthresh (slow start threshold). Initial ssthresh is usually set to the receive window size (effectively unbounded until first congestion event).

Congestion Avoidance (AIMD)

Once cwnd >= ssthresh, the algorithm switches to additive increase: increase cwnd by 1 MSS per RTT (regardless of window size). This is linear growth:

cwnd += MSS * MSS / cwnd   /* per ACK — approximates 1 MSS/RTT */

This is the AI (Additive Increase) part of AIMD. The MD (Multiplicative Decrease) happens on congestion detection:

On packet loss (timeout):
    ssthresh = cwnd / 2
    cwnd = 1 MSS    /* reset to slow start */

On 3 duplicate ACKs (fast retransmit):
    ssthresh = cwnd / 2
    cwnd = ssthresh  /* TCP Reno: don't restart slow start */

Fast Retransmit and Fast Recovery

The timeout-based retransmit waits the full RTO (200ms–3s) before retransmitting. Fast retransmit (Van Jacobson, 1988) triggers retransmit immediately upon receiving 3 duplicate ACKs — indicating the receiver got 3 segments after the lost one:

Sender                    Receiver
   | send seg 1               |
   | send seg 2               |
   | send seg 3 (lost)        |
   | send seg 4               |  recv seg 4 → dup ACK for seg 2
   |             dup ACK 2 <--|
   | send seg 5               |  recv seg 5 → dup ACK for seg 2
   |             dup ACK 2 <--|
   | send seg 6               |  recv seg 6 → dup ACK for seg 2
   |             dup ACK 2 <--|
   | retransmit seg 3 (no wait!)

Fast recovery (TCP Reno): after fast retransmit, instead of restarting slow start, inflate cwnd by the number of duplicate ACKs (since those segments have been received by the receiver, they're "out of the way"). This keeps the pipe full during recovery.

TCP Cubic (Default Linux Since 2.6.19)

TCP Reno's linear AIMD is pessimistic for high-bandwidth, high-latency (high-BDP) networks. On a 10 Gbps, 100ms RTT path, after a single packet loss:

Reno: ssthresh = cwnd/2, then linear growth at 1 MSS/RTT
Time to reach full window again: (cwnd/2) RTTs = potentially minutes

TCP Cubic (Injong Rhee & Lisong Xu, Linux 2.6.19, 2006) uses a cubic function of time since the last congestion event instead of ACK-based linear growth:

W(t) = C * (t - K)^3 + W_max
where:
  W_max = window at last congestion event
  K = time to reach W_max = cbrt(W_max * β / C)
  C = 0.4 (scaling constant)
  β = 0.7 (multiplicative decrease factor, larger than Reno's 0.5)

cwnd
 ^
 |                              /
 |                   Cubic   /
 |                       /  /
W_max|----............/   /
 |      "    /        /
 |        / concave /
 |       / region  / convex region
 |      /         /
 |     /         /
 +--+--+---------+---------+--- time
    |  Wmax/2    Wmax      Wmax+
  loss

Cubic properties: - Fast recovery: cubic function grows fast initially after loss (concave phase) - Plateau near Wmax: slows growth as it approaches the window where loss occurred (probing for new bandwidth) - Aggressive growth beyond Wmax: if no loss, assumes more bandwidth is available - High-BDP benefit: cubic function's growth is time-based, not RTT-based — faster recovery on high-RTT paths

Check and set congestion control:

sysctl net.ipv4.tcp_congestion_control      # current
sysctl net.ipv4.tcp_available_congestion_control  # available modules

sysctl -w net.ipv4.tcp_congestion_control=cubic  # set globally
# Per-socket: setsockopt(fd, IPPROTO_TCP, TCP_CONGESTION, "cubic", 6)

TCP BBR (Bottleneck Bandwidth and RTT)

BBR (Bottleneck Bandwidth and RTT — Google, 2016, authors: Cardwell, Cheng, Yeganeh, Van Jacobson) is a fundamentally different approach. All previous algorithms (Reno, Cubic, HTCP, Vegas) are loss-based: they increase until loss, then back off. BBR is model-based: it estimates the network's true bottleneck bandwidth and RTT, then paces sending to match the model.

The BBR insight: loss-based algorithms probe for bandwidth by filling router queues. By the time a packet is lost, the queue is already full — adding 100–200ms of queuing delay. BBR separates bandwidth estimation from queue occupancy.

BBR state machine:

Phase	Duration	cwnd	Purpose
STARTUP	Until bandwidth stops growing	doubles each RTT	Slow start (like Reno)
DRAIN	Until queue drained	reduce to BDP	Clear queue built during STARTUP
PROBE_BW	Steady state cycling	varies ±25%	Probe for more bandwidth periodically
PROBE_RTT	200ms every 10s	4 packets	Measure minimum RTT (drain queue)

BBR BW estimation:
  bottleneck_bw = max(delivered_bytes / elapsed_time) over 10 RTTs

BBR pacing:
  pacing_rate = 1.25 * bottleneck_bw  (in PROBE_BW high phase)
  pacing_rate = 0.75 * bottleneck_bw  (in PROBE_BW low phase, give others room)

BBR vs Cubic performance comparison:

Scenario	Cubic	BBR
Low-loss LAN	Similar	Similar
100ms RTT, 0.01% loss	~5 Gbps	~9 Gbps (10G link)
High-loss satellite (1%)	~200 Mbps	~1 Gbps
Buffered path (200ms queue)	High delay	Low delay (BBR probes gently)
Many flows competing	Fair	Less fair (BBR tends to dominate Cubic)

# Enable BBR (requires Linux 4.9+, fq qdisc required for pacing)
sysctl -w net.core.default_qdisc=fq
sysctl -w net.ipv4.tcp_congestion_control=bbr

# Monitor BBR state per connection
ss -ti dst :443 | grep bbr
# Output includes: bbr:(bw:X,mrtt:Y,pacing_gain:Z,cwnd_gain:W)

Google BBR deployment (2016): Google rolled out BBR on all its content servers (YouTube, Google.com). Results reported: 4% throughput improvement globally, up to 14% in high-loss regions. More significantly: video rebuffering rates dropped 50% in developing countries with higher-loss networks.

DCTCP: Data Center TCP

DCTCP (Mohammad Alizadeh et al., Microsoft Research, 2010) targets data center networks where RTTs are <1ms and switches support ECN (Explicit Congestion Notification).

ECN: routers mark packets (set CE bit in IP header) when queues exceed a threshold — before dropping. The receiver echoes these marks back via ECE flag in ACK. The sender reduces cwnd proportionally to the fraction of marked packets, not binary (one drop → halve cwnd).

DCTCP cwnd reduction:
  α = (1 - g) * α + g * F     /* F = fraction of ACKs with CE mark */
  cwnd = cwnd * (1 - α/2)     /* proportional reduction */

With ECN and DCTCP: queue depths stay at 1–5 packets (vs 50–500 for Cubic). End-to-end latency: <100 µs (vs 1–10ms with Cubic).

# Enable ECN
sysctl -w net.ipv4.tcp_ecn=1

# Enable DCTCP (requires ECN and switches with ECN marking)
sysctl -w net.ipv4.tcp_congestion_control=dctcp
modprobe tcp_dctcp

cwnd Evolution Diagram

cwnd (MSS)
  ^
  |                         Cubic                BBR
  |                            /               -------
  |                      /   /             ___/       \___
  |                  /   / /          ___/               \___
  |              /   / /          ___/                       \___
  |          /   / /          ___/                               \___
  |      /   / /          /¯¯¯
  |  /   / /      -------
  | / / /  [loss] [loss]
  |/+/                 Time -->
     ^       ^      ^
  slow start CA     CA
   (exp)   (linear) (cubic)

BBR: occupies a steady range around estimated BDP, occasionally
     probing high/low — no sharp drops on loss

Historical Context

Van Jacobson's 4 algorithms (slow start, congestion avoidance, fast retransmit, fast recovery) were implemented in BSD 4.3 Reno (hence "TCP Reno") and Linux simultaneously in 1988–1990. They represent one of the most impactful software contributions to network infrastructure.

TCP Vegas (Brakmo & Peterson, 1994) was an early attempt at a delay-based algorithm — it used RTT increase as a congestion signal rather than loss. It performed well in isolation but was aggressive against loss-based algorithms in shared networks.

TCP Cubic was developed at NCSU by Injong Rhee and Lisong Xu, originally published in 2005. It replaced HTCP as Linux's default in 2.6.19 (2006) after extensive testing on high-BDP networks.

BBR was the first major congestion control algorithm to come from an operational internet company (Google) rather than academia, reflecting the shift in network research from universities to hyperscalers with global infrastructure and the ability to run large-scale A/B tests.

Debugging Notes

# Per-connection congestion control and state
ss -ti 'dst :443'
# Shows: cwnd, ssthresh, rtt, retrans, bytes_acked, bytes_received, busy time

# Monitor cwnd evolution over time (bpftrace)
bpftrace -e '
kprobe:tcp_cong_avoid {
    $sk = (struct sock *)arg0;
    $tp = (struct tcp_sock *)$sk;
    printf("%s cwnd=%u ssthresh=%u\n", comm, $tp->snd_cwnd, $tp->snd_ssthresh);
}'

# Check retransmit counts per connection
ss -ti | grep retrans

# Global retransmit statistics
netstat -s | grep -i retran

# Compare congestion control algorithms
iperf3 -c <server> -C cubic -t 30 -i 5
iperf3 -c <server> -C bbr   -t 30 -i 5

Security Implications

BBR bandwidth estimation attacks: an attacker sending ACKs at a specific rate can manipulate BBR's bandwidth estimate, causing the sender to over- or under-send. Mitigated by ACK rate limiting and duplicate ACK detection.
Cubic AIMD fairness and exploitation: an application can set TCP_CONGESTION=cubic per socket while others use BBR, gaining throughput advantage since Cubic backs off less aggressively than BBR in some scenarios.
ECN and DCTCP in the public Internet: ECN-capable packets on the public internet can be exploited if routers mark them under attack conditions, causing senders to reduce rate — effectively a covert congestion-based DoS.
Low-rate TCP attacks (LRAT, 2003): periodic short bursts timed to expire retransmit timers keep TCP cwnd near minimum. The attacker uses very little bandwidth to impose large throughput reduction on the victim. Mitigated by TCP SACK and modern RTO estimation.

Performance Implications

Algorithm	Goodput (high-BDP)	Latency	CPU overhead
Reno	Low (over-conservative)	Low	Lowest
Cubic	Good	Medium	Low
BBR	Best on lossy paths	Low (no queue buildup)	Slightly higher
DCTCP	Best in DC	Very low (<100µs)	Low

For a 1 Gbps, 10ms RTT connection: - Cubic with 0% loss: ~950 Mbps (near line rate) - BBR with 0% loss: ~950 Mbps (similar) - Cubic with 1% loss: ~50 Mbps (Reno-like) - BBR with 1% loss: ~800 Mbps (model-based, loss-tolerant)

Failure Modes and Real Incidents

Incident: BBR causing starvation of Cubic flows (2017, academic study) When BBR and Cubic flows share a bottleneck link, BBR can consume 80–95% of available bandwidth in some configurations, reducing Cubic throughput to near-zero. The mechanism: BBR probes high while Cubic backs off on loss — BBR sees less loss during Cubic's backoff, increasing its window. Google acknowledged this and BBRv2 added explicit congestion signals to be fairer.

Incident: DCTCP mis-deployment outside DC (2019, cloud provider) DCTCP was enabled globally on a cloud provider's edge servers — including connections to the public internet where ECN marking is inconsistent. The CE bit misinterpretation caused random cwnd reduction on unrelated TCP flows. Fix: DCTCP should only be used on controlled DC networks with ECN-marking switches.

Failure Mode: Cubic + high-BDP + single flow A single Cubic TCP flow on a 100 Gbps, 100ms RTT link needs a cwnd of 1.25 GB to fill the pipe. At 1 MSS/RTT additive increase, recovering from loss takes 625,000 RTTs = 17+ hours. Parallel streams (iperf3 -P 8) are the practical workaround.

Modern Usage

BBR is the default for Google's infrastructure (GFE, YouTube, Cloud CDN) and is deployable via sysctl on any Linux 4.9+ kernel
BBRv2 (submitted to kernel, not yet merged as of Linux 6.x): adds ECN signal integration and better fairness with Cubic flows
QUIC congestion control uses the same algorithms (Cubic, BBR, NewReno) as TCP — QUIC implementations select which to use per-connection
eBPF congestion control (Linux 5.13): BPF_PROG_TYPE_STRUCT_OPS allows loading custom CC algorithms as eBPF programs, enabling per-connection CC selection and rapid iteration

Future Directions

Copa (MIT CSAIL, 2018): delay-based algorithm that is provably fair with itself and competitive with loss-based algorithms — shows promise for public internet deployment
Paced Chirping (Johansson & Zander): sub-RTT bandwidth probing that converges faster than BBR's 10-RTT probing window
Hardware pacing offload: NICs that implement TCP pacing in firmware enable BBR to operate without the kernel fq qdisc — currently under development in Mellanox OFED

Exercises

Run iperf3 -c <server> -C cubic and iperf3 -c <server> -C bbr on a lossy network (simulate 1% loss with tc qdisc add dev eth0 root netem loss 1%). Compare throughput. Explain why BBR outperforms Cubic on lossy links using the algorithm models.
Use bpftrace to instrument tcp_cong_avoid and log cwnd every 100ms for a 30-second iperf3 run. Plot the cwnd over time. Identify slow start, congestion avoidance, and loss events from the cwnd curve.
Implement the AIMD algorithm in Python, simulating a single TCP flow competing with a UDP flood for a 100 Mbps bottleneck link. Vary the UDP flood bandwidth from 0% to 90% of link capacity and plot TCP throughput vs congestion level.
Configure two iperf3 clients: one using BBR and one using Cubic, both targeting the same server simultaneously. Monitor ss -ti for both flows every second. Observe and explain the fairness behavior between BBR and Cubic.
Enable DCTCP on loopback (both client and server, with ECN enabled). Compare RTT under load to Cubic on loopback. Use tc qdisc add dev lo root fq_codel ecn to add ECN marking, and measure cwnd stability.

References

RFC 5681 — TCP Congestion Control (Reno)
RFC 8312 — CUBIC for Fast Long-Distance Networks
Cardwell, N. et al. BBR: Congestion-Based Congestion Control. ACM Queue, 2016.
Alizadeh, M. et al. Data Center TCP (DCTCP). ACM SIGCOMM 2010.
Jacobson, V. & Karels, M. Congestion Avoidance and Control. ACM SIGCOMM 1988. (The seminal paper)
net/ipv4/tcp_cubic.c — Linux Cubic implementation
net/ipv4/tcp_bbr.c — Linux BBR implementation
net/ipv4/tcp_cong.c — congestion control framework
Documentation/networking/tcp.rst — kernel TCP documentation
Floyd, S. & Jacobson, V. Random Early Detection Gateways for Congestion Avoidance. IEEE/ACM ToN, 1993.