02 — TCP Congestion Control
Technical Overview
TCP congestion control is the mechanism that prevents the Internet from collapsing under its own load. Without it, every sender would transmit as fast as possible, routers would drop everything, and throughput would approach zero. Congestion control adapts the sending rate to match available bandwidth while sharing capacity fairly among competing flows. The algorithms have evolved from Van Jacobson's crisis-driven 1988 design through Cubic's high-BDP optimization to BBR's model-based approach, and each reflects the network conditions of its era.
Prerequisites
- TCP state machine and three-way handshake (see
01-tcp-state-machine.md) - TCP sequence numbers, ACKs, and retransmission fundamentals
- Bandwidth-Delay Product (BDP) concept
- Familiarity with
ss,tc,iperf3,sysctl
Core Content
The 1988 Congestion Collapse
On October 27, 1988, the ARPANET (predecessor to the Internet) effectively ground to a halt. Throughput between UC Berkeley and MIT dropped from 32 kbps to 40 bps — a 800x reduction — due to congestion collapse.
The mechanism: routers dropped packets when buffers filled. Each dropped packet triggered TCP's timeout-based retransmit. But at timeout, TCP retransmitted the old packet while still sending new data at full rate — more packets, more drops, more retransmits, runaway positive feedback loop.
Van Jacobson and Michael Karels at LBNL diagnosed and fixed it in the weeks that followed, producing the algorithms (slow start, congestion avoidance, fast retransmit, fast recovery) that are still the foundation of TCP congestion control today.
Congestion Window (cwnd)
The sender limits the amount of unacknowledged data in flight to the minimum of:
1. cwnd (congestion window) — set by congestion control algorithm
2. rwnd (receiver window) — advertised by the receiver (flow control)
Bytes in flight = min(cwnd, rwnd)
The sender cannot have more than min(cwnd, rwnd) bytes unacknowledged at any time. If cwnd = 1 MSS (maximum segment size, typically 1460 bytes), the sender can send 1 segment and must wait for ACK before sending the next. If cwnd = 100 MSS, 100 segments can be in flight.
Slow Start
After connection establishment, cwnd starts at 10 MSS (Linux default since tcp_init_cwnd=10). Slow start grows cwnd exponentially: for each ACK received, cwnd += 1 MSS. Since one window of data generates one window of ACKs, the doubling happens once per RTT:
RTT 0: send 10 segments
RTT 1: 10 ACKs received → cwnd = 20, send 20 segments
RTT 2: 20 ACKs received → cwnd = 40
RTT 3: cwnd = 80
...until cwnd reaches ssthresh
Slow start terminates when cwnd >= ssthresh (slow start threshold). Initial ssthresh is usually set to the receive window size (effectively unbounded until first congestion event).
Congestion Avoidance (AIMD)
Once cwnd >= ssthresh, the algorithm switches to additive increase: increase cwnd by 1 MSS per RTT (regardless of window size). This is linear growth:
cwnd += MSS * MSS / cwnd /* per ACK — approximates 1 MSS/RTT */
This is the AI (Additive Increase) part of AIMD. The MD (Multiplicative Decrease) happens on congestion detection:
On packet loss (timeout):
ssthresh = cwnd / 2
cwnd = 1 MSS /* reset to slow start */
On 3 duplicate ACKs (fast retransmit):
ssthresh = cwnd / 2
cwnd = ssthresh /* TCP Reno: don't restart slow start */
Fast Retransmit and Fast Recovery
The timeout-based retransmit waits the full RTO (200ms–3s) before retransmitting. Fast retransmit (Van Jacobson, 1988) triggers retransmit immediately upon receiving 3 duplicate ACKs — indicating the receiver got 3 segments after the lost one:
Sender Receiver
| send seg 1 |
| send seg 2 |
| send seg 3 (lost) |
| send seg 4 | recv seg 4 → dup ACK for seg 2
| dup ACK 2 <--|
| send seg 5 | recv seg 5 → dup ACK for seg 2
| dup ACK 2 <--|
| send seg 6 | recv seg 6 → dup ACK for seg 2
| dup ACK 2 <--|
| retransmit seg 3 (no wait!)
Fast recovery (TCP Reno): after fast retransmit, instead of restarting slow start, inflate cwnd by the number of duplicate ACKs (since those segments have been received by the receiver, they're "out of the way"). This keeps the pipe full during recovery.
TCP Cubic (Default Linux Since 2.6.19)
TCP Reno's linear AIMD is pessimistic for high-bandwidth, high-latency (high-BDP) networks. On a 10 Gbps, 100ms RTT path, after a single packet loss:
Reno: ssthresh = cwnd/2, then linear growth at 1 MSS/RTT
Time to reach full window again: (cwnd/2) RTTs = potentially minutes
TCP Cubic (Injong Rhee & Lisong Xu, Linux 2.6.19, 2006) uses a cubic function of time since the last congestion event instead of ACK-based linear growth:
W(t) = C * (t - K)^3 + W_max
where:
W_max = window at last congestion event
K = time to reach W_max = cbrt(W_max * β / C)
C = 0.4 (scaling constant)
β = 0.7 (multiplicative decrease factor, larger than Reno's 0.5)
cwnd
^
| /
| Cubic /
| / /
W_max|----............/ /
| " / /
| / concave /
| / region / convex region
| / /
| / /
+--+--+---------+---------+--- time
| Wmax/2 Wmax Wmax+
loss
Cubic properties: - Fast recovery: cubic function grows fast initially after loss (concave phase) - Plateau near Wmax: slows growth as it approaches the window where loss occurred (probing for new bandwidth) - Aggressive growth beyond Wmax: if no loss, assumes more bandwidth is available - High-BDP benefit: cubic function's growth is time-based, not RTT-based — faster recovery on high-RTT paths
Check and set congestion control:
sysctl net.ipv4.tcp_congestion_control # current
sysctl net.ipv4.tcp_available_congestion_control # available modules
sysctl -w net.ipv4.tcp_congestion_control=cubic # set globally
# Per-socket: setsockopt(fd, IPPROTO_TCP, TCP_CONGESTION, "cubic", 6)
TCP BBR (Bottleneck Bandwidth and RTT)
BBR (Bottleneck Bandwidth and RTT — Google, 2016, authors: Cardwell, Cheng, Yeganeh, Van Jacobson) is a fundamentally different approach. All previous algorithms (Reno, Cubic, HTCP, Vegas) are loss-based: they increase until loss, then back off. BBR is model-based: it estimates the network's true bottleneck bandwidth and RTT, then paces sending to match the model.
The BBR insight: loss-based algorithms probe for bandwidth by filling router queues. By the time a packet is lost, the queue is already full — adding 100–200ms of queuing delay. BBR separates bandwidth estimation from queue occupancy.
BBR state machine:
| Phase | Duration | cwnd | Purpose |
|---|---|---|---|
| STARTUP | Until bandwidth stops growing | doubles each RTT | Slow start (like Reno) |
| DRAIN | Until queue drained | reduce to BDP | Clear queue built during STARTUP |
| PROBE_BW | Steady state cycling | varies ±25% | Probe for more bandwidth periodically |
| PROBE_RTT | 200ms every 10s | 4 packets | Measure minimum RTT (drain queue) |
BBR BW estimation:
bottleneck_bw = max(delivered_bytes / elapsed_time) over 10 RTTs
BBR pacing:
pacing_rate = 1.25 * bottleneck_bw (in PROBE_BW high phase)
pacing_rate = 0.75 * bottleneck_bw (in PROBE_BW low phase, give others room)
BBR vs Cubic performance comparison:
| Scenario | Cubic | BBR |
|---|---|---|
| Low-loss LAN | Similar | Similar |
| 100ms RTT, 0.01% loss | ~5 Gbps | ~9 Gbps (10G link) |
| High-loss satellite (1%) | ~200 Mbps | ~1 Gbps |
| Buffered path (200ms queue) | High delay | Low delay (BBR probes gently) |
| Many flows competing | Fair | Less fair (BBR tends to dominate Cubic) |
# Enable BBR (requires Linux 4.9+, fq qdisc required for pacing)
sysctl -w net.core.default_qdisc=fq
sysctl -w net.ipv4.tcp_congestion_control=bbr
# Monitor BBR state per connection
ss -ti dst :443 | grep bbr
# Output includes: bbr:(bw:X,mrtt:Y,pacing_gain:Z,cwnd_gain:W)
Google BBR deployment (2016): Google rolled out BBR on all its content servers (YouTube, Google.com). Results reported: 4% throughput improvement globally, up to 14% in high-loss regions. More significantly: video rebuffering rates dropped 50% in developing countries with higher-loss networks.
DCTCP: Data Center TCP
DCTCP (Mohammad Alizadeh et al., Microsoft Research, 2010) targets data center networks where RTTs are <1ms and switches support ECN (Explicit Congestion Notification).
ECN: routers mark packets (set CE bit in IP header) when queues exceed a threshold — before dropping. The receiver echoes these marks back via ECE flag in ACK. The sender reduces cwnd proportionally to the fraction of marked packets, not binary (one drop → halve cwnd).
DCTCP cwnd reduction:
α = (1 - g) * α + g * F /* F = fraction of ACKs with CE mark */
cwnd = cwnd * (1 - α/2) /* proportional reduction */
With ECN and DCTCP: queue depths stay at 1–5 packets (vs 50–500 for Cubic). End-to-end latency: <100 µs (vs 1–10ms with Cubic).
# Enable ECN
sysctl -w net.ipv4.tcp_ecn=1
# Enable DCTCP (requires ECN and switches with ECN marking)
sysctl -w net.ipv4.tcp_congestion_control=dctcp
modprobe tcp_dctcp
cwnd Evolution Diagram
cwnd (MSS)
^
| Cubic BBR
| / -------
| / / ___/ \___
| / / / ___/ \___
| / / / ___/ \___
| / / / ___/ \___
| / / / /¯¯¯
| / / / -------
| / / / [loss] [loss]
|/+/ Time -->
^ ^ ^
slow start CA CA
(exp) (linear) (cubic)
BBR: occupies a steady range around estimated BDP, occasionally
probing high/low — no sharp drops on loss
Historical Context
Van Jacobson's 4 algorithms (slow start, congestion avoidance, fast retransmit, fast recovery) were implemented in BSD 4.3 Reno (hence "TCP Reno") and Linux simultaneously in 1988–1990. They represent one of the most impactful software contributions to network infrastructure.
TCP Vegas (Brakmo & Peterson, 1994) was an early attempt at a delay-based algorithm — it used RTT increase as a congestion signal rather than loss. It performed well in isolation but was aggressive against loss-based algorithms in shared networks.
TCP Cubic was developed at NCSU by Injong Rhee and Lisong Xu, originally published in 2005. It replaced HTCP as Linux's default in 2.6.19 (2006) after extensive testing on high-BDP networks.
BBR was the first major congestion control algorithm to come from an operational internet company (Google) rather than academia, reflecting the shift in network research from universities to hyperscalers with global infrastructure and the ability to run large-scale A/B tests.
Debugging Notes
# Per-connection congestion control and state
ss -ti 'dst :443'
# Shows: cwnd, ssthresh, rtt, retrans, bytes_acked, bytes_received, busy time
# Monitor cwnd evolution over time (bpftrace)
bpftrace -e '
kprobe:tcp_cong_avoid {
$sk = (struct sock *)arg0;
$tp = (struct tcp_sock *)$sk;
printf("%s cwnd=%u ssthresh=%u\n", comm, $tp->snd_cwnd, $tp->snd_ssthresh);
}'
# Check retransmit counts per connection
ss -ti | grep retrans
# Global retransmit statistics
netstat -s | grep -i retran
# Compare congestion control algorithms
iperf3 -c <server> -C cubic -t 30 -i 5
iperf3 -c <server> -C bbr -t 30 -i 5
Security Implications
- BBR bandwidth estimation attacks: an attacker sending ACKs at a specific rate can manipulate BBR's bandwidth estimate, causing the sender to over- or under-send. Mitigated by ACK rate limiting and duplicate ACK detection.
- Cubic AIMD fairness and exploitation: an application can set
TCP_CONGESTION=cubicper socket while others use BBR, gaining throughput advantage since Cubic backs off less aggressively than BBR in some scenarios. - ECN and DCTCP in the public Internet: ECN-capable packets on the public internet can be exploited if routers mark them under attack conditions, causing senders to reduce rate — effectively a covert congestion-based DoS.
- Low-rate TCP attacks (LRAT, 2003): periodic short bursts timed to expire retransmit timers keep TCP cwnd near minimum. The attacker uses very little bandwidth to impose large throughput reduction on the victim. Mitigated by TCP SACK and modern RTO estimation.
Performance Implications
| Algorithm | Goodput (high-BDP) | Latency | CPU overhead |
|---|---|---|---|
| Reno | Low (over-conservative) | Low | Lowest |
| Cubic | Good | Medium | Low |
| BBR | Best on lossy paths | Low (no queue buildup) | Slightly higher |
| DCTCP | Best in DC | Very low (<100µs) | Low |
For a 1 Gbps, 10ms RTT connection: - Cubic with 0% loss: ~950 Mbps (near line rate) - BBR with 0% loss: ~950 Mbps (similar) - Cubic with 1% loss: ~50 Mbps (Reno-like) - BBR with 1% loss: ~800 Mbps (model-based, loss-tolerant)
Failure Modes and Real Incidents
Incident: BBR causing starvation of Cubic flows (2017, academic study) When BBR and Cubic flows share a bottleneck link, BBR can consume 80–95% of available bandwidth in some configurations, reducing Cubic throughput to near-zero. The mechanism: BBR probes high while Cubic backs off on loss — BBR sees less loss during Cubic's backoff, increasing its window. Google acknowledged this and BBRv2 added explicit congestion signals to be fairer.
Incident: DCTCP mis-deployment outside DC (2019, cloud provider) DCTCP was enabled globally on a cloud provider's edge servers — including connections to the public internet where ECN marking is inconsistent. The CE bit misinterpretation caused random cwnd reduction on unrelated TCP flows. Fix: DCTCP should only be used on controlled DC networks with ECN-marking switches.
Failure Mode: Cubic + high-BDP + single flow
A single Cubic TCP flow on a 100 Gbps, 100ms RTT link needs a cwnd of 1.25 GB to fill the pipe. At 1 MSS/RTT additive increase, recovering from loss takes 625,000 RTTs = 17+ hours. Parallel streams (iperf3 -P 8) are the practical workaround.
Modern Usage
- BBR is the default for Google's infrastructure (GFE, YouTube, Cloud CDN) and is deployable via
sysctlon any Linux 4.9+ kernel - BBRv2 (submitted to kernel, not yet merged as of Linux 6.x): adds ECN signal integration and better fairness with Cubic flows
- QUIC congestion control uses the same algorithms (Cubic, BBR, NewReno) as TCP — QUIC implementations select which to use per-connection
- eBPF congestion control (Linux 5.13):
BPF_PROG_TYPE_STRUCT_OPSallows loading custom CC algorithms as eBPF programs, enabling per-connection CC selection and rapid iteration
Future Directions
- Copa (MIT CSAIL, 2018): delay-based algorithm that is provably fair with itself and competitive with loss-based algorithms — shows promise for public internet deployment
- Paced Chirping (Johansson & Zander): sub-RTT bandwidth probing that converges faster than BBR's 10-RTT probing window
- Hardware pacing offload: NICs that implement TCP pacing in firmware enable BBR to operate without the kernel
fqqdisc — currently under development in Mellanox OFED
Exercises
-
Run
iperf3 -c <server> -C cubicandiperf3 -c <server> -C bbron a lossy network (simulate 1% loss withtc qdisc add dev eth0 root netem loss 1%). Compare throughput. Explain why BBR outperforms Cubic on lossy links using the algorithm models. -
Use
bpftraceto instrumenttcp_cong_avoidand logcwndevery 100ms for a 30-secondiperf3run. Plot the cwnd over time. Identify slow start, congestion avoidance, and loss events from the cwnd curve. -
Implement the AIMD algorithm in Python, simulating a single TCP flow competing with a UDP flood for a 100 Mbps bottleneck link. Vary the UDP flood bandwidth from 0% to 90% of link capacity and plot TCP throughput vs congestion level.
-
Configure two
iperf3clients: one using BBR and one using Cubic, both targeting the same server simultaneously. Monitorss -tifor both flows every second. Observe and explain the fairness behavior between BBR and Cubic. -
Enable DCTCP on loopback (both client and server, with ECN enabled). Compare RTT under load to Cubic on loopback. Use
tc qdisc add dev lo root fq_codel ecnto add ECN marking, and measure cwnd stability.
References
- RFC 5681 — TCP Congestion Control (Reno)
- RFC 8312 — CUBIC for Fast Long-Distance Networks
- Cardwell, N. et al. BBR: Congestion-Based Congestion Control. ACM Queue, 2016.
- Alizadeh, M. et al. Data Center TCP (DCTCP). ACM SIGCOMM 2010.
- Jacobson, V. & Karels, M. Congestion Avoidance and Control. ACM SIGCOMM 1988. (The seminal paper)
net/ipv4/tcp_cubic.c— Linux Cubic implementationnet/ipv4/tcp_bbr.c— Linux BBR implementationnet/ipv4/tcp_cong.c— congestion control frameworkDocumentation/networking/tcp.rst— kernel TCP documentation- Floyd, S. & Jacobson, V. Random Early Detection Gateways for Congestion Avoidance. IEEE/ACM ToN, 1993.