08 — Benchmarking

Technical Overview

Benchmarking is the act of measuring system performance under defined, reproducible conditions. Done correctly, it reveals capacity limits, validates optimization hypotheses, and catches regressions before production. Done incorrectly, it produces misleading numbers that cause engineers to invest in the wrong areas—or worse, to ship products with undetected performance regressions while believing they are fast.

The field is littered with methodological traps: JIT warmup, CPU frequency scaling, timer resolution, coordinated omission. This document addresses these systematically and provides a toolkit of benchmarking tools appropriate for each layer of the stack.

Prerequisites

Statistical concepts: mean, median, variance, standard deviation, confidence intervals.
CPU frequency scaling and power states.
Basic profiling with perf.
Application-layer knowledge for the system under test.

Core Content

Benchmarking Fundamentals

1. Define what you are measuring. Throughput (requests/second), latency (p50/p99 distribution), scalability (how throughput changes with core count), efficiency (ops/watt). These are different measurements requiring different bench designs.

2. Isolate variables. One change at a time. When comparing two systems, everything must be identical except the one dimension being tested: same hardware, same OS version, same kernel config, same network path.

3. Control the environment. Disable CPU frequency scaling, ASLR, and other sources of non-determinism when seeking reproducible results:

# Disable frequency scaling (set to performance governor)
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance > $cpu
done

# Disable turbo boost (Intel)
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

# Set CPU affinity to avoid migration
taskset -c 0-7 ./benchmark

# Disable ASLR for reproducible addresses
echo 0 > /proc/sys/kernel/randomize_va_space

4. Warm up. Caches, JIT compilers, connection pools, OS page cache, and DRAM row buffers all need time to reach steady state. Discard the first 30–60 seconds of data.

5. Sufficient sample size. Run at least 5–10 trials. Report median ± IQR or mean ± standard deviation. For significance testing, use the Mann-Whitney U test (non-parametric, appropriate for non-normal latency distributions) or Student's t-test for symmetric distributions.

Microbenchmarking Pitfalls

JVM JIT warmup. The JVM interprets bytecode initially, then JIT-compiles hot methods after ~1,000 invocations (C1 compiler) and ~10,000 invocations (C2 compiler). A benchmark that runs only 100 iterations measures interpreted performance, not production performance.

Fix: use JMH (Java Microbenchmark Harness). JMH handles warmup, forking fresh JVMs, and computing statistics correctly.

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(2)
public class MyBenchmark {
    @Benchmark
    public int compute(MyState state) {
        return state.doWork();
    }
}

CPU frequency scaling. If the governor is ondemand or powersave, early iterations run at low frequency and warm up the CPU. The "warm" results appear faster because the CPU has scaled up, not because of caching or JIT.

CPU migration. The OS may migrate the benchmark process to a different CPU core between iterations. Each migration incurs cache cold misses. Fix: taskset -c 0 ./benchmark.

Timer resolution. System.currentTimeMillis() (Java), time.time() (Python), and gettimeofday() have millisecond resolution. Measuring operations that complete in microseconds requires nanosecond-resolution timers.

Use RDTSC or clock_gettime(CLOCK_MONOTONIC_RAW) for sub-microsecond timing:

static inline uint64_t rdtsc(void) {
    uint32_t lo, hi;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}

// Usage: ticks-per-nanosecond must be calibrated first
uint64_t start = rdtsc();
// ... work ...
uint64_t end = rdtsc();
double ns = (end - start) / ticks_per_ns;

Note: RDTSC counts reference clock cycles (invariant TSC); it is not affected by frequency scaling on modern x86.

Dead code elimination. The compiler may optimize away computations whose results are never used. Use DoNotOptimize (JMH), benchmark::DoNotOptimize() (Google Benchmark), or volatile memory writes to prevent this.

// Google Benchmark: prevent optimizer from removing work
benchmark::DoNotOptimize(result);
benchmark::ClobberMemory();  // prevents store-to-load forwarding optimizations

Coordinated Omission (Gil Tene)

The most pervasive latency benchmarking bug. It occurs when: 1. The benchmark sends requests at a target rate. 2. The system under test gets slow or stalls. 3. The benchmark waits for the slow response before sending the next request. 4. Result: the benchmark only measures responses during the good periods; slow periods are not sampled proportionally.

Time:    1   2   3   4   5   6   7   8   9   10
Request: ●   ●   ●   ●   ●       ●   ●   ●   ●  (4 missed due to stall)
Latency: 5ms 5ms 5ms 5ms 500ms  5ms 5ms 5ms 5ms

Closed-loop (coordinated omission) p99: 500ms  [1 in 9 slow = 11%]
Actual user experience (open-loop):   500ms+ for requests 6,7,8,9 too
  because they were queued behind request 5

A closed-loop benchmark that stalls behind a slow response never issues requests 6–9 at the scheduled time. An open-loop benchmark does. The latency experienced by real users is what the open-loop benchmark measures.

Fix: use wrk2 (Gill Tene's wrk fork) or Gatling with coordinated omission correction, or JMH with @BenchmarkMode(Mode.Throughput) with thread count set to open-loop approximation.

# wrk2: open-loop, 10,000 req/s, 30 seconds, 4 threads, 100 connections
wrk2 -t 4 -c 100 -d 30s -R 10000 --latency http://server/endpoint

HDR Histogram (High Dynamic Range)

Traditional histograms have fixed bucket boundaries. If you don't know the expected range in advance, you either waste memory (too many buckets) or lose precision at the tail (too few buckets).

Gil Tene's HDR Histogram (2012) stores values with two significant decimal digits of precision across a configurable range (e.g., 1 µs to 3600 seconds). Memory usage: ~85 KB regardless of range.

HDR histograms solve the precision problem for latency recording: - p50: 5.2 ms - p99: 48 ms - p99.9: 234 ms - p99.99: 1.2 s - max: 3.4 s

All recorded precisely. In a 1-million-request benchmark, a traditional histogram with 100 buckets over 0–1s would lose all data for the max value of 3.4 s.

// Java HdrHistogram
Histogram histogram = new Histogram(3_600_000_000_000L, 2); // max 1 hour, 2 sig digits
histogram.recordValue(latencyNs);
System.out.println("p99: " + histogram.getValueAtPercentile(99.0));

Available in Java, C/C++, Go, Rust, Python.

Benchmarking Tools

sysbench (CPU, memory, I/O, MySQL):

# CPU benchmark: find prime numbers up to N
sysbench cpu --cpu-max-prime=20000 --threads=8 run

# Memory throughput
sysbench memory --memory-block-size=1M --memory-total-size=10G --threads=8 run

# MySQL OLTP benchmark
sysbench oltp_read_write --mysql-host=localhost --tables=10 --table-size=1000000 \
  --threads=16 --time=300 run

fio (storage I/O):

See 04-io-performance.md for fio configurations. Key patterns:

# Latency percentile test
fio --name=latency --rw=randread --bs=4k --iodepth=1 \
    --filename=/dev/nvme0n1 --direct=1 --runtime=60 \
    --lat_percentiles=1 --percentile_list=50:90:99:99.9

# Mixed read/write ratio (70/30)
fio --name=mixed --rw=randrw --rwmixread=70 --bs=4k \
    --iodepth=32 --filename=/dev/nvme0n1 --direct=1

iperf3 (network throughput):

# Server
iperf3 -s

# Client: TCP, 10 seconds, 4 parallel streams
iperf3 -c <server_ip> -t 10 -P 4

# UDP throughput
iperf3 -c <server_ip> -u -b 10G -t 10

# Latency (RTT) with very small packets
ping -s 1 -i 0.001 <server_ip>  # 1-byte ping, 1ms interval

wrk (closed-loop HTTP load test):

# 4 threads, 100 connections, 30 seconds
wrk -t 4 -c 100 -d 30s http://server/endpoint

wrk2 (open-loop with constant rate and HDR histogram):

# 4 threads, 100 connections, 30 seconds, 10K req/s
wrk2 -t 4 -c 100 -d 30s -R 10000 --latency http://server/endpoint
# Outputs full HDR histogram including extreme percentiles

phoronix-test-suite (comprehensive OS/application benchmarks):

phoronix-test-suite benchmark compress-7zip
phoronix-test-suite benchmark apache         # Apache web server benchmark
phoronix-test-suite benchmark pgbench        # PostgreSQL benchmark

SPEC CPU 2017 (industry-standard CPU benchmark): the gold standard for CPU performance comparison. Includes integer (SPECint) and floating-point (SPECfp) suites. Strict run rules prevent benchmarketing.

TPC benchmarks (database): - TPC-C: OLTP, measures transactions per minute (tpmC). Models order-entry system. - TPC-H: decision support (OLAP), measures Query-per-Hour (QphH). 22 SQL queries on large tables. - TPC-E: updated OLTP, more realistic than TPC-C.

# pgbench (PostgreSQL TPC-B approximation)
pgbench -i -s 100 benchdb                    # initialize 100x scale
pgbench -c 16 -j 4 -T 300 benchdb            # 16 clients, 4 jobs, 300 seconds

Statistical Significance

Two benchmark results are only meaningfully different if the difference is statistically significant—i.e., it's unlikely to be due to random variation.

Rule of thumb: if the confidence intervals overlap, the difference is not significant.

Mann-Whitney U test (for non-normal distributions, typical in latency):

from scipy import stats
before = [5.1, 5.3, 5.0, 5.2, 5.4]  # latency samples (ms)
after  = [4.8, 4.9, 5.0, 4.7, 4.8]
stat, p = stats.mannwhitneyu(before, after, alternative='greater')
print(f"p-value: {p:.4f}")  # p < 0.05 = statistically significant improvement

benchstat (Go ecosystem, also works for any benchmark results):

# Compare two benchmark runs
benchstat before.txt after.txt
# Output:
# name        old time/op  new time/op  delta
# Sort1K-8     45.1µs ±1%  42.3µs ±1%  -6.2%  (p=0.001 n=10+10)

Historical Context

SPEC (Standard Performance Evaluation Corporation) was founded in 1988 by a consortium of workstation vendors (HP, DEC, MIPS, Sun) tired of vendor-defined benchmarks that were meaningless for comparisons. SPEC CPU89 was the first standardized CPU benchmark; SPEC CPU 2017 is the current revision.

TPC (Transaction Processing Performance Council, 1988) standardized database benchmarks in response to similarly divergent vendor claims. TPC-A (simple debit/credit), TPC-B (batch version), TPC-C (OLTP, 1992) established the framework still used today.

Gil Tene's "How NOT to Measure Latency" talk (Strange Loop 2015) identified coordinated omission and the inadequacy of average/95th percentile for latency measurement. HdrHistogram solved the precision problem for tail latency recording.

The benchmarketing era of the Wintel platform (1990s–2000s) led to notorious abuses: compilers that detected benchmark patterns and special-cased them (Intel's C++ compiler treating SPEC workloads specially was discovered in 2012—a major scandal that led to SPEC forbidding benchmark-specific compiler optimizations).

Production Examples

Case: False p99 latency from closed-loop benchmark. An API team used wrk (closed-loop) to benchmark their service at 10,000 req/s and reported p99 of 45 ms. Product launched; real users saw p99 of 350 ms. Investigation: the real service had garbage collection pauses of 200–400 ms every 30 seconds. wrk never measured these because it backed off behind the slow requests. wrk2 with coordinated omission correction showed the true p99. The team added GC tuning and a circuit breaker before re-launch.

Case: SPEC CPU2017 Intel compiler scandal. Intel's C++ compiler was shown to detect SPEC benchmark patterns and generate better code for them than for equivalent user code. This inflated Intel's SPEC scores vs. GCC/Clang benchmarks. SPEC added prohibitions against benchmark-specific optimizations as a result.

Debugging Notes

perf stat -r 5 ./benchmark: run 5 times and report mean ± stddev for hardware counters.
High variance between runs (> 5% stddev) usually indicates: CPU frequency scaling, thermal throttling, or memory bandwidth contention from other processes. Eliminate each.
isolcpus=2,3 kernel parameter: isolate CPUs from the scheduler for dedicated benchmark execution (bind with taskset -c 2,3 ./benchmark).
nohz_full=2,3 kernel parameter: disable scheduler tick on isolated CPUs, reducing OS jitter to near zero for latency-critical benchmarks.
Linux perf bench includes built-in microbenchmarks: perf bench sched all, perf bench mem all, perf bench numa all.

Security Implications

Benchmarks that involve external services (database, HTTP server) should use isolated environments. Running throughput benchmarks against production risks DoS. Always benchmark against staging with representative data sizes.

Benchmark data may reveal architectural details (query patterns, data distributions, throughput limits) that are security-sensitive. Treat benchmark results with the same care as architecture documentation.

Performance Implications

The benchmark itself must not be the bottleneck. If wrk's 4-thread client is saturated before the server is, the results measure wrk, not the server. Monitor client-side CPU utilization during benchmarks; if it's > 80%, add more client threads/machines or use a more efficient load generator.

For latency measurement, adding many concurrent connections (wrk -c 100) with a low-throughput server inflates latency due to queuing. Match the concurrency to the target open-loop rate to avoid artificial queue buildup.

Failure Modes and Real Incidents

Benchmark warmup missing. A database team reported a 2x speedup in query latency after adding an index. The "before" benchmark was run on a cold system; the "after" was run immediately after (warm cache). Re-running with controlled warmup showed only a 15% improvement. The 2x was page cache, not index efficiency.

The Fizz Buzz "optimization" incident (2019). An engineer "optimized" a web service by making it return a cached constant for the most common request. The benchmark showed 50x throughput improvement. The benchmark used a single URL; the optimization only applied to that URL. The fix was rejected after a realistic benchmark showed no improvement for the actual request distribution.

Modern Usage

Continuous benchmarking in CI: Go's benchstat, Rust's criterion, and C++'s Google Benchmark all output machine-readable results. CI pipelines (GitHub Actions, Buildkite) compare each commit's benchmark against the baseline and fail the build if regression exceeds a threshold. Examples: ClickHouse runs perf benchmarks on every PR; TiDB has an automated benchmark regression system.

Production load testing: k6 (Grafana), Gatling, and locust generate realistic HTTP/gRPC workloads from scripts, supporting open-loop testing with configurable ramp-up patterns. Essential for validating auto-scaling behavior before traffic surges.

Future Directions

LLM-assisted benchmark design: generating realistic synthetic workloads from production traces.
Benchmarking ML inference: specialized tools for model serving (MLPERF inference, triton benchmark, orca) with new metrics (tokens/second, first-token latency) becoming standard.
eBPF-based micro-benchmarking: low-overhead in-kernel measurement that doesn't suffer from userspace timer overhead for sub-microsecond operations.

Exercises

Write a JMH benchmark (Java) or Google Benchmark test (C++) that measures string sorting throughput. Introduce the "dead code elimination" bug (don't use the result), observe the unrealistically fast result, then fix it with DoNotOptimize. Explain the difference.
Use wrk (closed-loop) and wrk2 (open-loop) to benchmark the same HTTP server at the same target rate. Compare the p99 latency. Add an artificial 500 ms sleep to 1% of requests server-side. Explain how coordinated omission changes what each tool measures.
Run sysbench memory --memory-total-size=10G at 1, 2, 4, 8, 16 threads. Plot throughput vs. threads. Identify where additional threads stop helping and explain why using the memory bandwidth model.
Configure fio with queue depth 1 and queue depth 32 for random 4 KB reads on an NVMe device. Report IOPS and average latency for each. Use the formula Throughput = IOPS × latency (Little's Law) to verify the queue depth relationship.
Take any two benchmark results from runs on the same hardware with different code. Apply the Mann-Whitney U test or benchstat to determine if the difference is statistically significant at p < 0.05. If not, describe what changes would be needed to establish significance.

References

Tene, G. "How NOT to Measure Latency." Strange Loop 2015. https://www.youtube.com/watch?v=lJ8ydIuPFeU
Tene, G. HdrHistogram: https://github.com/HdrHistogram/HdrHistogram
Jain, R. The Art of Computer Systems Performance Analysis. Wiley, 1991. (Chapter 12: Experimental Design.)
SPEC CPU 2017: https://www.spec.org/cpu2017/
TPC benchmarks: https://www.tpc.org/
wrk2: https://github.com/giltene/wrk2
fio documentation: https://fio.readthedocs.io/
JMH: https://openjdk.java.net/projects/code-tools/jmh/
Google Benchmark: https://github.com/google/benchmark
benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
criterion (Rust): https://github.com/bheisler/criterion.rs