Performance Analysis Labs
Overview
These five labs build a systematic toolkit for diagnosing CPU, memory, lock, network, and I/O performance problems on Linux. Each lab is self-contained: it includes a subject program you write or configure, the tools you run against it, expected output ranges, analysis questions that guide interpretation, and extension challenges that push deeper. All labs assume a modern Linux system (kernel 5.15+) with standard development tools installed. A bare-metal machine or dedicated VM is strongly preferred — virtualization distorts perf measurements significantly for CPU and memory labs.
Lab 1 — CPU Profiling with perf and Flame Graphs
Objective
Profile a CPU-intensive program, generate a flame graph, identify the dominant hot function, and apply an optimization to reduce its CPU share.
Setup
Install dependencies:
sudo apt install linux-perf linux-tools-common linux-tools-generic \
build-essential git
# Allow perf by non-root users
sudo sh -c 'echo 1 > /proc/sys/kernel/perf_event_paranoid'
Download Brendan Gregg's FlameGraph scripts:
git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph
Create the subject program (hotloop.c):
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* Intentionally inefficient: recompute string length on every iteration */
size_t sum_lengths(const char **strings, int count, int iters) {
size_t total = 0;
for (int i = 0; i < iters; i++)
for (int j = 0; j < count; j++)
total += strlen(strings[j]); /* hot function */
return total;
}
int main(void) {
const char *strs[64];
for (int i = 0; i < 64; i++) {
strs[i] = malloc(128);
memset((void *)strs[i], 'A' + (i % 26), 127);
((char *)strs[i])[127] = '\0';
}
printf("total: %zu\n", sum_lengths(strs, 64, 5000000));
return 0;
}
Compile with debug info and frame pointers (both required for accurate profiling):
gcc -O2 -g -fno-omit-frame-pointer hotloop.c -o hotloop
Profiling Steps
Run perf record with call-graph capture:
perf record -g --call-graph fp -F 999 ./hotloop
Flags: -g enables call graph; --call-graph fp uses frame pointer unwinding; -F 999 samples at ~999 Hz (slightly under 1000 to avoid aliasing with periodic timer activity).
Inspect the text report:
perf report --stdio | head -40
Expected output: strlen and sum_lengths should together account for 70–90% of samples, with the strlen PLT stub visible in the call chain.
Generate the flame graph:
perf script | ~/FlameGraph/stackcollapse-perf.pl | \
~/FlameGraph/flamegraph.pl > hotloop.svg
Open hotloop.svg in a browser. The widest tower is the dominant code path.
Optimization
Replace strlen with a precomputed length table:
size_t sum_lengths_opt(const char **strings, size_t *lens, int count, int iters) {
size_t total = 0;
for (int i = 0; i < iters; i++)
for (int j = 0; j < count; j++)
total += lens[j];
return total;
}
Re-profile and regenerate the flame graph. The strlen tower should collapse to noise.
What to Measure
- Percentage of CPU time in
strlenbefore and after the fix. - Total wall-clock time (
time ./hotloop) before and after. - Expected speedup: 3–6x depending on processor cache behavior.
Analysis Questions
- Why does compiling with
-O3(without-fno-omit-frame-pointer) produce misleading flame graphs? - What does the
[unknown]frame in the flame graph represent, and how do you eliminate it? - Why sample at 999 Hz instead of 1000 Hz?
- What is the difference between
perf statandperf record? When would you use each?
Extension Challenges
- Profile a multithreaded version (
-lpthread, 4 worker threads) and observe whetherstrlenstill dominates or whether locking overhead appears. - Use
perf annotateto view per-instruction sample counts for the hot loop. - Try
async-profileron a Java equivalent to compare profiling methodologies.
Lab 2 — Memory Analysis with Valgrind Massif
Objective
Profile heap allocation patterns of a program, identify the allocation site responsible for peak memory usage, and measure the runtime impact of Transparent Huge Pages (THP).
Setup
sudo apt install valgrind
Create the subject program (memhog.c):
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
/* Simulates a cache that accumulates entries without eviction */
typedef struct Entry { char data[256]; struct Entry *next; } Entry;
Entry *head = NULL;
size_t total_entries = 0;
void cache_insert(void) {
Entry *e = malloc(sizeof(Entry));
memset(e->data, 0xAB, 256);
e->next = head;
head = e;
total_entries++;
}
int main(void) {
/* Phase 1: fill cache (leaky) */
for (int i = 0; i < 100000; i++) cache_insert();
printf("Phase 1: %zu entries\n", total_entries);
/* Phase 2: simulate processing — allocate and free temporaries */
for (int i = 0; i < 50000; i++) {
void *tmp = malloc(4096);
memset(tmp, 0, 4096);
free(tmp);
}
printf("Phase 2 done\n");
return 0;
}
gcc -O1 -g memhog.c -o memhog
Profiling Steps
Run Massif:
valgrind --tool=massif --pages-as-heap=yes \
--massif-out-file=massif.out ./memhog
--pages-as-heap=yes captures total virtual memory including the allocator's overhead, giving a more realistic picture.
Generate the text tree:
ms_print massif.out | head -80
The output shows a timeline of snapshots (normal and "detailed"). Detailed snapshots include an allocation tree identifying the call chain responsible for each allocation.
Expected output: Peak heap will be approximately 27–30 MB (100,000 × 272 bytes for the Entry struct plus allocator overhead). The dominant allocation site should point to cache_insert → malloc.
Visualize (optional, if massif-visualizer is available):
massif-visualizer massif.out
THP Comparison Benchmark
Transparent Huge Pages collapse 4 KB pages into 2 MB pages, reducing TLB pressure but potentially increasing memory waste for small allocations.
Create a benchmark (thp_bench.c) that allocates and traverses a large array with a stride that stresses TLB coverage:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#define N (1L << 27) /* 128 MB */
#define STRIDE 4096
int main(void) {
char *buf = malloc(N);
volatile long sum = 0;
struct timespec t0, t1;
clock_gettime(CLOCK_MONOTONIC, &t0);
for (long i = 0; i < N; i += STRIDE) sum += buf[i];
clock_gettime(CLOCK_MONOTONIC, &t1);
long ns = (t1.tv_sec - t0.tv_sec) * 1000000000L +
(t1.tv_nsec - t0.tv_nsec);
printf("sum=%ld time=%ld ms\n", sum, ns / 1000000);
free(buf);
}
Run with THP enabled vs disabled:
# Enable THP
sudo sh -c 'echo always > /sys/kernel/mm/transparent_hugepage/enabled'
gcc -O2 thp_bench.c -o thp_bench && ./thp_bench
# Disable THP
sudo sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/enabled'
./thp_bench
Expected results: With a large strided workload, THP typically reduces traversal time by 15–40% due to TLB reach improvement. Results vary by CPU (TLB size and L1/L2 TLB structure matter).
Verify THP promotion actually occurred:
grep -i hugepages /proc/vmstat | grep -E 'thp_fault_alloc|thp_collapse_alloc'
Analysis Questions
- Why does
--pages-as-heap=yesproduce a higher peak number than the default Massif mode? - What is the difference between a memory leak and a memory hog? Which is
cache_insertexhibiting? - For what allocation sizes does THP help vs hurt? Why does THP cause latency spikes in some real-time workloads?
- What tool would you use to find actual memory leaks (use-after-free, freed-twice)? (
valgrind --tool=memcheck)
Extension Challenges
- Add a
free_cache()function that walks and frees the list. Verify with Massif that peak memory is unchanged (peak is measured at the point of maximum live allocations, not at program exit). - Use
jemallocas a drop-in allocator (LD_PRELOAD) and compare its heap behavior to glibc'sptmalloc.
Lab 3 — Lock Contention Analysis
Objective
Write a program with deliberate lock contention, measure the contention with perf lock or BCC's lockstat, and eliminate the bottleneck with finer-grained locking.
Setup
# BCC tools (preferred for detail)
sudo apt install bpfcc-tools linux-headers-$(uname -r)
# Or use perf lock (included with linux-perf)
Create the contended program (contended.c):
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define THREADS 8
#define OPS 1000000
pthread_mutex_t global_lock = PTHREAD_MUTEX_INITIALIZER;
uint64_t counter = 0;
void *worker(void *arg) {
for (int i = 0; i < OPS; i++) {
pthread_mutex_lock(&global_lock);
counter++;
pthread_mutex_unlock(&global_lock);
}
return NULL;
}
int main(void) {
pthread_t threads[THREADS];
for (int i = 0; i < THREADS; i++)
pthread_create(&threads[i], NULL, worker, NULL);
for (int i = 0; i < THREADS; i++)
pthread_join(threads[i], NULL);
printf("counter = %lu\n", counter);
}
gcc -O2 -g -pthread contended.c -o contended
Measuring Lock Contention
Method 1: perf lock (no BCC required)
sudo perf lock record ./contended
sudo perf lock report
Key fields in the report: acquired (successful lock grabs), contended (times a thread had to wait), avg wait (average wait time per contention event), total wait.
Expected output: With 8 threads and a single global lock, contention rate will be 85–95%. Average wait time will be in the range of several microseconds.
Method 2: BCC lockstat
sudo /usr/share/bcc/tools/lockstat -p $(pgrep contended) 5
This outputs a histogram of lock hold times and wait times, identifying the specific mutex by address.
The Fix — Striped Locking
Replace the single global lock with an array of locks, distributing counter across N shards:
#define SHARDS 16
pthread_mutex_t shard_lock[SHARDS];
uint64_t shard_counter[SHARDS];
void *worker_sharded(void *arg) {
unsigned int tid = (unsigned int)(uintptr_t)arg;
for (int i = 0; i < OPS; i++) {
int shard = (tid * 2654435761u + i) % SHARDS; /* multiplicative hash */
pthread_mutex_lock(&shard_lock[shard]);
shard_counter[shard]++;
pthread_mutex_unlock(&shard_lock[shard]);
}
return NULL;
}
After the benchmark, sum all shards: uint64_t total = 0; for (int i = 0; i < SHARDS; i++) total += shard_counter[i];
Re-run with BCC lockstat. Contention should drop to near zero. Measure wall-clock time improvement (time ./contended vs time ./contended_sharded).
Expected improvement: 4–7x throughput increase with 8 threads; the global-lock version barely exceeds single-threaded performance.
What to Measure
perf lock report: contended count and total wait time before/after.- Wall-clock time (
timecommand): before/after. - CPU utilization (
toporperf stat): look for reduced%systime in the fixed version.
Analysis Questions
- Why does the contended version not scale beyond ~1 thread even though the critical section is tiny?
- What is the difference between a spinlock and a mutex in terms of CPU utilization under contention?
- When does fine-grained locking hurt rather than help (hint: false sharing, lock acquisition overhead)?
- How would you use
perf stat -e cache-missesto detect false sharing in the sharded version?
Extension Challenges
- Replace the mutex with
_Atomic uint64_tandatomic_fetch_add. Benchmark. Explain why atomics outperform mutexes for this specific pattern. - Measure the effect of padding shard structures to 64 bytes (one cache line) to eliminate false sharing.
Lab 4 — Network Performance Baseline with iperf3
Objective
Establish a network throughput baseline, tune socket buffer sizes, enable TCP BBR congestion control, and quantify the improvement.
Setup
sudo apt install iperf3
# Two hosts required (or two VMs on the same hypervisor)
# HOST_A = server, HOST_B = client
Baseline Measurement
On the server:
iperf3 -s
On the client (replace SERVER_IP):
iperf3 -c SERVER_IP -t 30 -P 4 -i 5
Flags: -t 30 runs for 30 seconds; -P 4 opens 4 parallel streams (needed to saturate a multi-queue NIC); -i 5 reports every 5 seconds.
Record: sender throughput (Gbps), receiver throughput, retransmits, and the final sender / receiver summary line.
Check current socket buffer limits:
sysctl net.core.rmem_max net.core.wmem_max \
net.ipv4.tcp_rmem net.ipv4.tcp_wmem
Default rmem_max is often 212992 (208 KB), which is the bottleneck on any link with BDP > 208 KB. BDP = bandwidth × RTT; on a 10 Gbps link with 1 ms RTT, BDP = 1.25 MB.
Socket Buffer Tuning
Increase socket buffers:
sudo sysctl -w net.core.rmem_max=134217728 # 128 MB
sudo sysctl -w net.core.wmem_max=134217728
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"
sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 134217728"
sudo sysctl -w net.ipv4.tcp_moderate_rcvbuf=1
Re-run the iperf3 benchmark with identical parameters. Compare throughput and retransmit count.
TCP BBR Congestion Control
The default TCP congestion control in most Linux kernels is CUBIC, which uses packet loss as a signal. BBR (Bottleneck Bandwidth and RTT) uses bandwidth and RTT measurements instead, achieving higher throughput in lossy or high-latency environments.
Check current algorithm:
sysctl net.ipv4.tcp_congestion_control
Enable BBR:
sudo modprobe tcp_bbr
sudo sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl net.ipv4.tcp_available_congestion_control # verify bbr is listed
Re-run iperf3. For a local LAN with low loss, the improvement will be modest. To see BBR's full advantage, simulate packet loss:
# On the server, add 1% random packet loss with tc
sudo tc qdisc add dev eth0 root netem loss 1%
# Run iperf3 on the client — compare CUBIC vs BBR throughput
sudo tc qdisc del dev eth0 root # remove after test
Expected result: Under 1% simulated loss, BBR typically maintains 2–4x higher throughput than CUBIC.
What to Measure
- Throughput (Gbps) under: baseline, buffer-tuned, BBR.
- Retransmit count under each configuration.
- Throughput under 1% loss: CUBIC vs BBR.
Analysis Questions
- What is the Bandwidth-Delay Product and why does it determine the required socket buffer size?
- Why does running a single iperf3 stream often under-report link capacity on a multi-queue NIC?
- What is TCP slow start, and how does it interact with connection duration in a microservices environment?
- How would you use
ss -tito inspect the congestion algorithm and current cwnd of a live TCP connection?
Extension Challenges
- Use
netperfin addition to iperf3 to measure request/response latency (RR mode) rather than bulk throughput. - Enable Generic Receive Offload (GRO) and Generic Segmentation Offload (GSO) with
ethtooland measure the CPU overhead reduction.
Lab 5 — io_uring vs epoll HTTP Server Benchmark
Objective
Implement a minimal HTTP server twice — once using epoll and once using io_uring — benchmark both with wrk2, and compare p99 latency and throughput.
Setup
sudo apt install liburing-dev wrk
# wrk2 (constant-rate load generator — better for latency measurement than wrk)
git clone https://github.com/giltene/wrk2 ~/wrk2
cd ~/wrk2 && make
epoll HTTP Server (Skeleton)
The server listens on port 8080, accepts connections with epoll, and replies with a static 200 OK for every request:
// Key epoll loop structure (fill in with your socket boilerplate):
int epfd = epoll_create1(0);
struct epoll_event ev, events[MAX_EVENTS];
ev.events = EPOLLIN | EPOLLET;
ev.data.fd = listen_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);
while (1) {
int n = epoll_wait(epfd, events, MAX_EVENTS, -1);
for (int i = 0; i < n; i++) {
if (events[i].data.fd == listen_fd) {
int conn = accept4(listen_fd, NULL, NULL, SOCK_NONBLOCK);
ev.events = EPOLLIN | EPOLLET;
ev.data.fd = conn;
epoll_ctl(epfd, EPOLL_CTL_ADD, conn, &ev);
} else {
// read request, write response
char buf[1024];
read(events[i].data.fd, buf, sizeof(buf));
write(events[i].data.fd, HTTP_OK, strlen(HTTP_OK));
close(events[i].data.fd);
}
}
}
Compile: gcc -O2 -g epoll_server.c -o epoll_server
io_uring HTTP Server (Skeleton)
io_uring submits I/O operations as ring buffer entries, batching system calls and enabling kernel-side polling:
#include <liburing.h>
struct io_uring ring;
io_uring_queue_init(256, &ring, 0);
// Submit accept
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_accept(sqe, listen_fd, NULL, NULL, 0);
sqe->user_data = ACCEPT_TYPE;
io_uring_submit(&ring);
// Event loop
struct io_uring_cqe *cqe;
while (1) {
io_uring_wait_cqe(&ring, &cqe);
if (cqe->user_data == ACCEPT_TYPE) {
int conn_fd = cqe->res;
// Submit recv
sqe = io_uring_get_sqe(&ring);
io_uring_prep_recv(sqe, conn_fd, buf, sizeof(buf), 0);
sqe->user_data = RECV_TYPE | ((uint64_t)conn_fd << 32);
// Re-arm accept
sqe = io_uring_get_sqe(&ring);
io_uring_prep_accept(sqe, listen_fd, NULL, NULL, 0);
sqe->user_data = ACCEPT_TYPE;
io_uring_submit(&ring);
} else if ((cqe->user_data & 0xFFFFFFFF) == RECV_TYPE) {
int conn_fd = cqe->user_data >> 32;
sqe = io_uring_get_sqe(&ring);
io_uring_prep_send(sqe, conn_fd, HTTP_OK, strlen(HTTP_OK), 0);
sqe->user_data = SEND_TYPE | ((uint64_t)conn_fd << 32);
io_uring_submit(&ring);
} else {
close(cqe->user_data >> 32);
}
io_uring_cqe_seen(&ring, cqe);
}
Compile: gcc -O2 -g uring_server.c -o uring_server -luring
Benchmarking with wrk2
wrk2 generates load at a constant request rate (unlike wrk, which generates as much load as possible). This allows fair p99 latency comparison.
# Warm up
~/wrk2/wrk -t4 -c100 -d10s -R 10000 http://localhost:8080/
# Benchmark at 80% of maximum throughput (find max first with wrk)
~/wrk2/wrk -t4 -c100 -d60s -R 50000 --latency http://localhost:8080/
Record from the output: throughput (requests/sec), p50 latency, p99 latency, p99.9 latency, and max latency. Run each server three times and take the median of p99.
What to Measure
| Metric | epoll | io_uring |
|---|---|---|
| Max throughput (req/s) | — | — |
| p50 latency (us) | — | — |
| p99 latency (us) | — | — |
| p99.9 latency (us) | — | — |
Syscalls per req (strace -c) |
— | — |
Expected results: For a simple static-response server, io_uring reduces syscall count by batching accept/recv/send into ring submissions. p99 latency advantage is typically 20–50% at high load. At low load, epoll and io_uring perform similarly.
Measure syscall overhead directly:
sudo perf stat -e syscalls:sys_enter_epoll_wait,\
syscalls:sys_enter_read,syscalls:sys_enter_write \
./epoll_server &
~/wrk2/wrk -t4 -c100 -d10s -R 20000 http://localhost:8080/
Analysis Questions
- Why is p99 (not average) latency the critical metric for user-facing services?
- What is the kernel-poll mode of io_uring (
IORING_SETUP_SQPOLL) and when is it appropriate? - Why does the constant-rate load generator wrk2 give more informative latency results than the open-loop wrk?
- What would you need to change in the io_uring server to handle 10,000 concurrent connections efficiently (hint: fixed file table, registered buffers)?
Extension Challenges
- Add HTTP keep-alive to both servers and re-benchmark. The io_uring advantage typically grows with connection reuse.
- Implement a
IORING_OP_FIXED_FILEvariant and measure the overhead reduction from eliminating per-request file descriptor table lookups. - Use
bpftraceto trace the path of a single request through the kernel in each implementation.