Performance Analysis Labs

Overview

These five labs build a systematic toolkit for diagnosing CPU, memory, lock, network, and I/O performance problems on Linux. Each lab is self-contained: it includes a subject program you write or configure, the tools you run against it, expected output ranges, analysis questions that guide interpretation, and extension challenges that push deeper. All labs assume a modern Linux system (kernel 5.15+) with standard development tools installed. A bare-metal machine or dedicated VM is strongly preferred — virtualization distorts perf measurements significantly for CPU and memory labs.

Lab 1 — CPU Profiling with perf and Flame Graphs

Objective

Profile a CPU-intensive program, generate a flame graph, identify the dominant hot function, and apply an optimization to reduce its CPU share.

Setup

Install dependencies:

sudo apt install linux-perf linux-tools-common linux-tools-generic \
     build-essential git
# Allow perf by non-root users
sudo sh -c 'echo 1 > /proc/sys/kernel/perf_event_paranoid'

Download Brendan Gregg's FlameGraph scripts:

git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph

Create the subject program (hotloop.c):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* Intentionally inefficient: recompute string length on every iteration */
size_t sum_lengths(const char **strings, int count, int iters) {
    size_t total = 0;
    for (int i = 0; i < iters; i++)
        for (int j = 0; j < count; j++)
            total += strlen(strings[j]);   /* hot function */
    return total;
}

int main(void) {
    const char *strs[64];
    for (int i = 0; i < 64; i++) {
        strs[i] = malloc(128);
        memset((void *)strs[i], 'A' + (i % 26), 127);
        ((char *)strs[i])[127] = '\0';
    }
    printf("total: %zu\n", sum_lengths(strs, 64, 5000000));
    return 0;
}

Compile with debug info and frame pointers (both required for accurate profiling):

gcc -O2 -g -fno-omit-frame-pointer hotloop.c -o hotloop

Profiling Steps

Run perf record with call-graph capture:

perf record -g --call-graph fp -F 999 ./hotloop

Flags: -g enables call graph; --call-graph fp uses frame pointer unwinding; -F 999 samples at ~999 Hz (slightly under 1000 to avoid aliasing with periodic timer activity).

Inspect the text report:

perf report --stdio | head -40

Expected output: strlen and sum_lengths should together account for 70–90% of samples, with the strlen PLT stub visible in the call chain.

Generate the flame graph:

perf script | ~/FlameGraph/stackcollapse-perf.pl | \
    ~/FlameGraph/flamegraph.pl > hotloop.svg

Open hotloop.svg in a browser. The widest tower is the dominant code path.

Optimization

Replace strlen with a precomputed length table:

size_t sum_lengths_opt(const char **strings, size_t *lens, int count, int iters) {
    size_t total = 0;
    for (int i = 0; i < iters; i++)
        for (int j = 0; j < count; j++)
            total += lens[j];
    return total;
}

Re-profile and regenerate the flame graph. The strlen tower should collapse to noise.

What to Measure

Percentage of CPU time in strlen before and after the fix.
Total wall-clock time (time ./hotloop) before and after.
Expected speedup: 3–6x depending on processor cache behavior.

Analysis Questions

Why does compiling with -O3 (without -fno-omit-frame-pointer) produce misleading flame graphs?
What does the [unknown] frame in the flame graph represent, and how do you eliminate it?
Why sample at 999 Hz instead of 1000 Hz?
What is the difference between perf stat and perf record? When would you use each?

Extension Challenges

Profile a multithreaded version (-lpthread, 4 worker threads) and observe whether strlen still dominates or whether locking overhead appears.
Use perf annotate to view per-instruction sample counts for the hot loop.
Try async-profiler on a Java equivalent to compare profiling methodologies.

Lab 2 — Memory Analysis with Valgrind Massif

Objective

Profile heap allocation patterns of a program, identify the allocation site responsible for peak memory usage, and measure the runtime impact of Transparent Huge Pages (THP).

Setup

sudo apt install valgrind

Create the subject program (memhog.c):

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

/* Simulates a cache that accumulates entries without eviction */
typedef struct Entry { char data[256]; struct Entry *next; } Entry;

Entry *head = NULL;
size_t total_entries = 0;

void cache_insert(void) {
    Entry *e = malloc(sizeof(Entry));
    memset(e->data, 0xAB, 256);
    e->next = head;
    head = e;
    total_entries++;
}

int main(void) {
    /* Phase 1: fill cache (leaky) */
    for (int i = 0; i < 100000; i++) cache_insert();
    printf("Phase 1: %zu entries\n", total_entries);

    /* Phase 2: simulate processing — allocate and free temporaries */
    for (int i = 0; i < 50000; i++) {
        void *tmp = malloc(4096);
        memset(tmp, 0, 4096);
        free(tmp);
    }
    printf("Phase 2 done\n");
    return 0;
}

gcc -O1 -g memhog.c -o memhog

Profiling Steps

Run Massif:

valgrind --tool=massif --pages-as-heap=yes \
         --massif-out-file=massif.out ./memhog

--pages-as-heap=yes captures total virtual memory including the allocator's overhead, giving a more realistic picture.

Generate the text tree:

ms_print massif.out | head -80

The output shows a timeline of snapshots (normal and "detailed"). Detailed snapshots include an allocation tree identifying the call chain responsible for each allocation.

Expected output: Peak heap will be approximately 27–30 MB (100,000 × 272 bytes for the Entry struct plus allocator overhead). The dominant allocation site should point to cache_insert → malloc.

Visualize (optional, if massif-visualizer is available):

massif-visualizer massif.out

THP Comparison Benchmark

Transparent Huge Pages collapse 4 KB pages into 2 MB pages, reducing TLB pressure but potentially increasing memory waste for small allocations.

Create a benchmark (thp_bench.c) that allocates and traverses a large array with a stride that stresses TLB coverage:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#define N (1L << 27)  /* 128 MB */
#define STRIDE 4096

int main(void) {
    char *buf = malloc(N);
    volatile long sum = 0;
    struct timespec t0, t1;
    clock_gettime(CLOCK_MONOTONIC, &t0);
    for (long i = 0; i < N; i += STRIDE) sum += buf[i];
    clock_gettime(CLOCK_MONOTONIC, &t1);
    long ns = (t1.tv_sec - t0.tv_sec) * 1000000000L +
              (t1.tv_nsec - t0.tv_nsec);
    printf("sum=%ld time=%ld ms\n", sum, ns / 1000000);
    free(buf);
}

Run with THP enabled vs disabled:

# Enable THP
sudo sh -c 'echo always > /sys/kernel/mm/transparent_hugepage/enabled'
gcc -O2 thp_bench.c -o thp_bench && ./thp_bench

# Disable THP
sudo sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/enabled'
./thp_bench

Expected results: With a large strided workload, THP typically reduces traversal time by 15–40% due to TLB reach improvement. Results vary by CPU (TLB size and L1/L2 TLB structure matter).

Verify THP promotion actually occurred:

grep -i hugepages /proc/vmstat | grep -E 'thp_fault_alloc|thp_collapse_alloc'

Analysis Questions

Why does --pages-as-heap=yes produce a higher peak number than the default Massif mode?
What is the difference between a memory leak and a memory hog? Which is cache_insert exhibiting?
For what allocation sizes does THP help vs hurt? Why does THP cause latency spikes in some real-time workloads?
What tool would you use to find actual memory leaks (use-after-free, freed-twice)? (valgrind --tool=memcheck)

Extension Challenges

Add a free_cache() function that walks and frees the list. Verify with Massif that peak memory is unchanged (peak is measured at the point of maximum live allocations, not at program exit).
Use jemalloc as a drop-in allocator (LD_PRELOAD) and compare its heap behavior to glibc's ptmalloc.

Lab 3 — Lock Contention Analysis

Objective

Write a program with deliberate lock contention, measure the contention with perf lock or BCC's lockstat, and eliminate the bottleneck with finer-grained locking.

Setup

# BCC tools (preferred for detail)
sudo apt install bpfcc-tools linux-headers-$(uname -r)
# Or use perf lock (included with linux-perf)

Create the contended program (contended.c):

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define THREADS 8
#define OPS     1000000

pthread_mutex_t global_lock = PTHREAD_MUTEX_INITIALIZER;
uint64_t counter = 0;

void *worker(void *arg) {
    for (int i = 0; i < OPS; i++) {
        pthread_mutex_lock(&global_lock);
        counter++;
        pthread_mutex_unlock(&global_lock);
    }
    return NULL;
}

int main(void) {
    pthread_t threads[THREADS];
    for (int i = 0; i < THREADS; i++)
        pthread_create(&threads[i], NULL, worker, NULL);
    for (int i = 0; i < THREADS; i++)
        pthread_join(threads[i], NULL);
    printf("counter = %lu\n", counter);
}

gcc -O2 -g -pthread contended.c -o contended

Measuring Lock Contention

Method 1: perf lock (no BCC required)

sudo perf lock record ./contended
sudo perf lock report

Key fields in the report: acquired (successful lock grabs), contended (times a thread had to wait), avg wait (average wait time per contention event), total wait.

Expected output: With 8 threads and a single global lock, contention rate will be 85–95%. Average wait time will be in the range of several microseconds.

Method 2: BCC lockstat

sudo /usr/share/bcc/tools/lockstat -p $(pgrep contended) 5

This outputs a histogram of lock hold times and wait times, identifying the specific mutex by address.

The Fix — Striped Locking

Replace the single global lock with an array of locks, distributing counter across N shards:

#define SHARDS  16
pthread_mutex_t shard_lock[SHARDS];
uint64_t        shard_counter[SHARDS];

void *worker_sharded(void *arg) {
    unsigned int tid = (unsigned int)(uintptr_t)arg;
    for (int i = 0; i < OPS; i++) {
        int shard = (tid * 2654435761u + i) % SHARDS;  /* multiplicative hash */
        pthread_mutex_lock(&shard_lock[shard]);
        shard_counter[shard]++;
        pthread_mutex_unlock(&shard_lock[shard]);
    }
    return NULL;
}

After the benchmark, sum all shards: uint64_t total = 0; for (int i = 0; i < SHARDS; i++) total += shard_counter[i];

Re-run with BCC lockstat. Contention should drop to near zero. Measure wall-clock time improvement (time ./contended vs time ./contended_sharded).

Expected improvement: 4–7x throughput increase with 8 threads; the global-lock version barely exceeds single-threaded performance.

What to Measure

perf lock report: contended count and total wait time before/after.
Wall-clock time (time command): before/after.
CPU utilization (top or perf stat): look for reduced %sys time in the fixed version.

Analysis Questions

Why does the contended version not scale beyond ~1 thread even though the critical section is tiny?
What is the difference between a spinlock and a mutex in terms of CPU utilization under contention?
When does fine-grained locking hurt rather than help (hint: false sharing, lock acquisition overhead)?
How would you use perf stat -e cache-misses to detect false sharing in the sharded version?

Extension Challenges

Replace the mutex with _Atomic uint64_t and atomic_fetch_add. Benchmark. Explain why atomics outperform mutexes for this specific pattern.
Measure the effect of padding shard structures to 64 bytes (one cache line) to eliminate false sharing.

Lab 4 — Network Performance Baseline with iperf3

Objective

Establish a network throughput baseline, tune socket buffer sizes, enable TCP BBR congestion control, and quantify the improvement.

Setup

sudo apt install iperf3
# Two hosts required (or two VMs on the same hypervisor)
# HOST_A = server, HOST_B = client

Baseline Measurement

On the server:

iperf3 -s

On the client (replace SERVER_IP):

iperf3 -c SERVER_IP -t 30 -P 4 -i 5

Flags: -t 30 runs for 30 seconds; -P 4 opens 4 parallel streams (needed to saturate a multi-queue NIC); -i 5 reports every 5 seconds.

Record: sender throughput (Gbps), receiver throughput, retransmits, and the final sender / receiver summary line.

Check current socket buffer limits:

sysctl net.core.rmem_max net.core.wmem_max \
       net.ipv4.tcp_rmem net.ipv4.tcp_wmem

Default rmem_max is often 212992 (208 KB), which is the bottleneck on any link with BDP > 208 KB. BDP = bandwidth × RTT; on a 10 Gbps link with 1 ms RTT, BDP = 1.25 MB.

Socket Buffer Tuning

Increase socket buffers:

sudo sysctl -w net.core.rmem_max=134217728      # 128 MB
sudo sysctl -w net.core.wmem_max=134217728
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"
sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 134217728"
sudo sysctl -w net.ipv4.tcp_moderate_rcvbuf=1

Re-run the iperf3 benchmark with identical parameters. Compare throughput and retransmit count.

TCP BBR Congestion Control

The default TCP congestion control in most Linux kernels is CUBIC, which uses packet loss as a signal. BBR (Bottleneck Bandwidth and RTT) uses bandwidth and RTT measurements instead, achieving higher throughput in lossy or high-latency environments.

Check current algorithm:

sysctl net.ipv4.tcp_congestion_control

Enable BBR:

sudo modprobe tcp_bbr
sudo sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl net.ipv4.tcp_available_congestion_control   # verify bbr is listed

Re-run iperf3. For a local LAN with low loss, the improvement will be modest. To see BBR's full advantage, simulate packet loss:

# On the server, add 1% random packet loss with tc
sudo tc qdisc add dev eth0 root netem loss 1%
# Run iperf3 on the client — compare CUBIC vs BBR throughput
sudo tc qdisc del dev eth0 root   # remove after test

Expected result: Under 1% simulated loss, BBR typically maintains 2–4x higher throughput than CUBIC.

What to Measure

Throughput (Gbps) under: baseline, buffer-tuned, BBR.
Retransmit count under each configuration.
Throughput under 1% loss: CUBIC vs BBR.

Analysis Questions

What is the Bandwidth-Delay Product and why does it determine the required socket buffer size?
Why does running a single iperf3 stream often under-report link capacity on a multi-queue NIC?
What is TCP slow start, and how does it interact with connection duration in a microservices environment?
How would you use ss -ti to inspect the congestion algorithm and current cwnd of a live TCP connection?

Extension Challenges

Use netperf in addition to iperf3 to measure request/response latency (RR mode) rather than bulk throughput.
Enable Generic Receive Offload (GRO) and Generic Segmentation Offload (GSO) with ethtool and measure the CPU overhead reduction.

Lab 5 — io_uring vs epoll HTTP Server Benchmark

Objective

Implement a minimal HTTP server twice — once using epoll and once using io_uring — benchmark both with wrk2, and compare p99 latency and throughput.

Setup

sudo apt install liburing-dev wrk
# wrk2 (constant-rate load generator — better for latency measurement than wrk)
git clone https://github.com/giltene/wrk2 ~/wrk2
cd ~/wrk2 && make

epoll HTTP Server (Skeleton)

The server listens on port 8080, accepts connections with epoll, and replies with a static 200 OK for every request:

// Key epoll loop structure (fill in with your socket boilerplate):
int epfd = epoll_create1(0);
struct epoll_event ev, events[MAX_EVENTS];
ev.events = EPOLLIN | EPOLLET;
ev.data.fd = listen_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);

while (1) {
    int n = epoll_wait(epfd, events, MAX_EVENTS, -1);
    for (int i = 0; i < n; i++) {
        if (events[i].data.fd == listen_fd) {
            int conn = accept4(listen_fd, NULL, NULL, SOCK_NONBLOCK);
            ev.events = EPOLLIN | EPOLLET;
            ev.data.fd = conn;
            epoll_ctl(epfd, EPOLL_CTL_ADD, conn, &ev);
        } else {
            // read request, write response
            char buf[1024];
            read(events[i].data.fd, buf, sizeof(buf));
            write(events[i].data.fd, HTTP_OK, strlen(HTTP_OK));
            close(events[i].data.fd);
        }
    }
}

Compile: gcc -O2 -g epoll_server.c -o epoll_server

io_uring HTTP Server (Skeleton)

io_uring submits I/O operations as ring buffer entries, batching system calls and enabling kernel-side polling:

#include <liburing.h>

struct io_uring ring;
io_uring_queue_init(256, &ring, 0);

// Submit accept
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_accept(sqe, listen_fd, NULL, NULL, 0);
sqe->user_data = ACCEPT_TYPE;
io_uring_submit(&ring);

// Event loop
struct io_uring_cqe *cqe;
while (1) {
    io_uring_wait_cqe(&ring, &cqe);
    if (cqe->user_data == ACCEPT_TYPE) {
        int conn_fd = cqe->res;
        // Submit recv
        sqe = io_uring_get_sqe(&ring);
        io_uring_prep_recv(sqe, conn_fd, buf, sizeof(buf), 0);
        sqe->user_data = RECV_TYPE | ((uint64_t)conn_fd << 32);
        // Re-arm accept
        sqe = io_uring_get_sqe(&ring);
        io_uring_prep_accept(sqe, listen_fd, NULL, NULL, 0);
        sqe->user_data = ACCEPT_TYPE;
        io_uring_submit(&ring);
    } else if ((cqe->user_data & 0xFFFFFFFF) == RECV_TYPE) {
        int conn_fd = cqe->user_data >> 32;
        sqe = io_uring_get_sqe(&ring);
        io_uring_prep_send(sqe, conn_fd, HTTP_OK, strlen(HTTP_OK), 0);
        sqe->user_data = SEND_TYPE | ((uint64_t)conn_fd << 32);
        io_uring_submit(&ring);
    } else {
        close(cqe->user_data >> 32);
    }
    io_uring_cqe_seen(&ring, cqe);
}

Compile: gcc -O2 -g uring_server.c -o uring_server -luring

Benchmarking with wrk2

wrk2 generates load at a constant request rate (unlike wrk, which generates as much load as possible). This allows fair p99 latency comparison.

# Warm up
~/wrk2/wrk -t4 -c100 -d10s -R 10000 http://localhost:8080/

# Benchmark at 80% of maximum throughput (find max first with wrk)
~/wrk2/wrk -t4 -c100 -d60s -R 50000 --latency http://localhost:8080/

Record from the output: throughput (requests/sec), p50 latency, p99 latency, p99.9 latency, and max latency. Run each server three times and take the median of p99.

What to Measure

Metric	epoll	io_uring
Max throughput (req/s)	—	—
p50 latency (us)	—	—
p99 latency (us)	—	—
p99.9 latency (us)	—	—
Syscalls per req (`strace -c`)	—	—

Expected results: For a simple static-response server, io_uring reduces syscall count by batching accept/recv/send into ring submissions. p99 latency advantage is typically 20–50% at high load. At low load, epoll and io_uring perform similarly.

Measure syscall overhead directly:

sudo perf stat -e syscalls:sys_enter_epoll_wait,\
syscalls:sys_enter_read,syscalls:sys_enter_write \
./epoll_server &
~/wrk2/wrk -t4 -c100 -d10s -R 20000 http://localhost:8080/

Analysis Questions

Why is p99 (not average) latency the critical metric for user-facing services?
What is the kernel-poll mode of io_uring (IORING_SETUP_SQPOLL) and when is it appropriate?
Why does the constant-rate load generator wrk2 give more informative latency results than the open-loop wrk?
What would you need to change in the io_uring server to handle 10,000 concurrent connections efficiently (hint: fixed file table, registered buffers)?

Extension Challenges

Add HTTP keep-alive to both servers and re-benchmark. The io_uring advantage typically grows with connection reuse.
Implement a IORING_OP_FIXED_FILE variant and measure the overhead reduction from eliminating per-request file descriptor table lookups.
Use bpftrace to trace the path of a single request through the kernel in each implementation.