Exokernels

Technical Overview

The exokernel, proposed by Dawson Engler, M. Frans Kaashoek, and James O'Toole at MIT in 1995, takes the most radical position in OS architecture: the kernel should expose hardware resources directly to applications and enforce only resource ownership and protection — no abstractions. The motto, attributed to Engler: "Exterminate all abstractions."

Traditional OS kernels provide abstractions over hardware: files instead of disk blocks, sockets instead of network packets, virtual address spaces instead of physical pages. The exokernel thesis argues that these abstractions are simultaneously too general (unsuitable for any specific workload) and too restrictive (preventing applications from implementing better policies). The solution: eliminate the abstraction layer from the kernel entirely.

Prerequisites

Physical memory management (page frames, page tables)
Disk block layout and raw I/O concepts
Virtual memory abstraction vs physical memory
CPU scheduling fundamentals
Traditional OS abstractions (VFS, socket API, VM system)

Core Concepts

The Core Thesis

Traditional OS Abstraction Stack
==================================

Application
    |
    v
File API (read/write/seek)
    |          <- abstraction layer hides:
    v              - block layout
Block Device       - caching policy
    |              - prefetch strategy
    v
Physical Disk


Exokernel Design
=================

Application / LibOS
    |
    v
Exokernel (just multiplexing)
    |  <- kernel only:
    v       - securely allocates disk blocks to apps
Physical    - tracks who owns which block
Disk        - prevents unauthorized access
            - NOTHING ELSE

An application that wants a specific caching policy implements it. An application that wants a specific disk layout implements it. A database that wants to bypass all file system overhead talks directly to disk blocks it owns.

Exokernel Mechanisms

Secure Binding: The exokernel securely binds a hardware resource to an application. The application proves it has rights to the resource; the kernel establishes the binding. From that point, the application uses the resource directly.

Aegis Exokernel Resource Allocation
=====================================

App requests pages:
  exo_alloc_pages(n, protection_key)
  --> kernel verifies app has sufficient capability
  --> kernel marks pages as owned by this app
  --> returns physical page numbers to app

App directly manipulates its pages:
  exo_prot_ctrl(ppn, permissions)
  --> sets hardware protection bits directly
  --> kernel validates (still owned by app)
  --> TLB update to hardware directly

Visible Resource Revocation: When the kernel needs to reclaim a resource (e.g., physical page), it tells the application before taking it. The application gets a chance to save state. This "visible revocation" protocol is the key to allowing applications to manage their own state.

Revocation Protocol
====================

Kernel → App: "I need page frame 0x1234 back"
  [revocation notice via software exception or IPC]

App:
  - copies data to secondary storage if dirty
  - updates internal page table
  - acknowledges revocation

Kernel:
  - receives acknowledgment
  - reassigns physical frame

vs. Traditional Kernel:
  - kernel decides to page out, no notification
  - app never knows physical frame was reused

Abort Protocol: If an application fails to acknowledge revocation within a deadline, the kernel forcibly revokes the resource. This prevents a slow or buggy application from holding resources indefinitely.

Secure Dispatchers: Hardware events (network packets arriving, disk I/O completing) are dispatched directly to the application that owns the resource, bypassing the kernel where possible.

Library OS (LibOS)

The LibOS is the key enabling mechanism. Because the exokernel provides only raw resource access, applications need some OS functionality. The LibOS provides conventional OS abstractions (files, sockets, virtual addresses) as a user-space library:

Exokernel System Structure
===========================

+------------------------------------------+
|         Application Code                |
+------------------------------------------+
|         Library OS (libOS)               |
|                                          |
|  +----------+ +---------+ +-----------+ |
|  | Virtual  | |  File   | |  Socket   | |
|  |  Memory  | | System  | |   API     | |
|  | Manager  | |  (ext2) | |  (TCP/IP) | |
|  +----------+ +---------+ +-----------+ |
|                                          |
|  Cache Manager | Scheduler | VM Pager   |
+------------------------------------------+
         |         exokernel API
+------------------------------------------+
|              Aegis Exokernel             |
|                                          |
|  - Physical page allocation             |
|  - CPU time allocation (time slices)    |
|  - Disk block allocation                |
|  - Network packet demultiplexing        |
|  - TLB management (app-controlled)      |
+------------------------------------------+
|            Physical Hardware            |
+------------------------------------------+

Aegis and ExOS

The MIT 1995 paper introduced two systems:

Aegis: The exokernel itself. Written in ~2,000 lines of C. Implements: - Physical memory management (page frame allocation, TLB loading) - CPU multiplexing (time slices, preemption notification) - Disk I/O (block-level, owned by application) - Network demultiplexing (packet filters, owned receiver)

ExOS: The library OS that ran on top of Aegis. Implemented: - Virtual memory with application-controlled page tables - A simple UNIX-like filesystem - TCP/IP stack - fork(), exec(), UNIX processes

The performance results from the 1995 paper were striking. ExOS with application-managed VM achieved performance on VM-intensive benchmarks comparable to or exceeding standard BSD Unix — despite the latter being a mature production system.

Exokernel vs Traditional OS Comparison

Exokernel vs Traditional OS
=============================

Feature              | Traditional Kernel  | Exokernel
---------------------|---------------------|------------------
Abstraction level    | Files, sockets, VM  | Raw blocks, packets, pages
Cache policy         | Kernel-controlled   | Application-controlled
Scheduling granularity| Kernel decides     | App hints + kernel enforces
VM page replacement  | LRU or kernel algo  | App implements own pager
Disk layout          | Filesystem abstraction| App manages block placement
Network protocol     | Kernel TCP/IP       | App implements in libOS
Trust required       | Trust kernel policy | Trust only exokernel isolation
New OS abstraction   | Kernel modification | New libOS
Context switch cost  | Kernel overhead     | Minimal (libOS-controlled)
Crash isolation      | Process boundary    | LibOS boundary
Debug difficulty     | Moderate            | High (debugging libOS)
POSIX compatibility  | Native              | Through libOS

Key insight: A database can implement:
  - Its own buffer manager (beats kernel page cache for DB workloads)
  - Its own I/O scheduler (sequential scan prefetch > general LRU)
  - Its own transaction log layout (aligned to SSD erase blocks)
  All without a single kernel modification.

Why Abstractions Hurt Specific Workloads

The paper's central argument with examples:

Databases: A database management system (PostgreSQL, MySQL) maintains its own buffer pool. The kernel's page cache holds a second copy of the same data. The database's replacement policy is completely different from LRU (transaction log is sequential, frequently accessed hot rows should never evict). The OS abstraction actively hurts the database.

PostgreSQL still (in 2024) uses O_DIRECT flag to bypass the page cache and MADV_DONTNEED to aggressively return pages, because the OS abstraction fights the workload.

Network Servers: A high-performance HTTP server wants to bypass the TCP stack for specific operations (TLS acceleration, scatter-gather DMA directly from file content). The OS socket abstraction doesn't allow this. Zero-copy sendfile() is a partial workaround added specifically because the general abstraction was too costly.

Real-Time Systems: An RTOS application needs deterministic scheduling. The kernel's general-purpose scheduler doesn't provide this. PREEMPT_RT Linux patches the kernel; an exokernel-based system could implement a deterministic scheduler as a libOS without kernel modification.

Historical Context

The 1995 MIT Paper

"Exokernel: An Operating System Architecture for Application-Level Resource Management" by Engler, Kaashoek, and O'Toole. Published at SOSP 1995, it was one of the most influential OS papers of the decade. The abstract states:

"Traditional operating systems limit the performance, flexibility, and functionality of applications by fixing the interface and implementation of OS abstractions. The exokernel architecture addresses this by exporting hardware primitives, allowing untrusted software to implement traditional operating system abstractions."

The benchmarks showed 8-10x performance improvements for specific workloads (VM-intensive applications, database-style sequential I/O) compared to contemporary systems.

The Influence on Research

The exokernel was never widely deployed as a production system, but it profoundly influenced OS research:

Library OS research: The concept of putting OS code in application address space was picked up by unikernel research (MirageOS, IncludeOS — see 05-unikernels.md)
Bypass I/O: DPDK, SPDK, io_uring, RDMA — all are "exokernel ideas without the exokernel"
Application-level paging: hugepages, madvise(), io_uring — giving applications more control over memory and I/O

Production Examples

Dune: Exokernel Ideas in Linux (2012)

Stanford's Dune project used Intel VT-x hardware virtualization to give processes direct access to hardware features (page tables, TLB management, exceptions) while maintaining OS compatibility:

Dune Architecture
==================

Normal Process             Dune Process
   User space                User space
      |                         |
   Syscall                   VMCALL (VT-x)
      |                         |
   Linux kernel              Linux kernel (Dune module)
                                 |
                              VT-x hypervisor mode
                              (direct hardware access)

Dune achieved 7.5µs getpid() → 0.05µs equivalent for Dune processes (using ring 0 directly). Used for: garbage collector research, sandbox isolation (Google's Native Client), fast memory introspection.

Arrakis (2014)

University of Washington's Arrakis explicitly implemented exokernel principles for cloud workloads, using SR-IOV hardware virtualization to give VMs direct hardware access to network and storage:

Network I/O bypassing the kernel: 1.7µs latency vs 13µs through kernel networking
Disk I/O bypassing the kernel: eliminated OS noise for storage latency
Used as a platform for a Redis implementation that showed 2-5x throughput improvement

Haven (Microsoft Research, 2014)

Microsoft Research's Haven project implemented a library OS inside Intel SGX enclaves, essentially a userspace OS for trusted execution. The LibOS provides POSIX services to unmodified applications running inside the enclave.

The DPDK/SPDK Lineage

Intel's DPDK (Data Plane Development Kit) and SPDK (Storage Performance Development Kit) implement the exokernel principle for production use:

DPDK Architecture (exokernel principle in practice)
=====================================================

Traditional:                    DPDK:

App → socket() → kernel        App → DPDK → NIC driver (PMD)
      TCP/IP stack                   (runs entirely in userspace)
      net driver                     
      NIC hardware              NIC: polled directly by app code
                                     no kernel involvement

Latency: 10-50µs               Latency: 0.5-2µs
Throughput: ~1-10 Mpps         Throughput: 80-100 Mpps (10GbE+)

DPDK is production-deployed in Cisco routers, Juniper network appliances, cloud provider VPC implementations (AWS VPC, Azure vNet), and telecom 5G packet processing. It's the exokernel idea — "let applications manage the hardware" — applied to the most performance-critical path in network infrastructure.

io_uring as Partial Exokernel

Linux's io_uring (2019) achieves exokernel-like performance for file I/O by: - Shared memory ring buffers between app and kernel (eliminates copy) - Submission and completion without syscalls on hot paths - Application controls submission batching and ordering

io_uring benchmarks show ~350ns for file I/O on fast NVMe compared to ~1µs for pread() — the savings come from eliminating the syscall and per-operation overhead, approaching exokernel performance within the Linux architecture.

Debugging Notes

# Debugging in an exokernel/libOS environment is harder because
# the OS is in application space. Traditional kernel debuggers
# don't see libOS state.

# For Dune-based systems:
# Use hardware breakpoints via VT-x debug facilities
# Monitor VMCALLs for kernel interaction tracing

# For DPDK applications (practical exokernel debugging):

# DPDK metrics (packet drop, burst efficiency):
dpdk-proc-info -v --stats

# DPDK testpmd for driver testing:
testpmd -l 0-3 -n 4 -- -i --portmask=0x1
testpmd> show port stats 0

# Observe raw packet flow without kernel involvement:
testpmd> start tx_first
testpmd> show port xstats 0

# SPDK trace for storage bypass:
spdk_trace -s spdk_trace_file -t -c ioat

LibOS Debugging Challenges

In a libOS, a bug in the libOS itself manifests as an application crash with no kernel-level diagnosis. The libOS must implement its own panic/diagnostic infrastructure. This is a significant development overhead vs. relying on the OS.

Security Implications

Protection Without Abstraction

The exokernel maintains protection without abstraction: it enforces who owns which resources but doesn't define how they're used. This means:

Strength: A compromise of one application's libOS cannot access another application's physical resources — the exokernel enforces ownership.

Weakness: The libOS itself becomes a significant security component. If the libOS is buggy (buffer overflow in the TCP implementation, use-after-free in the VM manager), the application is fully compromised. In a traditional kernel, these bugs exist in trusted kernel code that's more carefully reviewed. In a libOS, each application carries its own potentially buggy OS implementation.

Trusted Computing Base

Traditional OS: TCB = kernel (large, but reviewed) Exokernel: TCB = exokernel (tiny) + libOS (per-application, varies)

The exokernel's TCB is smaller. But the total system TCB per application is larger when you include the libOS.

DPDK Security Considerations

DPDK gives applications direct ring 0 access to NIC hardware. A bug in a DPDK application can: - Corrupt the NIC's hardware queues - Interfere with other VFs on SR-IOV hardware (if hypervisor protection is incomplete) - Generate packets with arbitrary source MAC/IP (bypassing kernel netfilter rules)

Production DPDK deployments use SR-IOV with hypervisor-level isolation between VFs, and restrict DPDK processes to dedicated cores and NICs.

Performance Implications

The 1995 Numbers (Historical)

From the Engler et al. paper on DECStation 5000/240 (33MHz MIPS):

Operation	Ultrix	ExOS (LibOS)	Improvement
IPC (null RPC)	9.9 µs	0.5 µs	20x
Protected control xfer	19.8 µs	2.0 µs	10x
Exception dispatch	9.5 µs	0.5 µs	19x
VM map (page)	16.3 µs	0.9 µs	18x
VM unmap (page)	14.6 µs	0.5 µs	29x

These numbers reflect the overhead eliminated by having the libOS implement these operations directly without kernel interaction.

Modern DPDK Numbers (2024)

On Intel 100GbE (E810 NIC) with DPDK 23.11: - Packet forwarding throughput: ~100 Mpps (64-byte packets) - Latency (kernel bypass): ~500 ns - vs Linux kernel networking: ~20-50 Mpps, 5-15 µs

The exokernel principle, applied to networking via DPDK, delivers 2-5x throughput and 10-30x latency improvement for packet processing workloads.

Failure Modes and Real Incidents

The LibOS Isolation Problem

In a 2001 implementation study of Exokernel systems (Kaashoek et al.), a bug in one libOS's memory allocator caused silent data corruption that was invisible to the exokernel. Because the exokernel doesn't interpret the content of application pages, it cannot detect this class of error.

In production: DPDK applications that share huge pages between processes face similar risks. A buffer overrun in process A can corrupt process B's packet buffers, causing network corruption that's extremely difficult to diagnose.

DPDK Incident: Incorrect RSS Configuration

In large cloud deployments, DPDK Receive Side Scaling (RSS) misconfiguration caused certain flows to be processed by the wrong CPU core, creating apparent packet drops. Debugging required understanding the NIC's hardware hash function — something the kernel would normally abstract away. The bypass of the kernel networking stack requires operators to understand hardware internals that are normally hidden.

Modern Usage

The exokernel as a complete system never went into production. Its ideas live on in:

DPDK (Intel, 2010): Kernel bypass for network packet processing
SPDK (Intel, 2015): Kernel bypass for NVMe storage
io_uring (Linux, 2019): Partial kernel bypass for general I/O
RDMA/RoCE: Kernel bypass for datacenter interconnects
IOMMU/VFIO: Direct hardware assignment to VMs and containers
Unikernels: LibOS concept applied to cloud functions (see 05-unikernels.md)

Future Directions

SmartNIC and DPU: Modern SmartNICs (NVIDIA BlueField, Intel IPU) run ARM cores embedded in the NIC. Applications can offload packet processing to the NIC itself — an even more extreme version of the exokernel idea.

Persistent Memory (PMEM) / CXL: Intel Optane and CXL memory expansion give applications byte-addressable persistent storage. The FS-DAX interface in Linux allows applications to bypass the page cache entirely for PMEM, implementing their own persistence layer — directly applying exokernel principles to storage.

eBPF as Exokernel Extension: eBPF allows verified user code to run in the kernel's data path. A high-performance XDP program that processes packets in the kernel before they reach the socket layer is approaching the exokernel model from the opposite direction — moving verified user logic into the kernel path rather than moving kernel functionality into user space.

Exercises

DPDK Zero-Copy Benchmark: Set up DPDK with a single NIC (or use DPDK's software PMD for testing). Write a packet forwarding application that reads from one port and sends to another without any kernel involvement. Benchmark: packets per second, latency distribution (p50/p99/p999), CPU utilization. Compare to AF_PACKET or AF_XDP kernel-based alternatives.
io_uring Deep Dive: Write a file copy program using io_uring's IORING_OP_READ/IORING_OP_WRITE operations with registered buffers and fixed files. Compare latency and CPU usage to a read()/write() implementation. Explain what kernel abstractions io_uring bypasses vs. which it retains.
Application-Level Paging: Write a C program that uses mmap(MAP_ANONYMOUS) + madvise(MADV_DONTNEED) + userfaultfd to implement an application-level pager. Handle page faults in user space, implementing your own replacement policy. This demonstrates the exokernel VM control model within Linux.
LibOS Design Exercise: Design (not implement) a minimal libOS for a key-value store. Specify: physical memory layout, page fault handling, persistence strategy (fsync vs. direct block write), and how you would bypass the Linux page cache. What exokernel system calls would you need?
VFIO Exploration: Set up a Linux system with VFIO to assign a PCI device directly to a userspace application. Write a minimal userspace driver using the VFIO IOMMU API that can read the device's configuration space. What kernel protections remain active vs. what's bypassed?

References

Engler, D., Kaashoek, M.F., and O'Toole, J. "Exokernel: An Operating System Architecture for Application-Level Resource Management." SOSP '95. 1995. [The original paper]
Kaashoek, M.F., et al. "Application Performance and Flexibility on Exokernel Systems." SOSP '97. 1997.
Belay, A., et al. "Dune: Safe User-level Access to Privileged CPU Features." OSDI '12. 2012.
Peter, S., et al. "Arrakis: The Operating System is the Control Plane." OSDI '14. 2014.
Baumann, A., et al. "Shielding Applications from an Untrusted Cloud with Haven." OSDI '14. 2014.
Intel DPDK documentation: https://doc.dpdk.org/
Intel SPDK documentation: https://spdk.io/doc/
Axboe, J. "Efficient IO with io_uring." Kernel.org, 2019. https://kernel.dk/io_uring.pdf