User Threads and Green Threads

Technical Overview

User-level threads (also called green threads) are threading abstractions implemented entirely in user space, without direct kernel support for each thread. The kernel sees only one or a few kernel threads; the user-space threading library multiplexes many logical threads onto those kernel threads.

The model has several variants: - M:1 (N:1): Many user threads, one kernel thread — maximum portability, no parallelism - M:N: Many user threads on M kernel threads — the Go goroutine model, Java virtual threads - 1:1: Each user thread backed by one kernel thread — the modern POSIX model (not really "user threads")

The key motivation is reducing the overhead of kernel threads (creation time, stack memory, context switch latency) while maintaining high concurrency.

Prerequisites

Kernel threads (01-kernel-threads.md)
CPU context switch mechanics (register save/restore)
Stack layout and execution context
Cooperative vs. preemptive scheduling concepts
Basic understanding of setjmp/longjmp or ucontext

Core Concepts

M:1 Threading Model

M:1 (User Threads on 1 Kernel Thread)
=======================================

User Space:
  Thread A    Thread B    Thread C    Thread D
    |           |            |           |
    +-----+-----+----+--------+----------+
                     |
              User-space scheduler
              (cooperative or preemptive via SIGALRM)
                     |
Kernel Space:        |
              [ Kernel Thread ]
                     |
                  CPU Core 0

Properties:
  + Fast creation (no syscall)
  + Fast context switch (no kernel involvement)
  + Low memory (no kernel stack per thread)
  - Blocking syscall blocks ALL threads (no parallelism)
  - No true parallelism (single CPU)
  - Thread A calling read() blocks threads B, C, D

The M:1 model's fatal flaw for I/O-bound workloads: when any thread makes a blocking system call (read, write, connect, accept), the underlying kernel thread blocks, and the entire user-space scheduler stops. All other user threads are frozen until the syscall returns.

Workarounds: 1. Use only non-blocking I/O (O_NONBLOCK) + select()/poll()/epoll() in a scheduler loop 2. Use asynchronous I/O (aio_* or io_uring) 3. Use a dedicated kernel thread for blocking I/O (wraps blocking calls)

N:M Threading Model

N:M (N User Threads on M Kernel Threads)
==========================================

User Space:
  G1   G2   G3   G4   G5   G6   G7   G8   ... GN
   |    |    |    |    |    |    |    |
   +----+----+    +----+----+    +----+---...
        |              |
   P (processor 1) P (processor 2)  ... P (processor M)
   [kernel thread]  [kernel thread]
        |              |
   CPU Core 0     CPU Core 1

Properties:
  + True parallelism (M cores running simultaneously)
  + Fast creation (most goroutines don't need new kernel threads)
  + Work stealing (idle Ps steal goroutines from busy Ps)
  - Complex scheduler implementation
  - Need to handle blocking syscalls (park kernel thread, spin up new one)
  - Preemption requires OS signal or safepoint

This is the Go GMP model. N goroutines run on M OS threads, each associated with a P (processor context). See 06-go-goroutines.md for details.

Green Threads: JVM History

Java's original threading model (Java 1.0-1.3) used green threads on many platforms: - JVM provided a cooperative user-space scheduler - All Java threads ran on a single OS thread (on Solaris and early Linux) - Blocked I/O froze all Java threads - No true parallelism

Sun Microsystems' HotSpot JVM switched to 1:1 kernel threads (native threads) in Java 1.3 (2001) for Solaris, and this became universal by Java 1.4. The reason: as multiprocessor servers became common, the inability to use multiple CPUs was a dealbreaker.

Green threads persisted in non-Sun JVMs (GNU classpath, early Android Dalvik) longer, but the industry consensus moved firmly to 1:1 by ~2005.

Java Virtual Threads (Project Loom — N:M Return)

Java 21 (2023) introduced virtual threads, reversing the 1:1 decision for high-concurrency workloads:

// Traditional kernel thread (expensive for 10,000 connections):
ExecutorService threadPool = Executors.newFixedThreadPool(200);
threadPool.submit(() -> {
    // Each task runs on a kernel thread
    // 200 threads handles 10,000 concurrent connections poorly
    handleRequest();
});

// Virtual thread (lightweight, N:M):
ExecutorService vThreadPool = Executors.newVirtualThreadPerTaskExecutor();
vThreadPool.submit(() -> {
    // This runs on a virtual thread
    // 10,000 virtual threads on ~8 carrier (kernel) threads
    handleRequest();  // blocking I/O suspends virtual thread, not carrier
});

// Direct creation:
Thread vThread = Thread.ofVirtual().start(() -> {
    // ~1KB initial stack (vs 8MB for kernel thread)
    // creation: ~1µs (vs 50-500µs for kernel thread)
    System.out.println("Virtual thread: " + Thread.currentThread());
});

Key innovation: when a virtual thread blocks on I/O, it is unmounted from its carrier (kernel) thread. The carrier thread continues running other virtual threads. When the I/O completes, the virtual thread is remounted on an available carrier.

This is N:M threading: N virtual threads on M carrier (kernel) threads. The JVM scheduler handles the virtual→carrier multiplexing transparently.

Java Virtual Thread Lifecycle
==============================

virtual thread T1 runs on carrier K1
  T1 calls blocking I/O (e.g., socket read)
    |
    +-- JVM unmounts T1 from K1
    |   (T1 is suspended, stack saved to heap)
    |
    +-- K1 is now free
    |   JVM scheduler: find next runnable virtual thread T2
    |
    +-- T2 mounts on K1, continues running

... I/O completes for T1 ...
    |
    +-- T1 is marked runnable
    +-- Next available carrier thread picks up T1
    +-- T1 resumes from exactly where it suspended

Continuation-Passing Style (CPS) for Scheduling

The theoretical basis for user-space scheduling: a computation can be expressed as a continuation — a captured representation of "the rest of the computation from this point." Suspending a thread means capturing its continuation; resuming means calling it.

In practice, there are two implementation approaches:

Stackful (goroutines, Lua coroutines, Java virtual threads): - The full call stack is saved when suspending - Resume restores the entire stack state - Natural control flow (looks like regular sequential code) - Memory cost: one stack per suspended thread

Stackless (C++20 coroutines, Python asyncio generators): - Only local variables (the "frame") are saved - The rest of the stack is not preserved - Requires transforming code into a state machine - Memory cost: one frame per suspended coroutine (no stack) - Performance advantage: no stack allocation per coroutine

# Python generator: stackless coroutine
def counter(start):
    n = start  # this is the frame — only 'n' is saved between yields
    while True:
        yield n    # suspend here, saving 'n'
        n += 1     # resume here, 'n' is restored

gen = counter(0)
next(gen)  # → 0
next(gen)  # → 1
# Each 'next()' resumes from the yield point
# No stack saved/restored — just the 'n' variable

User Thread Context Switch Implementation

A user-space context switch saves and restores execution context without the kernel:

// Simplified user-thread context switch using ucontext
// (real implementations use hand-crafted assembly for performance)
#include <ucontext.h>

#define STACK_SIZE (64 * 1024)

typedef struct {
    ucontext_t ctx;
    char stack[STACK_SIZE];
    bool done;
} green_thread_t;

green_thread_t threads[16];
int current_thread = 0;
int n_threads = 0;

void yield(void) {
    int prev = current_thread;
    current_thread = (current_thread + 1) % n_threads;
    // swapcontext saves current, restores next
    // This is the core of a user-thread context switch:
    swapcontext(&threads[prev].ctx, &threads[current_thread].ctx);
}

void create_thread(void (*func)(void)) {
    int id = n_threads++;
    getcontext(&threads[id].ctx);
    threads[id].ctx.uc_stack.ss_sp    = threads[id].stack;
    threads[id].ctx.uc_stack.ss_size  = STACK_SIZE;
    threads[id].ctx.uc_link           = NULL;
    makecontext(&threads[id].ctx, func, 0);
}

A real-world user-thread context switch (stack-based) on x86-64:

; Minimal context switch: save callee-saved registers, swap stacks
; From: fiber switching in production systems

context_switch:
    ; Save current thread state (callee-saved registers per x86-64 ABI)
    pushq %rbx
    pushq %rbp
    pushq %r12
    pushq %r13
    pushq %r14
    pushq %r15

    ; Switch stacks (this is the key operation)
    movq %rsp, (%rdi)   ; save current stack pointer to 'prev->sp'
    movq (%rsi), %rsp   ; load next thread's saved stack pointer

    ; Restore next thread's callee-saved registers
    popq %r15
    popq %r14
    popq %r13
    popq %r12
    popq %rbp
    popq %rbx

    ret                  ; return to where next thread was interrupted

The key insight: a user-thread context switch is just saving 6 registers and swapping the stack pointer. No kernel call, no TLB flush, no FPU state save (unless FPU was used). Cost: ~20-50 cycles (~10-25 ns).

vs. kernel thread context switch: ~1000-2000 cycles (~500 ns - 1 µs).

Scheduling Models

Cooperative vs. Preemptive User Threads
=========================================

Cooperative (yield-based):
  Thread runs until it explicitly yields or blocks
  + No preemption overhead
  + No need for atomic operations (no concurrent modification)
  - Long-running CPU-bound task starves other threads
  - Must yield at logical points

  Used by: Lua coroutines, Python 2 gevent, early Node.js

Preemptive (timer-based):
  Scheduler uses SIGALRM (or similar) to periodically preempt
  + Fairness: CPU-bound tasks can't starve others
  + More transparent to application code
  - Signal delivery adds overhead (~µs per preemption)
  - Need atomic operations everywhere (concurrent access possible)

  Used by: Go (signal-based async preemption since 1.14)

Hybrid (goroutine model):
  Cooperative yields at scheduling points (channel ops, syscalls)
  Preemptive via SIGURG for long-running CPU tasks
  Best of both worlds at cost of scheduler complexity

Historical Context

Green Threads Etymology

The term "green threads" comes from the Green Team at Sun Microsystems who designed Java's first threading model in 1995. The Green Team built a user-space threading library that could run on systems where OS threading support was primitive or absent.

The name stuck even after the technical meaning shifted — "green threads" now generally means user-level threads, regardless of implementation.

Erlang's Actor Model (1987)

Erlang's lightweight processes (the "process" model, not OS processes) are the most successful production user-threading system in history. Erlang processes: - ~300 bytes initial memory - Created in microseconds - Scheduled by the Erlang VM (BEAM) using time slices - No shared memory (message passing only) - Crash isolation via supervision trees

WhatsApp ran 2 million connections per server using Erlang's process model. Discord uses Erlang. RabbitMQ is Erlang. The model predates Go goroutines by 30 years.

Goroutines (2009) and the M:N Renaissance

Go's goroutines (2009) re-demonstrated that M:N user threading can work at scale: - Initial stack: 2-8KB (vs. 8MB for kernel threads) - M=GOMAXPROCS kernel threads (typically = CPU count) - Work stealing between processors - Transparent blocking (blocking syscalls park the goroutine)

The combination of M:N scheduling, a garbage collector, and built-in channel primitives made Go highly productive for networked servers. Go's success influenced Java Project Loom.

Production Examples

Go HTTP Server Scalability

A Go HTTP server can handle 100,000+ concurrent connections with a few hundred kernel threads. Each goroutine uses ~8KB initial stack. 100,000 goroutines ≈ 800MB RAM (vs. 800GB for 100,000 8MB kernel thread stacks).

// Go HTTP server — each request gets a goroutine automatically
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
    // This runs in a goroutine (user thread)
    // The goroutine will park when waiting for network I/O
    // The underlying kernel thread continues with other goroutines
    data := fetchFromDatabase(r.URL.Query().Get("id"))  // blocks goroutine, not thread
    w.Write(data)
})
http.ListenAndServe(":8080", nil)

Erlang/WhatsApp

WhatsApp's 2011 engineering blog reported 2 million TCP connections on a single machine using FreeBSD + Erlang. The Erlang VM ran millions of lightweight processes on a fixed pool of kernel threads. Each connected client had a dedicated lightweight Erlang process — conceptually one thread per connection, but without OS thread overhead.

Nginx + Python gevent

Before async/await syntax was standard, Python servers used gevent (based on libev + greenlet) — M:1 user threads with a non-blocking I/O event loop:

# gevent: monkey-patching converts blocking calls to cooperative
import gevent.monkey
gevent.monkey.patch_all()  # patches socket, threading, etc.

from gevent.pywsgi import WSGIServer

def application(env, start_response):
    # This runs as a greenlet (user thread)
    # socket.recv() here is cooperative — yields to event loop
    start_response('200 OK', [('Content-Type', 'text/plain')])
    return [b'Hello World']

server = WSGIServer(('', 8080), application)
server.serve_forever()

Debugging Notes

# Java virtual threads: observe carrier thread pool
java -Djdk.trackAllThreads=true -jar app.jar
# JVM creates carrier threads as daemon threads with names like:
# "ForkJoinPool-1-worker-1" (carrier threads)
# "virtual-N" (virtual threads — visible when trackAllThreads=true)

# Go goroutine debugging (see 06-go-goroutines.md for detailed coverage)
# GOTRACEBACK=full: show all goroutine stacks on panic
GOTRACEBACK=full ./myapp

# Get goroutine dump from running Go process:
kill -SIGQUIT <pid>

# Python: greenlet debugging
import greenlet
greenlet.settrace(lambda event, args: print(f"greenlet {event}"))

M:N Deadlock Detection

A common user-thread deadlock scenario: - All M kernel threads blocked on I/O (no more kernel threads available) - One of the waiting I/O operations requires completing work on a goroutine - But no kernel thread is available to run that goroutine - System deadlock: all threads wait, no progress

Go detects this: "all goroutines are asleep — deadlock!"

Security Implications

Green Thread Isolation

User threads within the same process share all address space. A bug in one green thread can corrupt another thread's stack or heap. This is the same risk as kernel threads sharing an address space, but: - Green threads often don't have guard pages between them (some do, most don't) - Stack overflows in green threads can silently corrupt adjacent green thread stacks - This is a significant concern for security-sensitive applications

Scheduling Side Channels

User-space schedulers run in the same address space as the threads they schedule. If a scheduler makes access decisions based on a shared state that an attacker can influence (timing channels), the scheduler itself becomes an attack vector. Kernel-level scheduling has some separation; user-level scheduling has none.

Performance Implications

Context Switch Cost Comparison

Context Switch Overhead Comparison
====================================

Kernel thread context switch:
  - CPU register save/restore: ~10 ns
  - TLB flush (if different address space): 100-500 ns
  - Scheduler run queue manipulation: ~100 ns
  - FPU/SIMD state save: 50-300 ns (if dirty)
  Total: ~500 ns - 2 µs

User-space (green) thread context switch:
  - Register save/restore (callee-saved only, 6 regs): ~5 ns
  - Stack pointer swap: ~1 ns
  - No TLB flush (same address space)
  - FPU: typically not saved (green threads often don't use FPU directly)
  Total: ~10-50 ns

Goroutine context switch:
  - Save goroutine registers
  - Update G pointer in M struct
  - No stack pointer swap (scheduler picks up via G struct)
  Total: ~100-200 ns (includes scheduling logic overhead)

Virtual thread (Java 21) mount/unmount:
  - Stack copy to/from heap: variable (stack size dependent)
  - Carrier thread assignment: ~200 ns
  Total: ~300-500 ns (much better than kernel thread switch)

Failure Modes and Real Incidents

Java Green Thread I/O Blocking Bug (Pre-1.4)

Pre-NPTL Java had a well-known issue: a System.in.read() call would block the entire JVM. Any Java program that read from stdin and also needed timer callbacks or other threads to run would hang. The fix was native threads, which allowed the JVM to schedule other threads while one was blocked in a kernel call.

Go Goroutine Leak Detection

Goroutine leaks are a silent failure mode in production Go services:

// Common goroutine leak: goroutine waiting for channel that nobody writes to
func processRequest(data chan []byte) {
    go func() {
        result := <-data  // if nobody writes to data, goroutine leaks forever
        process(result)
    }()
}

// Fix: use context for cancellation
func processRequestFixed(ctx context.Context, data chan []byte) {
    go func() {
        select {
        case result := <-data:
            process(result)
        case <-ctx.Done():
            return  // clean up on cancellation
        }
    }()
}

Production incident: a services team at a major cloud provider had a service that consumed ~1GB RAM more per day. Root cause: goroutine leak in error handling path — a goroutine waiting on a channel was never informed of the error and leaked. With ~8KB per goroutine, it took ~125,000 leaked goroutines to consume 1GB. Detection: goroutine count metric in Prometheus.

Modern Usage

Go goroutines: Dominant in cloud infrastructure software (Kubernetes, Docker, Prometheus, Terraform — all Go)

Java Virtual Threads: Java 21+ production-ready, targeting Spring Boot and reactive frameworks (replaces WebFlux reactive model for many use cases)

Python asyncio: Event loop with coroutines (greenlet/gevent or native async/await) — dominant for Python I/O-bound services

Erlang/Elixir processes: BEAM VM's lightweight processes — Phoenix framework (Elixir web framework) runs millions of lightweight processes per node

Future Directions

Structured Concurrency: Go 1.21's sync.WaitGroup, Java's StructuredTaskScope, Kotlin's structured concurrency — these add lifecycle guarantees to goroutine/virtual thread creation: every spawned thread must complete before the parent scope exits, eliminating goroutine leaks.

Goroutine-aware Profiling: Go tooling continues to improve goroutine-level profiling (goroutine blocking profiles, goroutine ID in traces). This addresses the "goroutine leak is hard to debug" operational problem.

WASM Threads: WebAssembly's threading model uses shared memory + atomics + a cooperative user-space threading library. This will bring green thread behavior to browser and edge computing environments.

Exercises

Green Thread Scheduler: Implement a minimal cooperative green thread scheduler in C using ucontext_t. Support: create_thread(), yield(), and exit_thread(). Demonstrate scheduling 10 threads that each print their ID and yield in a round-robin. Measure context switch overhead with clock_gettime().
Java Virtual Thread Benchmark: Write a Java program that creates 100,000 tasks, each sleeping for 100ms then returning a result. Run it with (a) a fixed thread pool of 200 threads, (b) virtual threads. Measure: throughput (tasks/second), peak memory usage (JVM heap + native), and wall-clock time. Explain the difference.
Goroutine Blocking Behavior: Write a Go program that creates 1,000 goroutines, each making a blocking syscall (time.Sleep(10*time.Second)). Monitor GOMAXPROCS and the number of OS threads (runtime.NumGoroutine() vs. system ps -L). Observe how the Go runtime handles many goroutines blocked in a syscall.
Python gevent vs. asyncio: Implement the same HTTP client that makes 1,000 concurrent HTTP GET requests using (a) gevent monkey patching, (b) asyncio with aiohttp. Compare performance, CPU usage, and memory. Which model is more composable with existing synchronous libraries?
Stack Growth Tracing: Create a Go program with deeply recursive functions that grow goroutine stacks. Use runtime.Stack() to measure stack sizes at different recursion depths. Trace where the Go runtime grows (reallocates and copies) the stack. Measure the latency cost of a stack growth event.

References

Drepper, U. and Molnar, I. "The Native POSIX Thread Library for Linux." Red Hat, 2003.
Go team. "Go Concurrency Patterns." Google I/O talks (2012, 2013). https://talks.golang.org/
Liang, S., Bracha, G. "Dynamic Class Loading in the Java Virtual Machine." OOPSLA '98.
Lopes, N.P., and Rybalchenko, A. "Verifying Concurrent Programs with Coroutines." (coroutine theory)
Project Loom JEP: https://openjdk.org/jeps/444 (Virtual Threads, Java 21)
Armstrong, J. "Making Reliable Distributed Systems in the Presence of Software Errors." PhD Thesis, 2003. [Erlang process model]
Kerrisk, M. The Linux Programming Interface. Chapter 33: Threads Synchronization.
Golang blog: "Go Concurrency Patterns: Context" https://blog.golang.org/context