Fibers and Coroutines
Technical Overview
Fibers and coroutines are program constructs that allow a function to suspend execution at a specific point and resume later. Unlike threads, coroutines suspend and resume cooperatively — there is no preemption. Unlike functions, coroutines maintain state between calls.
The distinction between fibers and coroutines is primarily one of API context: fibers are user-space threads with explicit switching (OS term), while coroutines are programming language abstractions for suspendable functions (PL term). Mechanically, they're similar.
The spectrum ranges from stackful coroutines (full call stack preserved on suspend) to stackless coroutines (only local frame preserved, as in C++20 co_await). Each has different implementation complexity, memory overhead, and composability characteristics.
Prerequisites
- Function call mechanics (stack frames, calling conventions)
- Heap vs. stack memory distinction
- Basic async I/O concept (why suspension is useful)
- Understanding of cooperative threading (from
02-user-threads-and-green-threads.md) - For C++20 sections: template metaprogramming basics
Core Concepts
Fibers: User-Space Cooperative Threads
A fiber is a cooperative thread that explicitly transfers control to another fiber. The operating system sees only the underlying kernel thread(s); fibers are entirely user-space.
Fiber Control Flow
===================
Main fiber Fiber A Fiber B
start
|
SwitchToFiber(A)
(A starts)
A does work
SwitchToFiber(B)
(B starts)
B does work
SwitchToFiber(A)
(A resumes)
A finishes
SwitchToFiber(main)
|
main continues
Windows Fiber API
Windows has a native fiber API in Win32:
#include <windows.h>
#include <stdio.h>
LPVOID mainFiber;
LPVOID workerFiber;
void WINAPI FiberWorker(LPVOID param) {
printf("Fiber worker: started, param=%p\n", param);
SwitchToFiber(mainFiber); // yield back to main
printf("Fiber worker: resumed\n");
SwitchToFiber(mainFiber); // yield back to main again
printf("Fiber worker: finishing\n");
// Implicitly returns to mainFiber when function returns
// (because mainFiber was set as the return target)
}
int main() {
// Convert main thread to a fiber (required before SwitchToFiber)
mainFiber = ConvertThreadToFiber(NULL);
// Create worker fiber with 64KB stack
workerFiber = CreateFiber(64 * 1024, FiberWorker, (LPVOID)0xDEAD);
printf("Main: switching to worker\n");
SwitchToFiber(workerFiber);
printf("Main: back from first yield\n");
SwitchToFiber(workerFiber);
printf("Main: back from second yield\n");
SwitchToFiber(workerFiber);
printf("Main: worker done\n");
DeleteFiber(workerFiber);
ConvertFiberToThread();
return 0;
}
Windows fibers are used by Microsoft SQL Server's scheduling system (UMS — User Mode Scheduling) to avoid kernel-level context switch overhead for OLTP workloads. SQL Server creates one fiber per session on one kernel thread per CPU, achieving O(1) context switches via SwitchToFiber().
Symmetric vs. Asymmetric Coroutines
Coroutine Control Transfer Models
===================================
Asymmetric (caller/callee relationship maintained):
main() coroutine c
|
resume(c) ---------->|
c executes
yield(v) ------> main resumes with v
|
resume(c) ---------->|
c executes
yield(v) ------> main resumes with v
main is always the "caller", c is always the "callee"
Used by: Python generators, Lua coroutines (resume/yield API)
Symmetric (any coroutine can transfer to any other):
c1 c2 c3
|
transfer(c2) ------>|
c2 executes
transfer(c3) ------->|
c3 executes
transfer(c1) → c1 resumes
Any coroutine can transfer to any other directly
Used by: Windows Fibers (SwitchToFiber), some Lua variants
More flexible, more careful management required
Python generators use asymmetric coroutines — yield always returns to next() caller. Lua provides both models: coroutine.resume/yield (asymmetric) and coroutine.wrap/yield for symmetric-style.
Stackful vs. Stackless Coroutines
Stackful Coroutines (complete stack preserved)
================================================
When suspended, the ENTIRE call stack is saved:
coroutine() {
step1() ← stack frame: [coroutine, step1]
step2() ← stack frame: [coroutine, step2]
inner() ← stack frame: [coroutine, step2, inner]
yield() ← SUSPENDED HERE, full stack saved
← [coroutine, step2, inner, yield]
step3() ← after resume, continues here
}
Memory: one stack per suspended coroutine (typically 4KB-64KB)
Flexibility: can yield from deep in the call stack
Examples: Go goroutines, Lua coroutines, Windows Fibers, Java Virtual Threads
Stackless Coroutines (only frame preserved)
=============================================
Code is transformed into a state machine by the compiler.
Only local variables are saved, not the call stack.
// Original code:
async Task example() {
var a = step1(); // state 0
await someIO(); // SUSPEND POINT → state 1
var b = step2(a); // state 1 code (after resume)
await anotherIO(); // SUSPEND POINT → state 2
return b + step3(); // state 2 code
}
// Compiler-generated state machine (pseudocode):
struct ExampleStateMachine {
int state;
int a; // only local vars, NOT the full stack
int b;
void MoveNext() {
switch (state) {
case 0:
a = step1();
state = 1;
schedule(someIO()); // arrange to call MoveNext when done
return; // return to caller immediately
case 1:
b = step2(a);
state = 2;
schedule(anotherIO());
return;
case 2:
set_result(b + step3());
}
}
}
Memory: only local vars per coroutine (no stack, no heap allocation in Rust async)
Limitation: cannot yield from within a regular function call
Examples: C++20 co_await, Rust async/await, C# async/await, Python asyncio
The stackless model's key limitation: if step1() above is not itself an async/await function, you cannot yield in the middle of it. Every function that needs to yield must be declared async. This is the "function coloring" problem (Bob Nystrom's 2015 essay "What Color is Your Function?").
Stackless Coroutines in C++20
C++20 introduced coroutine support with three new keywords:
#include <coroutine>
#include <iostream>
#include <optional>
// Generator: yields values lazily
template<typename T>
struct Generator {
struct promise_type {
T current_value;
Generator get_return_object() {
return Generator{std::coroutine_handle<promise_type>::from_promise(*this)};
}
std::suspend_always initial_suspend() { return {}; }
std::suspend_always final_suspend() noexcept { return {}; }
std::suspend_always yield_value(T value) {
current_value = value;
return {};
}
void return_void() {}
void unhandled_exception() { std::terminate(); }
};
using handle_type = std::coroutine_handle<promise_type>;
handle_type handle;
explicit Generator(handle_type h) : handle(h) {}
~Generator() { if (handle) handle.destroy(); }
bool next() {
handle.resume();
return !handle.done();
}
T value() { return handle.promise().current_value; }
};
// The coroutine itself:
Generator<int> fibonacci() {
int a = 0, b = 1;
while (true) {
co_yield a; // suspend, make 'a' available to caller
auto next = a + b;
a = b;
b = next;
}
}
int main() {
auto gen = fibonacci();
for (int i = 0; i < 10; i++) {
gen.next();
std::cout << gen.value() << " ";
}
// 0 1 1 2 3 5 8 13 21 34
}
The compiler transforms fibonacci() into a state machine. The co_yield is the suspend point. Between calls to gen.next(), the coroutine is suspended with only a and b saved — no heap allocation for the stack.
// Async coroutine with co_await
// (Requires a coroutine task type — simplified version)
Task<std::string> fetchData(std::string url) {
auto response = co_await http::get(url); // suspend until HTTP response
auto body = co_await response.read_body();
co_return body;
}
Task<void> main_coro() {
// These run concurrently (if the executor supports it):
auto [result1, result2] = co_await
when_all(fetchData("http://api1.example.com"),
fetchData("http://api2.example.com"));
std::cout << result1 << "\n" << result2 << "\n";
}
Python Generators: Simplified Stackless Coroutines
Python generators are the simplest form of stackless coroutines:
def lazy_range(n):
"""Generator: produces values on demand"""
i = 0
while i < n:
yield i # suspend, send i to caller
i += 1 # resume here next time
# Usage:
for x in lazy_range(1000000):
process(x)
# Memory: O(1) — only one value at a time, no list allocation
# Generator as coroutine (send values in):
def accumulator():
total = 0
while True:
value = yield total # yield current total, receive new value
if value is None:
break
total += value
acc = accumulator()
next(acc) # prime the generator (advance to first yield)
acc.send(10) # → 10
acc.send(20) # → 30
acc.send(5) # → 35
Python's asyncio is built on generators extended with async def and await syntax — which is itself syntactic sugar over generators plus a scheduler.
Lua Coroutines: Stackful, Full-Featured
Lua's coroutines are stackful and symmetric, making them the most flexible coroutine implementation in a mainstream language:
-- Lua coroutines: producer-consumer pattern
local function producer()
local items = {1, 2, 3, 4, 5}
for _, item in ipairs(items) do
coroutine.yield(item) -- suspend, send item to consumer
end
end
local function consumer(prod)
while true do
local status, item = coroutine.resume(prod)
if not status or item == nil then break end
print("Consumed:", item)
end
end
local prod_coro = coroutine.create(producer)
consumer(prod_coro)
-- Output:
-- Consumed: 1
-- Consumed: 2
-- ...
Lua coroutines are stackful: yield can happen from deep within a call chain, unlike Python generators where yield must be directly in the generator function.
-- Lua: yield from nested function (stackful coroutine power)
local function deep_function()
coroutine.yield("from deep") -- yield from nested call!
end
local function coro_func()
deep_function() -- calls nested function
coroutine.yield("from top")
end
local c = coroutine.create(coro_func)
print(coroutine.resume(c)) -- → true "from deep"
print(coroutine.resume(c)) -- → true "from top"
-- Python generators CANNOT do this — yield must be in generator function directly
Kotlin Coroutines
Kotlin's coroutines (JVM-based) use a stackless transformation with structured concurrency:
import kotlinx.coroutines.*
// Suspend function: can be called from coroutine context
suspend fun fetchUser(id: Int): User {
return withContext(Dispatchers.IO) { // switch to IO thread pool
// blocking I/O is fine here — it's on IO dispatcher
database.findUser(id)
}
}
// Coroutine scope: structured concurrency
suspend fun processUsers(ids: List<Int>): List<User> {
return coroutineScope {
ids.map { id ->
async { fetchUser(id) } // each runs concurrently
}.awaitAll() // wait for all to complete
}
// When coroutineScope block exits, ALL launched coroutines are done
// This prevents goroutine/coroutine leaks — structured concurrency
}
// Launch coroutines from regular code:
fun main() = runBlocking {
val users = processUsers(listOf(1, 2, 3, 4, 5))
println("Got ${users.size} users")
}
Kotlin's coroutines have:
- Stackless compiler transformation (like C#/Rust async)
- Structured concurrency (coroutineScope ensures all children complete)
- Multiple dispatchers (Main for UI, IO for blocking, Default for CPU)
- Cancellation via Job.cancel() propagated to all children
Historical Context
Simula 67 and Process Concept
Simula 67 (Ole-Johan Dahl and Kristen Nygaard) introduced the concept of coroutines in the context of simulation — modeling concurrent entities that could be suspended and resumed. This was the first programming language with a formal coroutine concept.
The Great Async Convergence (2012-2018)
Between 2012 and 2018, virtually every major language added async/await or coroutine syntax: - C# async/await (2012): the first mainstream stackless async - Python 3.4 asyncio (2014): generator-based coroutines - Python 3.5 async/await (2015): native syntax - C++20 coroutines (2017 proposal, 2020 standard) - Kotlin coroutines (2018) - Rust async/await (2019)
The convergence was driven by the realization that the callback hell of Node.js-style async programming was unsustainable, and that stackless coroutines gave sequential-looking code with async behavior.
Production Examples
SQL Server UMS (User-Mode Scheduling)
Microsoft SQL Server uses Windows Fibers for its UMS scheduling system. A SQL Server worker is a fiber, not a kernel thread. The scheduler:
- Runs on top of N kernel threads (one per CPU)
- Switches between SQL Server worker fibers using SwitchToFiber()
- Avoids kernel-level context switches for normal query processing
- Saves ~1-2µs per context switch (fiber switch vs. kernel thread switch)
At 100,000 queries/second, saving 1µs per query context switch saves 0.1 CPU cores worth of overhead.
Nginx Coroutine-Style Event Handling
Nginx's architecture effectively implements coroutines manually: each connection has an associated state machine that is an implicit coroutine (suspend on I/O wait, resume on I/O ready). While not using explicit coroutine primitives, the nginx event loop is structurally equivalent to running millions of suspended stackless coroutines.
C++ Game Entity AI
Game engines (Unreal Engine 5, various indie engines) use coroutines for game AI:
// Game AI with C++20 coroutines
Task<void> patrol(NPC& npc) {
while (true) {
co_await npc.move_to(waypoint_A); // suspend until movement complete
co_await npc.wait(seconds(2)); // suspend for 2 game seconds
co_await npc.move_to(waypoint_B);
co_await npc.wait(seconds(2));
}
}
// Each NPC has a coroutine; the game engine resumes them each frame
// No threads needed — the coroutine scheduler is deterministic
This replaces complex state machines (switch on state variable) with sequential-looking code that naturally expresses behavior over time.
Debugging Notes
# Python: inspect generator state
def gen():
yield 1
yield 2
g = gen()
print(g.gi_frame) # None before first next()
next(g)
print(g.gi_frame.f_locals) # {'...': ...} — locals at suspension point
print(g.gi_frame.f_lineno) # line number where suspended
# Inspect all frames in a generator's call stack
import inspect
# For asyncio coroutines:
import asyncio
async def coro():
await asyncio.sleep(0)
task = asyncio.ensure_future(coro())
# task.get_coro().cr_frame — the coroutine's current frame
// C++ coroutine debugging is limited in current toolchains (2024)
// Key challenge: coroutine handle address + frame state not easily inspectable
// Workaround: add explicit logging at co_yield/co_await points
// MSVC: improved coroutine debugging in VS 2022 17.x
// clang: limited coroutine debugging
// GDB: basic coroutine frame inspection with 'info frame' on coro handles
Stackless Coroutine Memory Inspection
// Inspect C++20 coroutine frame size
// (compile-time, via sizeof on the coroutine frame struct)
// Or at runtime:
#include <coroutine>
auto handle = std::coroutine_handle<promise_type>::from_promise(p);
// handle.address() gives the heap address of the coroutine frame
// Frame size = next_frame_address - handle.address() (fragile heuristic)
// Better: use sanitizers to track allocation sizes
Security Implications
Coroutine State Leakage
Stackless coroutines store their state (local variables) in heap-allocated frame objects. If an attacker can read heap memory (e.g., via a use-after-free or heap spray), they may read the coroutine's local state — potentially including credentials, cryptographic keys, or session tokens stored in local variables at yield points.
Mitigation: clear sensitive local variables before yielding. Use volatile for sensitive data (forces compiler to not optimize away the clear).
C++20 Coroutine Memory Safety
C++20 coroutines allocate their frame on the heap. If the coroutine frame is destroyed while the coroutine is still alive (common mistake: destroying the Task object before awaiting it), the coroutine's frame is freed but the underlying operation may continue — classic use-after-free.
// DANGEROUS: use-after-free
void dangerous() {
auto task = launchBackgroundTask(); // creates coroutine frame
// task goes out of scope — frame destroyed
// But the background coroutine may still be running!
} // use-after-free when coroutine accesses freed frame
// SAFE: structured concurrency prevents this
Task<void> safe() {
co_await launchBackgroundTask(); // wait for task to complete before returning
// coroutine frame lives until we explicitly co_await it
}
Performance Implications
Stackless vs. Stackful Performance
Memory Cost per Suspended Coroutine
=====================================
Stackful (Go goroutine, initial):
Initial stack: 2-8KB
Grows dynamically (stack copying)
Full registers saved on suspension
Stackless (C++20 co_await):
Frame: only local vars + state int
Typical: 32-512 bytes for simple coroutines
Heap-allocated (one malloc per coroutine creation)
Rust async (stackless, zero-cost):
Frame: only local vars + state enum
No heap allocation if executor supports inline storage
Typically: 50-200 bytes (inline in parent future)
Comparison at 1,000,000 concurrent coroutines:
Go goroutines: ~4 GB (4KB each)
C++20 coroutines: ~100-500 MB (100-500 bytes each)
Rust async: ~50-200 MB (50-200 bytes each)
Context Switch Cost
Context Switch: Stackful vs. Stackless
=========================================
Stackful (Lua, Windows Fibers):
Save/restore 6 callee-saved registers
Swap stack pointer
Total: ~20-50 cycles (~10-25 ns)
Stackless (C++20 resumption):
Call handle.resume() → jumps to coroutine via vtable
Switch state variable, load local vars from frame
Total: ~5-15 cycles (~2-7 ns)
(Stackless is faster because no register save/restore needed
— the state machine transformation eliminates that need)
Failure Modes and Real Incidents
The Python Generator Finalization Problem
Python generators that hold resources must be explicitly closed, or resources may leak:
def resource_generator():
conn = open_database_connection() # acquires resource
try:
for row in conn.query("SELECT ..."):
yield row
finally:
conn.close() # always runs on generator.close() or normal completion
gen = resource_generator()
next(gen) # opens connection
next(gen) # reads row
# If we DON'T call gen.close(), the finally block runs at GC time
# In CPython: GC is deterministic, finally runs when gen is garbage collected
# In PyPy/Jython: GC may be delayed — connection leak until GC runs
Best practice: always use generators as context managers or explicitly call .close().
Kotlin Coroutine Cancellation Bugs
Kotlin's structured concurrency means cancellation is propagated. But cancellation exceptions can be accidentally swallowed:
// BUG: swallowing CancellationException prevents structured cancellation
suspend fun buggy() {
try {
delay(1000) // suspendable delay
} catch (e: Exception) {
// CancellationException is a subclass of Exception
// Catching Exception here eats the cancellation signal!
logger.error("caught: $e")
// should rethrow: throw e — but doesn't
}
}
// FIX: explicitly rethrow CancellationException
suspend fun fixed() {
try {
delay(1000)
} catch (e: CancellationException) {
throw e // always rethrow cancellation
} catch (e: Exception) {
logger.error("caught: $e")
}
}
This caused real production bugs in Android apps where background coroutines continued running after the Activity was destroyed.
Modern Usage
C++20 coroutines: Used in game engines, network libraries (ASIO with C++20), high-performance servers. Still complex (the boilerplate for custom promise types is significant).
Python asyncio: Standard library, used by FastAPI, aiohttp. Every Python async framework uses coroutines.
Kotlin coroutines: Standard in Android development, Ktor server framework.
Lua coroutines: Game scripting (Roblox engine, many game mod systems), OpenResty (nginx + LuaJIT for web).
Windows Fibers: SQL Server, some game engines, specialized scheduling systems.
Future Directions
C++ Executors (P2300): The std::execution proposal (C++26) standardizes how coroutines interact with executors and schedulers — addressing the current situation where each library (ASIO, folly, libunifex) has incompatible coroutine infrastructure.
Stackful Coroutines in Rust: The async-task and context crates provide stackful coroutines for Rust. The Rust team is evaluating whether stackful coroutines should enter the standard library, given the use cases (C++ interop, game scripting) that stackless can't cover.
Structured Concurrency Standardization: C++26 is adding structured concurrency concepts (std::task, scope-bound execution). This addresses the coroutine leak problem that affects current C++20 code.
Exercises
-
Coroutine Scheduler: Implement a round-robin coroutine scheduler using
ucontext_tthat can schedule N coroutines cooperatively. Add ayield_for_io()primitive that suspends the current coroutine, initiates a non-blocking I/O operation, and resumes the coroutine when I/O completes. Measure overhead vs. blocking I/O. -
C++20 Generator Chain: Implement a lazy pipeline using C++20 generators:
source | filter | transform | take(N). Each stage is a generator coroutine. Measure memory usage for a pipeline of 10 stages processing 1 million elements. Compare to a vector-based pipeline. -
Python asyncio Internals: Read CPython's
Lib/asyncio/base_events.py. Trace howloop.run_forever()schedules coroutines. Where is the event loop? How is a coroutine fromasyncio.sleep(1)resumed after 1 second? Draw the call graph fromasyncio.gather()to the finalsend()call on each coroutine. -
Stackful vs. Stackless Benchmark: Implement the same producer-consumer pipeline in (a) Lua coroutines (stackful), (b) Python generators (stackless), (c) C++20 generators (stackless). Benchmark: 1 million items, context switch count, memory usage, wall-clock time. Analyze which model performs better for which pipeline depth.
-
Cancellation Protocol: Design a coroutine cancellation protocol for a stackful coroutine system (no language support for cancellation). How do you signal cancellation to a coroutine blocked at an arbitrary yield point? Implement it using a flag + cooperative polling. Compare to Kotlin's structured concurrency cancellation. What are the tradeoffs?
References
- Moura, A.L. and Ierusalimschy, R. "Revisiting Coroutines." TOPLAS, 2009. [Formal taxonomy of coroutine types]
- Rossberg, A. "Weakening WebAssembly." SPLASH, 2018. [Stack-switching for WASM]
- Nystrom, B. "What Color is Your Function?" https://journal.stuffwithstuff.com/2015/02/26/. 2015.
- C++20 Coroutines proposal: P0912R5 — https://wg21.link/P0912
- Kowalski, J. "Coroutines in C++20." ACCU 2021 talk. [Practical guide]
- Dreyer, D., et al. "A Type-and-Effect System for Flexible-Aliasing." POPL '11. [Type theory basis]
- Lua 5.4 Reference Manual — Coroutines section: https://www.lua.org/manual/5.4/
- Python asyncio documentation: https://docs.python.org/3/library/asyncio.html
- Kotlin coroutines guide: https://kotlinlang.org/docs/coroutines-guide.html
- cppreference.com — Coroutines (C++20): https://en.cppreference.com/w/cpp/language/coroutines