Skip to content

Fibers and Coroutines

Technical Overview

Fibers and coroutines are program constructs that allow a function to suspend execution at a specific point and resume later. Unlike threads, coroutines suspend and resume cooperatively — there is no preemption. Unlike functions, coroutines maintain state between calls.

The distinction between fibers and coroutines is primarily one of API context: fibers are user-space threads with explicit switching (OS term), while coroutines are programming language abstractions for suspendable functions (PL term). Mechanically, they're similar.

The spectrum ranges from stackful coroutines (full call stack preserved on suspend) to stackless coroutines (only local frame preserved, as in C++20 co_await). Each has different implementation complexity, memory overhead, and composability characteristics.

Prerequisites

  • Function call mechanics (stack frames, calling conventions)
  • Heap vs. stack memory distinction
  • Basic async I/O concept (why suspension is useful)
  • Understanding of cooperative threading (from 02-user-threads-and-green-threads.md)
  • For C++20 sections: template metaprogramming basics

Core Concepts

Fibers: User-Space Cooperative Threads

A fiber is a cooperative thread that explicitly transfers control to another fiber. The operating system sees only the underlying kernel thread(s); fibers are entirely user-space.

Fiber Control Flow
===================

Main fiber          Fiber A            Fiber B

start
  |
  SwitchToFiber(A)
                  (A starts)
                  A does work
                  SwitchToFiber(B)
                                     (B starts)
                                     B does work
                                     SwitchToFiber(A)
                  (A resumes)
                  A finishes
                  SwitchToFiber(main)
  |
main continues

Windows Fiber API

Windows has a native fiber API in Win32:

#include <windows.h>
#include <stdio.h>

LPVOID mainFiber;
LPVOID workerFiber;

void WINAPI FiberWorker(LPVOID param) {
    printf("Fiber worker: started, param=%p\n", param);
    SwitchToFiber(mainFiber);  // yield back to main

    printf("Fiber worker: resumed\n");
    SwitchToFiber(mainFiber);  // yield back to main again

    printf("Fiber worker: finishing\n");
    // Implicitly returns to mainFiber when function returns
    // (because mainFiber was set as the return target)
}

int main() {
    // Convert main thread to a fiber (required before SwitchToFiber)
    mainFiber = ConvertThreadToFiber(NULL);

    // Create worker fiber with 64KB stack
    workerFiber = CreateFiber(64 * 1024, FiberWorker, (LPVOID)0xDEAD);

    printf("Main: switching to worker\n");
    SwitchToFiber(workerFiber);

    printf("Main: back from first yield\n");
    SwitchToFiber(workerFiber);

    printf("Main: back from second yield\n");
    SwitchToFiber(workerFiber);

    printf("Main: worker done\n");

    DeleteFiber(workerFiber);
    ConvertFiberToThread();
    return 0;
}

Windows fibers are used by Microsoft SQL Server's scheduling system (UMS — User Mode Scheduling) to avoid kernel-level context switch overhead for OLTP workloads. SQL Server creates one fiber per session on one kernel thread per CPU, achieving O(1) context switches via SwitchToFiber().

Symmetric vs. Asymmetric Coroutines

Coroutine Control Transfer Models
===================================

Asymmetric (caller/callee relationship maintained):

  main()                 coroutine c
    |
    resume(c) ---------->|
                         c executes
                         yield(v) ------> main resumes with v
    |
    resume(c) ---------->|
                         c executes
                         yield(v) ------> main resumes with v

  main is always the "caller", c is always the "callee"
  Used by: Python generators, Lua coroutines (resume/yield API)

Symmetric (any coroutine can transfer to any other):

  c1                  c2                  c3
   |
   transfer(c2) ------>|
                       c2 executes
                       transfer(c3) ------->|
                                            c3 executes
                                            transfer(c1) → c1 resumes
  Any coroutine can transfer to any other directly
  Used by: Windows Fibers (SwitchToFiber), some Lua variants
  More flexible, more careful management required

Python generators use asymmetric coroutines — yield always returns to next() caller. Lua provides both models: coroutine.resume/yield (asymmetric) and coroutine.wrap/yield for symmetric-style.

Stackful vs. Stackless Coroutines

Stackful Coroutines (complete stack preserved)
================================================

When suspended, the ENTIRE call stack is saved:

coroutine() {
    step1()        ← stack frame: [coroutine, step1]
    step2()        ← stack frame: [coroutine, step2]
      inner()      ← stack frame: [coroutine, step2, inner]
        yield()    ← SUSPENDED HERE, full stack saved
                   ← [coroutine, step2, inner, yield]
    step3()        ← after resume, continues here
}

Memory: one stack per suspended coroutine (typically 4KB-64KB)
Flexibility: can yield from deep in the call stack
Examples: Go goroutines, Lua coroutines, Windows Fibers, Java Virtual Threads


Stackless Coroutines (only frame preserved)
=============================================

Code is transformed into a state machine by the compiler.
Only local variables are saved, not the call stack.

// Original code:
async Task example() {
    var a = step1();         // state 0
    await someIO();          // SUSPEND POINT → state 1
    var b = step2(a);        // state 1 code (after resume)
    await anotherIO();       // SUSPEND POINT → state 2
    return b + step3();      // state 2 code
}

// Compiler-generated state machine (pseudocode):
struct ExampleStateMachine {
    int state;
    int a;  // only local vars, NOT the full stack
    int b;

    void MoveNext() {
        switch (state) {
        case 0:
            a = step1();
            state = 1;
            schedule(someIO());  // arrange to call MoveNext when done
            return;              // return to caller immediately
        case 1:
            b = step2(a);
            state = 2;
            schedule(anotherIO());
            return;
        case 2:
            set_result(b + step3());
        }
    }
}

Memory: only local vars per coroutine (no stack, no heap allocation in Rust async)
Limitation: cannot yield from within a regular function call
Examples: C++20 co_await, Rust async/await, C# async/await, Python asyncio

The stackless model's key limitation: if step1() above is not itself an async/await function, you cannot yield in the middle of it. Every function that needs to yield must be declared async. This is the "function coloring" problem (Bob Nystrom's 2015 essay "What Color is Your Function?").

Stackless Coroutines in C++20

C++20 introduced coroutine support with three new keywords:

#include <coroutine>
#include <iostream>
#include <optional>

// Generator: yields values lazily
template<typename T>
struct Generator {
    struct promise_type {
        T current_value;

        Generator get_return_object() {
            return Generator{std::coroutine_handle<promise_type>::from_promise(*this)};
        }
        std::suspend_always initial_suspend() { return {}; }
        std::suspend_always final_suspend() noexcept { return {}; }
        std::suspend_always yield_value(T value) {
            current_value = value;
            return {};
        }
        void return_void() {}
        void unhandled_exception() { std::terminate(); }
    };

    using handle_type = std::coroutine_handle<promise_type>;
    handle_type handle;

    explicit Generator(handle_type h) : handle(h) {}
    ~Generator() { if (handle) handle.destroy(); }

    bool next() {
        handle.resume();
        return !handle.done();
    }

    T value() { return handle.promise().current_value; }
};

// The coroutine itself:
Generator<int> fibonacci() {
    int a = 0, b = 1;
    while (true) {
        co_yield a;          // suspend, make 'a' available to caller
        auto next = a + b;
        a = b;
        b = next;
    }
}

int main() {
    auto gen = fibonacci();
    for (int i = 0; i < 10; i++) {
        gen.next();
        std::cout << gen.value() << " ";
    }
    // 0 1 1 2 3 5 8 13 21 34
}

The compiler transforms fibonacci() into a state machine. The co_yield is the suspend point. Between calls to gen.next(), the coroutine is suspended with only a and b saved — no heap allocation for the stack.

// Async coroutine with co_await
// (Requires a coroutine task type — simplified version)

Task<std::string> fetchData(std::string url) {
    auto response = co_await http::get(url);  // suspend until HTTP response
    auto body = co_await response.read_body();
    co_return body;
}

Task<void> main_coro() {
    // These run concurrently (if the executor supports it):
    auto [result1, result2] = co_await 
        when_all(fetchData("http://api1.example.com"),
                 fetchData("http://api2.example.com"));

    std::cout << result1 << "\n" << result2 << "\n";
}

Python Generators: Simplified Stackless Coroutines

Python generators are the simplest form of stackless coroutines:

def lazy_range(n):
    """Generator: produces values on demand"""
    i = 0
    while i < n:
        yield i   # suspend, send i to caller
        i += 1    # resume here next time

# Usage:
for x in lazy_range(1000000):
    process(x)
# Memory: O(1) — only one value at a time, no list allocation

# Generator as coroutine (send values in):
def accumulator():
    total = 0
    while True:
        value = yield total  # yield current total, receive new value
        if value is None:
            break
        total += value

acc = accumulator()
next(acc)          # prime the generator (advance to first yield)
acc.send(10)       # → 10
acc.send(20)       # → 30
acc.send(5)        # → 35

Python's asyncio is built on generators extended with async def and await syntax — which is itself syntactic sugar over generators plus a scheduler.

Lua's coroutines are stackful and symmetric, making them the most flexible coroutine implementation in a mainstream language:

-- Lua coroutines: producer-consumer pattern
local function producer()
    local items = {1, 2, 3, 4, 5}
    for _, item in ipairs(items) do
        coroutine.yield(item)  -- suspend, send item to consumer
    end
end

local function consumer(prod)
    while true do
        local status, item = coroutine.resume(prod)
        if not status or item == nil then break end
        print("Consumed:", item)
    end
end

local prod_coro = coroutine.create(producer)
consumer(prod_coro)

-- Output:
-- Consumed: 1
-- Consumed: 2
-- ...

Lua coroutines are stackful: yield can happen from deep within a call chain, unlike Python generators where yield must be directly in the generator function.

-- Lua: yield from nested function (stackful coroutine power)
local function deep_function()
    coroutine.yield("from deep")  -- yield from nested call!
end

local function coro_func()
    deep_function()       -- calls nested function
    coroutine.yield("from top")
end

local c = coroutine.create(coro_func)
print(coroutine.resume(c))  -- → true  "from deep"
print(coroutine.resume(c))  -- → true  "from top"
-- Python generators CANNOT do this — yield must be in generator function directly

Kotlin Coroutines

Kotlin's coroutines (JVM-based) use a stackless transformation with structured concurrency:

import kotlinx.coroutines.*

// Suspend function: can be called from coroutine context
suspend fun fetchUser(id: Int): User {
    return withContext(Dispatchers.IO) {  // switch to IO thread pool
        // blocking I/O is fine here — it's on IO dispatcher
        database.findUser(id)
    }
}

// Coroutine scope: structured concurrency
suspend fun processUsers(ids: List<Int>): List<User> {
    return coroutineScope {
        ids.map { id ->
            async { fetchUser(id) }  // each runs concurrently
        }.awaitAll()                 // wait for all to complete
    }
    // When coroutineScope block exits, ALL launched coroutines are done
    // This prevents goroutine/coroutine leaks — structured concurrency
}

// Launch coroutines from regular code:
fun main() = runBlocking {
    val users = processUsers(listOf(1, 2, 3, 4, 5))
    println("Got ${users.size} users")
}

Kotlin's coroutines have: - Stackless compiler transformation (like C#/Rust async) - Structured concurrency (coroutineScope ensures all children complete) - Multiple dispatchers (Main for UI, IO for blocking, Default for CPU) - Cancellation via Job.cancel() propagated to all children

Historical Context

Simula 67 and Process Concept

Simula 67 (Ole-Johan Dahl and Kristen Nygaard) introduced the concept of coroutines in the context of simulation — modeling concurrent entities that could be suspended and resumed. This was the first programming language with a formal coroutine concept.

The Great Async Convergence (2012-2018)

Between 2012 and 2018, virtually every major language added async/await or coroutine syntax: - C# async/await (2012): the first mainstream stackless async - Python 3.4 asyncio (2014): generator-based coroutines - Python 3.5 async/await (2015): native syntax - C++20 coroutines (2017 proposal, 2020 standard) - Kotlin coroutines (2018) - Rust async/await (2019)

The convergence was driven by the realization that the callback hell of Node.js-style async programming was unsustainable, and that stackless coroutines gave sequential-looking code with async behavior.

Production Examples

SQL Server UMS (User-Mode Scheduling)

Microsoft SQL Server uses Windows Fibers for its UMS scheduling system. A SQL Server worker is a fiber, not a kernel thread. The scheduler: - Runs on top of N kernel threads (one per CPU) - Switches between SQL Server worker fibers using SwitchToFiber() - Avoids kernel-level context switches for normal query processing - Saves ~1-2µs per context switch (fiber switch vs. kernel thread switch)

At 100,000 queries/second, saving 1µs per query context switch saves 0.1 CPU cores worth of overhead.

Nginx Coroutine-Style Event Handling

Nginx's architecture effectively implements coroutines manually: each connection has an associated state machine that is an implicit coroutine (suspend on I/O wait, resume on I/O ready). While not using explicit coroutine primitives, the nginx event loop is structurally equivalent to running millions of suspended stackless coroutines.

C++ Game Entity AI

Game engines (Unreal Engine 5, various indie engines) use coroutines for game AI:

// Game AI with C++20 coroutines
Task<void> patrol(NPC& npc) {
    while (true) {
        co_await npc.move_to(waypoint_A);  // suspend until movement complete
        co_await npc.wait(seconds(2));     // suspend for 2 game seconds
        co_await npc.move_to(waypoint_B);
        co_await npc.wait(seconds(2));
    }
}

// Each NPC has a coroutine; the game engine resumes them each frame
// No threads needed — the coroutine scheduler is deterministic

This replaces complex state machines (switch on state variable) with sequential-looking code that naturally expresses behavior over time.

Debugging Notes

# Python: inspect generator state
def gen():
    yield 1
    yield 2

g = gen()
print(g.gi_frame)         # None before first next()
next(g)
print(g.gi_frame.f_locals) # {'...': ...} — locals at suspension point
print(g.gi_frame.f_lineno) # line number where suspended

# Inspect all frames in a generator's call stack
import inspect
# For asyncio coroutines:
import asyncio
async def coro():
    await asyncio.sleep(0)

task = asyncio.ensure_future(coro())
# task.get_coro().cr_frame — the coroutine's current frame
// C++ coroutine debugging is limited in current toolchains (2024)
// Key challenge: coroutine handle address + frame state not easily inspectable
// Workaround: add explicit logging at co_yield/co_await points
// MSVC: improved coroutine debugging in VS 2022 17.x
// clang: limited coroutine debugging
// GDB: basic coroutine frame inspection with 'info frame' on coro handles

Stackless Coroutine Memory Inspection

// Inspect C++20 coroutine frame size
// (compile-time, via sizeof on the coroutine frame struct)
// Or at runtime:
#include <coroutine>

auto handle = std::coroutine_handle<promise_type>::from_promise(p);
// handle.address() gives the heap address of the coroutine frame
// Frame size = next_frame_address - handle.address() (fragile heuristic)
// Better: use sanitizers to track allocation sizes

Security Implications

Coroutine State Leakage

Stackless coroutines store their state (local variables) in heap-allocated frame objects. If an attacker can read heap memory (e.g., via a use-after-free or heap spray), they may read the coroutine's local state — potentially including credentials, cryptographic keys, or session tokens stored in local variables at yield points.

Mitigation: clear sensitive local variables before yielding. Use volatile for sensitive data (forces compiler to not optimize away the clear).

C++20 Coroutine Memory Safety

C++20 coroutines allocate their frame on the heap. If the coroutine frame is destroyed while the coroutine is still alive (common mistake: destroying the Task object before awaiting it), the coroutine's frame is freed but the underlying operation may continue — classic use-after-free.

// DANGEROUS: use-after-free
void dangerous() {
    auto task = launchBackgroundTask();  // creates coroutine frame
    // task goes out of scope — frame destroyed
    // But the background coroutine may still be running!
}  // use-after-free when coroutine accesses freed frame

// SAFE: structured concurrency prevents this
Task<void> safe() {
    co_await launchBackgroundTask();  // wait for task to complete before returning
    // coroutine frame lives until we explicitly co_await it
}

Performance Implications

Stackless vs. Stackful Performance

Memory Cost per Suspended Coroutine
=====================================

Stackful (Go goroutine, initial):
  Initial stack: 2-8KB
  Grows dynamically (stack copying)
  Full registers saved on suspension

Stackless (C++20 co_await):
  Frame: only local vars + state int
  Typical: 32-512 bytes for simple coroutines
  Heap-allocated (one malloc per coroutine creation)

Rust async (stackless, zero-cost):
  Frame: only local vars + state enum
  No heap allocation if executor supports inline storage
  Typically: 50-200 bytes (inline in parent future)

Comparison at 1,000,000 concurrent coroutines:
  Go goroutines: ~4 GB (4KB each)
  C++20 coroutines: ~100-500 MB (100-500 bytes each)
  Rust async: ~50-200 MB (50-200 bytes each)

Context Switch Cost

Context Switch: Stackful vs. Stackless
=========================================

Stackful (Lua, Windows Fibers):
  Save/restore 6 callee-saved registers
  Swap stack pointer
  Total: ~20-50 cycles (~10-25 ns)

Stackless (C++20 resumption):
  Call handle.resume() → jumps to coroutine via vtable
  Switch state variable, load local vars from frame
  Total: ~5-15 cycles (~2-7 ns)

(Stackless is faster because no register save/restore needed
 — the state machine transformation eliminates that need)

Failure Modes and Real Incidents

The Python Generator Finalization Problem

Python generators that hold resources must be explicitly closed, or resources may leak:

def resource_generator():
    conn = open_database_connection()  # acquires resource
    try:
        for row in conn.query("SELECT ..."):
            yield row
    finally:
        conn.close()  # always runs on generator.close() or normal completion

gen = resource_generator()
next(gen)    # opens connection
next(gen)    # reads row
# If we DON'T call gen.close(), the finally block runs at GC time
# In CPython: GC is deterministic, finally runs when gen is garbage collected
# In PyPy/Jython: GC may be delayed — connection leak until GC runs

Best practice: always use generators as context managers or explicitly call .close().

Kotlin Coroutine Cancellation Bugs

Kotlin's structured concurrency means cancellation is propagated. But cancellation exceptions can be accidentally swallowed:

// BUG: swallowing CancellationException prevents structured cancellation
suspend fun buggy() {
    try {
        delay(1000)  // suspendable delay
    } catch (e: Exception) {
        // CancellationException is a subclass of Exception
        // Catching Exception here eats the cancellation signal!
        logger.error("caught: $e")
        // should rethrow: throw e  — but doesn't
    }
}

// FIX: explicitly rethrow CancellationException
suspend fun fixed() {
    try {
        delay(1000)
    } catch (e: CancellationException) {
        throw e  // always rethrow cancellation
    } catch (e: Exception) {
        logger.error("caught: $e")
    }
}

This caused real production bugs in Android apps where background coroutines continued running after the Activity was destroyed.

Modern Usage

C++20 coroutines: Used in game engines, network libraries (ASIO with C++20), high-performance servers. Still complex (the boilerplate for custom promise types is significant).

Python asyncio: Standard library, used by FastAPI, aiohttp. Every Python async framework uses coroutines.

Kotlin coroutines: Standard in Android development, Ktor server framework.

Lua coroutines: Game scripting (Roblox engine, many game mod systems), OpenResty (nginx + LuaJIT for web).

Windows Fibers: SQL Server, some game engines, specialized scheduling systems.

Future Directions

C++ Executors (P2300): The std::execution proposal (C++26) standardizes how coroutines interact with executors and schedulers — addressing the current situation where each library (ASIO, folly, libunifex) has incompatible coroutine infrastructure.

Stackful Coroutines in Rust: The async-task and context crates provide stackful coroutines for Rust. The Rust team is evaluating whether stackful coroutines should enter the standard library, given the use cases (C++ interop, game scripting) that stackless can't cover.

Structured Concurrency Standardization: C++26 is adding structured concurrency concepts (std::task, scope-bound execution). This addresses the coroutine leak problem that affects current C++20 code.

Exercises

  1. Coroutine Scheduler: Implement a round-robin coroutine scheduler using ucontext_t that can schedule N coroutines cooperatively. Add a yield_for_io() primitive that suspends the current coroutine, initiates a non-blocking I/O operation, and resumes the coroutine when I/O completes. Measure overhead vs. blocking I/O.

  2. C++20 Generator Chain: Implement a lazy pipeline using C++20 generators: source | filter | transform | take(N). Each stage is a generator coroutine. Measure memory usage for a pipeline of 10 stages processing 1 million elements. Compare to a vector-based pipeline.

  3. Python asyncio Internals: Read CPython's Lib/asyncio/base_events.py. Trace how loop.run_forever() schedules coroutines. Where is the event loop? How is a coroutine from asyncio.sleep(1) resumed after 1 second? Draw the call graph from asyncio.gather() to the final send() call on each coroutine.

  4. Stackful vs. Stackless Benchmark: Implement the same producer-consumer pipeline in (a) Lua coroutines (stackful), (b) Python generators (stackless), (c) C++20 generators (stackless). Benchmark: 1 million items, context switch count, memory usage, wall-clock time. Analyze which model performs better for which pipeline depth.

  5. Cancellation Protocol: Design a coroutine cancellation protocol for a stackful coroutine system (no language support for cancellation). How do you signal cancellation to a coroutine blocked at an arbitrary yield point? Implement it using a flag + cooperative polling. Compare to Kotlin's structured concurrency cancellation. What are the tradeoffs?

References

  • Moura, A.L. and Ierusalimschy, R. "Revisiting Coroutines." TOPLAS, 2009. [Formal taxonomy of coroutine types]
  • Rossberg, A. "Weakening WebAssembly." SPLASH, 2018. [Stack-switching for WASM]
  • Nystrom, B. "What Color is Your Function?" https://journal.stuffwithstuff.com/2015/02/26/. 2015.
  • C++20 Coroutines proposal: P0912R5 — https://wg21.link/P0912
  • Kowalski, J. "Coroutines in C++20." ACCU 2021 talk. [Practical guide]
  • Dreyer, D., et al. "A Type-and-Effect System for Flexible-Aliasing." POPL '11. [Type theory basis]
  • Lua 5.4 Reference Manual — Coroutines section: https://www.lua.org/manual/5.4/
  • Python asyncio documentation: https://docs.python.org/3/library/asyncio.html
  • Kotlin coroutines guide: https://kotlinlang.org/docs/coroutines-guide.html
  • cppreference.com — Coroutines (C++20): https://en.cppreference.com/w/cpp/language/coroutines