Go Runtime

Technical Overview

The Go runtime is a small, embedded kernel that ships inside every Go binary. Unlike the JVM or CPython, there is no separate runtime process to launch — the Go runtime initializes itself at program startup, bootstraps the scheduler and GC, and then calls main.main. It is written almost entirely in Go itself (with a small assembly bootstrap layer), making it readable and hackable in the source tree at src/runtime/.

The runtime's three primary responsibilities are: scheduling goroutines onto OS threads (the GMP model), memory management (a tcmalloc-inspired allocator with a tiered cache hierarchy), and garbage collection (tricolor concurrent mark-sweep). These interact intimately — the GC must know about every goroutine's stack, the allocator integrates with the GC write barrier, and the scheduler must pause for GC safepoints.

Prerequisites

Understanding of OS threads, context switching, and user-space scheduling
Basic understanding of memory allocators (free list, slab allocation)
Familiarity with concurrent GC concepts (tricolor marking, write barriers)
Go language basics (goroutines, channels, defer)

GMP Model Diagram

+----------------------------------------------------------+
|                  Go Scheduler (GMP Model)                |
|                                                          |
|  P0                   P1                   P2            |
|  +---------------+    +---------------+    +----------+  |
|  | Run Queue     |    | Run Queue     |    | Run Queue|  |
|  | [G1][G2][G3]  |    | [G4][G5]      |    | [G6]     |  |
|  +-------+-------+    +-------+-------+    +----+-----+  |
|          |                    |                 |         |
|          v                    v                 v         |
|       M0 (OS thread)       M1 (OS thread)    M2 (OS thr) |
|       running G_cur        running G_cur     running G_cur|
|                                                          |
|  Global Run Queue: [G7][G8]  (overflow from P queues)   |
|                                                          |
|  Work stealing: P2's queue empty -> steals half of P0's |
+----------------------------------------------------------+

G states: _Gidle -> _Grunnable -> _Grunning -> _Gwaiting -> _Gdead
                                      ^                          |
                                      +--- unblocked / wake ----+

Core Content

GMP Scheduler Deep Dive

The Go scheduler implements M:N threading — M goroutines multiplexed onto N OS threads (where N is typically GOMAXPROCS, default equals number of CPUs). The three entities:

G — Goroutine: The unit of concurrent execution. A goroutine struct (runtime.g) contains: a stack (pointer + bounds), program counter, status, and a link to the P it last ran on. Goroutines begin life as _Gidle, transition to _Grunnable when created, _Grunning when executing, _Gwaiting when blocked on a channel, mutex, or syscall, and _Gdead when the goroutine function returns.

M — Machine (OS thread): An OS thread that executes goroutines. Each M has a g0 — a special goroutine with a larger stack used for scheduler functions (switching goroutines, running GC work). An M must hold a P to run user goroutines.

P — Processor: A logical processor. There are exactly GOMAXPROCS Ps. Each P has a local run queue (runq, a ring buffer of up to 256 Gs), a reference to the currently running G, and a cache of memory spans for allocation (mcache). The P also holds the GC write barrier buffer (p.wbBuf).

Scheduling decisions: 1. When a goroutine calls runtime.Gosched() or blocks on I/O or a channel, it yields its M 2. The M's scheduler function (running on g0) picks the next runnable G: first from the local P run queue, then from the global run queue (every 61 ticks to prevent starvation), then by stealing from other Ps 3. Work stealing: If a P's local queue is empty, it steals half the goroutines from a randomly chosen victim P's run queue. This is the primary load balancing mechanism.

Syscall handling: When a goroutine makes a blocking syscall (e.g., read() on a file), the M makes the syscall and blocks. The P detaches from the blocked M and is picked up by a new M (either from the idle M pool or a newly created OS thread) to continue running other goroutines. When the syscall returns, the original M tries to reacquire a P; if none is available, the goroutine is placed on the global run queue and the M goes idle.

Network I/O is handled differently: the Go runtime uses the OS's async I/O mechanism (epoll on Linux, kqueue on macOS) via the netpoller, a background goroutine. Network operations register file descriptors with epoll, and the goroutine parks itself (_Gwaiting). When epoll reports readiness, the runtime unparks the goroutine (_Grunnable) and it retries the I/O operation.

Goroutine Stack: Growth and Contiguous Stacks

Go goroutines start with a 2KB stack (significantly smaller than the typical OS thread default of 1–8MB). This allows creating millions of goroutines without exhausting virtual address space.

Stack growth check: On every function entry, the compiled code checks whether the current stack pointer is below the stack's low-water mark (stackguard0). If so, the runtime calls morestack, which: 1. Allocates a new, larger stack (typically 2x the current) 2. Copies all stack frames from the old stack to the new stack 3. Adjusts all pointers that pointed into the old stack to point into the new stack 4. Frees the old stack 5. Restarts the function

Contiguous stacks: The copy-and-adjust model means goroutine stacks are always a single contiguous memory region. This differs from segmented stacks (the original Go design, 1.2 and earlier), where each stack segment was a separate allocation chained together. Contiguous stacks eliminated the "hot split" performance problem where a function near a segment boundary triggers frequent segment allocation/deallocation.

Shrinkage: Stacks are shrunk at GC time if a goroutine's stack is less than 1/4 used. The stack is copied to a smaller allocation to reclaim memory.

The cost of stack growth: the pointer-adjusting copy is O(stack frames), but amortized by the 2x doubling. The escape analysis pass in the compiler determines which variables must be heap-allocated (those whose addresses escape the current function) — this minimizes the amount of pointer-adjustment needed during stack growth.

Go Memory Allocator

Go uses a tcmalloc-inspired three-tier allocator:

mheap: The global heap. Manages memory obtained from the OS via mmap. Divided into spans — contiguous runs of one or more 8KB pages. Each span serves one size class. mheap is protected by a global lock (a coarse-grained lock for span-level operations).

mcentral: One per size class. A central free list of spans for that size class, shared across all Ps. Protected by a per-mcentral lock.

mcache: Per-P allocation cache — one per logical processor. Contains a span for each size class. Allocation from mcache is lock-free: because only one goroutine runs per P at a time, the mcache is thread-local. When a size class's span in mcache is exhausted, mcache fetches a new full span from mcentral.

Size classes: Go has 68 size classes ranging from 8 bytes to 32KB. Objects >32KB are "large" and allocated directly from mheap via mmap. Each size class has a corresponding span size that minimizes internal fragmentation.

Allocation path: 1. new(T) or make([]T, n) → compiler emits a call to runtime.mallocgc 2. If the object is small (<32KB) and its size class has a free slot in mcache, take it (fast path, no lock) 3. If mcache is exhausted for that class, fetch a span from mcentral (per-mcentral lock) 4. If mcentral has no available span, get memory from mheap (global lock) 5. If mheap needs more memory, call mmap to obtain pages from the OS

Small allocations that contain no pointers (scalar types: int, float64, arrays thereof) are handled separately — the GC doesn't need to scan them. They are allocated from noscan spans, which the GC skips during marking.

Go GC: Tricolor Concurrent Mark-Sweep

Go uses a tricolor mark-sweep GC that runs concurrently with application goroutines.

Three colors: - White: Not yet visited. Initially, all objects are white. At GC end, white objects are garbage. - Gray: Discovered but not yet fully scanned (some references may not be marked). - Black: Fully scanned. All objects referenced by a black object are at least gray.

Algorithm: 1. Mark setup (STW, brief): Enable write barrier, pause goroutines briefly to snapshot roots 2. Concurrent mark: Worker goroutines scan the gray set, marking referenced objects gray (then black when fully scanned). Runs concurrently with the application. 3. Mark termination (STW): Brief final STW to drain the remaining gray work queue, ensure all objects are correctly colored 4. Sweep: Reclaim white (garbage) spans. This is concurrent in Go — sweep happens lazily as new allocations are made from spans.

Write barrier (Dijkstra insertion barrier variant): When the application modifies a pointer field (stores a new pointer), the write barrier ensures the GC invariant is maintained. Specifically, Go uses a variant of the Dijkstra barrier: when any pointer is written, the new value is shaded gray. This ensures that no black object can point to a white object without the white object being in the gray set. This is a weaker barrier than the original Dijkstra formulation but is sufficient with Go's tricolor invariant.

Write barrier code is inserted by the compiler at every pointer store. During non-GC time, the write barrier is a no-op (a fast check of the barrier enabled flag). The cost is minimal: typically 2–5 ns per pointer write.

GC pacing: The GOGC environment variable controls the GC trigger. GOGC=100 (default) means the next GC triggers when live heap size doubles. Setting GOGC=200 delays GC to 3x live heap size (less GC, more memory). GOGC=off disables GC entirely (for benchmarking). Go 1.19+ adds runtime/debug.SetMemoryLimit to cap total Go heap + GC overhead memory, giving finer control without sacrificing GC frequency predictability.

GODEBUG and Runtime Introspection

# Scheduler tracing (trace every 1000ms)
GODEBUG=schedtrace=1000 ./myapp
# Output: SCHED 1000ms: gomaxprocs=8 idleprocs=2 threads=10 ...

# GC tracing
GODEBUG=gctrace=1 ./myapp
# Output: gc 1 @0.023s 2%: 0.005+1.2+0.004 ms clock, ...

# Verify GC object marking correctness (expensive)
GODEBUG=gccheckmark=1 ./myapp

# Inspect memory statistics at runtime
import "runtime"
var m runtime.MemStats
runtime.ReadMemStats(&m)
// m.HeapAlloc, m.HeapSys, m.NumGC, m.PauseNs, etc.

Stack Growth Performance

Stack growth is usually invisible but can be a bottleneck when: - A tight loop calls a function that is just at the stack boundary — every call triggers morestack (the "hot split" analog with contiguous stacks is "hot growth") - A goroutine's stack grows from 2KB to megabytes and back — the growth is fast but the 2x doubling produces large allocations

Diagnostic: use go tool pprof -alloc_objects to identify where stack growth allocations occur. Or compile with go build -gcflags="-m" to see escape analysis decisions.

Historical Context

The Go scheduler was initially a simple single-threaded cooperative scheduler (Go 1.0). The current GMP work-stealing scheduler was introduced in Go 1.1 by Dmitry Vyukov (Google), based on his prior work on Goroutine scheduling published in "Scalable Go Scheduler Design." Go's original GC was a simple stop-the-world collector. Concurrent GC was introduced in Go 1.5 (2015), dramatically reducing pause times from 100ms to <1ms for most workloads. The tricolor write barrier design was refined through Go 1.5–1.9 to eliminate the STW re-scan of the stack by using the hybrid barrier (Yuasa deletion + Dijkstra insertion), finalized in Go 1.9.

Production Examples

// Check GC overhead
import "runtime"
func gcStats() {
    var stats runtime.MemStats
    runtime.ReadMemStats(&stats)
    fmt.Printf("GC cycles: %d, pause total: %v\n",
        stats.NumGC,
        time.Duration(stats.PauseTotalNs))
}

// Control GC aggressiveness
import "runtime/debug"
debug.SetGCPercent(200)          // Less frequent GC (more memory used)
debug.SetMemoryLimit(4 << 30)    // Hard cap: 4GB total Go memory

// goroutine dump on SIGQUIT (built-in)
// kill -QUIT <pid>  -> prints all goroutine stacks to stderr

# Profile a running Go server
go tool pprof http://localhost:6060/debug/pprof/goroutine
go tool pprof http://localhost:6060/debug/pprof/heap
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

Debugging Notes

kill -SIGQUIT <pid> dumps all goroutine stacks — the Go equivalent of jstack
Goroutine leaks (goroutines blocked forever on a channel) are detected by the goleak library or by monitoring runtime.NumGoroutine() over time
pprof heap profile: alloc_objects vs inuse_objects — use inuse_objects to find what's consuming live heap memory, alloc_objects for allocation rate
go tool trace captures a full execution trace (goroutine scheduling, GC events, blocking events) — visualized in the browser. Reveals scheduler imbalance and GC preemption patterns.
GOMAXPROCS(1) forces single-threaded execution — useful for isolating race conditions (some races only manifest with true parallelism)

Security Implications

Stack scanning safety: The Go GC must scan goroutine stacks for pointers. Misidentifying a non-pointer integer as a pointer could follow it into an invalid memory region. Go's precise GC uses stack maps (generated by the compiler) to know exactly which stack slots contain pointers — no conservative scanning.
Goroutine ID: Go deliberately does not expose a goroutine ID to user code (unlike Java's Thread.getId()). This is a deliberate design decision to prevent goroutine-local storage patterns that complicate reasoning about concurrent code.
CGo safety: Code mixing CGo (C calls from Go) must be careful about pointer rules: Go's GC can move stacks, so passing a pointer to a Go stack variable to C code is not allowed if the C code might store it. cgo enforces pointer rules at runtime in debug mode.

Performance Implications

Goroutine context switch cost: ~100ns (vs OS thread context switch: ~1–10 µs). This is the key advantage of Go's user-space scheduler for concurrent I/O-bound workloads.
Memory allocator: lock-free mcache path allocates 8-byte objects in ~5–10 ns. Compare to malloc on glibc at ~15–30 ns with the global lock under contention.
GC STW pauses in Go 1.21+: typically 0.1–1ms. P99 pause times for most production workloads are well under 5ms.
High allocation rate applications: if allocating >1GB/s, GC overhead becomes significant. Profile with GODEBUG=gctrace=1 and look at the GC %CPU column.

Failure Modes

Goroutine leak: A goroutine blocked on a nil channel or an unbuffered channel that no sender ever sends to will park forever, consuming stack memory. Multiply by thousands and memory exhausts.
Mutex starvation: A goroutine holding a sync.Mutex blocks indefinitely (e.g., inside a blocking syscall without the network poller). All goroutines trying to acquire the mutex queue. Go 1.9+ added starvation-prevention mode to sync.Mutex.
Stack overflow: Technically possible with deeply recursive functions (the 2x doubling has an upper limit: _StackMax = 1GB on 64-bit). Go throws a runtime panic: stack overflow.
GC "bang": A burst of allocations after a quiet period causes a large GC cycle. Use debug.SetMemoryLimit to smooth out GC trigger timing.

Modern Usage

Go 1.21 introduced sync.OnceFunc, min, max, and improved slice/map operations that reduce allocation pressure. The slices and maps packages (Go 1.21 standard library) provide generic utilities that the compiler can potentially optimize without heap allocation.

Go's scheduler handles os.File I/O (disk) differently from network I/O — disk I/O uses the OS blocking syscall path (M detaches from P) rather than the netpoller, because most OS kernels do not support async disk I/O uniformly (Linux io_uring changes this; Go has not yet integrated io_uring into the runtime).

Future Directions

io_uring integration: Replacing the thread-blocking syscall path for disk I/O with io_uring on Linux would eliminate OS thread parking/unparking, reducing context-switch overhead for file-heavy workloads
GC improvements for high-allocation workloads: Generational GC in Go has been discussed for years (Dmitry Vyukov's 2014 proposal); it remains unimplemented. The implementation challenge is integrating generational collection with the contiguous stack model.
WASM runtime support: The Go runtime's WASM target (GOOS=wasip1) is maturing; the runtime must implement its own stack management and scheduling on top of the WASM execution model.
Profile-guided optimization (PGO): Go 1.20+ supports PGO. The compiler uses runtime profiles to guide inlining decisions, reducing allocations on hot paths.

Exercises

Write a Go program that creates 100,000 goroutines, each waiting on a channel. Measure the total RSS. Then create 1000 OS threads (runtime.LockOSThread) and measure the RSS difference. Quantify the goroutine-vs-thread memory advantage.
Implement a goroutine leak detector: a function that periodically samples runtime.NumGoroutine() and runtime.Stack(buf, true), then compares goroutine stack traces across samples to identify goroutines that appear stuck.
Trigger stack growth in a hot path: write a recursive function with a ~1.8KB frame (arrays on the stack) and measure the cost of the first call (stack growth) vs subsequent calls (no growth). Use runtime.ReadMemStats to count stack allocations.
Instrument the Go GC using runtime.SetFinalizer on objects of known sizes. Count how many GC cycles occur during a fixed workload. Then vary GOGC (50, 100, 200) and measure throughput vs memory usage tradeoff.
Use go tool trace to generate an execution trace of a goroutine-intensive program. Identify: (a) goroutines that block on channel operations, (b) GC STW pause events, (c) any work-stealing events showing scheduler imbalance across Ps.

References

Dmitry Vyukov, "Scalable Go Scheduler Design." https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw/
Richard Hudson & Austin Clements, "Getting to Go: The Journey of Go's Garbage Collector." GopherCon 2018. https://go.dev/blog/ismmkeynote
Austin Clements, "Proposal: Eliminate STW stack re-scanning." Go issue #17503, 2016. (Hybrid barrier design)
Go runtime source: src/runtime/ — proc.go (scheduler), malloc.go (allocator), mgc.go (GC), stack.go (stack management)
Dave Cheney, "High Performance Go Workshop." https://dave.cheney.net/high-performance-go-workshop/gophercon-2019.html
William Kennedy, "Scheduling in Go." https://www.ardanlabs.com/blog/2018/08/scheduling-in-go-part1.html