Go Goroutines

Technical Overview

A goroutine is a lightweight, user-space concurrency primitive provided by the Go runtime. They're often described as "lightweight threads" but the mechanics differ from both kernel threads and classic green threads. Goroutines start with ~2KB of stack memory (growing dynamically to GBs if needed), are multiplexed across a pool of OS threads by the Go scheduler, and communicate via channels following the CSP (Communicating Sequential Processes) model.

The Go runtime's scheduler implements a work-stealing M:N model called GMP (Goroutines, OS threads M, Processors P) that enables efficient use of multiple CPU cores while keeping goroutine creation and switching overhead much lower than kernel threads.

Prerequisites

M:N threading concepts (02-user-threads-and-green-threads.md)
Work-stealing scheduler concepts
Channel / CSP model basics
Linux kernel threads and goroutine-to-OS-thread mapping

Core Concepts

The GMP Model

GMP Model: Goroutines, OS Threads (M), Processors (P)
=======================================================

                    Global Run Queue (GRQ)
                    [G9, G10, G11, ...]
                          |
         +----------------+----------------+
         |                                 |
    +----v----+                       +----v----+
    |    P1   |                       |    P2   |
    | [LRQ:   |                       | [LRQ:   |
    |  G1,G2] |                       |  G5,G6] |
    +----+----+                       +----+----+
         |                                 |
    +----v----+                       +----v----+
    |    M1   | (OS thread/kernel)    |    M2   | (OS thread/kernel)
    | running |                       | running |
    |    G3   |                       |    G7   |
    +----+----+                       +----+----+
         |                                 |
    CPU Core 0                        CPU Core 1

Key:
  G  = Goroutine (user-space execution unit)
  M  = OS Thread (kernel thread, machine)
  P  = Processor (logical CPU, runs goroutines)
  LRQ = Local Run Queue (per-P, holds goroutines to run)
  GRQ = Global Run Queue (overflow, accessed less frequently)

GOMAXPROCS = number of Ps = max parallel goroutines
Default: GOMAXPROCS = runtime.NumCPU()

The P is the key abstraction that decouples goroutines from OS threads. A P holds: - A local run queue (LRQ) of goroutines ready to run - A pointer to the M currently executing on it - Per-P heap caches, defer slabs, and timer heaps

When a P's LRQ is empty, it steals from another P's LRQ (work stealing) before checking the GRQ.

Goroutine Creation

package main

import (
    "fmt"
    "runtime"
    "sync"
)

func main() {
    runtime.GOMAXPROCS(4)  // use 4 OS threads (default: NumCPU)

    var wg sync.WaitGroup

    for i := 0; i < 100000; i++ {
        wg.Add(1)
        go func(n int) {  // goroutine: ~2KB initial stack, ~1µs creation
            defer wg.Done()
            fmt.Sprintf("goroutine %d", n)  // trivial work
        }(i)
    }

    wg.Wait()

    // Check runtime stats:
    fmt.Println("Goroutines:", runtime.NumGoroutine())
    fmt.Println("OS threads:", runtime.GOMAXPROCS(0))
}

Goroutine creation cost comparison: - kernel thread (pthread_create): ~5-50 µs, 8MB stack reservation - Go goroutine: ~0.5-2 µs, 2-8KB actual stack

A server creating 100,000 goroutines uses ~200-800MB RAM vs. the ~800TB virtual space for 100,000 kernel threads (though actual RAM usage depends on stack usage per goroutine).

Work Stealing

When a P runs out of goroutines in its LRQ, it steals from other Ps:

Work Stealing Algorithm (simplified)
======================================

func schedule() goroutine:
    // 1. Check local run queue first (61 out of 61 times):
    if g := p.runq.pop(); g != nil:
        return g

    // 2. Every ~61 schedules: check global queue to prevent starvation:
    if g := globalRunQueue.pop(); g != nil:
        return g

    // 3. Network poller (non-blocking check for I/O-ready goroutines):
    if g := netpoll(nonblocking); g != nil:
        return g

    // 4. Try to steal from other P's local queues:
    for _, victim := range randomize(allPs):
        if victim == p { continue }
        if g := victim.runq.stealHalf(); g != nil:
            return g

    // 5. Block: wait for work (GRQ check + syscall goroutine check)
    stopm()  // park M, return P to idle pool

The "steal half" policy takes half of the victim's LRQ at once (batching reduces contention on the victim's queue). This amortizes the stealing overhead.

Goroutine States and Transitions

Goroutine States
=================

_Grunnable   → on a run queue, ready to run
_Grunning    → currently running on an M
_Gsyscall    → inside a system call (M is running OS code)
_Gwaiting    → blocked (on channel, timer, GC, etc.)
_Gdead       → exited or not yet started
_Gcopystack  → stack is being copied (during stack growth)
_Gpreempted  → preempted by async preemption signal

Key transitions:

go func() {...}  →  _Grunnable (added to P's LRQ)

P picks up G  →  _Grunning (G starts executing)

G calls channel op, blocks  →  _Gwaiting
  + G stored in channel's send/recv queue
  + P picks up next G from LRQ

Channel send/recv completes  →  _Grunnable (back on a P's LRQ)

G enters syscall  →  _Gsyscall
  M detaches from P (P can run other goroutines)
  M continues in syscall (M is now "floating" without P)

Syscall returns:
  If P is available: G → _Grunning (G resumes on that P)
  If no P: G → _Grunnable (added to GRQ, wait for a P)

Goroutine Parking: Blocking I/O

The magic of goroutines for network I/O: when a goroutine makes a blocking I/O call, the runtime parks the goroutine and continues with other work:

Network I/O: Goroutine Parking
================================

G1 calls net.Read(conn)
  |
  +-- Go runtime checks: is data available? (via netpoll)
  |
  +-- Data NOT available:
  |     park G1: set G1.state = _Gwaiting
  |     register conn with netpoll (epoll on Linux)
  |     P continues running other goroutines (G2, G3, ...)
  |
  +-- Data arrives at NIC:
  |     epoll_wait() returns in netpoller goroutine
  |     netpoller marks G1 as _Grunnable
  |     G1 added to global run queue (or directly to a P)
  |
  +-- P picks up G1, resumes net.Read()
        G1 reads data, continues

The netpoll goroutine runs in the background, checking epoll/kqueue for I/O readiness and waking parked goroutines. This is the Go runtime equivalent of an event loop — but invisible to application code.

Blocking Syscalls: Thread Handoff

For blocking syscalls (not network I/O — which is handled by netpoll), the runtime must not let the OS thread block with a P attached:

Blocking Syscall Handling (e.g., file I/O, cgo)
=================================================

G1 enters blocking syscall (e.g., os.File.Read → read() on filesystem):

  1. M1 (OS thread) calls syscall, blocks in kernel

  2. Before blocking, runtime:
     - G1.state = _Gsyscall
     - P detaches from M1 (P.m = nil)
     - Sysmon (system monitor goroutine) may immediately attach P to new M

  3. P2 (now detached) is picked up by:
     - Existing idle M2, or
     - New OS thread M3 (runtime may spawn one)

  4. P2 continues running other goroutines on M2/M3

  5. M1's syscall returns:
     - If P is available: M1 re-attaches P, G1 → _Grunning
     - If no P: G1 → _Grunnable (GRQ), M1 → idle pool (sleep)

The cost: each blocking syscall may spawn/wake an OS thread.
For many concurrent blocking syscalls: many OS threads.
(This is the underlying reason for "goroutines are cheap, blocking syscalls aren't")

Channel Internals

Channels are Go's primary synchronization and communication primitive. Internally, a channel is an hchan struct:

// Internal representation (simplified from runtime/chan.go):
type hchan struct {
    qcount   uint           // number of elements in queue
    dataqsiz uint           // capacity of circular queue
    buf      unsafe.Pointer // points to circular queue array
    elemsize uint16
    closed   uint32
    elemtype *_type
    sendx    uint           // send index in circular buffer
    recvx    uint           // receive index in circular buffer
    recvq    waitq          // list of goroutines waiting to receive
    sendq    waitq          // list of goroutines waiting to send
    lock     mutex
}

type waitq struct {
    first *sudog  // list head
    last  *sudog
}

type sudog struct {
    g      *g             // the goroutine
    next   *sudog
    prev   *sudog
    elem   unsafe.Pointer // data element (copy of value being sent/received)
    // ...
}

Channel operation scenarios:

Channel send/receive scenarios
================================

Unbuffered channel ch (dataqsiz=0):

Scenario A: receiver waiting
  G2 is blocked on <-ch (recv):
    G2 is in ch.recvq
  G1 sends ch <- val:
    runtime directly copies val to G2's stack frame
    G2 moves to runnable queue
    G1 continues running

Scenario B: no receiver
  G1 sends ch <- val (no receiver waiting):
    G1 is parked (added to ch.sendq with val)
    G1.state = _Gwaiting
    P continues running other goroutines
  Later G2 receives <-ch:
    G2 finds G1 in sendq
    G2 copies val from G1's sudog
    G1 is unparked → _Grunnable

Buffered channel ch = make(chan int, N):

  If buffer not full: send copies val to buffer, continues
  If buffer full: same as unbuffered (sender blocks in sendq)

  If buffer not empty: receive copies from buffer, continues
  If buffer empty: same as unbuffered (receiver blocks in recvq)

Select Statement Internals

select allows waiting on multiple channel operations:

select {
case v := <-ch1:     // receive from ch1
    processV(v)
case ch2 <- x:       // send x to ch2
    processSent()
case <-time.After(5 * time.Second):
    handleTimeout()
default:             // non-blocking: execute if no case is ready
    handleNoReady()
}

Implementation: when no case is ready, the goroutine is added to ALL relevant channels' send/recv queues simultaneously. When any channel becomes ready, the goroutine is removed from all other queues and resumed. This is the trickiest part of the scheduler — the goroutine must be atomically de-registered from multiple channel queues.

Goroutine Stack Growth

Go uses contiguous stacks that grow and shrink:

Stack Growth (Go 1.14+: contiguous stacks)
============================================

Initial goroutine stack: 2KB or 8KB (version-dependent)

When stack needs more space:
  1. Function prologue detects stack overflow (stack guard check)
  2. Runtime allocates a NEW stack (2x or larger)
  3. Copies ENTIRE old stack to new location
     (all pointers to old stack updated — this is why Go doesn't allow raw
      C-style pointers to stack-allocated variables in goroutines)
  4. Old stack freed

Stack shrink:
  At GC time, if stack is 1/4 utilized: shrink by copying to smaller stack

History:
  Go 1.3 and earlier: segmented stacks (linked list of segments)
    Problem: "hot split" — frequent boundary crossings in tight loops
    caused repeated stack segment allocation/deallocation
  Go 1.4+: contiguous stacks — eliminated hot split at cost of O(n) copy

Cost of stack growth:
  Copy is O(stack size in use)
  A goroutine with 1MB live stack: ~1MB memcpy on growth
  Triggered only when stack is exhausted (infrequent for most goroutines)

Goroutine Preemption: Async Preemption (Go 1.14)

Before Go 1.14, goroutines could only be preempted at function calls (via stack overflow check). A tight loop with no function calls could monopolize a P:

// Before Go 1.14: this could starve other goroutines!
go func() {
    for {
        // No function calls, no allocation, no preemption point
        count++  // tight loop, never yields
    }
}()

Go 1.14 added async preemption: the sysmon goroutine sends a SIGURG signal to any M running a goroutine for more than 10ms. The signal handler injects a preemption point (saves goroutine state, switches to scheduler). This is similar to how operating systems preempt user threads.

// After Go 1.14: the tight loop is still problematic but
// sysmon will eventually preempt it via SIGURG
// Most practical code has function calls anyway (GC write barriers, etc.)

Goroutine Leak Detection

Goroutine leaks are common in production Go code. Detection:

// goleak: test-time goroutine leak detection
// go get go.uber.org/goleak

import (
    "testing"
    "go.uber.org/goleak"
)

func TestMyService(t *testing.T) {
    defer goleak.VerifyNone(t)  // fail if goroutines leak during test

    svc := NewMyService()
    svc.Start()

    // ... test ...

    svc.Stop()
    // goleak checks: are there unexpected goroutines still running?
}

// Runtime monitoring (for production):
import "runtime"

func goroutineCount() int {
    return runtime.NumGoroutine()
}

// Prometheus metric:
prometheus.NewGaugeFunc(prometheus.GaugeOpts{
    Name: "go_goroutines",
    Help: "Number of goroutines that currently exist.",
}, func() float64 {
    return float64(runtime.NumGoroutine())
})

Scheduler Profiling and Debugging

# GODEBUG scheduler trace:
GODEBUG=schedtrace=1000 ./myprogram
# Every 1000ms: prints scheduler state
# Example:
# SCHED 1000ms: gomaxprocs=8 idleprocs=3 threads=7 spinningthreads=1
#   idlethreads=2 runqueue=4 [0 0 2 1 0 0 0 1]
#                                    ^   per-P run queue sizes

GODEBUG=scheddetail=1 GODEBUG=schedtrace=100 ./myprogram
# Very verbose: individual goroutine states

# Profile goroutine blocking:
import "runtime/pprof"
f, _ := os.Create("goroutine.pprof")
pprof.Lookup("goroutine").WriteTo(f, 0)

# go tool pprof: interactive goroutine analysis
go tool pprof goroutine.pprof
(pprof) top10
(pprof) list myFunction

# Get goroutine dump from running process:
kill -SIGQUIT <pid>  # or: SIGABRT on some platforms
# Process prints all goroutine stacks and exits (or continues with SIGQUIT)

// Get goroutine stacks programmatically:
buf := make([]byte, 1<<20)  // 1MB buffer
n := runtime.Stack(buf, true)  // true = all goroutines
fmt.Printf("Goroutine dump:\n%s\n", buf[:n])

Historical Context

CSP: Tony Hoare (1978)

Go's channel model implements Tony Hoare's CSP (Communicating Sequential Processes), from his 1978 paper in CACM. The fundamental CSP principle:

"Do not communicate by sharing memory; instead, share memory by communicating."

CSP models concurrency as independent sequential processes communicating via message-passing channels. This is the theoretical foundation of goroutines + channels.

Goroutine Design Decisions (2009)

Rob Pike, Ken Thompson, and Robert Griesemer designed goroutines building on: - Plan 9's lightweight process model (Rob Pike's previous work) - Squeak (Aho's concurrent language) - Newsqueak (another Pike language) - Limbo (Inferno OS language — direct predecessor)

Key decision: stackful (not stackless) goroutines. This avoids the "function coloring" problem — any function can block, any function can be launched as a goroutine. The runtime handles the complexity.

Production Examples

Kubernetes Control Plane

Kubernetes (written in Go) runs thousands of goroutines in each controller. The controller manager runs ~40+ control loops, each as a goroutine polling for state changes. The kube-apiserver handles each incoming request in a goroutine.

A typical kube-apiserver under load: - ~500-2,000 active goroutines - GOMAXPROCS = node's CPU count - ~15-50 OS threads

Docker (Moby)

The Docker daemon uses goroutines for: - One goroutine per container's log stream - One goroutine per container's stats collection - Multiple goroutines for image layer operations

A Docker host running 100 containers typically has 200-500 goroutines in the daemon.

Prometheus Time-Series Database

Prometheus uses goroutines for: - Per-scrape-target goroutines (one per monitored service) - Background goroutines for TSDB compaction, head truncation, WAL replay

A Prometheus instance monitoring 1,000 targets maintains 1,000+ goroutines. This is feasible with goroutines; it would be prohibitive with kernel threads.

Debugging Notes

// Detect goroutine leaks in production:
http.HandleFunc("/debug/goroutines", func(w http.ResponseWriter, r *http.Request) {
    buf := make([]byte, 1<<20)
    n := runtime.Stack(buf, true)
    w.Write(buf[:n])
})
// Access: curl http://server:port/debug/goroutines

// Also available via pprof:
// import _ "net/http/pprof"
// Access: http://server:port/debug/pprof/goroutine?debug=2

# Analyze goroutine pprof dump:
go tool pprof http://localhost:6060/debug/pprof/goroutine
(pprof) top20                    # top goroutine creation sites
(pprof) list processRequest      # goroutines in processRequest
(pprof) traces                   # full stack traces grouped

# Scheduler analysis:
go tool trace trace.out          # visual scheduler trace (requires trace capture)
# Capture:
import "runtime/trace"
trace.Start(os.Stderr)
defer trace.Stop()

Security Implications

Goroutine Leaks as DoS

Leaked goroutines consume ~2-8KB of memory each. A handler that leaks one goroutine per request can exhaust memory on a busy server. At 100k req/s with 1 leak/request: 100k goroutines/second, ~800MB/second memory consumption.

Detection: monitor runtime.NumGoroutine() as a Prometheus metric. Alert if it grows unboundedly.

Race Conditions

Go's race detector (-race flag) instruments memory accesses to detect data races:

go test -race ./...
go build -race ./myserver && ./myserver
# Overhead: 5-20x slower, 5-10x more memory
# NOT for production — use in testing

Channel Closes and Panics

Closing a channel twice, or sending to a closed channel, panics:

ch := make(chan int)
close(ch)
close(ch)     // PANIC: close of closed channel
ch <- 1       // PANIC: send on closed channel (if called after close)

In security contexts: an attacker controlling timing of close() calls can cause panics and denial of service if channels are accessible across trust boundaries.

Performance Implications

Goroutine vs. Thread Performance

Benchmark: N concurrent HTTP handlers (simple JSON response)
=============================================================

N=100:
  Goroutines:     ~50,000 req/s, ~8MB RSS
  kernel threads: ~45,000 req/s, ~800MB RSS
  (similar throughput, 100x memory difference)

N=10,000:
  Goroutines:     ~48,000 req/s, ~80MB RSS
  kernel threads: ~10,000 req/s (OOM risk at 80GB virtual)
  (goroutines scale, threads don't)

N=100,000:
  Goroutines:     ~45,000 req/s, ~800MB RSS
  kernel threads: OOM or scheduler thrash

Channel Performance

Channel operation latency (benchmark, Go 1.22, x86-64)
========================================================

Unbuffered channel send+receive (two goroutines, same P):
  ~150-300 ns

Buffered channel send (buffer not full, same P):
  ~30-50 ns

Buffered channel receive (buffer not empty, same P):
  ~30-50 ns

Channel send causing goroutine wakeup (cross-P):
  ~300-600 ns (includes scheduler overhead)

Failure Modes and Real Incidents

Kubernetes "Goroutine Bomb" Bug (2019)

A Kubernetes pod controller bug caused exponential goroutine creation under specific error conditions. Each error triggered a retry, and the retry logic leaked a goroutine. Under load with many failing pods, goroutine count grew to millions, causing OOM and cluster-wide instability.

The fix was in the error handling path — properly canceling the goroutine's context on retry bounds exhaustion.

Go Goroutine Scheduler Regression (Go 1.14 SIGURG)

When Go 1.14 introduced async preemption via SIGURG, several programs broke because they registered SIGURG handlers for other purposes (notably: cgo programs and some JVM interop). The signal conflict caused unexpected preemptions that manifested as data corruption.

Workaround: in programs that need to use SIGURG, use runtime.LockOSThread() to pin sensitive sections to a specific OS thread, and //go:nointerrupt hints for Go code that must not be preempted.

Modern Usage

Go is the dominant language for cloud infrastructure (Kubernetes, Docker, Prometheus, Terraform, Consul, Vault, Istio, etcd). Goroutines are the reason Go is preferred for this domain: cheap concurrency maps naturally to handling many concurrent connections, pods, and background tasks.

Future Directions

Goroutine IDs: Go deliberately does not expose goroutine IDs to application code (they exist internally, runtime.GoID() is not public). Proposal to expose them is under discussion — useful for tracing and debugging.

Structured Concurrency in Go: Go 1.21 added sync.WaitGroup improvements, but lacks structured concurrency primitives (all goroutines must complete before scope exits). The x/sync/errgroup package partially addresses this. Formal structured concurrency may come in a future version.

Scheduler NUMA Awareness: For NUMA machines (multiple CPU sockets with distinct memory banks), placing goroutines on Ps close to the memory they access could reduce cross-NUMA memory access. This is an active research area for Go on large servers.

Exercises

GMP Model Visualization: Write a Go program that uses runtime.NumGoroutine() and runtime.NumCPU() to monitor the GMP state. Launch goroutines that do CPU-bound work (tight loops) and I/O-bound work (network calls). Monitor OS thread count via /proc/self/status Threads. Explain what you observe.
Work Stealing Demonstration: Write a Go program where one goroutine creates a burst of work (1000 goroutines) and then blocks. Verify via GODEBUG=schedtrace=100 that work is stolen by other Ps. Measure the steal rate and latency.
Channel Select Implementation: Implement a timeout wrapper for any channel operation using select and time.After. Then implement a "first of N" primitive: given N channels, return the first value from any of them. Measure the overhead of a select over a direct channel receive.
Goroutine Leak Injection and Detection: Write a service that intentionally leaks one goroutine per request (a goroutine blocked waiting on an abandoned channel). Add a NumGoroutine() metric. Generate 1000 requests. Observe the leak. Then fix it using context cancellation and verify the fix with goleak in tests.
Stack Growth Profiling: Write a deeply recursive Go function (to depth 10,000). Profile with go tool pprof to find stack growth events. Instrument with runtime.ReadMemStats to measure total goroutine stack size before and after. Force a stack shrink by making the goroutine idle and waiting for GC. Observe the GC's stack shrinking behavior.

References

Go team. "Go Concurrency Patterns." Google I/O 2012. https://talks.golang.org/2012/concurrency.slide
Pike, R. "Concurrency is not parallelism." HeroCon 2012. https://blog.golang.org/waza-talk
Hoare, C.A.R. "Communicating Sequential Processes." CACM, 1978.
Golang scheduler internals: runtime/proc.go in Go source
Golang channel internals: runtime/chan.go in Go source
"Analysis of the Go runtime scheduler." Rachit Arora, 2013.
go.uber.org/goleak documentation
Gregg, B. "Systems Performance." Prentice Hall, 2020. Chapter on Go profiling.
Go 1.14 release notes (async preemption): https://go.dev/doc/go1.14
Go blog: "The Go Memory Model" https://go.dev/ref/mem