Compilation Pipeline
Technical Overview
A compiler transforms source code — text in a programming language — into executable machine code. The transformation is not a single operation but a pipeline of well-defined passes, each consuming the output of the previous stage and producing a more machine-oriented representation. The separation of concerns in the pipeline is deliberate: each pass is independently testable, front-ends and back-ends can be combined modularly (LLVM's core architectural principle), and optimization passes can be inserted, removed, or reordered.
The canonical C compilation pipeline, as implemented by Clang/LLVM or GCC, passes through: preprocessing, lexing, parsing, semantic analysis, IR generation, optimization, code generation, and linking. Each stage has a distinct input/output format. Understanding where your error originates — is it a preprocessor failure, a parse error, a type error, or an ABI mismatch at link time — is fundamental to working with the toolchain.
Prerequisites
- C/C++ language familiarity at an intermediate level
- Basic understanding of formal language theory (tokens, grammars, ASTs)
- Familiarity with assembly language basics
- Understanding of operating system process model (sections, segments)
Full Pipeline Diagram
Source File (hello.c)
|
v
+-------------------+ Tool: cpp / clang -E
| 1. Preprocessor | Input: .c source text
| | Output: .i expanded source text
| - Macro expansion |
| - #include merge |
| - Conditional |
| compilation |
+-------------------+
|
v (textual C, all includes inlined)
+-------------------+ Tool: clang -fsyntax-only or cc1
| 2. Lexer/Scanner | Input: character stream (.i)
| | Output: token stream
| - DFAs for tokens |
| - Identifiers, |
| numbers, strings|
| operators |
+-------------------+
|
v (token stream)
+-------------------+ Tool: parser in cc1
| 3. Parser | Input: token stream
| | Output: Abstract Syntax Tree (AST)
| - Recursive |
| descent or LALR |
| - Syntax checking |
| - AST construction|
+-------------------+
|
v (AST)
+-------------------+ Tool: semantic analyzer in cc1
| 4. Semantic | Input: AST
| Analysis | Output: Decorated AST
| |
| - Type checking |
| - Name resolution |
| - Overload |
| resolution (C++)|
| - Scope analysis |
+-------------------+
|
v (decorated AST)
+-------------------+ Tool: IRGen in cc1
| 5. IR Generation | Input: Decorated AST
| | Output: LLVM IR (.ll) or GCC GIMPLE
| - AST → SSA IR |
| - Lowers language |
| constructs to |
| primitive ops |
+-------------------+
|
v (LLVM IR / GIMPLE)
+-------------------+ Tool: opt (LLVM) / gimple passes (GCC)
| 6. Optimization | Input: IR
| Passes | Output: Optimized IR
| |
| - Constant fold |
| - DCE, CSE, LICM |
| - Inlining |
| - Loop transforms |
| - Vectorization |
+-------------------+
|
v (optimized IR)
+-------------------+ Tool: llc / GCC backend
| 7. Code Generation| Input: Optimized IR
| | Output: Assembly (.s) or object (.o)
| - Instruction |
| selection |
| - Register alloc |
| - Instruction |
| scheduling |
+-------------------+
|
v (.o object files)
+-------------------+ Tool: ld / lld / gold
| 8. Linker | Input: Object files + libraries
| | Output: Executable or shared library
| - Symbol resolution|
| - Relocation |
| - Section merging |
+-------------------+
|
v
Executable / .so
Core Content
Phase 1: Preprocessing
The C preprocessor (cpp) runs before the compiler sees the source. It operates purely on text:
Macro expansion: #define MAX(a,b) ((a)>(b)?(a):(b)) is textually expanded at each use site. Object-like macros (#define PI 3.14159) replace the token. Function-like macros replace with argument substitution. The # operator stringifies an argument; ## concatenates tokens.
File inclusion: #include <stdio.h> textually inserts the entire content of stdio.h at that point. A typical C++ TU after preprocessing is 100–500KB due to cascaded header inclusion, even if the original source was 50 lines. This is the central cause of C++ compilation slowness and is addressed by precompiled headers (PCH) and C++20 modules.
Conditional compilation: #ifdef NDEBUG ... #endif enables build-time feature selection. The preprocessor evaluates #if expressions using the token-level integer arithmetic of the preprocessor (distinct from C arithmetic).
Run just the preprocessor: clang -E hello.c -o hello.i
Phase 2: Lexing / Scanning
The lexer (lexical analyzer) converts a stream of characters into a stream of tokens. Tokens are the atomic vocabulary units of the language: keywords (if, while, struct), identifiers (main, count), literals (42, 3.14f, "hello"), operators (+, ->, <<), and punctuation ({, ;).
Lexers are typically implemented as DFAs (Deterministic Finite Automata) — one DFA per token class. For example, the integer literal DFA accepts the regex [0-9]+. The lexer tries to match the longest token at the current position (maximal munch rule).
Lexer generators: flex generates a C lexer from a specification. LLVM implements its own hand-written lexer in lib/Support/SourceMgr.cpp and language-specific lexers in each frontend.
Tokens carry a source location (file, line, column) for error message generation. This is why the compiler can say "error at hello.c:42:13."
Phase 3: Parsing
The parser consumes the token stream and produces an AST (Abstract Syntax Tree). The AST is a tree representation of the syntactic structure — an if statement becomes an IfStmt node with three children: condition expression, then-body, else-body.
Recursive descent parsing: A hand-written parser where each grammar rule is a function. parseExpression() calls parseAddSub(), which calls parseMulDiv(), which calls parseUnary(), which calls parsePrimary(). The call stack mirrors the grammar derivation. Clang uses recursive descent.
LALR(1) parsing: A table-driven bottom-up parsing algorithm. The grammar is pre-analyzed to produce a set of parse tables (ACTION and GOTO tables). The parser uses a stack to recognize handles. GCC historically used LALR parsing generated by bison; it switched to a hand-written recursive descent parser in GCC 4.x for better error messages.
The AST is "abstract" in that it drops purely syntactic information (parentheses are represented implicitly by the tree structure, semicolons are not in the tree).
Phase 4: Semantic Analysis
Type checking and name resolution operate on the AST:
Name resolution: When the parser encounters count + 1, semantic analysis resolves count to the specific variable declaration in scope. It builds and queries a symbol table (scope-aware map from name to declaration).
Type checking: Each expression is annotated with its type. The binary + operator requires compatible operand types; if they don't match, implicit conversion rules are applied (integer promotion, arithmetic conversion) or an error is reported. Function calls are checked against the declared parameter types.
C++ specific: Overload resolution (selecting among overloaded functions based on argument types), template instantiation (generating concrete function/class code for template arguments), and ADL (Argument-Dependent Lookup) are performed during semantic analysis. C++ semantic analysis is notoriously complex — the C++ standard dedicates 400+ pages to it.
Phase 5: IR Generation
The decorated AST is lowered into the compiler's Intermediate Representation (IR). IR is a language-independent, machine-neutral code representation designed for optimization.
LLVM IR uses Static Single Assignment (SSA) form — every variable is assigned exactly once. The IR is explicitly typed, control flow is represented as a CFG of basic blocks, and phi nodes at block entry points merge values from different predecessor blocks:
; Example: int factorial(int n) { return n <= 1 ? 1 : n * factorial(n-1); }
define i32 @factorial(i32 %n) {
entry:
%cmp = icmp sle i32 %n, 1
br i1 %cmp, label %base, label %rec
base:
ret i32 1
rec:
%n_minus_1 = sub i32 %n, 1
%rec_result = call i32 @factorial(i32 %n_minus_1)
%result = mul i32 %n, %rec_result
ret i32 %result
}
GCC GIMPLE: GCC's IR is GIMPLE — a simplified subset of C where expressions are broken into three-address code (x = y op z or x = op y). GCC also has a lower-level IR called RTL (Register Transfer Language) for the back-end.
Phase 6: Optimization Passes
Optimization operates on IR. The canonical LLVM optimization pipeline at -O2 includes approximately 100+ passes. Key passes:
Constant folding: Evaluate constant expressions at compile time. 2 + 3 → 5. Constant propagation propagates known constant values through SSA use-def chains.
Dead Code Elimination (DCE): Remove instructions whose results are never used. Combined with constant folding, this can eliminate entire conditional branches whose condition is always true or false.
Common Subexpression Elimination (CSE): If a + b is computed twice with the same values (and no intervening modification), compute it once and reuse.
Inlining: Replace a function call with the function body. The inliner's cost model computes the call overhead savings vs the code size increase. This is the highest-impact optimization because it enables all subsequent optimizations to see across the former function boundary.
Loop Invariant Code Motion (LICM): Hoist computations out of loops when the result doesn't change across iterations. for (i=0; i<n; i++) { x = a*b; arr[i] += x; } → hoist a*b before the loop.
Loop Vectorization (SLP, Loop Vectorizer): Recognize loops over arrays and emit SIMD instructions. At -O3 or with -fvectorize, LLVM emits SSE/AVX on x86, NEON on AArch64.
Phase 7: Code Generation
The optimized IR is lowered to machine code for a target architecture:
Instruction selection: Map IR operations to target instructions. add i32 %a, %b might map to addl %eax, %ebx on x86. LLVM uses the SelectionDAG or GlobalISel framework. GCC uses RTL patterns.
Register allocation: IR has unlimited virtual registers; physical machines have limited registers (16 GP registers on x86-64). Register allocation assigns virtual registers to physical registers, spilling to the stack when registers run out. Linear Scan and graph-coloring are the standard algorithms.
Instruction scheduling: Reorder instructions to avoid pipeline stalls (data hazards). Modern out-of-order CPUs handle much of this in hardware, but software scheduling still helps on in-order cores (ARM Cortex-A53) and for avoiding structural hazards.
Phase 8: Assembly and Linking
The assembler converts assembly mnemonics to machine code bytes plus relocation entries (unresolved symbol references). The linker then combines multiple object files, resolves symbol references, applies relocations, and produces the final executable. See 03-linkers-and-loaders.md for full detail.
Translation Unit Model
In C/C++, each source file (.c, .cpp) is an independent translation unit (TU). The compiler processes each TU independently. The linker combines TUs. This is why declarations (function prototypes) in headers enable cross-TU calls even though the compiler only processes one TU at a time — the linker resolves the actual address.
This model enables separate compilation: changing one .c file requires recompiling only that TU (make's incremental build). The cost: the compiler cannot optimize across TU boundaries. Link-Time Optimization (LTO) overcomes this by writing the IR (not machine code) into object files and performing optimization at link time, after all TUs have been combined.
Historical Context
The structure of the compilation pipeline was established in the 1960s–70s. Dennis Ritchie's C compiler for Unix (1972) was one of the first self-compiling compilers. The Dragon Book (Aho, Sethi, Ullman, 1977/1986) codified the phases as a pedagogical framework. LALR parser generation by Yacc/bison made parser construction routine. The GCC project (1987, Richard Stallman) demonstrated a high-quality, free, multi-target compiler. LLVM (2000, Chris Lattner) modernized the IR design and pass framework, enabling clean front-end/back-end separation.
Production Examples
# See intermediate outputs at each stage
clang -E hello.c -o hello.i # After preprocessing
clang -fsyntax-only hello.c # Stop after semantic analysis
clang -emit-llvm -S hello.c -o hello.ll # LLVM IR (text)
clang -emit-llvm -c hello.c -o hello.bc # LLVM IR (bitcode, binary)
clang -S hello.c -o hello.s # Assembly
clang -c hello.c -o hello.o # Object file (no linking)
clang hello.c -o hello # Full pipeline
# Inspect LLVM optimization passes
clang -O2 -mllvm -print-after-all hello.c 2>&1 | head -200
# Measure compilation time by phase
clang -ftime-report hello.c 2>&1 | grep -v "^$"
Debugging Notes
- Macro expansion bugs:
clang -Eshows what the preprocessor produces; many subtle C bugs are macro expansion issues - Header include order issues:
clang -Hprints the include tree, showing which headers are included and how deep the tree is - Template instantiation errors in C++: Clang's error messages include a backtrace of instantiation chains; GCC's are more verbose.
-ftemplate-backtrace-limit=0removes the default limit. - LTO debugging: if a symbol is optimized away differently under LTO vs non-LTO, compare IR with
llvm-dison both the per-TU and the LTO-combined bitcode
Security Implications
- Undefined behavior in the optimizer: C/C++ UB (null pointer dereference, signed overflow, out-of-bounds) enables the optimizer to assume that UB never occurs, eliminating security-critical null checks that appear after the UB point.
-fsanitize=undefined(UBSan) catches these at runtime; compiling without UBSan masks them while the optimizer exploits them. -D_FORTIFY_SOURCE=2: Preprocessor macro that replacesmemcpy,strcpy, etc., with bounds-checking variants when the size is statically known. Zero cost when not violated, catches buffer overflows at runtime when violated.- Inlining security implications: Inlining can cause security checks to be optimized away if the check becomes trivially provable after inlining. Example: inlining a callee shows that the caller always passes positive n, so bounds checks on n<0 are eliminated.
Performance Implications
- Compilation time scales with TU size. C++ templates are the main offender: a TU that includes
<algorithm>and<string>has ~50,000 tokens before your code is reached. - LTO increases link time significantly (from seconds to minutes for large projects) but improves runtime performance by 5–15% by enabling cross-TU inlining and whole-program optimization.
-O0produces unoptimized code with one-to-one IR-to-instruction mapping. Useful for debugging (no variable elimination, no reordering). Production builds use-O2or-O3.
Failure Modes
- ODR (One Definition Rule) violations in C++: Multiple TUs define the same function differently. The linker picks one definition silently. The runtime behavior depends on which definition executes. Detected by LSAN's
-fsanitize=address -fno-omit-frame-pointeror LTO's whole-program analysis. - Incomplete optimization: UB allows the optimizer to remove security checks, bounds checks, or null checks, leading to security vulnerabilities in optimized builds that don't appear in debug builds.
- Register allocation spill performance cliff: A function with too many live variables for available registers causes many spills to stack — can cause 5–10x slowdown compared to a register-resident version.
Modern Usage
C++20 Modules replace the textual #include mechanism with a compiled module interface. The compiler emits a Binary Module Interface (BMI) file instead of a header, which can be read directly without re-parsing. This promises 10–100x compilation speedup for large C++ codebases. Clang and MSVC have production-quality module support; GCC support is maturing.
Future Directions
- MLIR: Multi-Level IR extends LLVM's approach by allowing custom IR dialects for domain-specific languages (ML frameworks, hardware description languages). The LLVM project is increasingly structured around MLIR as the foundation.
- Bolt (Binary Optimizer): Applies profile-guided optimization at the binary level (after linking), restructuring code layout to improve instruction cache utilization. Used at Meta to improve production binaries by 5–10%.
- Incremental compilation in C++: Avoiding full TU recompilation on header changes is an active research area; modules address part of this.
Exercises
- Trace a specific C program through each pipeline stage. Write
int main() { return 2+3; }. Produce the preprocessed output, LLVM IR, and assembly at-O0and-O2. Explain how constant folding eliminates the addition in the optimized IR. - Write a bison grammar for a simple arithmetic expression language. Plug it into a flex lexer. Generate a parser and observe the shift/reduce actions for an ambiguous grammar.
- Implement a tiny constant folding pass in Python: take LLVM IR as text, parse
add i32 N, Mwhere N and M are integer literals, replace with the result. Run it on a file and verify the output is valid IR. - Measure the impact of inlining: write a function that calls a small function in a tight loop. Compile at
-O0,-O1,-O2, and-O3. Useobjdump -dto verify that inlining occurs at-O2+. Benchmark the result. - Observe an LTO optimization that is impossible without it: write two TUs where TU1 has a function that TU2 calls in a hot loop. Verify that without LTO the function is called via a real
callinstruction; with LTO, it is inlined.
References
- Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools (Dragon Book, 2nd ed.). Addison-Wesley, 2006.
- Chris Lattner & Vikram Adve, "LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation." CGO 2004.
- GCC Internals Manual: https://gcc.gnu.org/onlinedocs/gccint/
- LLVM Language Reference Manual: https://llvm.org/docs/LangRef.html
- Chandler Carruth, "Undefined Behavior: What Happened to My Code?" LLVM Developers' Meeting 2016.