JIT Compilation

Technical Overview

Just-In-Time compilation occupies the design space between pure interpretation (execute bytecode/AST directly, no compilation, slow) and ahead-of-time compilation (compile everything statically, fast execution but no runtime profile information). A JIT compiler observes program behavior at runtime — which methods are called frequently, what types actually appear at polymorphic call sites, which branches are almost always taken — and generates highly optimized native code for hot paths while allowing cold paths to remain interpreted.

The fundamental insight is that most programs spend 90% of their time in 10% of their code (Knuth's observation, empirically validated repeatedly). A JIT compiler concentrates compilation resources on that 10%. The result, in mature implementations like HotSpot C2 or V8 TurboFan, frequently produces code that outperforms equivalent statically compiled C++ because the JIT has access to runtime information that static compilers cannot assume.

Prerequisites

Understanding of compiler basics: AST, IR, code generation
Familiarity with x86 or AArch64 instruction sets (basic level)
Understanding of CPU microarchitecture: branch prediction, instruction caches
JVM architecture (see 01-jvm-architecture.md) for HotSpot specifics

Tiered Compilation State Machine

                  Method first called
                         |
                         v
              +--------------------+
              | Level 0: Interpret |  <-- counters increment
              |   (no compilation) |
              +--------------------+
                /                \
     invocation          invocation+back-edge
     threshold            threshold (loop)
     (simple)             (complex)
           |                    |
           v                    v
+-------------------+   +-------------------+
| Level 1: C1        |   | Level 3: C1       |
| (no profiling)     |   | (full profiling)  |
| fast compile       |   | type profiles,    |
| simple opt only    |   | branch counts     |
+-------------------+   +-------------------+
           |                    |
     rarely hot           C2 queue filled
           |                    |
           v                    v
+-------------------+   +-------------------+
| Level 2: C1        |   | Level 4: C2       |
| (limited profile)  |   | (optimizing)      |
|                    |   | inlining, escape  |
|                    |   | analysis, SIMD    |
+-------------------+   +-------------------+
                                |
                         Deoptimization
                         (invalidated)
                                |
                                v
                    Back to Level 0/3
                    (recompile from
                     fresh profiles)

Core Content

Interpretation vs Compilation vs JIT

Pure interpretation: The interpreter fetches each bytecode, decodes it, performs the operation, advances the PC. No compilation. Overhead: ~10–50x vs native code for compute-intensive work. CPython, early Java 1.0, early Ruby MRI use this model.

AOT compilation: The compiler runs before execution. All code is compiled. Benefit: no warmup time. Drawback: cannot use runtime type information (which concrete class appears at a virtual call site), cannot speculate on branch behavior, cannot devirtualize calls that are de facto monomorphic at runtime. C and C++ use this model.

JIT compilation: Compilation occurs during execution. The runtime profile data — collected during interpretation or lightweight compilation — guides aggressive speculative optimization. HotSpot, V8, LuaJIT, PyPy, SpiderMonkey.

The JIT tradeoff: compilation itself consumes CPU and allocates code cache memory. A method compiled once and called 10 million times pays back the compilation cost quickly. A method called once should never be compiled.

HotSpot Tiered Compilation in Detail

Level 0 — Interpreter: Bytecode executed directly. The interpreter maintains two counters per method: invocation count (incremented on each call) and back-edge count (incremented on each loop back-edge). When these exceed thresholds, the method is submitted to the compiler queue.

Level 1 — C1 (no profiling): C1 is a fast, straightforward compiler. Level 1 compiles without inserting profiling probes. Used for methods that need basic compilation speed without overhead. Produces relatively simple native code quickly (typically 100–500 microseconds).

Level 2 — C1 (limited profiling): Adds limited profiling. Used when the C2 queue is full to avoid blocking hot methods. Intermediate state.

Level 3 — C1 (full profiling): Inserts instrumentation to collect: method invocation counts, loop back-edge counts, receiver type profiles at invokevirtual/invokeinterface call sites, branch frequency data. This profile data is later used by C2.

Level 4 — C2 (optimizing compiler): HotSpot's server-side optimizing compiler. The JVM's most aggressive optimization tier. C2 operates on an intermediate representation called the "Sea of Nodes" (Click & Paleczny 1995) — a graph where both data flow and control flow are represented as nodes in a single unified graph. This allows aggressive value numbering and code motion.

C2 uses the profile data collected at Level 3 to make speculative decisions: - Inlines call targets that are always the same type (monomorphic call site) - Eliminates branches that profile data shows are never taken - Vectorizes loops where element count and type allow SIMD

C2 Optimizations

Method Inlining: The single most impactful optimization. The body of a callee is inserted at the call site, eliminating call overhead and enabling further optimization across the combined code. C2 inlines aggressively: methods up to 325 bytecodes (-XX:MaxInlineSize=35 for cold inlining, higher for hot). Inlining enables the downstream optimizations that follow.

Escape Analysis: Determines whether an object reference escapes the current method (stored in a heap field, passed to another thread, returned). If an object does not escape: - Stack allocation: The object is allocated on the stack instead of the heap. Eliminated at method return without GC. - Scalar replacement: The object is decomposed into its fields, which are stored in registers/locals. No object header overhead. - Lock elision: If an object doesn't escape, synchronized(obj) is eliminated — no other thread can observe it.

Loop Optimizations: Loop unrolling (reduce loop overhead by processing multiple iterations per loop control check), loop vectorization (use SIMD instructions to process multiple elements simultaneously), loop invariant code motion (hoist computations that don't change across iterations out of the loop body).

Devirtualization: An invokevirtual is a virtual dispatch — potentially expensive. If the type profile shows that the call site has seen only one concrete type (monomorphic), C2 generates an optimistic inline with a type guard:

if (obj.getClass() == ExpectedType.class) {
    // inlined method body (fast path)
} else {
    // deoptimize / virtual dispatch (uncommon trap)
}

SIMD Auto-Vectorization: C2 recognizes loop patterns operating on arrays and emits SSE/AVX (x86) or NEON (AArch64) vector instructions. A loop summing a float[] can process 8 floats per instruction (AVX2) instead of 1.

On-Stack Replacement (OSR)

OSR addresses the scenario where a long-running method is in the middle of execution in the interpreter when C2 finishes compiling it. Without OSR, the method would continue interpreting until it returns. With OSR:

A compiled OSR entry point is created at a loop back-edge
At the next back-edge, the JVM transfers execution to the compiled version
The interpreter state (local variables, operand stack) is converted to the compiled frame layout
Execution continues in compiled code mid-method

OSR is technically complex because the compiled code must be able to start execution at a non-entry point with pre-populated register/memory state.

Deoptimization

Compiled code is based on speculative assumptions from profile data. When an assumption is violated at runtime, the JVM must deoptimize:

The compiled code hits an uncommon trap instruction (a call to the deoptimization handler)
The JVM reconstructs the interpreter state from the compiled frame
Execution resumes in the interpreter
Profiling continues; eventually a new, corrected compilation is triggered

Common deoptimization triggers: - A new class is loaded that is a subtype of an assumed monomorphic call site's receiver - A field assumed to be a constant (never written) is written - A branch that was never taken is suddenly taken - A null check that was assumed non-null receives null

Frequent deoptimizations severely harm performance. -XX:+PrintDeoptimization (with -XX:+UnlockDiagnosticVMOptions) shows deoptimization events.

JIT and Security: JIT Spray

JIT spray is an exploitation technique where an attacker controls constants in JIT-compiled code (e.g., through scripted computation in JavaScript). Since JIT-compiled code is executable memory, the attacker can craft constants that, when interpreted as x86 instructions starting at misaligned offsets, form useful gadgets. Combined with a heap spray to put the JIT-compiled buffer at a predictable address, this provides a ROP gadget source that bypasses static ASLR of the binary itself.

Mitigations: - Constant blinding: XOR JIT constants with a random key; the XOR to unmask and the real value are emitted, so the combined sequence is useless as a direct gadget - Guard pages / randomized JIT code placement: Randomize the address of JIT code regions - W^X enforcement: JIT pages are write-protected before being marked executable (hardware enforced on Apple Silicon and enforced on other platforms with PTW mechanisms). Some JITs maintain two mappings: one writable (for writing code), one executable (for running it), using dual-mapping tricks.

V8 JavaScript JIT

V8 (used in Node.js and Chrome) implements a two-tier pipeline:

Ignition (bytecode interpreter): Source JavaScript is compiled to V8 bytecode by the parser + bytecode generator. Ignition interprets the bytecode. Ignition is a register machine (not stack machine), which simplifies the IR needed for the optimizer.

TurboFan (optimizing compiler): TurboFan takes Ignition bytecode plus feedback collected by Ignition's inline caches (ICs). ICs track what types appear at each operation. TurboFan produces highly optimized machine code. It uses a "Sea of Nodes" IR similar to HotSpot C2.

V8 also features SparkPlug (fast non-optimizing compiler that compiles Ignition bytecode to machine code without optimization, filling the gap between Ignition and TurboFan), and Maglev (mid-tier optimizing compiler added in Chrome 114/Node.js 22).

LuaJIT

LuaJIT by Mike Pall is widely regarded as one of the best JIT implementations ever written. It uses a tracing JIT approach distinct from method-based JIT (HotSpot, V8):

Tracing starts at a hot loop back-edge
The interpreter records a trace — the linear path of instructions executed for one iteration
The trace is compiled to native code for that exact path
Side exits handle deviations (branches not taken in the trace)

LuaJIT frequently outperforms equivalent C code for loop-heavy numeric workloads because traces eliminate interpreter dispatch overhead and the JIT-generated native code is tight. LuaJIT is used in NGINX/OpenResty for high-performance Lua scripting at millions of requests/second.

Warmup Time in Serverless

Cold-start latency is critical for serverless functions (AWS Lambda, Google Cloud Functions). A Java Lambda function takes 1–5 seconds to start (JVM initialization + class loading + application startup) before serving any request, plus warmup time (JIT reaching steady state) of another 30–120 seconds for the JIT to fully optimize hot paths.

Mitigations: - GraalVM Native Image: Compile the function to a native binary. Cold start: 10–100ms. - AWS SnapStart: Snapshot the JVM state after initialization, restore the snapshot on cold start (CRIU-based). - Quarkus / Micronaut: Frameworks designed for fast Java startup, moving as much initialization as possible to build time. - Go / Rust: Languages with near-instant startup, no JIT warmup required.

Historical Context

McCarthy's LISP (1960) used an early form of dynamic compilation. The term "JIT" was coined in the 1990s. HotSpot was developed at Longview Technologies (founded by Animorphic Systems alumni) in the mid-1990s and acquired by Sun in 1997. The insight that adaptive optimization based on runtime profiles could beat static AOT was validated by the Self programming language at Stanford/Sun Labs (David Ungar, Craig Chambers, Urs Holzle). Self's compilation techniques directly influenced HotSpot's design. Urs Holzle and Lars Bak, Self alumni, later built V8.

Production Examples

# Print JIT compilation decisions
java -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions \
     -XX:+PrintInlining MyApp 2>&1 | head -100

# Output format: timestamp  compile_id  flags  class::method  bytecodes
#   %   = OSR compilation
#   !   = method has exception handler
#   s   = synchronized method
#   n   = native method

# Disable tiered compilation (force C2 only)
java -XX:-TieredCompilation MyApp

# Check JIT code cache size
java -XX:+PrintCodeCache -version

# Async profiler for flame graphs (no JIT safepoint bias)
./profiler.sh -d 30 -f profile.html <pid>

Debugging Notes

PrintCompilation output: a made not entrant line means a compiled method was deoptimized; a made zombie means the code was reclaimed
Code cache exhaustion (-XX:ReservedCodeCacheSize) causes the JIT to stop compiling; the application continues in interpreted mode — severe performance cliff
Warm-up measurement: JMH (Java Microbenchmark Harness) handles JVM warmup correctly in benchmarks by using warm-up iterations before measuring
Method too large to inline: check with -XX:+PrintInlining for too big messages; consider refactoring hot methods into smaller units

Security Implications

JIT spray (described above) — mitigated by constant blinding and W^X pages
JIT bugs can produce incorrect native code silently — this has occurred in V8 and HotSpot; security-critical math should not rely on JIT correctness without validation
Disabling JIT (-Xint flag in HotSpot, --jit=false in some configs) eliminates JIT attack surface at the cost of severe performance degradation (~10x)

Performance Implications

First call to a method: ~0 cost (interpreter). Hundredth call: C1 compiled. Ten-thousandth call: C2 compiled and fully optimized.
Code cache size affects how many methods can remain JIT-compiled. Default on 64-bit JVMs: 240MB. Under load with many classes, this can be exhausted.
Escape analysis is fragile — a single escape point (a log statement that takes Object... varargs, boxing the value) can prevent stack allocation of an otherwise non-escaping object.

Failure Modes

JIT regression: An update to the JVM or a new class loaded at runtime causes a previously well-compiled method to be deoptimized and recompiled sub-optimally. Manifests as sudden throughput drop.
Pathological deoptimization loop: A speculative optimization is compiled, then deoptimized, then re-compiled with the same incorrect assumption. Each cycle wastes CPU. Check with -XX:+PrintDeoptimization.
Code cache thrash: When code cache is nearly full, methods are evicted and recompiled repeatedly. -XX:+UseCodeCacheFlushing (enabled by default) manages this, but sizing the cache appropriately (-XX:ReservedCodeCacheSize=512m) is preferable.

Modern Usage

Java 21's virtual threads (Loom) interact with JIT: JIT-compiled code running in a virtual thread can be parked/unparked transparently. The JIT does not need to know about virtual thread mechanics — the Loom scheduler operates below the JIT-compiled frame.

GraalVM's Graal JIT (written in Java itself) provides a substrate for custom language JITs via Truffle and allows using the Graal JIT in HotSpot via the JVMCI interface (-XX:+EnableJVMCI -XX:+UseJVMCICompiler).

Future Directions

Project Leyden (OpenJDK): Static images and "premain" — compile Java programs to partially-resolved, partially-optimized forms that start faster without full Native Image constraints on dynamism
CRaC (Coordinated Restore at Checkpoint): JVM-level CRIU checkpoint captures JIT-compiled state; restored JVM has full JIT warmup immediately
MLIR-based JIT backends: Research into replacing the C2 Sea-of-Nodes IR with MLIR dialects to enable better vectorization and hardware targeting
Profile-guided AOT: Use runtime profiles from a production run to guide ahead-of-time compilation, closing the gap between AOT and JIT quality

Exercises

Using JMH, benchmark a polymorphic call site with 1, 2, and 4 concrete receiver types. Observe how deoptimization frequency and throughput change as you introduce more types.
Write a method that allocates a small Point object, performs a computation on it, and discards it. Confirm via -XX:+PrintEscapeAnalysis that it is stack-allocated. Then introduce a logging statement that captures the object in a varargs call; verify that escape analysis is defeated.
Reproduce a JIT spray scenario conceptually: write a JavaScript function (in Node.js) that embeds a large constant, run it in a tight loop, then use /proc/<pid>/maps to locate the JIT code region. Examine the memory with xxd.
Implement a microbenchmark measuring OSR overhead. Create a method with a long loop. Compare performance when the loop runs to completion once (OSR occurs mid-run) vs. running the same total work split across many short method calls (no OSR, method-level JIT applies from the first call).
Disable the JIT (-Xint) and measure the performance of a compute-heavy benchmark. Re-enable JIT with each tier disabled selectively (-XX:TieredStopAtLevel=1, =3, =4) to observe the contribution of each tier.

References

Cliff Click & Michael Paleczny, "A Simple Graph-Based Intermediate Representation." 1995 ACM SIGPLAN Workshop on Intermediate Representations.
Urs Holzle & David Ungar, "Optimizing Dynamically-Dispatched Calls with Run-Time Type Feedback." PLDI 1994.
Mike Pall, LuaJIT: http://luajit.org/luajit.html — design notes in the mailing list archives
Andreas Gal et al., "Trace-based Just-in-Time Type Specialization for Dynamic Languages." PLDI 2009.
V8 Blog: https://v8.dev/blog — Ignition, TurboFan, Maglev, Sparkplug design articles
Ben Titzer, "A Tour Through the WebAssembly Optimizing Compiler." Strange Loop 2019.