From Cranelift to LLVM: How Perry Got 24x Faster
Perry's backend migration from Cranelift to LLVM is finished. As of v0.5.12, LLVM is the sole code generation backend, and Perry now beats Node.js on 14 of 15 benchmarks — by margins ranging from 1.06x to 24.6x.
Getting here was not a straight line. The initial cutover in v0.5.0 made several benchmarks 70x slower than the Cranelift version it replaced. This post is the long version of what happened, why we made the switch anyway, what broke, what fixed it, and what the numbers look like on the other side.
If you're building a compiler, evaluating codegen backends, or just curious why “switch to LLVM” is rarely as simple as it sounds, this is for you.
Part 1: Why Switch at All?
Perry compiles TypeScript directly to native machine code. No Node, no V8, no Electron, no WebView. The pitch is “write TypeScript, ship a native binary,” and the entire value proposition collapses if that binary isn't actually fast.
For Perry's first several minor versions, the codegen backend was Cranelift. Cranelift is excellent — it's the codegen behind wasmtime, it's used by SpiderMonkey's baseline JIT, and it's the tool of choice when you need fast, predictable compilation with a clean embedding story. For a project bootstrapping a new language, it was the right starting point.
But two things eventually pushed us off it.
1. The optimizer ceiling
Cranelift is intentionally a fast, single-tier optimizing compiler. Its mandate is “produce decent code quickly,” not “produce the best possible code given unlimited time.” That's the right tradeoff for a JIT. It's the wrong tradeoff for an AOT compiler whose entire selling point is native performance.
LLVM has had over two decades of work poured into its middle-end. Loop vectorization, LICM, GVN, SCCP, instruction combining, inlining heuristics, fast-math reassociation, alias analysis — there is no realistic universe in which a smaller project catches up. If Perry is going to claim “faster than Node,” we need that machinery.
2. The arm64_32 problem
The immediate forcing function was Apple Watch. arm64_32 is an ABI Apple introduced for the Series 4 onward — 64-bit instructions, 32-bit pointers. Cranelift doesn't support it, and there was no realistic path to it landing. For Perry to credibly claim “9 platforms from one codebase,” watchOS could not be missing. LLVM supports arm64_32 out of the box.
Once we accepted that some targets would require LLVM, maintaining two backends became untenable. Two backends means two sets of bugs, two sets of optimization passes, two test matrices, two performance baselines. The honest answer was: pick one.
We picked LLVM.
Part 2: A Word on Cranelift
Before going further: this post is not a Cranelift teardown. Cranelift is a brilliant piece of engineering, and if you're building a JIT, a sandboxed runtime, or anything where compile latency matters more than peak throughput, it should be near the top of your list. wasmtime ships it for good reason. The Bytecode Alliance has been doing exemplary work.
Perry's needs are just different. We compile ahead of time, we ship the binary once, and the user runs it millions of times. That asymmetry — compile rarely, execute always — is exactly the regime where LLVM's heavier optimizer pays for itself. Different tool for a different job.
Part 3: The Cutover Disaster
v0.5.0 was the first release with LLVM as the sole backend. We expected a small regression in compile time and a meaningful improvement in runtime performance. We got the opposite of the second one.
Here's the table I did not want to post at the time:
| Benchmark | Cranelift | LLVM v0.5.0 | Delta |
|---|---|---|---|
| method_calls | 16ms | 1,084ms | 68x slower |
| object_create | 5ms | 318ms | 64x slower |
| matrix_multiply | 61ms | 184ms | 3x slower |
| math_intensive | 370ms | 131ms | 2.8x faster |
| nested_loops | 32ms | 57ms | 1.8x slower |
| fibonacci(40) | 505ms | 1,156ms | 2.3x slower |
Some workloads got faster. Most got dramatically worse. method_calls — one of the most important benchmarks because it represents idiomatic TypeScript class usage — was nearly 70x worse than what we shipped two releases prior.
What actually went wrong
Perry uses NaN-boxing for value representation. Every TypeScript value is a 64-bit word. f64 numbers are stored directly; everything else (objects, strings, booleans, undefined, null) is encoded into the unused bits of an IEEE 754 quiet NaN.
The advantage: numbers are zero-cost. No boxing, no tagging, no allocation for arithmetic.
The disadvantage: every operation on a non-number value requires bit manipulation to unpack, operate, and repack. If those sequences live as inline IR in your codegen, the optimizer can fuse and simplify them. If they live as calls into runtime helper functions, the optimizer sees an opaque call and gives up.
Our Cranelift backend had grown a large number of inline lowerings for hot operations — property loads, method dispatch, object allocation, integer arithmetic on f64-tagged values. The LLVM cutover, in the interest of getting correct code out the door first, routed almost all of those through runtime helpers in perry-runtime. Each helper was a call instruction in LLVM IR.
LLVM is excellent, but it cannot inline a function whose body it has never seen. perry-runtime is compiled separately, linked in at the end, and from the optimizer's perspective every helper call is a black box. The result was that hot loops which the Cranelift backend had been compiling to ~5 instructions of inline arithmetic were now compiling to function calls — register saves, stack frame setup, the works — repeated millions of times.
That's where the 70x came from. Not bad codegen. Bad inlining boundaries.
Part 4: The Fix
The work to recover and surpass the Cranelift numbers fell into roughly six categories. None of them are exotic. Most are textbook compiler optimizations that just had to be applied in the right places.
1. Inline bump allocator for object allocation
object_create was the worst regression after method_calls. The old path called js_object_alloc_class_with_keys for every new Point() — a function call, a thread-local arena access, a shape-cache lookup, and a write of the GC header + object header.
The fix: emit the bump allocation inline in LLVM IR. Each function that allocates objects gets a cached pointer to a thread-local InlineArenaState struct. Allocation becomes:
; state is a ptr to InlineArenaState { data: ptr, offset: i64, size: i64 }
%off_ptr = getelementptr i8, ptr %state, i64 8
%offset = load i64, ptr %off_ptr ; current bump offset
%new_off = add i64 %offset, 96 ; GcHeader(8) + ObjectHeader(24) + 8 fields(64)
%sz_ptr = getelementptr i8, ptr %state, i64 16
%size = load i64, ptr %sz_ptr ; current block capacity
%fits = icmp ule i64 %new_off, %size
br i1 %fits, label %fast, label %slow
fast:
store i64 %new_off, ptr %off_ptr ; bump the offset
%data = load ptr, ptr %state ; data pointer at offset 0
%raw = getelementptr i8, ptr %data, i64 %offset
store i64 <packed_gc_header>, ptr %raw ; GcHeader as one i64
slow:
call ptr @js_inline_arena_slow_alloc(ptr %state, i64 96, i64 8)The fast path is ~13 instructions of inline IR that LLVM can see, schedule around, and hoist out of loops. object_create went from 318ms to 9ms.
2. i32 loop counters
NaN-boxing means every TypeScript number is f64. That includes loop counters. A for (let i = 0; i < 100_000_000; i++) loop with f64 induction variables is a disaster: f64 increment, f64 compare, f64-to-i64 conversion every time you index an array.
The codegen detects for-loops where the induction variable is provably integer-valued and allocates a parallel i32 stack slot. The loop condition switches from fcmp to icmp slt i32, eliminating the f64 counter entirely.
This moved array_write from 11ms to 3ms, nested_loops from 18ms to 9ms, and array_read from 11ms to 4ms.
3. Fast-math flags
We attach reassoc contract flags to every f64 arithmetic instruction. reassoc lets LLVM break serial accumulator chains into parallel ones, and contract allows fused multiply-add. We keep nnan and ninf off because Perry uses NaN bits as value tags.
With those flags, LLVM's loop vectorizer kicks in on math_intensive, which dropped from 131ms to 14ms — beating Node by 3.5x.
4. Integer-modulo fast path
% on f64 in JavaScript is fmod, which is a libm call on ARM. But for integer-valued f64 operands, we can do fptosi → srem → sitofp and skip the libm round-trip entirely. The codegen uses static analysis to detect integer-valued operands — no runtime check needed.
This is the entire reason factorial went from 1,553ms to 24ms — and from Node's 591ms to 24ms. 24.6x faster than Node.
5. LICM for nested loops
LLVM does loop-invariant code motion out of the box, but NaN-boxing hides the structure. arr.length lowers to a load through a NaN-boxed pointer with a tag check — not obviously invariant.
The codegen detects the for (...; i < arr.length; ...) pattern and pre-loads the length into a stack slot before the loop, with a static walker verifying the loop body can't change the array's length. When the counter is bounded by this hoisted length, IndexGet/IndexSet skip bounds checks entirely.
6. Shape-cached objects
When the codegen knows the class of an object, it resolves field offsets at compile time and emits direct indexed loads — no runtime dispatch. For method dispatch, obj.method(args) becomes a direct call @perry_method_Class_name(this, args) — no vtable, no inline cache, no hash lookup.
The LLVM cutover had regressed this to the universal slow path. Restoring static dispatch gave us the method_calls recovery — from 1,084ms back down to 1ms. 11x faster than Node.
Part 5: The Numbers Today
Median of three runs, macOS ARM64 (Apple Silicon, M1 Max), Node.js v25:
| Benchmark | Perry | Node.js | vs Node |
|---|---|---|---|
| factorial | 24ms | 591ms | 24.6x |
| method_calls | 1ms | 11ms | 11x |
| loop_overhead | 12ms | 53ms | 4.4x |
| math_intensive | 14ms | 49ms | 3.5x |
| array_read | 4ms | 13ms | 3.2x |
| closure | 97ms | 303ms | 3.1x |
| array_write | 3ms | 8ms | 2.6x |
| string_concat | 1ms | 2ms | 2x |
| nested_loops | 9ms | 16ms | 1.7x |
| prime_sieve | 4ms | 7ms | 1.7x |
| matrix_multiply | 21ms | 34ms | 1.6x |
| fibonacci(40) | 932ms | 991ms | 1.06x |
| binary_trees | 9ms | 9ms | tied |
| mandelbrot | 24ms | 24ms | tied |
| object_create | 9ms | 8ms | 0.9x |
14 of 15 wins. The one loss is object_create, where V8's allocator is genuinely excellent and we're within 12%.
Part 6: The Compile-Time Question
The number-one reason people pick Cranelift over LLVM is compile speed. So let's talk about it.
LLVM increased Perry's per-file compile time by 20-50ms, or roughly 8-19%. Not 5x. Not 2x. Single-digit-to-low-double-digit percent.
The reason is that codegen is not the bottleneck in Perry's pipeline. The breakdown for a typical file:
- SWC parsing: ~30%
- HIR lowering (AST → IR, type inference): ~25%
- IR transform passes (closure conversion, async lowering, inlining): ~15%
- Codegen (LLVM IR text emission +
clang -c -O3): ~20% - Linking (
cc+ runtime library): ~10%
Codegen is one slice of five. Even doubling that slice only moves the total by 5-10%. If you're building an AOT compiler where the user types perry compile once and then runs the binary forever, the calculus is: spend 25ms more at compile time, save up to 24x at every single execution.
Part 7: What I'd Do Differently
If I were starting Perry today and could skip straight to LLVM, I would not. The Cranelift phase was genuinely valuable. It let us iterate on the frontend without LLVM's complexity tax, it gave us a working baseline to compare against, and it forced us to keep our HIR clean enough to be portable across backends.
What I would do differently is the cutover itself. We shipped v0.5.0 with most operations going through runtime helper calls, intending to inline them later. That was wrong. The right order would have been: identify the hot paths first, lower them inline before the cutover, and only release once the LLVM backend was at least at parity.
The lesson is the boring one: optimization boundaries matter more than optimizer quality. LLVM is a remarkable piece of software, but it cannot help you with code it cannot see. If your codegen routes everything through opaque runtime calls, you have built a wall between your source program and every optimization pass that exists.
Wrapping Up
Perry is now LLVM-only, faster than Node on 14 of 15 benchmarks, and shipping. The migration took longer than I planned, hurt more than I expected in the middle, and is unambiguously the right call in retrospect. Cranelift got us to v0.5; LLVM is taking us the rest of the way.
If you want to try Perry:
brew install perryts/perry/perry
perry init my-app && cd my-app
perry compile src/main.ts -o my-app && ./my-appSource: github.com/PerryTS/perry — Docs: docs.perryts.com — Run the benchmarks yourself: cd benchmarks/suite && ./run_benchmarks.sh
If you have questions, find bugs, or want to argue about codegen backends, the GitHub issues are open. I read them all.
— Ralph