Smith: Building a GPU Acceleration Library in JavaScript
A Float32Array and a Metal shader walked into the same memory address.
I had written a JavaScript GPT implementation called TinyFormer. It trained transformers from scratch using nothing but Float32Array loops. It ran on CPU. It was correct. It was also, by any reasonable measure, glacially slow. Training a tiny model on Shakespeare took minutes where it should take seconds, and the M3 MAX Pro's GPU sat idle the entire time, presumably bored.
Apple Silicon has a GPU accessible through Metal compute shaders, and — this is the part that matters — unified memory. The CPU and GPU share the same physical RAM. No cudaMemcpy. No staging buffers. No upload queues. A pointer allocated by Metal can be read directly by JavaScript through Bun's FFI, and vice versa. When I realized that a Float32Array in JavaScript and a device float* in a Metal shader point to the same bytes in memory, the project became inevitable.
Smith is the result. A GPU-accelerated tensor, autograd, and transformer library for Bun. Zero npm dependencies. About 2,500 lines of JavaScript and 500 lines of Metal shaders. The name is not complicated — a smith works metal.
The bridge
The entire native layer is one Objective-C file — gpu_bridge.m, roughly 200 lines — exposing about twenty C functions with a flat API. No Objective-C types cross the FFI boundary. JavaScript never touches MTLBuffer or MTLComputePipelineState directly. It sees opaque pointers, and that's deliberate. The native layer is a dumb pipe: allocate a buffer, load a shader, dispatch a kernel, wait.
const { symbols: lib } = dlopen(LIB_PATH, {
smith_init: { returns: FFIType.ptr, args: [] },
smith_alloc: { returns: FFIType.ptr, args: [FFIType.ptr, FFIType.u64, FFIType.u32] },
smith_buffer_contents:{ returns: FFIType.ptr, args: [FFIType.ptr] },
smith_begin: { returns: FFIType.ptr, args: [FFIType.ptr] },
smith_dispatch: { returns: FFIType.void, args: [FFIType.ptr, FFIType.u64, FFIType.u64, FFIType.u64, FFIType.u64, FFIType.u64, FFIType.u64] },
smith_end_sync: { returns: FFIType.void, args: [FFIType.ptr] },
// ...
})
All intelligence lives in JavaScript. This is the central design decision and everything follows from it. The autograd graph, the operator dispatch logic, the model architecture, the training loop — all JavaScript. The GPU does multiplication. JavaScript decides what to multiply.
The zero-copy trick
Apple Silicon makes Smith possible in a way it couldn't be on CUDA. You call smith_alloc to create a Metal buffer. You call smith_buffer_contents to get the raw pointer. You wrap that pointer with Bun's toArrayBuffer. Now you have a Float32Array that both JavaScript and the GPU can read and write.
That's it. That's the entire host-device transfer story. There isn't one.
When you create a tensor in Smith, the backing store is a Metal shared-mode buffer. When a shader reads from it, it reads the same physical bytes. When JavaScript writes a gradient into it, the GPU sees the update on the next dispatch. The FFI overhead per kernel launch is roughly 1–5 microseconds. For comparison, a single cudaMemcpy on NVIDIA hardware typically starts at 5–10 microseconds for small buffers, and that's before the actual copy.
Autograd: porting the brain
TinyFormer's autograd is a DAG — each variable carries _deps (parent variables) and _backward (a gradient closure). Call backward() on the loss and it topological-sorts the graph, walks it in reverse, and accumulates gradients. This handles weight tying, residual connections, any topology where a single tensor feeds multiple consumers. A tape-based autograd can't do this without extra bookkeeping.
I ported it almost verbatim. Every backward formula — matmul (grad @ B^T, A^T @ grad), GELU (the five-term derivative), layernorm (three gradients from one kernel), the broadcast-aware addGrad reduction — came directly from TinyFormer, where they'd been validated by training convergence. I replaced T.mul(grad, mask) with gpuMul(grad, mask). The mathematical definitions stayed identical. The compute moved to Metal.
This is the part of the project that felt almost unfair. TinyFormer had spent months getting the backward pass right. Smith inherited all of it in an afternoon.
The contiguous problem (twice)
Smith's first real bug was a class of problem that PyTorch solved years ago, and that I had to rediscover from first principles.
gpuTranspose produces a virtual view — same underlying buffer, different shape and strides. The GPU shaders, however, read memory linearly. They don't know about strides. So when the matmul backward computed grad @ B^T, it was actually computing grad @ B, because the transpose was a polite fiction that the shader couldn't see.
The fix was contiguous() — detect non-standard strides, copy the data into a fresh buffer in the correct physical order. Same approach PyTorch uses. The insight arrived about four hours after the confusion started.
Then the same bug appeared again in Phase 5, wearing a different hat. Multi-head attention concatenates heads by reshaping a transposed tensor: reshape(transpose(attnOut, [1,0,2]), [seqLen, dim]). The transpose created a virtual view. The reshape blindly assigned contiguous strides to the same buffer. For seqLen=1 — the cached inference path — this accidentally worked, because there was nothing to interleave. For seqLen>1, head data got mixed with position data. The model still "trained" because the corruption was consistent. It wasn't computing attention correctly, but it was computing something confidently.
Fix: reshape now calls contiguous() on non-contiguous inputs. This is the second time the same conceptual bug appeared, and both times the symptom was "the math is subtly wrong in a way that doesn't crash."
Shaders: where the actual work happens
Smith's Metal shaders follow a pattern. Each operation gets a dedicated .metal file with forward and backward kernels. The dispatch helper is 40 lines of JavaScript that translates tensor-level calls into Metal command encoding:
function run(kernel, buffers, grid, group, params) {
const pso = device.pipeline(kernel)
const enc = device.begin()
device.setPipeline(enc, pso)
for (const b of buffers) device.setBuffer(enc, b.buffer, b.index)
if (params) device.setBytes(enc, params.data, params.data.byteLength, params.index)
device.dispatch(enc, grid.x, grid.y || 1, grid.z || 1, grpX, grpY, grpZ)
device.endSync(enc)
}
Everything dispatches through this function. Matmul, softmax, layernorm, flash attention, convolutions — they all call run() with a kernel name, some buffers, and a grid size. The shader does the work. JavaScript does the bookkeeping.
The matmul shader uses 32×32 tiles loaded into threadgroup shared memory, with each thread computing a 4×4 sub-tile. Standard GPU GEMM technique — memory coalescing on load, compute-bound on accumulate. For small matrices below 64 per dimension, a naive one-thread-per-element fallback avoids the tiling overhead. The softmax shader fuses max-reduction, exp, and normalize into a single kernel — five separate dispatches compressed into one, eliminating roughly 20 microseconds of FFI overhead per call. When softmax runs inside every attention head at every layer, that overhead compounds.
Flash attention: the one fused kernel that matters
Standard attention materializes a [seqLen, seqLen] score matrix per head. At sequence length 2048, that's 16MB per head in f32. Across 12 heads and 12 layers, you're looking at 2.3GB of transient memory allocated, written once during softmax, read once, and discarded.
Flash attention eliminates this. The trick is online softmax — maintaining running max and sum-of-exp statistics per tile, rescaling the accumulator when a new tile produces a larger maximum. The rescaling is exact, not an approximation. You can compute softmax over a million elements using memory for only 32 at a time.
Smith implements FlashAttention-2 with Br=Bc=32 tile sizes, causal masking with early-exit, and a backward pass that recomputes attention weights from two saved scalars per row (the softmax max and sum). The entire O(n²) attention matrix is never materialized in either direction.
The backward kernel was the source of what I now think of as the trench coat bug — three separate problems stacked on top of each other pretending to be one. A missing threadgroup barrier let threads corrupt shared memory across steps. The buffer pool returned stale data through T.zeros() because recycled buffers aren't zeroed by Metal. And addGrad stored non-contiguous gradient views, making physically-transposed data look sign-flipped when compared element-by-element. Each fix was straightforward. Finding them required peeling back layers of indirection until the GPU kernel itself was vindicated — it had been correct the entire time.
From GPT-2 to GGUF: loading the world's models
Phase 6 added a safetensors parser — 30 lines of DataView and TextDecoder, no dependencies — that loads GPT-2 weights from HuggingFace. The tricky part is GPT-2's fused c_attn projection: Q, K, and V packed into a single [dim, 3*dim] tensor. Smith decomposes attention into separate projections, so the loader splits the fused weight row-by-row at load time and re-fuses it for export.
Phase 9 added GGUF — the lingua franca of local LLM inference. Every llama.cpp model, every Ollama download, every quantized checkpoint on HuggingFace. If Smith can't read GGUF, it's a toy. If it can, it has access to thousands of pretrained models overnight.
The GGUF parser is a cursor-based reader over DataView, handling variable-length strings, typed metadata arrays, alignment padding, and block dequantizers for Q4_0, Q4_1, Q8_0, F16, and BF16. About 120 lines. Architecture mappings for Llama, Phi, and GPT-2 are plain objects — a lookup table that routes tensor names like blk.15.attn_q.weight to internal paths like blocks.15.mha.qProj.weight. Weight transposition is detected by comparing GGUF tensor shapes against model variable shapes. If they disagree, transpose. If they agree, don't. This heuristic handles every architecture correctly because the mapping already routes each tensor to the right variable — we only need to fix the memory layout.
Convolutions and the im2col bargain
Smith started as a transformer library, but convolutions share the same building blocks. Adding conv2d, pooling, and batchnorm made it general-purpose — and made CLIP possible.
The initial conv2d uses direct convolution: one thread per output element, inner loop over channels and kernel positions. For 3×3 kernels on Apple Silicon, this is competitive because unified memory means no im2col copy penalty. Then came Winograd for the common case — F(2×2, 3×3) computes a 2×2 output tile from a 4×4 input tile using 16 multiplications instead of 36. The Winograd matrices are fixed 4×4 transforms that you can hardcode as a sequence of adds and subtracts. No matrix library, no eigendecomposition. The gap between "sounds hard" and "is hard" turned out to be enormous.
For everything Winograd doesn't cover — 5×5, 7×7, strided 3×3 — im2col converts the problem into matrix multiplication and reuses the existing tiled GEMM shader. The three-way auto-dispatch selects Winograd, im2col, or direct based on structural criteria, no heuristics. User code doesn't change.
With all three convolution paths plus pooling and batchnorm in place, Phase 15 built ResNet and CLIP as pure orchestration — composing existing ops in the right order. The ResNet builder is 340 lines. CLIP is 400. Every multiplication, every normalization, every attention score flows through the same GPU kernels tested in earlier phases. The value of having solid primitives became aggressively clear.
Mixed precision and the accumulator question
Apple Silicon runs f16 at double the throughput of f32. Phase 8 added _f16 variants of every shader — half-precision I/O with f32 internal accumulators for anything that sums: matmul inner products, softmax reductions, layernorm statistics. The pattern is simple to describe and touches every shader differently. Matmul loads half from device memory into f32 threadgroup shared memory. Softmax computes max and sum in f32, writes results as half. Flash attention does everything internally in f32 with half I/O at tile boundaries. Each kernel has its own "where does the cast happen" decision, and getting it wrong means either numerical garbage or a performance regression.
A k() helper selects the kernel variant: k('elementwise_add', tensor.dtype) appends _f16 when appropriate. One line replaced dozens of if/else chains across every op file.
The profiler that took 200 lines
After sixteen phases of compute kernels, model builders, and autograd infrastructure, I had no idea where time was actually spent. The fix took about an hour. Metal command buffers expose GPUStartTime and GPUEndTime. The profiler instruments the existing run() function — when profiling is enabled, it calls endTimed instead of endSync. Single boolean check on the hot path. Zero overhead when disabled.
The entire profiler is about 150 lines of JavaScript plus a 24-byte C function that reads two properties off a command buffer. The gap between "no visibility" and "per-kernel GPU timing with memory tracking" was about 200 lines of code total. Bun's FFI and Metal's unified memory model made this trivially easy. No driver queries, no special compilation flags, no separate profiling builds.
What it adds up to
Smith, at the end of sixteen phases: GPU-backed tensors with full reverse-mode autograd. Flash attention. Winograd and im2col convolutions. Mixed precision f16 with dynamic loss scaling. Safetensors and GGUF weight loading. ResNet and CLIP model builders. KV-cached generation. A profiler. About 3,000 lines of JavaScript and 800 lines of Metal shaders. No dependencies beyond Bun and the macOS SDK.
It loads GPT-2 from HuggingFace and generates text. It loads GGUF Llama models and runs inference with KV cache. It classifies images through ResNet and computes CLIP embeddings. It trains from scratch with fused AdamW and cosine learning rate schedules.
What I learned
The unified memory architecture on Apple Silicon isn't just a performance feature — it's an API simplification feature. Half the code I didn't have to write is code that manages the boundary between host and device memory. Checkpoints load by .set()-ing into a typed array that happens to be a Metal buffer. The profiler reads GPU timing by casting a 24-byte C struct into a Float64Array. The GGUF loader copies dequantized weights directly into model parameters without a staging step. Every system that would normally require a transfer abstraction collapses into a pointer cast.
The other lesson is about the value of porting from a correct reference implementation. TinyFormer's autograd was battle-tested across months of development. Smith inherited every backward formula, every gradient reduction, every edge case around broadcasting and weight tying. The bugs I hit were all in the new layer — the GPU dispatch, the strided views, the shared memory synchronization. The math was always right. The plumbing took some convincing.
A GPU compute library in JavaScript sounds like a stunt. It's less absurd than it appears — JavaScript is the orchestration layer, Metal is the compute layer, and Bun's FFI is the 5-microsecond bridge between them. The language that manages the graph doesn't need to be fast. The shaders that do the arithmetic do.
The code is available on Github if you want to have a rummage around!
https://github.com/fredrikpaulin/SMITH