Skip to main content

Command Palette

Search for a command to run...

Porting Karpathy's Autoresearch to JavaScript: A Study in Voluntary Suffering

Andrej Karpathy built a GPT pretrainer targeting eight H100s. I ported it to a single Mac, in JavaScript, then let an AI agent experiment with it overnight.

Updated
12 min read

At this point the pattern should be familiar. Someone publishes a clean, well-documented ML project in Python. I look at it. I think: "I have a GPU library. I have tensors. How hard could it be?" Then I spend a week building infrastructure that the Python version got for free from import torch, and at the end I have the same model architecture running on a single laptop in a language that doesn't even have operator overloading.

Karpathy's autoresearch is a from-scratch GPT pretrainer. It downloads public domain text, trains a BPE tokenizer, and runs a training loop with a hybrid optimizer called MuonAdamW that uses Newton-Schulz orthogonalization on matrix parameters. The reference implementation targets eight H100 GPUs with torch.compile, bfloat16, and gradient accumulation across devices. The model — attention, MLP, embeddings, loss — is maybe 300 lines of Python. The scaling infrastructure around it is the rest.

I wanted the 300 lines. I got the rest too, plus an autonomous research loop, a memory management crisis, and the discovery that Metal compute buffers are, for all practical purposes, immortal.

The optimizer that does linear algebra homework

Standard AdamW tracks running averages of the gradient and its square, then updates parameters using the ratio. Karpathy's autoresearch uses it for embeddings, biases, norms, and the LM head — the non-matrix parameters.

For the 2D matrix parameters — every linear projection in attention and the MLP — it uses Muon. Muon takes the gradient, applies Nesterov momentum, then runs five iterations of Newton-Schulz orthogonalization to approximate the polar decomposition. The polar decomposition finds the nearest orthogonal matrix to the gradient, which gives each parameter update a direction that preserves the spectral structure of the weight matrix rather than following the raw gradient.

function newtonSchulz(g, nsSteps) {
  const [rows, cols] = g.shape
  const tall = rows > cols

  let X = gpuScale(g, 1.0 / (frobeniusNorm(g) * 1.02 + 1e-6))

  for (let i = 0; i < nsSteps; i++) {
    const [a, b, c] = POLAR_COEFFS[i]
    let A = tall ? matmul2d(transpose(X), X) : matmul2d(X, transpose(X))
    const AA = matmul2d(A, A)
    const B = T.create(A.shape, A.dtype)
    nsPoly(A, AA, B, b, c)
    const product = tall ? matmul2d(X, B) : matmul2d(B, X)
    nsCombine(X, product, a)
  }
}

Five sets of polynomial coefficients, computed offline to minimize approximation error. Five iterations per parameter per step. For a model with 20 linear layers, that's 300 matmul dispatches per optimizer step just for Newton-Schulz. On top of that: NorMuon variance reduction, Nesterov momentum, and cautious weight decay that masks the decay where gradient and parameter disagree in sign. All of this per parameter, per step, in JavaScript, dispatched to Metal compute shaders.

The Frobenius norm is computed on CPU via unified memory because a single scalar isn't worth a reduction kernel. The reference stacks same-shape parameters into batch tensors for torch.compile; I process each parameter individually because Metal kernel dispatch is cheap (~2µs) and I didn't want the stack/unstack copy overhead.

Loss = 8.3178, forever

The first training run produced a loss of 8.3178 and held it there. Every step. No change. I blamed the optimizer. I blamed the learning rate. I blamed the gradient flow through the residual connections.

8.3178 is ln(4096). The cross-entropy of a uniform distribution over a 4096-token vocabulary. I should have recognized it immediately. The model was outputting all-zero logits, softmax was producing uniform 1/V probabilities, and the loss was reporting the mathematically correct entropy of a random guess. Forward and backward were both broken, but the loss looked like "model hasn't learned yet" rather than "fundamental operation is returning zeros."

The tell came from sequence-length bisection. At T=32, healthy gradient norms. At T=64, exactly zero. The threshold mapped to TILE_THRESHOLD = 64 in matmul.js — the cutoff where Smith switches from the simple kernel to the tiled GEMM with threadgroup shared memory.

The tiled matmul kernel declared threadgroup float* shared [[threadgroup(0)]] — Metal's dynamic shared memory binding. This requires the host to call setThreadgroupMemoryLength before dispatch. The native bridge never made that call. Zero bytes allocated. The shared memory pointer was invalid. Tile loads wrote to nothing, tile reads returned zero, every output element was zero.

The fix: static threadgroup arrays instead of dynamic bindings. threadgroup float As[TILE_M * TILE_K] — Metal allocates those automatically. Three lines changed. The reduce_sum and reduce_max kernels had the same bug. The most dangerous GPU bug is the one that produces a plausible-looking wrong answer.

The ops Smith didn't have

Autoresearch needed four things: grouped query attention with sliding windows, ReluSquared, RoPE that works with multi-head 3D tensors, and cross-entropy loss.

GQA and sliding windows went into the flash attention shader. The KV head for Q head h is h * numKVHeads / numQHeads — integer division, clean when numQHeads is a multiple of numKVHeads. Sliding window: if row - col > windowSize, score is -inf. Block-level optimization: entire column blocks outside the window get continued rather than computed. For a window of 256 on a 512-length sequence, half the computation is skipped. One continue statement doing the work of a research paper.

The RoPE bug was subtle. Smith's RoPE kernel is 2D: [seqLen, dim]. The model needs RoPE applied to 3D [nHead, T, headDim] tensors. Flatten naively and head 1's rows get position indices T through 2T-1 instead of 0 through T-1. The fix — tileRoPETable() — creates a cos/sin table where positions 0..T-1 repeat for each head block. Four hours to diagnose. One function to fix.

The zero-init gradient blockade was the educational one. The reference initializes output projections to zeros — intentionally making each transformer block an identity function at init. Side effect: no gradient flows back through attention on the first step, because dL/dAttnProj @ cProj^T = 0. My tests checked that cQ.grad was non-zero. They failed correctly. The tests were right, the model was right, and I was confused. Once I understood the design — identity residual at init — the fix was to test downstream parameters instead.

Letting an AI loose on it

At this point I had a working GPT pretrainer. Loss went down. Validation BPB was 2.86. The architecture matched the reference. I could have stopped.

Instead I wrote an autonomous research loop. The concept is the Karpathy pattern: one machine, one file, one metric. An AI agent reads the model code, forms a hypothesis, edits model.js or train.js, commits, runs training via research.js, checks if val_bpb improved. If yes, the commit stays. If no, git reset --hard HEAD~1. Loop forever.

cd examples/autoresearch && bash start.sh

The start script launches Claude Code with --allowedTools "Bash(*),Edit,Read,Write" and a CLAUDE.md that says: "You are an ML researcher. Run experiments. Keep improvements. Discard regressions. Never stop."

I went to bed.

126 experiments later

The agent ran 126 experiments overnight. 28 kept, 94 discarded, 4 crashed. BPB improved from 2.86 to 2.23 — a 22% reduction. The agent's log reads like a compressed version of a research paper's ablation study:

The first insight was that the 60-second time budget limited training to about 20 steps. With 20 steps, the biggest lever is getting more steps per run, not making each step smarter. The agent shrank the model — depth 4 to depth 2, dim 256 to dim 128 — and gained steps. It removed warmdown entirely, because spending any of those 20 steps decaying the learning rate was a waste.

Then it got creative. It replaced ReluSquared with SwiGLU. Removed the logit soft-capping. Removed the initial embedding normalization. Switched to GQA with 2 KV heads instead of full multi-head attention. Cranked the sequence length from 512 to 4096 — the single largest improvement, because BPB rewards context. Increased the batch size to 32768 tokens per step.

The final configuration looks nothing like where I started:

// What I ported from the reference:
depth=4, dim=256, seq=512, batch=8192, MHA, ReluSquared, softcap, 7.34M params

// What the agent found:
depth=1, dim=128, seq=4096, batch=32768, GQA-2, SwiGLU, no softcap, 0.67M params

One transformer layer. The agent had empirically discovered that under a fixed time budget on a single Apple Silicon Mac, a tiny model with long context beats a larger model with short context. Every additional layer costs step time that would be better spent seeing more data.

The hit rate was about 22% — roughly one in five changes improved the metric. Most gains came from the first 30 experiments. After experiment 85, diminishing returns set in. The agent kept trying, because the CLAUDE.md told it to never stop, and it followed those instructions with the literal-mindedness that makes AI agents both useful and slightly unnerving.

The memory leak that 60 seconds hid

Encouraged by the agent's results, I tried a longer training run. --time-budget 600 instead of 60. The process was OOM-killed at step 36.

Every forward() call creates intermediate tensors — matmul outputs, attention scores, normalization statistics. Every backward() creates gradient tensors. Every muonAdamWStep() runs five Newton-Schulz iterations per Muon parameter, each producing six intermediate tensors. None of these were freed. At 60 seconds and ~20 steps, the accumulated tensors fit comfortably in memory. At 600 seconds and ~200 steps, they didn't.

The agent had run 126 experiments without encountering this. The 60-second time budget was short enough to mask the leak entirely. The architecture of the experiment — short runs, many iterations — selected against discovering it.

The fix was smith.using() — scoped tensor cleanup. Model weights and optimizer state are created before the training loop and survive. Everything inside the scope — activations, gradients, Newton-Schulz intermediates — gets returned to the buffer pool at scope exit.

for (let seq = 0; seq < seqsPerStep; seq++) {
  smith.using(() => {
    const { input, target } = trainLoader.next()
    const { loss } = forward(model, input, target)
    const scaledLoss = smith.scale(loss, 1.0 / seqsPerStep)
    smith.backward(scaledLoss)
    for (const p of params) if (p.grad) smith.retain(p.grad)
  })
}

The retain() calls are the interesting part. Gradients need to survive the scope because they accumulate across sequences. But the scope wants to dispose everything it allocated. So you retain the gradients, let the scope clean up everything else, then dispose the gradients manually after the optimizer step. It's a manual memory management dance in a garbage-collected language — Metal's own retain/release pattern, ported to JavaScript.

Metal buffers are forever

This fixed the OOM at step 36. It moved it to step 61. Then step 182. Each fix revealed the next layer of the problem.

The final layer: Metal compute buffers, once used in a dispatch, cannot be freed through CFRelease. The command buffer takes an internal retain when you bind a buffer to a compute encoder. After the dispatch completes, waitUntilCompleted returns, ARC releases the command buffer, but device.currentAllocatedSize never decreases. I tried @autoreleasepool wrapping, explicit CFRelease, local ARC variables with = nil, nested autorelease pools. A buffer used in 8 dispatches had CFGetRetainCount of 9. Nothing I did through Objective-C could reduce it below 2.

The fix wasn't to make release work. It was to stop trying to release.

The buffer pool exists to avoid allocation overhead by reusing buffers. I had been undermining it by draining the pool between steps. With two changes — remove poolDrain() from the training loop, set MAX_PER_BIN = Infinity — every buffer stays pooled and gets reused. GPU memory stabilizes at ~450MB after step 2. The pool reaches 318 buffers at steady state with a 99% hit rate.

Sometimes the correct response to "I can't free this memory" is "then don't."

The actual ML is smaller than you think

The reference train.py is about 800 lines. Of those, roughly 300 are the model. Those 300 lines port almost line-for-line. The remaining 500 are scaling infrastructure: batch parallelism, torch.compile annotations, bfloat16, distributed gradients, NCCL. I threw away all 500 lines. Every one solves a problem I don't have.

Smith's version is about 800 lines too, but the composition is inverted. Instead of 300 lines of model and 500 lines of infrastructure, it's 400 lines of model (more verbose without operator overloading), 250 lines of data pipeline, and 150 lines of training loop. Zero lines of distributed anything.

And then an AI agent rewrote the model in 126 iterations, stripping it down to a single transformer layer with SwiGLU and GQA, because the constraint that matters on a single Mac isn't model capacity — it's steps per second.

The self-punishment question

This is the third time I've ported a significant ML system to JavaScript on Metal. GPT-2 inference. Whisper speech-to-text. Now GPT pretraining from scratch with an autonomous research agent. Each time the same sequence: look at a Python project, build the missing ops, wire them together, discover three bugs that produce plausible-looking wrong answers.

The Muon optimizer is 300 lines of JavaScript that does linear algebra I wouldn't have learned from a torch.optim.AdamW call. The tiled matmul zero-output bug taught me more about Metal's threadgroup memory model than any documentation. The buffer-retention discovery — that Metal drivers hold references below the Objective-C layer — took eight hours of CFGetRetainCount debugging to reach a conclusion I could have stated in one sentence.

Would pip install torch have been faster? Unquestionably. Would I have understood any of this? Not a chance.

The Python version of this project takes four commands: install, prepare, train, evaluate. My version took three new ops, a new optimizer with five GPU kernels, a data pipeline that downloads books from Project Gutenberg, a tensor lifecycle system, a memory management strategy built on the realization that Metal buffers are permanent residents, and an AI agent that ran 126 experiments while I slept and concluded that the model I spent a week porting should have one layer.

The self-punishment isn't the point. The self-punishment is the method. You learn the shape of a system by rebuilding it with your hands, not by calling its API. The calluses are the curriculum.

Or I'm addicted. Possibly both.

The code can be found here: https://github.com/fredrikpaulin/smith/examples/autoresearch