Skip to main content

Command Palette

Search for a command to run...

Whisper in JavaScript: Speech Recognition on Apple Silicon, For Shame

Someone built insanely-fast-whisper in Python with one pipeline call. I built mine from WAV parsing to mel filterbanks to cross-attention, because apparently that's who I am.

Updated
11 min read

The project that made me do this is called insanely-fast-whisper. It transcribes 150 minutes of audio in under 98 seconds using OpenAI's Whisper Large v3. The implementation is four lines of Python wrapping transformers.pipeline(). Four lines. The rest is HuggingFace Transformers, Optimum, and flash attention doing all the work behind a function call that could reasonably be described as cheating.

I looked at those four lines and thought: if they can do it, so can we.

"We" being Smith — a GPU acceleration library written in JavaScript that runs on Apple Silicon via Metal compute shaders. Smith had tensors, autograd, flash attention, convolutions, GGUF model loading, and sixteen phases of infrastructure that theoretically covered everything a transformer needs. Whisper is a transformer. How hard could it be?

The audio stack gap

Whisper needs audio in a very specific format: 16kHz mono float32 samples, chopped into 30-second chunks, transformed into 80-band mel spectrograms. The Python version gets this from librosa (audio decoding), torch.stft (spectrograms), and a handful of numpy operations that the authors presumably didn't even think about when writing.

Smith had none of this. Smith had matmul. Smith had attention. Smith did not have "parse a WAV file."

The gap between "has all the tensor operations" and "can accept an audio file" turned out to be about 600 lines of JavaScript that will probably never change. A WAV reader that parses RIFF chunks, decodes PCM int16/int32/float32, mixes stereo to mono, and resamples via linear interpolation to 16kHz. A radix-2 Cooley-Tukey FFT. A Short-Time Fourier Transform with Hann windowing. A triangular mel filterbank. Log normalization matching whisper.cpp's exact formula.

Each piece is small. The WAV reader is 70 lines — half of which is the RIFF chunk walker:

function readWav(buffer) {
  const view = new DataView(buffer instanceof ArrayBuffer ? buffer : buffer.buffer)
  if (view.getUint32(0, false) !== 0x52494646) throw new Error('Not a WAV file (missing RIFF header)')
  if (view.getUint32(8, false) !== 0x57415645) throw new Error('Not a WAV file (missing WAVE marker)')

  let offset = 12
  let fmt = null
  let dataChunk = null

  while (offset < view.byteLength - 8) {
    const id = String.fromCharCode(view.getUint8(offset), view.getUint8(offset + 1),
      view.getUint8(offset + 2), view.getUint8(offset + 3))
    const size = view.getUint32(offset + 4, true)
    if (id === 'fmt ') {
      fmt = { /* sample rate, channels, bit depth */ }
    } else if (id === 'data') {
      dataChunk = { offset: offset + 8, size }
    }
    offset += 8 + size
    if (size % 2 !== 0) offset++ // WAV chunks are word-aligned
  }
  // ...
}

The FFT is 35 lines — bit-reversal permutation, iterative butterfly, done. The mel filterbank is another 40. None of this code is interesting in isolation. The interesting thing is that it exists at all — six modules of audio processing that the Python ecosystem hides inside import librosa.

This is what "zero dependencies" costs. Not in difficulty, but in surface area.

Cross-attention: the op Smith didn't have

Smith's multi-head attention only does self-attention — Q, K, and V all come from the same input. Whisper's decoder needs cross-attention, where Q comes from the decoder state but K and V come from the encoder output. Two different inputs, one attention operation.

The cleanest approach turned out to be the simplest: reuse the existing createMultiHeadAttention for the projection weights (they're the same shapes), and write a new forward path that takes a separate kv source:

function decoderBlock(x, encoderOut, block, causalMask) {
  // Self-attention: Q, K, V all from x
  const norm1 = layernorm(x, block.selfAttnLnW, block.selfAttnLnB)
  const { output: selfOut } = multiHeadAttention(norm1, block.selfAttn, causalMask)
  const x2 = add(x, selfOut)

  // Cross-attention: Q from decoder, K/V from encoder
  const norm2 = layernorm(x2, block.crossAttnLnW, block.crossAttnLnB)
  const crossOut = multiHeadCrossAttention(norm2, encoderOut, block.crossAttn)
  const x3 = add(x2, crossOut)

  // FFN
  const norm3 = layernorm(x3, block.ffnLnW, block.ffnLnB)
  return add(x3, linear(gelu(linear(norm3, block.ffn1)), block.ffn2))
}

This was the first time Smith encountered an encoder-decoder architecture. Every previous model — GPT-2, Llama, CLIP's text encoder — was decoder-only or encoder-only. Cross-attention generalized the library from "transformers" to "all transformers."

Conv1d via the im2col bargain (again)

Whisper's encoder starts with two 1D convolutions — kernel size 3, the second with stride 2 to halve the time dimension. These are the only convolutions in the entire model, and they appear exactly twice.

Writing a dedicated 1D convolution Metal shader for two layers felt excessive. I had already solved this problem in 2D: im2col rearranges input patches into a matrix, then you call your existing matmul shader. The same trick works in 1D — extract overlapping patches into a column matrix, dispatch tiled GEMM, done. Conv1d became ~80 lines of JavaScript reusing the existing GPU matmul. No new shader.

Later — and this is the part I find satisfying — I moved conv1d to the core library and wrote the GPU im2col kernel for it. The example built the op, proved it worked, and the promotion to core was a clean extraction with no API changes. That's how a library should grow.

Butterflies in shared memory

The CPU mel spectrogram worked fine. For a 30-second chunk it took about 50ms, and the transformer inference dominated runtime by orders of magnitude. There was no performance reason to move the FFT to the GPU.

I moved it to the GPU anyway.

The radix-2 Cooley-Tukey FFT is one of those algorithms that maps perfectly to GPU shared memory. All n threads in one threadgroup process one FFT together — bit-reversal permutation, then log₂(n) butterfly stages with barriers between them. The working set fits entirely in shared memory (4KB for Whisper's n=512). No global memory round-trips between stages. One kernel call does the whole transform.

function dispatchFFT(complexIn, complexOut, fftN, batch, inverse) {
  if (fftN > 1024) throw new Error(`GPU FFT max size is 1024 (got ${fftN})`)
  const params = fftParams(fftN, batch, inverse)
  run('fft_radix2', [
    { buffer: complexIn.buffer, index: 0 },
    { buffer: complexOut.buffer, index: 1 },
  ], { x: fftN, y: batch }, { x: fftN, y: 1 }, { data: params, index: 2 })
}

A single FFT on the GPU doesn't outperform the CPU — the data is too small for the GPU to stretch its legs. The batch mode is where it earns its existence. An STFT over 30 seconds of audio at hop_length=160 produces roughly 3,000 frames, each needing its own 512-point FFT. On the CPU, that's 3,000 sequential transforms. On the GPU, it's one dispatch with 3,000 threadgroups, each churning through butterflies in its own shared memory. The forward and inverse kernels are identical except for the sign of the twiddle factor and a 1/N scaling — a single inverse flag in the params struct.

The mel pipeline chains four GPU kernels: Hann windowing → batch FFT → magnitude → mel filterbank → log normalization. The filterbank is technically a dense matmul — 80 mel filters × 257 frequency bins — but at 20K multiplies it's not worth the overhead of a sparse dispatch. The log normalization required a two-pass approach: compute log₁₀ on GPU, read the max on CPU (it's 240K floats, essentially free), then clamp and normalize on GPU. Whisper is particular about its normalization formula, and getting it wrong means the model sees garbage and hallucinates with great confidence.

Loading whisper.cpp weights

OpenAI's Whisper weights are available through HuggingFace in several formats. The most accessible for local inference is the whisper.cpp GGML format — single .bin files that contain hyperparameters, the mel filter matrix, the BPE vocabulary, and all tensor data in one binary blob.

GGML is simpler than GGUF. Linear header with hparams, then tensors with 32-byte alignment. No nested metadata, no typed arrays of typed arrays. The parser is about 120 lines of DataView calls. I split it into ggml_parser.js (pure JS, no Smith dependency, testable anywhere) and loader.js (creates Smith tensors from the parsed data). This split means the parser tests run on Linux CI without a Metal dylib, which turns out to matter when you're writing 25 tests for a module that only executes on macOS.

The model registry handles the rest. Six Whisper models — tiny, tiny-en, base, base-en, small, medium — each with a HuggingFace URL, SHA-256 checksum, and loader metadata:

bun examples/whisper/cli.js --model whisper-tiny --file audio.wav

First run downloads the model, verifies the checksum, caches it in models/whisper-tiny/. Subsequent runs load from cache. The entire fetch-verify-cache dance is about 250 lines that also serve GGUF models, ResNets, and CLIP checkpoints.

Making cross-attention caches earn their keep

The first working Whisper implementation re-projected the encoder output through every cross-attention layer at every decode step. This is correct and spectacularly wasteful — the encoder output doesn't change during decoding, so K and V are recomputed identically 224 times per transcription.

The fix follows the same prefill/decode split I used for Llama's KV cache. precomputeEncoderKV runs each decoder block's cross-attention K/V projections once on the full encoder output, producing [numHeads, audioCtx, headDim] tensors that get passed to every decode step unchanged. For Whisper tiny, that's 4 blocks × 2 tensors × 6 heads × 1500 positions × 64 dims — about 4.6 million floats projected once instead of 224 times.

Self-attention still needs a growing KV cache, same as any autoregressive decoder. The pattern is: prefill processes the prompt tokens (usually just <|startoftranscript|>, language token, <|transcribe|>) in one pass, then each decode step appends one token to the cache and runs attention against the full history.

Smith's cached attention functions were built speculatively — multiHeadAttentionCached in Phase 5, multiHeadCrossAttentionCached in Phase 17 — without a real encoder-decoder model to test them. Whisper was the first validation. They accepted the pre-projected K/V exactly as designed. The speculation paid off, but mostly because the API was minimal enough that there wasn't much to get wrong.

Chunking: the last 140 lines

Whisper processes 30 seconds at a time. Real audio is longer than 30 seconds. The solution is predictable: overlapping windows with text-based deduplication at the boundaries.

chunkAudio uses Float32Array.subarray() — zero-copy views into the original audio buffer. A 10-minute podcast at 16kHz creates 21 chunk objects referencing the same underlying ArrayBuffer. Only the mel spectrogram per chunk allocates new memory.

The stitching logic walks the suffix of accumulated text, checks progressively longer prefixes of the new chunk, and removes the overlap at the first match. English speech averages about 4 words per second, and the 1-second overlap provides 4 redundant words — enough to reliably find the splice point. The whole chunking module is 140 lines including the async pipeline. whisper.cpp's equivalent is embedded in the main inference loop and interleaved with memory management. Keeping it as pure functions with no GPU dependency made it testable on any platform.

What it took

The final tally for Whisper speech-to-text in JavaScript on Apple Silicon:

About 1,400 lines of pure JS audio processing — WAV decoding, FFT, STFT, mel filterbank, tokenizer, chunker. About 800 lines of model code — encoder, decoder, weight loader, GGML parser. 300 lines of Metal shaders for GPU FFT and mel spectrogram. 400 lines of CLI. 26 tests in the example, 18 in the core library.

Plus the ops that got promoted from example code to Smith's core: conv1d, cross-attention, sinusoidal positional embeddings. Three features that any encoder-decoder model would need, discovered by building an actual encoder-decoder model. The Whisper example went from "lots of custom code" to "thin orchestration layer over Smith primitives." The library grew because a real use case demanded it, not because someone speculated about what might be needed.

The punchline

$ bun examples/whisper/cli.js --model whisper-tiny --file meeting.wav
Fetching whisper-tiny...  done (75 MB, cached)
Loading model...          done (39M params)
Transcribing...           done (4 chunks, 2.1 minutes)

Good morning everyone, let's get started with the weekly sync...

Speech recognition. In JavaScript. On Metal shaders dispatched through Bun's FFI from a language that was invented to validate form fields.

The Python version does this in four lines and a pip install. Our version required building a WAV parser, an FFT, a mel filterbank, a GGML binary format parser, a cross-attention mechanism, a conv1d operator, a GPU butterfly kernel, a model registry with checksum verification, and an audio chunker with text-based deduplication.

insanely-fast-whisper had the right idea — make it fast and make it easy. I went a different direction. I made it from scratch and made it mine. Every byte of audio flows through code I wrote, through shaders I compiled, through a Metal bridge built from an Objective-C file and a dream of zero dependencies.

Was it worth it? I can explain every stage of the pipeline from PCM samples to transcribed text. I can tell you exactly where the butterflies happen in shared memory and why the mel normalization uses (mel + 4) / 4 and not something more obvious. I can tell you that the hardest part wasn't the transformer — it was getting the mel spectrogram to match whisper.cpp's output closely enough that the model didn't hallucinate.

Four lines of Python would have been faster to write. But I wouldn't have learned anything, and the GPU would still be bored.

All the code is available on Github: https://github.com/fredrikpaulin/SMITH/