Flash-MoE: Running a 397B Parameter Model on a MacBook

TL;DR

Flash-MoE is a pure C/Metal inference engine that runs Qwen3.5-397B-A17B — a 397 billion parameter model — on a MacBook Pro with 48GB RAM. Key facts:

4.4+ tokens/second at 4-bit quantization with full tool calling
1,847 GitHub stars in one week, 393 points on Hacker News
No Python, no frameworks — just C, Objective-C, and hand-tuned Metal shaders
209GB model streams from SSD through a custom Metal compute pipeline
Built in 24 hours by Dan Woods (CVS Health VP of AI) using Claude Opus 4.6 and the autoresearch pattern
Inspired by Apple’s 2023 “LLM in a Flash” paper
License: Open source on GitHub

The project proves that a single developer with an AI coding agent can implement a research paper and get frontier-class model intelligence running on consumer hardware in a weekend.

Why This Matters

Running large language models locally usually means one of two things: use a smaller model that fits in RAM, or spend thousands on GPU servers. Flash-MoE breaks this tradeoff by exploiting a key property of Mixture-of-Experts (MoE) models.

Qwen3.5-397B-A17B has 397 billion total parameters, but each token only activates 17 billion of them. The model has 512 experts per layer, and only 4 are needed per token. Flash-MoE keeps the non-expert weights (5.5GB) in memory and streams the active experts from SSD on demand.

The result: frontier-class model intelligence on a laptop that costs $3,499.

The Numbers

Configuration	Tokens/sec	Disk Size	Quality
4-bit experts (production)	4.36	209GB	Excellent — full tool calling
2-bit experts	5.74	120GB	Good — breaks JSON output
2-bit peak burst	7.05	120GB	Good — warm cache only
M1 Ultra (community)	~20	—	256K context, 87.86% MMLU

The 4-bit configuration is the sweet spot. 2-bit quantization is faster but produces \name\ instead of "name" in JSON, making tool calling unreliable.

How It Works

Flash-MoE’s architecture is deceptively simple. Here’s the pipeline for each token:

GPU: attention + routing → CPU: top-K selection → SSD: load 4 experts → GPU: expert forward pass

Each step takes milliseconds:

Step	Time	Component
Attention projections + delta-net	1.22ms	GPU
Output projection + norm + routing	0.55ms	GPU
Softmax + top-K routing	0.003ms	CPU
Parallel pread 4 experts	2.41ms	SSD
Expert forward + combine + norm	Deferred	GPU

The SSD read is the bottleneck. Apple’s NVMe SSD delivers 17.5 GB/s sequential reads, and each expert is ~6.75MB. Four experts = ~27MB per token, loaded in parallel using GCD dispatch groups.

Key Technical Decisions

1. “Trust the OS” — No custom cache

The most counterintuitive finding. Every custom caching approach — Metal LRU, malloc cache, LZ4 compressed cache — was slower than letting macOS manage the page cache. The OS achieves ~71% cache hit rate naturally through LRU. Deleting the custom Metal LRU cache gave a 38% speed improvement.

Why? On Apple Silicon, the GPU and SSD share the same memory controller. Custom caches create GPU memory pressure and fight with the page cache for bandwidth.

2. FMA-optimized dequantization

The inner loop of 4-bit matrix-vector multiply was rearranged from (nibble * scale + bias) * x to fma(nibble, scale*x, bias*x). Pre-computing scale*x and bias*x lets the GPU’s fused multiply-add unit handle dequant and multiply in one instruction. 12% faster.

3. No GPU/SSD overlap

Another surprising finding. On Apple Silicon, SSD DMA and GPU compute share the same memory controller. Running them simultaneously causes memory arbitration conflicts — the GPU slows down 73%. The serial pipeline (GPU → SSD → GPU) is actually hardware-optimal.

4. Accelerate BLAS for linear attention

The GatedDeltaNet recurrence (45 of 60 layers use linear attention) runs on CPU using Apple’s Accelerate framework — cblas_sscal, cblas_sgemv, cblas_sger. 64% faster than scalar code for the 64-head × 128×128 state matrix update.

What Failed (And Why It’s Interesting)

The project logged 90+ experiments. Many promising optimizations made things worse:

Approach	Result	Why It Failed
LZ4 expert compression	-13%	Decompression overhead exceeded cache savings
SSD prefetch (F_RDADVISE)	Net 0%	Unified memory — SSD DMA slowed GPU by 73%
Temporal expert prediction	-18%	Only 25% hit rate, wasted SSD bandwidth
MLP routing predictor	31% accuracy	Worse than temporal baseline
GPU LUT dequant kernel	-2%	Indirect register access serialized
Speculative early routing	-38%	Cache pollution + overhead
mmap expert files	-5x	Per-page fault overhead on cold data
MTP speculative decoding	Break-even	MoE I/O scales per-token unlike dense models

The “Trust the OS” philosophy came from repeated failure. Every attempt to outsmart macOS’s virtual memory system backfired on Apple Silicon’s unified memory architecture.

How to Run It

Requirements

MacBook Pro with Apple M-series chip (M3 Max tested)
48GB+ unified memory
209GB free SSD space (4-bit) or 120GB (2-bit)
macOS 26.2+

Build and Run

git clone https://github.com/danveloper/flash-moe.git
cd flash-moe/metal_infer

# Build the inference engine
make

# Basic inference
./infer --prompt "Explain quantum computing" --tokens 100

# Interactive chat with tool calling
./chat

# Per-layer timing breakdown
./infer --prompt "Hello" --tokens 20 --timing

You’ll need to download and convert the model weights first using extract_weights.py and repack_experts.py (instructions in the repo).

Codebase Overview

The entire engine is remarkably compact:

infer.m — Complete inference engine (~7,000 lines)
shaders.metal — Metal compute kernels (~1,200 lines)
chat.m — Interactive chat TUI with tool calling
tokenizer.h — C BPE tokenizer (449 lines, single header)

No dependencies. No package managers. Just make.

Community Reception

The HN thread (393 points, 121 comments) was largely positive but raised valid concerns:

The praise:

“I’ve had great success (~20 t/s) running it on a M1 Ultra with room for 256K context” — with benchmark results showing 87.86% MMLU, 82.32% GPQA diamond
Simon Willison covered it, noting the autoresearch pattern used to build it

The skepticism:

2-bit quantization quality degrades in longer sessions — “They look promising in short sessions but then you try to do real work”
Reducing experts from 10 to 4 is “another layer of quality degradation”
“Running a smaller dense model like 27B produces better results than 2-bit quants”

The 4-bit upgrade addressed the biggest complaint. At 4-bit with 4 experts, the model handles tool calling reliably and maintains coherent output over longer conversations.

The Autoresearch Angle

Flash-MoE wasn’t just built with AI — it was built by AI following the autoresearch pattern popularized by Andrej Karpathy. Dan Woods fed Apple’s “LLM in a Flash” paper to Claude Opus 4.6 and had it:

Design the experiment structure
Implement the C/Metal inference engine
Run 90+ optimization experiments
Write a full academic paper documenting the results

The entire process took 24 hours. The paper PDF in the repo was “mostly written by Claude Opus 4.6.”

This is a compelling proof point for the autoresearch workflow: one domain expert + one AI coding agent = publishable research and working code in a day.

Limitations

Apple Silicon only — Metal shaders don’t run on NVIDIA/AMD GPUs
48GB minimum — 5.5GB resident memory + page cache needs room
4.4 tok/s is slow — Fine for interactive chat, painful for batch processing
Model quality at K=4 experts — The original model uses K=10. Quality impact isn’t fully characterized
No long-context benchmarks — Community reports are anecdotal
SSD wear — Streaming 27MB per token means significant SSD read volume over time

FAQ

Can I run Flash-MoE on a MacBook Air or base MacBook Pro?

Not with the 397B model. You need at least 48GB unified memory. The 24GB models may work at 2-bit quantization but quality would be poor. For smaller machines, a 27B dense model (like Qwen3.5-27B) is a better choice.

How does Flash-MoE compare to llama.cpp?

llama.cpp supports MoE models through GGUF quantization and can also run Qwen3.5-397B on Apple Silicon. Flash-MoE’s advantage is the hand-tuned Metal pipeline optimized specifically for expert streaming. The community reports comparable speeds — some users get 20 t/s on M1 Ultra with llama.cpp.

Is the output quality as good as the full model?

At 4-bit quantization with 4 experts (vs. the original 10), there’s some quality loss. Community benchmarks show 87.86% MMLU and 82.32% GPQA diamond, compared to the original BF16 scores of ~88% MMLU. For practical use, the 4-bit configuration handles coding, analysis, and tool calling well.

What about SSD lifespan?

At 27MB per token and typical chat sessions of a few hundred tokens, you’re reading tens of GB per session. Modern Apple SSDs are rated for hundreds of TBW (terabytes written), but reads don’t count toward write endurance. SSD reads have no wear impact — the concern is theoretical, not practical.

Can this approach work for other MoE models?

In principle, yes. Any MoE model with sparse expert activation could benefit from SSD streaming. The specific Metal shaders and pipeline would need adaptation, but the architecture is generalizable. Mixtral, DeepSeek-V3, and future MoE models are candidates.

Flash-MoE is open source at github.com/danveloper/flash-moe. Built by Dan Woods with Claude Opus 4.6 in 24 hours.