Flash-MoE: Running a 397B Parameter Model on a MacBook Pro

TL;DR

What: Pure C/Metal inference engine that runs Qwen3.5-397B-A17B on a MacBook Pro with 48GB RAM
Creator: Dan Woods (danveloper), VP of AI at CVS Health
Stars: Trending on GitHub with 393 points and 121+ comments on HN
Speed: 4.4 tokens/second at 4-bit quantization with full tool calling support
Model size: 209GB on disk, streamed from SSD — only 6GB in memory at any time
Language: Pure C, Objective-C, and hand-tuned Metal shaders — no Python, no frameworks
License: Open source on GitHub
Key insight: Only 4 of 512 experts activate per token, so the entire 397B model streams from NVMe SSD through Apple’s unified memory architecture

Quick Reference Table

Metric	Value
GitHub	danveloper/flash-moe
Model	Qwen3.5-397B-A17B (Alibaba)
Parameters	397B total, 17B active per token
Hardware	MacBook Pro M3 Max, 48GB RAM
Speed (4-bit)	4.36 tok/s
Speed (2-bit)	5.74 tok/s (breaks JSON)
Disk Usage	209GB (4-bit) / 120GB (2-bit)
RAM Usage	~6GB (non-expert weights + buffers)
Language	C, Objective-C, Metal
Lines of Code	~8,200 (infer.m + shaders.metal)
Built In	24 hours (human + AI pair programming)

What Is Flash-MoE?

Flash-MoE is a from-scratch inference engine that does something most people assumed was impossible: running a 397 billion parameter language model on a laptop with 48GB of RAM. No cloud API. No multi-GPU server. Just a MacBook Pro and its NVMe SSD.

The trick? Mixture-of-Experts (MoE) architecture. Qwen3.5-397B-A17B has 397 billion parameters spread across 512 experts per layer, but only 4 experts activate per token — meaning only about 17 billion parameters are “hot” at any given moment. Flash-MoE exploits this sparsity by streaming expert weights from SSD on demand rather than loading everything into RAM.

The entire codebase is roughly 8,200 lines of C, Objective-C, and Metal shader code. No Python. No PyTorch. No GGML. No llama.cpp. Just raw metal (pun intended) that talks directly to Apple’s GPU compute pipeline and NVMe storage.

Why This Matters Right Now

The Local LLM Landscape Just Changed

Until now, running models above ~70B parameters locally required either expensive GPU servers or compromises that made the output useless. Flash-MoE proves that MoE architecture combined with smart SSD streaming can bypass the memory wall entirely.

Consider the economics. Running Qwen3.5-Plus through Alibaba’s API costs $0.40 per million input tokens. Flash-MoE gives you the same model — running locally, fully private, with zero per-token cost — on hardware many developers already own.

Apple Silicon Was Built for This

Apple’s unified memory architecture is the unsung hero here. On a traditional PC, data must cross the PCIe bus between CPU, GPU, and NVMe controller. On Apple Silicon, SSD, GPU, and CPU share the same memory controller and fabric. Flash-MoE’s pipeline exploits this:

GPU compute (attention) → SSD read (4 experts) → GPU compute (expert forward)

Each SSD read fetches about 27MB (4 experts × 6.75MB each) and takes 2.41ms — fast enough that the GPU never stalls. The OS page cache (~35GB) handles caching with a ~71% hit rate, meaning most expert reads come from RAM anyway.

The “Trust the OS” Philosophy

One of the most fascinating aspects of Flash-MoE is what the developers didn’t do. They tried building custom expert caches in Metal, LRU eviction schemes, LZ4 compression, mmap, prefetching — and every single approach was slower than just letting macOS manage the page cache.

Here’s the full list of approaches that failed:

Approach	Result	Why It Failed
LZ4 expert compression	-13%	Decompression overhead exceeded savings
F_RDADVISE prefetch	Net 0%	SSD DMA slows GPU by 73%
Temporal expert prediction	-18%	Only 25% hit rate, wastes bandwidth
MLP routing predictor	31% accuracy	Worse than baseline
GPU LUT dequant kernel	-2%	Register access serialized
Spin-poll GPU wait	-23%	CPU thermal throttles GPU
mmap expert files	-5x	Per-page fault overhead
dispatch_io	-70%	dispatch_data management overhead
Speculative early routing	-38%	Cache pollution

The lesson: on unified memory hardware, the OS kernel team has already optimized the data path better than you can from userspace.

How Flash-MoE Works: Architecture Deep Dive

The Pipeline

Flash-MoE processes each token through 60 transformer layers — 45 GatedDeltaNet (linear attention) layers and 15 standard full attention layers. The pipeline for each layer follows a strict serial sequence:

CMD3(prev layer) → CMD1: attention projections + delta-net [1.22ms GPU]
                → CPU: flush results [0.01ms]
                → CMD2: output projection + norm + routing [0.55ms GPU]
                → CPU: softmax + topK routing [0.003ms]
                → I/O: parallel pread K=4 experts [2.41ms SSD]
                → CMD3: expert forward + combine + norm [deferred]

The key optimization: CMD3 (the expert computation) is submitted to the GPU without waiting for completion. The GPU executes it while the CPU starts preparing the next layer. This overlap is critical for maintaining throughput.

The FMA Kernel Trick

The biggest single optimization (+12% speed) came from rearranging the dequantization math. The naive 4-bit dequant does:

result = (nibble * scale + bias) * x

Flash-MoE rewrites this as:

scale_x = scale * x    // precompute once
bias_x  = bias * x     // precompute once
result  = fma(nibble, scale_x, bias_x)  // single FMA instruction

This lets the GPU’s fused multiply-add unit handle dequantization and multiplication in one instruction instead of three separate operations.

Memory Management

Flash-MoE is remarkably conservative with RAM:

Non-expert weights: 5.5GB (mmap’d, read-only)
Metal scratch buffers: ~200MB
Total engine footprint: ~6GB
Remaining for OS page cache: ~42GB

No OOM risk. Expert data streams from SSD on demand. The engine explicitly avoids any allocation pattern that could pressure the GPU’s memory controller.

Community Reactions

Hacker News (393 points, 121+ comments)

The HN discussion was substantive, with several experienced local LLM users testing the approach:

One user running an M1 Ultra reported ~20 tokens/second with room for 256K context, along with impressive benchmark scores:

MMLU: 87.86%
GPQA Diamond: 82.32%
GSM8K: 86.43%
IFEval: 75.90%

Reddit (r/LocalLLaMA)

The discussion on r/LocalLLaMA was more skeptical but constructive. Several users pointed out that the “397B” headline is misleading since only 17B parameters are active per token:

“The ‘397B’ is the full parameter space. The ‘17B’ is what actually runs. This is not a quirk of Qwen specifically — it’s the fundamental property of MoE architecture.”

Others noted that with 256GB DDR4 and 24GB VRAM, they were already getting 6.5 tokens/second with a Q5_K_S quantization through llama.cpp — faster than Flash-MoE’s approach on more conventional hardware.

Spinoff Projects

The project has already spawned derivatives. Matt K. Wong built mlx-flash, adapting the approach for Nemotron 30B with a hybrid RAM/disk control knob that lets users tune the tradeoff between speed and storage usage.

Getting Started

Requirements

macOS 13+ on Apple Silicon (M1/M2/M3/M4 series)
48GB+ unified memory recommended
~209GB free SSD space for 4-bit weights (120GB for 2-bit)
Xcode command line tools

Installation

# Clone the repository
git clone https://github.com/danveloper/flash-moe.git
cd flash-moe

# Build the engine
make

# Download the pre-packed 4-bit expert weights (209GB)
./download_weights.sh

# Run inference
./infer --prompt "Explain quantum computing in simple terms" --tokens 100

# Interactive chat with tool calling
./chat

# Per-layer timing breakdown
./infer --prompt "Hello" --tokens 20 --timing

What You’ll Actually Experience

First run will be slow as the OS page cache is cold — expect 2-3 tok/s. After a few prompts, the cache warms up and you’ll see consistent 4.4+ tok/s. The 209GB download takes a while (budget 30-60 minutes on a fast connection).

The --timing flag is worth running at least once. It shows per-layer breakdowns that make the architecture tangible — you can see exactly where time goes between GPU compute, SSD reads, and pipeline overhead.

Flash-MoE vs. Other Local Inference Options

Feature	Flash-MoE	llama.cpp	MLX	Ollama
Max model	397B MoE	~70B dense (48GB)	~70B dense	~70B dense
Language	C/Metal	C/C++	Python/Metal	Go (wraps llama.cpp)
SSD streaming	✅ Core design	Partial (mmap)	❌	❌
MoE optimized	✅	Basic	Basic	Basic
Tool calling	✅ (4-bit)	✅	✅	✅
Ease of use	Manual build	Easy	Easy	One-liner
Platform	macOS only	Cross-platform	macOS only	Cross-platform

Flash-MoE is not trying to replace llama.cpp or Ollama for everyday use. It’s a research project that proves a specific architectural point: with the right SSD streaming pipeline, MoE models can run on consumer hardware far beyond what RAM alone would allow.

Limitations and Honest Assessment

This is an experimental research project, not production software. Here’s what to know:

macOS only — The entire engine depends on Apple’s Metal GPU framework and unified memory architecture. No Linux, no Windows, no CUDA.
209GB download — You need serious SSD space. And the download itself takes time.
4.4 tok/s is slow — For interactive chat, this is usable but noticeably delayed. Claude or GPT-4o via API is 10-50x faster.
2-bit quant breaks JSON — The faster 2-bit mode produces \name\ instead of "name", making tool calling unreliable. You’re locked to 4-bit for real work.
Single-model focus — This is built specifically for Qwen3.5-397B-A17B. It’s not a general-purpose inference engine (yet).
Active parameters are “only” 17B — Critics rightly point out that comparing this to dense 397B is misleading. The quality ceiling is closer to a very good 17B model with access to specialist knowledge.

Who Should Care About Flash-MoE

Use it if:

You want to run frontier-class models fully locally for privacy
You’re researching MoE inference optimization
You have a Mac with 48GB+ RAM and want to push it to its limits
You want to understand how SSD streaming inference actually works

Skip it if:

You need fast interactive responses (use API or smaller local models)
You’re on Linux or Windows
You don’t have 209GB of free SSD space
You just want a local model that works (use Ollama instead)

What This Means for the Future

Flash-MoE is a proof of concept that will age well. As Apple ships Macs with faster SSDs and more GPU cores, the same architecture will get faster for free. The M4 Ultra with 192GB unified memory could potentially run this at 15-20+ tok/s with warm cache.

More importantly, it validates a design principle: SSD is the new VRAM for sparse models. As more model architectures adopt MoE (and the trend is clearly heading that way), SSD streaming inference will become a standard technique rather than a research curiosity.

The paper that accompanies the project documents 90+ experiments and the full story of building the engine in 24 hours through human-AI collaboration. It’s worth reading not just for the technical content but as an example of what a skilled engineer can accomplish with AI assistance in a single day.

FAQ

Can I run Flash-MoE on a Mac with 32GB RAM?

Technically possible but not recommended. With only 32GB, the OS page cache would be ~26GB, reducing the expert cache hit rate significantly. You’d see 2-3 tok/s instead of 4.4. The 48GB configuration leaves 42GB for page cache, which is the sweet spot for the 209GB model.

Is 4.4 tokens/second fast enough for real work?

For interactive chat, it’s usable but slow — about one short sentence every 3-4 seconds. For batch processing, code generation, or any task where you can wait, it’s perfectly fine. The quality of a 397B MoE model often justifies the wait, especially for complex reasoning tasks.

How does this compare to running smaller models at full speed?

A Qwen 27B model at Q6 quantization on the same hardware runs at 30-40 tok/s. For most tasks, that’s a better tradeoff. Flash-MoE shines on tasks that benefit from the 397B model’s broader knowledge and reasoning — complex multi-step problems, nuanced analysis, or tasks requiring specialized domain knowledge.

Will this work on the M4 MacBook Pro?

Yes, and it should be faster. The M4 Pro and M4 Max have improved memory bandwidth and GPU cores. An M4 Max with 48GB should hit 5-6 tok/s, and an M4 Ultra with 192GB could potentially hit 15-20+ tok/s with most experts cached in RAM.

Can I use this with other MoE models besides Qwen3.5-397B?

Not currently. Flash-MoE is hardcoded for Qwen3.5-397B-A17B’s specific architecture (512 experts, K=4 routing, GatedDeltaNet attention). Adapting it to other MoE models would require modifying the weight extraction and inference code, though the core principles transfer directly.

Is this actually “running 397B parameters” or is it misleading?

Both are true. The model has 397B total parameters, and all of them influence the output through the routing mechanism. But only ~17B parameters are active for any given token. Comparing it directly to a dense 397B model (like GPT-4’s rumored architecture) isn’t fair — the quality ceiling is closer to a very good 17B-30B dense model with access to specialist knowledge through the expert routing.