TL;DR
- What: Pure C/Metal inference engine that runs Qwen3.5-397B-A17B on a MacBook Pro with 48GB RAM
- Creator: Dan Woods (danveloper), VP of AI at CVS Health
- Stars: Trending on GitHub with 393 points and 121+ comments on HN
- Speed: 4.4 tokens/second at 4-bit quantization with full tool calling support
- Model size: 209GB on disk, streamed from SSD — only 6GB in memory at any time
- Language: Pure C, Objective-C, and hand-tuned Metal shaders — no Python, no frameworks
- License: Open source on GitHub
- Key insight: Only 4 of 512 experts activate per token, so the entire 397B model streams from NVMe SSD through Apple’s unified memory architecture
Quick Reference Table
| Metric | Value |
|---|---|
| GitHub | danveloper/flash-moe |
| Model | Qwen3.5-397B-A17B (Alibaba) |
| Parameters | 397B total, 17B active per token |
| Hardware | MacBook Pro M3 Max, 48GB RAM |
| Speed (4-bit) | 4.36 tok/s |
| Speed (2-bit) | 5.74 tok/s (breaks JSON) |
| Disk Usage | 209GB (4-bit) / 120GB (2-bit) |
| RAM Usage | ~6GB (non-expert weights + buffers) |
| Language | C, Objective-C, Metal |
| Lines of Code | ~8,200 (infer.m + shaders.metal) |
| Built In | 24 hours (human + AI pair programming) |
What Is Flash-MoE?
Flash-MoE is a from-scratch inference engine that does something most people assumed was impossible: running a 397 billion parameter language model on a laptop with 48GB of RAM. No cloud API. No multi-GPU server. Just a MacBook Pro and its NVMe SSD.
The trick? Mixture-of-Experts (MoE) architecture. Qwen3.5-397B-A17B has 397 billion parameters spread across 512 experts per layer, but only 4 experts activate per token — meaning only about 17 billion parameters are “hot” at any given moment. Flash-MoE exploits this sparsity by streaming expert weights from SSD on demand rather than loading everything into RAM.
The entire codebase is roughly 8,200 lines of C, Objective-C, and Metal shader code. No Python. No PyTorch. No GGML. No llama.cpp. Just raw metal (pun intended) that talks directly to Apple’s GPU compute pipeline and NVMe storage.
Why This Matters Right Now
The Local LLM Landscape Just Changed
Until now, running models above ~70B parameters locally required either expensive GPU servers or compromises that made the output useless. Flash-MoE proves that MoE architecture combined with smart SSD streaming can bypass the memory wall entirely.
Consider the economics. Running Qwen3.5-Plus through Alibaba’s API costs $0.40 per million input tokens. Flash-MoE gives you the same model — running locally, fully private, with zero per-token cost — on hardware many developers already own.
Apple Silicon Was Built for This
Apple’s unified memory architecture is the unsung hero here. On a traditional PC, data must cross the PCIe bus between CPU, GPU, and NVMe controller. On Apple Silicon, SSD, GPU, and CPU share the same memory controller and fabric. Flash-MoE’s pipeline exploits this:
GPU compute (attention) → SSD read (4 experts) → GPU compute (expert forward)
Each SSD read fetches about 27MB (4 experts × 6.75MB each) and takes 2.41ms — fast enough that the GPU never stalls. The OS page cache (~35GB) handles caching with a ~71% hit rate, meaning most expert reads come from RAM anyway.
The “Trust the OS” Philosophy
One of the most fascinating aspects of Flash-MoE is what the developers didn’t do. They tried building custom expert caches in Metal, LRU eviction schemes, LZ4 compression, mmap, prefetching — and every single approach was slower than just letting macOS manage the page cache.
Here’s the full list of approaches that failed:
| Approach | Result | Why It Failed |
|---|---|---|
| LZ4 expert compression | -13% | Decompression overhead exceeded savings |
| F_RDADVISE prefetch | Net 0% | SSD DMA slows GPU by 73% |
| Temporal expert prediction | -18% | Only 25% hit rate, wastes bandwidth |
| MLP routing predictor | 31% accuracy | Worse than baseline |
| GPU LUT dequant kernel | -2% | Register access serialized |
| Spin-poll GPU wait | -23% | CPU thermal throttles GPU |
| mmap expert files | -5x | Per-page fault overhead |
| dispatch_io | -70% | dispatch_data management overhead |
| Speculative early routing | -38% | Cache pollution |
The lesson: on unified memory hardware, the OS kernel team has already optimized the data path better than you can from userspace.
How Flash-MoE Works: Architecture Deep Dive
The Pipeline
Flash-MoE processes each token through 60 transformer layers — 45 GatedDeltaNet (linear attention) layers and 15 standard full attention layers. The pipeline for each layer follows a strict serial sequence:
CMD3(prev layer) → CMD1: attention projections + delta-net [1.22ms GPU]
→ CPU: flush results [0.01ms]
→ CMD2: output projection + norm + routing [0.55ms GPU]
→ CPU: softmax + topK routing [0.003ms]
→ I/O: parallel pread K=4 experts [2.41ms SSD]
→ CMD3: expert forward + combine + norm [deferred]
The key optimization: CMD3 (the expert computation) is submitted to the GPU without waiting for completion. The GPU executes it while the CPU starts preparing the next layer. This overlap is critical for maintaining throughput.
The FMA Kernel Trick
The biggest single optimization (+12% speed) came from rearranging the dequantization math. The naive 4-bit dequant does:
result = (nibble * scale + bias) * x
Flash-MoE rewrites this as:
scale_x = scale * x // precompute once
bias_x = bias * x // precompute once
result = fma(nibble, scale_x, bias_x) // single FMA instruction
This lets the GPU’s fused multiply-add unit handle dequantization and multiplication in one instruction instead of three separate operations.
Memory Management
Flash-MoE is remarkably conservative with RAM:
- Non-expert weights: 5.5GB (mmap’d, read-only)
- Metal scratch buffers: ~200MB
- Total engine footprint: ~6GB
- Remaining for OS page cache: ~42GB
No OOM risk. Expert data streams from SSD on demand. The engine explicitly avoids any allocation pattern that could pressure the GPU’s memory controller.
Community Reactions
Hacker News (393 points, 121+ comments)
The HN discussion was substantive, with several experienced local LLM users testing the approach:
One user running an M1 Ultra reported ~20 tokens/second with room for 256K context, along with impressive benchmark scores:
- MMLU: 87.86%
- GPQA Diamond: 82.32%
- GSM8K: 86.43%
- IFEval: 75.90%
Reddit (r/LocalLLaMA)
The discussion on r/LocalLLaMA was more skeptical but constructive. Several users pointed out that the “397B” headline is misleading since only 17B parameters are active per token:
“The ‘397B’ is the full parameter space. The ‘17B’ is what actually runs. This is not a quirk of Qwen specifically — it’s the fundamental property of MoE architecture.”
Others noted that with 256GB DDR4 and 24GB VRAM, they were already getting 6.5 tokens/second with a Q5_K_S quantization through llama.cpp — faster than Flash-MoE’s approach on more conventional hardware.
Spinoff Projects
The project has already spawned derivatives. Matt K. Wong built mlx-flash, adapting the approach for Nemotron 30B with a hybrid RAM/disk control knob that lets users tune the tradeoff between speed and storage usage.
Getting Started
Requirements
- macOS 13+ on Apple Silicon (M1/M2/M3/M4 series)
- 48GB+ unified memory recommended
- ~209GB free SSD space for 4-bit weights (120GB for 2-bit)
- Xcode command line tools
Installation
# Clone the repository
git clone https://github.com/danveloper/flash-moe.git
cd flash-moe
# Build the engine
make
# Download the pre-packed 4-bit expert weights (209GB)
./download_weights.sh
# Run inference
./infer --prompt "Explain quantum computing in simple terms" --tokens 100
# Interactive chat with tool calling
./chat
# Per-layer timing breakdown
./infer --prompt "Hello" --tokens 20 --timing
What You’ll Actually Experience
First run will be slow as the OS page cache is cold — expect 2-3 tok/s. After a few prompts, the cache warms up and you’ll see consistent 4.4+ tok/s. The 209GB download takes a while (budget 30-60 minutes on a fast connection).
The --timing flag is worth running at least once. It shows per-layer breakdowns that make the architecture tangible — you can see exactly where time goes between GPU compute, SSD reads, and pipeline overhead.
Flash-MoE vs. Other Local Inference Options
| Feature | Flash-MoE | llama.cpp | MLX | Ollama |
|---|---|---|---|---|
| Max model | 397B MoE | ~70B dense (48GB) | ~70B dense | ~70B dense |
| Language | C/Metal | C/C++ | Python/Metal | Go (wraps llama.cpp) |
| SSD streaming | ✅ Core design | Partial (mmap) | ❌ | ❌ |
| MoE optimized | ✅ | Basic | Basic | Basic |
| Tool calling | ✅ (4-bit) | ✅ | ✅ | ✅ |
| Ease of use | Manual build | Easy | Easy | One-liner |
| Platform | macOS only | Cross-platform | macOS only | Cross-platform |
Flash-MoE is not trying to replace llama.cpp or Ollama for everyday use. It’s a research project that proves a specific architectural point: with the right SSD streaming pipeline, MoE models can run on consumer hardware far beyond what RAM alone would allow.
Limitations and Honest Assessment
This is an experimental research project, not production software. Here’s what to know:
-
macOS only — The entire engine depends on Apple’s Metal GPU framework and unified memory architecture. No Linux, no Windows, no CUDA.
-
209GB download — You need serious SSD space. And the download itself takes time.
-
4.4 tok/s is slow — For interactive chat, this is usable but noticeably delayed. Claude or GPT-4o via API is 10-50x faster.
-
2-bit quant breaks JSON — The faster 2-bit mode produces
\name\instead of"name", making tool calling unreliable. You’re locked to 4-bit for real work. -
Single-model focus — This is built specifically for Qwen3.5-397B-A17B. It’s not a general-purpose inference engine (yet).
-
Active parameters are “only” 17B — Critics rightly point out that comparing this to dense 397B is misleading. The quality ceiling is closer to a very good 17B model with access to specialist knowledge.
Who Should Care About Flash-MoE
Use it if:
- You want to run frontier-class models fully locally for privacy
- You’re researching MoE inference optimization
- You have a Mac with 48GB+ RAM and want to push it to its limits
- You want to understand how SSD streaming inference actually works
Skip it if:
- You need fast interactive responses (use API or smaller local models)
- You’re on Linux or Windows
- You don’t have 209GB of free SSD space
- You just want a local model that works (use Ollama instead)
What This Means for the Future
Flash-MoE is a proof of concept that will age well. As Apple ships Macs with faster SSDs and more GPU cores, the same architecture will get faster for free. The M4 Ultra with 192GB unified memory could potentially run this at 15-20+ tok/s with warm cache.
More importantly, it validates a design principle: SSD is the new VRAM for sparse models. As more model architectures adopt MoE (and the trend is clearly heading that way), SSD streaming inference will become a standard technique rather than a research curiosity.
The paper that accompanies the project documents 90+ experiments and the full story of building the engine in 24 hours through human-AI collaboration. It’s worth reading not just for the technical content but as an example of what a skilled engineer can accomplish with AI assistance in a single day.
FAQ
Can I run Flash-MoE on a Mac with 32GB RAM?
Technically possible but not recommended. With only 32GB, the OS page cache would be ~26GB, reducing the expert cache hit rate significantly. You’d see 2-3 tok/s instead of 4.4. The 48GB configuration leaves 42GB for page cache, which is the sweet spot for the 209GB model.
Is 4.4 tokens/second fast enough for real work?
For interactive chat, it’s usable but slow — about one short sentence every 3-4 seconds. For batch processing, code generation, or any task where you can wait, it’s perfectly fine. The quality of a 397B MoE model often justifies the wait, especially for complex reasoning tasks.
How does this compare to running smaller models at full speed?
A Qwen 27B model at Q6 quantization on the same hardware runs at 30-40 tok/s. For most tasks, that’s a better tradeoff. Flash-MoE shines on tasks that benefit from the 397B model’s broader knowledge and reasoning — complex multi-step problems, nuanced analysis, or tasks requiring specialized domain knowledge.
Will this work on the M4 MacBook Pro?
Yes, and it should be faster. The M4 Pro and M4 Max have improved memory bandwidth and GPU cores. An M4 Max with 48GB should hit 5-6 tok/s, and an M4 Ultra with 192GB could potentially hit 15-20+ tok/s with most experts cached in RAM.
Can I use this with other MoE models besides Qwen3.5-397B?
Not currently. Flash-MoE is hardcoded for Qwen3.5-397B-A17B’s specific architecture (512 experts, K=4 routing, GatedDeltaNet attention). Adapting it to other MoE models would require modifying the weight extraction and inference code, though the core principles transfer directly.
Is this actually “running 397B parameters” or is it misleading?
Both are true. The model has 397B total parameters, and all of them influence the output through the routing mechanism. But only ~17B parameters are active for any given token. Comparing it directly to a dense 397B model (like GPT-4’s rumored architecture) isn’t fair — the quality ceiling is closer to a very good 17B-30B dense model with access to specialist knowledge through the expert routing.