Rapid-MLX Review: 4x Faster Local LLM Server for Mac

TL;DR

Rapid-MLX is a new open-source LLM inference server for Apple Silicon that bills itself as the “fastest local AI engine for Mac” — and the benchmarks back it up. It’s an OpenAI-compatible drop-in replacement that runs Qwen, DeepSeek, Gemma, Llama and friends directly on Apple’s MLX framework, with results that beat even Ollama’s new MLX backend by 2-4x on the same hardware. It picked up ~800 stars this week and is climbing fast on GitHub trending.

Key facts:

2-4x faster than Ollama on Apple Silicon (community benchmarks: 108 tok/s vs 41 tok/s on Qwen3.5-9B, M3 Ultra)
0.08s cached TTFT (Time To First Token) — feels instant in coding agents
100% tool calling on Qwen3.5/Qwopus models — works with Claude Code, Cursor, Aider, OpenCode out of the box
Drop-in OpenAI API — just point any client at http://localhost:8000/v1
17 tool parsers built in (Hermes, Llama 3 JSON, Qwen, Gemma function call, etc.)
Prompt cache + reasoning separation for chains-of-thought models like DeepSeek R1
Smart cloud routing — auto-forwards large-context requests to cloud LLMs when local prefill would be slow
58 model aliases across 21 families, including day-0 DeepSeek V4 Flash 158B-A13B
MIT licensed, Python 3.10+, MLX 0.18+, Apple Silicon only (M1/M2/M3/M4/M5)

If you’ve already invested in a 32-128GB Mac mini, Studio or MacBook Pro specifically to run local models, Rapid-MLX is the upgrade that finally makes that hardware feel like a frontier-grade workstation. If you’re new to local AI, this is also one of the easiest ways in — brew install and you’re done.

Why This Matters Now

Ollama’s 0.19 release in late April was the biggest performance jump local AI has had on Mac all year. We reviewed it here — 2x faster decode, 134 tok/s on Qwen3.5-35B-A3B, real-time coding assistance suddenly possible on a Mac mini.

Two weeks later, Rapid-MLX shipped a benchmark page showing it beating that by another 2-4x. The catch was that everyone assumed someone would do this eventually — Ollama is written in Go and wraps MLX through a service boundary. A pure-Python server sitting directly on mlx-lm was always going to be faster. What no one expected was how much the gap would be, or how good the agent harness compatibility would be on day one.

Andrei from the Hetzner side of this blog runs a 32GB Mac mini M5 Pro for local AI experiments. After installing Rapid-MLX last weekend, his Qwen3.5-27B speed on coding tasks went from “I’ll wait for Claude” to “I’ll just let the local one do it.” That’s the whole pitch.

What Rapid-MLX Actually Is (Architecture)

Rapid-MLX is three things bundled into one pip install rapid-mlx:

An MLX-native serving layer — built on top of mlx-lm and mlx-vlm, but with continuous batching, KV cache reuse across requests, and reasoning-token separation (DeepSeek R1’s <think> blocks are stripped before they hit the tool-calling parser).
An OpenAI-compatible HTTP server — exposes /v1/chat/completions, /v1/embeddings, /v1/models, plus an Anthropic-shaped /v1/messages endpoint for Claude SDK clients.
17 tool-call parsers — Hermes XML, Llama 3.1 JSON, Qwen <tool_call>, Gemma function-call, Phi-4, Mistral V3, GLM 4.5, MiniMax M2.5, NVIDIA Nemotron, plus generic fallbacks. The right parser is selected automatically per model.

The “100% tool calling” claim refers to the result of rapid-mlx agents --test, which runs a 14-tool deterministic test against any harness/model combination. Qwen3.5 27B, Qwen3.6 35B-A3B, and Qwopus 27B all hit 100% on Hermes, PydanticAI, and LangChain harnesses. Llama 3.3 70B also hits 100% but only via smolagents (which generates code instead of JSON).

There’s also MHI — Model-Harness Index — a composite score combining tool calling (50%), HumanEval (30%), and tinyMMLU (20%). It’s the metric that matters most if you’re using these models inside coding agents. Qwopus 27B scored 92, Llama 3.3 70B scored 83, Qwen3.5 27B scored 82.

Installation (Three Ways, Pick One)

The README is unusually careful about Python versioning — macOS still ships Python 3.9, which doesn’t satisfy the >=3.10 requirement, so there are three install paths:

# Option 1 — Homebrew (recommended, handles Python automatically)
brew install raullenchai/rapid-mlx/rapid-mlx

# Option 2 — pip (need your own Python 3.10+)
pip install rapid-mlx

# Option 3 — one-liner with auto-Python install
curl -fsSL https://raullenchai.github.io/Rapid-MLX/install.sh | bash

The base install is ~460MB. If you want vision models like Gemma 4 or Qwen-VL, add the extras: pip install 'rapid-mlx[vision]' (~322MB more for mlx-vlm + opencv + torch).

Serving a Model in 30 Seconds

Once installed:

rapid-mlx serve qwen3.5-9b

First run downloads from Hugging Face (~5GB for 9B at 4bit), then prints Ready: http://localhost:8000/v1. Test it:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Write a Python function that reverses a linked list."}],
    "stream": false
  }'

That’s it. You now have an OpenAI-compatible inference server running on Apple Silicon, and any tool that supports OpenAI’s API works against it.

To see all available aliases: rapid-mlx models — currently 58 across 21 families, including Qwen3.5/3.6, Qwopus, DeepSeek R1 and V4 Flash, Gemma 3/4, Llama 3, GLM 4.5/4.7, MiniMax M2.5, Kimi 48B/K2.5, Mistral 24B, Devstral V2, and GPT-OSS 20B.

Real Code Example: PydanticAI with Local Tool Calling

This is where Rapid-MLX shines compared to mlx-lm directly — the tool-call parsing just works. Here’s a fully working typed agent:

from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider
from pydantic import BaseModel

class WeatherResponse(BaseModel):
    location: str
    temp_c: float
    condition: str

model = OpenAIChatModel(
    model_name="default",
    provider=OpenAIProvider(
        base_url="http://localhost:8000/v1",
        api_key="not-needed",  # Rapid-MLX doesn't check keys
    ),
)

agent = Agent(model, output_type=WeatherResponse)

@agent.tool_plain
def get_weather(city: str) -> dict:
    # Pretend API call
    return {"city": city, "temp": 18.0, "sky": "rain"}

result = agent.run_sync("What's the weather in Tallinn?")
print(result.output)
# WeatherResponse(location='Tallinn', temp_c=18.0, condition='rain')

Run this against rapid-mlx serve qwen3.5-9b and the model issues a real tool call, Rapid-MLX parses it via the Qwen <tool_call> parser, PydanticAI executes get_weather, returns to the model, and you get a typed Python object. No instructor, no custom JSON-mode hacks — it just works because Qwen3.5 has 100% tool calling on this harness.

Connecting It to Coding Agents

The README documents wiring Rapid-MLX into every popular coding agent — and most are one-liners.

Cursor: Settings → Models → Add OpenAI-compatible model. Base URL http://localhost:8000/v1, key not-needed, model name default. Cursor’s composer/agent mode picks up tool calls automatically.

Claude Code:

OPENAI_BASE_URL=http://localhost:8000/v1 claude

Claude Code’s --openai flag re-routes everything through Rapid-MLX. Pair it with qwen3.5-27b for a no-API-cost coding loop on a 32GB Mac.

Aider:

aider --openai-api-base http://localhost:8000/v1 --openai-api-key not-needed

OpenCode (the open-source Claude Code clone): rapid-mlx agents opencode --setup writes opencode.json for you. Same for codex --setup or hermes-agent --setup.

Continue.dev (~/.continue/config.yaml):

models:
  - name: rapid-mlx
    provider: openai
    model: default
    apiBase: http://localhost:8000/v1
    apiKey: not-needed

OpenClaw: works out of the box. Set agents.defaults.codingModel to openai/default and point OPENAI_BASE_URL.

Benchmarks: How Much Faster, Really?

The README’s headline number is “4.2x faster than Ollama.” Community benchmarks tell a more nuanced story — but Rapid-MLX wins everywhere:

Model	Hardware	Rapid-MLX	Ollama 0.19 (MLX)	Speedup
Qwen3.5-9B 4bit	M3 Ultra	108 tok/s	41 tok/s	2.6x
Gemma 4 26B 4bit	M3 Ultra	85 tok/s	68 tok/s	1.3x
Phi-4 Mini 14B	M3 Ultra	180 tok/s	56 tok/s	3.2x
Qwen3.5-27B 4bit	M5 Max 64GB	64 tok/s	39 tok/s	1.6x
Nemotron-Nano 30B	Mac mini M5 32GB	141 tok/s	n/a	(new)
Qwen3.6-35B-A3B	M5 Max 64GB	95 tok/s	73 tok/s	1.3x
DeepSeek V4 Flash 158B-A13B	Mac Studio Ultra 128GB	31-56 tok/s	n/a	(Rapid-MLX only)

The 0.08s cached TTFT is the other underrated win. In Cursor or Claude Code, the model regularly receives the same long system prompt followed by short edits. Rapid-MLX’s prompt cache reuses the KV cache for the unchanged prefix, so the second turn onwards starts generating tokens in 80 milliseconds. Ollama’s prompt cache is less aggressive — same workload typically takes 400-800ms TTFT.

Smart Cloud Routing — The Killer Feature Nobody’s Talking About

Buried in the README is a feature called Smart Cloud Routing. The problem: prefill on Apple Silicon is much slower than decode. A 50K-token paste into a coding agent might take 30+ seconds before the first token streams on a 32GB Mac mini. Rapid-MLX optionally detects when the input exceeds a configurable threshold and forwards that specific request to a cloud LLM (Anthropic, OpenAI, or Groq), then transparently returns to the local model for follow-ups in the same session.

Configure in ~/.rapid-mlx/config.yaml:

cloud_routing:
  enabled: true
  prefill_threshold_tokens: 16000
  provider: anthropic
  model: claude-sonnet-4-7
  api_key_env: ANTHROPIC_API_KEY

For me on a 32GB Mac mini, this is the difference between “local AI is a toy” and “local AI is my daily driver.” The expensive long-context calls (whole-repo questions, big-PDF analysis) hit the cloud once; everything else stays on-device and free.

Community Reactions

From the HN thread that launched it (item 47816238, ~340 points):

“The MHI methodology is what sold me. Other servers benchmark perplexity; this one benchmarks the thing that actually matters for coding — does the tool call work end-to-end.”

“0.08s TTFT in Cursor changes the feel completely. Local Qwen now feels faster than cloud Sonnet because there’s zero network round-trip.”

“DFlash support on day 0 is wild. Got DeepSeek V4 Flash running on my M4 Max 128GB with 1M context the day after Anthropic released it.”

From r/LocalLLaMA, the criticism is consistent and worth knowing:

“It’s MLX-only. If you have a Linux box with a 4090, llama.cpp + vLLM is still the play.”

“The cloud routing is brilliant but feels like an admission that Apple Silicon can’t handle long contexts.”

“Documentation is dense. The README is 4000 lines. A quickstart that just covers brew install → serve → Cursor would 10x adoption.”

Fair criticisms, all three.

Honest Limitations

Apple Silicon only. No CUDA, no AMD, no Linux. If your fleet is mixed, you’ll need llama.cpp + Ollama anyway.

MLX model zoo lag. Models need to be converted to MLX format. The mlx-community Hugging Face org is fast (DeepSeek V4 Flash shipped day-0), but if you want bleeding-edge custom finetunes from random research orgs, GGUF still has a 1-2 week head start.

Prefill is still the bottleneck. On a 16GB MacBook Air, ingesting a 32K-token prompt takes 15-25 seconds even with Rapid-MLX. Decode is fast, prefill is unavoidable physics. Smart cloud routing mitigates this but doesn’t solve it.

Memory pressure. macOS doesn’t isolate model memory the way Linux does. If you serve Qwen3.5-27B (15GB) and then open Chrome, Activity Monitor will go yellow within minutes. Stick to the official sizing table.

Single-user. No multi-tenant auth, no request quotas, no team management. This is a single-developer-machine tool. Don’t expose it to your team’s network without putting a proxy in front.

No web UI. You bring your own client (Cursor, Claude Code, Open WebUI, LibreChat). For pure chat, install Open WebUI with the one-line Docker command in the README.

FAQ

Is Rapid-MLX faster than Ollama for everyone?

For Apple Silicon Macs running MLX-format models, yes — 1.3x to 4.2x faster depending on model and hardware. For Linux/NVIDIA, llama.cpp or vLLM remain the right choice. Ollama is still the better pick if you need cross-platform support (Windows, Linux, Intel Mac) or if you’re not comfortable with Python.

Can I use Rapid-MLX with Claude Code?

Yes. Set OPENAI_BASE_URL=http://localhost:8000/v1 before launching claude. Use a tool-calling-capable model like qwen3.5-27b or qwopus-27b and Claude Code’s full agentic loop works against a local model with zero API cost. See the Claude Code skill we wrote for prompt patterns that work well with smaller local models.

What’s the best model for a 32GB Mac mini?

For coding: qwen3.5-27b (15.3GB, 39 tok/s, 100% tool calling) is the sweet spot. For chat and reasoning: nemotron-nano-30b (18GB, 141 tok/s) — fastest 30B available. If you have 32GB exactly and want to push it, qwen3.6-35b-a3b (20GB, 95 tok/s, 262K context, 256 MoE experts) is the most capable that still fits.

Does Rapid-MLX support vision/multimodal models?

Yes — install with pip install 'rapid-mlx[vision]' and you get Gemma 4 26B/31B, Qwen-VL 4B/8B/30B, and the rest of the mlx-vlm zoo. Same OpenAI-compatible endpoint, just pass image_url content blocks.

How does it compare to LM Studio?

LM Studio has a polished GUI and built-in chat UI; Rapid-MLX is server-first and faster. If you want point-and-click for non-developers, LM Studio. If you’re wiring it into Cursor/Claude Code/Aider and want maximum throughput, Rapid-MLX.

Is it production-ready?

For single-developer local use, yes. For team servers or production inference, no — there’s no auth, no rate limiting, no request observability beyond logs. Use Bento, Modal, or vLLM behind an API gateway for production.

Verdict

Rapid-MLX is the most exciting Apple Silicon AI release of the year, and it isn’t close. It takes the foundation Apple’s mlx-lm team has built and adds the missing pieces — tool calling, prompt cache, cloud routing, 58 model aliases — that turn local AI from “fun toy” into “actual development environment.”

If you’ve got a 32GB+ Mac, install it tonight and try qwen3.5-27b against Claude Code. The combination is genuinely production-grade. The 4x speed advantage over Ollama isn’t just bragging rights — it’s the difference between watching tokens stream and feeling like a cloud model is responding.

The downsides — Apple-only, documentation density, no multi-tenant story — are real but manageable for the target audience (individual developers on Mac who already chose this platform for AI work). For everyone else, Ollama 0.19 is still the right cross-platform default.

GitHub: raullenchai/Rapid-MLX · MIT licensed · MLX 0.18+ · Python 3.10+ · Apple Silicon only

Reviewing local AI tooling on andrew.ooo is part of a longer thread — see Ollama 0.19 MLX, LiteRT-LM, and Flash MoE for the rest of the Apple Silicon inference landscape.