TL;DR

Ollama 0.19 replaces its Metal backend with Apple’s MLX framework on Apple Silicon Macs, delivering massive speed improvements for local LLM inference. Key facts:

  • 1.6-2x faster prompt processing and token generation on all Apple Silicon
  • 1,851 tokens/sec prefill and 134 tokens/sec decode with int4 quantization on M5 Max
  • 166K+ GitHub stars, 52 million monthly downloads in Q1 2026
  • NVFP4 support — NVIDIA’s new format for production-quality quantization
  • Smarter caching — reduced memory usage, intelligent checkpoints, better for coding agents
  • M5 Neural Accelerator support for additional hardware acceleration
  • Currently supports Qwen3.5-35B-A3B (more models coming)
  • License: MIT

This is the biggest performance leap for Mac-based local AI since Ollama launched. If you run models locally on a Mac, update now.


Why This Matters

Ollama is the de facto standard for running LLMs locally — 52 million downloads per month, used as the backend for Claude Code, OpenClaw, OpenCode, and dozens of other tools. But on Mac, it’s always left performance on the table.

The problem: Ollama used Apple’s Metal framework (designed for general GPU compute), not MLX (designed specifically for machine learning on Apple Silicon’s unified memory architecture). It’s like running a race in hiking boots — functional, but not optimal.

Ollama 0.19 fixes this. By switching to MLX, it properly exploits the unified memory where CPU and GPU share the same memory pool, eliminating unnecessary data copies and unlocking hardware-specific optimizations.


Benchmarks

Ollama tested with Qwen3.5-35B-A3B on M5 Max:

Prefill Performance (Prompt Processing)

VersionFrameworkSpeed
Ollama 0.18Metal (Q4_K_M)~1,100 tok/s
Ollama 0.19MLX (NVFP4)~1,851 tok/s
Improvement~1.7x faster

Decode Performance (Token Generation)

VersionFrameworkSpeed
Ollama 0.18Metal (Q4_K_M)~58 tok/s
Ollama 0.19MLX (int4)~134 tok/s
Improvement~2.3x faster

Third-party benchmarks from byteiota confirm even stronger numbers: ~230 tokens/sec sustained throughput with MLX, representing a 20-30% advantage over llama.cpp on identical hardware.

However, some users on r/LocalLLaMA report that Ollama’s Go wrapper still adds up to 30% overhead compared to running MLX directly. If you need absolute maximum performance, tools like mlx-lm may still edge ahead — but Ollama’s ease of use and ecosystem make it the practical choice.


What Changed Under the Hood

MLX vs Metal: The Architecture Difference

Metal is Apple’s general-purpose GPU framework. It handles everything from gaming to photo processing. It works for ML, but it’s not optimized for it.

MLX is Apple’s purpose-built ML framework. Key advantages:

  • Unified memory aware — Tensors live in shared CPU/GPU memory, zero-copy access
  • Lazy evaluation — Computations are deferred and fused for optimal execution
  • M-series optimized — Leverages Neural Engine and AMX hardware blocks
  • M5 Neural Accelerators — The latest chips get additional hardware acceleration paths

NVFP4: Production-Quality 4-Bit Quantization

Ollama 0.19 introduces support for NVIDIA’s NVFP4 (4-bit floating point) format. This matters because:

  • Better accuracy than GGUF Q4 — NVFP4 maintains model quality closer to full precision
  • Production parity — Same quantization format used by cloud inference providers
  • NVIDIA Model Optimizer — Models optimized by NVIDIA’s tools work directly in Ollama

This means your local inference results match what you’d get from cloud providers using the same format.

Smarter Caching

The cache improvements are particularly relevant for coding agents:

  • Cross-conversation cache reuse — When Claude Code or OpenClaw sends multiple requests with the same system prompt, the cache is reused instead of reprocessed
  • Intelligent checkpoints — Ollama stores cache snapshots at smart locations, reducing recomputation
  • Smarter eviction — Shared prefixes survive longer when old branches are dropped

For coding workflows where you’re sending many requests with the same context, this alone can significantly reduce latency.


Getting Started

Install or Update

# macOS (Homebrew)
brew upgrade ollama

# Or download directly
curl -fsSL https://ollama.com/install.sh | sh

# Verify version
ollama --version
# Should show 0.19.x

Requirements: Apple Silicon Mac (M1 or later) with 32GB+ unified memory.

Run the Optimized Model

# Chat mode
ollama run qwen3.5:35b-a3b-coding-nvfp4

# Launch for Claude Code
ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4

# Launch for OpenClaw
ollama launch openclaw --model qwen3.5:35b-a3b-coding-nvfp4

The qwen3.5:35b-a3b-coding-nvfp4 model is a Mixture-of-Experts model with 35B total parameters but only 3B active per token — fast enough for real-time coding assistance.

Memory Requirements

Mac ConfigWhat Runs Well
32GBQwen3.5-35B-A3B (~20GB model + 12GB cache)
48GBComfortable headroom for larger contexts
64GB+Multiple models or 70B+ parameter models
128GB+Largest open models with full context

Users on MacRumors confirm that 32GB works but is tight — the model uses ~20GB, leaving ~12GB for KV cache. For longer conversations, 48GB+ is recommended.


Community Reactions

The Hacker News thread (front page, hundreds of comments) sparked a broad debate about local AI:

“On-device models are the future. Users prefer them. No privacy issues. No dealing with connectivity, tokens, or changes to vendors’ implementations.” — HN commenter

“If we continue to see improvements in running local models, and RAM prices continue to fall… suddenly you don’t have to worry about token counts anymore” — HN commenter

The r/LocalLLaMA community celebrated but with caveats — some users noted Ollama’s Go wrapper overhead compared to raw MLX, and the current limitation to one model architecture.

Banking and enterprise users pointed out that local inference eliminates their data privacy concerns entirely — a major factor given GitHub’s recent announcement that Copilot data will be used for training starting April 24.


Current Limitations

  1. Model support is limited — Only Qwen3.5-35B-A3B is MLX-accelerated in this preview. More architectures are coming.
  2. Preview status — Some edge cases may have issues. Not recommended for critical production yet.
  3. 32GB minimum — The optimized model needs significant memory. 16GB Macs are out.
  4. Go wrapper overhead — Up to 30% slower than raw MLX according to community benchmarks. Fine for practical use, but speed demons may prefer mlx-lm directly.
  5. Apple Silicon only — Intel Macs get no benefit. NVIDIA GPU users should stick with CUDA backend.

Ollama 0.19 vs Alternatives

ToolBackendSpeed (M4 Max)Best For
Ollama 0.19MLX~134 tok/sEase of use, ecosystem
mlx-lmMLX~170 tok/sRaw speed on Mac
llama.cppMetal~85 tok/sCross-platform, flexibility
LM StudioMLX~130 tok/sGUI users
vLLMCUDA~400+ tok/sProduction servers (NVIDIA)

Ollama wins on ecosystem and ease of use. It’s the backend for Claude Code, OpenClaw, and most local LLM workflows. The MLX update closes the performance gap that previously pushed speed-sensitive users to alternatives.


Who Should Upgrade

  • Mac users running Ollama → Update immediately. Free 1.6-2x speed boost.
  • Claude Code / OpenClaw users → Your local model responses will be noticeably faster.
  • Privacy-conscious developers → Local inference means zero data leaves your machine. Particularly relevant given GitHub Copilot’s upcoming training data policy change.
  • Intel Mac users → No benefit. Consider upgrading hardware.
  • NVIDIA GPU users → Stick with CUDA. This is Apple Silicon only.

FAQ

Q: Do I need to change any settings after upgrading? A: No. Ollama 0.19 automatically uses MLX on Apple Silicon. Just update and run.

Q: Will my existing GGUF models work? A: GGUF models still work but run on the Metal path. MLX acceleration currently requires the NVFP4 format. GGUF models will transition to MLX as more architectures are supported.

Q: Is Ollama 0.19 stable for daily use? A: It’s labeled “preview” but community reports are positive. The 32GB setup works well with the recommended model. Edge cases with very long contexts may hit memory limits.

Q: When will more models be supported? A: Ollama says they’re “actively working” to support more architectures. Expect broader support within weeks, not months.

Last verified: April 2, 2026