TL;DR

ds4 is Salvatore “antirez” Sanfilippo’s brand-new inference engine for the DeepSeek V4 Flash model. The creator of Redis spent the last few weeks writing a pure-C runtime that does one thing: run DeepSeek V4 Flash as fast as physically possible on a MacBook or a DGX Spark. It’s not a generic GGUF runner. It is not a wrapper around llama.cpp. It’s a deliberately narrow engine that has gained 8,056 GitHub stars in roughly four days and topped the Hacker News front page with 497 points and 157 comments.

The headline result: a 284B-parameter Mixture-of-Experts model running at 26.68 tokens/s generation on a 128 GB MacBook Pro M3 Max, with a 1-million-token context window and persistent on-disk KV cache. That’s not a typo. That’s a frontier-class model on a laptop you can buy at the Apple Store.

Key facts:

  • Pure C inference engine for one model: DeepSeek V4 Flash (284B params, ~13B active)
  • Metal (macOS) and CUDA (Linux, DGX Spark optimized) primary backends
  • 2-bit asymmetric quantization — only MoE experts quantized, shared layers untouched
  • 1M token context window with KV cache as a “first-class disk citizen”
  • OpenAI/Anthropic-compatible HTTP server with tool calling baked in
  • 96 GB MacBooks confirmed working at 250k context by community reports
  • MIT licensed, built openly with GPT-5.5 assistance (antirez says so himself)
  • Status: alpha — “this exists only for a few days,” but it works

If you have a 128 GB Mac and you’ve been waiting for the moment when “local frontier model” stops being aspirational, this is the moment.

Quick Reference

FieldValue
Repoantirez/ds4
AuthorSalvatore Sanfilippo (creator of Redis)
LanguageC (no GGML link, no llama.cpp runtime)
LicenseMIT
ModelDeepSeek V4 Flash only
Weightshuggingface.co/antirez/deepseek-v4-gguf
BackendsMetal (macOS), CUDA (Linux/DGX Spark), ROCm (separate branch)
Min RAM96 GB (q2-imatrix), 128 GB recommended, 256 GB+ for q4
Quantsq2-imatrix, q4-imatrix, q2, q4, optional MTP for speculative decoding
ContextUp to 1,000,000 tokens (disk-backed KV cache)
First commit2026-05-06
Stars8,056 (as of 2026-05-13, ~4 days after launch)

What ds4 Actually Is

The README is unusually candid about scope:

“DwarfStar 4 is a small native inference engine specific for DeepSeek V4 Flash. It is intentionally narrow: not a generic GGUF runner, not a wrapper around another runtime: it is completely self-contained.”

That single sentence explains the whole project. The local inference world is dominated by llama.cpp, Ollama, MLX, and vLLM — runtimes that try to support hundreds of model architectures. Every time a new model drops, those teams scramble to add support, which means every model gets an okay implementation but nothing gets a great one. The MoE routing might be generic, the KV cache layout might be conservative, the tokenizer might be wrapped through three abstraction layers.

ds4 takes the opposite bet: one model, one engine, finished end-to-end. The GGUF files are custom — antirez built his own quantization pipeline specifically for V4 Flash’s architecture. The 2-bit quants aren’t just q2_K applied uniformly; they’re “asymmetric”: only the routed Mixture-of-Experts (MoE) expert weights get squeezed to IQ2_XXS and Q2_K, while the shared experts, attention projections, and routing layers stay at full precision. The result is a ~70 GB model that fits in 128 GB of unified memory with room left for KV cache, OS, and your editor.

The other architectural bet is the KV cache. DeepSeek V4 Flash already compresses KV state aggressively (it’s part of why the model is fast). ds4 takes that one step further: KV cache lives on disk by default. Modern MacBook SSDs read at ~7 GB/s — fast enough that paging KV chunks during decode is viable. That’s how you get a 1M-token context window on a laptop without 1 TB of RAM.

Three things colliding in the same week:

  1. DeepSeek V4 Flash dropped in late April 2026 — a 284B-parameter MoE with only ~13B active per token, 1M context, and quality that benchmarks competitively with frontier closed models.
  2. antirez published ds4 on May 6, 2026 — six days after Apple started shipping the M3 Ultra Mac Studio with 512 GB unified memory. Hardware caught up to the model the same month an engine appeared.
  3. The author is antirez. When the creator of Redis writes 12,000 lines of pure C to make a specific model run on his MacBook, the open-source world pays attention. The repo crossed 7,000 stars in 4 days without a major Twitter push — pure HN + word of mouth.

The Hacker News thread (497 points, 157 comments at item 48142108) is mostly people being mildly stunned that frontier-class inference on a laptop is no longer hypothetical.

Real Performance Numbers

These are antirez’s own benchmarks from the repo, run with --ctx 32768 --nothink greedy decoding:

MachineQuantPrompt LengthPrefillGeneration
MacBook Pro M3 Max 128 GBq2short58.52 t/s26.68 t/s
MacBook Pro M3 Max 128 GBq211,709 tokens250.11 t/s21.47 t/s
Mac Studio M3 Ultra 512 GBq2short84.43 t/s36.86 t/s
Mac Studio M3 Ultra 512 GBq211,709 tokens468.03 t/s27.39 t/s
Mac Studio M3 Ultra 512 GBq4short78.95 t/s35.50 t/s
Mac Studio M3 Ultra 512 GBq412,018 tokens448.82 t/s26.62 t/s
DGX Spark GB10 128 GBq27,047 tokens343.81 t/s13.75 t/s

To put 26 t/s in context: that’s roughly twice the speed of casual reading. For a 284B model, on a laptop, with 1M context support, this is the first time those three constraints have aligned.

The DGX Spark numbers being slower than the Mac Studio at generation is worth a pause. DGX Spark prefill is excellent (343 t/s), but Apple Silicon’s unified memory architecture wins on the memory-bandwidth-bound decode loop. That’s a structural advantage Nvidia can’t easily close without an HBM-class consumer card.

Getting Started

The install path is genuinely refreshing — no Python environment, no Docker, no conda:

# Clone and download a model
git clone https://github.com/antirez/ds4.git
cd ds4

# 96/128 GB Mac → q2-imatrix (~70 GB download)
./download_model.sh q2-imatrix

# 256 GB+ machine → q4-imatrix (~140 GB)
./download_model.sh q4-imatrix

# Build for your platform
make                 # macOS Metal (default)
make cuda-spark      # Linux CUDA on DGX Spark / GB10
make cuda-generic    # Linux CUDA on other GPUs
make cpu             # CPU diagnostics only (do NOT run on macOS — kernel bug)

Then run the CLI:

./ds4 --ctx 32768 --nothink

Or start the OpenAI-compatible HTTP server for coding agents:

./ds4-server --port 8080 --ctx 200000

The server exposes OpenAI and Anthropic-compatible endpoints, so dropping it into Aider, Continue.dev, OpenClaw, or any tool that speaks those APIs is a matter of changing the base URL:

# Example: point Aider at local ds4
aider --openai-api-base http://localhost:8080/v1 \
      --openai-api-key dummy \
      --model deepseek-v4-flash

Tool calling works in both formats. The README is explicit that 2-bit quants are “not a joke” for tool use — they call tools reliably under coding agents, which is the failure mode that usually kills aggressive quantization.

The “Thinking” Mode Trick

DeepSeek V4 Flash has a configurable reasoning mode. ds4 exposes it directly:

./ds4 --think low      # Short thinking section
./ds4 --think medium   # Default
./ds4 --think high     # Long deliberation
./ds4 --nothink        # Skip thinking entirely

antirez argues this is one of V4 Flash’s distinguishing features: the thinking section length scales with problem complexity, and it’s typically 1/5 the length of competing thinking models like QwQ or DeepSeek R1. That makes “thinking enabled” actually usable for coding work, where every extra second of thought is a tax on iteration speed.

KV Cache on Disk: The Architectural Bet

This is the part most local-LLM tooling doesn’t do, and it’s the reason the 1M-context claim isn’t marketing:

  • DeepSeek V4 Flash’s KV cache is already compressed at the architecture level (similar to MLA from V2/V3).
  • ds4 serializes the KV cache to disk per-session.
  • On a 7 GB/s SSD, paging KV chunks during decode is faster than recomputing them.
  • Sessions persist across restarts — you can resume a 500k-token conversation tomorrow.

The practical implication: you can pre-prefill a giant codebase, save the KV state, and start every coding session with that context already loaded. Cold-start cost amortizes to zero after the first run.

Community Reaction

The Hacker News thread and r/LocalLLaMA discussion converge on a few takes:

  • “This is the antirez software philosophy applied to AI.” Small, single-purpose, written in C, opinionated about scope. Same approach as Redis in 2009.
  • “96 GB Macs work.” Multiple commenters confirmed running q2-imatrix on M3 Max 96 GB at up to 250k context — antirez’s official spec says 128 GB but the community is pushing lower.
  • “Tool calling actually works.” This is the surprise. Most 2-bit quants break structured output. The asymmetric quantization (full-precision shared weights) seems to preserve enough of the model’s reliability for agents to function.
  • “The KV-on-disk bet is the real innovation.” Several commenters singled this out as the structural insight — not the C code, not the quantization, but the decision to stop treating KV cache as a pure RAM resource.
  • “This will not generalize.” ds4 won’t run Llama 4, won’t run Qwen 3.6, won’t ever be a generic engine. That’s the point. The community is split on whether that’s a feature or a limitation.

One representative comment from r/LocalLLaMA: “I’ve been running llama.cpp with V4 Flash and getting ~15 t/s on the same M3 Max. ds4 gave me 26. The KV cache disk persistence alone is worth switching.”

Honest Limitations

Worth knowing before you spend a Saturday downloading 70 GB:

  1. One model only. When DeepSeek V5 or V4.1 drops, you’ll wait for antirez to update or fork the project. There’s no fallback to other models.
  2. Hardware floor is real. Below 96 GB unified memory on Apple Silicon, this project is not for you. Period.
  3. Alpha quality. README says it explicitly: “this exists only for a few days. It will take months to reach a more stable form.” Expect breaking changes, tokenizer edge cases, and tool-calling malforms in early releases.
  4. macOS CPU build crashes the kernel. Genuinely. A macOS VM bug means make cpu on Mac will hard-crash your machine. The README warns about it. Don’t try it.
  5. AMD ROCm is community-maintained. antirez doesn’t have AMD hardware, so the rocm branch lags behind main.
  6. Custom GGUFs only. You can’t point ds4 at GGUFs from other sources. The tensor layout, quant mix, and metadata are bespoke. You download what antirez built or nothing works.
  7. AI-assisted code disclosure. From the README: “This software is developed with strong assistance from GPT 5.5 and with humans leading the ideas, testing, and debugging.” If that bothers you, antirez is upfront that this project isn’t for you.

Who Should Use This

Good fit:

  • You own or have access to a 128 GB+ Apple Silicon machine
  • You want a frontier-class model for coding agents that runs offline
  • You’re comfortable with C tooling and Make
  • You care about long-context work (50k+ tokens routinely)
  • You’re okay with alpha-quality software and tracking a fast-moving project

Not a fit:

  • You have a 16/32/64 GB laptop (use Ollama with smaller models)
  • You want one runtime for many models (use llama.cpp or Ollama)
  • You need Windows native support
  • You need production stability today

ds4 vs. The Alternatives

Aspectds4llama.cppOllamaMLX
ScopeDeepSeek V4 Flash onlyMany modelsMany models (llama.cpp wrapper)Many models, Apple only
V4 Flash speed (M3 Max)26 t/s~15 t/s~15 t/sUntested
1M contextYes (disk KV)RAM-limitedRAM-limitedRAM-limited
Custom GGUF neededYesNoNoN/A
Tool calling at 2-bitReliableOften brokenOften brokenN/A
Production-readyAlphaYesYesYes

If you want the fastest possible V4 Flash experience on a Mac today, ds4 wins. If you want one tool for all your models, llama.cpp or Ollama still wins.

FAQ

Q: Do I really need 128 GB of RAM to run this? A: Officially yes. Practically, the r/LocalLLaMA community has confirmed q2-imatrix running on M3 Max 96 GB machines, even at 250k context. Below 96 GB you’ll swap heavily or run out of memory. There’s no path to running this on a 32 GB MacBook — the model is fundamentally too large.

Q: How is ds4 different from running DeepSeek V4 Flash in Ollama or llama.cpp? A: Three things: (1) ds4 is ~75% faster on the same hardware because it’s hand-tuned for V4 Flash’s specific architecture; (2) ds4 supports 1M context with KV cache on disk, which generic runtimes don’t; (3) ds4’s 2-bit quants are asymmetric (only experts quantized), so tool calling actually works at q2.

Q: Can I use ds4 with Aider, Continue.dev, OpenClaw, or other coding agents? A: Yes. ./ds4-server exposes OpenAI and Anthropic-compatible HTTP APIs. Point your agent’s base URL at http://localhost:8080/v1 and it should work as a drop-in replacement.

Q: What happens when DeepSeek V5 comes out? A: ds4 won’t run it until antirez (or a fork) adds support. That’s the explicit tradeoff — depth over breadth. The README says the “exact model may change as the landscape evolves” but only one model at a time.

Q: Is it safe to run AI-assisted code that antirez wrote with GPT-5.5? A: The MIT license and full source are public. antirez disclosed the AI collaboration upfront in the README. You can audit the C code yourself — it’s ~12,000 lines and the project is small enough to review. That’s actually more transparent than most modern projects.

Q: Why C and not Rust or Zig? A: antirez has written C for 20 years. Redis is C. Familiarity, predictable performance, easy ABI for Metal/CUDA bindings. Also, “C is for people who have stopped trying to impress people,” to paraphrase a different antirez essay.

Q: Does this work on the DGX Spark? A: Yes, with make cuda-spark. Prefill is excellent (343 t/s on a 7k prompt), but generation throughput is lower than Apple Silicon (13.75 t/s vs 26.68 t/s on M3 Max). The unified memory architecture wins on decode.

Q: Where do the GGUF weights come from? Are they official? A: They’re built by antirez specifically for ds4, hosted at huggingface.co/antirez/deepseek-v4-gguf. They are not compatible with llama.cpp or other runtimes — the tensor layout and quantization mix are bespoke. The model weights themselves come from DeepSeek’s official open release.


Bottom line: ds4 is the most interesting local inference project of May 2026 — not because it does everything, but because it does one thing exceptionally well. If you have the hardware, this is your weekend project. If you don’t, watch the repo; antirez has 20 years of shipping habits and this won’t stay alpha forever.