How to Run DeepSeek V4 Flash Locally (Step-by-Step, 2026)
How to Run DeepSeek V4 Flash Locally (Step-by-Step, 2026)
DeepSeek V4 Flash launched yesterday (April 24, 2026) — the cheapest 1M-context open-weight model ever released. Here’s how to run it on your own hardware as of April 25, 2026.
Last verified: April 25, 2026
Why run it locally?
- Privacy — no data leaves your network
- Cost at scale — past ~10B tokens/month, self-hosting beats API
- Latency — sub-100ms TTFT vs 200-400ms via API
- Tinkering — fine-tuning, custom adapters, modified inference
For most developers, the API is still cheaper. V4-Flash is $0.14/$0.28 per million tokens. Self-hosting only makes sense at high volume or for compliance reasons.
Hardware requirements
| Setup | Hardware | Quantization | Throughput |
|---|---|---|---|
| Minimum dev | 4× RTX 5090 (96GB) | INT4 (GPTQ/AWQ when available) | ~30 req/sec |
| Solid prod | 1× H200 SXM5 (141GB) | INT8 (FP8) | ~50 req/sec |
| Best perf | 8× A100 80GB | FP16 | ~80 req/sec |
| Apple Silicon | M3 Ultra 192GB | INT4 via MLX | ~25 tok/sec |
| Huawei Ascend | Ascend 950 supernode | w8a8 | Varies by cluster |
V4-Pro: needs 16× H200 minimum. Use the API unless you have real reasons.
Method 1: vLLM on Nvidia (recommended)
This is the production-grade path. vLLM 0.7.x ships with native DeepSeek V4 support.
Step 1: Install vLLM
pip install vllm>=0.7.0
# Or for the bleeding edge:
pip install git+https://github.com/vllm-project/vllm.git
Step 2: Download weights
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
--local-dir ./deepseek-v4-flash \
--max-workers 8
Expect 90-400GB depending on quantization. Use --include "*-fp8/*" for the FP8 release if you only have ~140GB VRAM.
Step 3: Start the server
vllm serve ./deepseek-v4-flash \
--tensor-parallel-size 4 \
--max-model-len 1048576 \
--gpu-memory-utilization 0.92 \
--quantization fp8 \
--enable-chunked-prefill \
--port 8000
Adjust --tensor-parallel-size to your GPU count.
Step 4: Test it
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "./deepseek-v4-flash",
"messages": [{"role": "user", "content": "Write a Python LRU cache."}]
}'
vLLM exposes an OpenAI-compatible API. Most clients (LangChain, OpenAI SDK, Cursor with custom endpoint) work without changes.
Method 2: SGLang (best throughput)
SGLang often outperforms vLLM by 20-30% on DeepSeek MoE models. Worth trying if you’re at scale.
pip install "sglang[all]>=0.4.0"
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V4-Flash \
--tp 4 \
--context-length 1048576 \
--quantization fp8 \
--port 30000
Method 3: Apple Silicon (MLX)
For M3 Ultra / M4 Max with ≥128GB unified memory.
pip install mlx-lm
# Convert weights to MLX format (one-time):
python -m mlx_lm.convert \
--hf-path deepseek-ai/DeepSeek-V4-Flash \
--mlx-path ./mlx-deepseek-v4-flash \
-q --q-bits 4
# Run inference:
mlx_lm.server \
--model ./mlx-deepseek-v4-flash \
--port 8080
Expect ~25 tokens/sec on M3 Ultra 192GB at INT4. Not fast enough for production, but excellent for solo dev work.
Method 4: Huawei Ascend 950
Officially supported on launch day. China-friendly deployment path.
pip install vllm-ascend>=0.7.0
# Download weights from ModelScope (faster from China):
modelscope download deepseek-ai/DeepSeek-V4-Flash \
--local-dir ./deepseek-v4-flash
# Set Ascend env vars:
export USE_MULTI_BLOCK_POOL=1
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
# Start server:
vllm serve ./deepseek-v4-flash \
--tensor-parallel-size 8 \
--quantization w8a8 \
--max-model-len 1048576
Use npu-smi info to verify Ascend cards are detected.
Method 5: Ollama (when GGUF lands)
GGUF quants for V4 Flash are not yet on Ollama as of April 25. Expect them within 1-2 weeks based on prior DeepSeek release patterns.
When available, the command will be:
ollama run deepseek-v4-flash
Watch ollama.com/library and unsloth’s HF page for GGUF releases.
Connecting to coding tools
Cursor (custom endpoint)
Settings → Models → Add custom OpenAI endpoint:
- Base URL:
http://localhost:8000/v1 - Model name:
deepseek-v4-flash
Claude Code via LiteLLM proxy
litellm --model openai/deepseek-v4-flash \
--api_base http://localhost:8000/v1 \
--port 4000
Then point Claude Code at http://localhost:4000.
Continue.dev (VS Code)
# config.yaml
models:
- title: DeepSeek V4 Flash (local)
provider: openai
model: deepseek-v4-flash
apiBase: http://localhost:8000/v1
contextLength: 1048576
Performance tuning checklist
- Quantization: FP8 > INT8 > INT4 in quality. INT4 only if VRAM-constrained.
- Chunked prefill: Always enable for long contexts (
--enable-chunked-prefill). - Tensor parallel: Match TP size to GPU count. PCIe-only setups suffer; NVLink/NVSwitch helps a lot.
- Max model len: Don’t set 1M unless you actually use it — cuts KV cache budget.
- Speculative decoding: Pair with V4-Flash as draft model for V4-Pro to get ~2× speedup on Pro.
- Batching: vLLM’s continuous batching is on by default. Tune
--max-num-seqs.
Costs at the break-even point
When does self-hosting beat the API?
- V4-Flash API: $0.14 / $0.28 per million tokens
- Single H200 box: ~$2.50/hour on RunPod, ~$1,800/month
- Throughput: ~50 req/sec, ~3M tok/sec at 60-second avg, sustained ~100B tokens/month
Break-even: roughly 5-10 billion tokens/month for V4-Flash. Below that, the API is strictly cheaper.
Common pitfalls
- ❌ Insufficient VRAM — V4-Flash needs ≥80GB even at INT4. Don’t try on a single 4090.
- ❌ Mixing CUDA versions — vLLM is picky. Use a fresh venv with CUDA 12.4+.
- ❌ Ollama too early — wait for official GGUF.
- ❌ Underestimating bandwidth — first download is 90-400GB. Plan for it.
- ❌ Ignoring context len — defaulting to 1M context wastes KV cache budget. Set what you need.
Last verified: April 25, 2026. Sources: DeepSeek V4 model cards on Hugging Face, vLLM docs, vLLM-Ascend repo, MLX-LM docs, Ollama library tracking, RunPod pricing.