AI agents · OpenClaw · self-hosting · automation

Quick Answer

How to Run DeepSeek V4 Flash Locally (Step-by-Step, 2026)

Published:

How to Run DeepSeek V4 Flash Locally (Step-by-Step, 2026)

DeepSeek V4 Flash launched yesterday (April 24, 2026) — the cheapest 1M-context open-weight model ever released. Here’s how to run it on your own hardware as of April 25, 2026.

Last verified: April 25, 2026

Why run it locally?

  • Privacy — no data leaves your network
  • Cost at scale — past ~10B tokens/month, self-hosting beats API
  • Latency — sub-100ms TTFT vs 200-400ms via API
  • Tinkering — fine-tuning, custom adapters, modified inference

For most developers, the API is still cheaper. V4-Flash is $0.14/$0.28 per million tokens. Self-hosting only makes sense at high volume or for compliance reasons.

Hardware requirements

SetupHardwareQuantizationThroughput
Minimum dev4× RTX 5090 (96GB)INT4 (GPTQ/AWQ when available)~30 req/sec
Solid prod1× H200 SXM5 (141GB)INT8 (FP8)~50 req/sec
Best perf8× A100 80GBFP16~80 req/sec
Apple SiliconM3 Ultra 192GBINT4 via MLX~25 tok/sec
Huawei AscendAscend 950 supernodew8a8Varies by cluster

V4-Pro: needs 16× H200 minimum. Use the API unless you have real reasons.

This is the production-grade path. vLLM 0.7.x ships with native DeepSeek V4 support.

Step 1: Install vLLM

pip install vllm>=0.7.0
# Or for the bleeding edge:
pip install git+https://github.com/vllm-project/vllm.git

Step 2: Download weights

huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
  --local-dir ./deepseek-v4-flash \
  --max-workers 8

Expect 90-400GB depending on quantization. Use --include "*-fp8/*" for the FP8 release if you only have ~140GB VRAM.

Step 3: Start the server

vllm serve ./deepseek-v4-flash \
  --tensor-parallel-size 4 \
  --max-model-len 1048576 \
  --gpu-memory-utilization 0.92 \
  --quantization fp8 \
  --enable-chunked-prefill \
  --port 8000

Adjust --tensor-parallel-size to your GPU count.

Step 4: Test it

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "./deepseek-v4-flash",
    "messages": [{"role": "user", "content": "Write a Python LRU cache."}]
  }'

vLLM exposes an OpenAI-compatible API. Most clients (LangChain, OpenAI SDK, Cursor with custom endpoint) work without changes.

Method 2: SGLang (best throughput)

SGLang often outperforms vLLM by 20-30% on DeepSeek MoE models. Worth trying if you’re at scale.

pip install "sglang[all]>=0.4.0"

python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V4-Flash \
  --tp 4 \
  --context-length 1048576 \
  --quantization fp8 \
  --port 30000

Method 3: Apple Silicon (MLX)

For M3 Ultra / M4 Max with ≥128GB unified memory.

pip install mlx-lm

# Convert weights to MLX format (one-time):
python -m mlx_lm.convert \
  --hf-path deepseek-ai/DeepSeek-V4-Flash \
  --mlx-path ./mlx-deepseek-v4-flash \
  -q --q-bits 4

# Run inference:
mlx_lm.server \
  --model ./mlx-deepseek-v4-flash \
  --port 8080

Expect ~25 tokens/sec on M3 Ultra 192GB at INT4. Not fast enough for production, but excellent for solo dev work.

Method 4: Huawei Ascend 950

Officially supported on launch day. China-friendly deployment path.

pip install vllm-ascend>=0.7.0

# Download weights from ModelScope (faster from China):
modelscope download deepseek-ai/DeepSeek-V4-Flash \
  --local-dir ./deepseek-v4-flash

# Set Ascend env vars:
export USE_MULTI_BLOCK_POOL=1
export VLLM_ASCEND_ENABLE_FUSED_MC2=1

# Start server:
vllm serve ./deepseek-v4-flash \
  --tensor-parallel-size 8 \
  --quantization w8a8 \
  --max-model-len 1048576

Use npu-smi info to verify Ascend cards are detected.

Method 5: Ollama (when GGUF lands)

GGUF quants for V4 Flash are not yet on Ollama as of April 25. Expect them within 1-2 weeks based on prior DeepSeek release patterns.

When available, the command will be:

ollama run deepseek-v4-flash

Watch ollama.com/library and unsloth’s HF page for GGUF releases.

Connecting to coding tools

Cursor (custom endpoint)

Settings → Models → Add custom OpenAI endpoint:

  • Base URL: http://localhost:8000/v1
  • Model name: deepseek-v4-flash

Claude Code via LiteLLM proxy

litellm --model openai/deepseek-v4-flash \
  --api_base http://localhost:8000/v1 \
  --port 4000

Then point Claude Code at http://localhost:4000.

Continue.dev (VS Code)

# config.yaml
models:
  - title: DeepSeek V4 Flash (local)
    provider: openai
    model: deepseek-v4-flash
    apiBase: http://localhost:8000/v1
    contextLength: 1048576

Performance tuning checklist

  1. Quantization: FP8 > INT8 > INT4 in quality. INT4 only if VRAM-constrained.
  2. Chunked prefill: Always enable for long contexts (--enable-chunked-prefill).
  3. Tensor parallel: Match TP size to GPU count. PCIe-only setups suffer; NVLink/NVSwitch helps a lot.
  4. Max model len: Don’t set 1M unless you actually use it — cuts KV cache budget.
  5. Speculative decoding: Pair with V4-Flash as draft model for V4-Pro to get ~2× speedup on Pro.
  6. Batching: vLLM’s continuous batching is on by default. Tune --max-num-seqs.

Costs at the break-even point

When does self-hosting beat the API?

  • V4-Flash API: $0.14 / $0.28 per million tokens
  • Single H200 box: ~$2.50/hour on RunPod, ~$1,800/month
  • Throughput: ~50 req/sec, ~3M tok/sec at 60-second avg, sustained ~100B tokens/month

Break-even: roughly 5-10 billion tokens/month for V4-Flash. Below that, the API is strictly cheaper.

Common pitfalls

  • Insufficient VRAM — V4-Flash needs ≥80GB even at INT4. Don’t try on a single 4090.
  • Mixing CUDA versions — vLLM is picky. Use a fresh venv with CUDA 12.4+.
  • Ollama too early — wait for official GGUF.
  • Underestimating bandwidth — first download is 90-400GB. Plan for it.
  • Ignoring context len — defaulting to 1M context wastes KV cache budget. Set what you need.

Last verified: April 25, 2026. Sources: DeepSeek V4 model cards on Hugging Face, vLLM docs, vLLM-Ascend repo, MLX-LM docs, Ollama library tracking, RunPod pricing.