What hardware do I need to run DeepSeek V4 Flash locally?

Minimum: 4× RTX 5090 (96GB total VRAM) with INT4 quantization. Recommended: single H200 (141GB) or 8× A100 80GB. On Apple Silicon, an M3 Ultra Mac Studio with 192GB unified memory runs INT4 quants via MLX. V4-Pro requires multi-node clusters and is impractical to self-host below ~10B tokens/month.

Can I run DeepSeek V4 Flash with Ollama?

Not yet at the GGUF/Ollama level — as of April 25, 2026, GGUF quants are not yet released. Use vLLM with HuggingFace weights for serving today. Ollama support typically arrives within 1-2 weeks of major DeepSeek releases.

Does DeepSeek V4 Flash work on Huawei Ascend chips?

Yes — officially supported. Huawei announced full Ascend 950 supernode support on launch day (April 24, 2026). Use vLLM-Ascend with the dedicated DeepSeek V4 scripts and w8a8 quantized weights from ModelScope.

How long does it take to download DeepSeek V4 Flash?

Roughly 350-400GB at full FP8 precision, 90-150GB at INT4. On a 1 Gbps connection, expect 1-3 hours. Use huggingface-cli download with parallel workers to speed it up. ModelScope is faster from China.

Quick Answer

How to Run DeepSeek V4 Flash Locally (Step-by-Step, 2026)

Published: April 25, 2026

How to Run DeepSeek V4 Flash Locally (Step-by-Step, 2026)

DeepSeek V4 Flash launched yesterday (April 24, 2026) — the cheapest 1M-context open-weight model ever released. Here’s how to run it on your own hardware as of April 25, 2026.

Last verified: April 25, 2026

Why run it locally?

Privacy — no data leaves your network
Cost at scale — past ~10B tokens/month, self-hosting beats API
Latency — sub-100ms TTFT vs 200-400ms via API
Tinkering — fine-tuning, custom adapters, modified inference

For most developers, the API is still cheaper. V4-Flash is $0.14/$0.28 per million tokens. Self-hosting only makes sense at high volume or for compliance reasons.

Hardware requirements

Setup	Hardware	Quantization	Throughput
Minimum dev	4× RTX 5090 (96GB)	INT4 (GPTQ/AWQ when available)	~30 req/sec
Solid prod	1× H200 SXM5 (141GB)	INT8 (FP8)	~50 req/sec
Best perf	8× A100 80GB	FP16	~80 req/sec
Apple Silicon	M3 Ultra 192GB	INT4 via MLX	~25 tok/sec
Huawei Ascend	Ascend 950 supernode	w8a8	Varies by cluster

V4-Pro: needs 16× H200 minimum. Use the API unless you have real reasons.

Method 1: vLLM on Nvidia (recommended)

This is the production-grade path. vLLM 0.7.x ships with native DeepSeek V4 support.

Step 1: Install vLLM

pip install vllm>=0.7.0
# Or for the bleeding edge:
pip install git+https://github.com/vllm-project/vllm.git

Step 2: Download weights

huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
  --local-dir ./deepseek-v4-flash \
  --max-workers 8

Expect 90-400GB depending on quantization. Use --include "*-fp8/*" for the FP8 release if you only have ~140GB VRAM.

Step 3: Start the server

vllm serve ./deepseek-v4-flash \
  --tensor-parallel-size 4 \
  --max-model-len 1048576 \
  --gpu-memory-utilization 0.92 \
  --quantization fp8 \
  --enable-chunked-prefill \
  --port 8000

Adjust --tensor-parallel-size to your GPU count.

Step 4: Test it

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "./deepseek-v4-flash",
    "messages": [{"role": "user", "content": "Write a Python LRU cache."}]
  }'

vLLM exposes an OpenAI-compatible API. Most clients (LangChain, OpenAI SDK, Cursor with custom endpoint) work without changes.

Method 2: SGLang (best throughput)

SGLang often outperforms vLLM by 20-30% on DeepSeek MoE models. Worth trying if you’re at scale.

pip install "sglang[all]>=0.4.0"

python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V4-Flash \
  --tp 4 \
  --context-length 1048576 \
  --quantization fp8 \
  --port 30000

Method 3: Apple Silicon (MLX)

For M3 Ultra / M4 Max with ≥128GB unified memory.

pip install mlx-lm

# Convert weights to MLX format (one-time):
python -m mlx_lm.convert \
  --hf-path deepseek-ai/DeepSeek-V4-Flash \
  --mlx-path ./mlx-deepseek-v4-flash \
  -q --q-bits 4

# Run inference:
mlx_lm.server \
  --model ./mlx-deepseek-v4-flash \
  --port 8080

Expect ~25 tokens/sec on M3 Ultra 192GB at INT4. Not fast enough for production, but excellent for solo dev work.

Method 4: Huawei Ascend 950

Officially supported on launch day. China-friendly deployment path.

pip install vllm-ascend>=0.7.0

# Download weights from ModelScope (faster from China):
modelscope download deepseek-ai/DeepSeek-V4-Flash \
  --local-dir ./deepseek-v4-flash

# Set Ascend env vars:
export USE_MULTI_BLOCK_POOL=1
export VLLM_ASCEND_ENABLE_FUSED_MC2=1

# Start server:
vllm serve ./deepseek-v4-flash \
  --tensor-parallel-size 8 \
  --quantization w8a8 \
  --max-model-len 1048576

Use npu-smi info to verify Ascend cards are detected.

Method 5: Ollama (when GGUF lands)

GGUF quants for V4 Flash are not yet on Ollama as of April 25. Expect them within 1-2 weeks based on prior DeepSeek release patterns.

When available, the command will be:

ollama run deepseek-v4-flash

Watch ollama.com/library and unsloth’s HF page for GGUF releases.

Connecting to coding tools

Cursor (custom endpoint)

Settings → Models → Add custom OpenAI endpoint:

Base URL: http://localhost:8000/v1
Model name: deepseek-v4-flash

Claude Code via LiteLLM proxy

litellm --model openai/deepseek-v4-flash \
  --api_base http://localhost:8000/v1 \
  --port 4000

Then point Claude Code at http://localhost:4000.

Continue.dev (VS Code)

# config.yaml
models:
  - title: DeepSeek V4 Flash (local)
    provider: openai
    model: deepseek-v4-flash
    apiBase: http://localhost:8000/v1
    contextLength: 1048576

Performance tuning checklist

Quantization: FP8 > INT8 > INT4 in quality. INT4 only if VRAM-constrained.
Chunked prefill: Always enable for long contexts (--enable-chunked-prefill).
Tensor parallel: Match TP size to GPU count. PCIe-only setups suffer; NVLink/NVSwitch helps a lot.
Max model len: Don’t set 1M unless you actually use it — cuts KV cache budget.
Speculative decoding: Pair with V4-Flash as draft model for V4-Pro to get ~2× speedup on Pro.
Batching: vLLM’s continuous batching is on by default. Tune --max-num-seqs.

Costs at the break-even point

When does self-hosting beat the API?

V4-Flash API: $0.14 / $0.28 per million tokens
Single H200 box: ~$2.50/hour on RunPod, ~$1,800/month
Throughput: ~50 req/sec, ~3M tok/sec at 60-second avg, sustained ~100B tokens/month

Break-even: roughly 5-10 billion tokens/month for V4-Flash. Below that, the API is strictly cheaper.

Common pitfalls

❌ Insufficient VRAM — V4-Flash needs ≥80GB even at INT4. Don’t try on a single 4090.
❌ Mixing CUDA versions — vLLM is picky. Use a fresh venv with CUDA 12.4+.
❌ Ollama too early — wait for official GGUF.
❌ Underestimating bandwidth — first download is 90-400GB. Plan for it.
❌ Ignoring context len — defaulting to 1M context wastes KV cache budget. Set what you need.

Last verified: April 25, 2026. Sources: DeepSeek V4 model cards on Hugging Face, vLLM docs, vLLM-Ascend repo, MLX-LM docs, Ollama library tracking, RunPod pricing.