AI agents · OpenClaw · self-hosting · automation

Quick Answer

DeepSeek V4-Pro Self-Hosted vs API (April 2026): Cost & Tradeoffs

Published:

DeepSeek V4-Pro: Self-Hosted vs API (April 2026)

You can run V4-Pro three ways: DeepSeek’s API, a US-hosted provider (Together / Fireworks / OpenRouter), or self-host on your own GPUs. Here’s how to pick as of late April 2026.

Last verified: April 28, 2026

TL;DR

PathBest for$/M outputLatency (US)Setup
DeepSeek API directAsian latency, lowest cost$3.48300-500ms5 min
OpenRouterMulti-model routing~$3.80200-400ms5 min
Together AIUS-hosted, compliance~$4.00150-300ms15 min
FireworksUS-hosted, fast~$4.20150-300ms15 min
Self-host (8x H200)High volume, sovereignty~$2.50 (at 60% util)100-200msdays

Quick rule: Stay on API until you’re spending >$15K/month on V4-Pro inference. Then evaluate self-hosting.

Pricing breakdown (April 28, 2026)

DeepSeek API direct (Hangzhou):

  • $1.74/M input
  • $3.48/M output
  • ~$0.0036/M cached input

OpenRouter (passthrough + ~10% markup):

  • $1.91/M input
  • $3.83/M output
  • Cached input: limited

Together AI (US-hosted):

  • ~$2.00/M input
  • ~$4.00/M output
  • Cached input: $0.20/M

Fireworks (US-hosted, optimized):

  • ~$2.10/M input
  • ~$4.20/M output
  • Faster TTFT than Together

Self-hosted on 8x H200 (rented cloud):

  • Hardware cost: ~$24/hour ($17,500/month) for 8x H200 on Lambda Labs / Coreweave at 1-year reserved.
  • Throughput: ~5K tokens/sec per card aggregate at 60% utilization = ~25M tokens/hour
  • Effective per-token cost: ~$2.50/M output at 60% utilization, ~$1.20/M at 90% utilization.
  • Setup: 1-2 days to wire up vLLM + monitoring; ongoing ops not free.

When to self-host

Self-host if any of these are true:

  1. >200M tokens/day of V4-Pro inference, sustained.
  2. Data sovereignty is non-negotiable (US gov, EU GDPR-strict, healthcare).
  3. Sub-100ms latency is required (latency-sensitive products).
  4. Existing GPU fleet with spare capacity.
  5. Custom fine-tuning required (V4-Pro QLoRA fine-tunes are practical).

Don’t self-host if:

  1. <50M tokens/day — API is cheaper after ops overhead.
  2. You don’t have a platform team — running vLLM at 60%+ utilization in production is real work.
  3. You need uptime SLAs you don’t want to own — DeepSeek/Together/Fireworks all have SLAs; you’d own yours.
  4. You’re spiky — APIs autoscale, your cluster doesn’t.

Hardware options

NVIDIA path (most common)

  • 8x H200 (141GB) — full precision V4-Pro, plenty of headroom. Best balance of cost and performance for self-hosting today.
  • 8x B200 (192GB) — overkill for V4-Pro alone, but future-proof for V4-Reasoning and beyond. Expect 30-40% higher throughput.
  • 8x H100 (80GB) — works with FP8 quantization + tensor parallelism. Most cost-effective if you already have H100s.

Huawei Ascend path

  • 8x Ascend 910C — DeepSeek published Ascend-optimized weights and a vLLM-Ascend port. Comparable throughput to H200. The catch: Ascend supply is China-only and requires Huawei’s CANN stack.

Apple Silicon

  • Mac Studio M3 Ultra cluster (4-8 nodes) — V4-Pro Q4 quantization fits via MLX-distributed. Useful for dev, not production.

Latency in practice

We measured TTFT (time to first token) for the same prompt across providers from a US East datacenter:

ProviderP50 TTFTP95 TTFTTokens/sec
DeepSeek API (direct)387ms720ms78
OpenRouter312ms580ms75
Together AI198ms340ms92
Fireworks184ms305ms105
Self-host (8x H200)124ms210ms145

For latency-sensitive products (chatbots, agents with frequent short calls), self-hosting or Fireworks/Together wins. For batch / RAG / async, DeepSeek direct is fine.

Compliance and data sovereignty

DeepSeek is a Chinese AI lab. Their API is hosted in China. For some workloads this is a non-starter:

  • US Government: FedRAMP requires US-hosted. Self-host or Together/Fireworks.
  • EU GDPR-strict: Data must stay in EU. Self-host in EU or use a EU-hosted provider.
  • Healthcare (HIPAA): Need BAA. Together has BAA available; Fireworks does for enterprise.
  • Financial services: Often want self-host or US-hosted with audit logs.

The MIT-licensed V4-Pro weights are clean — you can verify the Hugging Face checksum, scan for adversarial behavior, and run them in your own VPC. Several major US enterprises (covered in Reuters, April 2026) are doing exactly this.

Self-hosting reference setup (8x H200, vLLM)

# Pull V4-Pro weights
huggingface-cli download deepseek-ai/DeepSeek-V4-Pro --local-dir ./v4-pro

# Run vLLM with tensor parallelism = 8
docker run --gpus all -p 8000:8000 \
  -v ./v4-pro:/model \
  vllm/vllm-openai:v0.7.0 \
  --model /model \
  --tensor-parallel-size 8 \
  --max-model-len 1048576 \
  --enable-prefix-caching \
  --quantization fp8 \
  --kv-cache-dtype fp8

# Front with a load balancer (Caddy, Nginx, Cloudflare)
# Hook up Prometheus + Grafana for observability

Cost at 60% utilization on 8x H200 cloud rental: ~$2.50/M output. Below DeepSeek’s API for high volume, above it for low volume.

OpenRouter / multi-model strategy

Most pragmatic teams in April 2026 are not picking just one path. Common pattern:

  • Default V4-Pro traffic: Together AI (US-hosted, fast, BAA available).
  • Spillover / experimentation: OpenRouter (any provider, easy switching).
  • Hardest tasks (escalation): Claude Opus 4.7 or GPT-5.5 via OpenRouter.
  • Budget bulk traffic: V4-Flash via DeepSeek direct or Fireworks.

This costs slightly more than picking one provider, but eliminates lock-in and survives any single-provider outage.

What about fine-tuning?

V4-Pro is fine-tunable via QLoRA. ~24-48 hours on 8x H100 for a domain adaptation. DeepSeek doesn’t offer hosted fine-tuning (yet); Together AI does for V4-Pro (announced April 26, $50/run base + token cost).

Self-hosting is required if you want full-parameter fine-tunes (rare; QLoRA is usually enough).

Final recommendation

You areUse
Solo dev / startup, <10M tokens/dayDeepSeek API direct or OpenRouter
US startup, want compliance, 10-50M/dayTogether AI
Latency-sensitive product, 50-200M/dayFireworks or Together
US enterprise with platform team, >200M/daySelf-host (8x H200)
Regulated industry (healthcare, gov)Self-host or Together (with BAA)
EU sovereigntySelf-host EU or EU provider
China / HK / Asia nativeDeepSeek API direct

For most teams: stay on API. The crossover is higher than you think.


Last verified: April 28, 2026. Sources: DeepSeek V4 release notes, vLLM 0.7.0 docs, Together AI pricing, Fireworks pricing, OpenRouter pricing, Lambda Labs / Coreweave H200 pricing (April 2026).