DeepSeek V4 on Huawei Ascend vs Nvidia H200 (April 2026)
DeepSeek V4 on Huawei Ascend vs Nvidia H200 (April 2026)
DeepSeek V4 launched on April 24, 2026 with full first-day support for both Nvidia H200 and Huawei Ascend 950. This is the first time a major frontier-tier open model has shipped with first-class non-Nvidia inference. Here’s what it actually means.
Last verified: April 25, 2026
TL;DR
| Nvidia H200 SXM5 | Huawei Ascend 950 | |
|---|---|---|
| Memory | 141 GB HBM3e | 144 GB (supernode config) |
| FP16 throughput | ~67 TFLOPS | ~58 TFLOPS (est.) |
| V4 inference stack | vLLM, SGLang | vLLM-Ascend |
| V4-Pro support | ✅ Full (FP8) | ✅ Full (w8a8) |
| V4-Flash support | ✅ Full (FP8) | ✅ Full (w8a8) |
| Power consumption | ~700W | ~600W |
| Availability outside China | Yes (constrained) | Limited |
| Price per FLOP | Higher | ~40-60% lower in China |
The launch-day reality
When DeepSeek V4 went live on April 24, 2026, two things shipped simultaneously:
- Hugging Face weights for V4-Pro and V4-Flash, ready for vLLM serving on Nvidia.
- Huawei Ascend supernode endorsement — Huawei Technologies publicly announced same-day full support on Ascend 950-based clusters.
Reuters reported Huawei chips were used in part of V4-Flash’s training. Hours later, ModelScope hosted the w8a8-quantized weights specifically tuned for Ascend deployment.
This is unprecedented for a non-Nvidia AI chip in the frontier-model tier.
Inference performance
Nvidia H200 SXM5
- VRAM: 141 GB HBM3e
- Bandwidth: 4.8 TB/s
- V4-Flash on a single H200: ~50 req/sec at FP8, ~220 tokens/sec output per request
- V4-Pro: Needs a multi-H200 cluster (16× minimum)
- Stack: vLLM 0.7+, SGLang 0.4+
- Quantization: Native FP8
Huawei Ascend 950 (supernode)
- Memory per node: 144 GB
- Topology: Supernode optimized for MoE — high inter-chip bandwidth
- V4-Flash on Ascend 950 supernode: Comparable throughput to H200 (within 15-25% based on initial Huawei-published numbers)
- V4-Pro: Officially supported on multi-node Ascend 950 supernodes
- Stack: vLLM-Ascend with dedicated DeepSeek V4 launch scripts
- Quantization: w8a8 (8-bit weights, 8-bit activations)
The performance gap is real but smaller than it was for prior generations. For inference workloads in China, Ascend 950 is now a credible alternative.
Why DeepSeek’s Ascend support matters
1. Sovereignty and supply chain
US export controls on H100/H200 to China have tightened repeatedly since 2022. Ascend 950 gives Chinese AI teams a viable supply chain that doesn’t depend on US chip access.
2. Cost structure
In China, Ascend 950 is dramatically cheaper than smuggled or downgraded H200s. Ascend supernode pricing for cloud providers is reportedly 40-60% below H200 equivalents on a per-FLOP basis. That cost advantage is partly why V4 can be priced at $1.74/$3.48.
3. Training optimization
DeepSeek-Flash was partly trained on Ascend 950 hardware. This is a major signal that Chinese-domestic training is now viable for frontier-tier models. Earlier Chinese frontier models (V3, R1) were Nvidia-trained.
4. Optionality for everyone
Even US/EU teams benefit. The lower cost basis in China lets DeepSeek price aggressively on the open market. OpenRouter, Together, and Fireworks all serve V4 from US/EU infrastructure at prices that wouldn’t be possible without DeepSeek’s underlying cost advantage.
What you actually deploy on
If you’re in China
Ascend 950 supernodes are the cost-optimal choice. Major Chinese cloud providers (Alibaba, Tencent, Huawei Cloud) all offer V4 inference on Ascend.
# vLLM-Ascend example
export USE_MULTI_BLOCK_POOL=1
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
vllm serve ./deepseek-v4-flash-w8a8 \
--tensor-parallel-size 8 \
--quantization w8a8 \
--max-model-len 1048576
If you’re in the US/EU
Nvidia H200 remains the default inference chip. vLLM with FP8 is the production stack.
vllm serve deepseek-ai/DeepSeek-V4-Flash \
--tensor-parallel-size 4 \
--quantization fp8 \
--max-model-len 1048576 \
--enable-chunked-prefill
For most US/EU teams: don’t self-host at all. Use OpenRouter or Together AI’s V4 endpoints. They handle the H200 fleet for you.
Performance comparison: real numbers
DeepSeek and Huawei published partial benchmarks on launch. Independent verification is still rolling in, but early numbers suggest:
| Workload | H200 (FP8) | Ascend 950 (w8a8) | Gap |
|---|---|---|---|
| V4-Flash, 8K context, single-stream | 220 tok/s | 185 tok/s | +19% H200 |
| V4-Flash, 8K context, batched (50 req/s) | 50 req/s | 42 req/s | +19% H200 |
| V4-Pro, 32K context, single-stream | 110 tok/s | 88 tok/s | +25% H200 |
| Long-context (500K), single-stream | 45 tok/s | 38 tok/s | +18% H200 |
The H200 advantage is real but not the 2-3× of prior generations. For most production workloads, the gap is engineering-noise level — and the price difference more than makes up for it in China.
What this means for the AI hardware market
A few honest predictions for the rest of 2026:
-
Nvidia retains the training crown for at least another generation. Multi-node FP8 training on H200/B200 is still dominant.
-
Ascend takes inference share in China. By end of Q3 2026, expect Ascend to handle the majority of LLM inference inside China — not because it’s better, but because it’s available, cheap, and now officially supported by frontier models.
-
Other Chinese chips follow. Cambricon, Biren, Moore Threads — all racing to get DeepSeek V4 inference certified.
-
US/EU teams care indirectly. You won’t run on Ascend, but the model you serve via OpenRouter or self-host on H200 was made possible by China’s ability to keep training cheaply on Ascend.
Bottom line
The DeepSeek V4 launch isn’t just a model release — it’s a signal that the AI compute monopoly is fragmenting. For the first time, a frontier-tier model ships with first-class non-Nvidia inference support. Whether you serve V4 on H200 in Virginia or Ascend 950 in Shanghai, the result is the same: near-Opus-4.7 quality at one-seventh the price.
For deployment decisions in late April 2026:
- In China: Ascend 950 supernodes
- In the US/EU: OpenRouter / Together / Fireworks (H200 fleet under the hood)
- At scale (>10B tokens/month): Self-hosted H200 cluster
The hardware doesn’t actually matter much for app developers. What matters is that the underlying cost curve just got steeper — in users’ favor.
Last verified: April 25, 2026. Sources: Reuters reporting on Huawei Ascend supernode V4 support, DeepSeek V4 release notes, Hugging Face deepseek-ai/DeepSeek-V4-Pro model card, vLLM-Ascend project documentation, Fortune coverage of DeepSeek-Huawei integration.