GLM-5.1 vs Qwen 3.6 Plus: Self-Hosted Coding (May 2026)
GLM-5.1 vs Qwen 3.6 Plus: Self-Hosted Coding (May 2026)
GLM-5.1 (Z.ai / Zhipu AI) and Qwen 3.6 Plus (Alibaba) are the two best open-weights coding models for self-hosted enterprise deployment in May 2026. GLM-5.1 wins on raw coding capability (58.4% SWE-Bench Pro) but needs 754B parameters worth of hardware. Qwen 3.6 Plus is smaller, multilingual-strong, and licensed under Apache 2.0 — better for cost-conscious or multilingual-heavy deployments. Here’s how to choose.
Last verified: May 5, 2026
Side-by-side comparison
| Dimension | GLM-5.1 | Qwen 3.6 Plus |
|---|---|---|
| Released | April 7, 2026 | Q1 2026 |
| Parameters | 754B MoE | 35B-A3B + Plus variant |
| Active params per forward | ~32B | ~3B (35B-A3B) |
| License | MIT | Apache 2.0 |
| SWE-Bench Pro | 58.4% | ~57% |
| GDPval-AA | 1535 | ~1450 (estimate) |
| Context window | 256K | 256K (extended variants to 1M) |
| Multilingual | Strong | Best (CJK) |
| Min self-host hardware | 8× H200 | 4× H100 / 2× H200 |
| Hourly self-host cost | ~$16 | ~$8 |
| Hosted API pricing ($/1M output) | ~$1.20 | ~$1.00 |
Sources: BenchLM Chinese leaderboard (April 2026), Atlas Cloud comparison (April 2026), Z.ai and Alibaba Cloud documentation, artificialanalysis.ai (May 2026).
Why pick GLM-5.1
1. Top-3 open-weights coding capability. SWE-Bench Pro 58.4% places GLM-5.1 within rounding distance of Kimi K2.6 (58.6%) and DeepSeek V4 Pro (~58%). On GDPval-AA (1535), GLM-5.1 is among the top three open-weights models for agentic real-world performance.
2. Pure MIT license. GLM-5.1 is one of the few frontier-class open-weights models under pure MIT. No restrictions on training competing frontier models, no commercial-use thresholds, no derivative-work clauses. Maximum legal clarity.
3. NVIDIA-independent training. GLM-5.1 was trained on 100,000 Huawei Ascend chips, demonstrating that frontier open-weights models can be built without NVIDIA dependency. Strategically important for sovereign deployment in regions facing export controls.
4. Strong on full-repo coding. The 754B parameter count (with ~32B active per forward pass) handles whole-repository context better than smaller models. For enterprise codebases, the larger model often makes meaningfully better architectural decisions.
Why pick Qwen 3.6 Plus
1. Cheaper to self-host (2-3x). With ~3B active parameters per forward pass (35B-A3B) and the Plus variant adding capacity selectively, Qwen 3.6 Plus runs on roughly half the hardware of GLM-5.1 at comparable throughput.
2. Apache 2.0 license. The most enterprise-recognized open-source license. Includes explicit patent grants. Most enterprise legal teams approve Apache 2.0 with minimal review.
3. Best multilingual coding. Qwen 3.6 Plus is the strongest open-weights model for Chinese / Japanese / Korean code comments, documentation, and mixed-language codebases. For teams with significant CJK code or documentation, this is decisive.
4. Mature tool-use behavior. Qwen 3.6 Plus has been refined specifically for agent loops with strong tool-use reliability. The 35B-A3B variant in particular has been tuned heavily for agentic coding workflows.
5. Stronger Alibaba ecosystem. If you’re already in Alibaba Cloud or using the Qwen ecosystem, integration is simpler.
Where they tie
- Context window. Both default to 256K, both have extended-context variants pushing to 1M.
- License clarity. Both are highly permissive (MIT vs Apache 2.0); both are easy enterprise approvals.
- SWE-Bench Pro performance. Within a percentage point of each other.
- Hosted API pricing. ~$1.00-$1.20 per 1M output tokens.
- Community support. Both have active communities and major hosted-API provider support.
Hardware deep-dive
For self-hosting in production:
GLM-5.1 (754B MoE):
- INT8 inference: 8× H200 (~640GB HBM3e total). Throughput: 30-60 tokens/sec per stream, 8-16 concurrent.
- INT4 inference: 4× H200 (~320GB HBM3e). Throughput: 50-100 tokens/sec per stream, 16-24 concurrent.
- Cost: ~$8/hour (4× H200) to ~$16/hour (8× H200) on cloud spot. ~$5-10/hour on long-term reserved.
- Software: vLLM, SGLang, Tensor-RT LLM all support GLM-5.1 with various quantization options.
Qwen 3.6 Plus (35B-A3B):
- INT8 inference: 2× H200 or 4× H100 (~160GB HBM3e). Throughput: 80-150 tokens/sec per stream, 16-32 concurrent (3B active params makes this fast).
- INT4 inference: 1× H200 or 2× H100. Throughput: 100-200 tokens/sec.
- Cost: ~$4-8/hour on cloud. The smaller hardware footprint compounds favorably.
- Software: vLLM and SGLang both have well-tuned Qwen 3.6 support.
Break-even vs hosted APIs:
- GLM-5.1: ~50B tokens/month.
- Qwen 3.6 Plus: ~30B tokens/month.
Use case mapping
Pick GLM-5.1 if any of these apply:
- You’re doing whole-repo refactors or architectural decisions where 754B parameter count helps.
- You need maximum capability per benchmark point (58.4% SWE-Bench Pro).
- You have NVIDIA-export-control concerns and want a model proven to train on non-NVIDIA hardware.
- Your legal team prefers MIT to Apache 2.0 (rare but happens).
- You’re already running GLM-4 / GLM-5 in production.
Pick Qwen 3.6 Plus if any of these apply:
- Hardware budget is constrained (2-3x lower self-hosting cost).
- You have significant CJK code or documentation in your codebase.
- Your legal team prefers Apache 2.0 (most common).
- You’re already in the Alibaba Cloud ecosystem.
- You need very fast inference (3B active params is hard to beat).
- You’re running heavy agent-loop workloads where Qwen’s tool-use tuning helps.
Comparison to alternatives
For context, here’s how GLM-5.1 and Qwen 3.6 Plus stack up against the broader open-weights field:
| Model | SWE-Bench Pro | License | Self-host complexity |
|---|---|---|---|
| GLM-5.1 | 58.4% | MIT | High (754B MoE) |
| Kimi K2.6 | 58.6% | Modified MIT | Moderate |
| DeepSeek V4 Pro | ~58% | DeepSeek License | High |
| Qwen 3.6 Plus | ~57% | Apache 2.0 | Low |
| MiniMax M2.7 | — | Restrictive | High |
For self-hosted enterprise, the practical shortlist is GLM-5.1 (capability) or Qwen 3.6 Plus (efficiency). Kimi K2.6 is competitive but the Modified MIT license creates more legal review than pure MIT or Apache 2.0.
Going to production: best practices
For either model:
-
Start with INT8 quantization. It’s the best capability/cost trade-off. Move to INT4 only if you’ve validated the capability loss is acceptable.
-
Use vLLM or SGLang. Both have excellent open-weights model support with continuous batching, paged attention, and prefix caching. Tensor-RT LLM is faster but harder to operate.
-
Run a router pattern. Default to open-weights, escalate to Claude Opus 4.7 or Mythos Preview for hard tasks (~10-20% of tickets). Captures most of the cost savings while preserving ceiling capability.
-
Monitor the gap. SWE-Bench Pro and GDPval-AA scores update monthly. Track whether your model is keeping pace with the field, and plan to update model versions every 2-4 months.
-
Build evals on your codebase. Public benchmarks tell you relative ordering. Your own evals tell you whether model A or B works better for your specific code style. Run quarterly internal evals.
Bottom line
For self-hosted enterprise coding deployment in May 2026, GLM-5.1 wins on capability (SWE-Bench Pro 58.4%, 754B MoE, MIT license, NVIDIA-independent). Qwen 3.6 Plus wins on efficiency (2-3x cheaper hardware, Apache 2.0, best CJK multilingual). For most enterprises, the right answer is to evaluate both on your codebase for two weeks and pick based on which one performs better on your real tasks. For multi-vendor strategies, run both and route based on language and task complexity.
Sources: BenchLM.ai Chinese leaderboard (April 2026), Atlas Cloud “Kimi K2.6 vs GLM 5.1 vs Qwen 3.6 Plus vs MiniMax M2.7” (April 2026), Z.ai GLM-5.1 documentation, Alibaba Cloud Qwen 3.6 documentation, artificialanalysis.ai (May 2026).