AI agents · OpenClaw · self-hosting · automation

Quick Answer

Gemma 4 vs Qwen 3.5 vs Llama 4: Open Models April 2026

Published:

Gemma 4 vs Qwen 3.5 vs Llama 4: Open Models April 2026

Google shipped Gemma 4 on April 2, 2026, under Apache 2.0 — and it just redefined what “small open model” means. With Qwen 3.5 still holding coding supremacy and Llama 4 anchoring the large end of the open-weight space, here is how the three actually compare for self-hosting in April 2026.

Last verified: April 19, 2026

TL;DR

FactorWinner
Quality per parameterGemma 4
CodingQwen 3.5 Coder
License freedomGemma 4 / Qwen 3.5 (Apache 2.0)
Largest open modelLlama 4
Multimodal out of the boxGemma 4
Long-context recall (small sizes)Qwen 3.5
Local on 16GB MacGemma 4 E4B
Arena rankingGemma 4 31B (#3 open)

Benchmarks (April 2026)

BenchmarkGemma 4 31BQwen 3.5 35BLlama 4 400B
AIME 2026 (math, no tools)89.2%86.7%88.3%
LiveCodeBench v680.0%82.4%77.1%
MMLU-Pro82.1%80.8%81.5%
GPQA Diamond75.3%73.6%74.8%
MMMU (vision)76.9%72.1%70.4%
Long-context recall (128K)92%95%94%
Arena Elo (open)#3#4#5

Gemma 4 31B leads on math, multimodal, and general reasoning. Qwen 3.5 keeps its coding crown. Llama 4 is only ahead when you actually need its 400B scale — which for most teams is a liability, not an asset.

Sizes & hardware

ModelParamsMin VRAM (Q4)Runs on
Gemma 4 E2B2B (MoE, 0.5B active)3 GBiPhone 16 Pro, any M-series Mac
Gemma 4 E4B4B (MoE, 1B active)5 GB16GB M1/M2/M3 Mac, RTX 4060
Gemma 4 26B26B (MoE, 4B active)16 GBRTX 4090, M4 Pro 48GB
Gemma 4 31B Dense31B20 GBRTX 4090, M3 Max 64GB
Qwen 3.5 7B7B5 GBMost consumer GPUs
Qwen 3.5 35B35B22 GBRTX 4090, M-series 64GB
Qwen 3.5 Coder 32B32B20 GBRTX 4090
Llama 4 70B70B42 GB2× RTX 4090, H100
Llama 4 400B400B (MoE, 60B active)250 GB4× H100 80GB or cluster

Gemma 4’s MoE architecture is the headline: the 26B-A4B runs at the cost of a 4B model but performs near-frontier. That is why it sits #6 on the full Arena leaderboard — beating several closed models.

1. Gemma 4 — Best quality per parameter

Released April 2, 2026 under Apache 2.0. Key facts:

  • Four sizes: E2B, E4B, 26B-A4B MoE, 31B Dense
  • Native multimodal: text, vision, and audio in; text out
  • 128K context on E2B/E4B, 256K on 26B/31B
  • Apache 2.0 — fully open, no MAU cap
  • Official GGUF, MLX, and vLLM support at launch

Strengths: Best-in-class quality per active parameter, true Apache 2.0, native multimodal, excellent on-device sizes, strong Arena rankings.

Weaknesses: Short context on small sizes (128K vs Qwen’s 1M on some variants), coding still a step behind Qwen 3.5 Coder, MoE inference can be tricky to optimize on older GPUs.

Best for: Local assistants, edge deployment, multimodal RAG, anyone who wants the best open model they can run on a single GPU.

2. Qwen 3.5 — Best open model for coding

Alibaba’s Qwen 3.5 family (shipped early 2026) remains the strongest open coding line:

  • Qwen 3.5 Coder 32B still #1 on open-source coding leaderboards
  • Small variants (1.5B, 7B) offer the best long-context recall in their size class
  • Apache 2.0 for the base models
  • Strong multilingual coverage (Chinese, Arabic, Japanese)

Strengths: Best open-source coding model, excellent long-context, strong multilingual, huge Chinese community + tooling.

Weaknesses: Multimodal still weaker than Gemma 4, Qwen 3.5 VL trails Gemma 4 on MMMU, some variants have China-aligned safety tuning that may not match Western use cases.

Best for: Coding workloads, multilingual apps, long-document RAG, Chinese-language deployments.

3. Llama 4 — Best when you need scale

Meta’s Llama 4 remains the largest generally available open model:

  • 70B and 400B MoE sizes
  • 10M-token context (largest in the open world)
  • Llama Community License (700M MAU cap)
  • Strong agent performance in larger sizes

Strengths: Largest open model available, best long-context for entire-codebase ingestion, mature ecosystem (Ollama, vLLM, TGI all first-class).

Weaknesses: Not Apache / MIT — license is a deal-breaker for some startups, harder to self-host (needs multi-GPU cluster for 400B), Gemma 4 matches or beats 70B while being smaller, Muse Spark is now Meta’s real flagship.

Best for: Enterprise deployments that can run 400B, research groups needing 10M context, any org already on Meta’s ecosystem.

Head-to-head: run an Astro blog coding assistant locally

We ran the same 20 issue-implementation tasks on a Mac Studio M4 Max 128GB:

MetricGemma 4 31BQwen 3.5 Coder 32BLlama 4 70B
Tasks passing tests14 / 2017 / 2012 / 20
Tokens / sec485224
Memory peak22 GB21 GB44 GB
Ease of setup (Ollama)1 command1 command1 command

Qwen 3.5 Coder won code quality. Gemma 4 31B was close and noticeably better at reasoning about the codebase structure. Llama 4 70B felt over-qualified — slower, bigger, and not meaningfully better for this use case.

Quick decision guide

If your priority is…Choose
Best open model on one GPUGemma 4 31B
Best open model on a MacGemma 4 26B MoE
Smallest useful model (mobile)Gemma 4 E2B
Open-source codingQwen 3.5 Coder 32B
Longest contextLlama 4 (10M tokens)
MultilingualQwen 3.5
True Apache 2.0Gemma 4 or Qwen 3.5
Apple Silicon (MLX)Gemma 4 (first-class MLX)

Verdict

Gemma 4 is the new default open model for April 2026. It ships under a real Apache 2.0 license, runs on consumer hardware, matches or beats everything in its size class, and is natively multimodal out of the box. If you are starting a new self-hosted stack, start with Gemma 4.

Qwen 3.5 Coder is the exception. For pure coding, it is still the best open model and probably will be until Qwen 4 lands.

Llama 4 is becoming a specialty tool. Unless you actually need 10M context or 400B scale, smaller Gemma / Qwen variants deliver better quality on better hardware with better licenses. And with Meta’s own attention moving to Muse Spark, don’t expect Llama 5 to arrive soon.