VoxCPM2 Review: Tokenizer-Free TTS with 30 Languages

TL;DR

VoxCPM2 is OpenBMB’s April 2026 release of a tokenizer-free Text-to-Speech system that skips discrete audio tokens entirely and models continuous speech through an end-to-end diffusion autoregressive architecture. It’s the clearest shot yet at open-source, self-hostable voice cloning that actually competes with ElevenLabs. Key facts:

2B-parameter model built on a MiniCPM-4 backbone
Trained on 2M+ hours of multilingual speech across 30 languages and 9 Chinese dialects
48kHz studio-quality output from 16kHz reference input — super-resolution is built in
Three generation modes: Voice Design from text description, Controllable Cloning with style steering, Ultimate Cloning from reference + transcript
~8 GB VRAM on an RTX 4090, RTF ~0.3 (real-time factor), down to ~0.13 with Nano-vLLM
Apache-2.0 license — commercial use is explicitly allowed
14,686 GitHub stars, 5,051 new this week
OpenAI-compatible serving via vLLM-Omni for production deploys

If you’ve been waiting for an open-source TTS that doesn’t sound like a robotic Google Translate demo and actually handles languages outside English/Chinese well, VoxCPM2 is the model to try this week.

Quick Reference

Field	Value
Repo	OpenBMB/VoxCPM
Weights	🤗 openbmb/VoxCPM2
Playground	HF Spaces demo
Install	`pip install voxcpm`
License	Apache-2.0 (commercial-ready)
Backbone	MiniCPM-4 (2B params)
Sample rate	48kHz output, 16kHz reference
Languages	30 (Arabic → Vietnamese)
VRAM	~8 GB
Release date	April 2026

What VoxCPM2 Actually Is

The “tokenizer-free” part matters more than it sounds. Most modern open-source TTS systems (XTTS, CosyVoice, IndexTTS, F5-TTS) work in two stages: a language model predicts discrete audio tokens, then a separate codec decoder turns those tokens back into waveforms. That pipeline has two weak points — the codec loses detail, and the quantization step introduces artifacts you can hear as metallic warble on sibilants or clipped breaths.

VoxCPM2 skips that. The model predicts continuous speech representations directly from text using a diffusion autoregressive objective, and those representations go through a single AudioVAE v2 decoder with built-in super-resolution to produce 48kHz output. No discrete codebook, no external tokenizer dependency, no upsampler bolted on at the end.

What you actually get at the API surface:

Voice Design — write a prompt like "(A young woman, gentle and sweet voice)Hello!" and the model invents a voice that matches the description. No reference audio needed.
Controllable Voice Cloning — upload a reference clip, the model clones the timbre, and you can still inject style instructions ("slightly faster, cheerful tone") to steer emotion and pacing.
Ultimate Cloning — provide the reference clip and its exact transcript. The model treats it as an audio continuation task and reproduces every nuance — micro-pauses, breathing, speaking rhythm, emotional register.
Streaming — chunked generation with RTF low enough that real-time agents (phone bots, live narration) become plausible on a single consumer GPU.

Three things pushed VoxCPM2 up GitHub Trending fast:

It fixes the main complaint about VoxCPM 0.5B/1.5. The old models were only bilingual (Chinese + English) and had voice-consistency issues toward the end of long sentences. VoxCPM2 adds 28 more languages and trains on ~5x the data.
48kHz output is rare in open-source TTS. Most OS models top out at 22kHz or 24kHz. If you’re producing podcast audio, audiobook narration, or dubbing, the extra sample rate actually matters for AudioVAE decoder transparency.
Apache-2.0 matters for anyone shipping a product. A lot of the strongest TTS options right now (F5-TTS, some IndexTTS releases) carry non-commercial or research-only restrictions. Commercial teams can drop VoxCPM2 into a paid product without a license review.

The r/LocalLLaMA thread on VoxCPM2 hit 105 upvotes in the first day, with commenters specifically calling out that cross-language cloning (English speaker saying Japanese lines in their own voice) works better than they expected from any open model.

Installation

Baseline install is one line if you already have PyTorch and a modern CUDA setup:

pip install voxcpm

Requirements are strict: Python 3.10–3.12, PyTorch ≥ 2.5.0, CUDA ≥ 12.0. Python 3.13 is not supported yet because some of the audio deps (specifically soundfile and a custom kernel) don’t have 3.13 wheels.

If you want to pin the model locally instead of letting HuggingFace cache it in ~/.cache, use ModelScope:

pip install modelscope

from modelscope import snapshot_download
snapshot_download(
    "OpenBMB/VoxCPM2",
    local_dir="./pretrained_models/VoxCPM2",
)

First pull is ~4.5 GB of weights plus the AudioVAE decoder. Plan accordingly on metered connections.

Hello, World: Basic Synthesis

Minimum viable VoxCPM2 call — text in, wav out:

from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained(
    "openbmb/VoxCPM2",
    load_denoiser=False,
)

wav = model.generate(
    text="VoxCPM2 is the current recommended release for realistic multilingual speech synthesis.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("demo.wav", wav, model.tts_model.sample_rate)
print("saved: demo.wav")

A few knobs to know:

cfg_value — classifier-free guidance strength. 2.0 is the default. Below 1.5 the voice drifts off the prompt; above 3.0 it gets oversaturated and starts clipping in places.
inference_timesteps — diffusion steps. 10 is a good default; 20 buys you slightly cleaner sibilants at ~2x the latency.
load_denoiser=False — skip loading the optional denoiser model if your inputs are already clean. Saves ~1 GB VRAM.

Voice Design (No Reference Audio)

This is the feature competitors don’t have cleanly. You describe a voice in plain English and the model invents one. The convention is: put the description in parentheses at the start of the text.

wav = model.generate(
    text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("voice_design.wav", wav, model.tts_model.sample_rate)

What works in the description:

Gender — “male”, “female”, “androgynous”
Age — “young”, “middle-aged”, “elderly”, or explicit “in her twenties”
Tone / emotion — “gentle”, “authoritative”, “warm”, “cynical”, “excited”
Pace — “slow”, “measured”, “brisk”, “rapid-fire”
Character traits — “slight smile in the voice”, “breathy”, “gravelly”, “nasal”

What doesn’t work reliably:

Named voices (“sounds like Morgan Freeman”) — the model wasn’t trained on labeled celebrity data.
Regional accents in English — “British accent” sometimes comes through, but it’s inconsistent. Use Ultimate Cloning with a reference clip if accent accuracy matters.
Very specific prosody instructions (“emphasize the word ‘critical’”) — phrase-level control is weaker than sentence-level mood.

The model itself will admit this: in the repo’s limitations section, the team notes that “Voice Design and Controllable Voice Cloning results can vary between runs — you may try to generate 1–3 times to obtain the desired voice or style.” Run the same prompt twice and you’ll get two different voices matching the description. For production, that’s a feature (draft 3 and pick one) but it’s a pain if you need determinism.

Voice Cloning from a Reference Clip

The more interesting mode for most real-world use cases — you’ve got a 10-second voice sample and you want the model to read a script in that voice.

wav = model.generate(
    text="This is a cloned voice generated by VoxCPM2.",
    reference_wav_path="path/to/voice.wav",
)
sf.write("clone.wav", wav, model.tts_model.sample_rate)

You can layer a style hint on top of the clone:

wav = model.generate(
    text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
    reference_wav_path="path/to/voice.wav",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)

For maximum fidelity, there’s Ultimate Cloning — provide the reference audio and its transcript, and the model treats the new text as a continuation:

wav = model.generate(
    text="This is an ultimate cloning demonstration using VoxCPM2.",
    prompt_wav_path="path/to/voice.wav",
    prompt_text="The transcript of the reference audio.",
    reference_wav_path="path/to/voice.wav",  # optional, improves similarity
)
sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)

In practice, Ultimate Cloning with both prompt_wav_path and reference_wav_path pointing at the same clip gives the most convincing results. It’s the mode that comes closest to ElevenLabs Instant Voice Clone quality on clean reference audio.

Streaming Generation

Real-time applications (voice agents, live narration) need audio flowing out before the full utterance is rendered. VoxCPM2 exposes a streaming iterator:

import numpy as np
import soundfile as sf

chunks = []
for chunk in model.generate_streaming(
    text="Streaming text to speech is easy with VoxCPM!",
):
    chunks.append(chunk)

wav = np.concatenate(chunks)
sf.write("streaming.wav", wav, model.tts_model.sample_rate)

On an RTX 4090, the first chunk lands in roughly 400–600ms, and subsequent chunks arrive at RTF ~0.3. If you run VoxCPM2 under Nano-vLLM-VoxCPM, RTF drops to ~0.13 with batched concurrent requests — that’s the configuration you want for any multi-user deployment.

Production Deployment with vLLM-Omni

For anything beyond a local demo, skip pip install voxcpm and run the model under vLLM-Omni, the official vLLM project’s omni-modal extension. You get PagedAttention, continuous batching, and — most importantly — a drop-in OpenAI-compatible /v1/audio/speech endpoint:

# Install vllm-omni from source
uv pip install vllm==0.19.0 --torch-backend=auto
git clone https://github.com/vllm-project/vllm-omni.git && cd vllm-omni
uv pip install -e .

# Launch the server
vllm serve openbmb/VoxCPM2 --omni --port 8000

Then call it from any client that already speaks OpenAI TTS:

curl http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"openbmb/VoxCPM2","input":"Hello from VoxCPM2 on vLLM-Omni!","voice":"default"}' \
  --output out.wav

This is the setup to use if you’re migrating from OpenAI TTS or ElevenLabs and you want to keep your existing client code.

How It Compares

Model	Params	Languages	Sample Rate	License	Voice Design
VoxCPM2	2B	30	48kHz	Apache-2.0	✅
VoxCPM 1.5	0.6B	2	44.1kHz	Apache-2.0	—
F5-TTS	0.3B	2	24kHz	Research-only	—
XTTS v2	0.4B	17	24kHz	CPML (non-commercial)	—
CosyVoice 3	1.5B	~20	24kHz	Apache-2.0	Partial
IndexTTS 2	1B	2	24kHz	Apache-2.0	—
Qwen3-TTS	7B	~15	24kHz	Apache-2.0	Partial

VoxCPM2 is the only current Apache-2.0 option combining 30-language coverage, 48kHz output, and voice-design-from-text in a single model. The tradeoff is VRAM — 2B params plus the AudioVAE decoder need ~8 GB, versus ~4 GB for F5-TTS or XTTS.

On the Seed-TTS-eval and CV3-eval benchmarks, the team reports state-of-the-art or comparable scores against every open-source competitor, with specific wins on InstructTTSEval (which measures controllability from natural-language style prompts — exactly the thing Voice Design is designed for).

Community Reactions

The early Reddit and X reaction has been notably less cynical than most TTS launches:

A r/TextToSpeech comment from March 2026 called the original VoxCPM “a good option: little hallucinations, and has bundled finetune scripts that actually work” — specifically citing successful Latvian fine-tuning on Mozilla Common Voice data. The fine-tuning story has always been VoxCPM’s strongest selling point; VoxCPM2 ships with both SFT and LoRA scripts that work with as little as 5–10 minutes of reference audio.
The LocalLLaMA community flagged the 30-language support as the single biggest jump — being able to clone an English speaker’s voice and then have them speak Japanese, Turkish, or Hindi in the same timbre is genuinely new for Apache-2.0 models.
Skeptics note that 48kHz output can be misleading: the model is trained to produce 48kHz, but if your reference audio is 16kHz lossy (e.g., a compressed voice memo), the upsampled output will still carry the reference’s frequency ceiling, just in a 48kHz container.

Honest Limitations

The README is refreshingly upfront about what doesn’t work yet. The main ones to plan around:

Run-to-run variance in Voice Design and Controllable Cloning. Expect to generate 1–3 takes and pick the best one. Not a deal-breaker for offline content production, but awkward in deterministic production pipelines.
Cross-language accent bleed. Clone an English speaker and have them say a Spanish line — it works, but a subtle English accent comes through. For native-accent output, you need a native-speaker reference clip in the target language.
No phoneme-level control. You can’t force a specific pronunciation of a word, and the model will occasionally mispronounce proper nouns or technical jargon. The only workaround today is SSML-style phonetic spellings in the input text.
Long-context generation drift. Past ~45 seconds of continuous generation, the voice can drift slightly in timbre — Ultimate Cloning mode is much less prone to this than Voice Design.
Python 3.13 and CUDA <12 unsupported. If you’re on a Hopper cluster or an older RTX 20-series with CUDA 11, you’re waiting for a wheel update.
Explicit safety clause. The license specifically bans using the model for “impersonation, fraud, or disinformation.” Not legally binding in every jurisdiction, but worth knowing if you’re designing a product around voice cloning.

FAQ

Can VoxCPM2 run on a MacBook? Not officially. The distribution assumes CUDA ≥ 12.0 on an NVIDIA GPU. There’s no MPS (Apple Silicon) backend yet, and no one has published a working MLX port. If you must run it on a Mac, the only realistic option today is via a remote vLLM-Omni server on a Linux box with an NVIDIA GPU.

What’s the minimum GPU I need? The VoxCPM2 model card lists ~8 GB VRAM as the baseline. In practice that means an RTX 3060 12GB, RTX 4060 Ti 16GB, or any RTX 4070 / A5000 / L4 / A10 or better. The 2B backbone plus AudioVAE decoder won’t quite fit on an 8 GB card without quantization, and quantization support isn’t upstream yet.

How does it compare to ElevenLabs? On clean reference audio with Ultimate Cloning, VoxCPM2 is close enough that most listeners can’t tell the difference in blind A/B tests under 20 seconds. Where ElevenLabs still wins: studio-grade consistency across hour-long narrations, more robust emotion controls, and mature enterprise features (per-voice fine-tuning UI, dubbing tools). Where VoxCPM2 wins: zero marginal cost per character, commercial-friendly license, full control over the weights, and 30-language support without per-language pricing tiers.

Can I fine-tune it on my own voice? Yes. The repo includes both full SFT and LoRA fine-tuning scripts. The team recommends 5–10 minutes of clean reference audio for LoRA and 30+ minutes for full SFT. LoRA is what you want for cloning a single voice; SFT is for specializing the whole model (e.g., a new language or a specific speaking domain like sports commentary).

Does it work with LangChain / LlamaIndex / existing agent frameworks? Indirectly. Run it under vLLM-Omni and it exposes an OpenAI-compatible /v1/audio/speech endpoint, which means any tool that already calls OpenAI’s TTS API (LangChain, LlamaIndex, LibreChat, every voice-agent framework) will work with a base URL swap. Set OPENAI_BASE_URL=http://your-server:8000/v1 and you’re done.

Is the Apache-2.0 license actually clean for commercial use? The license itself is, yes — Apache-2.0 is as permissive as it gets. The separate use-case disclaimer in the README bans impersonation, fraud, and disinformation, but that’s a policy statement rather than a license restriction. In practice, commercial products using VoxCPM2 for podcast narration, audiobook production, in-game voices, or accessibility TTS are cleanly in bounds. Deepfake-style impersonation of real people without consent is not.

The Bottom Line

VoxCPM2 is the first open-source TTS model I’d reach for in 2026 when the constraint is commercial-ready, multilingual, and actually sounds human. It’s not ElevenLabs, but for most podcast, audiobook, and voice-agent workloads the quality gap has closed enough that license and self-hosting advantages dominate the decision.

If you already run a local LLM stack on an 8GB+ NVIDIA GPU, VoxCPM2 is a one-line install and vLLM-Omni gives you an OpenAI-compatible endpoint for free. Start there before renewing any TTS subscription.