MOSS-TTS Review: Open-Source 8B TTS Family with Voice Cloning

TL;DR

MOSS-TTS is OpenMOSS and MOSI.AI’s open-source speech and sound generation family that hit GitHub Trending this week with 3,015 stars and 974 new in seven days. Unlike single-model releases like VoxCPM2, MOSS-TTS ships five production-ready models under one umbrella — flagship TTS, multi-speaker dialogue, voice design, real-time streaming, and sound effects — designed to be used independently or composed into a pipeline. Key facts:

Five-model family: MOSS-TTS (8B flagship), MOSS-TTSD (8B dialogue), MOSS-VoiceGenerator (1.7B voice design), MOSS-TTS-Realtime (voice agents), MOSS-SoundEffect (8B → 1.3B DiT v2.0)
Latest release MOSS-TTS-v1.5 (May 26, 2026) adds explicit pause control via [pause X.Ys], stronger multilingual synthesis with language tags, and punctuation-following prosody
Two architectures: MossTTSDelay (multi-head parallel RVQ with delay-pattern scheduling, production-tuned) and MossTTSLocal (time-synchronous RVQ blocks with depth transformer, streaming-friendly)
PyTorch-free inference via llama.cpp + ONNX Runtime; GGUF quantized weights mean the 8B model fits on 8 GB GPUs
180 ms TTFB on MOSS-TTS-Realtime; full LLM-first-sentence + Realtime-TTFB latency is 377 ms — fast enough for live voice agents
MOSS-TTSD v1.0 outperforms Doubao and Gemini 2.5-pro in subjective dialogue evaluations
MOSS-TTS-Nano: separate 0.1B sibling that runs realtime on a 4-core CPU at 48 kHz stereo
SGLang backend gives ~3× faster generation throughput on the Delay architecture
Recent r/LocalLLaMA TTS benchmark thread: “Very good TTS voice cloning in my experience” (May 2026)

If you’ve been juggling separate models for voiceover, podcasts, agent voices, and game SFX, MOSS-TTS is the first credible open-source attempt to put all of it under one roof — with weights you can actually run locally.

Quick Reference

Field	Value
Repo	OpenMOSS/MOSS-TTS
Org	OpenMOSS team + MOSI.AI
Stars	3,015 (974 this week, Trending #12)
Latest	MOSS-TTS-v1.5 + MOSS-SoundEffect-v2.0 (May 26, 2026)
Backbones	8B (flagship), 1.7B (Local + VoiceGenerator), 1.3B DiT (SoundEffect-v2), 0.1B (Nano)
Architectures	MossTTSDelay, MossTTSLocal, MossTTSRealtime
Inference	PyTorch, SGLang (~3× faster), llama.cpp + ONNX (Torch-free, 8 GB VRAM)
Weights	HF: OpenMOSS-Team + ModelScope
Paper	arXiv 2603.18090 (MOSS-TTS), 2603.19739 (TTSD), 2603.28086 (VoiceGenerator)
Studio	studio.mosi.cn (hosted API)
Apple Silicon	Supported via mlx-audio since May 6, 2026

What Makes MOSS-TTS Different

Most open-source TTS releases in 2025-2026 — Kokoro, F5-TTS, IndexTTS2, Qwen3-TTS, VoxCPM2 — solve one job well. You get zero-shot voice cloning, or streaming, or sound effects, but not all three from a coherent design.

MOSS-TTS picks a different bet: one architectural family, five specialized heads.

MOSS-TTS — the flagship 8B model. Best zero-shot voice cloning quality in the family. Supports long-form synthesis (tens of minutes), fine-grained control over Pinyin, phonemes, and per-token duration, and code-switched multilingual output.
MOSS-TTSD — spoken dialogue generation for multi-speaker, ultra-long conversations. The v1.0 release claims to outperform Doubao and Gemini 2.5-pro in human preference tests on dialogue quality. This is the model to use for podcast-style content.
MOSS-VoiceGenerator — a 1.7B voice design model. Generate a fresh voice from a text description ("middle-aged woman, smoky tone, slight rasp") without any reference audio. Use it standalone or as a design layer feeding into MOSS-TTS for synthesis.
MOSS-TTS-Realtime — multi-turn context-aware streaming for voice agents. 180 ms TTFB. When paired with an LLM, full pipeline latency (LLM-first-sentence + TTS-TTFB) is 377 ms — competitive with closed-source phone-bot stacks.
MOSS-SoundEffect — 48 kHz bilingual environmental, urban, biological, and musical sound effects. The v2.0 release moved to a DiT backbone with Flow Matching at 1.3B params and can generate up to 30 seconds.

Two backbone architectures power everything:

MossTTSDelay — multi-head parallel RVQ prediction with delay-pattern scheduling. Tuned for long-context stability and production throughput.
MossTTSLocal — time-synchronous RVQ blocks with a depth transformer. Lighter, streaming-oriented.
MossTTSRealtime — a separate capability-driven design with hierarchical text–audio inputs for real-time synthesis, not a third baseline comparison.

This isn’t “TTS with extra steps.” It’s a coherent answer to the question “what would a serious open-source competitor to ElevenLabs’ product family look like?”

Install and Run

Option 1: Conda

git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS
conda create -n moss-tts python=3.10
conda activate moss-tts
pip install -e .

Option 2: uv (faster)

git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS
uv venv && source .venv/bin/activate
uv pip install -e .

Optional: FlashAttention 2

pip install flash-attn --no-build-isolation

Basic generation

from moss_tts import MossTTS

model = MossTTS.from_pretrained("OpenMOSS-Team/MOSS-TTS-v1.5")

# Zero-shot voice cloning
audio = model.synthesize(
    text="Hello world. This is a test of the MOSS-TTS family.",
    reference_audio="./samples/reference.wav",
    reference_text="The reference clip transcript goes here.",
)
audio.save("output.wav")

# Explicit pause control (v1.5)
audio = model.synthesize(
    text="First sentence. [pause 0.8s] Second sentence, after a beat.",
    reference_audio="./samples/reference.wav",
)

Torch-free inference with llama.cpp

For deployment without a PyTorch dependency, the OpenMOSS fork of llama.cpp ships a first-class MOSS-TTS pipeline (GGUF backbone + ONNX audio codec):

git clone -b moss-tts-firstclass https://github.com/OpenMOSS/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build -j

# 8B GGUF backbone (fits on 8 GB GPU)
huggingface-cli download OpenMOSS-Team/MOSS-TTS-GGUF \
  --local-dir ./models/moss-tts-gguf

# ONNX audio tokenizer
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX \
  --local-dir ./models/moss-tokenizer-onnx

./build/bin/moss-tts-cli \
  --model ./models/moss-tts-gguf/moss-tts-v1.5-Q4_K_M.gguf \
  --tokenizer ./models/moss-tokenizer-onnx \
  --text "Production deployment, zero Torch wheel." \
  --reference ./samples/reference.wav \
  --out ./output.wav

End-to-end docs live in docs/moss-tts-firstclass-e2e.md.

SGLang backend (~3× faster)

For higher throughput on the Delay architecture:

pip install sglang

python -m moss_tts.serve.sglang \
  --model OpenMOSS-Team/MOSS-TTS-v1.5 \
  --host 0.0.0.0 --port 30000

Then POST to /generate with {"text": "...", "reference_audio": "..."} and you get streamed PCM back.

Apple Silicon via mlx-audio

Since May 6, 2026, MOSS-TTS and the audio tokenizer ship as first-class mlx-audio backends — no CUDA, no Rosetta, just Metal. That’s the path I’d take on an M3/M4 Mac.

The v1.5 Upgrade That Actually Matters

The May 26 v1.5 release is small on the changelog but big in practice:

Language tags strengthen multilingual synthesis — prepending <|en|>, <|zh|>, <|ja|> etc. routes the model toward the correct phoneme inventory. Code-switched text becomes much more stable.
Long-reference short-text cloning — a previous failure mode where a 30-second reference would over-fit and degrade short outputs is fixed.
Punctuation-following prosody — commas, ellipses, em-dashes actually shape pauses and intonation now. You can write "Wait... what?" and get the beat.
Explicit pause control via [pause X.Ys] — inline directive for any duration. Critical for podcast and audiobook workflows where rhythm matters.

If you tried MOSS-TTS 1.0 back in February and bounced off voice-cloning stability, v1.5 is a re-test.

Benchmarks and Community Reactions

Voice cloning quality (community)

The May 2026 r/LocalLLaMA “TTS Benchmark Comparison” thread highlighted MOSS-TTS for voice cloning specifically:

“One more to add to the list, MOSS-TTS. Very good TTS voice cloning in my experience (just don’t try the sound effects model, it’s awful).” — r/LocalLLaMA, May 2026

A separate February 2026 thread reported the inverse:

“MOSS-TTS is better in voice cloning, even compared to the RL version of GLM-TTS.”

But another commenter on a “best quality” thread pushed back on the smaller 1.7B variant:

“MOSS-TTS 1.7B – audio not clean; noticeable clicks and noise artifacts.”

The picture that emerges: the 8B Delay-architecture flagship (MOSS-TTS-v1.5) is the version to deploy. The 1.7B Local variant trades quality for streaming-friendly latency, and that trade-off is audible.

Dialogue benchmarks (claimed)

OpenMOSS published MOSS-TTSD v1.0 evaluation results showing it outperforms top closed-source competitors in subjective evaluation on multi-speaker dialogue:

vs. Doubao TTS: preferred by human raters
vs. Gemini 2.5-pro TTS: preferred by human raters
Objective metrics: industry-leading on standard MOS / speaker-similarity benchmarks

These are the team’s own numbers, not an independent benchmark. But for podcast-style use cases, no open-source model has previously made a credible claim against Gemini 2.5-pro on dialogue, and the dedicated MOSS-TTSD repo and v1.0 paper are worth reading.

Real-time latency

MOSS-TTS-Realtime reports 180 ms TTFB. When paired with a modern LLM, the team measures full pipeline latency at 377 ms from user-utterance-end to first audible byte. That’s in the same envelope as proprietary voice-agent stacks like Vapi, Retell, and Cartesia’s Sonic — and you own the weights.

Honest Limitations

A balanced view, because no model is magic:

SoundEffect v1 was rough — the original MOSS-SoundEffect drew specific community criticism. The May 26 v2.0 rewrite (DiT + Flow Matching) is supposed to fix this; I’d recommend trying v2 before forming an opinion on the family’s SFX capabilities.
Chinese is the strongest language — both training data and team are China-based. English voice cloning is competitive; long-form English narration is good but you may still pick VoxCPM2 or F5-TTS for English-only podcasts. Japanese, Korean, and European-language quality varies.
The 1.7B Local variant has audible artifacts — clicks and noise on some content. Use the 8B flagship for any production-facing work; reserve Local for cases where latency truly trumps quality.
8B model is not small — even with GGUF quantization, you need a real GPU (or Apple Silicon with enough unified memory) for the Delay architecture. The 0.1B MOSS-TTS-Nano sibling is the CPU-only path.
Licensing is per-model — the umbrella repo is Apache-2.0, but individual model weights on Hugging Face have their own licenses; check the model card before commercial deployment. Several recent OpenMOSS releases use a research-only license for the largest weights.
Docs assume Chinese-first — the English README is excellent, but several deep tutorials and fine-tuning guides are clearer in README_zh.md and the MOSI.AI blog. Pipe through a translator if you go deep.
No batched dialogue inference yet — MOSS-TTSD generates conversations turn-by-turn. For high-throughput podcast production at scale, you’ll be CPU-bound on the orchestration layer, not the model.

Who Should Use Each Model

Use case	Pick
Voiceover, audiobook narration	MOSS-TTS-v1.5 (8B Delay)
Two-speaker podcasts, NotebookLM-style	MOSS-TTSD-v1.0
AI character voices for games	MOSS-VoiceGenerator + MOSS-TTS
Real-time voice agents (phone bots, live chat)	MOSS-TTS-Realtime
Sound effects, ambient audio	MOSS-SoundEffect-v2.0
CPU-only / edge / mobile	MOSS-TTS-Nano (0.1B)
Apple Silicon Mac	8B v1.5 via mlx-audio
Production server (high throughput)	8B v1.5 + SGLang backend
No PyTorch in production	GGUF + llama.cpp + ONNX

How It Compares to Other Open-Source TTS

Model	Best at	License	Why pick MOSS-TTS instead
VoxCPM2 (OpenBMB)	30-language English-strong cloning	Apache-2.0	If you need dialogue, voice design, and SFX in one family
Kokoro 82M	Tiny + fast English	Apache-2.0	If you need voice cloning at all
F5-TTS	English voice cloning, simple API	CC-BY-NC	If you need a commercial-friendlier path or non-English
IndexTTS2	Mandarin TTS	Apache-2.0	If you need real-time + dialogue + sound design
Qwen3-TTS	Multilingual, Qwen-ecosystem	Tongyi license	If you want llama.cpp/SGLang first-class support
GLM-TTS	Voice cloning quality	Custom	Community reports MOSS-TTS edges it on cloning

Where MOSS-TTS uniquely wins: dialogue at MOSS-TTSD quality, the 377 ms full-pipeline voice-agent latency, and the single-family coherence when you need more than one capability.

FAQ

How big is the MOSS-TTS Family — is it really five models?

Yes. The published lineup is MOSS-TTS (8B), MOSS-TTS-Local-Transformer (1.7B), MOSS-TTSD (8B dialogue), MOSS-VoiceGenerator (1.7B voice design), MOSS-TTS-Realtime (voice agents), MOSS-SoundEffect (8B v1 → 1.3B DiT v2), plus the separate-repo MOSS-TTS-Nano (0.1B). All weights are on Hugging Face under OpenMOSS-Team and on ModelScope.

Can MOSS-TTS run without a GPU?

The 8B flagship effectively needs a GPU (or Apple Silicon with sufficient unified memory). For CPU-only deployment, use MOSS-TTS-Nano — 0.1B parameters, runs realtime on 4 CPU cores at 48 kHz stereo. It’s a separate repo at OpenMOSS/MOSS-TTS-Nano.

What about commercial use?

The repo is Apache-2.0, but check each model card on Hugging Face — individual weights have their own license terms, and some recent OpenMOSS releases use a research-only license for the largest models. The Nano variant has historically been the most commercial-friendly.

How does it compare to ElevenLabs?

For English-only voiceover and short-form, ElevenLabs still leads on raw polish and tooling. For multilingual code-switched synthesis, Mandarin dialogue, and anything you need to self-host with weight access, MOSS-TTS-v1.5 is now the strongest open-source competitor. The 377 ms voice-agent latency closes the gap on real-time use cases where ElevenLabs’ API round-trip used to win by default.

Does MOSS-TTS support streaming?

Yes — via two paths. MOSS-TTS-Realtime is the dedicated streaming/voice-agent model with 180 ms TTFB. The MossTTSLocal architecture (1.7B) is also streaming-oriented but trades audio quality for latency. For the flagship 8B model, SGLang backend gives chunked streaming output with ~3× the throughput of the reference PyTorch path.

Why “MOSS”?

MOSS is the long-running open foundation-model program from Fudan University’s OpenMOSS team, dating back to the 2023 MOSS LLM. The TTS family is one branch of that broader effort, jointly developed with MOSI.AI, the commercial speech spin-out. The technical reports are on arXiv (2603.18090).

The Take

MOSS-TTS-v1.5 is the first open-source TTS release that feels like a product family rather than a model dump. The Delay/Local/Realtime backbone split lets the team specialize each head for its job without re-architecting the world. The v1.5 polish (pause tokens, punctuation prosody, language tags) is exactly the kind of incremental quality work that closed-source providers spend months on between announcements.

If you’re building anything voice-agent-shaped — IVR, customer support, AI characters, podcast pipelines, accessibility tools — MOSS-TTS-v1.5 plus MOSS-TTSD plus MOSS-TTS-Realtime gives you a credible open-source stack in one ecosystem. Pair it with a fast LLM and you’re shipping voice products with no proprietary API dependency at all.

The honest caveats: SoundEffect v1 was bad and v2 is too new to fully judge, the 1.7B Local variant has audible artifacts, and per-model licensing means you need to read each Hugging Face card before going commercial. But for the use cases the family does well — flagship 8B voice cloning, dialogue, and real-time agents — there’s no open competitor that matches all three under one design.

This is the TTS family to evaluate this week.

Try it: github.com/OpenMOSS/MOSS-TTS · HF Space demo · studio.mosi.cn (hosted)

Published 2026-06-04. Stars and benchmark numbers reflect the state of the MOSS-TTS repo and OpenMOSS team’s published results at time of writing. Community reactions sampled from r/LocalLLaMA threads from February through May 2026.