MOSS-TTSD is a multi-speaker spoken dialogue synthesis model leveraging a discrete speech generation paradigm with Qwen3-8B and MOSS-Audio-Tokenizer. It achieves state-of-the-art performance in generating long-form (up to 60 minutes), multi-party (up to 5 speakers) conversations with high acoustic consistency and zero-shot voice cloning capabilities.
TL;DR
MOSS-TTSD is a breakthrough in Spoken Dialogue Generation (SDG). Unlike standard TTS that focuses on single sentences, MOSS-TTSD handles up to 5 speakers in a single pass, maintaining acoustic consistency for sessions as long as 60 minutes. By combining a 2kbps high-efficiency audio tokenizer with a Qwen3-8B backbone, it solves the long-standing "stitching artifact" and "speaker drift" problems in AI-generated podcasts and audiobooks.
The "Dialogue Gap": Why Standard TTS Fails
Most current TTS models are "short-sighted." They generate speech utterance-by-utterance. In a dialogue, this leads to three catastrophic failures:
- Acoustic Drift: The same speaker sounds different in Turn 1 and Turn 10.
- Unnatural Turn-taking: The rhythm between speakers is mechanical because the model doesn't "hear" the previous speaker's prosody.
- Scaling Limits: Audio quality degrades significantly as the duration exceeds a few minutes.
Methodology: Discrete Tokens & Curriculum Learning
MOSS-TTSD treats speech as a "text-like" sequence. It uses the MOSS-Audio-Tokenizer to convert audio into discrete Residual Vector Quantization (RVQ) tokens.
1. Architecture & Delay Pattern
The model predicts 16 layers of RVQ tokens using a multi-head delay pattern. This allows the LLM to process audio at a low bitrate (2 kbps) and a low frame rate (12.5 Hz), which is the secret sauce for fitting 60 minutes of audio tokens into the model's context window.
Figure: The inference pipeline showcasing how explicit speaker tags and reference audio guide the multi-speaker generation.
2. Three-Stage Curriculum Training
The team didn't just throw data at the model. They used a staged approach:
- Stage 1 (Pre-training): Adaptation from single-speaker TTS to long-context sequences (64k tokens).
- Stage 2 (High Fidelity): Filtering for high-quality audio (DNSMOS ≥ 3.4) to boost clarity.
- Stage 3 (Dialogue Mastery): Adding real-world multi-speaker recordings and synthetic "concatenated" conversations to perfect turn-switching.
Technical Innovation: TTSD-eval
One of the most impressive parts of this paper isn't just the model, but how they measured it. Traditional metrics (cpWER) rely on speaker diarization tools, which are notoriously buggy. The authors proposed TTSD-eval, which uses Forced Alignment. By knowing exactly when the text happens in the audio, they can calculate:
- ACC (Speaker Attribution Accuracy): Did the right voice speak the right lines?
- SIM (Speaker Similarity): Does the voice match the prompt?

Experiments: Beating the Giants
In head-to-head battles, MOSS-TTSD outperformed both open-source (VibeVoice, FireRedTTS) and proprietary (ElevenLabs V3, Gemini TTS) models in dialogue scenarios.
| Model | ZH ACC ↑ | ZH SIM ↑ | EN ACC ↑ | EN SIM ↑ | | :--- | :--- | :--- | :--- | :--- | | MOSS-TTSD | 0.9587 | 0.7949 | 0.9626 | 0.7326 | | VibeVoice 7B | 0.9222 | 0.7590 | 0.9554 | 0.7140 | | Eleven V3 | 0.9653 | 0.6970 | 0.9498 | 0.6730 |
Note: While propriety models have high ACC, MOSS-TTSD often wins on SIM (timbre similarity) and rhythm.
Conclusion & Future Outlook
MOSS-TTSD proves that the "Speech-as-Language" paradigm is superior for long-form content. By treating speaker tags as special tokens and using a staged training approach, the model achieves a level of conversational flow that was previously only possible via manual editing.
Future Implications: We are moving toward a world where entire multi-host podcasts or tabletop RPG sessions can be synthesized in real-time with perfect consistency. The release of the code and the TTSD-eval framework sets a new gold standard for the community to follow.
