WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[ICLR 2025 Submission] LLaDA-TTS: Flipping the AR Bottleneck with Masked Diffusion and Bidirectional Insights
总结
问题
方法
结果
要点
摘要

LLaDA-TTS is a novel speech synthesis system that transforms a pretrained Autoregressive (AR) Large Language Model into a Masked Diffusion Model. By employing bidirectional attention and a 1/t-weighted masked prediction objective, it achieves state-of-the-art results (0.98% CER on Seed-TTS-Eval) while providing 2x inference acceleration and native zero-shot speech editing capabilities.

TL;DR

LLaDA-TTS breaks the sequential bottleneck of traditional Autoregressive (AR) Text-to-Speech by converting a standard LLM into a Masked Diffusion Model. By utilizing bidirectional attention, it achieves a 2x speedup over state-of-the-art AR baselines like CosyVoice 3 while matching their quality. Most impressively, it enables zero-shot speech editing (insertion, deletion, substitution) out-of-the-box without any task-specific training.

The Bottleneck: Why is TTS still slow?

Modern TTS systems (like VALL-E or Seed-TTS) treat speech as a "second language," using LLMs to predict discrete acoustic tokens. However, these are almost universally Autoregressive (AR): to generate 10 seconds of speech (approx. 250 tokens), the model must run 250 sequential forward passes. This dependency makes real-time deployment difficult for long-form content.

The challenge of moving to Non-Autoregressive (NAR) models has always been the loss of "planning" and coherence. Authors of LLaDA-TTS argue that we don't need to choose between the speed of NAR and the quality of AR—we can have both by adapting the pre-existing knowledge of AR models into a bidirectional diffusion framework.

Methodology: The Architecture of LLaDA-TTS

LLaDA-TTS retains the standard LLM-based TTS pipeline—tokenizer, text encoder, and vocoder—but fundamentally changes the LLM Stage.

1. Bidirectional Attention & Masked Diffusion

Instead of a causal mask (where a token only sees its past), LLaDA-TTS uses Full Bidirectional Attention. During training, target speech tokens are randomly masked. The model learns to predict these masked tokens by looking at both the preceding and following context.

2. The Weight Transfer "Magic" (Label Shift)

The authors don't start from scratch. They initialize LLaDA-TTS with weights from a pretrained Qwen2-0.5B AR model. To bridge the gap between "predicting the next token" (AR) and "predicting the current masked token" (Diffusion), they use Label Shift: shifting the hidden states so that the pretrained output projection remains valid.

LLaDA-TTS Architecture Figure 1: The LLaDA-TTS architecture. A bidirectional Transformer backbone iteratively unmasks speech tokens in T parallel steps.

Theoretical Breakthrough: Why AR Init works

A core contribution of this paper is Theorem 1 (Bounded Suboptimality). The authors prove that for signals with "temporal locality" (like speech, where the next 40ms is mostly determined by the immediate past), an AR predictor is actually a near-optimal starting point for a bidirectional predictor.

This explains why LLaDA-TTS converges so rapidly: it’s not learning speech from scratch; it’s simply learning to "refined" its existing sequential predictions using the extra information provided by the "future" context.

Experiments: Faster and Sharper

LLaDA-TTS was evaluated on the Seed-TTS-Eval benchmark. At 64 steps, it delivered:

  • 0.98% CER (Chinese): Even better than the CosyVoice 3 AR baseline (1.21%).
  • 1.96% WER (English): Significantly outperforming other NAR models like MaskGCT.
  • 2x Speedup: Despite not having the KV cache optimization that helps AR models, the reduction in total iterations (from ~250 down to 64) slashes latency.

Performance Comparison Figure 2: Speed-Quality Tradeoff. LLaDA-TTS surpasses the AR baseline quality at just 48 steps, offering a 2.6x speedup.

Emergent Feature: Native Speech Editing

Because the model is bidirectional, it naturally understands how to fill in gaps. To perform a "word-level substitution," one simply masks the tokens corresponding to the word to be changed and lets the diffusion process "regenerate" that specific segment.

The authors found that Layer 11 of the model spontaneously develops a monotonic text-to-speech alignment, which they use to automatically find which speech tokens to mask for a given text edit.

Speech Editing Pipeline Figure 3: The editing process utilizes emergent attention alignment to pinpoint regions for masking and regeneration.

Critical Analysis & Conclusion

Takeaway: LLaDA-TTS represents a "best of both worlds" approach. It proves that we can repurpose the massive investments made in AR LLM pretraining for NAR diffusion tasks.

Limitations:

  1. Length Prediction: The model still needs to know the total length of the speech (the number of mask tokens) beforehand.
  2. Non-Streaming: Unlike some recent AR models, the diffusion process requires seeing the whole sequence, making it less suitable for ultra-low-latency streaming applications.

Future Outlook: The success of transferring AR weights to diffusion via Label Shift suggests a general recipe for other modalities. We might soon see "LLaDA-Video" or "LLaDA-Music" which uses the same logic to accelerate and enable editing in other generative fields.

发现相似论文

试试这些示例

  • Search for recent papers that utilize discrete masked diffusion for high-fidelity audio or music generation beyond text-to-speech.
  • Which paper first introduced the "Label Shift" or "Dream" initialization method for transferring autoregressive weights to diffusion models?
  • Explore research that applies bidirectional attention or masked modeling to enable zero-shot editing in other sequential modalities like video or motion synthesis.
目录
[ICLR 2025 Submission] LLaDA-TTS: Flipping the AR Bottleneck with Masked Diffusion and Bidirectional Insights
1. TL;DR
2. The Bottleneck: Why is TTS still slow?
3. Methodology: The Architecture of LLaDA-TTS
3.1. 1. Bidirectional Attention & Masked Diffusion
3.2. 2. The Weight Transfer "Magic" (Label Shift)
4. Theoretical Breakthrough: Why AR Init works
5. Experiments: Faster and Sharper
6. Emergent Feature: Native Speech Editing
7. Critical Analysis & Conclusion