AlphaFlowTSE: One-Step Generative Target Speaker Extraction via Conditional AlphaFlow

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

AlphaFlowTSE: One-Step Generative Target Speaker Extraction via Conditional AlphaFlow

[INTERSPEECH 2025] AlphaFlowTSE: Redefining One-Step Generative Target Speaker Extraction

Summary

Problem

Method

Results

Takeaways

Abstract

AlphaFlowTSE is a state-of-the-art one-step generative framework for Target Speaker Extraction (TSE) that achieves high-fidelity speech recovery with a single network evaluation (NFE=1). By utilizing a JVP-free AlphaFlow objective on a deterministic mixture-to-target trajectory, it eliminates the need for multi-step diffusion sampling and auxiliary mixing-ratio (MR) predictors while setting new SOTA performance on Libri2Mix and REAL-T benchmarks.

TL;DR

AlphaFlowTSE is a breakthrough in Target Speaker Extraction (TSE) that delivers high-fidelity, studio-quality speech from noisy mixtures in a single forward pass (NFE=1). By bypassing the iterative nature of traditional diffusion models and the instability of previous one-step flows, it achieves SOTA results on Libri2Mix and generalizes exceptionally well to real-world conversational datasets like REAL-T.

The "Latency vs. Fidelity" Dilemma

In the world of speech separation, we face a classic trade-off:

Discriminative Models: Fast, but prone to "robotic" artifacts and over-suppression.
Generative Models (Diffusion/Flow): Produces natural, high-fidelity speech but requires 10-50+ iterations, making them too slow for live meetings or hearing aids.

Previous attempts at One-Step Generation (like MeanFlow-TSE) tried to solve this by predicting the "average velocity" needed to reach the target in one jump. However, training these models is notoriously unstable, often requiring complex Jacobian-vector products (JVP) or auxiliary Mixing-Ratio (MR) predictors that break down when applied to real-world recordings where the "mixture ratio" isn't a known constant.

Methodology: The AlphaFlow Intuition

AlphaFlowTSE moves away from the complex "Background-to-Target" path and simplifies the problem to a direct Mixture-to-Target trajectory.

1. JVP-Free Training

Instead of calculating expensive mathematical derivatives (JVPs) to ensure consistency across different time steps, AlphaFlowTSE uses a Teacher-Student setup.

The Student: Tries to predict the jump from the mixture to the clean speech.
The Teacher: Evaluates a smaller, "easier" segment of the path and provides a stable target for the student.
Adaptive Weighting: A specialized loss function ensures that the model focuses on the most informative parts of the learning process.

2. Architecture: UDiT

The backbone leverages a U-Net Diffusion Transformer (UDiT), which combines the global modeling power of Transformers with the structural inductive bias of U-Nets. It processes complex STFT features and is conditioned on both the speaker enrollment and the "time interval" of the jump.

Overall architecture of AlphaFlowTSE Figure 1: The AlphaFlowTSE pipeline. Note the optional MR Predictor, which the authors prove is no longer strictly necessary thanks to their robust training objective.

Experimental Showdown

The model was put to the test on Libri2Mix (synthetic) and REAL-T (real-world conversations).

Performance on Libri2Mix

AlphaFlowTSE outperformed existing one-step models across every major metric:

SI-SDR: 19.17 dB (Clean) | 13.16 dB (Noisy)
PESQ: 3.27 (Clean) | 2.28 (Noisy)

The "MR-Free" Advantage

Perhaps the most significant finding is shown in the table below. While previous models like AD-FlowTSE and MeanFlowTSE crash (see the SI-SDR drop) when the auxiliary Mixing-Ratio predictor is removed, AlphaFlowTSE remains rock solid.

Libri2Mix Experimental Results Table 2: AlphaFlowTSE shows minimal degradation (-0.67 dB SI-SDR) compared to the catastrophic failure of MeanFlowTSE (-24.80 dB) when switching to MR-free mode.

Generalization to Real Conversations (REAL-T)

When applied to real-world meetings (zero-shot, no fine-tuning), AlphaFlowTSE achieved the lowest Word Error Rate (WER) and Character Error Rate (CER). This proves that the "Mean Velocity" learned by the model isn't just memorizing synthetic patterns but actually understands the underlying physics of speech separation.

Critical Analysis & Conclusion

Takeaway

AlphaFlowTSE proves that we don't need iterative sampling for high-fidelity generative TSE. By focusing on interval consistency rather than just endpoint accuracy, we can create models that are both fast and robust.

Limitations

While the model is SOTA for one-step generation, it still trails behind multi-step diffusion models in terms of absolute "naturalness" (DNSMOS OVRL). There is still a small gap between "Instantaneous One-Step" and "Iterative Perfection."

Future Outlook

The JVP-free AlphaFlow approach is likely to expand into other domains like Speech Enhancement and Multi-modal Extraction (using lip-reading or gestures), where latency is equally critical.

Senior Technical Editor's Note: AlphaFlowTSE is a masterclass in "Simplification through Sophistication." By removing the need for auxiliary predictors and iterative steps, it moves generative AI one step closer to being a standard component in every smartphone's audio stack.

Find Similar Papers

Try Our Examples

Search for recent papers on JVP-free training methods for one-step generative models beyond the AlphaFlow framework.
Which paper first proposed the concept of "Mean Flow" for ODE-based generative modeling, and how does this work adapt it for the STFT domain?
Investigate how the UDiT (U-Net Diffusion Transformer) backbone has been applied to other speech tasks like dereverberation or multi-channel source separation.

Contents

[INTERSPEECH 2025] AlphaFlowTSE: Redefining One-Step Generative Target Speaker Extraction

1. TL;DR

2. The "Latency vs. Fidelity" Dilemma

3. Methodology: The AlphaFlow Intuition

3.1. 1. JVP-Free Training

3.2. 2. Architecture: UDiT

4. Experimental Showdown

4.1. Performance on Libri2Mix

4.2. The "MR-Free" Advantage

5. Generalization to Real Conversations (REAL-T)

6. Critical Analysis & Conclusion

6.1. Takeaway

6.2. Limitations

6.3. Future Outlook