Plug-and-Steer: Decoupling Separation and Selection in Audio-Visual Target Speaker Extraction

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Plug-and-Steer: Decoupling Separation and Selection in Audio-Visual Target Speaker Extraction

[Interspeech 2025] Plug-and-Steer: Decoupling Separation and Selection in Audio-Visual Target Speaker Extraction

总结

问题

方法

结果

要点

摘要

Plug-and-Steer is a novel audio-visual target speaker extraction (AV-TSE) framework that decouples the acoustic separation process from the target selection task. By utilizing a frozen, pre-trained audio-only speech separation (AOSS) backbone and an ultra-lightweight Latent Steering Matrix (LSM), the method anchors a specific speaker to a designated output channel using visual cues without retraining the separation engine.

TL;DR

Existing audio-visual speaker extraction systems often compromise audio quality because they try to re-learn separation using noisy video-audio datasets. Plug-and-Steer changes the game by "plugging" into a frozen, high-fidelity audio-only model and using a tiny "steering wheel" (the Latent Steering Matrix) guided by lip-sync to simply route the right voice to the right channel. It keeps the audio crystal clear while solving the identity problem with 99.9% accuracy.

The Fidelity Ceiling: Why Deep Fusion Might Be Hurting Your Audio

The "Cocktail Party Problem"—isolating one voice among many—has largely been solved by Audio-Only Speech Separation (AOSS) models. These models, trained on massive studio-quality datasets like LibriMix, produce incredibly clean results.

However, they have one fatal flaw: Permutation Ambiguity. They separate the voices but don't know who is who.

To fix this, traditional Audio-Visual Target Speaker Extraction (AV-TSE) models fuse audio and video features (lip movements) deep inside the network. But here is the catch: Audio-visual datasets collected from the web (like VoxCeleb) are noisy and reverberant. When you train a model on this noisy data, you inadvertently "teach" the model to produce noisy audio, effectively throwing away the high-fidelity priors of the original audio-only engine.

Methodology: The "Steering Wheel" Approach

Instead of re-inventing the wheel, the authors of Plug-and-Steer treat the AOSS model as a high-performance engine and add a steering mechanism.

1. The Latent Steering Matrix (LSM)

The authors discovered that even in complex models like TF-GridNet or MossFormer2, speaker identities are represented as distinct paths in the latent space. By applying a simple linear transformation $W$ (the LSM) to the latent features, they can "swap" the speakers between channels. $f_{i}^{'} = (I + g \cdot W) f_{i}$ If $g = 0$ , the features pass through normally. If $g = 1$ , the speakers are swapped.

Training process of LSM

2. Visual Steering Module

A lightweight module looks at the target speaker's lip movements and the internal audio features to decide whether to activate the LSM. This module doesn't touch the separation process; it only controls the "routing logic."

Experiments: Higher Fidelity, Lower Cost

The researchers tested this on four major architectures. The results were striking:

Near-Lossless Routing: In models like TF-GridNet, the LSM preserved 99.91% of the original separation performance while correctly identifying the target.
Perceptual Superiority: Unlike "Residual" adaptation (fine-tuning) which caused the audio quality to drop sharply (measured by NISQA scores), Plug-and-Steer maintained the high-fidelity "studio" sound of the backbone.
Efficiency: Because it avoids redundant decoding and re-encoding steps required by post-hoc selection methods, it is significantly faster (RTF 0.147 vs 0.209).

Performance Comparison Table

Why This Matters: Scaling for the Future

The beauty of Plug-and-Steer is its modularity. We are currently seeing a rapid evolution in speech separation engines (e.g., the move toward State-Space Models like Mamba).

With Plug-and-Steer, you don't need to rebuild your audio-visual model every time a better audio-only model comes out. You can simply "plug" the new backbone in and train a tiny steering matrix. It bridges the gap between the clean world of acoustic research and the messy world of multi-modal applications.

Conclusion

Plug-and-Steer proves that in the era of foundation models, decoupling is often better than fusion. By treating the visual modality strictly as a selector and the audio modality as a generator, we can achieve the best of both worlds: perfect identity matching and studio-quality sound.

Limitations: Currently, the system is designed for 2-speaker mixtures. Extending this to an arbitrary number of speakers would require a more complex routing matrix beyond a simple binary swap.

发现相似论文

试试这些示例

Search for recent papers in audio-visual speech separation that utilize frozen pre-trained backbones or Foundation Models to maintain acoustic fidelity.
Which paper first established the concept of "Permutation Ambiguity" in the Cocktail Party Problem, and how have subsequent works like Plug-and-Steer evolved to solve it using multi-modal cues?
Explore if the concept of Latent Steering or minimalist linear routing has been applied to other multi-modal tasks such as text-to-image editing or video object tracking.

[Interspeech 2025] Plug-and-Steer: Decoupling Separation and Selection in Audio-Visual Target Speaker Extraction

1. TL;DR

2. The Fidelity Ceiling: Why Deep Fusion Might Be Hurting Your Audio

3. Methodology: The "Steering Wheel" Approach

3.1. 1. The Latent Steering Matrix (LSM)

3.2. 2. Visual Steering Module

4. Experiments: Higher Fidelity, Lower Cost

5. Why This Matters: Scaling for the Future

6. Conclusion