WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[Interspeech 2024] MSR-HuBERT: Breaking the 16kHz Barrier in Speech Self-Supervised Learning
总结
问题
方法
结果
要点
摘要

MSR-HuBERT is a multi-sampling-rate adaptive self-supervised learning (SSL) framework designed for speech processing. It introduces a novel adaptive downsampling CNN that maps raw waveforms from various sampling rates (16 to 48 kHz) to a shared temporal resolution, achieving State-of-the-Art performance in mixed-rate ASR and full-band speech reconstruction.

TL;DR

Existing speech SSL models (HuBERT, Wav2Vec 2.0) are "locked" to a 16 kHz sampling rate, creating a resolution mismatch when faced with high-fidelity audio. MSR-HuBERT solves this by using an adaptive CNN front-end that maps any sampling rate (16k to 48k) to a shared 20ms temporal grid. It retains the original Transformer backbone while significantly improving performance in speech reconstruction and ASR without the need for destructive resampling.

The "Resolution Mismatch" Problem

Most modern speech SSL models use a convolutional feature extractor that performs a fixed 320x downsampling. For a 16 kHz signal, this results in a 20ms frame shift—the "golden standard" for phoneme bottleneck modeling.

However, if you feed 48 kHz audio into the same model, the frame shift drops to ~6.6ms. This creates a fundamental mismatch for the Transformer encoder, which was designed to process 20ms chunks.

  • Resampling to 16kHz? You lose everything above 8kHz (crucial for high-fidelity reconstruction).
  • Training separate models? Computationally expensive and fragments the data.

Methodology: Adaptive Downsampling

The core innovation of MSR-HuBERT is the Multi-sampling-rate Adaptive Downsampling CNN. Instead of a "one-size-fits-all" stride, the authors designed specific convolutional configurations for different rates (16, 22.05, 24, 48 kHz) to ensure the output always aligns to the same temporal resolution.

Overall Architecture of MSRHuBERT

Why it works:

  1. Rate-Specific Strides: By adjusting the stride (e.g., more aggressive downsampling for 48kHz), the model compensates for the higher input density.
  2. Shared Feature Space: By applying Layer Normalization (LN) after each rate-specific branch, the model forces features from disparate sampling rates into a common distribution, allowing the subsequent Transformer to be rate-agnostic.
  3. Single Codebook: Even with mixed inputs, the model uses one shared codebook for mask-prediction, forcing the model to learn representations that capture the "essence" of speech regardless of its frequency resolution.

Experimental Performance

The researchers tested MSR-HuBERT on two opposing fronts: ASR (which values low-frequency semantics) and Speech Reconstruction (SR) (which requires high-frequency textures).

Experimental Results Comparison

Key Insights:

  • ASR Superiority: MSR-HuBERT achieved a WER of 5.89 at 16kHz, outperforming the standard HuBERT Base (6.41). This suggests that multi-rate training actually acts as a regularizer, improving the robustness of low-frequency semantic extraction.
  • Preservation of Detail: In full-band reconstruction (STOI metric), MSR-HuBERT significantly outperformed models that were forced to resample to 16kHz during pre-training, proving its ability to "see" and utilize high-frequency signal components.
  • Efficiency: The multi-rate overhead is minimal—adding support for new rates increases the parameter count by only ~3%, as the heavy-lifting Transformer layers are shared.

Compatibility & Future Work

One of the strongest selling points of MSR-HuBERT is that it is a drop-in replacement for HuBERT. The authors demonstrated that existing "hacks" to improve HuBERT—like Intermediate Layer Supervision—work just as well on MSR-HuBERT, yielding further gains (see Table 3 in the paper).

Limitations

While highly effective, the model currently requires the sampling rate to be known a priori to route the signal to the correct CNN branch. Future iterations might benefit from an automated "Rate-ID" module to make the pipeline fully end-to-end for "in-the-wild" audio.

Conclusion

MSR-HuBERT marks a transition from "Narrow-band SSL" to "Full-band SSL." By solving the resolution mismatch at the architectural level, it allows the community to leverage the vast amounts of high-resolution audio data available today without sacrificing the efficiency of established self-supervised learning paradigms.

发现相似论文

试试这些示例

  • Search for recent papers after 2024 that address sampling rate robustness in speech self-supervised learning beyond the HuBERT architecture.
  • Which paper first introduced the concept of "Slimmable Neural Networks" or adaptive downsampling in CV, and how does MSR-HuBERT's Layer Normalization strategy differ for audio?
  • Explore studies that evaluate the impact of preserving high-frequency speech information (above 8kHz) on downstream tasks like emotion recognition or paralinguistics.
目录
[Interspeech 2024] MSR-HuBERT: Breaking the 16kHz Barrier in Speech Self-Supervised Learning
1. TL;DR
2. The "Resolution Mismatch" Problem
3. Methodology: Adaptive Downsampling
3.1. Why it works:
4. Experimental Performance
4.1. Key Insights:
5. Compatibility & Future Work
5.1. Limitations
6. Conclusion