FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

[ArXiv 2025] FireRedASR2S: Xiaohongshu’s All-in-One Powerhouse for Industrial-Grade Speech Recognition

总结

问题

方法

结果

要点

摘要

FireRedASR2S is an industrial-grade, all-in-one speech recognition system from Xiaohongshu Inc. that integrates SOTA modules for ASR, VAD, LID, and Punctuation Prediction. It features two ASR variants: FireRedASR2-LLM (8B+) and FireRedASR2-AED (1B+), achieving a record 2.89% CER on Mandarin benchmarks and significantly outperforming competitors like Qwen3-ASR and Doubao-ASR.

TL;DR

Xiaohongshu recently unveiled FireRedASR2S, a comprehensive, industrial-grade ASR system that doesn't just transcribe audio but handles the entire pipeline: Voice Activity Detection (VAD), Spoken Language Identification (LID), ASR, and Punctuation. By scaling training data to 200,000 hours and utilizing human-annotated events rather than weak supervision, they've set a new SOTA for Mandarin and Chinese dialects, even outperforming heavyweights like Doubao and Qwen3.

Background: Beyond Simple Transcription

In the real world, "ASR" is rarely just about turning clear speech into text. We deal with long-form recordings, background music, singing, code-switching, and a plethora of dialects. Most industrial systems are "Frankenstein" pipelines—cobbled together from different toolkits.

FireRedASR2S addresses this by providing a unified, modular architecture where every component is optimized to work together.

The Modular Architecture

The system follows a sequential pipeline where each stage adds another layer of metadata to the raw waveform.

Overview of FireRedASR2S

1. FireRedVAD: The Precision Scalpel

Most VADs are trained using "forced alignment" labels from ASR models. This is a circular dependency that often fails when the audio isn't speech (e.g., singing or music).

Innovation: FireRedVAD is trained on thousands of hours of human-annotated acoustic events.
Structure: Uses a Deep Feedforward Sequential Memory Network (DFSMN).
Result: At only 0.6M parameters, it hits 99.60% AUC-ROC, making it both "ultra-lightweight" and "ultra-robust."

2. FireRedLID: Solving the Dialect Maze

LID is critical for routing audio to the right model. FireRedLID uses a Hierarchical Decoding strategy. It predicts the language (e.g., zh, en) first; if it's Chinese, it emits a second token for the specific dialect (e.g., yue, min, wu). This narrows the search space and improves accuracy significantly (88.47% on dialect benchmarks).

3. FireRedASR2: The Heavy Lifter

The core ASR comes in two flavors:

FireRedASR2-LLM: Uses an Encoder-Adapter-LLM (8B+ parameters) foundation for maximum linguistic context.
FireRedASR2-AED: A standard Attention-based Encoder-Decoder (1B+ parameters) for better efficiency and precise word-level timestamps via a post-hoc CTC branch.

Architecture Comparison

Evaluation: Crushing the Baselines

The most impressive part of FireRedASR2S is its performance on diverse Chinese dialects and singing.

| Test Set | FRASR2-LLM (Ours) | Doubao-ASR | Qwen3-ASR | Fun-ASR | | :--- | :---: | :---: | :---: | :---: | | Avg-Mandarin-4 (CER%) | 2.89 | 3.69 | 3.76 | 4.16 | | Avg-Dialect-19 (CER%) | 11.55 | 15.39 | 11.85 | 12.76 | | Singing (Opencpop) | 1.12 | 4.36 | 2.57 | 3.05 |

The model achieves a 2.89% CER on Mandarin—this is near-human parity on public benchmarks. Its singing transcription (1.12% CER) is nearly 4x better than most commercial APIs, likely due to the diverse training data which includes musical content.

Critical Analysis & Takeaways

Why does it work?

Data Diversity: 200k hours of supervised data is the "secret sauce." Scaling the variety of dialects and acoustic environments is more effective than just scaling model parameters.
Human Alignment: By moving away from forced-alignment labels for VAD, the system handles the "non-speech" segments that usually break industrial pipelines.

Limitations

While FireRedASR2S is "all-in-one," it still relies on a sequential pipeline. Errors in the VAD stage (e.g., cutting a word in half) can still propagate. Future iterations might explore "End-to-End" joint training where VAD and ASR share the same latent space more deeply.

Conclusion

FireRedASR2S is a masterclass in industrial AI engineering. It proves that a well-orchestrated pipeline of specialized modules, backed by high-quality supervised data, can still outperform general-purpose foundation models in specific localized tasks like Chinese dialect recognition.

Model weights and code are released on GitHub, making this one of the most powerful open assets for the speech community in 2025/2026.

发现相似论文

试试这些示例

Search for recent papers published after 2024 that utilize Large Language Models (LLMs) specifically for Chinese dialect speech recognition and code-switching tasks.
Which original research paper first proposed the Deep Feedforward Sequential Memory Network (DFSMN) architecture, and how has its use evolved for Voice Activity Detection in streaming contexts?
Investigate the performance of the LERT (Linguistically-motivated Evaluation and Restoration Task) pre-trained model compared to standard BERT for punctuation and disfluency prediction tasks.

[ArXiv 2025] FireRedASR2S: Xiaohongshu’s All-in-One Powerhouse for Industrial-Grade Speech Recognition

1. TL;DR

2. Background: Beyond Simple Transcription

3. The Modular Architecture

3.1. 1. FireRedVAD: The Precision Scalpel

3.2. 2. FireRedLID: Solving the Dialect Maze

3.3. 3. FireRedASR2: The Heavy Lifter

4. Evaluation: Crushing the Baselines

5. Critical Analysis & Takeaways

5.1. Why does it work?

5.2. Limitations

6. Conclusion