FireRedASR2S is an industrial-grade, all-in-one speech recognition system from Xiaohongshu Inc. that integrates SOTA modules for ASR, VAD, LID, and Punctuation Prediction. It features two ASR variants: FireRedASR2-LLM (8B+) and FireRedASR2-AED (1B+), achieving a record 2.89% CER on Mandarin benchmarks and significantly outperforming competitors like Qwen3-ASR and Doubao-ASR.
TL;DR
Xiaohongshu recently unveiled FireRedASR2S, a comprehensive, industrial-grade ASR system that doesn't just transcribe audio but handles the entire pipeline: Voice Activity Detection (VAD), Spoken Language Identification (LID), ASR, and Punctuation. By scaling training data to 200,000 hours and utilizing human-annotated events rather than weak supervision, they've set a new SOTA for Mandarin and Chinese dialects, even outperforming heavyweights like Doubao and Qwen3.
Background: Beyond Simple Transcription
In the real world, "ASR" is rarely just about turning clear speech into text. We deal with long-form recordings, background music, singing, code-switching, and a plethora of dialects. Most industrial systems are "Frankenstein" pipelines—cobbled together from different toolkits.
FireRedASR2S addresses this by providing a unified, modular architecture where every component is optimized to work together.
The Modular Architecture
The system follows a sequential pipeline where each stage adds another layer of metadata to the raw waveform.

1. FireRedVAD: The Precision Scalpel
Most VADs are trained using "forced alignment" labels from ASR models. This is a circular dependency that often fails when the audio isn't speech (e.g., singing or music).
- Innovation: FireRedVAD is trained on thousands of hours of human-annotated acoustic events.
- Structure: Uses a Deep Feedforward Sequential Memory Network (DFSMN).
- Result: At only 0.6M parameters, it hits 99.60% AUC-ROC, making it both "ultra-lightweight" and "ultra-robust."
2. FireRedLID: Solving the Dialect Maze
LID is critical for routing audio to the right model. FireRedLID uses a Hierarchical Decoding strategy. It predicts the language (e.g., zh, en) first; if it's Chinese, it emits a second token for the specific dialect (e.g., yue, min, wu). This narrows the search space and improves accuracy significantly (88.47% on dialect benchmarks).
3. FireRedASR2: The Heavy Lifter
The core ASR comes in two flavors:
- FireRedASR2-LLM: Uses an Encoder-Adapter-LLM (8B+ parameters) foundation for maximum linguistic context.
- FireRedASR2-AED: A standard Attention-based Encoder-Decoder (1B+ parameters) for better efficiency and precise word-level timestamps via a post-hoc CTC branch.

Evaluation: Crushing the Baselines
The most impressive part of FireRedASR2S is its performance on diverse Chinese dialects and singing.
| Test Set | FRASR2-LLM (Ours) | Doubao-ASR | Qwen3-ASR | Fun-ASR | | :--- | :---: | :---: | :---: | :---: | | Avg-Mandarin-4 (CER%) | 2.89 | 3.69 | 3.76 | 4.16 | | Avg-Dialect-19 (CER%) | 11.55 | 15.39 | 11.85 | 12.76 | | Singing (Opencpop) | 1.12 | 4.36 | 2.57 | 3.05 |
The model achieves a 2.89% CER on Mandarin—this is near-human parity on public benchmarks. Its singing transcription (1.12% CER) is nearly 4x better than most commercial APIs, likely due to the diverse training data which includes musical content.
Critical Analysis & Takeaways
Why does it work?
- Data Diversity: 200k hours of supervised data is the "secret sauce." Scaling the variety of dialects and acoustic environments is more effective than just scaling model parameters.
- Human Alignment: By moving away from forced-alignment labels for VAD, the system handles the "non-speech" segments that usually break industrial pipelines.
Limitations
While FireRedASR2S is "all-in-one," it still relies on a sequential pipeline. Errors in the VAD stage (e.g., cutting a word in half) can still propagate. Future iterations might explore "End-to-End" joint training where VAD and ASR share the same latent space more deeply.
Conclusion
FireRedASR2S is a masterclass in industrial AI engineering. It proves that a well-orchestrated pipeline of specialized modules, backed by high-quality supervised data, can still outperform general-purpose foundation models in specific localized tasks like Chinese dialect recognition.
Model weights and code are released on GitHub, making this one of the most powerful open assets for the speech community in 2025/2026.
