How is data pre-processed and tokenized in cantonese automatic speech recognition datasets?

Why does Cantonese ASR need special pre-processing and tokenization?

Cantonese is a low-resource language with limited labeled speech data, making standard ASR approaches less effective. The language has many homophone characters (same sound, different meaning) and rare words that are easily misrecognized by data-hungry end-to-end models. One study found that a homophone extension method—linking rare words to their common homophones during beam search decoding—combined with unified writing (merging variant characters like traditional and simplified forms) reduced character error rate (CER) by an absolute 5% on in-domain tests and 18% on out-of-domain tests [2]. This shows that pre-processing that incorporates linguistic knowledge (like homophone lexicons) is critical for Cantonese ASR.

Additionally, Hong Kong Cantonese frequently mixes Cantonese and English (code-switching), which creates modeling challenges. Researchers developed a Cantonese-English difference modeling unit to narrow the gap between the two languages and a language identification subtask to distinguish them, improving accuracy by 10-49% depending on the rescoring strategy [5]. This means tokenization must account for bilingual switches, not just single-language characters.

How do researchers expand Cantonese datasets and pretrain models?

Because Cantonese speech data is scarce, data augmentation is a key pre-processing step. One approach uses large language models (LLMs) and text-to-speech (TTS) to generate synthetic Cantonese text and speech from Mandarin data, creating a Mandarin-Cantonese parallel database. A retrieval-augmented generation (RAG) method with a Cantonese knowledge base improved the diversity and accuracy of the generated text, and a data filter ensured quality. This augmented data was effective for fine-tuning Cantonese ASR and translation models [4].

Another strategy is unsupervised language-specific pretraining. Researchers collected 2,000 hours of unlabeled Cantonese audio and pretrained a wav2vec2.0 model on it. This Cantonese-specific model outperformed the multilingual XLSR-53 model (trained on 56,000 hours of 128 languages) by a 6% relative improvement in recognition, despite using 28 times less data. When fine-tuned with RNN-T and CTC loss functions, it achieved a CER of 15.57% on an open-source Cantonese test set, compared to 30.18% for a conformer end-to-end baseline [3]. This shows that domain-specific pretraining on even modest amounts of unlabeled Cantonese data can dramatically boost performance.

What tokenization units and fine-tuning methods work best for Cantonese?

Tokenization in Cantonese ASR typically operates at the character level, but researchers have explored specialized units. For code-switched Hong Kong Cantonese, a Cantonese-English difference modeling unit was created to handle the phonetic and orthographic differences between the two languages [5]. This unit, combined with a multi-task rescoring strategy that jointly trains language identification and ASR, significantly improved accuracy.

For fine-tuning, parameter-efficient methods like LoRA (Low-Rank Adaptation) are effective for Cantonese. In one study, fine-tuning only 1.6% of Whisper-tiny's weights on the Common Voice zh-HK dataset reduced CER from 49.5% to 11.1%, nearly matching full fine-tuning (10.3%) while cutting training memory and computational cost by about 10 times. The model was then quantized to INT8 (60 MB) for fast inference on edge devices, achieving a real-time factor of 0.20 on a MacBook Pro M1 Max CPU [1]. This demonstrates that careful tokenization and efficient fine-tuning can make Cantonese ASR practical even on limited hardware.

Sources used in this answer

LoRA-INT8 Whisper: A Low-Cost Cantonese Speech Recognition Framework for Edge Devices

LoRA fine-tuning of Whisper-tiny on Cantonese Common Voice reduced CER from 49.5% to 11.1% using only 1.6% of weights, with INT8 quantization enabling 5x real-time inference on a MacBook CPU.

2025 · Lusheng Zhang, Shie Wu, Zhongxun Wang · Sensors (Basel, Switzerland)

Original

Improving Rare Words Recognition through Homophone Extension and Unified Writing for Low-resource Cantonese Speech Recognition

Homophone extension and unified writing reduced Cantonese ASR character error rate by an absolute 5% (in-domain) and 18% (out-of-domain) by integrating linguistic knowledge into decoding.

2022 · HoLam Chung, Junan Li, Pengfei Liu, Wai-Kim Leung, Xixin Wu, Helen Meng · ISCSLP

Original

Advances in Cantonese Speech Recognition: A Language-Specific Pretraining Model and RNN-T Loss

A wav2vec2.0 model pretrained on 2,000 hours of unlabeled Cantonese data outperformed XLSR-53 (56,000 hours) by 6% relative, achieving 15.57% CER vs. 30.18% for a conformer baseline.

2023 · Junyun Guan, Minqiang Xu, Xuan Xuan, Lei Fang, Yihao Chen, Liang He · 2023 5th International Academic Exchange Conference on Science and Technology Innovation (IAECST)

Original

Augment Mandarin to Cantonese Speech Databases via Retrieval-Augmented Generation and Speech Synthesis

Retrieval-augmented generation with LLMs and TTS augmented Mandarin-Cantonese parallel data, improving Cantonese ASR and translation model fine-tuning.

2025 · Fan Liu, Cheng Gong, Boyu Zhu, Ruihao Jing, Chunyu Qiang, Tianrui Wang, Xiao-Lei Zhang, Xuelong Li · INTERSPEECH

Original

HKSR: A Research of Code-Switching Hong Kong Cantonese Speech Recognition Based on Multi-task Rescoring Strategy

A Cantonese-English difference modeling unit and multi-task rescoring strategy improved Hong Kong Cantonese ASR accuracy by 10-49% for code-switched speech.

2022 · Yuting Huang, Bi Zeng, Zhentao Lin, Jia Cai · ICCT

Original