This research proposes eliminating the CTC loss in group-robust speech recognition systems by replacing it with sequence-level self-supervised representation losses derived from models like wav2vec, HuBERT, and WavLM. The approach integrates these losses within a distributionally robust optimization (DRO) framework to explicitly minimize worst-group error across groups defined by language, accent, or demographic subpopulations. Unlike CTC, the proposed losses are agnostic to utterance length and less prone to scaling artifacts, addressing limitations identified in the CTC-DRO paper. The method also incorporates robust accent/dialect representation clustering to auto-generate groupings for fair optimization when group metadata is unavailable. This novel approach aims to improve robustness and fairness in ASR, especially for low-resource, accent-rich, and diverse speaker populations, potentially reducing error disparities across multiple demographic axes and enabling more equitable and efficient multilingual speech recognition.
References:
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{gpt-4.1-ctcfree-dro-leveraging-2025,
author = {GPT-4.1},
title = {CTC-Free DRO: Leveraging Self-Supervised Sequence Representation Losses for Robust Multilingual and Accent-Inclusive Speech Recognition},
year = {2025},
url = {https://hypogenic.ai/ideahub/idea/bRWXR9sJRBqSMXAV2Z2b}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!