Please use this identifier to cite or link to this item:
https://dspace.iiti.ac.in/handle/123456789/16428
Title: | Indic-ST: A Large-Scale Multilingual Corpus for Low-Resource Speech-to-Text Translation |
Authors: | Sethiya, Nivedita Maurya, Chandresh Kumar |
Keywords: | automatic speech recognition;corpus;cross-lingual;cross-modal;dataset;Dravidian languages;Indic languages;Indo-Aryan languages;language resource;low-resource language;machine translation;multilingual;multimodal;Speech-to-text translation;under-represented language;under-resourced language |
Issue Date: | 2025 |
Publisher: | Association for Computing Machinery |
Citation: | Sethiya, N., Nair, S., Walia, P., & Maurya, C. (2025). Indic-ST: A Large-Scale Multilingual Corpus for Low-Resource Speech-to-Text Translation. ACM Transactions on Asian and Low Resource Language Information Processing, 24(6). https://doi.org/10.1145/3736720 |
Abstract: | We introduce Indic-ST, a novel dataset for speech-to-text translation (ST) task from English to Indic languages to bridge the performance gap. ST involves converting spoken input in one language into written text in another, playing a key role in real-world applications like subtitling, lecture transcription, and multilingual communication systems. Despite several efforts like Meta's seamless m4t, OpenAI's Whisper, or Google USM model, the performance of ST models on low-resource languages lags to that of English (or high-resource languages like European languages). Indic-ST is compiled from four distinct domains: conversational audio, religious texts, education, and news, which combined results in the Indic-ST dataset. To the best of our knowledge, this is the largest low-resource ST data covering approximately 6,800 hours of English speech in the real human voice and text in 15 Indic languages with diverse scripts totaling approximately 900 GB in size. To assess the usefulness of the dataset, we present the baseline performance of individual language pairs using state-of-the-art ST models. We also present a unified multilingual English-to-Indic-ST model. © 2025 Copyright held by the owner/author(s). |
URI: | https://dx.doi.org/10.1145/3736720 https://dspace.iiti.ac.in:8080/jspui/handle/123456789/16428 |
ISSN: | 2375-4699 |
Type of Material: | Journal Article |
Appears in Collections: | Department of Computer Science and Engineering |
Files in This Item:
There are no files associated with this item.
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
Altmetric Badge: