Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/16428
Full metadata record
DC FieldValueLanguage
dc.contributor.authorSethiya, Niveditaen_US
dc.contributor.authorMaurya, Chandresh Kumaren_US
dc.date.accessioned2025-07-09T13:48:02Z-
dc.date.available2025-07-09T13:48:02Z-
dc.date.issued2025-
dc.identifier.citationSethiya, N., Nair, S., Walia, P., & Maurya, C. (2025). Indic-ST: A Large-Scale Multilingual Corpus for Low-Resource Speech-to-Text Translation. ACM Transactions on Asian and Low Resource Language Information Processing, 24(6). https://doi.org/10.1145/3736720en_US
dc.identifier.issn2375-4699-
dc.identifier.otherEID(2-s2.0-105009383968)-
dc.identifier.urihttps://dx.doi.org/10.1145/3736720-
dc.identifier.urihttps://dspace.iiti.ac.in:8080/jspui/handle/123456789/16428-
dc.description.abstractWe introduce Indic-ST, a novel dataset for speech-to-text translation (ST) task from English to Indic languages to bridge the performance gap. ST involves converting spoken input in one language into written text in another, playing a key role in real-world applications like subtitling, lecture transcription, and multilingual communication systems. Despite several efforts like Meta's seamless m4t, OpenAI's Whisper, or Google USM model, the performance of ST models on low-resource languages lags to that of English (or high-resource languages like European languages). Indic-ST is compiled from four distinct domains: conversational audio, religious texts, education, and news, which combined results in the Indic-ST dataset. To the best of our knowledge, this is the largest low-resource ST data covering approximately 6,800 hours of English speech in the real human voice and text in 15 Indic languages with diverse scripts totaling approximately 900 GB in size. To assess the usefulness of the dataset, we present the baseline performance of individual language pairs using state-of-the-art ST models. We also present a unified multilingual English-to-Indic-ST model. © 2025 Copyright held by the owner/author(s).en_US
dc.language.isoenen_US
dc.publisherAssociation for Computing Machineryen_US
dc.sourceACM Transactions on Asian and Low-Resource Language Information Processingen_US
dc.subjectautomatic speech recognitionen_US
dc.subjectcorpusen_US
dc.subjectcross-lingualen_US
dc.subjectcross-modalen_US
dc.subjectdataseten_US
dc.subjectDravidian languagesen_US
dc.subjectIndic languagesen_US
dc.subjectIndo-Aryan languagesen_US
dc.subjectlanguage resourceen_US
dc.subjectlow-resource languageen_US
dc.subjectmachine translationen_US
dc.subjectmultilingualen_US
dc.subjectmultimodalen_US
dc.subjectSpeech-to-text translationen_US
dc.subjectunder-represented languageen_US
dc.subjectunder-resourced languageen_US
dc.titleIndic-ST: A Large-Scale Multilingual Corpus for Low-Resource Speech-to-Text Translationen_US
dc.typeJournal Articleen_US
Appears in Collections:Department of Computer Science and Engineering

Files in This Item:
There are no files associated with this item.


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Altmetric Badge: