Benchmarking Hindi-to-English direct speech-to-speech translation with synthetic data

Maurya, Chandresh Kumar

Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/16055

Title:	Benchmarking Hindi-to-English direct speech-to-speech translation with synthetic data
Authors:	Maurya, Chandresh Kumar
Keywords:	Low-resource languages;Semantic similarity;Speech-to-speech translation
Issue Date:	2025
Publisher:	Springer Science and Business Media B.V.
Citation:	Gupta, M., Dutta, M., & Maurya, C. K. (2025). Benchmarking Hindi-to-English direct speech-to-speech translation with synthetic data. Language Resources and Evaluation. https://doi.org/10.1007/s10579-025-09827-2
Abstract:	Speech-to-speech translation (S2ST) tasks aim to translate speech from one language to another. Recent research focuses on direct S2ST models, which do not rely on intermediate text representation. This approach is useful for bridging the gap across multilingual communities. Towards such overarching goals, creating parallel speech corpora is a challenging and expensive process, resulting in the limited availability of datasets in various languages. Most of the available S2ST datasets are available only in high-resource languages. As such, direct S2ST models have not been tested in low-resource languages. Therefore, we present a Hindi–English S2ST dataset considered a low-resource language pair where raw speech and text are sourced from the TED talk platform. A cost-effective self-supervised pruning method, leveraging Cross-lingual Semantic Similarity and Word Error Rate (WER), is employed to enhance the quality of the developed dataset. Manual validation conducted by human evaluators on sampled data further confirms the high quality of the dataset. Further, existing S2ST models are evaluated through extensive experiments to establish a baseline for the Hindi–English language pair on the developed dataset. For pre-training and data augmentation, pseudo-labeled data is also used to improve the performance of baseline models. The performances of direct S2ST models are compared with the cascade baseline S2ST models. The results indicate that the Transformer-based direct S2ST model achieves a translation accuracy of 15.86 BLEU score after data augmentation, which lags the cascade model by a gap of 2.27 BLEU score. The dataset will be open-sourced after the acceptance of the paper. © The Author(s), under exclusive licence to Springer Nature B.V. 2025.
URI:	https://doi.org/10.1007/s10579-025-09827-2 https://dspace.iiti.ac.in/handle/123456789/16055
ISSN:	1574-020X
Type of Material:	Journal Article
Appears in Collections:	Department of Computer Science and Engineering

Files in This Item:

There are no files associated with this item.

Show full item record

Altmetric Badge: