Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/17052
Full metadata record
DC FieldValueLanguage
dc.contributor.authorMaurya, Chandresh Kumaren_US
dc.date.accessioned2025-10-31T17:40:59Z-
dc.date.available2025-10-31T17:40:59Z-
dc.date.issued2025-
dc.identifier.citationGupta, M., Dutta, M., & Maurya, C. K. (2025). Enhancing Hindi–English Direct Speech-to-Speech Translation with Clustering-Aided Cross-Contrastive Self-Supervised Speech Representation Learning (Vol. 9, Issue 1). https://doi.org/10.1007/s41314-025-00078-1en_US
dc.identifier.otherEID(2-s2.0-105018680129)-
dc.identifier.urihttps://dx.doi.org/10.1007/s41314-025-00078-1-
dc.identifier.urihttps://dspace.iiti.ac.in:8080/jspui/handle/123456789/17052-
dc.description.abstractDirect speech-to-speech translation (S2ST) is an important tool for bridging communication gaps. Direct S2ST translates speech from one language to another without relying on intermediate text, making it particularly useful for languages primarily spoken rather than written. However, the performance of Direct S2ST models on low-resource languages remains limited due to the scarcity or complete absence of parallel speech data required for training. Pretraining and finetuning are widely used techniques to leverage unsupervised speech data to improve model performance. In this work, we employ a cluster-aided, cross-contrastive self-supervised learning (SSL)-based speech representation model as the pre-trained encoder, combined with a multilingual BART (mBART) decoder. The resulting finetuned model outperforms a baseline that uses a contrastive-loss-based SSL model as the encoder. The proposed models improve the BLEU score by 4.14% for Hindi→English and 8.2% for English→Hindi compared to their respective baseline models. To train the model for English-to-Hindi, we trained a unit-vocoder on speech quantized using ensemble clustering instead of standard clustering. The resulting unit-vocoder outperformed the one trained on speech quantized using standard k-means for all evaluation metrics. © 2025 Elsevier B.V., All rights reserved.en_US
dc.language.isoenen_US
dc.publisherSpringeren_US
dc.subjectCluster ensemblingen_US
dc.subjectCross-contrastive speech representation learningen_US
dc.subjectDirect speech-to-speech translationen_US
dc.titleEnhancing Hindi–English Direct Speech-to-Speech Translation with Clustering-Aided Cross-Contrastive Self-Supervised Speech Representation Learningen_US
dc.typeJournal Articleen_US
Appears in Collections:Department of Computer Science and Engineering

Files in This Item:
There are no files associated with this item.


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Altmetric Badge: