Enhancing Hindi–English Direct Speech-to-Speech Translation with Clustering-Aided Cross-Contrastive Self-Supervised Speech Representation Learning

Maurya, Chandresh Kumar

Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/17052

Full metadata record

DC Field	Value	Language
dc.contributor.author	Maurya, Chandresh Kumar	en_US
dc.date.accessioned	2025-10-31T17:40:59Z	-
dc.date.available	2025-10-31T17:40:59Z	-
dc.date.issued	2025	-
dc.identifier.citation	Gupta, M., Dutta, M., & Maurya, C. K. (2025). Enhancing Hindi–English Direct Speech-to-Speech Translation with Clustering-Aided Cross-Contrastive Self-Supervised Speech Representation Learning (Vol. 9, Issue 1). https://doi.org/10.1007/s41314-025-00078-1	en_US
dc.identifier.other	EID(2-s2.0-105018680129)	-
dc.identifier.uri	https://dx.doi.org/10.1007/s41314-025-00078-1	-
dc.identifier.uri	https://dspace.iiti.ac.in:8080/jspui/handle/123456789/17052	-
dc.description.abstract	Direct speech-to-speech translation (S2ST) is an important tool for bridging communication gaps. Direct S2ST translates speech from one language to another without relying on intermediate text, making it particularly useful for languages primarily spoken rather than written. However, the performance of Direct S2ST models on low-resource languages remains limited due to the scarcity or complete absence of parallel speech data required for training. Pretraining and finetuning are widely used techniques to leverage unsupervised speech data to improve model performance. In this work, we employ a cluster-aided, cross-contrastive self-supervised learning (SSL)-based speech representation model as the pre-trained encoder, combined with a multilingual BART (mBART) decoder. The resulting finetuned model outperforms a baseline that uses a contrastive-loss-based SSL model as the encoder. The proposed models improve the BLEU score by 4.14% for Hindi→English and 8.2% for English→Hindi compared to their respective baseline models. To train the model for English-to-Hindi, we trained a unit-vocoder on speech quantized using ensemble clustering instead of standard clustering. The resulting unit-vocoder outperformed the one trained on speech quantized using standard k-means for all evaluation metrics. © 2025 Elsevier B.V., All rights reserved.	en_US
dc.language.iso	en	en_US
dc.publisher	Springer	en_US
dc.subject	Cluster ensembling	en_US
dc.subject	Cross-contrastive speech representation learning	en_US
dc.subject	Direct speech-to-speech translation	en_US
dc.title	Enhancing Hindi–English Direct Speech-to-Speech Translation with Clustering-Aided Cross-Contrastive Self-Supervised Speech Representation Learning	en_US
dc.type	Journal Article	en_US
Appears in Collections:	Department of Computer Science and Engineering

Files in This Item:

There are no files associated with this item.

Show simple item record

Altmetric Badge: