Scalable alignment-free feature extraction approach for genome data and their cluster analysis

Tripathi, Abhishek; Tiwari, Aruna; Chaudhari, Narendra S.; Ratnaparkhe, Milind Balkrishna; Dwivedi, Rajesh

Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/16114

Full metadata record

DC Field	Value	Language
dc.contributor.author	Tripathi, Abhishek	en_US
dc.contributor.author	Tiwari, Aruna	en_US
dc.contributor.author	Chaudhari, Narendra S.	en_US
dc.contributor.author	Ratnaparkhe, Milind Balkrishna	en_US
dc.contributor.author	Dwivedi, Rajesh	en_US
dc.date.accessioned	2025-05-14T16:55:29Z	-
dc.date.available	2025-05-14T16:55:29Z	-
dc.date.issued	2025	-
dc.identifier.citation	Tripathi, A., Tiwari, A., Chaudhari, N. S., Ratnaparkhe, M., Bharill, N., Jha, P., & Dwivedi, R. (2025). Scalable alignment-free feature extraction approach for genome data and their cluster analysis. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-025-20864-5	en_US
dc.identifier.issn	1380-7501	-
dc.identifier.other	EID(2-s2.0-105003556879)	-
dc.identifier.uri	https://doi.org/10.1007/s11042-025-20864-5	-
dc.identifier.uri	https://dspace.iiti.ac.in/handle/123456789/16114	-
dc.description.abstract	Feature extraction is crucial in bioinformatics, as it converts genomic sequences into numerical feature vectors essential for machine learning algorithms, particularly in clustering, to identify the families of newly sequenced genomes. Traditional methods have relied on alignment-based techniques for clustering the genomic sequences. However, these methods are computationally intensive. In contrast, alignment-free methods are now more commonly used due to their reduced computational demands. Despite this, many alignment-free approaches may generate identical feature vectors for dissimilar sequences, as they focus solely on single nucleotide counts (1-gram) and their arrangement during feature extraction, often neglecting dinucleotide counts and their arrangement, which can degrade clustering performance. Furthermore, certain approaches include trinucleotide or higher-order compositions	en_US
dc.description.abstract	they introduce high-dimensionality issues, resulting in inaccurate results. Additionally, some existing methods are not scalable and take substantial time to extract features from large genomic sequences. To address these issues, we proposed a novel 33-dimensional Scalable Alignment-Free Feature Vector (33d-SAFFV) approach to extract the significantly important features such as length of sequence, count of dinucleotides, and positional sum of dinucleotides, which produces a 33-dimensional feature vector. This approach leverages Apache Spark for scalability and efficient in-memory computations, making it suitable for large datasets. We evaluated the performance of our proposed method by applying the extracted 33-dimensional feature vectors to K-Means and Fuzzy C-Means (FCM) clustering algorithms. Performance is measured using the Silhouette Index (SI) and Calinski-Harabasz (CH) index. Experimental results on the gene sequences of four varieties of rice datasets and two varieties of soybean datasets show the effectiveness of the proposed 33d-SAFFV approach. In K-Means clustering with three clusters, the proposed method achieves an average increase of at least 8.66% in the SI value and 7.54% in the CH index. Similarly, in FCM clustering with three clusters, the proposed approach shows an average increase of at least 10.14% in the SI value and 9.88% in the CH index. These results clearly indicate that the proposed 33d-SAFFV approach outperforms other state-of-the-art methods. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.	en_US
dc.language.iso	en	en_US
dc.publisher	Springer	en_US
dc.source	Multimedia Tools and Applications	en_US
dc.subject	Apache spark	en_US
dc.subject	Calinski-harabasz index	en_US
dc.subject	Feature extraction	en_US
dc.subject	Feature vector	en_US
dc.subject	Fuzzy C-means	en_US
dc.subject	K-means	en_US
dc.subject	Silhouette index	en_US
dc.title	Scalable alignment-free feature extraction approach for genome data and their cluster analysis	en_US
dc.type	Journal Article	en_US
Appears in Collections:	Department of Computer Science and Engineering

Files in This Item:

There are no files associated with this item.

Show simple item record

Altmetric Badge: