Scalable alignment-free feature extraction approach for genome data and their cluster analysis

Tripathi, Abhishek; Tiwari, Aruna; Chaudhari, Narendra S.; Ratnaparkhe, Milind Balkrishna; Dwivedi, Rajesh

Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/16114

Title:	Scalable alignment-free feature extraction approach for genome data and their cluster analysis
Authors:	Tripathi, Abhishek Tiwari, Aruna Chaudhari, Narendra S. Ratnaparkhe, Milind Balkrishna Dwivedi, Rajesh
Keywords:	Apache spark;Calinski-harabasz index;Feature extraction;Feature vector;Fuzzy C-means;K-means;Silhouette index
Issue Date:	2025
Publisher:	Springer
Citation:	Tripathi, A., Tiwari, A., Chaudhari, N. S., Ratnaparkhe, M., Bharill, N., Jha, P., & Dwivedi, R. (2025). Scalable alignment-free feature extraction approach for genome data and their cluster analysis. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-025-20864-5
Abstract:	Feature extraction is crucial in bioinformatics, as it converts genomic sequences into numerical feature vectors essential for machine learning algorithms, particularly in clustering, to identify the families of newly sequenced genomes. Traditional methods have relied on alignment-based techniques for clustering the genomic sequences. However, these methods are computationally intensive. In contrast, alignment-free methods are now more commonly used due to their reduced computational demands. Despite this, many alignment-free approaches may generate identical feature vectors for dissimilar sequences, as they focus solely on single nucleotide counts (1-gram) and their arrangement during feature extraction, often neglecting dinucleotide counts and their arrangement, which can degrade clustering performance. Furthermore, certain approaches include trinucleotide or higher-order compositions they introduce high-dimensionality issues, resulting in inaccurate results. Additionally, some existing methods are not scalable and take substantial time to extract features from large genomic sequences. To address these issues, we proposed a novel 33-dimensional Scalable Alignment-Free Feature Vector (33d-SAFFV) approach to extract the significantly important features such as length of sequence, count of dinucleotides, and positional sum of dinucleotides, which produces a 33-dimensional feature vector. This approach leverages Apache Spark for scalability and efficient in-memory computations, making it suitable for large datasets. We evaluated the performance of our proposed method by applying the extracted 33-dimensional feature vectors to K-Means and Fuzzy C-Means (FCM) clustering algorithms. Performance is measured using the Silhouette Index (SI) and Calinski-Harabasz (CH) index. Experimental results on the gene sequences of four varieties of rice datasets and two varieties of soybean datasets show the effectiveness of the proposed 33d-SAFFV approach. In K-Means clustering with three clusters, the proposed method achieves an average increase of at least 8.66% in the SI value and 7.54% in the CH index. Similarly, in FCM clustering with three clusters, the proposed approach shows an average increase of at least 10.14% in the SI value and 9.88% in the CH index. These results clearly indicate that the proposed 33d-SAFFV approach outperforms other state-of-the-art methods. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.
URI:	https://doi.org/10.1007/s11042-025-20864-5 https://dspace.iiti.ac.in/handle/123456789/16114
ISSN:	1380-7501
Type of Material:	Journal Article
Appears in Collections:	Department of Computer Science and Engineering

Files in This Item:

There are no files associated with this item.

Show full item record

Altmetric Badge: