Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/16114
Full metadata record
DC FieldValueLanguage
dc.contributor.authorTripathi, Abhisheken_US
dc.contributor.authorTiwari, Arunaen_US
dc.contributor.authorChaudhari, Narendra S.en_US
dc.contributor.authorRatnaparkhe, Milind Balkrishnaen_US
dc.contributor.authorDwivedi, Rajeshen_US
dc.date.accessioned2025-05-14T16:55:29Z-
dc.date.available2025-05-14T16:55:29Z-
dc.date.issued2025-
dc.identifier.citationTripathi, A., Tiwari, A., Chaudhari, N. S., Ratnaparkhe, M., Bharill, N., Jha, P., & Dwivedi, R. (2025). Scalable alignment-free feature extraction approach for genome data and their cluster analysis. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-025-20864-5en_US
dc.identifier.issn1380-7501-
dc.identifier.otherEID(2-s2.0-105003556879)-
dc.identifier.urihttps://doi.org/10.1007/s11042-025-20864-5-
dc.identifier.urihttps://dspace.iiti.ac.in/handle/123456789/16114-
dc.description.abstractFeature extraction is crucial in bioinformatics, as it converts genomic sequences into numerical feature vectors essential for machine learning algorithms, particularly in clustering, to identify the families of newly sequenced genomes. Traditional methods have relied on alignment-based techniques for clustering the genomic sequences. However, these methods are computationally intensive. In contrast, alignment-free methods are now more commonly used due to their reduced computational demands. Despite this, many alignment-free approaches may generate identical feature vectors for dissimilar sequences, as they focus solely on single nucleotide counts (1-gram) and their arrangement during feature extraction, often neglecting dinucleotide counts and their arrangement, which can degrade clustering performance. Furthermore, certain approaches include trinucleotide or higher-order compositionsen_US
dc.description.abstractthey introduce high-dimensionality issues, resulting in inaccurate results. Additionally, some existing methods are not scalable and take substantial time to extract features from large genomic sequences. To address these issues, we proposed a novel 33-dimensional Scalable Alignment-Free Feature Vector (33d-SAFFV) approach to extract the significantly important features such as length of sequence, count of dinucleotides, and positional sum of dinucleotides, which produces a 33-dimensional feature vector. This approach leverages Apache Spark for scalability and efficient in-memory computations, making it suitable for large datasets. We evaluated the performance of our proposed method by applying the extracted 33-dimensional feature vectors to K-Means and Fuzzy C-Means (FCM) clustering algorithms. Performance is measured using the Silhouette Index (SI) and Calinski-Harabasz (CH) index. Experimental results on the gene sequences of four varieties of rice datasets and two varieties of soybean datasets show the effectiveness of the proposed 33d-SAFFV approach. In K-Means clustering with three clusters, the proposed method achieves an average increase of at least 8.66% in the SI value and 7.54% in the CH index. Similarly, in FCM clustering with three clusters, the proposed approach shows an average increase of at least 10.14% in the SI value and 9.88% in the CH index. These results clearly indicate that the proposed 33d-SAFFV approach outperforms other state-of-the-art methods. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.en_US
dc.language.isoenen_US
dc.publisherSpringeren_US
dc.sourceMultimedia Tools and Applicationsen_US
dc.subjectApache sparken_US
dc.subjectCalinski-harabasz indexen_US
dc.subjectFeature extractionen_US
dc.subjectFeature vectoren_US
dc.subjectFuzzy C-meansen_US
dc.subjectK-meansen_US
dc.subjectSilhouette indexen_US
dc.titleScalable alignment-free feature extraction approach for genome data and their cluster analysisen_US
dc.typeJournal Articleen_US
Appears in Collections:Department of Computer Science and Engineering

Files in This Item:
There are no files associated with this item.


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Altmetric Badge: