A scalable method for extracting features using a complex network from SNP sequences and clustering using the scalable Max of Min algorithm

Kansal, Achint Kumar; Tiwari, Aruna; Dwivedi, Rajesh

Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/16160

Title:	A scalable method for extracting features using a complex network from SNP sequences and clustering using the scalable Max of Min algorithm
Authors:	Kansal, Achint Kumar Tiwari, Aruna Dwivedi, Rajesh
Keywords:	Apache spark;Feature extraction;Fuzzy c-means clustering;K-means clustering;S-MaxMin;Single nucleotide polymorphism
Issue Date:	2025
Publisher:	Springer Science and Business Media Deutschland GmbH
Citation:	Kansal, A. K., Tiwari, A., Ratnaparkhe, M., Dwivedi, R., & Jha, P. (2025). A scalable method for extracting features using a complex network from SNP sequences and clustering using the scalable Max of Min algorithm. Soft Computing. https://doi.org/10.1007/s00500-025-10622-y
Abstract:	Feature extraction is pivotal in bioinformatics as it converts variable-length genome sequences into fixed-length mathematical feature vectors, which serve as input for clustering algorithms to cluster similar sequences. One of the types of genome sequences is the Single Nucleotide Polymorphism (SNP), which categorises individuals into risk categories for plant diseases and predicts treatment outcomes more reliably. Extracting features from SNP sequences poses many challenges, including extracting similar features for distinct sequences and lacking context-based features. These approaches also take enormous time to compute features for a huge amount of SNP sequences. Therefore, a scalable approach to extract features is proposed based on a complex network, which converts the genome sequence into a complex network and extracts the proposed relevant features. The time utilised to extract those features has reduced drastically. The efficacy of the proposed scalable feature extraction approach is evaluated by applying K-means and Fuzzy c-means algorithms to assess the performance of this proposed feature vector set and found promising results when compared with the other alignment-free state-of-the-art approaches for feature extraction in terms of the Silhouette index and the Calinski–Harabasz index. Additionally, as most SNP datasets are unlabeled, determining the optimal number of clusters presents another significant challenge. A scalable algorithm called the S-MaxMin algorithm is proposed based on the distance metric to find the optimal number of clusters. The proposed S-MaxMin algorithm is being tested on different datasets, including eight labelled benchmark datasets, giving the same number of clusters as the actual number of classes. Also, the S-MaxMin algorithm is tested on four unlabeled SNP datasets, which yielded approximately the same number of clusters as the clusters with a high Silhouette index score. The two proposed scalable approaches are integrated into a framework consisting of two modules. The first module is dedicated to feature extraction for SNP sequences, while the second module focuses on determining the optimal number of clusters. © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2025.
URI:	https://doi.org/10.1007/s00500-025-10622-y https://dspace.iiti.ac.in/handle/123456789/16160
ISSN:	1432-7643
Type of Material:	Journal Article
Appears in Collections:	Department of Computer Science and Engineering

Files in This Item:

There are no files associated with this item.

Show full item record

Altmetric Badge: