A Novel Feature Extraction Approach for the Clustering and Classification of Genome Sequences

Dwivedi, Rajesh; Tiwari, Aruna; Tripathi, Abhishek

Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/13207

Title:	A Novel Feature Extraction Approach for the Clustering and Classification of Genome Sequences
Authors:	Dwivedi, Rajesh Tiwari, Aruna Tripathi, Abhishek
Keywords:	Classification;Clustering;Feature extraction;Genome sequences;Single nucleotide polymorphism
Issue Date:	2023
Publisher:	Institute of Electrical and Electronics Engineers Inc.
Citation:	Dwivedi, R., Tiwari, A., Bharill, N., Ratnaparkhe, M., Tripathi, A., & Jha, P. (2023). A Novel Feature Extraction Approach for the Clustering and Classification of Genome Sequences. 2023 IEEE Symposium Series on Computational Intelligence, SSCI 2023. Scopus. https://doi.org/10.1109/SSCI52147.2023.10372047
Abstract:	Feature extraction is essential in bioinformatics because it transforms genome sequences into the feature vectors required for data mining activities such as classification and clustering. The data mining activities enable us to classify or cluster the newly sequenced genome to the known families. Nowadays, a variety of feature extraction strategies are available for genome data. Nevertheless, several existing algorithms do not extract context-sensitive key properties, also some approaches extract features, which are unable to distinguish between two non-similar sequences. In addition, the efficacy of existing feature extraction techniques is evaluated on either supervised or unsupervised learning models, but not on both. Thus, an efficient feature extraction technique that extracts significantly relevant features from genome sequences is required. In this paper, a novel feature extraction method is proposed that extracts features based on the length of the sequence, the frequency of nucleotide bases, the modified positional sum of nucleotide bases, the distribution of nucleotide bases, and the entropy of the sequence to generate a 14-dimensional fixed-length numeric vector to describe each genome sequence uniquely. By applying extracted features to both supervised and unsupervised machine learning approaches, the performance of the proposed feature extraction method is assessed. The experimental results show that the proposed strategy for clustering and classifying novel genome sequences into recognized genome classes is highly effective and efficient. The same is proven by comparing the proposed method to the standard state-of-the-art method. © 2023 IEEE.
URI:	https://doi.org/10.1109/SSCI52147.2023.10372047 https://dspace.iiti.ac.in/handle/123456789/13207
ISBN:	978-1665430654
Type of Material:	Conference Paper
Appears in Collections:	Department of Computer Science and Engineering

Files in This Item:

There are no files associated with this item.

Show full item record

Altmetric Badge: