A novel apache spark-based 14-dimensional scalable feature extraction approach for the clustering of genomics data

Dwivedi, Rajesh; Tiwari, Aruna; Ratnaparkhe, Milind Balkrishna; Mogre, Parul; Gadge, Pranjal; Jagadeesh, Kethavath

Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/12614

Full metadata record

DC Field	Value	Language
dc.contributor.author	Dwivedi, Rajesh	en_US
dc.contributor.author	Tiwari, Aruna	en_US
dc.contributor.author	Ratnaparkhe, Milind Balkrishna	en_US
dc.contributor.author	Mogre, Parul	en_US
dc.contributor.author	Gadge, Pranjal	en_US
dc.contributor.author	Jagadeesh, Kethavath	en_US
dc.date.accessioned	2023-12-14T12:37:55Z	-
dc.date.available	2023-12-14T12:37:55Z	-
dc.date.issued	2023	-
dc.identifier.citation	Dwivedi, R., Tiwari, A., Bharill, N., Ratnaparkhe, M., Mogre, P., Gadge, P., & Jagadeesh, K. (2023). A novel apache spark-based 14-dimensional scalable feature extraction approach for the clustering of genomics data. Journal of Supercomputing. Scopus. https://doi.org/10.1007/s11227-023-05602-8	en_US
dc.identifier.issn	0920-8542	-
dc.identifier.other	EID(2-s2.0-85170058442)	-
dc.identifier.uri	https://doi.org/10.1007/s11227-023-05602-8	-
dc.identifier.uri	https://dspace.iiti.ac.in/handle/123456789/12614	-
dc.description.abstract	Feature extraction is essential in bioinformatics because it transforms genomics sequences into feature vectors, which are needed for clustering to discover the family of newly sequenced genome. Most of the existing feature extraction methods extract similar features for dissimilar sequences, do not extract context-based features and unable to handle millions of genome sequences because they are not scalable. So, to tackle these challenges, we proposed an efficient apache spark-based scalable feature extraction approach that extracts significantly important features from millions of genome sequences in less computational time. The proposed approach extracts features in five stages, i.e., based on the length of the sequence, the frequency of nucleotide bases, the pattern organization of nucleotide bases, distribution of nucleotide bases, and the entropy of the sequence to generate a fixed-length numeric vector consist of only 14 dimensions to describe each genome sequence uniquely. The proposed approach efficiently extracts the context-based features in terms of pattern organization and distribution, also removes the drawback of extracting same features for the dissimilar sequences using a novel power method. The feature extracted with the proposed scalable feature extraction approach is applied on k-means and fuzzy c-means clustering techniques. The experimental results show that the proposed method is highly successful and efficient in terms of computing time in comparison to other state-of-the-art approaches. © 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.	en_US
dc.language.iso	en	en_US
dc.publisher	Springer	en_US
dc.source	Journal of Supercomputing	en_US
dc.subject	Apache spark	en_US
dc.subject	Feature extraction	en_US
dc.subject	Fuzzy c-means	en_US
dc.subject	Genome sequences	en_US
dc.subject	k-means	en_US
dc.title	A novel apache spark-based 14-dimensional scalable feature extraction approach for the clustering of genomics data	en_US
dc.type	Journal Article	en_US
Appears in Collections:	Department of Computer Science and Engineering

Files in This Item:

There are no files associated with this item.

Show simple item record

Altmetric Badge: