Apache Spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis

Jha, Preeti; Tiwari, Aruna; Bharill, Neha; Mounika, Mukkamalla

Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/4819

Full metadata record

DC Field	Value	Language
dc.contributor.author	Jha, Preeti	en_US
dc.contributor.author	Tiwari, Aruna	en_US
dc.contributor.author	Bharill, Neha	en_US
dc.contributor.author	Mounika, Mukkamalla	en_US
dc.date.accessioned	2022-03-17T01:00:00Z	-
dc.date.accessioned	2022-03-17T15:35:37Z	-
dc.date.available	2022-03-17T01:00:00Z	-
dc.date.available	2022-03-17T15:35:37Z	-
dc.date.issued	2021	-
dc.identifier.citation	Jha, P., Tiwari, A., Bharill, N., Ratnaparkhe, M., Mounika, M., & Nagendra, N. (2021). Apache spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis. Computational Biology and Chemistry, 92 doi:10.1016/j.compbiolchem.2021.107454	en_US
dc.identifier.issn	1476-9271	-
dc.identifier.other	EID(2-s2.0-85102073478)	-
dc.identifier.uri	https://doi.org/10.1016/j.compbiolchem.2021.107454	-
dc.identifier.uri	https://dspace.iiti.ac.in/handle/123456789/4819	-
dc.description.abstract	This paper introduces a kernel based fuzzy clustering approach to deal with the non-linear separable problems by applying kernel Radial Basis Functions (RBF) which maps the input data space non-linearly into a high-dimensional feature space. Discovering clusters in the high-dimensional genomics data is extremely challenging for the bioinformatics researchers for genome analysis. To support the investigations in bioinformatics, explicitly on genomic clustering, we proposed high-dimensional kernelized fuzzy clustering algorithms based on Apache Spark framework for clustering of Single Nucleotide Polymorphism (SNP) sequences. The paper proposes the Kernelized Scalable Random Sampling with Iterative Optimization Fuzzy c-Means (KSRSIO-FCM) which inherently uses another proposed Kernelized Scalable Literal Fuzzy c-Means (KSLFCM) clustering algorithm. Both the approaches completely adapt the Apache Spark cluster framework by localized sub-clustering Resilient Distributed Dataset (RDD) method. Additionally, we are also proposing a preprocessing approach for generating numeric feature vectors for huge SNP sequences and making it a scalable preprocessing approach by executing it on an Apache Spark cluster, which is applied to real-world SNP datasets taken from open-internet repositories of two different plant species, i.e., soybean and rice. The comparison of the proposed scalable kernelized fuzzy clustering results with similar works shows the significant improvement of the proposed algorithm in terms of time and space complexity, Silhouette index, and Davies-Bouldin index. Exhaustive experiments are performed on various SNP datasets to show the effectiveness of proposed KSRSIO-FCM in comparison with proposed KSLFCM and other scalable clustering algorithms, i.e., SRSIO-FCM, and SLFCM. © 2021 Elsevier Ltd	en_US
dc.language.iso	en	en_US
dc.publisher	Elsevier Ltd	en_US
dc.source	Computational Biology and Chemistry	en_US
dc.subject	Bioinformatics	en_US
dc.subject	Fuzzy clustering	en_US
dc.subject	Fuzzy systems	en_US
dc.subject	Iterative methods	en_US
dc.subject	Nucleotides	en_US
dc.subject	Polymorphism	en_US
dc.subject	High-dimensional feature space	en_US
dc.subject	Iterative Optimization	en_US
dc.subject	Kernelized fuzzy clustering	en_US
dc.subject	Preprocessing approaches	en_US
dc.subject	Radial Basis Function(RBF)	en_US
dc.subject	Resilient distributed dataset	en_US
dc.subject	Single-nucleotide polymorphisms	en_US
dc.subject	Time and space complexity	en_US
dc.subject	Clustering algorithms	en_US
dc.subject	algorithm	en_US
dc.subject	biology	en_US
dc.subject	cluster analysis	en_US
dc.subject	fuzzy logic	en_US
dc.subject	genetic database	en_US
dc.subject	genetics	en_US
dc.subject	human	en_US
dc.subject	single nucleotide polymorphism	en_US
dc.subject	Algorithms	en_US
dc.subject	Cluster Analysis	en_US
dc.subject	Computational Biology	en_US
dc.subject	Databases, Genetic	en_US
dc.subject	Fuzzy Logic	en_US
dc.subject	Humans	en_US
dc.subject	Polymorphism, Single Nucleotide	en_US
dc.title	Apache Spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis	en_US
dc.type	Journal Article	en_US
Appears in Collections:	Department of Computer Science and Engineering

Files in This Item:

There are no files associated with this item.

Show simple item record

Altmetric Badge: