Design of scalable fuzzy clustering algorithms and its application to huge genomics data

Jha, Preeti

Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/3592

Title:	Design of scalable fuzzy clustering algorithms and its application to huge genomics data
Authors:	Jha, Preeti
Supervisors:	Tiwari, Aruna Ratnaparkhe, Milind Bharill, Neha
Keywords:	Computer Science and Engineering
Issue Date:	23-Dec-2021
Publisher:	Department of Computer Science and Engineering, IIT Indore
Series/Report no.:	TH418
Abstract:	Clustering is one of the most popular methods used for exploratory data analysis. The need for clustering arises in many real-life problems, such as gene analysis, image processing, text organization, community detection, disease diagnosis, and protein categorization. In the bioinformatics domain, an enormous amount of new genome sequences are produced at a great pace. Hence, the clustering of genome sequencing gives rise to this new era of Big Data in bioinformatics. Clustering genome sequences in real life becomes a major challenge because sequences can belong to multiple clusters. So, there is a need to apply a clustering algorithm that assigns a data sample to more than one cluster. Fuzzy clustering is one of the most widely used methods to handle such real-life problems. The principle advantage of fuzzy clustering is that the membership degrees express how ambiguously a data sample should belong to a cluster. However, there are many aspects of the design of fuzzy clustering that need to be addressed for improving the overall performance of fuzzy clustering by preserving the quality of clustering to handle Big Data. This thesis mainly investigates to design and develop the fuzzy based scalable clus tering algorithms and feature extraction techniques for handling huge genome data us ing Apache Spark. To handle Big Data, novel scalable fuzzy clustering approaches are designed. First, we have proposed scalable kernelized fuzzy clustering algorithms for handling Big Data. These scalable kernelized fuzzy clustering algorithms are evolved to deal with the non-linear separable problems by applying a kernel Radial Basis Func tions (RBF), which maps the input data space non-linearly into a high dimensional feature space. The proposed scalable kernelized fuzzy clustering algorithms are be ing implemented on Apache Spark cluster to perform the efficient clustering of Big Data due to its in-memory cluster computing technique. The proposed algorithms remove the problem of loading the entire data in memory all at once. This results in a significant reduction in run-time. To further improve the cluster quality, we have proposed a novel scalable incremen tal fuzzy consensus clustering algorithm, which aims to find a single partition of data i that agrees as much as possible with existing basic partitions/segments. The scal able incremental fuzzy consensus clustering aims to identify a soft consensus partition with overlapping clusters from a set of fuzzy partitions. It has been implemented on Apache Spark cluster framework, a distributed data stream environment for handling big data by considering the data as a set of subsets of data that are processed in crementally. The scalable incremental fuzzy consensus clustering facilitates efficient Big Data clustering by improving the quality of clusters and thus performing storage space optimization and significantly reducing time complexity. The scalable kernel ized fuzzy clustering and scalable fuzzy consensus clustering is applied to huge genome data. Before clustering raw genome sequences, there is a need to develop a method that can extract significant features from huge genome sequences. To handle huge genome sequences, we have proposed novel scalable feature extrac tion techniques for preprocessing huge Single Nucleotide Polymorphism (SNP) and protein sequences that extract fixed-length numerical feature vectors. The extracted numerical feature vectors are then fed as an input to the developed scalable fuzzy clus tering algorithms to cluster huge SNP and protein datasets. Finally, we have investi gated massive protein data of the Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2) using our developed scalable feature extraction approach and scalable fuzzy clustering algorithms. Therefore, the scalable algorithms presented in this thesis are generalized to various genome datasets of any size (Big Data).
URI:	https://dspace.iiti.ac.in/handle/123456789/3592
Type of Material:	Thesis_Ph.D
Appears in Collections:	Department of Computer Science and Engineering_ETD

Files in This Item:

File	Description	Size	Format
TH_418_Preeti_ Jha_1801201006.pdf		2.25 MB	Adobe PDF	View/Open

Show full item record

Altmetric Badge: