Please use this identifier to cite or link to this item:
https://dspace.iiti.ac.in/handle/123456789/10392
Title: | GPU-accelerated scalable feature extraction techniques with scalable kernelized fuzzy clustering algorithms and its application to real-life genomics data for gene identification |
Authors: | Sreeharsh, Namani Sawarkar, Saloni Tiwari, Aruna [Guide] |
Keywords: | Computer Science and Engineering |
Issue Date: | 26-May-2022 |
Publisher: | Department of Computer Science and Engineering, IIT Indore |
Series/Report no.: | BTP585;CSE 2022 SRE |
Abstract: | Bioinformatics is the study of gaining knowledge from biological data. It encom passes data collection, storage, retrieval, manipulation, modelling, and prediction using algorithms and software. When it comes to genomics, the role of technology is primar ily focused on the tremendous rise in genome sequencing, which is developing at a rate that is faster than projected by Moore’s law. The size of the data set is increasing ex ponentially, necessitating the use of massive data processing technology. Clustering is one of the most widely used data mining methods for bioinformatics genome data in vestigation. In genome data investigation, the surging volume of genome data has put colossal weight on clustering algorithms to scale beyond a single machine due to both space and time bottlenecks. To scale the clustering algorithms for huge genome data, there is a requirement for Big Data handling systems. Recently, incalculable handling frameworks have been designed precisely for the utilization of Big Data. This thesis mainly investigates to design and develop the fuzzy based scalable ker nelized clustering algorithms and feature extraction techniques for handling huge soy bean RNA and SNP data using Apache Spark cluster on High Performance Supercom puting (HPC). To handle Big Data, we proposed an Apache Spark cluster-based Log kernelized clustering algorithm named Kernelized Scalable Random Sampling with It erative Optimization Fuzzy c-Means (LKSRSIO-FCM). This is based on the Log Ker nelized Scalable Literal Fuzzy c-Means (LKSLFCM) clustering algorithm, in which log kernel function is used. Additionally, we proposed an Apache Spark cluster-based Cauchy kernelized clustering algorithm named Cauchy Kernelized Scalable Random Sampling with Iterative Optimization Fuzzy c-Means (CKSRSIO-FCM). This is based on the Cauchy Kernelized Scalable Literal Fuzzy c-Means (CKSLFCM) clustering al gorithm, in which cauchy kernel function is used.This proposed work is inspired by a Kernelized Scalable Random Sampling with Iterative Optimization Fuzzy c-Means (KSRSIO-FCM) algorithm. The kernel function is applied to achieve better mapping for non-linearly separable datasets. The proposed algorithms remove the problem of loading the entire data in memory all at once. This results in a significant reduction in run-time. The effectiveness of the proposed scalable kernelized fuzzy clustering algo rithms are tested on large benchmark datasets. To handle huge real-life soybean SNP sequences, we have proposed novel scalable feature extraction techniques for preprocessing huge SNP/RNA data that extract fixed length numerical feature vectors. The extracted numerical feature vectors are then fed as an input to the proposed scalable kernelized fuzzy clustering algorithms to cluster huge real-life SNP datasets. The algorithms are intended for detecting disease by grouping samples (individuals) with comparable gene expression patterns, as well as identifying groupings of genes with similar profiles across samples. However, few practical re search have been undertaken to test the efficacy of suggested scalable kernelized fuzzy clustering algorithms for issues aimed at identifying new illness utilising gene identifi cation. For the SoySNP50K iSelect BeadChip, we have created a new version of the SNP dataset. The complete data set for 20,087 G. max and G. soja accessions genotyped with 42,509 SNPs is generated for Wm82.a3. |
URI: | https://dspace.iiti.ac.in/handle/123456789/10392 |
Type of Material: | B.Tech Project |
Appears in Collections: | Department of Computer Science and Engineering_BTP |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
BTP_585_Namani_Sreeharsh_180001032_Saloni_Sawarkar_180001048.pdf | 1.73 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
Altmetric Badge: