Investigations in fuzzy based learning algorithms with application to big data classification

Bharill, Neha

Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/1064

Title:	Investigations in fuzzy based learning algorithms with application to big data classification
Authors:	Bharill, Neha
Supervisors:	Tiwari, Aruna
Keywords:	Computer Science and Engineering
Issue Date:	15-Jan-2018
Publisher:	Department of Computer Science and Engineering, IIT Indore
Series/Report no.:	TH105
Abstract:	Clustering is one of the most widely used methods for exploratory data analysis. The need of clustering arises in many real-life problems, such as gene analysis, image processing, text organization, community detection, disease diagnosis, and protein categorization. Fuzzy clustering is one of the most widely used methods to handle such real-life problems. The principle advantage of fuzzy clustering is that the membership degrees expresses how ambiguously a data point should belong to a cluster. However, there are many aspects of the design of fuzzy clustering need to be addressed for improving the overall performance of fuzzy clustering by preserving the quality of clustering to handle di erent categories of data, such as Very Large (VL) and Big Data. The quality of clustering is a ected by many parameters during the clustering process. One such parameter is the number of clusters (c), that can be determined during clustering. Apart from this, two other parameters are the fuzzi er and the locations of cluster centers that need to be selected appropriately for better e cacy of fuzzy clustering. Furthermore, looking towards the current need of clustering processes to handle VL and Big Data in real-life situations, there is a need to model incremental and scalable clustering algorithms. The performance improvement of fuzzy clustering has been done by propos- ing a novel cluster validity index, which incorporates intra-cluster compactness, inter-cluster separation, and inter-cluster overlap measures to nd the optimal number of clusters in a dataset. Further, for the appropriate selection of other parameters of fuzzy clustering, i.e., the fuzzi er and the location of cluster cen- ters, we proposed two variants of hybrid fuzzy clustering algorithms. Firstly, the proposed quantum-inspired evolutionary fuzzy algorithm for data cluster- ing, which uses quantum computing principle to explore a large search space for an appropriate selection of a fuzzi er parameter in fuzzy clustering. Sec-ondly, the enhanced quantum-inspired evolutionary fuzzy clustering algorithm is proposed, which explores a large search space for the appropriate selection of all parameters of fuzzy clustering, i.e., the number of clusters, the location of cluster centers, and the fuzzi er. In order to handle di erent categories of data, such as VL and Big Data, other novel approaches for fuzzy clustering are designed. Firstly, we proposed an incremental clustering algorithm integrated with a supervised classi cation method for processing VL datasets. The proposed method processes the VL dataset in chunks and sequentially performs clustering of each chunk. It removes the problem of loading the entire data in memory all at once by reducing the run- time and shows signi cant improvement in classi cation accuracy. To handle Big Data, scalable clustering algorithms are proposed and implemented using the Apache Spark framework which provides an in-memory computation capability for proceeding with faster computations on Big Data. The proposed approach reduces run-time and space complexity. It also shows competitive results in terms of various performance measures. The proposed scalable clustering algorithm is applied to a real-life Big Data Problem. For this, a massive protein database of a complex plant genome, which is of size 80 GB, has been collected from the Directorate of Soybean Research (DSR), Indore under the Indian Council of Agricultural Research (ICAR). The collected protein database is used for the categorizing of protein sequences into the respective superfamilies. This categorization helps to assist DSR scientists in enhancing the productivity of next generation protein sequences of di er- ent species of plant. To do this, rst, a rigorous study and analysis of protein sequences of the complex plant genome are carried out. Then, a novel feature ex- traction approach is designed that extracts xed-length numeric feature vectors consisting of only six dimensions to represent a long chain of protein sequences. This feature extraction method is applied to the real-life protein database for its preprocessing and then the preprocessed data is passed as input to the scalable clustering algorithm for e cient categorization of protein sequences into super- families. The exhaustive results on this massive protein database are evaluated in terms of various performance measures.
URI:	https://dspace.iiti.ac.in/handle/123456789/1064
Type of Material:	Thesis_Ph.D
Appears in Collections:	Department of Computer Science and Engineering_ETD

Files in This Item:

File	Description	Size	Format
TH_105__Neha_Bharill_12120103.pdf		4.41 MB	Adobe PDF	View/Open

Show full item record

Altmetric Badge: