Predictions in non-stationary data streams using Fuzzy clustering-based adaptive regression

Ajay

Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/3162

Title:	Predictions in non-stationary data streams using Fuzzy clustering-based adaptive regression
Authors:	Ajay
Supervisors:	Tiwari, Aruna
Keywords:	Computer Science and Engineering
Issue Date:	5-Nov-2021
Publisher:	Department of Computer Science and Engineering, IIT Indore
Series/Report no.:	MSR019
Abstract:	The massive amount of data generated in real-time by IoT, sensors, and mobile applications is referred to as streaming data. It is not easy to collect, process, and analyze such large amounts of data in real-time to generate predictions and extract hidden knowledge due to the increasing rate of data generation. Also, in data streams, the assumption that data is independent and identically distributed may be incorrect. Mining these data streams to extract valuable information is difficult due to their changing nature and concept drift, i.e., changes in data distribution in an unpredictable and unforeseen manner. This research aims to develop a concept drift adaptation method called Scalable K-means++ seeded Fuzzy Clustering induced Regression (SFC-R), which uses a fuzzy clustering-based approach to identify concepts or patterns. Subsequently, regression parameters are updated for each pattern to predict the target variable. The clus tering process uses a Scalable K-means++ based algorithm to initialize the cluster centers. According to a literature review, the existing state-of-the approaches are se quential in nature and hence can not handle large amounts of data generated by data streams efficiently. To address this problem, we have also proposed a Parallel and Scalable K-means++ seeded Fuzzy Clustering induced Regression (PSFC-R), a par allel and scalable variant of SFC-R implemented in pyspark using the Apache Spark in-memory cluster computing framework. The SFC-R method is validated with 13 stationary datasets and 8 non-stationary real-world data streams to confirm its ability to deal with the task of prediction in stationary and non-stationary data streams. It is observed that the proposed SFC-R method can effectively handle the concept drift problem in data streams, with better results for mixed drift (more than one type of drift) and reoccurring drift. Three real-world plant protein sequence datasets are also experimented with the proposed method PSFC-R. It is observed that the PSFC-R can handle big data streams in an efficient way in terms of computation time.
URI:	https://dspace.iiti.ac.in/handle/123456789/3162
Type of Material:	Thesis_MS Research
Appears in Collections:	Department of Computer Science and Engineering_ETD

Files in This Item:

File	Description	Size	Format
MSR019_Ajay_1904101003.pdf		3.22 MB	Adobe PDF	View/Open

Show full item record

Altmetric Badge: