Please use this identifier to cite or link to this item:
https://dspace.iiti.ac.in/handle/123456789/3162
Title: | Predictions in non-stationary data streams using Fuzzy clustering-based adaptive regression |
Authors: | Ajay |
Supervisors: | Tiwari, Aruna |
Keywords: | Computer Science and Engineering |
Issue Date: | 5-Nov-2021 |
Publisher: | Department of Computer Science and Engineering, IIT Indore |
Series/Report no.: | MSR019 |
Abstract: | The massive amount of data generated in real-time by IoT, sensors, and mobile applications is referred to as streaming data. It is not easy to collect, process, and analyze such large amounts of data in real-time to generate predictions and extract hidden knowledge due to the increasing rate of data generation. Also, in data streams, the assumption that data is independent and identically distributed may be incorrect. Mining these data streams to extract valuable information is difficult due to their changing nature and concept drift, i.e., changes in data distribution in an unpredictable and unforeseen manner. This research aims to develop a concept drift adaptation method called Scalable K-means++ seeded Fuzzy Clustering induced Regression (SFC-R), which uses a fuzzy clustering-based approach to identify concepts or patterns. Subsequently, regression parameters are updated for each pattern to predict the target variable. The clus tering process uses a Scalable K-means++ based algorithm to initialize the cluster centers. According to a literature review, the existing state-of-the approaches are se quential in nature and hence can not handle large amounts of data generated by data streams efficiently. To address this problem, we have also proposed a Parallel and Scalable K-means++ seeded Fuzzy Clustering induced Regression (PSFC-R), a par allel and scalable variant of SFC-R implemented in pyspark using the Apache Spark in-memory cluster computing framework. The SFC-R method is validated with 13 stationary datasets and 8 non-stationary real-world data streams to confirm its ability to deal with the task of prediction in stationary and non-stationary data streams. It is observed that the proposed SFC-R method can effectively handle the concept drift problem in data streams, with better results for mixed drift (more than one type of drift) and reoccurring drift. Three real-world plant protein sequence datasets are also experimented with the proposed method PSFC-R. It is observed that the PSFC-R can handle big data streams in an efficient way in terms of computation time. |
URI: | https://dspace.iiti.ac.in/handle/123456789/3162 |
Type of Material: | Thesis_MS Research |
Appears in Collections: | Department of Computer Science and Engineering_ETD |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
MSR019_Ajay_1904101003.pdf | 3.22 MB | Adobe PDF | ![]() View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
Altmetric Badge: