Enhancing K-means Clustering Performance with a Two-Stage Hybrid Preprocessing Strategy

Tripathi, Abhishek; Tiwari, Aruna; Chaudhari, Narendra S.; Dwivedi, Rajesh

Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/15496

Title:	Enhancing K-means Clustering Performance with a Two-Stage Hybrid Preprocessing Strategy
Authors:	Tripathi, Abhishek Tiwari, Aruna Chaudhari, Narendra S. Dwivedi, Rajesh
Keywords:	Composite score;Density;Density-based inter-cluster distance;Hybrid distance metric;Laplacian score;Mean density-based intra-cluster distance
Issue Date:	2024
Publisher:	Springer Nature
Citation:	Tripathi, A., Tiwari, A., Chaudhari, N. S., Ratnaparkhe, M., & Dwivedi, R. (2024). Enhancing K-means Clustering Performance with a Two-Stage Hybrid Preprocessing Strategy. Arabian Journal for Science and Engineering. Scopus. https://doi.org/10.1007/s13369-024-09878-7
Abstract:	K-means clustering is a popular technique with a broader utility in various domains due to its simplicity and consistent performance. However, it faces notable challenges, making it inefficient in certain scenarios. The first issue arises when dealing with clusters having non-spherical shapes with nonlinear boundaries. Additionally, it heavily depends on the user-defined number of clusters (K). Another significant challenge is its struggle with large datasets characterized by high dimensionality, introducing complexities in achieving effective clustering results. These challenges can substantially impact the overall performance of K-means clustering. To address these limitations and improve K-means performance, we proposed a novel two-stage hybrid preprocessing approach. In Stage 1, we proposed the hybrid distance metric, and based on this, we computed several novel measures such as densities, density-based inter-cluster distances, mean density-based intra-cluster distances and composite scores of the data points to handle non-spherical shapes of clusters, aiding in determining the optimal number of clusters (K). Later, in Stage 2, the most relevant and non-redundant features from the high-dimensional dataset are selected using the Laplacian Score and modified normalized Calinski–Harabasz index utilizing K, obtained from Stage 1. Finally, the modified dataset with relevant features and its K is supplied to the K-means clustering algorithm. The performance of the proposed approach is tested by applying the various datasets, i.e., benchmark and real-life plant genomics (RLPG) datasets. It is found that the proposed approach outperforms the existing state-of-the-art approaches in terms of the Silhouette index, Calinski–Harabasz index, and Davies–Bouldin index. © King Fahd University of Petroleum & Minerals 2024.
URI:	https://doi.org/10.1007/s13369-024-09878-7 https://dspace.iiti.ac.in/handle/123456789/15496
ISSN:	2193-567X
Type of Material:	Journal Article
Appears in Collections:	Department of Computer Science and Engineering

Files in This Item:

There are no files associated with this item.

Show full item record

Altmetric Badge: