Apache Spark-based scalable feature extraction approaches for protein sequence and their clustering performance analysis

Jha, Preeti; Tiwari, Aruna; Mounika, Mukkamalla; Nagendra, Neha

Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/11353

Title:	Apache Spark-based scalable feature extraction approaches for protein sequence and their clustering performance analysis
Authors:	Jha, Preeti Tiwari, Aruna Mounika, Mukkamalla Nagendra, Neha
Keywords:	Big data;Bioinformatics;Clustering algorithms;Extraction;Feature extraction;Iterative methods;Proteins;Apache spark cluster;Co-occurrence;Feature extraction methods;Features extraction;Fuzzy-c means;Huge protein sequence;Protein features;Protein sequences;Random sampling;Scalable algorithms;Fuzzy clustering
Issue Date:	2023
Publisher:	Springer Science and Business Media Deutschland GmbH
Citation:	Jha, P., Tiwari, A., Bharill, N., Ratnaparkhe, M., Patel, O. P., Harshith, N., . . . Nagendra, N. (2023). Apache spark-based scalable feature extraction approaches for protein sequence and their clustering performance analysis. International Journal of Data Science and Analytics, doi:10.1007/s41060-022-00381-6
Abstract:	Genome sequencing projects are rapidly contributing to the rise of high-dimensional protein sequence datasets. Extracting features from a high-dimensional protein sequence dataset poses many challenges. However, many features extraction methods exist, but extracting features from millions of protein sequences becomes impractical because these approaches are not scalable. Therefore, to design an efficient scalable feature extraction approach that extracts significant features, we have proposed two Apache Spark-based scalable feature extraction approaches that extracts significantly important features based on statistical properties from huge protein sequences, which are termed 60d-SPF (60-dimensional Scalable Protein Feature) and 6d-SCPSF (6-dimensional Scalable Co-occurrence-based Probability-Specific Feature). The proposed 60d-SPF and 6d-SCPSF approaches capture the statistical properties of amino acids to create a fixed-length numeric feature vector that represents each protein sequence in terms of 60-dimensional and 6-dimensional features, respectively. The preprocessed huge protein sequences are used as an input in four clustering algorithms, i.e., scalable random sampling with iterative optimization fuzzy c-means (SRSIO-FCM), scalable literal fuzzy c-means (SLFCM), kernelized SRSIO-FCM (KSRSIO-FCM), and kernelized SLFCM (KSLFCM) for clustering. We have conducted extensive experiments on various soybean protein datasets to demonstrate the effectiveness of the proposed feature extraction methods, 60d-SPF, 6d-SCPSF, and existing feature extraction methods on SRSIO-FCM, SLFCM, KSRSIO-FCM, and KSLFCM clustering algorithms. The reported results in terms of the Silhouette index and the Davies–Bouldin index show that the proposed 60d-SPF extraction method on SRSIO-FCM, SLFCM, KSRSIO-FCM, and KSLFCM clustering algorithms achieve significantly better results than the proposed 6d-SCPSF and existing feature extraction approaches. © 2023, The Author(s), under exclusive licence to Springer Nature Switzerland AG.
URI:	https://doi.org/10.1007/s41060-022-00381-6 https://dspace.iiti.ac.in/handle/123456789/11353
ISSN:	2364415X
Type of Material:	Journal Article
Appears in Collections:	Department of Computer Science and Engineering

Files in This Item:

There are no files associated with this item.

Show full item record

Altmetric Badge: