Apache Spark-based scalable feature extraction approaches for protein sequence and their clustering performance analysis

Jha, Preeti; Tiwari, Aruna; Mounika, Mukkamalla; Nagendra, Neha

Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/11353

Full metadata record

DC Field	Value	Language
dc.contributor.author	Jha, Preeti	en_US
dc.contributor.author	Tiwari, Aruna	en_US
dc.contributor.author	Mounika, Mukkamalla	en_US
dc.contributor.author	Nagendra, Neha	en_US
dc.date.accessioned	2023-02-27T15:27:12Z	-
dc.date.available	2023-02-27T15:27:12Z	-
dc.date.issued	2023	-
dc.identifier.issn	2364415X	-
dc.identifier.other	EID(2-s2.0-85145504181)	-
dc.identifier.uri	https://doi.org/10.1007/s41060-022-00381-6	-
dc.identifier.uri	https://dspace.iiti.ac.in/handle/123456789/11353	-
dc.description.abstract	Genome sequencing projects are rapidly contributing to the rise of high-dimensional protein sequence datasets. Extracting features from a high-dimensional protein sequence dataset poses many challenges. However, many features extraction methods exist, but extracting features from millions of protein sequences becomes impractical because these approaches are not scalable. Therefore, to design an efficient scalable feature extraction approach that extracts significant features, we have proposed two Apache Spark-based scalable feature extraction approaches that extracts significantly important features based on statistical properties from huge protein sequences, which are termed 60d-SPF (60-dimensional Scalable Protein Feature) and 6d-SCPSF (6-dimensional Scalable Co-occurrence-based Probability-Specific Feature). The proposed 60d-SPF and 6d-SCPSF approaches capture the statistical properties of amino acids to create a fixed-length numeric feature vector that represents each protein sequence in terms of 60-dimensional and 6-dimensional features, respectively. The preprocessed huge protein sequences are used as an input in four clustering algorithms, i.e., scalable random sampling with iterative optimization fuzzy c-means (SRSIO-FCM), scalable literal fuzzy c-means (SLFCM), kernelized SRSIO-FCM (KSRSIO-FCM), and kernelized SLFCM (KSLFCM) for clustering. We have conducted extensive experiments on various soybean protein datasets to demonstrate the effectiveness of the proposed feature extraction methods, 60d-SPF, 6d-SCPSF, and existing feature extraction methods on SRSIO-FCM, SLFCM, KSRSIO-FCM, and KSLFCM clustering algorithms. The reported results in terms of the Silhouette index and the Davies–Bouldin index show that the proposed 60d-SPF extraction method on SRSIO-FCM, SLFCM, KSRSIO-FCM, and KSLFCM clustering algorithms achieve significantly better results than the proposed 6d-SCPSF and existing feature extraction approaches. © 2023, The Author(s), under exclusive licence to Springer Nature Switzerland AG.	en_US
dc.language.iso	en	en_US
dc.publisher	Springer Science and Business Media Deutschland GmbH	en_US
dc.source	International Journal of Data Science and Analytics	en_US
dc.subject	Big data	en_US
dc.subject	Bioinformatics	en_US
dc.subject	Clustering algorithms	en_US
dc.subject	Extraction	en_US
dc.subject	Feature extraction	en_US
dc.subject	Iterative methods	en_US
dc.subject	Proteins	en_US
dc.subject	Apache spark cluster	en_US
dc.subject	Co-occurrence	en_US
dc.subject	Feature extraction methods	en_US
dc.subject	Features extraction	en_US
dc.subject	Fuzzy-c means	en_US
dc.subject	Huge protein sequence	en_US
dc.subject	Protein features	en_US
dc.subject	Protein sequences	en_US
dc.subject	Random sampling	en_US
dc.subject	Scalable algorithms	en_US
dc.subject	Fuzzy clustering	en_US
dc.title	Apache Spark-based scalable feature extraction approaches for protein sequence and their clustering performance analysis	en_US
dc.type	Journal Article	en_US
Appears in Collections:	Department of Computer Science and Engineering

Files in This Item:

There are no files associated with this item.

Show simple item record

Altmetric Badge: