Please use this identifier to cite or link to this item:
https://dspace.iiti.ac.in/handle/123456789/11353
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Jha, Preeti | en_US |
dc.contributor.author | Tiwari, Aruna | en_US |
dc.contributor.author | Mounika, Mukkamalla | en_US |
dc.contributor.author | Nagendra, Neha | en_US |
dc.date.accessioned | 2023-02-27T15:27:12Z | - |
dc.date.available | 2023-02-27T15:27:12Z | - |
dc.date.issued | 2023 | - |
dc.identifier.citation | Jha, P., Tiwari, A., Bharill, N., Ratnaparkhe, M., Patel, O. P., Harshith, N., . . . Nagendra, N. (2023). Apache spark-based scalable feature extraction approaches for protein sequence and their clustering performance analysis. International Journal of Data Science and Analytics, doi:10.1007/s41060-022-00381-6 | en_US |
dc.identifier.issn | 2364415X | - |
dc.identifier.other | EID(2-s2.0-85145504181) | - |
dc.identifier.uri | https://doi.org/10.1007/s41060-022-00381-6 | - |
dc.identifier.uri | https://dspace.iiti.ac.in/handle/123456789/11353 | - |
dc.description.abstract | Genome sequencing projects are rapidly contributing to the rise of high-dimensional protein sequence datasets. Extracting features from a high-dimensional protein sequence dataset poses many challenges. However, many features extraction methods exist, but extracting features from millions of protein sequences becomes impractical because these approaches are not scalable. Therefore, to design an efficient scalable feature extraction approach that extracts significant features, we have proposed two Apache Spark-based scalable feature extraction approaches that extracts significantly important features based on statistical properties from huge protein sequences, which are termed 60d-SPF (60-dimensional Scalable Protein Feature) and 6d-SCPSF (6-dimensional Scalable Co-occurrence-based Probability-Specific Feature). The proposed 60d-SPF and 6d-SCPSF approaches capture the statistical properties of amino acids to create a fixed-length numeric feature vector that represents each protein sequence in terms of 60-dimensional and 6-dimensional features, respectively. The preprocessed huge protein sequences are used as an input in four clustering algorithms, i.e., scalable random sampling with iterative optimization fuzzy c-means (SRSIO-FCM), scalable literal fuzzy c-means (SLFCM), kernelized SRSIO-FCM (KSRSIO-FCM), and kernelized SLFCM (KSLFCM) for clustering. We have conducted extensive experiments on various soybean protein datasets to demonstrate the effectiveness of the proposed feature extraction methods, 60d-SPF, 6d-SCPSF, and existing feature extraction methods on SRSIO-FCM, SLFCM, KSRSIO-FCM, and KSLFCM clustering algorithms. The reported results in terms of the Silhouette index and the Davies–Bouldin index show that the proposed 60d-SPF extraction method on SRSIO-FCM, SLFCM, KSRSIO-FCM, and KSLFCM clustering algorithms achieve significantly better results than the proposed 6d-SCPSF and existing feature extraction approaches. © 2023, The Author(s), under exclusive licence to Springer Nature Switzerland AG. | en_US |
dc.language.iso | en | en_US |
dc.publisher | Springer Science and Business Media Deutschland GmbH | en_US |
dc.source | International Journal of Data Science and Analytics | en_US |
dc.subject | Big data | en_US |
dc.subject | Bioinformatics | en_US |
dc.subject | Clustering algorithms | en_US |
dc.subject | Extraction | en_US |
dc.subject | Feature extraction | en_US |
dc.subject | Iterative methods | en_US |
dc.subject | Proteins | en_US |
dc.subject | Apache spark cluster | en_US |
dc.subject | Co-occurrence | en_US |
dc.subject | Feature extraction methods | en_US |
dc.subject | Features extraction | en_US |
dc.subject | Fuzzy-c means | en_US |
dc.subject | Huge protein sequence | en_US |
dc.subject | Protein features | en_US |
dc.subject | Protein sequences | en_US |
dc.subject | Random sampling | en_US |
dc.subject | Scalable algorithms | en_US |
dc.subject | Fuzzy clustering | en_US |
dc.title | Apache Spark-based scalable feature extraction approaches for protein sequence and their clustering performance analysis | en_US |
dc.type | Journal Article | en_US |
Appears in Collections: | Department of Computer Science and Engineering |
Files in This Item:
There are no files associated with this item.
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
Altmetric Badge: