Novel statistical and probabilistic machine learning algorithms for genotype clustering and cancer classification

Shastri, Aditya Anand

Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/3013

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Ahuja, Kapil	-
dc.contributor.author	Shastri, Aditya Anand	-
dc.date.accessioned	2021-08-06T04:48:51Z	-
dc.date.available	2021-08-06T04:48:51Z	-
dc.date.issued	2021-07-28	-
dc.identifier.uri	https://dspace.iiti.ac.in/handle/123456789/3013	-
dc.description.abstract	The two critical problems faced by the present world are depreciation in agricul tural productivity and depleting human health. Specifically, due to climate change, scarcity of water and excessive heat cause decrease in the productivity of crops. Thus, the first part of this dissertation focuses on developing efficient variants of the stan dard clustering algorithm to obtain the species of crops that can be grown in less water and high heat. Furthermore, cancer has emerged as an important cause of mortality after the cardiac diseases. Hence, the second part focuses on developing image classi fication systems to accurately classify the cancer images for their early detection and prevention. To increase the agricultural productivity, it is very important to study the genetic and phenotypic data associated with the crops (henceforth referred as plants). Genetic data is in the form of Whole Genome Sequence (WGS), which is a sequence made from a combination of four nucleotides: A (Adenine), T (Thymine), G (Guanine), and C (Cytosine). Phenotypic data are all kinds of information regarding physical characteristics of plants, such as Plant Height, 100 Seed Weight, Seed Yield Per Plant, Number of Branches Per Plant, Days to 50% Flowering, Days to Maturity, etc. We develop a Vector Quantized Spectral Clustering (VQSC) algorithm that is a combination of Spectral Clustering (SC) and Vector Quantization (VQ) sampling for grouping genome sequences of plants. The novelty of our algorithm is in developing the crucial similarity matrix in SC as well as use of k-medoids in VQ. For genetic data of Soybean plant, we compare VQSC with commonly used techniques like Un-weighted Pair Graph Method with Arithmetic mean (UPGMA) and Neighbor Joining (NJ). Experimental results on the standard set of 31 Soybean sequences show that our VQSC outperforms both these techniques significantly in terms of cluster quality (average improvement of 21% over UPGMA and 24% over NJ) as well as time complexity (order of magnitude faster than both UPGMA and NJ). Similarly, we develop a Probabilistically Sampled Spectral Clustering that is a combination of SC and Pivotal Sampling for grouping phenotypic data. The novelty of our algorithm is again in constructing the crucial similarity matrix for the clustering algorithm and defining probabilities for the sampling technique. For phenotypic data of Soybean plant, we compare our algorithm with the traditional Hierarchical Clustering (HC) algorithm. Experimental results on commonly used 2400 Soybean genotypes show that we get up to 45% better quality clusters than HC in terms of Silhouette Value. Again, the complexity of our algorithm is more than a magnitude lesser than HC. The two common cancers prevailing in the world are breast cancer and thyroid cancer. These cancers are becoming pervasive with their early detection forming a big step in saving the life of any patient. The traditional diagnostic techniques highly depend upon the personal knowledge and the experience of the doctor, where they di agnose the presence of cancerous tumor from images (X-ray image, ultrasound image, magnetic resonance image etc.). Hence, now-a-days, automated imaging techniques are commonly used for these cancer diagnosis. The most important step here is clas sification of the cancer images as benign or malignant. Mammography is the most effective tool for early detection of breast cancer that uses a low-dose X-ray radiation, and is commonly used. Similarly, ultrasound images (that use high frequency sound waves) of thyroid gland of a human being are mostly used for detecting thyroid cancer. Texture of a breast and thyroid in these images plays a significant role in classify ing them as benign or malignant. We propose a descriptor that is a combination of Histogram of Gradients (HOG) and Gabor filter, which exploits textural information. We term it as Histogram of Oriented Texture (HOT). We also revisit the Pass Band - Discrete Cosine Transform (PB-DCT) descriptor that captures texture information well. All features of the cancer images may not be useful. Hence, we apply a feature selection technique called Discrimination Potentiality (DP). Our resulting descriptors, DP-HOT and DP-PB-DCT, are compared with the standard descriptors. Experimen tal results on breast and thyroid images show that we achieve an average accuracy of 92% and 96%, respectively which is substantially more than the existing standard descriptors.	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Computer Science and Engineering, IIT Indore	en_US
dc.relation.ispartofseries	TH358	-
dc.subject	Computer Science and Engineering	en_US
dc.title	Novel statistical and probabilistic machine learning algorithms for genotype clustering and cancer classification	en_US
dc.type	Thesis_Ph.D	en_US
Appears in Collections:	Department of Computer Science and Engineering_ETD

Files in This Item:

File	Description	Size	Format
TH_358_Aditya_Anand_Shastri_1501201001.pdf		2.03 MB	Adobe PDF	View/Open

Show simple item record

Altmetric Badge: