Title : Mutation-driven multi-model ensemble learning framework for glioma subtype classification
Abstract:
Gliomas are tumors originating from glial cells that support and protect neurons in the nervous system but can become malignant and quickly invade healthy brain tissue. Glioma subtypes have varying growth rates and respond differently to treatment options. Early identification of the subtype will help determine the treatment plan and can affect patient outcome. Some studies have differentiated between low-grade and high-grade gliomas, but limited studies have closely examined the factors that distinguish subtypes. Further, while clinical and radiographic features are commonly used for diagnosis, purely genomic approaches remain understudied. Thus, this study aims to leverage a mutation-driven machine learning classification pipeline to predict glioma subtypes based on genomic data. Data was extracted from The Cancer Genome Atlas (TCGA) Project and Memorial Sloan Kettering Cancer Center (MSK), and a panel of 20 most frequently mutated genes in gliomas was analyzed to predict three major glioma subtypes: oligodendroglioma, astrocytoma, and glioblastoma. Three base models (multinomial logistic regression, Random Forest, and XGBoost) were trained and evaluated independently. To further improve generalization and accuracy of classification, out-of-fold predictions from the base models were used to develop an ensemble XGBoost meta-model. The stacked model achieved an overall accuracy of 77.9% and an AUC of 0.896, outperforming all the base models. Notably, the meta-model classified glioblastomas, the deadliest subtype, with 92.2% accuracy, highlighting the model’s ability to identify high-grade tumors. Our findings illustrate the clinical potential that mutation-based classifiers hold and suggest that genetic features alone can hold strong predictive value for glioma subtype classification. This study also demonstrates how mutation-based machine learning frameworks can enable precision accuracy, especially in settings where clinical data may not be readily available.