Transcriptomics-Guided Multi-Cohort Machine Learning for Alzheimer’s Disease Diagnosis

Emine Güven
Ayfer Koyuncu
Sümeyya Arıkan Akgün
Khalid Saad Alharbi
Sattam Khulaif Alenezi
Tarik G Alsahli
Muhammad Afzal

0 evaluations Published on Jun 12, 2026

This article on Sciety

Abstract

Alzheimer’s disease (AD) is a slowly progressive neurodegenerative disease and a major causes of dementia. The identification of reliable diagnostic biomarkers and early diagnosis of AD remain critical challenges in clinical neurology. Adressing these challenges requires the development of robust computational pipelines capable of analyzing high-dimensional genomics data. In this study, four publicly available datasets (GSE125583, GSE125050, GSE153873, and GSE173955) were obtained from NIH/NCBI/GEO repository. To train six classifiers for predicting Alzheimer’s disease status, we identified 30 distinctly expressed gene (DEG) signatures (log ₂ |fold change| > 0.8 and adjusted p-value < 0.05), comprising the top 15 up-regulated and 15 top down-regulated brain tissue RNA biomarkers from GSE125583 training dataset (N _train = 231). Multiple supervised machine learning classifiers, including Generalized Linear Models (GLMNET), Support Vector Machines (SVM), Random Forest (RF), and Gradient Boosting Machines (GBM), were employed to predict AD status based on transcriptomic profiles. The classification analysis results showed that all models performed competitively during training, with cross-validation AUC-ROC values ranging from 0.84 to 0.88. Among these, GLMNET and Partial Least Squares (PLS) exhibited the highest performance and stability. During internal testing, the GBM model achieved excellent discriminatory performance (AUC = 0.962, 95% CI 0.911–1.000), while GLMNET also demonstrated strongly predictive ability (AUC = 0.940, 95% CI 0.883–0.997). Furthermore, a weighted ensemble model combining GBM and GLMNET matched the top internal test performance (AUC = 0.97, 95% CI 0.910–1.000). Across external validation cohorts, the ensemble maintained robust generalizability, achieving mean AUC = 0.89 comparable to the performance of the GBM model. Trait significance analysis identified a key signature consisting of 20 predictor genes. Among these, COL27A1, LINC02937, and TEPSIN emerged as the highest-ranking indicators of dysregulation associated with AD. In the GBM model, ISYNA1, HMGB3, and KDM7A.DT genes retained high significance rankings. GLMNET coefficients further confirmed the directional effect of these features, revealing significant up-regulation of RIPOR3, SMTN, and SCG3 in AD samples. Conversely, a strong down-regulation was observed in COL27A1, LINC02937, ETV4, and HSPB7. Finally, this study presented an interpretable machine learning-based classification pipeline for AD diagnosis and identified promising genetic targets such as SLC6A12 and ISYNA1.

Related articles are currently not available for this article.