RNA foundation models enable generalizable endometriosis disease classification and stable gene-level interpretation

This article has 1 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

Endometriosis is a chronic inflammatory condition with significant diagnostic delays impacting one in ten reproductive age women worldwide. While machine learning (ML) models trained on transcriptomic data show promise for disease prediction, limited generalizability across independent patient cohorts has hindered clinical translation. Foundations models (FMs) pretrained on large-scale transcriptomic data offer promise to learn transferrable, biologically meaningful representations that could support cross-cohort predictions. We assembled a 12-cohort bulk RNA-seq benchmark (334 samples) and developed a computationally efficient pipeline to test whether FMs improve endometriosis classification, an approach not previously applied to this disease. Using AutoXAI4Omics with cohort-aware validation, we compared embeddings derived from five state-of-the-art RNA FMs against TPM baselines. In cross-cohort prediction, FM embeddings significantly improved performance, achieving a weighted F1-score of 0.83 vs. 0.68 for the baseline. To allow gene-level interpretation of FM embedding models, we introduce classified-aligned integrated gradients (CA-IG), an interpretability approach aligning gene-level attributions to the downstream classifier without end-to-end finetuning. CA-IG revealed a conserved set of predictive genes from FM embeddings across cohort-validation regimes, contrasting with unstable baseline explainability, suggesting that FM embeddings prioritized transferable disease-related signal over cohort-specific effects. These genes include novel candidates that converge on biologically plausible pathways for endometriosis.

Related articles

Related articles are currently not available for this article.