A Gene Set Foundation Model Pre-Trained on a Massive Collection of Diverse Gene Sets
Abstract
Trained with large datasets, foundation models can capture complex patterns within these datasets to create embeddings that can be used for a variety of useful applications. Here we created a gene set foundation model that was trained on a massive collection of unlabeled gene sets from two databases: Rummagene and RummaGEO. Rummagene automatically extracts gene sets from supplemental tables of publications; and RummaGEO has gene sets automatically computed from comparing groups of samples from RNA-seq studies deposited into the gene expression omnibus. Several foundation model architectures and data sources for training were benchmarked in the task of predicting gene function. Such predictions were also compared to other state-of-the-art gene function prediction methods and models. One of the GSFM architectures achieves superior performance compared to all other methods and models. This model was used to systematically predict gene functions for all human gene. These predictions are served on gene pages that are accessible from<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="https://gsfm.maayanlab.cloud">https://gsfm.maayanlab.cloud</ext-link>.
Related articles
Related articles are currently not available for this article.