BioToken and BioFM – Biologically-Informed Tokenization Enables Accurate and Efficient Genomic Foundation Models

This article has 0 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

Genomic variation underlies human phenotypic diversity, disease susceptibility, and evolutionary adaptation. Although large-scale genomic sequencing has transformed our ability to map genetic variation, accurately modeling and interpreting this data remains a central challenge; while genomic foundation models (GFMs) are a promising approach, they suffer from fundamental limitations. Current GFMs typically treat DNA simplistically as nucleotide-only sequences, overlooking critical biological context, such as genomic annotations, regulatory elements, and functional contexts central to genomic interpretation. Here, we introduce BioToken, a modular and extendable tokenization framework designed to encode genomic variants and biologically relevant region annotations directly into genomic representations. By utilizing intrinsic inductive biases, BioToken facilitates meaningful representation learning and generalization across diverse molecular phenotypes, such as gene expression, alternative splicing, and variant pathogenicity prediction. Built on BioToken, our genomic foundation model, BioFM, achieves competitive or superior results relative to specialized models (e.g., Enformer, SpliceTransformer) and GFMs up to 7B parameters across a comprehensive suite of genomic benchmarks, including noncoding pathogenicity, expression modulation, sQTL prediction, and long-range genomic interactions. Notably, BioFM achieves state-of-the-art performance with significantly fewer parameters (265M), substantially reducing training costs and computational requirements. Our findings highlight the substantial advantages of integrating biologically-informed inductive biases into genomic foundation modeling, providing a robust and accessible path forward in genomics. We provide our code and model checkpoints to support further research in this direction.

Related articles

Related articles are currently not available for this article.