GENERator: A Long-Context Generative Genomic Foundation Model
Abstract
The rapid advancement of DNA sequencing has produced vast genomic datasets, yet the interpretation and rational engineering of sequence function remain fundamental challenges. Recent large language models (LLMs) have opened new avenues for genomic analysis; however, existing approaches are frequently constrained by limited training scope, restricted generative flexibility, or prohibitive computational cost. In this study, we introduce GENERator, a generative genomic foundation model designed for long-context DNA modeling, with a context length of 98k nucleotides, pre-trained on 386 billion nucleotides of eukaryotic DNA. GENERator demonstrates strong intrinsic capabilities arising directly from pre-training. Unsupervised embedding analyses reveal latent organization consistent with phylogenetic relationships. Sequence recovery benchmarks show that GENERator achieves generative accuracy matching or exceeding state-of-the-art baselines with substantially improved computational efficiency. In a zero-shot setting, GENERator further provides competitive variant effect prediction performance relative to alignment-based methods, while remaining fully alignment-free and broadly applicable across species. Beyond training-free evaluation, GENERator consistently delivers strong performance through task-specific fine-tuning on established genomic benchmarks. We further demonstrate practical generative applications enabled by the model. GENERator can generate protein-coding DNA sequences that translate into structurally plausible proteins and, through a prompt-guided design framework, design cis-regulatory elements with targeted activity profiles, including synthetic enhancers whose regulatory strength exceeds that of natural genomic sequences, as validated by high-throughput UMI-STARR-seq assays. Collectively, these results establish GENERator as an efficient and biologically grounded foundation for genomic interpretation and programmable sequence design across diverse genomic contexts. Implementation details and supplementary resources are available at https://github.com/GenerTeam/GENERator.
Related articles
Related articles are currently not available for this article.