Coalescence and Translation: A Language Model for Population Genetics

This article has 4 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

Probabilistic models such as the sequentially Markovian coalescent (SMC) have long provided a powerful framework for population genetic inference, enabling reconstruction of demographic history and ancestral relationships from genomic data. However, these methods are inherently specialized, relying on predefined assumptions and/or limited scalability. Recent advances in simulation and deep learning provide an alternative approach: learning directly to generalize from synthetic genetic data to infer specific hidden evolutionary processes. Here we reframe the inference of coalescence times as a problem of translation between two biological languages: the sparse, observable patterns of mutation along the genome and the unobservable ancestral recombination graph (ARG) that gave rise to them. Inspired by large language models, we develop cxt, a decoder-only transformer that autoregressively predicts coalescent events conditioned on local mutational context. We show that cxt performs on par with state-of-the-art MCMC-based likelihood models across a broad range of demographic scenarios, including both in-distribution and out-of-distribution settings. Trained on simulations spanning the stdpopsim catalog, the model generalizes robustly and enables efficient inference at scale, producing over a million coalescence predictions in minutes. In addition cxt produces a well calibrated approximate posterior distribution of its predictions, enabling principled uncertainty quantification. We apply cxt to population genomic data from both humans and mosquitoes, highlighting the model’s ability to deal with the complexities of empirical data.

Significance statement

<monospace>cxt</monospace> is a language model for population genetics which introduces next-coalescence prediction as translation from observed mutations to coalescence times by modeling the coalescent with recombination as a conditional stochastic process. It learns implicit priors from <monospace>stdpopsim</monospace> and generalizes across both known and novel demographies. <monospace>cxt</monospace> generates millions of TMRCA estimates in minutes and samples well-calibrated posteriors for uncertainty quantification. A simple post-hoc correction aligns predicted diversity with the species mutation rate, ensuring robustness to novel evolutionary scenarios.

Related articles

Related articles are currently not available for this article.