ProFam: Open-Source Protein Family Language Modelling for Fitness Prediction and Design

This article has 2 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

Protein language models have become essential tools for engineering novel functional proteins. The emerging paradigm of family-based language models makes use of homologous sequences to steer protein design and enhance zero-shot fitness prediction, by imbuing models with an ability to explicitly reason over evolutionary context. To provide an open foundation for this modelling approach, we introduce ProFam-1 , a 251M-parameter autoregressive protein family language model (pfLM) trained with next-token prediction on millions of protein families represented as concatenated, unaligned sets of sequences. ProFam-1 is competitive with state-of-the-art models on the ProteinGym zero-shot fitness prediction benchmark, achieving Spearman correlations of 0.47 for substitutions and 0.53 for indels. For homology-guided generation, ProFam-1 generates diverse sequences with predicted structural similarity, while preserving residue conservation and covariance patterns. All of ProFam’s training and inference pipelines, together with our curated, large-scale training dataset ProFam Atlas , are released fully open source, lowering the barrier to future method development.

Related articles

Related articles are currently not available for this article.