ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein-protein interactions and gene essentiality across taxa
Abstract
Language models trained on biological sequences are advancing inference tasks from the scale of single proteins to that of genomic neighborhoods. Here, we introduce ProteomeLM, a transformer-based language model that uniquely operates on entire proteomes from species spanning the tree of life. ProteomeLM is trained to reconstruct masked protein embeddings using the whole proteomic context, yielding contextualized protein representations that reflect proteome-scale functional constraints. Notably, ProteomeLM’s attention coefficients encode protein-protein interactions (PPI), despite being trained without interaction labels. Furthermore, it enables interactome-wide PPI screening that is substantially more accurate, and orders of magnitude faster, than amino-acid coevolution-based methods. We further develop ProteomeLM-PPI, a supervised model that combines ProteomeLM embeddings and attention coefficients to achieve state-of-the-art PPI prediction across benchmarks and species. Finally, we introduce ProteomeLM-Ess, a supervised gene essentiality predictor that generalizes across diverse taxa. Our results demonstrate the potential of proteome-scale language models for addressing function and interactions at the organism level.
Significance statement
Predicting protein interactions and functions is a key challenge in biology. Although deep learning-based language models are advancing the analysis of individual protein sequences and of genomic neighborhoods, they struggle to capture properties involving all the proteins expressed in a cell, such as protein–protein interactions (PPI) and gene essentiality. We present ProteomeLM, a language model that reasons on entire proteomes across diverse species. ProteomeLM captures PPI without supervision, and enables more accurate and faster screening of entire interactomes than current sequence-based approaches. ProteomeLM also delivers state-of-the-art supervised PPI prediction, and improves supervised prediction of gene essentiality compared to protein language models. These results demonstrate the potential of proteome-scale language models to reveal system-level organization and functional relationships.
Related articles
Related articles are currently not available for this article.