Illuminating the functional landscape of the dark proteome across the Animal Tree of Life through natural language processing models
Abstract
Background
Understanding how coding genes and their functions evolve over time is a key aspect of evolutionary biology. Protein coding genes poorly understood or characterized at the functional level may be related to important evolutionary innovations, potentially leading to incomplete or inaccurate models of evolutionary change, and limiting the ability to identify conserved or lineage-specific features. Homology-based methodologies often fail to transfer functional annotations in a large fraction of the coding gene repertoire in non-model organisms. This is particularly relevant in animals, where a large number of their coding genes yield no functional annotation.
Results
Here, we leverage homology, deep learning, and protein language models to investigate functional annotation in the ‘dark proteome’ (defined as the unknown functional landscape’) of ca. 1,000 gene repertoires of virtually all animal phyla, totaling ca. 23.2 million coding genes. We then explored the ‘dark proteome’ of all animal phyla revealing an enrichment in functions related to immune response, viral infection, response to stimuli, development, or signaling, among others. Furthermore, we provide an open-source pipeline - FANTASIA - to implement and benchmark these methodologies in any dataset.
Conclusions
Our results uncover the putative functions of poorly understood protein-coding genes across the Animal Tree of Life that were inaccessible before due to the limitations in homology inference, contributing to a more comprehensive understanding of the molecular basis of animal evolution, and providing a new tool for the functional annotation of protein-coding genes in newly generated genomes.
Related articles
Related articles are currently not available for this article.