Scaling down for efficiency: Medium-sized protein language models perform well at transfer learning on realistic datasets

This article has 7 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

Protein language models (pLMs) can offer deep insights into evolutionary and structural properties of proteins. While larger models, such as the 15 billion parameter model ESM-2, promise to capture more complex patterns in sequence space, they also present practical challenges due to their high dimensionality and high computational cost. We systematically evaluated the performance of various pLMs across multiple biological datasets to assess the impact of model size on transfer learning. Surprisingly, we found that larger models not necessarily outperform smaller ones, in particular when data is limited. Medium-sized models, such as ESM-2 650M and ESM C 600M, demonstrated consistently good performance, falling only slightly behind their larger counterparts—ESM-2 15B and ESM C 6B—despite being many times smaller. Additionally, we compared various methods of compressing embeddings prior to transfer learning, and we found that mean embeddings consistently outperformed other compression methods. In summary, ESM C 600M with mean embeddings offers an optimal balance between performance and efficiency, making it a practical and scalable choice for transfer learning in realistic biological applications.

Significance Statement

This work challenges the common belief that larger language models always yield better results, here in the context of protein biochemistry. By systematically comparing transformer models of different sizes in transfer learning tasks, we demonstrate that medium size models, such as ESM C 600M, frequently perform as well as or better than larger variants, especially when data is limited. These findings provide an efficient strategy for machine learning-based protein analysis. Smaller and more efficient models help democratize cutting-edge AI approaches, making them more accessible to researchers with limited computational resources.

Related articles

Related articles are currently not available for this article.