ProtSEC: Ultrafast Protein Sequence Embedding in Complex Space Using Fast Fourier Transform

This article has 0 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

Among various methods for vector representation of protein sequences, protein language models (PLMs) learn embedded representations and have been empirically shown to enhance accuracy. However, PLMs require substantial computational resources for both training and generating an embedding for a new sequence, which poses significant challenges for downstream tasks involving protein sequences. To address these challenges, we propose ProtSEC (Protein Sequence Embedding in Complex Space) that begins by mapping each amino acid to a unique complex number derived from the BLOSUM62 matrix to capture evolutionary information. Subsequently, a protein sequence is treated as a complex signal, and the Fast Fourier Transform (FFT) is applied to generate an embedding in complex space. Unlike PLMs, ProtSEC does not require training on a large-scale protein sequence dataset and subsequently showed a 20,000-fold increase in run time and an 85-fold memory efficiency compared to popular PLMs such as esm2_3B, esm2_35M, prot_t5, and prot_bert. ProtSEC showed 4% higher accuracy in sequence similarity search and improved accuracy in phylogenetic tree reconstruction compared to PLMs. ProSEC offers fast and accurate protein sequence embeddings in complex numbers, enabling efficient integration into downstream bioinformatics analyses. ProtSEC is available here: https://github.com/omics-lab/ProtSEC

Related articles

Related articles are currently not available for this article.