ProtSEC: Ultrafast Protein Sequence Embedding in Complex Space Using Fast Fourier Transform
Abstract
Among various methods for vector representation of protein sequences, protein language models (PLMs) learn embedded representations and have been empirically shown to enhance accuracy. However, PLMs require substantial computational resources for both training and generating an embedding for a new sequence, which poses significant challenges for downstream tasks involving protein sequences. To address these challenges, we propose ProtSEC (Protein Sequence Embedding in Complex Space) that begins by mapping each amino acid to a unique complex number derived from the BLOSUM62 matrix to capture evolutionary information. Subsequently, a protein sequence is treated as a complex signal, and the Fast Fourier Transform (FFT) is applied to generate an embedding in complex space. Unlike PLMs, ProtSEC does not require training on a large-scale protein sequence dataset and subsequently showed a 20,000-fold increase in run time and an 85-fold memory efficiency compared to popular PLMs such as esm2_3B, esm2_35M, prot_t5, and prot_bert. ProtSEC showed 4% higher accuracy in sequence similarity search and improved accuracy in phylogenetic tree reconstruction compared to PLMs. ProSEC offers fast and accurate protein sequence embeddings in complex numbers, enabling efficient integration into downstream bioinformatics analyses. ProtSEC is available here: https://github.com/omics-lab/ProtSEC
Related articles
Related articles are currently not available for this article.