ProtSEC: Ultrafast Protein Sequence Embedding in Complex Space Using Fast Fourier Transform
Abstract
Among the various approaches for representing protein sequences as vectors, embeddings derived from protein language models (PLMs) have been empirically shown to enhance accuracy in downstream bioinformatics tasks. However, the substantial computational demands of PLMs, both during training and inference, pose significant challenges. We introduce ProtSEC ( Prot ein S equence E mbedding in C omplex Space), a novel approach that encodes each amino acid as a unique complex number derived from the BLOSUM62 substitution matrix. By modeling protein sequences as complex signals and applying the Fast Fourier Transform (FFT), ProtSEC generates embeddings in the complex space. Unlike PLMs, ProtSEC requires no pre-training on large protein sequence datasets and operates independently of any pre-trained models. Our benchmarking demonstrate that ProtSEC achieves a 20,000-fold reduction in runtime and an 85-fold improvement in memory efficiency compared to popular PLMs (e.g., esm2_3B, esm2_35M, prot_t5, prot_bert). Depending on the task, ProtSEC demonstrates either superior or comparable accuracy to PLMs in sequence similarity search, sequence classification and phylogenetic tree reconstruction. ProtSEC provides fast and accurate protein sequence embeddings in complex numbers, facilitating efficient integration into diverse downstream bioinformatics workflows. ProtSEC is available at <ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="https://github.com/omics-lab/ProtSEC">https://github.com/omics-lab/ProtSEC</ext-link> .
Related articles
Related articles are currently not available for this article.