Direct construction of sparse suffix arrays with Libsais

This article has 0 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

Pattern matching is a fundamental challenge in bioinformatics, especially in the fields of genomics, transcriptomics and proteomics. Efficient indexing structures, such as suffix arrays, are critical for searching large datasets. While sparse suffix arrays offer significant memory savings compared to full suffix arrays, they typically still require the construction of a full suffix array prior to a sampling step, resulting in substantial memory overhead during the construction phase. We present an alternative method to directly construct the sparse suffix array using a simple, yet powerful text encoding, in combination with the widely used Libsais library. This approach bypasses the need for constructing a full suffix array, reducing memory usage by 63% and construction time by 55% when building a sparse suffix array with sparseness factor 3 for the entire UniProt knowledgebase. The method is particularly effective for applications with small alphabets, such as a nucleotide or amino acid alphabet. An open-source implementation of this method is available on GitHub, enabling easy adoption for large-scale bioinformatics applications.

Related articles

Related articles are currently not available for this article.