Sequence Compression Benchmark (SCB) database — a comprehensive evaluation of reference-free compressors for FASTA-formatted sequences

Kirill Kryukov
Mahoko Takahashi Ueda
So Nakagawa
Tadashi Imanishi

1 evaluations Published on Dec 27, 2019

This article on Sciety

Abstract

Background

Nearly all molecular sequence databases currently use gzip for data compression. Ongoing rapid accumulation of stored data calls for more efficient compression tool. Although numerous compressors exist, both specialized and general-purpose, choosing one of them was difficult because no comprehensive analysis of their comparative advantages for sequence compression was available.

Findings

We systematically benchmarked 410 settings of 44 compressors (including 26 specialized sequence compressors and 18 general-purpose compressors) on representative FASTA-formatted datasets of DNA, RNA and protein sequences. Each compressor was evaluated on 17 performance measures, including compression strength, as well as time and memory required for compression and decompression. We used 25 test datasets including individual genomes of various sizes, DNA and RNA datasets, and standard protein datasets. We summarized the results as the Sequence Compression Benchmark database (SCB database, <ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="http://kirr.dyndns.org/sequence-compression-benchmark/">http://kirr.dyndns.org/sequence-compression-benchmark/</ext-link> ) that allows building custom visualizations for selected subsets of benchmark results.

Conclusion

We found that modern compressors offer large improvement in compactness and speed compared to gzip. Our benchmark allows comparing compressors and their settings using a variety of performance measures, offering the opportunity to select the optimal compressor based on the data type and usage scenario specific to particular application.

Related articles are currently not available for this article.