mim: A lightweight auxiliary index to enable fast, parallel, gzipped FASTQ parsing

This article has 0 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

The <monospace>FASTQ</monospace> file format is the lingua franca of primary data distribution and processing across most of bioinformatics. Over time, the compression, storage, transmission, and decompression of <monospace>gzip</monospace> compressed <monospace>fastq.gz</monospace> files has become a substantial scalability bottleneck in the modern world of fast and massively parallel genomics tools and algorithms.

In this work, we introduce <monospace>mim</monospace> : a lightweight, auxiliary index that enables fast, parallel, and highly-scalable parsing of compressed <monospace>fastq.gz</monospace> files. The creation of the <monospace>mim</monospace> index for a file is a one-time operation that can be performed in time comparable to that of simply decompressing and parsing the file (index creation induces ∼ 20% overhead) and with minimal working memory. The <monospace>mim</monospace> index itself is very small, usually about <inline-formula> <inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="690271v1_inline1.gif"/> </inline-formula> th of the size of the original compressed file, and can be easily stored along side the file or fetched from a remote location when it is needed. Further, the <monospace>mim</monospace> index is purely additive — it does not modify the original <monospace>gzipped FASTQ</monospace> file in any way, nor require that the file be recompressed or rewritten — and thus it does not require converting the massive back catalog of existing raw sequencing data.

To demonstrate the feasibility and utility of the <monospace>mim</monospace> index, we benchmark construction of the <monospace>mim</monospace> index on a variety of existing <monospace>gzipped FASTQ</monospace> data, and also measure thread-scaling of <monospace>mim</monospace> index-assisted parallel <monospace>FASTQ</monospace> parsing on a simple parsing/ decompression-related task. We find that, for the one-time cost of index creation, and a small fraction of extra storage space, the <monospace>mim</monospace> index can massively accelerate the ingestion and parsing of <monospace>gzipped FASTQ</monospace> data, exhibiting near linear thread scaling in our experiments. <monospace>mim</monospace> is written in <monospace>C++</monospace> 17, and is available as open source software under a BSD 3-clause license at <ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="https://github.com/COMBINE-lab/mim">https://github.com/COMBINE-lab/mim</ext-link> .

Related articles

Related articles are currently not available for this article.