PanSpace: Fast and Scalable Indexing for Massive Bacterial Databases
Abstract
Motivation
Species identification is a critical task in agriculture, food processing, and health-care. The rapid growth of genomic databases — driven in part by the increasing investigation of bacterial genomes in clinical microbiology — has outpaced the capabilities of conventional tools such as BLAST for basic search and query tasks. A key bottleneck in microbiome studies lies in building indexes that allow rapid species identification and classification from assemblies while scaling efficiently to massive resources such as the AllTheBacteria database, thus enabling large-scale analyses to be performed even on a common laptop.
Results
We introduce <monospace>PanSpace</monospace> , the first convolutional neural network–based approach that leverages dense vector (embedding) indexing —– scalable to billions of embeddings —– for indexing and querying massive bacterial genome databases. <monospace>PanSpace</monospace> is specifically designed to classify bacterial draft assemblies. Compared to the most recent and competitive tool for this task, <monospace>PanSpace</monospace> requires only ~2 GB of disk space to index the AllTheBacteria database, an 8 × reduction relative to existing methods. Moreover, it delivers ultra-fast query performance, processing more than 1,000 assemblies in less than two and a half minutes, while preserving the utmost accuracy of state-of-the-art approaches.
Availability
<monospace>PanSpace</monospace> is available at <ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="https://github.com/pg-space/panspace">https://github.com/pg-space/panspace</ext-link> .
Related articles
Related articles are currently not available for this article.