PanSpace: Fast and Scalable Indexing for Massive Bacterial Databases

Jorge Avila Cartes
Simone Ciccolella
Luca Denti
Raghuram Dandinasivara
Gianluca Della Vedova
Paola Bonizzoni
Alexander Schönhuth

0 evaluations Published on Apr 5, 2025

This article on Sciety

Abstract

Motivation

Species identification is a crucial task in fields such as agriculture, food processing, and healthcare. The rapid expansion of genomics databases, especially with the growing focus on investigating new bacterial genomes in clinical microbiology, has surpassed the capabilities of conventional tools like BLAST for basic search and query procedures. A major bottleneck in microbiome studies is building indexes that enable rapid identification and classification of species from assemblies while scaling efficiently to AllTheBacteria Database, the current larger massive bacterial databases, making large-scale analysis feasible on a common laptop.

Results

We introduce<monospace>PANSPACE</monospace>, the first convolutional neural network-based approach that leverages dense vector (embedding) indexing, proven to scale up to 1 billion embeddings, to index and query very large bacterial genome databases.<monospace>PANSPACE</monospace>is designed to classify (draft) assemblies of bacteria. Compared to the most recent and competitive tool for this task, our index requires only ~2GB of disk space for the AllTheBacteria Database, more than 40× less. Additionally,<monospace>PANSPACE</monospace>is ultra-fast in genomic queries, processing over 1,000 queries in under two minutes and half while maintaining high accuracy compared to the current state-of-the-art tool for the same tasks.

Availability

<monospace>PANSPACE</monospace>is available at<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="https://github.com/pg-space/panspace">https://github.com/pg-space/panspace</ext-link>.

Related articles are currently not available for this article.