PanSpace: Fast and Scalable Indexing for Massive Bacterial Databases

Jorge Avila Cartes
Simone Ciccolella
Luca Denti
Raghuram Dandinasivara
Gianluca Della Vedova
Paola Bonizzoni
Alexander Schönhuth

0 evaluations Published on Nov 5, 2025

This article on Sciety

Abstract

Motivation

Species identification is a critical task in agriculture, food processing, and health-care. The rapid growth of genomic databases — driven in part by the increasing investigation of bacterial genomes in clinical microbiology — has outpaced the capabilities of conventional tools such as BLAST for basic search and query tasks. A key bottleneck in microbiome studies lies in building indexes that allow rapid species identification and classification from assemblies while scaling efficiently to massive resources such as the AllTheBacteria database, thus enabling large-scale analyses to be performed even on a common laptop.

Results

We introduce <monospace>PanSpace</monospace> , the first convolutional neural network–based approach that leverages dense vector (embedding) indexing —– scalable to billions of embeddings —– for indexing and querying massive bacterial genome databases. <monospace>PanSpace</monospace> is specifically designed to classify bacterial draft assemblies. Compared to the most recent and competitive tool for this task, <monospace>PanSpace</monospace> requires only ~2 GB of disk space to index the AllTheBacteria database, an 8 × reduction relative to existing methods. Moreover, it delivers ultra-fast query performance, processing more than 1,000 assemblies in less than two and a half minutes, while preserving the utmost accuracy of state-of-the-art approaches.

Availability

<monospace>PanSpace</monospace> is available at <ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="https://github.com/pg-space/panspace">https://github.com/pg-space/panspace</ext-link> .

Related articles are currently not available for this article.