Pangenome databases provide superior host removal and mycobacteria classification from clinical metagenomic data
Abstract
Background
Culture-free real-time sequencing of clinical metagenomic samples promises both rapid pathogen detection and antimicrobial resistance profiling. However, this approach introduces the risk of patient DNA leakage. To mitigate this risk, we need near-comprehensive removal of human DNA sequence at the point of sequencing, typically involving use of resource-constrained devices. Existing benchmarks have largely focused on use of standardised databases and largely ignored the computational requirements of depletion pipelines as well as the impact of human genome diversity.
Results
We benchmarked host removal pipelines on simulated Illumina and Nanopore metagenomic samples. We found that construction of a custom kraken database containing diverse human genomes results in the best balance of accuracy and computational resource usage. In addition, we benchmarked pipelines using kraken and minimap2 for taxonomic classification ofMycobacteriumreads using standard and custom databases. With a database representative of theMycobacteriumgenus, both tools obtained near-perfect precision and recall for classification ofMycobacterium tuberculosis. Computational efficiency of these custom databases was again superior to most standard approaches, allowing them to be executed on a laptop device.
Conclusions
Nanopore sequencing and a custom kraken human database with a diversity of genomes leads to superior host read removal from simulated metagenomic samples while being executable on a laptop. In addition, constructing a taxon-specific database provides excellent taxonomic read assignment while keeping runtime and memory low. We make all customised databases and pipelines freely available.
Related articles
Related articles are currently not available for this article.