Fast and accurate taxonomic domain assignment of short metagenomic reads using BBERT

This article has 0 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

Metagenomes from complex environments such as soil contain vast biodiversity, yet most short reads cannot be taxonomically or functionally annotated because they lack reference genomes, obscuring the true structure and function of microbial communities. We introduce BBERT, a nucleotide large language model. BBERT identifies bacterial sequence syntax without relying on reference databases, enabling accurate assignment of taxonomic domain, coding potential, and reading frame directly from reads as short as 100 bp. Applying BBERT to a global dataset of soil metagenomes reveals that the majority of previously unannotated “microbial dark matter” is non-bacterial, and that resolving this conflation reshapes functional inferences from global surveys, uncovering functional differences between temperate and boreal-arctic soils. BBERT also improves de-novo metagenomic assembly, reducing mismatches and gaps while accelerating runtime. By providing fast, reference-free classification of short reads, BBERT unlocks large metagenomic archives for more accurate ecological and evolutionary analyses.

Related articles

Related articles are currently not available for this article.