Billi: Provably Accurate and Scalable Bubble Detection in Pangenome Graphs

This article has 0 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

A key application of pangenome graphs is the characterization of small and large genomic variants represented as bubbles within the graph. Although bubbles have been extensively studied in directed graphs in the context of genome assembly, there remains a need for a rigorous definition and systematic analysis of bubbles in bidirected graphs, which is the predominant data structure used to represent pangenomes. We show that existing bubble definitions for bidirected graphs do not fully meet the requirements for representing genetic sites and alleles; in particular, overlapping bubbles may not exhibit strict nesting. To address this, we introduce a new sub-graph abstraction called panbubble and prove that it satisfies the desired structural properties for variant characterization. We then present an exact algorithm with runtime 𝒪 (| V | 2 (| V | + | E |)) for detecting all panbubbles in a bidirected graph G = ( V, E ). In addition, we propose a heuristic algorithm that produces identical output as the exact algorithm in practice and scales to large graphs, including both the first and second releases of the Human Pangenome Reference Consortium (HPRC). We implemented our algorithms in the tool <monospace>Billi</monospace> ( <ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="http://github.com/at-cg/billi">github.com/at-cg/billi</ext-link> ). On our largest dataset, Billi is more than 15× faster and uses over 5× less memory than VG.

Related articles

Related articles are currently not available for this article.