vMUS-dBG: A Novel De Bruijn Graph Model for De Novo Genome Assembly Using Variable-Length Minimum Unique Substrings
Abstract
De novo genome assembly using de Bruijn graphs (DBGs) typically relies on fixed-length k -mers as the nodes of the graph. While this approach is effective, it presents a fundamental trade-off: smaller k values tend to collapse repeats, whereas larger k values can result in fragmentation, particularly in low-coverage regions. Although multi- k and variable-order methods help mitigate these issues, they still rely on fixed-length topology or heuristic parameter selection. In this work, we introduce a de Bruijn graph constructed from Minimum Unique Substrings (MUSs), substrings that appear exactly once within the genome. This new graph is referred to as the variable-length MUS de Bruijn graph (vMUS-dBG). In the vMUS-dBG, the nodes are defined by read-extracted MUS anchors, and directed edges represent read-supported transitions between successive occurrences of MUSs. Each edge is also enhanced with instance-level metadata to preserve positional weights (repeats) and support counts. This innovative design eliminates the necessity for a global k -mer selection, while producing a concrete, repeat-aware graph construction that operates differently from the abstract manifold-style DBG model. Our experiments using real 24x E. coli K 12 HiFi data demonstrate that a prototype implementation of our approach achieves contiguity and accuracy comparable to that of a fixed-k method. These results establish MUS-based variable-length graph construction as a principled and biologically grounded alternative to fixed- k De Bruijn graph assembly to explore.
Related articles
Related articles are currently not available for this article.