Natural variation in regulatory code revealed through Bayesian analysis of plant pan-genomes and pan-transcriptomes

This article has 0 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

Understanding the genetic code of cis-regulatory elements (CREs) is essential for engineering gene expression and modulating agronomic traits in crops. In plants, CREs underlying rapid evolution of gene expression often overlap with structural variation in promoters, making them undetectable using single-reference genomes. Here, we develop K-PROB (K-mer-based in silico PROmoter Bashing), a computational tool that learns from intraspecies promoter sequence and gene expression variation in pan-genomes and pan-transcriptomes to identify CREs controlling gene expression. K-PROB deploys a k-mer-based Bayesian variable selection framework to prioritize causal variable identification. We demonstrate the effectiveness of our approach in maize and soybean, two staple crops species. Applying K-PROB to genes with the most highly variable promoter sequences and the most diverse patterns of expression, such as nucleotide-binding leucine-rich repeat receptors, we identified k-mers enriched for bona fide transcription factor binding sequences, and overlapping with open chromatin regions and DAP-seq binding sites. Notably, multiple significant k-mers are located within presence/absence structural variants, highlighting structural variation in promoters as key drivers of transcriptional diversity of highly variable genes. We further validated the regulatory effects of identified k-mers on gene expression using luciferase reporter assays. Our results showcase a high-throughput and pangenomic approach for probing natural intraspecies cis-regulatory diversity, discovering new causative cis-elements, and facilitating future expression engineering across plant species.

Significance Statement

Understanding which DNA sequences control gene expression is essential for crop improvement. Current methods for identifying regulatory elements rely on expensive, specialized biochemical datasets typically limited to a single genotype. We developed a computational tool that links natural sequence variation and gene expression variation to identify functional regulatory sequences. Our tool employs a statistical framework that prioritizes causality over correlation, in contrast to most genome-wide association studies. Applying it to maize and soybean, two staple crops, we uncovered known and novel regulatory elements and validated them with molecular assays. Our approach is scalable, cost-effective, and efficiently utilizes natural variation from existing pangenomic datasets, opening new avenues for future crop engineering and studying gene regulation in diverse plant species.

Related articles

Related articles are currently not available for this article.