Mining conserved and divergent signals in 5’ splicing site sequences across fungi, metazoa and plants
Abstract
The main steps of the splicing process are similar across eukaryotes. However differences in splicing factors, gene architecture and sequence divergences suggest clade-specific features of splicing and its regulation. In each organism the ensemble of 5’ splicing sequences reflects the balance between natural nucleotidic variability and minimal molecular constraints to assure splicing fidelity. This compromise shapes the underlying statistical patterns in donor sequences composition.
In this work we aimed to mine conserved and divergent signals in splicing donor sequences. As 5’ donor sequences are a major cue for proper recognition of splicing sites we reasoned that statistical regularities of their sequence composition might reflect biological functionality and evolutionary history associated to splicing mechanisms.
We considered a regularized maximum entropy modeling framework to mine for non-trivial two-site correlations in donor sequences of 30 different eukaryote organisms. Our approach allowed us to accommodate and extend within a unified framework many of the regularities observed in previous works, like the negative epistatic effects between exonic and intronic consensus sites. In addition, for each analyzed organism, we could identify minimal sets of two-site coupling patterns that could generate, at a given regularization level, observed one-site and two-site frequencies in donor sequences. Noticeably, performing a systematic and comparative analysis of 5’ss we showed that lineage information could be traced from joint di-nucleotide probabilities. Specifically, we could identify characteristic two-site coupling patterns for plants and for animals and argue that they could echo differences in splicing regulation previously reported between these groups.
Author summary
The sequence composition of 5’ splicing sites of eukaryote organisms reflect a complex scenario. Nucleotide variability has to coexist with the need to correctly define exon/intron boundaries and the fidelity of splicing seems to depend on a pattern of trade-offs between substitutions at different positions of the splicing site. Better understanding of these patterns may help to gain insight into the details underlying the splicing process and its evolution.
In this work we aimed to study conserved and divergent signatures embedded in the sequence composition of eukaryotic 5’ splicing sites. We developed generative probabilistic models that allowed us to analyze sequence composition of donor sequences for several eukaryote organisms. Our regularized models served to incrementally disentangle the minimal set of coupling parameters needed to accurately reproduced observed 1-site and 2-site nucleotide frequencies. Focusing our study on di-nucelotide probabilities we found that they actually carried phylogenetic signal. In particular, our comparative analysis allowed us to identify differential two-site coupling patterns for animal and plants that might be related to specific differences in splicing regulation.
Related articles
Related articles are currently not available for this article.