IRCAS: a novel end-to-end approach to identify, rectify and classify comprehensive alternative splicing events in a transcriptome without genome reference
Abstract
Alternative splicing (AS) is a fundamental post-transcriptional mechanism that amplifies proteomic diversity and enables adaptive responses across eukaryotes. Current AS detection methods rely heavily on reference genomes, limiting their applicability to non-model organisms. Existing reference-free approaches suffer from inaccurate splice site prediction and treat detection and classification as separate processes, resulting in cascading errors. We present IRCAS, an integrated end-to-end framework for reference-free AS analysis, comprising three modules: identification, rectification, and classification. IRCAS employs colored de Bruijn graphs for AS detection, an attention-based CNN for splice site rectification, and a hybrid Graph Neural Network combining GAT and Transformer layers for classification. Evaluation across four species demonstrates substantial improvements: splice site accuracy increased to 92-96% versus 50-55% for existing methods, and end-to-end accuracy reached 83.4% compared to 41.2% for the previous best method. IRCAS establishes a new benchmark for reference-free AS detection in non-model organisms.
GRAPHICAL ABSTRACT
<fig id="fig1" position="float" orientation="portrait" fig-type="figure"> <label>Fig 1.</label> <caption>Workflow for construction and application of IRCAS. IRCAS is composed of three parts: identification, rectification, classification. (A) Workflow for reference-free AS identification from a raw transcriptomic data. First, according to the input transcripts, we apply BLAST all versus all alignment for preliminary screen. Then we adopt the MkcDBGAS Graph construction strategy. A cDBG was constructed from two sequences using a specified k-mer size. Based on bubble topologies, bubbles were classified into 5 types: SNV-induced, four AS-induced, MX-induced, AL-induced, AF-induced. (B)Workflow for AS position offset rectification and reconstruction of cDBG. Input transcript pairs are converted into a single sequence that includes two virtual nucleotides denoting the splicing start and end sites. SUPPA, a reference-based method, is utilized to determine the true splicing sites. The sequence is encoded into an n×6 vector using one-hot encoding. The offset between the true and predicted splicing sites is calculated and encoded as the ground truth. An attention-based convolutional neural network (CNN) rectification model is trained on these data to predict the offset, enabling the reconstruction of the cDBG with corrected splicing positions. (C) Workflow for 4 types AS classification. For each cDBG, node features, edge features, and global features are extracted. These features are integrated into distinct layers of a graph attention network (GAT)-Transformer hybrid model. This architecture enables high-precision classification of four types of AS events. (D)Workflow for end-to-end application of IRCAS. Transcriptomic data from any species lacking a reference genome is processed by IRCAS, enabling the classification of seven AS types with high accuracy.
</caption> <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="689457v1_fig1" position="float" orientation="portrait"/> </fig>Related articles
Related articles are currently not available for this article.