Benchmarking the coding strategies of non-coding mutations on sequence-based downstream tasks with machine learning
Abstract
Non-coding single nucleotide polymorphisms (SNPs) are critical drivers of gene regulation and disease susceptibility, yet predicting their functional impact remains a challenging task. A variety of methods exist for encoding non-coding SNPs, such as direct base encoding or using pre-trained models to obtain embeddings. However, there is a lack of comprehensive evaluation and guidance on the choice of encoding strategies for downstream prediction tasks involving non-coding SNPs. To address this gap, we present a benchmark study that compares six distinct encoding strategies for non-coding SNPs, assessing them across six dimensions, including interpretability, encoding abundance, and computational efficiency. Using three Quantitative Trait Loci (QTL)-related downstream tasks involving non-coding SNPs, we test these encoding strategies in combination with nine machine learning and deep learning models. Our findings demonstrate that semantic embeddings show strong robustness, while the choice of coding strategy and the model used for downstream prediction are all key variables influencing task performance. This benchmark provides actionable insights into the interplay between encoding strategies, models, and data properties, offering a framework for optimizing QTL prediction tasks and advancing the analysis of non-coding SNPs in genomic regulation.
Related articles
Related articles are currently not available for this article.