upsAI: A high-accuracy machine learning classifier for predictingPlasmodium falciparum vargene upstream groups
Abstract
Plasmodium falciparumerythrocyte membrane protein 1 (PfEMP1), encoded by the hypervariablevargene family, is central to malaria pathogenesis, influencing both disease severity and immune evasion. Classifyingvargenes into upstream groups (upsA, upsB, upsC, upsE) is important for understanding parasite biology and clinical outcomes, but remains challenging, especially with partial sequences, such as the DBLα tag or RNA-Seq assemblies.
We developed upsAI, a machine learning-based classifier trained on 2,530 curatedvargenes, to accurately assign upstream groups using sequence features from different partial gene regions. We compared seven different methods, including support vector machines, random forest, XGB boost and HMMer models. The best model of upsAI for DBLα-tags sequences achieves an overall accuracy of 83%, 92% and for full-lengthvargenes, therefore significantly outperforming existing tools. Further, we propose a new model to distinguish between internal and subtelomericvargenes with high accuracy and scalability.
upsAI is available at<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="https://github.com/sii-scRNA-Seq/upsAI">https://github.com/sii-scRNA-Seq/upsAI</ext-link>, providing a robust and efficient resource for large-scalevargene analysis. It can classifyvargenes from 20 genomes in under one second.
Related articles
Related articles are currently not available for this article.