Prediction of coronavirus 3C-like protease cleavage sites using machine-learning algorithms
Abstract
The coronavirus 3C-like (3CL) protease is a Cysteine protease. It plays an important role in viral infection and immune escape by not only cleaving the viral polyprotein ORF1ab at 11 sites, but also cleaving the host proteins. However, there is still a lack of effective tools for determining the cleavage sites of the 3CL protease. This study systematically investigated the diversity of the cleavage sites of the coronavirus 3CL protease on the viral polyprotein, and found that the cleavage motif were highly conserved for viruses in the genera of Alphacoronavirus, Betacoronavirus and Gammacoronavirus. Strong residue preferences were observed at the neighboring positions of the cleavage sites. A random forest (RF) model was built to predict the cleavage sites of the coronavirus 3CL protease based on the representation of residues at cleavage site and neighboring positions by amino acid indexes, and the model achieved an AUC of 0.96 in cross-validations. The RF model was further tested on an independent test dataset composed of cleavage sites on host proteins, and achieved an AUC of 0.88 and a prediction precision of 0.80 when considering the accessibility of the cleavage site. Then, 1,079 human proteins were predicted to be cleaved by the 3CL protease by the RF model. These proteins were enriched in pathways related to neurodegenerative diseases and pathogen infection. Finally, a user-friendly online server named 3CLP was built to predict the cleavage sites of the coronavirus 3CL protease based on the RF model. Overall, the study not only provides an effective tool for identifying the cleavage sites of the 3CL protease, but also provides insights into the molecular mechanism underlying the pathogenicity of coronaviruses.
Related articles
Related articles are currently not available for this article.