A multi-class gene classifier for SARS-CoV-2 variants based on convolutional neural network
Abstract
Surveillance of circulating variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is of great importance in controlling the coronavirus disease 2019 (COVID-19) pandemic. We propose an alignment-free in silico approach for classifying SARS-CoV-2 variants based on their genomic sequences. A deep learning model was constructed utilizing a stacked 1-D convolutional neural network and multilayer perceptron (MLP). The pre-processed genomic sequencing data of the four SARS-CoV-2 variants were first fed to three stacked convolution-pooling nets to extract local linkage patterns in the sequences. Then a 2-layer MLP was used to compute the correlations between the input and output. Finally, a logistic regression model transformed the output and returned the probability values. Learning curves and stratified 10-fold cross-validation showed that the proposed classifier enables robust variant classification. External validation of the classifier showed an accuracy of 0.9962, precision of 0.9963, recall of 0.9963 and F1 score of 0.9962, outperforming other machine learning methods, including logistic regression, K-nearest neighbor, support vector machine, and random forest. By comparing our model with an MLP model without the convolution-pooling network, we demonstrate the essential role of convolution in extracting viral variant features. Thus, our results indicate that the proposed convolution-based multi-class gene classifier is efficient for the variant classification of SARS-CoV-2.
Related articles
Related articles are currently not available for this article.