Using Automated-Machine Learning to Predict COVID-19 Patient Survival: Identify Influential Biomarkers
Abstract
Background
In a pandemic, it is important for clinicians to stratify patients and decide who receives limited medical resources. In this study, we used automated machine learning (autoML) to develop and compare between multiple machine learning (ML) models that predict the chance of patient survival from COVID-19 infection and identified the best-performing model. In addition, we investigated which biomarkers are the most influential in generating an accurate model. We believe an ML model such as this could be a useful tool for clinicians stratifying hospitalized SARS-CoV-2 patients.
Methods
The data was retrospectively collected from Clinical Looking Glass (CLG) on all patients testing positive for COVID-19 through a nasopharyngeal specimen by real-time RT-PCR and admitted between 3/1/2020-7/3/2020 (4376 patients) at our institution. We collected 47 biomarkers from each patient within 36 hours before or after the index time: RT-PCR positivity, and tracked whether a patient survived or not for one month following this time. We utilized the autoML from H2O.ai, an open source package for R language. The autoML generated 20 ML models and ranked them by area under the precision-recall curve (AUCPR) on the test set. We selected the best model (model_var_47) and chose a threshold probability that maximized F2 score to make a binary classifier: dead or alive. Subsequently, we ranked the relative importance of variables that generated model_var_47 and chose the 10 most influential variables. Next, we reran the autoML with these 10 variables and likewise selected the model with the best AUCPR on the test set (model_var_10). Again, threshold probability that maximized F2 score for model_var_10 was chosen to make a binary classifier. We calculated and compared the sensitivity, specificity, and positive predicate value (PPV) for model_var_10 and model_var_47.
Results
The best model that autoML generated using all 47 variables was the stacked ensemble model of all models (AUCPR = 0.836). The most influential variables were: systolic and diastolic blood pressure, age, respiratory rate, pulse oximetry, blood urea nitrogen, lactate dehydrogenase, d-dimer, troponin, and glucose. When the autoML was retrained with these 10 most important variables, it did not significantly affect the performance (AUCPR= 0.828). For the binary classifiers, sensitivity, specificity, and PPV of model_var_47 was 83.6%, 87.7%, and 69.8% respectively, while for model_var_10 they were 90.9%, 71.1%, and 51.8% respectively.
Conclusions
By using autoML, we developed high-performing models that predict patient mortality from COVID-19 infection. In addition, we identified the most important biomarkers correlated with mortality. This ML model can be used as a decision supporting tool for medical practitioners to efficiently triage COVID-19 infected patients. From our literature review, this will be the largest COVID-19 patient cohort to train ML models and the first to utilize autoML. The COVID-19 survival calculator based on this study can be found at <ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="https://www.tsubomitech.com/">https://www.tsubomitech.com/</ext-link>.
Related articles
Related articles are currently not available for this article.