Optimizing predictive models to prioritize viral discovery in zoonotic reservoirs
Abstract
Despite global investment in One Health disease surveillance, it remains difficult—and often very costly—to identify and monitor the wildlife reservoirs of novel zoonotic viruses. Statistical models can be used to guide sampling prioritization, but predictions from any given model may be highly uncertain; moreover, systematic model validation is rare, and the drivers of model performance are consequently under-documented. Here, we use bat hosts of betacoronaviruses as a case study for the data-driven process of comparing and validating predictive models of likely reservoir hosts. In the first quarter of 2020, we generated an ensemble of eight statistical models that predict host-virus associations and developed priority sampling recommendations for potential bat reservoirs and potential bridge hosts for SARS-CoV-2. Over more than a year, we tracked the discovery of 40 new bat hosts of betacoronaviruses, validated initial predictions, and dynamically updated our analytic pipeline. We find that ecological trait-based models perform extremely well at predicting these novel hosts, whereas network methods consistently perform roughly as well or worse than expected at random. These findings illustrate the importance of ensembling as a buffer against variation in model quality and highlight the value of including host ecology in predictive models. Our revised models show improved performance and predict over 400 bat species globally that could be undetected hosts of betacoronaviruses. Although 20 species of horseshoe bats (Rhinolophusspp.) are known to be the primary reservoir of SARS-like viruses, we find at least three-fourths of plausible betacoronavirus reservoirs in this bat genus might still be undetected. Our study is the first to demonstrate through systematic validation that machine learning models can help optimize wildlife sampling for undiscovered viruses and illustrates how such approaches are best implemented through a dynamic process of prediction, data collection, validation, and updating.
Related articles
Related articles are currently not available for this article.