Hierarchical machine learning predicts geographical origin of Salmonella within four minutes of sequencing

This article has 4 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

Salmonella enterica serovar Enteritidis is one of the most frequent causes of Salmonellosis globally and is commonly transmitted from animals to humans by the consumption of contaminated foodstuffs. Herein, we detail the development and application of a hierarchical machine learning model to rapidly identify and trace the geographical source of S . Enteritidis infections from whole genome sequencing data. 2,313 S . Enteritidis genomes collected by the UKHSA between 2014-2019 were used to train a ‘local classifier per node’ hierarchical classifier to attribute isolates to 4 continents, 11 sub-regions and 38 countries (53 classes). Highest classification accuracy was achieved at the continental level followed by the sub-regional and country levels (macro F1: 0.954, 0.718, 0.661 respectively). A number of countries commonly visited by UK travellers were predicted with high accuracy (hF1: >0.9). Longitudinal analysis and validation with publicly accessible international samples indicated that predictions were robust to prospective external datasets. The hierarchical machine learning framework provides granular geographical source prediction directly from sequencing reads in <4 minutes per sample, facilitating rapid outbreak resolution and real-time genomic epidemiology.

Related articles

Related articles are currently not available for this article.