Phylogenetic analysis of SARS-CoV-2 data is difficult

Benoit Morel
Pierre Barbera
Lucas Czech
Ben Bettisworth
Lukas Hübner
Sarah Lutteropp
Dora Serdari
Evangelia-Georgia Kostaki
Ioannis Mamais
Alexey M Kozlov
Pavlos Pavlidis
Dimitrios Paraskevis
Alexandros Stamatakis

1 evaluations Published on Aug 6, 2020

This article on Sciety

Abstract

Numerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on <ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="http://nextstrain.org">nextstrain.org</ext-link> . Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising all virus sequences available on May 5, 2020 from <ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="http://gisaid.org">gisaid.org</ext-link> . We find that it is difficult to infer a reliable phylogeny on these data due to the large number of sequences in conjunction with the low number of mutations. We further find that rooting the inferred phylogeny with some degree of confidence either via the bat and pangolin outgroups or by applying novel computational methods on the ingroup phylogeny does not appear to be possible. Finally, an automatic classification of the current sequences into sub-classes based on statistical criteria is also not possible, as the sequences are too closely related. We conclude that, although the application of phylogenetic methods to disentangle the evolution and spread of COVID-19 provides some insight, results of phylogenetic analyses, in particular those conducted under the default settings of current phylogenetic inference tools, as well as downstream analyses on the inferred phylogenies, should be considered and interpreted with extreme caution.

Related articles are currently not available for this article.