Evaluating the applicability of replication success metrics in animal-to-human translation: A simulation study

Carolyne Jie Huang
Samuel Pawel
Kimberley Elaine Wever
Benjamin Victor Ineichen
Rachel Heyard

4 evaluations Published on Nov 9, 2025

This article on Sciety

Abstract

Translation failure, in which promising animal study results can not be reproduced in human trials, is a challenge in biomedical research. Metrics for replication success are widely used to evaluate reproducibility, i.e., the extent to which the results of a study agree with those of replication studies. The relevance of these metrics in assessing animal-to-human translation success (or faillure) is unclear. We conducted a simulation study to examine whether these metrics can quantify translation success and how their performance varies under different conditions. Using parameters from a meta-analysis on prenatal amino acid supplementation and maternal blood pressure, we simulated animal and human studies under 648 scenarios, varying effect sizes, heterogeneity, animal sample sizes and number of pooled animal studies. Nine metrics were assessed, namely the two-trials rule, meta-analysis, replication Bayes factor, unweighted and weighted Edgington’s methods, golden sceptical p -value and three versions of controlled sceptical p -value. Most metrics, except meta-analysis and replication Bayes factor, controlled false positive rates under no heterogeneity, but became liberal as heterogeneity increased, particularly between human studies. Translation power (i.e., the probability of true positive translation success) was constrained by the weaker evidence of the two findings; e.g., small sample size in the animal studies resulted in lower translation power. The metric based on meta-analysis frequently indicated success when either of the species found strong evidence, while sceptical p -values were more conservative. The sceptical p -value that controls overall type-one error and the weighted version of Edgington’s method performed relatively consistently across scenarios. No metric was uniformly optimal. Metrics developed for replication studies can inform assessments of translation, but their utility depends on the underlying evidence and assumptions. Using multiple metrics in combination, with attention to their strengths and limitations, is recommended for evaluating the translation of animal findings to human outcomes.

Related articles are currently not available for this article.