Comment on Walsh et al. “The statistical significance of randomized controlled trial results is frequently fragile: a case for a Fragility Index” J Clin Epidemiol 67: 622-628, 2014

Despite common misconceptions, a p-value does not tell us anything about truth (i.e. that an observed finding in a sample is representative for the underlying population of interest); it only describes the probability that a difference at least as large as the one being observed could have been found based on chance alone if in reality there is no difference. A p-value can no longer be interpreted at face value if the data being analyzed do not represent random samples, for instance because of unconscious bias in sampling, study execution, data analysis or reporting. Even worse, it can no longer be interpreted at face value if the investigators have actively violated the randomness principle by p-hacking (Motulsky, 2014). Even if none of this has happened, a relevant percentage of statistically significant findings may be false – a phenomenon largely driven by the a priori probability of an observation (Ioannidis, 2005). Add on top of these problems the issue of small sample sizes leading to fickle p-values (Halsey et al., 2015).

Canadian investigators have added an additional spin to this (Walsh et al., 2014): They performed modelling experiments based on 399 randomized controlled trials in which they added events to the control group in a step-wise fashion until the p-value exceeded 0.05 and called this the Fragility Index. Interestingly, the Fragility Index was smaller than the number of patients lost to follow-up in 53% of trials being analyzed. These findings show that the statistical significance of results from randomized clinical trials hinges on a small number of events. This highlights the general recommendation to focus reporting on effect sizes with confidence intervals and not on p-values (Michel et al., 2020).