Variability in the analysis of a single neuroimaging dataset by many teams

To test the reproducibility and robustness of results obtained in the neuroimaging field, 70 independent teams of neuroimaging experts from across the globe were asked to analyze and interpret the same functional magnetic resonance imaging dataset.
The authors found that no two teams chose identical workflows to analyse the data – a consequence of the degrees of freedom and flexibility around the best suited analytical approaches.
This flexibility resulted in sizeable variation in the results of hypothesis tests, even for teams whose statistical maps were highly correlated at intermediate stages of the analysis pipeline. Variation in reported results was related to several aspects of analysis methodology. Notably, a meta-analytical approach that aggregated information across teams yielded a significant consensus in activated regions. Furthermore, prediction markets of researchers in the field revealed an overestimation of the likelihood of significant findings, even by researchers with direct knowledge of the dataset. These findings show that analytical flexibility can have substantial effects on scientific conclusions, and identify factors that may be related to variability in the analysis of functional magnetic resonance imaging. The results emphasize the importance of validating and sharing complex analysis workflows and the need for experts in the field to come together and discuss what minimum reporting standards are.
The most straightforward way to combat such (unintentional) degrees of freedom is to have detailed data processing and analysis protocols as part of the study plans. As this example illustrates, such protocols need to be checked by independent scientists to make sure that they are complete and unequivocal. While the imaging field is complex and data analysis cannot be described in one sentence, the need to have sufficiently detailed study plans is also a message to pre-registration platforms that should not impose any restrictions on the amount of information being pre-registered.


Commentary March 2020

Comment on Walsh et al. “The statistical significance of randomized controlled trial results is frequently fragile: a case for a Fragility Index” J Clin Epidemiol 67: 622-628, 2014

Despite common misconceptions, a p-value does not tell us anything about truth (i.e. that an observed finding in a sample is representative for the underlying population of interest); it only describes the probability that a difference at least as large as the one being observed could have been found based on chance alone if in reality there is no difference. A p-value can no longer be interpreted at face value if the data being analyzed do not represent random samples, for instance because of unconscious bias in sampling, study execution, data analysis or reporting. Even worse, it can no longer be interpreted at face value if the investigators have actively violated the randomness principle by p-hacking (Motulsky, 2014). Even if none of this has happened, a relevant percentage of statistically significant findings may be false – a phenomenon largely driven by the a priori probability of an observation (Ioannidis, 2005). Add on top of these problems the issue of small sample sizes leading to fickle p-values (Halsey et al., 2015).

Canadian investigators have added an additional spin to this (Walsh et al., 2014): They performed modelling experiments based on 399 randomized controlled trials in which they added events to the control group in a step-wise fashion until the p-value exceeded 0.05 and called this the Fragility Index. Interestingly, the Fragility Index was smaller than the number of patients lost to follow-up in 53% of trials being analyzed. These findings show that the statistical significance of results from randomized clinical trials hinges on a small number of events. This highlights the general recommendation to focus reporting on effect sizes with confidence intervals and not on p-values (Michel et al., 2020).

To “p” or not to “p”

“A biologist and a statistician are in prison and, before being executed, are asked about their last wish. Statistician wants to give a course on statistics. Biologist is asking to be executed first …”
This old joke reflects the difficult relationships between statistics and many biologists. Indeed, at least from the outside, in the eyes of professional biostatisticians, biologists are resistant to learn and keep following wrong practices. Is that really so?
Our readers have most likely seen the recent proposal to abandon the use of P values. Published in a top journal and signed by hundreds of experts worldwide, this proposal aimed to trigger a change.
Yet, before we had a chance to digest the message and to start thinking how to survive in a “p-free” world, another group of respected scientists has challenged the proposal.
This interesting discussion is still ongoing and we, the non-statisticians, hope to have a clear guidance one day. How clear should the guidance be? Hopefully, similar to what Harvey Motulsky has published several years ago in pharmacology journals – and still remaining the main text on data analysis and statistics that we recommend our peers to read.

Why we need to report more than ’Data were Analyzed by t-tests or ANOVA’

In this article,Weissgerber et al. present findings suggesting that scientific publications are often lacking sufficient information about the statistical methods used, which are required by independent scientists to successfully replicate the published results.

The authors evaluated the quality of reporting of statistical tests (such as t-tests and ANOVA) in 328 research papers published in physiology journals in June 2017. They found that 84,5% papers used either ANOVA or t-tests or both. Although there are different types of ANOVA, 95% of articles that used ANOVA did not indicate what type of ANOVA was performed. Likewise, many papers did not specify what type of t-test was used. As a consequence, the lackof transparent statistical reporting does not allow others to judge whether the most appropriate test was selected and to verify the obtained study results. The authors conclude that “the findings of the present study highlight the need for investigators, journal editors and reviewers to work together to improve the quality of statistical reporting in submitted manuscripts”.


Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results

In this article, 29 independent research teams used the same data set to address the same research question: whether soccer referees are more likely to give red cards to dark-skin-toned players than to light-skin-toned players. Analytic approaches varied widely across the teams and 20 teams (69%) found a statistically significant positive effect, and 9 teams (31%) did not observe a significant relationship. Overall, the 29 different analyses used 21 unique combinations of covariates. Neither analysts’ prior beliefs about the effect of interest nor their level of expertise readily explained the variation in the outcomes of the analyses. These findings suggest that significant variation in the results of analyses of complex data may be difficult to avoid, even by experts with honest intentions. Crowdsourcing data analysis, a strategy in which numerous research teams are recruited to simultaneously investigate the same research question, makes transparent how defensible, yet subjective, analytic choices influence research results.


Simple changes of individual studies can improve the reproducibility of the biomedical scientific process as a whole

Today, the probability of successfully publishing negative/null results in biomedical research is lower than that for positive results. For instance, negative results repeatedly remain in lab books or drawers, or are not being published because they are rejected by scientific journals. These seemingly positive results then lead to further studies which build on the supposedly proven effect. In contrast, if all studies, irrespectively of their results, were to be published after complying with good scientific practice, a false result could be disproven more quickly.
The mathematical model presented in this paper by Steinfath et al. provides evidence that higher-powered experiments can save resources in the overall research process without generating excess false positives: a sufficiently high number of test animals for a single experiment increases the likelihood of achieving correct and reproducible results at the first attempt. In the long run, unnecessary follow-up tests with animals based on false assumptions can be avoided this way. Hence, the use of more test animals in a single experiment can reduce the total number of animals used – and speed up the development of new therapies.