Variability in the analysis of a single neuroimaging dataset by many teams

To test the reproducibility and robustness of results obtained in the neuroimaging field, 70 independent teams of neuroimaging experts from across the globe were asked to analyze and interpret the same functional magnetic resonance imaging dataset.
The authors found that no two teams chose identical workflows to analyse the data – a consequence of the degrees of freedom and flexibility around the best suited analytical approaches.
This flexibility resulted in sizeable variation in the results of hypothesis tests, even for teams whose statistical maps were highly correlated at intermediate stages of the analysis pipeline. Variation in reported results was related to several aspects of analysis methodology. Notably, a meta-analytical approach that aggregated information across teams yielded a significant consensus in activated regions. Furthermore, prediction markets of researchers in the field revealed an overestimation of the likelihood of significant findings, even by researchers with direct knowledge of the dataset. These findings show that analytical flexibility can have substantial effects on scientific conclusions, and identify factors that may be related to variability in the analysis of functional magnetic resonance imaging. The results emphasize the importance of validating and sharing complex analysis workflows and the need for experts in the field to come together and discuss what minimum reporting standards are.
The most straightforward way to combat such (unintentional) degrees of freedom is to have detailed data processing and analysis protocols as part of the study plans. As this example illustrates, such protocols need to be checked by independent scientists to make sure that they are complete and unequivocal. While the imaging field is complex and data analysis cannot be described in one sentence, the need to have sufficiently detailed study plans is also a message to pre-registration platforms that should not impose any restrictions on the amount of information being pre-registered.


Commentary March 2020

Comment on Walsh et al. “The statistical significance of randomized controlled trial results is frequently fragile: a case for a Fragility Index” J Clin Epidemiol 67: 622-628, 2014

Despite common misconceptions, a p-value does not tell us anything about truth (i.e. that an observed finding in a sample is representative for the underlying population of interest); it only describes the probability that a difference at least as large as the one being observed could have been found based on chance alone if in reality there is no difference. A p-value can no longer be interpreted at face value if the data being analyzed do not represent random samples, for instance because of unconscious bias in sampling, study execution, data analysis or reporting. Even worse, it can no longer be interpreted at face value if the investigators have actively violated the randomness principle by p-hacking (Motulsky, 2014). Even if none of this has happened, a relevant percentage of statistically significant findings may be false – a phenomenon largely driven by the a priori probability of an observation (Ioannidis, 2005). Add on top of these problems the issue of small sample sizes leading to fickle p-values (Halsey et al., 2015).

Canadian investigators have added an additional spin to this (Walsh et al., 2014): They performed modelling experiments based on 399 randomized controlled trials in which they added events to the control group in a step-wise fashion until the p-value exceeded 0.05 and called this the Fragility Index. Interestingly, the Fragility Index was smaller than the number of patients lost to follow-up in 53% of trials being analyzed. These findings show that the statistical significance of results from randomized clinical trials hinges on a small number of events. This highlights the general recommendation to focus reporting on effect sizes with confidence intervals and not on p-values (Michel et al., 2020).

Additional reads in March 2020

Preregistration of exploratory research: Learning from the golden age of discovery

What Research Institutions Can Do to Foster Research Integrity

When Science Needs Self-Correcting

Journal transparency index will be ‘alternative’ to impact scores

Universities ‘should bolster their research integrity policies’

Wozu Tierversuche? Medikamente gibt’s doch in der Apotheke (in German)

Scientists offered €1,000 to publish null results

Irreproducibility is not a sign of failure, but an inspiration for fresh ideas

How bad statistical practices drive away the good ones.

How a ‘no raw data, no science’ outlook can resolve the reproducibility crisis in science

The Data Must Be Accessible to All

In praise of replication studies and null results

Improving the trustworthiness, usefulness, and ethics of biomedical research through an innovative and comprehensive institutional initiative

Find a home for every imaging data set

A controlled trial for reproducibility

Challenges to the Reproducibility of Machine Learning Models in Health Care

The BJP expects authors to share data

Getting To The Root Of Poor ELISA Data Reproducibility

Comparing quality of reporting between preprints and peer-reviewed articles in the biomedical literature

Preprint usage is growing rapidly in the life sciences; however, questions remain on the relative quality of preprints when compared to published articles. An objective dimension of quality that is readily measurable is completeness of reporting, as transparency can improve the reader’s ability to independently interpret data and reproduce findings. In this study, the authors compared random samples of articles published in bioRxiv and in PubMed-indexed journals in 2016 using a quality of reporting questionnaire. It was found that peer-reviewed articles had, on average, higher quality of reporting than preprints, although this difference was small. On average, they found that peer reviewers caught just one deficiency per manuscript in about 25 categories of reporting.
Although the sample size was small and only 56 bioRxiv preprints were analyse, these results indicate show that quality of reporting in preprints in the life sciences is within a similar range as that of peer-reviewed articles, supporting the idea that preprints should be considered valid scientific contributions.


Research Culture: Framework for advancing rigorous research

There is a pressing need to increase the rigor of research in the life and biomedical sciences. To address this issue, the authors propose to establish communities of ’rigor champions’, who campaign for reforms of the research culture that has led to shortcomings in rigor. These rigor champions would also assist in the development and adoption of a comprehensive educational platform that would teach the principles of rigorous science to researchers at all career stages.


Improving the trustworthiness, usefulness, and ethics of biomedical research through an innovative and comprehensive institutional initiative

The reproducibility crisis triggered worldwide initiatives to improve rigor, reproducibility, and transparency in biomedical research. There are many examples of scientists, journals, and funding agencies adopting responsible research practices. The QUEST (Quality-Ethics-Open Science-Translation) Center offers a unique opportunity to examine the role of institutions. The Berlin Institute of Health founded QUEST to increase the likelihood that research conducted at this large academic medical center would be trustworthy, useful for scientists and society, and ethical. QUEST researchers perform “science of science” studies to understand problems with standard practices and develop targeted solutions. The staff work with institutional leadership and local scientists to incentivize and support responsible practices in research, funding, and hiring. Some activities described in this paper focus on the institution, whereas others may benefit the national and international scientific community. The experiences, approaches, and recommendations of the QUEST Center will be informative for faculty leadership, administrators, and researchers interested in improving scientific practice.