Reproducibility in systems biology modelling

A recent report by Tiwari et al. investigated the reproducibility rate in systems biology modelling by reproducing the mathematical representation of 455 kinetic models. The authors tried to 

1.) reproduce the published model (step 1), 

2.) if failed, adjust their efforts based on experience (step 2), 

3.) if failed again, contact the authors of the original study for clarification and support (step 3).

When attempting to reproduce the selected models based on the information provided in the primary literature (step 1), only 51% of the models could be reproduced, meaning that the remaining 49% needed additional efforts (i.e. via steps 2+3). However, 37% of the total articles could not be reproduced by Tiwari and colleagues at all, even when adjusting the model system or asking the authors of the original study for support.

Notably, over 70% of the corresponding authors did not respond when contacted by Tiwari et al and in half of the responses it was not possible to reproduce the model, even with the support of the authors.

This low reproducibility rate, in combination with the very low response rate of the original authors makes it absolutely necessary to have very good reporting standards in the original study and to have them checked by the peer reviewers.

To improve the situation for systems biology, Tiwari and colleagues provided specific reporting guidelines in form of a checklist with eight points to increase the reproducibility of systems biology modelling.

Risk‐of‐bias VISualization (robvis): An R package and Shiny web app for visualizing risk‐of‐bias assessments

There is currently no generic tool for producing figures to display and explore the risk‐of‐bias assessments that routinely take place as part of systematic review.
In this article, the authors, therefore, present a new tool, robvis (Risk‐Of‐Bias VISualization), available as an R package and web app, which facilitates rapid production of publication‐quality risk‐of‐bias assessment figures. A timeline of the tool’s development and its key functionality is also presented.

LINK

Best-dose analysis – A confirmatory research case

It is often observed and discussed that there are substantial inter-individual differences that can overshadow effects of otherwise effective treatments.  These differences can be present in the form of a varying pre-treatment “baseline” that often motivates researchers to express the post-treatment results not in absolute terms but relative to the baseline. 
 
A less commonly discussed scenario is the differences in the individual sensitivity to experimental manipulation (e.g. drug).  In other words, inspection of data may reveal that while there is no overall treatment effect (left panel below), for each subject, there is at least one drug dose that appears to be “effective” – i.e. the “best dose” for a given subject.  If one selectively plots only the best-dose data, null results suddenly turn into something much more preferred (right panel below).

Figure 1: Schematic exaggerated representation of the best-dose analysis. A hypothetical experiment with 8 subjects tested under each of the four treatment conditions (vehicle and three doses of a drug). For each subject, every possible measured response value (2, 4, 6 or 8) is recorded at different treatment conditions.  Conventional representation (left panel) illustrates that there are no differences between treatment conditions.  For the best-dose analysis, one selects highest response values across different drug doses (highlighted area on the left panel) and re-plots them against the vehicle control values as shown in the right panel.  

There are a number of publications with this type of analysis.  Is this a legitimate practice?  As long as this analysis is viewed as exploratory and is followed by a confirmatory experiment where “best doses” are prespecified and their effects are confirmed, this can be a legitimate analytic technique.  Without such complementary confirmatory effort, best-dose analysis can be very misleading and may actually be an attempt at p-hacking.
 
In the following example we generated 400 random numbers, normally distributed and randomly allocated across 4 groups with n=100 per group.  Then we drew repeatedly 8, 16, 32 or 64 values from each group and organized them in a table so that each row was one hypothetical subject exposed to a vehicle control condition and to 3 doses of a drug.
 
Next, we did an ordinary one-factor analysis of variance and saved the p value. After that, the “best dose” (i.e., the highest value) was identified for each row / subject. A conventional t-test was run comparing the control with the best dose (with the p value saved).  This was repeated for 500 iterations of random sampling and analyses.
 
As the plot below illustrates, the t-test for the best-dose analysis often yields “p<0.05” for datasets where the all-dose ANOVA does not reveal any main effect of treatment.
 
This conclusion becomes clearer with increasing sample sizes – already at n=16, the best-dose analysis finds “effects” in nearly half of the cases. By n=64, it is rare that best-dose analysis fails to find an “effect” using these random numbers.
 
Is all this too obvious?  Hopefully, it is, at least for our readers.  This example, nevertheless, may be useful for those who consider applying a best-dose analysis and for those who need to illustrate the appropriate and important role of confirmatory research.

Figure 2: p-values by best-dose analysis vs the corresponding all-dose analysis for each of four selected sample sizes.  The shaded zones show where the former analysis appears “statistically significant” but the latter does not.

By Anton Bespalov (PAASP) and David McArthur (UCLA)
 
The R script can be found HERE.
 
PS After the above analysis was designed and completed, we became aware of a paper by Paul Soto and colleagues that discussed the same issue of the best-dose analysis using different tools.

Federal judge invalidates patents – but on the basis of a common statistical mistake

In a recent case before the US District Court in the District of Nevada, the judge ruled six method-of-use patents to be invalid because the claims were obvious based on previously published data (“prior art”). Curfman et al. (2020) argue her decision was incorrect, because the judge interpreted the published statistical analysis incorrectly.  

Mori et al. (2000) measured LDL cholesterol before and after giving participants (17-19 per group) docosahexaenoic acid (DHA) or eicosapentaenoic acid (EPA). DHA increased LDL cholesterol by 8.0% (P= 0.019) while EPA increased LDL by 3.5% (p>0.05) as compared to vehicle (olive oil). The judge said these findings clearly demonstrated that DHA and EPA have different effects. It’s “obvious” (which has special meaning in patent cases) because the effect of DHA was statistically significant, while the effect of EPA was not.

Curfman et al. disagree with this conclusion because the study did not compare EPA to DHA, but only each of them to vehicle. They performed a t test comparing the effects (change pre vs post treatment) of EPA and DHA, showed that the p-value must be high, and so concluded that the null hypothesis that the two compounds have equal effects on LDL cannot be rejected. They didn’t report the p-value, but we calculate it to be 0.33 (without correcting for multiple comparisons). This example demonstrates that the difference between ‘significant’ and ‘non-significant’ may itself not be statistically significant (Gelman and Stern, 2006).

Curfman et al. argue that conclusions based on flawed statistical analysis should not be accepted as prior art. The case has been accepted by the US Court of Appeals for the Federal Circuit and could have a major effect on the future of the biotechnology industry. In a broader sense this case brings up the question whether flawed data/inaccurate science can be used to support patents and further highlights the need for sufficient training of scientists to enable a sound understanding of statistical analysis.

Our study was conducted according to…

Over the past several years, more and more journals have revised their guides for authors and included specific instructions on information to be provided in the manuscripts – from animal welfare statements to various aspects of data analysis and study design.

A paper by Horowitz et al. is a good illustration of the change triggered by these new journal policies and the emerging challenges.

On a positive side, it is very good to read that power analysis was applied to determine the sample size, that all data generated or analyzed in this study were included in the paper, that randomization was applied, and that blinding was used and maintained throughout histological, biochemical and behavioral assessments and treatment groups were un-blinded at the end of each experiment upon statistical analysis. 

Yet, we have previously expressed concerns (HERE and HERE) that, unless specific actions are taken and authors are appropriately trained and informed, changes in the journal policies may not always reach their objectives. More specifically, we are worried about normative responses especially regarding subjects that are not sufficiently disambiguated by the journals’ guides.

For example, when reading this particular paper, we could not understand why, in an experiment involving one control and one treatment group, sample sizes are markedly unequal – 12 vs 19. This is where we turned to the methods description in an attempt to understand how the randomization process was conducted, but found no details. We also looked for more information regarding sample size determination but found hardly any useful information either (apart from the generic alpha = 0.05 and beta = 0.8 levels). We have also contacted the authors.

We do not blame the authors for not providing all these information in the paper. However, we increasingly believe that it is actually the journals that cause this “tick-the-box” behavior when policy requirements are nominally addressed but without much value for the reader.

Additional reads in August 2020

Five better ways to assess science

$48 Billion Is Lost to Avoidable Experiment Expenditure Every Year

Opinion: Surgisphere Fiasco Highlights Need for Proper QA

Assuring research integrity during a pandemic

Reproducibility in Cancer Biology: Pseudogenes, RNAs and new reproducibility norms

The Problems With Science Journals Trying to Be Gatekeepers – and Some Solutions

Replications do not fail

Sex matters in neuroscience and neuropsychopharmacology

Journals endorse new checklist to clean up sloppy animal research

Paying it forward – publishing your research reproducibly

The best time to argue about what a replication means? Before you do it

How scientists can stop fooling themselves over statistics