Estimation vs Hypothesis-Testing

As some of our readers may know, the PAASP team was involved in the “Negative results Award” project that took place in 2018. One important lesson for us was that any discussion about negative (or null) results should start with a decision whether results are really negative and how confident we are with such a conclusion. For example, p-values above 0.05 do not necessarily indicate that the results are negative and we have not found any reading or teaching material that can easily explain to non-statisticians and, in particular, young scientists in biomedical research fields the need to move away from a binary decision process. If any of our readers are aware of such tools, please share them with us and we will post the information in our Resource Center
Therefore, we were pleased to see a recent opinion paper by Calin-Jageman and Cumming that provides basic information with examples that could serve as a (self)-learning material. One particular example focussed on a study that found that “caffeine administration enhances memory consolidation in humans” (Borota et al., 2014). The same results were re-analyzed and visualized with a different (quantitative) question in mind: To what extent does caffeine improve memory? To answer this question, the difference between the group means was estimated and the uncertainties in this estimate due to expected sampling error were quantified. The results obtained can be summarized as “caffeine is estimated to improve memory relative to the placebo group by 31% with a 95% confidence interval of (0.2%, 62%)”.
Why do we believe that this is a good example that can be used for educational purposes?
First, the confidence interval suggests considerable uncertainty about generalizing from the sample to the world at large. Does it make such results of lower value? Certainly not.
Second, the estimation approach described in the eNeuro opinion paper does not call for a revolution that is difficult to follow (e.g. stopping the use of p-values). Instead, it suggests that we should rather go for a more complete and informative reporting of the results.
And last but not least, prior to reading the article by Calin-Jageman and Cumming, most members of our team have not heard about the eNeuro journal. Previously, learning about a new journal would certainly mean also inquiring about its “impact factor”. There is no need to know the impact factor of eNeuro – if the editors and reviewers of this journal manage to introduce and maintain high quality reporting, this should make it on the wanted list of all (neuro)scientists, whether readers or authors!


A long-awaited revision of the ARRIVE guidelines has finally been published (The ARRIVE guidelines 2019).What impact can we expect from ARRIVE 2.0? At the very least, we hope that the Nature journals will update their life sciences checklist, which will certainly have an impact. As an example, let’s look at the recently published paper that reported on the impact of gut microbiome on motor function and survival in the ALS mouse model.We picked this example because: i) it is recent, and ii) it uses the same SOD1 mouse model that has become a classical example of how adherence to higher research rigor standards turns “positive” data into “negative”.We would not use this example if the only finding was about changes in motor performance of the SOD1 mice exposed to long-term antibiotic cocktail – this is not be too surprising as antibiotics may penetrate into the CNS and may have effects unrelated to their “class” effects. And the survival data in the germ-free animals also do not make us focus on this paper because there were obvious problems with the rederivation itself (page 2, left column) and only insufficient information on study design is given (e.g. no details on surgery; no information on whether the colonized animals were also obtained via rederivation).The most striking are actually the data in Figure 3 “Akkermansia muciniphila colonization ameliorates motor degeneration and increases life-span in SOD1-Tg mice”.How would ARRIVE 2.0 help the reader to gain confidence in these data sets? In the Table below, we review the responses provided by the authors in the Life Sciences checklist and, stimulated by ARRIVE 2.0, indicate what information is missing to increase confident in these study results:
Published Life Sciences checklistWhat we would like to see
All in vivo experiments were randomized and controlled for gender, weight and cage effects.What methods were used for randomization and how can these methods explain highly unequal sample sizes within an experiment?
Sample sizes were determined based on previous studies and by the gold standards accepted in the field.A reference to the gold standards would be very helpful. The only gold standard in this field we are aware of – Scott et al. 2008 – would certainly not recommend using n = 5.
In all in vivo experiments the researches were blinded.This statement is insufficient to know whether blinding was maintained until data analysis.
No data were excluded from the manuscript.The experiment in Fig. 3 was repeated 6 (!) times with sample sizes between 5 and 26 resulting in the pooled sample sizes of up to 62 mice per group. However, survival data are presented only for 4-8 mice per group. It would be interesting to see survival data from the main pool of animals unless the main experiment was stopped at day 140.
All attempts of replication were successful and individual repeats are presented in ED and SI sections.Given that each of the “replication” experiments was severely underpowered, one may wonder whether these were indeed independent experiments or parts of a single study erroneously presented as “attempts of replication”.
We realize that ARRIVE 2.0 may not be sufficient to obtain answers to all of the above questions but this is certainly a major step forward that should be rigorously endorsed and promoted.

Power analysis for two-sided t-test

The example below does not require any additional R packages to be installed (i.e. runs with the basic R installation).
This example conducts power analysis for an expected difference between means equal to 2 with the standard deviation of 1 at a range of alpha and beta values. Any of these four parameters can be estimated when the other three are provided.
Results are generated as the required sample size to detect expected difference between means at the following alpha (rows) and beta (columns):

or, as the difference between means that can be detected with the following sample sizes (rows) at alpha=0.05 and beta=0.7-0.95:

This script also generates a plot summarizing the relationships between all four parameters:

Here is the script:

# calculate sample size required
# to detect expected difference between means
# at the following alpha (0.001, 0.01 and 0.05) and beta (0.7-0.95)

a <- c(.001,.01,.05)
na <- length(a)
p <- c(.7,.8,.85,.9,.95)
np <- length(p)

# d is the expected difference between means
# s is the standard deviation

d <- 2
s <- 1

samsize <- array(numeric(na*np), dim=c(na,np))
for (i in 1:np){
for (j in 1:na){
result <- power.t.test(n = NULL, delta = d, sd = s,
sig.level = a[j], power = p[i],
type = “two.sample”,alternative = “two.sided”)
samsize[j,i] <- ceiling(result$n)
colnames(samsize) <- p
row.names(samsize) <- a

# calculate difference between means that can be detected
# with the following sample sizes (6 to 12)
# at alpha=0.05 and beta=0.7-0.95

nu <- c(6:12)
nun <- length(nu)

samsize2 <- array(numeric(nun*np), dim=c(nun,np))
for (k in 1:np){
for (m in 1:nun){
result <- power.t.test(n = nu[m], delta = NULL, sd = s,
sig.level = 0.05, power = p[k],
type = “two.sample”,alternative = “two.sided”)
samsize2[m,k] <- result$delta
colnames(samsize2) <- p
row.names(samsize2) <- nu


# set up graph

xrange <- range(a)
yrange <- round(range(samsize))
colors <- rainbow(length(p))
plot(xrange, yrange, log=”x”, type=”n”,
ylab=”Sample Size (n)” )

# add power curves

for (i in 1:np){
lines(a, samsize[,i], type=”l”, lwd=2, col=colors[i])

# add annotation (grid lines, title, legend)

abline(h=0, v=a, lty=2, col=”grey89″)
title(“Sample Size Estimation for two-tailed t-test \n”)
legend(“topright”, title=”Power”, as.character(p), fill=colors)