The consequences of collecting more data in pursuit of statistical significance
You probably have encountered this: you did an experiment, observed a numerical effect with a size of presumed biological relevance, but the calculated p-value is higher than your threshold (usually ≥0.05). This is an inconclusive experiment because you can neither conclude that there is an effect, nor that there is no effect – the worst possible outcome. The temptation arises in this situation to increase sample size, i.e., add a few more samples. This procedure is often referred to as n-hacking, a form of p-hacking. Many statisticians have pointed out that post-hoc addition of samples is a bad thing, particularly in studies designed to test a statistical null hypothesis.

This recent essay by Pamela Reinagel from UCSD describes outcomes of mathematical simulations to determine the impact of n-hacking, which she more politely describes as “sample augmentation”. Here simulations found that unlimited sample augmentation indeed dramatically increases the risk of false positive findings. On the other hand, restricted sample augmentation based on preset rules (most importantly defining the p-value range at which samples can be added and how many samples that can be) has a lower than often expected impact on false positives and at the same time increases the positive predictive value. Based on these calculations, she makes a case that sample augmentation may be ok if (and that is a big if) this is transparently reported, including the sample size, effect size, and p-value before and after adding more samples.

While the simulations presented by Pamela Reinagel are compelling, they are abstractions. I would rather look at real bench scenarios and immediately think of three prototypical settings: 

  • The study in question is a big effort long-term study. In this case, post-hoc addition of samples is typically unfeasible.
  • The study in question is a simple lab experiment, e.g., on a cell line, and does not require much time, resources, or effort. In this case, considering the original study as pilot experiments and adding an entirely new study with larger sample size may be the best alternative. Optionally, outcomes of both studies can later be pooled in some form of meta-analysis. This will avoid any discussion on n-hacking.
  • These are the two easy scenarios. I feel that Pamela Reinagel’s essay becomes most important for the studies that fall in between, i.e., where adding samples is feasible but where repeating the entire study with a bigger sample size is not.

Pamela Reinagel states, although not very explicitly, that rules for augmenting sample sizes also need to be prespecified. This is very important because otherwise the decision to add samples will be highly biased and most likely not be justified by the simulations reported here.