It has been proposed repeatedly that adding samples based on results of initial experiments is a form of p-hacking (see e.g. new Instructions to Authors of journals of the Am Soc Exp Pharmacol Ther). While these recommendations were based on sound theoretical considerations, Pamela Rainagel from San Diego demonstrates in a manuscript not yet peer reviewed that the impact on false positives based on Monte-Carlo simulations of dynamically adjusting sample size, called n-hacking by her. Interestingly, her analysis shows that that n-hacking increase false positives and that effect sizes and prior probability are key drivers of this.
However, her simulations also suggest that the positive predictive value increases and is higher than that from non-incremental experiments. Apparently, the increase in false positives is more than offset by that in true positives. She proposes that post-hoc increases in sample size must be disclosed but could confer previously unappreciated advantages for efficiency and positive predictive value. However, she also warns that adaptations of sample sizes need careful considerations of correction of p-values because n-hacking essentially is a form of increasing the statistical alpha.
Therefore, the proposal by Pamela Rainagel does not invalidate the argument that findings resulting from n-hacking/adapted sample sizes are no longer suitable for hypothesis-testing statistical analysis unless careful pre-specification is in place to adjust for the increased alpha.
1 Comment
First a correction: the statement "n-hacking increase false positives and that effect sizes and prior probability are key drivers of this" is not an accurate summary of the paper. The increase in false positives analyzed in that paper were simulated under the assumption that the null hypothesis is true. The key drivers were shown to be initial sample size and sample increment. There was no role of effect size or prior probability in any of the analysis of false positives. Effect size and prior probability were discussed, but in the discussion of PPV, where they were shown to be inconsequential, not key drivers. Of course effect size and priors drive PPV, but they did not affect the consequences of N-hacking on PPV, which is the question at hand.
It may well be so, but I do not think it has been established yet whether pre-specification of an alpha adjustment rule is superior to post-hoc application of one of (or even picking the most "favorable" of) the available corrections, in terms of the accuracy of the reported p value. The simulations in the cited manuscript represent what would happen if a population of researchers neither pre-specified nor corrected for post-hoc sample augmentation. Thus the results could be interpreted as empirical upper bounds on the effective alpha of non-pre-specified augmentation. Which under some experimental designs was shown to be a rather modest increase in FP (given the null), and an attendant increase in PPV (regardless of the prior or effect size).
Incidentally, for those who have this concern: I have since explored "p equals" simulations for a few example scenarios, and so far the results do not materially change relative to the "p-less-than" simulations shown in the paper.
I think it would be valuable to follow up on the cited study by comparing the different available corrections of the p-value (or alternative hypothesis testing metrics), with or without pre-specification. This would allow one to verify and quantify the benefits of pre-specification, in a case where ground truth is known.
Leave A Comment