A P-value of < 0.05 is commonly accepted as the borderline between a finding and the empty-handed end of a research project. However, there are problems with that. First, P-values around 0.05 are notoriously irreproducible – as they should be on theoretical grounds (Halsey et a., 2015). Second, P-values around 0.05 are associated with a false discovery rate that can easily achieve more than 30% (Ioannidis 2005). Based on these considerations, David Colquhoun stated a few years ago “a p∼0.05 means nothing more than ‘worth another look’. If you want to avoid making a fool of yourself very often, do not regard anything greater than p<0.001 as a demonstration that you have discovered something” (Colquhoun 2014). While many thought that this has to be taken with a grain of salt, a currently circulating preprint further challenges the P<0.05 concept It is a consensus statement for more than 70 leading statisticians, representing institutions such as Duke, Harvard and Stanford and proposes to move to a new standard of P<0.005. Reducing the statistical alpha to a tenth will certainly reduce false positives in biomedical research, but key questions arise.
First, sample sizes required to power an experiment for a statistical alpha of 0.005 will simply be unfeasible in many if not most experimental models. In other words, feasible n’s will in most cases lead to inconclusive results – or at least to results that carry a considerable uncertainty. I wonder whether this would be such a bad thing if discussed transparently. Research always has to handle uncertainty and researchers should not hide this but rather discuss it. Rather than increasing sample sizes to unfeasible numbers, we should think of alternative approaches such as within-study confirmative experiments, perhaps with somewhat different designs for added robustness.
Second, shifting from 0.05 to 0.005 may simply replace quasi-mythic value with another. However, you set the statistical alpha, you will always balance the chance of false positives against that of false negatives. It is unlikely that one size fits all. If there is a big risk, for instance a deadly complication of a new drug or man-made climate change, I’d rather err on the safe side and may take counter measures at P<0.1. However, in other cases I may be more concerned about false positives, e.g. in genome wide association studies where P-values are given on a log scale.
Third, a threshold P-value (statistical alpha) turns a grey zone of probabilities into a binary decision of whether to reject the null hypothesis. Such binary decisions can be important, for instance whether to approve a new drug. In most cases of biomedical research, we do not necessarily need such binary decisions but rather a careful weighing of the available data and understanding of the associated uncertainties.
In conclusion, a P<0.05 is inadequate in many ways. However, only in few cases will a marked lowering of the threshold for statistical significance be the solution. Rather more critical interpretation of data and uncertainty may be required.