“When she arrived, I gave her a data set of a self-funded, failed study which had null results (it was a one month study in an all-you-can-eat Italian restaurant buffet where we had charged some people ½ as much as others). I said, “This cost us a lot of time and our own money to collect. There’s got to be something here we can salvage because it’s a cool (rich & unique) data set.” I had three ideas for potential Plan B, C, & D directions (since Plan A had failed). I told her what the analyses should be and what the tables should look like. I then asked her if she wanted to do them.
Every day she came back with puzzling new results, and every day we would scratch our heads, ask “Why,” and come up with another way to reanalyze the data with yet another set of plausible hypotheses. Eventually we started discovering solutions that held up regardless of how we pressure-tested them. I outlined the first paper, and she wrote it up, and every day for a month I told her how to rewrite it and she did. This happened with a second paper, and then a third paper (which was one that was based on her own discovery while digging through the data).”
These two paragraphs are an extract from the (infamous) November 2016 blog post by the food behavior expert Brian Wansink, Professor at Cornell University. In this article, Wansink described how he encouraged a visiting graduate student to perform repeated data analyses – and praised her for HARKing (‘Hypothesizing after the results are known’) and cherry-picking (choosing to make selective choices, so as to emphasize those results that support a given position, while ignoring or dismissing any findings that do not support it) data to fit a hypothesis.
Now, about a year later, critics have found extensive irregularities in at least 50 of Wansink’s studies leading to four retracted articles and at least eight corrections (either published or forthcoming in the next few weeks).
Selective reporting of studies and analyses is an inevitable reality, which challenges the value of the scientific enterprise. How widespread measures like cherry-picking, HARKing or p-hacking are in the scientific literature is, however, not entirely clear as these procedures are very difficult to detect. One possibility is based on the use of ‘P-curves’, which describe the distribution of statistically significant p-values for a set of independent findings [LIT].
Although this analysis also has its limitations [for a more detailed discussion see here – LIT], P-curve tests for evidential value (whether or not the published evidence for a specific hypothesis suggests that the effect size is nonzero) and p-hacking can readily be used to detect biases in meta-analysis datasets when p-values are reported in all selected publications.
In cases where the null hypothesis is true (effect size is zero), the distribution of p values of independent experiments is flat and uniform (Figure 1, black line), i.e. p<0.05 will occur 5% of the time. In contrast, when the true effect size is nonzero, the expected distribution of p-values should be exponential and right-skewed (Figure 2, black line): this is because it is more like to obtain very low p-values (e.g., p<0.001) for strong true effects than moderately low p-values (e.g., p<0.01) and it is even less likely to obtain nonsignificant p-values (p>0.05) in this scenario.
If researchers now use p-hacking to change results when there is no true effect, the p-curve will be altered close to the perceived significance threshold (typically p=0.05) and will shift from being flat to left skewed (Figure 1, red line). Consequently, a p-hacked p-curve will have an overabundance of p-values just below 0.05 (Figure 1, red line). On the other hand, if p-hacking occurs when there is a true effect, the p-curve will be exponential with right skew but there will also be an overrepresentation of p-values in the tail of the distribution just below 0.05 (Figure 2, red line).
Figure 1+2: The effect of p-hacking on the distribution of p-values in the range of significance. Figures adopted from Head at al. 2015.
If, due to HARKing, a hypothesis is made after the results are known, the chance of falsely rejecting a null-hypothesis increases. Furthermore, there will be a distorted image of effect sizes because effects found will be larger than the true effect sizes. In addition, false positive results can inspire investment in fruitless research programs, resulting in wasting resources like time and money and can even discredit entire fields. Quite often, when false positive results enter the literature, they become very persistent and early positive studies often receive more attention than later negative ones.
Fortunately, there are some measures helping to prevent cherry-picking and HARKing:
- Replication of research makes it possible to find if there was any HARKing.
- Pre-registering of study hypothesis really minimizes the possibility to change any hypothesis depending on the data obtained. In addition, if pre-registered methods and protocols are accepted for publication independent of the outcome, this will further increase the acceptance of negative findings
- Clearly label research as pre-specified (confirmatory) or exploratory (i.e., involves exploration of data that looks intriguing, where methods and analyses used are often based on post hoc decisions), so that readers can treat results with appropriate caution. Results from prespecified studies offer far more convincing evidence than those from exploratory research
- Performing data analysis in a blinded manner wherever possible makes it difficult to p-hack for specific results
- If ignorance of the consequences is the reason for HARKing and cherry-picking, then educational and training courses are a good solution
- Journals need to provide clear and detailed guidelines for the full reporting of data analyses and results