At least partly driven by regulatory requirements, technical data quality in clinical studies has markedly improved in the past decades and may exceed that in many non-clinical studies. Thus, reduction of bias by blinding, randomization and pre-specified sample sizes statistical analysis plans have become standard in clinical trials, particularly those performed for regulatory purposes and sponsored by the pharmaceutical industry. Therefore, the current debate how to improve reproducibility in preclinical research has much to learn from clinical pharmacology.
However, clinical pharmacology is not perfect either. I would like to highlight an article by Shun-Shin & Francis (2013), which discusses an issue in clinical trials that also may be relevant for preclinical studies. This is, how does a research address unexpected values. While blinding may ameliorate the problem, the prior belief whether two groups are different may affect how one handles outliers in the absence of a pre-specified policy. The options are A) keeping every value no matter how implausible, B) reclassifying the sample, C) removing the sample and D) remeasuring the sample. Each of these options has its own implications:
Keeping a sample in the dataset by all means is the only unbiased option but may create variability that makes no sense. For instance, a recorded height of 165 millimeter (for an adult human) is obviously not compatible with life and most probably represents a data entry error; keeping this in the data set may obscure it. Reclassification, i.e. moving a sample assigned to group X to group Y, makes an assumption that the sample would only plausibly fit into the other group based on the preconceived notion that group Y is different. Removing the sample makes the same assumption, but data sets may be less vulnerable to that. Remeasuring the sample (where possible), sounds like the option with least bias, but if only samples not fitting preconceived notions are remeasured, this in itself introduces a bias because regression to the mean is only applied in a single direction. Based on simulations, Shun-Shin & Francis show that the effect of options B-D decreases with larger overall sample sizes. Moreover, they report that the data sets with a given sample size are most vulnerable (turning a neutral into a statistically significant difference) to reclassification and least to remeasuring.
The theoretically best way to handle this would be to pre-specify methods of outlier handling. However, my own experience tells me that there are so many types of unexpected data, particularly in preclinical studies that a comprehensive policy for each type of measurement in a study may require more effort than the study itself; there are just too many ways in which an experiment can yield unexpected values. Thus, what can be done about it? Prespecified policies on outlier handling are a good thing. Where this is not applicable, full transparency on what actually has been done and access to / sharing of raw data may be the best option.