Searching for significance

To examine associations between urinary chemical concentrations and adult health status, the US Centers for Disease Control and Prevention (CDC) measured the urine of 1455 study participants (representing the general US adult population) for the presence of 275 environmental chemicals, including Bisphenol A (BPA).
BPA is used in certain packaging materials such as polycarbonates for baby food bottles. It is also used in epoxy resins for internal protective linings for canned food and metal lids. However, evidence of adverse effects in animals has generated concerns over low-level chronic exposures in humans.
As part of the 2003-2004 National Health and Nutrition Examination Survey (NHANES), the same participants mentioned above were also questioned about 32 different clinical outcomes. Based on this NHANES analysis, it was found that higher urinary concentrations of BPA were associated with an increased prevalence of cardiovascular disease, diabetes, and liver-enzyme abnormalities. (Lang et al., JAMA 2008; 300, 1303-10).
Importantly, however, the potential for false positives in this case is substantial when the complete CDC study design is examined: from the perspective of the full data set, there are 32 x 275 = 8800 different questions at issue. In addition, ten demographic variables (such as ethnicity, education and income) were also analyzed. With 32 possible health outcomes, potentially associated with any of the 275 chemicals, along with each demographic variable and different strategies for covariate adjustment, there could be as many as approximately 9 million statistical models and endpoints available to analyze the data (Young and Yu, JAMA 2009; 301, 720-721).

Given that the publication by Lang et al. focused only on one chemical and 16 health conditions, it is important to understand how many questions were at issue before conducting the study. With this huge search space and all possible modeling variations in the CDC study design, there is a real possibility that the findings reported by the authors could well be the result of chance rather than representing real health concerns. When many questions are asked of the same data, some of those questions will by chance come up as false positives – a consequence known as the multiple testing problem.

The probability P of rejecting at least one true null hypothesis for the case when all tests are independent of each other can be calculated as follows:
P = 1 – (1 – α)^n
(1 – α): probability of not rejecting a true hypothesis for one test
(1 – α)^n : probability of not rejecting n true hypothesis (with n tests in total)

If the conventional significance level of α = 0.05 is used for n = 20 tests, then there is a probability of around 64% that at least one true null hypothesis is rejected.

In ‘omics studies, like genome-wide association studies (GWAS), it is not unusual to investigate 10,000 genes or more in simultaneous experiments. Without any appropriate corrections for multiple testing (e.g. Bonferroni, step-down, and step-up techniques) and α = 0.05, 500 genes out of 10,000 will be found as significant even without any real difference when compared to a control group.
Over the last 10 years, however, the field of human genome epidemiology has responded to the above mentioned issues and a major transformational shift occurred with prominent changes in key parameters that influence the occurrence of false positive findings (Ioannidis et al., Epidemiology 2011, Vol 22, p 450-456):
  • Far more stringent criteria are required for ‘significant’ discoveries, with typical Type I error rates of α = 5 x 10-8 instead of 0.05.
  • Establishment of multi-team consortia for sharing data and performing GWA genotyping. These collaborations undertake systematic replication efforts for large sets of previously proposed nominally statistically significant candidate-gene associations.
  • The appraisal of all (negative and positive) results without selective reporting (publication bias, selective outcome, and analysis reporting bias).
 Having implemented these new quality criteria, recent GWAS-based replication attempts of previously detected gene relationships found that only 13 gene loci-phenotype associations survive replication among 1151 tested in these studies, resulting in a replication rate of approximately 1.2% (Ioannidis et al., Epidemiology 2011, Vol 22, p 450-456).

This shows that, sometimes, the right way is to step back and to consider issues associated with the study design of large studies and data sets, and to develop a (pre-specified) statistical analysis strategy that takes into account the large number of questions at issue.

Scientists believed a whiff of the bonding hormone Oxytocin could increase trust between humans. Then they went back and checked their work… 

Over the last two decades, the neuropeptide Oxytocin (OT) has been studied extensively and many articles have been published about its role in humans’ emotional and social lives, e.g. increasing trust and sensitivity to others’ feelings. Even a TED talk has been recorded (Trust, morality – and oxytocin?) with over 1.4 million viewers.
The human trials conducted were based on early animal studies, where a critical manipulation of the OT system was translated into behavioral phenotypes affecting social cognition, bonding and individual recognition.
However, some recent publications question the sometimes bewildering evidence for the role of OT in influencing complex social processes in humans, and failed to reproduce some of the most influential studies conducted. Furthermore, no elevated cerebrospinal fluid (CSF) OT levels could be detected 45 min after administration, which represents the time window at which most behavioral tasks took place (Striepens et al., 2013). CSF OT concentrations were increased after 75 minutes, indicating that OT pharmacokinetics is not fully understood. Moreover, it is still unclear whether the usual doses administered in the field (between 24 and 40 IU) can indeed deliver enough OT to the brain in order to produce significant changes in individuals (Leng et al., 2016).
This ultimately leads to the following question: ‘If the published literature on the OT effects does not reflect the true state of the world, how has the vast behavioral OT literature accumulated (Lane et al., 2016)?’
Several possible scenarios and reasons are currently discussed and analyzed amongst OT researchers, demonstrating the crucial importance of implementing Good Research Practice standards, proper study design and a priori statistical power calculations:Power analysis:
A meta-analysis of the effects of OT on human behavior found that the average OT study has a statistical power of 16% for healthy individuals and a median sample size of 49 individuals. For clinical trials the statistical power was even lower (12%), given a median sample size of 26 individuals (Walum et al., 2016) .
Hence, OT studies in humans are dangerously underpowered, as 80% is normally considered the standard for minimal adequate statistical power. Even for studies with the largest effect and sample sizes (N = 112), the statistical power was lower than 70%. In order to achieve 80% power for the average effect size reported, a sample size of 352 healthy individuals would be needed (310 individuals for clinical trials).
Statistical power is the probability that a test will be able to reject the null hypothesis considering a true relation with a given effect size. In other words, replication attempts of true positive OT studies (with the same sample size) would fail up to 88% of the time considering the false negative rate of 84% or 88%, respectively. To further aggravate the problem, the observed effect size in underpowered studies is likely to be highly exaggerated, a phenomenon also known as “the winner’s curse”.
In addition, this meta-analysis also demonstrated that the positive predictive value (PPV) of those studies (using information on power, the pre-study odds and the alpha level) is low. Therefore, it was concluded that most of the reported positive findings in this field are likely to be false positives (Walum et al., 2016).Publication bias:
Almost all studies (29 out of 33), which were investigated as part of the meta-analysis (Walum et al., 2016), reported at least one positive result (p-value below 0.05). This huge excess of statistically significant findings clearly points towards a phenomenon referred to as the ‘file-drawer effect’ or publication bias suggesting that there could be a substantial amount of unpublished negative or inconclusive findings.
In an admirable and applaudable attempt to investigate if there is a file drawer problem in OT research, Anthony Lane at Catholic University of Louvain started to analyze all studies that were performed in his laboratory from 2009 until 2014 on a total of 453 subjects (Lane et al., 2016). Indeed, he found a statistically significant effect of OT for only one out of 25 tasks. This large proportion of ‘unexpected’ null findings, which were never published after they were conducted, raised concerns about the validity of what is known about the influence of OT on human behaviors and cognition. A. Lane therefore states that ‘our initial enthusiasm for OT has slowly faded away over the years and the studies have turned us from ‘believers’ into ‘skeptics’.
This process of publication bias is further supported by the current publication culture and the strong tendency of journals to favor publishing results that confirm hypothesis and neglect unconvincing data.Study design:
In addition to publication bias, the excess of significant effects of OT may also be the result of methodological, measurement or statistical artefacts: A. Lane’s laboratory also reported a massive use of ‘between-subject’ designs of relatively small sample size (around 30 individuals per study), which carries the risk of attributing effects to OT that are in fact generated by various unobservable factors, e.g. personality of participants (Lane et al., 2016).
Furthermore, Lane et al. failed twice to replicate their own previous study (Lane at al., 2015), which showed a powerful effect of OT increasing trusting behavior of study members. Notably, in the original study, OT administration followed a single blind procedure, where the subject is blind to the treatment condition but the experimenter is not, introducing the risk that the experimenter might unconsciously act differently and thereby influencing the subjects’ behavior to confirm the researcher’s hypothesis (unconscious behavioral priming). Both subsequent replication attempts were conducted in a double-blinded manner!

Importantly, the statistical and methodological limitations discussed here are not specific to the OT field and also directly affect other areas of biomedical research. Nevertheless, a systematic change in research practices and in the OT publication process is required to increase the trustworthiness and integrity of the data and to reveal the true state of OT effects. The adherence to detailed Good Research Practices (e.g. a priori power calculations and accurate blinding procedures) and a transparent reporting of methods and findings should therefore be strongly encouraged.

GDF11 – the ‘new blood’ anti-aging protein

The seminal paper by John Ioannidis  entitled “Why Most Published Research Findings Are False” (2005 PLoS Medicine 2: e124), contains some statements that are easy to understand and follow, e.g. smaller sample sizes indicate that research findings are less likely to be true. However, there are others that, in spite of being  very well presented and discussed in this highly cited paper, are rather difficult to follow and implement into research practice. For example, it has been argued for years that it is important to estimate how likely it is that a phenomenon is real when considering the general knowledge in the area previous to the study (pre-study probability). This Bayesian thinking can be convincingly illustrated (Nuzzo (2014) Nature 506: 150) but, for most biomedical scientists and many research situations, it is difficult to implement, since this pre-study probability is difficult to estimate.
Ioannidis (2005, Table 4) provides some rough examples of the ratios between true and non-true relationships for different study types, but this is not always helpful if a scientist wants to apply it to his/her particular research plans. Nevertheless, Table 4 in the article by Ioannidis (2005) can be used as a first starting point and, by analyzing relevant examples, one may come up with a set of formal criteria that could help scientists to estimate the pre-study odds for their own projects. We present this case study as an example that can stimulate such discussion:

Aging is a slow process that likely involves multiple interconnected and very complex mechanisms. This is a statement that, for most people not having specific hypotheses about aging mechanisms, would sound rather reasonable to accept. Thus, how likely is it that a single protein given over a fairly short period of time will reverse the signs of aging?
Three papers published in highly respectful journals (Loffredo et al (2013) Cell; Katsimpardi et al (2014) Science; Sinha et al (2014) Science), presented data suggesting that a four-week long treatment with a protein called GDF11 makes the heart, skeletal muscle and brain of old mice look and perform like young ones. It is no surprise that other labs tried to follow up these publications and have come to conflicting evidence (summarized at: First, the quality of the research tools used (antibody specificity) has been questioned by a study arguing that GDF11 is actually accumulating with age and inhibits muscle regeneration (Egerman et al (2015) Cell Metabolism). Second, the originator lab re-ran the study using a more appropriate design, and found that the GDF11 protein treatment affects heart muscle equally in both old and young mice. Third, these are single-dose studies, although, technically speaking, they fall under the jurisdiction of pharmacology, which would demand a dose-effect analysis. As a consequence, discrepant results are attributed to the fact that one lab is apparently testing higher doses than the other, and that the “therapeutic window” for a desired effect seems to be too narrow (so that only the “lucky” lab is working with the right dose and therefore reporting positive effects).
Besides illustrating the importance of proper validation of research tools and investment into optimal study design, this case study supports the need to take all statements in the paper by Ioannidis (2005) seriously: it is of crucial importance to weight pre-study odds against the scientific excitement about the obtained study results.

Presented data do not always support conclusions made in a paper

We will continue presenting Case Study publications that have received a lot of interest among the scientific community (and sometimes even in mass media) and are particularly interesting from the GSP perspective. We hope that these cases will be useful in training programs and will help younger scientists to learn about the basics of study design, planning and analysis.
This month we chose: vom Berg et al. (2012) Inhibition of IL­12/IL­23 signaling reduces Alzheimer’s disease–like pathology and cognitive declineNature Med 18 (12): 1812-9.
There are many aspects of this report that are worth discussing. However, the most interesting part is the behavioural data. According to the conclusion stated in the article, intracerebroventricular delivery of a neutralizing p40-specific antibody (p40 is the common IL-12 and IL-23 subunit) reversed cognitive deficits in aged APPPS1 mice (“Alzheimer” mouse). This conclusion is based on the experiments using three cognitive tasks (discussed on pp. 1816-7): the contextual fear conditioning paradigm, novel object recognition and the Barnes maze. So, how strong is the evidence?
Contextual fear conditioning: Data is not shown – apparently because ‘…performance in the contextual fear conditioning test did not differ between p40-antibody-treated and isotype-treated APPPS1 mice’. In other words, no effects of treatment.
Novel object recognition: Data is presented in Fig. 5a but it is difficult to see any support for either ‘deficits shown by APPPS1 mice treated with isotype control antibodies’ (vs. corresponding WT mice) or normalization ‘p40-antibody treatment to the levels of age-matched WT mice’. While the study design is 2×2, results of one-way ANOVA are presented with no signs that pairwise group comparisons were conducted to support the conclusions made (Dunnett’s test is mentioned in the legend referring to Fig. 5a and, in any case, deals with comparing all groups to one). So, again, no effects of treatment presented.
Barnes maze: Data is presented in Fig. 5b and it is claimed that ‘the significant deficit in short-term memory retention in APPPS1 mice in the Barnes maze test was substantially ameliorated by icv treatment with p40-specific antibodies’. Here, the situation is reverse: results of pairwise group comparisons are indicated but no ANOVA analysis. And the conclusion on treatment effects is based on nothing else but “common sense” (i.e. ‘the enemy of my enemy is my friend’). In sum, no clear evidence presented that would support the claims made.
In addition, Supplementary Figure 9 may be worth a look as well: here, CNS p40 concentrations were analyzed in human samples to see whether previous findings in mice could be translated to humans with Alzheimer’s disease. For this purpose, p40 protein concentrations in cerebrospinal fluid specimens from diseased patients (n = 39) were compared to those from individuals without Alzheimer’s disease (n = 20). ‘A significant (p < 0.05) linear correlation of cognitive performance assessed by the mini-mental score evaluation (MMSE) with CSF p40 concentrations’ was found; however, instead of 59 data points as expected, Figure S9 only showed n=6 (control) and n=7 (AD) measurements, respectively, indicating that most individuals could not be included in this study – the p40 protein concentrations were simply below detection limit.
Nevertheless, we would like to emphasize that we are discussing this paper solely from the data analysis and presentation point of view and by no means we aim to challenge the scientific value of this paper that has been well cited and has triggered good-quality follow-up research.

Sperm RNA carries marks of trauma

We will use this Case Study section to draw your attention to publications that have received a lot of attention in the scientific community and mass media and are particularly interesting from the GSP perspective.
Since 2011, there were reports that sperm RNA can ‘remember’ traumatic experiences (Gapp et al., Nature Neuroscience 2014, 17, 667–669). Can environmental factors involving traumatic events and chronic stress in early life indeed be responsible for severe psychiatric disorders not only in exposed individuals but also in their progeny?
There is a very entertaining and equally teaching discussion on these findings that can be found here: LINK