To examine associations between urinary chemical concentrations and adult health status, the US Centers for Disease Control and Prevention (CDC) measured the urine of 1455 study participants (representing the general US adult population) for the presence of 275 environmental chemicals, including Bisphenol A (BPA).
BPA is used in certain packaging materials such as polycarbonates for baby food bottles. It is also used in epoxy resins for internal protective linings for canned food and metal lids. However, evidence of adverse effects in animals has generated concerns over low-level chronic exposures in humans.
As part of the 2003-2004 National Health and Nutrition Examination Survey (NHANES), the same participants mentioned above were also questioned about 32 different clinical outcomes. Based on this NHANES analysis, it was found that higher urinary concentrations of BPA were associated with an increased prevalence of cardiovascular disease, diabetes, and liver-enzyme abnormalities. (Lang et al., JAMA 2008; 300, 1303-10).
Importantly, however, the potential for false positives in this case is substantial when the complete CDC study design is examined: from the perspective of the full data set, there are 32 x 275 = 8800 different questions at issue. In addition, ten demographic variables (such as ethnicity, education and income) were also analyzed. With 32 possible health outcomes, potentially associated with any of the 275 chemicals, along with each demographic variable and different strategies for covariate adjustment, there could be as many as approximately 9 million statistical models and endpoints available to analyze the data (Young and Yu, JAMA 2009; 301, 720-721).

Given that the publication by Lang et al. focused only on one chemical and 16 health conditions, it is important to understand how many questions were at issue before conducting the study. With this huge search space and all possible modeling variations in the CDC study design, there is a real possibility that the findings reported by the authors could well be the result of chance rather than representing real health concerns. When many questions are asked of the same data, some of those questions will by chance come up as false positives – a consequence known as the multiple testing problem.

The probability P of rejecting at least one true null hypothesis for the case when all tests are independent of each other can be calculated as follows:
P = 1 – (1 – α)^n
(1 – α): probability of not rejecting a true hypothesis for one test
(1 – α)^n : probability of not rejecting n true hypothesis (with n tests in total)

If the conventional significance level of α = 0.05 is used for n = 20 tests, then there is a probability of around 64% that at least one true null hypothesis is rejected.

In ‘omics studies, like genome-wide association studies (GWAS), it is not unusual to investigate 10,000 genes or more in simultaneous experiments. Without any appropriate corrections for multiple testing (e.g. Bonferroni, step-down, and step-up techniques) and α = 0.05, 500 genes out of 10,000 will be found as significant even without any real difference when compared to a control group.
Over the last 10 years, however, the field of human genome epidemiology has responded to the above mentioned issues and a major transformational shift occurred with prominent changes in key parameters that influence the occurrence of false positive findings (Ioannidis et al., Epidemiology 2011, Vol 22, p 450-456):

  • Far more stringent criteria are required for ‘significant’ discoveries, with typical Type I error rates of α = 5 x 10-8 instead of 0.05.
  • Establishment of multi-team consortia for sharing data and performing GWA genotyping. These collaborations undertake systematic replication efforts for large sets of previously proposed nominally statistically significant candidate-gene associations.
  • The appraisal of all (negative and positive) results without selective reporting (publication bias, selective outcome, and analysis reporting bias).

 Having implemented these new quality criteria, recent GWAS-based replication attempts of previously detected gene relationships found that only 13 gene loci-phenotype associations survive replication among 1151 tested in these studies, resulting in a replication rate of approximately 1.2% (Ioannidis et al., Epidemiology 2011, Vol 22, p 450-456).

This shows that, sometimes, the right way is to step back and to consider issues associated with the study design of large studies and data sets, and to develop a (pre-specified) statistical analysis strategy that takes into account the large number of questions at issue.