Error bars can convey misleading information

by Martin C. Michel

The most common type of graphical data reporting are bar graphs depicting means with SEM error bars. Based on simulated data, Weissgerber et al. have argued convincingly that bar graphs are not showing but rather hiding data, as various patterns of underlying data can lead to the same mean value (Weissgerber et al., 2015). Thus, an apparent inter-group difference can represent symmetric variability in both groups, as most would assume the difference in means represents. However, it also could be driven by outliers, bimodal distribution within in each group or by unequal sample sizes across groups. Each option may reach statistical significance, but the story behind the data may be differing considerably. Weissgerber et al. have also shown that the choice of depicting variability, at least psychologically, affects how we perceive data. Thus, SEM (SD divided by square root of n) has the smallest error bar, and a small error bar may make even small group difference look large, even if the overlap between both groups is considerably. To further look into this, I have gone back into previously published real data from my lab (Frazier et al., 2006). That study has explored possible difference in relaxation of the urinary bladder by several β-adrenoceptor agonists between young and old rats. At the time, not knowing any better, we reported means with SEM error bars. In the figure below, I show a bar graph based on means with SEM error bars as the data had been presented in the paper along with other types of data representation. Looking at this panel only, it appears that there may be a fairly large difference between the young and old rats, i.e. old rats exhibiting only about half of the maximum relaxation. But if we look at the scatter plot, two problems appear with this interpretation. Firstly, there was one rat among the old rats in which noradrenaline caused hardly any relaxation. It does not look like a major outlier but clearly had impact on the overall mean. Second, there is considerable overlap in the noradrenaline effects between the two age groups. Thus, only 5 out of 9 measurements in old rats yield values smaller than the lowest in the young rats. Thus, these real data confirm that means may hide existing variability in data and pretend a certainty of conclusions that may not be warranted. As proposed by Weissgerber et al., the scatter plot conveys the real data much better than the bar graph and gives readers a choice to interpret the data as they are. Thus, unless there is a large number of data points, the scatter plot is clearly superior to the bar graph.

However, when data are not shown in a figure but in the main text, not all data points can be presented and a summarizing number is required. If one looks at the four bar graphs (each showing the same data, only with a different type of error bar), they convey different messages. The graph with an SEM error bar makes it look as if the difference between the two groups is quite robust, as the group difference is more than thrice the magnitude of the error bar. However, we have seen from the scatter plot that this is not what the data really say. On the other hand, the SD error bars by definition are larger. As everybody knows, about 95% of all data fall within twice the SD. Looking at the SD error bars, it is quite clear that the two groups overlap. This is what the raw data say, but not the impression coming from the SEM error bars.

There also is a conceptual difference between SD and SEM error bars. SD describes the variability within the sample, whereas SEM describes the precision with which the group mean has been estimated. An alternative to presenting precision of the parameter estimate is the 95% confidence interval. In this specific case, it provides a similar message as the SD error bar, i.e. the two populations may differ but probably are overlapping. Of note, SEM and SD are only meaningful if the samples come from a population with Gaussian distribution (or at least close to it). In biology, this often is not the case or we at least do not have sufficient information for an informed decision. In this case it involves fewer assumptions to show medians. To express the variability of the data depicted as medians, the interquartile range is a useful indicator. In this example, it conveys a similar message as the SD or confidence interval error bars.

In summary, many data points may lead to similar bar graphs, but a different biology may be hiding behind it in each case. Therefore, the scatter plot (where possible) is clearly the preferred option of showing quantitative data. If means with error bar have to be sown, e.g. within the main text, SD is the error bar of choice to depict variability and confidence interval to depict precision of parameter estimate. For data from populations with non-Gaussian distribution medians with interquartile ranges are the preferred option to present data when scatter plots are not possible.

References

Frazier EP, Schneider T, Michel MC (2006) Effects of gender, age and hypertension on ß-adrenergic receptor function in rat urinary bladder. Naunyn-Schmiedeberg’s Arch Pharmacol 373: 300-309

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond bar and line graphs: time for a new data presentation paradigm. PLoS Biol 13: e1002128

What does the Net Present Value have to do with preclinical drug discovery research?

When looking for Investors in the Life Science sector, entrepreneurial scientists and start-up companies have to deal with an unavoidable questions: ‘What is actually the appropriate valuation of my idea or business?’ Venture capitalists may hesitate investing in biotechnology if bioentrepreneurs fail to provide or accept realistic estimates of the value of their technologies. One of the underlying reasons is that there is often little intuition into what biotech companies are worth and numbers sometimes can seem very arbitrary. Furthermore, owing to the complexity and specificity of scientific knowledge, it can be challenging and time-consuming to evaluate the technological and scientific risks associated with an early-stage Biotech company.

Traditionally, the typical instruments applied for valuation in the biomedical biotech area are based on the Discounted Cashflow (DCF) analysis and the Net Present Value (NPV) model. These approaches require revenue and growth projections as well as projections of potential market share. In addition, the net price of the future drug, the costs per clinical trial and market access are further parameters normally considered. Based on these calculations, assumptions regarding price, peak market share and accessible market have the greatest impact on the venture valuation. By following this type of analysis, investors usually focus more on parameters relevant for the commercial phase of a product rather than the R&D phase.

Importantly, additional risk adjustments can be applied to the NPV calculations by modifying  future cash flows based on the probability of a drug progressing from one development stage to the next, resulting in a risk-adjusted NPV (rNPV). However, reference data for the determination of the attrition risks are usually calculated from historical information on the success rate in each development phase for products of a similar category (e.g. type of disease) – without taking into account pre-clinical data quality and integrity.

At least for the early-stage companies (i.e. before clinical Proof-of-Concept), this is quite surprising and risky as only robust and high-quality pre-clinical data can build a solid foundation for future success of drug R&D projects. Several steps within the R&D process are covered by GxP-based quality procedures (e.g. GLP, GCP, GMP, etc.) that aim to protect the integrity of research. However, these same standards cannot be applied to the basic and pre-clinical areas of drug discovery, and, consequently, biotech companies can differ widely regarding the quality of data sets generated. Thus, pre-clinical quality of research data is crucial and should be taken into account if the venture valuation is done before the Proof-of-Concept in Phase II is delivered. It is therefore highly recommended to analyze the likelihood that a given set of preclinical data is robust enough to support a successful clinical drug discovery project.

Uncertainties regarding data quality and robustness can be reflected by superimposing Monte Carlo (MC) simulations on the rNPV calculations, which returns a range of possible outcomes and importantly, the probability of their occurrences – rather than providing only a single return on investment figure, like the rNPV. In reality, only a small minority of drug development projects have positive cash flows (in case a projects reaches beyond the pre-registration phase) and most scenarios have, in fact, a negative rNPV (in case one of the clinical trials yielded a negative result). In contrast to the standard rNVP, further advanced models (e.g. risk-profiled MC valuations) indeed place the focus on clinical phase I/II failures as the most probable outcome. Hence, the costs and lengths of phase I/II trials become the most critical parameters with the highest impact on valuation.

Furthermore, for projects, where data robustness and probability of reproducing preclinical data is low, most of the rNPV range will shift towards a negative mean, providing a more accurate view of the risks involved in pharmaceutical R&D.

Monte Carlo (MC) simulation: The MC calculation is usually repeated hundreds of times, using different input values for each parameter. The rNPV (in US$K) is plotted against the probability for each rNPV value.

Given the importance of preclinical data for the outcome of all subsequent clinical trials, only a plausible evaluation of the quality, robustness and integrity of all pre-clinical studies, ideally via a third-party assessment, will complete any Due Diligence process and should therefore be a critical and valuable part of the decision-making procedure for modern portfolio management.

Number of citations as a measure of research quality – a dangerous approach

How Many Is Too Many? On the Relationship between Research Productivity and Impact. In this research article, published in PLOS One (Larivière V, Costas R (2016) PLoS ONE 11(9): e0162709), V. Larivière and R. Costas analysed the publication and citation records of more than 28 million researchers, who published at least one paper between 1980 and 2013. Using this database, the authors tried to understand the relationships between research productivity and scientific impact. They addressed the question whether incentives for scientists to publish as many papers as possible will lead to higher-quality work – or just more publications. It was found that, in general, an increasing number of scientific articles per author did not yield lower shares of highly cited publications, or, as Larivière and Costas put it: ‘the higher the number of papers a researcher publishes, the higher the proportion of these papers are amongst the most cited.

There are two reasons why we find this paper very interesting and worth reading:

On the one hand, here at PAASP, we are very much interested in the reverse relationship – whether quality has an impact on productivity. Indeed, some colleagues are worried that introducing and maintaining higher quality standards in research could have a negative impact on the number of papers published, less possibilities to publish in a higher impact factor journal or longer duration of student projects (e.g. for PhD students).

On the other hand, this paper reminds us that using citation numbers as an index of quality is a dangerous approach. For example, we have used data generated by the Reproducibility Project Psychology (Open Science Framework, https://osf.io/ezum7/) to plot citations for papers where research findings were replicated vs papers with findings that were not replicated (Excel table with the raw data is available upon request).

As the graph below illustrates, it is does not matter how often the senior or first authors have been cited during their carriers or how many times a particular paper has been cited. There are no differences between publications, which could or could not be replicated!

Reproducibility Project Psychology

The False Discovery Rate (FDR) – an important statistical concept

The p-value reports the probability of seeing a difference as large as the observed one, or larger, even if the two samples came from populations with the same mean value. However, and in contrast to a common perception, the p-value does not determine the probability that an observed finding is true!

When conducting multiple comparisons (e.g. thousands of hypothesis tests are often conducted simultaneously when analyzing results from genome-wide studies) there is an increased probability of false positives. While there are a number of approaches to overcome problems due to multiple testing, most of them attempt to reduce the p-value threshold from 5% to a more reasonable value.

In 1995, Benjamini and Hochberg introduced the concept of the False Discovery Rate (FDR) as a way to allow inference when many tests are being conducted. The FDR is the ratio of the number of false positive results to the number of total positive test results: a p-value of 0.05 implies that 5% of all tests will result in false positives. An FDR-adjusted p-value (also called q-value) of 0.05 indicates that 5% of significant tests will result in false positives. In other words, an FDR of 5% means that, among all results called significant, only 5% of these are truly null.

The importance of the FDR can be nicely demonstrated by analysing the following scientific publication:

Published in Nature Medicine in 2014 (Mapstone et al., 2014; Nature Medicine 20, 415–418), the authors discovered a biomarker panel of ten lipids from peripheral blood that predicted phenoconversion to Alzheimer’s disease within a 2-3 year timeframe. Importantly, the described sensitivity and specificity values of the proposed blood test were over 90%. In general, an accuracy of 90% is considered appropriate for any kind of screening test in normal-risk individuals and the reported results triggered a high degree of optimism – but is it justified?

This paper may well represent progress towards an AD blood test, but usefulness depends on what the rate of Alzheimer’s is in the population being screened:

Given a general Alzheimer’s incidence rate of 1%, out of 10,000 people 100 will have a condition and a test based on the described biomarker panel will reveal 90 true positive results (box at top right). However, what about the false positive results! Although 9.900 people do not have any condition, the test will show a false positive result for 990 people (box at bottom right), which leads to a total of 1080 positive results (990 false positives plus 90 true positives). Of these results, 990/1080 are false positives, resulting in a False Discovery Rate of 92%. That is, over 90% of positive screening results would be false!

False Discovery Rate

As a classic example of Bayes theorem, calculating the FDR clearly demonstrates that a test with a 90% (true positive) accuracy rate is going to misdiagnose (supply a false positive) almost 92% of the tested people (if the actual disease incidence rate is 1%). These sorts of calculations are misunderstood even by people who should know better, e.g. physicians. As can be seen from the above example, a key driver of FDR is the a priori probability of a hypothesis (in this case known incidence of Alzheimer’s). If prior probability is low, FDR will be high for a given p-value. If prior probability is high, FDR tends to be lower.

Consequently, compared to the p-value, the FDR has some useful properties. Controlling for the FDR is a way to identify as many significant tests as possible while incurring a relatively low proportion of false positives. Using the FDR allows scientists to decide how many false positives they are willing to accept among all the results that can be called significant.

When deciding on a cut-off or threshold value, this decision should focus on the question of how many false positive results will this test reveal, rather than just randomly picking a p-value of 0.05 and assuming that every comparison with a p-value less than 0.05 is significant.

Blinding – does it really have an impact?

Zubin Mehta, conductor of the Los Angeles Symphony from 1964 to 1978 and of the New York Philharmonic from 1978 to 1990, is credited with saying, “I just don’t think women should be in an orchestra.” In 1970, the top five orchestras in the U.S. had fewer than 5% female musicians and this number gradually increased over years reaching on average 25-30%. So, what was the source of this change?

Well, blinding seems to be one of the factors!

In the 1970s and 1980s, orchestras began using blind auditions. Candidates and jury members were separated by a curtain in a way that they could not see each other. This blinding process was found to account for at least 30% of the increase in the female proportion of “new hires“ at major symphony orchestras in the US (see Figure below, modified from Goldin & Rouse (2000) American Economic Rev 90: 715).

By the way, the first blinded auditions provided an astonishing result: men were still favoured over women!

It was later discovered that screens kept juries from seeing the candidates move into position, but the sound of the women’s heels when entering the music stage unknowingly influenced the jury. Once the musicians removed their shoes, almost 50% of the women made it past the first audition.

Accurate design of in vitro experiments – why does it matter?

Good statistical design is a key aspect of meaningful research. Elements such as data robustness, randomization and blinding are widely recognized as being essential to producing valid results and reducing biased assessment. Although commonly used in in vivo animal studies and clinical trials, why is it that these practices seem to be so often overlooked in in vitro experiments?

In this thread we would like to stimulate a discussion about the importance of this issue, the various designs available for typical in vitro studies, and the need to carefully consider what is ‘n’ in cell culture experiments.

Let’s consider pseudoreplication, as it is a relatively serious error of experimental planning and analysis that hasn’t received much attention in the context of in vitro research.

The term pseudoreplication was defined by Hurlbert more than 30 years ago as “the use of inferential statistics to test for treatment effects with data from experiments where either treatments are not replicated (though samples may be) or replicates are not statistically independent” (Hurlbert SH, Ecol Monogr. 1984, 54: 187-211). In other words, the exaggeration of the statistical significance of a set of measurements because they are treated as independent observations when they are not.

Importantly, the independence of observations or samples is (in the vast majority of cases) an essential requirement on which most statistical methods rely. Analyzing pseudoreplicated observations ultimately results in erroneous confidence intervals, that are too small, and inaccurate p-values as the underlying experimental variability were underestimated and the degrees of freedom (number of independent observations) were incorrect. Thus, the statistical significance can be greatly inflated leading to a higher probability of Type I error (falsely rejecting a true null hypothesis).

To add to the confusion, the word ‘replication’ is often used in the literature to describe technical replicates or repeated measurements on the same sample unit, but can also be used to describe a true biological replicate, which is characterized as “the smallest experimental unit to which a treatment is independently applied” (link).

To understand pseudoreplication-related issues, it is therefore crucial to carefully define the term biological replicate (= data robustness) in this context and to distinguish it from a technical replicate (= pseudoreplicate): The critical difference here (as proposed by M. Clemens: link) is whether or not the follow-up test should give, in expectation, exactly the same quantitative result as the original study. A technical replication re-analyses the same underlying data set as used in the original study, whereas a biological replicate estimates parameters drawn from different samples. Following this definition, performing pseudoreplication tests does not introduce independency into the experimental system and can mainly be applied to measure errors in sample handling as the new findings should be quantitatively identical to the old results. In contrast, robustness tests represent true biological ‘replicates’ due to independent raw materials (animals, cells, etc.) used and therefore do not need to give the same results as obtained before. Only a robustness test can analyze whether a system operates correctly while its variables or conditions are exchanged.

In the following experiment, cells from a common stock are split into two culture dishes and either left untreated (control) or stimulated with a growth factor of interest. The number of cells per dish is then used as the main readout to examine the effect of the treatment. The process of data acquisition will have a decisive impact on the quality and reliability of the final result. These are different options on how to conduct this experiment:

  1. After a certain period of time, 3 different cover slides are prepared from each dish to calculate cell numbers, resulting in six different values (three per condition).

Sample size equals one

Although there were two culture dishes and six glass slides, the correct sample size here is n=1, as the variability among cell counts reflects technical errors only, and the three values for each treatment condition do not represent robustness tests (= biological replicates) but technical replicates.

  1. A slightly better approach is to perform the same experiment on three different days, counting the cells only once per condition each day.
This experiment indeed shows “n” equal three.

 

This approach gives the same number of final values (six), yet, independency is introduced (in the form of time) due to repeating the experiment at three separate occasions, resulting in a sample size of n = 3. Here, the two glass slides from the same day should be analyzed as paired observations and a paired-samples t-test could be used for statistical evaluation.

  1. To further increase confidence in the obtained results, the three single experiments should be performed as independently as possible, meaning that cell culture media should be prepared freshly for each experiment, different frozen cell stocks and growth factor batches, etc. should be used.

It is reasonable to assume that most scientists who have performed in vitro cell based assays will have gotten as far as to consider and apply these precautions. But now we must ask ourselves: do those measurements actually account for real robustness tests? When working with cell-based assays, it is important to consider that, even if for each replicate a new frozen cell stock was used, ultimately all cells originated from the same starting material, therefore no biological replicates can possibly be achieved.

This problem can only be solved by generating several independent cell lines from several different human/animal tissue or blood samples, which demonstrates that reality often places constraints on what is statistically optimal.

The key questions, thus, are: ‘How feasible is it to obtain true biological replicates and to satisfy all statistical criteria?’ or ‘How much pseudoreplication is still acceptable?’

We all know that cost and time considerations, as well as the availability of biological sample material, are important; and quite frequently these factors force scientists to make compromises regarding study design and statistical analysis. Nevertheless, as many medical advances are based on preclinical basic in vitro research, it is critical to conduct, analyze and report preclinical studies in the most optimal way. As a minimum requirement, when reporting a study, the design of the experiment, the data collection and the statistical analysis should be described in sufficient detail, including a clear definition and understanding of the smallest experimental unit with respect to its independence. Scientists should also be open about the limitations of a research study and it should be possible to consider and publish a study as preliminary or exploratory (using ‘pseudo-confidence intervals’ instead of ‘true’ confidence intervals, when over-interpretation of results should be avoided) or to combine results with others to obtain more informative data sets.

As mentioned above, even if samples are easy to get or inexpensive, it can be dangerous to inflate the sample size by simply increasing the number of technical replicates, which may lead to spurious statistical significance. Ultimately, only a higher number of true biological replicates will increase the power of the analysis and result in quality research.

In this context, and to understand the extent of the problem, it would be quite informative to perform a detailed meta-analysis of articles about in vitro research studies to get an idea about the ratio of biological and technical (and unknown) replicates used for the scientific conclusion!