Good statistical design is a key aspect of meaningful research. Elements such as data robustness, randomization and blinding are widely recognized as being essential to producing valid results and reducing biased assessment. Although commonly used in *in vivo* animal studies and clinical trials, why is it that these practices seem to be so often overlooked in *in vitro* experiments?

In this thread we would like to stimulate a discussion about the importance of this issue, the various designs available for typical *in vitro* studies, and the need to carefully consider what is ‘n’ in cell culture experiments.

Let’s consider **pseudoreplication**, as it is a relatively serious error of experimental planning and analysis that hasn’t received much attention in the context of *in vitro* research.

The term pseudoreplication was defined by Hurlbert more than 30 years ago as “*the use of inferential statistics to test for treatment effects with data from experiments where either treatments are not replicated (though samples may be) or replicates are not statistically independent*” (Hurlbert SH, Ecol Monogr. 1984, 54: 187-211). In other words, the exaggeration of the statistical significance of a set of measurements because they are treated as independent observations when they are not.

Importantly, the **independence** of observations or samples is (in the vast majority of cases) an essential requirement on which most statistical methods rely. Analyzing pseudoreplicated observations ultimately results in erroneous confidence intervals, that are too small, and inaccurate p-values as the underlying experimental variability were underestimated and the degrees of freedom (number of independent observations) were incorrect. Thus, the statistical significance can be greatly inflated leading to a **higher probability of Type I error** (falsely rejecting a true null hypothesis).

To add to the confusion, the word ‘**replication**’ is often used in the literature to describe technical replicates or repeated measurements on the same sample unit, but can also be used to describe a true biological replicate, which is characterized as “*the smallest experimental unit to which a treatment is independently applied*” (Heffner et al., Ecology, 1996, 77 (8) 2558-2562).

To understand pseudoreplication-related issues, it is therefore crucial to carefully define the term **biological replicate** (= data robustness) in this context and to distinguish it from a **technical replicate** (= pseudoreplicate)**: **The critical difference here (as proposed by M. Clemens: http://www.cgdev.org/publication/meaning-failed-replications-review-and-proposal-working-paper-399) is whether or not the follow-up test should give, in expectation, exactly the **same quantitative result** as the original study. A technical replication re-analyses the same underlying data set as used in the original study, whereas a biological replicate estimates parameters drawn from different samples. Following this definition, performing pseudoreplication tests does not introduce independency into the experimental system and can mainly be applied to measure errors in sample handling as the new findings should be quantitatively identical to the old results. In contrast, robustness tests represent true biological ‘replicates’ due to independent raw materials (animals, cells, etc.) used and therefore do not need to give the same results as obtained before. Only a robustness test can analyze whether a system operates correctly while its variables or conditions are exchanged.

In the following experiment, cells from a common stock are split into two culture dishes and either left untreated (control) or stimulated with a growth factor of interest. The number of cells per dish is then used as the main readout to examine the effect of the treatment. The process of data acquisition will have a decisive impact on the quality and reliability of the final result. These are different options on how to conduct this experiment:

- After a certain period of time, 3 different cover slides are prepared from each dish to calculate cell numbers, resulting in six different values (three per condition).

Although there were two culture dishes and six glass slides, the correct **sample size here is n=1**, as the variability among cell counts reflects technical errors only, and the three values for each treatment condition do not represent robustness tests (= biological replicates) but technical replicates.

- A slightly better approach is to perform the same experiment on three different days, counting the cells only once per condition each day.

*(modified from http://labstats.net/articles/cell_culture_n.html)*

This approach gives the same number of final values (six), yet, independency is introduced (in the form of time) due to repeating the experiment at three separate occasions, resulting in a **sample size of n = 3**. Here, the two glass slides from the same day should be analyzed as paired observations and a paired-samples t-test could be used for statistical evaluation.

- To further increase confidence in the obtained results, the three single experiments should be performed as independently as possible, meaning that cell culture media should be prepared freshly for each experiment, different frozen cell stocks and growth factor batches, etc. should be used.

It is reasonable to assume that most scientists who have performed *in vitro* cell based assays will have gotten as far as to consider and apply these precautions. But now we must ask ourselves: do those measurements actually account for real robustness tests? When working with cell-based assays, it is important to consider that, even if for each replicate a new frozen cell stock was used, ultimately **all ****cells originated from the same starting material**, therefore no biological replicates can possibly be achieved.

This problem can only be solved by generating several independent cell lines from several different human/animal tissue or blood samples, which demonstrates that reality often places **constraints** on what is statistically optimal.

The key questions, thus, are: ‘How **feasible** is it to obtain true biological replicates and to satisfy all statistical criteria?’ or ‘How much pseudoreplication is still acceptable?’

We all know that cost and time considerations, as well as the availability of biological sample material, are important; and quite frequently these factors force scientists to make compromises regarding study design and statistical analysis. Nevertheless, as many medical advances are based on preclinical basic *in vitro* research, it is critical to conduct, analyze and report preclinical studies in the most optimal way. As a minimum requirement, when reporting a study, the design of the experiment, the data collection and the statistical analysis should be described in **sufficient detail**, including a clear definition and understanding of the smallest experimental unit with respect to its independence. Scientists should also be **open about the limitations** of a research study and it should be possible to consider and publish a study as preliminary or exploratory (using ‘pseudo-confidence intervals’ instead of ‘true’ confidence intervals, when over-interpretation of results should be avoided) or to combine results with others to obtain more informative data sets.

As mentioned above, even if samples are easy to get or inexpensive, it can be dangerous to inflate the sample size by simply increasing the number of technical replicates, which may lead to spurious statistical significance. Ultimately, only a higher number of true biological replicates will increase the power of the analysis and result in quality research.

In this context, and to understand the extent of the problem, it would be quite informative to perform a detailed **meta-analysis** of articles about in vitro research studies to get an idea about the ratio of biological and technical (and unknown) replicates used for the scientific conclusion!