Biological vs technical replicates: Now from a data analysis perspective

We have discussed this topic several times before (HERE and HERE). There seems to be a growing understanding that, when reporting an experiment’s results, one should state clearly what experimental units (biological replicates) are included, and, when applicable, distinguish them from technical replicates.

In discussing this topic with various colleagues, it became obvious to us that there is no clarity on best analytic practices and how to take technical replicates into analysis.

We have approached David L McArthur (at the UCLA Department of Neurosurgery), an expert in study design and analysis, who has been helping us and the Preclinical Data Forum on projects related to data analysis and robust data analysis practices.

A representative example that we wanted to discuss includes 3 treatment groups (labeled A, B, and C) with 6 mice per group and 4 samples processed for each mouse (e.g. one blood draw per mouse separated into four vials and subjected to the same measurement procedure) – i.e. a 3X6X4 dataset.

The text below is based on Dave’s feedback.  Note that Dave is using the term “facet” as an overarching label for anything that contributes to (or fails to contribute to) interpretable coherence beyond background noise in the dataset, and the term “measurement” as a label for the observed value obtained from each sample (rather than the phrase “dependent variable”  often used elsewhere).

Dave has drafted a thought experiment supported by a simulation.  With a simple spreadsheet using only elementary function commands, it’s easy to build a toy study in the form of a flat file representing that 3X6X4 system of data, with the outcome consisting of one measurement in each line of a “tall” datafile, i.e., 72 lines of data with each line having entries for group, subject, sample, and close-but-not-quite-identical measurement (LINK). But, for our purposes, we’ll insert not just measurement A but also measurement B on each line — where we’ve constructed measurement B to differ from measurement A in its variability but otherwise to have identical group means and subject means.  (As shown in Column E, this can be done easily: take each A value, jitter it by uniform application of some multiplier, then subtract out any per-subject mean difference to obtain B.)  With no loss of meaning, in this dataset measurement A has just a little variation from one measurement to the next within a given subject, but because of that multiplier, measurement B has a lot of variation from one measurement to the next within a given subject.

A 14-term descriptive summary shows that using all values of measurement A, across groups, results in:

abc
N24.000024.000024.0000
mean0.85001.45002.0500
SD0.28740.28740.2874
robust min0.30000.90001.5000
min0.30000.90001.5000
hdQ: 0.250.63801.23801.8380… (25th quantile, the lower box bound of a boxplot)
median0.85001.45002.0500
hdQ: 0.751.06201.66202.2620… (75th quantile, the upper box bound of a boxplot)
max1.40002.00002.6000
robust max1.40002.00002.6000
skew-0.0000-0.0000-0.0000
kurtosis-0.5908-0.5908-0.5908
Huber mu0.85001.45002.0500
Shapiro p0.97030.97030.9703

while, using all values of  measurement B, across groups, results in:

abc
N24.000024.000024.0000
mean0.85001.45002.0500<– identical group means
SD5.71315.71315.7131<– group standard deviations about 20 times larger
robust min-6.9000-6.3000-5.7000
min-6.9000-6.3000-5.7000
hdQ: 0.25-4.2657-3.6657-3.0657
median0.85001.45002.0500<– identical group medians
hdQ: 0.755.96576.56577.1657
max8.60009.20009.8000
robust max8.60009.20009.8000
skew-0.0000-0.0000-0.0000<– identical group skews
kurtosis-1.3908-1.3908-1.3908<– greater kurtoses, no surprise
Huber mu0.85001.45002.0500<– identical Huber estimates of group centers
Shapiro p0.00780.00780.0078<– suspiciously low p-values for test of normality, no surprise

The left panel in the image below results from simple arithmetical averaging of that dataset’s samples from each subject, with the working dataframe reduced by averaging from 72 lines to 18 lines.  It doesn’t matter here whether we now analyze measurement A or measurement B, as both measurements inside this artificial dataset generate the identical 18-line dataframe, with means of 0.8500, 1.4500, and 2.0500 for groups A, B and C respectively.  Importantly, the sample facet disappears altogether, though we still have group, mouse, measurement and noise.  The simple ANOVA solution for the mean measures shows “very highly significant” differences between the groups.  But wait.

The center panel uses all 72 available datapoints from measurement A.  By definition that’s in the form of a repeated-measures structure, with four non-identical samples provided by each subject.  Mixed effects modeling accounts for all 5 facets here by treating them as fixed (group and sample) or random (subject), or as the object of the equation (measurement), or as residual (noise).  The mixed effects model analysis for measurement A results in “highly significant” differences between groups, though those p-values are not the same as those in the left panel.  But wait.

The right panel uses all 72 available datapoints from measurement B.  Again, it’s a repeated-measures structure, but while the means and medians remain the same, now the standard deviations are 20 times larger than those for measurement A, a feature of the noise facet being intentionally magnified and inserted into the artificial source datafile. The mixed effects model analysis for measurement B results in “not-at-all-close-to-significant” differences between groups; no real surprise.

What does this example teach us?

Averaging technical replicates (as in the left panel) and running statistical analyses on average values means losing potentially important information.  No facet should be dropped from analysis unless one is confident that it can have absolutely no effect on analyses.  A decision to ignore a facet (any facet), drop data and go for a simpler statistical test must in any case be justified and defended.

Further recommendations that are supported by this toy example or that the readers can illustrate for themselves (with the R script LINK) are:

  • There is no reason to use the antiquated method of repeated measures ANOVA; in contrast to RM ANOVA, mixed effects modeling makes no sphericity assumption and handles missing data well.
  • There is no reason to use nested ANOVA in this context:  nesting is applicable in situations when one or another constraint does not allow crossing every level of one factor with every level of another factor.  In such situations with a nested layout, fewer than all levels of one factor occur within each level of the other factor.  By this definition, the toy example here includes no nesting.
  • The expanded descriptive summary can be highly instructive (and is yours to use freely). 

And last but not least, whatever method is used for the analysis, the main message that should be lost – one should be maximally transparent about how the data were collected, what were the experimental units, what were the replicates, and what analyses were used to examine the data.

Newsletter December 2019

News
        – New publication: Be positive about negatives – Read more
        – EQIPD Stakeholder Group Meeting – Read more
        – Kick-Off Meeting: Quality Standards for In Vitro Research – Read more
        – DRAFT NIH Policy for Data Management and Sharing – Read more
        – PAASP at the REWARD/EQUATOR Conference in Berlin – Read more
Recent publications related to Good Research Practice  
        – Publication rates in animal research. Extent and characteristics of published and non-published animal studies followed up at two German university medical centres – Read more 
        – A consensus-based transparency checklist – Read more
        – Additional reads in December 2019 – Read more
Commentary I: An ounce of prevention is worth a pound of cure – Read more
Commentary II: Instead of replicating studies with problems, let’s replicate the good studies – Read more
Case study:When p-hacking is not p-hacking – Read more
FUN Section – Read more

Off-target toxicity is a common mechanism of action of cancer drugs undergoing clinical trials

Use of high-quality tools is an essential element of Good Research Practice. Like for many other aspects, “high quality” is a relative term suggesting that the tools are fit-for-purpose according to the best information available today.
A recent paper by Lin and colleagues illustrates that, with the advancement of technologies, good knowledge may be replaced by better knowledge:
In this study, the authors used CRISPR gene editing techniques to examine the mechanisms of ten cancer drugs that target one of six proteins, which have been reported as important for the survival of cancer cells in over 180 publications. The drugs studied have been used in at least 29 different clinical trials involving a total of over 1,000 patients, and include prominent candidates such as citarinostat and ricolinostat, which are being tested against multiple myeloma.
Contrary to the previous reports based on RNA silencing, the drugs did not actually kill cancer cells by inhibiting their target proteins: they still worked when given to cells deficient in their target. Rather, the drugs induced cell death through off-target mechanisms.
Lin et al. argue that it will be necessary to adopt more rigorous validation approaches in preclinical trials to verify a causal link between target and disease and that future drug candidates work as intended.

Link

Toward Good In Vitro Reporting Standards

Many areas of biomedical science have developed reporting standards and checklists to support the adequate reporting of scientific efforts, but in vitro research still has no generally accepted criteria. In this article, the authors discuss ‘focus points’ of in vitro research, ensuring that the scientific community is able to take full advantage of the animal-free methods and studies and that resources spent on conducting these experiment are not wasted: A first priority of reporting standards is to ensure the completeness and transparency of the provided information (data focus). A second tier is the quality of data display that makes information digestible and easy to grasp, compare, and further analysable (information focus).
This article summarizes a series of initiatives geared towards improving the quality of in vitro work and its reporting – with the ultimate aim to generate Good In Vitro Reporting Standards (GIVReSt).

Link

Reproducible and Transparent Research Practices in Published Neurology Research

The objective of this study by Rauh et al. was to evaluate the nature and extent of reproducible and transparent research practices in neurology research.
Thus, the authors conducted a PubMed search to retrieve articles published in neurology journals over a five-year period from 2014 to 2018 and determined whether publications provided access to items such as study materials, raw data, analysis scripts, and protocols. In addition, it was analysed if a publication was included in a replication study or systematic review, was preregistered, had a conflict of interest declaration, specified funding sources, and was open access.
Based on this analysis, the authors concluded that current research in the field of neurology does not consistently provide information needed to reproduce study results and that collaborative intervention by authors, peer reviewers, journals, and funding sources is needed to mitigate this problem.

Link