Best-dose analysis – A confirmatory research case

It is often observed and discussed that there are substantial inter-individual differences that can overshadow effects of otherwise effective treatments.  These differences can be present in the form of a varying pre-treatment “baseline” that often motivates researchers to express the post-treatment results not in absolute terms but relative to the baseline. 
 
A less commonly discussed scenario is the differences in the individual sensitivity to experimental manipulation (e.g. drug).  In other words, inspection of data may reveal that while there is no overall treatment effect (left panel below), for each subject, there is at least one drug dose that appears to be “effective” – i.e. the “best dose” for a given subject.  If one selectively plots only the best-dose data, null results suddenly turn into something much more preferred (right panel below).

Figure 1: Schematic exaggerated representation of the best-dose analysis. A hypothetical experiment with 8 subjects tested under each of the four treatment conditions (vehicle and three doses of a drug). For each subject, every possible measured response value (2, 4, 6 or 8) is recorded at different treatment conditions.  Conventional representation (left panel) illustrates that there are no differences between treatment conditions.  For the best-dose analysis, one selects highest response values across different drug doses (highlighted area on the left panel) and re-plots them against the vehicle control values as shown in the right panel.  

There are a number of publications with this type of analysis.  Is this a legitimate practice?  As long as this analysis is viewed as exploratory and is followed by a confirmatory experiment where “best doses” are prespecified and their effects are confirmed, this can be a legitimate analytic technique.  Without such complementary confirmatory effort, best-dose analysis can be very misleading and may actually be an attempt at p-hacking.
 
In the following example we generated 400 random numbers, normally distributed and randomly allocated across 4 groups with n=100 per group.  Then we drew repeatedly 8, 16, 32 or 64 values from each group and organized them in a table so that each row was one hypothetical subject exposed to a vehicle control condition and to 3 doses of a drug.
 
Next, we did an ordinary one-factor analysis of variance and saved the p value. After that, the “best dose” (i.e., the highest value) was identified for each row / subject. A conventional t-test was run comparing the control with the best dose (with the p value saved).  This was repeated for 500 iterations of random sampling and analyses.
 
As the plot below illustrates, the t-test for the best-dose analysis often yields “p<0.05” for datasets where the all-dose ANOVA does not reveal any main effect of treatment.
 
This conclusion becomes clearer with increasing sample sizes – already at n=16, the best-dose analysis finds “effects” in nearly half of the cases. By n=64, it is rare that best-dose analysis fails to find an “effect” using these random numbers.
 
Is all this too obvious?  Hopefully, it is, at least for our readers.  This example, nevertheless, may be useful for those who consider applying a best-dose analysis and for those who need to illustrate the appropriate and important role of confirmatory research.

Figure 2: p-values by best-dose analysis vs the corresponding all-dose analysis for each of four selected sample sizes.  The shaded zones show where the former analysis appears “statistically significant” but the latter does not.

By Anton Bespalov (PAASP) and David McArthur (UCLA)
 
The R script can be found HERE.
 
PS After the above analysis was designed and completed, we became aware of a paper by Paul Soto and colleagues that discussed the same issue of the best-dose analysis using different tools.

Beauty is not everything

Being able to trust the experimental data is critical for experimental research. This is certainly also true for Western Blotting as one of the most commonly used methods in cell biology laboratories due to its high specificity and sensitivity.
Hence there is a critical need for comprehensive reporting of Western Blotting conditions to improve the accuracy and reproducibility of data based on this technique.
To address this need, Jennifer Gilda and colleagues have developed the Western blotting minimal reporting standard (WBMRS), which covers 10 critical points: 

  1. Information regarding the primary and secondary antibodies used.
  2. Molecular mass of band of interest should be shown on the blot.
  3. The amount of total protein loaded onto the gel.
  4. Type, amount and extent of use of blocking reagent.
  5. Washing solution used, how often and for how long.
  6. Amount and incubation time of primary antibody (incl. primary antibody buffer).
  7. Amount and incubation time of secondary antibodies (incl. secondary antibody buffer).
  8. How the image was collected.
  9. The reagent used for detection and membrane incubation time.
  10. The software used for signal strength quantification (if required).

These are certainly important criteria to support scientist reproducing reported experiments, but what is missing here is a clear guide how to present results obtained:
Over the years, it has become common practice to crop representative blots to reduce their overall size and to save journal space. Removal of unnecessary clutter may indeed help focus the eye of the reader and to improve clarity and readability of the scientific article.
However, quite often, a blot is cut into a narrow strip or cropped around individual bands to remove ugly (beautification), unexplainable or confusing areas.
Not only can it mislead the editor, referee, and ultimately the reader, but it can also hide important information, potentially pointing to real biological insight like in Figure 1 (e.g. “What is the origin of the band below 80 kDa as the band intensity is inversely proportional to the band around 100 kDa?”).

It may also lead to a false impression of antibody specificity. Especially for Figure 2, it is not easily understandable why the space available around the EZH2 signal was not better used – leaving the reader with the question whether or not important information has been removed…

In addition, the red arrow points towards a visible vertical line between lane 2 and 3, which normally represents some unnatural appearances and image manipulation and should be treated with caution. If splicing different gels together is unavoidable, a black line or gap should be left so that it is clear that the parts had separate origins.
 
Regarding the ability to compare different samples and signal intensities, it is important that these signals come from the same gel/blot as the exact same blotting conditions cannot be guaranteed. Thus, mixing and matching bands from various gels (as shown in Figure 3) is meaningless if the purpose of the experiment was to compare bands between different samples and treatment conditions.

These examples show that it is critical to think about the best approach how to present data generated by SDS-PAGE and Western Blotting. For every adjustment and manipulation (e.g. cutting, splicing) that is done, it has to be ensured that the resulting image still represents an accurate representation of the original data!
As exemplified by Figure 3, we can only effectively utilize Western blotting data if the way we analyse the Western Blotting results obtained is fit-for-purpose and qualified to reach the conclusions presented in the paper.

Covid-19 and the PPV

There has been a lot in the news recently about using antibody tests to detect people who have had Covid-19 and who might therefore be immune to further infection (this remains to be proven). Despite the relatively good performance of many of the proposed antibody tests, with sensitivity and specificity above 95% in most cases, the low prevalence of infection means that the Positive Predictive Value (PPV) of these tests is low.
For example, a test which has 98% sensitivity (ability to detect true positives) and specificity (ability to avoid false positives) will still give a false positive rate of nearly 30% if the underlying rate of infection is around 5% (i.e. a PPV of 72.1%). Thus, for every 1000 people tested there will be on average 50 people with antibodies to SARS-CoV-2 and the test will correctly detect 49 of those 50, This is the result of the test being 98% sensitive. However, the number of false positives in the 950 people who do not have antibodies will be 19. Despite the high specificity of the test, the high number of non-infected people means that the 2% of them that give a false positive result will be a large proportion of the overall number of positive tests: in this case 19 out of 68. This has quite rightly been pointed out by health authorities as insufficient to qualify people as ‘immune’.
However, there is a relatively simple solution. By using two independent antibody tests, with similar specificity and sensitivity, PPV can be increased to 99.2% if we consider as positive only those who are positive under both tests. This is because 48 of the previous positives will again be positive (98%of 49) but no more than 1 of the 19 false positives will give a second positive test (only 0.4 people, i.e. 2% of 19). The two tests would have to be truly independent, but as there are now numerous tests available (or in development) it should be possible to find two that can be combined to achieve a PPV that can be useful for making public health decisions.
What we have done here, in effect, is to increase the underlying rate of true positives for the second test (to 72%, 49 of the initially 68 positive tests). Under these conditions our antibody test meets our expectations of what a 98% test should do.
Too often we look at a test performance without considering the underlying rate. Tests to predict low incidence disease must be extremely good to be useful as diagnostic tools simply because of the large number of true negatives. The same holds true for analysis of most scientific experiments: we use an alpha of 0.05 and a beta of 0.2 to decide if out studies are significant. This is the same as 95% sensitivity and 80% specificity. In the example above that would result in 48 true positives but 190 false positives!
We rarely know what the true underlying rate of true positives is in our experiments. We might suspect that for structured, confirmatory studies it is quite high, maybe above 50%. But for exploratory studies, screening of compound libraries etc. it might be less than 5%. Interpretation and analysis of these experiments needs to consider such differences if we want to reach robust conclusions. With the importance of underlying rate for calculating PPV now getting a more public airing, we can only hope that it will be considered more in data analysis.

Additional note:
Whilst writing this piece, the new FDA-approved test by Roche claims 100% sensitivity and 99.8% specificity, which would get the PPV up to 96.3. This shows that the bar needs to be set very high for low-incidence events, but that Roche seem to have succeeded.

Biological vs technical replicates: Now from a data analysis perspective

We have discussed this topic several times before (HERE and HERE). There seems to be a growing understanding that, when reporting an experiment’s results, one should state clearly what experimental units (biological replicates) are included, and, when applicable, distinguish them from technical replicates.

In discussing this topic with various colleagues, it became obvious to us that there is no clarity on best analytic practices and how to take technical replicates into analysis.

We have approached David L McArthur (at the UCLA Department of Neurosurgery), an expert in study design and analysis, who has been helping us and the Preclinical Data Forum on projects related to data analysis and robust data analysis practices.

A representative example that we wanted to discuss includes 3 treatment groups (labeled A, B, and C) with 6 mice per group and 4 samples processed for each mouse (e.g. one blood draw per mouse separated into four vials and subjected to the same measurement procedure) – i.e. a 3X6X4 dataset.

The text below is based on Dave’s feedback.  Note that Dave is using the term “facet” as an overarching label for anything that contributes to (or fails to contribute to) interpretable coherence beyond background noise in the dataset, and the term “measurement” as a label for the observed value obtained from each sample (rather than the phrase “dependent variable”  often used elsewhere).

Dave has drafted a thought experiment supported by a simulation.  With a simple spreadsheet using only elementary function commands, it’s easy to build a toy study in the form of a flat file representing that 3X6X4 system of data, with the outcome consisting of one measurement in each line of a “tall” datafile, i.e., 72 lines of data with each line having entries for group, subject, sample, and close-but-not-quite-identical measurement (LINK). But, for our purposes, we’ll insert not just measurement A but also measurement B on each line — where we’ve constructed measurement B to differ from measurement A in its variability but otherwise to have identical group means and subject means.  (As shown in Column E, this can be done easily: take each A value, jitter it by uniform application of some multiplier, then subtract out any per-subject mean difference to obtain B.)  With no loss of meaning, in this dataset measurement A has just a little variation from one measurement to the next within a given subject, but because of that multiplier, measurement B has a lot of variation from one measurement to the next within a given subject.

A 14-term descriptive summary shows that using all values of measurement A, across groups, results in:

abc
N24.000024.000024.0000
mean0.85001.45002.0500
SD0.28740.28740.2874
robust min0.30000.90001.5000
min0.30000.90001.5000
hdQ: 0.250.63801.23801.8380… (25th quantile, the lower box bound of a boxplot)
median0.85001.45002.0500
hdQ: 0.751.06201.66202.2620… (75th quantile, the upper box bound of a boxplot)
max1.40002.00002.6000
robust max1.40002.00002.6000
skew-0.0000-0.0000-0.0000
kurtosis-0.5908-0.5908-0.5908
Huber mu0.85001.45002.0500
Shapiro p0.97030.97030.9703

while, using all values of  measurement B, across groups, results in:

abc
N24.000024.000024.0000
mean0.85001.45002.0500<– identical group means
SD5.71315.71315.7131<– group standard deviations about 20 times larger
robust min-6.9000-6.3000-5.7000
min-6.9000-6.3000-5.7000
hdQ: 0.25-4.2657-3.6657-3.0657
median0.85001.45002.0500<– identical group medians
hdQ: 0.755.96576.56577.1657
max8.60009.20009.8000
robust max8.60009.20009.8000
skew-0.0000-0.0000-0.0000<– identical group skews
kurtosis-1.3908-1.3908-1.3908<– greater kurtoses, no surprise
Huber mu0.85001.45002.0500<– identical Huber estimates of group centers
Shapiro p0.00780.00780.0078<– suspiciously low p-values for test of normality, no surprise

The left panel in the image below results from simple arithmetical averaging of that dataset’s samples from each subject, with the working dataframe reduced by averaging from 72 lines to 18 lines.  It doesn’t matter here whether we now analyze measurement A or measurement B, as both measurements inside this artificial dataset generate the identical 18-line dataframe, with means of 0.8500, 1.4500, and 2.0500 for groups A, B and C respectively.  Importantly, the sample facet disappears altogether, though we still have group, mouse, measurement and noise.  The simple ANOVA solution for the mean measures shows “very highly significant” differences between the groups.  But wait.

The center panel uses all 72 available datapoints from measurement A.  By definition that’s in the form of a repeated-measures structure, with four non-identical samples provided by each subject.  Mixed effects modeling accounts for all 5 facets here by treating them as fixed (group and sample) or random (subject), or as the object of the equation (measurement), or as residual (noise).  The mixed effects model analysis for measurement A results in “highly significant” differences between groups, though those p-values are not the same as those in the left panel.  But wait.

The right panel uses all 72 available datapoints from measurement B.  Again, it’s a repeated-measures structure, but while the means and medians remain the same, now the standard deviations are 20 times larger than those for measurement A, a feature of the noise facet being intentionally magnified and inserted into the artificial source datafile. The mixed effects model analysis for measurement B results in “not-at-all-close-to-significant” differences between groups; no real surprise.

What does this example teach us?

Averaging technical replicates (as in the left panel) and running statistical analyses on average values means losing potentially important information.  No facet should be dropped from analysis unless one is confident that it can have absolutely no effect on analyses.  A decision to ignore a facet (any facet), drop data and go for a simpler statistical test must in any case be justified and defended.

Further recommendations that are supported by this toy example or that the readers can illustrate for themselves (with the R script LINK) are:

  • There is no reason to use the antiquated method of repeated measures ANOVA; in contrast to RM ANOVA, mixed effects modeling makes no sphericity assumption and handles missing data well.
  • There is no reason to use nested ANOVA in this context:  nesting is applicable in situations when one or another constraint does not allow crossing every level of one factor with every level of another factor.  In such situations with a nested layout, fewer than all levels of one factor occur within each level of the other factor.  By this definition, the toy example here includes no nesting.
  • The expanded descriptive summary can be highly instructive (and is yours to use freely). 

And last but not least, whatever method is used for the analysis, the main message that should be lost – one should be maximally transparent about how the data were collected, what were the experimental units, what were the replicates, and what analyses were used to examine the data.

From fecal transplants, obesity and the importance of randomizatIon and normalization

In one of the last Newsletter issues (LINK), we wrote about the uncertainties which sometimes exist when trying to find the best normalization approach for a given study. In this context, the need to be transparent about normalization or randomization steps and provide sufficiently detailed information is nicely demonstrated by the following example:
One of the most influential papers about the role of the gut microbiota in obesity has been published in Nature in 2006 by Turnbaugh and colleagues. It has already been cited more than 8000 times. The reason for its importance is that this was the first study to show that transplanting the microbiota from obese mice into lean mice made the mice fatter than mice getting transplants from lean mice.
The key figure is shown below, demonstrating that, in germ-free mice transplanted with gut microbiota from obese donors (n = 9 recipients), body fat increased by 47% compared to a 27% increase when donors were lean (n = 10 recipients).
This study outcome (Figure 1) looks indeed quite impressive and apparently merited a publication in Nature.
Unfortunately, however, the paper does not provide an explanation how these 19 recipient mice were initially randomized and allocated to the two different groups and whether or not body fat/weight was considered as a normalization denominator.

Why is this relevant?
It is important to note that this figure does not show body fat percentage but rather the percentage increase in fat in relation to the initial body fat.
Although the paper does not state the starting body fat of the mice, from the information provided it is possible to count back (nicely done by Matthew Dalby) to end up with the following situation:Mice getting transplants from obese mice had 2.76 grams of fat at the start.Mice getting transplants from lean mice had 3.19 grams of fat at the start.The final body fat of mice getting transplants from obese mice was 4.07 gramsThe final body fat of mice getting transplants from lean mice was 4.05 gramsThis is summarized in Figure 2, which certainly does not look as impressive as Figure 1.

Although the mice getting a fecal transplant from the obese mice gained a little more fat, they ended up with the same body fat at the end of the study as the mice getting transplants from lean mice. As mice at this age would typically weigh about 25 grams, at 4 grams of body fat both groups of mice would be considered lean and therefore neither group of mice were obese after the transplantation.

Are we too critical? We would certainly have less reasons to review this example if more information was provided in this publication about the randomization process and whether the normalization was pre-specified.

When p-hacking is not p-hacking – or: How shall the data be reported?

Organ bath experiments are a key technology to assess contractility of smooth muscle. When comparing force of contraction, investigators typically normalize measured force for size of the tissue specimen as a larger piece of tissue should develop greater force. However, there is a lack of agreement which parameter should be used as denominator for normalization.
 
To analyze the effects of ageing on muscarinic receptor subtypes and function in rat urinary bladder, Tim Schneider and colleagues pre-defined tissue specimen length as denominator in their study protocol to normalize measured force of contraction. Using this as the primary outcome parameter, the investigators showed that carbachol, a synthetic mimetic for the neurotransmitter acetylcholine acting primarily by stimulating muscarinic receptors, contracted bladders from young and old rats with similar efficacy and no difference could be detected between the two groups (old vs young).
 
Given the uncertainty around the best normalization approach, the investigators also explored normalization for strip weight (as secondary outcome parameter) and found that maximum force of contraction was significantly smaller in old than in young rats. It became clear that whether carbachol effects differed between groups depended on the type of data normalization.
 
Having obtained the different data sets based on the pre-defined parameters in the study protocol, the question now arose how best to report the data obtained?
Here, the investigators aimed for maximum transparency and decided not to select one analysis over the other but to report both negative and positive analysis outcomes in the same publication. In addition, they clearly stated which parameters were defined in the study protocol as primary and secondary outcomes.
Only several years later could a much larger study by the same group show that normalization for tissue specimen weight would have been the right choice and more informative.
 
This is an example that openness and transparency are a key basis for trustworthy research and that there should always be sufficient room in scientific articles to report results as detailed as necessary for the reader to understand the complexity of a study.