In vivo veritas

Most of our readers are biologists but nevertheless we are occasionally exposed to medicinal chemistry literature describing novel research tools, lead molecules or drug candidates. Medicinal chemistry papers often include evidence from animal research supporting certain qualities of novel chemical entities. Like us, you may have noticed that, while the chemistry methods are presented with lots of details, biology sections tend to be very short. 

To test this subjective impression, we have retrieved all papers published since the beginning of 2020 in three leading medicinal chemistry journals – two journals from the American Chemistry Society (Journal of Medicinal Chemistry [JMC] and ACS Chemical Neuroscience [ACN]) and the European Journal of Medicinal Chemistry (EJMC).  These journals have more or less the same reader audience and have comparable conventional impact factors.

From a total of 1,413 papers, we have randomly selected 15 research papers from JMC and ACN and another 15 papers from EJMC (a total of 30 papers) that met the following criteria: i) reporting the discovery of a novel chemical entity, and ii) reporting at least one in vivo experiment in a mammalian species.

For EJMC, the Guide for Authors states: “All animal experiments should comply with the ARRIVE guidelines and should be carried out in accordance with the U.K. Animals (Scientific Procedures) Act, 1986 and associated guidelines, EU Directive 2010/63/EU for animal experiments, or the National Institutes of Health guide for the care and use of Laboratory animals (NIH Publications No. 8023, revised 1978) and the authors should clearly indicate in the manuscript that such guidelines have been followed.”

For JMC, the Guide for Authors states: “Research involving animals must be performed in accordance with institutional guidelines as defined by Institutional Animal Care and Use Committee for U.S. institutions or an equivalent regulatory committee in other countries.

A statement confirming that all animal experiments performed in the manuscript were conducted in compliance with these guidelines is required. In the experimental section, the source, age, sex, species, and strain of animals should be included.”

For ACN, the Guide for Authors states: “Papers reporting data from experiments on live animals must include a statement identifying the approving committee and certifying that such experiments were performed in accordance with all national or local guidelines and regulations.”

Here is an overview of the information we have retrieved from this sample of 30 papers:

 Eur J Med Chem(n=15)ACS journals(n=15)All journals(N=30)
Statement about compliance with NIH guidelines, EU Directive, or national animal welfare law5 (33%)4 (27%)9 (30%)
Statement on approval of animal study protocol by any local body10 (67%)7 (47%)17 (57%)
Any information on animal housing and husbandry7 (47%)1 (7%)8 (27%)
Supplier or breeder of animals identified12 (80%)6 (40%)18 (60%)
Any information on any of the Landis 4 criteria for any of the experiments (randomization, blinding, sample size estimation, inclusion / exclusion criteria)9 (60%)(randomization only)5 (33%)(randomization in 4 papers and blinding in one)14 (47%)

Our analysis is based on a very small sample and this is a clear limitation that prevents us from making too many conclusions.

However, even with this limited sample, we are able to confirm that reporting of in vivo experiments in medchem journals needs to improve.

Further, the Guide for Authors does apparently have an impact on what is eventually reported by the authors. Hence, editors and publishers should be encouraged to revisit their author guides to enhance rigor and transparency in reporting the results of in vivo experiments.

Some of our chemist colleagues argue that, for a chemistry paper, the primary focus should be on chemistry and it is therefore justified that biology results are not presented as they would normally be presented in pharmacology journals.  We do not agree.  Do you? 

Estimation vs Hypothesis-Testing

As some of our readers may know, the PAASP team was involved in the “Negative results Award” project that took place in 2018. One important lesson for us was that any discussion about negative (or null) results should start with a decision whether results are really negative and how confident we are with such a conclusion. For example, p-values above 0.05 do not necessarily indicate that the results are negative and we have not found any reading or teaching material that can easily explain to non-statisticians and, in particular, young scientists in biomedical research fields the need to move away from a binary decision process. If any of our readers are aware of such tools, please share them with us and we will post the information in our Resource Center
Therefore, we were pleased to see a recent opinion paper by Calin-Jageman and Cumming that provides basic information with examples that could serve as a (self)-learning material. One particular example focussed on a study that found that “caffeine administration enhances memory consolidation in humans” (Borota et al., 2014). The same results were re-analyzed and visualized with a different (quantitative) question in mind: To what extent does caffeine improve memory? To answer this question, the difference between the group means was estimated and the uncertainties in this estimate due to expected sampling error were quantified. The results obtained can be summarized as “caffeine is estimated to improve memory relative to the placebo group by 31% with a 95% confidence interval of (0.2%, 62%)”.
Why do we believe that this is a good example that can be used for educational purposes?
First, the confidence interval suggests considerable uncertainty about generalizing from the sample to the world at large. Does it make such results of lower value? Certainly not.
Second, the estimation approach described in the eNeuro opinion paper does not call for a revolution that is difficult to follow (e.g. stopping the use of p-values). Instead, it suggests that we should rather go for a more complete and informative reporting of the results.
And last but not least, prior to reading the article by Calin-Jageman and Cumming, most members of our team have not heard about the eNeuro journal. Previously, learning about a new journal would certainly mean also inquiring about its “impact factor”. There is no need to know the impact factor of eNeuro – if the editors and reviewers of this journal manage to introduce and maintain high quality reporting, this should make it on the wanted list of all (neuro)scientists, whether readers or authors!


A long-awaited revision of the ARRIVE guidelines has finally been published (The ARRIVE guidelines 2019).What impact can we expect from ARRIVE 2.0? At the very least, we hope that the Nature journals will update their life sciences checklist, which will certainly have an impact. As an example, let’s look at the recently published paper that reported on the impact of gut microbiome on motor function and survival in the ALS mouse model.We picked this example because: i) it is recent, and ii) it uses the same SOD1 mouse model that has become a classical example of how adherence to higher research rigor standards turns “positive” data into “negative”.We would not use this example if the only finding was about changes in motor performance of the SOD1 mice exposed to long-term antibiotic cocktail – this is not be too surprising as antibiotics may penetrate into the CNS and may have effects unrelated to their “class” effects. And the survival data in the germ-free animals also do not make us focus on this paper because there were obvious problems with the rederivation itself (page 2, left column) and only insufficient information on study design is given (e.g. no details on surgery; no information on whether the colonized animals were also obtained via rederivation).The most striking are actually the data in Figure 3 “Akkermansia muciniphila colonization ameliorates motor degeneration and increases life-span in SOD1-Tg mice”.How would ARRIVE 2.0 help the reader to gain confidence in these data sets? In the Table below, we review the responses provided by the authors in the Life Sciences checklist and, stimulated by ARRIVE 2.0, indicate what information is missing to increase confident in these study results:
Published Life Sciences checklistWhat we would like to see
All in vivo experiments were randomized and controlled for gender, weight and cage effects.What methods were used for randomization and how can these methods explain highly unequal sample sizes within an experiment?
Sample sizes were determined based on previous studies and by the gold standards accepted in the field.A reference to the gold standards would be very helpful. The only gold standard in this field we are aware of – Scott et al. 2008 – would certainly not recommend using n = 5.
In all in vivo experiments the researches were blinded.This statement is insufficient to know whether blinding was maintained until data analysis.
No data were excluded from the manuscript.The experiment in Fig. 3 was repeated 6 (!) times with sample sizes between 5 and 26 resulting in the pooled sample sizes of up to 62 mice per group. However, survival data are presented only for 4-8 mice per group. It would be interesting to see survival data from the main pool of animals unless the main experiment was stopped at day 140.
All attempts of replication were successful and individual repeats are presented in ED and SI sections.Given that each of the “replication” experiments was severely underpowered, one may wonder whether these were indeed independent experiments or parts of a single study erroneously presented as “attempts of replication”.
We realize that ARRIVE 2.0 may not be sufficient to obtain answers to all of the above questions but this is certainly a major step forward that should be rigorously endorsed and promoted.

Power analysis for two-sided t-test

The example below does not require any additional R packages to be installed (i.e. runs with the basic R installation).
This example conducts power analysis for an expected difference between means equal to 2 with the standard deviation of 1 at a range of alpha and beta values. Any of these four parameters can be estimated when the other three are provided.
Results are generated as the required sample size to detect expected difference between means at the following alpha (rows) and beta (columns):

or, as the difference between means that can be detected with the following sample sizes (rows) at alpha=0.05 and beta=0.7-0.95:

This script also generates a plot summarizing the relationships between all four parameters:

Here is the script:

# calculate sample size required
# to detect expected difference between means
# at the following alpha (0.001, 0.01 and 0.05) and beta (0.7-0.95)

a <- c(.001,.01,.05)
na <- length(a)
p <- c(.7,.8,.85,.9,.95)
np <- length(p)

# d is the expected difference between means
# s is the standard deviation

d <- 2
s <- 1

samsize <- array(numeric(na*np), dim=c(na,np))
for (i in 1:np){
for (j in 1:na){
result <- power.t.test(n = NULL, delta = d, sd = s,
sig.level = a[j], power = p[i],
type = “two.sample”,alternative = “two.sided”)
samsize[j,i] <- ceiling(result$n)
colnames(samsize) <- p
row.names(samsize) <- a

# calculate difference between means that can be detected
# with the following sample sizes (6 to 12)
# at alpha=0.05 and beta=0.7-0.95

nu <- c(6:12)
nun <- length(nu)

samsize2 <- array(numeric(nun*np), dim=c(nun,np))
for (k in 1:np){
for (m in 1:nun){
result <- power.t.test(n = nu[m], delta = NULL, sd = s,
sig.level = 0.05, power = p[k],
type = “two.sample”,alternative = “two.sided”)
samsize2[m,k] <- result$delta
colnames(samsize2) <- p
row.names(samsize2) <- nu


# set up graph

xrange <- range(a)
yrange <- round(range(samsize))
colors <- rainbow(length(p))
plot(xrange, yrange, log=”x”, type=”n”,
ylab=”Sample Size (n)” )

# add power curves

for (i in 1:np){
lines(a, samsize[,i], type=”l”, lwd=2, col=colors[i])

# add annotation (grid lines, title, legend)

abline(h=0, v=a, lty=2, col=”grey89″)
title(“Sample Size Estimation for two-tailed t-test \n”)
legend(“topright”, title=”Power”, as.character(p), fill=colors)