There is currently no generic tool for producing figures to display and explore the risk‐of‐bias assessments that routinely take place as part of systematic review.

In this article, the authors, therefore, present a new tool, robvis (Risk‐Of‐Bias VISualization), available as an R package and web app, which facilitates rapid production of publication‐quality risk‐of‐bias assessment figures. A timeline of the tool’s development and its key functionality is also presented.

## R script BestDose

library(scales)

library(ggplot2)

## sample size

n1 <- 8

n2 <- 16

n3 <- 32

n4 <- 64

## iterations m <- 500

yfig <- vector()

xfig <- vector()

facetfig <- data.frame()

bda1 = data.frame(matrix(rnorm(400), nrow=100))

for (j in c(n1, n2, n3, n4)) {

for (i in 1:m) {

bda2 = bda1[sample(nrow(bda1), j), ]

pa <- oneway.test(values ~ ind, stack(bda2))$p.value

xfig[i] <- pa

bda3 <- subset(bda2, select = – X1)

bda3$colMax <- apply(bda3, 1, function(x) max(x))

bda3$X1 <- bda2$X1

pt <- t.test(bda3$X1, bda3$colMax, paired=TRUE)$p.value

yfig[i] <- pt

}

facetfig <- rbind(facetfig,cbind.data.frame(N=rep(j, m), xfig, yfig))

}

p <- ggplot (facetfig, aes(x=xfig, y=yfig)) + geom_point(shape=16, size=4)

p + facet_grid(. ~ N, labeller = label_both) +

theme(strip.text = element_text(face = “bold”, size = 24), strip.background = element_rect(fill = “lightblue”)) +

annotate(“rect”, ymin = 0, ymax = 0.05, xmin = 0.05, xmax = 1, alpha=.1, fill=”blue”) +

scale_x_log10(labels=trans_format(“log10”, math_format(10^.x))) +

scale_y_log10(labels=trans_format(“log10”, math_format(10^.x))) +

geom_hline(yintercept = .05) + geom_vline(xintercept = .05) +

labs(y=”p value for best-dose t-test”, x=”p value for all-dose ANOVA”) +

theme(axis.title = element_text(size = 24),

axis.line = element_line(colour = “black”), panel.grid.minor = element_blank(),

panel.grid.major = element_blank(), axis.text = element_text(size = 20))

## Biological vs technical replicates: Now from a data analysis perspective: R script

This is the R Script referenced in the blog post LINK

## "Effortlessly Read Any Rectangular Data"

library(readit)

## "Linear and Nonlinear Mixed Effects Models"

library(nlme)

## "Groupwise Statistics, LSmeans, Linear Contrasts, Utilities"

library(doBy)

## set graphic display

par(mfrow=c(1,3))

## acquire dataset

ab <- as.data.frame(readit("TOYEXAMPLE.xls"))

## generate mean data by subject

amean <- summaryBy(A~group+subject,FUN=mean,data=ab)

## detailed descriptives, function available on request

stats(amean$A.mean, by=amean$group)

## plot result 1

boxplot(amean$A.mean~amean$group,ylim=c(-8,10))

## do analysis of variance

summary(aov(A.mean~group,data=amean))

## detailed descriptives

stats(ab$A,by=ab$group)

stats(ab$B,by=ab$group)

## plot result 2

boxplot(A~group,data=ab,ylim=c(-8,10))

## plot result 3

boxplot(B~group,data=ab,ylim=c(-8,10))

## do mixed effects modeling 1

a.1 <- lme(A~group,random=~1|subject,data=ab)

## do mixed effects modeling 2

b.1 <- lme(B~group,random=~1|subject,data=ab)

summary(a.1)

summary(b.1)

## Biological vs technical replicates: Now from a data analysis perspective

We have discussed this topic several times before (HERE and HERE). There seems to be a growing understanding that, when reporting an experiment’s results, one should state clearly what experimental units (biological replicates) are included, and, when applicable, distinguish them from technical replicates.

In discussing this topic with various colleagues, it became obvious to us that there is no clarity on best analytic practices and how to take technical replicates into analysis.

We have approached David L McArthur (at the UCLA Department of Neurosurgery), an expert in study design and analysis, who has been helping us and the Preclinical Data Forum on projects related to data analysis and robust data analysis practices.

A representative example that we wanted to discuss includes 3 treatment groups (labeled A, B, and C) with 6 mice per group and 4 samples processed for each mouse (e.g. one blood draw per mouse separated into four vials and subjected to the same measurement procedure) – i.e. a 3X6X4 dataset.

The text below is based on Dave’s feedback. Note that Dave is using the term “facet” as an overarching label for anything that contributes to (or fails to contribute to) interpretable coherence beyond background noise in the dataset, and the term “measurement” as a label for the observed value obtained from each sample (rather than the phrase “dependent variable” often used elsewhere).

Dave has drafted a thought experiment supported by a simulation. With a simple spreadsheet using only elementary function commands, it’s easy to build a toy study in the form of a flat file representing that 3X6X4 system of data, with the outcome consisting of one measurement in each line of a “tall” datafile, i.e., 72 lines of data with each line having entries for group, subject, sample, and close-but-not-quite-identical measurement (LINK). But, for our purposes, we’ll insert not just measurement A but also measurement B on each line — where we’ve constructed measurement B to differ from measurement A in its variability but otherwise to have identical group means and subject means. (As shown in Column E, this can be done easily: take each A value, jitter it by uniform application of some multiplier, then subtract out any per-subject mean difference to obtain B.) With no loss of meaning, in this dataset measurement A has just a little variation from one measurement to the next within a given subject, but because of that multiplier, measurement B has a lot of variation from one measurement to the next within a given subject.

A 14-term descriptive summary shows that using all values of measurement A, across groups, results in:

a | b | c | ||

N | 24.0000 | 24.0000 | 24.0000 | |

mean | 0.8500 | 1.4500 | 2.0500 | |

SD | 0.2874 | 0.2874 | 0.2874 | |

robust min | 0.3000 | 0.9000 | 1.5000 | |

min | 0.3000 | 0.9000 | 1.5000 | |

hdQ: 0.25 | 0.6380 | 1.2380 | 1.8380 | … (25th quantile, the lower box bound of a boxplot) |

median | 0.8500 | 1.4500 | 2.0500 | |

hdQ: 0.75 | 1.0620 | 1.6620 | 2.2620 | … (75th quantile, the upper box bound of a boxplot) |

max | 1.4000 | 2.0000 | 2.6000 | |

robust max | 1.4000 | 2.0000 | 2.6000 | |

skew | -0.0000 | -0.0000 | -0.0000 | |

kurtosis | -0.5908 | -0.5908 | -0.5908 | |

Huber mu | 0.8500 | 1.4500 | 2.0500 | |

Shapiro p | 0.9703 | 0.9703 | 0.9703 |

while, using all values of measurement B, across groups, results in:

a | b | c | ||

N | 24.0000 | 24.0000 | 24.0000 | |

mean | 0.8500 | 1.4500 | 2.0500 | <– identical group means |

SD | 5.7131 | 5.7131 | 5.7131 | <– group standard deviations about 20 times larger |

robust min | -6.9000 | -6.3000 | -5.7000 | |

min | -6.9000 | -6.3000 | -5.7000 | |

hdQ: 0.25 | -4.2657 | -3.6657 | -3.0657 | |

median | 0.8500 | 1.4500 | 2.0500 | <– identical group medians |

hdQ: 0.75 | 5.9657 | 6.5657 | 7.1657 | |

max | 8.6000 | 9.2000 | 9.8000 | |

robust max | 8.6000 | 9.2000 | 9.8000 | |

skew | -0.0000 | -0.0000 | -0.0000 | <– identical group skews |

kurtosis | -1.3908 | -1.3908 | -1.3908 | <– greater kurtoses, no surprise |

Huber mu | 0.8500 | 1.4500 | 2.0500 | <– identical Huber estimates of group centers |

Shapiro p | 0.0078 | 0.0078 | 0.0078 | <– suspiciously low p-values for test of normality, no surprise |

The left panel in the image below results from simple arithmetical averaging of that dataset’s samples from each subject, with the working dataframe reduced by averaging from 72 lines to 18 lines. It doesn’t matter here whether we now analyze measurement A or measurement B, as both measurements inside this artificial dataset generate the identical 18-line dataframe, with means of 0.8500, 1.4500, and 2.0500 for groups A, B and C respectively. Importantly, the sample facet disappears altogether, though we still have group, mouse, measurement and noise. The simple ANOVA solution for the mean measures shows “*very highly significant”* differences between the groups. But wait.

The center panel uses all 72 available datapoints from measurement A. By definition that’s in the form of a repeated-measures structure, with four non-identical samples provided by each subject. Mixed effects modeling accounts for all 5 facets here by treating them as fixed (group and sample) or random (subject), or as the object of the equation (measurement), or as residual (noise). The mixed effects model analysis for measurement A results in “*highly significant*” differences between groups, though those p-values are not the same as those in the left panel. But wait.

The right panel uses all 72 available datapoints from measurement B. Again, it’s a repeated-measures structure, but while the means and medians remain the same, now the standard deviations are 20 times larger than those for measurement A, a feature of the noise facet being intentionally magnified and inserted into the artificial source datafile. The mixed effects model analysis for measurement B results in “*not-at-all-close-to-significant*” differences between groups; no real surprise.

What does this example teach us?

Averaging technical replicates (as in the left panel) and running statistical analyses on average values means losing potentially important information. No facet should be dropped from analysis unless one is confident that it can have absolutely no effect on analyses. A decision to ignore a facet (any facet), drop data and go for a simpler statistical test must in any case be justified and defended.

Further recommendations that are supported by this toy example or that the readers can illustrate for themselves (with the R script LINK) are:

- There is no reason to use the antiquated method of repeated measures ANOVA; in contrast to RM ANOVA, mixed effects modeling makes no sphericity assumption and handles missing data well.
- There is no reason to use nested ANOVA in this context: nesting is applicable in situations when one or another constraint does not allow crossing every level of one factor with every level of another factor. In such situations with a nested layout, fewer than all levels of one factor occur within each level of the other factor. By this definition, the toy example here includes no nesting.
- The expanded descriptive summary can be highly instructive (and is yours to use freely).

And last but not least, whatever method is used for the analysis, the main message that should be lost – one should be maximally transparent about how the data were collected, what were the experimental units, what were the replicates, and what analyses were used to examine the data.

## MANILA – A web tool for designing reproducible and transparent preclinical intervention studies

*Teemu D. Laajala ^{1,2}*

*1: University of Turku, Turku (Finland), Department of Mathematics and Statistics*

*2: University of Colorado, Denver (CO, US), Anschutz Medical Campus, Department of Pharmacology*

*MANILA*(

*MAtched ANimaL Analysis*) is a novel web-based tool that leverages predictive baseline variables and incorporates complex baseline characteristics for allocating treatment groups in preclinical intervention studies. The need for MANILA was motivated by the challenges in reproducibility and transparency reported in preclinical cancer research (Laajala TD, et al., Aittokallio T, et al.), and from an internal need to standardize protocols and provide a generalizable tool also for non-bioinformaticians. MANILA provides not only an interactive web-based interface for its use, but also an underlying more extensive R-package

*hamlet*(

*hierarchical optimal matching and machine learning toolbox*), including open-source R functionality that is most relevant for preclinical experimentation. For experts interested in the wider use of hamlet, users are encouraged to explore the R-package on CRAN (Central R Archive Network), where it is extensively documented and exemplified.

MANILA identifies animal subgroups based on a selected dissimilarity metric, so that predictive baseline characteristics of the animals portrait a similar prognosis. These subgroups – dubbed submatches – are evenly divided into blinded intervention arms in a stochastic manner, optimized by a genetic algorithm. One of MANILA’s strengths lies in the blinding of the study arms similar to how clinical trials are conducted. All of the allocated intervention arms are asymptotically similar, so any group label can be used as the control or comparison group. MANILA provides a highly versatile range of options for adjusting various parameters in the matching procedure, including but not limited to distance or dissimilarity metrics, scaling, and genetic algorithm parameters.

Our web-tool incorporates tools that help non-expert users in inputting and modifying their raw data, through e.g. data transformations, inclusion and exclusion of variables or observations, and diagnostics. Such additional tools are complemented by a wide variety of visualization tools such as heatmaps, hierarchical clustering, multidimensional scatterplots, and boxplots. Furthermore, MANILA offers mixed-effects models for testing differences in treatment effects after the interventions have been conducted. The tool offers the possibility to use the original grouping of submatches for increased power in identifying differences between animals that had a similar prognosis based on the baseline variables. For downstream analyses, power and longitudinal regression curves can be generated.

Importantly, MANILA allows power calculations based on a representative simulated dataset or a pilot study. In contrast to providing rather straight-forward expert-curated expected effect sizes and variance, regression-model power calculations are often more complex and unintuitive, due to both experimental and modelling considerations. To this end, stratified bootstrapping (sampling with replacement) is offered for sampling groups of observations. As multiple longitudinal observations are nested within an individual, this scheme samples individuals correctly for mixed-effects modeling and also considers complex phenomena such as right-censoring due to moribund animals. As such effects are difficult or impossible to provide to power calculations as mere parameter estimates, this simulation approach leverages computational power to utilize existing or representative data for future studies with similar setup, and hence increases research reproducibility and reliability.

The MANILA tool comes with a step-by-step user guide and the tool is hosted at the University of Turku:https://biomedportal.utu.fi/utu-apps/Rvivo/

Alternatively, the user may download the R Shiny web app and run it locally, for example, to increase the speed of the sampling based power calculations. The underlying R-package hamlet is open source and freely available for expert users, and expands on functionality that goes beyond the graphical interface.

## Collection of resources for the use of R in study design and analysis

In the recent report on „Digital tools and services to support research replicability and verifiability”, Caroline Skirrow and Markus Munafó have argued that “publication of analysis scripts can help to improve transparency of data analytic methods, allowing greater replicability and verifiability of scientific results”.

We strongly support the transparency of analytical methods and the use of software packages that allow analysis scripts to be saved and shared. We have promoted and referred to such tools in previous issuesof the Newsletter.

To support the use of script-based methods in study design and analysis, PAASP has established an online repository and invites all colleagues to share their examples, links to useful R packages and literature.

This repository will aim to present information not only on the methods commonly available using conventional statistical software packages (t-test, ANOVA) but also methods that are increasingly recognized as important but may be less known and accessible for preclinical biologists and pharmacologists (e.g. equivalence testing).