The Embassy of Good Science – An online platform fostering research integrity

By Iris Lechner
The field of research integrity is growing substantially. An increasing number of guidelines and initiatives to foster responsible research practices are being implemented worldwide.
Individual researchers however sometimes find it difficult to know which policies, codes and rules of good research practices apply to them in their specific context. In addition, surviving in academia is not easy, and short-cuts are still too often rewarded. To make research integrity information easily accessible to researchers, the European consortium ENTIRE developed the online platform The Embassy of Good Science. The platform can be found on Here, a wide range of information, resources, and tools can be found.
Uniquely, the platform is both for researchers, and made by researchers. The research community can easily search for and find relevant information to learn about good research practices relevant for their work. The Embassy contains short explanatory ‘theme pages’ which provide an introduction about specific research integrity topics in a way that is understandable for all researchers. A wide range of topics have already been covered, including plagiarism, authorship, the FAIR principles, p-hacking and conflict of interest.
The unique approach is made possible by using Semantic Media-Wiki which allows individual researchers to add and edit information. Theme pages are automatically linked to the relevant resources on the platform, including guidelines, cases and educational tools. For example, if you read a theme page on plagiarism, a case on plagiarism and online training is also shown on the same page. In this way, researchers can access all relevant information needed to navigate the complex web research integrity has become. Online training modules on specific research integrity topics will also be made available shortly aimed at increasing researchers’ knowledge of and commitment to research integrity in their everyday practice.
Interested in the platform? Check out this short video and go to
The EQIPD project is also present on The Embassy of Good Science platform and summarized HERE.

Biological vs technical replicates: Now from a data analysis perspective: R script

This is the R Script referenced in the blog post LINK

## "Effortlessly Read Any Rectangular Data"


## "Linear and Nonlinear Mixed Effects Models"


## "Groupwise Statistics, LSmeans, Linear Contrasts, Utilities"


## set graphic display


## acquire dataset  

ab <-"TOYEXAMPLE.xls"))

## generate mean data by subject

amean <- summaryBy(A~group+subject,FUN=mean,data=ab)

## detailed descriptives,  function available on request

stats(amean$A.mean, by=amean$group)

## plot result 1


## do analysis of variance  


## detailed descriptives



## plot result 2


 ## plot result 3


## do mixed effects modeling 1

a.1 <- lme(A~group,random=~1|subject,data=ab)      

 ## do mixed effects modeling 2

b.1 <- lme(B~group,random=~1|subject,data=ab)                    



Biological vs technical replicates: Now from a data analysis perspective

We have discussed this topic several times before (HERE and HERE). There seems to be a growing understanding that, when reporting an experiment’s results, one should state clearly what experimental units (biological replicates) are included, and, when applicable, distinguish them from technical replicates.

In discussing this topic with various colleagues, it became obvious to us that there is no clarity on best analytic practices and how to take technical replicates into analysis.

We have approached David L McArthur (at the UCLA Department of Neurosurgery), an expert in study design and analysis, who has been helping us and the Preclinical Data Forum on projects related to data analysis and robust data analysis practices.

A representative example that we wanted to discuss includes 3 treatment groups (labeled A, B, and C) with 6 mice per group and 4 samples processed for each mouse (e.g. one blood draw per mouse separated into four vials and subjected to the same measurement procedure) – i.e. a 3X6X4 dataset.

The text below is based on Dave’s feedback.  Note that Dave is using the term “facet” as an overarching label for anything that contributes to (or fails to contribute to) interpretable coherence beyond background noise in the dataset, and the term “measurement” as a label for the observed value obtained from each sample (rather than the phrase “dependent variable”  often used elsewhere).

Dave has drafted a thought experiment supported by a simulation.  With a simple spreadsheet using only elementary function commands, it’s easy to build a toy study in the form of a flat file representing that 3X6X4 system of data, with the outcome consisting of one measurement in each line of a “tall” datafile, i.e., 72 lines of data with each line having entries for group, subject, sample, and close-but-not-quite-identical measurement (LINK). But, for our purposes, we’ll insert not just measurement A but also measurement B on each line — where we’ve constructed measurement B to differ from measurement A in its variability but otherwise to have identical group means and subject means.  (As shown in Column E, this can be done easily: take each A value, jitter it by uniform application of some multiplier, then subtract out any per-subject mean difference to obtain B.)  With no loss of meaning, in this dataset measurement A has just a little variation from one measurement to the next within a given subject, but because of that multiplier, measurement B has a lot of variation from one measurement to the next within a given subject.

A 14-term descriptive summary shows that using all values of measurement A, across groups, results in:

robust min0.30000.90001.5000
hdQ: 0.250.63801.23801.8380… (25th quantile, the lower box bound of a boxplot)
hdQ: 0.751.06201.66202.2620… (75th quantile, the upper box bound of a boxplot)
robust max1.40002.00002.6000
Huber mu0.85001.45002.0500
Shapiro p0.97030.97030.9703

while, using all values of  measurement B, across groups, results in:

mean0.85001.45002.0500<– identical group means
SD5.71315.71315.7131<– group standard deviations about 20 times larger
robust min-6.9000-6.3000-5.7000
hdQ: 0.25-4.2657-3.6657-3.0657
median0.85001.45002.0500<– identical group medians
hdQ: 0.755.96576.56577.1657
robust max8.60009.20009.8000
skew-0.0000-0.0000-0.0000<– identical group skews
kurtosis-1.3908-1.3908-1.3908<– greater kurtoses, no surprise
Huber mu0.85001.45002.0500<– identical Huber estimates of group centers
Shapiro p0.00780.00780.0078<– suspiciously low p-values for test of normality, no surprise

The left panel in the image below results from simple arithmetical averaging of that dataset’s samples from each subject, with the working dataframe reduced by averaging from 72 lines to 18 lines.  It doesn’t matter here whether we now analyze measurement A or measurement B, as both measurements inside this artificial dataset generate the identical 18-line dataframe, with means of 0.8500, 1.4500, and 2.0500 for groups A, B and C respectively.  Importantly, the sample facet disappears altogether, though we still have group, mouse, measurement and noise.  The simple ANOVA solution for the mean measures shows “very highly significant” differences between the groups.  But wait.

The center panel uses all 72 available datapoints from measurement A.  By definition that’s in the form of a repeated-measures structure, with four non-identical samples provided by each subject.  Mixed effects modeling accounts for all 5 facets here by treating them as fixed (group and sample) or random (subject), or as the object of the equation (measurement), or as residual (noise).  The mixed effects model analysis for measurement A results in “highly significant” differences between groups, though those p-values are not the same as those in the left panel.  But wait.

The right panel uses all 72 available datapoints from measurement B.  Again, it’s a repeated-measures structure, but while the means and medians remain the same, now the standard deviations are 20 times larger than those for measurement A, a feature of the noise facet being intentionally magnified and inserted into the artificial source datafile. The mixed effects model analysis for measurement B results in “not-at-all-close-to-significant” differences between groups; no real surprise.

What does this example teach us?

Averaging technical replicates (as in the left panel) and running statistical analyses on average values means losing potentially important information.  No facet should be dropped from analysis unless one is confident that it can have absolutely no effect on analyses.  A decision to ignore a facet (any facet), drop data and go for a simpler statistical test must in any case be justified and defended.

Further recommendations that are supported by this toy example or that the readers can illustrate for themselves (with the R script LINK) are:

  • There is no reason to use the antiquated method of repeated measures ANOVA; in contrast to RM ANOVA, mixed effects modeling makes no sphericity assumption and handles missing data well.
  • There is no reason to use nested ANOVA in this context:  nesting is applicable in situations when one or another constraint does not allow crossing every level of one factor with every level of another factor.  In such situations with a nested layout, fewer than all levels of one factor occur within each level of the other factor.  By this definition, the toy example here includes no nesting.
  • The expanded descriptive summary can be highly instructive (and is yours to use freely). 

And last but not least, whatever method is used for the analysis, the main message that should be lost – one should be maximally transparent about how the data were collected, what were the experimental units, what were the replicates, and what analyses were used to examine the data.

Is N-Hacking Ever OK? A simulation-based study

After an experiment has been completed and analyzed, a trend may be observed that is “not quite significant”. Sometimes in this situation, researchers incrementally grow their sample size N in an effort to achieve statistical significance. This is especially tempting in situations when samples are very costly or time-consuming to collect, such that collecting an entirely new sample larger than N (the statistically sanctioned alternative) would be prohibitive. Such post-hoc sampling or “N-hacking” is condemned, however, because it leads to an excess of false positive results. Here Monte-Carlo simulations are used to show why and how incremental sampling causes false positives, but also to challenge the claim that it necessarily produces alarmingly high false positive rates. In a parameter regime that would be representative of practice in many research fields, simulations show that the inflation of the false positive rate is modest and easily bounded. But the effect on false positive rate is only half the story. What many researchers really want to know is the effect N-hacking would have on the likelihood that a positive result is a real effect that will be replicable. This question has not been considered in the reproducibility literature. The answer depends on the effect size and the prior probability of an effect. Although in practice these values are not known, simulations show that for a wide range of values, the positive predictive value (PPV) of results obtained by N-hacking is in fact higher than that of non-incremented experiments of the same sample size and statistical power. This is because the increase in false positives is more than offset by the increase in true positives. Therefore in many situations, adding a few samples to shore up a nearly-significant result is in fact statistically beneficial. It is true that uncorrected N-hacking elevates false positives, but in some common situations this does not reduce PPV, which has not been shown previously. In conclusion, if samples are added after an initial hypothesis test this should be disclosed, and if a false positive rate is stated it should be corrected. But, contrary to widespread belief, collecting additional samples to resolve a borderline P value is not invalid, and can confer previously unappreciated advantages for efficiency and positive predictive value.


Toward Good In Vitro Reporting Standards

Many areas of biomedical science have developed reporting standards and checklists to support the adequate reporting of scientific efforts, but in vitro research still has no generally accepted criteria. In this article, the authors discuss ‘focus points’ of in vitro research, ensuring that the scientific community is able to take full advantage of the animal-free methods and studies and that resources spent on conducting these experiment are not wasted: A first priority of reporting standards is to ensure the completeness and transparency of the provided information (data focus). A second tier is the quality of data display that makes information digestible and easy to grasp, compare, and further analysable (information focus).
This article summarizes a series of initiatives geared towards improving the quality of in vitro work and its reporting – with the ultimate aim to generate Good In Vitro Reporting Standards (GIVReSt).