Is N-Hacking Ever OK? A simulation-based study

After an experiment has been completed and analyzed, a trend may be observed that is “not quite significant”. Sometimes in this situation, researchers incrementally grow their sample size N in an effort to achieve statistical significance. This is especially tempting in situations when samples are very costly or time-consuming to collect, such that collecting an entirely new sample larger than N (the statistically sanctioned alternative) would be prohibitive. Such post-hoc sampling or “N-hacking” is condemned, however, because it leads to an excess of false positive results. Here Monte-Carlo simulations are used to show why and how incremental sampling causes false positives, but also to challenge the claim that it necessarily produces alarmingly high false positive rates. In a parameter regime that would be representative of practice in many research fields, simulations show that the inflation of the false positive rate is modest and easily bounded. But the effect on false positive rate is only half the story. What many researchers really want to know is the effect N-hacking would have on the likelihood that a positive result is a real effect that will be replicable. This question has not been considered in the reproducibility literature. The answer depends on the effect size and the prior probability of an effect. Although in practice these values are not known, simulations show that for a wide range of values, the positive predictive value (PPV) of results obtained by N-hacking is in fact higher than that of non-incremented experiments of the same sample size and statistical power. This is because the increase in false positives is more than offset by the increase in true positives. Therefore in many situations, adding a few samples to shore up a nearly-significant result is in fact statistically beneficial. It is true that uncorrected N-hacking elevates false positives, but in some common situations this does not reduce PPV, which has not been shown previously. In conclusion, if samples are added after an initial hypothesis test this should be disclosed, and if a false positive rate is stated it should be corrected. But, contrary to widespread belief, collecting additional samples to resolve a borderline P value is not invalid, and can confer previously unappreciated advantages for efficiency and positive predictive value.

LINK

Why we need to report more than ’Data were Analyzed by t-tests or ANOVA’

In this article,Weissgerber et al. present findings suggesting that scientific publications are often lacking sufficient information about the statistical methods used, which are required by independent scientists to successfully replicate the published results.

The authors evaluated the quality of reporting of statistical tests (such as t-tests and ANOVA) in 328 research papers published in physiology journals in June 2017. They found that 84,5% papers used either ANOVA or t-tests or both. Although there are different types of ANOVA, 95% of articles that used ANOVA did not indicate what type of ANOVA was performed. Likewise, many papers did not specify what type of t-test was used. As a consequence, the lackof transparent statistical reporting does not allow others to judge whether the most appropriate test was selected and to verify the obtained study results. The authors conclude that “the findings of the present study highlight the need for investigators, journal editors and reviewers to work together to improve the quality of statistical reporting in submitted manuscripts”.

LINK

Visualization of Biomedical Data

The rapid increase in volume and complexity of biomedical data requires changes in research, communication, and clinical practices. This includes learning how to effectively present and visualize data sets and research outcomes that clearly express complex phenomena. In this review, the authors we summarize key principles and resources from data visualization research that help address this difficult challenge and discuss 4 common misconceptions: 1) “The goal of data visualization is to impress.” 2) “Data visualization is easy.” 3) “Studying data visualization is unnecessary.” 4) “Visualization is just a synonym for imaging.”The authors then survey how visualization is being used in a selection of emerging biomedical research areas and highlight common poor visualization practices. O’Donoghue and colleagues also outline ongoing initiatives aimed at improving visualization practices in biomedical research via better tools, peer-to-peer learning, and interdisciplinary collaboration with computer scientists, science communicators, and graphic designers. These changes are revolutionizing how scientists see and think about our data.

LINK

E-Learning tools and MOOCs II

E-learning platforms and Massive Open Online Courses (MOOCs) are picking up momentum in popularity and are an emerging new model for education and training that delivers videotaped lectures and other course materials over the Internet for students and scientists at many different levels. These novel educational technologies do not only represent tools for learning, but they have the potential to become a catalyst for changing the whole education system.
In this issue we introduce selected MOOCs that focus on the statistical analysis of experiments and data and introduce important tools and mathematical concepts:

Coursera: Mathematical Biostatistics Boot Camp 1
This class presents the fundamental probability and statistical concepts used in elementary data analysis. It will be taught at an introductory level. A small amount of linear algebra and programming are useful for the class, but not required.
edX: Introduction to Applied Biostatistics: Statistics for Medical Research
This Applied Biostatistics course provides an introduction to important topics in medical statistical concepts and reasoning. Each topic will be introduced with examples from published clinical research papers; and all homework assignments will expose learner to hands-on data analysis using real-life datasets. This course also represents an introduction to basic epidemiological concepts covering study designs and sample size computation. Open-source, easy-to-use software will be used such as R Commander and PS sample size software.
DataCamp: Introduction to R
In this introduction to R, the basics of this open source language, including factors, lists and data frames will be introduced. With the knowledge gained in this course, it will be possible to undertake the first very own data analysis. With over 2 million users worldwide R is rapidly becoming the leading programming language in statistics and data science. Every year, the number of R users grows by 40% and an increasing number of organizations are using it in their day-to-day activities.
edX: Data Analysis for Life Sciences 1: Statistics and R
This course covers the basics of statistical inference in order to understand and compute p-values and confidence intervals, all while analyzing data with R. R programming examples are provided in a way that will help make the connection between concepts and implementation. Problem sets requiring R programming will be used to test understanding and ability to implement basic data analyses. Visualization techniques will be used to explore new data sets and determine the most appropriate approach. Robust statistical techniques are described as alternatives when data do not fit assumptions required by the standard approaches. By using R scripts to analyze data, the basics of conducting reproducible research can be learned.