A commentary based on the two articles by Jiménez et al. and Eglen et al. about the need to publish the software code of computer programs developed and used in research labs
The generation of non-reproducible research data causes a huge scientific and social problem wasting many billions of US dollars and other resources every year. The root cause of this problem is with no doubt multifactorial and the whole research community is needed to fix this problem: researchers, publishers, funding agencies, industry, research and governmental institutions. The questions remains, where to start? Two recently published articles (Eglen et al. 2017and Jimenez et al. 2017) directly address researcher, funding agencies and journal editorial bodies with an almost neglected but nevertheless important piece of the reproducibility puzzle: the development and distribution of research software.
90% of researchers acknowledge that software is important for their research and many researchers even develop their own software to solve a specific problem they are facing in the lab. This self-developed software then becomes a corner stone for the generation and/or analysis of research data, and, thus, also plays a crucial part in reproducing the scientific results. However, quite often, these important software tools are not published alongside the data, which makes it impossible for other scientists to understand how specific data sets were generated and how they can be reproduced.
Eglen et al. 2017and Jimenez et al. 2017both emphasize the importance for peers to understand the software code of these individual applications and the need of making it available. This is already advocated in computational sciences as stated by Eglen et al. in the introduction: “[…] the scholarship is not the article, the scholarship is the complete software […]”.
Why is it so difficult to publish the Software code in natural sciences?
Jimenez et al., the authors of the article: “Four simple recommendations to encourage best practices in research software” provide the following recommendations to create an Open Source Software (OSS):
- The source code should be made accessible from day one in a publicly available, version controlled repository. Only this would allow for trust in the software and give the opportunity for the community to understand, judge and enhance the software. This can be easily achieved with, for example, GitHub or Bitbucket.
- Providing software metadata for better understanding the context of the software. The authors claim that is it important to provide additional information together with the software code. This might include source code location, contributors, licence, version, etc.
- Adopt a licence and comply with the licence of third-party dependencies. This adopted licence should clarify how the software can be used, modified and distributed by anybody else.
- Define clear and transparent contribution, governance and communication processes. In this context, it is up to the developer whether they want the developer community to contribute or not, however, in any case, these processes should be clarified upfront and made transparent.
The authors of this article conclude that these recommendations aim to encourage the adoption of best practice and help to create better software for better research. In contrast to most previously published recommendations targeting the software developer themselves, the authors target a wider audience, basically everybody who is producing, investing in or funding research software development. These recommendations were discussed at several workshops and meetings to get feedback from multiple stakeholders.
The article by Eglen and colleagues published in Nature Neuroscience is more specific in promoting standard practices for sharing computer code and programs in neuroscience. The authors also see a huge advantage if developers got into a habit of sharing as much information as possible, not only to help others but also to help themselves. Two main reasons are given for this: A) the code will not be lost when a colleague leaves the lab and B) people tend to write higher quality codes when they know it will become publicly available. Some of the points by Eglen et al.are overlapping with the four previously mentioned recommendations, but, they provide some further details: For example, adopting a licence, having a version controlled system (like GitHub) or providing additional data about the software (in a README file) were also mentioned by Jiménez et al. In addition, Eglen et al. recommend to create a stable URL (such as a DOI) and to comply with Daniel Kahneman’s “reproducibility etiquette”. Furthermore, it would be favorable to publish all experimental data collected alongside the software. Finally, testing of the software is a very critical step, however, it is often neglected by researchers. Therefore, the authors recommend including test suites to be able to demonstrate that the software is producing the correct results.
Without any doubt, the discussed issue is a very important piece in the puzzle and these recommendations are crucial steps towards more transparency and a better understanding how research data were generated. The critical question will be, however, how to implement these new recommendations? For a researcher, computer software is quite often just a tool for her/his real aim: producing novel data. Therefore, it might not be easy to convince them to put more effort and resources into the development and distribution of these “side-tools” except if it were made obligatory or strongly incentivized by funders, publishers and/or policymakers.