diff options
-rw-r--r-- | paper.tex | 87 | ||||
-rw-r--r-- | tex/src/figure-project-outline.tex | 2 | ||||
-rw-r--r-- | tex/src/references.tex | 10 |
3 files changed, 93 insertions, 6 deletions
@@ -676,6 +676,93 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314 \else \clearpage \appendices +\section{Necessity for reproducible research} +\label{sec:introduction} + +The increasing volume and complexity of data analysis has been highly productive, giving rise to a new branch of ``Big Data'' in many fields of the sciences and industry. +However, given its inherent complexity, the mere results are barely useful alone. +Questions such as these commonly follow any such result: +What inputs were used? +What operations were done on those inputs? How were the configurations or training data chosen? +How did the quantitative results get visualized into the final demonstration plots, figures or narrative/qualitative interpretation? +May there be a bias in the visualization? +See Figure \ref{fig:questions} for a more detailed visual representation of such questions for various stages of the workflow. + +In data science and database management, this type of metadata are commonly known as \emph{data provenance} or \emph{data lineage}. +Data lineage is being increasingly demanded for integrity checking from both the scientific and industrial/legal domains. +Notable examples in each domain are respectively the ``Reproducibility crisis'' in the sciences that was reported by the Nature journal after a large survey \citeappendix{baker16}, and the General Data Protection Regulation (GDPR) by the European Parliament and the California Consumer Privacy Act (CCPA), implemented in 2018 and 2020 respectively. +The former argues that reproducibility (as a test on sufficiently conveying the data lineage) is necessary for other scientists to study, check and build-upon each other's work. +The latter requires the data intensive industry to give individual users control over their data, effectively requiring thorough management and knowledge of the data's lineage. +Besides regulation and integrity checks, having a robust data governance (management of data lineage) in a project can be very productive: it enables easy debugging, experimentation on alternative methods, or optimization of the workflow. + +In the sciences, the results of a project's analysis are published as scientific papers which have also been the primary conveyor of the result's lineage: usually in narrative form, within the ``Methods'' section of the paper. +From our own experiences, this section is usually most discussed during peer review and conference presentations, showing its importance. +After all, a result is defined as ``scientific'' based on its \emph{method} (the ``scientific method''), or lineage in data-science terminology. +In the industry however, data governance is usually kept as a trade secret and isn't publicly published or scrutinized. +Therefore the main practical focus here will be in the scientific front which has traditionally been more open to publishing the methods and anonymous peer scrutiny. + +\begin{figure*}[t] + \begin{center} + \includetikz{figure-project-outline}{width=\linewidth} + \end{center} + \caption{\label{fig:questions}Graph of a generic project's workflow (connected through arrows), highlighting the various issues/questions on each step. + The green boxes with sharp edges are inputs and the blue boxes with rounded corners are the intermediate or final outputs. + The red boxes with dashed edges highlight the main questions on the respective stage. + The orange box surrounding the software download and build phases marks shows the various commonly recognized solutions to the questions in it, for more see Appendix \ref{appendix:independentenvironment}. + } +\end{figure*} + +The traditional format of a scientific paper has been very successful in conveying the method with the result in the last centuries. +However, the complexity mentioned above has made it impossible to describe all the analytical steps of most modern projects to a sufficient level of detail. +Citing this difficulty, many authors suffice to describing the very high-level generalities of their analysis, while even the most basic calculations (like the mean of a distribution) can depend on the software implementation. + +Due to the complexity of modern scientific analysis, a small deviation in the final result can be due to many different steps, which may be significant. +Publishing the precise codes of the analysis is the only guarantee. +For example, \citeappendix{smart18} describes how a 7-year old conflict in theoretical condensed matter physics was only identified after the relative codes were shared. +Nature is already a black box which we are trying hard to unlock, or understand. +Not being able to experiment on the methods of other researchers is an artificial and self-imposed black box, wrapped over the original, and taking most of the energy of researchers. + +An example showing the importance of sharing code is \citeappendix{miller06}, that found a mistaken column flipping, leading to retraction of 5 papers in major journals, including Science. +\citeappendix{baggerly09} highlighted the inadequate narrative description of the analysis and showed the prevalence of simple errors in published results, ultimately calling their work ``forensic bioinformatics''. +\citeappendix{herndon14} and \citeappendix{horvath15} also reported similar situations and \citeappendix{ziemann16} concluded that one-fifth of papers with supplementary Microsoft Excel gene lists, contain erroneous gene name conversions. +Such integrity checks are a critical component of the scientific method, but are only possible with access to the data and codes and \emph{cannot be resolved by the published paper alone}. + +The completeness of a paper's published metadata (or ``Methods'' section) can be measured by a simple question: given the same input datasets (supposedly on a third-party database like \href{http://zenodo.org}{zenodo.org}), can another researcher reproduce the exact same result automatically, without needing to contact the authors? +Several studies have attempted to answer this with different levels of detail. +For example \citeappendix{allen18} found that roughly half of the papers in astrophysics don't even mention the names of any analysis software they have used, while \cite{menke20} found that the fraction of papers explicitly mentioning their tools/software has greatly improved in medical journals over the last two decades. + +\citeappendix{ioannidis2009} attempted to reproduce 18 published results by two independent groups but, only fully succeeded in 2 of them and partially in 6. +\citeappendix{chang15} attempted to reproduce 67 papers in well-regarded economic journals with data and code: only 22 could be reproduced without contacting authors, and more than half could not be replicated at all. +\citeappendix{stodden18} attempted to replicate the results of 204 scientific papers published in the journal Science \emph{after} that journal adopted a policy of publishing the data and code associated with the papers. +Even though the authors were contacted, the success rate was $26\%$. +Generally, this problem is unambiguously felt in the community: \citeappendix{baker16} surveyed 1574 researchers and found that only $3\%$ did not see a ``reproducibility crisis''. + +This is not a new problem in the sciences: in 2011, Elsevier conducted an ``Executable Paper Grand Challenge'' \citeappendix{gabriel11}. +The proposed solutions were published in a special edition. +Some of them are reviewed in Appendix \ref{appendix:existingsolutions}, but most have not been continued since then. +Before that, \citeappendix{ioannidis05} proved that ``most claimed research findings are false''. +In the 1990s, \cite{schwab2000}, \citeappendix{buckheit1995} and \cite{claerbout1992} describe this same problem very eloquently and also provided some solutions that they used. +While the situation has improved since the early 1990s, the problems mentioned in these papers will resonate strongly with the frustrations of today's scientists. +Even earlier, through his famous quartet, \citeappendix{anscombe73} qualitatively showed how distancing of researchers from the intricacies of algorithms/methods can lead to misinterpretation of the results. +One of the earliest such efforts we found was \citeappendix{roberts69} who discussed conventions in FORTRAN programming and documentation to help in publishing research codes. + +From a practical point of view, for those who publish the data lineage, a major problem is the fast evolving and diverse software technologies and methodologies that are used by different teams in different epochs. +\citeappendix{zhao12} describe it as ``workflow decay'' and recommend preserving these auxiliary resources. +But in the case of software its not as straightforward as data: if preserved in binary form, software can only be run on certain hardware and if kept as source-code, their build dependencies and build configuration must also be preserved. +\citeappendix{gronenschild12} specifically study the effect of software version and environment and encourage researchers to not update their software environment. +However, this is not a practical solution because software updates are necessary, at least to fix bugs in the same research software. +Generally, software is not a secular component of projects, where one software can easily be swapped with another. +Projects are built around specific software technologies, and research in software methods and implementations is itself a vibrant research topic in many domains \citeappendix{dicosmo19}. + + + + + + + + + + \section{Survey of existing tools for various phases} \label{appendix:existingtools} diff --git a/tex/src/figure-project-outline.tex b/tex/src/figure-project-outline.tex index 4cd933d..e807e5d 100644 --- a/tex/src/figure-project-outline.tex +++ b/tex/src/figure-project-outline.tex @@ -1,4 +1,4 @@ -% Copyright (C) 2018-2019 Mohammad Akhlaghi <mohammad@akhlaghi.org> +% Copyright (C) 2018-2020 Mohammad Akhlaghi <mohammad@akhlaghi.org> % % This LaTeX source is free software: you can redistribute it and/or % modify it under the terms of the GNU General Public License as diff --git a/tex/src/references.tex b/tex/src/references.tex index dc3b816..ad38508 100644 --- a/tex/src/references.tex +++ b/tex/src/references.tex @@ -251,12 +251,12 @@ archivePrefix = {arXiv}, @ARTICLE{dicosmo19, - author = {Roberto {Di Cosmo} and Francois Pellegrini}, + author = {M\'elanie Cl\'ement-Fontaine and Roberto Di Cosmo and Bastien Guerry and Patrick Moreau and Francois Pellegrini}, title = {Encouraging a wider usage of software derived from research}, year = {2019}, - journal = {\doihref{https://www.ouvrirlascience.fr/wp-content/uploads/2020/02/Opportunity-Note\_software-derived-from-research\_EN.pdf}{Ouvrir la science}}, + journal = {Ouvrir la science}, volume = {}, - pages = {}, + pages = {\href{https://hal.archives-ouvertes.fr/hal-02545142}{hal-02545142}}, doi = {}, } @@ -1842,7 +1842,7 @@ Reproducible Research in Image Processing}, journal = {Large Installation System Administration Conference}, year = {2004}, volume = {18}, - pages = {79. \url{https://www.usenix.org/legacy/events/lisa04/tech/full\_papers/dolstra/dolstra.pdf}}, + pages = {79, PDF in \href{https://www.usenix.org/legacy/events/lisa04/tech/full\_papers/dolstra/dolstra.pdf}{LISA04 webpage}}, } @@ -1950,7 +1950,7 @@ Reproducible Research in Image Processing}, title = {Cake: a fifth generation version of make}, journal = {University of Melbourne}, year = {1987}, - pages = {1: \url{https://pdfs.semanticscholar.org/3e97/3b5c9af7763d70cdfaabdd1b96b3b75b5483.pdf}}, + pages = {\href{https://pdfs.semanticscholar.org/3e97/3b5c9af7763d70cdfaabdd1b96b3b75b5483.pdf}{Corpus ID: 107669553}}, } |