aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex87
1 files changed, 87 insertions, 0 deletions
diff --git a/paper.tex b/paper.tex
index 8f11870..070223e 100644
--- a/paper.tex
+++ b/paper.tex
@@ -676,6 +676,93 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314
\else
\clearpage
\appendices
+\section{Necessity for reproducible research}
+\label{sec:introduction}
+
+The increasing volume and complexity of data analysis has been highly productive, giving rise to a new branch of ``Big Data'' in many fields of the sciences and industry.
+However, given its inherent complexity, the mere results are barely useful alone.
+Questions such as these commonly follow any such result:
+What inputs were used?
+What operations were done on those inputs? How were the configurations or training data chosen?
+How did the quantitative results get visualized into the final demonstration plots, figures or narrative/qualitative interpretation?
+May there be a bias in the visualization?
+See Figure \ref{fig:questions} for a more detailed visual representation of such questions for various stages of the workflow.
+
+In data science and database management, this type of metadata are commonly known as \emph{data provenance} or \emph{data lineage}.
+Data lineage is being increasingly demanded for integrity checking from both the scientific and industrial/legal domains.
+Notable examples in each domain are respectively the ``Reproducibility crisis'' in the sciences that was reported by the Nature journal after a large survey \citeappendix{baker16}, and the General Data Protection Regulation (GDPR) by the European Parliament and the California Consumer Privacy Act (CCPA), implemented in 2018 and 2020 respectively.
+The former argues that reproducibility (as a test on sufficiently conveying the data lineage) is necessary for other scientists to study, check and build-upon each other's work.
+The latter requires the data intensive industry to give individual users control over their data, effectively requiring thorough management and knowledge of the data's lineage.
+Besides regulation and integrity checks, having a robust data governance (management of data lineage) in a project can be very productive: it enables easy debugging, experimentation on alternative methods, or optimization of the workflow.
+
+In the sciences, the results of a project's analysis are published as scientific papers which have also been the primary conveyor of the result's lineage: usually in narrative form, within the ``Methods'' section of the paper.
+From our own experiences, this section is usually most discussed during peer review and conference presentations, showing its importance.
+After all, a result is defined as ``scientific'' based on its \emph{method} (the ``scientific method''), or lineage in data-science terminology.
+In the industry however, data governance is usually kept as a trade secret and isn't publicly published or scrutinized.
+Therefore the main practical focus here will be in the scientific front which has traditionally been more open to publishing the methods and anonymous peer scrutiny.
+
+\begin{figure*}[t]
+ \begin{center}
+ \includetikz{figure-project-outline}{width=\linewidth}
+ \end{center}
+ \caption{\label{fig:questions}Graph of a generic project's workflow (connected through arrows), highlighting the various issues/questions on each step.
+ The green boxes with sharp edges are inputs and the blue boxes with rounded corners are the intermediate or final outputs.
+ The red boxes with dashed edges highlight the main questions on the respective stage.
+ The orange box surrounding the software download and build phases marks shows the various commonly recognized solutions to the questions in it, for more see Appendix \ref{appendix:independentenvironment}.
+ }
+\end{figure*}
+
+The traditional format of a scientific paper has been very successful in conveying the method with the result in the last centuries.
+However, the complexity mentioned above has made it impossible to describe all the analytical steps of most modern projects to a sufficient level of detail.
+Citing this difficulty, many authors suffice to describing the very high-level generalities of their analysis, while even the most basic calculations (like the mean of a distribution) can depend on the software implementation.
+
+Due to the complexity of modern scientific analysis, a small deviation in the final result can be due to many different steps, which may be significant.
+Publishing the precise codes of the analysis is the only guarantee.
+For example, \citeappendix{smart18} describes how a 7-year old conflict in theoretical condensed matter physics was only identified after the relative codes were shared.
+Nature is already a black box which we are trying hard to unlock, or understand.
+Not being able to experiment on the methods of other researchers is an artificial and self-imposed black box, wrapped over the original, and taking most of the energy of researchers.
+
+An example showing the importance of sharing code is \citeappendix{miller06}, that found a mistaken column flipping, leading to retraction of 5 papers in major journals, including Science.
+\citeappendix{baggerly09} highlighted the inadequate narrative description of the analysis and showed the prevalence of simple errors in published results, ultimately calling their work ``forensic bioinformatics''.
+\citeappendix{herndon14} and \citeappendix{horvath15} also reported similar situations and \citeappendix{ziemann16} concluded that one-fifth of papers with supplementary Microsoft Excel gene lists, contain erroneous gene name conversions.
+Such integrity checks are a critical component of the scientific method, but are only possible with access to the data and codes and \emph{cannot be resolved by the published paper alone}.
+
+The completeness of a paper's published metadata (or ``Methods'' section) can be measured by a simple question: given the same input datasets (supposedly on a third-party database like \href{http://zenodo.org}{zenodo.org}), can another researcher reproduce the exact same result automatically, without needing to contact the authors?
+Several studies have attempted to answer this with different levels of detail.
+For example \citeappendix{allen18} found that roughly half of the papers in astrophysics don't even mention the names of any analysis software they have used, while \cite{menke20} found that the fraction of papers explicitly mentioning their tools/software has greatly improved in medical journals over the last two decades.
+
+\citeappendix{ioannidis2009} attempted to reproduce 18 published results by two independent groups but, only fully succeeded in 2 of them and partially in 6.
+\citeappendix{chang15} attempted to reproduce 67 papers in well-regarded economic journals with data and code: only 22 could be reproduced without contacting authors, and more than half could not be replicated at all.
+\citeappendix{stodden18} attempted to replicate the results of 204 scientific papers published in the journal Science \emph{after} that journal adopted a policy of publishing the data and code associated with the papers.
+Even though the authors were contacted, the success rate was $26\%$.
+Generally, this problem is unambiguously felt in the community: \citeappendix{baker16} surveyed 1574 researchers and found that only $3\%$ did not see a ``reproducibility crisis''.
+
+This is not a new problem in the sciences: in 2011, Elsevier conducted an ``Executable Paper Grand Challenge'' \citeappendix{gabriel11}.
+The proposed solutions were published in a special edition.
+Some of them are reviewed in Appendix \ref{appendix:existingsolutions}, but most have not been continued since then.
+Before that, \citeappendix{ioannidis05} proved that ``most claimed research findings are false''.
+In the 1990s, \cite{schwab2000}, \citeappendix{buckheit1995} and \cite{claerbout1992} describe this same problem very eloquently and also provided some solutions that they used.
+While the situation has improved since the early 1990s, the problems mentioned in these papers will resonate strongly with the frustrations of today's scientists.
+Even earlier, through his famous quartet, \citeappendix{anscombe73} qualitatively showed how distancing of researchers from the intricacies of algorithms/methods can lead to misinterpretation of the results.
+One of the earliest such efforts we found was \citeappendix{roberts69} who discussed conventions in FORTRAN programming and documentation to help in publishing research codes.
+
+From a practical point of view, for those who publish the data lineage, a major problem is the fast evolving and diverse software technologies and methodologies that are used by different teams in different epochs.
+\citeappendix{zhao12} describe it as ``workflow decay'' and recommend preserving these auxiliary resources.
+But in the case of software its not as straightforward as data: if preserved in binary form, software can only be run on certain hardware and if kept as source-code, their build dependencies and build configuration must also be preserved.
+\citeappendix{gronenschild12} specifically study the effect of software version and environment and encourage researchers to not update their software environment.
+However, this is not a practical solution because software updates are necessary, at least to fix bugs in the same research software.
+Generally, software is not a secular component of projects, where one software can easily be swapped with another.
+Projects are built around specific software technologies, and research in software methods and implementations is itself a vibrant research topic in many domains \citeappendix{dicosmo19}.
+
+
+
+
+
+
+
+
+
+
\section{Survey of existing tools for various phases}
\label{appendix:existingtools}