From 3f4aa48534a4cae4a1bd3264b2f019eefe87e9ea Mon Sep 17 00:00:00 2001 From: Pedram Ashofteh Ardakani Date: Thu, 9 Apr 2020 22:51:12 +0430 Subject: Fix spelling errors, suggest alternative words I tried to get all the words I knew. Some may be correct in different conventions. It definitely needs a second or third review for spell checking. Suggested some additional formatting, including but not limited to using the LaTeX \textsuperscript{} command for stating dates. Also, some unfamiliar rare words that finished with `-able` or `-ability` may need to be changed. Finding better alternatives to better simplify and ease the `readabiliy` ;-) of the paper - I see it's hard not to use them actually. It has got me wondering what better alternatives are available? We'll find out. --- paper.tex | 270 +++++++++++++++++++++++++++++++------------------------------- 1 file changed, 135 insertions(+), 135 deletions(-) diff --git a/paper.tex b/paper.tex index b041b4a..9384d7c 100644 --- a/paper.tex +++ b/paper.tex @@ -82,19 +82,19 @@ However, given its inherent complexity, the mere results are barely useful alone Questions such as these commonly follow any such result: What inputs were used? What operations were done on those inputs? How were the configurations or training data chosen? -How did the quantiative results get visualized into the final demonstration plots, figures or narrative/qualitative interpretation? +How did the quantitative results get visualized into the final demonstration plots, figures or narrative/qualitative interpretation? May there be a bias in the visualization? See Figure \ref{fig:questions} for a more detailed visual representation of such questions for various stages of the workflow. In data science and database management, this type of metadata are commonly known as \emph{data provenance} or \emph{data lineage}. Their definitions are elaborated with other basic concepts in Section \ref{sec:definitions}. Data lineage is being increasingly demanded for integrity checking from both the scientific and industrial/legal domains. -Notable examples in each domain are respectively the ``Reproducibility crisis'' in the sciences that was claimed by the Nature journal \citep{baker16}, and the General Data Protection Regulation (GDPR) by the European parliment and the California Consumer Privacy Act (CCPA), implemented in 2018 and 2020 respectively. +Notable examples in each domain are respectively the ``Reproducibility crisis'' in the sciences that was claimed by the Nature journal \citep{baker16}, and the General Data Protection Regulation (GDPR) by the European Parliament and the California Consumer Privacy Act (CCPA), implemented in 2018 and 2020 respectively. The former argues that reproducibility (as a test on sufficiently conveying the data lineage) is necessary for other scientists to study, check and build-upon each other's work. The latter requires the data intensive industry to give individual users control over their data, effectively requiring thorough management and knowledge of the data's lineage. Besides regulation and integrity checks, having a robust data governance (management of data lineage) in a project can be very productive: it enables easy debugging, experimentation on alternative methods, or optimization of the workflow. -In the sciences, the results of a project's analysis are published as scientific papers which have also been the primary conveyer of the result's lineage: usually in narrative form, within the ``Methods'' section of the paper. +In the sciences, the results of a project's analysis are published as scientific papers which have also been the primary conveyor of the result's lineage: usually in narrative form, within the ``Methods'' section of the paper. From our own experiences, this section is usually most discussed during peer review and conference presentations, showing its importance. After all, a result is defined as ``scientific'' based on its \emph{method} (the ``scientific method''), or lineage in data-science terminology. In the industry however, data governance is usually kept as a trade secret and isn't publicly published or scrutinized. @@ -122,14 +122,14 @@ For example, \citet{smart18} describes how a 7-year old conflict in theoretical Nature is already a black box which we are trying hard to unlock, or understand. Not being able to experiment on the methods of other researchers is an artificial and self-imposed black box, wrapped over the original, and taking most of the energy of researchers. -\citet{miller06} found that a mistaken column flipping caused the retraction of 5 papers in major journals, including Science. +\citet{miller06} found that a mistaken column flipping, leading to retraction of 5 papers in major journals, including Science. \citet{baggerly09} highlighted the inadequate narrative description of the analysis and showed the prevalence of simple errors in published results, ultimately calling their work ``forensic bioinformatics''. \citet{herndon14} and \citet[a self-correction]{horvath15} also reported similar situations and \citet{ziemann16} concluded that one-fifth of papers with supplementary Microsoft Excel gene lists contain erroneous gene name conversions. Such integrity checks tests are a critical component of the scientific method, but are only possible with access to the data and codes. The completeness of a paper's published metadata (or ``Methods'' section) can be measured by a simple question: given the same input datasets (supposedly on a third-party database like \href{http://zenodo.org}{zenodo.org}), can another researcher reproduce the exact same result automatically, without needing to contact the authors? -Several studies have attempted to answer this question with differnet levels of detail. -For example, \citet{allen18} found that roughly half of the papers in astrophysics do not even mention the names of any analysis software they used, while \citet{menke20} found that the fraction of papers explicitly mentioning their tools/software has greatly improved over the last two decades. +Several studies have attempted to answer this with different levels of detail. +For example \citet{allen18} found that roughly half of the papers in astrophysics don't even mention the names of any analysis software they have used, while \citet{menke20} found that the fraction of papers explicitly mentioning their tools/software has greatly improved in medical journals over the last two decades. \citet{ioannidis2009} attempted to reproduce 18 published results by two independent groups but, only fully succeeded in 2 of them and partially in 6. \citet{chang15} attempted to reproduce 67 papers in well-regarded economic journals with data and code: only 22 could be reproduced without contacting authors, and more than half could not be replicated at all. @@ -144,11 +144,11 @@ Before that, \citet{ioannidis05} proved that ``most claimed research findings ar In the 1990s, \citet{schwab2000, buckheit1995, claerbout1992} describe this same problem very eloquently and also provided some solutions that they used. While the situation has improved since the early 1990s, these papers still resonate strongly with the frustrations of today's scientists. Even earlier, through his famous quartet, \citet{anscombe73} qualitatively showed how distancing of researchers from the intricacies of algorithms/methods can lead to misinterpretation of the results. -One of the earliest such efforts we found was \citet{roberts69}, who discussed conventions in Fortran programming and documentation to help in publishing research codes. +One of the earliest such efforts we found was \citet{roberts69} who discussed conventions in FORTRAN programming and documentation to help in publishing research codes. From a practical point of view, for those who publish the data lineage, a major problem is the fast evolving and diverse software technologies and methodologies that are used by different teams in different epochs. -\citet{zhao12} describe it as ``workflow decay'' and recommend preserving these auxilary resources. -But in the case of software, its not as straightforward as data: if preserved in binary form, software can only be run on certain hardware and if kept as source-code, their build dependencies and build configuration must also be preserved. +\citet{zhao12} describe it as ``workflow decay'' and recommend preserving these auxiliary resources. +But in the case of software its not as straightforward as data: if preserved in binary form, software can only be run on certain hardware and if kept as source-code, their build dependencies and build configuration must also be preserved. \citet{gronenschild12} specifically study the effect of software version and environment and encourage researchers to not update their software environment. However, this is not a practical solution because software updates are necessary, at least to fix bugs in the same research software. Generally, software is not a secular component of projects, where one software can easily be swapped with another. @@ -157,7 +157,7 @@ Projects are built around specific software technologies, and research in softwa \tonote{add a short summary of the advantages of Maneage.} This paper introduces Maneage as a solution to these important issues. -Section \ref{sec:definitions} defines the necessay concepts and terminology used in this paper, leading to a discussion of the necessary guiding principles in Section \ref{sec:principles}. +Section \ref{sec:definitions} defines the necessary concepts and terminology used in this paper leading to a discussion of the necessary guiding principles in Section \ref{sec:principles}. Section \ref{sec:maneage} introduces the implementation of Maneage, going into lower-level details in some cases. Finally, in Section \ref{sec:discussion}, the future prospects of using systems like this template are discussed. After the main body, Appendix \ref{appendix:existingtools} reviews the most commonly used lower-level technologies used today. @@ -199,7 +199,7 @@ Any computer file that may be usable in more than one project. The inputs of a project include data, software source code, etc. (see \citet{hinsen16} on the fundamental similarity of data and source code). Inputs may be encoded in plain text (for example tables of comma-separated values, CSV, or processing scripts), custom binary formats (for example JPEG images), or domain-specific data formats \citep[e.g., FITS in astronomy, see][]{pence10}. -Inputs may have initially been created/written (e.g., software soure code) or collected (e.g., data) for one specific project. +Inputs may have initially been created/written (e.g., software source code) or collected (e.g., data) for one specific project. However, they can, and most often will, be used in other/later projects also. Following the principle of modularity, it is therefore optimal to treat the inputs of any project as independent entities, not mixing them with how they are managed (how software is run on the data) within the project (see Section \ref{definition:project}). @@ -215,12 +215,12 @@ Otherwise, they can be published with the project, but as independent files, for \subsection{Definition: output} \label{definition:output} Any computer file that is published at the end of the project. -The output(s) can be datasets (terabyte-sized, small table(s) or image(s), a single number, a true/false (boolean) outcome), automatically generated software source code, or any other file. +The output(s) can be datasets (terabyte-sized, small table(s) or image(s), a single number, a true/false (Boolean) outcome), automatically generated software source code, or any other file. The raw output files are commonly supplemented with a paper/report that summarizes them in a human-friendly readable/printable/narrative format. The report commonly includes highlights of the input/output datasets (or intermediate datasets) as plots, figures, tables or simple numbers blended into the text. -The outputs can either be published independently on data servers which assign specific persistant identifers (PIDs) to be cited in the final report or published paper (in a journal for example). -Alternatively, the datasets can be published with the project source, see for example \href{https://doi.org/10.5281/zenodo.1164774}{zenodo.1164774} \citep[Sections 7.3 \& 3.4]{bacon17}. +The outputs can either be published independently on data servers which assign specific persistent identifiers (PIDs) to be cited in the final report or published paper (in a journal for example). +Alternatively, the datasets can be published with the project source, for example \href{https://doi.org/10.5281/zenodo.1164774}{zenodo.1164774} \citep[Sections 7.3 \& 3.4]{bacon17}. @@ -232,7 +232,7 @@ The most high-level series of operations that are done on input(s) to produce th Because the project's report is also defined as an output (see above), besides the high-level analysis, the project's source also includes scripts/commands to produce plots, figures or tables. With this definition, this concept of a ``project'' is similar to ``workflow''. -However, it is imporant to emphasize that the project's source code and inputs are distinct entities. +However, it is important to emphasize that the project's source code and inputs are distinct entities. For example the project may be written in the same programming language as one analysis step. Generally, the project source is defined as the most high-level source file that is unique to that individual project (its language is irrelevant). The project is thus only in charge of managing the inputs and outputs of each analysis step (take the outputs of one step, and feed them as inputs to the next), not to do analysis by itself. @@ -263,7 +263,7 @@ Data provenance thus provides a high-level view of the data's genealogy. % Business lineage pictures data flows on a business-term level and it's provided by solutions like Collibra, Alation and many others. % Technical data lineage is created from actual technical metadata and tracks data flows on the lowest level - actual tables, scripts and statements. % Technical data lineage is being provided by solutions such as MANTA or Informatica Metadata Manager. " -Data lineage is commonly used interchangably with Data provenance \citep[for example][\tonote{among many others, just search ``data lineage'' in scholar.google.com}]{cheney09}. +Data lineage is commonly used interchangeably with Data provenance \citep[for example][\tonote{among many others, just search ``data lineage'' in scholar.google.com}]{cheney09}. However, for clarity, in this paper we refer to the term ``Data lineage'' as a low-level and fine-grained recording of the data's source, and operations that occur on it, down to the exact command that produced each intermediate step. This \emph{recording} does not necessarily have to be in a formal metadata model. But data lineage must be complete (see completeness principle in Section \ref{principle:complete}), and allow extraction of data provenance metadata, and thus higher-level operations like visualization of the workflow. @@ -283,23 +283,23 @@ We adopt the same definition of \citet{leek17,fineberg19}, among others: %% Reproducibility is a minimum necessary condition for a finding to be believable and informative.”(K. Bollen, J. T. Cacioppo, R. Kaplan, J. Krosnick, J. L. Olds, Social, Behavioral, and Economic Sciences Perspectives on Robust and Reliable Science (National Science Foundation, Arlington, VA, 2015)). \begin{itemize} -\item {\bf\small Reproducibility:} (same inputs $\rightarrow$ consistant result). +\item {\bf\small Reproducibility:} (same inputs $\rightarrow$ consistent result). Formally: ``obtaining consistent [not necessarily identical] results using the same input data; computational steps, methods, and code; and conditions of analysis'' \citep{fineberg19}. This is thus synonymous with ``computational reproducibility''. \citet{fineberg19} allow non-bitwise or non-identical numeric outputs within their definition of reproducibility, but they also acknowledge that this flexibility can lead to complexities: what is an acceptable non-identical reproduction? - Exactly reproducbile outputs can be precisely and automatically verified without statistical interpretations, even in a very complex analysis (involving many CPU cores, and random operations), see Section \ref{principle:verify}. + Exactly reproducible outputs can be precisely and automatically verified without statistical interpretations, even in a very complex analysis (involving many CPU cores, and random operations), see Section \ref{principle:verify}. It also requires no expertise, as \citet{claerbout1992} put it: ``a clerk can do it''. \tonote{Raul: I don't know if this is true... at least it needs a bit of training and an extra time. Maybe remove last phrase?} In this paper, unless otherwise mentioned, we only consider bitwise/exact reproducibility. -\item {\bf\small Replicability:} (different inputs $\rightarrow$ consistant result). +\item {\bf\small Replicability:} (different inputs $\rightarrow$ consistent result). Formally: ``obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data'' \citep{fineberg19}. Generally, since replicability involves new data collection, it can be expensive. For example the ``Reproducibility Project: Cancer Biology'' initiative started in 2013 to replicate 50 high-impact papers in cancer biology\footnote{\url{https://elifesciences.org/collections/9b1e83d1/reproducibility-project-cancer-biology}}. Even with a funding of at least \$1.3 million, it later shrunk to 18 projects \citep{kaiser18} due to very high costs. -We also note that replicability does not have to be limited to different input data: using the same data, but with different implementations of methods, is also a replication attempt \citep[also known as ``in silico'' experiments, see][]{stevens03}. +We also note that replicability doesn't have to be limited to different input data: using the same data, but with different implementations of methods, is also a replication attempt \citep[also known as ``in silico'' experiments, see][]{stevens03}. \end{itemize} \tonote{Raul: put white line to separate next paragraph from the previous list?} @@ -312,7 +312,7 @@ For example, \citet{ioannidis2009} use ``repeatability'' to encompass both the t However, the ACM/VIM definition for repeatability is ``a researcher can reliably repeat her own computation''. Hence, in the ACM terminology, the only difference between replicability and repeatability is the ``team'' that is conducting the computation. In the context of this paper, inputs are precisely defined (Section \ref{definition:input}): files with specific/registered checksums (see Section \ref{principle:verify}). -Therefore, our inputs are team-agnostic, allowing us to safely ignore ``repeatability'' as defined by ACM/VIM. +Therefore our inputs are team-agnostic, allowing us to safely ignore ``repeatability'' as defined by ACM/VIM. @@ -345,7 +345,7 @@ Science is the only class that attempts to be as objective as possible through t This paper thus proposes a framework that is optimally designed for both designing and executing a project, \emph{as well as} publication of the (computational) methods along with the published paper/result. However, this paper is not the first attempted solution to this fundamental problem. Various solutions have been proposed since the early 1990s, see Appendix \ref{appendix:existingsolutions} for a review. -To better highlight the differences with those methods, and the foundations of this method (which help in understanding certain implementation choices), in the sub-sections below, the core principle above is expaneded by breaking it into logically independent sub-components. +To better highlight the differences with those methods, and the foundations of this method (which help in understanding certain implementation choices), in the sub-sections below, the core principle above is expanded by breaking it into logically independent sub-components. It is important to note that based on the definition of a project (Section \ref{definition:project}) and the first principle below (modularity, Section \ref{principle:modularity}) this paper is designed to be modular and thus agnostic to high-level choices. For example the choice of hardware (e.g., high performance computing facility or a personal computer), or high-level interfaces (for example a webpage or specialized graphic user interface). @@ -368,8 +368,8 @@ Generally, a project's source should include the whole project: access to the in This principle has several important consequences: \begin{itemize} -\item A complete project doesn't need any previlaged/root permissions for system-wide installation, or environment preparations. - Even when the user does have root previlages, interefering with the host operating system for a project, may lead to many conflicts with the host or other projects. +\item A complete project doesn't need any privileged/root permissions for system-wide installation, or environment preparations. + Even when the user does have root privileges, interfering with the host operating system for a project, may lead to many conflicts with the host or other projects. This principle thus allows a safe execution of the project, and will not cause any security problems. \item A complete project doesn't need an internet connection to build itself or to do its analysis and possibly make a report. @@ -384,12 +384,12 @@ This principle has several important consequences: Interactivity is also an inherently irreproducible operation, exposing the analysis to human error, and requiring expert knowledge. \end{itemize} -The first two components are particularly important for high performace computing (HPC) facilities: because of security reasons, HPC users commonly don't have previlaged permissions or internet access. +The first two components are particularly important for high performance computing (HPC) facilities: because of security reasons, HPC users commonly don't have privileged permissions or internet access. A complete project as defined here is much less exposed to ``workflow decay'' as defined by \citet{zhao12} (in particular under their missing execution environment tests). As recommended by \citet{zhao12}, a complete project automatically builds all its necessary third-party tools, it doesn't just assume their existence. Ultimately, the executability of a project will decay once the host Linux kernel inevitably evolves to such that the project's fixed version of the GNU C Library and GNU C Compiler can't be built. -This will happen on much longer time scales than the high-level software menioned in \citet{zhao12} and can be fixed by changing the project's (GNU) C library and (GNU) C Compiler to versions that are build-able with the host kernel. +This will happen on much longer time scales than the high-level software mentioned in \citet{zhao12} and can be fixed by changing the project's (GNU) C library and (GNU) C Compiler to versions that are build-able with the host kernel. These are very low-level components and any possible change in the output should be minimal. Ultimately after multiple decades, even that may not be possible, but even at that point, thanks to the plain-text principle (Section \ref{principle:text}), it can still be studied, without necessarily executing it. @@ -409,8 +409,8 @@ This principle doesn't just apply to the analysis, it also applies to the whole Within the analysis phase, this principle can be summarized best with the Unix philosophy, best described by \citet{mcilroy78} in the ``Style'' section. In particular ``Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new `features'''. -Independent parts of the analysis can be maintained as independent software (for example shell, Python, or R scripts, or programs written in C, C++ or Fortran, among others). -This core aspect of the Unix philosophy has been the cause of its continueed success (particulary through GNU and BSD) and development in the last half century. +Independent parts of the analysis can be maintained as independent software (for example shell, Python, or R scripts, or programs written in C, C++ or FORTRAN, among others). +This core aspect of the Unix philosophy has been the cause of its continued success (particularly through GNU and BSD) and development in the last half century. For the most high-level analysis/operations, the boundary between the ``analysis'' and ``project'' can become blurry. It is thus inevitable that some highly project-specific, and small, analysis steps are also kept within the project and not maintained as a separate software package (that is built before the project is run). @@ -421,7 +421,7 @@ One example of an existing system that doesn't follow this principle is Madagasc %\tonote{Find a place to put this:} Note that input files are a subset of built files: they are imported/built (copied or downloaded) using the project's instructions, from an external location/repository. % This principle is inspired by a similar concept in the free and open source software building tools like the GNU Build system (the standard `\texttt{./configure}', `\texttt{make}' and `\texttt{make install}'), or CMake. % Software developers already have decades of experience with the problems of mixing hand-written source files with the automatically generated (built) files. -% This principle has proved to be an exceptionally useful in this model, grealy +% This principle has proved to be an exceptionally useful in this model, greatly @@ -430,23 +430,23 @@ One example of an existing system that doesn't follow this principle is Madagasc A project's primarily stored/archived format should be plain text with human-readable encoding\footnote{Plain text format doesn't include document container formats like \inlinecode{.odf} or \inlinecode{.doc}, for software like LibreOffice or Microsoft Office.}, for example ASCII or Unicode (for the definition of a project, see Section \ref{definition:project}). The reason behind this principle is that opening, reading, or editing non-plain text (executable or binary) file formats needs specialized software. Binary formats will complicate various aspects of the project: its usage, archival, automatic parsing, or human readability. -This is a critical principle for long term preservation and portability: when the software to read binary format has been depreciated or become obsolete and isn't installable on the running system, the project will not be readable/usable any more. +This is a critical principle for long term preservation and portability: when the software to read binary format has been depreciated or become obsolete and isn't installable on the running system, the project will not be readable/usable any more. % should replace `installable`? A project that is solely in plain text format can be put under version control as it evolves, with easy tracking of changed parts, using already available and mature tools in software development: software source code is also in plain text. -After publication, independent modules of a plain-text project can be used and cited through services like Software Heritage \citep{dicosmo18,dicosmo20}, enabling future projects to easily build ontop of old ones, or cite specific parts of a project. +After publication, independent modules of a plain-text project can be used and cited through services like Software Heritage \citep{dicosmo18,dicosmo20}, enabling future projects to easily build on top of old ones, or cite specific parts of a project. Archiving a binary version of the project is like archiving a well cooked dish itself, which will be inedible with changes in hardware (temperature, humidity, and the natural world in general). But archiving the dish's recipe (which is also in plain text!): you can re-cook it any time. -When the environment is under perfect control (as in the proposed system), the binary/executable, or re-cooked, output will be verifiably identical. +When the environment is under perfect control (as in the proposed system), the binary/executable, or re-cooked, output will be verifiably identical. % should replace `verifiably` with another word? One illustrative example of the importance of source code is mentioned in \citet{smart18}: a seven-year old dispute between condensed matter scientists could only be solved when they shared the plain text source of their respective projects. -This principle doesn't conflict with having an executable or immediately-runnable project\footnote{In their recommendation 4-1 on reproducibility, \citet{fineberg19} mention: ``a detailed description of the study methods (ideally in executable form)''.}. +This principle doesn't conflict with having an executable or immediately-runnable project\footnote{In their recommendation 4-1 on reproducibility, \citet{fineberg19} mention: ``a detailed description of the study methods (ideally in executable form)''.}. % should replace `runnable`? Because it is trivial to build a text-based project within an executable container or virtual machine. For more on containers, please see Appendix \ref{appendix:independentenvironment}. To help contemporary researchers, this built/executable form of the project can be published as an output in respective servers like \url{http://hub.docker.com} (see Section \ref{definition:output}). Note that this principle applies to the whole project, not just the initial phase. -Therefore a project like Conda that currently includes a $+500$MB binary blob in a plain-text shell script (see Appendix \ref{appendix:conda}) is not acceptable for this principle. +Therefore a project like Conda that currently includes a $+500$MB binary blob in a plain-text shell script (see Appendix \ref{appendix:conda}) is not acceptable for this principle. % is it `Anaconda` or `Conda` project? This failure also applies to projects that build tools to read binary sources. In short, the full source of a project should be in plain text. @@ -454,31 +454,31 @@ In short, the full source of a project should be in plain text. -\subsection{Principle: Minimal complexity (i.e., maximal compatability)} +\subsection{Principle: Minimal complexity (i.e., maximal compatibility)} \label{principle:complexity} An important measure of the quality of a project is how much it avoids complexity. -In principle this is similar to Occum's rasor: ``Never posit pluralities without necessity'' \citep{schaffer15}, but extrapolated to project management. -In this context Occum's rasor can be interpretted like the following cases: +In principle this is similar to Occam's razor: ``Never posit pluralities without necessity'' \citep{schaffer15}, but extrapolated to project management. +In this context Occam's razor can be interpreted like the following cases: minimize the number of a project's dependency software (there are often multiple ways of doing something), -avoid complex relationtions between analysis steps (which is not unrelated to the principle of modularity in Section \ref{principle:modularity}), +avoid complex relations between analysis steps (which is not unrelated to the principle of modularity in Section \ref{principle:modularity}), or avoid the programming language that is currently in vogue because it is going to fall out of fashion soon and take the project down with it, see Appendix \ref{appendix:highlevelinworkflow}). -This principle has several important concequences: +This principle has several important consequences: \begin{itemize} \item Easier learning curve. Scientists can't adopt new tools and methods as fast as software developers. They have to invest the majority of their time on their own research domain. -Because of this researchers usually continue their career with the language/tools they learnt when they started. +Because of this researchers usually continue their career with the language/tools they learned when they started. \item Future usage. Scientific projects require longevity: unlike software engineering, there is no end-of-life in science (e.g., Aristotle's work 2.5 millennia ago is still ``science''). Scientific projects that depend too much on an ever evolving, high-level software developing toolchain, will be harder to archive, run, or even study for their immediate and future peers. -One recent example is the Popper software implementation: it was originally designed in the HashiCorp configuration language (HCL) because it was the default for organizing operations in GitHub. -However, Github dropped HCL in October 2019, for more see Appendix \ref{appendix:popper}. +One recent example is the Popper software implementation: it was originally designed in the HashiCorp configuration language (HCL) because it was the default for organizing operations in GitHub. % should names like HashiCorp be formatted in italics? +However, GitHub dropped HCL in October 2019, for more see Appendix \ref{appendix:popper}. \item Compatible and extensible. A project that has minimal complexity, can easily adapt to any kind of data, programming language, host hardware or software and etc. It can also be easily extended for new inputs and environments. - For example when a project management system is designed only to manage Python functions (like CGAT-core, see Appendix \ref{appendix:jobmanagement}), it will be hard, inefficient and buggy for managing an analysis step that is written in R and another written in Fortran. + For example when a project management system is designed only to manage Python functions (like CGAT-core, see Appendix \ref{appendix:jobmanagement}), it will be hard, inefficient and buggy for managing an analysis step that is written in R and another written in FORTRAN. \end{itemize} @@ -517,7 +517,7 @@ After publication, the project's history can also be published on services like Taking this principle to a higher level, newer projects are built upon the shoulders of previous projects. A project management system should be able to provide this temporal connection between projects. -Quantifying how newer projects relate to older projects (for example through Git banches) will enable 1) scientists to simply use the relevant parts of an older project, 2) quantify the connections of various projects, which is primarily of interest for meta-research (research on research) or historical studies. +Quantifying how newer projects relate to older projects (for example through Git branches) will enable 1) scientists to simply use the relevant parts of an older project, 2) quantify the connections of various projects, which is primarily of interest for meta-research (research on research) or historical studies. In data science, ``provenance'' is used to track the analysis and original datasets that were used in producing a higher-level dataset. A system that uses this principle will also provide ``temporal provenance'', quantifying how a certain project grew/evolved in the time dimension. @@ -533,8 +533,8 @@ This is because software freedom as an important pillar for the sciences as show \begin{itemize} \item Based on the completeness principle (Section \ref{principle:complete}), it is possible to trace the output's provenance back to the exact source code lines within an analysis software. If the software's source code isn't available such important and useful provenance information is lost. -\item A non-free software may not be runnable on a given hardware. - Since free software is modifable, others can modify (or hire someone to modify) it and make it runnable on their particular platform. +\item A non-free software may not be runnable on a given hardware. % should use an alternative word for `runnable`? If yes, please consider replacing it in the whole document to keep consistency. + Since free software is modifiable, others can modify (or hire someone to modify) it and make it runnable on their particular platform. \item A non-free software cannot be distributed by the authors, making the whole community reliant only on the proprietary owner's server (even if the proprietary software doesn't ask for payments). A project that uses free software can also release the necessary tarballs of the software it uses. For example see \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481} \citep{akhlaghi19} or \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937} \citep{infante20}. @@ -567,9 +567,9 @@ The proposed solution is an implementation of the principles discussed in Sectio In practice it is a collection of plain-text files, that are distributed in pre-defined sub-directories by context, and are all under version-control (currently with Git). In its raw form (before customizing for different projects), it is a fully working skeleton of a project without much flesh: containing all the low-level infrastructure, with just a small demonstrative ``delete-me'' analysis. -To start a new project, users will \emph{clone}\footnote{In Git, ``clone''ing is the process of copying all the project's file and their history into the host system.} the core skeleton, create their own Git branch, and start customizing the core files (adding their high-level analysis steps, scripts to generate figure and narrative) within their custom branch. +To start a new project, users will \emph{clone}\footnote{In Git, ``clone''ing is the process of copying all the project's file and their history into the host system.} the core skeleton, create their own Git branch, and start customizing the core files (adding their high-level analysis steps, scripts to generate figure and narrative) within their custom branch. % should replace ``clone''ing with ``cloning''? -In this section we will review the current implementation of the reproducibile paper template. +In this section we will review the current implementation of the reproducible paper template. Generally, job orchestration is implemented in Make (a POSIX software), this choice is elaborated in Section \ref{sec:usingmake}. We continue with a general outline of the project's file structure in Section \ref{sec:generalimplementation}. As described there, we make a cosmetic distinction between ``configuration'' (or building of necessary software) and execution (or running the software on the data), these two phases are discussed in Sections \ref{sec:projectconfigure} \& \ref{sec:projectmake}. @@ -600,7 +600,7 @@ Most importantly, this enables full reproducibility from scratch with no changes This will allow robust results and let scientists do what they do best: experiment, and be critical to the methods/analysis without having to waste energy on the added complexity of experimentation in scripts. \item \textbf{\small Parallel processing:} Since the dependencies are clearly demarcated in Make, it can identify independent steps and run them in parallel. - This greatly speeds up the processing, with no cost in terms of complexy. + This greatly speeds up the processing, with no cost in terms of complexity. \item \textbf{\small Codifying data lineage and provenance:} In many systems data provenance has to be manually added. However, in Make, it is part of the design and no extra manual step is necessary to fully track (or back-track) the series of steps that generated the data. @@ -613,7 +613,7 @@ Make is also well known by many outside of the software developing communities. For example \citet{schwab2000} report how geophysics students have easily adopted it for the RED project management tool used in their lab at that time (see Appendix \ref{appendix:red} for more on RED). Because of its simplicity, we have also had very good feedback on using Make from the early adopters of this system during the last year, in particular graduate students and postdocs. -In summary Make satisfies all our principles (see Section \ref{sec:principles}), while avoiding the well-known problems of using high-level languages for project managment like a generational gap and ``dependency hell'', see Appendix \ref{appendix:highlevelinworkflow}. +In summary Make satisfies all our principles (see Section \ref{sec:principles}), while avoiding the well-known problems of using high-level languages for project management like a generational gap and ``dependency hell'', see Appendix \ref{appendix:highlevelinworkflow}. For more on Make and a discussion on some other job orchestration tools, see Appendices \ref{appendix:make} and \ref{appendix:jobmanagement} respectively. @@ -632,7 +632,7 @@ Most of the top project directory files are only intended for human readers (as \inlinecode{README-hacking.md} describes how to customize, or hack, the template for creators of new projects. In the top project directory, there are two non-narrative files: \inlinecode{project} (which should have been under \inlinecode{reproduce/}) and \inlinecode{paper.tex} (which should have been under \inlinecode{tex/}). -The former is nessary in the top project directory because it is the high-level user interface, with the \inlinecode{./project} command. +The former is necessary in the top project directory because it is the high-level user interface, with the \inlinecode{./project} command. The latter is necessary for many web-based automatic paper generating systems like arXiv, journals, or systems like Overleaf. \begin{figure}[t] @@ -651,7 +651,7 @@ The latter is necessary for many web-based automatic paper generating systems li } \end{figure} -\inlinecode{project} is a simple executable POSIX-compliant shell script, that is just a high-level wrapper script to call the project's Makefiles. +\inlinecode{project} is a simple executable POSIX-compliant shell script, that is just a high-level wrapper script to call the project's Makefiles. % should the `makefile` be in capitalized? Recall that the main job orchestrator in this system is Make, see Section \ref{sec:usingmake} for why Make was chosen. In the current implementation, the project's execution consists of the following two calls to the \inlinecode{project} script: @@ -661,7 +661,7 @@ In the current implementation, the project's execution consists of the following \end{lstlisting} The operations of both are managed by files under the top-level \inlinecode{reproduce/} directory. -When the first command is called, the contents of \inlinecode{reproduce\-/software} are used, and the latter calls files uner \inlinecode{reproduce\-/analysis}. +When the first command is called, the contents of \inlinecode{reproduce\-/software} are used, and the latter calls files under \inlinecode{reproduce\-/analysis}. This highlights the \emph{cosmetic} distinction we have adopted between the two main steps of a project: 1) building the project's full software environment and 2) doing the analysis (running the software). Technically there is no difference between the two and they could easily be merged under one directory. However, during a research project, researchers commonly just need to focus on their analysis steps and will rarely need to edit the software environment settings (maybe only once at the start of the project). @@ -706,7 +706,7 @@ The project will only look into them for the necessary software tarballs and inp If they are not found, the project will attempt to download any necessary file from the recoded URLs/PIDs within the project source. These directories are therefore primarily tailored to scenarios where the project must run offline (based on the completeness principle of Section \ref{principle:complete}). -After project configuration, a symbolic link is built the top project soure directory that points to the build directory. +After project configuration, a symbolic link is built the top project source directory that points to the build directory. The symbolic link is a hidden file named \inlinecode{.build}, see Figure \ref{fig:files}. With this symbolic link, its always very easy to access to built files, no matter where the build directory is actually located on the filesystem. @@ -721,7 +721,7 @@ A working C compiler is thus mandatory and the configure script will abort if a In particular, on GNU/Linux systems, the project builds its own version of the GNU Compiler Collection (GCC), therefore a static C library is necessary with the compiler. If not found, an informative error message will be printed and the project will abort. -The custom version of GCC is configured to also build Fortran, C++, objective-C and objective-C++ compilers. +The custom version of GCC is configured to also build FORTRAN, C++, objective-C and objective-C++ compilers. Python and R running environments are themselves written in C, therefore they are also automatically built afterwards if the project uses these languages. On macOS systems, we currently don't build a C compiler, but it is planned to do so in the future. @@ -737,7 +737,7 @@ Researchers using the template only have to specify the most high-level analysis Based on the completeness principle (Section \ref{principle:complete}), on GNU/Linux systems the dependency tree is automatically traced down to the GNU C Library and GNU Compiler Collection (GCC). Thus creating identical high-level analysis software on any system. When the C library and compiler can't be installed (for example on macOS systems), the users are forced to rely on the host's C compiler and library, and this may hamper the exact reproducibility of the final result: the project will abort if the final outputs have changed. -Because the project's main output is currently a \LaTeX{}-built PDF, the project also contains an internal installation of \TeX{}Live, providing all the necessary tools to build the PDF, indepedent of the host operating system's \LaTeX{} version and packages. +Because the project's main output is currently a \LaTeX{}-built PDF, the project also contains an internal installation of \TeX{}Live, providing all the necessary tools to build the PDF, independent of the host operating system's \LaTeX{} version and packages. To build the software from source, the project needs access to its source tarball or zip-file. If the tarballs are already present on the system, the user can specify the respective directory at the start of project configuration (Section \ref{sec:localdirs}). @@ -780,18 +780,18 @@ For more on software citation, see Section \ref{sec:softwarecitation}. \subsubsection{Software citation} \label{sec:softwarecitation} Based on the completeness principle (Section \ref{principle:complete}), the project contains the full list of installed software, their versions and their configuration options. -However, this information is burried deep into the project's source. +However, this information is buried deep into the project's source. A distilled fraction of this information must also be printed in the project's final report, blended into the narrative. Furthermore, when a published paper is associated with the used software, it is important to cite that paper, the citations help software authors gain more recognition and grants, encouraging them to further develop it. This is particularly important in the case for research software, where the researcher has invested significant time in building the software, and requires official citation to justify continued work on it. -One notable example that nicely highlights this issue is GNU Parallel \citep{tange18}: everytime it is run, it prints the citation information before it starts. +One notable example that nicely highlights this issue is GNU Parallel \citep{tange18}: every time it is run, it prints the citation information before it starts. This doesn't cause any problem in automatic scripts, but can be annoying when reading/debugging the outputs. Users can disable the notice, with the \inlinecode{--citation} option and accept to cite its paper, or support its development directly by paying $10000$ euros! This is justified by an uncomfortably true statement\footnote{GNU Parallel's FAQ on the need to cite software: \url{http://git.savannah.gnu.org/cgit/parallel.git/plain/doc/citation-notice-faq.txt}}: ``history has shown that researchers forget to [cite software] if they are not reminded explicitly. ... If you feel the benefit from using GNU Parallel is too small to warrant a citation, then prove that by simply using another tool''. In bug 905674\footnote{Debian bug on the citation notice of GNU Parallel: \url{https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=905674}}, the Debian developers argued that because of this extra condition, GNU Parallel should not be considered as free software, and they are using a patch to remove that part of the code for its build under Debian-based operating systems. -Most other research software don't resort to such drastic measures, however, citation is imporant for them. -Given the increasing number of software used in scientific research, the only reliable solution is to automaticly cite the used software in the final paper. +Most other research software don't resort to such drastic measures, however, citation is important for them. +Given the increasing number of software used in scientific research, the only reliable solution is to automatically cite the used software in the final paper. As mentioned above in Section \ref{sec:buildsoftware}, a plain-text file is built automatically at the end of a software's successful build and installation. This file contains the name, version and possible citation of that software. @@ -863,19 +863,19 @@ It is designed for any selection steps that may be necessary to optimize \inline It is mainly useful when the research targets are more focused than the raw input and may not be necessary in many scenarios. Its role is described here with an example. -Let's assume the raw input data (that the project recieved from a database) has 5000 rows (potential targets for doing the analysis on). +Let's assume the raw input data (that the project received from a database) has 5000 rows (potential targets for doing the analysis on). However, this particular project only needs to work on 100 of them, not the full 5000. If the full 5000 targets are given to \inlinecode{top-make.mk}, Make will need to create a data lineage for all 5000 targets and project authors have to add checks in many places to ignore those that aren't necessary. This will add to the project's complexity and is prone to many bugs. Furthermore, if the filesystem isn't fast (for example a filesystem that exists over a network), checking all the intermediate and final files over the full lineage can be slow. In this scenario, the preparation phase finds the IDs of the 100 targets of interest and saves them as a Make variable in a file under \inlinecode{BDIR}. -Later, this file is loaded into the analysis phase, precisely identifing the project's targets-of-interest. +Later, this file is loaded into the analysis phase, precisely identifying the project's targets-of-interest. This selection phase can't be done within \inlinecode{top-make.mk} because the full data lineage (all input and output files) must be known to Make before it starts to execute the necessary operations. It is possible to for Make to call itself as another Makefile, but this practice is strongly discouraged here because it makes the flow very hard to read. However, if the project authors insist on calling Make within Make, it is certainly possible. -The ``preparation'' phase thus allows \inlinecode{top-make.mk} to optimially organize the complex set of operations that must be run on each input and the dependencies (possibly in parallel). +The ``preparation'' phase thus allows \inlinecode{top-make.mk} to optimally organize the complex set of operations that must be run on each input and the dependencies (possibly in parallel). It also greatly simplifies the coding for the project authors. Ideally \inlinecode{top-prepare.mk} is only for the ``preparation phase''. However, projects can be complex and ultimately, the choice of which parts of an analysis being a ``preparation'' can be highly subjective. @@ -931,7 +931,7 @@ We'll follow Make's paradigm (see Section \ref{sec:usingmake}) of starting form } \end{figure} -To aviod getting too abstract in the subsections below, where necessary, we'll do a basic analysis on the data of \citet[data were published as supplementary material on bioXriv]{menke20} and try to replicate some of their results. +To avoid getting too abstract in the subsections below, where necessary, we'll do a basic analysis on the data of \citet[data were published as supplementary material on bioXriv]{menke20} and try to replicate some of their results. Note that because we are not using the same software, this isn't a reproduction (see Section \ref{definition:reproduction}). We can't use the same software because they use Microsoft Excel for the analysis which violates several of our principles: 1) Completeness (as a graphic user interface program, it needs human interaction, Section \ref{principle:complete}), 2) Minimal complexity (even free software alternatives like LibreOffice involve many dependencies and are extremely hard to build, Section \ref{principle:complexity}) and 3) Free software (Section \ref{principle:freesoftware}). @@ -943,7 +943,7 @@ We can't use the same software because they use Microsoft Excel for the analysis It is possible to call a new instance of Make within an existing Make instance. This is also known as recursive Make\footnote{\url{https://www.gnu.org/software/make/manual/html_node/Recursion.html}}. -Recursive Make is infact used by many Make users, especially in the software development communities. +Recursive Make is in fact used by many Make users, especially in the software development communities. It is also possible within a project using the proposed template. However, recursive Make is discouraged in the template, and not used in it. @@ -997,7 +997,7 @@ Given the evolution of a scientific projects, this type of human error is very h Such values must also be automatically generated. To automatically generate and blend them in the text, we use \LaTeX{} macros. -In the quote above, the \LaTeX{} source\footnote{\citet{akhlaghi19} uses this templat to be reproducible, so its LaTeX source is available in multiple ways: 1) direct download from arXiv:\href{https://arxiv.org/abs/1909.11230}{1909.11230}, by clicking on ``other formats'', or 2) the Git or \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481} links is also available on arXiv.} looks like this: ``\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}''. +In the quote above, the \LaTeX{} source\footnote{\citet{akhlaghi19} uses this template to be reproducible, so its \LaTeX{} source is available in multiple ways: 1) direct download from arXiv:\href{https://arxiv.org/abs/1909.11230}{1909.11230}, by clicking on ``other formats'', or 2) the Git or \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481} links is also available on arXiv.} looks like this: ``\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}''. The \LaTeX{} macro ``\inlinecode{\small\textbackslash{}demosfoptimizedsn}'' is automatically calculated and recorded during in the project and expands to the value ``\inlinecode{0.25}''. The automatically generated file \inlinecode{project.tex} stores all such inline output macros. Furthermore, Figure \ref{fig:datalineage} shows that it is a prerequisite of \inlinecode{paper.pdf} (as well as the manually written \LaTeX{} sources that are shown in green). @@ -1023,7 +1023,7 @@ The lineage ultimate ends in a \LaTeX{} macro file in \inlinecode{analysis3.tex} \subsubsection{Verification of outputs (\inlinecode{verify.mk})} \label{sec:outputverification} An important principle for this template is that outputs should be automatically verified, see Section \ref{principle:verify}. -However, simply confirming the checksum of the final PDF, or figures and datasets is not generally possible: as mentioned in Section \ref{principle:verify}, many tools that produce datasets or PDFs write the creation date into the produced files. +However, simply confirming the checksum of the final PDF, or figures and datasets is not generally possible: as mentioned in Section \ref{principle:verify}, many tools that produce datasets or PDFs write the creation date into the produced files. % should replace `PDFs` with `PDF files`? Therefore it is necessary to verify the project's outputs before the PDF is created. To facilitate output verification, the project has a \inlinecode{verify.mk} Makefile, see Figure \ref{fig:datalineage}. It is the only prerequisite of \inlinecode{project.tex} that was described in Section \ref{sec:paperpdf}. @@ -1049,11 +1049,11 @@ Nevertheless, project authors are strongly encouraged to study it and use all th \inlinecode{initial\-ize\-.mk} doesn't contain any analysis or major processing steps, it just initializes the system. For example it sets the necessary environment variables, internal Make variables and defines generic rules like \inlinecode{./project make clean} (to clean/delete all built products, not software) or \inlinecode{./project make dist} (to package the project into a tarball for distribution) among others. -It also adds one special \LaTeX{} macro in \inlinecode{initial\-ize\-.tex}: the current Git commit that is generated everytime the analysis is run. +It also adds one special \LaTeX{} macro in \inlinecode{initial\-ize\-.tex}: the current Git commit that is generated every time the analysis is run. It is stored in \inlinecode{{\footnotesize\textbackslash}projectversion} macro and can be used anywhere within the final report. For this PDF it has a value of \inlinecode{\projectversion}. One good place to put it is in the end of the abstract for any reader to be able to identify the exact point in history that the report was created. -It also uses the \inlinecode{--dirty} feature of Git's \inlinecode{--describe} output: if any version-controlled file is not already commited, the value to this macro will have a \inlinecode{-dirty} suffix. +It also uses the \inlinecode{--dirty} feature of Git's \inlinecode{--describe} output: if any version-controlled file is not already committed, the value to this macro will have a \inlinecode{-dirty} suffix. If its in a prominent place (like the abstract), it will always remind the author to commit their work. @@ -1122,13 +1122,13 @@ Figure \ref{fig:inputconf} shows the corresponding \inlinecode{INPUTS.conf}, sho If \inlinecode{menke20.xlsx} exists on the \emph{input} directory, it will just be validated and put it in the \emph{build} directory. Otherwise, it will be downloaded from the given URL, validated, and put it in the build directory. Recall that the input and build directories differ from system to system and are specified at project configuration time, see Section \ref{sec:localdirs}. -In the data lineage of Figure \ref{fig:datalineage}, the arrow from \inlinecode{INPUTS.conf} to \inlinecode{menke20.xlsx} sympolizes this step. +In the data lineage of Figure \ref{fig:datalineage}, the arrow from \inlinecode{INPUTS.conf} to \inlinecode{menke20.xlsx} symbolizes this step. Note that in our notation, once an external dataset is imported, it is a \emph{built} product, it thus has a blue box in Figure \ref{fig:datalineage}. It is sometimes necessary to report basic information about external datasets in the report/paper. As described in Section \ref{sec:valuesintext}, here this is done with \LaTeX{} macros to avoid human error. For example in Footnote \ref{footnote:dataurl}, we gave the full URL that this dataset was downloaded from. -In the \LaTeX{} source of that footnote, this URL is stored as the \inlinecode{\textbackslash{}menketwentyurl} macro which is created with the simplied\footnote{This Make rule is simplified by removing the directory variable names to help in readability.} Make rule below (it is located at the bottom of \inlinecode{download.mk}). +In the \LaTeX{} source of that footnote, this URL is stored as the \inlinecode{\textbackslash{}menketwentyurl} macro which is created with the simplified\footnote{This Make rule is simplified by removing the directory variable names to help in readability.} Make rule below (it is located at the bottom of \inlinecode{download.mk}). In this rule, \inlinecode{download.tex} is the \emph{target} and \inlinecode{menke20.xlsx} is its \emph{prerequisite}. The \emph{recipe} to build the target from the prerequisite is the \inlinecode{echo} shell command which writes the \LaTeX{} macro definition as a simple string (enclosed in double-quotes) into the \inlinecode{download.tex}. @@ -1136,7 +1136,7 @@ The target is built after the prerequisite(s) are built, or when the prerequisit Note that \inlinecode{\$(MK20URL)} is a call to the variable defined above in \inlinecode{INPUTS.conf}. Also recall that in Make, \inlinecode{\$@} is an \emph{automatic variable}, which is expanded to the rule's target name (in this case, \inlinecode{download.tex}). Therefore if the dataset is re-imported (possibly with a new URL), the URL in Footnote \ref{footnote:dataurl} will also be re-created automatically. -In the data lineage of Figure \ref{fig:datalineage}, the arrow from \inlinecode{menke20.xlsx} to \inlinecode{download.tex} sympolizes this step. +In the data lineage of Figure \ref{fig:datalineage}, the arrow from \inlinecode{menke20.xlsx} to \inlinecode{download.tex} symbolizes this step. @@ -1152,7 +1152,7 @@ Maneage is therefore designed to encourage and facilitate splitting the analysis For example in the data lineage graph of Figure \ref{fig:datalineage}, the analysis is broken into three subMakefiles: \inlinecode{format.mk}, \inlinecode{demo-plot.mk} and \inlinecode{analysis3.mk}. Theoretical discussion of this phase can be hard to follow, we will thus describe a demonstration project on data from \citet{menke20}. -In Section \ref{sec:download}, the process of importing this dataset into the proejct was described. +In Section \ref{sec:download}, the process of importing this dataset into the project was described. The first issue is that \inlinecode{menke20.xlsx} must be converted to a simple plain-text table which is generically usable by simple tools (see the principle of minimal complexity in Section \ref{principle:complexity}). For more on the problems with Microsoft Office and this format, see Section \ref{sec:lowlevelanalysis}. In \inlinecode{format.mk} (Figure \ref{fig:formatsrc}), we thus convert it to a simple white-space separated, plain-text table (\inlinecode{menke20-table-3.txt}) and do a basic calculation on it. @@ -1203,12 +1203,12 @@ Note that both the numbers of this sentence, and the first year of data mentione \end{figure} The operation of reproducing that figure is a contextually separate operation from the operations that were described above in \inlinecode{format.mk}. -Therfore we add a new subMakefile to the project called \inlinecode{demo-plot.mk}, which is shown in Figure \ref{fig:demoplotsrc}. +Therefore we add a new subMakefile to the project called \inlinecode{demo-plot.mk}, which is shown in Figure \ref{fig:demoplotsrc}. As before, in the first rule, we make the directory to host the data (\inlinecode{a2dir}). However, unlike before, this directory is placed under \inlinecode{texdir} which is the directory hosting all \LaTeX{} related files. This is because the plot of Figure \ref{fig:toolsperyear} is directly made within \LaTeX{}, using its PGFPlots package\footnote{PGFPLots package of \LaTeX: \url{https://ctan.org/pkg/pgfplots}. \inlinecode{texdir} has some special features when using \LaTeX{}, see Section \ref{sec:buildingpaper}. - PGFPlots uses the same graphics engine that is building the paper, producing a highquality figure that blends nicely in the paper.}. + PGFPlots uses the same graphics engine that is building the paper, producing a high quality figure that blends nicely in the paper.}. Note that this is just our personal choice, other methods of generating plots (for example with R, Gnuplot or Matplotlib) are also possible within this system. As with the input data files of PGFPlots, it is just necessary to put the files that are loaded into \LaTeX{} under the \inlinecode{\$(BDIR)/tex} directory, see Section \ref{sec:publishing}. @@ -1229,11 +1229,11 @@ Configuration files are discussed in more detain in Section \ref{sec:configfiles } \end{figure} -In a similar manner many more subMakefiles can be added in more complex analysis scenarios. +In a similar manner many more subMakefiles can be added in more complex analysis scenarios. % should subMakefiles have a special kind of formatting? Such as monospaced, italics, etc? This is shown with the lower opacity files and dashed arrows of the data lineage in Figure \ref{fig:datalineage}. Generally, the files created within one subMakefile don't necessarily have to be a prerequisite of its \LaTeX{} macro. For example see \inlinecode{demo-out.dat} in Figure \ref{fig:datalineage}: it is managed in \inlinecode{demo-plot.mk}, however, it isn't a prerequisite of \inlinecode{demo-plot.tex}, it is a prerequisite of \inlinecode{out-3b.dat} (which is managed in \inlinecode{another-step.mk} and is a prerequisite of \inlinecode{another-step.tex}). -Hence ultimately, through another file, it's decendants conclude in a \LaTeX{} macro. +Hence ultimately, through another file, it's descendants conclude in a \LaTeX{} macro. The high-level \inlinecode{top-make.mk} file is designed to simplify the addition of new subMakefiles for the authors, and reading the source for readers (see Section \ref{sec:highlevelanalysis}). As mentioned before, this high-level Makefile just defines the ultimate target (\inlinecode{paper.pdf}, see Section \ref{sec:paperpdf}) and imports all the subMakefiles in the specific order. @@ -1267,7 +1267,7 @@ The configuration files greatly simplify project management from multiple perspe \begin{itemize} \item If an analysis parameter is used in multiple places within the project, simply changing the value in the configuration file will change it everywhere in the project. - This is cirtical in more complex projects and if not done like this can lead to significant human error. + This is cortical in more complex projects and if not done like this can lead to significant human error. \item Configuration files enable the logical separation between the low-level implementation and high-level running of a project. For example after writing the project, the authors don't need to remember where the number/parameter was used, they can just modify the configuration file. Other co-authors, or readers, of the project also benefit: they just need to know that there is a unified place for high-level project settings, parameters, or numbers without necessarily having to know the low-level implementation. @@ -1312,7 +1312,7 @@ This file is updated on the \inlinecode{maneage} branch and will always be up-to The low-level infrastructure can always be updated (keeping the added high-level analysis intact), with a simple merge between branches. Two phases of a project's evolution shown here: in phase 1, a co-author has made two commits in parallel to the main project branch, which have later been merged. In phase 2, the project has finished: note the identical first project commit and the Maneage commits it branches from. - The dashed parts of Scenario 2 can be any arbitraty history after those shown in phase 1. + The dashed parts of Scenario 2 can be any arbitrary history after those shown in phase 1. A second team now wants to build upon that published work in a derivate branch, or project. The second team applies two commits and merges their branch with Maneage to improve the skeleton and continue their research. The Git commits are shown on their branches as colored ellipses, with their hash printed in them. @@ -1345,7 +1345,7 @@ Modern version control systems provide many more capabilities that can be exploi Because the project's source and build directories are separate, it is possible for different users to share a build directory, while working on their own separate project branches during a collaboration. Similar to the parallel branch that is later merged in phase 1 of Figure \ref{fig:branching}. -To give all users previlage, Maneage assumes that they are in the same (POSIX) user group of the system. +To give all users privilege, Maneage assumes that they are in the same (POSIX) user group of the system. All files built in the build directory are then automatically assigned to this user group, with read-write permissions for all group members (\inlinecode{-rwxrwx---}), through the \inlinecode{sg} and \inlinecode{umask} commands that are prepended to the call to Make. The \inlinecode{./project} script has a special \inlinecode{--group} option which activates this mode in both configuration and analysis phases. It takes the user group name as its argument and the built files will only be accessible by the group members, even when the shared location is accessible by people outside the project. @@ -1357,7 +1357,7 @@ Other project members can also compare her results in this way and once it is me The project already applies this strategy for part of the project that runs \LaTeX{} to build the final report. This is because project members will usually also be editing their parts of the report/paper as they progress. -To fix this, when the project is configured and built with \inlinecode{--group}, each project member's user-name will be appended to the \LaTeX{} build directury (which is under \inlinecode{\$(BDIR)/tex}). +To fix this, when the project is configured and built with \inlinecode{--group}, each project member's user-name will be appended to the \LaTeX{} build directory (which is under \inlinecode{\$(BDIR)/tex}). However, human error is inevitable, so when the project takes long in some phases, the user and group write-permission flags can be manually removed from the respective subdirectories under \inlinecode{\$(BDIR)} until the project is to be built from scratch; maybe for a test prior to submission. @@ -1450,7 +1450,7 @@ Given the strong integrity checks in Maneage, we believe it has features to addr In this way, Maneage can contribute to a new concept of authorship in scientific projects and help to quantify newton's famous ``standing on the shoulders of giants'' quote. However, this is a long term goal and requires major changes to academic value systems. \item The authors can be given a grace period where the journal, or some third authority, keeps the source and publishes it a certain time after publication. - Infact, journals can create specific policies for such scenarios, for example saying that all project sources will be available publicly, $N$ months/years after publication while allowing authors to opt-out of it if they like, so the source is published immediately with the paper. + In fact, journals can create specific policies for such scenarios, for example saying that all project sources will be available publicly, $N$ months/years after publication while allowing authors to opt-out of it if they like, so the source is published immediately with the paper. However, journals cannot expect exclusive copyright to distribute the project source, in the same manner they do with the final paper. As discussed in the free software principle of Section \ref{principle:freesoftware}, it is critical that the project source be free for the community to use, modify and distribute. @@ -1507,11 +1507,11 @@ Once the improvements become substantial, new paper(s) will be written to comple \begin{itemize} \item Science is defined by its method, not its results. - Just as papers are quality checked for a reasonable English (which is not necessary for conveying the final result), the necessities of modern science require a similar check on a reasonable review of the computation, which is easiest to check when the result is exactly reproducibile. + Just as papers are quality checked for a reasonable English (which is not necessary for conveying the final result), the necessities of modern science require a similar check on a reasonable review of the computation, which is easiest to check when the result is exactly reproducible. \item Initiative such as \url{https://software.ac.uk} (UK) and \url{http://urssi.us} (USA) are good attempts at improving the quality of research software. \item Hiring software engineers is not the solution: the language of science has changed. Could Galileo have created a telescope if he wasn't familiar with what a lens is? - Science is not independnet of its its tools. + Science is not independent of its its tools. \item The actual processing is archived in multiple places (with the paper on arXiv, with the data on Zenodo, on a Git repository, in future versions of the project). \item As shown by the very common use of something like Conda, Software (even free software) is mainly seen in executable form, but this is wrong: even if the software source code is no longer compilable, it is still readable. \item The software/workflow is not independent of the paper. @@ -1524,10 +1524,10 @@ Once the improvements become substantial, new paper(s) will be written to comple \item \citet{munafo19} discuss how collective action is necessary. \item Research objects (Appendix \ref{appendix:researchobject}) can automatically be generated from the Makefiles, we can also apply special commenting conventions, to be included as annotations/descriptions in the research object metadata. \item Provenance between projects: through Git, all projects based on this template are automatically connected, but also through inputs/outputs, the lineage of a project can be traced back to projects before it also. -\item \citet{gibney20}: After code submission was encouraged by the Neural Information Processing Systems (NeurIPS), the frac +\item \citet{gibney20}: After code submission was encouraged by the Neural Information Processing Systems (NeurIPS), the frac % incomplete \item When the data are confidential, \citet{perignon19} suggest to have a third party familiar with coding to referee the code and give its approval. In this system, because of automatic verification of inputs and outputs, no technical knowledge is necessary for the verification. -\item \citet{miksa19b} Machine-actionable data management plans (maDMPs) embeded in workflows, allowing +\item \citet{miksa19b} Machine-actionable data management plans (maDMPs) embedded in workflows, allowing \item \citet{miksa19a} RDA recommendation on maDMPs. \item FAIR Principles \citep{wilkinson16}. \item \citet{cheney09}: ``In both data warehouses and curated databases, tremendous (\emph{and often manual}) effort is usually expended in the construction'' @@ -1545,7 +1545,7 @@ Once the improvements become substantial, new paper(s) will be written to comple %% Acknowledgements -\section{Acknowledgements} +\section{Acknowledgments} Work on the reproducible paper template has been funded by the Japanese Ministry of Education, Culture, Sports, Science, and Technology ({\small MEXT}) scholarship and its Grant-in-Aid for Scientific Research (21244012, 24253003), the European Research Council (ERC) advanced grant 339659-MUSICOS, European Union’s Horizon 2020 research and innovation programme under Marie Sklodowska-Curie grant agreement No 721463 to the SUNDIAL ITN, and from the Spanish Ministry of Economy and Competitiveness (MINECO) under grant number AYA2016-76219-P. The reproducible paper template was also supported by European Union’s Horizon 2020 (H2020) research and innovation programme via the RDA EU 4.0 project (ref. GA no. 777388). @@ -1580,7 +1580,7 @@ The reproducible paper template was also supported by European Union’s Horizon Conducting a reproducible research project is a high-level process, which involves using various lower-level tools. In this section, a survey of the most commonly used lower-level tools for various aspects of a reproducible project is presented with an introduction as relates to reproducibility and the proposed template. -In particular, we focus on the tools used within the proposed template and also tools that are used by the existing reproducibile framework that is reviewed in Appendix \ref{appendix:existingsolutions}. +In particular, we focus on the tools used within the proposed template and also tools that are used by the existing reproducible framework that is reviewed in Appendix \ref{appendix:existingsolutions}. Some existing solution to for managing the different parts of a reproducible workflow are reviewed here. @@ -1609,14 +1609,14 @@ Users often choose an operating system for the container's independent operating Below we'll review some of the most common container solutions: Docker and Singularity. \begin{itemize} -\item {\bf\small Docker containers:} Docker is one of the most popular tools today for keeping an independnet analysis environment. - It is primarily driven by the need of software developers: they need to be able to reproduce a bug on the ``cloud'' (whic is just a remote VM), where they have root access. +\item {\bf\small Docker containers:} Docker is one of the most popular tools today for keeping an independent analysis environment. + It is primarily driven by the need of software developers: they need to be able to reproduce a bug on the ``cloud'' (which is just a remote VM), where they have root access. A Docker container is composed of independent Docker ``images'' that are built with Dockerfiles. It is possible to precisely version/tag the images that are imported (to avoid downloading the latest/different version in a future build). To have a reproducible Docker image, it must be ensured that all the imported Docker images check their dependency tags down to the initial image which contains the C library. Another important drawback of Docker for scientific applications is that it runs as a daemon (a program that is always running in the background) with root permissions. - This is a major security flaw that discourages many high performacen computing (HPC) facilities from installing it. + This is a major security flaw that discourages many high performance computing (HPC) facilities from installing it. \item {\bf\small Singularity:} Singularity is a single-image container (unlike Docker which is composed of modular/independent images). Although it needs root permissions to be installed on the system (once), it doesn't require root permissions every time it is run. @@ -1661,7 +1661,7 @@ Note that we are not including package manager that are only limited to one lang \subsubsection{Operating system's package manager} The most commonly used package managers are those of the host operating system, for example \inlinecode{apt} or \inlinecode{yum} respectively on Debian-based, or RedHat-based GNU/Linux operating systems (among many others). -These package managers are tighly intertwined with the operating system. +These package managers are tightly intertwined with the operating system. Therefore they require root access, and arbitrary control (for different projects) of the versions and configuration options of software within them is not trivial/possible: for example a special version of a software that may be necessary for a project, may conflict with an operating system component, or another project. Furthermore, in many operating systems it is only possible to have one version of a software at any moment (no including Nix or GNU Guix which can also be independent of the operating system, described below). Hence if two projects need different versions of a software, it is not possible to work on them at the same time. @@ -1683,12 +1683,12 @@ Conda is able to maintain an approximately independent environment on an operati Conda tracks the dependencies of a package/environment through a YAML formatted file, where the necessary software and their acceptable versions are listed. However, it is not possible to fix the versions of the dependencies through the YAML files alone. This is thoroughly discussed under issue 787 (in May 2019) of \inlinecode{conda-forge}\footnote{\url{https://github.com/conda-forge/conda-forge.github.io/issues/787}}. -In that discussion, the authors of \citet{uhse19} report that the half-life of their environment (defined in a YAML file) is 3 months, and that atleast one of their their depenencies breaks shortly after this period. +In that discussion, the authors of \citet{uhse19} report that the half-life of their environment (defined in a YAML file) is 3 months, and that at least one of their their dependencies breaks shortly after this period. The main reply they got in the discussion is to build the Conda environment in a container, which is also the suggested solution by \citet{gruning18}. However, as described in Appendix \ref{appendix:independentenvironment} containers just hide the reproducibility problem, they don't fix it: containers aren't static and need to evolve (i.e., re-built) with the project. Given these limitations, \citet{uhse19} are forced to host their conda-packaged software as tarballs on a separate repository. -Conda installs with a shell script that contains a binary-blob (+500 mega bytes, embeded in the shell script). +Conda installs with a shell script that contains a binary-blob (+500 mega bytes, embedded in the shell script). This is the first major issue with Conda: from the shell script, it is not clear what is in this binary blob and what it does. After installing Conda in any location, users can easily activate that environment by loading a special shell script into their shell. However, the resulting environment is not fully independent of the host operating system as described below: @@ -1699,7 +1699,7 @@ However, the resulting environment is not fully independent of the host operatin Therefore, a user, or script may not notice that a software that is being used is actually coming from the operating system, not the controlled Conda installation. \item Generally, by default Conda relies heavily on the operating system and doesn't include core analysis components like \inlinecode{mkdir}, \inlinecode{ls} or \inlinecode{cp}. - Although they are generally the same between different Unix-like operatings sytems, they have their differences. + Although they are generally the same between different Unix-like operating systems, they have their differences. For example \inlinecode{mkdir -p} is a common way to build directories, but this option is only available with GNU Coreutils (default on GNU/Linux systems). Running the same command within a Conda environment on a macOS for example, will crash. Important packages like GNU Coreutils are available in channels like conda-forge, but they are not the default. @@ -1719,7 +1719,7 @@ However, the resulting environment is not fully independent of the host operatin \end{itemize} As reviewed above, the low-level dependence of Conda on the host operating system's components and build-time conditions, is the primary reason that it is very fast to install (thus making it an attractive tool to software developers who just need to reproduce a bug in a few minutes). -However, these same factors are major caveats in a scientific scenario, where long-term archivability, readability or usability are important. +However, these same factors are major caveats in a scientific scenario, where long-term archivability, readability or usability are important. % alternative to `archivability`? @@ -1735,7 +1735,7 @@ That hash is then prefixed to the software's installation directory. For example \citep[from][]{dolstra04} if a certain build of GNU C Library 2.3.2 has a hash of \inlinecode{8d013ea878d0}, then it is installed under \inlinecode{/nix/store/8d013ea878d0-glibc-2.3.2} and all software that are compiled with it (and thus need it to run) will link to this unique address. This allows for multiple versions of the software to co-exist on the system, while keeping an accurate dependency tree. -As mentioned in \citet{courtes15}, one major caveat with using these package managers is that they require a daemon with root previlages. +As mentioned in \citet{courtes15}, one major caveat with using these package managers is that they require a daemon with root privileges. This is necessary ``to use the Linux kernel container facilities that allow it to isolate build processes and maximize build reproducibility''. \tonote{While inspecting the Guix build instructions for some software, I noticed they don't actually mention the version names. This creates a similar issue withe Conda example above (how to regenerate the software with a given hash, given that its dependency versions aren't explicitly mentioned. Ask Ludo' about this.} @@ -1754,7 +1754,7 @@ Spack is a package manager that is also influenced by Nix (similar to GNU Guix), \subsection{Package management conclusion} There are two common issues regarding generic package managers that hinders their usage for high-level scientific projects, as listed below: \begin{itemize} -\item {\bf\small Pre-compiled/binary downloads:} Most package managers (excluding Nix or its derivaties) only download the software in a binary (pre-compiled) format. +\item {\bf\small Pre-compiled/binary downloads:} Most package managers (excluding Nix or its derivatives) only download the software in a binary (pre-compiled) format. This allows users to download it very fast and almost instantaneously be able to run it. However, to provide for this, servers need to keep binary files for each build of the software on different operating systems (for example Conda needs to keep binaries for Windows, macOS and GNU/Linux operating systems). It is also necessary for them to store binaries for each build, which includes different versions of its dependencies. @@ -1762,7 +1762,7 @@ There are two common issues regarding generic package managers that hinders thei For example Debian's Long Term Support is only valid for 5 years. Pre-built binaries of the ``Stable'' branch will only be kept during this period and this branch only gets updated once every two years. - However, scientific sofware commonly evolve on much faster rates. + However, scientific software commonly evolve on much faster rates. Therefore scientific projects using Debian often use the ``Testing'' branch which has more up to date features. The problem is that binaries on the Testing branch are immediately removed when no other package depends on it, and a newer version is available. This is not limited to operating systems, similar problems are also reported in Conda for example, see the discussion of Conda above for one real-world example. @@ -1778,7 +1778,7 @@ They will thus manually install their high-level software in an uncontrolled, or This can result in not fully documenting the process that each package was built (for example the versions of the dependent libraries of a package). \end{itemize} -Addressing these issues has been the basic raison d'\^etre of the proposed template's approach to package management strategy: instructions to download and build the packages are included within the actual science project (thus fully cusomizable) and no special/new syntax/language is used: software download, building and installation is done with the same langugage/syntax that researchers manage their research: using the shell (GNU Bash) and Make (GNU Make). +Addressing these issues has been the basic reason d'\^etre of the proposed template's approach to package management strategy: instructions to download and build the packages are included within the actual science project (thus fully customizable) and no special/new syntax/language is used: software download, building and installation is done with the same language/syntax that researchers manage their research: using the shell (GNU Bash) and Make (GNU Make). @@ -1788,7 +1788,7 @@ A scientific project is not written in a day. It commonly takes more than a year (for example a PhD project is 3 or 4 years). During this time, the project evolves significantly from its first starting date and components are added or updated constantly as it approaches completion. Added with the complexity of modern projects, is not trivial to manually track this evolution, and its affect of on the final output: files produced in one stage of the project may be used at later stages (where the project has evolved). -Furthermore, scientific projects do not progres linearly: earlier stages of the analysis are often modified after later stages are written. +Furthermore, scientific projects do not progress linearly: earlier stages of the analysis are often modified after later stages are written. This is a natural consequence of the scientific method; where progress is defined by experimentation and modification of hypotheses (earlier phases). It is thus very important for the integrity of a scientific project that the state/version of its processing is recorded as the project evolves for example better methods are found or more data arrive. @@ -1803,8 +1803,8 @@ However, currently, Git is by far the most commonly used in individual projects \subsubsection{Git} With Git, changes in a project's contents are accurately identified by comparing them with their previous version in the archived Git repository. When the user decides the changes are significant compared to the archived state, they can ``commit'' the changes into the history/repository. -The commit involves copying the changed files into the repository and calculating a 40 character checksum/hash that is calculated from the files, an accompanying ``message'' (a narrarative description of the project's state), and the previous commit (thus creating a ``chain'' of commits that are strongly connected to each other). -For example \inlinecode{f4953cc\-f1ca8a\-33616ad\-602ddf\-4cd189\-c2eff97b} is a commit identifer in the Git history that this paper is being written in. +The commit involves copying the changed files into the repository and calculating a 40 character checksum/hash that is calculated from the files, an accompanying ``message'' (a narrative description of the project's state), and the previous commit (thus creating a ``chain'' of commits that are strongly connected to each other). +For example \inlinecode{f4953cc\-f1ca8a\-33616ad\-602ddf\-4cd189\-c2eff97b} is a commit identifier in the Git history that this paper is being written in. Commits are is commonly summarized by the checksum's first few characters, for example \inlinecode{f4953cc}. With Git, making parallel ``branches'' (in the project's history) is very easy and its distributed nature greatly helps in the parallel development of a project by a team. @@ -1888,7 +1888,7 @@ Going deeper into the syntax of Make is beyond the scope of this paper, but we r \subsubsection{SCons} Scons (\url{https://scons.org}) is a Python package for managing operations outside of Python (in contrast to CGAT-core, discussed below, which only organizes Python functions). In many aspects it is similar to Make, for example it is managed through a `SConstruct' file. -Like a Makefile, SConstruct is also declerative: the running order is not necessarily the top-to-bottom order of the written operations within the file (unlike the the imperative paradigm which is common in languages like C, Python, or Fortran). +Like a Makefile, SConstruct is also declarative: the running order is not necessarily the top-to-bottom order of the written operations within the file (unlike the imperative paradigm which is common in languages like C, Python, or FORTRAN). However, unlike Make, SCons doesn't use the file modification date to decide if it should be remade. SCons keeps the MD5 hash of all the files (in a hidden binary file) to check if the contents has changed. @@ -1897,7 +1897,7 @@ It also goes beyond raw job management and attempts to extract information from SCons is therefore more complex than Make: its manual is almost double that of GNU Make. Besides added complexity, all these ``smart'' features decrease its performance, especially as files get larger and more numerous: on every call, every file's checksum has to be calculated, and a Python system call has to be made (which is computationally expensive). -Finally, it has the same drawback as any other tool that uses hight-level languagues, see Section \ref{appendix:highlevelinworkflow}. +Finally, it has the same drawback as any other tool that uses high-level languages, see Section \ref{appendix:highlevelinworkflow}. We encountered such a problem while testing SCons: on the Debian-10 testing system, the \inlinecode{python} program pointed to Python 2. However, since Python 2 is now obsolete, SCons was built with Python 3 and our first run crashed. To fix it, we had to either manually change the core operating system path, or the SCons source hashbang. @@ -1918,7 +1918,7 @@ Hence in the GWL paradigm, software installation and usage doesn't have to be se GWL has two high-level concepts called ``processes'' and ``workflows'' where the latter defines how multiple processes should be executed together. As described above shell scripts and Make are a common and highly used system that have existed for several decades and many researchers are already familiar with them and have already used them. -The list of necessary software solutions for the various stages of a research project (listed in the subsections of Appendix \ref{appendix:existingtools}), is aleady very large, and each software has its own learning curve (which is a heavy burden for a natural or social scientist for example). +The list of necessary software solutions for the various stages of a research project (listed in the subsections of Appendix \ref{appendix:existingtools}), is already very large, and each software has its own learning curve (which is a heavy burden for a natural or social scientist for example). The other workflow management tools are too specific to a special paradigm, for example CGAT-core is written for Python, or GWL is intertwined with GNU Guix. Therefore their generalization into any kind of problem is not trivial. @@ -1932,7 +1932,7 @@ Therefore a robust solution would avoid designing their low-level processing ste \subsection{Editing steps and viewing results} \label{appendix:editors} In order to later reproduce a project, the analysis steps must be stored in files. -For example Shell, Python or R scripts, Makefiles, Dockerfiles, or even the source files of compiled languages like C or Fortran. +For example Shell, Python or R scripts, Makefiles, Dockerfiles, or even the source files of compiled languages like C or FORTRAN. Given that a scientific project does not evolve linearly and many edits are needed as it evolves, it is important to be able to actively test the analysis steps while writing the project's source files. Here we'll review some common methods that are currently used. @@ -1958,9 +1958,9 @@ Jupyter \citep[initially IPython,][]{kluyver16} is an implementation of Literate The main user interface is a web-based ``notebook'' that contains blobs of executable code and narrative. Jupyter uses the custom built \inlinecode{.ipynb} format\footnote{\url{https://nbformat.readthedocs.io/en/latest}}. Jupyter's name is a combination of the three main languages it was designed for: Julia, Python and R. -The \inlinecode{.ipynb} format, is a simple, human-readable (can be opened in a plain-text editor) file, formatted in Javascript Object Notation (JSON). -It contains various kinds of ``cells'', or blobs, that can contain narrative description, code, or multi-media visalizations (for example images/plots), that are all stored in one file. -The cells can have any order, allowing the creation of a literal programing style graphical implementation, where narrative descriptions and executable patches of code can be intertwined. +The \inlinecode{.ipynb} format, is a simple, human-readable (can be opened in a plain-text editor) file, formatted in JavaScript Object Notation (JSON). +It contains various kinds of ``cells'', or blobs, that can contain narrative description, code, or multi-media visualizations (for example images/plots), that are all stored in one file. +The cells can have any order, allowing the creation of a literal programming style graphical implementation, where narrative descriptions and executable patches of code can be intertwined. For example to have a paragraph of text about a patch of code, and run that patch immediately in the same page. The \inlinecode{.ipynb} format does theoretically allow dependency tracking between cells, see IPython mailing list (discussion started by Gabriel Becker from July 2013\footnote{\url{https://mail.python.org/pipermail/ipython-dev/2013-July/010725.html}}). @@ -1972,13 +1972,13 @@ Integration of directional graph features (dependencies between the cells) into The fact that the \inlinecode{.ipynb} format stores narrative text, code and multi-media visualization of the outputs in one file, is another major hurdle: The files can easy become very large (in volume/bytes) and hard to read from source. -Both are critical for scientific processing, especially the latter: when a web-browser with proper Javascript features isn't available (can happen in a few years). +Both are critical for scientific processing, especially the latter: when a web-browser with proper JavaScript features isn't available (can happen in a few years). This is further exacerbated by the fact that binary data (for example images) are not directly supported in JSON and have to be converted into much less memory-efficient textual encodings. Finally, Jupyter has an extremely complex dependency graph: on a clean Debian 10 system, Pip (a Python package manager that is necessary for installing Jupyter) required 19 dependencies to install, and installing Jupyter within Pip needed 41 dependencies! \citet{hinsen15} reported such conflicts when building Jupyter into the Active Papers framework (see Appendix \ref{appendix:activepapers}). However, the dependencies above are only on the server-side. -Since Jupyter is a web-based system, it requires many dependencies on the viewing/running browser also (for example special Javascript or HTML5 features, which evolve very fast). +Since Jupyter is a web-based system, it requires many dependencies on the viewing/running browser also (for example special JavaScript or HTML5 features, which evolve very fast). As discussed in Appendix \ref{appendix:highlevelinworkflow} having so many dependencies is a major caveat for any system regarding scientific/long-term reproducibility (as opposed to industrial/immediate reproducibility). In summary, Jupyter is most useful in manual, interactive and graphical operations for temporary operations (for example educational tutorials). @@ -2002,7 +2002,7 @@ For example Conda or Spack (Appendix \ref{appendix:packagemanagement}), CGAT-cor The discussion below applies to both the actual analysis software and project management software. In this context, its more focused on the latter. -Because of their nature, higher-level languages evolve very fast, creating incompatabilities on the way. +Because of their nature, higher-level languages evolve very fast, creating incompatibilities on the way. The most prominent example is the transition from Python 2 (released in 2000) to Python 3 (released in 2008). Python 3 was incompatible with Python 2 and it was decided to abandon the former by 2015. However, due to community pressure, this was delayed to January 1st, 2020. @@ -2026,14 +2026,14 @@ Beyond technical, low-level, problems for the developers mentioned above, this c \subsubsection{Dependency hell} The evolution of high-level languages is extremely fast, even within one version. -For example packages thar are written in Python 3 often only work with a special interval of Python 3 versions (for example newer than Python 3.6). +For example packages that are written in Python 3 often only work with a special interval of Python 3 versions (for example newer than Python 3.6). This isn't just limited to the core language, much faster changes occur in their higher-level libraries. For example version 1.9 of Numpy (Python's numerical analysis module) discontinued support for Numpy's predecessor (called Numeric), causing many problems for scientific users \citep[see][]{hinsen15}. On the other hand, the dependency graph of tools written in high-level languages is often extremely complex. For example see Figure 1 of \citet{alliez19}, it shows the dependencies and their inter-dependencies for Matplotlib (a popular plotting module in Python). -Acceptable dependency intervals between the dependencies will cause incompatibilities in a year or two, when a robust pakage manager is not used (see Appendix \ref{appendix:packagemanagement}). +Acceptable dependency intervals between the dependencies will cause incompatibilities in a year or two, when a robust package manager is not used (see Appendix \ref{appendix:packagemanagement}). Since a domain scientist doesn't always have the resources/knowledge to modify the conflicting part(s), many are forced to create complex environments with different versions of Python and pass the data between them (for example just to use the work of a previous PhD student in the team). This greatly increases the complexity of the project, even for the principal author. A good reproducible workflow can account for these different versions. @@ -2045,7 +2045,7 @@ As of this writing, the \inlinecode{pip3 install popper} and \inlinecode{pip2 in It is impossible to run either of these solutions if there is a single conflict in this very complex dependency graph. This problem actually occurred while we were testing Sciunit: even though it installed, it couldn't run because of conflicts (its last commit was only 1.5 years old), for more see Appendix \ref{appendix:sciunit}. \citet{hinsen15} also report a similar problem when attempting to install Jupyter (see Appendix \ref{appendix:editors}). -Ofcourse, this also applies to tools that these systems use, for example Conda (which is also written in Python, see Appendix \ref{appendix:packagemanagement}). +Of course, this also applies to tools that these systems use, for example Conda (which is also written in Python, see Appendix \ref{appendix:packagemanagement}). @@ -2103,7 +2103,7 @@ Therefore proprietary solutions like Code Ocean (\url{https://codeocean.com}) or Reproducible Electronic Documents (\url{http://sep.stanford.edu/doku.php?id=sep:research:reproducible}) is the first attempt that we could find on doing reproducible research \citep{claerbout1992,schwab2000}. It was developed within the Stanford Exploration Project (SEP) for Geophysics publications. Their introductions on the importance of reproducibility, resonate a lot with today's environment in computational sciences. -In particluar the heavy investment one has to make in order to re-do another scientist's work, even in the same team. +In particular the heavy investment one has to make in order to re-do another scientist's work, even in the same team. RED also influenced other early reproducible works, for example \citet{buckheit1995}. To orchestrate the various figures/results of a project, from 1990, they used ``Cake'' \citep[]{somogyi87}, a dialect of Make, for more on Make, see Appendix \ref{appendix:jobmanagement}. @@ -2114,7 +2114,7 @@ Several basic low-level Makefiles were included in the high-level/central Makefi The reader/user of a project had to manually edit the central Makefile and set the variable \inlinecode{RESDIR} (result dir), this is the directory where built files are kept. Afterwards, the reader could set which figures/parts of the project to reproduce by manually adding its name in the central Makefile, and running Make. -At the time, Make was already practiced by individual researchers and projects as a job orchestraion tool, but SEP's innovation was to standardize it as an internal policy, and define conventions for the Makefiles to be consistant across projects. +At the time, Make was already practiced by individual researchers and projects as a job orchestration tool, but SEP's innovation was to standardize it as an internal policy, and define conventions for the Makefiles to be consistent across projects. This enabled new members to benefit from the already existing work of previous team members (who had graduated or moved to other jobs). However, RED only used the existing software of the host system, it had no means to control them. Therefore, with wider adoption, they confronted a ``versioning problem'' where the host's analysis software had different versions on different hosts, creating different results, or crashing \citep{fomel09}. @@ -2130,7 +2130,7 @@ Apache Taverna (\url{https://taverna.incubator.apache.org}) is a workflow manage A workflow is defined as a directed graph, where nodes are called ``processors''. Each Processor transforms a set of inputs into a set of outputs and they are defined in the Scufl language (an XML-based language, were each step is an atomic task). Other components of the workflow are ``Data links'' and ``Coordination constraints''. -The main user interface is graphical, where users place processers in a sheet and define links between their intputs outputs. +The main user interface is graphical, where users place processors in a sheet and define links between their inputs outputs. \citet{zhao12} have studied the problem of workflow decays in Taverna. In many aspects Taverna is like VisTrails, see Appendix \ref{appendix:vistrails} [Since kepler is older, it may be better to bring the VisTrails features here.] @@ -2142,7 +2142,7 @@ In many aspects Taverna is like VisTrails, see Appendix \ref{appendix:vistrails} \label{appendix:madagascar} Madagascar (\url{http://ahay.org}) is a set of extensions to the SCons job management tool \citep{fomel13}. For more on SCons, see Appendix \ref{appendix:jobmanagement}. -Madagascar is a continuation of the Reproducible Electronic Documents (RED) project that was disucssed in Appendix \ref{appendix:red}. +Madagascar is a continuation of the Reproducible Electronic Documents (RED) project that was discussed in Appendix \ref{appendix:red}. Madagascar does include project management tools in the form of SCons extensions. However, it isn't just a reproducible project management tool, it is primarily a collection of analysis programs, tools to interact with RSF files, and plotting facilities. @@ -2209,7 +2209,7 @@ However, even though XML is in plain text, it is very hard to edit manually. VisTrails therefore provides a graphic user interface with a visual representation of the project's inter-dependent steps (similar to Figure \ref{fig:analysisworkflow}). Besides the fact that it is no longer maintained, the conceptual differences with the proposed template are substantial. The most important is that VisTrails doesn't control the software that is run, it only controls the sequence of steps that they are run in. -This template also defines dependencies and opertions based on the very standard and commonly known Make system, not a custom XML format. +This template also defines dependencies and operations based on the very standard and commonly known Make system, not a custom XML format. Scripts can easily be written to generate an XML-formatted output from Makefiles. @@ -2233,16 +2233,16 @@ Besides some small differences, this seems to be very similar to GenePattern (Ap The IPOL journal (\url{https://www.ipol.im}) attempts to publish the full implementation details of proposed image processing algorithm as a scientific paper \citep[first published article in July 2010]{limare11}. An IPOL paper is a traditional research paper, but with a focus on implementation. The published narrative description of the algorithm must be detailed to a level that any specialist can implement it in their own programming language (extremely detailed). -The author's own implementation of the algorithm is also published with the paper (in C, C++ or Matlab), the code must be commented well enough and link each part of it with the relevant part of the paper. +The author's own implementation of the algorithm is also published with the paper (in C, C++ or MATLAB), the code must be commented well enough and link each part of it with the relevant part of the paper. The authors must also submit several example datasets/scenarios. The referee actually inspects the code and narrative, confirming that they match with each other, and with the stated conclusions of the published paper. After publication, each paper also has a ``demo'' button on its webpage, allowing readers to try the algorithm on a web-interface and even provide their own input. -The IPOL model is indeed the single most robust model of peer review and publishing computaional research methods/implementations. +The IPOL model is indeed the single most robust model of peer review and publishing computational research methods/implementations. It has grown steadily over the last 10 years, publishing 23 research articles in 2019 alone. We encourage the reader to visit its webpage and see some of its recent papers and their demos. It can be so thorough and complete because it has a very narrow scope (image processing), and the published algorithms are highly atomic, not needing significant dependencies (beyond input/output), allowing the referees to go deep into each implemented algorithm. -Infact high-level languages like Perl, Python or Java are not acceptable precisely because of the additional complexities/dependencies that they require. +In fact, high-level languages like Perl, Python or Java are not acceptable precisely because of the additional complexities/dependencies that they require. Ideally (if any referee/reader was inclined to do so), the proposed template of this paper allows for a similar level of scrutiny, but for much more complex research scenarios, involving hundreds of dependencies and complex processing on the data. @@ -2273,8 +2273,8 @@ When the Python module contains a component written in other languages (mostly C As mentioned in \citep{hinsen15}, the fact that it relies on HDF5 is a caveat of Active Papers, because many tools are necessary to access it. Downloading the pre-built HDF View binaries (provided by the HDF group) is not possible anonymously/automatically (login is required). -Installing it using the Debain or Arch Linux package managers also failed due to dependencies. -Furthermore, as a high-level data format HDF5 evolves very fast, for example HDF5 1.12.0 (February 29th, 2020) is not usable with older libraries provided by the HDF5 team. +Installing it using the Debian or Arch Linux package managers also failed due to dependencies. +Furthermore, as a high-level data format HDF5 evolves very fast, for example HDF5 1.12.0 (February 29th, 2020) is not usable with older libraries provided by the HDF5 team. % maybe replace with: February 29\textsuperscript{th}, 2020? While data and code are indeed fundamentally similar concepts technically \tonote{cite Konrad's paper on this}, they are used by humans differently. This becomes a burden when large datasets are used, this was also acknowledged in \citet{hinsen15}. @@ -2319,14 +2319,14 @@ A ``verifiable computational result'' (\url{http://vcr.stanford.edu}) is an outp It was awarded the third prize in the Elsevier Executable Paper Grand Challenge \citep{gabriel11}. A VRI is created using tags within the programming source that produced that output, also recording its version control or history. -This enables exact identificatication and citation of results. -The VRIs are automatically generated web-URLs that link to public VCR reposities containing the data, inputs and scripts, that may be re-executed. -According to \citet{gavish11}, the VRI generation routine has been implemented in Matlab, R and Python, although only the Matlab version was available during the writing of this paper. +This enables exact identification and citation of results. +The VRIs are automatically generated web-URLs that link to public VCR repositories containing the data, inputs and scripts, that may be re-executed. +According to \citet{gavish11}, the VRI generation routine has been implemented in MATLAB, R and Python, although only the MATLAB version was available during the writing of this paper. VCR also has special \LaTeX{} macros for loading the respective VRI into the generated PDF. Unfortunately most parts of the webpage are not complete at the time of this writing. The VCR webpage contains an example PDF\footnote{\url{http://vcr.stanford.edu/paper.pdf}} that is generated with this system, however, the linked VCR repository (\inlinecode{http://vcr-stat.stanford.edu}) does not exist at the time of this writing. -Finally, the date of the files in the Matlab extension tarball are set to 2011, hinting that probably VCR has been abandoned soon after the the publication of \citet{gavish11}. +Finally, the date of the files in the MATLAB extension tarball are set to 2011, hinting that probably VCR has been abandoned soon after the publication of \citet{gavish11}. @@ -2343,7 +2343,7 @@ SOLE also supports workflows as Galaxy tools \citep{goecks10}. For reproducibility, \citet{pham12} suggest building a SOLE-based project in a virtual machine, using any custom package manager that is hosted on a private server to obtain a usable URI. However, as described in Appendices \ref{appendix:independentenvironment} and \ref{appendix:packagemanagement}, unless virtual machines are built with robust package managers, this is not a sustainable solution (the virtual machine itself is not reproducible). Also, hosting a large virtual machine server with fixed IP on a hosting service like Amazon (as suggested there) will be very expensive. -The manual/artificial defintion of tags to connect parts of the paper with the analysis scripts is also a caveat due to human error and incompleteness (tags the authors may not consider important, but may be useful later). +The manual/artificial definition of tags to connect parts of the paper with the analysis scripts is also a caveat due to human error and incompleteness (tags the authors may not consider important, but may be useful later). The solution of the proposed template (where anything coming out of the analysis is directly linked to the paper's contents with \LaTeX{} elements avoids these problems. @@ -2357,7 +2357,7 @@ The captured environment can be viewed in plain text, a web interface. Sumatra also provides \LaTeX/Sphinx features, which will link the paper with the project's Sumatra database. This enables researchers to use a fixed version of a project's figures in the paper, even at later times (while the project is being developed). -The actual code that Sumatra wraps around, must itself be under version control, and it doesn't run if there is non-commited changes (although its not clear what happens if a commit is ammended). +The actual code that Sumatra wraps around, must itself be under version control, and it doesn't run if there is non-committed changes (although its not clear what happens if a commit is amended). Since information on the environment has been captured, Sumatra is able to identify if it has changed since a previous run of the project. Therefore Sumatra makes no attempt at storing the environment of the analysis as in Sciunit (see Appendix \ref{appendix:sciunit}), but its information. Sumatra thus needs to know the language of the running program. @@ -2374,7 +2374,7 @@ It thus provides resources to link various workflow/analysis components (see App \citet{bechhofer13} describes how a workflow in Apache Taverna (Appendix \ref{appendix:taverna}) can be translated into research objects. The important thing is that the research object concept is not specific to any special workflow, it is just a metadata bundle which is only as robust in reproducing the result as the running workflow. -For example, Apache Tavenra cannot guarantee exact reproducibility as described in Appendix \ref{appendix:taverna}. +For example, Apache Taverna cannot guarantee exact reproducibility as described in Appendix \ref{appendix:taverna}. But when a translator is written to convert the proposed template into research objects, they can do this. @@ -2440,7 +2440,7 @@ For a discussion on the convention, please see Section \ref{sec:principles}, in The Popper team's own solution is through a command-line program called \inlinecode{popper}. The \inlinecode{popper} program itself is written in Python, but job management is with the HashiCorp configuration language (HCL). HCL is primarily aimed at running jobs on HashiCorp's ``infrastructure as a service'' (IaaS) products. -Until September 30th, 2019\footnote{\url{https://github.blog/changelog/2019-09-17-github-actions-will-stop-running-workflows-written-in-hcl}}, HCL was used by ``GitHub Actions'' to manage workflows. +Until September 30th, 2019\footnote{\url{https://github.blog/changelog/2019-09-17-github-actions-will-stop-running-workflows-written-in-hcl}}, HCL was used by ``GitHub Actions'' to manage workflows. % maybe use the \textsuperscript{th} with dates? To start a project, the \inlinecode{popper} command-line program builds a template, or ``scaffold'', which is a minimal set of files that can be run. The scaffold is very similar to the raw template of that is proposed in this paper. @@ -2460,7 +2460,7 @@ Whole Tale (\url{https://wholetale.org}) is a web-based platform for managing a It uses online editors like Jupyter or RStudio (see Appendix \ref{appendix:editors}) that are encapsulated in a Docker container (see Appendix \ref{appendix:independentenvironment}). The web-based nature of Whole Tale's approach, and its dependency on many tools (which have many dependencies themselves) is a major limitation for future reproducibility. -For example, when following their own tutorial on ``Creating a new tale'', the provided Jupyter notbook could not be executed because of a dependency problem. +For example, when following their own tutorial on ``Creating a new tale'', the provided Jupyter notebook could not be executed because of a dependency problem. This has been reported to the authors as issue 113\footnote{\url{https://github.com/whole-tale/wt-design-docs/issues/113}}, but as all the second-order dependencies evolve, its not hard to envisage such dependency incompatibilities being the primary issue for older projects on Whole Tale. Furthermore, the fact that a Tale is stored as a binary Docker container causes two important problems: 1) it requires a very large storage capacity for every project that is hosted there, making it very expensive to scale if demand expands. 2) It is not possible to see how the environment was built accurately (when the Dockerfile uses \inlinecode{apt}), for more on this, please see Appendix \ref{appendix:packagemanagement}. -- cgit v1.2.1