diff options
author | Mohammad Akhlaghi <mohammad@akhlaghi.org> | 2021-01-05 01:19:07 +0000 |
---|---|---|
committer | Mohammad Akhlaghi <mohammad@akhlaghi.org> | 2021-01-05 01:19:07 +0000 |
commit | e4a5566861bb7b639624c50be45b2a04d0ce9197 (patch) | |
tree | 90e7d82bdb4b7a19323c5b348ceef68ca3d15533 | |
parent | 962128c13207c2e247469ef81471c96828dea33b (diff) |
Polished main paper and appendices after a full re-read
In preparation for the submission of the revised manuscript, I went through
the full paper and appendices one last time. The second appendix (reviewing
existing reproducible solutions) in particular needed some attention
because some of the tools weren't properly compared with the criteria.
In the paper, I was also able to remove about 30 words, and bring our own
count (which is an over-estimation already) to below 6250.
-rw-r--r-- | paper.tex | 129 | ||||
-rw-r--r-- | tex/src/appendix-existing-solutions.tex | 142 | ||||
-rw-r--r-- | tex/src/appendix-existing-tools.tex | 196 | ||||
-rw-r--r-- | tex/src/figure-src-format.tex | 2 | ||||
-rw-r--r-- | tex/src/supplement.tex | 4 |
5 files changed, 263 insertions, 210 deletions
@@ -75,9 +75,8 @@ A set of criteria is introduced to address this problem: %% METHOD Completeness (no \new{execution requirement} beyond \new{a minimal Unix-like operating system}, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free software. - They have been tested in several research publications in various fields. %% RESULTS - As a proof of concept, ``Maneage'' is introduced, enabling cheap archiving, provenance extraction, and peer verification. + As a proof of concept, we introduce ``Maneage'' (Managing data lineage), enabling cheap archiving, provenance extraction, and peer verification that been tested in several research publications. %% CONCLUSION We show that longevity is a realistic requirement that does not sacrifice immediate or short-term reproducibility. The caveats (with proposed solutions) are then discussed and we conclude with the benefits for the various stakeholders. @@ -156,20 +155,21 @@ A basic review of the longevity of commonly used tools is provided here \new{(fo ). } -To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011, but discontinued in 2019). +To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011, discontinued in 2019). However, containers (e.g., Docker or Singularity) are currently the most widely-used solution. We will focus on Docker here because it is currently the most common. \new{It is hypothetically possible to precisely identify the used Docker ``images'' with their checksums (or ``digest'') to re-create an identical OS image later. However, that is rarely done.} -Usually images are imported with operating system (OS) names; e.g., \cite{mesnard20} imports `\inlinecode{FROM ubuntu:16.04}'. +Usually images are imported with operating system (OS) names; e.g., \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}'. The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly, and only the most recent five are archived there. Hence, if the image is built in different months, it will contain different OS components. -In the year 2024, when long-term support (LTS) for this version of Ubuntu expires, the image will be unavailable at the expected URL \new{(if not earlier, like CentOS 8 which \href{https://blog.centos.org/2020/12/future-is-centos-stream}{will terminate} 8 years early).} +In the year 2024, when this version's long-term support (LTS) expires \new{(if not earlier, like CentOS 8 which \href{https://blog.centos.org/2020/12/future-is-centos-stream}{will terminate} 8 years early)}, the image will not be available at the expected URL. Generally, \new{pre-built} binary files (like Docker images) are large and expensive to maintain and archive. -\new{Because of this, in October 2020 Docker Hub (where many workflows are archived) \href{https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}{announced} that inactive images (older than 6 months) will be deleted in free accounts from mid 2021.} -Furthermore, Docker requires root permissions, and only supports recent (LTS) versions of the host kernel: older Docker images may not be executable \new{(their longevity is determined by the host kernel, typically a decade).} +\new{Because of this, in October 2020 Docker Hub (where many workflows are archived) \href{https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}{announced} that inactive images (more than 6 months) will be deleted in free accounts from mid 2021.} +Furthermore, Docker requires root permissions, and only supports recent (LTS) versions of the host kernel. +Hence older Docker images may not be executable \new{(their longevity is determined by the host kernel, typically a decade).} Once the host OS is ready, PMs are used to install the software or environment. Usually the OS's PM, such as `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with generic PMs. @@ -177,14 +177,14 @@ The former has \new{the same longevity} as the OS, while some of the latter (suc Nix and GNU Guix produce bit-wise identical programs \new{with considerably better longevity; that of their supported CPU architectures}. However, they need root permissions and are primarily targeted at the Linux kernel. Generally, in all the package managers, the exact version of each software (and its dependencies) is not precisely identified by default, although an advanced user can indeed fix them. -Unless precise version identifiers of \emph{every software package} are stored by project authors, a PM will use the most recent version. +Unless precise version identifiers of \emph{every software package} are stored by project authors, a third-party PM will use the most recent version. Furthermore, because third-party PMs introduce their own language, framework, and version history (the PM itself may evolve) and are maintained by an external team, they increase a project's complexity. With the software environment built, job management is the next component of a workflow. Visual/GUI workflow tools like Apache Taverna, GenePattern \new{(deprecated)}, Kepler or VisTrails \new{(deprecated)}, which were mostly introduced in the 2000s and used Java or Python 2 encourage modularity and robust job management. \new{However, a GUI environment is tailored to specific applications and is hard to generalize, while being hard to reproduce once the required Java Virtual Machine (JVM) is deprecated. These tools' data formats are complex (designed for computers to read) and hard to read by humans without the GUI.} -The more recent tools (mostly non-GUI, written in Python) leave this to the authors of the project. +The more recent solutions (mostly non-GUI, written in Python) leave this to the authors of the project. Designing a robust project needs to be encouraged and facilitated because scientists (who are not usually trained in project or data management) will rarely apply best practices. This includes automatic verification, which is possible in many solutions, but is rarely practiced. Besides non-reproducibility, weak project management leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding, or validating will be expensive). @@ -196,9 +196,9 @@ Furthermore, as with job management, computational notebooks do not actively enc \new{The ``cells'' in a Jupyter notebook can either be run sequentially (from top to bottom, one after the other) or by manually selecting the cell to run. By default, cell dependencies are not included (e.g., automatically running some cells only after certain others), parallel execution, or usage of more than one language. There are third party add-ons like \inlinecode{sos} or \inlinecode{extension's} (both written in Python) for some of these. -However, since they are not part of the core, their longevity can be assumed to be shorter. -Therefore, the core Jupyter framework leaves very few options for project management, especially as the project grows beyond a small test or tutorial.} -In summary, notebooks can rarely deliver their promised potential \cite{rule18} and may even hamper reproducibility \cite{pimentel19}. +However, since they are not part of the core, a shorter longevity can be assumed. +The core Jupyter framework has few options for project management, especially as the project grows beyond a small test or tutorial.} +Notebooks can therefore rarely deliver their promised potential \cite{rule18} and may even hamper reproducibility \cite{pimentel19}. @@ -206,7 +206,7 @@ In summary, notebooks can rarely deliver their promised potential \cite{rule18} \section{Proposed criteria for longevity} \label{criteria} -The main premise is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it at the end \cite{austin17,fineberg19}. +The main premise here is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it just before publication \cite{austin17,fineberg19}. In this context, researchers play a critical role \cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles). Simply archiving a project workflow in a repository after the project is finished is, on its own, insufficient, and maintaining it by repository staff is often either practically unfeasible or unscalable. We argue and propose that workflows satisfying the following criteria can not only improve researcher flexibility during a research project, but can also increase the FAIRness of the deliverables for future researchers: @@ -214,14 +214,14 @@ We argue and propose that workflows satisfying the following criteria can not on \textbf{Criterion 1: Completeness.} A project that is complete (self-contained) has the following properties. (1) \new{No \emph{execution requirements} apart from a minimal Unix-like operating system. -Fewer explicit execution requirements would mean higher \emph{execution possibility} and consequently higher \emph{longevity}.} +Fewer explicit execution requirements would mean larger \emph{execution possibility} and consequently longer \emph{longevity}.} (2) Primarily stored as plain text \new{(encoded in ASCII/Unicode)}, not needing specialized software to open, parse, or execute. (3) No impact on the host OS libraries, programs, and \new{environment variables}. -(4) Does not require root privileges to run (during development or post-publication). +(4) No root privileges to run (during development or post-publication). (5) Builds its own controlled software \new{with independent environment variables}. (6) Can run locally (without an internet connection). (7) Contains the full project's analysis, visualization \emph{and} narrative: including instructions to automatically access/download raw inputs, build necessary software, do the analysis, produce final data products \emph{and} final published report with figures \emph{as output}, e.g., PDF or HTML. -(8) It can run automatically, \new{without} human interaction. +(8) It can run automatically, without human interaction. \textbf{Criterion 2: Modularity.} A modular project enables and encourages independent modules with well-defined inputs/outputs and minimal side effects. @@ -270,7 +270,7 @@ A project that is \href{https://www.gnu.org/philosophy/free-sw.en.html}{free sof When the software used by the project is itself also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware. \new{Proprietary software may be necessary to read proprietary data formats produced by data collection hardware (for example micro-arrays in genetics). -In such cases, it is best to immediately convert the data to free formats upon collection, and archive (e.g., on Zenodo) or use the data in free formats.} +In such cases, it is best to immediately convert the data to free formats upon collection and safely use or archive the data as free formats.} @@ -283,13 +283,13 @@ In such cases, it is best to immediately convert the data to free formats upon c \section{Proof of concept: Maneage} -With the longevity problems of existing tools outlined above, a proof-of-concept tool is presented here via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}. +With the longevity problems of existing tools outlined above, a proof-of-concept solution is presented here via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}. \new{Since the initial submission of this paper, it has also been used in \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} (on the COVID-19 pandemic) and \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}.} It was also awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows \cite{austin17}, from the researchers' perspective. -The tool is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at \url{https://maneage.org}. +It is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at \url{https://maneage.org}. It was developed as a parallel research project over five years of publishing reproducible workflows of our research. -The original implementation was published in \cite{akhlaghi15}, and evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}. +Its primordial implementation was used in \cite{akhlaghi15}, which evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}. Technically, the hardest criterion to implement was the first (completeness); in particular \new{restricting execution requirements to only a minimal Unix-like operating system}. One solution we considered was GNU Guix and Guix Workflow Language (GWL). @@ -303,11 +303,11 @@ It is thus mature, actively maintained, highly optimized, efficient in managing Researchers using free software have also already had some exposure to it \new{(most free research software are built with Make).} Linking the analysis and narrative (criterion 7) was historically our first design element. -To avoid the problems with computational notebooks mentioned above, our implementation follows a more abstract linkage, providing a more direct and precise, yet modular, connection. +To avoid the problems with computational notebooks mentioned above, we adopt a more abstract linkage, providing a more direct and traceable connection. Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is through automatically-created \LaTeX{} macros, during the analysis. For example, \cite{akhlaghi19} writes `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'. The \LaTeX{} source of the quote above is: `\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}'. -The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is generated during the analysis and expands to the value `\inlinecode{0.25}' when the PDF output is built. +The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is automatically generated after the analysis and expands to the value `\inlinecode{0.25}' upon creation of the PDF. Since values like this depend on the analysis, they should \emph{also} be reproducible, along with figures and tables. These macros act as a quantifiable link between the narrative and analysis, with the granularity of a word in a sentence and a particular analysis command. @@ -325,29 +325,27 @@ These are combined at the end to generate precise software \new{acknowledgment} \fi% } for other examples, see \cite{akhlaghi19, infante20}. -\new{Furthermore, the machine-related specifications of the running system (including hardware name and byte-order) are also collected and cited. +\new{Furthermore, the machine-related specifications of the running system (including CPU architecture and byte-order) are also collected to report in the paper (they are reported for this paper in the acknowledgments). These can help in \emph{root cause analysis} of observed differences/issues in the execution of the workflow on different machines.} The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}). -All software dependencies are built down to precise versions of every tool, including the shell,\new{important low-level application programs} (e.g., GNU Coreutils) and of course, the high-level science software. -\new{The source code of all the free software used in Maneage is archived in and downloaded from \href{https://doi.org/10.5281/zenodo.3883409}{zenodo.3883409}. +All software dependencies are built down to precise versions of every tool, including the shell, \new{important low-level application programs} (e.g., GNU Coreutils) and of course, the high-level science software. +\new{The source code of all the free software used in Maneage is archived in, and downloaded from, \href{https://doi.org/10.5281/zenodo.3883409}{zenodo.3883409}. Zenodo promises long-term archival and also provides a persistent identifier for the files, which are sometimes unavailable at a software package's web page.} On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library (glibc) is being added (task \href{http://savannah.nongnu.org/task/?15390}{15390}). Currently, {\TeX}Live is also being added (task \href{http://savannah.nongnu.org/task/?15267}{15267}), but that is only for building the final PDF, not affecting the analysis or verification. -\new{Finally, some software cannot be built on some CPU architectures, hence by default, the architecture is included in the final built paper automatically (see below).} \new{Building the core Maneage software environment on an 8-core CPU takes about 1.5 hours (GCC consumes more than half of the time). However, this is only necessary once in a project: the analysis (which usually takes months to write/mature for a normal project) will only use the built environment. Hence the few hours of initial software building is negligible compared to a project's life span. To facilitate moving to another computer in the short term, Maneage'd projects can be built in a container or VM. -The \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{\inlinecode{README.md}} file has instructions on building in Docker. +The \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{\inlinecode{README.md}} file has thorough instructions on building in Docker. Through containers or VMs, users on non-Unix-like OSs (like Microsoft Windows) can use Maneage. For Windows-native software that can be run in batch-mode, evolving technologies like Windows Subsystem for Linux may be usable.} The analysis phase of the project however is naturally different from one project to another at a low-level. It was thus necessary to design a generic framework to comfortably host any project, while still satisfying the criteria of modularity, scalability, and minimal complexity. -This design is demonstrated with the example of Figure \ref{fig:datalineage} (left). -It is an enhanced replication of the ``tool'' curve of Figure 1C in \cite{menke20}. +This design is demonstrated with the example of Figure \ref{fig:datalineage} (left) which is an enhanced replication of the ``tool'' curve of Figure 1C in \cite{menke20}. Figure \ref{fig:datalineage} (right) is the data lineage that produced it. \begin{figure*}[t] @@ -378,9 +376,15 @@ This is visualized in Figure \ref{fig:datalineage} (right) where no built (blue) A visual inspection of this file is sufficient for a non-expert to understand the high-level steps of the project (irrespective of the low-level implementation details), provided that the subMakefile names are descriptive (thus encouraging good practice). A human-friendly design that is also optimized for execution is a critical component for the FAIRness of reproducible research. +All projects first load \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk} (Listing \ref{code:topmake}). +Project authors add their modular subMakefiles in between. +Except for \inlinecode{paper.mk} (which builds the ultimate target: \inlinecode{paper.pdf}), all subMakefiles build a macro file with the same base-name (the \inlinecode{.tex} file at the bottom of each subMakefile in Figure \ref{fig:datalineage}). +Other built files (``targets'' in intermediate analysis steps) cascade down in the lineage to one of these macro files, possibly through other files. + \begin{lstlisting}[ label=code:topmake, - caption={This project's simplified \inlinecode{top-make.mk}, also see Figure \ref{fig:datalineage}} + caption={This project's simplified \inlinecode{top-make.mk}, also see Figure \ref{fig:datalineage}.\\ + \new{For full file, see \href{https://archive.softwareheritage.org/swh:1:cnt:d552dc18749fbb16249b642cd4f8107c1ce8ff68;origin=https://gitlab.com/makhlaghi/maneage-paper.git;visit=swh:1:snp:ee7cc3bb558c4af703e8de53dd590654c8967663;anchor=swh:1:rev:e4f61544facf8a3bd88c8466e7d3d847544c8228;path=/reproduce/analysis/make/top-make.mk}{SoftwareHeritage}}} ] # Default target/goal of project. all: paper.pdf @@ -401,12 +405,7 @@ include $(foreach s,$(makesrc), \ reproduce/analysis/make/$(s).mk) \end{lstlisting} -All projects first load \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk} (Listing \ref{code:topmake}). -Project authors add their modular subMakefiles in between. -Except for \inlinecode{paper.mk} (which builds the ultimate target: \inlinecode{paper.pdf}), all subMakefiles build a macro file with the same base-name (the \inlinecode{.tex} file in each subMakefile of Figure \ref{fig:datalineage}). -Other built files (intermediate analysis steps) cascade down in the lineage to one of these macro files, possibly through other files. - -Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk} to satisfy the verification criteria (this step was not yet available in \cite{akhlaghi19, infante20}). +Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk} to satisfy the verification criteria (this step was not available in \cite{infante20}). All project deliverables (macro files, plot or table data, and other datasets) are verified at this stage, with their checksums, to automatically ensure exact reproducibility. Where exact reproducibility is not possible \new{(for example, due to parallelization)}, values can be verified by the project authors. \new{For example see \new{\href{https://archive.softwareheritage.org/browse/origin/content/?branch=refs/heads/postreferee_corrections&origin_url=https://codeberg.org/boud/elaphrocentre.git&path=reproduce/analysis/bash/verify-parameter-statistically.sh}{verify-parameter-statistically.sh}} of \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}.} @@ -420,9 +419,14 @@ Where exact reproducibility is not possible \new{(for example, due to paralleliz The low-level structure (in Maneage, shared between all projects) can be updated by merging with Maneage. (b) A finished/published project can be revitalized for new technologies by merging with the core branch. Each Git ``commit'' is shown on its branch as a colored ellipse, with its commit hash shown and colored to identify the team that is/was working on the branch. - Briefly, Git is a version control system, allowing a structured backup of project files. - Each Git ``commit'' effectively contains a copy of all the project's files at the moment it was made. - The upward arrows at the branch-tops are therefore in the time direction. + Briefly, Git is a version control system, allowing a structured backup of project files, for more see + \ifdefined\separatesupplement% + supplementary appendices (section on version control)% + \else% + Appendix \ref{appendix:versioncontrol}% + \fi% + . Each Git ``commit'' effectively contains a copy of all the project's files at the moment it was made. + The upward arrows at the branch-tops are therefore in the direction of time. } \end{figure*} @@ -438,18 +442,23 @@ Furthermore, the configuration files are a prerequisite of the targets that use If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's source or other built products. This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details; e.g., by co-authors or future readers), and ensures self-consistency. -\new{In contrast to notebooks like Jupyter, the analysis scripts and configuration parameters are therefore not blended into the running code, are not stored together in a single file, and do not require a unique editor. -To satisfy the modularity criterion, the analysis steps and narrative are run in their own files (in different languages, thus maximally benefiting from the unique features of each) and the files can be viewed or manipulated with any text editor that the authors prefer. +\new{In contrast to notebooks like Jupyter, the analysis scripts, configuration parameters and paper's narrative are therefore not blended into in a single file, and do not require a unique editor. +To satisfy the modularity criterion, the analysis steps and narrative are written and run in their own files (in different languages) and the files can be viewed or manipulated with any text editor that the authors prefer. The analysis can benefit from the powerful and portable job management features of Make and communicates with the narrative text through \LaTeX{} macros, enabling much better-formatted output that blends analysis outputs in the narrative sentences and enables direct provenance tracking.} To satisfy the recorded history criterion, version control (currently implemented in Git) is another component of Maneage (see Figure \ref{fig:branching}). Maneage is a Git branch that contains the shared components (infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles, and interface script). \new{The core Maneage git repository is hosted at \href{http://git.maneage.org/project.git}{git.maneage.org/project.git} (archived at \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{Software Heritage}).} Derived projects start by creating a branch and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its particular analysis, see Listing \ref{code:branching}). -There is a \new{thoroughly elaborated} customization checklist in \inlinecode{README-hacking.md}). +There is a \new{thoroughly elaborated} customization checklist in \inlinecode{README-hacking.md}. -The current project's Git hash is provided to the authors as a \LaTeX{} macro (shown here at the end of the abstract), as well as the Git hash of the last commit in the Maneage branch (shown here in the acknowledgments). -These macros are created in \inlinecode{initialize.mk}, with \new{other basic information from the running system like the CPU architecture, byte order or address sizes (shown here in the acknowledgments)}. +The current project's Git hash is provided to the authors as a \LaTeX{} macro (shown here at the end of the abstract), as well as the Git hash of the last commit in the Maneage branch (shown in the acknowledgments). +These macros are created in \inlinecode{initialize.mk}, with \new{other basic information from the running system like the CPU architecture, byte order or address sizes (shown in the acknowledgments)}. + +Figure \ref{fig:branching} shows how projects can re-import Maneage at a later time (technically: \emph{merge}), thus improving their low-level infrastructure: in (a) authors do the merge during an ongoing project; +in (b) readers do it after publication; e.g., the project remains reproducible but the infrastructure is outdated, or a bug is fixed in Maneage. +\new{Generally, any Git flow (branching strategy) can be used by the high-level project authors or future readers.} +Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of project curation and maintenance, before \emph{and} after publication. \begin{lstlisting}[ label=code:branching, @@ -471,12 +480,8 @@ $ ./project make # Re-build to see effect. $ git add -u && git commit # Commit changes. \end{lstlisting} -The branch-based design of Figure \ref{fig:branching} allows projects to re-import Maneage at a later time (technically: \emph{merge}), thus improving its low-level infrastructure: in (a) authors do the merge during an ongoing project; -in (b) readers do it after publication; e.g., the project remains reproducible but the infrastructure is outdated, or a bug is fixed in Maneage. -\new{Generally, any git flow (branching strategy) can be used by the high-level project authors or future readers.} -Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of project curation and maintenance, before \emph{and} after publication. -Finally, the complete project source is usually $\sim100$ kilo-bytes. +Finally, a snapshot of the complete project source is usually $\sim100$ kilo-bytes. It can thus easily be published or archived in many servers, for example, it can be uploaded to arXiv (with the \LaTeX{} source, see the arXiv source in \cite{akhlaghi19, infante20, akhlaghi15}), published on Zenodo and archived in SoftwareHeritage. @@ -499,8 +504,8 @@ It can thus easily be published or archived in many servers, for example, it can %% Attempt to generalise the significance. %% should not just present a solution or an enquiry into a unitary problem but make an effort to demonstrate wider significance and application and say something more about the ‘science of data’ more generally. -We have shown that it is possible to build workflows satisfying all the proposed criteria, and we comment here on our experience in testing them through this proof-of-concept tool. -Maneage user-base grew with the support of RDA, underscoring some difficulties for widespread adoption. +We have shown that it is possible to build workflows satisfying all the proposed criteria. +Here we comment on our experience in testing them through Maneage and its increasing user-base (thanks to the support of RDA). Firstly, while most researchers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used. Fortunately, we have noticed that after witnessing the improvements in their research, many, especially early-career researchers, have started mastering these tools. @@ -510,16 +515,17 @@ Indeed the fast-evolving tools are primarily targeted at software developers, wh Scientists, on the other hand, need to focus on their own research fields and need to consider longevity. Hence, arguably the most important feature of these criteria (as implemented in Maneage) is that they provide a fully working template or bundle that works immediately out of the box by producing a paper with an example calculation that they just need to start customizing. Using mature and time-tested tools, for blending version control, the research paper's narrative, the software management \emph{and} a robust data management strategies. -We have noticed that providing a complete \emph{and} customizable template with a clear checklist of the initial steps is much more effective in encouraging mastery of these modern scientific tools than having abstract, isolated tutorials on each tool individually. +We have noticed that providing a clear checklist of the initial customizations is much more effective in encouraging mastery of these core analysis tools than having abstract, isolated tutorials on each tool individually. Secondly, to satisfy the completeness criterion, all the required software of the project must be built on various \new{Unix-like OSs} (Maneage is actively tested on different GNU/Linux distributions, macOS, and is being ported to FreeBSD also). This requires maintenance by our core team and consumes time and energy. However, because the PM and analysis components share the same job manager (Make) and design principles, we have already noticed some early users adding, or fixing, their required software alone. They later share their low-level commits on the core branch, thus propagating it to all derived projects. -\new{Unix-like OSs are a very large and diverse group (mostly conforming with POSIX), so this condition does not guarantee bit-wise reproducibility of the software, even when built on the same hardware.}. -However \new{our focus is on reproducing results (output of software), not the software itself.} -Well written software internally corrects for differences in OS or hardware that may affect its output (through tools like the GNU portability library). +\new{Thirdly, Unix-like OSs are a very large and diverse group (mostly conforming with POSIX), so our completeness condition does not guarantee bit-wise reproducibility of the software, even when built on the same hardware. +However our focus is on reproducing results (output of software), not the software itself.} +Well written software internally corrects for differences in OS or hardware that may affect its output (through tools like the GNU Portability Library, or Gnulib). + On GNU/Linux hosts, Maneage builds precise versions of the compilation tool chain. However, glibc is not install-able on some \new{Unix-like} OSs (e.g., macOS) and all programs link with the C library. This may hypothetically hinder the exact reproducibility \emph{of results} on non-GNU/Linux systems, but we have not encountered this in our research so far. @@ -533,12 +539,17 @@ Using continuous integration (CI) is one way to precisely identify breaking poin %This is a long-term goal and would require major changes to academic value systems. %2) Authors can be given a grace period where the journal or a third party embargoes the source, keeping it private for the embargo period and then publishing it. -Other implementations of the criteria, or future improvements in Maneage, may solve some of the caveats, but this proof of concept already shows their many advantages. +Other implementations of the criteria, or future improvements in Maneage, may solve some of the caveats, but this proof of concept already shows many advantages. For example, the publication of projects meeting these criteria on a wide scale will allow automatic workflow generation, optimized for desired characteristics of the results (e.g., via machine learning). The completeness criterion implies that algorithms and data selection can be included in the optimizations. Furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the connection between an analysis with the resulting narrative \emph{and} the history of that analysis+narrative. -Parsers can be written over projects for meta-research and provenance studies, e.g., to generate ``research objects''. +Parsers can be written over projects for meta-research and provenance studies, e.g., to generate Research Objects +\ifdefined\separatesupplement +(see the supplement appendix). +\else +(see Appendix \ref{appendix:researchobject}). +\fi Likewise, when a bug is found in one science software, affected projects can be detected and the scale of the effect can be measured. Combined with SoftwareHeritage, precise high-level science components of the analysis can be accurately cited (e.g., even failed/abandoned tests at any historical point). Many components of ``machine-actionable'' data management plans can also be automatically completed as a byproduct, useful for project PIs and grant funders. @@ -548,7 +559,7 @@ From the data repository perspective, these criteria can also be useful, e.g., t (2) Automated and persistent bidirectional linking of data and publication can be established through the published \emph{and complete} data lineage that is under version control. (3) Software management: with these criteria, each project comes with its unique and complete software management. It does not use a third-party PM that needs to be maintained by the data center (and the many versions of the PM), hence enabling robust software management, preservation, publishing, and citation. -For example, see \href{https://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}, \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}, \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} or \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460} where we have exploited the free-software criterion to distribute the source code of all software used in each project as deliverables. +For example, see \href{https://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}, \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}, \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} or \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460} where we distribute the source code of all software used in each project in a tarball, as deliverables. (4) ``Linkages between documentation, code, data, and journal articles in an integrated environment'', which effectively summarizes the whole purpose of these criteria. @@ -725,7 +736,7 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314 %% \item \citeappendix{fanelli18} is critical of the narrative that there is a ``reproducibility crisis'', and that its important to empower scientists. %% \item \citeappendix{burrell18} open software (in particular Python) in heliophysics. %% \item \citeappendix{allen18} show that many papers do not cite software. -%% \item \citeappendix{zhang18} explicity say that they won't release their code: ``We opt not to make the code used for the chemical evo-lution modeling publicly available because it is an important asset of the re-searchers’ toolkits'' +%% \item \citeappendix{zhang18} explicity say that they will not release their code: ``We opt not to make the code used for the chemical evo-lution modeling publicly available because it is an important asset of the re-searchers’ toolkits'' %% \item \citeappendix{jones19} make genuine effort at reproducing every number in the paper (using Docker, Conda, and CGAT-core, and Binder), but they can ultimately only release scripts. They claim its not possible to reproduce that level of reproducibility, but here we show it is. %% \item LSST uses Kubernetes and docker for reproducibility \citeappendix{banek19}. %% \item Interesting survey/paper on the importance of coding in science \citeappendix{merali10}. diff --git a/tex/src/appendix-existing-solutions.tex b/tex/src/appendix-existing-solutions.tex index e802644..d149c77 100644 --- a/tex/src/appendix-existing-solutions.tex +++ b/tex/src/appendix-existing-solutions.tex @@ -21,53 +21,55 @@ \section{Survey of common existing reproducible workflows} \label{appendix:existingsolutions} -As reviewed in the introduction, the problem of reproducibility has received considerable attention over the last three decades and various solutions have already been proposed. +The problem of reproducibility has received considerable attention over the last three decades and various solutions have already been proposed. The core principles that many of the existing solutions (including Maneage) aim to achieve are nicely summarized by the FAIR principles \citeappendix{wilkinson16}. -In this appendix, some of the solutions are reviewed. +In this appendix, \emph{some} of the solutions are reviewed. +We are not just reviewing solutions that can be used today. +The main focus of this paper is longevity, therefore we also spent considerable time on finding and inspecting solutions that have been aborted, discontinued or abandoned. + The solutions are based on an evolving software landscape, therefore they are ordered by date: when the project has a web page, the year of its first release is used for the sorting. Otherwise their paper's publication year is used. - For each solution, we summarize its methodology and discuss how it relates to the criteria proposed in this paper. Freedom of the software/method is a core concept behind scientific reproducibility, as opposed to industrial reproducibility where a black box is acceptable/desirable. Therefore proprietary solutions like Code Ocean\footnote{\inlinecode{\url{https://codeocean.com}}} or Nextjournal\footnote{\inlinecode{\url{https://nextjournal.com}}} will not be reviewed here. Other studies have also attempted to review existing reproducible solutions, for example, see \citeappendix{konkol20}. - - +We have tried our best to test and read through the documentation of almost all reviewed solutions to a sufficient level. +However, due to time constraints, it is inevitable that we may have missed some aspects the solutions, or incorrectly interpreted their behavior and outputs. +In this case, please let us know and we will correct it in the text on the paper's Git repository and publish the updated PDF on \href{https://doi.org/10.5281/zenodo.3872247}{zenodo.3872247} (this is the version-independent DOI, that always points to the most recent Zenodo upload). \subsection{Suggested rules, checklists, or criteria} -Before going into the various implementations, it is also useful to review existing suggested rules, checklists, or criteria for computationally reproducible research. - -All the cases below are primarily targeted to immediate reproducibility and do not consider longevity explicitly. -Therefore, they lack a strong/clear completeness criterion (they mainly only suggest, rather than require, the recording of versions, and their ultimate suggestion of storing the full binary OS in a binary VM or container is problematic (as mentioned in \ref{appendix:independentenvironment} and \citeappendix{oliveira18}). +Before going into the various implementations, it is useful to review some existing suggested rules, checklists, or criteria for computationally reproducible research. Sandve et al. \citeappendix{sandve13} propose ``ten simple rules for reproducible computational research'' that can be applied in any project. Generally, these are very similar to the criteria proposed here and follow a similar spirit, but they do not provide any actual research papers following up all those points, nor do they provide a proof of concept. -The Popper convention \citeappendix{jimenez17} also provides a set of principles that are indeed generally useful, among which some are common to the criteria here (for example, automatic validation, and, as in Maneage, the authors suggest providing a template for new users), but the authors do not include completeness as a criterion nor pay attention to longevity (Popper itself is written in Python with many dependencies, and its core operating language has already changed once). +The Popper convention \citeappendix{jimenez17} also provides a set of principles that are indeed generally useful, among which some are common to the criteria here (for example, automatic validation, and, as in Maneage, the authors suggest providing a template for new users), but the authors do not include completeness as a criterion nor pay attention to longevity: Popper has already changed its core workflow language once and is written in Python with many dependencies that evolve fast, see \ref{appendix:highlevelinworkflow}. For more on Popper, please see Section \ref{appendix:popper}. -For improved reproducibility in Jupyter notebook users, \citeappendix{rule19} propose ten rules to improve reproducibility and also provide links to example implementations. +For improved reproducibility Jupyter notebooks, \citeappendix{rule19} propose ten rules and also provide links to example implementations. These can be very useful for users of Jupyter but are not generic for non-Jupyter-based computational projects. Some criteria (which are indeed very good in a more general context) do not directly relate to reproducibility, for example their Rule 1: ``Tell a Story for an Audience''. Generally, as reviewed in -\ifdefined\separatesupplement -the main body of this paper (section on the longevity of existing tools) -\else -Section \ref{sec:longevityofexisting} +\ifdefined\separatesupplement% +the main body of this paper (section on the longevity of existing tools)% +\else% +Section \ref{sec:longevityofexisting}% \fi and Section \ref{appendix:jupyter} (below), Jupyter itself has many issues regarding reproducibility. To create Docker images, N\"ust et al. propose ``ten simple rules'' in \citeappendix{nust20}. They recommend some issues that can indeed help increase the quality of Docker images and their production/usage, such as their rule 7 to ``mount datasets [only] at run time'' to separate the computational environment from the data. However, the long-term reproducibility of the images is not included as a criterion by these authors. For example, they recommend using base operating systems, with version identification limited to a single brief identifier such as \inlinecode{ubuntu:18.04}, which has a serious problem with longevity issues -\ifdefined\separatesupplement -(as discussed in the longevity of existing tools section of the main paper). -\else -(Section \ref{sec:longevityofexisting}). -\fi +\ifdefined\separatesupplement% +(as discussed in the longevity of existing tools section of the main paper)% +\else% +(Section \ref{sec:longevityofexisting})% +\fi. Furthermore, in their proof-of-concept Dockerfile (listing 1), \inlinecode{rocker} is used with a tag (not a digest), which can be problematic due to the high risk of ambiguity (as discussed in Section \ref{appendix:containers}). +Previous criteria are thus primarily targeted to immediate reproducibility and do not consider longevity. +Therefore, they lack a strong/clear completeness criterion (they mainly only suggest, rather than require, the recording of versions, and their ultimate suggestion of storing the full binary OS in a binary VM or container is problematic (as mentioned in \ref{appendix:independentenvironment} and \citeappendix{oliveira18}). @@ -81,7 +83,7 @@ In particular, the heavy investment one has to make in order to re-do another sc RED also influenced other early reproducible works, for example \citeappendix{buckheit1995}. To orchestrate the various figures/results of a project, from 1990, they used ``Cake'' \citeappendix{somogyi87}, a dialect of Make, for more on Make, see Appendix \ref{appendix:jobmanagement}. -As described in \cite{schwab2000}, in the latter half of that decade, they moved to GNU Make, which was much more commonly used, developed and came with a complete and up-to-date manual. +As described in \cite{schwab2000}, in the latter half of that decade, they moved to GNU Make, which was much more commonly used, better maintained, and came with a complete and up-to-date manual. The basic idea behind RED's solution was to organize the analysis as independent steps, including the generation of plots, and organizing the steps through a Makefile. This enabled all the results to be re-executed with a single command. Several basic low-level Makefiles were included in the high-level/central Makefile. @@ -100,13 +102,13 @@ Hence in 2006 SEP moved to a new Python-based framework called Madagascar, see A \subsection{Apache Taverna (2003)} \label{appendix:taverna} -Apache Taverna\footnote{\inlinecode{\url{https://taverna.incubator.apache.org}}} \citeappendix{oinn04} is a workflow management system written in Java with a graphical user interface which is still being developed. +Apache Taverna\footnote{\inlinecode{\url{https://taverna.incubator.apache.org}}} \citeappendix{oinn04} is a workflow management system written in Java with a graphical user interface which is still being used and developed. A workflow is defined as a directed graph, where nodes are called ``processors''. Each Processor transforms a set of inputs into a set of outputs and they are defined in the Scufl language (an XML-based language, where each step is an atomic task). Other components of the workflow are ``Data links'' and ``Coordination constraints''. The main user interface is graphical, where users move processors in the given space and define links between their inputs and outputs (manually constructing a lineage like \ifdefined\separatesupplement -lineage figure of the main paper. +lineage figure of the main paper). \else Figure \ref{fig:datalineage}). \fi @@ -145,12 +147,12 @@ Furthermore, the blending of the workflow component with the low-level analysis \subsection{GenePattern (2004)} \label{appendix:genepattern} GenePattern\footnote{\inlinecode{\url{https://www.genepattern.org}}} \citeappendix{reich06} (first released in 2004) is a client-server software containing many common analysis functions/modules, primarily focused for Gene studies. -Although it is highly focused to a special research field, it is reviewed here because its concepts/methods are generic, and in the context of this paper. +Although it is highly focused to a special research field, it is reviewed here because its concepts/methods are generic. Its server-side software is installed with fixed software packages that are wrapped into GenePattern modules. The modules are used through a web interface, the modern implementation is GenePattern Notebook \citeappendix{reich17}. It is an extension of the Jupyter notebook (see Appendix \ref{appendix:editors}), which also has a special ``GenePattern'' cell that will connect to GenePattern servers for doing the analysis. -However, the wrapper modules just call an existing tool on the host system. +However, the wrapper modules just call an existing tool on the running system. Given that each server may have its own set of installed software, the analysis may differ (or crash) when run on different GenePattern servers, hampering reproducibility. %% GenePattern shutdown announcement (although as of November 2020, it does not open any more): https://www.genepattern.org/blog/2019/10/01/the-genomespace-project-is-ending-on-november-15-2019 @@ -159,7 +161,9 @@ However, it was shut down on November 15th 2019 due to the end of funding. All processing with this sever has stopped, and any archived data on it has been deleted. Since GenePattern is free software, there are alternative public servers to use, so hopefully, work on it will continue. However, funding is limited and those servers may face similar funding problems. + This is a very nice example of the fragility of solutions that depend on archiving and running the research codes with high-level research products (including data and binary/compiled codes that are expensive to keep in one place). +The data and software may have backups in other places, but the high-level project-specific workflows that researchers spent most time on, have been lost due to the deletion (unless they were backed up privately by the authors!). @@ -169,11 +173,11 @@ This is a very nice example of the fragility of solutions that depend on archivi Kepler\footnote{\inlinecode{\url{https://kepler-project.org}}} \citeappendix{ludascher05} is a Java-based Graphic User Interface workflow management tool. Users drag-and-drop analysis components, called ``actors'', into a visual, directional graph, which is the workflow (similar to \ifdefined\separatesupplement -the lineage figure shown in the main paper. +the lineage figure shown in the main paper). \else Figure \ref{fig:datalineage}). \fi -Each actor is connected to others through the Ptolemy II\footnote{\inlinecode{\url{https://ptolemy.berkeley.edu}}} \citeappendix{eker03}. +Each actor is connected to others through Ptolemy II\footnote{\inlinecode{\url{https://ptolemy.berkeley.edu}}} \citeappendix{eker03}. In many aspects, the usage of Kepler and its issues for long-term reproducibility is like Apache Taverna (see Section \ref{appendix:taverna}). @@ -193,14 +197,14 @@ Since the main goal was visualization (as images), apparently its primary output Its design is based on a change-based provenance model using a custom VisTrails provenance query language (vtPQL), for more see \citeappendix{scheidegger08}. Since XML is a plain text format, as the user inspects the data and makes changes to the analysis, the changes are recorded as ``trails'' in the project's VisTrails repository that operates very much like common version control systems (see Appendix \ref{appendix:versioncontrol}). . -However, even though XML is in plain text, it is very hard to edit manually. +However, even though XML is in plain text, it is very hard to read/edit without the VisTrails software (which is no longer maintained). VisTrails, therefore, provides a graphic user interface with a visual representation of the project's inter-dependent steps (similar to \ifdefined\separatesupplement the data lineage figure of the main paper). \else Figure \ref{fig:datalineage}). \fi -Besides the fact that it is no longer maintained, VisTrails does not control the software that is run, it only controls the sequence of steps that they are run in. +Besides the fact that it is no longer maintained, VisTrails did not control the software that is run, it only controlled the sequence of steps that they are run in. @@ -213,7 +217,7 @@ The main user interface is the ``Galaxy Pages'', which does not require any prog Therefore the actual running version of the program can be hard to control across different Galaxy servers. Besides the automatically generated metadata of a project (which include version control, or its history), users can also tag/annotate each analysis step, describing its intent/purpose. Besides some small differences, Galaxy seems very similar to GenePattern (Appendix \ref{appendix:genepattern}), so most of the same points there apply here too. -For example the very large cost of maintaining such a system and being based on a graphic environment. +For example the very large cost of maintaining such a system, being based on a graphic environment and blending hand-written code with automatically generated (large) files. @@ -225,17 +229,16 @@ The IPOL journal\footnote{\inlinecode{\url{https://www.ipol.im}}} \citeappendix{ An IPOL paper is a traditional research paper, but with a focus on implementation. The published narrative description of the algorithm must be detailed to a level that any specialist can implement it in their own programming language (extremely detailed). The author's own implementation of the algorithm is also published with the paper (in C, C++, or MATLAB), the code must be commented well enough and link each part of it with the relevant part of the paper. -The authors must also submit several examples of datasets/scenarios. +The authors must also submit several example of datasets that show the applicability of their proposed algorithm. The referee is expected to inspect the code and narrative, confirming that they match with each other, and with the stated conclusions of the published paper. After publication, each paper also has a ``demo'' button on its web page, allowing readers to try the algorithm on a web-interface and even provide their own input. -IPOL has grown steadily over the last 10 years, publishing 23 research articles in 2019 alone. +IPOL has grown steadily over the last 10 years, publishing 23 research articles in 2019. We encourage the reader to visit its web page and see some of its recent papers and their demos. The reason it can be so thorough and complete is its very narrow scope (low-level image processing algorithms), where the published algorithms are highly atomic, not needing significant dependencies (beyond input/output of well-known formats), allowing the referees and readers to go deeply into each implemented algorithm. In fact, high-level languages like Perl, Python, or Java are not acceptable in IPOL precisely because of the additional complexities, such as the dependencies that they require. -However, many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, so this solution is not scalable. +However, many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, so while it is modular (a single module, doing a very specific thing) this solution is not scalable. -IPOL thus fails on our Scalability criteria. Furthermore, by not publishing/archiving each paper's version control history or directly linking the analysis and produced paper, it fails criteria 6 and 7. Note that on the web page, it is possible to change parameters, but that will not affect the produced PDF. A paper written in Maneage (the proof-of-concept solution presented in this paper) could be scrutinized at a similar detailed level to IPOL, but for much more complex research scenarios, involving hundreds of dependencies and complex processing of the data. @@ -267,13 +270,15 @@ In the Python version, all processing steps and input data (or references to the When the Python module contains a component written in other languages (mostly C or C++), it needs to be an external dependency to the Active Paper. As mentioned in \citeappendix{hinsen15}, the fact that it relies on HDF5 is a caveat of Active Papers, because many tools are necessary to merely open it. -Downloading the pre-built HDF View binaries (provided by the HDF group) is not possible anonymously/automatically (login is required). +Downloading the pre-built ``HDF View'' binaries (a GUI browser of HDF5 files that is provided by the HDF group) is not possible anonymously/automatically (login is required). Installing it using the Debian or Arch Linux package managers also failed due to dependencies in our trials. Furthermore, as a high-level data format HDF5 evolves very fast, for example HDF5 1.12.0 (February 29th, 2020) is not usable with older libraries provided by the HDF5 team. % maybe replace with: February 29\textsuperscript{th}, 2020? -While data and code are indeed fundamentally similar concepts technically\citeappendix{hinsen16}, they are used by humans differently due to their volume: the code of a large project involving Terabytes of data can be less than a megabyte. -Hence, storing code and data together becomes a burden when large datasets are used, this was also acknowledged in \citeappendix{hinsen15}. -Also, if the data are proprietary (for example medical patient data), the data must not be released, but the methods that were applied to them can be published. +While data and code are indeed fundamentally similar concepts technically\citeappendix{hinsen16}, they are used by humans differently. +The hand-written code of a large project involving Terabytes of data can be 100 kilo bytes. +When the two are bundled together, merely seeing one line of the code, requires downloading Terabytes volume that is not needed, this was also acknowledged in \citeappendix{hinsen15}. +It may also happen that the data are proprietary (for example medical patient data). +In such cases, the data must not be publicly released, but the methods that were applied to them can. Furthermore, since all reading and writing is done in the HDF5 file, it can easily bloat the file to very large sizes due to temporary files. These files can later be removed as part of the analysis, but this makes the code more complicated and hard to read/maintain. For example the Active Papers HDF5 file of \citeappendix[in \href{https://doi.org/10.5281/zenodo.2549987}{zenodo.2549987}]{kneller19} is 1.8 giga-bytes. @@ -290,11 +295,14 @@ Hence the extra volume for data and obscure HDF5 format that needs special tools \label{appendix:collage} The Collage Authoring Environment \citeappendix{nowakowski11} was the winner of Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}. It is based on the GridSpace2\footnote{\inlinecode{\url{http://dice.cyfronet.pl}}} distributed computing environment, which has a web-based graphic user interface. -Through its web-based interface, viewers of a paper can actively experiment with the parameters of a published paper's displayed outputs (for example figures). -%\tonote{See how it containerizes the software environment} - - +Through its web-based interface, viewers of a paper can actively experiment with the parameters of a published paper's displayed outputs (for example figures) through a web interface. +In their Figure 3, they nicely vizualize how the ``Executable Paper'' of Collage operates through two servers and a computing backend. +Unfortunately in the paper no webpage has been provided follow up on the work and find its current status. +A web search also only pointed us to its main paper (\citeappendix{nowakowski11}). +In the paper they do not discuss the major issue of software versioning and its verification to ensure that future updates to the backend do not affect the result; apparently it just assumes the software exist on the ``Computing backend''. +Since we could not access or test it, from the descriptions in the paper, it seems to be very similar to the modern day Jupyter notebook concept (see \ref{appendix:jupyter}), which had not yet been created in its current form in 2011. +So we expect similar longevity issues with Collage. \subsection{SHARE (2011)} @@ -302,7 +310,7 @@ Through its web-based interface, viewers of a paper can actively experiment with SHARE\footnote{\inlinecode{\url{https://is.ieis.tue.nl/staff/pvgorp/share}}} \citeappendix{vangorp11} is a web portal that hosts virtual machines (VMs) for storing the environment of a research project. SHARE was recognized as the second position in the Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}. Simply put, SHARE was just a VM library that users could download or connect to, and run. -The limitations of VMs for reproducibility were discussed in Appendix \ref{appendix:virtualmachines}, and the SHARE system does not specify any requirements on making the VM itself reproducible. +The limitations of VMs for reproducibility were discussed in Appendix \ref{appendix:virtualmachines}, and the SHARE system does not specify any requirements or standards on making the VM itself reproducible, or enforcing common internals for its supported projects. As of January 2021, the top SHARE web page still works. However, upon selecting any operation, a notice is printed that ``SHARE is offline'' since 2019 and the reason is not mentioned. @@ -315,15 +323,23 @@ However, upon selecting any operation, a notice is printed that ``SHARE is offli A ``verifiable computational result''\footnote{\inlinecode{\url{http://vcr.stanford.edu}}} is an output (table, figure, etc) that is associated with a ``verifiable result identifier'' (VRI), see \citeappendix{gavish11}. It was awarded the third prize in the Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}. -A VRI is created using tags within the programming source that produced that output, also recording its version control or history. +A VRI is a hash that is created using tags within the programming source that produced that output, also recording its version control or history. This enables the exact identification and citation of results. The VRIs are automatically generated web-URLs that link to public VCR repositories containing the data, inputs, and scripts, that may be re-executed. -According to \citeappendix{gavish11}, the VRI generation routine has been implemented in MATLAB, R, and Python, although only the MATLAB version was available during the writing of this paper. +According to \citeappendix{gavish11}, the VRI generation routine has been implemented in MATLAB, R, and Python, although only the MATLAB version was available on the webpage in January 2021. VCR also has special \LaTeX{} macros for loading the respective VRI into the generated PDF. +In effect this is very similar to what have done at the end of the caption of +\ifdefined\separatesupplement +the first figure in the main body of the paper, +\else +Figure \ref{fig:datalineage}, +\fi +where you can click on the given Zenodo link and be taken to the raw data that created the plot. +However, instead of a long and hard to read hash, we simply point to the plotted file's source as a Zenodo DOI (which has long term funding for logevity). -Unfortunately, most parts of the web page are not complete at the time of this writing. -The VCR web page contains an example PDF\footnote{\inlinecode{\url{http://vcr.stanford.edu/paper.pdf}}} that is generated with this system, but the linked VCR repository\footnote{\inlinecode{\url{http://vcr-stat.stanford.edu}}} does not exist at the time of this writing. -Finally, the date of the files in the MATLAB extension tarball is set to 2011, hinting that probably VCR has been abandoned soon after the publication of \citeappendix{gavish11}. +Unfortunately, most parts of the web page are not complete as of January 2021. +The VCR web page contains an example PDF\footnote{\inlinecode{\url{http://vcr.stanford.edu/paper.pdf}}} that is generated with this system, but the linked VCR repository\footnote{\inlinecode{\url{http://vcr-stat.stanford.edu}}} did not exist (again, as of January 2021). +Finally, the date of the files in the MATLAB extension tarball is set to May 2011, hinting that probably VCR has been abandoned soon after the publication of \citeappendix{gavish11}. @@ -340,8 +356,9 @@ SOLE also supports workflows as Galaxy tools \citeappendix{goecks10}. For reproducibility, \citeappendix{pham12} suggest building a SOLE-based project in a virtual machine, using any custom package manager that is hosted on a private server to obtain a usable URI. However, as described in Appendices \ref{appendix:independentenvironment} and \ref{appendix:packagemanagement}, unless virtual machines are built with robust package managers, this is not a sustainable solution (the virtual machine itself is not reproducible). Also, hosting a large virtual machine server with fixed IP on a hosting service like Amazon (as suggested there) for every project in perpetuity will be very expensive. + The manual/artificial definition of tags to connect parts of the paper with the analysis scripts is also a caveat due to human error and incompleteness (the authors may not consider tags as important things, but they may be useful later). -In Maneage, instead of using artificial/commented tags, the analysis inputs and outputs are automatically linked into the paper's text through \LaTeX{} macros that are the backbone of the whole system (aren't artifical/extra features). +In Maneage, instead of using artificial/commented tags, the analysis inputs and outputs are automatically linked into the paper's text through \LaTeX{} macros that are the backbone of the whole system (are not artifical/extra features). @@ -372,6 +389,7 @@ It thus provides resources to link various workflow/analysis components (see App \citeappendix{bechhofer13} describes how a workflow in Apache Taverna (Appendix \ref{appendix:taverna}) can be translated into research objects. The important thing is that the research object concept is not specific to any special workflow, it is just a metadata bundle/standard which is only as robust in reproducing the result as the running workflow. +Therefore if implemented over a complete workflow like Maneage, it can be very useful in analysing/optimizing the workflow, finding common components between many Maneage'd workflows, or translating to other complete workflows. @@ -386,7 +404,7 @@ Sciunit was originally written in Python 2 (which reached its end-of-life on Jan Therefore Sciunit2 is a new implementation in Python 3. The main issue with Sciunit's approach is that the copied binaries are just black boxes: it is not possible to see how the used binaries from the initial system were built. -This is a major problem for scientific projects: in principle (not knowing how the programs were built) and in practice (archiving a large volume sciunit for every step of the analysis requires a lot of storage space). +This is a major problem for scientific projects: in principle (not knowing how the programs were built) and in practice (archiving a large volume sciunit for every step of the analysis requires a lot of storage space and archival cost). @@ -423,9 +441,9 @@ Hence two projects that each use a 1-terabyte dataset will need a full copy of t \subsection{Binder (2017)} Binder\footnote{\inlinecode{\url{https://mybinder.org}}} is used to containerize already existing Jupyter based processing steps. Users simply add a set of Binder-recognized configuration files to their repository and Binder will build a Docker image and install all the dependencies inside of it with Conda (the list of necessary packages comes from Conda). -One good feature of Binder is that the imported Docker image must be tagged (something like a checksum). -This will ensure that future/latest updates of the imported Docker image are not mistakenly used. +One good feature of Binder is that the imported Docker image must be tagged, although as mentioned in Appendix \ref{appendix:containers}, tags do not ensure reproducibility. However, it does not make sure that the Dockerfile used by the imported Docker image follows a similar convention also. +So users can simply use generic operating system names. Binder is used by \citeappendix{jones19}. @@ -436,6 +454,8 @@ Binder is used by \citeappendix{jones19}. %% I took the date from their PiPy page, where the first version 0.1 was published in November 2016. Gigantum\footnote{\inlinecode{\url{https://gigantum.com}}} is a client/server system, in which the client is a web-based (graphical) interface that is installed as ``Gigantum Desktop'' within a Docker image. Gigantum uses Docker containers for an independent environment, Conda (or Pip) to install packages, Jupyter notebooks to edit and run code, and Git to store its history. +The reproducibility issues with these tools has been thoroughly discussed in \ref{appendix:existingtools}. + Simply put, it is a high-level wrapper for combining these components. Internally, a Gigantum project is organized as files in a directory that can be opened without their own client. The file structure (which is under version control) includes codes, input data, and output data. @@ -452,19 +472,19 @@ However, there is one directory that can be used to store files that must not be Popper\footnote{\inlinecode{\url{https://falsifiable.us}}} is a software implementation of the Popper Convention \citeappendix{jimenez17}. The Popper team's own solution is through a command-line program called \inlinecode{popper}. The \inlinecode{popper} program itself is written in Python. -However, job management was initially based on the HashiCorp configuration language (HCL) because HCL was used by ``GitHub Actions'' to manage workflows. -Moreover, from October 2019 GitHub changed to a custom YAML-based language, so Popper also deprecated HCL. -This is an important issue when low-level choices are based on service providers. +However, job management was initially based on the HashiCorp configuration language (HCL) because HCL was used by ``GitHub Actions'' to manage workflows at that time. +However, from October 2019 GitHub changed to a custom YAML-based language, so Popper also deprecated HCL. +This is an important issue when low-level choices are based on service providers (see Appendix \ref{appendix:highlevelinworkflow}). To start a project, the \inlinecode{popper} command-line program builds a template, or ``scaffold'', which is a minimal set of files that can be run. -However, as of this writing, the scaffold is not complete: it lacks a manuscript and validation of outputs (as mentioned in the convention). By default, Popper runs in a Docker image (so root permissions are necessary and reproducible issues with Docker images have been discussed above), but Singularity is also supported. See Appendix \ref{appendix:independentenvironment} for more on containers, and Appendix \ref{appendix:highlevelinworkflow} for using high-level languages in the workflow. Popper does not comply with the completeness, minimal complexity, and including the narrative criteria. Moreover, the scaffold that is provided by Popper is an output of the program that is not directly under version control. -Hence, tracking future changes in Popper and how they relate to the high-level projects that depend on it will be very hard. -In Maneage, the same \inlinecode{maneage} git branch is shared by the developers and users; any new feature or change in Maneage can thus be directly tracked with Git when the high-level project merges their branch with Maneage. +Hence, tracking future low-level changes in Popper and how they relate to the high-level projects that depend on it through the scaffold will be very hard. +In Maneage, users start their projects by branching-off of the core \inlinecode{maneage} git branch. +Hence any future change in the low level features will directly propagated to all derived projects (and be clear as Git conflicts if the user has customized them). @@ -477,10 +497,11 @@ It uses online editors like Jupyter or RStudio (see Appendix \ref{appendix:edito The web-based nature of Whole Tale's approach and its dependency on many tools (which have many dependencies themselves) is a major limitation for future reproducibility. For example, when following their own tutorial on ``Creating a new tale'', the provided Jupyter notebook could not be executed because of a dependency problem. -This was reported to the authors as issue 113\footnote{\inlinecode{\url{https://github.com/whole-tale/wt-design-docs/issues/113}}}, but as all the second-order dependencies evolve, it is not hard to envisage such dependency incompatibilities being the primary issue for older projects on Whole Tale. +This was reported to the authors as issue 113\footnote{\inlinecode{\url{https://github.com/whole-tale/wt-design-docs/issues/113}}} and fixed. +But as all the second-order dependencies evolve, it is not hard to envisage such dependency incompatibilities being the primary issue for older projects on Whole Tale. Furthermore, the fact that a Tale is stored as a binary Docker container causes two important problems: 1) it requires a very large storage capacity for every project that is hosted there, making it very expensive to scale if demand expands. -2) It is not possible to see how the environment was built accurately (when the Dockerfile uses \inlinecode{apt}). +2) It is not possible to see how the environment was built accurately (when the Dockerfile uses operating system package managers like \inlinecode{apt}). This issue with Whole Tale (and generally all other solutions that only rely on preserving a container/VM) was also mentioned in \citeappendix{oliveira18}, for more on this, please see Appendix \ref{appendix:packagemanagement}. @@ -488,6 +509,7 @@ This issue with Whole Tale (and generally all other solutions that only rely on \subsection{Occam (2018)} +\label{appendix:occam} Occam\footnote{\inlinecode{\url{https://occam.cs.pitt.edu}}} \citeappendix{oliveira18} is web-based application to preserve software and its execution. To achieve long-term reproducibility, Occam includes its own package manager (instructions to build software and their dependencies) to be in full control of the software build instructions, similar to Maneage. Besides Nix or Guix (which are primarily a package manager that can also do job management), Occam has been the only solution in our survey here that attempts to be complete in this aspect. diff --git a/tex/src/appendix-existing-tools.tex b/tex/src/appendix-existing-tools.tex index d923d5f..1d0b383 100644 --- a/tex/src/appendix-existing-tools.tex +++ b/tex/src/appendix-existing-tools.tex @@ -47,11 +47,11 @@ VMs thus provide the ultimate control one can have over the run-time environment However, the VM's kernel does not talk directly to the running hardware that is doing the analysis, it talks to a simulated hardware layer that is provided by the host's kernel. Therefore, a process that is run inside a virtual machine can be much slower than one that is run on a native kernel. An advantage of VMs is that they are a single file that can be copied from one computer to another, keeping the full environment within them if the format is recognized. -VMs are used by cloud service providers, enabling fully independent operating systems on their large servers (where the customer can have root access). +VMs are used by cloud service providers, enabling fully independent operating systems on their large servers where the customer can have root access. VMs were used in solutions like SHARE \citeappendix{vangorp11} (which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011 \citeappendix{gabriel11}), or in suggested reproducible papers like \citeappendix{dolfi14}. However, due to their very large size, these are expensive to maintain, thus leading SHARE to discontinue its services in 2019. -The URL to the VM file \texttt{provenance\_machine.ova} that is mentioned in \citeappendix{dolfi14} is not currently accessible (we suspect that this is due to size and archival costs). +The URL to the VM file \texttt{provenance\_machine.ova} that is mentioned in \citeappendix{dolfi14} is also not currently accessible (we suspect that this is due to size and archival costs). \subsubsection{Containers} \label{appendix:containers} @@ -65,7 +65,7 @@ We review some of the most common container solutions: Docker, Singularity, and \begin{itemize} \item {\bf\small Docker containers:} Docker is one of the most popular tools nowadays for keeping an independent analysis environment. - It is primarily driven by the need of software developers for reproducing a previous environment, where they have root access mostly on the ``cloud'' (which is just a remote VM). + It is primarily driven by the need of software developers for reproducing a previous environment, where they have root access mostly on the ``cloud'' (which is usually a remote VM). A Docker container is composed of independent Docker ``images'' that are built with a \inlinecode{Dockerfile}. It is possible to precisely version/tag the images that are imported (to avoid downloading the latest/different version in a future build). To have a reproducible Docker image, it must be ensured that all the imported Docker images check their dependency tags down to the initial image which contains the C library. @@ -91,16 +91,16 @@ Meng \& Thain \citeappendix{meng17} also give similar reasons on why Docker imag On a more fundamental level, VMs or containers do not store \emph{how} the core environment was built. This information is usually in a third-party repository, and not necessarily inside the container or VM file, making it hard (if not impossible) to track for future users. -This is a major problem when considering reproducibility, which is also highlighted as a major issue in terms of long term reproducibility in \citeappendix{oliveira18}. +This is a major problem in relation to the proposed completeness criteria and is also highlighted as an issue in terms of long term reproducibility by \citeappendix{oliveira18}. -The example of \cite{mesnard20} was previously mentioned in +The example of \inlinecode{Dockerfile} of \cite{mesnard20} was previously mentioned in \ifdefined\separatesupplement the main body of this paper, when discussing the criteria. \else in Section \ref{criteria}. \fi Another useful example is the \href{https://github.com/benmarwick/1989-excavation-report-Madjedbebe/blob/master/Dockerfile}{\inlinecode{Dockerfile}} of \citeappendix{clarkso15} (published in June 2015) which starts with \inlinecode{FROM rocker/verse:3.3.2}. -When we tried to build it (November 2020), the core downloaded image (\inlinecode{rocker/verse:3.3.2}, with image ``digest'' \inlinecode{sha256:c136fb0dbab...}) was created in October 2018 (long after the publication of that paper). +When we tried to build it (November 2020), we noticed that the core downloaded image (\inlinecode{rocker/verse:3.3.2}, with image ``digest'' \inlinecode{sha256:c136fb0dbab...}) was created in October 2018 (long after the publication of that paper). In principle, it is possible to investigate the difference between this new image and the old one that the authors used, but that would require a lot of effort and may not be possible when the changes are not available in a third public repository or not under version control. In Docker, it is possible to retrieve the precise Docker image with its digest, for example, \inlinecode{FROM ubuntu:16.04@sha256:XXXXXXX} (where \inlinecode{XXXXXXX} is the digest, uniquely identifying the core image to be used), but we have not seen this often done in existing examples of ``reproducible'' \inlinecode{Dockerfiles}. @@ -123,7 +123,7 @@ The virtual machine and container solutions mentioned above, have their own inde Another approach to having an isolated analysis environment is to use the same file system as the host, but installing the project's software in a non-standard, project-specific directory that does not interfere with the host. Because the environment in this approach can be built in any custom location on the host, this solution generally does not require root permissions or extra low-level layers like containers or VMs. However, ``moving'' the built product of such solutions from one computer to another is not generally as trivial as containers or VMs. -Examples of such third-party package managers (that are detached from the host OS's package manager) include Nix, GNU Guix, Python's Virtualenv package, and Conda, among others. +Examples of such third-party package managers (that are detached from the host OS's package manager) include (but are not limited to) Nix, GNU Guix, Python's Virtualenv package, Conda. Because it is highly intertwined with the way software is built and installed, third party package managers are described in more detail as part of Section \ref{appendix:packagemanagement}. Maneage (the solution proposed in this paper) also follows a similar approach of building and installing its own software environment within the host's file system, but without depending on it beyond the kernel. @@ -140,7 +140,7 @@ Package management is the process of automating the build and installation of a A package manager thus contains the following information on each software package that can be run automatically: the URL of the software's tarball, the other software that it possibly depends on, and how to configure and build it. Package managers can be tied to specific operating systems at a very low level (like \inlinecode{apt} in Debian-based OSs). Alternatively, there are third-party package managers that can be installed on many OSs. -Both are discussed in more detail in what follows. +Both are discussed in more detail below. Package managers are the second component in any workflow that relies on containers or VMs for an independent environment, and the starting point in others that use the host's file system (as discussed above in Section \ref{appendix:independentenvironment}). In this section, some common package managers are reviewed, in particular those that are most used by the reviewed reproducibility solutions of Appendix \ref{appendix:existingsolutions}. @@ -148,7 +148,7 @@ For a more comprehensive list of existing package managers, see \href{https://en Note that we are not including package managers that are specific to one language, for example \inlinecode{pip} (for Python) or \inlinecode{tlmgr} (for \LaTeX). \subsubsection{Operating system's package manager} -The most commonly used package managers are those of the host operating system, for example, \inlinecode{apt} or \inlinecode{yum} respectively on Debian-based, or RedHat-based GNU/Linux operating systems, \inlinecode{pkg} in FreeBSD, among many others in other OSes. +The most commonly used package managers are those of the host operating system, for example, \inlinecode{apt}, \inlinecode{yum} or \inlinecode{pkg} which are respectively used in Debian-based, Red Hat-based and FreeBSD-based OSs (among many other OSs). These package managers are tightly intertwined with the operating system: they also include the building and updating of the core kernel and the C library. Because they are part of the OS, they also commonly require root permissions. @@ -163,38 +163,44 @@ Hence a fixed version of the dependencies must also be specified. In robust package managers like Debian's \inlinecode{apt} it is possible to fully control (and later reproduce) the built environment of a high-level software. Debian also archives all packaged high-level software in its Snapshot\footnote{\inlinecode{\url{https://snapshot.debian.org/}}} service since 2005 which can be used to build the higher-level software environment on an older OS \citeappendix{aissi20}. -Therefore it is indeed theoretically possible to reproduce the software environment only using archived operating systems and their own package managers, but unfortunately, we have not seen it practiced in scientific papers/projects. +Therefore it is indeed theoretically possible to reproduce the software environment only using archived operating systems and their own package managers, but unfortunately, we have not seen it practiced in (reproducible) scientific papers/projects. -In summary, the host OS package managers are primarily meant for the operating system components or very low-level components. -Hence, many robust reproducible analysis solutions (reviewed in Appendix \ref{appendix:existingsolutions}) do not use the host's package manager, but an independent package manager, like the ones discussed below. +In summary, the host OS package managers are primarily meant for the low-level operating system components. +Hence, many robust reproducible analysis workflows (reviewed in Appendix \ref{appendix:existingsolutions}) do not use the host's package manager, but an independent package manager, like the ones discussed below. -\subsubsection{Packaging with Linux containerization} -Once a software is packaged as an AppImage\footnote{\inlinecode{\url{https://appimage.org}}}, Flatpak\footnote{\inlinecode{\url{https://flatpak.org}}} or Snap\footnote{\inlinecode{\url{https://snapcraft.io}}} the software's binary product and all its dependencies (not including the core C library) are packaged into one file. -This makes it very easy to move that single software's built product to newer systems. -However, because the C library is not included, it can fail on older systems. -Moreover, these are designed for the Linux kernel (using its containerization features) and can thus only be run on GNU/Linux operating systems. +\subsubsection{Blind packaging of already built software} +An already-built software contains links to the system libraries it uses. +Therefore one way of packaging a software is to look into the binary file for the libraries it uses and bring them into a file with the executable so on different systems, the same set of dependencies are moved around with the desired software. +Tools like AppImage\footnote{\inlinecode{\url{https://appimage.org}}}, Flatpak\footnote{\inlinecode{\url{https://flatpak.org}}} or Snap\footnote{\inlinecode{\url{https://snapcraft.io}}} are designed for this purpose: the software's binary product and all its dependencies (not including the core C library) are packaged into one file. +This makes it very easy to move that single software's built product and already built dependencies to different systems. +However, because the C library is not included, it can fail on newer/older systems (depending on the system it was built on). +We call this method ``blind'' packaging because it is agnostic to \emph{how} the software and its dependencies were built (which is important in a scientific context). +Moreover, these types of packagers are designed for the Linux kernel (using its containerization and unique mounting features). +They can therefore only be run on GNU/Linux operating systems. \subsubsection{Nix or GNU Guix} \label{appendix:nixguix} -Nix \citeappendix{dolstra04} and GNU Guix \citeappendix{courtes15} are independent package managers that can be installed and used on GNU/Linux operating systems, and macOS (only for Nix, prior to macOS Catalina). +Nix\footnote{\inlinecode{\url{https://nixos.org}}} \citeappendix{dolstra04} and GNU Guix\footnote{\inlinecode{\url{https://guix.gnu.org}}} \citeappendix{courtes15} are independent package managers that can be installed and used on GNU/Linux operating systems, and macOS (only for Nix, prior to macOS Catalina). Both also have a fully functioning operating system based on their packages: NixOS and ``Guix System''. GNU Guix is based on the same principles of Nix but implemented differently, so we focus the review here on Nix. The Nix approach to package management is unique in that it allows exact dependency tracking of all the dependencies, and allows for multiple versions of software, for more details see \citeappendix{dolstra04}. -In summary, a unique hash is created from all the components that go into the building of the package. +In summary, a unique hash is created from all the components that go into the building of the package (including the instructions on how to build the software). That hash is then prefixed to the software's installation directory. As an example from \citeappendix{dolstra04}: if a certain build of GNU C Library 2.3.2 has a hash of \inlinecode{8d013ea878d0}, then it is installed under \inlinecode{/nix/store/8d013ea878d0-glibc-2.3.2} and all software that is compiled with it (and thus need it to run) will link to this unique address. This allows for multiple versions of the software to co-exist on the system, while keeping an accurate dependency tree. -As mentioned in \citeappendix{courtes15}, one major caveat with using these package managers is that they require a daemon with root privileges. +As mentioned in \citeappendix{courtes15}, one major caveat with using these package managers is that they require a daemon with root privileges (failing our completeness criteria). This is necessary ``to use the Linux kernel container facilities that allow it to isolate build processes and maximize build reproducibility''. This is because the focus in Nix or Guix is to create bit-wise reproducible software binaries and this is necessary for the security or development perspectives. -However, in a non-computer-science analysis (for example natural sciences), the main aim is reproducible \emph{results} that can also be created with the same software version that may not be bit-wise identical (for example when they are installed in other locations, because the installation location is hard-coded in the software binary). +However, in a non-computer-science analysis (for example natural sciences), the main aim is reproducible \emph{results} that can also be created with the same software version that may not be bit-wise identical (for example when they are installed in other locations, because the installation location is hard-coded in the software binary or for a different CPU architecture). -Finally, while Guix and Nix do allow precisely reproducible environments, it requires extra effort. -For example, simply running \inlinecode{guix install gcc} will install the most recent version of GCC that can be different at different times. +Finally, while Guix and Nix do allow precisely reproducible environments, it requires extra effort on the user's side to ensure that the built environment is reproducible later. +For example, simply running \inlinecode{guix install gcc} (the most common way to install a new software) will install the most recent version of GCC, that can be different at different times. Hence, similar to the discussion in host operating system package managers, it is up to the user to ensure that their created environment is recorded properly for reproducibility in the future. -Generally, this is a major limitation of projects that rely on detached package managers for building their software, including the other tools mentioned below. +It is not a complex operation, but like the Docker digest codes mentioned in Appendix \ref{appendix:containers}, many will probably not know, forget or ignore it. +Generally, this is an issue with projects that rely on detached (third party) package managers for building their software, including the other tools mentioned below. +We solved this problem in Maneage by including the package manager and analysis steps into one project: it is simply not possible to forget to record the exact versions of the software used. \subsubsection{Conda/Anaconda} \label{appendix:conda} @@ -206,12 +212,12 @@ However, it is not possible to fix the versions of the dependencies through the This is thoroughly discussed under issue 787 (in May 2019) of \inlinecode{conda-forge}\footnote{\url{https://github.com/conda-forge/conda-forge.github.io/issues/787}}. In that discussion, the authors of \citeappendix{uhse19} report that the half-life of their environment (defined in a YAML file) is 3 months, and that at least one of their dependencies breaks shortly after this period. The main reply they got in the discussion is to build the Conda environment in a container, which is also the suggested solution by \citeappendix{gruning18}. -However, as described in Appendix \ref{appendix:independentenvironment}, containers just hide the reproducibility problem, they do not fix it: containers are not static and need to evolve (i.e., re-built) with the project. +However, as described in Appendix \ref{appendix:independentenvironment}, containers just hide the reproducibility problem, they do not fix it: containers are not static and need to evolve (i.e., get re-built) with the project. Given these limitations, \citeappendix{uhse19} are forced to host their conda-packaged software as tarballs on a separate repository. Conda installs with a shell script that contains a binary-blob (+500 megabytes, embedded in the shell script). This is the first major issue with Conda: from the shell script, it is not clear what is in this binary blob and what it does. -After installing Conda in any location, users can easily activate that environment by loading a special shell script into their shell. +After installing Conda in any location, users can easily activate that environment by loading a special shell script. However, the resulting environment is not fully independent of the host operating system as described below: \begin{itemize} @@ -219,12 +225,12 @@ However, the resulting environment is not fully independent of the host operatin However, the host operating system's directories are also appended afterward. Therefore, a user or script may not notice that the software that is being used is actually coming from the operating system, and not from the controlled Conda installation. -\item Generally, by default, Conda relies heavily on the operating system and does not include core analysis components like \inlinecode{mkdir}, \inlinecode{ls} or \inlinecode{cp}. - Although they are generally the same between different Unix-like operating systems, they have their differences. - For example, \inlinecode{mkdir -p} is a common way to build directories, but this option is only available with GNU Coreutils (default on GNU/Linux systems). - Running the same command within a Conda environment on a macOS would crash. +\item Generally, by default, Conda relies heavily on the operating system and does not include core commands like \inlinecode{mkdir} (to make a directory), \inlinecode{ls} (to list files) or \inlinecode{cp} (to copy). + Although a minimal functionality is defined for them in POSIX and generally behave similarly for basic operations on different Unix-like operating systems, they have their differences. + For example, \inlinecode{mkdir -p} is a common way to build directories, but this option is only available with the \inlinecode{mkdir} of GNU Coreutils (default on GNU/Linux systems and installable in almost all Unix-like OSs). + Running the same command within a Conda environment that does not include GNU Coreutils on a macOS would crash. Important packages like GNU Coreutils are available in channels like conda-forge, but they are not the default. - Therefore, many users may not recognize this, and failing to account for it, will cause unexpected crashes. + Therefore, many users may not recognize this, and failing to account for it, will cause unexpected crashes when the project is run on a new system. \item Many major Conda packaging ``channels'' (for example the core Anaconda channel, or very popular conda-forge channel) do not include the C library, that a package was built with, as a dependency. They rely on the host operating system's C library. @@ -235,7 +241,7 @@ However, the resulting environment is not fully independent of the host operatin \item Conda does allow a package to depend on a special build of its prerequisites (specified by a checksum, fixing its version and the version of its dependencies). However, this is rarely practiced in the main Git repositories of channels like Anaconda and conda-forge: only the name of the high-level prerequisite packages is listed in a package's \inlinecode{meta.yaml} file, which is version-controlled. Therefore two builds of the package from the same Git repository will result in different tarballs (depending on what prerequisites were present at build time). - In the Conda tarball (that contains the binaries and is not under version control) \inlinecode{meta.yaml} does include the exact versions of most build-time dependencies. + In Conda's downloaded tarball (that contains the built binaries and is not under version control) the exact versions of most build-time dependencies are listed. However, because the different software of one project may have been built at different times, if they depend on different versions of a single software there will be a conflict and the tarball cannot be rebuilt, or the project cannot be run. \end{itemize} @@ -254,7 +260,7 @@ Because of such bootstrapping problems (for example how Spack needs Python to bu In conclusion for all package managers, there are two common issues regarding generic package managers that hinder their usage for high-level scientific projects: \begin{itemize} -\item {\bf\small Pre-compiled/binary downloads:} Most package managers (excluding Nix or its derivatives) only download the software in a binary (pre-compiled) format. +\item {\bf\small Pre-compiled/binary downloads:} Most package managers primarily download the software in a binary (pre-compiled) format. This allows users to download it very fast and almost instantaneously be able to run it. However, to provide for this, servers need to keep binary files for each build of the software on different operating systems (for example Conda needs to keep binaries for Windows, macOS and GNU/Linux operating systems). It is also necessary for them to store binaries for each build, which includes different versions of its dependencies. @@ -262,14 +268,15 @@ In conclusion for all package managers, there are two common issues regarding ge \item {\bf\small Adding high-level software:} Packaging new software is not trivial and needs a good level of knowledge/experience with that package manager. For example, each one has its own special syntax/standards/languages, with pre-defined variables that must already be known before someone can package new software for them. - However, in many research projects, the most high-level analysis software is written by the team that is doing the research, and they are its primary users, even when the software is distributed with free licenses on open repositories. - Although active package manager members are commonly very supportive in helping to package new software, many teams may not be able to make that extra effort/time investment. + However, in many research projects, the most high-level analysis software is written by the team that is doing the research, and they are its primary/only users, even when the software is distributed with free licenses on open repositories. + + Although active package manager members are commonly very supportive in helping to package new software, many teams may not be able to make that extra effort and time investment to package their most high-level (i.e., relevant) software in it. As a result, they manually install their high-level software in an uncontrolled, or non-standard way, thus jeopardizing the reproducibility of the whole work. This is another consequence of the detachment of the package manager from the project doing the analysis. \end{itemize} Addressing these issues has been the basic reason behind the proposed solution: based on the completeness criteria, instructions to download and build the packages are included within the actual science project, and no special/new syntax/language is used. -Software download, built and installation is done with the same language/syntax that researchers manage their research: using the shell (by default GNU Bash in Maneage) and Make (by default, GNU Make in Maneage). +Software download, built and installation is done with the same language/syntax that researchers manage their research: using the shell (by default GNU Bash in Maneage) for low-level steps and Make (by default, GNU Make in Maneage) for job management. @@ -302,12 +309,12 @@ With Git, changes in a project's contents are accurately identified by comparing When the user decides the changes are significant compared to the archived state, they can be ``committed'' into the history/repository. The commit involves copying the changed files into the repository and calculating a 40 character checksum/hash that is calculated from the files, an accompanying ``message'' (a narrative description of the purpose/goals of the changes), and the previous commit (thus creating a ``chain'' of commits that are strongly connected to each other like \ifdefined\separatesupplement -the figure on Git in the main body of the paper. +the figure on Git in the main body of the paper). \else Figure \ref{fig:branching}). \fi For example \inlinecode{f4953cc\-f1ca8a\-33616ad\-602ddf\-4cd189\-c2eff97b} is a commit identifier in the Git history of this project. -Commits are is commonly summarized by the checksum's first few characters, for example, \inlinecode{f4953cc}. +Commits are is commonly summarized by the checksum's first few characters, for example, \inlinecode{f4953cc} of the example above. With Git, making parallel ``branches'' (in the project's history) is very easy and its distributed nature greatly helps in the parallel development of a project by a team. The team can host the Git history on a web page and collaborate through that. @@ -321,20 +328,21 @@ For example, it is first necessary to download a dataset and do some preparation Each one of these is a logically independent step, which needs to be run before/after the others in a specific order. Hence job management is a critical component of a research project. -There are many tools for managing the sequence of jobs, below we review the most common ones that are also used the existing reproducibility solutions of Appendix \ref{appendix:existingsolutions}. +There are many tools for managing the sequence of jobs, below we review the most common ones that are also used the existing reproducibility solutions of Appendix \ref{appendix:existingsolutions} and Maneage. \subsubsection{Manual operation with narrative} \label{appendix:manual} -The most commonly used workflow system for many researchers is to run the commands, experiment on them, and keep the output when they are happy with it. -As an improvement, some researchers also keep a narrative description of what they ran. +The most commonly used workflow system for many researchers is to run the commands, experiment on them, and keep the output when they are happy with it (therefore loosing the actual command that produced it). +As an improvement, some researchers also keep a narrative description in a text file, and keep a copy of the command they ran. At least in our personal experience with colleagues, this method is still being heavily practiced by many researchers. -Given that many researchers do not get trained well in computational methods, this is not surprising and as discussed in +Given that many researchers do not get trained well in computational methods, this is not surprising. +As discussed in \ifdefined\separatesupplement the discussion section of the main paper, \else Section \ref{discussion}, \fi -we believe that improved literacy in computational methods is the single most important factor for the integrity/reproducibility of modern science. +based on this observation we believe that improved literacy in computational methods is the single most important factor for the integrity/reproducibility of modern science. \subsubsection{Scripts} \label{appendix:scripts} @@ -342,19 +350,20 @@ Scripts (in any language, for example GNU Bash, or Python) are the most common w They are primarily designed to execute each step sequentially (one after another), making them also very intuitive. However, as the series of operations become complex and large, managing the workflow in a script will become highly complex. -For example, if 90\% of a long project is already done and a researcher wants to add a followup step, a script will go through all the previous steps (which can take significant time). +For example, if 90\% of a long project is already done and a researcher wants to add a followup step, a script will go through all the previous steps every time it is run (which can take significant time). In other scenarios, when a small step in the middle of the analysis has to be changed, the full analysis needs to be re-run from the start. -Scripts have no concept of dependencies, forcing authors to ``temporarily'' comment parts that they do not want to be re-run (forgetting to un-comment such parts are the most common cause of frustration for the authors and others attempting to reproduce the result). +Scripts have no concept of dependencies, forcing authors to ``temporarily'' comment parts that they do not want to be re-run. +Therefore forgetting to un-comment them afterwards is the most common cause of frustration. -Such factors discourage experimentation, which is a critical component of the scientific method. -It is possible to manually add conditionals all over the script to add dependencies or only run certain steps at certain times, but they just make it harder to read and introduce many bugs themselves. +This discourages experimentation, which is a critical component of the scientific method. +It is possible to manually add conditionals all over the script, thus manually defining dependencies, or only run certain steps at certain times, but they just make it harder to read, add logical complexity and introduce many bugs themselves. Parallelization is another drawback of using scripts. While it is not impossible, because of the high-level nature of scripts, it is not trivial and parallelization can also be very inefficient or buggy. \subsubsection{Make} \label{appendix:make} Make was originally designed to address the problems mentioned above for scripts \citeappendix{feldman79}. -In particular, it addresses the context of managing the compilation of software programs that involve many source code files. +In particular, it was originally designed in the context of managing the compilation of software source code that are distributed in many files. With Make, the source files of a program that have not been changed are not recompiled. Moreover, when two source files do not depend on each other, and both need to be rebuilt, they can be built in parallel. This was found to greatly help in debugging software projects, and in speeding up test builds, giving Make a core place in software development over the last 40 years. @@ -373,27 +382,30 @@ Therefore all three components in a rule must be files on the running filesystem To decide which operation should be re-done when executed, Make compares the timestamp of the targets and prerequisites. When any of the prerequisite(s) is newer than a target, the recipe is re-run to re-build the target. When all the prerequisites are older than the target, that target does not need to be rebuilt. -The recipe can contain any number of commands, they should just all start with a \inlinecode{TAB}. -Going deeper into the syntax of Make is beyond the scope of this paper, but we recommend interested readers to consult the GNU Make manual for a nice introduction\footnote{\inlinecode{\url{http://www.gnu.org/software/make/manual/make.pdf}}}. +A recipe is just a bundle or shell commands that are executed if necessary. +Going deeper into the syntax of Make is beyond the scope of this paper, but we recommend interested readers to consult the GNU Make manual for a very good introduction\footnote{\inlinecode{\url{http://www.gnu.org/software/make/manual/make.pdf}}}. \subsubsection{Snakemake} -Snakemake is a Python-based workflow management system, inspired by GNU Make (which is the job organizer in Maneage), that is aimed at reproducible and scalable data analysis \citeappendix{koster12}\footnote{\inlinecode{\url{https://snakemake.readthedocs.io/en/stable}}}. +\label{appendix:snakemake} +Snakemake is a Python-based workflow management system, inspired by GNU Make (discussed above). +It is aimed at reproducible and scalable data analysis \citeappendix{koster12}\footnote{\inlinecode{\url{https://snakemake.readthedocs.io/en/stable}}}. It defines its own language to implement the ``rule'' concept of Make within Python. -Currently, it requires Python 3.5 (released in September 2015) and above, while Snakemake was originally introduced in 2012. -Hence it is not clear if older Snakemake source files can be executed today. -As reviewed in many tools here, this is a major longevity problem when using high-level tools as the skeleton of the workflow. Technically, calling command-line programs within Python is very slow, and using complex shell scripts in each step will involve a lot of quotations that make the code hard to read. +Currently, Snakemake requires Python 3.5 (released in September 2015) and above, while Snakemake was originally introduced in 2012. +Hence it is not clear if older Snakemake source files can be executed today. +As reviewed in many tools here, depending on high-level systems for low-level project components causes a major bootstrapping problem that reduces the longevity of a project. + \subsubsection{Bazel} Bazel\footnote{\inlinecode{\url{https://bazel.build}}} is a high-level job organizer that depends on Java and Python and is primarily tailored to software developers (with features like facilitating linking of libraries through its high-level constructs). \subsubsection{SCons} \label{appendix:scons} -Scons is a Python package for managing operations outside of Python (in contrast to CGAT-core, discussed below, which only organizes Python functions). +Scons\footnote{\inlinecode{\url{https://scons.org}}} is a Python package for managing operations outside of Python (in contrast to CGAT-core, discussed below, which only organizes Python functions). In many aspects it is similar to Make, for example, it is managed through a `SConstruct' file. Like a Makefile, SConstruct is also declarative: the running order is not necessarily the top-to-bottom order of the written operations within the file (unlike the imperative paradigm which is common in languages like C, Python, or FORTRAN). However, unlike Make, SCons does not use the file modification date to decide if it should be remade. -SCons keeps the MD5 hash of all the files (in a hidden binary file) to check if the contents have changed. +SCons keeps the MD5 hash of all the files in a hidden binary file and checks them to see if it is necessary to re-run. SCons thus attempts to work on a declarative file with an imperative language (Python). It also goes beyond raw job management and attempts to extract information from within the files (for example to identify the libraries that must be linked while compiling a program). @@ -410,7 +422,8 @@ This can also be problematic when a Python analysis library, may require a Pytho \subsubsection{CGAT-core} CGAT-Core is a Python package for managing workflows, see \citeappendix{cribbs19}. It wraps analysis steps in Python functions and uses Python decorators to track the dependencies between tasks. -It is used in papers like \citeappendix{jones19}, but as mentioned in \citeappendix{jones19} it is good for managing individual outputs (for example separate figures/tables in the paper, when they are fully created within Python). +It is used in papers like \citeappendix{jones19}. +However, as mentioned in \citeappendix{jones19} it is good for managing individual outputs (for example separate figures/tables in the paper, when they are fully created within Python). Because it is primarily designed for Python tasks, managing a full workflow (which includes many more components, written in other languages) is not trivial. Another drawback with this workflow manager is that Python is a very high-level language where future versions of the language may no longer be compatible with Python 3, that CGAT-core is implemented in (similar to how Python 2 programs are not compatible with Python 3). @@ -420,23 +433,24 @@ It is closely linked with GNU Guix and can even install the necessary software n Hence in the GWL paradigm, software installation and usage does not have to be separated. GWL has two high-level concepts called ``processes'' and ``workflows'' where the latter defines how multiple processes should be executed together. -In conclusion, shell scripts and Make are very common and extensively used by users of Unix-based OSs (which are most commonly used for computations). -They have also existed for several decades and are robust and mature. -Many researchers are also already familiar with them and have already used them. -As we see in this appendix, the list of necessary tools for the various stages of a research project (an independent environment, package managers, job organizers, analysis languages, writing formats, editors, etc) is already very large. -Each software has its own learning curve, which is a heavy burden for a natural or social scientist for example. -Most other workflow management tools are yet another language that have to be mastered. - -Furthermore, high-level and specific solutions will evolve very fast causing disruptions in the reproducible framework. -A good example is Popper \citeappendix{jimenez17} which initially organized its workflow through the HashiCorp configuration language (HCL) because it was the default in GitHub. -However, in September 2019, GitHub dropped HCL as its default configuration language, so Popper is now using its own custom YAML-based workflow language, see Appendix \ref{appendix:popper} for more on Popper. - \subsubsection{Nextflow (2013)} Nextflow\footnote{\inlinecode{\url{https://www.nextflow.io}}} \citeappendix{tommaso17} workflow language with a command-line interface that is written in Java. \subsubsection{Generic workflow specifications (CWL and WDL)} Due to the variety of custom workflows used in existing reproducibility solution (like those of Appendix \ref{appendix:existingsolutions}), some attempts have been made to define common workflow standards like the Common workflow language (CWL\footnote{\inlinecode{\url{https://www.commonwl.org}}}, with roots in Make, formatted in YAML or JSON) and Workflow Description Language (WDL\footnote{\inlinecode{\url{https://openwdl.org}}}, formatted in JSON). -These are primarily specifications/standards rather than software, so ideally translators can be written between the various workflow systems to make them more interoperable. +These are primarily specifications/standards rather than software. +With these standards, ideally, translators can be written between the various workflow systems to make them more interoperable. + +In conclusion, shell scripts and Make are very common and extensively used by users of Unix-based OSs (which are most commonly used for computations). +They have also existed for several decades and are robust and mature. +Many researchers that use heavy computations are also already familiar with them and have already used them already (to different levels). +As we demonstrated above in this appendix, the list of necessary tools for the various stages of a research project (an independent environment, package managers, job organizers, analysis languages, writing formats, editors, etc) is already very large. +Each software/tool/paradigm has its own learning curve, which is not easy for a natural or social scientist for example (who need to put their primary focus on their own scientific domain). +Most workflow management tools and the reproducible workflow solutions that depend on them are, yet another language/paradigm that has to be mastered by researchers and thus a heavy burden. + +Furthermore as shown above (and below) high-level tools will evolve very fast causing disruptions in the reproducible framework. +A good example is Popper \citeappendix{jimenez17} which initially organized its workflow through the HashiCorp configuration language (HCL) because it was the default in GitHub. +However, in September 2019, GitHub dropped HCL as its default configuration language, so Popper is now using its own custom YAML-based workflow language, see Appendix \ref{appendix:popper} for more on Popper. @@ -444,7 +458,7 @@ These are primarily specifications/standards rather than software, so ideally tr \subsection{Editing steps and viewing results} \label{appendix:editors} -In order to later reproduce a project, the analysis steps must be stored in files. +In order to reproduce a project, the analysis steps must be stored in files. For example Shell, Python, R scripts, Makefiles, Dockerfiles, or even the source files of compiled languages like C or FORTRAN. Given that a scientific project does not evolve linearly and many edits are needed as it evolves, it is important to be able to actively test the analysis steps while writing the project's source files. Here we review some common methods that are currently used. @@ -456,13 +470,16 @@ To solve this problem there are advanced text editors like GNU Emacs that allow However, editors that can execute or debug the source (like GNU Emacs), just run external programs for these jobs (for example GNU GCC, or GNU GDB), just as if those programs was called from outside the editor. With text editors, the final edited file is independent of the actual editor and can be further edited with another editor, or executed without it. -This is a very important feature that is not commonly present for other solutions mentioned below. +This is a very important feature and corresponds to the modularity criteria of this paper. +This type of modularity is not commonly present for other solutions mentioned below (the source can only be edited/run in a specific browser). Another very important advantage of advanced text editors like GNU Emacs or Vi(m) is that they can also be run without a graphic user interface, directly on the command-line. This feature is critical when working on remote systems, in particular high performance computing (HPC) facilities that do not provide a graphic user interface. Also, the commonly used minimalistic containers do not include a graphic user interface. +Hence by default all Maneage'd projects also build the simple GNU Nano plain-text editor as part of the project (to be able to edit the source directly within a minimal environment). +Maneage can also also optinally build GNU Emacs or Vim, but its up to the user to build them (same as their high-level science software). \subsubsection{Integrated Development Environments (IDEs)} -To facilitate the development of source files, IDEs add software building and running environments as well as debugging tools to a plain text editor. +To facilitate the development of source code in special programming languages, IDEs add software building and running environments as well as debugging tools to a plain text editor. Many IDEs have their own compilers and debuggers, hence source files that are maintained in IDEs are not necessarily usable/portable on other systems. Furthermore, they usually require a graphic user interface to run. In summary, IDEs are generally very specialized tools, for special projects and are not a good solution when portability (the ability to run on different systems and at different times) is required. @@ -470,10 +487,10 @@ In summary, IDEs are generally very specialized tools, for special projects and \subsubsection{Jupyter} \label{appendix:jupyter} Jupyter (initially IPython) \citeappendix{kluyver16} is an implementation of Literate Programming \citeappendix{knuth84}. +Jupyter's name is a combination of the three main languages it was designed for: Julia, Python, and R. The main user interface is a web-based ``notebook'' that contains blobs of executable code and narrative. Jupyter uses the custom built \inlinecode{.ipynb} format\footnote{\inlinecode{\url{https://nbformat.readthedocs.io/en/latest}}}. -Jupyter's name is a combination of the three main languages it was designed for: Julia, Python, and R. -The \inlinecode{.ipynb} format, is a simple, human-readable (can be opened in a plain-text editor) file, formatted in JavaScript Object Notation (JSON). +The \inlinecode{.ipynb} format, is a simple, human-readable format that can be opened in a plain-text editor) and formatted in JavaScript Object Notation (JSON). It contains various kinds of ``cells'', or blobs, that can contain narrative description, code, or multi-media visualizations (for example images/plots), that are all stored in one file. The cells can have any order, allowing the creation of a literal programming style graphical implementation, where narrative descriptions and executable patches of code can be intertwined. For example to have a paragraph of text about a patch of code, and run that patch immediately on the same page. @@ -486,7 +503,7 @@ It is possible to manually execute only one cell, but the previous/next cells th Integration of directional graph features (dependencies between the cells) into Jupyter has been discussed, but as of this publication, there is no plan to implement it (see Jupyter's GitHub issue 1175\footnote{\inlinecode{\url{https://github.com/jupyter/notebook/issues/1175}}}). The fact that the \inlinecode{.ipynb} format stores narrative text, code, and multi-media visualization of the outputs in one file, is another major hurdle and against the modularity criteria proposed here. -The files can easily become very large (in volume/bytes) and hard to read. +The files can easily become very large (in volume/bytes) and hard to read when the Jupyter web-interface is not accessible. Both are critical for scientific processing, especially the latter: when a web browser with proper JavaScript features is not available (can happen in a few years). This is further exacerbated by the fact that binary data (for example images) are not directly supported in JSON and have to be converted into a much less memory-efficient textual encoding. @@ -494,7 +511,7 @@ Finally, Jupyter has an extremely complex dependency graph: on a clean Debian 10 \citeappendix{hinsen15} reported such conflicts when building Jupyter into the Active Papers framework (see Appendix \ref{appendix:activepapers}). However, the dependencies above are only on the server-side. Since Jupyter is a web-based system, it requires many dependencies on the viewing/running browser also (for example special JavaScript or HTML5 features, which evolve very fast). -As discussed in Appendix \ref{appendix:highlevelinworkflow} having so many dependencies is a major caveat for any system regarding scientific/long-term reproducibility (as opposed to industrial/immediate reproducibility). +As discussed in Appendix \ref{appendix:highlevelinworkflow} having so many dependencies is a major caveat for any system regarding scientific/long-term reproducibility. In summary, Jupyter is most useful in manual, interactive, and graphical operations for temporary operations (for example educational tutorials). @@ -507,7 +524,7 @@ Currently, the most popular high-level data analysis language is Python. R is closely tracking it and has superseded Python in some fields, while Julia \citeappendix{bezanson17} is quickly gaining ground. These languages have themselves superseded previously popular languages for data analysis of the previous decades, for example, Java, Perl, or C++. All are part of the C-family programming languages. -In many cases, this means that the tools to use that language are written in C, which is the language of modern operating systems. +In many cases, this means that the language's execution environment are themselves written in C, which is the language of modern operating systems. Scientists, or data analysts, mostly use these higher-level languages. Therefore they are naturally drawn to also apply the higher-level languages for lower-level project management, or designing the various stages of their workflow. @@ -524,21 +541,23 @@ Some projects could not make this investment and their developers decided to sto The problems were not just limited to translation. Python 2 was still being actively used during the transition period (and is still being used by some, after its end-of-life). -Therefore, developers of packages used by others had to maintain (for example fix bugs in) both versions in one package. +Therefore, developers had to maintain (for example fix bugs in) both versions in one package. This is not particular to Python, a similar evolution occurred in Perl: in 2000 it was decided to improve Perl 5, but the proposed Perl 6 was incompatible with it. However, the Perl community decided not to abandon Perl 5, and Perl 6 was eventually defined as a new language that is now officially called ``Raku'' (\url{https://raku.org}). It is unreasonably optimistic to assume that high-level languages will not undergo similar incompatible evolutions in the (not too distant) future. -For software developers, this is not a problem at all: non-scientific software, and the general population's usage of them, has a similarly fast evolution. -Hence, it is rarely (if ever) necessary to look into codes that are more than a couple of years old. -However, in the sciences (which are commonly funded by public money) this is a major caveat for the longer-term usability of solutions that are designed in such high-level languages. +For industial software developers, this is not a major problem: non-scientific software, and the general population's usage of them, has a similarly fast evolution and shelf-life. +Hence, it is rarely (if ever) necessary to look into industrial/business codes that are more than a couple of years old. +However, in the sciences (which are commonly funded by public money) this is a major caveat for the longer-term usability of solutions. -In summary, in this section we are discussing the bootstrapping problem as regards scientific projects: the workflow/pipeline can reproduce the analysis and its dependencies, but the dependencies of the workflow itself can not be ignored. +In summary, in this section we are discussing the bootstrapping problem as regards scientific projects: the workflow/pipeline can reproduce the analysis and its dependencies. +However, the dependencies of the workflow itself should not be ignored. Beyond technical, low-level, problems for the developers mentioned above, this causes major problems for scientific project management as listed below: \subsubsection{Dependency hell} The evolution of high-level languages is extremely fast, even within one version. -For example, packages that are written in Python 3 often only work with a special interval of Python 3 versions (for example newer than Python 3.6). +For example, packages that are written in Python 3 often only work with a special interval of Python 3 versions. +For example Snakemake and Occam which can only be run on Python versions 3.4 and 3.5 or newer respectively, see Appendices \ref{appendix:snakemake} and \ref{appendix:occam}. This is not just limited to the core language, much faster changes occur in their higher-level libraries. For example version 1.9 of Numpy (Python's numerical analysis module) discontinued support for Numpy's predecessor (called Numeric), causing many problems for scientific users \citeappendix{hinsen15}. @@ -548,10 +567,11 @@ Acceptable version intervals between the dependencies will cause incompatibiliti Since a domain scientist does not always have the resources/knowledge to modify the conflicting part(s), many are forced to create complex environments with different versions of Python and pass the data between them (for example just to use the work of a previous PhD student in the team). This greatly increases the complexity of the project, even for the principal author. -A good reproducible workflow can account for these different versions. -However, when the actual workflow system (not the analysis software) is written in a high-level language this will cause a major problem. +A well-designed reproducible workflow like Maneage that has no dependencies beyond a C compiler in a Unix-like operating system can account for this. +However, when the actual workflow system (not the analysis software) is written in a high-level language like the examples above. -For example, merely installing the Python installer (\inlinecode{pip}) on a Debian system (with \inlinecode{apt install pip2} for Python 2 packages), required 32 other packages as dependencies. +Another relevant example of the dependency hell is mentioned here: +merely installing the Python installer (\inlinecode{pip}) on a Debian system (with \inlinecode{apt install pip2} for Python 2 packages), required 32 other packages as dependencies. \inlinecode{pip} is necessary to install Popper and Sciunit (Appendices \ref{appendix:popper} and \ref{appendix:sciunit}). As of this writing, the \inlinecode{pip3 install popper} and \inlinecode{pip2 install sciunit2} commands for installing each, required 17 and 26 Python modules as dependencies. It is impossible to run either of these solutions if there is a single conflict in this very complex dependency graph. @@ -563,8 +583,8 @@ Of course, this also applies to tools that these systems use, for example Conda This occurs primarily for domain scientists (for example astronomers, biologists, or social sciences). Once they have mastered one version of a language (mostly in the early stages of their career), they tend to ignore newer versions/languages. The inertia of programming languages is very strong. -This is natural because they have their own science field to focus on, and re-writing their high-level analysis toolkits (which they have curated over their career and is often only readable/usable by themselves) in newer languages every few years requires too much investment and time. +This is natural because they have their own science field to focus on, and re-writing their high-level analysis toolkits (which they have curated over their career and is often only readable/usable by themselves) in newer languages every few years is not practically possible. -When this investment is not possible, either the mentee has to use the mentor's old method (and miss out on all the new tools, which they need for the future job prospects), or the mentor has to avoid implementation details in discussions with the mentee because they do not share a common language. +When this investment is not possible, either the mentee has to use the mentor's old method (and miss out on all the newly fashionable tools that many are talking about), or the mentor has to avoid implementation details in discussions with the mentee because they do not share a common language. The authors of this paper have personal experiences in both mentor/mentee relational scenarios. This failure to communicate in the details is a very serious problem, leading to the loss of valuable inter-generational experience. diff --git a/tex/src/figure-src-format.tex b/tex/src/figure-src-format.tex index ba4458e..f860d54 100644 --- a/tex/src/figure-src-format.tex +++ b/tex/src/figure-src-format.tex @@ -17,7 +17,7 @@ \texttt{\recipecomment{Call XLSX I/O to convert all the spreadsheets into different CSV files.}} - \texttt{\recipecomment{We only want the `table-3' spreadsheet, but XLSX I/O doesn't allow setting its}} + \texttt{\recipecomment{We only want the `table-3' spreadsheet, but XLSX I/O does not allow setting its}} \texttt{\recipecomment{output filename. For simplicity, let's assume its written in `table-3.csv'.}} diff --git a/tex/src/supplement.tex b/tex/src/supplement.tex index c1e75b0..6678e71 100644 --- a/tex/src/supplement.tex +++ b/tex/src/supplement.tex @@ -75,8 +75,8 @@ %% A short appendix describing this file. \begin{abstract} - This supplement contains appendices to the main body of the published paper on CiSE (pre-print published on \href{https://arxiv.org/abs/\projectarxivid}{\texttt{arXiv:\projectarxivid}} or \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}}). - In the paper's main body we discussed criteria for longevity and introduced a proof of concept that implements them, called Maneage (\emph{Man}aging data lin\emph{eage}). + This supplement contains appendices to the main body of a paper submitted to CiSE (pre-print published on \href{https://arxiv.org/abs/\projectarxivid}{\texttt{arXiv:\projectarxivid}} or \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}}). + In the paper's main body we introduced criteria for longevity of reproducible workflow solutions and introduced a proof of concept that implements them, called Maneage (\emph{Man}aging data lin\emph{eage}). This supplement provides an in-depth literature review of previous methods and compares them and their lower-level tools in detail with our criteria and with the proof of concept presented in this work. Appendix \ref{appendix:existingtools} reviews the low-level tools that are used by many reproducible workflow solutions (including our proof of concept). Appendix \ref{appendix:existingsolutions} reviews many solutions that attempt(ed) reproducible management of workflows (including solutions that have stopped development, or worse, are no longer available online). |