aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex332
1 files changed, 249 insertions, 83 deletions
diff --git a/paper.tex b/paper.tex
index 08b431b..b28e2af 100644
--- a/paper.tex
+++ b/paper.tex
@@ -32,7 +32,7 @@
%% The paper headers
\markboth{Computing in Science and Engineering, Vol. X, No. X, MM YYYY}%
-{Akhlaghi \MakeLowercase{\textit{et al.}}: Towards Long-term and Archivable Reproducibility}
+{Akhlaghi \MakeLowercase{\textit{et al.}}: \projecttitle}
@@ -53,27 +53,34 @@
% in the abstract or keywords.
\begin{abstract}
%% CONTEXT
- Analysis pipelines commonly use high-level technologies that are popular when created, but are unlikely to be sustainable in the long term.
+ Analysis pipelines commonly use high-level technologies that are popular when created, but are unlikely to be readable, executable, or sustainable in the long term.
%% AIM
- We therefore aim to introduce a set of criteria to address this problem.
+ A set of criteria are introduced to address this problem.
%% METHOD
- These criteria have been tested in several research publications and have the following features: completeness (no dependency beyond POSIX, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free software.
+ Completeness (no dependency beyond POSIX, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free software.
+ They have been tested in several research publications in various fields.
%% RESULTS
As a proof of concept, ``Maneage'' is introduced for storing projects in machine-actionable and human-readable
plain text, enabling cheap archiving, provenance extraction, and peer verification.
%% CONCLUSION
We show that longevity is a realistic requirement that does not sacrifice immediate or short-term reproducibility.
- We then discuss the caveats (with proposed solutions) and conclude with the benefits for the various stakeholders.
+ The caveats (with proposed solutions) are then discussed and we conclude with the benefits for the various stakeholders.
This paper is itself written with Maneage (project commit \projectversion).
- \vspace{3mm}
+ \vspace{2.5mm}
+ \emph{Appendix} ---
+ Two comprehensive appendices that review existing solutions available
+\ifdefined\noappendix
+in \href{https://arxiv.org/abs/\projectarxivid}{\texttt{arXiv:\projectarxivid}} or \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}}.
+\else
+at the end (Appendices \ref{appendix:existingtools} and \ref{appendix:existingsolutions}).
+\fi
+
+ \vspace{2.5mm}
\emph{Reproducible supplement} ---
All products in \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}},
Git history of source at \href{https://gitlab.com/makhlaghi/maneage-paper}{\texttt{gitlab.com/makhlaghi/maneage-paper}},
which is also archived on \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=https://gitlab.com/makhlaghi/maneage-paper.git}{SoftwareHeritage}.
-\ifdefined\noappendix
- Appendices reviewing existing reproducible solutions available in \href{https://arxiv.org/abs/\projectarxivid}{\texttt{arXiv:\projectarxivid}} or \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}}.
-\fi
\end{abstract}
% Note that keywords are not normally used for peer-review papers.
@@ -108,10 +115,9 @@ Reproducible research has been discussed in the sciences for at least 30 years \
Many reproducible workflow solutions (hereafter, ``solutions'') have been proposed that mostly rely on the common technology of the day,
starting with Make and Matlab libraries in the 1990s, Java in the 2000s, and mostly shifting to Python during the last decade.
-However, these technologies develop fast, e.g., Python 2 code often cannot run with Python 3.
+However, these technologies develop fast, e.g., code written in Python 2 \new{(which is no longer officially maintained)} often cannot run with Python 3.
The cost of staying up to date within this rapidly-evolving landscape is high.
-Scientific projects, in particular, suffer the most: scientists have to focus on their own research domain, but to some degree
-they need to understand the technology of their tools because it determines their results and interpretations.
+Scientific projects, in particular, suffer the most: scientists have to focus on their own research domain, but to some degree they need to understand the technology of their tools because it determines their results and interpretations.
Decades later, scientists are still held accountable for their results and therefore the evolving technology landscape
creates generational gaps in the scientific community, preventing previous generations from sharing valuable experience.
@@ -119,46 +125,64 @@ creates generational gaps in the scientific community, preventing previous gener
-\section{Commonly used tools and their longevity}
-Longevity is as important in science as in some fields of industry, but this ideal is not always necessary; e.g., fast-evolving tools can be appropriate in short-term commercial projects.
-To highlight the necessity, a sample set of commonly-used tools is reviewed here in the following order:
-(1) environment isolators -- virtual machines (VMs) or containers;
-(2) package managers (PMs) -- Conda, Nix, or Spack;
-(3) job management -- shell scripts, Make, SCons, or CGAT-core;
-(4) notebooks -- Jupyter.
+\section{Longevity of existing tools}
+\label{sec:longevityofexisting}
+\new{Reproducibility is defined as ``obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis'' \cite{fineberg19}.
+ Longevity is defined as the time that a project can be usable.
+ Usability is defined by context: for machines (machine-actionable, or executable files) \emph{and} humans (readability of the source).
+ Because many usage contexts don't involve execution; for example checking the configuration parameter of a single step of the analysis to re-\emph{use} in another project, or checking the version of used software, or source of the input data (extracting these from the outputs of execution is not always possible).}
+
+Longevity is as important in science as in some fields of industry, but not all; e.g., fast-evolving tools can be appropriate in short-term commercial projects.
+To highlight the necessity, a short review of commonly-used tools is provided below:
+(1) environment isolators (virtual machines, VMs, or containers);
+(2) package managers (PMs, like Conda, Nix, or Spack);
+(3) job management (like shell scripts or Make);
+(4) notebooks (like Jupyter).
+\new{For a much more comprehensive review of existing tools and solutions is available in the appendices.}
To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011 but was discontinued in 2019).
However, containers (in particular, Docker, and to a lesser degree, Singularity) are currently the most widely-used solution.
We will thus focus on Docker here.
-Ideally, it is possible to precisely identify the Docker ``images'' that are imported with their checksums, but that is rarely practiced in most solutions that we have surveyed.
-Usually, images are imported with generic operating system (OS) names; e.g., \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}'.
-The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly and only the most recent five are archived.
-Hence, if the Dockerfile is run in different months, its output image will contain different OS components.
+\new{It is theoretically possible to precisely identify the used Docker ``images'' with their checksums (or ``digest'') to re-create an identical OS image later.
+ However, that is rarely practiced.}
+Usually images are imported with generic operating system (OS) names; e.g., \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}' \new{(more examples in the appendices)}.
+The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly, and only the most recent five are archived there.
+Hence, if the image is built in different months, its output image will contain different OS components.
In the year 2024, when long-term support for this version of Ubuntu expires, the image will be unavailable at the expected URL.
Generally, Pre-built binary files (like Docker images) are large and expensive to maintain and archive.
-%% This URL: https://www.docker.com/blog/scaling-dockers-business-to-serve-millions-more-developers-storage/}
-This prompted DockerHub (an online service to host Docker images, including many reproducible workflows) to delete images that have not been used for over 6 months.
-Furthermore, Docker requires root permissions, and only supports recent (``long-term-support'') versions of the host kernel, so older Docker images may not be executable.
+%% This URL: https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}
+\new{Because of this DockerHub (where many reproducible workflows are archived) announced that inactive images (for over 6 months) will be deleted in free accounts from mid 2021.}
+Furthermore, Docker requires root permissions, and only supports recent (``long-term-support'') versions of the host kernel, so older Docker images may not be executable \new{(their longevity is determined by the host kernel, usually a decade)}.
Once the host OS is ready, PMs are used to install the software or environment.
Usually the OS's PM, such as `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with generic PMs.
-The former suffers from the same longevity problem as the OS, while some of the latter (such as Conda and Spack) are written in high-level languages like Python, so the PM itself depends on the host's Python installation.
-Nix and GNU Guix produce bit-wise identical programs, but they need root permissions and are primarily targeted at the Linux kernel.
-Generally, the exact version of each software's dependencies is not precisely identified in the PM build instructions (although this could be implemented).
-Therefore, unless precise version identifiers of \emph{every software package} are stored by project authors, a PM will use the most recent version.
+The former has \new{the same longevity} as the OS, while some of the latter (such as Conda and Spack) are written in high-level languages like Python, so the PM itself depends on the host's Python installation \new{with a usual longevity of a few years}.
+Nix and GNU Guix produce bit-wise identical programs \new{with considerably better longevity; same as supported CPU architectures}.
+However, they need root permissions and are primarily targeted at the Linux kernel.
+Generally, in all the package managers, the exact version of each software (and its dependencies) is not precisely identified by default, although an advanced user can indeed fix them.
+Unless precise version identifiers of \emph{every software package} are stored by project authors, a PM will use the most recent version.
Furthermore, because third-party PMs introduce their own language, framework, and version history (the PM itself may evolve) and are maintained by an external team, they increase a project's complexity.
With the software environment built, job management is the next component of a workflow.
-Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails (mostly introduced in the 2000s and using Java) encourage modularity and robust job management, but the more recent tools (mostly in Python) leave this to the authors of the project.
-Designing a modular project needs to be encouraged and facilitated because scientists (who are not usually trained in project or data management) will rarely apply best practices.
+Visual/GUI workflow tools like Apache Taverna, GenePattern (depreciated), Kepler or VisTrails (depreciated), which were mostly introduced in the 2000s and used Java or Python 2 encourage modularity and robust job management.
+\new{However, a GUI environment is tailored to specific applications and is hard to genralize, while being hard to reproduce once the required Java Virtual Machines (JVM) is depreciated.
+Their data formats are also complex (designed for computers to read) and hard to read by humans without the GUI.}
+The more recent tools (mostly non-GUI, written in Python) leave this to the authors of the project.
+Designing a robust project needs to be encouraged and facilitated because scientists (who are not usually trained in project or data management) will rarely apply best practices.
This includes automatic verification, which is possible in many solutions, but is rarely practiced.
-Weak project management leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding, or validating will be expensive).
+Besides non-reproducibility, weak project management leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding, or validating will be expensive).
-Finally, to add narrative, computational notebooks \cite{rule18}, such as Jupyter, are currently gaining popularity.
-However, because of their complex dependency trees, they are vulnerable to the passage of time; e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies.
+Finally, to blend narrative into the workflow, computational notebooks \cite{rule18}, such as Jupyter, are currently gaining popularity.
+However, because of their complex dependency trees, their build is vulnerable to the passage of time; e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies.
It is important to remember that the longevity of a project is determined by its shortest-lived dependency.
-Further, as with job management, computational notebooks do not actively encourage good practices in programming or project management, hence they can rarely deliver their promised potential \cite{rule18} and may even hamper reproducibility \cite{pimentel19}.
+Furthermore, as with job management, computational notebooks do not actively encourage good practices in programming or project management.
+\new{The ``cells'' in a Jupyter notebook can either be run sequentially (from top to bottom, one after the other) or by manually selecting which cell to run.
+The default cells don't include dependencies (so some cells run only after certain others are re-done), parallel execution, or usage of more than one language.
+There are third party add-ons like \inlinecode{sos} or \inlinecode{nbextensions} (both written in Python) for some of these.
+However, since they aren't part of the core and have their own dependencies, their longevity can be assumed to be shorter.
+Therefore, the core Jupyter framework leaves very few options for project management, especially as the project grows beyond a small test or tutorial.}
+In summary, notebooks can rarely deliver their promised potential \cite{rule18} and may even hamper reproducibility \cite{pimentel19}.
An exceptional solution we encountered was the Image Processing Online Journal (IPOL, \href{https://www.ipol.im}{ipol.im}).
Submitted papers must be accompanied by an ISO C implementation of their algorithm (which is buildable on any widely used OS) with example images/data that can also be executed on their webpage.
@@ -178,7 +202,7 @@ We argue and propose that workflows satisfying the following criteria can not on
\textbf{Criterion 1: Completeness.}
A project that is complete (self-contained) has the following properties.
-(1) No dependency beyond the Portable Operating System Interface: POSIX (a minimal Unix-like environment).
+(1) No dependency beyond the Portable Operating System Interface (POSIX, \new{a minimal Unix-like standard that is shared between many operating systems}).
POSIX has been developed by the Austin Group (which includes IEEE) since 1988 and many OSes have complied.
(2) Primarily stored as plain text, not needing specialized software to open, parse, or execute.
(3) No impact on the host OS libraries, programs or environment.
@@ -200,8 +224,8 @@ Explicit communication between various modules enables optimizations on many lev
\textbf{Criterion 3: Minimal complexity.}
Minimal complexity can be interpreted as:
(1) Avoiding the language or framework that is currently in vogue (for the workflow, not necessarily the high-level analysis).
-A popular framework typically falls out of fashion and requires significant resources to translate or rewrite every few years.
-More stable/basic tools can be used with less long-term maintenance.
+A popular framework typically falls out of fashion and requires significant resources to translate or rewrite every few years \new{(for example Python 2, which is now a dead language and no longer supported)}.
+More stable/basic tools can be used with less long-term maintenance costs.
(2) Avoiding too many different languages and frameworks; e.g., when the workflow's PM and analysis are orchestrated in the same framework, it becomes easier to adopt and encourages good practices.
\textbf{Criterion 4: Scalability.}
@@ -225,11 +249,13 @@ A narrative description is also a deliverable (defined as ``data article'' in \c
This is related to longevity, because if a workflow contains only the steps to do the analysis or generate the plots, in time it may get separated from its accompanying published paper.
\textbf{Criterion 8: Free and open source software:}
-Reproducibility (defined in \cite{fineberg19}) is not possible with a black box (non-free or non-open-source software); this criterion is therefore necessary because nature is already a black box.
-A project that is free software (as formally defined), allows others to learn from, modify, and build upon it.
+Reproducibility is not possible with a black box (non-free or non-open-source software); this criterion is therefore necessary because nature is already a black box, we don't need an artificial source of ambiguity wraped over it.
+A project that is \href{https://www.gnu.org/philosophy/free-sw.en.html}{free software} (as formally defined), allows others to learn from, modify, and build upon it.
When the software used by the project is itself also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware.
In contrast, non-free tools typically cannot be distributed or modified by others, making it reliant on a single supplier (even without payments).
+\new{It may happen that proprietary software is necessary to convert proprietary data formats produced by special hardware (for example micro-arrays in genetics) into free data formats.
+ In such cases, it is best to immediately convert the data upon collection, and archive the data in free formats (for example on Zenodo).}
@@ -242,7 +268,8 @@ In contrast, non-free tools typically cannot be distributed or modified by other
\section{Proof of concept: Maneage}
With the longevity problems of existing tools outlined above, a proof-of-concept tool is presented here via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}.
-It was in fact awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows \cite{austin17}, from the researchers' perspective.
+\new{Since the initial submission of this paper, it has also been used in \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} (on the COVID-19 pandemic) and \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}.}
+It was also awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows \cite{austin17}, from the researchers' perspective.
The tool is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at \url{https://maneage.org}.
It was developed as a parallel research project over five years of publishing reproducible workflows of our research.
@@ -256,8 +283,7 @@ Inspired by GWL+Guix, a single job management tool was implemented for both inst
Make is not an analysis language, it is a job manager, deciding when and how to call analysis programs (in any language like Python, R, Julia, Shell, or C).
Make is standardized in POSIX and is used in almost all core OS components.
It is thus mature, actively maintained, highly optimized, efficient in managing exact provenance, and even recommended by the pioneers of reproducible research \cite{claerbout1992,schwab2000}.
-Researchers using free software tools have also already had some exposure to it.
-%However, because they didn't attempt to build the software environment, in 2006 they moved to SCons (Make-simulator in Python which also attempts to manage software dependencies) in a project called Madagascar (\url{http://ahay.org}), which is highly tailored to Geophysics.
+Researchers using free software tools have also already had some exposure to it \new{(almost all free software projects are built with Make)}
Linking the analysis and narrative (criterion 7) was historically our first design element.
To avoid the problems with computational notebooks mentioned above, our implementation follows a more abstract linkage, providing a more direct and precise, yet modular, connection.
@@ -268,23 +294,34 @@ The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is generated du
Since values like this depend on the analysis, they should \emph{also} be reproducible, along with figures and tables.
These macros act as a quantifiable link between the narrative and analysis, with the granularity of a word in a sentence and a particular analysis command.
-This allows accurate post-publication provenance \emph{and} automatic updates to the embedded numbers during a project.
-Through the latter, manual updates by authors are by-passed, which are prone to errors, thus discouraging improvements after writing the first draft.
+This allows automatic updates to the embedded numbers during the experimentation phase of a project \emph{and} accurate post-publication provenance.
+Through the former, manual updates by authors (which are prone to errors and discourage improvements or experimentation after writing the first draft) are by-passed.
Acting as a link, the macro files build the core skeleton of Maneage.
-For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation..
+For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation.
These are combined at the end to generate precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}), which are excluded here because of the strict word limit.
-Furthermore, machine related specifications including hardware name and byte-order are also collected and cited, as a reference point if they were needed for \emph{root cause analysis} of observed differences/issues in the execution of the wokflow on different machines.
+\new{Furthermore, the machine related specifications of the running system (including hardware name and byte-order) are also collected and cited.
+ These can help in \emph{root cause analysis} of observed differences/issues in the execution of the wokflow on different machines.}
The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}).
All software dependencies are built down to precise versions of every tool, including the shell, POSIX tools (e.g., GNU Coreutils) and of course, the high-level science software.
+\new{The source code of all the free software used in Maneage is archived in and downloaded from \href{https://doi.org/10.5281/zenodo.3883409}{zenodo.3883409}.
+Zenodo promises long-term archival and also provides a persistant identifier for the files, which is rarely available in each software's webpage.}
+
On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library (glibc) is being added (task \href{http://savannah.nongnu.org/task/?15390}{15390}).
Currently, {\TeX}Live is also being added (task \href{http://savannah.nongnu.org/task/?15267}{15267}), but that is only for building the final PDF, not affecting the analysis or verification.
-Temporary relocation of a built project, without building from source, can be done by building the project in a container or VM (\inlinecode{README.md} has recommendations on building a \inlinecode{Dockerfile}).
+\new{Finally, all software cannot be built on all CPU architectures, hence by default it is included in the final built paper automatically, see below.}
+
+\new{Because, everything is built from source, building the core Maneage environment on an 8-core CPU takes about 1.5 hours (GCC consumes more than half of the time).
+When the analysis involves complex computations, this is negligible compared to the actual analysis.
+Also, due to the Git features blended into Maneage, it is best (from the perspective of provenance) to start a Maneage'd project with the start of a project and keep the history of changes, as the project matures.
+To avoid repeating the build on different systems, Maneage'd projects can be built in a container or VM.
+In fact the \inlinecode{README.md} \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{has instructions} on building a Maneage'd project in Docker.
+Through Docker (or VMs) users on Microsoft Windows can benefit from Maneage, and for Windows-native software that can be run in batch-mode, technologies like Windows Subsystem for Linux can be used.}
The analysis phase of the project however is naturally different from one project to another at a low-level.
It was thus necessary to design a generic framework to comfortably host any project, while still satisfying the criteria of modularity, scalability, and minimal complexity.
-We demonstrate this design by replicating Figure 1C of \cite{menke20} in Figure \ref{fig:datalineage} (top).
-Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it (including this complete paper).
+We demonstrate this design by replicating Figure 1C of \cite{menke20} in Figure \ref{fig:datalineage} (left).
+Figure \ref{fig:datalineage} (right) is the data lineage graph that produced it (including this complete paper).
\begin{figure*}[t]
\begin{center}
@@ -295,11 +332,12 @@ Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it
\caption{\label{fig:datalineage}
Left: an enhanced replica of Figure 1C in \cite{menke20}, shown here for demonstrating Maneage.
It shows the ratio of the number of papers mentioning software tools (green line, left vertical axis) to the total number of papers studied in that year (light red bars, right vertical axis on a log scale).
- Right: Schematic representation of the data lineage, or workflow, to generate the plot above.
- Each colored box is a file in the project and the arrows show the dependencies between them.
+ Right: Schematic representation of the data lineage, or workflow, to generate the plot on the left.
+ Each colored box is a file in the project and \new{arrows show the operation of various software, showing what inputs it takes and what outputs it produces}.
Green files/boxes are plain-text files that are under version control and in the project source directory.
Blue files/boxes are output files in the build directory, shown within the Makefile (\inlinecode{*.mk}) where they are defined as a \emph{target}.
- For example, \inlinecode{paper.pdf} depends on \inlinecode{project.tex} (in the build directory; generated automatically) and \inlinecode{paper.tex} (in the source directory; written manually).
+ For example, \inlinecode{paper.pdf} \new{is created by running \LaTeX{} on} \inlinecode{project.tex} (in the build directory; generated automatically) and \inlinecode{paper.tex} (in the source directory; written manually).
+ \new{Other software are used in other steps.}
The solid arrows and full-opacity built boxes correspond to this paper.
The dotted arrows and built boxes show the scalability by adding hypothetical steps to the project.
The underlying data of the top plot is available at
@@ -309,7 +347,7 @@ Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it
The analysis is orchestrated through a single point of entry (\inlinecode{top-make.mk}, which is a Makefile; see Listing \ref{code:topmake}).
It is only responsible for \inlinecode{include}-ing the modular \emph{subMakefiles} of the analysis, in the desired order, without doing any analysis itself.
-This is visualized in Figure \ref{fig:datalineage} (bottom) where no built (blue) file is placed directly over \inlinecode{top-make.mk} (they are produced by the subMakefiles under them).
+This is visualized in Figure \ref{fig:datalineage} (right) where no built (blue) file is placed directly over \inlinecode{top-make.mk} (they are produced by the subMakefiles under them).
A visual inspection of this file is sufficient for a non-expert to understand the high-level steps of the project (irrespective of the low-level implementation details), provided that the subMakefile names are descriptive (thus encouraging good practice).
A human-friendly design that is also optimized for execution is a critical component for the FAIRness of reproducible research.
@@ -372,12 +410,22 @@ Furthermore, the configuration files are a prerequisite of the targets that use
If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's source or other built products.
This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details; e.g., by co-authors or future readers), and ensures self-consistency.
-Finally, to satisfy the recorded history criterion, version control (currently implemented in Git) is another component of Maneage (see Figure \ref{fig:branching}).
+\new{To summarize, in contrast to notebooks like Jupyter, in a ``Maneage''d project the analysis scripts and configuration parameters are not blended into the running code (and all stored in one file).
+ Based on the modularity criteria, the analysis steps are run in their own files (for their own respective language, thus maximally benefiting from its unique features) and the narrative has its own file(s).
+ The analysis communicates with the narrative through intermediate files (the \LaTeX{} macros), enabling much better blending of analysis outputs in the narrative sentences than is possible with the high-level notebooks and enabling direct provenance tracking.}
+
+To satisfy the recorded history criterion, version control (currently implemented in Git) is another component of Maneage (see Figure \ref{fig:branching}).
Maneage is a Git branch that contains the shared components (infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles, and interface script).
-Derived projects start by branching off and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its particular analysis, see Listing \ref{code:branching}, there is customization checklist in \inlinecode{README-hacking.md}).
+\new{The core Maneage git repository is hosted at \href{http://git.maneage.org/project.git}{git.maneage.org/project.git} (also archived on \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{Software Heritage}).}
+Derived projects start by creating a branch and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its particular analysis, see Listing \ref{code:branching}).
+There is a \new{thoroughly elaborated} customization checklist in \inlinecode{README-hacking.md}).
+
+The current project's Git hash is provided to the authors as a \LaTeX{} macro (shown here at the end of the abstract), as well as the Git hash of the last commit in the Maneage branch (shown here in the acknowledgments).
+These macros are created in \inlinecode{initialize.mk}, with \new{other basic information from the running system like the CPU architecture, byte order or address sizes (shown here in the acknowledgements)}.
The branch-based design of Figure \ref{fig:branching} allows projects to re-import Maneage at a later time (technically: \emph{merge}), thus improving its low-level infrastructure: in (a) authors do the merge during an ongoing project;
in (b) readers do it after publication; e.g., the project remains reproducible but the infrastructure is outdated, or a bug is fixed in Maneage.
+\new{Generally, any git flow (branching strategies) can be used by the high-level project authors or future readers.}
Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of curation and maintenance of each individual project, before \emph{and} after publication.
Finally, the complete project source is usually $\sim100$ kilo-bytes.
@@ -431,7 +479,8 @@ Scientists are rarely trained sufficiently in data management or software develo
Indeed the fast-evolving tools are primarily targeted at software developers, who are paid to learn and use them effectively for short-term projects before moving on to the next technology.
Scientists, on the other hand, need to focus on their own research fields, and need to consider longevity.
-Hence, arguably the most important feature of these criteria (as implemented in Maneage) is that they provide a fully working template, using mature and time-tested tools, for blending version control, the research paper's narrative, the software management \emph{and} a robust data carpentry.
+Hence, arguably the most important feature of these criteria (as implemented in Maneage) is that they provide a fully working template or bundle that works immediately out of the box by producing a paper with an example calculation that they just need to start customizing.
+Using mature and time-tested tools, for blending version control, the research paper's narrative, the software management \emph{and} a robust data management strategies.
We have noticed that providing a complete \emph{and} customizable template with a clear checklist of the initial steps is much more effective in encouraging mastery of these modern scientific tools than having abstract, isolated tutorials on each tool individually.
Secondly, to satisfy the completeness criterion, all the required software of the project must be built on various POSIX-compatible systems (Maneage is actively tested on different GNU/Linux distributions, macOS, and is being ported to FreeBSD also).
@@ -446,6 +495,7 @@ On GNU/Linux hosts, Maneage builds precise versions of the compilation tool chai
However, glibc is not install-able on some POSIX OSs (e.g., macOS).
All programs link with the C library, and this may hypothetically hinder the exact reproducibility \emph{of results} on non-GNU/Linux systems, but we have not encountered this in our research so far.
With everything else under precise control, the effect of differing Kernel and C libraries on high-level science can now be systematically studied with Maneage in follow-up research.
+\new{Using continuous integration (CI) is one way to precisely identify breaking points with updated technologies on available systems.}
% DVG: It is a pity that the following paragraph cannot be included, as it is really important but perhaps goes beyond the intended goal.
%Thirdly, publishing a project's reproducible data lineage immediately after publication enables others to continue with follow-up papers, which may provide unwanted competition against the original authors.
@@ -481,6 +531,7 @@ For example, see \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937},
The authors wish to thank (sorted alphabetically)
Julia Aguilar-Cabello,
+Dylan A\"issi,
Marjan Akbari,
Alice Allen,
Pedram Ashofteh Ardakani,
@@ -506,6 +557,12 @@ Nadia Tonello,
Ignacio Trujillo and
the AMIGA team at the Instituto de Astrof\'isica de Andaluc\'ia
for their useful help, suggestions, and feedback on Maneage and this paper.
+\new{The five referees and editors of CiSE (Lorena Barba and George Thiruvathukal) also provided many very helpful points to clarify the points made in this paper.}
+
+This project was developed in the reproducible framework of Maneage (\emph{Man}aging data lin\emph{eage})
+\new{on Commit \inlinecode{\projectversion} (in the project branch).
+The latest merged Maneage commit was \inlinecode{\maneageversion} (\maneagedate).
+This project was built on an \inlinecode{\machinearchitecture} machine with {\machinebyteorder} byte-order and address sizes {\machineaddresssizes}}.
Work on Maneage, and this paper, has been partially funded/supported by the following institutions:
The Japanese Ministry of Education, Culture, Sports, Science, and Technology (MEXT) PhD scholarship to
@@ -597,7 +654,7 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314
%% the appendix is built.
\ifdefined\noappendix
\else
-\newpage
+\clearpage
\appendices
\section{Survey of existing tools for various phases}
\label{appendix:existingtools}
@@ -626,6 +683,10 @@ Therefore, a process that is run inside a virtual machine can be much slower tha
An advantages of VMs is that they are a single file which can be copied from one computer to another, keeping the full environment within them if the format is recognized.
VMs are used by cloud service providers, enabling fully independent operating systems on their large servers (where the customer can have root access).
+VMs were used in solutions like SHARE \citeappendix{vangorp11} (which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011 \citeappendix{gabriel11}), or in suggested reproducible papers like \citeappendix{dolfi14}.
+However, due to their very large size, they are expensive to maintain, thus leading SHARE to discontinue its services in 2019.
+Also, the URL to the VM that is mentioned in \citeappendix{dolfi14} is no longer accessible (probably due to the same reason of size and archival costs).
+
\subsubsection{Containers}
\label{appendix:containers}
Containers also host a binary copy of a running environment, but don't have their own kernel.
@@ -646,19 +707,27 @@ Below we'll review some of the most common container solutions: Docker and Singu
An important drawback of Docker for high performance scientific needs is that it runs as a daemon (a program that is always running in the background) with root permissions.
This is a major security flaw that discourages many high performance computing (HPC) facilities from providing it.
-\item {\bf\small Singularity:} Singularity is a single-image container (unlike Docker which is composed of modular/independent images).
+\item {\bf\small Singularity:} Singularity \citeappendix{kurtzer17} is a single-image container (unlike Docker which is composed of modular/independent images).
Although it needs root permissions to be installed on the system (once), it doesn't require root permissions every time it is run.
Its main program is also not a daemon, but a normal program that can be stopped.
These features make it much easier for HPC administrators to install compared to Docker.
However, the fact that it requires root access for initial install is still a hindrance for a random project: if its not already present on the HPC, the project can't be run as a normal user.
+
+\item {\bf\small Podman:} Podman uses the Linux kernel containerization features to enable containers without a daemon, and without root permissions.
+ It has a command-line interface very similar to Docker, but only works on GNU/Linux operating systems.
\end{itemize}
-Generally, VMs or containers are good solutions to reproducibly run/repeating an analysis.
+Generally, VMs or containers are good solutions to reproducibly run/repeating an analysis in the short term (a couple of years).
However, their focus is to store the already-built (binary, non-human readable) software environment.
-Storing \emph{how} the core environment was built is up to the user, in a third repository (not necessarily inside container or VM file).
-This is a major problem when considering reproducibility.
-The example of \cite{mesnard20} was previously mentioned in Section \ref{criteria}.
+Because of this they will be large (many Gigabytes) and expensive to archive, download or access.
+Recall the two examples above for VMs in Section \ref{appendix:virtualmachines}. But this is also valid for Docker images, as is clear from Dockerhub's recent decision to delete images of free accounts that haven't been used for more than 6 months.
+Meng \& Thain \citeappendix{meng17} also give similar reasons on why Docker images were not suitable in their trials.
+
+On a more fundamental level, VMs or contains don't store \emph{how} the core environment was built.
+This information is usually in a third-party repository, and not necessarily inside container or VM file, making it hard (if not impossible) to track for future users.
+This is a major problem when considering reproducibility which is also highlighted as a major issue in terms of long term reproducibility in \citeappendix{oliveira18}.
+The example of \cite{mesnard20} was previously mentioned in Section \ref{criteria}.
Another useful example is the \href{https://github.com/benmarwick/1989-excavation-report-Madjedbebe/blob/master/Dockerfile}{\inlinecode{Dockerfile}} of \citeappendix{clarkso15} (published in June 2015) which starts with \inlinecode{FROM rocker/verse:3.3.2}.
When we tried to build it (November 2020), the core downloaded image (\inlinecode{rocker/verse:3.3.2}, with image ``digest'' \inlinecode{sha256:c136fb0dbab...}) was created in October 2018 (long after the publication of that paper).
Theoretically it is possible to investigate the difference between this new image and the old one that the authors used, but that will require a lot of effort and may not be possible where the changes are not in a third public repository or not under version control.
@@ -669,6 +738,14 @@ A more generic/longterm approach to ensure identical core OS componets at a late
ISO files are pre-built binary files with volumes of hundreds of megabytes and not containing their build instructions).
For example the archives of Debian\footnote{\inlinecode{\url{https://cdimage.debian.org/mirror/cdimage/archive/}}} or Ubuntu\footnote{\inlinecode{\url{http://old-releases.ubuntu.com/releases}}} provide older ISO files.
+The concept of containers (and the independent images that build them) can also be extended beyond just the software environment.
+For example \citeappendix{lofstead19} propose a ``data pallet'' concept to containerize access to data and thus allow tracing data back wards to the application that produced them.
+
+In summary, containers or VMs are just a built product themselves.
+If they are built properly (for example building a Maneage'd project inside a Docker container), they can be useful for immediate usage and fast moving of the project from one system to another.
+With robust building, the container or VM can also be exactly reproduced later.
+However, attempting to archive the actual binary container or VM files as a black box (not knowing the precise versions of the software in them) is expensive, and will not be able to answer the most fundamental
+
\subsubsection{Independent build in host's file system}
\label{appendix:independentbuild}
The virtual machine and container solutions mentioned above, have their own independent file system.
@@ -722,6 +799,10 @@ Hence it is indeed theoretically possible to reproduce the software environment
In summary, the host OS package managers are primarily meant for the operating system components or very low-level components.
Hence, many robust reproducible analysis solutions (reviewed in Appendix \ref{appendix:existingsolutions}) don't use the host's package manager, but an independent package manager, like the ones below discussed below.
+\subsubsection{Packaging with Linux containerization}
+Once a software is packaged as an AppImage\footnote{\inlinecode{\url{https://appimage.org}}}, Flatpak\footnote{\inlinecode{\url{https://flatpak.org}}} or Snap\footnote{\inlinecode{\url{https://snapcraft.io}}} the software's binary product and all its dependencies (not including the core C library) are packaged into one file.
+This makes it very easy to move that single software's built product to newer systems: because the C library is not included, it can fail on older systems.
+However, these are designed for the Linux kernel (using its containerization features) and can thus only be run on GNU/Linux operating systems.
\subsubsection{Nix or GNU Guix}
\label{appendix:nixguix}
@@ -916,6 +997,17 @@ When all the prerequisites are older than the target, that target doesn't need t
The recipe can contain any number of commands, they should just all start with a \inlinecode{TAB}.
Going deeper into the syntax of Make is beyond the scope of this paper, but we recommend interested readers to consult the GNU Make manual for a nice introduction\footnote{\inlinecode{\url{http://www.gnu.org/software/make/manual/make.pdf}}}.
+\subsubsection{Snakemake}
+is a Python-based workflow management system, inspired by GNU Make (which is the job organizer in Maneage), that is aimed at reproducible and scalable data analysis \citeappendix{koster12}\footnote{\inlinecode{\url{https://snakemake.readthedocs.io/en/stable}}}.
+It defines its own language to implement the ``rule'' concept in Make within Python.
+Currently it requires Python 3.5 (released in September 2015) and above, while Snakemake was originally introduced in 2012.
+Hence it is not clear if older Snakemake source files can be executed today.
+This as reviewed in many tools here, this is a major longevity problem when using highlevel tools as the skeleton of the workflow.
+Technically, calling commond-line programs within Python is very slow and using complex shell scripts in each step will involve a lot quotations that make the code hard to read.
+
+\subsubsection{Bazel}
+Bazel\footnote{\inlinecode{\url{https://bazel.build}}} is a high-level job organizer that depends on Java and Python and is primarily tailored to software developers (with features like facilitating linking of libraries through its high level constructs).
+
\subsubsection{SCons}
\label{appendix:scons}
Scons is a Python package for managing operations outside of Python (in contrast to CGAT-core, discussed below, which only organizes Python functions).
@@ -960,8 +1052,12 @@ Furthermore, high-level and specific solutions will evolve very fast causing dis
A good example is Popper \citeappendix{jimenez17} which initially organized its workflow through the HashiCorp configuration language (HCL) because it was the default in GitHub.
However, in September 2019, GitHub dropped HCL as its default configuration language, so Popper is now using its own custom YAML-based workflow language, see Appendix \ref{appendix:popper} for more on Popper.
+\subsubsection{Nextflow (2013)}
+Nextflow\footnote{\inlinecode{\url{https://www.nextflow.io}}} \citeappendix{tommaso17} workflow language with a command-line interface that is written in Java.
-
+\subsubsection{Generic workflow specifications (CWL and WDL)}
+Due to the variety of custom workflows used in existing reproducibility solution (like those of Appendix \ref{appendix:existingsolutions}), some attempts have been made to define common workflow standards like the Common workflow language (CWL\footnote{\inlinecode{\url{https://www.commonwl.org}}}, with roots in Make, formatted in YAML or JSON) and Workflow Description Language (WDL\footnote{\inlinecode{\url{https://openwdl.org}}}, formatted in JSON).
+These are primarily specifications/standards rather than software, so ideally translators can be written between the various workflow systems to make them more interoperable.
\subsection{Editing steps and viewing results}
@@ -990,6 +1086,7 @@ Furthermore, they usually require a graphic user interface to run.
In summary IDEs are generally very specialized tools, for special projects and are not a good solution when portability (the ability to run on different systems and at different times) is required.
\subsubsection{Jupyter}
+\label{appendix:jupyter}
Jupyter (initially IPython) \citeappendix{kluyver16} is an implementation of Literate Programming \citeappendix{knuth84}.
The main user interface is a web-based ``notebook'' that contains blobs of executable code and narrative.
Jupyter uses the custom built \inlinecode{.ipynb} format\footnote{\url{https://nbformat.readthedocs.io/en/latest}}.
@@ -1119,6 +1216,7 @@ This failure to communicate in the details is a very serious problem, leading to
\label{appendix:existingsolutions}
As reviewed in the introduction, the problem of reproducibility has received a lot of attention over the last three decades and various solutions have already been proposed.
+The core principles that many of the existing solutions (including Maneage) aims to achieve are nicely summarized by the FAIR principles \citeappendix{wilkinson16}.
In this appendix, some of the solutions are reviewed.
The solutions are based on an evolving software landscape, therefore they are ordered by date: when the project has a webpage, the year of its first release is used for the sorting, otherwise their paper's publication year is used.
@@ -1127,9 +1225,27 @@ Freedom of the software/method is a core concept behind scientific reproducibili
Therefore proprietary solutions like Code Ocean\footnote{\inlinecode{\url{https://codeocean.com}}} or Nextjournal\footnote{\inlinecode{\url{https://nextjournal.com}}} will not be reviewed here.
Other studies have also attempted to review existing reproducible solutions, foro example \citeappendix{konkol20}.
+\subsection{Suggested rules, checklists, or criteria}
+Before going into the various implementations, it is also useful to review existing suggested rules, checklists or criteria for computationally reproducible research.
+
+All the cases below are primarily targetted to immediate reproducibility and don't consider longevity explicitly.
+Therefore, they lack a strong/clear completeness criteria (mainly merely suggesting to record versions, or the ultimate suggestion to store the full binary OS in a binary VM or container is problematic (as mentioned in \ref{appendix:independentenvironment} and \citeappendix{oliveira18}).
+Sandve et al. \citeappendix{sandve13} propose ``ten simple rule for reproducible computational research'' that can be applied in any project.
+Generally, the are very similar to the criteria proposed here and follow a similar spirit but they don't provide any actual research papers following all those points, or a proof of concept.
+The Popper convention \citeappendix{jimenez17} also provides a set of principles that are indeed generally useful and some are shared with the criteria here (for example automatic validation, and like Maneage, they suggest having a template for new users).
+but they don't include completness or attention to longevity as mentioned above (Popper itself is written in Python with many dependencies, and its core operating language has already changed once).
+For more on Popper, please see Section \ref{appendix:popper}.
+For improved reproducibility in Jupyter notebook users, \citeappendix{rule19} propose ten rules to improve reproducibility and also provide links to example implementations.
+They can be very useful for users of Jupyter and not generic to any computational project.
+Some criteria (which are indeed very good in a more general context) don't directly relate to reproducibility, for example their Rule 1: ``Tell a Story for an Audience''.
+Generally, as reviewed in Sections \ref{sec:longevityofexisting} and \ref{appendix:jupyter}, Jupyter itself has many issues regarding reproducibility.
+To create Docker images, N\"ust et al. propose ``ten simple rules'' in \citeappendix{nust20}.
+They do recommend some issues that can indeed help increase the quality of Docker images and their production/usage, for example their rule 7 to ``mount datasets at run time'' to separate the computational evironment from the data.
+However, like before, the long term reproducibility of the images is not a concern, for example in they recommend using base operating systems only with a version like \inlinecode{ubuntu:18.04}, which was clearly shown to have longevity issues in Section \ref{sec:longevityofexisting}.
+Furthermore, in their proof of concept Dockerfile (listing 1), \inlinecode{rocker} is used with a tag (not a digest), which can be problematic (as shown in Section \ref{appendix:containers}).
\subsection{Reproducible Electronic Documents, RED (1992)}
\label{appendix:red}
@@ -1242,8 +1358,7 @@ Since XML is a plane text format, as the user inspects the data and makes change
.
However, even though XML is in plain text, it is very hard to edit manually.
VisTrails therefore provides a graphic user interface with a visual representation of the project's inter-dependent steps (similar to Figure \ref{fig:analysisworkflow}).
-Besides the fact that it is no longer maintained, the conceptual differences with the proposed template are substantial.
-The most important is that VisTrails doesn't control the software that is run, it only controls the sequence of steps that they are run in.
+Besides the fact that it is no longer maintained, VisTrails didn't control the software that is run, it only controls the sequence of steps that they are run in.
@@ -1383,8 +1498,6 @@ In Maneage, instead of artificial/commented tags directly link the analysis inpu
-
-
\subsection{Sumatra (2012)}
Sumatra\footnote{\inlinecode{\url{http://neuralensemble.org/sumatra}}} \citeappendix{davison12} attempts to capture the environment information of a running project.
It is written in Python and is a command-line wrapper over the analysis script.
@@ -1420,14 +1533,39 @@ The important thing is that the research object concept is not specific to any s
Sciunit\footnote{\inlinecode{\url{https://sciunit.run}}} \citeappendix{meng15} defines ``sciunit''s that keep the executed commands for an analysis and all the necessary programs and libraries that are used in those commands.
It automatically parses all the executables in the script, and copies them, and their dependency libraries (down to the C library), into the sciunit.
Because the sciunit contains all the programs and necessary libraries, its possible to run it readily on other systems that have a similar CPU architecture.
+Sciunit was originally written in Python 2 (which reached its end-of-life in January 1st, 2020).
+Therefore Sciunit2 is a new implementation in Python 3.
-In our tests, Sciunit installed successfully, however we couldn't run it because of a dependency problem with the \inlinecode{tempfile} package (in the standard Python library).
-Sciunit is written in Python 2 (which reached its end-of-life in January 1st, 2020) and its last Git commit in its main branch is from June 2018 (+1.5 years ago).
-Recent activity in a \inlinecode{python3} branch shows that others are attempting to translate the code into Python 3 (the main author has graduated and apparently not working on Sciunit anymore).
-
-Because we weren't able to run it, the following discussion will just be theoretical.
The main issue with Sciunit's approach is that the copied binaries are just black boxes: it is not possible to see how the used binaries from the initial system were built.
-This is a major problem for scientific projects, in principle (not knowing how they programs were built) and practice (archiving a large volume sciunit for every step of an analysis requires a lot of space).
+This is a major problem for scientific projects: in principle (not knowing how they programs were built) and in practice (archiving a large volume sciunit for every step of an analysis requires a lot of storage space).
+
+
+
+
+
+\subsection{Umbrella (2015)}
+Umbrella \citeappendix{meng15b} is a high-level wrapper script for isolating the environment of an analysis.
+The user specifies the necessary operating system, and necessary packages and analysis steps in variuos JSON files.
+Umbrella will then study the host operating system and the various necessary inputs (including data and software) through a process similar to Sciunits mentioned above to find the best environment isolator (maybe using Linux containerization, containers or VMs).
+We couldn't find a URL to the source software of Umbrella (no source code repository has been mentioned in the papers we reviewed above), but from the descriptions in \citeappendix{meng17}, it is written in Python 2.6 (which is now depreciated).
+
+
+
+
+
+\subsection{ReproZip (2016)}
+ReproZip\footnote{\inlinecode{\url{https://www.reprozip.org}}} \citeappendix{chirigati16} is a Python package that is designed to automatically track all the necessary data files, libraries and environment variables into a single bundle.
+The tracking is done at the kernel system-call level, so any file that is accessed during the running of the project is identified.
+The tracked files can be packaged into a \inlinecode{.rpz} bundle that can then be unpacked into another system.
+
+ReproZip is therefore very good to take a ``snapshot'' of the running environment into a single file.
+The bundle can become very large if large, or many datasets, are used or if the software evironment is complex (many dependencies).
+Since it copies the binary software libraries, it can only be run on systems with a similar CPU architecture to the original.
+Furthermore, ReproZip just copies the binary/compiled files used in a project, it has no way to know how those software were built.
+As mentioned in this paper, and also \citeappendix{oliveira18} the question of ``how'' the environment was built is critical for understanding the results and simply having the binaries cannot necessarily be useful.
+
+For the data, it is similarly not possible to extract which data server they came from.
+Hence two projects that each use a 1 terra byte dataset will need a full copy of that same 1 terra byte file in their bundle, making long term preservation extremely expensive.
@@ -1459,7 +1597,6 @@ However, there is one directory which can be used to store files that must not b
-
\subsection{Popper (2017)}
\label{appendix:popper}
Popper\footnote{\inlinecode{\url{https://falsifiable.us}}} is a software implementation of the Popper Convention \citeappendix{jimenez17}.
@@ -1471,23 +1608,52 @@ This is an important issue when low-level choices are based on service providers
To start a project, the \inlinecode{popper} command-line program builds a template, or ``scaffold'', which is a minimal set of files that can be run.
However, as of this writing, the scaffold isn't complete: it lacks a manuscript and validation of outputs (as mentioned in the convention).
-By default Popper runs in a Docker image (so root permissions are necessary), but Singularity is also supported.
+By default Popper runs in a Docker image (so root permissions are necessary and reproducible issues with Docker images have been discussed above), but Singularity is also supported.
See Appendix \ref{appendix:independentenvironment} for more on containers, and Appendix \ref{appendix:highlevelinworkflow} for using high-level languages in the workflow.
+Igonoring the failure to comply with the completeness, minimal complexity and includig narrative, the scaffold that is provided by Popper is an output of the program that is not directly under version control.
+Hence tracking future changes in Popper and how they relate to the high level projects that depend on it will be very hard.
+In Maneage, the same \inlinecode{maneage} git branch is shared by the developers and users, any new feature or change in Maneage can thus be directly tracked with Git when the high-level project merges their branch with Maneage.
-
-
-\subsection{Whole Tale (2019)}
+\subsection{Whole Tale (2017)}
\label{appendix:wholetale}
-Whole Tale (\url{https://wholetale.org}) is a web-based platform for managing a project and organizing data provenance, see \citeappendix{brinckman19}
+Whole Tale\footnote{\inlinecode{\url{https://wholetale.org}}} is a web-based platform for managing a project and organizing data provenance, see \citeappendix{brinckman17}
It uses online editors like Jupyter or RStudio (see Appendix \ref{appendix:editors}) that are encapsulated in a Docker container (see Appendix \ref{appendix:independentenvironment}).
The web-based nature of Whole Tale's approach, and its dependency on many tools (which have many dependencies themselves) is a major limitation for future reproducibility.
For example, when following their own tutorial on ``Creating a new tale'', the provided Jupyter notebook could not be executed because of a dependency problem.
-This has been reported to the authors as issue 113\footnote{\url{https://github.com/whole-tale/wt-design-docs/issues/113}}, but as all the second-order dependencies evolve, its not hard to envisage such dependency incompatibilities being the primary issue for older projects on Whole Tale.
-Furthermore, the fact that a Tale is stored as a binary Docker container causes two important problems: 1) it requires a very large storage capacity for every project that is hosted there, making it very expensive to scale if demand expands. 2) It is not possible to see how the environment was built accurately (when the Dockerfile uses \inlinecode{apt}), for more on this, please see Appendix \ref{appendix:packagemanagement}.
+This was reported to the authors as issue 113\footnote{\inlinecode{\url{https://github.com/whole-tale/wt-design-docs/issues/113}}}, but as all the second-order dependencies evolve, its not hard to envisage such dependency incompatibilities being the primary issue for older projects on Whole Tale.
+Furthermore, the fact that a Tale is stored as a binary Docker container causes two important problems:
+1) it requires a very large storage capacity for every project that is hosted there, making it very expensive to scale if demand expands.
+2) It is not possible to see how the environment was built accurately (when the Dockerfile uses \inlinecode{apt}).
+This issue with Whole Tale (and generally all other solutions that only rely on preserving a container/VM) was also mentioned in \citeappendix{oliveira18}, for more on this, please see Appendix \ref{appendix:packagemanagement}.
+
+
+
+
+
+\subsection{Occam (2018)}
+Occam\footnote{\inlinecode{\url{https://occam.cs.pitt.edu}}} \citeappendix{oliveira18} is web-based application to preserve software and its execution.
+To achieve long-term reproducibility, Occam includes its own package manager (instructions to build software and their dependencies) to be in full control of the software build instructions, similar to Maneage.
+Besides Nix or Guix (which are primarily a package manager that can also do job management), Occum has been the only solution in our survey here that attempts to be complete in this aspect.
+
+However it is incomplete from the perspective of requirements: it works within a Docker image (that requires root permissions) and currently only runs on Debian-based, Redhat-based and Arch-based GNU/Linux operating systems that respectively use the \inlinecode{apt}, \inlinecode{pacman} or \inlinecode{yum} package managers.
+It is also itself written in Python (version 3.4 or above), hence it is not clear
+
+Furthermore, it also violates our complexity criteria because the instructions to build the software, their versions and etc are not immediately viewable or modifable by the user.
+Occam contains its own JSON database for this that should be parsed with its own custom program.
+The analysis phase of Occum is also through a drag-and-drop interface (similar to Taverna, Appendix \ref{appendix:taverna}) that is a web-based graphic user interface.
+All the connections between various phases of an analysis need to be pre-defined in a JSON file and manually linked in the GUI.
+Hence for complex data analysis operations with involve thousands of steps, it is not scalable.
+
+
+
+
+
+
+