aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2021-04-09 17:13:54 +0100
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2021-04-09 17:31:39 +0100
commit925091ef4efebd03ca519d780cf5a0c2ce9df18b (patch)
tree3b37462f063e74f7b3f0a42853fbe07bfd55777c
parent4ccf01420f36c705302fdcde9a94c72676256665 (diff)
Implemented EiC (Lorena Barba) comments, and added final review
The email notice of the final acceptance of this paper in CiSE has been included in the project and the stylistic points that were raised by the editor in chief (EiC) have also been implemented. The most important points were: - Including citations within the text structure (as if they would be footnotes), so things like "see \cite{...}" should have been changed. - Hyperlinks should be printed as footnotes (because the journal gets actually printed). Also, to avoid the second listing breaking between pages, it has been moved to after the next paragraph.
-rw-r--r--paper.tex96
-rw-r--r--tex/src/appendix-existing-solutions.tex108
-rw-r--r--tex/src/appendix-existing-tools.tex73
3 files changed, 138 insertions, 139 deletions
diff --git a/paper.tex b/paper.tex
index 53caf4b..db8f3e3 100644
--- a/paper.tex
+++ b/paper.tex
@@ -74,9 +74,9 @@
%% AIM
A set of criteria is introduced to address this problem:
%% METHOD
- Completeness (no execution requirement beyond a minimal Unix-like operating system, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free software.
+ Completeness (no execution requirement beyond a minimal Unix-like operating system, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free and open source software.
%% RESULTS
- As a proof of concept, we introduce ``Maneage'' (Managing data lineage), enabling cheap archiving, provenance extraction, and peer verification that been tested in several research publications.
+ As a proof of concept, we introduce ``Maneage'' (Managing data lineage), enabling cheap archiving, provenance extraction, and peer verification that has been tested in several research publications.
%% CONCLUSION
We show that longevity is a realistic requirement that does not sacrifice immediate or short-term reproducibility.
The caveats (with proposed solutions) are then discussed and we conclude with the benefits for the various stakeholders.
@@ -126,7 +126,7 @@ Data Lineage, Provenance, Reproducibility, Scientific Pipelines, Workflows
% by the rest of the first word in caps.
%\IEEEPARstart{F}{irst} word
-Reproducible research has been discussed in the sciences for at least 30 years \cite{claerbout1992, fineberg19}.
+Reproducible research has been discussed in the sciences for at least 30 years\cite{claerbout1992, fineberg19}.
Many reproducible workflow solutions (hereafter, ``solutions'') have been proposed that mostly rely on the common technology of the day,
starting with Make and Matlab libraries in the 1990s, Java in the 2000s, and mostly shifting to Python during the last decade.
@@ -141,7 +141,7 @@ Decades later, scientists are still held accountable for their results and there
\section{Longevity of existing tools}
\label{sec:longevityofexisting}
-Reproducibility is defined as ``obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis'' \cite{fineberg19}.
+Reproducibility is defined as ``obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis''\cite{fineberg19}.
Longevity is defined as the length of time that a project remains \emph{functional} after its creation.
Functionality is defined as \emph{human readability} of the source and its \emph{execution possibility} (when necessary).
Many usage contexts of a project do not involve execution: for example, checking the configuration parameter of a single step of the analysis to re-\emph{use} in another project, or checking the version of used software, or the source of the input data.
@@ -154,21 +154,21 @@ A basic review of the longevity of commonly used tools is provided here (for a m
\fi%
).
-To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011, discontinued in 2019).
+To isolate the environment, virtual machines (VMs) have sometimes been used, e.g., in SHARE\footnote{\inlinecode{\url{https://is.ieis.tue.nl/staff/pvgorp/share}}} (awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011, discontinued in 2019).
However, containers (e.g., Docker or Singularity) are currently the most widely-used solution.
We will focus on Docker here because it is currently the most common.
It is hypothetically possible to precisely identify the used Docker ``images'' with their checksums (or ``digest'') to re-create an identical OS image later.
However, that is rarely done.
-Usually images are imported with operating system (OS) names; e.g., \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}'.
+Usually images are imported with operating system (OS) names; e.g., Mesnard \& Barba\cite{mesnard20} use `\inlinecode{FROM ubuntu:16.04}'.
The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly, and only the most recent five are archived there.
Hence, if the image is built in different months, it will contain different OS components.
-In the year 2024, when this version's long-term support (LTS) expires (if not earlier, like CentOS 8 which \href{https://blog.centos.org/2020/12/future-is-centos-stream}{will terminate} 8 years early), the image will not be available at the expected URL.
+In the year 2024, when this version's long-term support (LTS) expires (if not earlier, like CentOS 8 which will terminate 8 years early\footnote{\inlinecode{\url{https://blog.centos.org/2020/12/future-is-centos-stream}}}), the image will not be available at the expected URL.
-Generally, pre-built binary files (like Docker images) are large and expensive to maintain and archive.
-Because of this, in October 2020 Docker Hub (where many workflows are archived) \href{https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}{announced} that inactive images (more than 6 months) will be deleted in free accounts from mid 2021.
+Generally, pre-built binary files (like Docker images) are large and expensive to maintain, distribute and archive.
+Because of this, in October 2020 Docker Hub (where many workflows are archived) announced\footnote{\inlinecode{\href{https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}{https://www.docker.com/blog/docker-hub-image-retention}\\\href{https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}{-policy-delayed-and-subscription-updates}}} a new consumpiton-based payment model.
Furthermore, Docker requires root permissions, and only supports recent (LTS) versions of the host kernel.
-Hence older Docker images may not be executable (their longevity is determined by the host kernel, typically a decade).
+Hence older Docker images may not be executable: their longevity is determined by OS kernels, typically a decade.
Once the host OS is ready, package managers (PMs) are used to install the software or environment.
Usually the OS's PM, such as `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with generic PMs.
@@ -188,8 +188,8 @@ Designing a robust project needs to be encouraged and facilitated because scient
This includes automatic verification, which is possible in many solutions, but is rarely practiced.
Besides non-reproducibility, weak project management leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding, or validating will be expensive).
-Finally, to blend narrative and analysis, computational notebooks \cite{rule18}, such as Jupyter, are currently gaining popularity.
-However, because of their complex dependency trees, their build is vulnerable to the passage of time; e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies.
+Finally, to blend narrative and analysis, computational notebooks\cite{rule18}, such as Jupyter, are currently gaining popularity.
+However, because of their complex dependency trees, their build is vulnerable to the passage of time; e.g., see Figure 1 of Alliez et al.\cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies.
It is important to remember that the longevity of a project is determined by its shortest-lived dependency.
Furthermore, as with job management, computational notebooks do not actively encourage good practices in programming or project management.
The ``cells'' in a Jupyter notebook can either be run sequentially (from top to bottom, one after the other) or by manually selecting the cell to run.
@@ -197,7 +197,7 @@ By default, cell dependencies are not included (e.g., automatically running some
There are third party add-ons like \inlinecode{sos} or \inlinecode{extension's} (both written in Python) for some of these.
However, since they are not part of the core, a shorter longevity can be assumed.
The core Jupyter framework has few options for project management, especially as the project grows beyond a small test or tutorial.
-Notebooks can therefore rarely deliver their promised potential \cite{rule18} and may even hamper reproducibility \cite{pimentel19}.
+Notebooks can therefore rarely deliver their promised potential\cite{rule18} and may even hamper reproducibility\cite{pimentel19}.
@@ -205,8 +205,8 @@ Notebooks can therefore rarely deliver their promised potential \cite{rule18} an
\section{Proposed criteria for longevity}
\label{criteria}
-The main premise here is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it just before publication \cite{austin17,fineberg19}.
-In this context, researchers play a critical role \cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles\footnote{FAIR originally targeted data. Work is ongoing to adopt it for software through initiatives like FAIR4RS (FAIR for Research Software).}).
+The main premise here is that starting a project with a robust data management strategy (or tools that provide it) is more effective, for researchers and the community, than imposing it just before publication\cite{austin17,fineberg19}.
+In this context, researchers play a critical role\cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles\footnote{FAIR originally targeted data. Work is ongoing to adopt it for software through initiatives like FAIR4RS (FAIR for Research Software).}).
Simply archiving a project workflow in a repository after the project is finished is, on its own, insufficient, and maintaining it by repository staff is often either practically unfeasible or unscalable.
We argue and propose that workflows satisfying the following criteria can not only improve researcher flexibility during a research project, but can also increase the FAIRness of the deliverables for future researchers.
@@ -259,14 +259,14 @@ The derivation ``history'' of a result is thus not any the less valuable as itse
\textbf{Criterion 7: Including narrative that is linked to analysis.}
A project is not just its computational analysis.
A raw plot, figure, or table is hardly meaningful alone, even when accompanied by the code that generated it.
-A narrative description is also a deliverable (defined as ``data article'' in \cite{austin17}): describing the purpose of the computations, interpretations of the result, and the context in relation to other projects/papers.
+A narrative description is also a deliverable (defined as ``data article''\cite{austin17}): describing the purpose of the computations, interpretations of the result, and the context in relation to other projects/papers.
This is related to longevity, because if a workflow contains only the steps to do the analysis or generate the plots, in time it may get separated from its accompanying published paper.
-\textbf{Criterion 8: Free and open-source software:}
-Non-free or non-open-source software typically cannot be distributed, inspected, or modified by others.
-They are reliant on a single supplier (even without payments) and prone to \href{https://www.gnu.org/proprietary/proprietary-obsolescence.html}{proprietary obsolescence}.
-A project that is \href{https://www.gnu.org/philosophy/free-sw.en.html}{free software} (as formally defined by GNU), allows others to run, learn from, distribute, build upon (modify), and publish their modified versions.
-When the software used by the project is itself also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware.
+\textbf{Criterion 8: Free and open-source software (FOSS):}
+Non-FOSS software typically cannot be distributed, inspected, or modified by others.
+They are thus reliant on a single supplier (even without payments) and prone to \emph{proprietary obsolescence}\footnote{\inlinecode{\url{https://www.gnu.org/proprietary/proprietary-obsolescence.html}}}.
+A project that is \emph{free software} (as formally defined by GNU\footnote{\inlinecode{\url{https://www.gnu.org/philosophy/free-sw.en.html}}}), allows others to run, learn from, distribute, build upon (modify), and publish their modified versions.
+When the software used by the high-level project is also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware.
Proprietary software may be necessary to read proprietary data formats produced by data collection hardware (for example micro-arrays in genetics).
In such cases, it is best to immediately convert the data to free formats upon collection and safely use or archive the data as free formats.
@@ -282,13 +282,13 @@ In such cases, it is best to immediately convert the data to free formats upon c
\section{Proof of concept: Maneage}
-With the longevity problems of existing tools outlined above, a proof-of-concept solution is presented here via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}.
-Since the initial submission of this paper, it has also been used in \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} (on the COVID-19 pandemic) and \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}.
-It was also awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows \cite{austin17}, from the researchers' perspective.
+With the longevity problems of existing tools outlined above, a proof-of-concept solution is presented here via an implementation that has been tested in published papers\cite{akhlaghi19, infante20}.
+Since the initial submission of this paper, it has also been used in \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} (on the COVID-19 pandemic) and \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460} (on galaxy evolution).
+It was also awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows\cite{austin17}, from the researchers' perspective.
-It is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at \url{https://maneage.org}.
+It is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at \inlinecode{\url{https://maneage.org}}.
It was developed as a parallel research project over five years of publishing reproducible workflows of our research.
-Its primordial implementation was used in \cite{akhlaghi15}, which evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}.
+Its primordial implementation was used in Akhlaghi \& Ichikawa\cite{akhlaghi15}, which evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}.
Technically, the hardest criterion to implement was the first (completeness); in particular restricting execution requirements to only a minimal Unix-like operating system.
One solution we considered was GNU Guix and Guix Workflow Language (GWL).
@@ -298,13 +298,13 @@ Inspired by GWL+Guix, a single job management tool was implemented for both inst
Make is not an analysis language, it is a job manager.
Make decides when and how to call analysis steps/programs (in any language like Python, R, Julia, Shell, or C).
Make has been available since 1977, it is still heavily used in almost all components of modern Unix-like OSs and is standardized in POSIX.
-It is thus mature, actively maintained, highly optimized, efficient in managing provenance, and recommended by the pioneers of reproducible research \cite{claerbout1992,schwab2000}.
-Researchers using free software have also already had some exposure to it (most free research software are built with Make).
+It is thus mature, actively maintained, highly optimized, efficient in managing provenance, and recommended by the pioneers of reproducible research\cite{claerbout1992,schwab2000}.
+Moreover, researchers using FOSS have already had some exposure to Make (most FOSS are built with Make).
Linking the analysis and narrative (criterion 7) was historically our first design element.
To avoid the problems with computational notebooks mentioned above, we adopt a more abstract linkage, providing a more direct and traceable connection.
Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is through automatically-created \LaTeX{} macros, during the analysis.
-For example, \cite{akhlaghi19} writes `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'.
+For example, Akhlaghi writes\cite{akhlaghi19} `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'.
The \LaTeX{} source of the quote above is: `\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}'.
The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is automatically generated after the analysis and expands to the value `\inlinecode{0.25}' upon creation of the PDF.
Since values like this depend on the analysis, they should \emph{also} be reproducible, along with figures and tables.
@@ -317,32 +317,32 @@ Acting as a link, the macro files build the core skeleton of Maneage.
For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version, and possible citation.
These are combined at the end to generate precise software acknowledgment and citation that is shown in the
\ifdefined\separatesupplement%
-\href{https://doi.org/10.5281/zenodo.\projectzenodoid}{appendices},%
+appendices,%
\else%
-appendices (\ref{appendix:software}),%
+appendices (\ref{appendix:software}), %
\fi%
-for other examples, see \cite{akhlaghi19, infante20}.
+other examples are also available\cite{akhlaghi19, infante20}.
Furthermore, the machine-related specifications of the running system (including CPU architecture and byte-order) are also collected to report in the paper (they are reported for this paper in the acknowledgments).
These can help in \emph{root cause analysis} of observed differences/issues in the execution of the workflow on different machines.
-The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}).
+The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of Alliez et al.\cite{alliez19}).
All software dependencies are built down to precise versions of every tool, including the shell, important low-level application programs (e.g., GNU Coreutils) and of course, the high-level science software.
-The source code of all the free software used in Maneage is archived in, and downloaded from, \href{https://doi.org/10.5281/zenodo.3883409}{zenodo.3883409}.
-Zenodo promises long-term archival and also provides a persistent identifier for the files, which are sometimes unavailable at a software package's web page.
+The source code of all the FOSS software used in Maneage is archived in, and downloaded from, \href{https://doi.org/10.5281/zenodo.3883409}{zenodo.3883409}.
+Zenodo promises long-term archival and also provides a persistent identifiers for the files, which are sometimes unavailable at a software package's web page.
-On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library (glibc) is being added (task \href{http://savannah.nongnu.org/task/?15390}{15390}).
-Currently, {\TeX}Live is also being added (task \href{http://savannah.nongnu.org/task/?15267}{15267}), but that is only for building the final PDF, not affecting the analysis or verification.
+On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library (glibc) is being added\footnote{\inlinecode{\url{http://savannah.nongnu.org/task/?15390}}}.
+Currently, {\TeX}Live is also being added\footnote{\inlinecode{\url{http://savannah.nongnu.org/task/?15267}}}, but that is only for building the final PDF, not affecting the analysis or verification.
Building the core Maneage software environment on an 8-core CPU takes about 1.5 hours (GCC consumes more than half of the time).
However, this is only necessary once in a project: the analysis (which usually takes months to write/mature for a normal project) will only use the built environment.
Hence the few hours of initial software building is negligible compared to a project's life span.
To facilitate moving to another computer in the short term, Maneage'd projects can be built in a container or VM.
-The \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{\inlinecode{README.md}} file has thorough instructions on building in Docker.
+The \inlinecode{README.md}\footnote{\inlinecode{\label{maneageatswh}\href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{https://archive.softwareheritage.org/browse/origin/directory}\\\href{https://archive.softwareheritage.org/browse/origin/directory/?origin\_url=http://git.maneage.org/project.git}{?origin\_url=http://git.maneage.org/project.git}}} file has thorough instructions on building in Docker.
Through containers or VMs, users on non-Unix-like OSs (like Microsoft Windows) can use Maneage.
For Windows-native software that can be run in batch-mode, evolving technologies like Windows Subsystem for Linux may be usable.
The analysis phase of the project however is naturally different from one project to another at a low-level.
It was thus necessary to design a generic framework to comfortably host any project, while still satisfying the criteria of modularity, scalability, and minimal complexity.
-This design is demonstrated with the example of Figure \ref{fig:datalineage} (left) which is an enhanced replication of the ``tool'' curve of Figure 1C in \cite{menke20}.
+This design is demonstrated with the example of Figure \ref{fig:datalineage} (left) which is an enhanced replication of the ``tool'' curve of Figure 1C in Menke et al.\cite{menke20}.
Figure \ref{fig:datalineage} (right) is the data lineage that produced it.
\begin{figure*}[t]
@@ -352,7 +352,7 @@ Figure \ref{fig:datalineage} (right) is the data lineage that produced it.
\end{center}
\vspace{-3mm}
\caption{\label{fig:datalineage}
- Left: an enhanced replica of Figure 1C in \cite{menke20}, shown here for demonstrating Maneage.
+ Left: an enhanced replica of Figure 1C in Menke et al.\cite{menke20}, shown here for demonstrating Maneage.
It shows the fraction of the number of papers mentioning software tools (green line, left vertical axis) in each year (red bars, right vertical axis on a log scale).
Right: Schematic representation of the data lineage, or workflow, to generate the plot on the left.
Each colored box is a file in the project and arrows show the operation of various software: linking input file(s) to the output file(s).
@@ -402,7 +402,7 @@ include $(foreach s,$(makesrc), \
reproduce/analysis/make/$(s).mk)
\end{lstlisting}
-Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk} to satisfy the verification criteria (this step was not available in \cite{infante20}).
+Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk} to satisfy the verification criteria.
All project deliverables (macro files, plot or table data, and other datasets) are verified at this stage, with their checksums, to automatically ensure exact reproducibility.
Where exact reproducibility is not possible (for example, due to parallelization), values can be verified by the project authors.
For example see \href{https://archive.softwareheritage.org/browse/origin/content/?branch=refs/heads/postreferee_corrections&origin_url=https://codeberg.org/boud/elaphrocentre.git&path=reproduce/analysis/bash/verify-parameter-statistically.sh}{verify-parameter-statistically.sh} of \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}.
@@ -431,7 +431,7 @@ To further minimize complexity, the low-level implementation can be further sepa
By convention in Maneage, the subMakefiles (and the programs they call for number crunching) do not contain any fixed numbers, settings, or parameters.
Parameters are set as Make variables in ``configuration files'' (with a \inlinecode{.conf} suffix) and passed to the respective program by Make.
For example, in Figure \ref{fig:datalineage} (bottom), \inlinecode{INPUTS.conf} contains URLs and checksums for all imported datasets, thereby enabling exact verification before usage.
-To illustrate this, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot).
+To illustrate this, we report that Menke et al.\cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot).
The number \inlinecode{\menkenumpapersdemoyear} is stored in \inlinecode{demo-year.conf} and the result (\inlinecode{\menkenumpapersdemocount}) was calculated after generating \inlinecode{tools-per-year.txt}.
Both numbers are expanded as \LaTeX{} macros when creating this PDF file.
An interested reader can change the value in \inlinecode{demo-year.conf} to automatically update the result in the PDF, without knowing the underlying low-level implementation.
@@ -445,7 +445,7 @@ The analysis can benefit from the powerful and portable job management features
To satisfy the recorded history criterion, version control (currently implemented in Git) is another component of Maneage (see Figure \ref{fig:branching}).
Maneage is a Git branch that contains the shared components (infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles, and interface script).
-The core Maneage git repository is hosted at \href{http://git.maneage.org/project.git}{git.maneage.org/project.git} (archived at \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{Software Heritage}).
+The core Maneage git repository is hosted at \inlinecode{\href{http://git.maneage.org/project.git}{git.maneage.org/project.git}} (archived at \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{Software Heritage}).
Derived projects start by creating a branch and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its particular analysis, see Listing \ref{code:branching}).
There is a thoroughly elaborated customization checklist in \inlinecode{README-hacking.md}.
@@ -458,6 +458,9 @@ in (b) readers do it after publication; e.g., the project remains reproducible b
Generally, any Git flow (branching strategy) can be used by the high-level project authors or future readers.
Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of project curation and maintenance, before \emph{and} after publication.
+Finally, a snapshot of the complete project source is usually $\sim100$ kilo-bytes.
+It can thus easily be published or archived in many servers, for example, it can be uploaded to arXiv (with the \LaTeX{} source\cite{akhlaghi19, infante20, akhlaghi15}), published on Zenodo and archived in SoftwareHeritage.
+
\begin{lstlisting}[
label=code:branching,
caption={Starting a new project with Maneage, and building it},
@@ -479,11 +482,6 @@ $ git add -u && git commit # Commit changes.
\end{lstlisting}
-Finally, a snapshot of the complete project source is usually $\sim100$ kilo-bytes.
-It can thus easily be published or archived in many servers, for example, it can be uploaded to arXiv (with the \LaTeX{} source, see the arXiv source in \cite{akhlaghi19, infante20, akhlaghi15}), published on Zenodo and archived in SoftwareHeritage.
-
-
-
@@ -558,7 +556,7 @@ Likewise, when a bug is found in one science software, affected projects can be
Combined with SoftwareHeritage, precise high-level science components of the analysis can be accurately cited (e.g., even failed/abandoned tests at any historical point).
Many components of ``machine-actionable'' data management plans can also be automatically completed as a byproduct, useful for project PIs and grant funders.
-From the data repository perspective, these criteria can also be useful, e.g., the challenges mentioned in \cite{austin17}:
+From the data repository perspective, these criteria can also be useful, e.g., the challenges mentioned in Austin et al.\cite{austin17}:
(1) The burden of curation is shared among all project authors and readers (the latter may find a bug and fix it), not just by database curators, thereby improving sustainability.
(2) Automated and persistent bidirectional linking of data and publication can be established through the published \emph{and complete} data lineage that is under version control.
(3) Software management: with these criteria, each project comes with its unique and complete software management.
diff --git a/tex/src/appendix-existing-solutions.tex b/tex/src/appendix-existing-solutions.tex
index 8bbbe1c..62c98e6 100644
--- a/tex/src/appendix-existing-solutions.tex
+++ b/tex/src/appendix-existing-solutions.tex
@@ -23,7 +23,7 @@
\section{Survey of common existing reproducible workflows}
\label{appendix:existingsolutions}
The problem of reproducibility has received considerable attention over the last three decades and various solutions have already been proposed.
-The core principles that many of the existing solutions (including Maneage) aim to achieve are nicely summarized by the FAIR principles \citeappendix{wilkinson16}.
+The core principles that many of the existing solutions (including Maneage) aim to achieve are nicely summarized by the FAIR principles\citeappendix{wilkinson16}.
In this appendix, \emph{some} of the solutions are reviewed.
We are not just reviewing solutions that can be used today.
The main focus of this paper is longevity, therefore we also spent considerable time on finding and inspecting solutions that have been aborted, discontinued or abandoned.
@@ -33,7 +33,7 @@ Otherwise their paper's publication year is used.
For each solution, we summarize its methodology and discuss how it relates to the criteria proposed in this paper.
Freedom of the software/method is a core concept behind scientific reproducibility, as opposed to industrial reproducibility where a black box is acceptable/desirable.
Therefore proprietary solutions like Code Ocean\footnote{\inlinecode{\url{https://codeocean.com}}} or Nextjournal\footnote{\inlinecode{\url{https://nextjournal.com}}} will not be reviewed here.
-Other studies have also attempted to review existing reproducible solutions, for example, see \citeappendix{konkol20}.
+Other studies have also attempted to review existing reproducible solutions, for example, see Konkol et al.\citeappendix{konkol20}.
We have tried our best to test and read through the documentation of almost all reviewed solutions to a sufficient level.
However, due to time constraints, it is inevitable that we may have missed some aspects the solutions, or incorrectly interpreted their behavior and outputs.
@@ -43,12 +43,12 @@ In this case, please let us know and we will correct it in the text on the paper
\subsection{Suggested rules, checklists, or criteria}
Before going into the various implementations, it is useful to review some existing suggested rules, checklists, or criteria for computationally reproducible research.
-Sandve et al. \citeappendix{sandve13} propose ``ten simple rules for reproducible computational research'' that can be applied in any project.
+Sandve et al.\citeappendix{sandve13} propose ``ten simple rules for reproducible computational research'' that can be applied in any project.
Generally, these are very similar to the criteria proposed here and follow a similar spirit, but they do not provide any actual research papers following up all those points, nor do they provide a proof of concept.
-The Popper convention \citeappendix{jimenez17} also provides a set of principles that are indeed generally useful, among which some are common to the criteria here (for example, automatic validation, and, as in Maneage, the authors suggest providing a template for new users), but the authors do not include completeness as a criterion nor pay attention to longevity: Popper has already changed its core workflow language once and is written in Python with many dependencies that evolve fast, see \ref{appendix:highlevelinworkflow}.
+The Popper convention\citeappendix{jimenez17} also provides a set of principles that are indeed generally useful, among which some are common to the criteria here (for example, automatic validation, and, as in Maneage, the authors suggest providing a template for new users), but the authors do not include completeness as a criterion nor pay attention to longevity: Popper has already changed its core workflow language once and is written in Python with many dependencies that evolve fast, see \ref{appendix:highlevelinworkflow}.
For more on Popper, please see Section \ref{appendix:popper}.
-For improved reproducibility Jupyter notebooks, \citeappendix{rule19} propose ten rules and also provide links to example implementations.
+For improved reproducibility Jupyter notebooks, Rule et al.\citeappendix{rule19} propose ten rules and also provide links to example implementations.
These can be very useful for users of Jupyter but are not generic for non-Jupyter-based computational projects.
Some criteria (which are indeed very good in a more general context) do not directly relate to reproducibility, for example their Rule 1: ``Tell a Story for an Audience''.
Generally, as reviewed in
@@ -58,7 +58,7 @@ the main body of this paper (section on the longevity of existing tools)%
Section \ref{sec:longevityofexisting}%
\fi
and Section \ref{appendix:jupyter} (below), Jupyter itself has many issues regarding reproducibility.
-To create Docker images, N\"ust et al. propose ``ten simple rules'' in \citeappendix{nust20}.
+To create Docker images, N\"ust et al. propose\citeappendix{nust20} ``ten simple rules''.
They recommend some issues that can indeed help increase the quality of Docker images and their production/usage, such as their rule 7 to ``mount datasets [only] at run time'' to separate the computational environment from the data.
However, the long-term reproducibility of the images is not included as a criterion by these authors.
For example, they recommend using base operating systems, with version identification limited to a single brief identifier such as \inlinecode{ubuntu:18.04}, which has a serious problem with longevity issues
@@ -70,21 +70,21 @@ For example, they recommend using base operating systems, with version identific
Furthermore, in their proof-of-concept Dockerfile (listing 1), \inlinecode{rocker} is used with a tag (not a digest), which can be problematic due to the high risk of ambiguity (as discussed in Section \ref{appendix:containers}).
Previous criteria are thus primarily targeted to immediate reproducibility and do not consider longevity.
-Therefore, they lack a strong/clear completeness criterion (they mainly only suggest, rather than require, the recording of versions, and their ultimate suggestion of storing the full binary OS in a binary VM or container is problematic (as mentioned in \ref{appendix:independentenvironment} and \citeappendix{oliveira18}).
+Therefore, they lack a strong/clear completeness criterion (they mainly only suggest, rather than require, the recording of versions, and their ultimate suggestion of storing the full binary OS in a binary VM or container is problematic (as mentioned in \ref{appendix:independentenvironment} and Oliveira et al.\citeappendix{oliveira18}).
\subsection{Reproducible Electronic Documents, RED (1992)}
\label{appendix:red}
-RED\footnote{\inlinecode{\url{http://sep.stanford.edu/doku.php?id=sep:research:reproducible}}} is the first attempt that we could find on doing reproducible research, see \cite{claerbout1992,schwab2000}.
+RED\footnote{\inlinecode{\url{http://sep.stanford.edu/doku.php?id=sep:research:reproducible}}} is the first attempt\cite{claerbout1992,schwab2000} that we could find on doing reproducible research.
It was developed within the Stanford Exploration Project (SEP) for Geophysics publications.
Their introductions on the importance of reproducibility, resonate a lot with today's environment in computational sciences.
In particular, the heavy investment one has to make in order to re-do another scientist's work, even in the same team.
-RED also influenced other early reproducible works, for example \citeappendix{buckheit1995}.
+RED also influenced other early reproducible works, for example Buckheit \& Donoho\citeappendix{buckheit1995}.
-To orchestrate the various figures/results of a project, from 1990, they used ``Cake'' \citeappendix{somogyi87}, a dialect of Make, for more on Make, see Appendix \ref{appendix:jobmanagement}.
-As described in \cite{schwab2000}, in the latter half of that decade, they moved to GNU Make, which was much more commonly used, better maintained, and came with a complete and up-to-date manual.
+To orchestrate the various figures/results of a project, from 1990, they used ``Cake''\citeappendix{somogyi87}, a dialect of Make, for more on Make, see Appendix \ref{appendix:jobmanagement}.
+As described in Schwab et al.\cite{schwab2000}, in the latter half of that decade, they moved to GNU Make, which was much more commonly used, better maintained, and came with a complete and up-to-date manual.
The basic idea behind RED's solution was to organize the analysis as independent steps, including the generation of plots, and organizing the steps through a Makefile.
This enabled all the results to be re-executed with a single command.
Several basic low-level Makefiles were included in the high-level/central Makefile.
@@ -94,7 +94,7 @@ The reader could later select which figures/parts of the project to reproduce by
At the time, Make was already practiced by individual researchers and projects as a job orchestration tool, but SEP's innovation was to standardize it as an internal policy, and define conventions for the Makefiles to be consistent across projects.
This enabled new members to benefit from the already existing work of previous team members (who had graduated or moved to other jobs).
However, RED only used the existing software of the host system, it had no means to control them.
-Therefore, with wider adoption, they confronted a ``versioning problem'' where the host's analysis software had different versions on different hosts, creating different results, or crashing \citeappendix{fomel09}.
+Therefore, with wider adoption, they confronted a ``versioning problem'' where the host's analysis software had different versions on different hosts, creating different results, or crashing\citeappendix{fomel09}.
Hence in 2006 SEP moved to a new Python-based framework called Madagascar, see Appendix \ref{appendix:madagascar}.
@@ -103,7 +103,7 @@ Hence in 2006 SEP moved to a new Python-based framework called Madagascar, see A
\subsection{Taverna (2003)}
\label{appendix:taverna}
-Taverna\footnote{\inlinecode{\url{https://github.com/taverna}}} \citeappendix{oinn04} was a workflow management system written in Java with a graphical user interface.
+Taverna\footnote{\inlinecode{\url{https://github.com/taverna}}}\citeappendix{oinn04} was a workflow management system written in Java with a graphical user interface.
In 2014 it was sponsored by the Apache Incubator project and called ``Apache Taverna'', but its developers \href{https://lists.apache.org/thread.html/r559e0dd047103414fbf48a6ce1bac2e17e67504c546300f2751c067c\%40\%3Cdev.taverna.apache.org\%3E}{voted} to \emph{retire} it in 2020 because development has come to a standstill (as of April 2021, latest public Github commit was in 2016).
In Taverna, a workflow is defined as a directed graph, where nodes are called ``processors''.
@@ -116,7 +116,7 @@ lineage figure of the main paper).
Figure \ref{fig:datalineage}).
\fi
Taverna is only a workflow manager and is not integrated with a package manager, hence the versions of the used software can be different in different runs.
-Ref.\/~\citeappendix{zhao12} studied the problem of workflow decays in Taverna.
+Zhao et al. \citeappendix{zhao12} studied the problem of workflow decays in Taverna.
@@ -124,9 +124,9 @@ Ref.\/~\citeappendix{zhao12} studied the problem of workflow decays in Taverna.
\subsection{Madagascar (2003)}
\label{appendix:madagascar}
-Madagascar\footnote{\inlinecode{\url{http://ahay.org}}} \citeappendix{fomel13} is a set of extensions to the SCons job management tool (reviewed in \ref{appendix:scons}).
+Madagascar\footnote{\inlinecode{\url{http://ahay.org}}}\citeappendix{fomel13} is a set of extensions to the SCons job management tool (reviewed in \ref{appendix:scons}).
Madagascar is a continuation of the Reproducible Electronic Documents (RED) project that was discussed in Appendix \ref{appendix:red}.
-Madagascar has been used in the production of hundreds of research papers or book chapters\footnote{\inlinecode{\url{http://www.ahay.org/wiki/Reproducible_Documents}}}, 120 prior to \citeappendix{fomel13}.
+Madagascar has been used in the production of hundreds of research papers or book chapters\footnote{\inlinecode{\url{http://www.ahay.org/wiki/Reproducible_Documents}}}, 120 prior to Fomel et al.\citeappendix{fomel13}.
Madagascar does include project management tools in the form of SCons extensions.
However, it is not just a reproducible project management tool.
@@ -149,17 +149,17 @@ Furthermore, the blending of the workflow component with the low-level analysis
\subsection{GenePattern (2004)}
\label{appendix:genepattern}
-GenePattern\footnote{\inlinecode{\url{https://www.genepattern.org}}} \citeappendix{reich06} (first released in 2004) is a client-server software containing many common analysis functions/modules, primarily focused for Gene studies.
+GenePattern\footnote{\inlinecode{\url{https://www.genepattern.org}}}\citeappendix{reich06} (first released in 2004) is a client-server software containing many common analysis functions/modules, primarily focused for Gene studies.
Although it is highly focused to a special research field, it is reviewed here because its concepts/methods are generic.
Its server-side software is installed with fixed software packages that are wrapped into GenePattern modules.
-The modules are used through a web interface, the modern implementation is GenePattern Notebook \citeappendix{reich17}.
+The modules are used through a web interface, the modern implementation is GenePattern Notebook\citeappendix{reich17}.
It is an extension of the Jupyter notebook (see Appendix \ref{appendix:editors}), which also has a special ``GenePattern'' cell that will connect to GenePattern servers for doing the analysis.
However, the wrapper modules just call an existing tool on the running system.
Given that each server may have its own set of installed software, the analysis may differ (or crash) when run on different GenePattern servers, hampering reproducibility.
%% GenePattern shutdown announcement (although as of November 2020, it does not open any more): https://www.genepattern.org/blog/2019/10/01/the-genomespace-project-is-ending-on-november-15-2019
-The primary GenePattern server was active since 2008 and had 40,000 registered users with 2000 to 5000 jobs running every week \citeappendix{reich17}.
+The primary GenePattern server was active since 2008 and had 40,000 registered users with 2000 to 5000 jobs running every week\citeappendix{reich17}.
However, it was shut down on November 15th 2019 due to the end of funding.
All processing with this sever has stopped, and any archived data on it has been deleted.
Since GenePattern is free software, there are alternative public servers to use, so hopefully, work on it will continue.
@@ -173,14 +173,14 @@ The data and software may have backups in other places, but the high-level proje
\subsection{Kepler (2005)}
-Kepler\footnote{\inlinecode{\url{https://kepler-project.org}}} \citeappendix{ludascher05} is a Java-based Graphic User Interface workflow management tool.
+Kepler\footnote{\inlinecode{\url{https://kepler-project.org}}}\citeappendix{ludascher05} is a Java-based Graphic User Interface workflow management tool.
Users drag-and-drop analysis components, called ``actors'', into a visual, directional graph, which is the workflow (similar to
\ifdefined\separatesupplement
the lineage figure shown in the main paper).
\else
Figure \ref{fig:datalineage}).
\fi
-Each actor is connected to others through Ptolemy II\footnote{\inlinecode{\url{https://ptolemy.berkeley.edu}}} \citeappendix{eker03}.
+Each actor is connected to others through Ptolemy II\footnote{\inlinecode{\url{https://ptolemy.berkeley.edu}}}\citeappendix{eker03}.
In many aspects, the usage of Kepler and its issues for long-term reproducibility is like Taverna (see Section \ref{appendix:taverna}).
@@ -189,7 +189,7 @@ In many aspects, the usage of Kepler and its issues for long-term reproducibilit
\subsection{VisTrails (2005)}
\label{appendix:vistrails}
-VisTrails\footnote{\inlinecode{\url{https://www.vistrails.org}}} \citeappendix{bavoil05} was a graphical workflow managing system.
+VisTrails\footnote{\inlinecode{\url{https://www.vistrails.org}}}\citeappendix{bavoil05} was a graphical workflow managing system.
According to its web page, VisTrails maintenance has stopped since May 2016, its last Git commit, as of this writing, was in November 2017.
However, given that it was well maintained for over 10 years is an achievement.
@@ -197,7 +197,7 @@ VisTrails (or ``visualization trails'') was initially designed for managing visu
Each analysis step, or module, is recorded in an XML schema, which defines the operations and their dependencies.
The XML attributes of each module can be used in any XML query language to find certain steps (for example those that used a certain command).
Since the main goal was visualization (as images), apparently its primary output is in the form of image spreadsheets.
-Its design is based on a change-based provenance model using a custom VisTrails provenance query language (vtPQL), for more see \citeappendix{scheidegger08}.
+Its design is based on a change-based provenance model using a custom VisTrails provenance query language (vtPQL), for more see Scheidegger et al.\citeappendix{scheidegger08}.
Since XML is a plain text format, as the user inspects the data and makes changes to the analysis, the changes are recorded as ``trails'' in the project's VisTrails repository that operates very much like common version control systems (see Appendix \ref{appendix:versioncontrol}).
.
However, even though XML is in plain text, it is very hard to read/edit without the VisTrails software (which is no longer maintained).
@@ -215,7 +215,7 @@ Besides the fact that it is no longer maintained, VisTrails did not control the
\subsection{Galaxy (2010)}
\label{appendix:galaxy}
-Galaxy\footnote{\inlinecode{\url{https://galaxyproject.org}}} is a web-based Genomics workbench \citeappendix{goecks10}.
+Galaxy\footnote{\inlinecode{\url{https://galaxyproject.org}}} is a web-based Genomics workbench\citeappendix{goecks10}.
The main user interface is the ``Galaxy Pages'', which does not require any programming: users graphically manipulate abstract ``tools'' which are wrappers over command-line programs.
Therefore the actual running version of the program can be hard to control across different Galaxy servers.
Besides the automatically generated metadata of a project (which include version control, or its history), users can also tag/annotate each analysis step, describing its intent/purpose.
@@ -228,7 +228,7 @@ For example the very large cost of maintaining such a system, being based on a g
\subsection{Image Processing On Line journal, IPOL (2010)}
\label{appendix:ipol}
-The IPOL journal\footnote{\inlinecode{\url{https://www.ipol.im}}} \citeappendix{limare11} (first published article in July 2010) publishes papers on image processing algorithms as well as the the full code of the proposed algorithm.
+The IPOL journal\footnote{\inlinecode{\url{https://www.ipol.im}}}\citeappendix{limare11} (first published article in July 2010) publishes papers on image processing algorithms as well as the the full code of the proposed algorithm.
An IPOL paper is a traditional research paper, but with a focus on implementation.
The published narrative description of the algorithm must be detailed to a level that any specialist can implement it in their own programming language (extremely detailed).
The author's own implementation of the algorithm is also published with the paper (in C, C++ or MATLAB/Octave and recently Python), the code can only have a very limited set of external dependencies (with pre-defined versions), must be commented well enough, and link each part of it with the relevant part of the paper.
@@ -251,7 +251,7 @@ A paper written in Maneage (the proof-of-concept solution presented in this pape
\subsection{WINGS (2010)}
\label{appendix:wings}
-WINGS\footnote{\inlinecode{\url{https://wings-workflows.org}}} \citeappendix{gil10} is an automatic workflow generation algorithm.
+WINGS\footnote{\inlinecode{\url{https://wings-workflows.org}}}\citeappendix{gil10} is an automatic workflow generation algorithm.
It runs on a centralized web server, requiring many dependencies (such that it is recommended to download Docker images).
It allows users to define various workflow components (for example datasets, analysis components, etc), with high-level goals.
It then uses selection and rejection algorithms to find the best components using a pool of analysis components that can satisfy the requested high-level constraints.
@@ -264,15 +264,15 @@ It then uses selection and rejection algorithms to find the best components usin
\subsection{Active Papers (2011)}
\label{appendix:activepapers}
Active Papers\footnote{\inlinecode{\url{http://www.activepapers.org}}} attempts to package the code and data of a project into one file (in HDF5 format).
-It was initially written in Java because its compiled byte-code outputs in JVM are portable on any machine \citeappendix{hinsen11}.
-However, Java is not a commonly used platform today, hence it was later implemented in Python \citeappendix{hinsen15}.
+It was initially written in Java because its compiled byte-code outputs in JVM are portable on any machine\citeappendix{hinsen11}.
+However, Java is not a commonly used platform today, hence it was later implemented in Python\citeappendix{hinsen15}.
Dependence on high-level platforms (Java or Python) is therefore a fundamental issue.
In the Python version, all processing steps and input data (or references to them) are stored in an HDF5 file.
%However, it can only account for pure-Python packages using the host operating system's Python modules \tonote{confirm this!}.
When the Python module contains a component written in other languages (mostly C or C++), it needs to be an external dependency to the Active Paper.
-As mentioned in \citeappendix{hinsen15}, the fact that it relies on HDF5 is a caveat of Active Papers, because many tools are necessary to merely open it.
+As mentioned in Hinsen\citeappendix{hinsen15}, the fact that it relies on HDF5 is a caveat of Active Papers, because many tools are necessary to merely open it.
Downloading the pre-built ``HDF View'' binaries (a GUI browser of HDF5 files that is provided by the HDF group) is not possible anonymously/automatically: as of January 2021 login is required\footnote{\inlinecode{\url{https://www.hdfgroup.org/downloads/hdfview}}} (this was not the case when Active Papers moved to HDF5).
% From K. Hinsen in a private email to M. Akhlaghi: This is true today, but wasn't when I started ActivePapers. Otherwise I'd never have built on HDF5.
Installing HDF View using the Debian or Arch Linux package managers also failed due to dependencies in our trials.
@@ -280,7 +280,7 @@ Furthermore, like most high-level tools, the HDF5 library evolves very fast: on
While data and code are indeed fundamentally similar concepts technically\citeappendix{hinsen16}, they are used by humans differently.
The hand-written code of a large project involving Terabytes of data can be 100 kilo bytes.
-When the two are bundled together in one remote file, merely seeing one line of the code, requires downloading Terabytes volume that is not needed, this was also acknowledged in \citeappendix{hinsen15}.
+When the two are bundled together in one remote file, merely seeing one line of the code, requires downloading Terabytes volume that is not needed, this was also acknowledged in Hinsen\citeappendix{hinsen15}.
It may also happen that the data are proprietary (for example medical patient data).
In such cases, the data must not be publicly released, but the methods that were applied to them can.
@@ -294,13 +294,13 @@ This is not a fundamental feature of the approach, but rather an effect of the i
\subsection{Collage Authoring Environment (2011)}
\label{appendix:collage}
-The Collage Authoring Environment \citeappendix{nowakowski11} was the winner of Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}.
+The Collage Authoring Environment\citeappendix{nowakowski11} was the winner of Elsevier Executable Paper Grand Challenge\citeappendix{gabriel11}.
It is based on the GridSpace2\footnote{\inlinecode{\url{http://dice.cyfronet.pl}}} distributed computing environment, which has a web-based graphic user interface.
Through its web-based interface, viewers of a paper can actively experiment with the parameters of a published paper's displayed outputs (for example figures) through a web interface.
In their Figure 3, they nicely vizualize how the ``Executable Paper'' of Collage operates through two servers and a computing backend.
Unfortunately in the paper no webpage has been provided follow up on the work and find its current status.
-A web search also only pointed us to its main paper (\citeappendix{nowakowski11}).
+A web search also only pointed us to its main paper\citeappendix{nowakowski11}.
In the paper they do not discuss the major issue of software versioning and its verification to ensure that future updates to the backend do not affect the result; apparently it just assumes the software exist on the ``Computing backend''.
Since we could not access or test it, from the descriptions in the paper, it seems to be very similar to the modern day Jupyter notebook concept (see \ref{appendix:jupyter}), which had not yet been created in its current form in 2011.
So we expect similar longevity issues with Collage.
@@ -308,8 +308,8 @@ So we expect similar longevity issues with Collage.
\subsection{SHARE (2011)}
\label{appendix:SHARE}
-SHARE\footnote{\inlinecode{\url{https://is.ieis.tue.nl/staff/pvgorp/share}}} \citeappendix{vangorp11} is a web portal that hosts virtual machines (VMs) for storing the environment of a research project.
-SHARE was recognized as the second position in the Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}.
+SHARE\footnote{\inlinecode{\url{https://is.ieis.tue.nl/staff/pvgorp/share}}}\citeappendix{vangorp11} is a web portal that hosts virtual machines (VMs) for storing the environment of a research project.
+SHARE was recognized as the second position in the Elsevier Executable Paper Grand Challenge\citeappendix{gabriel11}.
Simply put, SHARE was just a VM library that users could download or connect to, and run.
The limitations of VMs for reproducibility were discussed in Appendix \ref{appendix:virtualmachines}, and the SHARE system does not specify any requirements or standards on making the VM itself reproducible, or enforcing common internals for its supported projects.
As of January 2021, the top SHARE web page still works.
@@ -321,13 +321,13 @@ However, upon selecting any operation, a notice is printed that ``SHARE is offli
\subsection{Verifiable Computational Result, VCR (2011)}
\label{appendix:verifiableidentifier}
-A ``verifiable computational result''\footnote{\inlinecode{\url{http://vcr.stanford.edu}}} is an output (table, figure, etc) that is associated with a ``verifiable result identifier'' (VRI), see \citeappendix{gavish11}.
-It was awarded the third prize in the Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}.
+A ``verifiable computational result''\footnote{\inlinecode{\url{http://vcr.stanford.edu}}} is an output (table, figure, etc) that is associated with a ``verifiable result identifier'' (VRI), see\citeappendix{gavish11}.
+It was awarded the third prize in the Elsevier Executable Paper Grand Challenge\citeappendix{gabriel11}.
A VRI is a hash that is created using tags within the programming source that produced that output, also recording its version control or history.
This enables the exact identification and citation of results.
The VRIs are automatically generated web-URLs that link to public VCR repositories containing the data, inputs, and scripts, that may be re-executed.
-According to \citeappendix{gavish11}, the VRI generation routine has been implemented in MATLAB, R, and Python, although only the MATLAB version was available on the webpage in January 2021.
+According to Gavish \& Donoho\citeappendix{gavish11}, the VRI generation routine has been implemented in MATLAB, R, and Python, although only the MATLAB version was available on the webpage in January 2021.
VCR also has special \LaTeX{} macros for loading the respective VRI into the generated PDF.
In effect this is very similar to what have done at the end of the caption of
\ifdefined\separatesupplement
@@ -340,7 +340,7 @@ However, instead of a long and hard to read hash, we simply point to the plotted
Unfortunately, most parts of the web page are not complete as of January 2021.
The VCR web page contains an example PDF\footnote{\inlinecode{\url{http://vcr.stanford.edu/paper.pdf}}} that is generated with this system, but the linked VCR repository\footnote{\inlinecode{\url{http://vcr-stat.stanford.edu}}} did not exist (again, as of January 2021).
-Finally, the date of the files in the MATLAB extension tarball is set to May 2011, hinting that probably VCR has been abandoned soon after the publication of \citeappendix{gavish11}.
+Finally, the date of the files in the MATLAB extension tarball is set to May 2011, hinting that probably VCR has been abandoned soon after the publication of Gavish \& Donoho\citeappendix{gavish11}.
@@ -348,13 +348,13 @@ Finally, the date of the files in the MATLAB extension tarball is set to May 201
\subsection{SOLE (2012)}
\label{appendix:sole}
-SOLE (Science Object Linking and Embedding) defines ``science objects'' (SOs) that can be manually linked with phrases of the published paper \citeappendix{pham12,malik13}.
+SOLE (Science Object Linking and Embedding) defines ``science objects'' (SOs) that can be manually linked with phrases of the published paper\citeappendix{pham12,malik13}.
An SO is any code/content that is wrapped in begin/end tags with an associated type and name.
For example, special commented lines in a Python, R, or C program.
The SOLE command-line program parses the tagged file, generating metadata elements unique to the SO (including its URI).
-SOLE also supports workflows as Galaxy tools \citeappendix{goecks10}.
+SOLE also supports workflows as Galaxy tools\citeappendix{goecks10}.
-For reproducibility, \citeappendix{pham12} suggest building a SOLE-based project in a virtual machine, using any custom package manager that is hosted on a private server to obtain a usable URI.
+For reproducibility, Pham et al. \citeappendix{pham12} suggest building a SOLE-based project in a virtual machine, using any custom package manager that is hosted on a private server to obtain a usable URI.
However, as described in Appendices \ref{appendix:independentenvironment} and \ref{appendix:packagemanagement}, unless virtual machines are built with robust package managers, this is not a sustainable solution (the virtual machine itself is not reproducible).
Also, hosting a large virtual machine server with fixed IP on a hosting service like Amazon (as suggested there) for every project in perpetuity will be very expensive.
@@ -366,7 +366,7 @@ In Maneage, instead of using artificial/commented tags, the analysis inputs and
\subsection{Sumatra (2012)}
-Sumatra\footnote{\inlinecode{\url{http://neuralensemble.org/sumatra}}} \citeappendix{davison12} attempts to capture the environment information of a running project.
+Sumatra\footnote{\inlinecode{\url{http://neuralensemble.org/sumatra}}}\citeappendix{davison12} attempts to capture the environment information of a running project.
It is written in Python and is a command-line wrapper over the analysis script.
By controlling a project at running-time, Sumatra is able to capture the environment it was run in.
The captured environment can be viewed in plain text or a web interface.
@@ -385,10 +385,10 @@ It just captures the environment, it does not store \emph{how} that environment
\subsection{Research Object (2013)}
\label{appendix:researchobject}
-The Research object\footnote{\inlinecode{\url{http://www.researchobject.org}}} is collection of meta-data ontologies, to describe aggregation of resources, or workflows, see \citeappendix{bechhofer13} and \citeappendix{belhajjame15}.
+The Research object\footnote{\inlinecode{\url{http://www.researchobject.org}}} is collection of meta-data ontologies, to describe aggregation of resources, or workflows\citeappendix{bechhofer13,belhajjame15}.
It thus provides resources to link various workflow/analysis components (see Appendix \ref{appendix:existingtools}) into a final workflow.
-Ref.\/~\citeappendix{bechhofer13} describes how a workflow in Taverna (Appendix \ref{appendix:taverna}) can be translated into research objects.
+Bechhofer et al. \citeappendix{bechhofer13} describes how a workflow in Taverna (Appendix \ref{appendix:taverna}) can be translated into research objects.
The important thing is that the research object concept is not specific to any special workflow, it is just a metadata bundle/standard which is only as robust in reproducing the result as the running workflow.
Therefore if implemented over a complete workflow like Maneage, it can be very useful in analysing/optimizing the workflow, finding common components between many Maneage'd workflows, or translating to other complete workflows.
@@ -398,7 +398,7 @@ Therefore if implemented over a complete workflow like Maneage, it can be very u
\subsection{Sciunit (2015)}
\label{appendix:sciunit}
-Sciunit\footnote{\inlinecode{\url{https://sciunit.run}}} \citeappendix{meng15} defines ``sciunit''s that keep the executed commands for an analysis and all the necessary programs and libraries that are used in those commands.
+Sciunit\footnote{\inlinecode{\url{https://sciunit.run}}}\citeappendix{meng15} defines ``sciunit''s that keep the executed commands for an analysis and all the necessary programs and libraries that are used in those commands.
It automatically parses all the executable files in the script and copies them, and their dependency libraries (down to the C library), into the sciunit.
Because the sciunit contains all the programs and necessary libraries, it is possible to run it readily on other systems that have a similar CPU architecture.
Sciunit was originally written in Python 2 (which reached its end-of-life on January 1st, 2020).
@@ -412,17 +412,17 @@ This is a major problem for scientific projects: in principle (not knowing how t
\subsection{Umbrella (2015)}
-Umbrella \citeappendix{meng15b} is a high-level wrapper script for isolating the environment of the analysis.
+Umbrella\citeappendix{meng15b} is a high-level wrapper script for isolating the environment of the analysis.
The user specifies the necessary operating system, and necessary packages for the analysis steps in various JSON files.
Umbrella will then study the host operating system and the various necessary inputs (including data and software) through a process similar to Sciunits mentioned above to find the best environment isolator (maybe using Linux containerization, containers, or VMs).
-We could not find a URL to the source software of Umbrella (no source code repository is mentioned in the papers we reviewed above), but from the descriptions in \citeappendix{meng17}, it is written in Python 2.6 (which is now deprecated).
+We could not find a URL to the source software of Umbrella (no source code repository is mentioned in the papers we reviewed above), but from the descriptions\citeappendix{meng17}, it is written in Python 2.6 (which is now deprecated).
\subsection{ReproZip (2016)}
-ReproZip\footnote{\inlinecode{\url{https://www.reprozip.org}}} \citeappendix{chirigati16} is a Python package that is designed to automatically track all the necessary data files, libraries, and environment variables into a single bundle.
+ReproZip\footnote{\inlinecode{\url{https://www.reprozip.org}}}\citeappendix{chirigati16} is a Python package that is designed to automatically track all the necessary data files, libraries, and environment variables into a single bundle.
The tracking is done at the kernel system-call level, so any file that is accessed during the running of the project is identified.
The tracked files can be packaged into a \inlinecode{.rpz} bundle that can then be unpacked into another system.
@@ -430,7 +430,7 @@ ReproZip is therefore very good to take a ``snapshot'' of the running environmen
However, the bundle can become very large when many/large datasets are involved, or if the software environment is complex (many dependencies).
Since it copies the binary software libraries, it can only be run on systems with a similar CPU architecture to the original.
Furthermore, ReproZip just copies the binary/compiled files used in a project, it has no way to know how the software was built.
-As mentioned in this paper, and also \citeappendix{oliveira18} the question of ``how'' the environment was built is critical for understanding the results, and simply having the binaries cannot necessarily be useful.
+As mentioned in this paper, and also Oliveira et al. \citeappendix{oliveira18} the question of ``how'' the environment was built is critical for understanding the results, and simply having the binaries cannot necessarily be useful.
For the data, it is similarly not possible to extract which data server they came from.
Hence two projects that each use a 1-terabyte dataset will need a full copy of that same 1-terabyte file in their bundle, making long-term preservation extremely expensive.
@@ -445,7 +445,7 @@ Users simply add a set of Binder-recognized configuration files to their reposit
One good feature of Binder is that the imported Docker image must be tagged, although as mentioned in Appendix \ref{appendix:containers}, tags do not ensure reproducibility.
However, it does not make sure that the Dockerfile used by the imported Docker image follows a similar convention also.
So users can simply use generic operating system names.
-Binder is used by \citeappendix{jones19}.
+Binder is used by Jones et al.\citeappendix{jones19}.
@@ -470,7 +470,7 @@ However, there is one directory that can be used to store files that must not be
\subsection{Popper (2017)}
\label{appendix:popper}
-Popper\footnote{\inlinecode{\url{https://falsifiable.us}}} is a software implementation of the Popper Convention \citeappendix{jimenez17}.
+Popper\footnote{\inlinecode{\url{https://falsifiable.us}}} is a software implementation of the Popper Convention\citeappendix{jimenez17}.
The Popper team's own solution is through a command-line program called \inlinecode{popper}.
The \inlinecode{popper} program itself is written in Python.
However, job management was initially based on the HashiCorp configuration language (HCL) because HCL was used by ``GitHub Actions'' to manage workflows at that time.
@@ -493,7 +493,7 @@ Hence any future change in the low level features will directly propagated to al
\subsection{Whole Tale (2017)}
\label{appendix:wholetale}
-Whole Tale\footnote{\inlinecode{\url{https://wholetale.org}}} is a web-based platform for managing a project and organizing data provenance, see \citeappendix{brinckman17}.
+Whole Tale\footnote{\inlinecode{\url{https://wholetale.org}}} is a web-based platform for managing a project and organizing data provenance\citeappendix{brinckman17}.
It uses online editors like Jupyter or RStudio (see Appendix \ref{appendix:editors}) that are encapsulated in a Docker container (see Appendix \ref{appendix:independentenvironment}).
The web-based nature of Whole Tale's approach and its dependency on many tools (which have many dependencies themselves) is a major limitation for future reproducibility.
@@ -503,7 +503,7 @@ But as all the second-order dependencies evolve, it is not hard to envisage such
Furthermore, the fact that a Tale is stored as a binary Docker container causes two important problems:
1) it requires a very large storage capacity for every project that is hosted there, making it very expensive to scale if demand expands.
2) It is not possible to see how the environment was built accurately (when the Dockerfile uses operating system package managers like \inlinecode{apt}).
-This issue with Whole Tale (and generally all other solutions that only rely on preserving a container/VM) was also mentioned in \citeappendix{oliveira18}, for more on this, please see Appendix \ref{appendix:packagemanagement}.
+This issue with Whole Tale (and generally all other solutions that only rely on preserving a container/VM) was also mentioned in Oliveira et al.\citeappendix{oliveira18}, for more on this, please see Appendix \ref{appendix:packagemanagement}.
@@ -511,7 +511,7 @@ This issue with Whole Tale (and generally all other solutions that only rely on
\subsection{Occam (2018)}
\label{appendix:occam}
-Occam\footnote{\inlinecode{\url{https://occam.cs.pitt.edu}}} \citeappendix{oliveira18} is a web-based application to preserve software and its execution.
+Occam\footnote{\inlinecode{\url{https://occam.cs.pitt.edu}}}\citeappendix{oliveira18} is a web-based application to preserve software and its execution.
To achieve long-term reproducibility, Occam includes its own package manager (instructions to build software and their dependencies) to be in full control of the software build instructions, similar to Maneage.
Besides Nix or Guix (which are primarily a package manager that can also do job management), Occam has been the only solution in our survey here that attempts to be complete in this aspect.
diff --git a/tex/src/appendix-existing-tools.tex b/tex/src/appendix-existing-tools.tex
index a773322..8ad97ef 100644
--- a/tex/src/appendix-existing-tools.tex
+++ b/tex/src/appendix-existing-tools.tex
@@ -50,9 +50,9 @@ Therefore, a process that is run inside a virtual machine can be much slower tha
An advantage of VMs is that they are a single file that can be copied from one computer to another, keeping the full environment within them if the format is recognized.
VMs are used by cloud service providers, enabling fully independent operating systems on their large servers where the customer can have root access.
-VMs were used in solutions like SHARE \citeappendix{vangorp11} (which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011 \citeappendix{gabriel11}), or in suggested reproducible papers like \citeappendix{dolfi14}.
+VMs were used in solutions like SHARE\citeappendix{vangorp11} (which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011\citeappendix{gabriel11}), or in some suggested reproducible papers\citeappendix{dolfi14}.
However, due to their very large size, these are expensive to maintain, thus leading SHARE to discontinue its services in 2019.
-The URL to the VM file \texttt{provenance\_machine.ova} that is mentioned in \citeappendix{dolfi14} is also not currently accessible (we suspect that this is due to size and archival costs).
+The URL to the VM file \texttt{provenance\_machine.ova} that is mentioned in Dolfi et al.\citeappendix{dolfi14} is also not currently accessible (we suspect that this is due to size and archival costs).
\subsubsection{Containers}
\label{appendix:containers}
@@ -74,7 +74,7 @@ We review some of the most common container solutions: Docker, Singularity, and
An important drawback of Docker for high-performance scientific needs is that it runs as a daemon (a program that is always running in the background) with root permissions.
This is a major security flaw that discourages many high-performance computing (HPC) facilities from providing it.
-\item {\bf\small Singularity:} Singularity \citeappendix{kurtzer17} is a single-image container (unlike Docker, which is composed of modular/independent images).
+\item {\bf\small Singularity:} Singularity\citeappendix{kurtzer17} is a single-image container (unlike Docker, which is composed of modular/independent images).
Although it needs root permissions to be installed on the system (once), it does not require root permissions every time it is run.
Its main program is also not a daemon, but a normal program that can be stopped.
These features make it much safer for HPC administrators to install compared to Docker.
@@ -87,20 +87,20 @@ We review some of the most common container solutions: Docker, Singularity, and
Generally, VMs or containers are good solutions to reproducibly run/repeating an analysis in the short term (a couple of years).
However, their focus is to store the already-built (binary, non-human readable) software environment.
Because of this, they will be large (many Gigabytes) and expensive to archive, download, or access.
-Recall the two examples above for VMs in Section \ref{appendix:virtualmachines}. But this is also valid for Docker images, as is clear from Dockerhub's recent decision to delete images of free accounts that have not been used for more than 6 months.
-Meng \& Thain \citeappendix{meng17} also give similar reasons on why Docker images were not suitable in their trials.
+Recall the two examples above for VMs in Section \ref{appendix:virtualmachines}. But this is also valid for Docker images, as is clear from Dockerhub's recent decision to a new consumpiton-based payment model.
+Meng \& Thain\citeappendix{meng17} also give similar reasons on why Docker images were not suitable in their trials.
On a more fundamental level, VMs or containers do not store \emph{how} the core environment was built.
This information is usually in a third-party repository, and not necessarily inside the container or VM file, making it hard (if not impossible) to track for future users.
-This is a major problem in relation to the proposed completeness criteria and is also highlighted as an issue in terms of long term reproducibility by \citeappendix{oliveira18}.
+This is a major problem in relation to the proposed completeness criteria and is also highlighted as an issue in terms of long term reproducibility by Oliveira et al.\citeappendix{oliveira18}.
-The example of \inlinecode{Dockerfile} of \cite{mesnard20} was previously mentioned in
+The example of \inlinecode{Dockerfile} of Mesnard \& Barba\cite{mesnard20} was previously mentioned in
\ifdefined\separatesupplement
the main body of this paper, when discussing the criteria.
\else
in Section \ref{criteria}.
\fi
-Another useful example is the \href{https://github.com/benmarwick/1989-excavation-report-Madjedbebe/blob/master/Dockerfile}{\inlinecode{Dockerfile}} of \citeappendix{clarkso15} (published in June 2015) which starts with \inlinecode{FROM rocker/verse:3.3.2}.
+Another useful example is the \inlinecode{Dockerfile}\footnote{\inlinecode{\href{https://github.com/benmarwick/1989-excavation-report-Madjedbebe/blob/master/Dockerfile}{https://github.com/benmarwick/1989-excavation-report-}\\\href{https://github.com/benmarwick/1989-excavation-report-Madjedbebe/blob/master/Dockerfile}{Madjedbebe/blob/master/Dockerfile}}} of Clarkson et al.\citeappendix{clarkso15} (published in June 2015) which starts with \inlinecode{FROM rocker/verse:3.3.2}.
When we tried to build it (November 2020), we noticed that the core downloaded image (\inlinecode{rocker/verse:3.3.2}, with image ``digest'' \inlinecode{sha256:c136fb0dbab...}) was created in October 2018 (long after the publication of that paper).
In principle, it is possible to investigate the difference between this new image and the old one that the authors used, but that would require a lot of effort and may not be possible when the changes are not available in a third public repository or not under version control.
In Docker, it is possible to retrieve the precise Docker image with its digest, for example, \inlinecode{FROM ubuntu:16.04@sha256:XXXXXXX} (where \inlinecode{XXXXXXX} is the digest, uniquely identifying the core image to be used), but we have not seen this often done in existing examples of ``reproducible'' \inlinecode{Dockerfiles}.
@@ -111,7 +111,7 @@ ISO files are pre-built binary files with volumes of hundreds of megabytes and n
For example, the archives of Debian\footnote{\inlinecode{\url{https://cdimage.debian.org/mirror/cdimage/archive/}}} or Ubuntu\footnote{\inlinecode{\url{http://old-releases.ubuntu.com/releases}}} provide older ISO files.
The concept of containers (and the independent images that build them) can also be extended beyond just the software environment.
-For example, \citeappendix{lofstead19} propose a ``data pallet'' concept to containerize access to data and thus allow tracing data back to the application that produced them.
+For example, Lofstead et al.\citeappendix{lofstead19} propose a ``data pallet'' concept to containerize access to data and thus allow tracing data back to the application that produced them.
In summary, containers or VMs are just a built product themselves.
If they are built properly (for example building a Maneage'd project inside a Docker container), they can be useful for immediate usage and fast-moving of the project from one system to another.
@@ -145,7 +145,7 @@ Both are discussed in more detail below.
Package managers are the second component in any workflow that relies on containers or VMs for an independent environment, and the starting point in others that use the host's file system (as discussed above in Section \ref{appendix:independentenvironment}).
In this section, some common package managers are reviewed, in particular those that are most used by the reviewed reproducibility solutions of Appendix \ref{appendix:existingsolutions}.
-For a more comprehensive list of existing package managers, see \href{https://en.wikipedia.org/wiki/List_of_software_package_management_systems}{Wikipedia}.
+For a more comprehensive list of existing package managers, see Wikipedia\footnote{\inlinecode{\href{https://en.wikipedia.org/wiki/List\_of\_software\_package\_management\_systems}{https://en.wikipedia.org/wiki/List\_of\_software\_package\_}\\\href{https://en.wikipedia.org/wiki/List\_of\_software\_package\_management\_systems}{management\_systems}}}.
Note that we are not including package managers that are specific to one language, for example \inlinecode{pip} (for Python) or \inlinecode{tlmgr} (for \LaTeX).
\subsubsection{Operating system's package manager}
@@ -163,7 +163,7 @@ Requesting a special version of that special software does not fully address the
Hence a fixed version of the dependencies must also be specified.
In robust package managers like Debian's \inlinecode{apt} it is possible to fully control (and later reproduce) the built environment of a high-level software.
-Debian also archives all packaged high-level software in its Snapshot\footnote{\inlinecode{\url{https://snapshot.debian.org/}}} service since 2005 which can be used to build the higher-level software environment on an older OS \citeappendix{aissi20}.
+Debian also archives all packaged high-level software in its Snapshot\footnote{\inlinecode{\url{https://snapshot.debian.org/}}} service since 2005 which can be used to build the higher-level software environment on an older OS\citeappendix{aissi20}.
Therefore it is indeed theoretically possible to reproduce the software environment only using archived operating systems and their own package managers, but unfortunately, we have not seen it practiced in (reproducible) scientific papers/projects.
In summary, the host OS package managers are primarily meant for the low-level operating system components.
@@ -181,17 +181,17 @@ They can therefore only be run on GNU/Linux operating systems.
\subsubsection{Nix or GNU Guix}
\label{appendix:nixguix}
-Nix\footnote{\inlinecode{\url{https://nixos.org}}} \citeappendix{dolstra04} and GNU Guix\footnote{\inlinecode{\url{https://guix.gnu.org}}} \citeappendix{courtes15} are independent package managers that can be installed and used on GNU/Linux operating systems, and macOS (only for Nix, prior to macOS Catalina).
+Nix\footnote{\inlinecode{\url{https://nixos.org}}}\citeappendix{dolstra04} and GNU Guix\footnote{\inlinecode{\url{https://guix.gnu.org}}}\citeappendix{courtes15} are independent package managers that can be installed and used on GNU/Linux operating systems, and macOS (only for Nix, prior to macOS Catalina).
Both also have a fully functioning operating system based on their packages: NixOS and ``Guix System''.
GNU Guix is based on the same principles of Nix but implemented differently, so we focus the review here on Nix.
-The Nix approach to package management is unique in that it allows exact dependency tracking of all the dependencies, and allows for multiple versions of software, for more details see \citeappendix{dolstra04}.
+The Nix approach to package management is unique in that it allows exact dependency tracking of all the dependencies, and allows for multiple versions of software, for more details see Dolstra et al.\citeappendix{dolstra04}.
In summary, a unique hash is created from all the components that go into the building of the package (including the instructions on how to build the software).
That hash is then prefixed to the software's installation directory.
-As an example from \citeappendix{dolstra04}: if a certain build of GNU C Library 2.3.2 has a hash of \inlinecode{8d013ea878d0}, then it is installed under \inlinecode{/nix/store/8d013ea878d0-glibc-2.3.2} and all software that is compiled with it (and thus need it to run) will link to this unique address.
+As an example Dolstra et al.\citeappendix{dolstra04}: if a certain build of GNU C Library 2.3.2 has a hash of \inlinecode{8d013ea878d0}, then it is installed under \inlinecode{/nix/store/8d013ea878d0-glibc-2.3.2} and all software that is compiled with it (and thus need it to run) will link to this unique address.
This allows for multiple versions of the software to co-exist on the system, while keeping an accurate dependency tree.
-As mentioned in \citeappendix{courtes15}, one major caveat with using these package managers is that they require a daemon with root privileges (failing our completeness criteria).
+As mentioned in Court{\'e}s \& Wurmus\citeappendix{courtes15}, one major caveat with using these package managers is that they require a daemon with root privileges (failing our completeness criteria).
This is necessary ``to use the Linux kernel container facilities that allow it to isolate build processes and maximize build reproducibility''.
This is because the focus in Nix or Guix is to create bit-wise reproducible software binaries and this is necessary for the security or development perspectives.
However, in a non-computer-science analysis (for example natural sciences), the main aim is reproducible \emph{results} that can also be created with the same software version that may not be bit-wise identical (for example when they are installed in other locations, because the installation location is hard-coded in the software binary or for a different CPU architecture).
@@ -213,10 +213,10 @@ Conda is able to maintain an approximately independent environment on an operati
Conda tracks the dependencies of a package/environment through a YAML formatted file, where the necessary software and their acceptable versions are listed.
However, it is not possible to fix the versions of the dependencies through the YAML files alone.
This is thoroughly discussed under issue 787 (in May 2019) of \inlinecode{conda-forge}\footnote{\inlinecode{\url{https://github.com/conda-forge/conda-forge.github.io/issues/787}}}.
-In that discussion, the authors of \citeappendix{uhse19} report that the half-life of their environment (defined in a YAML file) is 3 months, and that at least one of their dependencies breaks shortly after this period.
-The main reply they got in the discussion is to build the Conda environment in a container, which is also the suggested solution by \citeappendix{gruning18}.
+In that Github discussion, the authors of Uhse et al.\citeappendix{uhse19} report that the half-life of their environment (defined in a YAML file) is 3 months, and that at least one of their dependencies breaks shortly after this period.
+The main reply they got in the discussion is to build the Conda environment in a container, which is also the suggested solution by Gr\"uning et al.\citeappendix{gruning18}.
However, as described in Appendix \ref{appendix:independentenvironment}, containers just hide the reproducibility problem, they do not fix it: containers are not static and need to evolve (i.e., get re-built) with the project.
-Given these limitations, \citeappendix{uhse19} are forced to host their conda-packaged software as tarballs on a separate repository.
+Given these limitations, Uhse et al.\citeappendix{uhse19} are forced to host their conda-packaged software as tarballs on a separate repository.
Conda installs with a shell script that contains a binary-blob (+500 megabytes, embedded in the shell script).
This is the first major issue with Conda: from the shell script, it is not clear what is in this binary blob and what it does.
@@ -252,7 +252,7 @@ As reviewed above, the low-level dependence of Conda on the host operating syste
However, these same factors are major caveats in a scientific scenario, where long-term archivability, readability, or usability are important. % alternative to `archivability`?
\subsubsection{Spack}
-Spack is a package manager that is also influenced by Nix (similar to GNU Guix), see \citeappendix{gamblin15}.
+Spack\citeappendix{gamblin15} is a package manager that is also influenced by Nix (similar to GNU Guix).
But unlike Nix or GNU Guix, it does not aim for full, bit-wise reproducibility and can be built without root access in any generic location.
It relies on the host operating system for the C library.
@@ -302,7 +302,7 @@ In this way, later processing stages can make sure that they can safely be used,
Solutions to keep track of a project's history have existed since the early days of software engineering in the 1970s and they have constantly improved over the last decades.
Today the distributed model of ``version control'' is the most common, where the full history of the project is stored locally on different systems and can easily be integrated.
There are many existing version control solutions, for example, CVS, SVN, Mercurial, GNU Bazaar, or GNU Arch.
-However, currently, Git is by far the most commonly used in individual projects, such that Software Heritage \citeappendix{dicosmo18} (an archival system aiming for long term preservation of software) is also modeled on Git.
+However, currently, Git is by far the most commonly used in individual projects, such that Software Heritage\citeappendix{dicosmo18} (an archival system aiming for long term preservation of software) is also modeled on Git.
Git is also the foundation upon which this paper's proof of concept (Maneage) is built.
Hence we will just review Git here, but the general concept of version control is the same in all implementations.
@@ -316,7 +316,7 @@ the figure on Git in the main body of the paper).
Figure \ref{fig:branching}).
\fi
For example \inlinecode{f4953cc\-f1ca8a\-33616ad\-602ddf\-4cd189\-c2eff97b} is a commit identifier in the Git history of this project.
-Through the content-based storage concept, similar hash structures can be used to identify data \citeappendix{hinsen20}.
+Through the content-based storage concept, similar hash structures can be used to identify data\citeappendix{hinsen20}.
Git commits are commonly summarized by the checksum's first few characters, for example, \inlinecode{f4953cc} of the example above.
With Git, making parallel ``branches'' (in the project's history) is very easy and its distributed nature greatly helps in the parallel development of a project by a team.
@@ -365,14 +365,14 @@ While it is not impossible, because of the high-level nature of scripts, it is n
\subsubsection{Make}
\label{appendix:make}
-Make was originally designed to address the problems mentioned above for scripts \citeappendix{feldman79}.
+Make was originally designed to address the problems mentioned above for scripts\citeappendix{feldman79}.
In particular, it was originally designed in the context of managing the compilation of software source code that are distributed in many files.
With Make, the source files of a program that have not been changed are not recompiled.
Moreover, when two source files do not depend on each other, and both need to be rebuilt, they can be built in parallel.
This was found to greatly help in debugging software projects, and in speeding up test builds, giving Make a core place in software development over the last 40 years.
The most common implementation of Make, since the early 1990s, is GNU Make.
-Make was also the framework used in the first attempts at reproducible scientific papers \cite{claerbout1992,schwab2000}.
+Make was also the framework used in the first attempts at reproducible scientific papers\cite{claerbout1992,schwab2000}.
Our proof-of-concept (Maneage) also uses Make to organize its workflow.
Here, we complement that section with more technical details on Make.
@@ -391,7 +391,7 @@ Going deeper into the syntax of Make is beyond the scope of this paper, but we r
\subsubsection{Snakemake}
\label{appendix:snakemake}
Snakemake is a Python-based workflow management system, inspired by GNU Make (discussed above).
-It is aimed at reproducible and scalable data analysis \citeappendix{koster12}\footnote{\inlinecode{\url{https://snakemake.readthedocs.io/en/stable}}}.
+It is aimed at reproducible and scalable data analysis\citeappendix{koster12}\footnote{\inlinecode{\url{https://snakemake.readthedocs.io/en/stable}}}.
It defines its own language to implement the ``rule'' concept of Make within Python.
Technically, using complex shell scripts (to call software in other languages) in each step will involve a lot of quotations that make the code hard to read and maintain.
It is therefore most useful for Python-based projects.
@@ -424,10 +424,10 @@ The former will conflict with other system tools that assume \inlinecode{python}
This can also be problematic when a Python analysis library, may require a Python version that conflicts with the running SCons.
\subsubsection{CGAT-core}
-CGAT-Core is a Python package for managing workflows, see \citeappendix{cribbs19}.
+CGAT-Core\citeappendix{cribbs19} is a Python package for managing workflows.
It wraps analysis steps in Python functions and uses Python decorators to track the dependencies between tasks.
-It is used in papers like \citeappendix{jones19}.
-However, as mentioned in \citeappendix{jones19} it is good for managing individual outputs (for example separate figures/tables in the paper, when they are fully created within Python).
+It is used in papers like Jones et al.\citeappendix{jones19}.
+However, as mentioned there it is primarily good for managing individual outputs (for example separate figures/tables in the paper, when they are fully created within Python).
Because it is primarily designed for Python tasks, managing a full workflow (which includes many more components, written in other languages) is not trivial.
Another drawback with this workflow manager is that Python is a very high-level language where future versions of the language may no longer be compatible with Python 3, that CGAT-core is implemented in (similar to how Python 2 programs are not compatible with Python 3).
@@ -438,7 +438,7 @@ Hence in the GWL paradigm, software installation and usage does not have to be s
GWL has two high-level concepts called ``processes'' and ``workflows'' where the latter defines how multiple processes should be executed together.
\subsubsection{Nextflow (2013)}
-Nextflow\footnote{\inlinecode{\url{https://www.nextflow.io}}} \citeappendix{tommaso17} workflow language with a command-line interface that is written in Java.
+Nextflow\footnote{\inlinecode{\url{https://www.nextflow.io}}} workflow language\citeappendix{tommaso17} with a command-line interface that is written in Java.
\subsubsection{Generic workflow specifications (CWL and WDL)}
\label{appendix:genericworkflows}
@@ -455,7 +455,7 @@ Each software/tool/paradigm has its own learning curve, which is not easy for a
Most workflow management tools and the reproducible workflow solutions that depend on them are, yet another language/paradigm that has to be mastered by researchers and thus a heavy burden.
Furthermore as shown above (and below) high-level tools will evolve very fast causing disruptions in the reproducible framework.
-A good example is Popper \citeappendix{jimenez17} which initially organized its workflow through the HashiCorp configuration language (HCL) because it was the default in GitHub.
+A good example is Popper\citeappendix{jimenez17} which initially organized its workflow through the HashiCorp configuration language (HCL) because it was the default in GitHub.
However, in September 2019, GitHub dropped HCL as its default configuration language, so Popper is now using its own custom YAML-based workflow language, see Appendix \ref{appendix:popper} for more on Popper.
@@ -492,7 +492,7 @@ In summary, IDEs are generally very specialized tools, for special projects and
\subsubsection{Jupyter}
\label{appendix:jupyter}
-Jupyter (initially IPython) \citeappendix{kluyver16} is an implementation of Literate Programming \citeappendix{knuth84}.
+Jupyter\citeappendix{kluyver16} (initially IPython) is an implementation of Literate Programming \citeappendix{knuth84}.
Jupyter's name is a combination of the three main languages it was designed for: Julia, Python, and R.
The main user interface is a web-based ``notebook'' that contains blobs of executable code and narrative.
Jupyter uses the custom built \inlinecode{.ipynb} format\footnote{\inlinecode{\url{https://nbformat.readthedocs.io/en/latest}}}.
@@ -514,7 +514,7 @@ Both are critical for scientific processing, especially the latter: when a web b
This is further exacerbated by the fact that binary data (for example images) are not directly supported in JSON and have to be converted into a much less memory-efficient textual encoding.
Finally, Jupyter has an extremely complex dependency graph: on a clean Debian 10 system, Pip (a Python package manager that is necessary for installing Jupyter) required 19 dependencies to install, and installing Jupyter within Pip needed 41 dependencies.
-\citeappendix{hinsen15} reported such conflicts when building Jupyter into the Active Papers framework (see Appendix \ref{appendix:activepapers}).
+Hinsen\citeappendix{hinsen15} reported such conflicts when building Jupyter into the Active Papers framework (see Appendix \ref{appendix:activepapers}).
However, the dependencies above are only on the server-side.
Since Jupyter is a web-based system, it requires many dependencies on the viewing/running browser also (for example special JavaScript or HTML5 features, which evolve very fast).
As discussed in Appendix \ref{appendix:highlevelinworkflow} having so many dependencies is a major caveat for any system regarding scientific/long-term reproducibility.
@@ -527,7 +527,7 @@ In summary, Jupyter is most useful in manual, interactive, and graphical operati
\subsection{Project management in high-level languages}
\label{appendix:highlevelinworkflow}
Currently, the most popular high-level data analysis language is Python.
-R is closely tracking it and has superseded Python in some fields, while Julia \citeappendix{bezanson17} is quickly gaining ground.
+R is closely tracking it and has superseded Python in some fields, while Julia\citeappendix{bezanson17} is quickly gaining ground.
These languages have themselves superseded previously popular languages for data analysis of the previous decades, for example, Java, Perl, or C++.
All are part of the C-family programming languages.
In many cases, this means that the language's execution environment are themselves written in C, which is the language of modern operating systems.
@@ -542,7 +542,7 @@ Because of their nature, higher-level languages evolve very fast, creating incom
The most prominent example is the transition from Python 2 (released in 2000) to Python 3 (released in 2008).
Python 3 was incompatible with Python 2 and it was decided to abandon the former by 2015.
However, due to community pressure, this was delayed to January 1st, 2020.
-The end-of-life of Python 2 caused many problems for projects that had invested heavily in Python 2: all their previous work had to be translated, for example, see \citeappendix{jenness17} or Appendix \ref{appendix:sciunit}.
+The end-of-life of Python 2 caused many problems for projects that had invested heavily in Python 2: all their previous work had to be translated, for example, see Jenness\citeappendix{jenness17} or Appendix \ref{appendix:sciunit}.
Some projects could not make this investment and their developers decided to stop maintaining it, for example VisTrails (see Appendix \ref{appendix:vistrails}).
The problems were not just limited to translation.
@@ -565,10 +565,11 @@ The evolution of high-level languages is extremely fast, even within one version
For example, packages that are written in Python 3 often only work with a special interval of Python 3 versions.
For example Snakemake and Occam which can only be run on Python versions 3.4 and 3.5 or newer respectively, see Appendices \ref{appendix:snakemake} and \ref{appendix:occam}.
This is not just limited to the core language, much faster changes occur in their higher-level libraries.
-For example version 1.9 of Numpy (Python's numerical analysis module) discontinued support for Numpy's predecessor (called Numeric), causing many problems for scientific users \citeappendix{hinsen15}.
+For example version 1.9 of Numpy (Python's numerical analysis module) discontinued support for Numpy's predecessor (called Numeric), causing many problems for scientific users\citeappendix{hinsen15}.
On the other hand, the dependency graph of tools written in high-level languages is often extremely complex.
-For example, see Figure 1 of \cite{alliez19}, it shows the dependencies and their inter-dependencies for Matplotlib (a popular plotting module in Python).
+For example, see Figure 1 of Alliez et al.\cite{alliez19}.
+It shows the dependencies and their inter-dependencies for Matplotlib (a popular plotting module in Python).
Acceptable version intervals between the dependencies will cause incompatibilities in a year or two, when a robust package manager is not used (see Appendix \ref{appendix:packagemanagement}).
Since a domain scientist does not always have the resources/knowledge to modify the conflicting part(s), many are forced to create complex environments with different versions of Python and pass the data between them (for example just to use the work of a previous PhD student in the team).
@@ -582,7 +583,7 @@ merely installing the Python installer (\inlinecode{pip}) on a Debian system (wi
As of this writing, the \inlinecode{pip3 install popper} and \inlinecode{pip2 install sciunit2} commands for installing each, required 17 and 26 Python modules as dependencies.
It is impossible to run either of these solutions if there is a single conflict in this very complex dependency graph.
This problem actually occurred while we were testing Sciunit: even though it was installed, it could not run because of conflicts (its last commit was only 1.5 years old), for more see Appendix \ref{appendix:sciunit}.
-\citeappendix{hinsen15} also report a similar problem when attempting to install Jupyter (see Appendix \ref{appendix:editors}).
+Hinsen\citeappendix{hinsen15} also report a similar problem when attempting to install Jupyter (see Appendix \ref{appendix:editors}).
Of course, this also applies to tools that these systems use, for example Conda (which is also written in Python, see Appendix \ref{appendix:packagemanagement}).
\subsubsection{Generational gap}