aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex1848
1 files changed, 1702 insertions, 146 deletions
diff --git a/paper.tex b/paper.tex
index 63ba4fc..e8b7cc9 100644
--- a/paper.tex
+++ b/paper.tex
@@ -1,15 +1,17 @@
-%% Copyright (C) 2018-2020 Mohammad Akhlaghi <mohammad@akhlaghi.org>
-%% See the end of the file for license conditions.
-\documentclass[10pt, twocolumn]{article}
-
-%% (OPTIONAL) CONVENIENCE VARIABLE: Only relevant when you use Maneage's
-%% '\includetikz' macro to build plots/figures within LaTeX using TikZ or
-%% PGFPlots. If so, when the Figure files (PDFs) are already built, you can
-%% avoid TikZ or PGFPlots completely by commenting/removing the definition
-%% of '\makepdf' below. This is useful when you don't want to slow-down a
-%% LaTeX-only build of the project (for example this happens when you run
-%% './project make dist'). See the definition of '\includetikz' in
-%% `tex/preamble-pgfplots.tex' for more.
+%% Main LaTeX source of project's paper, license is printed in the end.
+%
+%% Copyright (C) 2020 Mohammad Akhlaghi <mohammad@akhlaghi.org>
+%% Copyright (C) 2020 Raúl Infante-Sainz <infantesainz@gmail.com>
+%% Copyright (C) 2020 Boudewijn F. Roukema <boud@astro.uni.torun.pl>
+%% Copyright (C) 2020 David Valls-Gabaud <david.valls-gabaud@obspm.fr>
+%% Copyright (C) 2020 Roberto Baena-Gallé <roberto.baena@gmail.com>
+\documentclass[journal]{IEEEtran}
+
+%% This is a convenience variable if you are using PGFPlots to build plots
+%% within LaTeX. If you want to import PDF files for figures directly, you
+%% can use the standard `\includegraphics' command. See the definition of
+%% `\includetikz' in `tex/preamble-pgfplots.tex' for where the files are
+%% assumed to be if you use `\includetikz' when `\makepdf' is not defined.
\newcommand{\makepdf}{}
%% VALUES FROM ANALYSIS (NUMBERS AND STRINGS): this file is automatically
@@ -17,50 +19,259 @@
%% (defined with '\newcommand') for various processing outputs to be used
%% within the paper.
\input{tex/build/macros/project.tex}
-
-%% MANEAGE-ONLY PREAMBLE: this file contains LaTeX constructs that are
-%% provided by Maneage (for example enabling or disabling of highlighting
-%% from the './project' script). They are not style-related.
\input{tex/src/preamble-maneage.tex}
-%% PROJECT-SPECIFIC PREAMBLE: This is where you can include any LaTeX
-%% setting for customizing your project.
+%% Import the other necessary TeX files for this particular project.
\input{tex/src/preamble-project.tex}
+%% Title and author names.
+\title{\projecttitle}
+\author{
+ Mohammad~Akhlaghi,
+ Ra\'ul Infante-Sainz,
+ Boudewijn F. Roukema,
+ David Valls-Gabaud,
+ Roberto Baena-Gall\'e
+ \thanks{Manuscript received MM DD, YYYY; revised MM DD, YYYY.}
+}
+
+%% The paper headers
+\markboth{Computing in Science and Engineering, Vol. X, No. X, MM YYYY}%
+{Akhlaghi \MakeLowercase{\textit{et al.}}: \projecttitle}
+
+
+
-%% PROJECT TITLE: The project title should also be printed as metadata in
-%% all output files. To avoid inconsistancy caused by manually typing it,
-%% the title is defined with other core project metadata in
-%% 'reproduce/analysis/config/metadata.conf'. That value is then written in
-%% the '\projectitle' LaTeX macro which is available in 'project.tex' (that
-%% was loaded above).
+
+
+
+%% Start the paper.
+\begin{document}
+
+% make the title area
+\maketitle
+
+% As a general rule, do not put math, special symbols or citations
+% in the abstract or keywords.
+\begin{abstract}
+ %% CONTEXT
+ Analysis pipelines commonly use high-level technologies that are popular when created, but are unlikely to be readable, executable, or sustainable in the long term.
+ %% AIM
+ A set of criteria are introduced to address this problem.
+ %% METHOD
+ Completeness (no dependency beyond POSIX, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free software.
+ They have been tested in several research publications in various fields.
+ %% RESULTS
+ As a proof of concept, ``Maneage'' is introduced for storing projects in machine-actionable and human-readable plain text, enabling cheap archiving, provenance extraction, and peer verification.
+ %% CONCLUSION
+ We show that longevity is a realistic requirement that does not sacrifice immediate or short-term reproducibility.
+ The caveats (with proposed solutions) are then discussed and we conclude with the benefits for the various stakeholders.
+ This paper is itself written with Maneage (project commit \projectversion).
+
+ \vspace{2.5mm}
+ \emph{Appendices} ---
+ Two comprehensive appendices that review existing solutions; available
+\ifdefined\noappendix
+at \href{https://arxiv.org/abs/\projectarxivid}{\texttt{arXiv:\projectarxivid}} or \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}}.
+\else
+at the end (Appendices \ref{appendix:existingtools} and \ref{appendix:existingsolutions}).
+\fi
+
+ \vspace{2.5mm}
+ \emph{Reproducible supplement} ---
+ All products in \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}},
+ Git history of source at \href{https://gitlab.com/makhlaghi/maneage-paper}{\texttt{gitlab.com/makhlaghi/maneage-paper}},
+ which is also archived in \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=https://gitlab.com/makhlaghi/maneage-paper.git}{SoftwareHeritage}.
+\end{abstract}
+
+% Note that keywords are not normally used for peer-review papers.
+\begin{IEEEkeywords}
+Data Lineage, Provenance, Reproducibility, Scientific Pipelines, Workflows
+\end{IEEEkeywords}
+
+
+
+
+
+
+
+% For peer review papers, you can put extra information on the cover
+% page as needed:
+% \ifCLASSOPTIONpeerreview
+% \begin{center} \bfseries EDICS Category: 3-BBND \end{center}
+% \fi
%
-%% Please set your project's title in 'metadata.conf' (ideally with other
-%% basic project information) and re-run the project to have your new
-%% title. If you later use a different LaTeX style, please use the same
-%% '\projectitle' in it (after importing 'tex/build/macros/project.tex'
-%% like above), don't type it by hand.
-\title{\large \uppercase{\projecttitle}}
+% For peerreview papers, this IEEEtran command inserts a page break and
+% creates the second title. It will be ignored for other modes.
+\IEEEpeerreviewmaketitle
-%% AUTHOR INFORMATION: For a more fine-grained control of the headers
-%% including author name, or paper info, see
-%% `tex/src/preamble-header.tex'. Note that if you plan to use a journal's
-%% LaTeX style file, you will probably set the authors in a different way,
-%% feel free to change them here, this is just basic style and varies from
-%% project to project.
-\author[1]{Your name}
-\author[2]{Coauthor one}
-\author[1,3]{Coauthor two}
-\affil[1]{The first affiliation in the list.; \url{your@email.address}}
-\affil[2]{Another affilation can be put here.}
-\affil[3]{And generally as many affiliations as you like.
-\par \emph{Received YYYY MM DD; accepted YYYY MM DD; published YYYY MM DD}}
-\date{}
+\section{Introduction}
+% The very first letter is a 2 line initial drop letter followed
+% by the rest of the first word in caps.
+%\IEEEPARstart{F}{irst} word
+
+Reproducible research has been discussed in the sciences for at least 30 years \cite{claerbout1992, fineberg19}.
+Many reproducible workflow solutions (hereafter, ``solutions'') have been proposed that mostly rely on the common technology of the day,
+starting with Make and Matlab libraries in the 1990s, Java in the 2000s, and mostly shifting to Python during the last decade.
+
+However, these technologies develop fast, e.g., code written in Python 2 \new{(which is no longer officially maintained)} often cannot run with Python 3.
+The cost of staying up to date within this rapidly-evolving landscape is high.
+Scientific projects, in particular, suffer the most: scientists have to focus on their own research domain, but to some degree they need to understand the technology of their tools because it determines their results and interpretations.
+Decades later, scientists are still held accountable for their results and therefore the evolving technology landscape creates generational gaps in the scientific community, preventing previous generations from sharing valuable experience.
+
+
+
+
+
+\section{Longevity of existing tools}
+\label{sec:longevityofexisting}
+\new{Reproducibility is defined as ``obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis'' \cite{fineberg19}.
+Longevity is defined as the length of time that a project remains \emph{functional} after its creation.
+Functionality is defined as \emph{human readability} of the source and its \emph{execution possibility} (when necessary).
+Many usage contexts of a project don't involve execution: for example, checking the configuration parameter of a single step of the analysis to re-\emph{use} in another project, or checking the version of used software, or the source of the input data.
+Extracting these from execution outputs is not always possible.}
+
+Longevity is as important in science as in some fields of industry, but not all; e.g., fast-evolving tools can be appropriate in short-term commercial projects.
+To highlight the necessity, a short review of commonly-used tools is provided below:
+(1) environment isolators (virtual machines, VMs, or containers);
+(2) package managers (PMs, like Conda, Nix, or Spack);
+(3) job management (like shell scripts or Make);
+(4) notebooks (like Jupyter).
+\new{A comprehensive review of existing tools and solutions is available in the
+ \ifdefined\noappendix
+ \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{appendices}.%
+ \else%
+ appendices (\ref{appendix:existingsolutions}).%
+ \fi%
+}
+
+To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011 but was discontinued in 2019).
+However, containers (in particular, Docker, and to a lesser degree, Singularity) are currently the most widely-used solution.
+We will thus focus on Docker here.
+
+\new{It is hypothetically possible to precisely identify the used Docker ``images'' with their checksums (or ``digest'') to re-create an identical OS image later.
+However, that is rarely done.}
+Usually images are imported with generic operating system (OS) names; e.g., \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}'
+ \ifdefined\noappendix
+ \new{(more examples in the \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{appendices})}.%
+ \else%
+ \new{(more examples: see the appendices (\ref{appendix:existingtools})).}%
+ \fi%
+The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly, and only the most recent five are archived there.
+Hence, if the image is built in different months, its output image will contain different OS components.
+In the year 2024, when long-term support for this version of Ubuntu expires, the image will be unavailable at the expected URL.
+Generally, Pre-built binary files (like Docker images) are large and expensive to maintain and archive.
+%% This URL: https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}
+\new{Because of this, DockerHub (where many reproducible workflows are archived) announced that inactive images (older than 6 months) will be deleted in free accounts from mid 2021.}
+Furthermore, Docker requires root permissions, and only supports recent (``long-term-support'') versions of the host kernel, so older Docker images may not be executable \new{(their longevity is determined by the host kernel, typically a decade)}.
+
+Once the host OS is ready, PMs are used to install the software or environment.
+Usually the OS's PM, such as `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with generic PMs.
+The former has \new{the same longevity} as the OS, while some of the latter (such as Conda and Spack) are written in high-level languages like Python, so the PM itself depends on the host's Python installation \new{with a typical longevity of a few years}.
+Nix and GNU Guix produce bit-wise identical programs \new{with considerably better longevity; that of their supported CPU architectures}.
+However, they need root permissions and are primarily targeted at the Linux kernel.
+Generally, in all the package managers, the exact version of each software (and its dependencies) is not precisely identified by default, although an advanced user can indeed fix them.
+Unless precise version identifiers of \emph{every software package} are stored by project authors, a PM will use the most recent version.
+Furthermore, because third-party PMs introduce their own language, framework, and version history (the PM itself may evolve) and are maintained by an external team, they increase a project's complexity.
+
+With the software environment built, job management is the next component of a workflow.
+Visual/GUI workflow tools like Apache Taverna, GenePattern \new{(deprecated)}, Kepler or VisTrails \new{(deprecated)}, which were mostly introduced in the 2000s and used Java or Python 2 encourage modularity and robust job management.
+\new{However, a GUI environment is tailored to specific applications and is hard to generalize, while being hard to reproduce once the required Java Virtual Machine (JVM) is deprecated.
+These tools' data formats are complex (designed for computers to read) and hard to read by humans without the GUI.}
+The more recent tools (mostly non-GUI, written in Python) leave this to the authors of the project.
+Designing a robust project needs to be encouraged and facilitated because scientists (who are not usually trained in project or data management) will rarely apply best practices.
+This includes automatic verification, which is possible in many solutions, but is rarely practiced.
+Besides non-reproducibility, weak project management leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding, or validating will be expensive).
+
+Finally, to blend narrative into the workflow, computational notebooks \cite{rule18}, such as Jupyter, are currently gaining popularity.
+However, because of their complex dependency trees, their build is vulnerable to the passage of time; e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies.
+It is important to remember that the longevity of a project is determined by its shortest-lived dependency.
+Furthermore, as with job management, computational notebooks do not actively encourage good practices in programming or project management.
+\new{The ``cells'' in a Jupyter notebook can either be run sequentially (from top to bottom, one after the other) or by manually selecting which cell to run.
+The default cells do not include dependencies (requiring some cells to be run only after certain others are re-done), parallel execution, or usage of more than one language.
+There are third party add-ons like \inlinecode{sos} or \inlinecode{nbextensions} (both written in Python) for some of these.
+However, since they are not part of the core and have their own dependencies, their longevity can be assumed to be shorter.
+Therefore, the core Jupyter framework leaves very few options for project management, especially as the project grows beyond a small test or tutorial.}
+In summary, notebooks can rarely deliver their promised potential \cite{rule18} and may even hamper reproducibility \cite{pimentel19}.
+
+An exceptional solution we encountered was the Image Processing Online Journal (IPOL, \href{https://www.ipol.im}{ipol.im}).
+Submitted papers must be accompanied by an ISO C implementation of their algorithm (which is buildable on any widely used OS) with example images/data that can also be executed on their webpage.
+This is possible owing to the focus on low-level algorithms with no dependencies beyond an ISO C compiler.
+However, many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, so this solution is not scalable.
+
+
+
+
+
+\section{Proposed criteria for longevity}
+\label{criteria}
+The main premise is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it at the end \cite{austin17,fineberg19}.
+In this context, researchers play a critical role \cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles).
+Simply archiving a project workflow in a repository after the project is finished is, on its own, insufficient, and maintaining it by repository staff is often either practically unfeasible or unscalable.
+We argue and propose that workflows satisfying the following criteria can not only improve researcher flexibility during a research project, but can also increase the FAIRness of the deliverables for future researchers:
+
+\textbf{Criterion 1: Completeness.}
+A project that is complete (self-contained) has the following properties.
+(1) No dependency beyond the Portable Operating System Interface (POSIX, \new{a minimal Unix-like standard that is shared between many operating systems}).
+POSIX has been developed by the Austin Group (which includes IEEE) since 1988 and many OSes have complied.
+(2) Primarily stored as plain text, not needing specialized software to open, parse, or execute.
+(3) No impact on the host OS libraries, programs or environment.
+(4) Does not require root privileges to run (during development or post-publication).
+(5) Builds its own controlled software for an independent environment.
+(6) Can run locally (without an internet connection).
+(7) Contains the full project's analysis, visualization \emph{and} narrative: including instructions to automatically access/download raw inputs, build necessary software, do the analysis, produce final data products \emph{and} final published report with figures \emph{as output}, e.g., PDF or HTML.
+(8) It can run automatically, with no human interaction.
+
+\textbf{Criterion 2: Modularity.}
+A modular project enables and encourages the analysis to be broken into independent modules with well-defined inputs/outputs and minimal side effects.
+Explicit communication between various modules enables optimizations on many levels:
+(1) Execution in parallel and avoiding redundancies (when a dependency of a module has not changed, it will not be re-run).
+(2) Usage in other projects.
+(3) Easy debugging and improvements.
+(4) Modular citation of specific parts.
+(5) Provenance extraction.
+
+\textbf{Criterion 3: Minimal complexity.}
+Minimal complexity can be interpreted as:
+(1) Avoiding the language or framework that is currently in vogue (for the workflow, not necessarily the high-level analysis).
+A popular framework typically falls out of fashion and requires significant resources to translate or rewrite every few years \new{(for example Python 2, which is now a dead language and no longer supported)}.
+More stable/basic tools can be used with less long-term maintenance costs.
+(2) Avoiding too many different languages and frameworks; e.g., when the workflow's PM and analysis are orchestrated in the same framework, it becomes easier to adopt and encourages good practices.
+
+\textbf{Criterion 4: Scalability.}
+A scalable project can easily be used in arbitrarily large and/or complex projects.
+On a small scale, the criteria here are trivial to implement, but can rapidly become unsustainable.
+
+\textbf{Criterion 5: Verifiable inputs and outputs.}
+The project should automatically verify its inputs (software source code and data) \emph{and} outputs, not needing any expert knowledge.
+
+\textbf{Criterion 6: Recorded history.}
+No exploratory research is done in a single, first attempt.
+Projects evolve as they are being completed.
+It is natural that earlier phases of a project are redesigned/optimized only after later phases have been completed.
+Research papers often report this with statements such as ``\emph{we [first] tried method [or parameter] X, but Y is used here because it gave lower random error}''.
+The derivation ``history'' of a result is thus not any the less valuable as itself.
+
+\textbf{Criterion 7: Including narrative, linked to analysis.}
+A project is not just its computational analysis.
+A raw plot, figure or table is hardly meaningful alone, even when accompanied by the code that generated it.
+A narrative description is also a deliverable (defined as ``data article'' in \cite{austin17}): describing the purpose of the computations, and interpretations of the result, and the context in relation to other projects/papers.
+This is related to longevity, because if a workflow contains only the steps to do the analysis or generate the plots, in time it may get separated from its accompanying published paper.
+
+\textbf{Criterion 8: Free and open source software:}
+Reproducibility is not possible with a black box (non-free or non-open-source software); this criterion is therefore necessary because nature is already a black box, we do not need an artificial source of ambiguity wraped over it.
+A project that is \href{https://www.gnu.org/philosophy/free-sw.en.html}{free software} (as formally defined), allows others to learn from, modify, and build upon it.
+When the software used by the project is itself also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware.
+In contrast, non-free tools typically cannot be distributed or modified by others, making it reliant on a single supplier (even without payments).
+
+\new{It may happen that proprietary software is necessary to convert proprietary data formats produced by special hardware (for example micro-arrays in genetics) into free data formats.
+In such cases, it is best to immediately convert the data upon collection, and archive the data in free formats (for example, on Zenodo).}
+
@@ -69,146 +280,1494 @@
-%% Start creating the paper.
-\begin{document}
-%% Project abstract and keywords.
-\includeabstract{
- Welcome to Maneage (\emph{Man}aging data lin\emph{eage}) and reproducible papers/projects, for a review of the basics of this system, please see \citet{maneage}.
- You are now ready to configure Maneage and implement your own research in this framework.
- Maneage contains almost all the elements that you will need in a research project, and adding any missing parts is very easy once you become familiar with it.
- For example it already has steps to downloading of raw data and necessary software (while verifying them with their checksums), building the software, and processing the data with the software in a highly-controlled environment.
- But Maneage is not just for the analysis of your project, you will also write your paper in it (by replacing this text in \texttt{paper.tex}): including this abstract, figures and bibliography.
- If you design your project with Maneage's infra-structure, don't forget to add a notice and clearly let the readers know that your work is reproducible, we should spread the word and show the world how useful reproducible research is for the sciences, also don't forget to cite and acknowledge it so we can continue developing it.
- This PDF was made with Maneage, commit \projectversion{}.
+\section{Proof of concept: Maneage}
- \vspace{0.25cm}
+With the longevity problems of existing tools outlined above, a proof-of-concept tool is presented here via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}.
+\new{Since the initial submission of this paper, it has also been used in \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} (on the COVID-19 pandemic) and \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460} (which illustrates statistical reproducibility for parallelised code).}
+It was also awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows \cite{austin17}, from the researchers' perspective.
- \textsl{Keywords}: Add some keywords for your research here.
+The tool is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at \url{https://maneage.org}.
+It was developed as a parallel research project over five years of publishing reproducible workflows of our research.
+The original implementation was published in \cite{akhlaghi15}, and evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}.
- \textsl{Reproducible paper}: All quantitave results (numbers and plots)
- in this paper are exactly reproducible with Maneage
- (\url{https://maneage.org}). }
+Technically, the hardest criterion to implement was the first (completeness) and, in particular, avoiding non-POSIX dependencies).
+One solution we considered was GNU Guix and Guix Workflow Language (GWL).
+However, because Guix requires root access to install, and only works with the Linux kernel, it failed the completeness criterion.
+Inspired by GWL+Guix, a single job management tool was implemented for both installing software \emph{and} the analysis workflow: Make.
-%% To add the first page's headers.
-\thispagestyle{firststyle}
+Make is not an analysis language, it is a job manager, deciding when and how to call analysis programs (in any language like Python, R, Julia, Shell, or C).
+Make is standardized in POSIX and is used in almost all core OS components.
+It is thus mature, actively maintained, highly optimized, efficient in managing exact provenance, and even recommended by the pioneers of reproducible research \cite{claerbout1992,schwab2000}.
+Researchers using free software tools have also already had some exposure to it \new{(almost all free software projects are built with Make).}
+Linking the analysis and narrative (criterion 7) was historically our first design element.
+To avoid the problems with computational notebooks mentioned above, our implementation follows a more abstract linkage, providing a more direct and precise, yet modular, connection.
+Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is through automatically-created \LaTeX{} macros, during the analysis.
+For example, \cite{akhlaghi19} writes `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'.
+The \LaTeX{} source of the quote above is: `\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}'.
+The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is generated during the analysis and expands to the value `\inlinecode{0.25}' when the PDF output is built.
+Since values like this depend on the analysis, they should \emph{also} be reproducible, along with figures and tables.
+These macros act as a quantifiable link between the narrative and analysis, with the granularity of a word in a sentence and a particular analysis command.
+This allows automatic updates to the embedded numbers during the experimentation phase of a project \emph{and} accurate post-publication provenance.
+Through the former, manual updates by authors (which are prone to errors and discourage improvements or experimentation after writing the first draft) are by-passed.
+Acting as a link, the macro files build the core skeleton of Maneage.
+For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation.
+These are combined at the end to generate precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}), which are excluded here because of the strict word limit.
+\new{Furthermore, the machine related specifications of the running system (including hardware name and byte-order) are also collected and cited.
+These can help in \emph{root cause analysis} of observed differences/issues in the execution of the workflow on different machines.}
+The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}).
+All software dependencies are built down to precise versions of every tool, including the shell, POSIX tools (e.g., GNU Coreutils) and of course, the high-level science software.
+\new{The source code of all the free software used in Maneage is archived in and downloaded from \href{https://doi.org/10.5281/zenodo.3883409}{zenodo.3883409}.
+Zenodo promises long-term archival and also provides a persistent identifier for the files, which are sometimes unavailable at a software package's webpage.}
+On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library (glibc) is being added (task \href{http://savannah.nongnu.org/task/?15390}{15390}).
+Currently, {\TeX}Live is also being added (task \href{http://savannah.nongnu.org/task/?15267}{15267}), but that is only for building the final PDF, not affecting the analysis or verification.
+\new{Finally, some software cannot be built on some CPU architectures, hence by default, the architecture is included in the final built paper automatically (see below).}
-%% Start of main body.
-\section{Congratulations!}
-Congratulations on running the raw template project! You can now follow the ``Customization checklist'' in the \texttt{README-hacking.md} file, customize this template and start your exciting research project over it.
-You can always merge Maneage back into your project to improve its infra-structure and leaving your own project intact.
-If you haven't already read \citet{maneage}, please do so before continuing, it isn't long (just 7 pages).
+\new{Building the core Maneage software environment on an 8-core CPU takes about 1.5 hours (GCC consumes more than half of the time).
+However, this is only necessary once for every computer, the analysis phase (which usually takes months to write for a normal project) will use the same environment later.
+To facilitate moving to another computer in the short term, Maneage'd projects can be built in a container or VM.
+The \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{\inlinecode{README.md}} file has instructions on building in Docker.
+Through Docker (or VMs), users on Microsoft Windows can benefit from Maneage, and for Windows-native software that can be run in batch-mode, evolving technologies like Windows Subsystem for Linux may be usable.}
-While you are writing your paper, just don't forget to \emph{not} use numbers or fixed strings (for example database urls like \url{\wfpctwourl}) directly within your \LaTeX{} source.
-Put them in configuration files and after using them in the analysis, pass them into the \LaTeX{} source through macros in the same subMakefile that used them.
-For some already published examples, please see \citet{maneage}\footnote{\url{https://gitlab.com/makhlaghi/maneage-paper}}, \citet{infantesainz20}\footnote{\url{https://gitlab.com/infantesainz/sdss-extended-psfs-paper}} and \citet{akhlaghi19}\footnote{\url{https://gitlab.com/makhlaghi/iau-symposium-355}}.
-Working in this way, will let you focus clearly on your science and not have to worry about fixing this or that number/name in the text.
+The analysis phase of the project however is naturally different from one project to another at a low-level.
+It was thus necessary to design a generic framework to comfortably host any project, while still satisfying the criteria of modularity, scalability, and minimal complexity.
+We demonstrate this design by replicating Figure 1C of \cite{menke20} in Figure \ref{fig:datalineage} (left).
+Figure \ref{fig:datalineage} (right) is the data lineage graph that produced it (including this complete paper).
-Once your project is ready for publication, there is also a ``Publication checklist'' in \texttt{README-hacking.md} that will guide you in the steps to do for making your project as FAIR as possible (Findable, Accessibile, Interoperable, and Reusable).
+\begin{figure*}[t]
+ \begin{center}
+ \includetikz{figure-tools-per-year}{width=0.95\linewidth}
+% \includetikz{figure-data-lineage}{width=0.85\linewidth}
+ \end{center}
+ \vspace{-3mm}
+ \caption{\label{fig:datalineage}
+ Left: an enhanced replica of Figure 1C in \cite{menke20}, shown here for demonstrating Maneage.
+ It shows the fraction of the number of papers mentioning software tools (green line, left vertical axis) in each year (red bars, right vertical axis on a log scale).
+ Right: Schematic representation of the data lineage, or workflow, to generate the plot on the left.
+ Each colored box is a file in the project and \new{arrows show the operation of various software: linking input file(s) to output file(s)}.
+ Green files/boxes are plain-text files that are under version control and in the project source directory.
+ Blue files/boxes are output files in the build directory, shown within the Makefile (\inlinecode{*.mk}) where they are defined as a \emph{target}.
+ For example, \inlinecode{paper.pdf} \new{is created by running \LaTeX{} on} \inlinecode{project.tex} (in the build directory; generated automatically) and \inlinecode{paper.tex} (in the source directory; written manually).
+ \new{Other software is used in other steps.}
+ The solid arrows and full-opacity built boxes correspond to the lineage of this paper.
+ The dotted arrows and built boxes show the scalability of Maneage (ease of adding hypothetical steps to the project as it evolves).
+ The underlying data of the left plot is available at
+ \href{https://zenodo.org/record/\projectzenodoid/files/tools-per-year.txt}{zenodo.\projectzenodoid/tools-per-year.txt}.
+ }
+\end{figure*}
-The default \LaTeX{} structure within Maneage also has two \LaTeX{} macros for easy marking of text within your document as \emph{new} and \emph{notes}.
-To activate them, please use the \texttt{--highlight-new} or \texttt{--highlight-notes} options with \texttt{./project make}.
+The analysis is orchestrated through a single point of entry (\inlinecode{top-make.mk}, which is a Makefile; see Listing \ref{code:topmake}).
+It is only responsible for \inlinecode{include}-ing the modular \emph{subMakefiles} of the analysis, in the desired order, without doing any analysis itself.
+This is visualized in Figure \ref{fig:datalineage} (right) where no built (blue) file is placed directly over \inlinecode{top-make.mk} (they are produced by the subMakefiles under them).
+A visual inspection of this file is sufficient for a non-expert to understand the high-level steps of the project (irrespective of the low-level implementation details), provided that the subMakefile names are descriptive (thus encouraging good practice).
+A human-friendly design that is also optimized for execution is a critical component for the FAIRness of reproducible research.
-For example if you run \texttt{./project make --highlight-new}, then \new{this text (that has been marked as \texttt{new}) will show up as green in the final PDF}.
-If you run \texttt{./project make --highlight-notes} then you will see a note following this sentence that is written in red and has square brackets around it (it is invisible without this option).
-\tonote{This text is written within a \texttt{tonote} and is invisible without \texttt{--highlight-notes}.}
-You can also use these two options together to both highlight the new parts and add notes within the text.
+\begin{lstlisting}[
+ label=code:topmake,
+ caption={This project's simplified \inlinecode{top-make.mk}, also see Figure \ref{fig:datalineage}}
+ ]
+# Default target/goal of project.
+all: paper.pdf
-Another thing you may notice from the \LaTeX{} source of this default paper is there is one line per sentence (and one sentence in a line).
-Of course, as with everything else in Maneage, you are free to use any format that you are most comfortable with.
-The reason behind this choice is that this source is under Git version control and that Git also primarily works on lines.
-In this way, when a change in a setence is made, git will only highlight/color that line/sentence we have found that this helps a lot in viewing the changes.
-Also, this format helps in reminding the author when the sentence is getting too long!
-Here is a tip when looking at the changes of narrative document in Git: use the \texttt{--word-diff} option (for example \texttt{git diff --word-diff}, you can also use it with \texttt{git log}).
+# Define subMakefiles to load in order.
+makesrc = initialize \ # General
+ download \ # General
+ format \ # Project-specific
+ demo-plot \ # Project-specific
+ verify \ # General
+ paper # General
-Figure \ref{squared} shows a simple plot as a demonstration of creating plots within \LaTeX{} (using the {\small PGFP}lots package).
-The minimum value in this distribution is $\deletememin$, and $\deletememax$ is the maximum.
-Take a look into the \LaTeX{} source and you'll see these numbers are actually macros that were calculated from the same dataset (they will change if the dataset, or function that produced it, changes).
+# Load all the configuration files.
+include reproduce/analysis/config/*.conf
-The individual {\small PDF} file of Figure \ref{squared} is available under the \texttt{tex/tikz/} directory of your build directory.
-You can use this PDF file in other contexts (for example in slides showing your progress or after publishing the work).
-If you want to directly use the {\small PDF} file in the figure without having to let {\small T}i{\small KZ} decide if it should be remade or not, you can also comment the \texttt{makepdf} macro at the top of this \LaTeX{} source file.
+# Load the subMakefiles in the defined order
+include $(foreach s,$(makesrc), \
+ reproduce/analysis/make/$(s).mk)
+\end{lstlisting}
-\begin{figure}[t]
- \includetikz{delete-me-squared}{width=\linewidth}
+All projects first load \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk} (Listing \ref{code:topmake}).
+Project authors add their modular subMakefiles in between.
+Except for \inlinecode{paper.mk} (which builds the ultimate target: \inlinecode{paper.pdf}), all subMakefiles build a macro file with the same base-name (the \inlinecode{.tex} file in each subMakefile of Figure \ref{fig:datalineage}).
+Other built files (intermediate analysis steps) cascade down in the lineage to one of these macro files, possibly through other files.
- \captionof{figure}{\label{squared} A very basic $X^2$ plot for
- demonstration.}
-\end{figure}
+Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk} to satisfy the verification criteria (this step was not yet available in \cite{akhlaghi19, infante20}).
+All project deliverables (macro files, plot or table data and other datasets) are verified at this stage, with their checksums, to automatically ensure exact reproducibility.
+Where exact reproducibility is not possible \new{(for example due to parallelization)}, values can be verified by any statistical means, specified by the project authors.
-Figure \ref{image-histogram} is another demonstration of showing images (datasets) using PGFPlots.
-It shows a small crop of an image from the Wide-Field Planetary Camera 2 (that was installed on the Hubble Space Telescope from 1993 to 2009).
-As another more realistic demonstration of reporting results with Maneage, here we report that the mean pixel value in that image is $\deletemewfpctwomean$ and the median is $\deletemewfpctwomedian$.
-The skewness in the histogram of Figure \ref{image-histogram}(b) explains this difference between the mean and median.
-The dataset is visualized here as a black and white image using the \textsf{Convert\-Type} program of GNU Astronomy Utilities (Gnuastro).
-The histogram and basic statstics were generated with Gnuastro's \textsf{Statistics} program.
+\begin{figure*}[t]
+ \begin{center} \includetikz{figure-branching}{scale=1}\end{center}
+ \vspace{-3mm}
+ \caption{\label{fig:branching} Maneage is a Git branch.
+ Projects using Maneage are branched off it and apply their customizations.
+ (a) A hypothetical project's history prior to publication.
+ The low-level structure (in Maneage, shared between all projects) can be updated by merging with Maneage.
+ (b) A finished/published project can be revitalized for new technologies by merging with the core branch.
+ Each Git ``commit'' is shown on its branch as a colored ellipse, with its commit hash shown and colored to identify the team that is/was working on the branch.
+ Briefly, Git is a version control system, allowing a structured backup of project files.
+ Each Git ``commit'' effectively contains a copy of all the project's files at the moment it was made.
+ The upward arrows at the branch-tops are therefore in the timee direction.
+ }
+\end{figure*}
-{\small PGFP}lots\footnote{\url{https://ctan.org/pkg/pgfplots}} is a great tool to build the plots within \LaTeX{} and removes the necessity to add further dependencies, just to create the plots.
-There are high-level libraries like Matplotlib which also generate plots.
-However, the problem is that they require \emph{many} dependencies, for example see Figure 1 of \citet{alliez19}.
-Installing these dependencies from source, is not easy and will harm the reproducibility of your paper in the future.
+To further minimize complexity, the low-level implementation can be further separated from the high-level execution through configuration files.
+By convention in Maneage, the subMakefiles (and the programs they call for number crunching) do not contain any fixed numbers, settings, or parameters.
+Parameters are set as Make variables in ``configuration files'' (with a \inlinecode{.conf} suffix) and passed to the respective program by Make.
+For example, in Figure \ref{fig:datalineage} (bottom), \inlinecode{INPUTS.conf} contains URLs and checksums for all imported datasets, thereby enabling exact verification before usage.
+To illustrate this, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot).
+The number \inlinecode{\menkenumpapersdemoyear} is stored in \inlinecode{demo-year.conf} and the result (\inlinecode{\menkenumpapersdemocount}) was calculated after generating \inlinecode{tools-per-year.txt}.
+Both numbers are expanded as \LaTeX{} macros when creating this PDF file.
+An interested reader can change the value in \inlinecode{demo-year.conf} to automatically update the result in the PDF, without knowing the underlying low-level implementation.
+Furthermore, the configuration files are a prerequisite of the targets that use them.
+If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's source or other built products.
+This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details; e.g., by co-authors or future readers), and ensures self-consistency.
-Furthermore, since {\small PGFP}lots builds the plots within \LaTeX{}, it respects all the properties of your text (for example line width and fonts and etc).
-Therefore the final plot blends in your paper much more nicely.
-It also has a wonderful manual\footnote{\url{http://mirrors.ctan.org/graphics/pgf/contrib/pgfplots/doc/pgfplots.pdf}}.
+\new{To summarize, in contrast to notebooks like Jupyter, in a ``Maneage''d project the analysis scripts and configuration parameters are not blended into the running code (nor stored together in a single file).
+To satisfy the modularity criterion, the analysis steps are run in their own files (for their own respective language, thus maximally benefiting from its unique features) and the narrative has its own file(s).
+The analysis communicates with the narrative through intermediate files (the \LaTeX{} macros), enabling much better blending of analysis outputs in the narrative sentences than is possible with the high-level notebooks and enabling direct provenance tracking.}
-\begin{figure}[t]
- \includetikz{delete-me-image-histogram}{width=\linewidth}
+To satisfy the recorded history criterion, version control (currently implemented in Git) is another component of Maneage (see Figure \ref{fig:branching}).
+Maneage is a Git branch that contains the shared components (infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles, and interface script).
+\new{The core Maneage git repository is hosted at \href{http://git.maneage.org/project.git}{git.maneage.org/project.git} (archived at \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{Software Heritage}).}
+Derived projects start by creating a branch and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its particular analysis, see Listing \ref{code:branching}).
+There is a \new{thoroughly elaborated} customization checklist in \inlinecode{README-hacking.md}).
- \captionof{figure}{\label{image-histogram} (a) An example image of the Wide-Field Planetary Camera 2, on board the Hubble Space Telescope from 1993 to 2009.
- This is one of the sample images from the FITS standard webpage, kept as examples for this file format.
- (b) Histogram of pixel values in (a).}
-\end{figure}
+The current project's Git hash is provided to the authors as a \LaTeX{} macro (shown here at the end of the abstract), as well as the Git hash of the last commit in the Maneage branch (shown here in the acknowledgments).
+These macros are created in \inlinecode{initialize.mk}, with \new{other basic information from the running system like the CPU architecture, byte order or address sizes (shown here in the acknowledgements)}.
+The branch-based design of Figure \ref{fig:branching} allows projects to re-import Maneage at a later time (technically: \emph{merge}), thus improving its low-level infrastructure: in (a) authors do the merge during an ongoing project;
+in (b) readers do it after publication; e.g., the project remains reproducible but the infrastructure is outdated, or a bug is fixed in Maneage.
+\new{Generally, any git flow (branching strategy) can be used by the high-level project authors or future readers.}
+Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of curation and maintenance of each individual project, before \emph{and} after publication.
+Finally, the complete project source is usually $\sim100$ kilo-bytes.
+It can thus easily be published or archived in many servers, for example it can be uploaded to arXiv (with the \LaTeX{} source, see the arXiv source in \cite{akhlaghi19, infante20, akhlaghi15}), published on Zenodo and archived in SoftwareHeritage.
-\section{Notice and citations}
-To encourage other scientists to publish similarly reproducible papers, please add a notice close to the start of your paper or in the end of the abstract clearly mentioning that your work is fully reproducible.
-One convention we have adopted until now is to put the Git checkum of the project as the last word of the abstract, for example see \citet{akhlaghi19}, \citet{infantesainz20} and \citet{maneage}
+\begin{lstlisting}[
+ label=code:branching,
+ caption={Starting a new project with Maneage, and building it},
+ ]
+# Cloning main Maneage branch and branching off it.
+$ git clone https://git.maneage.org/project.git
+$ cd project
+$ git remote rename origin origin-maneage
+$ git checkout -b master
-Finally, don't forget to cite \citet{maneage} and acknowledge the funders mentioned below.
-Otherwise we won't be able to continue working on Maneage.
-Also, just as another reminder, before publication, don't forget to follow the ``Publication checklist'' of \texttt{README-hacking.md}.
-%% End of main body.
+# Build the raw Maneage skeleton in two phases.
+$ ./project configure # Build software environment.
+$ ./project make # Do analysis, build PDF paper.
+# Start editing, test-building and committing
+$ emacs paper.tex # Set your name as author.
+$ ./project make # Re-build to see effect.
+$ git add -u && git commit # Commit changes.
+\end{lstlisting}
-\section{Acknowledgments}
-\new{Please include the following paragraph in the Acknowledgement section of your paper.
- In order to get more funding to continue working on Maneage, we need to to cite it and its funding institutions in your papers.
- Also note that at the start, it includes version and date information for the most recent Maneage commit you merged with (which can be very helpful for others) as well as very basic information about your CPU architecture (which was extracted during configuration).
- This CPU information is very important for reproducibility because some software may not be buildable on other CPU architectures, so it is necessary to publish CPU information with the results and software versions.}
-This project was developed in the reproducible framework of Maneage \citep[\emph{Man}aging data lin\emph{eage},][latest Maneage commit \maneageversion{}, from \maneagedate]{maneage}.
-The project was built on an {\machinearchitecture} machine with {\machinebyteorder} byte-order, see Appendix \ref{appendix:software} for the used software and their versions.
-Maneage has been funded partially by the following grants: Japanese Ministry of Education, Culture, Sports, Science, and Technology (MEXT) PhD scholarship to M. Akhlaghi and its Grant-in-Aid for Scientific Research (21244012, 24253003).
+
+
+
+
+
+\section{Discussion}
+\label{discussion}
+%% It should provide some insight or 'lessons learned', where 'lessons learned' is jargon for 'informal hypotheses motivated by experience', reworded to make the hypotheses sound more scientific (if it's a 'lesson', then it sounds like knowledge, when in fact it's really only a hypothesis).
+%% What is the message we should take from the experience?
+%% Are there clear demonstrated design principles that can be reapplied elsewhere?
+%% Are there roadblocks or bottlenecks that others might avoid?
+%% Are there suggested community or work practices that can make things smoother?
+%% Attempt to generalise the significance.
+%% should not just present a solution or an enquiry into a unitary problem but make an effort to demonstrate wider significance and application and say something more about the ‘science of data’ more generally.
+
+We have shown that it is possible to build workflows satisfying all the proposed criteria, and we comment here on our experience in testing them through this proof-of-concept tool.
+Maneage user-base grew with the support of RDA, underscoring some difficulties for a widespread adoption.
+
+Firstly, while most researchers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used.
+Fortunately, we have noticed that after witnessing the improvements in their research, many, especially early-career researchers, have started mastering these tools.
+Scientists are rarely trained sufficiently in data management or software development, and the plethora of high-level tools that change every few years discourages them.
+Indeed the fast-evolving tools are primarily targeted at software developers, who are paid to learn and use them effectively for short-term projects before moving on to the next technology.
+
+Scientists, on the other hand, need to focus on their own research fields, and need to consider longevity.
+Hence, arguably the most important feature of these criteria (as implemented in Maneage) is that they provide a fully working template or bundle that works immediately out of the box by producing a paper with an example calculation that they just need to start customizing.
+Using mature and time-tested tools, for blending version control, the research paper's narrative, the software management \emph{and} a robust data management strategies.
+We have noticed that providing a complete \emph{and} customizable template with a clear checklist of the initial steps is much more effective in encouraging mastery of these modern scientific tools than having abstract, isolated tutorials on each tool individually.
+
+Secondly, to satisfy the completeness criterion, all the required software of the project must be built on various POSIX-compatible systems (Maneage is actively tested on different GNU/Linux distributions, macOS, and is being ported to FreeBSD also).
+This requires maintenance by our core team and consumes time and energy.
+However, because the PM and analysis components share the same job manager (Make) and design principles, we have already noticed some early users adding, or fixing, their required software alone.
+They later share their low-level commits on the core branch, thus propagating it to all derived projects.
+
+A related caveat is that, POSIX is a fuzzy standard, not guaranteeing the bit-wise reproducibility of programs.
+It has been chosen here, however, as the underlying platform \new{because our focus is on reproducing the results (output of software), not the software itself.}
+POSIX is ubiquitous and low-level software (e.g., core GNU tools) are install-able on most.
+Well written software internally corrects for differences in OS or hardware that may affect its functionality (through tools like the GNU portability library).
+On GNU/Linux hosts, Maneage builds precise versions of the compilation tool chain.
+However, glibc is not install-able on some POSIX OSs (e.g., macOS) and all programs link with the C library.
+This may hypothetically hinder the exact reproducibility \emph{of results} on non-GNU/Linux systems, but we have not encountered this in our research so far.
+With everything else under precise control in Maneage, the effect of differing Kernel and C libraries on high-level science can now be systematically studied in follow-up research \new{(including floating-point arithmetic or optimization differences).
+Using continuous integration (CI) is one way to precisely identify breaking points on multiple systems.}
+
+% DVG: It is a pity that the following paragraph cannot be included, as it is really important but perhaps goes beyond the intended goal.
+%Thirdly, publishing a project's reproducible data lineage immediately after publication enables others to continue with follow-up papers, which may provide unwanted competition against the original authors.
+%We propose these solutions:
+%1) Through the Git history, the work added by another team at any phase of the project can be quantified, contributing to a new concept of authorship in scientific projects and helping to quantify Newton's famous ``\emph{standing on the shoulders of giants}'' quote.
+%This is a long-term goal and would require major changes to academic value systems.
+%2) Authors can be given a grace period where the journal or a third party embargoes the source, keeping it private for the embargo period and then publishing it.
+
+Other implementations of the criteria, or future improvements in Maneage, may solve some of the caveats, but this proof of concept already shows their many advantages.
+For example, publication of projects meeting these criteria on a wide scale will allow automatic workflow generation, optimized for desired characteristics of the results (e.g., via machine learning).
+The completeness criterion implies that algorithms and data selection can be included in the optimizations.
+
+Furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the connection between an analysis with the resulting narrative \emph{and} the history of that analysis+narrative.
+Parsers can be written over projects for meta-research and provenance studies, e.g., to generate ``research objects''.
+Likewise, when a bug is found in one science software, affected projects can be detected and the scale of the effect can be measured.
+Combined with SoftwareHeritage, precise high-level science components of the analysis can be accurately cited (e.g., even failed/abandoned tests at any historical point).
+Many components of ``machine-actionable'' data management plans can also be automatically completed as a byproduct, useful for project PIs and grant funders.
+
+From the data repository perspective, these criteria can also be useful, e.g., the challenges mentioned in \cite{austin17}:
+(1) The burden of curation is shared among all project authors and readers (the latter may find a bug and fix it), not just by database curators, thereby improving sustainability.
+(2) Automated and persistent bidirectional linking of data and publication can be established through the published \emph{and complete} data lineage that is under version control.
+(3) Software management: with these criteria, each project comes with its unique and complete software management.
+It does not use a third-party PM that needs to be maintained by the data center (and the many versions of the PM), hence enabling robust software management, preservation, publishing, and citation.
+For example, see \href{https://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}, \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}, \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} or \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460} where we have exploited the free-software criterion to distribute the source code of all software used in each project as deliverables.
+(4) ``Linkages between documentation, code, data, and journal articles in an integrated environment'', which effectively summarizes the whole purpose of these criteria.
+
+
+
+
+
+% use section* for acknowledgment
+\section*{Acknowledgment}
+
+The authors wish to thank (sorted alphabetically)
+Julia Aguilar-Cabello,
+Dylan A\"issi,
+Marjan Akbari,
+Alice Allen,
+Pedram Ashofteh Ardakani,
+Roland Bacon,
+Michael R. Crusoe,
+Antonio D\'iaz D\'iaz,
+Surena Fatemi,
+Fabrizio Gagliardi,
+Konrad Hinsen,
+Marios Karouzos,
+Mohammad-reza Khellat,
+Johan Knapen,
+Tamara Kovazh,
+Terry Mahoney,
+Ryan O'Connor,
+Mervyn O'Luing,
+Simon Portegies Zwart,
+Idafen Santana-P\'erez,
+Elham Saremi,
+Yahya Sefidbakht,
+Zahra Sharbaf,
+Nadia Tonello,
+Ignacio Trujillo and
+the AMIGA team at the Instituto de Astrof\'isica de Andaluc\'ia
+for their useful help, suggestions, and feedback on Maneage and this paper.
+\new{The five referees and editors of CiSE (Lorena Barba and George Thiruvathukal) provided many points that greatly helped to clarify this paper.}
+
+This project was developed in the reproducible framework of Maneage (\emph{Man}aging data lin\emph{eage})
+\new{on Commit \inlinecode{\projectversion} (in the project branch).
+The latest merged Maneage commit was \inlinecode{\maneageversion} (\maneagedate).
+This project was built on an \inlinecode{\machinearchitecture} machine with {\machinebyteorder} byte-order and address sizes {\machineaddresssizes}}.
+
+Work on Maneage, and this paper, has been partially funded/supported by the following institutions:
+The Japanese Ministry of Education, Culture, Sports, Science, and Technology (MEXT) PhD scholarship to
+M. Akhlaghi and its Grant-in-Aid for Scientific Research (21244012, 24253003).
The European Research Council (ERC) advanced grant 339659-MUSICOS.
-The European Union (EU) Horizon 2020 (H2020) research and innovation programmes No 777388 under RDA EU 4.0 project, and Marie Sk\l{}odowska-Curie grant agreement No 721463 to the SUNDIAL ITN.
-The State Research Agency (AEI) of the Spanish Ministry of Science, Innovation and Universities (MCIU) and the European Regional Development Fund (ERDF) under the grant AYA2016-76219-P.
+The European Union (EU) Horizon 2020 (H2020) research and innovation programmes No 777388 under RDA EU 4.0 project, and Marie
+Sk\l{}odowska-Curie grant agreement No 721463 to the SUNDIAL ITN.
+The State Research Agency (AEI) of the Spanish Ministry of Science, Innovation and Universities (MCIU) and the European
+Regional Development Fund (ERDF) under the grant AYA2016-76219-P.
The IAC project P/300724, financed by the MCIU, through the Canary Islands Department of Economy, Knowledge and Employment.
+The ``A next-generation worldwide quantum sensor network with optical atomic clocks'' project of the TEAM IV programme of the
+Foundation for Polish Science co-financed by the EU under ERDF.
+The Polish MNiSW grant DIR/WK/2018/12.
+The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314.
+
+
+
+
+
+
+
+
+
+%% Bibliography of main body
+\bibliographystyle{IEEEtran_openaccess}
+\bibliography{IEEEabrv,references}
+
+%% Biography
+\begin{IEEEbiographynophoto}{Mohammad Akhlaghi}
+ is a postdoctoral researcher at the Instituto de Astrof\'isica de Canarias (IAC), Spain.
+ He received his PhD from Tohoku University (Japan) and was previously a CNRS postdoc in Lyon (France).
+ Email: mohammad-at-akhlaghi.org; Website: \url{https://akhlaghi.org}.
+\end{IEEEbiographynophoto}
+
+\begin{IEEEbiographynophoto}{Ra\'ul Infante-Sainz}
+ is a doctoral student at IAC, Spain.
+ He received his M.Sc in University of Granada (Spain).
+ Email: infantesainz-at-gmail.com; Website: \url{https://infantesainz.org}.
+\end{IEEEbiographynophoto}
+
+\begin{IEEEbiographynophoto}{Boudewijn F. Roukema}
+ is a professor of cosmology at the Institute of Astronomy, Faculty of Physics, Astronomy and Informatics, Nicolaus Copernicus University in Toru\'n, Grudziadzka 5, Poland (PhD, Australian National University; boud-at-astro.uni.torun.pl).
+\end{IEEEbiographynophoto}
+
+\begin{IEEEbiographynophoto}{David Valls-Gabaud}
+ is a CNRS Research Director at LERMA, Observatoire de Paris, France.
+ Educated at the universities of Madrid, Paris, and Cambridge, he obtained his PhD in 1991.
+ Email: david.valls-gabaud-at-obspm.fr.
+\end{IEEEbiographynophoto}
+
+\begin{IEEEbiographynophoto}{Roberto Baena-Gall\'e}
+ held a postdoc position at IAC and obtained a degree in Telecommunication and Electronic at the University of Seville, with a PhD at the University of Barcelona.
+ Email: rbaena-at-iac.es
+\end{IEEEbiographynophoto}
+\vfill
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+%% Appendix (only build if 'noappendix' has not been given). So in default,
+%% the appendix is built.
+\ifdefined\noappendix
+\else
+\clearpage
+\appendices
+\section{Survey of existing tools for various phases}
+\label{appendix:existingtools}
+
+Data analysis workflows (including those that aim for reproducibility) are commonly high-level frameworks which employ various lower-level components.
+To help in reviewing existing reproducible workflow solutions in light of the proposed criteria in Appendix \ref{appendix:existingsolutions}, we first need survey the most commonly employed lower-level tools.
+
+\subsection{Independent environment}
+\label{appendix:independentenvironment}
+
+The lowest-level challenge of any reproducible solution is to avoid the differences between various run-time environments, to a desirable/certain level.
+For example different hardware, operating systems, versions of existing dependencies, and etc.
+Therefore any reasonable attempt at providing a reproducible workflow starts with isolating its running envionment from the host environment.
+There are three general technologies that are used for this purpose and reviewed below:
+1) Virtual machines,
+2) Containers,
+3) Independent build in host's file system.
+
+\subsubsection{Virtual machines}
+\label{appendix:virtualmachines}
+Virtual machines (VMs) host a binary copy of a full operating system that can be run on other operating systems.
+This includes the lowest-level operating system component or the kernel.
+VMs thus provide the ultimate control one can have over the run-time environment of an analysis.
+However, the VM's kernel does not talk directly to the running hardware that is doing the analysis, it talks to a simulated hardware layer that is provided by the host's kernel.
+Therefore, a process that is run inside a virtual machine can be much slower than one that is run on a native kernel.
+An advantages of VMs is that they are a single file which can be copied from one computer to another, keeping the full environment within them if the format is recognized.
+VMs are used by cloud service providers, enabling fully independent operating systems on their large servers (where the customer can have root access).
+
+VMs were used in solutions like SHARE \citeappendix{vangorp11} (which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011 \citeappendix{gabriel11}), or in suggested reproducible papers like \citeappendix{dolfi14}.
+However, due to their very large size, these are expensive to maintain, thus leading SHARE to discontinue its services in 2019.
+The URL to the VM file \texttt{provenance\_machine.ova} that is mentioned in \citeappendix{dolfi14} is not currently accessible (we suspect that this is due to size and archival costs).
+
+\subsubsection{Containers}
+\label{appendix:containers}
+Containers also host a binary copy of a running environment, but do not have their own kernel.
+Through a thin layer of low-level system libraries, programs running within a container talk directly with the host operating system kernel.
+Otherwise, containers have their own independent software for everything else.
+Therefore, they have much less overhead in hardware/CPU access.
+Like VMs, users often choose an operating system for the container's independent operating system (most commonly GNU/Linux distributions which are free software).
+
+Below we'll review some of the most common container solutions: Docker and Singularity.
+
+\begin{itemize}
+\item {\bf\small Docker containers:} Docker is one of the most popular tools today for keeping an independent analysis environment.
+ It is primarily driven by the need of software developers for reproducing a previous environment, where they have root access mostly on the ``cloud'' (which is just a remote VM).
+ A Docker container is composed of independent Docker ``images'' that are built with a \inlinecode{Dockerfile}.
+ It is possible to precisely version/tag the images that are imported (to avoid downloading the latest/different version in a future build).
+ To have a reproducible Docker image, it must be ensured that all the imported Docker images check their dependency tags down to the initial image which contains the C library.
+
+ An important drawback of Docker for high performance scientific needs is that it runs as a daemon (a program that is always running in the background) with root permissions.
+ This is a major security flaw that discourages many high performance computing (HPC) facilities from providing it.
+
+\item {\bf\small Singularity:} Singularity \citeappendix{kurtzer17} is a single-image container (unlike Docker which is composed of modular/independent images).
+ Although it needs root permissions to be installed on the system (once), it does not require root permissions every time it is run.
+ Its main program is also not a daemon, but a normal program that can be stopped.
+ These features make it much safer for HPC administrators to install compared to Docker.
+ However, the fact that it requires root access for the initial install is still a hindrance for a typical project: if Singularity is not already present on the HPC, the user's science project cannot be run by a non-root user.
+
+\item {\bf\small Podman:} Podman uses the Linux kernel containerization features to enable containers without a daemon, and without root permissions.
+ It has a command-line interface very similar to Docker, but only works on GNU/Linux operating systems.
+\end{itemize}
+
+Generally, VMs or containers are good solutions to reproducibly run/repeating an analysis in the short term (a couple of years).
+However, their focus is to store the already-built (binary, non-human readable) software environment.
+Because of this they will be large (many Gigabytes) and expensive to archive, download or access.
+Recall the two examples above for VMs in Section \ref{appendix:virtualmachines}. But this is also valid for Docker images, as is clear from Dockerhub's recent decision to delete images of free accounts that haven't been used for more than 6 months.
+Meng \& Thain \citeappendix{meng17} also give similar reasons on why Docker images were not suitable in their trials.
+
+On a more fundamental level, VMs or contains do not store \emph{how} the core environment was built.
+This information is usually in a third-party repository, and not necessarily inside container or VM file, making it hard (if not impossible) to track for future users.
+This is a major problem when considering reproducibility which is also highlighted as a major issue in terms of long term reproducibility in \citeappendix{oliveira18}.
+
+The example of \cite{mesnard20} was previously mentioned in Section \ref{criteria}.
+Another useful example is the \href{https://github.com/benmarwick/1989-excavation-report-Madjedbebe/blob/master/Dockerfile}{\inlinecode{Dockerfile}} of \citeappendix{clarkso15} (published in June 2015) which starts with \inlinecode{FROM rocker/verse:3.3.2}.
+When we tried to build it (November 2020), the core downloaded image (\inlinecode{rocker/verse:3.3.2}, with image ``digest'' \inlinecode{sha256:c136fb0dbab...}) was created in October 2018 (long after the publication of that paper).
+In principle, it is possible to investigate the difference between this new image and the old one that the authors used, but that would require a lot of effort and may not be possible where the changes are not available in a third public repository or not under version control.
+In Docker, it is possible to retrieve the precise Docker image with its digest for example \inlinecode{FROM ubuntu:16.04@sha256:XXXXXXX} (where \inlinecode{XXXXXXX} is the digest, uniquely identifying the core image to be used), but we haven't seen this often done in existing examples of ``reproducible'' \inlinecode{Dockerfiles}.
+
+The ``digest'' is specific to Docker repositories.
+A more generic/longterm approach to ensure identical core OS components at a later time is to construct the containers or VMs with fixed/archived versions of the operating system ISO files.
+ISO files are pre-built binary files with volumes of hundreds of megabytes and not containing their build instructions).
+For example the archives of Debian\footnote{\inlinecode{\url{https://cdimage.debian.org/mirror/cdimage/archive/}}} or Ubuntu\footnote{\inlinecode{\url{http://old-releases.ubuntu.com/releases}}} provide older ISO files.
+
+The concept of containers (and the independent images that build them) can also be extended beyond just the software environment.
+For example \citeappendix{lofstead19} propose a ``data pallet'' concept to containerize access to data and thus allow tracing data back wards to the application that produced them.
+
+In summary, containers or VMs are just a built product themselves.
+If they are built properly (for example building a Maneage'd project inside a Docker container), they can be useful for immediate usage and fast moving of the project from one system to another.
+With robust building, the container or VM can also be exactly reproduced later.
+However, attempting to archive the actual binary container or VM files as a black box (not knowing the precise versions of the software in them, and \emph{how} they were built) is expensive, and will not be able to answer the most fundamental questions.
+
+\subsubsection{Independent build in host's file system}
+\label{appendix:independentbuild}
+The virtual machine and container solutions mentioned above, have their own independent file system.
+Another approach to having an isolated analysis environment is to use the same filesystem as the host, but installing the project's software in a non-standrard, project-specific directory that does not interfere with the host.
+Because the environment in this approach can be built in any custom location on the host, this solution generally does not require root permissions or extra low-level layers like containers or VMs.
+However, ``moving'' the built product of such solutions from one computer to another is not generally as trivial as containers or VMs.
+Examples of such third-party package managers (that are detached from the host OS's package manager) include Nix, GNU Guix, Python's Virtualenv package and Conda, among others.
+Because it is highly intertwined with the way software are built and installed, third party package managers are described in more detail as part of Section \ref{appendix:packagemanagement}.
+
+Maneage (the solution proposed in this paper) also follows a similar approach of building and installing its own software environment within the the host's file system but without depending on it beyond the kernel.
+However, unlike the third party package maneager mentioned above, based on the Completeness criteria above Maneage's package management is not detached from the specific research/analysis project: the instructions to build the full isolated software environment is maintained with the high-level analysis steps of the project, and the narrative paper/report of the project.
+
+
+
+
+
+\subsection{Package management}
+\label{appendix:packagemanagement}
+
+Package management is the process of automating the build and installation of a software environment.
+A package manager thus contains the following information on each software package that can be run automatically: the URL of the software's tarball, the other software that it possibly depends on, and how to configure and build it.
+Package managers can be tied to specific operating systems at a very low level (like \inlinecode{apt} in Debian-based OSs).
+Alternatively, there are third-party package managers which ca be installed on many OSs.
+Both are discussed in more detail below.
+
+Package managers are the second component in any workflow that relies on containers or VMs for an independent environment, and the starting point in others that use the host's file system (as discussed above in Section \ref{appendix:independentenvironment}).
+In this section, some common package managers are reviewed, in particular those that are most used by the reviewed reproducibility solutions of Appendix \ref{appendix:existingsolutions}.
+For a more comprehensive list of existing package managers, see \href{https://en.wikipedia.org/wiki/List_of_software_package_management_systems}{Wikipedia}.
+Note that we are not including package managers that are specific to one language, for example \inlinecode{pip} (for Python) or \inlinecode{tlmgr} (for \LaTeX).
+
+
+
+\subsubsection{Operating system's package manager}
+The most commonly used package managers are those of the host operating system, for example \inlinecode{apt} or \inlinecode{yum} respectively on Debian-based, or RedHat-based GNU/Linux operating systems, \inlinecode{pkg} in FreeBSD, among many others in other OSes.
+
+These package managers are tightly intertwined with the operating system: they also include the building and updating of the core kernel and the C library.
+Because they are part of the OS, they also commonly require root permissions.
+Also, it is usually only possible to have one version/configuration of a software at any moment and downgrading versions for one project, may conflict with other projects, or even cause problems in the OS.
+Hence if two projects need different versions of a software, it is not possible to work on them at the same time in the OS.
+
+When a container or virtual machine (see Appendix \ref{appendix:independentenvironment}) is used for each project, it is common for projects to use the containerized operating system's package manager.
+However, it is important to remember that operating system package managers are not static: software are updated on their servers.
+Hence, simply running \inlinecode{apt install gcc}, will install different versions of the GNU Compiler Collection (GCC) based on the version of the OS and when it has been run.
+Requesting a special version of that special software does not fully address the problem because the package managers also download and install its dependencies.
+Hence a fixed version of the dependencies must also be specified.
+
+In robust package managers like Debian's \inlinecode{apt} it is possible to fully control (and later reproduce) the build environment of a high-level software.
+Debian also archives all packaged high-level software in its Snapshot\footnote{\inlinecode{\url{https://snapshot.debian.org/}}} service since 2005 which can be used to build the higher-level software environment on an older OS \citeappendix{aissi20}.
+Hence it is indeed theoretically possible to reproduce the software environment only using archived operating systems and their own package managers, but unfortunately we have not seen it practiced in scientific papers/projects.
+
+In summary, the host OS package managers are primarily meant for the operating system components or very low-level components.
+Hence, many robust reproducible analysis solutions (reviewed in Appendix \ref{appendix:existingsolutions}) do not use the host's package manager, but an independent package manager, like the ones below discussed below.
+
+\subsubsection{Packaging with Linux containerization}
+Once a software is packaged as an AppImage\footnote{\inlinecode{\url{https://appimage.org}}}, Flatpak\footnote{\inlinecode{\url{https://flatpak.org}}} or Snap\footnote{\inlinecode{\url{https://snapcraft.io}}} the software's binary product and all its dependencies (not including the core C library) are packaged into one file.
+This makes it very easy to move that single software's built product to newer systems.
+However, because the C library is not included, it can fail on older systems.
+Moreover, these are designed for the Linux kernel (using its containerization features) and can thus only be run on GNU/Linux operating systems.
+
+\subsubsection{Nix or GNU Guix}
+\label{appendix:nixguix}
+Nix \citeappendix{dolstra04} and GNU Guix \citeappendix{courtes15} are independent package managers that can be installed and used on GNU/Linux operating systems, and macOS (only for Nix, prior to macOS Catalina).
+Both also have a fully functioning operating system based on their packages: NixOS and ``Guix System''.
+GNU Guix is based on the same principles of Nix but implemented differencely, so we'll focus the review here on Nix.
+
+The Nix approach to package management is unique in that it allows exact dependency tracking of all the dependencies, and allows for multiple versions of a software, for more details see \citeappendix{dolstra04}.
+In summary, a unique hash is created from all the components that go into the building of the package.
+That hash is then prefixed to the software's installation directory.
+As an example from \citeappendix{dolstra04}: if a certain build of GNU C Library 2.3.2 has a hash of \inlinecode{8d013ea878d0}, then it is installed under \inlinecode{/nix/store/8d013ea878d0-glibc-2.3.2} and all software that are compiled with it (and thus need it to run) will link to this unique address.
+This allows for multiple versions of the software to co-exist on the system, while keeping an accurate dependency tree.
+
+As mentioned in \citeappendix{courtes15}, one major caveat with using these package managers is that they require a daemon with root privileges.
+This is necessary ``to use the Linux kernel container facilities that allow it to isolate build processes and maximize build reproducibility''.
+This is because the focus in Nix or Guix is to create bit-wise reproducible software binaries and this is necessary in the security or development perspectives.
+However, in a non-computer-science analysis (for example natural sciences), the main aim is reproducibile \emph{results} that can also be created with the same software version that may not be bitwise identical (for example when they are installed in other locations, because the installation location is hardcoded in the software binary).
+
+Finally, while Guix and Nix do allow preciesly reproducible environments, it requires extra effort.
+For example simply running \inlinecode{guix install gcc} will install the most recent version of GCC that can be different at different times.
+Hence, similar to the discussion in host operating system package managers, it is up to the user to ensure that their created environment is recorded properly for reproducibility in the future.
+Generally, this is a major limitation of projects that rely on detached package managers for building their software, including the other tools mentioned below.
+
+\subsubsection{Conda/Anaconda}
+\label{appendix:conda}
+Conda is an independent package manager that can be used on GNU/Linux, macOS, or Windows operating systems, although all software packages are not available in all operating systems.
+Conda is able to maintain an approximately independent environment on an operating system without requiring root access.
+
+Conda tracks the dependencies of a package/environment through a YAML formatted file, where the necessary software and their acceptable versions are listed.
+However, it is not possible to fix the versions of the dependencies through the YAML files alone.
+This is thoroughly discussed under issue 787 (in May 2019) of \inlinecode{conda-forge}\footnote{\url{https://github.com/conda-forge/conda-forge.github.io/issues/787}}.
+In that discussion, the authors of \citeappendix{uhse19} report that the half-life of their environment (defined in a YAML file) is 3 months, and that at least one of their their dependencies breaks shortly after this period.
+The main reply they got in the discussion is to build the Conda environment in a container, which is also the suggested solution by \citeappendix{gruning18}.
+However, as described in Appendix \ref{appendix:independentenvironment} containers just hide the reproducibility problem, they do not fix it: containers aren't static and need to evolve (i.e., re-built) with the project.
+Given these limitations, \citeappendix{uhse19} are forced to host their conda-packaged software as tarballs on a separate repository.
+
+Conda installs with a shell script that contains a binary-blob (+500 mega bytes, embedded in the shell script).
+This is the first major issue with Conda: from the shell script, it is not clear what is in this binary blob and what it does.
+After installing Conda in any location, users can easily activate that environment by loading a special shell script into their shell.
+However, the resulting environment is not fully independent of the host operating system as described below:
+
+\begin{itemize}
+\item The Conda installation directory is present at the start of environment variables like \inlinecode{PATH} (which is used to find programs to run) and other such environment variables.
+ However, the host operating system's directories are also appended afterwards.
+ Therefore, a user, or script may not notice that a software that is being used is actually coming from the operating system, not the controlled Conda installation.
+
+\item Generally, by default Conda relies heavily on the operating system and does not include core analysis components like \inlinecode{mkdir}, \inlinecode{ls} or \inlinecode{cp}.
+ Although they are generally the same between different Unix-like operating systems, they have their differences.
+ For example \inlinecode{mkdir -p} is a common way to build directories, but this option is only available with GNU Coreutils (default on GNU/Linux systems).
+ Running the same command within a Conda environment on a macOS for example, will crash.
+ Important packages like GNU Coreutils are available in channels like conda-forge, but they are not the default.
+ Therefore, many users may not recognize this, and failing to account for it, will cause unexpected crashes.
+
+\item Many major Conda packaging ``channels'' (for example the core Anaconda channel, or very popular conda-forge channel) do not include the C library, that a package was built with, as a dependency.
+ They rely on the host operating system's C library.
+ C is the core language of modern operating systems and even higher-level languages like Python or R are written in it, and need it to run.
+ Therefore if the host operating system's C library is different from the C library that a package was built with, a Conda-packaged program will crash and the project will not be executable.
+ Theoretically, it is possible to define a new Conda ``channel'' which includes the C library as a dependency of its software packages, but it will take too much time for any individual team to practically implement all their necessary packages, up to their high-level science software.
+
+\item Conda does allow a package to depend on a special build of its prerequisites (specified by a checksum, fixing its version and the version of its dependencies).
+ However, this is rarely practiced in the main Git repositories of channels like Anaconda and conda-forge: only the name of the high-level prerequisite packages is listed in a package's \inlinecode{meta.yaml} file, which is version-controlled.
+ Therefore two builds of the package from the same Git repository will result in different tarballs (depending on what prerequisites were present at build time).
+ In the Conda tarball (that contains the binaries and is not under version control) \inlinecode{meta.yaml} does include the exact versions of most build-time dependencies.
+ However, because the different software of one project may have been built at different times, if they depend on different versions of a single software there will be a conflict and the tarball cannot be rebuilt, or the project cannot be run.
+\end{itemize}
+
+As reviewed above, the low-level dependence of Conda on the host operating system's components and build-time conditions, is the primary reason that it is very fast to install (thus making it an attractive tool to software developers who just need to reproduce a bug in a few minutes).
+However, these same factors are major caveats in a scientific scenario, where long-term archivability, readability or usability are important. % alternative to `archivability`?
+
+
+\subsubsection{Spack}
+Spack is a package manager that is also influenced by Nix (similar to GNU Guix), see \citeappendix{gamblin15}.
+ But unlike Nix or GNU Guix, it does not aim for full, bit-wise reproducibility and can be built without root access in any generic location.
+ It relies on the host operating system for the C library.
+
+ Spack is fully written in Python, where each software package is an instance of a class, which defines how it should be downloaded, configured, built and installed.
+ Therefore if the proper version of Python is not present, Spack cannot be used and when incompatibilities arise in future versions of Python (similar to how Python 3 is not compatible with Python 2), software building recipes, or the whole system, have to be upgraded.
+ Because of such bootstrapping problems (for example how Spack needs Python to build Python and other software), it is generally a good practice to use simpler, lower-level languages/systems for a low-level operation like package management.
+
+
+In conclusion for all package managers, there are two common issues regarding generic package managers that hinders their usage for high-level scientific projects, as listed below:
+\begin{itemize}
+\item {\bf\small Pre-compiled/binary downloads:} Most package managers (excluding Nix or its derivatives) only download the software in a binary (pre-compiled) format.
+ This allows users to download it very fast and almost instantaneously be able to run it.
+ However, to provide for this, servers need to keep binary files for each build of the software on different operating systems (for example Conda needs to keep binaries for Windows, macOS and GNU/Linux operating systems).
+ It is also necessary for them to store binaries for each build, which includes different versions of its dependencies.
+ Maintaining such a large binary library is expensive, therefore once the shelf-life of a binary has expired, it will be removed, causing problems for projects that depends on them.
+
+\item {\bf\small Adding high-level software:} Packaging new software is not trivial and needs a good level of knowledge/experience with that package manager.
+For example each has its own special syntax/standards/languages, with pre-defined variables that must already be known before someone can packaging new software for them.
+
+However, in many research projects, the most high-level analysis software are written by the team that is doing the research, and they are its primary users, even when the software are distributed with free licenses on open repositories.
+Although active package manager members are commonly very supportive in helping to package new software, many teams may not be able to make that extra effort/time investment.
+As a result, they manually install their high-level software in an uncontrolled, or non-standard way, thus jeopardizing the reproducibility of the whole work.
+This is another consequence of detachment of the package manager from the project doing the analysis.
+\end{itemize}
+
+Addressing these issues has been the basic reason d'\^etre of the proposed criteria: based on the completeness criteria, instructions to download and build the packages are included within the actual science project and no special/new syntax/language is used: software download, building and installation is done with the same language/syntax that researchers manage their research: using the shell (by default GNU Bash in Maneage) and Make (by default, GNU Make in Maneage).
+
+
+
+\subsection{Version control}
+\label{appendix:versioncontrol}
+A scientific project is not written in a day; it usually takes more than a year.
+During this time, the project evolves significantly from its first starting date and components are added or updated constantly as it approaches completion.
+Added with the complexity of modern computational projects, is not trivial to manually track this evolution, and the evolution's affect of on the final output: files produced in one stage of the project can mistakenly be used by an evolved analysis environment in later stages (where the project has evolved).
+
+Furthermore, scientific projects do not progress linearly: earlier stages of the analysis are often modified after later stages are written.
+This is a natural consequence of the scientific method; where progress is defined by experimentation and modification of hypotheses (results from earlier phases).
+
+It is thus very important for the integrity of a scientific project that the state/version of its processing is recorded as the project evolves.
+For example better methods are found or more data arrive.
+Any intermediate dataset that is produced should also be tagged with the version of the project at the time it was created.
+In this way, later processing stages can make sure that they can safely be used, i.e., no change has been made in their processing steps.
+
+Solutions to keep track of a project's history have existed since the early days of software engineering in the 1970s and they have constantly improved over the last decades.
+Today the distributed model of ``version control'' is the most common, where the full history of the project is stored locally on different systems and can easily be integrated.
+There are many existing version control solutions, for example CVS, SVN, Mercurial, GNU Bazaar, or GNU Arch.
+However, currently, Git is by far the most commonly used in individual projects.
+Git is also the foundation on which this paper's proof of concept (Maneage) is built upon.
+Archival systems aiming for long term preservation of software like Software Heritage \citeappendix{dicosmo18} are also modeled on Git.
+Hence we will just review Git here, but the general concept of version control is the same in all implementations.
+
+\subsubsection{Git}
+With Git, changes in a project's contents are accurately identified by comparing them with their previous version in the archived Git repository.
+When the user decides the changes are significant compared to the archived state, they can ``commit'' the changes into the history/repository.
+The commit involves copying the changed files into the repository and calculating a 40 character checksum/hash that is calculated from the files, an accompanying ``message'' (a narrative description of the purpose/goals of the changes), and the previous commit (thus creating a ``chain'' of commits that are strongly connected to each other like Figure \ref{fig:branching}).
+For example \inlinecode{f4953cc\-f1ca8a\-33616ad\-602ddf\-4cd189\-c2eff97b} is a commit identifier in the Git history of this project.
+Commits are is commonly summarized by the checksum's first few characters, for example \inlinecode{f4953cc}.
+
+With Git, making parallel ``branches'' (in the project's history) is very easy and its distributed nature greatly helps in the parallel development of a project by a team.
+The team can host the Git history on a webpage and collaborate through that.
+There are several Git hosting services for example \href{http://codeberg.org}{codeberg.org}, \href{http://gitlab.com}{gitlab.com}, \href{http://bitbucket.org}{bitbucket.org} or \href{http://github.com}{github.com} (among many others).
+Storing the changes in binary files is also possible in Git, however it is most useful for human-readable plain-text sources.
+
+
+
+\subsection{Job management}
+\label{appendix:jobmanagement}
+Any analysis will involve more than one logical step.
+For example it is first necessary to download a dataset and do some preparations on it before applying the research software on it, and finally to make visualizations/tables that can be imported into the final report.
+Each one of these is a logically independent step, which needs to be run before/after the others in a specific order.
+
+Hence job management is a critical component of a research project.
+There are many tools for managing the sequence of jobs, below we'll review the most common ones that are also used the existing reproducibility solutions of Appendix \ref{appendix:existingsolutions}.
+
+\subsubsection{Manual operation with narrative}
+\label{appendix:manual}
+The most commonly used workflow system for many researchers is to run the commands, experiment on them and keep the output when they are happy with it.
+As an improvement, some also keep a narrative description of what they ran.
+Atleast in our personal experience with colleagues, this method is still being heavily practiced by many researchers.
+Given that many researchers do not get trained well in computational methods, this is not surprizing and as discussed in Section \ref{discussion}, we believe that improved literacy in computational methods is the single most important factor for the integrity/reproducibility of modern science.
+
+\subsubsection{Scripts}
+\label{appendix:scripts}
+Scripts (in any language, for example GNU Bash, or Python) are the most common ways of organizing a series of steps.
+They are primarily designed to execute each step sequentially (one after another), making them also very intuitive.
+However, as the series of operations become complex and large, managing the workflow in a script will become highly complex.
+
+For example if 90\% of a long project is already done and a researcher wants to add a followup step, a script will go through all the previous steps (which can take significant time).
+In other scenarios, when a small step in the middle of an analysis has to be changed, the full analysis needs to be re-run from the start.
+Scripts have no concept of dependencies, forcing authors to ``temporarily'' comment parts of that they do not want to be re-run (forgetting to un-comment such parts are the most common cause of frustration for the authors and others attempting to reproduce the result).
+
+Such factors discourage experimentation, which is a critical component of the scientific method.
+It is possible to manually add conditionals all over the script to add dependencies or only run certain steps at certain times, but they just make it harder to read, and introduce many bugs themselves.
+Parallelization is another drawback of using scripts.
+While its not impossible, because of the high-level nature of scripts, it is not trivial and parallelization can also be very inefficient or buggy.
+
+
+\subsubsection{Make}
+\label{appendix:make}
+Make was originally designed to address the problems mentioned above for scripts \citeappendix{feldman79}.
+In particular in the context of managing the compilation of software programs that involve many source code files.
+With Make, the various source files of a program that hadn't been changed, wouldn't be recompiled.
+Also, when two source files didn't depend on each other, and both needed to be rebuilt, they could be built in parallel.
+This greatly helped in debugging of software projects, and speeding up test builds, giving Make a core place in software development in the last 40 years.
+
+The most common implementation of Make, since the early 1990s, is GNU Make.
+Make was also the framework used in the first attempts at reproducible scientific papers \citeappendix{claerbout1992,schwab2000}.
+Our proof-of-concept (Maneage) also uses Make to organize its workflow.
+Here, we'll complement that section with more technical details on Make.
+
+Usually, the top-level Make instructions are placed in a file called Makefile, but it is also common to use the \inlinecode{.mk} suffix for custom file names.
+Each stage/step in the analysis is defined through a \emph{rule}.
+Rules define \emph{recipes} to build \emph{targets} from \emph{pre-requisites}.
+In POSIX operating systems (Unix-like), everything is a file, even directories and devices.
+Therefore all three components in a rule must be files on the running filesystem.
+
+To decide which operation should be re-done when executed, Make compares the time stamp of the targets and prerequisites.
+When any of the prerequisite(s) is newer than a target, the recipe is re-run to re-build the target.
+When all the prerequisites are older than the target, that target does not need to be rebuilt.
+The recipe can contain any number of commands, they should just all start with a \inlinecode{TAB}.
+Going deeper into the syntax of Make is beyond the scope of this paper, but we recommend interested readers to consult the GNU Make manual for a nice introduction\footnote{\inlinecode{\url{http://www.gnu.org/software/make/manual/make.pdf}}}.
+
+\subsubsection{Snakemake}
+is a Python-based workflow management system, inspired by GNU Make (which is the job organizer in Maneage), that is aimed at reproducible and scalable data analysis \citeappendix{koster12}\footnote{\inlinecode{\url{https://snakemake.readthedocs.io/en/stable}}}.
+It defines its own language to implement the ``rule'' concept in Make within Python.
+Currently it requires Python 3.5 (released in September 2015) and above, while Snakemake was originally introduced in 2012.
+Hence it is not clear if older Snakemake source files can be executed today.
+This as reviewed in many tools here, this is a major longevity problem when using highlevel tools as the skeleton of the workflow.
+Technically, calling commond-line programs within Python is very slow and using complex shell scripts in each step will involve a lot quotations that make the code hard to read.
+
+\subsubsection{Bazel}
+Bazel\footnote{\inlinecode{\url{https://bazel.build}}} is a high-level job organizer that depends on Java and Python and is primarily tailored to software developers (with features like facilitating linking of libraries through its high level constructs).
+
+\subsubsection{SCons}
+\label{appendix:scons}
+Scons is a Python package for managing operations outside of Python (in contrast to CGAT-core, discussed below, which only organizes Python functions).
+In many aspects it is similar to Make, for example it is managed through a `SConstruct' file.
+Like a Makefile, SConstruct is also declarative: the running order is not necessarily the top-to-bottom order of the written operations within the file (unlike the imperative paradigm which is common in languages like C, Python, or FORTRAN).
+However, unlike Make, SCons does not use the file modification date to decide if it should be remade.
+SCons keeps the MD5 hash of all the files (in a hidden binary file) to check if the contents have changed.
+
+SCons thus attempts to work on a declarative file with an imperative language (Python).
+It also goes beyond raw job management and attempts to extract information from within the files (for example to identify the libraries that must be linked while compiling a program).
+SCons is therefore more complex than Make and its manual is almost double that of GNU Make.
+Besides added complexity, all these ``smart'' features decrease its performance, especially as files get larger and more numerous: on every call, every file's checksum has to be calculated, and a Python system call has to be made (which is computationally expensive).
+
+Finally, it has the same drawback as any other tool that uses high-level languages, see Section \ref{appendix:highlevelinworkflow}.
+We encountered such a problem while testing SCons: on the Debian-10 testing system, the \inlinecode{python} program pointed to Python 2.
+However, since Python 2 is now obsolete, SCons was built with Python 3 and our first run crashed.
+To fix it, we had to either manually change the core operating system path, or the SCons source hashbang.
+The former will conflict with other system tools that assume \inlinecode{python} points to Python-2, the latter may need root permissions for some systems.
+This can also be problematic when a Python analysis library, may require a Python version that conflicts with the running SCons.
+
+\subsubsection{CGAT-core}
+CGAT-Core is a Python package for managing workflows, see \citeappendix{cribbs19}.
+It wraps analysis steps in Python functions and uses Python decorators to track the dependencies between tasks.
+It is used papers like \citeappendix{jones19}, but as mentioned in \citeappendix{jones19} it is good for managing individual outputs (for example separate figures/tables in the paper, when they are fully created within Python).
+Because it is primarily designed for Python tasks, managing a full workflow (which includes many more components, written in other languages) is not trivial in it.
+Another drawback with this workflow manager is that Python is a very high-level language where future versions of the language may no longer be compatible with Python 3, that CGAT-core is implemented in (similar to how Python 2 programs are not compatible with Python 3).
+
+\subsubsection{Guix Workflow Language (GWL)}
+GWL is based on the declarative language that GNU Guix uses for package management (see Appendix \ref{appendix:packagemanagement}), which is itself based on the general purpose Scheme language.
+It is closely linked with GNU Guix and can even install the necessary software needed for each individual process.
+Hence in the GWL paradigm, software installation and usage does not have to be separated.
+GWL has two high-level concepts called ``processes'' and ``workflows'' where the latter defines how multiple processes should be executed together.
+
+In conclusion, shell scripts and Make are very common and extensively used by users of Unix-based OSs (which are most commonly used for computations).
+They have also existed for several decades and are robust and mature.
+Many researchers are also already familiar with them and have already used them.
+As we see in this appendix, the list of necessary tools for the various stages of a research project (an independent environment, package managers, job organizers, analysis languages, writing formats, editors and etc) is already very large.
+Each software has its own learning curve, which is a heavy burden for a natural or social scientist for example.
+Most other workflow management tools are yet another language that have to be mastered.
+
+Furthermore, high-level and specific solutions will evolve very fast causing disruptions in the reproducible framework.
+A good example is Popper \citeappendix{jimenez17} which initially organized its workflow through the HashiCorp configuration language (HCL) because it was the default in GitHub.
+However, in September 2019, GitHub dropped HCL as its default configuration language, so Popper is now using its own custom YAML-based workflow language, see Appendix \ref{appendix:popper} for more on Popper.
+
+\subsubsection{Nextflow (2013)}
+Nextflow\footnote{\inlinecode{\url{https://www.nextflow.io}}} \citeappendix{tommaso17} workflow language with a command-line interface that is written in Java.
+
+\subsubsection{Generic workflow specifications (CWL and WDL)}
+Due to the variety of custom workflows used in existing reproducibility solution (like those of Appendix \ref{appendix:existingsolutions}), some attempts have been made to define common workflow standards like the Common workflow language (CWL\footnote{\inlinecode{\url{https://www.commonwl.org}}}, with roots in Make, formatted in YAML or JSON) and Workflow Description Language (WDL\footnote{\inlinecode{\url{https://openwdl.org}}}, formatted in JSON).
+These are primarily specifications/standards rather than software, so ideally translators can be written between the various workflow systems to make them more interoperable.
+
+
+\subsection{Editing steps and viewing results}
+\label{appendix:editors}
+In order to later reproduce a project, the analysis steps must be stored in files.
+For example Shell, Python or R scripts, Makefiles, Dockerfiles, or even the source files of compiled languages like C or FORTRAN.
+Given that a scientific project does not evolve linearly and many edits are needed as it evolves, it is important to be able to actively test the analysis steps while writing the project's source files.
+Here we'll review some common methods that are currently used.
+
+\subsubsection{Text editors}
+The most basic way to edit text files is through simple text editors which just allow viewing and editing such files, for example \inlinecode{gedit} on the GNOME graphic user interface.
+However, working with simple plain text editors like \inlinecode{gedit} can be very frustrating since its necessary to save the file, then go to a terminal emulator and execute the source files.
+To solve this problem there are advanced text editors like GNU Emacs that allow direct execution of the script, or access to a terminal within the text editor.
+However, editors that can execute or debug the source (like GNU Emacs), just run external programs for these jobs (for example GNU GCC, or GNU GDB), just as if those programs was called from outside the editor.
+
+With text editors, the final edited file is independent of the actual editor and can be further edited with another editor, or executed without it.
+This is a very important feature that is not commonly present for other solutions mentioned below.
+Another very important advantage of advanced text editors like GNU Emacs or Vi(m) is that they can also be run without a graphic user interface, directly on the command-line.
+This feature is critical when working on remote systems, in particular high performance computing (HPC) facilities that do not provide a graphic user interface.
+Also, the commonly used minimalistic containers do not include a graphic user interface.
+
+\subsubsection{Integrated Development Environments (IDEs)}
+To facilitate the development of source files, IDEs add software building and running environments as well as debugging tools to a plain text editor.
+Many IDEs have their own compilers and debuggers, hence source files that are maintained in IDEs are not necessarily usable/portable on other systems.
+Furthermore, they usually require a graphic user interface to run.
+In summary IDEs are generally very specialized tools, for special projects and are not a good solution when portability (the ability to run on different systems and at different times) is required.
+
+\subsubsection{Jupyter}
+\label{appendix:jupyter}
+Jupyter (initially IPython) \citeappendix{kluyver16} is an implementation of Literate Programming \citeappendix{knuth84}.
+The main user interface is a web-based ``notebook'' that contains blobs of executable code and narrative.
+Jupyter uses the custom built \inlinecode{.ipynb} format\footnote{\inlinecode{\url{https://nbformat.readthedocs.io/en/latest}}}.
+Jupyter's name is a combination of the three main languages it was designed for: Julia, Python and R.
+The \inlinecode{.ipynb} format, is a simple, human-readable (can be opened in a plain-text editor) file, formatted in JavaScript Object Notation (JSON).
+It contains various kinds of ``cells'', or blobs, that can contain narrative description, code, or multi-media visualizations (for example images/plots), that are all stored in one file.
+The cells can have any order, allowing the creation of a literal programming style graphical implementation, where narrative descriptions and executable patches of code can be intertwined.
+For example to have a paragraph of text about a patch of code, and run that patch immediately in the same page.
+
+The \inlinecode{.ipynb} format does theoretically allow dependency tracking between cells, see IPython mailing list (discussion started by Gabriel Becker from July 2013\footnote{\url{https://mail.python.org/pipermail/ipython-dev/2013-July/010725.html}}).
+Defining dependencies between the cells can allow non-linear execution which is critical for large scale (thousands of files) and complex (many dependencies between the cells) operations.
+It allows automation, run-time optimization (deciding not to run a cell if its not necessary) and parallelization.
+However, Jupyter currently only supports a linear run of the cells: always from the start to the end.
+It is possible to manually execute only one cell, but the previous/next cells that may depend on it, also have to be manually run (a common source of human error, and frustration for complex operations).
+Integration of directional graph features (dependencies between the cells) into Jupyter has been discussed, but as of this publication, there is no plan to implement it (see Jupyter's GitHub issue 1175\footnote{\inlinecode{\url{https://github.com/jupyter/notebook/issues/1175}}}).
+
+The fact that the \inlinecode{.ipynb} format stores narrative text, code and multi-media visualization of the outputs in one file, is another major hurdle:
+The files can easy become very large (in volume/bytes) and hard to read.
+Both are critical for scientific processing, especially the latter: when a web-browser with proper JavaScript features isn't available (can happen in a few years).
+This is further exacerbated by the fact that binary data (for example images) are not directly supported in JSON and have to be converted into much less memory-efficient textual encodings.
+
+Finally, Jupyter has an extremely complex dependency graph: on a clean Debian 10 system, Pip (a Python package manager that is necessary for installing Jupyter) required 19 dependencies to install, and installing Jupyter within Pip needed 41 dependencies!
+\citeappendix{hinsen15} reported such conflicts when building Jupyter into the Active Papers framework (see Appendix \ref{appendix:activepapers}).
+However, the dependencies above are only on the server-side.
+Since Jupyter is a web-based system, it requires many dependencies on the viewing/running browser also (for example special JavaScript or HTML5 features, which evolve very fast).
+As discussed in Appendix \ref{appendix:highlevelinworkflow} having so many dependencies is a major caveat for any system regarding scientific/long-term reproducibility (as opposed to industrial/immediate reproducibility).
+In summary, Jupyter is most useful in manual, interactive and graphical operations for temporary operations (for example educational tutorials).
+
+
+
+
+
+
+\subsection{Project management in high-level languages}
+\label{appendix:highlevelinworkflow}
+
+Currently the most popular high-level data analysis language is Python.
+R is closely tracking it, and has superseded Python in some fields, while Julia \citeappendix{bezanson17} is quickly gaining ground.
+These languages have themselves superseded previously popular languages for data analysis of the previous decades, for example Java, Perl or C++.
+All are part of the C-family programming languages.
+In many cases, this means that the tools to use that language are written in C, which is the language of modern operating systems.
+
+Scientists, or data analysts, mostly use these higher-level languages.
+Therefore they are naturally drawn to also apply the higher-level languages for lower-level project management, or designing the various stages of their workflow.
+For example Conda or Spack (Appendix \ref{appendix:packagemanagement}), CGAT-core (Appendix \ref{appendix:jobmanagement}), Jupyter (Appendix \ref{appendix:editors}) or Popper (Appendix \ref{appendix:popper}) are written in Python.
+The discussion below applies to both the actual analysis software and project management software.
+In this context, its more focused on the latter.
-%% Tell BibLaTeX to put the bibliography list here.
-\printbibliography
+Because of their nature, higher-level languages evolve very fast, creating incompatibilities on the way.
+The most prominent example is the transition from Python 2 (released in 2000) to Python 3 (released in 2008).
+Python 3 was incompatible with Python 2 and it was decided to abandon the former by 2015.
+However, due to community pressure, this was delayed to January 1st, 2020.
+The end-of-life of Python 2 caused many problems for projects that had invested heavily in Python 2: all their previous work had to be translated, for example see \citeappendix{jenness17} or Appendix \ref{appendix:sciunit}.
+Some projects couldn't make this investment and their developers decided to stop maintaining it, for example VisTrails (see Appendix \ref{appendix:vistrails}).
-%% Start appendix.
-\appendix
+The problems weren't just limited to translation.
+Python 2 was still actively being actively used during the transition period (and is still being used by some, after its end-of-life).
+Therefore, developers of packages used by others had to maintain (for example fix bugs in) both versions in one package.
+This isn't particular to Python, a similar evolution occurred in Perl: in 2000 it was decided to improve Perl 5, but the proposed Perl 6 was incompatible with it.
+However, the Perl community decided not to abandon Perl 5, and Perl 6 was eventually defined as a new language that is now officially called ``Raku'' (\url{https://raku.org}).
-%% Mention all used software in an appendix.
-\section{Software acknowledgement}
-\label{appendix:software}
-\input{tex/build/macros/dependencies.tex}
+It is unreasonably optimistic to assume that high-level languages won't undergo similar incompatible evolutions in the (not too distant) future.
+For software developers, this isn't a problem at all: non-scientific software, and the general population's usage of them, has a similarly fast evolvution.
+Hence, it is rarely (if ever) necessary to look into codes that are more than a couple of years old.
+However, in the sciences (which are commonly funded by public money) this is a major caveat for the longer-term usability of solutions that are designed in such high level languages.
-%% Finish LaTeX
+In summary, in this section we are discussing the bootstrapping problem as regards scientific projects: the workflow/pipeline can reproduce the analysis and its dependencies, but the dependencies of the workflow itself cannot not be ignored.
+Beyond technical, low-level, problems for the developers mentioned above, this causes major problems for scientific project management as listed below:
+
+\subsubsection{Dependency hell}
+The evolution of high-level languages is extremely fast, even within one version.
+For example packages that are written in Python 3 often only work with a special interval of Python 3 versions (for example newer than Python 3.6).
+This isn't just limited to the core language, much faster changes occur in their higher-level libraries.
+For example version 1.9 of Numpy (Python's numerical analysis module) discontinued support for Numpy's predecessor (called Numeric), causing many problems for scientific users \citeappendix{hinsen15}.
+
+On the other hand, the dependency graph of tools written in high-level languages is often extremely complex.
+For example see Figure 1 of \citeappendix{alliez19}, it shows the dependencies and their inter-dependencies for Matplotlib (a popular plotting module in Python).
+Acceptable version intervals between the dependencies will cause incompatibilities in a year or two, when a robust package manager is not used (see Appendix \ref{appendix:packagemanagement}).
+
+Since a domain scientist does not always have the resources/knowledge to modify the conflicting part(s), many are forced to create complex environments with different versions of Python and pass the data between them (for example just to use the work of a previous PhD student in the team).
+This greatly increases the complexity of the project, even for the principal author.
+A good reproducible workflow can account for these different versions.
+However, when the actual workflow system (not the analysis software) is written in a high-level language this will cause a major problem.
+
+For example, merely installing the Python installer (\inlinecode{pip}) on a Debian system (with \inlinecode{apt install pip2} for Python 2 packages), required 32 other packages as dependencies.
+\inlinecode{pip} is necessary to install Popper and Sciunit (Appendices \ref{appendix:popper} and \ref{appendix:sciunit}).
+As of this writing, the \inlinecode{pip3 install popper} and \inlinecode{pip2 install sciunit2} commands for installing each, required 17 and 26 Python modules as dependencies.
+It is impossible to run either of these solutions if there is a single conflict in this very complex dependency graph.
+This problem actually occurred while we were testing Sciunit: even though it installed, it couldn't run because of conflicts (its last commit was only 1.5 years old), for more see Appendix \ref{appendix:sciunit}.
+\citeappendix{hinsen15} also report a similar problem when attempting to install Jupyter (see Appendix \ref{appendix:editors}).
+Of course, this also applies to tools that these systems use, for example Conda (which is also written in Python, see Appendix \ref{appendix:packagemanagement}).
+
+
+
+
+
+\subsubsection{Generational gap}
+This occurs primarily for domain scientists (for example astronomers, biologists or social sciences).
+Once they have mastered one version of a language (mostly in the early stages of their career), they tend to ignore newer versions/languages.
+The inertia of programming languages is very strong.
+This is natural, because they have their own science field to focus on, and re-writing their high-level analysis toolkits (which they have curated over their career and is often only readable/usable by themselves) in newer languages every few years requires too much investment and time.
+
+When this investment is not possible, either the mentee has to use the mentor's old method (and miss out on all the new tools, which they need for the future job prospects), or the mentor has to avoid implementation details in discussions with the mentee, because they do not share a common language.
+The authors of this paper have personal experiences in both mentor/mentee relational scenarios.
+This failure to communicate in the details is a very serious problem, leading to the loss of valuable inter-generational experience.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+\section{Survey of common existing reproducible workflows}
+\label{appendix:existingsolutions}
+
+As reviewed in the introduction, the problem of reproducibility has received a lot of attention over the last three decades and various solutions have already been proposed.
+The core principles that many of the existing solutions (including Maneage) aim to achieve are nicely summarized by the FAIR principles \citeappendix{wilkinson16}.
+In this appendix, some of the solutions are reviewed.
+The solutions are based on an evolving software landscape, therefore they are ordered by date: when the project has a webpage, the year of its first release is used for the sorting, otherwise their paper's publication year is used.
+
+For each solution, we summarize its methodology and discuss how it relates to the criteria proposed in this paper.
+Freedom of the software/method is a core concept behind scientific reproducibility, as opposed to industrial reproducibility where a black box is acceptable/desirable.
+Therefore proprietary solutions like Code Ocean\footnote{\inlinecode{\url{https://codeocean.com}}} or Nextjournal\footnote{\inlinecode{\url{https://nextjournal.com}}} will not be reviewed here.
+Other studies have also attempted to review existing reproducible solutions, foro example \citeappendix{konkol20}.
+
+\subsection{Suggested rules, checklists, or criteria}
+Before going into the various implementations, it is also useful to review existing suggested rules, checklists or criteria for computationally reproducible research.
+
+All the cases below are primarily targetted to immediate reproducibility and do not consider longevity explicitly.
+Therefore, they lack a strong/clear completeness criterion (they mainly only suggest, rather than require, the recording of versions, and their ultimate suggestion of storing the full binary OS in a binary VM or container is problematic (as mentioned in \ref{appendix:independentenvironment} and \citeappendix{oliveira18}).
+
+Sandve et al. \citeappendix{sandve13} propose ``ten simple rules for reproducible computational research'' that can be applied in any project.
+Generally, these are very similar to the criteria proposed here and follow a similar spirit, but they do not provide any actual research papers following up all those points, nor do they provide a proof of concept.
+The Popper convention \citeappendix{jimenez17} also provides a set of principles that are indeed generally useful, among which some are common to the criteria here (for example, automatic validation, and, as in Maneage, the authors suggest providing a template for new users),
+but the authors do not include completeness as a criterion nor pay attention to longevity (Popper itself is written in Python with many dependencies, and its core operating language has already changed once).
+For more on Popper, please see Section \ref{appendix:popper}.
+
+For improved reproducibility in Jupyter notebook users, \citeappendix{rule19} propose ten rules to improve reproducibility and also provide links to example implementations.
+These can be very useful for users of Jupyter, but are not generic for non-Jupyter-based computational projects.
+Some criteria (which are indeed very good in a more general context) do not directly relate to reproducibility, for example their Rule 1: ``Tell a Story for an Audience''.
+Generally, as reviewed in Sections \ref{sec:longevityofexisting} and \ref{appendix:jupyter}, Jupyter itself has many issues regarding reproducibility.
+
+To create Docker images, N\"ust et al. propose ``ten simple rules'' in \citeappendix{nust20}.
+They recommend some issues that can indeed help increase the quality of Docker images and their production/usage, such as their rule 7 to ``mount datasets [only] at run time'' to separate the computational environment from the data.
+However, long-term reproducibility of the images is not included as a criterion by these authors.
+For example, they recommend using base operating systems, with version identification limited to a single brief identifier such as \inlinecode{ubuntu:18.04}, which has a serious problem with longevity issues (Section \ref{sec:longevityofexisting}).
+Furthermore, in their proof-of-concept Dockerfile (listing 1), \inlinecode{rocker} is used with a tag (not a digest), which can be problematic due to the high risk of ambiguity (as discussed in Section \ref{appendix:containers}).
+
+\subsection{Reproducible Electronic Documents, RED (1992)}
+\label{appendix:red}
+
+RED\footnote{\inlinecode{\url{http://sep.stanford.edu/doku.php?id=sep:research:reproducible}}} is the first attempt that we could find on doing reproducible research, see \citeappendix{claerbout1992,schwab2000}.
+It was developed within the Stanford Exploration Project (SEP) for Geophysics publications.
+Their introductions on the importance of reproducibility, resonate a lot with today's environment in computational sciences.
+In particular the heavy investment one has to make in order to re-do another scientist's work, even in the same team.
+RED also influenced other early reproducible works, for example \citeappendix{buckheit1995}.
+
+To orchestrate the various figures/results of a project, from 1990, they used ``Cake'' \citeappendix{somogyi87}, a dialect of Make, for more on Make, see Appendix \ref{appendix:jobmanagement}.
+As described in \citeappendix{schwab2000}, in the latter half of that decade, they moved to GNU Make, which was much more commonly used, developed and came with a complete and up-to-date manual.
+The basic idea behind RED's solution was to organize the analysis as independent steps, including the generation of plots, and organizing the steps through a Makefile.
+This enabled all the results to be re-executed with a single command.
+Several basic low-level Makefiles were included in the high-level/central Makefile.
+The reader/user of a project had to manually edit the central Makefile and set the variable \inlinecode{RESDIR} (result dir), this is the directory where built files are kept.
+Afterwards, the reader could set which figures/parts of the project to reproduce by manually adding its name in the central Makefile, and running Make.
+
+At the time, Make was already practiced by individual researchers and projects as a job orchestration tool, but SEP's innovation was to standardize it as an internal policy, and define conventions for the Makefiles to be consistent across projects.
+This enabled new members to benefit from the already existing work of previous team members (who had graduated or moved to other jobs).
+However, RED only used the existing software of the host system, it had no means to control them.
+Therefore, with wider adoption, they confronted a ``versioning problem'' where the host's analysis software had different versions on different hosts, creating different results, or crashing \citeappendix{fomel09}.
+Hence in 2006 SEP moved to a new Python-based framework called Madagascar, see Appendix \ref{appendix:madagascar}.
+
+
+
+
+
+\subsection{Apache Taverna (2003)}
+\label{appendix:taverna}
+Apache Taverna\footnote{\inlinecode{\url{https://taverna.incubator.apache.org}}} \citeappendix{oinn04} is a workflow management system written in Java with a graphical user interface which is still being developed.
+A workflow is defined as a directed graph, where nodes are called ``processors''.
+Each Processor transforms a set of inputs into a set of outputs and they are defined in the Scufl language (an XML-based language, were each step is an atomic task).
+Other components of the workflow are ``Data links'' and ``Coordination constraints''.
+The main user interface is graphical, where users move processors in the given space and define links between their inputs outputs (manually constructing a lineage like Figure \ref{fig:datalineage}).
+Taverna is only a workflow manager and isn't integrated with a package manager, hence the versions of the used software can be different in different runs.
+\citeappendix{zhao12} have studied the problem of workflow decays in Taverna.
+
+
+
+
+
+\subsection{Madagascar (2003)}
+\label{appendix:madagascar}
+Madagascar\footnote{\inlinecode{\url{http://ahay.org}}} \citeappendix{fomel13} is a set of extensions to the SCons job management tool (reviewed in \ref{appendix:scons}).
+Madagascar is a continuation of the Reproducible Electronic Documents (RED) project that was discussed in Appendix \ref{appendix:red}.
+Madagascar has been used in the production of hundreds of research papers or book chapters\footnote{\inlinecode{\url{http://www.ahay.org/wiki/Reproducible_Documents}}}, 120 prior to \citeappendix{fomel13}.
+
+Madagascar does include project management tools in the form of SCons extensions.
+However, it isn't just a reproducible project management tool.
+It is primarily a collection of analysis programs and tools to interact with RSF files, and plotting facilities.
+The Regularly Sampled File (RSF) file format\footnote{\inlinecode{\url{http://www.ahay.org/wiki/Guide\_to\_RSF\_file\_format}}} is a custom plain-text file that points to the location of the actual data files on the filesystem and acts as the intermediary between Madagascar's analysis programs.
+For example in our test of Madagascar 3.0.1, it installed 855 Madagascar-specific analysis programs (\inlinecode{PREFIX/bin/sf*}).
+The analysis programs mostly target geophysical data analysis, including various project specific tools: more than half of the total built tools are under the \inlinecode{build/user} directory which includes names of Madagascar users.
+
+Besides the location or contents of the data, RSF also contains name/value pairs that can be used as options to Madagascar programs, which are built with inputs and outputs of this format.
+Since RSF contains program options also, the inputs and outputs of Madagascar's analysis programs are read from, and written to, standard input and standard output.
+
+In terms of completeness, as long as the user only uses Madagascar's own analysis programs, it is fairly complete at a high level (not lower-level OS libraries).
+However, this comes at the expense of a large amount of bloatware (programs that one project may never need, but is forced to build).
+Also, the linking between the analysis programs (of a certain user at a certain time) and future versions of that program (that is updated in time) is not immediately obvious.
+Madagascar could have been more useful to a larger community if the workflow components were maintained as a separate project compared to the analysis components.
+
+\subsection{GenePattern (2004)}
+\label{appendix:genepattern}
+GenePattern\footnote{\inlinecode{\url{https://www.genepattern.org}}} \citeappendix{reich06} (first released in 2004) is a client-server software containing many common analysis functions/modules, primarily focused for Gene studies.
+Although its highly focused to a special research field, it is reviewed here because its concepts/methods are generic, and in the context of this paper.
+
+Its server-side software is installed with fixed software packages that are wrapped into GenePattern modules.
+The modules are used through a web interface, the modern implementation is GenePattern Notebook \citeappendix{reich17}.
+It is an extension of the Jupyter notebook (see Appendix \ref{appendix:editors}), which also has a special ``GenePattern'' cell that will connect to GenePattern servers for doing the analysis.
+However, the wrapper modules just call an existing tool on the host system.
+Given that each server may have its own set of installed software, the analysis may differ (or crash) when run on different GenePattern servers, hampering reproducibility.
+
+%% GenePattern shutdown announcement (although as of November 2020, it does not open any more!): https://www.genepattern.org/blog/2019/10/01/the-genomespace-project-is-ending-on-november-15-2019
+The primary GenePattern server was active since 2008 and had 40,000 registered users with 2000 to 5000 jobs running every week \citeappendix{reich17}.
+However, it was shut down on November 15th 2019 due to end of funding.
+All processing with this sever has stopped, and any archived data on it has been deleted.
+Since GenePattern is free software, there are alternative public servers to use, so hopefully work on it will continue.
+However, funding is limited and those servers may face similar funding problems.
+This is a very nice example of the fragility of solutions that depend on archiving and running the research codes with high-level research products (including data, binary/compiled code which are expensive to keep) in one place.
+
+
+
+
+
+\subsection{Kepler (2005)}
+Kepler\footnote{\inlinecode{\url{https://kepler-project.org}}} \citeappendix{ludascher05} is a Java-based Graphic User Interface workflow management tool.
+Users drag-and-drop analysis components, called ``actors'', into a visual, directional graph, which is the workflow (similar to Figure \ref{fig:datalineage}).
+Each actor is connected to others through the Ptolemy II\footnote{\inlinecode{\url{https://ptolemy.berkeley.edu}}} \citeappendix{eker03}.
+In many aspects, the usage of Kepler and its issues for long-term reproducibility is like Apache Taverna (see Section \ref{appendix:taverna}).
+
+
+
+
+
+\subsection{VisTrails (2005)}
+\label{appendix:vistrails}
+
+VisTrails\footnote{\inlinecode{\url{https://www.vistrails.org}}} \citeappendix{bavoil05} was a graphical workflow managing system.
+According to its webpage, VisTrails maintainance has stopped since May 2016, its last Git commit, as of this writing, was in November 2017.
+However, given that it was well maintained for over 10 years is an achievement.
+
+VisTrails (or ``visualization trails'') was initially designed for managing visualizations, but later grew into a generic workflow system with meta-data and provenance features.
+Each analysis step, or module, is recorded in an XML schema, which defines the operations and their dependencies.
+The XML attributes of each module can be used in any XML query language to find certain steps (for example those that used a certain command).
+Since the main goal was visualization (as images), apparently its primary output is in the form of image spreadsheets.
+Its design is based on a change-based provenance model using a custom VisTrails provenance query language (vtPQL), for more see \citeappendix{scheidegger08}.
+Since XML is a plane text format, as the user inspects the data and makes changes to the analysis, the changes are recorded as ``trails'' in the project's VisTrails repository that operates very much like common version control systems (see Appendix \ref{appendix:versioncontrol}).
+.
+However, even though XML is in plain text, it is very hard to edit manually.
+VisTrails therefore provides a graphic user interface with a visual representation of the project's inter-dependent steps (similar to Figure \ref{fig:datalineage}).
+Besides the fact that it is no longer maintained, VisTrails didn't control the software that is run, it only controls the sequence of steps that they are run in.
+
+
+
+
+
+\subsection{Galaxy (2010)}
+\label{appendix:galaxy}
+
+Galaxy\footnote{\inlinecode{\url{https://galaxyproject.org}}} is a web-based Genomics workbench \citeappendix{goecks10}.
+The main user interface are ``Galaxy Pages'', which does not require any programming: users simply use abstract ``tools'' which are a wrappers over command-line programs.
+Therefore the actual running version of the program can be hard to control across different Galaxy servers.
+Besides the automatically generated metadata of a project (which include version control, or its history), users can also tag/annotate each analysis step, describing its intent/purpose.
+Besides some small differences Galaxy seems very similar to GenePattern (Appendix \ref{appendix:genepattern}), so most of the same points there apply here too (including the very large cost of maintining such a system).
+
+
+
+
+
+\subsection{Image Processing On Line journal, IPOL (2010)}
+The IPOL journal\footnote{\inlinecode{\url{https://www.ipol.im}}} \citeappendix{limare11} (first published article in July 2010) publishes papers on image processing algorithms as well as the the full code of the proposed algorithm.
+An IPOL paper is a traditional research paper, but with a focus on implementation.
+The published narrative description of the algorithm must be detailed to a level that any specialist can implement it in their own programming language (extremely detailed).
+The author's own implementation of the algorithm is also published with the paper (in C, C++ or MATLAB), the code must be commented well enough and link each part of it with the relevant part of the paper.
+The authors must also submit several example datasets/scenarios.
+The referee is expected to inspect the code and narrative, confirming that they match with each other, and with the stated conclusions of the published paper.
+After publication, each paper also has a ``demo'' button on its webpage, allowing readers to try the algorithm on a web-interface and even provide their own input.
+
+The IPOL model is the single most robust model of peer review and publishing computational research methods/implementations that we have seen in this survey.
+It has grown steadily over the last 10 years, publishing 23 research articles in 2019 alone.
+We encourage the reader to visit its webpage and see some of its recent papers and their demos.
+The reason it can be so thorough and complete is its very narrow scope (image processing algorithms), where the published algorithms are highly atomic, not needing significant dependencies (beyond input/output), allowing the referees and readers to go deeply into each implemented algorithm.
+In fact, high-level languages like Perl, Python or Java are not acceptable in IPOL precisely because of the additional complexities, such as dependencies, that they require.
+If any referee or reader were inclined to do so, a paper written in Maneage (the proof-of-concept solution presented in this paper) could be scrutinised at a similar detailed level, but for much more complex research scenarios, involving hundreds of dependencies and complex processing of the data.
+
+
+
+
+
+
+\subsection{WINGS (2010)}
+\label{appendix:wings}
+
+WINGS\footnote{\inlinecode{\url{https://wings-workflows.org}}} \citeappendix{gil10} is an automatic workflow generation algorithm.
+It runs on a centralized web server, requiring many dependencies (such that it is recommended to download Docker images).
+It allows users to define various workflow components (for example datasets, analysis components and etc), with high-level goals.
+It then uses selection and rejection algorithms to find the best components using a pool of analysis components that can satisfy the requested high-level constraints.
+\tonote{Read more about this}
+
+
+
+
+
+\subsection{Active Papers (2011)}
+\label{appendix:activepapers}
+Active Papers\footnote{\inlinecode{\url{http://www.activepapers.org}}} attempts to package the code and data of a project into one file (in HDF5 format).
+It was initially written in Java because its compiled byte-code outputs in JVM are portable on any machine \citeappendix{hinsen11}.
+However, Java is not a commonly used platform today, hence it was later implemented in Python \citeappendix{hinsen15}.
+
+In the Python version, all processing steps and input data (or references to them) are stored in a HDF5 file.
+However, it can only account for pure-Python packages using the host operating system's Python modules \tonote{confirm this!}.
+When the Python module contains a component written in other languages (mostly C or C++), it needs to be an external dependency to the Active Paper.
+
+As mentioned in \citeappendix{hinsen15}, the fact that it relies on HDF5 is a caveat of Active Papers, because many tools are necessary to merely open it.
+Downloading the pre-built HDF View binaries (provided by the HDF group) is not possible anonymously/automatically (login is required).
+Installing it using the Debian or Arch Linux package managers also failed due to dependencies in our trials.
+Furthermore, as a high-level data format HDF5 evolves very fast, for example HDF5 1.12.0 (February 29th, 2020) is not usable with older libraries provided by the HDF5 team. % maybe replace with: February 29\textsuperscript{th}, 2020?
+
+While data and code are indeed fundamentally similar concepts technically\citeappendix{hinsen16}, they are used by humans differently due to their volume: the code of a large project involving Terabytes of data can be less than a megabyte.
+Hence, storing code and data together becomes a burden when large datasets are used, this was also acknowledged in \citeappendix{hinsen15}.
+Also, if the data are proprietary (for example medical patient data), the data must not be released, but the methods that were applied to them can be published.
+Furthermore, since all reading and writing is done in the HDF5 file, it can easily bloat the file to very large sizes due to temporary/reproducible files, and its necessary to remove/dummify them, thus complicating the code, making it hard to read.
+For example the Active Papers HDF5 file of \citeappendix[in \href{https://doi.org/10.5281/zenodo.2549987}{zenodo.2549987}]{kneller19} is 1.8 giga-bytes.
+
+In many scenarios, peers just want to inspect the processing by reading the code and checking a very special part of it (one or two lines, just to see the option values to one step for example).
+They do not necessarily need to run it, or obtaining the datasets.
+Hence the extra volume for data, and obscure HDF5 format that needs special tools for reading plain text code is a major hinderance.
+
+
+
+
+
+\subsection{Collage Authoring Environment (2011)}
+\label{appendix:collage}
+The Collage Authoring Environment \citeappendix{nowakowski11} was the winner of Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}.
+It is based on the GridSpace2\footnote{\inlinecode{\url{http://dice.cyfronet.pl}}} distributed computing environment\tonote{find citation}, which has a web-based graphic user interface.
+Through its web-based interface, viewers of a paper can actively experiment with the parameters of a published paper's displayed outputs (for example figures).
+\tonote{See how it containerizes the software environment}
+
+
+
+
+
+\subsection{SHARE (2011)}
+\label{appendix:SHARE}
+SHARE\footnote{\inlinecode{\url{https://is.ieis.tue.nl/staff/pvgorp/share}}} \citeappendix{vangorp11} is a web portal that hosts virtual machines (VMs) for storing the environment of a research project.
+The top project webpage above is still active, however, the virtual machines and SHARE system have been removed since 2019, probably due to the large volume and high maintainence cost of the VMs.
+
+SHARE was recognized as second position in the Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}.
+Simply put, SHARE is just a VM that users can download and run.
+The limitations of VMs for reproducibility were discussed in Appendix \ref{appendix:virtualmachines}, and the SHARE system does not specify any requirements on making the VM itself reproducible.
+
+
+
+
+
+\subsection{Verifiable Computational Result, VCR (2011)}
+\label{appendix:verifiableidentifier}
+A ``verifiable computational result''\footnote{\inlinecode{\url{http://vcr.stanford.edu}}} is an output (table, figure, or etc) that is associated with a ``verifiable result identifier'' (VRI), see \citeappendix{gavish11}.
+It was awarded the third prize in the Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}.
+
+A VRI is created using tags within the programming source that produced that output, also recording its version control or history.
+This enables exact identification and citation of results.
+The VRIs are automatically generated web-URLs that link to public VCR repositories containing the data, inputs and scripts, that may be re-executed.
+According to \citeappendix{gavish11}, the VRI generation routine has been implemented in MATLAB, R and Python, although only the MATLAB version was available during the writing of this paper.
+VCR also has special \LaTeX{} macros for loading the respective VRI into the generated PDF.
+
+Unfortunately most parts of the webpage are not complete at the time of this writing.
+The VCR webpage contains an example PDF\footnote{\inlinecode{\url{http://vcr.stanford.edu/paper.pdf}}} that is generated with this system, however, the linked VCR repository\footnote{\inlinecode{\url{http://vcr-stat.stanford.edu}}} does not exist at the time of this writing.
+Finally, the date of the files in the MATLAB extension tarball are set to 2011, hinting that probably VCR has been abandoned soon after the publication of \citeappendix{gavish11}.
+
+
+
+
+
+\subsection{SOLE (2012)}
+\label{appendix:sole}
+SOLE (Science Object Linking and Embedding) defines ``science objects'' (SOs) that can be manually linked with phrases of the published paper \citeappendix{pham12,malik13}.
+An SO is any code/content that is wrapped in begin/end tags with an associated type and name.
+For example special commented lines in a Python, R or C program.
+The SOLE command-line program parses the tagged file, generating metadata elements unique to the SO (including its URI).
+SOLE also supports workflows as Galaxy tools \citeappendix{goecks10}.
+
+For reproducibility, \citeappendix{pham12} suggest building a SOLE-based project in a virtual machine, using any custom package manager that is hosted on a private server to obtain a usable URI.
+However, as described in Appendices \ref{appendix:independentenvironment} and \ref{appendix:packagemanagement}, unless virtual machines are built with robust package managers, this is not a sustainable solution (the virtual machine itself is not reproducible).
+Also, hosting a large virtual machine server with fixed IP on a hosting service like Amazon (as suggested there) for every project in perpetuity will be very expensive.
+The manual/artificial definition of tags to connect parts of the paper with the analysis scripts is also a caveat due to human error and incompleteness (tags the authors may not consider important, but may be useful later).
+In Maneage, instead of artificial/commented tags directly link the analysis input and outputs to the paper's text automatically.
+
+
+
+\subsection{Sumatra (2012)}
+Sumatra\footnote{\inlinecode{\url{http://neuralensemble.org/sumatra}}} \citeappendix{davison12} attempts to capture the environment information of a running project.
+It is written in Python and is a command-line wrapper over the analysis script.
+By controlling a project at running-time, Sumatra is able to capture the environment it was run in.
+The captured environment can be viewed in plain text or a web interface.
+Sumatra also provides \LaTeX/Sphinx features, which will link the paper with the project's Sumatra database.
+This enables researchers to use a fixed version of a project's figures in the paper, even at later times (while the project is being developed).
+
+The actual code that Sumatra wraps around, must itself be under version control, and it does not run if there is non-committed changes (although its not clear what happens if a commit is amended).
+Since information on the environment has been captured, Sumatra is able to identify if it has changed since a previous run of the project.
+Therefore Sumatra makes no attempt at storing the environment of the analysis as in Sciunit (see Appendix \ref{appendix:sciunit}), but its information.
+Sumatra thus needs to know the language of the running program and isn't generic.
+It just captures the environment, it does not store \emph{how} that environment was built.
+
+
+
+
+
+\subsection{Research Object (2013)}
+\label{appendix:researchobject}
+
+The Research object\footnote{\inlinecode{\url{http://www.researchobject.org}}} is collection of meta-data ontologies, to describe aggregation of resources, or workflows, see \citeappendix{bechhofer13} and \citeappendix{belhajjame15}.
+It thus provides resources to link various workflow/analysis components (see Appendix \ref{appendix:existingtools}) into a final workflow.
+
+\citeappendix{bechhofer13} describes how a workflow in Apache Taverna (Appendix \ref{appendix:taverna}) can be translated into research objects.
+The important thing is that the research object concept is not specific to any special workflow, it is just a metadata bundle/standard which is only as robust in reproducing the result as the running workflow.
+
+
+
+
+\subsection{Sciunit (2015)}
+\label{appendix:sciunit}
+Sciunit\footnote{\inlinecode{\url{https://sciunit.run}}} \citeappendix{meng15} defines ``sciunit''s that keep the executed commands for an analysis and all the necessary programs and libraries that are used in those commands.
+It automatically parses all the executables in the script, and copies them, and their dependency libraries (down to the C library), into the sciunit.
+Because the sciunit contains all the programs and necessary libraries, its possible to run it readily on other systems that have a similar CPU architecture.
+Sciunit was originally written in Python 2 (which reached its end-of-life in January 1st, 2020).
+Therefore Sciunit2 is a new implementation in Python 3.
+
+The main issue with Sciunit's approach is that the copied binaries are just black boxes: it is not possible to see how the used binaries from the initial system were built.
+This is a major problem for scientific projects: in principle (not knowing how they programs were built) and in practice (archiving a large volume sciunit for every step of an analysis requires a lot of storage space).
+
+
+
+
+
+\subsection{Umbrella (2015)}
+Umbrella \citeappendix{meng15b} is a high-level wrapper script for isolating the environment of an analysis.
+The user specifies the necessary operating system, and necessary packages and analysis steps in variuos JSON files.
+Umbrella will then study the host operating system and the various necessary inputs (including data and software) through a process similar to Sciunits mentioned above to find the best environment isolator (maybe using Linux containerization, containers or VMs).
+We couldn't find a URL to the source software of Umbrella (no source code repository has been mentioned in the papers we reviewed above), but from the descriptions in \citeappendix{meng17}, it is written in Python 2.6 (which is now \new{deprecated}).
+
+
+
+
+
+\subsection{ReproZip (2016)}
+ReproZip\footnote{\inlinecode{\url{https://www.reprozip.org}}} \citeappendix{chirigati16} is a Python package that is designed to automatically track all the necessary data files, libraries and environment variables into a single bundle.
+The tracking is done at the kernel system-call level, so any file that is accessed during the running of the project is identified.
+The tracked files can be packaged into a \inlinecode{.rpz} bundle that can then be unpacked into another system.
+
+ReproZip is therefore very good to take a ``snapshot'' of the running environment into a single file.
+The bundle can become very large if large, or many datasets, are used or if the software evironment is complex (many dependencies).
+Since it copies the binary software libraries, it can only be run on systems with a similar CPU architecture to the original.
+Furthermore, ReproZip just copies the binary/compiled files used in a project, it has no way to know how those software were built.
+As mentioned in this paper, and also \citeappendix{oliveira18} the question of ``how'' the environment was built is critical for understanding the results and simply having the binaries cannot necessarily be useful.
+
+For the data, it is similarly not possible to extract which data server they came from.
+Hence two projects that each use a 1 terra byte dataset will need a full copy of that same 1 terra byte file in their bundle, making long term preservation extremely expensive.
+
+
+
+
+
+\subsection{Binder (2017)}
+Binder\footnote{\inlinecode{\url{https://mybinder.org}}} is used to containerize already existing Jupyter based processing steps.
+Users simply add a set of Binder-recognized configuration files to their repository and Binder will build a Docker image and install all the dependencies inside of it with Conda (the list of necessary packages comes from Conda).
+One good feature of Binder is that the imported Docker image must be tagged (something like a checksum).
+This will ensure that future/latest updates of the imported Docker image are not mistakenly used.
+However, it does not make sure that the dockerfile used by the imported Docker image follows a similar convention also.
+Binder is used by \citeappendix{jones19}.
+
+
+
+
+
+\subsection{Gigantum (2017)}
+%% I took the date from their PiPy page, where the first version 0.1 was published in November 2016.
+Gigantum\footnote{\inlinecode{\url{https://gigantum.com}}} is a client/server system, in which the client is a web-based (graphical) interface that is installed as ``Gigantum Desktop'' within a Docker image.
+Gigantum uses Docker containers for an independent environment, Conda (or Pip) to install packages, Jupyter notebooks to edit and run code, and Git to store its history.
+Simply put, its a high-level wrapper for combining these components.
+Internally, a Gigantum project is organized as files in a directory that can be opened without their own client.
+The file structure (which is under version control) includes codes, input data and output data.
+As acknowledged on their own webpage, this greatly reduces the speed of Git operations, transmitting, or archiving the project.
+Therefore there are size limits on the dataset/code sizes.
+However, there is one directory which can be used to store files that must not be tracked.
+
+
+
+
+\subsection{Popper (2017)}
+\label{appendix:popper}
+Popper\footnote{\inlinecode{\url{https://falsifiable.us}}} is a software implementation of the Popper Convention \citeappendix{jimenez17}.
+The Popper team's own solution is through a command-line program called \inlinecode{popper}.
+The \inlinecode{popper} program itself is written in Python.
+However, job management was initially based on the HashiCorp configuration language (HCL) because HCL was used by ``GitHub Actions'' to manage workflows.
+Moreover, from October 2019 Github changed to a custom YAML-based languguage, so Popper also deprecated HCL.
+This is an important issue when low-level choices are based on service providers.
+
+To start a project, the \inlinecode{popper} command-line program builds a template, or ``scaffold'', which is a minimal set of files that can be run.
+However, as of this writing, the scaffold isn't complete: it lacks a manuscript and validation of outputs (as mentioned in the convention).
+By default Popper runs in a Docker image (so root permissions are necessary and reproducible issues with Docker images have been discussed above), but Singularity is also supported.
+See Appendix \ref{appendix:independentenvironment} for more on containers, and Appendix \ref{appendix:highlevelinworkflow} for using high-level languages in the workflow.
+
+Popper does not comply with the completeness, minimal complexity and including-narrative criteria.
+Moreover, the scaffold that is provided by Popper is an output of the program that is not directly under version control.
+Hence, tracking future changes in Popper and how they relate to the high-level projects that depend on it will be very hard.
+In Maneage, the same \inlinecode{maneage} git branch is shared by the developers and users; any new feature or change in Maneage can thus be directly tracked with Git when the high-level project merges their branch with Maneage.
+
+
+\subsection{Whole Tale (2017)}
+\label{appendix:wholetale}
+
+Whole Tale\footnote{\inlinecode{\url{https://wholetale.org}}} is a web-based platform for managing a project and organizing data provenance, see \citeappendix{brinckman17}
+It uses online editors like Jupyter or RStudio (see Appendix \ref{appendix:editors}) that are encapsulated in a Docker container (see Appendix \ref{appendix:independentenvironment}).
+
+The web-based nature of Whole Tale's approach, and its dependency on many tools (which have many dependencies themselves) is a major limitation for future reproducibility.
+For example, when following their own tutorial on ``Creating a new tale'', the provided Jupyter notebook could not be executed because of a dependency problem.
+This was reported to the authors as issue 113\footnote{\inlinecode{\url{https://github.com/whole-tale/wt-design-docs/issues/113}}}, but as all the second-order dependencies evolve, its not hard to envisage such dependency incompatibilities being the primary issue for older projects on Whole Tale.
+Furthermore, the fact that a Tale is stored as a binary Docker container causes two important problems:
+1) it requires a very large storage capacity for every project that is hosted there, making it very expensive to scale if demand expands.
+2) It is not possible to see how the environment was built accurately (when the Dockerfile uses \inlinecode{apt}).
+This issue with Whole Tale (and generally all other solutions that only rely on preserving a container/VM) was also mentioned in \citeappendix{oliveira18}, for more on this, please see Appendix \ref{appendix:packagemanagement}.
+
+
+
+
+
+\subsection{Occam (2018)}
+Occam\footnote{\inlinecode{\url{https://occam.cs.pitt.edu}}} \citeappendix{oliveira18} is web-based application to preserve software and its execution.
+To achieve long-term reproducibility, Occam includes its own package manager (instructions to build software and their dependencies) to be in full control of the software build instructions, similar to Maneage.
+Besides Nix or Guix (which are primarily a package manager that can also do job management), Occum has been the only solution in our survey here that attempts to be complete in this aspect.
+
+However it is incomplete from the perspective of requirements: it works within a Docker image (that requires root permissions) and currently only runs on Debian-based, Redhat-based and Arch-based GNU/Linux operating systems that respectively use the \inlinecode{apt}, \inlinecode{pacman} or \inlinecode{yum} package managers.
+It is also itself written in Python (version 3.4 or above), hence it is not clear
+
+Furthermore, it also violates our complexity criteria because the instructions to build the software, their versions and etc are not immediately viewable or modifable by the user.
+Occam contains its own JSON database for this that should be parsed with its own custom program.
+The analysis phase of Occum is also through a drag-and-drop interface (similar to Taverna, Appendix \ref{appendix:taverna}) that is a web-based graphic user interface.
+All the connections between various phases of an analysis need to be pre-defined in a JSON file and manually linked in the GUI.
+Hence for complex data analysis operations with involve thousands of steps, it is not scalable.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+%%\newpage
+%%\section{Things remaining to add}
+%%\begin{itemize}
+%%\item \url{https://sites.nationalacademies.org/cs/groups/pgasite/documents/webpage/pga_180684.pdf}, does the following classification of tools:
+%% \begin{itemize}
+%% \item Research environments: \href{http://vcr.stanford.edu}{Verifiable computational research} (discussed above), \href{http://www.sciencedirect.com/science/article/pii/S1877050911001207}{SHARE} (a Virtual Machine), \href{http://www.codeocean.com}{Code Ocean} (discussed above), \href{http://jupyter.org}{Jupyter} (discussed above), \href{https://yihui.name/knitr}{knitR} (based on Sweave, dynamic report generation with R), \href{https://cran.r-project.org}{Sweave} (Function in R, for putting R code within \LaTeX), \href{http://www.cyverse.org}{Cyverse} (proprietary web tool with servers for bioinformatics), \href{https://nanohub.org}{NanoHUB} (collection of Simulation Programs for nanoscale phenomena that run in the cloud), \href{https://www.elsevier.com/about/press-releases/research-and-journals/special-issue-computers-and-graphics-incorporates-executable-paper-grand-challenge-winner-collage-authoring-environment}{Collage Authoring Environment} (discussed above), \href{https://osf.io/ns2m3}{SOLE} (discussed above), \href{https://osf.io}{Open Science framework} (a hosting webpage), \href{https://www.vistrails.org}{VisTrails} (discussed above), \href{https://pypi.python.org/pypi/Sumatra}{Sumatra} (discussed above), \href{http://software.broadinstitute.org/cancer/software/genepattern}{GenePattern} (reviewed above), Image Processing On Line (\href{http://www.ipol.im}{IPOL}) journal (publishes full analysis scripts, but does not deal with dependencies), \href{https://github.com/systemslab/popper}{Popper} (reviewed above), \href{https://galaxyproject.org}{Galaxy} (reviewed above), \href{http://torch.ch}{Torch.ch} (finished project for neural networks on images), \href{http://wholetale.org/}{Whole Tale} (discussed above).
+%% \item Workflow systems: \href{http://www.taverna.org.uk}{Taverna}, \href{http://www.wings-workflows.org}{Wings}, \href{https://pegasus.isi.edu}{Pegasus}, \href{http://www.pgbovine.net/cde.html}{CDE}, \href{http://binder.org}{Binder}, \href{http://wiki.datakurator.org/wiki}{Kurator}, \href{https://kepler-project.org}{Kepler}, \href{https://github.com/everware}{Everware}, \href{http://cds.nyu.edu/projects/reprozip}{Reprozip}.
+%% \item Dissemination platforms: \href{http://researchcompendia.org}{ResearchCompendia}, \href{https://datacenterhub.org/about}{DataCenterHub}, \href{http://runmycode.org}, \href{https://www.chameleoncloud.org}{ChameleonCloud}, \href{https://occam.cs.pitt.edu}{Occam}, \href{http://rcloud.social/index.html}{RCloud}, \href{http://thedatahub.org}{TheDataHub}, \href{http://www.ahay.org/wiki/Package_overview}{Madagascar}.
+%% \end{itemize}
+%%\item Special volume on ``Reproducible research'' in the Computing in Science Engineering \citeappendix{fomel09}.
+%%\item ``I’ve learned that interactive programs are slavery (unless they include the ability to arrive in any previous state by means of a script).'' \citeappendix{fomel09}.
+%%\item \citeappendix{fomel09} discuss the ``versioning problem'': on different systems, programs have different versions.
+%%\item \citeappendix{fomel09}: a C program written 20 years ago was still usable.
+%%\item \citeappendix{fomel09}: ``in an attempt to increase the size of the community, Matthias Schwab and I submitted a paper to Computers in Physics, one of CiSE’s forerunners. It was rejected. The editors said if everyone used Microsoft computers, everything would be easily reproducible. They also predicted the imminent demise of Fortran''.
+%%\item \citeappendix{alliez19}: Software citation, with a nice dependency plot for matplotlib.
+%% \item SC \href{https://sc19.supercomputing.org/submit/reproducibility-initiative}{Reproducibility Initiative} for mandatory Artifact Description (AD).
+%% \item \href{https://www.acm.org/publications/policies/artifact-review-badging}{Artifact review badging} by the Association of computing machinery (ACM).
+%% \item eLife journal \href{https://elifesciences.org/labs/b521cf4d/reproducible-document-stack-towards-a-scalable-solution-for-reproducible-articles}{announcement} on reproducible papers. \citeappendix{lewis18} is their first reproducible paper.
+%% \item The \href{https://www.scientificpaperofthefuture.org}{Scientific paper of the future initiative} encourages geoscientists to include associate metadata with scientific papers \citeappendix{gil16}.
+%% \item Digital objects: \url{http://doi.org/10.23728/b2share.b605d85809ca45679b110719b6c6cb11} and \url{http://doi.org/10.23728/b2share.4e8ac36c0dd343da81fd9e83e72805a0}
+%% \item \citeappendix{mesirov10}, \citeappendix{casadevall10}, \citeappendix{peng11}: Importance of reproducible research.
+%% \item \citeappendix{sandve13} is an editorial recommendation to publish reproducible results.
+%% \item \citeappendix{easterbrook14} Free/open software for open science.
+%% \item \citeappendix{peng15}: Importance of better statistical education.
+%% \item \citeappendix{topalidou16}: Failed attempt to reproduce a result.
+%% \item \citeappendix{hutton16} reproducibility in hydrology, criticized in \citeappendix{melson17}.
+%% \item \citeappendix{fomel09}: Editorial on reproducible research.
+%% \item \citeappendix{munafo17}: Reproducibility in social sciences.
+%% \item \citeappendix{stodden18}: Effectiveness of journal policy on computational reproducibility.
+%% \item \citeappendix{fanelli18} is critical of the narrative that there is a ``reproducibility crisis'', and that its important to empower scientists.
+%% \item \citeappendix{burrell18} open software (in particular Python) in heliophysics.
+%% \item \citeappendix{allen18} show that many papers do not cite software.
+%% \item \citeappendix{zhang18} explicity say that they won't release their code: ``We opt not to make the code used for the chemical evo-lution modeling publicly available because it is an important asset of the re-searchers’ toolkits''
+%% \item \citeappendix{jones19} make genuine effort at reproducing every number in the paper (using Docker, Conda, and CGAT-core, and Binder), but they can ultimately only release scripts. They claim its not possible to reproduce that level of reproducibility, but here we show it is.
+%% \item LSST uses Kubernetes and docker for reproducibility \citeappendix{banek19}.
+%% \item Interesting survey/paper on the importance of coding in science \citeappendix{merali10}.
+%% \item Discuss the Provenance challenge \citeappendix{moreau08}, showing the importance of meta data and provenance tracking.
+%% Especially that it is organized by teh medical scientists.
+%% Its webpage (for latest challenge) has a nice intro: \url{https://www.cccinnovationcenter.com/challenges/provenance-challenge}.
+%% \item In discussion: The XML provenance system is very interesting, scripts can be written to parse the Makefiles within this template to generate such XML outputs for easy standard metadata parsing.
+%% The XML that contains a log of the outputs is also interesting.
+%% \item \citeappendix{becker17} Discuss reproducibility methods in R.
+%% \item Elsevier Executable Paper Grand Challenge\footnote{\url{https://shar.es/a3dgl2}} \citeappendix{gabriel11}.
+%% \item \citeappendix{menke20} show how software identifability has seen the best improvement, so there is hope!
+%% \item Nature's collection on papers about reproducibility: \url{https://www.nature.com/collections/prbfkwmwvz}.
+%% \item Nice links for applying FAIR principles in research software: \url{https://www.rd-alliance.org/group/software-source-code-ig/wiki/fair4software-reading-materials}
+%% \item Jupyter Notebooks and problems with reproducibility: \citeappendix{rule18} and \citeappendix{pimentel19}.
+%% \item Reproducibility certification \url{https://www.cascad.tech}.
+%% \item \url{https://plato.stanford.edu/entries/scientific-reproducibility}.
+%% \item
+%%Modern analysis tools are almost entirely implemented as software packages.
+%%This has lead many scientists to adopt solutions that software developers use for reproducing software (for example to fix bugs, or avoid security issues).
+%%These tools and how they are used are thorougly reviewed in Appendices \ref{appendix:existingtools} and \ref{appendix:existingsolutions}.
+%%However, the problem of reproducibility in the sciences is more complicated and subtle than that of software engineering.
+%%This difference can be broken up into the following categories, which are described more fully below:
+%%1) Reading vs. executing, 2) Archiving how software is used and 3) Citation of the software/methods used for scientific credit.
+%%
+%%The first difference is because in the sciences, reproducibility is not merely a problem of re-running a research project (where a binary blob like a container or virtual machine is sufficient).
+%%For a scientist it is more important to read/study a method of a paper that is 1, 10, or 100 years old.
+%%The hardware to execute the code may have become obsolete, or it may require too much processing power, storage, or time for another random scientist to execute.
+%%Another scientist just needs to be assured that the commands they are reading is exactly what was (and can potentially be) executed.
+%%
+%%On the second point, scientists are devoting a smaller fraction of their papers to the technical aspects of the work because they are done increasingly by pre-written software programs and libraries.
+%%Therefore, scientific papers are no longer a complete repository for preserving and archiving very important aspects of the scientific endeavor and hard gained experience.
+%%Attempts such as Software Heritage\footnote{\url{https://www.softwareheritage.org}} \citeappendix{dicosmo18} do a wonderful job at long term preservation and archival of the software source code.
+%%However, preservation of the software's raw code is only part of the process, it is also critically important to preserve how the software was used: with what configuration or run-time options, for what kinds of problems, in conjunction with which other software tools and etc.
+%%
+%%The third major difference was scientific credit, which is measured in units of citations, not dollars.
+%%As described above, scientific software are playing an increasingly important role in modern science.
+%%Because of the domain-specific knowledge necessary to produce such software, they are mostly written by scientists for scientists.
+%%Therefore a significant amount of effort and research funding has gone into producing scientific software.
+%%Atleast for the software that do have an accompanying paper, it is thus important that those papers be cited when they are used.
+%%\end{itemize}
+
+
+
+
+%% Bibliography of appendix
+\bibliographystyleappendix{IEEEtran_openaccess}
+\bibliographyappendix{IEEEabrv,references}
+\fi
\end{document}
-%% This file is part of Maneage (https://maneage.org).
-%
%% This file is free software: you can redistribute it and/or modify it
%% under the terms of the GNU General Public License as published by the
%% Free Software Foundation, either version 3 of the License, or (at your
@@ -217,7 +1776,4 @@ The IAC project P/300724, financed by the MCIU, through the Canary Islands Depar
%% This file is distributed in the hope that it will be useful, but WITHOUT
%% ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
%% FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
-%% for more details.
-%
-%% You should have received a copy of the GNU General Public License along
-%% with this file. If not, see <http://www.gnu.org/licenses/>.
+%% for more details. See <http://www.gnu.org/licenses/>.