diff options
Diffstat (limited to 'tex/src/appendix-existing-solutions.tex')
-rw-r--r-- | tex/src/appendix-existing-solutions.tex | 502 |
1 files changed, 502 insertions, 0 deletions
diff --git a/tex/src/appendix-existing-solutions.tex b/tex/src/appendix-existing-solutions.tex new file mode 100644 index 0000000..e802644 --- /dev/null +++ b/tex/src/appendix-existing-solutions.tex @@ -0,0 +1,502 @@ +%% Appendix on reviewing existing reproducible workflow solutions. This +%% file is loaded by the project's 'paper.tex' or 'tex/src/supplement.tex', +%% it should not be run independently. +% +%% Copyright (C) 2020-2021 Mohammad Akhlaghi <mohammad@akhlaghi.org> +%% Copyright (C) 2021 Raúl Infante-Sainz <infantesainz@gmail.com> +% +%% This file is free software: you can redistribute it and/or modify it +%% under the terms of the GNU General Public License as published by the +%% Free Software Foundation, either version 3 of the License, or (at your +%% option) any later version. +% +%% This file is distributed in the hope that it will be useful, but WITHOUT +%% ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or +%% FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License +%% for more details. See <http://www.gnu.org/licenses/>. + + + + + +\section{Survey of common existing reproducible workflows} +\label{appendix:existingsolutions} +As reviewed in the introduction, the problem of reproducibility has received considerable attention over the last three decades and various solutions have already been proposed. +The core principles that many of the existing solutions (including Maneage) aim to achieve are nicely summarized by the FAIR principles \citeappendix{wilkinson16}. +In this appendix, some of the solutions are reviewed. +The solutions are based on an evolving software landscape, therefore they are ordered by date: when the project has a web page, the year of its first release is used for the sorting. +Otherwise their paper's publication year is used. + +For each solution, we summarize its methodology and discuss how it relates to the criteria proposed in this paper. +Freedom of the software/method is a core concept behind scientific reproducibility, as opposed to industrial reproducibility where a black box is acceptable/desirable. +Therefore proprietary solutions like Code Ocean\footnote{\inlinecode{\url{https://codeocean.com}}} or Nextjournal\footnote{\inlinecode{\url{https://nextjournal.com}}} will not be reviewed here. +Other studies have also attempted to review existing reproducible solutions, for example, see \citeappendix{konkol20}. + + + + + +\subsection{Suggested rules, checklists, or criteria} +Before going into the various implementations, it is also useful to review existing suggested rules, checklists, or criteria for computationally reproducible research. + +All the cases below are primarily targeted to immediate reproducibility and do not consider longevity explicitly. +Therefore, they lack a strong/clear completeness criterion (they mainly only suggest, rather than require, the recording of versions, and their ultimate suggestion of storing the full binary OS in a binary VM or container is problematic (as mentioned in \ref{appendix:independentenvironment} and \citeappendix{oliveira18}). + +Sandve et al. \citeappendix{sandve13} propose ``ten simple rules for reproducible computational research'' that can be applied in any project. +Generally, these are very similar to the criteria proposed here and follow a similar spirit, but they do not provide any actual research papers following up all those points, nor do they provide a proof of concept. +The Popper convention \citeappendix{jimenez17} also provides a set of principles that are indeed generally useful, among which some are common to the criteria here (for example, automatic validation, and, as in Maneage, the authors suggest providing a template for new users), but the authors do not include completeness as a criterion nor pay attention to longevity (Popper itself is written in Python with many dependencies, and its core operating language has already changed once). +For more on Popper, please see Section \ref{appendix:popper}. + +For improved reproducibility in Jupyter notebook users, \citeappendix{rule19} propose ten rules to improve reproducibility and also provide links to example implementations. +These can be very useful for users of Jupyter but are not generic for non-Jupyter-based computational projects. +Some criteria (which are indeed very good in a more general context) do not directly relate to reproducibility, for example their Rule 1: ``Tell a Story for an Audience''. +Generally, as reviewed in +\ifdefined\separatesupplement +the main body of this paper (section on the longevity of existing tools) +\else +Section \ref{sec:longevityofexisting} +\fi +and Section \ref{appendix:jupyter} (below), Jupyter itself has many issues regarding reproducibility. +To create Docker images, N\"ust et al. propose ``ten simple rules'' in \citeappendix{nust20}. +They recommend some issues that can indeed help increase the quality of Docker images and their production/usage, such as their rule 7 to ``mount datasets [only] at run time'' to separate the computational environment from the data. +However, the long-term reproducibility of the images is not included as a criterion by these authors. +For example, they recommend using base operating systems, with version identification limited to a single brief identifier such as \inlinecode{ubuntu:18.04}, which has a serious problem with longevity issues +\ifdefined\separatesupplement +(as discussed in the longevity of existing tools section of the main paper). +\else +(Section \ref{sec:longevityofexisting}). +\fi +Furthermore, in their proof-of-concept Dockerfile (listing 1), \inlinecode{rocker} is used with a tag (not a digest), which can be problematic due to the high risk of ambiguity (as discussed in Section \ref{appendix:containers}). + + + + + +\subsection{Reproducible Electronic Documents, RED (1992)} +\label{appendix:red} +RED\footnote{\inlinecode{\url{http://sep.stanford.edu/doku.php?id=sep:research:reproducible}}} is the first attempt that we could find on doing reproducible research, see \cite{claerbout1992,schwab2000}. +It was developed within the Stanford Exploration Project (SEP) for Geophysics publications. +Their introductions on the importance of reproducibility, resonate a lot with today's environment in computational sciences. +In particular, the heavy investment one has to make in order to re-do another scientist's work, even in the same team. +RED also influenced other early reproducible works, for example \citeappendix{buckheit1995}. + +To orchestrate the various figures/results of a project, from 1990, they used ``Cake'' \citeappendix{somogyi87}, a dialect of Make, for more on Make, see Appendix \ref{appendix:jobmanagement}. +As described in \cite{schwab2000}, in the latter half of that decade, they moved to GNU Make, which was much more commonly used, developed and came with a complete and up-to-date manual. +The basic idea behind RED's solution was to organize the analysis as independent steps, including the generation of plots, and organizing the steps through a Makefile. +This enabled all the results to be re-executed with a single command. +Several basic low-level Makefiles were included in the high-level/central Makefile. +The reader/user of a project had to manually edit the central Makefile and set the variable \inlinecode{RESDIR} (result directory), this is the directory where built files are kept. +The reader could later select which figures/parts of the project to reproduce by manually adding its name in the central Makefile, and running Make. + +At the time, Make was already practiced by individual researchers and projects as a job orchestration tool, but SEP's innovation was to standardize it as an internal policy, and define conventions for the Makefiles to be consistent across projects. +This enabled new members to benefit from the already existing work of previous team members (who had graduated or moved to other jobs). +However, RED only used the existing software of the host system, it had no means to control them. +Therefore, with wider adoption, they confronted a ``versioning problem'' where the host's analysis software had different versions on different hosts, creating different results, or crashing \citeappendix{fomel09}. +Hence in 2006 SEP moved to a new Python-based framework called Madagascar, see Appendix \ref{appendix:madagascar}. + + + + + +\subsection{Apache Taverna (2003)} +\label{appendix:taverna} +Apache Taverna\footnote{\inlinecode{\url{https://taverna.incubator.apache.org}}} \citeappendix{oinn04} is a workflow management system written in Java with a graphical user interface which is still being developed. +A workflow is defined as a directed graph, where nodes are called ``processors''. +Each Processor transforms a set of inputs into a set of outputs and they are defined in the Scufl language (an XML-based language, where each step is an atomic task). +Other components of the workflow are ``Data links'' and ``Coordination constraints''. +The main user interface is graphical, where users move processors in the given space and define links between their inputs and outputs (manually constructing a lineage like +\ifdefined\separatesupplement +lineage figure of the main paper. +\else +Figure \ref{fig:datalineage}). +\fi +Taverna is only a workflow manager and is not integrated with a package manager, hence the versions of the used software can be different in different runs. +\citeappendix{zhao12} have studied the problem of workflow decays in Taverna. + + + + + +\subsection{Madagascar (2003)} +\label{appendix:madagascar} +Madagascar\footnote{\inlinecode{\url{http://ahay.org}}} \citeappendix{fomel13} is a set of extensions to the SCons job management tool (reviewed in \ref{appendix:scons}). +Madagascar is a continuation of the Reproducible Electronic Documents (RED) project that was discussed in Appendix \ref{appendix:red}. +Madagascar has been used in the production of hundreds of research papers or book chapters\footnote{\inlinecode{\url{http://www.ahay.org/wiki/Reproducible_Documents}}}, 120 prior to \citeappendix{fomel13}. + +Madagascar does include project management tools in the form of SCons extensions. +However, it is not just a reproducible project management tool. +The Regularly Sampled File (RSF) file format\footnote{\inlinecode{\url{http://www.ahay.org/wiki/Guide\_to\_RSF\_file\_format}}} is a custom plain-text file that points to the location of the actual data files on the file system and acts as the intermediary between Madagascar's analysis programs. +Therefore, Madagascar is primarily a collection of analysis programs and tools to interact with RSF files and plotting facilities. +For example in our test of Madagascar 3.0.1, it installed 855 Madagascar-specific analysis programs (\inlinecode{PREFIX/bin/sf*}). +The analysis programs mostly target geophysical data analysis, including various project-specific tools: more than half of the total built tools are under the \inlinecode{build/user} directory which includes names of Madagascar users. + +Besides the location or contents of the data, RSF also contains name/value pairs that can be used as options to Madagascar programs, which are built with inputs and outputs of this format. +Since RSF contains program options also, the inputs and outputs of Madagascar's analysis programs are read from, and written to, standard input and standard output. + +In terms of completeness, as long as the user only uses Madagascar's own analysis programs, it is fairly complete at a high level (not lower-level OS libraries). +However, this comes at the expense of a large amount of bloatware (programs that one project may never need, but is forced to build), thus adding complexity. +Also, the linking between the analysis programs (of a certain user at a certain time) and future versions of that program (that is updated in time) is not immediately obvious. +Furthermore, the blending of the workflow component with the low-level analysis components fails the modularity criteria. + + + + + +\subsection{GenePattern (2004)} +\label{appendix:genepattern} +GenePattern\footnote{\inlinecode{\url{https://www.genepattern.org}}} \citeappendix{reich06} (first released in 2004) is a client-server software containing many common analysis functions/modules, primarily focused for Gene studies. +Although it is highly focused to a special research field, it is reviewed here because its concepts/methods are generic, and in the context of this paper. + +Its server-side software is installed with fixed software packages that are wrapped into GenePattern modules. +The modules are used through a web interface, the modern implementation is GenePattern Notebook \citeappendix{reich17}. +It is an extension of the Jupyter notebook (see Appendix \ref{appendix:editors}), which also has a special ``GenePattern'' cell that will connect to GenePattern servers for doing the analysis. +However, the wrapper modules just call an existing tool on the host system. +Given that each server may have its own set of installed software, the analysis may differ (or crash) when run on different GenePattern servers, hampering reproducibility. + +%% GenePattern shutdown announcement (although as of November 2020, it does not open any more): https://www.genepattern.org/blog/2019/10/01/the-genomespace-project-is-ending-on-november-15-2019 +The primary GenePattern server was active since 2008 and had 40,000 registered users with 2000 to 5000 jobs running every week \citeappendix{reich17}. +However, it was shut down on November 15th 2019 due to the end of funding. +All processing with this sever has stopped, and any archived data on it has been deleted. +Since GenePattern is free software, there are alternative public servers to use, so hopefully, work on it will continue. +However, funding is limited and those servers may face similar funding problems. +This is a very nice example of the fragility of solutions that depend on archiving and running the research codes with high-level research products (including data and binary/compiled codes that are expensive to keep in one place). + + + + + +\subsection{Kepler (2005)} +Kepler\footnote{\inlinecode{\url{https://kepler-project.org}}} \citeappendix{ludascher05} is a Java-based Graphic User Interface workflow management tool. +Users drag-and-drop analysis components, called ``actors'', into a visual, directional graph, which is the workflow (similar to +\ifdefined\separatesupplement +the lineage figure shown in the main paper. +\else +Figure \ref{fig:datalineage}). +\fi +Each actor is connected to others through the Ptolemy II\footnote{\inlinecode{\url{https://ptolemy.berkeley.edu}}} \citeappendix{eker03}. +In many aspects, the usage of Kepler and its issues for long-term reproducibility is like Apache Taverna (see Section \ref{appendix:taverna}). + + + + + +\subsection{VisTrails (2005)} +\label{appendix:vistrails} +VisTrails\footnote{\inlinecode{\url{https://www.vistrails.org}}} \citeappendix{bavoil05} was a graphical workflow managing system. +According to its web page, VisTrails maintenance has stopped since May 2016, its last Git commit, as of this writing, was in November 2017. +However, given that it was well maintained for over 10 years is an achievement. + +VisTrails (or ``visualization trails'') was initially designed for managing visualizations, but later grew into a generic workflow system with meta-data and provenance features. +Each analysis step, or module, is recorded in an XML schema, which defines the operations and their dependencies. +The XML attributes of each module can be used in any XML query language to find certain steps (for example those that used a certain command). +Since the main goal was visualization (as images), apparently its primary output is in the form of image spreadsheets. +Its design is based on a change-based provenance model using a custom VisTrails provenance query language (vtPQL), for more see \citeappendix{scheidegger08}. +Since XML is a plain text format, as the user inspects the data and makes changes to the analysis, the changes are recorded as ``trails'' in the project's VisTrails repository that operates very much like common version control systems (see Appendix \ref{appendix:versioncontrol}). +. +However, even though XML is in plain text, it is very hard to edit manually. +VisTrails, therefore, provides a graphic user interface with a visual representation of the project's inter-dependent steps (similar to +\ifdefined\separatesupplement +the data lineage figure of the main paper). +\else +Figure \ref{fig:datalineage}). +\fi +Besides the fact that it is no longer maintained, VisTrails does not control the software that is run, it only controls the sequence of steps that they are run in. + + + + + +\subsection{Galaxy (2010)} +\label{appendix:galaxy} +Galaxy\footnote{\inlinecode{\url{https://galaxyproject.org}}} is a web-based Genomics workbench \citeappendix{goecks10}. +The main user interface is the ``Galaxy Pages'', which does not require any programming: users graphically manipulate abstract ``tools'' which are wrappers over command-line programs. +Therefore the actual running version of the program can be hard to control across different Galaxy servers. +Besides the automatically generated metadata of a project (which include version control, or its history), users can also tag/annotate each analysis step, describing its intent/purpose. +Besides some small differences, Galaxy seems very similar to GenePattern (Appendix \ref{appendix:genepattern}), so most of the same points there apply here too. +For example the very large cost of maintaining such a system and being based on a graphic environment. + + + + + +\subsection{Image Processing On Line journal, IPOL (2010)} +\label{appendix:ipol} +The IPOL journal\footnote{\inlinecode{\url{https://www.ipol.im}}} \citeappendix{limare11} (first published article in July 2010) publishes papers on image processing algorithms as well as the the full code of the proposed algorithm. +An IPOL paper is a traditional research paper, but with a focus on implementation. +The published narrative description of the algorithm must be detailed to a level that any specialist can implement it in their own programming language (extremely detailed). +The author's own implementation of the algorithm is also published with the paper (in C, C++, or MATLAB), the code must be commented well enough and link each part of it with the relevant part of the paper. +The authors must also submit several examples of datasets/scenarios. +The referee is expected to inspect the code and narrative, confirming that they match with each other, and with the stated conclusions of the published paper. +After publication, each paper also has a ``demo'' button on its web page, allowing readers to try the algorithm on a web-interface and even provide their own input. + +IPOL has grown steadily over the last 10 years, publishing 23 research articles in 2019 alone. +We encourage the reader to visit its web page and see some of its recent papers and their demos. +The reason it can be so thorough and complete is its very narrow scope (low-level image processing algorithms), where the published algorithms are highly atomic, not needing significant dependencies (beyond input/output of well-known formats), allowing the referees and readers to go deeply into each implemented algorithm. +In fact, high-level languages like Perl, Python, or Java are not acceptable in IPOL precisely because of the additional complexities, such as the dependencies that they require. +However, many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, so this solution is not scalable. + +IPOL thus fails on our Scalability criteria. +Furthermore, by not publishing/archiving each paper's version control history or directly linking the analysis and produced paper, it fails criteria 6 and 7. +Note that on the web page, it is possible to change parameters, but that will not affect the produced PDF. +A paper written in Maneage (the proof-of-concept solution presented in this paper) could be scrutinized at a similar detailed level to IPOL, but for much more complex research scenarios, involving hundreds of dependencies and complex processing of the data. + + + + + +\subsection{WINGS (2010)} +\label{appendix:wings} +WINGS\footnote{\inlinecode{\url{https://wings-workflows.org}}} \citeappendix{gil10} is an automatic workflow generation algorithm. +It runs on a centralized web server, requiring many dependencies (such that it is recommended to download Docker images). +It allows users to define various workflow components (for example datasets, analysis components, etc), with high-level goals. +It then uses selection and rejection algorithms to find the best components using a pool of analysis components that can satisfy the requested high-level constraints. +%\tonote{Read more about this} + + + + + +\subsection{Active Papers (2011)} +\label{appendix:activepapers} +Active Papers\footnote{\inlinecode{\url{http://www.activepapers.org}}} attempts to package the code and data of a project into one file (in HDF5 format). +It was initially written in Java because its compiled byte-code outputs in JVM are portable on any machine \citeappendix{hinsen11}. +However, Java is not a commonly used platform today, hence it was later implemented in Python \citeappendix{hinsen15}. + +In the Python version, all processing steps and input data (or references to them) are stored in an HDF5 file. +%However, it can only account for pure-Python packages using the host operating system's Python modules \tonote{confirm this!}. +When the Python module contains a component written in other languages (mostly C or C++), it needs to be an external dependency to the Active Paper. + +As mentioned in \citeappendix{hinsen15}, the fact that it relies on HDF5 is a caveat of Active Papers, because many tools are necessary to merely open it. +Downloading the pre-built HDF View binaries (provided by the HDF group) is not possible anonymously/automatically (login is required). +Installing it using the Debian or Arch Linux package managers also failed due to dependencies in our trials. +Furthermore, as a high-level data format HDF5 evolves very fast, for example HDF5 1.12.0 (February 29th, 2020) is not usable with older libraries provided by the HDF5 team. % maybe replace with: February 29\textsuperscript{th}, 2020? + +While data and code are indeed fundamentally similar concepts technically\citeappendix{hinsen16}, they are used by humans differently due to their volume: the code of a large project involving Terabytes of data can be less than a megabyte. +Hence, storing code and data together becomes a burden when large datasets are used, this was also acknowledged in \citeappendix{hinsen15}. +Also, if the data are proprietary (for example medical patient data), the data must not be released, but the methods that were applied to them can be published. +Furthermore, since all reading and writing is done in the HDF5 file, it can easily bloat the file to very large sizes due to temporary files. +These files can later be removed as part of the analysis, but this makes the code more complicated and hard to read/maintain. +For example the Active Papers HDF5 file of \citeappendix[in \href{https://doi.org/10.5281/zenodo.2549987}{zenodo.2549987}]{kneller19} is 1.8 giga-bytes. + +In many scenarios, peers just want to inspect the processing by reading the code and checking a very special part of it (one or two lines, just to see the option values to one step for example). +They do not necessarily need to run it or obtain the output the datasets (which may be published elsewhere). +Hence the extra volume for data and obscure HDF5 format that needs special tools for reading its plain-text internals is an issue. + + + + + +\subsection{Collage Authoring Environment (2011)} +\label{appendix:collage} +The Collage Authoring Environment \citeappendix{nowakowski11} was the winner of Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}. +It is based on the GridSpace2\footnote{\inlinecode{\url{http://dice.cyfronet.pl}}} distributed computing environment, which has a web-based graphic user interface. +Through its web-based interface, viewers of a paper can actively experiment with the parameters of a published paper's displayed outputs (for example figures). +%\tonote{See how it containerizes the software environment} + + + + + +\subsection{SHARE (2011)} +\label{appendix:SHARE} +SHARE\footnote{\inlinecode{\url{https://is.ieis.tue.nl/staff/pvgorp/share}}} \citeappendix{vangorp11} is a web portal that hosts virtual machines (VMs) for storing the environment of a research project. +SHARE was recognized as the second position in the Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}. +Simply put, SHARE was just a VM library that users could download or connect to, and run. +The limitations of VMs for reproducibility were discussed in Appendix \ref{appendix:virtualmachines}, and the SHARE system does not specify any requirements on making the VM itself reproducible. +As of January 2021, the top SHARE web page still works. +However, upon selecting any operation, a notice is printed that ``SHARE is offline'' since 2019 and the reason is not mentioned. + + + + + +\subsection{Verifiable Computational Result, VCR (2011)} +\label{appendix:verifiableidentifier} +A ``verifiable computational result''\footnote{\inlinecode{\url{http://vcr.stanford.edu}}} is an output (table, figure, etc) that is associated with a ``verifiable result identifier'' (VRI), see \citeappendix{gavish11}. +It was awarded the third prize in the Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}. + +A VRI is created using tags within the programming source that produced that output, also recording its version control or history. +This enables the exact identification and citation of results. +The VRIs are automatically generated web-URLs that link to public VCR repositories containing the data, inputs, and scripts, that may be re-executed. +According to \citeappendix{gavish11}, the VRI generation routine has been implemented in MATLAB, R, and Python, although only the MATLAB version was available during the writing of this paper. +VCR also has special \LaTeX{} macros for loading the respective VRI into the generated PDF. + +Unfortunately, most parts of the web page are not complete at the time of this writing. +The VCR web page contains an example PDF\footnote{\inlinecode{\url{http://vcr.stanford.edu/paper.pdf}}} that is generated with this system, but the linked VCR repository\footnote{\inlinecode{\url{http://vcr-stat.stanford.edu}}} does not exist at the time of this writing. +Finally, the date of the files in the MATLAB extension tarball is set to 2011, hinting that probably VCR has been abandoned soon after the publication of \citeappendix{gavish11}. + + + + + +\subsection{SOLE (2012)} +\label{appendix:sole} +SOLE (Science Object Linking and Embedding) defines ``science objects'' (SOs) that can be manually linked with phrases of the published paper \citeappendix{pham12,malik13}. +An SO is any code/content that is wrapped in begin/end tags with an associated type and name. +For example, special commented lines in a Python, R, or C program. +The SOLE command-line program parses the tagged file, generating metadata elements unique to the SO (including its URI). +SOLE also supports workflows as Galaxy tools \citeappendix{goecks10}. + +For reproducibility, \citeappendix{pham12} suggest building a SOLE-based project in a virtual machine, using any custom package manager that is hosted on a private server to obtain a usable URI. +However, as described in Appendices \ref{appendix:independentenvironment} and \ref{appendix:packagemanagement}, unless virtual machines are built with robust package managers, this is not a sustainable solution (the virtual machine itself is not reproducible). +Also, hosting a large virtual machine server with fixed IP on a hosting service like Amazon (as suggested there) for every project in perpetuity will be very expensive. +The manual/artificial definition of tags to connect parts of the paper with the analysis scripts is also a caveat due to human error and incompleteness (the authors may not consider tags as important things, but they may be useful later). +In Maneage, instead of using artificial/commented tags, the analysis inputs and outputs are automatically linked into the paper's text through \LaTeX{} macros that are the backbone of the whole system (aren't artifical/extra features). + + + + + +\subsection{Sumatra (2012)} +Sumatra\footnote{\inlinecode{\url{http://neuralensemble.org/sumatra}}} \citeappendix{davison12} attempts to capture the environment information of a running project. +It is written in Python and is a command-line wrapper over the analysis script. +By controlling a project at running-time, Sumatra is able to capture the environment it was run in. +The captured environment can be viewed in plain text or a web interface. +Sumatra also provides \LaTeX/Sphinx features, which will link the paper with the project's Sumatra database. +This enables researchers to use a fixed version of a project's figures in the paper, even at later times (while the project is being developed). + +The actual code that Sumatra wraps around, must itself be under version control, and it does not run if there are non-committed changes (although it is not clear what happens if a commit is amended). +Since information on the environment has been captured, Sumatra is able to identify if it has changed since a previous run of the project. +Therefore Sumatra makes no attempt at storing the environment of the analysis as in Sciunit (see Appendix \ref{appendix:sciunit}), but its information. +Sumatra thus needs to know the language of the running program and is not generic. +It just captures the environment, it does not store \emph{how} that environment was built. + + + + + +\subsection{Research Object (2013)} +\label{appendix:researchobject} +The Research object\footnote{\inlinecode{\url{http://www.researchobject.org}}} is collection of meta-data ontologies, to describe aggregation of resources, or workflows, see \citeappendix{bechhofer13} and \citeappendix{belhajjame15}. +It thus provides resources to link various workflow/analysis components (see Appendix \ref{appendix:existingtools}) into a final workflow. + +\citeappendix{bechhofer13} describes how a workflow in Apache Taverna (Appendix \ref{appendix:taverna}) can be translated into research objects. +The important thing is that the research object concept is not specific to any special workflow, it is just a metadata bundle/standard which is only as robust in reproducing the result as the running workflow. + + + + + +\subsection{Sciunit (2015)} +\label{appendix:sciunit} +Sciunit\footnote{\inlinecode{\url{https://sciunit.run}}} \citeappendix{meng15} defines ``sciunit''s that keep the executed commands for an analysis and all the necessary programs and libraries that are used in those commands. +It automatically parses all the executable files in the script and copies them, and their dependency libraries (down to the C library), into the sciunit. +Because the sciunit contains all the programs and necessary libraries, it is possible to run it readily on other systems that have a similar CPU architecture. +Sciunit was originally written in Python 2 (which reached its end-of-life on January 1st, 2020). +Therefore Sciunit2 is a new implementation in Python 3. + +The main issue with Sciunit's approach is that the copied binaries are just black boxes: it is not possible to see how the used binaries from the initial system were built. +This is a major problem for scientific projects: in principle (not knowing how the programs were built) and in practice (archiving a large volume sciunit for every step of the analysis requires a lot of storage space). + + + + + +\subsection{Umbrella (2015)} +Umbrella \citeappendix{meng15b} is a high-level wrapper script for isolating the environment of the analysis. +The user specifies the necessary operating system, and necessary packages for the analysis steps in various JSON files. +Umbrella will then study the host operating system and the various necessary inputs (including data and software) through a process similar to Sciunits mentioned above to find the best environment isolator (maybe using Linux containerization, containers, or VMs). +We could not find a URL to the source software of Umbrella (no source code repository is mentioned in the papers we reviewed above), but from the descriptions in \citeappendix{meng17}, it is written in Python 2.6 (which is now \new{deprecated}). + + + + + +\subsection{ReproZip (2016)} +ReproZip\footnote{\inlinecode{\url{https://www.reprozip.org}}} \citeappendix{chirigati16} is a Python package that is designed to automatically track all the necessary data files, libraries, and environment variables into a single bundle. +The tracking is done at the kernel system-call level, so any file that is accessed during the running of the project is identified. +The tracked files can be packaged into a \inlinecode{.rpz} bundle that can then be unpacked into another system. + +ReproZip is therefore very good to take a ``snapshot'' of the running environment into a single file. +However, the bundle can become very large when many/large datasets are involved, or if the software environment is complex (many dependencies). +Since it copies the binary software libraries, it can only be run on systems with a similar CPU architecture to the original. +Furthermore, ReproZip just copies the binary/compiled files used in a project, it has no way to know how the software was built. +As mentioned in this paper, and also \citeappendix{oliveira18} the question of ``how'' the environment was built is critical for understanding the results, and simply having the binaries cannot necessarily be useful. + +For the data, it is similarly not possible to extract which data server they came from. +Hence two projects that each use a 1-terabyte dataset will need a full copy of that same 1-terabyte file in their bundle, making long-term preservation extremely expensive. + + + + + +\subsection{Binder (2017)} +Binder\footnote{\inlinecode{\url{https://mybinder.org}}} is used to containerize already existing Jupyter based processing steps. +Users simply add a set of Binder-recognized configuration files to their repository and Binder will build a Docker image and install all the dependencies inside of it with Conda (the list of necessary packages comes from Conda). +One good feature of Binder is that the imported Docker image must be tagged (something like a checksum). +This will ensure that future/latest updates of the imported Docker image are not mistakenly used. +However, it does not make sure that the Dockerfile used by the imported Docker image follows a similar convention also. +Binder is used by \citeappendix{jones19}. + + + + + +\subsection{Gigantum (2017)} +%% I took the date from their PiPy page, where the first version 0.1 was published in November 2016. +Gigantum\footnote{\inlinecode{\url{https://gigantum.com}}} is a client/server system, in which the client is a web-based (graphical) interface that is installed as ``Gigantum Desktop'' within a Docker image. +Gigantum uses Docker containers for an independent environment, Conda (or Pip) to install packages, Jupyter notebooks to edit and run code, and Git to store its history. +Simply put, it is a high-level wrapper for combining these components. +Internally, a Gigantum project is organized as files in a directory that can be opened without their own client. +The file structure (which is under version control) includes codes, input data, and output data. +As acknowledged on their own web page, this greatly reduces the speed of Git operations, transmitting, or archiving the project. +Therefore there are size limits on the dataset/code sizes. +However, there is one directory that can be used to store files that must not be tracked. + + + + + +\subsection{Popper (2017)} +\label{appendix:popper} +Popper\footnote{\inlinecode{\url{https://falsifiable.us}}} is a software implementation of the Popper Convention \citeappendix{jimenez17}. +The Popper team's own solution is through a command-line program called \inlinecode{popper}. +The \inlinecode{popper} program itself is written in Python. +However, job management was initially based on the HashiCorp configuration language (HCL) because HCL was used by ``GitHub Actions'' to manage workflows. +Moreover, from October 2019 GitHub changed to a custom YAML-based language, so Popper also deprecated HCL. +This is an important issue when low-level choices are based on service providers. + +To start a project, the \inlinecode{popper} command-line program builds a template, or ``scaffold'', which is a minimal set of files that can be run. +However, as of this writing, the scaffold is not complete: it lacks a manuscript and validation of outputs (as mentioned in the convention). +By default, Popper runs in a Docker image (so root permissions are necessary and reproducible issues with Docker images have been discussed above), but Singularity is also supported. +See Appendix \ref{appendix:independentenvironment} for more on containers, and Appendix \ref{appendix:highlevelinworkflow} for using high-level languages in the workflow. + +Popper does not comply with the completeness, minimal complexity, and including the narrative criteria. +Moreover, the scaffold that is provided by Popper is an output of the program that is not directly under version control. +Hence, tracking future changes in Popper and how they relate to the high-level projects that depend on it will be very hard. +In Maneage, the same \inlinecode{maneage} git branch is shared by the developers and users; any new feature or change in Maneage can thus be directly tracked with Git when the high-level project merges their branch with Maneage. + + + + + +\subsection{Whole Tale (2017)} +\label{appendix:wholetale} +Whole Tale\footnote{\inlinecode{\url{https://wholetale.org}}} is a web-based platform for managing a project and organizing data provenance, see \citeappendix{brinckman17}. +It uses online editors like Jupyter or RStudio (see Appendix \ref{appendix:editors}) that are encapsulated in a Docker container (see Appendix \ref{appendix:independentenvironment}). + +The web-based nature of Whole Tale's approach and its dependency on many tools (which have many dependencies themselves) is a major limitation for future reproducibility. +For example, when following their own tutorial on ``Creating a new tale'', the provided Jupyter notebook could not be executed because of a dependency problem. +This was reported to the authors as issue 113\footnote{\inlinecode{\url{https://github.com/whole-tale/wt-design-docs/issues/113}}}, but as all the second-order dependencies evolve, it is not hard to envisage such dependency incompatibilities being the primary issue for older projects on Whole Tale. +Furthermore, the fact that a Tale is stored as a binary Docker container causes two important problems: +1) it requires a very large storage capacity for every project that is hosted there, making it very expensive to scale if demand expands. +2) It is not possible to see how the environment was built accurately (when the Dockerfile uses \inlinecode{apt}). +This issue with Whole Tale (and generally all other solutions that only rely on preserving a container/VM) was also mentioned in \citeappendix{oliveira18}, for more on this, please see Appendix \ref{appendix:packagemanagement}. + + + + + +\subsection{Occam (2018)} +Occam\footnote{\inlinecode{\url{https://occam.cs.pitt.edu}}} \citeappendix{oliveira18} is web-based application to preserve software and its execution. +To achieve long-term reproducibility, Occam includes its own package manager (instructions to build software and their dependencies) to be in full control of the software build instructions, similar to Maneage. +Besides Nix or Guix (which are primarily a package manager that can also do job management), Occam has been the only solution in our survey here that attempts to be complete in this aspect. + +However it is incomplete from the perspective of requirements: it works within a Docker image (that requires root permissions) and currently only runs on Debian-based, Red Hat based, and Arch-based GNU/Linux operating systems that respectively use the \inlinecode{apt}, \inlinecode{pacman} or \inlinecode{yum} package managers. +It is also itself written in Python (version 3.4 or above). + +Furthermore, it does not account for the minimal complexity criteria because the instructions to build the software and their versions are not immediately viewable or modifiable by the user. +Occam contains its own JSON database that should be parsed with its own custom program. +The analysis phase of Occam is also through a drag-and-drop interface (similar to Taverna, Appendix \ref{appendix:taverna}) that is a web-based graphic user interface. +All the connections between various phases of the analysis need to be pre-defined in a JSON file and manually linked in the GUI. +Hence for complex data analysis operations that involve thousands of steps, it is not scalable. |