diff options
-rw-r--r-- | tex/src/appendix-existing-solutions.tex | 191 |
1 files changed, 104 insertions, 87 deletions
diff --git a/tex/src/appendix-existing-solutions.tex b/tex/src/appendix-existing-solutions.tex index 25b2920..ccef295 100644 --- a/tex/src/appendix-existing-solutions.tex +++ b/tex/src/appendix-existing-solutions.tex @@ -3,6 +3,7 @@ %% it should not be run independently. % %% Copyright (C) 2020-2021 Mohammad Akhlaghi <mohammad@akhlaghi.org> +%% Copyright (C) 2021 Raúl Infante-Sainz <infantesainz@gmail.com> % %% This file is free software: you can redistribute it and/or modify it %% under the terms of the GNU General Public License as published by the @@ -13,44 +14,52 @@ %% ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or %% FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License %% for more details. See <http://www.gnu.org/licenses/>. + + + + + \section{Survey of common existing reproducible workflows} \label{appendix:existingsolutions} - As reviewed in the introduction, the problem of reproducibility has received considerable attention over the last three decades and various solutions have already been proposed. The core principles that many of the existing solutions (including Maneage) aim to achieve are nicely summarized by the FAIR principles \citeappendix{wilkinson16}. In this appendix, some of the solutions are reviewed. -The solutions are based on an evolving software landscape, therefore they are ordered by date: when the project has a web page, the year of its first release is used for the sorting, otherwise their paper's publication year is used. +The solutions are based on an evolving software landscape, therefore they are ordered by date: when the project has a web page, the year of its first release is used for the sorting. +Otherwise their paper's publication year is used. For each solution, we summarize its methodology and discuss how it relates to the criteria proposed in this paper. Freedom of the software/method is a core concept behind scientific reproducibility, as opposed to industrial reproducibility where a black box is acceptable/desirable. Therefore proprietary solutions like Code Ocean\footnote{\inlinecode{\url{https://codeocean.com}}} or Nextjournal\footnote{\inlinecode{\url{https://nextjournal.com}}} will not be reviewed here. -Other studies have also attempted to review existing reproducible solutions, for example \citeappendix{konkol20}. +Other studies have also attempted to review existing reproducible solutions, for example, see \citeappendix{konkol20}. + + + + \subsection{Suggested rules, checklists, or criteria} -Before going into the various implementations, it is also useful to review existing suggested rules, checklists or criteria for computationally reproducible research. +Before going into the various implementations, it is also useful to review existing suggested rules, checklists, or criteria for computationally reproducible research. All the cases below are primarily targeted to immediate reproducibility and do not consider longevity explicitly. Therefore, they lack a strong/clear completeness criterion (they mainly only suggest, rather than require, the recording of versions, and their ultimate suggestion of storing the full binary OS in a binary VM or container is problematic (as mentioned in \ref{appendix:independentenvironment} and \citeappendix{oliveira18}). Sandve et al. \citeappendix{sandve13} propose ``ten simple rules for reproducible computational research'' that can be applied in any project. Generally, these are very similar to the criteria proposed here and follow a similar spirit, but they do not provide any actual research papers following up all those points, nor do they provide a proof of concept. -The Popper convention \citeappendix{jimenez17} also provides a set of principles that are indeed generally useful, among which some are common to the criteria here (for example, automatic validation, and, as in Maneage, the authors suggest providing a template for new users), -but the authors do not include completeness as a criterion nor pay attention to longevity (Popper itself is written in Python with many dependencies, and its core operating language has already changed once). +The Popper convention \citeappendix{jimenez17} also provides a set of principles that are indeed generally useful, among which some are common to the criteria here (for example, automatic validation, and, as in Maneage, the authors suggest providing a template for new users), but the authors do not include completeness as a criterion nor pay attention to longevity (Popper itself is written in Python with many dependencies, and its core operating language has already changed once). For more on Popper, please see Section \ref{appendix:popper}. For improved reproducibility in Jupyter notebook users, \citeappendix{rule19} propose ten rules to improve reproducibility and also provide links to example implementations. -These can be very useful for users of Jupyter, but are not generic for non-Jupyter-based computational projects. +These can be very useful for users of Jupyter but are not generic for non-Jupyter-based computational projects. Some criteria (which are indeed very good in a more general context) do not directly relate to reproducibility, for example their Rule 1: ``Tell a Story for an Audience''. Generally, as reviewed in \ifdefined\separatesupplement -the main body of this paper (section on longevity of existing tools) +the main body of this paper (section on the longevity of existing tools) \else Section \ref{sec:longevityofexisting} \fi and Section \ref{appendix:jupyter} (below), Jupyter itself has many issues regarding reproducibility. To create Docker images, N\"ust et al. propose ``ten simple rules'' in \citeappendix{nust20}. They recommend some issues that can indeed help increase the quality of Docker images and their production/usage, such as their rule 7 to ``mount datasets [only] at run time'' to separate the computational environment from the data. -However, long-term reproducibility of the images is not included as a criterion by these authors. +However, the long-term reproducibility of the images is not included as a criterion by these authors. For example, they recommend using base operating systems, with version identification limited to a single brief identifier such as \inlinecode{ubuntu:18.04}, which has a serious problem with longevity issues \ifdefined\separatesupplement (as discussed in the longevity of existing tools section of the main paper). @@ -59,13 +68,16 @@ For example, they recommend using base operating systems, with version identific \fi Furthermore, in their proof-of-concept Dockerfile (listing 1), \inlinecode{rocker} is used with a tag (not a digest), which can be problematic due to the high risk of ambiguity (as discussed in Section \ref{appendix:containers}). + + + + \subsection{Reproducible Electronic Documents, RED (1992)} \label{appendix:red} - RED\footnote{\inlinecode{\url{http://sep.stanford.edu/doku.php?id=sep:research:reproducible}}} is the first attempt that we could find on doing reproducible research, see \cite{claerbout1992,schwab2000}. It was developed within the Stanford Exploration Project (SEP) for Geophysics publications. Their introductions on the importance of reproducibility, resonate a lot with today's environment in computational sciences. -In particular the heavy investment one has to make in order to re-do another scientist's work, even in the same team. +In particular, the heavy investment one has to make in order to re-do another scientist's work, even in the same team. RED also influenced other early reproducible works, for example \citeappendix{buckheit1995}. To orchestrate the various figures/results of a project, from 1990, they used ``Cake'' \citeappendix{somogyi87}, a dialect of Make, for more on Make, see Appendix \ref{appendix:jobmanagement}. @@ -74,7 +86,7 @@ The basic idea behind RED's solution was to organize the analysis as independent This enabled all the results to be re-executed with a single command. Several basic low-level Makefiles were included in the high-level/central Makefile. The reader/user of a project had to manually edit the central Makefile and set the variable \inlinecode{RESDIR} (result directory), this is the directory where built files are kept. -Afterwards, the reader could set which figures/parts of the project to reproduce by manually adding its name in the central Makefile, and running Make. +The reader could later select which figures/parts of the project to reproduce by manually adding its name in the central Makefile, and running Make. At the time, Make was already practiced by individual researchers and projects as a job orchestration tool, but SEP's innovation was to standardize it as an internal policy, and define conventions for the Makefiles to be consistent across projects. This enabled new members to benefit from the already existing work of previous team members (who had graduated or moved to other jobs). @@ -90,9 +102,9 @@ Hence in 2006 SEP moved to a new Python-based framework called Madagascar, see A \label{appendix:taverna} Apache Taverna\footnote{\inlinecode{\url{https://taverna.incubator.apache.org}}} \citeappendix{oinn04} is a workflow management system written in Java with a graphical user interface which is still being developed. A workflow is defined as a directed graph, where nodes are called ``processors''. -Each Processor transforms a set of inputs into a set of outputs and they are defined in the Scufl language (an XML-based language, were each step is an atomic task). +Each Processor transforms a set of inputs into a set of outputs and they are defined in the Scufl language (an XML-based language, where each step is an atomic task). Other components of the workflow are ``Data links'' and ``Coordination constraints''. -The main user interface is graphical, where users move processors in the given space and define links between their inputs outputs (manually constructing a lineage like +The main user interface is graphical, where users move processors in the given space and define links between their inputs and outputs (manually constructing a lineage like \ifdefined\separatesupplement lineage figure of the main paper. \else @@ -113,10 +125,10 @@ Madagascar has been used in the production of hundreds of research papers or boo Madagascar does include project management tools in the form of SCons extensions. However, it is not just a reproducible project management tool. -It is primarily a collection of analysis programs and tools to interact with RSF files, and plotting facilities. The Regularly Sampled File (RSF) file format\footnote{\inlinecode{\url{http://www.ahay.org/wiki/Guide\_to\_RSF\_file\_format}}} is a custom plain-text file that points to the location of the actual data files on the file system and acts as the intermediary between Madagascar's analysis programs. +Therefore, Madagascar is primarily a collection of analysis programs and tools to interact with RSF files and plotting facilities. For example in our test of Madagascar 3.0.1, it installed 855 Madagascar-specific analysis programs (\inlinecode{PREFIX/bin/sf*}). -The analysis programs mostly target geophysical data analysis, including various project specific tools: more than half of the total built tools are under the \inlinecode{build/user} directory which includes names of Madagascar users. +The analysis programs mostly target geophysical data analysis, including various project-specific tools: more than half of the total built tools are under the \inlinecode{build/user} directory which includes names of Madagascar users. Besides the location or contents of the data, RSF also contains name/value pairs that can be used as options to Madagascar programs, which are built with inputs and outputs of this format. Since RSF contains program options also, the inputs and outputs of Madagascar's analysis programs are read from, and written to, standard input and standard output. @@ -126,10 +138,14 @@ However, this comes at the expense of a large amount of bloatware (programs that Also, the linking between the analysis programs (of a certain user at a certain time) and future versions of that program (that is updated in time) is not immediately obvious. Madagascar could have been more useful to a larger community if the workflow components were maintained as a separate project compared to the analysis components. + + + + \subsection{GenePattern (2004)} \label{appendix:genepattern} GenePattern\footnote{\inlinecode{\url{https://www.genepattern.org}}} \citeappendix{reich06} (first released in 2004) is a client-server software containing many common analysis functions/modules, primarily focused for Gene studies. -Although its highly focused to a special research field, it is reviewed here because its concepts/methods are generic, and in the context of this paper. +Although it is highly focused to a special research field, it is reviewed here because its concepts/methods are generic, and in the context of this paper. Its server-side software is installed with fixed software packages that are wrapped into GenePattern modules. The modules are used through a web interface, the modern implementation is GenePattern Notebook \citeappendix{reich17}. @@ -137,13 +153,13 @@ It is an extension of the Jupyter notebook (see Appendix \ref{appendix:editors}) However, the wrapper modules just call an existing tool on the host system. Given that each server may have its own set of installed software, the analysis may differ (or crash) when run on different GenePattern servers, hampering reproducibility. -%% GenePattern shutdown announcement (although as of November 2020, it does not open any more!): https://www.genepattern.org/blog/2019/10/01/the-genomespace-project-is-ending-on-november-15-2019 +%% GenePattern shutdown announcement (although as of November 2020, it does not open any more): https://www.genepattern.org/blog/2019/10/01/the-genomespace-project-is-ending-on-november-15-2019 The primary GenePattern server was active since 2008 and had 40,000 registered users with 2000 to 5000 jobs running every week \citeappendix{reich17}. -However, it was shut down on November 15th 2019 due to end of funding. +However, it was shut down on November 15th 2019 due to the end of funding. All processing with this sever has stopped, and any archived data on it has been deleted. -Since GenePattern is free software, there are alternative public servers to use, so hopefully work on it will continue. +Since GenePattern is free software, there are alternative public servers to use, so hopefully, work on it will continue. However, funding is limited and those servers may face similar funding problems. -This is a very nice example of the fragility of solutions that depend on archiving and running the research codes with high-level research products (including data, binary/compiled code which are expensive to keep) in one place. +This is a very nice example of the fragility of solutions that depend on archiving and running the research codes with high-level research products (including data and binary/compiled codes that are expensive to keep in one place). @@ -166,7 +182,6 @@ In many aspects, the usage of Kepler and its issues for long-term reproducibilit \subsection{VisTrails (2005)} \label{appendix:vistrails} - VisTrails\footnote{\inlinecode{\url{https://www.vistrails.org}}} \citeappendix{bavoil05} was a graphical workflow managing system. According to its web page, VisTrails maintenance has stopped since May 2016, its last Git commit, as of this writing, was in November 2017. However, given that it was well maintained for over 10 years is an achievement. @@ -176,10 +191,10 @@ Each analysis step, or module, is recorded in an XML schema, which defines the o The XML attributes of each module can be used in any XML query language to find certain steps (for example those that used a certain command). Since the main goal was visualization (as images), apparently its primary output is in the form of image spreadsheets. Its design is based on a change-based provenance model using a custom VisTrails provenance query language (vtPQL), for more see \citeappendix{scheidegger08}. -Since XML is a plane text format, as the user inspects the data and makes changes to the analysis, the changes are recorded as ``trails'' in the project's VisTrails repository that operates very much like common version control systems (see Appendix \ref{appendix:versioncontrol}). +Since XML is a plain text format, as the user inspects the data and makes changes to the analysis, the changes are recorded as ``trails'' in the project's VisTrails repository that operates very much like common version control systems (see Appendix \ref{appendix:versioncontrol}). . However, even though XML is in plain text, it is very hard to edit manually. -VisTrails therefore provides a graphic user interface with a visual representation of the project's inter-dependent steps (similar to +VisTrails, therefore, provides a graphic user interface with a visual representation of the project's inter-dependent steps (similar to \ifdefined\separatesupplement the data lineage figure of the main paper). \else @@ -193,12 +208,11 @@ Besides the fact that it is no longer maintained, VisTrails does not control the \subsection{Galaxy (2010)} \label{appendix:galaxy} - Galaxy\footnote{\inlinecode{\url{https://galaxyproject.org}}} is a web-based Genomics workbench \citeappendix{goecks10}. -The main user interface are ``Galaxy Pages'', which does not require any programming: users simply use abstract ``tools'' which are a wrappers over command-line programs. +The main user interfaces are ``Galaxy Pages'', which do not require any programming: users simply use abstract ``tools'' which are wrappers over command-line programs. Therefore the actual running version of the program can be hard to control across different Galaxy servers. Besides the automatically generated metadata of a project (which include version control, or its history), users can also tag/annotate each analysis step, describing its intent/purpose. -Besides some small differences Galaxy seems very similar to GenePattern (Appendix \ref{appendix:genepattern}), so most of the same points there apply here too (including the very large cost of maintaining such a system). +Besides some small differences, Galaxy seems very similar to GenePattern (Appendix \ref{appendix:genepattern}), so most of the same points there apply here too (including the very large cost of maintaining such a system). @@ -209,15 +223,15 @@ Besides some small differences Galaxy seems very similar to GenePattern (Appendi The IPOL journal\footnote{\inlinecode{\url{https://www.ipol.im}}} \citeappendix{limare11} (first published article in July 2010) publishes papers on image processing algorithms as well as the the full code of the proposed algorithm. An IPOL paper is a traditional research paper, but with a focus on implementation. The published narrative description of the algorithm must be detailed to a level that any specialist can implement it in their own programming language (extremely detailed). -The author's own implementation of the algorithm is also published with the paper (in C, C++ or MATLAB), the code must be commented well enough and link each part of it with the relevant part of the paper. -The authors must also submit several example datasets/scenarios. +The author's own implementation of the algorithm is also published with the paper (in C, C++, or MATLAB), the code must be commented well enough and link each part of it with the relevant part of the paper. +The authors must also submit several examples of datasets/scenarios. The referee is expected to inspect the code and narrative, confirming that they match with each other, and with the stated conclusions of the published paper. After publication, each paper also has a ``demo'' button on its web page, allowing readers to try the algorithm on a web-interface and even provide their own input. IPOL has grown steadily over the last 10 years, publishing 23 research articles in 2019 alone. We encourage the reader to visit its web page and see some of its recent papers and their demos. The reason it can be so thorough and complete is its very narrow scope (low-level image processing algorithms), where the published algorithms are highly atomic, not needing significant dependencies (beyond input/output of well-known formats), allowing the referees and readers to go deeply into each implemented algorithm. -In fact, high-level languages like Perl, Python or Java are not acceptable in IPOL precisely because of the additional complexities, such as dependencies, that they require. +In fact, high-level languages like Perl, Python, or Java are not acceptable in IPOL precisely because of the additional complexities, such as the dependencies that they require. However, many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, so this solution is not scalable. IPOL thus fails on our Scalability criteria. @@ -229,15 +243,13 @@ A paper written in Maneage (the proof-of-concept solution presented in this pape - \subsection{WINGS (2010)} \label{appendix:wings} - WINGS\footnote{\inlinecode{\url{https://wings-workflows.org}}} \citeappendix{gil10} is an automatic workflow generation algorithm. It runs on a centralized web server, requiring many dependencies (such that it is recommended to download Docker images). -It allows users to define various workflow components (for example datasets, analysis components and etc), with high-level goals. +It allows users to define various workflow components (for example datasets, analysis components, etc), with high-level goals. It then uses selection and rejection algorithms to find the best components using a pool of analysis components that can satisfy the requested high-level constraints. -\tonote{Read more about this} +%\tonote{Read more about this} @@ -249,8 +261,8 @@ Active Papers\footnote{\inlinecode{\url{http://www.activepapers.org}}} attempts It was initially written in Java because its compiled byte-code outputs in JVM are portable on any machine \citeappendix{hinsen11}. However, Java is not a commonly used platform today, hence it was later implemented in Python \citeappendix{hinsen15}. -In the Python version, all processing steps and input data (or references to them) are stored in a HDF5 file. -However, it can only account for pure-Python packages using the host operating system's Python modules \tonote{confirm this!}. +In the Python version, all processing steps and input data (or references to them) are stored in an HDF5 file. +%However, it can only account for pure-Python packages using the host operating system's Python modules \tonote{confirm this!}. When the Python module contains a component written in other languages (mostly C or C++), it needs to be an external dependency to the Active Paper. As mentioned in \citeappendix{hinsen15}, the fact that it relies on HDF5 is a caveat of Active Papers, because many tools are necessary to merely open it. @@ -261,12 +273,13 @@ Furthermore, as a high-level data format HDF5 evolves very fast, for example HDF While data and code are indeed fundamentally similar concepts technically\citeappendix{hinsen16}, they are used by humans differently due to their volume: the code of a large project involving Terabytes of data can be less than a megabyte. Hence, storing code and data together becomes a burden when large datasets are used, this was also acknowledged in \citeappendix{hinsen15}. Also, if the data are proprietary (for example medical patient data), the data must not be released, but the methods that were applied to them can be published. -Furthermore, since all reading and writing is done in the HDF5 file, it can easily bloat the file to very large sizes due to temporary/reproducible files, and its necessary to remove/dummify them, thus complicating the code, making it hard to read. +Furthermore, since all reading and writing is done in the HDF5 file, it can easily bloat the file to very large sizes due to temporary/reproducible files. +These files are later removed, thus making the code more complicated and hard to read. For example the Active Papers HDF5 file of \citeappendix[in \href{https://doi.org/10.5281/zenodo.2549987}{zenodo.2549987}]{kneller19} is 1.8 giga-bytes. In many scenarios, peers just want to inspect the processing by reading the code and checking a very special part of it (one or two lines, just to see the option values to one step for example). -They do not necessarily need to run it, or obtaining the datasets. -Hence the extra volume for data, and obscure HDF5 format that needs special tools for reading plain text code is a major hindrance. +They do not necessarily need to run it or obtain the output the datasets. +Hence the extra volume for data and obscure HDF5 format that needs special tools for reading plain text code is a major hindrance. @@ -275,9 +288,9 @@ Hence the extra volume for data, and obscure HDF5 format that needs special tool \subsection{Collage Authoring Environment (2011)} \label{appendix:collage} The Collage Authoring Environment \citeappendix{nowakowski11} was the winner of Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}. -It is based on the GridSpace2\footnote{\inlinecode{\url{http://dice.cyfronet.pl}}} distributed computing environment\tonote{find citation}, which has a web-based graphic user interface. +It is based on the GridSpace2\footnote{\inlinecode{\url{http://dice.cyfronet.pl}}} distributed computing environment, which has a web-based graphic user interface. Through its web-based interface, viewers of a paper can actively experiment with the parameters of a published paper's displayed outputs (for example figures). -\tonote{See how it containerizes the software environment} +%\tonote{See how it containerizes the software environment} @@ -286,11 +299,10 @@ Through its web-based interface, viewers of a paper can actively experiment with \subsection{SHARE (2011)} \label{appendix:SHARE} SHARE\footnote{\inlinecode{\url{https://is.ieis.tue.nl/staff/pvgorp/share}}} \citeappendix{vangorp11} is a web portal that hosts virtual machines (VMs) for storing the environment of a research project. -The top project web page above is still active, however, the virtual machines and SHARE system have been removed since 2019, probably due to the large volume and high maintenance cost of the VMs. - -SHARE was recognized as second position in the Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}. -Simply put, SHARE is just a VM that users can download and run. +SHARE was recognized as the second position in the Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}. +Simply put, SHARE was just a VM library that users could download or connect to, and run. The limitations of VMs for reproducibility were discussed in Appendix \ref{appendix:virtualmachines}, and the SHARE system does not specify any requirements on making the VM itself reproducible. +While the top project web page still works, upon selecting any operation, a notice is printed that ``SHARE is offline'' since 2019 and the reason is not mentioned. @@ -298,18 +310,18 @@ The limitations of VMs for reproducibility were discussed in Appendix \ref{appen \subsection{Verifiable Computational Result, VCR (2011)} \label{appendix:verifiableidentifier} -A ``verifiable computational result''\footnote{\inlinecode{\url{http://vcr.stanford.edu}}} is an output (table, figure, or etc) that is associated with a ``verifiable result identifier'' (VRI), see \citeappendix{gavish11}. +A ``verifiable computational result''\footnote{\inlinecode{\url{http://vcr.stanford.edu}}} is an output (table, figure, etc) that is associated with a ``verifiable result identifier'' (VRI), see \citeappendix{gavish11}. It was awarded the third prize in the Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}. A VRI is created using tags within the programming source that produced that output, also recording its version control or history. -This enables exact identification and citation of results. -The VRIs are automatically generated web-URLs that link to public VCR repositories containing the data, inputs and scripts, that may be re-executed. -According to \citeappendix{gavish11}, the VRI generation routine has been implemented in MATLAB, R and Python, although only the MATLAB version was available during the writing of this paper. +This enables the exact identification and citation of results. +The VRIs are automatically generated web-URLs that link to public VCR repositories containing the data, inputs, and scripts, that may be re-executed. +According to \citeappendix{gavish11}, the VRI generation routine has been implemented in MATLAB, R, and Python, although only the MATLAB version was available during the writing of this paper. VCR also has special \LaTeX{} macros for loading the respective VRI into the generated PDF. -Unfortunately most parts of the web page are not complete at the time of this writing. -The VCR web page contains an example PDF\footnote{\inlinecode{\url{http://vcr.stanford.edu/paper.pdf}}} that is generated with this system, however, the linked VCR repository\footnote{\inlinecode{\url{http://vcr-stat.stanford.edu}}} does not exist at the time of this writing. -Finally, the date of the files in the MATLAB extension tarball are set to 2011, hinting that probably VCR has been abandoned soon after the publication of \citeappendix{gavish11}. +Unfortunately, most parts of the web page are not complete at the time of this writing. +The VCR web page contains an example PDF\footnote{\inlinecode{\url{http://vcr.stanford.edu/paper.pdf}}} that is generated with this system, but the linked VCR repository\footnote{\inlinecode{\url{http://vcr-stat.stanford.edu}}} does not exist at the time of this writing. +Finally, the date of the files in the MATLAB extension tarball is set to 2011, hinting that probably VCR has been abandoned soon after the publication of \citeappendix{gavish11}. @@ -319,15 +331,17 @@ Finally, the date of the files in the MATLAB extension tarball are set to 2011, \label{appendix:sole} SOLE (Science Object Linking and Embedding) defines ``science objects'' (SOs) that can be manually linked with phrases of the published paper \citeappendix{pham12,malik13}. An SO is any code/content that is wrapped in begin/end tags with an associated type and name. -For example special commented lines in a Python, R or C program. +For example, special commented lines in a Python, R, or C program. The SOLE command-line program parses the tagged file, generating metadata elements unique to the SO (including its URI). SOLE also supports workflows as Galaxy tools \citeappendix{goecks10}. For reproducibility, \citeappendix{pham12} suggest building a SOLE-based project in a virtual machine, using any custom package manager that is hosted on a private server to obtain a usable URI. -However, as described in Appendices \ref{appendix:independentenvironment} and \ref{appendix:packagemanagement}, unless virtual machines are built with robust package managers, this is not a sustainable solution (the virtual machine itself is not reproducible). +However, as described in Appendices \ref{appendix:independentenvironment} and \ref{appendix:packagemanagement}, unless virtual machines are built with robust package managers, this is not a sustainable solution (the virtual machine itself is not reproducible). Also, hosting a large virtual machine server with fixed IP on a hosting service like Amazon (as suggested there) for every project in perpetuity will be very expensive. -The manual/artificial definition of tags to connect parts of the paper with the analysis scripts is also a caveat due to human error and incompleteness (tags the authors may not consider important, but may be useful later). -In Maneage, instead of artificial/commented tags directly link the analysis input and outputs to the paper's text automatically. +The manual/artificial definition of tags to connect parts of the paper with the analysis scripts is also a caveat due to human error and incompleteness (the authors may not consider tags as important things, but they may be useful later). +In Maneage, instead of using artificial/commented tags, the analysis inputs and outputs are automatically linked into the paper's text. + + @@ -339,7 +353,7 @@ The captured environment can be viewed in plain text or a web interface. Sumatra also provides \LaTeX/Sphinx features, which will link the paper with the project's Sumatra database. This enables researchers to use a fixed version of a project's figures in the paper, even at later times (while the project is being developed). -The actual code that Sumatra wraps around, must itself be under version control, and it does not run if there is non-committed changes (although its not clear what happens if a commit is amended). +The actual code that Sumatra wraps around, must itself be under version control, and it does not run if there are non-committed changes (although it is not clear what happens if a commit is amended). Since information on the environment has been captured, Sumatra is able to identify if it has changed since a previous run of the project. Therefore Sumatra makes no attempt at storing the environment of the analysis as in Sciunit (see Appendix \ref{appendix:sciunit}), but its information. Sumatra thus needs to know the language of the running program and is not generic. @@ -351,7 +365,6 @@ It just captures the environment, it does not store \emph{how} that environment \subsection{Research Object (2013)} \label{appendix:researchobject} - The Research object\footnote{\inlinecode{\url{http://www.researchobject.org}}} is collection of meta-data ontologies, to describe aggregation of resources, or workflows, see \citeappendix{bechhofer13} and \citeappendix{belhajjame15}. It thus provides resources to link various workflow/analysis components (see Appendix \ref{appendix:existingtools}) into a final workflow. @@ -361,25 +374,26 @@ The important thing is that the research object concept is not specific to any s + \subsection{Sciunit (2015)} \label{appendix:sciunit} Sciunit\footnote{\inlinecode{\url{https://sciunit.run}}} \citeappendix{meng15} defines ``sciunit''s that keep the executed commands for an analysis and all the necessary programs and libraries that are used in those commands. -It automatically parses all the executable files in the script, and copies them, and their dependency libraries (down to the C library), into the sciunit. -Because the sciunit contains all the programs and necessary libraries, its possible to run it readily on other systems that have a similar CPU architecture. -Sciunit was originally written in Python 2 (which reached its end-of-life in January 1st, 2020). +It automatically parses all the executable files in the script and copies them, and their dependency libraries (down to the C library), into the sciunit. +Because the sciunit contains all the programs and necessary libraries, it is possible to run it readily on other systems that have a similar CPU architecture. +Sciunit was originally written in Python 2 (which reached its end-of-life on January 1st, 2020). Therefore Sciunit2 is a new implementation in Python 3. The main issue with Sciunit's approach is that the copied binaries are just black boxes: it is not possible to see how the used binaries from the initial system were built. -This is a major problem for scientific projects: in principle (not knowing how they programs were built) and in practice (archiving a large volume sciunit for every step of an analysis requires a lot of storage space). +This is a major problem for scientific projects: in principle (not knowing how the programs were built) and in practice (archiving a large volume sciunit for every step of the analysis requires a lot of storage space). \subsection{Umbrella (2015)} -Umbrella \citeappendix{meng15b} is a high-level wrapper script for isolating the environment of an analysis. -The user specifies the necessary operating system, and necessary packages and analysis steps in various JSON files. -Umbrella will then study the host operating system and the various necessary inputs (including data and software) through a process similar to Sciunits mentioned above to find the best environment isolator (maybe using Linux containerization, containers or VMs). +Umbrella \citeappendix{meng15b} is a high-level wrapper script for isolating the environment of the analysis. +The user specifies the necessary operating system, and necessary packages for the analysis steps in various JSON files. +Umbrella will then study the host operating system and the various necessary inputs (including data and software) through a process similar to Sciunits mentioned above to find the best environment isolator (maybe using Linux containerization, containers, or VMs). We could not find a URL to the source software of Umbrella (no source code repository is mentioned in the papers we reviewed above), but from the descriptions in \citeappendix{meng17}, it is written in Python 2.6 (which is now \new{deprecated}). @@ -387,18 +401,18 @@ We could not find a URL to the source software of Umbrella (no source code repos \subsection{ReproZip (2016)} -ReproZip\footnote{\inlinecode{\url{https://www.reprozip.org}}} \citeappendix{chirigati16} is a Python package that is designed to automatically track all the necessary data files, libraries and environment variables into a single bundle. +ReproZip\footnote{\inlinecode{\url{https://www.reprozip.org}}} \citeappendix{chirigati16} is a Python package that is designed to automatically track all the necessary data files, libraries, and environment variables into a single bundle. The tracking is done at the kernel system-call level, so any file that is accessed during the running of the project is identified. The tracked files can be packaged into a \inlinecode{.rpz} bundle that can then be unpacked into another system. ReproZip is therefore very good to take a ``snapshot'' of the running environment into a single file. -The bundle can become very large if large, or many datasets, are used or if the software environment is complex (many dependencies). +However, the bundle can become very large when many/large datasets are involved, or if the software environment is complex (many dependencies). Since it copies the binary software libraries, it can only be run on systems with a similar CPU architecture to the original. -Furthermore, ReproZip just copies the binary/compiled files used in a project, it has no way to know how those software were built. -As mentioned in this paper, and also \citeappendix{oliveira18} the question of ``how'' the environment was built is critical for understanding the results and simply having the binaries cannot necessarily be useful. +Furthermore, ReproZip just copies the binary/compiled files used in a project, it has no way to know how the software was built. +As mentioned in this paper, and also \citeappendix{oliveira18} the question of ``how'' the environment was built is critical for understanding the results, and simply having the binaries cannot necessarily be useful. For the data, it is similarly not possible to extract which data server they came from. -Hence two projects that each use a 1-terabyte dataset will need a full copy of that same 1-terabyte file in their bundle, making long term preservation extremely expensive. +Hence two projects that each use a 1-terabyte dataset will need a full copy of that same 1-terabyte file in their bundle, making long-term preservation extremely expensive. @@ -420,12 +434,13 @@ Binder is used by \citeappendix{jones19}. %% I took the date from their PiPy page, where the first version 0.1 was published in November 2016. Gigantum\footnote{\inlinecode{\url{https://gigantum.com}}} is a client/server system, in which the client is a web-based (graphical) interface that is installed as ``Gigantum Desktop'' within a Docker image. Gigantum uses Docker containers for an independent environment, Conda (or Pip) to install packages, Jupyter notebooks to edit and run code, and Git to store its history. -Simply put, its a high-level wrapper for combining these components. +Simply put, it is a high-level wrapper for combining these components. Internally, a Gigantum project is organized as files in a directory that can be opened without their own client. -The file structure (which is under version control) includes codes, input data and output data. +The file structure (which is under version control) includes codes, input data, and output data. As acknowledged on their own web page, this greatly reduces the speed of Git operations, transmitting, or archiving the project. Therefore there are size limits on the dataset/code sizes. -However, there is one directory which can be used to store files that must not be tracked. +However, there is one directory that can be used to store files that must not be tracked. + @@ -441,24 +456,26 @@ This is an important issue when low-level choices are based on service providers To start a project, the \inlinecode{popper} command-line program builds a template, or ``scaffold'', which is a minimal set of files that can be run. However, as of this writing, the scaffold is not complete: it lacks a manuscript and validation of outputs (as mentioned in the convention). -By default Popper runs in a Docker image (so root permissions are necessary and reproducible issues with Docker images have been discussed above), but Singularity is also supported. +By default, Popper runs in a Docker image (so root permissions are necessary and reproducible issues with Docker images have been discussed above), but Singularity is also supported. See Appendix \ref{appendix:independentenvironment} for more on containers, and Appendix \ref{appendix:highlevelinworkflow} for using high-level languages in the workflow. -Popper does not comply with the completeness, minimal complexity and including-narrative criteria. +Popper does not comply with the completeness, minimal complexity, and including the narrative criteria. Moreover, the scaffold that is provided by Popper is an output of the program that is not directly under version control. Hence, tracking future changes in Popper and how they relate to the high-level projects that depend on it will be very hard. In Maneage, the same \inlinecode{maneage} git branch is shared by the developers and users; any new feature or change in Maneage can thus be directly tracked with Git when the high-level project merges their branch with Maneage. + + + \subsection{Whole Tale (2017)} \label{appendix:wholetale} - -Whole Tale\footnote{\inlinecode{\url{https://wholetale.org}}} is a web-based platform for managing a project and organizing data provenance, see \citeappendix{brinckman17} +Whole Tale\footnote{\inlinecode{\url{https://wholetale.org}}} is a web-based platform for managing a project and organizing data provenance, see \citeappendix{brinckman17}. It uses online editors like Jupyter or RStudio (see Appendix \ref{appendix:editors}) that are encapsulated in a Docker container (see Appendix \ref{appendix:independentenvironment}). -The web-based nature of Whole Tale's approach, and its dependency on many tools (which have many dependencies themselves) is a major limitation for future reproducibility. +The web-based nature of Whole Tale's approach and its dependency on many tools (which have many dependencies themselves) is a major limitation for future reproducibility. For example, when following their own tutorial on ``Creating a new tale'', the provided Jupyter notebook could not be executed because of a dependency problem. -This was reported to the authors as issue 113\footnote{\inlinecode{\url{https://github.com/whole-tale/wt-design-docs/issues/113}}}, but as all the second-order dependencies evolve, its not hard to envisage such dependency incompatibilities being the primary issue for older projects on Whole Tale. +This was reported to the authors as issue 113\footnote{\inlinecode{\url{https://github.com/whole-tale/wt-design-docs/issues/113}}}, but as all the second-order dependencies evolve, it is not hard to envisage such dependency incompatibilities being the primary issue for older projects on Whole Tale. Furthermore, the fact that a Tale is stored as a binary Docker container causes two important problems: 1) it requires a very large storage capacity for every project that is hosted there, making it very expensive to scale if demand expands. 2) It is not possible to see how the environment was built accurately (when the Dockerfile uses \inlinecode{apt}). @@ -471,13 +488,13 @@ This issue with Whole Tale (and generally all other solutions that only rely on \subsection{Occam (2018)} Occam\footnote{\inlinecode{\url{https://occam.cs.pitt.edu}}} \citeappendix{oliveira18} is web-based application to preserve software and its execution. To achieve long-term reproducibility, Occam includes its own package manager (instructions to build software and their dependencies) to be in full control of the software build instructions, similar to Maneage. -Besides Nix or Guix (which are primarily a package manager that can also do job management), Occum has been the only solution in our survey here that attempts to be complete in this aspect. +Besides Nix or Guix (which are primarily a package manager that can also do job management), Occam has been the only solution in our survey here that attempts to be complete in this aspect. -However it is incomplete from the perspective of requirements: it works within a Docker image (that requires root permissions) and currently only runs on Debian-based, Red Hat based and Arch-based GNU/Linux operating systems that respectively use the \inlinecode{apt}, \inlinecode{pacman} or \inlinecode{yum} package managers. -It is also itself written in Python (version 3.4 or above), hence it is not clear +However it is incomplete from the perspective of requirements: it works within a Docker image (that requires root permissions) and currently only runs on Debian-based, Red Hat based, and Arch-based GNU/Linux operating systems that respectively use the \inlinecode{apt}, \inlinecode{pacman} or \inlinecode{yum} package managers. +It is also itself written in Python (version 3.4 or above). -Furthermore, it also violates our complexity criteria because the instructions to build the software, their versions and etc are not immediately viewable or modifiable by the user. -Occam contains its own JSON database for this that should be parsed with its own custom program. -The analysis phase of Occum is also through a drag-and-drop interface (similar to Taverna, Appendix \ref{appendix:taverna}) that is a web-based graphic user interface. -All the connections between various phases of an analysis need to be pre-defined in a JSON file and manually linked in the GUI. -Hence for complex data analysis operations with involve thousands of steps, it is not scalable. +Furthermore, it does not account for the minimal complexity criteria because the instructions to build the software and their versions are not immediately viewable or modifiable by the user. +Occam contains its own JSON database that should be parsed with its own custom program. +The analysis phase of Occam is also through a drag-and-drop interface (similar to Taverna, Appendix \ref{appendix:taverna}) that is a web-based graphic user interface. +All the connections between various phases of the analysis need to be pre-defined in a JSON file and manually linked in the GUI. +Hence for complex data analysis operations that involve thousands of steps, it is not scalable. |