diff options
Diffstat (limited to 'tex/src')
-rw-r--r-- | tex/src/appendix-existing-solutions.tex | 50 | ||||
-rw-r--r-- | tex/src/appendix-existing-tools.tex | 8 |
2 files changed, 29 insertions, 29 deletions
diff --git a/tex/src/appendix-existing-solutions.tex b/tex/src/appendix-existing-solutions.tex index d7888ad..25b2920 100644 --- a/tex/src/appendix-existing-solutions.tex +++ b/tex/src/appendix-existing-solutions.tex @@ -16,20 +16,20 @@ \section{Survey of common existing reproducible workflows} \label{appendix:existingsolutions} -As reviewed in the introduction, the problem of reproducibility has received a lot of attention over the last three decades and various solutions have already been proposed. +As reviewed in the introduction, the problem of reproducibility has received considerable attention over the last three decades and various solutions have already been proposed. The core principles that many of the existing solutions (including Maneage) aim to achieve are nicely summarized by the FAIR principles \citeappendix{wilkinson16}. In this appendix, some of the solutions are reviewed. -The solutions are based on an evolving software landscape, therefore they are ordered by date: when the project has a webpage, the year of its first release is used for the sorting, otherwise their paper's publication year is used. +The solutions are based on an evolving software landscape, therefore they are ordered by date: when the project has a web page, the year of its first release is used for the sorting, otherwise their paper's publication year is used. For each solution, we summarize its methodology and discuss how it relates to the criteria proposed in this paper. Freedom of the software/method is a core concept behind scientific reproducibility, as opposed to industrial reproducibility where a black box is acceptable/desirable. Therefore proprietary solutions like Code Ocean\footnote{\inlinecode{\url{https://codeocean.com}}} or Nextjournal\footnote{\inlinecode{\url{https://nextjournal.com}}} will not be reviewed here. -Other studies have also attempted to review existing reproducible solutions, foro example \citeappendix{konkol20}. +Other studies have also attempted to review existing reproducible solutions, for example \citeappendix{konkol20}. \subsection{Suggested rules, checklists, or criteria} Before going into the various implementations, it is also useful to review existing suggested rules, checklists or criteria for computationally reproducible research. -All the cases below are primarily targetted to immediate reproducibility and do not consider longevity explicitly. +All the cases below are primarily targeted to immediate reproducibility and do not consider longevity explicitly. Therefore, they lack a strong/clear completeness criterion (they mainly only suggest, rather than require, the recording of versions, and their ultimate suggestion of storing the full binary OS in a binary VM or container is problematic (as mentioned in \ref{appendix:independentenvironment} and \citeappendix{oliveira18}). Sandve et al. \citeappendix{sandve13} propose ``ten simple rules for reproducible computational research'' that can be applied in any project. @@ -73,7 +73,7 @@ As described in \cite{schwab2000}, in the latter half of that decade, they moved The basic idea behind RED's solution was to organize the analysis as independent steps, including the generation of plots, and organizing the steps through a Makefile. This enabled all the results to be re-executed with a single command. Several basic low-level Makefiles were included in the high-level/central Makefile. -The reader/user of a project had to manually edit the central Makefile and set the variable \inlinecode{RESDIR} (result dir), this is the directory where built files are kept. +The reader/user of a project had to manually edit the central Makefile and set the variable \inlinecode{RESDIR} (result directory), this is the directory where built files are kept. Afterwards, the reader could set which figures/parts of the project to reproduce by manually adding its name in the central Makefile, and running Make. At the time, Make was already practiced by individual researchers and projects as a job orchestration tool, but SEP's innovation was to standardize it as an internal policy, and define conventions for the Makefiles to be consistent across projects. @@ -114,7 +114,7 @@ Madagascar has been used in the production of hundreds of research papers or boo Madagascar does include project management tools in the form of SCons extensions. However, it is not just a reproducible project management tool. It is primarily a collection of analysis programs and tools to interact with RSF files, and plotting facilities. -The Regularly Sampled File (RSF) file format\footnote{\inlinecode{\url{http://www.ahay.org/wiki/Guide\_to\_RSF\_file\_format}}} is a custom plain-text file that points to the location of the actual data files on the filesystem and acts as the intermediary between Madagascar's analysis programs. +The Regularly Sampled File (RSF) file format\footnote{\inlinecode{\url{http://www.ahay.org/wiki/Guide\_to\_RSF\_file\_format}}} is a custom plain-text file that points to the location of the actual data files on the file system and acts as the intermediary between Madagascar's analysis programs. For example in our test of Madagascar 3.0.1, it installed 855 Madagascar-specific analysis programs (\inlinecode{PREFIX/bin/sf*}). The analysis programs mostly target geophysical data analysis, including various project specific tools: more than half of the total built tools are under the \inlinecode{build/user} directory which includes names of Madagascar users. @@ -168,7 +168,7 @@ In many aspects, the usage of Kepler and its issues for long-term reproducibilit \label{appendix:vistrails} VisTrails\footnote{\inlinecode{\url{https://www.vistrails.org}}} \citeappendix{bavoil05} was a graphical workflow managing system. -According to its webpage, VisTrails maintainance has stopped since May 2016, its last Git commit, as of this writing, was in November 2017. +According to its web page, VisTrails maintenance has stopped since May 2016, its last Git commit, as of this writing, was in November 2017. However, given that it was well maintained for over 10 years is an achievement. VisTrails (or ``visualization trails'') was initially designed for managing visualizations, but later grew into a generic workflow system with meta-data and provenance features. @@ -198,7 +198,7 @@ Galaxy\footnote{\inlinecode{\url{https://galaxyproject.org}}} is a web-based Gen The main user interface are ``Galaxy Pages'', which does not require any programming: users simply use abstract ``tools'' which are a wrappers over command-line programs. Therefore the actual running version of the program can be hard to control across different Galaxy servers. Besides the automatically generated metadata of a project (which include version control, or its history), users can also tag/annotate each analysis step, describing its intent/purpose. -Besides some small differences Galaxy seems very similar to GenePattern (Appendix \ref{appendix:genepattern}), so most of the same points there apply here too (including the very large cost of maintining such a system). +Besides some small differences Galaxy seems very similar to GenePattern (Appendix \ref{appendix:genepattern}), so most of the same points there apply here too (including the very large cost of maintaining such a system). @@ -212,18 +212,18 @@ The published narrative description of the algorithm must be detailed to a level The author's own implementation of the algorithm is also published with the paper (in C, C++ or MATLAB), the code must be commented well enough and link each part of it with the relevant part of the paper. The authors must also submit several example datasets/scenarios. The referee is expected to inspect the code and narrative, confirming that they match with each other, and with the stated conclusions of the published paper. -After publication, each paper also has a ``demo'' button on its webpage, allowing readers to try the algorithm on a web-interface and even provide their own input. +After publication, each paper also has a ``demo'' button on its web page, allowing readers to try the algorithm on a web-interface and even provide their own input. IPOL has grown steadily over the last 10 years, publishing 23 research articles in 2019 alone. -We encourage the reader to visit its webpage and see some of its recent papers and their demos. +We encourage the reader to visit its web page and see some of its recent papers and their demos. The reason it can be so thorough and complete is its very narrow scope (low-level image processing algorithms), where the published algorithms are highly atomic, not needing significant dependencies (beyond input/output of well-known formats), allowing the referees and readers to go deeply into each implemented algorithm. In fact, high-level languages like Perl, Python or Java are not acceptable in IPOL precisely because of the additional complexities, such as dependencies, that they require. However, many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, so this solution is not scalable. IPOL thus fails on our Scalability criteria. -Furthermore, by not publishing/archiving each paper's version control history or directly linking the analysis and produced paper, it fails Criterias 6 and 7. -Note that on the webpage, it is possible to change parameters, but that will not affect the produced PDF. -A paper written in Maneage (the proof-of-concept solution presented in this paper) could be scrutinised at a similar detailed level to IPOL, but for much more complex research scenarios, involving hundreds of dependencies and complex processing of the data. +Furthermore, by not publishing/archiving each paper's version control history or directly linking the analysis and produced paper, it fails criteria 6 and 7. +Note that on the web page, it is possible to change parameters, but that will not affect the produced PDF. +A paper written in Maneage (the proof-of-concept solution presented in this paper) could be scrutinized at a similar detailed level to IPOL, but for much more complex research scenarios, involving hundreds of dependencies and complex processing of the data. @@ -266,7 +266,7 @@ For example the Active Papers HDF5 file of \citeappendix[in \href{https://doi.or In many scenarios, peers just want to inspect the processing by reading the code and checking a very special part of it (one or two lines, just to see the option values to one step for example). They do not necessarily need to run it, or obtaining the datasets. -Hence the extra volume for data, and obscure HDF5 format that needs special tools for reading plain text code is a major hinderance. +Hence the extra volume for data, and obscure HDF5 format that needs special tools for reading plain text code is a major hindrance. @@ -286,7 +286,7 @@ Through its web-based interface, viewers of a paper can actively experiment with \subsection{SHARE (2011)} \label{appendix:SHARE} SHARE\footnote{\inlinecode{\url{https://is.ieis.tue.nl/staff/pvgorp/share}}} \citeappendix{vangorp11} is a web portal that hosts virtual machines (VMs) for storing the environment of a research project. -The top project webpage above is still active, however, the virtual machines and SHARE system have been removed since 2019, probably due to the large volume and high maintainence cost of the VMs. +The top project web page above is still active, however, the virtual machines and SHARE system have been removed since 2019, probably due to the large volume and high maintenance cost of the VMs. SHARE was recognized as second position in the Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}. Simply put, SHARE is just a VM that users can download and run. @@ -307,8 +307,8 @@ The VRIs are automatically generated web-URLs that link to public VCR repositori According to \citeappendix{gavish11}, the VRI generation routine has been implemented in MATLAB, R and Python, although only the MATLAB version was available during the writing of this paper. VCR also has special \LaTeX{} macros for loading the respective VRI into the generated PDF. -Unfortunately most parts of the webpage are not complete at the time of this writing. -The VCR webpage contains an example PDF\footnote{\inlinecode{\url{http://vcr.stanford.edu/paper.pdf}}} that is generated with this system, however, the linked VCR repository\footnote{\inlinecode{\url{http://vcr-stat.stanford.edu}}} does not exist at the time of this writing. +Unfortunately most parts of the web page are not complete at the time of this writing. +The VCR web page contains an example PDF\footnote{\inlinecode{\url{http://vcr.stanford.edu/paper.pdf}}} that is generated with this system, however, the linked VCR repository\footnote{\inlinecode{\url{http://vcr-stat.stanford.edu}}} does not exist at the time of this writing. Finally, the date of the files in the MATLAB extension tarball are set to 2011, hinting that probably VCR has been abandoned soon after the publication of \citeappendix{gavish11}. @@ -364,7 +364,7 @@ The important thing is that the research object concept is not specific to any s \subsection{Sciunit (2015)} \label{appendix:sciunit} Sciunit\footnote{\inlinecode{\url{https://sciunit.run}}} \citeappendix{meng15} defines ``sciunit''s that keep the executed commands for an analysis and all the necessary programs and libraries that are used in those commands. -It automatically parses all the executables in the script, and copies them, and their dependency libraries (down to the C library), into the sciunit. +It automatically parses all the executable files in the script, and copies them, and their dependency libraries (down to the C library), into the sciunit. Because the sciunit contains all the programs and necessary libraries, its possible to run it readily on other systems that have a similar CPU architecture. Sciunit was originally written in Python 2 (which reached its end-of-life in January 1st, 2020). Therefore Sciunit2 is a new implementation in Python 3. @@ -378,7 +378,7 @@ This is a major problem for scientific projects: in principle (not knowing how t \subsection{Umbrella (2015)} Umbrella \citeappendix{meng15b} is a high-level wrapper script for isolating the environment of an analysis. -The user specifies the necessary operating system, and necessary packages and analysis steps in variuos JSON files. +The user specifies the necessary operating system, and necessary packages and analysis steps in various JSON files. Umbrella will then study the host operating system and the various necessary inputs (including data and software) through a process similar to Sciunits mentioned above to find the best environment isolator (maybe using Linux containerization, containers or VMs). We could not find a URL to the source software of Umbrella (no source code repository is mentioned in the papers we reviewed above), but from the descriptions in \citeappendix{meng17}, it is written in Python 2.6 (which is now \new{deprecated}). @@ -392,7 +392,7 @@ The tracking is done at the kernel system-call level, so any file that is access The tracked files can be packaged into a \inlinecode{.rpz} bundle that can then be unpacked into another system. ReproZip is therefore very good to take a ``snapshot'' of the running environment into a single file. -The bundle can become very large if large, or many datasets, are used or if the software evironment is complex (many dependencies). +The bundle can become very large if large, or many datasets, are used or if the software environment is complex (many dependencies). Since it copies the binary software libraries, it can only be run on systems with a similar CPU architecture to the original. Furthermore, ReproZip just copies the binary/compiled files used in a project, it has no way to know how those software were built. As mentioned in this paper, and also \citeappendix{oliveira18} the question of ``how'' the environment was built is critical for understanding the results and simply having the binaries cannot necessarily be useful. @@ -409,7 +409,7 @@ Binder\footnote{\inlinecode{\url{https://mybinder.org}}} is used to containerize Users simply add a set of Binder-recognized configuration files to their repository and Binder will build a Docker image and install all the dependencies inside of it with Conda (the list of necessary packages comes from Conda). One good feature of Binder is that the imported Docker image must be tagged (something like a checksum). This will ensure that future/latest updates of the imported Docker image are not mistakenly used. -However, it does not make sure that the dockerfile used by the imported Docker image follows a similar convention also. +However, it does not make sure that the Dockerfile used by the imported Docker image follows a similar convention also. Binder is used by \citeappendix{jones19}. @@ -423,7 +423,7 @@ Gigantum uses Docker containers for an independent environment, Conda (or Pip) t Simply put, its a high-level wrapper for combining these components. Internally, a Gigantum project is organized as files in a directory that can be opened without their own client. The file structure (which is under version control) includes codes, input data and output data. -As acknowledged on their own webpage, this greatly reduces the speed of Git operations, transmitting, or archiving the project. +As acknowledged on their own web page, this greatly reduces the speed of Git operations, transmitting, or archiving the project. Therefore there are size limits on the dataset/code sizes. However, there is one directory which can be used to store files that must not be tracked. @@ -436,7 +436,7 @@ Popper\footnote{\inlinecode{\url{https://falsifiable.us}}} is a software impleme The Popper team's own solution is through a command-line program called \inlinecode{popper}. The \inlinecode{popper} program itself is written in Python. However, job management was initially based on the HashiCorp configuration language (HCL) because HCL was used by ``GitHub Actions'' to manage workflows. -Moreover, from October 2019 Github changed to a custom YAML-based languguage, so Popper also deprecated HCL. +Moreover, from October 2019 GitHub changed to a custom YAML-based language, so Popper also deprecated HCL. This is an important issue when low-level choices are based on service providers. To start a project, the \inlinecode{popper} command-line program builds a template, or ``scaffold'', which is a minimal set of files that can be run. @@ -473,10 +473,10 @@ Occam\footnote{\inlinecode{\url{https://occam.cs.pitt.edu}}} \citeappendix{olive To achieve long-term reproducibility, Occam includes its own package manager (instructions to build software and their dependencies) to be in full control of the software build instructions, similar to Maneage. Besides Nix or Guix (which are primarily a package manager that can also do job management), Occum has been the only solution in our survey here that attempts to be complete in this aspect. -However it is incomplete from the perspective of requirements: it works within a Docker image (that requires root permissions) and currently only runs on Debian-based, Redhat-based and Arch-based GNU/Linux operating systems that respectively use the \inlinecode{apt}, \inlinecode{pacman} or \inlinecode{yum} package managers. +However it is incomplete from the perspective of requirements: it works within a Docker image (that requires root permissions) and currently only runs on Debian-based, Red Hat based and Arch-based GNU/Linux operating systems that respectively use the \inlinecode{apt}, \inlinecode{pacman} or \inlinecode{yum} package managers. It is also itself written in Python (version 3.4 or above), hence it is not clear -Furthermore, it also violates our complexity criteria because the instructions to build the software, their versions and etc are not immediately viewable or modifable by the user. +Furthermore, it also violates our complexity criteria because the instructions to build the software, their versions and etc are not immediately viewable or modifiable by the user. Occam contains its own JSON database for this that should be parsed with its own custom program. The analysis phase of Occum is also through a drag-and-drop interface (similar to Taverna, Appendix \ref{appendix:taverna}) that is a web-based graphic user interface. All the connections between various phases of an analysis need to be pre-defined in a JSON file and manually linked in the GUI. diff --git a/tex/src/appendix-existing-tools.tex b/tex/src/appendix-existing-tools.tex index 5920fbd..d923d5f 100644 --- a/tex/src/appendix-existing-tools.tex +++ b/tex/src/appendix-existing-tools.tex @@ -120,7 +120,7 @@ However, attempting to archive the actual binary container or VM files as a blac \subsubsection{Independent build in host's file system} \label{appendix:independentbuild} The virtual machine and container solutions mentioned above, have their own independent file system. -Another approach to having an isolated analysis environment is to use the same filesystem as the host, but installing the project's software in a non-standard, project-specific directory that does not interfere with the host. +Another approach to having an isolated analysis environment is to use the same file system as the host, but installing the project's software in a non-standard, project-specific directory that does not interfere with the host. Because the environment in this approach can be built in any custom location on the host, this solution generally does not require root permissions or extra low-level layers like containers or VMs. However, ``moving'' the built product of such solutions from one computer to another is not generally as trivial as containers or VMs. Examples of such third-party package managers (that are detached from the host OS's package manager) include Nix, GNU Guix, Python's Virtualenv package, and Conda, among others. @@ -189,7 +189,7 @@ This allows for multiple versions of the software to co-exist on the system, whi As mentioned in \citeappendix{courtes15}, one major caveat with using these package managers is that they require a daemon with root privileges. This is necessary ``to use the Linux kernel container facilities that allow it to isolate build processes and maximize build reproducibility''. This is because the focus in Nix or Guix is to create bit-wise reproducible software binaries and this is necessary for the security or development perspectives. -However, in a non-computer-science analysis (for example natural sciences), the main aim is reproducible \emph{results} that can also be created with the same software version that may not be bitwise identical (for example when they are installed in other locations, because the installation location is hardcoded in the software binary). +However, in a non-computer-science analysis (for example natural sciences), the main aim is reproducible \emph{results} that can also be created with the same software version that may not be bit-wise identical (for example when they are installed in other locations, because the installation location is hard-coded in the software binary). Finally, while Guix and Nix do allow precisely reproducible environments, it requires extra effort. For example, simply running \inlinecode{guix install gcc} will install the most recent version of GCC that can be different at different times. @@ -310,7 +310,7 @@ For example \inlinecode{f4953cc\-f1ca8a\-33616ad\-602ddf\-4cd189\-c2eff97b} is a Commits are is commonly summarized by the checksum's first few characters, for example, \inlinecode{f4953cc}. With Git, making parallel ``branches'' (in the project's history) is very easy and its distributed nature greatly helps in the parallel development of a project by a team. -The team can host the Git history on a webpage and collaborate through that. +The team can host the Git history on a web page and collaborate through that. There are several Git hosting services for example \href{http://codeberg.org}{codeberg.org}, \href{http://gitlab.com}{gitlab.com}, \href{http://bitbucket.org}{bitbucket.org} or \href{http://github.com}{github.com} (among many others). Storing the changes in binary files is also possible in Git, however it is most useful for human-readable plain-text sources. @@ -488,7 +488,7 @@ Integration of directional graph features (dependencies between the cells) into The fact that the \inlinecode{.ipynb} format stores narrative text, code, and multi-media visualization of the outputs in one file, is another major hurdle and against the modularity criteria proposed here. The files can easily become very large (in volume/bytes) and hard to read. Both are critical for scientific processing, especially the latter: when a web browser with proper JavaScript features is not available (can happen in a few years). -This is further exacerbated by the fact that binary data (for example images) are not directly supported in JSON and have to be converted into much less memory-efficient textual encodings. +This is further exacerbated by the fact that binary data (for example images) are not directly supported in JSON and have to be converted into a much less memory-efficient textual encoding. Finally, Jupyter has an extremely complex dependency graph: on a clean Debian 10 system, Pip (a Python package manager that is necessary for installing Jupyter) required 19 dependencies to install, and installing Jupyter within Pip needed 41 dependencies. \citeappendix{hinsen15} reported such conflicts when building Jupyter into the Active Papers framework (see Appendix \ref{appendix:activepapers}). |