diff options
-rw-r--r-- | paper.tex | 38 | ||||
-rw-r--r-- | tex/src/appendix-existing-solutions.tex | 50 | ||||
-rw-r--r-- | tex/src/appendix-existing-tools.tex | 8 |
3 files changed, 44 insertions, 52 deletions
@@ -87,7 +87,7 @@ \emph{Appendices} --- Two comprehensive appendices that review the longevity of existing solutions; available \ifdefined\separatesupplement -as supplementary ``Web extras'' on the journal webpage. +as supplementary ``Web extras'' on the journal web page. \else after main body of paper (Appendices \ref{appendix:existingtools} and \ref{appendix:existingsolutions}). \fi @@ -162,21 +162,13 @@ We will focus on Docker here because it is currently the most common. \new{It is hypothetically possible to precisely identify the used Docker ``images'' with their checksums (or ``digest'') to re-create an identical OS image later. However, that is rarely done.} -Usually images are imported with operating system (OS) names; e.g., \cite{mesnard20} -\ifdefined\separatesupplement -\new{(more examples in the \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{appendices})}% -\else% -\new{(more examples: see the appendices (\ref{appendix:existingtools}))}% -\fi% -{ }imports `\inlinecode{FROM ubuntu:16.04}'. +Usually images are imported with operating system (OS) names; e.g., \cite{mesnard20} imports `\inlinecode{FROM ubuntu:16.04}'. The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly, and only the most recent five are archived there. Hence, if the image is built in different months, it will contain different OS components. -% CentOS announcement: https://blog.centos.org/2020/12/future-is-centos-stream -In the year 2024, when long-term support (LTS) for this version of Ubuntu expires, the image will be unavailable at the expected URL \new{(if not aborted earlier, like CentOS 8 which will be terminated 8 years early).} +In the year 2024, when long-term support (LTS) for this version of Ubuntu expires, the image will be unavailable at the expected URL \new{(if not earlier, like CentOS 8 which \href{https://blog.centos.org/2020/12/future-is-centos-stream}{will terminate} 8 years early).} Generally, \new{pre-built} binary files (like Docker images) are large and expensive to maintain and archive. -%% This URL: https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates} -\new{Because of this, DockerHub (where many reproducible workflows are archived) announced that inactive images (older than 6 months) will be deleted in free accounts from mid 2021.} +\new{Because of this, in October 2020 Docker Hub (where many workflows are archived) \href{https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}{announced} that inactive images (older than 6 months) will be deleted in free accounts from mid 2021.} Furthermore, Docker requires root permissions, and only supports recent (LTS) versions of the host kernel: older Docker images may not be executable \new{(their longevity is determined by the host kernel, typically a decade).} Once the host OS is ready, PMs are used to install the software or environment. @@ -203,7 +195,7 @@ It is important to remember that the longevity of a project is determined by its Furthermore, as with job management, computational notebooks do not actively encourage good practices in programming or project management. \new{The ``cells'' in a Jupyter notebook can either be run sequentially (from top to bottom, one after the other) or by manually selecting the cell to run. By default, cell dependencies are not included (e.g., automatically running some cells only after certain others), parallel execution, or usage of more than one language. -There are third party add-ons like \inlinecode{sos} or \inlinecode{nbextensions} (both written in Python) for some of these. +There are third party add-ons like \inlinecode{sos} or \inlinecode{extension's} (both written in Python) for some of these. However, since they are not part of the core, their longevity can be assumed to be shorter. Therefore, the core Jupyter framework leaves very few options for project management, especially as the project grows beyond a small test or tutorial.} In summary, notebooks can rarely deliver their promised potential \cite{rule18} and may even hamper reproducibility \cite{pimentel19}. @@ -222,7 +214,7 @@ We argue and propose that workflows satisfying the following criteria can not on \textbf{Criterion 1: Completeness.} A project that is complete (self-contained) has the following properties. (1) \new{No \emph{execution requirements} apart from a minimal Unix-like operating system. -Fewer explicit execution requirements would mean higher \emph{execution possibility} and consequently higher \emph{longetivity}.} +Fewer explicit execution requirements would mean higher \emph{execution possibility} and consequently higher \emph{longevity}.} (2) Primarily stored as plain text \new{(encoded in ASCII/Unicode)}, not needing specialized software to open, parse, or execute. (3) No impact on the host OS libraries, programs, and \new{environment variables}. (4) Does not require root privileges to run (during development or post-publication). @@ -277,7 +269,7 @@ They are reliant on a single supplier (even without payments) \new{and prone to A project that is \href{https://www.gnu.org/philosophy/free-sw.en.html}{free software} (as formally defined by GNU), allows others to run, learn from, \new{distribute, build upon (modify), and publish their modified versions}. When the software used by the project is itself also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware. -\new{Propietary software may be necessary to read proprietary data formats produced by data collection hardware (for example micro-arrays in genetics). +\new{Proprietary software may be necessary to read proprietary data formats produced by data collection hardware (for example micro-arrays in genetics). In such cases, it is best to immediately convert the data to free formats upon collection, and archive (e.g., on Zenodo) or use the data in free formats.} @@ -324,21 +316,21 @@ Through the former, manual updates by authors (which are prone to errors and dis Acting as a link, the macro files build the core skeleton of Maneage. For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version, and possible citation. -These are combined at the end to generate precise software \new{acknowledgement} and citation that is shown in the +These are combined at the end to generate precise software \new{acknowledgment} and citation that is shown in the \new{ \ifdefined\separatesupplement - \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{appendices}.% + \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{appendices},% \else% - appendices (\ref{appendix:software}).% + appendices (\ref{appendix:software}),% \fi% } -(for other examples, see \cite{akhlaghi19, infante20}) +for other examples, see \cite{akhlaghi19, infante20}. \new{Furthermore, the machine-related specifications of the running system (including hardware name and byte-order) are also collected and cited. These can help in \emph{root cause analysis} of observed differences/issues in the execution of the workflow on different machines.} The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}). All software dependencies are built down to precise versions of every tool, including the shell,\new{important low-level application programs} (e.g., GNU Coreutils) and of course, the high-level science software. \new{The source code of all the free software used in Maneage is archived in and downloaded from \href{https://doi.org/10.5281/zenodo.3883409}{zenodo.3883409}. -Zenodo promises long-term archival and also provides a persistent identifier for the files, which are sometimes unavailable at a software package's webpage.} +Zenodo promises long-term archival and also provides a persistent identifier for the files, which are sometimes unavailable at a software package's web page.} On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library (glibc) is being added (task \href{http://savannah.nongnu.org/task/?15390}{15390}). Currently, {\TeX}Live is also being added (task \href{http://savannah.nongnu.org/task/?15267}{15267}), but that is only for building the final PDF, not affecting the analysis or verification. @@ -482,7 +474,7 @@ $ git add -u && git commit # Commit changes. The branch-based design of Figure \ref{fig:branching} allows projects to re-import Maneage at a later time (technically: \emph{merge}), thus improving its low-level infrastructure: in (a) authors do the merge during an ongoing project; in (b) readers do it after publication; e.g., the project remains reproducible but the infrastructure is outdated, or a bug is fixed in Maneage. \new{Generally, any git flow (branching strategy) can be used by the high-level project authors or future readers.} -Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of curation and maintenance of each individual project, before \emph{and} after publication. +Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of project curation and maintenance, before \emph{and} after publication. Finally, the complete project source is usually $\sim100$ kilo-bytes. It can thus easily be published or archived in many servers, for example, it can be uploaded to arXiv (with the \LaTeX{} source, see the arXiv source in \cite{akhlaghi19, infante20, akhlaghi15}), published on Zenodo and archived in SoftwareHeritage. @@ -525,10 +517,10 @@ This requires maintenance by our core team and consumes time and energy. However, because the PM and analysis components share the same job manager (Make) and design principles, we have already noticed some early users adding, or fixing, their required software alone. They later share their low-level commits on the core branch, thus propagating it to all derived projects. -\new{Unix-like OSs are a very large and diverse group (mostly conforming with POSIX), so this condition does not guarantee bitwise reproducibility of the software, even when built on the same hardware.}. +\new{Unix-like OSs are a very large and diverse group (mostly conforming with POSIX), so this condition does not guarantee bit-wise reproducibility of the software, even when built on the same hardware.}. However \new{our focus is on reproducing results (output of software), not the software itself.} Well written software internally corrects for differences in OS or hardware that may affect its output (through tools like the GNU portability library). -On GNU/Linux hosts, Maneage builds precise versions of the compilation toolchain. +On GNU/Linux hosts, Maneage builds precise versions of the compilation tool chain. However, glibc is not install-able on some \new{Unix-like} OSs (e.g., macOS) and all programs link with the C library. This may hypothetically hinder the exact reproducibility \emph{of results} on non-GNU/Linux systems, but we have not encountered this in our research so far. With everything else under precise control in Maneage, the effect of differing hardware, Kernel and C libraries on high-level science can now be systematically studied in follow-up research \new{(including floating-point arithmetic or optimization differences). diff --git a/tex/src/appendix-existing-solutions.tex b/tex/src/appendix-existing-solutions.tex index d7888ad..25b2920 100644 --- a/tex/src/appendix-existing-solutions.tex +++ b/tex/src/appendix-existing-solutions.tex @@ -16,20 +16,20 @@ \section{Survey of common existing reproducible workflows} \label{appendix:existingsolutions} -As reviewed in the introduction, the problem of reproducibility has received a lot of attention over the last three decades and various solutions have already been proposed. +As reviewed in the introduction, the problem of reproducibility has received considerable attention over the last three decades and various solutions have already been proposed. The core principles that many of the existing solutions (including Maneage) aim to achieve are nicely summarized by the FAIR principles \citeappendix{wilkinson16}. In this appendix, some of the solutions are reviewed. -The solutions are based on an evolving software landscape, therefore they are ordered by date: when the project has a webpage, the year of its first release is used for the sorting, otherwise their paper's publication year is used. +The solutions are based on an evolving software landscape, therefore they are ordered by date: when the project has a web page, the year of its first release is used for the sorting, otherwise their paper's publication year is used. For each solution, we summarize its methodology and discuss how it relates to the criteria proposed in this paper. Freedom of the software/method is a core concept behind scientific reproducibility, as opposed to industrial reproducibility where a black box is acceptable/desirable. Therefore proprietary solutions like Code Ocean\footnote{\inlinecode{\url{https://codeocean.com}}} or Nextjournal\footnote{\inlinecode{\url{https://nextjournal.com}}} will not be reviewed here. -Other studies have also attempted to review existing reproducible solutions, foro example \citeappendix{konkol20}. +Other studies have also attempted to review existing reproducible solutions, for example \citeappendix{konkol20}. \subsection{Suggested rules, checklists, or criteria} Before going into the various implementations, it is also useful to review existing suggested rules, checklists or criteria for computationally reproducible research. -All the cases below are primarily targetted to immediate reproducibility and do not consider longevity explicitly. +All the cases below are primarily targeted to immediate reproducibility and do not consider longevity explicitly. Therefore, they lack a strong/clear completeness criterion (they mainly only suggest, rather than require, the recording of versions, and their ultimate suggestion of storing the full binary OS in a binary VM or container is problematic (as mentioned in \ref{appendix:independentenvironment} and \citeappendix{oliveira18}). Sandve et al. \citeappendix{sandve13} propose ``ten simple rules for reproducible computational research'' that can be applied in any project. @@ -73,7 +73,7 @@ As described in \cite{schwab2000}, in the latter half of that decade, they moved The basic idea behind RED's solution was to organize the analysis as independent steps, including the generation of plots, and organizing the steps through a Makefile. This enabled all the results to be re-executed with a single command. Several basic low-level Makefiles were included in the high-level/central Makefile. -The reader/user of a project had to manually edit the central Makefile and set the variable \inlinecode{RESDIR} (result dir), this is the directory where built files are kept. +The reader/user of a project had to manually edit the central Makefile and set the variable \inlinecode{RESDIR} (result directory), this is the directory where built files are kept. Afterwards, the reader could set which figures/parts of the project to reproduce by manually adding its name in the central Makefile, and running Make. At the time, Make was already practiced by individual researchers and projects as a job orchestration tool, but SEP's innovation was to standardize it as an internal policy, and define conventions for the Makefiles to be consistent across projects. @@ -114,7 +114,7 @@ Madagascar has been used in the production of hundreds of research papers or boo Madagascar does include project management tools in the form of SCons extensions. However, it is not just a reproducible project management tool. It is primarily a collection of analysis programs and tools to interact with RSF files, and plotting facilities. -The Regularly Sampled File (RSF) file format\footnote{\inlinecode{\url{http://www.ahay.org/wiki/Guide\_to\_RSF\_file\_format}}} is a custom plain-text file that points to the location of the actual data files on the filesystem and acts as the intermediary between Madagascar's analysis programs. +The Regularly Sampled File (RSF) file format\footnote{\inlinecode{\url{http://www.ahay.org/wiki/Guide\_to\_RSF\_file\_format}}} is a custom plain-text file that points to the location of the actual data files on the file system and acts as the intermediary between Madagascar's analysis programs. For example in our test of Madagascar 3.0.1, it installed 855 Madagascar-specific analysis programs (\inlinecode{PREFIX/bin/sf*}). The analysis programs mostly target geophysical data analysis, including various project specific tools: more than half of the total built tools are under the \inlinecode{build/user} directory which includes names of Madagascar users. @@ -168,7 +168,7 @@ In many aspects, the usage of Kepler and its issues for long-term reproducibilit \label{appendix:vistrails} VisTrails\footnote{\inlinecode{\url{https://www.vistrails.org}}} \citeappendix{bavoil05} was a graphical workflow managing system. -According to its webpage, VisTrails maintainance has stopped since May 2016, its last Git commit, as of this writing, was in November 2017. +According to its web page, VisTrails maintenance has stopped since May 2016, its last Git commit, as of this writing, was in November 2017. However, given that it was well maintained for over 10 years is an achievement. VisTrails (or ``visualization trails'') was initially designed for managing visualizations, but later grew into a generic workflow system with meta-data and provenance features. @@ -198,7 +198,7 @@ Galaxy\footnote{\inlinecode{\url{https://galaxyproject.org}}} is a web-based Gen The main user interface are ``Galaxy Pages'', which does not require any programming: users simply use abstract ``tools'' which are a wrappers over command-line programs. Therefore the actual running version of the program can be hard to control across different Galaxy servers. Besides the automatically generated metadata of a project (which include version control, or its history), users can also tag/annotate each analysis step, describing its intent/purpose. -Besides some small differences Galaxy seems very similar to GenePattern (Appendix \ref{appendix:genepattern}), so most of the same points there apply here too (including the very large cost of maintining such a system). +Besides some small differences Galaxy seems very similar to GenePattern (Appendix \ref{appendix:genepattern}), so most of the same points there apply here too (including the very large cost of maintaining such a system). @@ -212,18 +212,18 @@ The published narrative description of the algorithm must be detailed to a level The author's own implementation of the algorithm is also published with the paper (in C, C++ or MATLAB), the code must be commented well enough and link each part of it with the relevant part of the paper. The authors must also submit several example datasets/scenarios. The referee is expected to inspect the code and narrative, confirming that they match with each other, and with the stated conclusions of the published paper. -After publication, each paper also has a ``demo'' button on its webpage, allowing readers to try the algorithm on a web-interface and even provide their own input. +After publication, each paper also has a ``demo'' button on its web page, allowing readers to try the algorithm on a web-interface and even provide their own input. IPOL has grown steadily over the last 10 years, publishing 23 research articles in 2019 alone. -We encourage the reader to visit its webpage and see some of its recent papers and their demos. +We encourage the reader to visit its web page and see some of its recent papers and their demos. The reason it can be so thorough and complete is its very narrow scope (low-level image processing algorithms), where the published algorithms are highly atomic, not needing significant dependencies (beyond input/output of well-known formats), allowing the referees and readers to go deeply into each implemented algorithm. In fact, high-level languages like Perl, Python or Java are not acceptable in IPOL precisely because of the additional complexities, such as dependencies, that they require. However, many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, so this solution is not scalable. IPOL thus fails on our Scalability criteria. -Furthermore, by not publishing/archiving each paper's version control history or directly linking the analysis and produced paper, it fails Criterias 6 and 7. -Note that on the webpage, it is possible to change parameters, but that will not affect the produced PDF. -A paper written in Maneage (the proof-of-concept solution presented in this paper) could be scrutinised at a similar detailed level to IPOL, but for much more complex research scenarios, involving hundreds of dependencies and complex processing of the data. +Furthermore, by not publishing/archiving each paper's version control history or directly linking the analysis and produced paper, it fails criteria 6 and 7. +Note that on the web page, it is possible to change parameters, but that will not affect the produced PDF. +A paper written in Maneage (the proof-of-concept solution presented in this paper) could be scrutinized at a similar detailed level to IPOL, but for much more complex research scenarios, involving hundreds of dependencies and complex processing of the data. @@ -266,7 +266,7 @@ For example the Active Papers HDF5 file of \citeappendix[in \href{https://doi.or In many scenarios, peers just want to inspect the processing by reading the code and checking a very special part of it (one or two lines, just to see the option values to one step for example). They do not necessarily need to run it, or obtaining the datasets. -Hence the extra volume for data, and obscure HDF5 format that needs special tools for reading plain text code is a major hinderance. +Hence the extra volume for data, and obscure HDF5 format that needs special tools for reading plain text code is a major hindrance. @@ -286,7 +286,7 @@ Through its web-based interface, viewers of a paper can actively experiment with \subsection{SHARE (2011)} \label{appendix:SHARE} SHARE\footnote{\inlinecode{\url{https://is.ieis.tue.nl/staff/pvgorp/share}}} \citeappendix{vangorp11} is a web portal that hosts virtual machines (VMs) for storing the environment of a research project. -The top project webpage above is still active, however, the virtual machines and SHARE system have been removed since 2019, probably due to the large volume and high maintainence cost of the VMs. +The top project web page above is still active, however, the virtual machines and SHARE system have been removed since 2019, probably due to the large volume and high maintenance cost of the VMs. SHARE was recognized as second position in the Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}. Simply put, SHARE is just a VM that users can download and run. @@ -307,8 +307,8 @@ The VRIs are automatically generated web-URLs that link to public VCR repositori According to \citeappendix{gavish11}, the VRI generation routine has been implemented in MATLAB, R and Python, although only the MATLAB version was available during the writing of this paper. VCR also has special \LaTeX{} macros for loading the respective VRI into the generated PDF. -Unfortunately most parts of the webpage are not complete at the time of this writing. -The VCR webpage contains an example PDF\footnote{\inlinecode{\url{http://vcr.stanford.edu/paper.pdf}}} that is generated with this system, however, the linked VCR repository\footnote{\inlinecode{\url{http://vcr-stat.stanford.edu}}} does not exist at the time of this writing. +Unfortunately most parts of the web page are not complete at the time of this writing. +The VCR web page contains an example PDF\footnote{\inlinecode{\url{http://vcr.stanford.edu/paper.pdf}}} that is generated with this system, however, the linked VCR repository\footnote{\inlinecode{\url{http://vcr-stat.stanford.edu}}} does not exist at the time of this writing. Finally, the date of the files in the MATLAB extension tarball are set to 2011, hinting that probably VCR has been abandoned soon after the publication of \citeappendix{gavish11}. @@ -364,7 +364,7 @@ The important thing is that the research object concept is not specific to any s \subsection{Sciunit (2015)} \label{appendix:sciunit} Sciunit\footnote{\inlinecode{\url{https://sciunit.run}}} \citeappendix{meng15} defines ``sciunit''s that keep the executed commands for an analysis and all the necessary programs and libraries that are used in those commands. -It automatically parses all the executables in the script, and copies them, and their dependency libraries (down to the C library), into the sciunit. +It automatically parses all the executable files in the script, and copies them, and their dependency libraries (down to the C library), into the sciunit. Because the sciunit contains all the programs and necessary libraries, its possible to run it readily on other systems that have a similar CPU architecture. Sciunit was originally written in Python 2 (which reached its end-of-life in January 1st, 2020). Therefore Sciunit2 is a new implementation in Python 3. @@ -378,7 +378,7 @@ This is a major problem for scientific projects: in principle (not knowing how t \subsection{Umbrella (2015)} Umbrella \citeappendix{meng15b} is a high-level wrapper script for isolating the environment of an analysis. -The user specifies the necessary operating system, and necessary packages and analysis steps in variuos JSON files. +The user specifies the necessary operating system, and necessary packages and analysis steps in various JSON files. Umbrella will then study the host operating system and the various necessary inputs (including data and software) through a process similar to Sciunits mentioned above to find the best environment isolator (maybe using Linux containerization, containers or VMs). We could not find a URL to the source software of Umbrella (no source code repository is mentioned in the papers we reviewed above), but from the descriptions in \citeappendix{meng17}, it is written in Python 2.6 (which is now \new{deprecated}). @@ -392,7 +392,7 @@ The tracking is done at the kernel system-call level, so any file that is access The tracked files can be packaged into a \inlinecode{.rpz} bundle that can then be unpacked into another system. ReproZip is therefore very good to take a ``snapshot'' of the running environment into a single file. -The bundle can become very large if large, or many datasets, are used or if the software evironment is complex (many dependencies). +The bundle can become very large if large, or many datasets, are used or if the software environment is complex (many dependencies). Since it copies the binary software libraries, it can only be run on systems with a similar CPU architecture to the original. Furthermore, ReproZip just copies the binary/compiled files used in a project, it has no way to know how those software were built. As mentioned in this paper, and also \citeappendix{oliveira18} the question of ``how'' the environment was built is critical for understanding the results and simply having the binaries cannot necessarily be useful. @@ -409,7 +409,7 @@ Binder\footnote{\inlinecode{\url{https://mybinder.org}}} is used to containerize Users simply add a set of Binder-recognized configuration files to their repository and Binder will build a Docker image and install all the dependencies inside of it with Conda (the list of necessary packages comes from Conda). One good feature of Binder is that the imported Docker image must be tagged (something like a checksum). This will ensure that future/latest updates of the imported Docker image are not mistakenly used. -However, it does not make sure that the dockerfile used by the imported Docker image follows a similar convention also. +However, it does not make sure that the Dockerfile used by the imported Docker image follows a similar convention also. Binder is used by \citeappendix{jones19}. @@ -423,7 +423,7 @@ Gigantum uses Docker containers for an independent environment, Conda (or Pip) t Simply put, its a high-level wrapper for combining these components. Internally, a Gigantum project is organized as files in a directory that can be opened without their own client. The file structure (which is under version control) includes codes, input data and output data. -As acknowledged on their own webpage, this greatly reduces the speed of Git operations, transmitting, or archiving the project. +As acknowledged on their own web page, this greatly reduces the speed of Git operations, transmitting, or archiving the project. Therefore there are size limits on the dataset/code sizes. However, there is one directory which can be used to store files that must not be tracked. @@ -436,7 +436,7 @@ Popper\footnote{\inlinecode{\url{https://falsifiable.us}}} is a software impleme The Popper team's own solution is through a command-line program called \inlinecode{popper}. The \inlinecode{popper} program itself is written in Python. However, job management was initially based on the HashiCorp configuration language (HCL) because HCL was used by ``GitHub Actions'' to manage workflows. -Moreover, from October 2019 Github changed to a custom YAML-based languguage, so Popper also deprecated HCL. +Moreover, from October 2019 GitHub changed to a custom YAML-based language, so Popper also deprecated HCL. This is an important issue when low-level choices are based on service providers. To start a project, the \inlinecode{popper} command-line program builds a template, or ``scaffold'', which is a minimal set of files that can be run. @@ -473,10 +473,10 @@ Occam\footnote{\inlinecode{\url{https://occam.cs.pitt.edu}}} \citeappendix{olive To achieve long-term reproducibility, Occam includes its own package manager (instructions to build software and their dependencies) to be in full control of the software build instructions, similar to Maneage. Besides Nix or Guix (which are primarily a package manager that can also do job management), Occum has been the only solution in our survey here that attempts to be complete in this aspect. -However it is incomplete from the perspective of requirements: it works within a Docker image (that requires root permissions) and currently only runs on Debian-based, Redhat-based and Arch-based GNU/Linux operating systems that respectively use the \inlinecode{apt}, \inlinecode{pacman} or \inlinecode{yum} package managers. +However it is incomplete from the perspective of requirements: it works within a Docker image (that requires root permissions) and currently only runs on Debian-based, Red Hat based and Arch-based GNU/Linux operating systems that respectively use the \inlinecode{apt}, \inlinecode{pacman} or \inlinecode{yum} package managers. It is also itself written in Python (version 3.4 or above), hence it is not clear -Furthermore, it also violates our complexity criteria because the instructions to build the software, their versions and etc are not immediately viewable or modifable by the user. +Furthermore, it also violates our complexity criteria because the instructions to build the software, their versions and etc are not immediately viewable or modifiable by the user. Occam contains its own JSON database for this that should be parsed with its own custom program. The analysis phase of Occum is also through a drag-and-drop interface (similar to Taverna, Appendix \ref{appendix:taverna}) that is a web-based graphic user interface. All the connections between various phases of an analysis need to be pre-defined in a JSON file and manually linked in the GUI. diff --git a/tex/src/appendix-existing-tools.tex b/tex/src/appendix-existing-tools.tex index 5920fbd..d923d5f 100644 --- a/tex/src/appendix-existing-tools.tex +++ b/tex/src/appendix-existing-tools.tex @@ -120,7 +120,7 @@ However, attempting to archive the actual binary container or VM files as a blac \subsubsection{Independent build in host's file system} \label{appendix:independentbuild} The virtual machine and container solutions mentioned above, have their own independent file system. -Another approach to having an isolated analysis environment is to use the same filesystem as the host, but installing the project's software in a non-standard, project-specific directory that does not interfere with the host. +Another approach to having an isolated analysis environment is to use the same file system as the host, but installing the project's software in a non-standard, project-specific directory that does not interfere with the host. Because the environment in this approach can be built in any custom location on the host, this solution generally does not require root permissions or extra low-level layers like containers or VMs. However, ``moving'' the built product of such solutions from one computer to another is not generally as trivial as containers or VMs. Examples of such third-party package managers (that are detached from the host OS's package manager) include Nix, GNU Guix, Python's Virtualenv package, and Conda, among others. @@ -189,7 +189,7 @@ This allows for multiple versions of the software to co-exist on the system, whi As mentioned in \citeappendix{courtes15}, one major caveat with using these package managers is that they require a daemon with root privileges. This is necessary ``to use the Linux kernel container facilities that allow it to isolate build processes and maximize build reproducibility''. This is because the focus in Nix or Guix is to create bit-wise reproducible software binaries and this is necessary for the security or development perspectives. -However, in a non-computer-science analysis (for example natural sciences), the main aim is reproducible \emph{results} that can also be created with the same software version that may not be bitwise identical (for example when they are installed in other locations, because the installation location is hardcoded in the software binary). +However, in a non-computer-science analysis (for example natural sciences), the main aim is reproducible \emph{results} that can also be created with the same software version that may not be bit-wise identical (for example when they are installed in other locations, because the installation location is hard-coded in the software binary). Finally, while Guix and Nix do allow precisely reproducible environments, it requires extra effort. For example, simply running \inlinecode{guix install gcc} will install the most recent version of GCC that can be different at different times. @@ -310,7 +310,7 @@ For example \inlinecode{f4953cc\-f1ca8a\-33616ad\-602ddf\-4cd189\-c2eff97b} is a Commits are is commonly summarized by the checksum's first few characters, for example, \inlinecode{f4953cc}. With Git, making parallel ``branches'' (in the project's history) is very easy and its distributed nature greatly helps in the parallel development of a project by a team. -The team can host the Git history on a webpage and collaborate through that. +The team can host the Git history on a web page and collaborate through that. There are several Git hosting services for example \href{http://codeberg.org}{codeberg.org}, \href{http://gitlab.com}{gitlab.com}, \href{http://bitbucket.org}{bitbucket.org} or \href{http://github.com}{github.com} (among many others). Storing the changes in binary files is also possible in Git, however it is most useful for human-readable plain-text sources. @@ -488,7 +488,7 @@ Integration of directional graph features (dependencies between the cells) into The fact that the \inlinecode{.ipynb} format stores narrative text, code, and multi-media visualization of the outputs in one file, is another major hurdle and against the modularity criteria proposed here. The files can easily become very large (in volume/bytes) and hard to read. Both are critical for scientific processing, especially the latter: when a web browser with proper JavaScript features is not available (can happen in a few years). -This is further exacerbated by the fact that binary data (for example images) are not directly supported in JSON and have to be converted into much less memory-efficient textual encodings. +This is further exacerbated by the fact that binary data (for example images) are not directly supported in JSON and have to be converted into a much less memory-efficient textual encoding. Finally, Jupyter has an extremely complex dependency graph: on a clean Debian 10 system, Pip (a Python package manager that is necessary for installing Jupyter) required 19 dependencies to install, and installing Jupyter within Pip needed 41 dependencies. \citeappendix{hinsen15} reported such conflicts when building Jupyter into the Active Papers framework (see Appendix \ref{appendix:activepapers}). |