diff options
-rw-r--r-- | paper.tex | 29 |
1 files changed, 14 insertions, 15 deletions
@@ -62,7 +62,7 @@ %% AIM A set of criteria are introduced to address this problem. %% METHOD - Completeness (no dependency beyond POSIX, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free software. + Completeness (no \new{execution requirement} beyond \new{a minimal Unix-like operating system}, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free software. They have been tested in several research publications in various fields. %% RESULTS As a proof of concept, ``Maneage'' is introduced for storing projects in machine-actionable and human-readable plain text, enabling cheap archiving, provenance extraction, and peer verification. @@ -165,7 +165,7 @@ Usually images are imported with generic operating system (OS) names; e.g., \cit The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly, and only the most recent five are archived there. Hence, if the image is built in different months, its output image will contain different OS components. In the year 2024, when long-term support for this version of Ubuntu expires, the image will be unavailable at the expected URL. -Generally, Pre-built binary files (like Docker images) are large and expensive to maintain and archive. +Generally, \new{pre-built} binary files (like Docker images) are large and expensive to maintain and archive. %% This URL: https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates} \new{Because of this, DockerHub (where many reproducible workflows are archived) announced that inactive images (older than 6 months) will be deleted in free accounts from mid 2021.} Furthermore, Docker requires root permissions, and only supports recent (``long-term-support'') versions of the host kernel, so older Docker images may not be executable \new{(their longevity is determined by the host kernel, typically a decade)}. @@ -264,7 +264,7 @@ A narrative description is also a deliverable (defined as ``data article'' in \c This is related to longevity, because if a workflow contains only the steps to do the analysis or generate the plots, in time it may get separated from its accompanying published paper. \textbf{Criterion 8: Free and open source software:} -Reproducibility is not possible with a black box (non-free or non-open-source software); this criterion is therefore necessary because nature is already a black box, we do not need an artificial source of ambiguity wraped over it. +Reproducibility is not possible with a black box (non-free or non-open-source software); this criterion is therefore necessary because nature is already a black box, we do not need an artificial source of ambiguity \new{wrapped} over it. A project that is \href{https://www.gnu.org/philosophy/free-sw.en.html}{free software} (as formally defined), allows others to learn from, modify, and build upon it. When the software used by the project is itself also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware. In contrast, non-free tools typically cannot be distributed or modified by others, making it reliant on a single supplier (even without payments). @@ -291,13 +291,13 @@ The tool is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending It was developed as a parallel research project over five years of publishing reproducible workflows of our research. The original implementation was published in \cite{akhlaghi15}, and evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}. -Technically, the hardest criterion to implement was the first (completeness) and, in particular, avoiding non-POSIX dependencies). +Technically, the hardest criterion to implement was the first (completeness) and, in particular, \new{restricting execution requirements to only a minimal Unix-like operating system}). One solution we considered was GNU Guix and Guix Workflow Language (GWL). However, because Guix requires root access to install, and only works with the Linux kernel, it failed the completeness criterion. Inspired by GWL+Guix, a single job management tool was implemented for both installing software \emph{and} the analysis workflow: Make. Make is not an analysis language, it is a job manager, deciding when and how to call analysis programs (in any language like Python, R, Julia, Shell, or C). -Make is standardized in POSIX and is used in almost all core OS components. +Make is standardized in \new{Unix-like operating systems}. It is thus mature, actively maintained, highly optimized, efficient in managing exact provenance, and even recommended by the pioneers of reproducible research \cite{claerbout1992,schwab2000}. Researchers using free software tools have also already had some exposure to it \new{(almost all free software projects are built with Make).} @@ -315,11 +315,11 @@ Through the former, manual updates by authors (which are prone to errors and dis Acting as a link, the macro files build the core skeleton of Maneage. For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation. -These are combined at the end to generate precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}), which are excluded here because of the strict word limit. +These are combined at the end to generate precise software \new{acknowledgement} and citation (see \cite{akhlaghi19, infante20}), which are excluded here because of the strict word limit. \new{Furthermore, the machine related specifications of the running system (including hardware name and byte-order) are also collected and cited. These can help in \emph{root cause analysis} of observed differences/issues in the execution of the workflow on different machines.} The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}). -All software dependencies are built down to precise versions of every tool, including the shell, POSIX tools (e.g., GNU Coreutils) and of course, the high-level science software. +All software dependencies are built down to precise versions of every tool, including the shell,\new{important low-level application programs} (e.g., GNU Coreutils) and of course, the high-level science software. \new{The source code of all the free software used in Maneage is archived in and downloaded from \href{https://doi.org/10.5281/zenodo.3883409}{zenodo.3883409}. Zenodo promises long-term archival and also provides a persistent identifier for the files, which are sometimes unavailable at a software package's webpage.} @@ -409,7 +409,7 @@ Where exact reproducibility is not possible \new{(for example due to paralleliza Each Git ``commit'' is shown on its branch as a colored ellipse, with its commit hash shown and colored to identify the team that is/was working on the branch. Briefly, Git is a version control system, allowing a structured backup of project files. Each Git ``commit'' effectively contains a copy of all the project's files at the moment it was made. - The upward arrows at the branch-tops are therefore in the timee direction. + The upward arrows at the branch-tops are therefore in the \new{time} direction. } \end{figure*} @@ -435,7 +435,7 @@ Maneage is a Git branch that contains the shared components (infrastructure) of Derived projects start by creating a branch and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its particular analysis, see Listing \ref{code:branching}). There is a \new{thoroughly elaborated} customization checklist in \inlinecode{README-hacking.md}). -The current project's Git hash is provided to the authors as a \LaTeX{} macro (shown here at the end of the abstract), as well as the Git hash of the last commit in the Maneage branch (shown here in the acknowledgments). +The current project's Git hash is provided to the authors as a \LaTeX{} macro (shown here at the end of the abstract), as well as the Git hash of the last commit in the Maneage branch (shown here in the \new{acknowledgements}). These macros are created in \inlinecode{initialize.mk}, with \new{other basic information from the running system like the CPU architecture, byte order or address sizes (shown here in the acknowledgements)}. The branch-based design of Figure \ref{fig:branching} allows projects to re-import Maneage at a later time (technically: \emph{merge}), thus improving its low-level infrastructure: in (a) authors do the merge during an ongoing project; @@ -498,17 +498,16 @@ Hence, arguably the most important feature of these criteria (as implemented in Using mature and time-tested tools, for blending version control, the research paper's narrative, the software management \emph{and} a robust data management strategies. We have noticed that providing a complete \emph{and} customizable template with a clear checklist of the initial steps is much more effective in encouraging mastery of these modern scientific tools than having abstract, isolated tutorials on each tool individually. -Secondly, to satisfy the completeness criterion, all the required software of the project must be built on various POSIX-compatible systems (Maneage is actively tested on different GNU/Linux distributions, macOS, and is being ported to FreeBSD also). +Secondly, to satisfy the completeness criterion, all the required software of the project must be built on various \new{Unix-like operating systems} (Maneage is actively tested on different GNU/Linux distributions, macOS, and is being ported to FreeBSD also). This requires maintenance by our core team and consumes time and energy. However, because the PM and analysis components share the same job manager (Make) and design principles, we have already noticed some early users adding, or fixing, their required software alone. They later share their low-level commits on the core branch, thus propagating it to all derived projects. -A related caveat is that, POSIX is a fuzzy standard, not guaranteeing the bit-wise reproducibility of programs. -It has been chosen here, however, as the underlying platform \new{because our focus is on reproducing the results (output of software), not the software itself.} -POSIX is ubiquitous and low-level software (e.g., core GNU tools) are install-able on most. +\new{We have chosen not to be sharp about the condition of executability on a minimal Unix-like operating system because at the end it should not matter how minimal the operating system is.} +\new{The operating system should possess enough components to be able to install low-level software (e.g., core GNU tools).} Well written software internally corrects for differences in OS or hardware that may affect its functionality (through tools like the GNU portability library). On GNU/Linux hosts, Maneage builds precise versions of the compilation tool chain. -However, glibc is not install-able on some POSIX OSs (e.g., macOS) and all programs link with the C library. +However, glibc is not install-able on some \new{Unix-like} OSs (e.g., macOS) and all programs link with the C library. This may hypothetically hinder the exact reproducibility \emph{of results} on non-GNU/Linux systems, but we have not encountered this in our research so far. With everything else under precise control in Maneage, the effect of differing Kernel and C libraries on high-level science can now be systematically studied in follow-up research \new{(including floating-point arithmetic or optimization differences). Using continuous integration (CI) is one way to precisely identify breaking points on multiple systems.} @@ -993,7 +992,7 @@ Here, we'll complement that section with more technical details on Make. Usually, the top-level Make instructions are placed in a file called Makefile, but it is also common to use the \inlinecode{.mk} suffix for custom file names. Each stage/step in the analysis is defined through a \emph{rule}. Rules define \emph{recipes} to build \emph{targets} from \emph{pre-requisites}. -In POSIX operating systems (Unix-like), everything is a file, even directories and devices. +In \new{Unix-like operating systems}, everything is a file, even directories and devices. Therefore all three components in a rule must be files on the running filesystem. To decide which operation should be re-done when executed, Make compares the time stamp of the targets and prerequisites. |