diff options
| -rw-r--r-- | paper.tex | 68 | 
1 files changed, 37 insertions, 31 deletions
| @@ -67,10 +67,10 @@    This paper is itself written with Maneage (project commit \projectversion).    \vspace{2.5mm} -  \emph{Appendix} --- -  Two comprehensive appendices that review existing solutions available +  \emph{Appendices} --- +  Two comprehensive appendices that review existing solutions; available  \ifdefined\noappendix -in \href{https://arxiv.org/abs/\projectarxivid}{\texttt{arXiv:\projectarxivid}} or \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}}. +at \href{https://arxiv.org/abs/\projectarxivid}{\texttt{arXiv:\projectarxivid}} or \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}}.  \else  at the end (Appendices \ref{appendix:existingtools} and \ref{appendix:existingsolutions}).  \fi @@ -126,9 +126,9 @@ Decades later, scientists are still held accountable for their results and there  \section{Longevity of existing tools}  \label{sec:longevityofexisting}  \new{Reproducibility is defined as ``obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis'' \cite{fineberg19}. -Longevity is defined as the time that a project can be usable. +Longevity is defined as the time during which a project remains usable.  Usability is defined by context: for machines (machine-actionable, or executable files) \emph{and} humans (readability of the source). -Because many usage contexts do not involve execution; for example checking the configuration parameter of a single step of the analysis to re-\emph{use} in another project, or checking the version of used software, or source of the input data (extracting these from the outputs of execution is not always possible).} +Many usage contexts do not involve execution: for example, checking the configuration parameter of a single step of the analysis to re-\emph{use} in another project, or checking the version of used software, or the source of the input data (extracting these from the outputs of execution is not always possible).}  Longevity is as important in science as in some fields of industry, but not all; e.g., fast-evolving tools can be appropriate in short-term commercial projects.  To highlight the necessity, a short review of commonly-used tools is provided below: @@ -136,36 +136,42 @@ To highlight the necessity, a short review of commonly-used tools is provided be  (2) package managers (PMs, like Conda, Nix, or Spack);  (3) job management (like shell scripts or Make);  (4) notebooks (like Jupyter). -\new{For a much more comprehensive review of existing tools and solutions is available in the appendices.} +\new{A comprehensive review of existing tools and solutions is available in the +  \ifdefined\noappendix +  \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{appendices}.% +  \else% +  appendices (\ref{appendix:existingsolutions}).% +  \fi% +}  To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011 but was discontinued in 2019).  However, containers (in particular, Docker, and to a lesser degree, Singularity) are currently the most widely-used solution.  We will thus focus on Docker here. -\new{It is theoretically possible to precisely identify the used Docker ``images'' with their checksums (or ``digest'') to re-create an identical OS image later. -However, that is rarely practiced.} +\new{It is hypothetically possible to precisely identify the used Docker ``images'' with their checksums (or ``digest'') to re-create an identical OS image later. +However, that is rarely done.}  Usually images are imported with generic operating system (OS) names; e.g., \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}' \new{(more examples in the appendices)}.  The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly, and only the most recent five are archived there.  Hence, if the image is built in different months, its output image will contain different OS components.  In the year 2024, when long-term support for this version of Ubuntu expires, the image will be unavailable at the expected URL.  Generally, Pre-built binary files (like Docker images) are large and expensive to maintain and archive.  %% This URL: https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates} -\new{Because of this DockerHub (where many reproducible workflows are archived) announced that inactive images (for over 6 months) will be deleted in free accounts from mid 2021.} -Furthermore, Docker requires root permissions, and only supports recent (``long-term-support'') versions of the host kernel, so older Docker images may not be executable \new{(their longevity is determined by the host kernel, usually a decade)}. +\new{Because of this DockerHub (where many reproducible workflows are archived) announced that inactive images (older than 6 months) will be deleted in free accounts from mid 2021.} +Furthermore, Docker requires root permissions, and only supports recent (``long-term-support'') versions of the host kernel, so older Docker images may not be executable \new{(their longevity is determined by the host kernel, typically a decade)}.  Once the host OS is ready, PMs are used to install the software or environment.  Usually the OS's PM, such as `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with generic PMs. -The former has \new{the same longevity} as the OS, while some of the latter (such as Conda and Spack) are written in high-level languages like Python, so the PM itself depends on the host's Python installation \new{with a usual longevity of a few years}. -Nix and GNU Guix produce bit-wise identical programs \new{with considerably better longevity; same as supported CPU architectures}. +The former has \new{the same longevity} as the OS, while some of the latter (such as Conda and Spack) are written in high-level languages like Python, so the PM itself depends on the host's Python installation \new{with a typical longevity of a few years}. +Nix and GNU Guix produce bit-wise identical programs \new{with considerably better longevity; that of their supported CPU architectures}.  However, they need root permissions and are primarily targeted at the Linux kernel.  Generally, in all the package managers, the exact version of each software (and its dependencies) is not precisely identified by default, although an advanced user can indeed fix them.  Unless precise version identifiers of \emph{every software package} are stored by project authors, a PM will use the most recent version.  Furthermore, because third-party PMs introduce their own language, framework, and version history (the PM itself may evolve) and are maintained by an external team, they increase a project's complexity.  With the software environment built, job management is the next component of a workflow. -Visual/GUI workflow tools like Apache Taverna, GenePattern (depreciated), Kepler or VisTrails (depreciated), which were mostly introduced in the 2000s and used Java or Python 2 encourage modularity and robust job management. -\new{However, a GUI environment is tailored to specific applications and is hard to generalize, while being hard to reproduce once the required Java Virtual Machines (JVM) is depreciated. -Their data formats are also complex (designed for computers to read) and hard to read by humans without the GUI.} +Visual/GUI workflow tools like Apache Taverna, GenePattern \new{(deprecated)}, Kepler or VisTrails \new{(deprecated)}, which were mostly introduced in the 2000s and used Java or Python 2 encourage modularity and robust job management. +\new{However, a GUI environment is tailored to specific applications and is hard to generalize, while being hard to reproduce once the required Java Virtual Machine (JVM) is deprecated. +These tools' data formats are complex (designed for computers to read) and hard to read by humans without the GUI.}  The more recent tools (mostly non-GUI, written in Python) leave this to the authors of the project.  Designing a robust project needs to be encouraged and facilitated because scientists (who are not usually trained in project or data management) will rarely apply best practices.  This includes automatic verification, which is possible in many solutions, but is rarely practiced. @@ -176,7 +182,7 @@ However, because of their complex dependency trees, their build is vulnerable to  It is important to remember that the longevity of a project is determined by its shortest-lived dependency.  Furthermore, as with job management, computational notebooks do not actively encourage good practices in programming or project management.  \new{The ``cells'' in a Jupyter notebook can either be run sequentially (from top to bottom, one after the other) or by manually selecting which cell to run. -The default cells do not include dependencies (so some cells run only after certain others are re-done), parallel execution, or usage of more than one language. +The default cells do not include dependencies (requiring some cells to be run only after certain others are re-done), parallel execution, or usage of more than one language.  There are third party add-ons like \inlinecode{sos} or \inlinecode{nbextensions} (both written in Python) for some of these.  However, since they are not part of the core and have their own dependencies, their longevity can be assumed to be shorter.  Therefore, the core Jupyter framework leaves very few options for project management, especially as the project grows beyond a small test or tutorial.} @@ -253,7 +259,7 @@ When the software used by the project is itself also free, the lineage can be tr  In contrast, non-free tools typically cannot be distributed or modified by others, making it reliant on a single supplier (even without payments).  \new{It may happen that proprietary software is necessary to convert proprietary data formats produced by special hardware (for example micro-arrays in genetics) into free data formats. -In such cases, it is best to immediately convert the data upon collection, and archive the data in free formats (for example on Zenodo).} +In such cases, it is best to immediately convert the data upon collection, and archive the data in free formats (for example, on Zenodo).} @@ -299,22 +305,22 @@ Acting as a link, the macro files build the core skeleton of Maneage.  For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation.  These are combined at the end to generate precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}), which are excluded here because of the strict word limit.  \new{Furthermore, the machine related specifications of the running system (including hardware name and byte-order) are also collected and cited. -These can help in \emph{root cause analysis} of observed differences/issues in the execution of the wokflow on different machines.} +These can help in \emph{root cause analysis} of observed differences/issues in the execution of the workflow on different machines.}  The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}).  All software dependencies are built down to precise versions of every tool, including the shell, POSIX tools (e.g., GNU Coreutils) and of course, the high-level science software.  \new{The source code of all the free software used in Maneage is archived in and downloaded from \href{https://doi.org/10.5281/zenodo.3883409}{zenodo.3883409}. -Zenodo promises long-term archival and also provides a persistant identifier for the files, which is rarely available in each software's webpage.} +Zenodo promises long-term archival and also provides a persistent identifier for the files, which are sometimes unavailable at a software package's webpage.}  On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library (glibc) is being added (task \href{http://savannah.nongnu.org/task/?15390}{15390}).  Currently, {\TeX}Live is also being added (task \href{http://savannah.nongnu.org/task/?15267}{15267}), but that is only for building the final PDF, not affecting the analysis or verification. -\new{Finally, all software cannot be built on all CPU architectures, hence by default it is included in the final built paper automatically, see below.} +\new{Finally, some software cannot be built on some CPU architectures, hence by default, the architecture is included in the final built paper automatically (see below).} -\new{Because, everything is built from source, building the core Maneage environment on an 8-core CPU takes about 1.5 hours (GCC consumes more than half of the time). +\new{Because everything is built from source, building the core Maneage environment on an 8-core CPU takes about 1.5 hours (GCC consumes more than half of the time).  When the analysis involves complex computations, this is negligible compared to the actual analysis. -Also, due to the Git features blended into Maneage, it is best (from the perspective of provenance) to start a Maneage'd project with the start of a project and keep the history of changes, as the project matures. +Also, due to the Git features blended into Maneage, it is best (from the perspective of provenance) to start a project immediately within Maneage, thereby recording the history of changes as the project matures.  To avoid repeating the build on different systems, Maneage'd projects can be built in a container or VM. -In fact the \inlinecode{README.md} \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{has instructions} on building a Maneage'd project in Docker. -Through Docker (or VMs) users on Microsoft Windows can benefit from Maneage, and for Windows-native software that can be run in batch-mode, technologies like Windows Subsystem for Linux can be used.} +The \inlinecode{README.md} file \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{has instructions} on building a Maneage'd project in Docker. +Through Docker (or VMs), users on Microsoft Windows can benefit from Maneage, and for Windows-native software that can be run in batch-mode, technologies like Windows Subsystem for Linux can be used.}  The analysis phase of the project however is naturally different from one project to another at a low-level.  It was thus necessary to design a generic framework to comfortably host any project, while still satisfying the criteria of modularity, scalability, and minimal complexity. @@ -335,7 +341,7 @@ Figure \ref{fig:datalineage} (right) is the data lineage graph that produced it      Green files/boxes are plain-text files that are under version control and in the project source directory.      Blue files/boxes are output files in the build directory, shown within the Makefile (\inlinecode{*.mk}) where they are defined as a \emph{target}.      For example, \inlinecode{paper.pdf} \new{is created by running \LaTeX{} on} \inlinecode{project.tex} (in the build directory; generated automatically) and \inlinecode{paper.tex} (in the source directory; written manually). -    \new{Other software are used in other steps.} +    \new{Other software is used in other steps.}      The solid arrows and full-opacity built boxes correspond to the lineage of this paper.      The dotted arrows and built boxes show the scalability of Maneage (ease of adding hypothetical steps to the project as it evolves).      The underlying data of the left plot is available at @@ -408,13 +414,13 @@ Furthermore, the configuration files are a prerequisite of the targets that use  If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's source or other built products.  This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details; e.g., by co-authors or future readers), and ensures self-consistency. -\new{To summarize, in contrast to notebooks like Jupyter, in a ``Maneage''d project the analysis scripts and configuration parameters are not blended into the running code (and all stored in one file). -Based on the modularity criteria, the analysis steps are run in their own files (for their own respective language, thus maximally benefiting from its unique features) and the narrative has its own file(s). +\new{To summarize, in contrast to notebooks like Jupyter, in a ``Maneage''d project the analysis scripts and configuration parameters are not blended into the running code (nor stored together in a single file). +To satisfy the modularity criterion, the analysis steps are run in their own files (for their own respective language, thus maximally benefiting from its unique features) and the narrative has its own file(s).  The analysis communicates with the narrative through intermediate files (the \LaTeX{} macros), enabling much better blending of analysis outputs in the narrative sentences than is possible with the high-level notebooks and enabling direct provenance tracking.}  To satisfy the recorded history criterion, version control (currently implemented in Git) is another component of Maneage (see Figure \ref{fig:branching}).  Maneage is a Git branch that contains the shared components (infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles, and interface script). -\new{The core Maneage git repository is hosted at \href{http://git.maneage.org/project.git}{git.maneage.org/project.git} (also archived on \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{Software Heritage}).} +\new{The core Maneage git repository is hosted at \href{http://git.maneage.org/project.git}{git.maneage.org/project.git} (archived at \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{Software Heritage}).}  Derived projects start by creating a branch and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its particular analysis, see Listing \ref{code:branching}).  There is a \new{thoroughly elaborated} customization checklist in \inlinecode{README-hacking.md}). @@ -423,7 +429,7 @@ These macros are created in \inlinecode{initialize.mk}, with \new{other basic in  The branch-based design of Figure \ref{fig:branching} allows projects to re-import Maneage at a later time (technically: \emph{merge}), thus improving its low-level infrastructure: in (a) authors do the merge during an ongoing project;  in (b) readers do it after publication; e.g., the project remains reproducible but the infrastructure is outdated, or a bug is fixed in Maneage. -\new{Generally, any git flow (branching strategies) can be used by the high-level project authors or future readers.} +\new{Generally, any git flow (branching strategy) can be used by the high-level project authors or future readers.}  Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of curation and maintenance of each individual project, before \emph{and} after publication.  Finally, the complete project source is usually $\sim100$ kilo-bytes. @@ -555,7 +561,7 @@ Nadia Tonello,  Ignacio Trujillo and  the AMIGA team at the Instituto de Astrof\'isica de Andaluc\'ia  for their useful help, suggestions, and feedback on Maneage and this paper. -\new{The five referees and editors of CiSE (Lorena Barba and George Thiruvathukal) also provided many very helpful points to clarify the points made in this paper.} +\new{The five referees and editors of CiSE (Lorena Barba and George Thiruvathukal) provided many points that greatly helped to clarify this paper.}  This project was developed in the reproducible framework of Maneage (\emph{Man}aging data lin\emph{eage})  \new{on Commit \inlinecode{\projectversion} (in the project branch). @@ -1544,7 +1550,7 @@ This is a major problem for scientific projects: in principle (not knowing how t  Umbrella \citeappendix{meng15b} is a high-level wrapper script for isolating the environment of an analysis.  The user specifies the necessary operating system, and necessary packages and analysis steps in variuos JSON files.  Umbrella will then study the host operating system and the various necessary inputs (including data and software) through a process similar to Sciunits mentioned above to find the best environment isolator (maybe using Linux containerization, containers or VMs). -We couldn't find a URL to the source software of Umbrella (no source code repository has been mentioned in the papers we reviewed above), but from the descriptions in \citeappendix{meng17}, it is written in Python 2.6 (which is now depreciated). +We couldn't find a URL to the source software of Umbrella (no source code repository has been mentioned in the papers we reviewed above), but from the descriptions in \citeappendix{meng17}, it is written in Python 2.6 (which is now \new{deprecated}). | 
