diff options
-rw-r--r-- | paper.tex | 18 |
1 files changed, 9 insertions, 9 deletions
@@ -1,7 +1,7 @@ %% Main LaTeX source of project's paper, license is printed in the end. % %% Copyright (C) 2020 Mohammad Akhlaghi <mohammad@akhlaghi.org> -%% Copyright (C) 2020 Raúl Infante-Saiz <infantesainz@gmail.com> +%% Copyright (C) 2020 Raúl Infante-Sainz <infantesainz@gmail.com> %% Copyright (C) 2020 Boudewijn F. Roukema <boud@astro.uni.torun.pl> %% Copyright (C) 2020 David Valls-Gabaud <david.valls-gabaud@obspm.fr> %% Copyright (C) 2020 Roberto Baena-Gallé <roberto.baena@gmail.com> @@ -67,7 +67,7 @@ The criteria have been tested in several research publications and can be summarized as: completeness (no dependency beyond a POSIX-compatible operating system, no administrator privileges, no network connection and storage primarily in plain text); modular design; linking analysis with narrative; temporal provenance; scalability; and free-and-open-source software. %% RESULTS Through an implementation, called ``Maneage'', we find that storing the project in machine-actionable and human-readable plain-text, enables version-control, cheap archiving, automatic parsing to extract data provenance, and peer-reviewable verification. - Furthermore, we find that these criteria are not limited to long-term reproducibility, but also provide immediate benefits for short-term reproducibility. + Furthermore, these criteria are not limited to long-term reproducibility, but also provide immediate benefits for short-term reproducibility. %% CONCLUSION We conclude that requiring longevity of a reproducible workflow solution is realistic. We discuss the benefits of these criteria for scientific progress. @@ -117,7 +117,7 @@ As a solution to this problem, here we introduce a set of criteria that can guar \section{Commonly used tools and their longevity} To highlight the necessity of longevity, some of the most commonly used tools are reviewed here, from the perspective of long-term usability. -While longevity is important in science and some fields of industry, this isn't always the case, e.g., fast-evolving tools can be appropriate in short-term commercial projects. +While longevity is important in science and some fields of industry, this is not always the case, e.g., fast-evolving tools can be appropriate in short-term commercial projects. Most existing reproducible workflows use a common set of third-party tools that can be categorized as: (1) environment isolators -- virtual machines (VMs) or containers; (2) PMs -- Conda, Nix, or Spack; @@ -129,9 +129,9 @@ However, containers (in particular, Docker, and to a lesser degree, Singularity) Ideally, it is possible to precisely identify the images that are imported into a Docker container by their checksums. But that is rarely practiced in most solutions that we have studied. -Usually, images are imported with generic operating system names e.g. \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}'. +Usually, images are imported with generic operating system (OS) names e.g. \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}'. The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated with different software versions almost monthly and only archives the most recent five images. -If the Dockerfile is run in different months, it will contain different core operating system components. +If the Dockerfile is run in different months, it will contain different core OS components. In the year 2024, when long-term support for this version of Ubuntu expires, the image will be unavailable at the expected URL. This is similar for other OSes: pre-built binary files are large and expensive to maintain and archive. Furthermore, Docker requires root permissions, and only supports recent (``long-term-support'') versions of the host kernel, so older Docker images may not be executable. @@ -146,18 +146,18 @@ Therefore, unless precise version identifiers of \emph{every software package} a Furthermore, because each third-party PM introduces its own language and framework, this increases the project's complexity. With the software environment built, job management is the next component of a workflow. -Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails (mostly introduced in the 2000s and using Java) encourage modularity and robust job management, but the more recent tools (mostly in Python) leave this to project authors. +Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails (mostly introduced in the 2000s and using Java) encourage modularity and robust job management, but the more recent tools (mostly in Python) leave this to the project authors. Designing a modular project needs to be encouraged and facilitated because scientists (who are not usually trained in data management) will rarely apply best practices in project management and data carpentry. This includes automatic verification: while it is possible in many solutions, it is rarely practiced, which leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding or validating will be expensive). Finally, to add narrative, computational notebooks\cite{rule18}, like Jupyter, are being increasingly used in many solutions. However, the complex dependency trees of such web-based tools make them very vulnerable to the passage of time, e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies. The longevity of a project is determined by its shortest-lived dependency. -Furthermore, as with job management, computational notebooks don't actively encourage good practices in programming or project management. +Furthermore, as with job management, computational notebooks do not actively encourage good practices in programming or project management. Hence they can rarely deliver their promised potential\cite{rule18} and can even hamper reproducibility \cite{pimentel19}. An exceptional solution we encountered was the Image Processing Online Journal (IPOL, \href{https://www.ipol.im}{ipol.im}). -Submitted papers must be accompanied by an ISO C implementation of their algorithm (which is buildable on any widely used operating system) with example images/data that can also be executed on their webpage. +Submitted papers must be accompanied by an ISO C implementation of their algorithm (which is buildable on any widely used OS) with example images/data that can also be executed on their webpage. This is possible due to the focus on low-level algorithms that do not need any dependencies beyond an ISO C compiler. Many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, so this solution is not scalable. @@ -174,7 +174,7 @@ In this paper we argue that workflows satisfying the criteria below can improve \textbf{Criterion 1: Completeness.} A project that is complete (self-contained) has the following properties. -(1) It has no dependency beyond the Portable Operating System (OS) Interface: POSIX. +(1) It has no dependency beyond the Portable Operating System Interface: POSIX. IEEE defined POSIX (a minimal Unix-like environment) and many OSes have complied. It is a reliable foundation for longevity in software execution. (2) ``No dependency'' requires that the project itself must be primarily stored in plain text, not needing specialized software to open, parse or execute. |