diff options
author | Boud Roukema <boud@cosmo.torun.pl> | 2020-05-23 18:41:32 +0200 |
---|---|---|
committer | Mohammad Akhlaghi <mohammad@akhlaghi.org> | 2020-05-23 23:23:44 +0100 |
commit | fc32ee43908a7473bbac8cf26da3a7cb02703bae (patch) | |
tree | b63fa6c30a1e61a2a24b2c361e6f15c6964d27fe | |
parent | b1c69a400a677c2595bc6738ab4d6c9b28aedc71 (diff) |
Section II edits + definition of solutions
This commit implements quite a few minor changes in section II.
The aim of most is to clarify the meaning and remove ambiguity.
A few changes are that the reader will normally assume that
successive sentences in a paragraph are closely related in terms
of logical flow. It is superfluous - and considered excessive -
to put too many "Therefore"'s and "Hence"'s in (at least) modern
astronomy style. These are supposed to be used when there is a
strong chain of reasoning.
One change is done in the Introduction, because if we're going to
use "solution(s)" throughout to mean "reproducible workflow
solution(s)", then we have to clearly define this as jargon for
this particular paper. It's probably preferable to RWS - reproducible
workflow solution - or RWI - reproducible workflow implementation.
But we can't just keep saying "solution" because that has many
different meanings in a scientific context.
Pdf word count = 5880
-rw-r--r-- | paper.tex | 46 |
1 files changed, 23 insertions, 23 deletions
@@ -101,7 +101,7 @@ Data Lineage, Provenance, Reproducibility, Scientific Pipelines, Workflows %\IEEEPARstart{F}{irst} word Reproducible research has been discussed in the sciences for at least 30 years \cite{claerbout1992, fineberg19}. -Many solutions have been proposed, mostly relying on the common technology of the day: starting with Make and Matlab libraries in the 1990s, Java in the 2000s and mostly shifting to Python during the last decade. +Many reproducible workflow solutions (hereafter, ``solution(s)'') have been proposed, mostly relying on the common technology of the day: starting with Make and Matlab libraries in the 1990s, Java in the 2000s and mostly shifting to Python during the last decade. Recently, controlling the environment has been facilitated through generic package managers (PMs) and containers. However, because of their high-level nature, such third-party tools for the workflow (not the analysis) develop very fast, e.g., Python 2 code often cannot run with Python 3, interrupting many projects. @@ -120,49 +120,49 @@ As a solution to this problem, here we introduce a set of criteria that can guar \section{Commonly used tools and their longevity} To highlight the necessity of longevity, some of the most commonly used tools are reviewed here, from the perspective of long-term usability. -We recall that while longevity is important in some fields (like in science and some industries), it isn't the case in others (e.g., short-term commercial projects), hence the usage of fast-evolving tools. +While longevity is important in some fields of science and industry, this isn't always the case: fast-evolving tools can be appropriate in short-term commercial projects. Most existing reproducible workflows use a common set of third-party tools that can be categorized as: -(1) Environment isolators like virtual machines or containers; -(2) PMs like Conda, Nix, or Spack; -(3) Job management like scripts, Make, SCons, and CGAT-core; -(4) Notebooks like Jupyter. +(1) environment isolators -- virtual machines or containers; +(2) PMs -- Conda, Nix, or Spack; +(3) job management -- shell scripts, Make, SCons, or CGAT-core; +(4) notebooks -- such as Jupyter. -To isolate the environment, virtual machines (VMs) have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (which was awarded 2nd prize in the Elsevier Executable Paper Grand Challenge of 2011 but discontinued in 2019). -However, containers (in particular Docker and to a lesser degree, Singularity) are by far the most widely used solution today, so we will focus on Docker here. +To isolate the environment, virtual machines (VMs) have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011 but discontinued in 2019). +However, containers (in particular, Docker, and to a lesser degree, Singularity) are by far the most widely used solution today, so we will focus on Docker here. -Ideally, it is possible to precisely version/tag the images that are imported into a Docker container. +Ideally, it is possible to precisely identify the images that are imported into a Docker container by a version number or tag. But that is rarely practiced in most solutions that we have studied. Usually, images are imported with generic operating system names e.g. \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}'. -The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated with different software versions almost monthly and only archives the last 5 images. -Hence if the Dockerfile is run in different months, it will contain different core operating system components. -In the year 2024, when the long-term support for this version of Ubuntu expires, it will be totally removed. -This is similar in other OSs: pre-built binary files are large and expensive to maintain and archive. -Furthermore, Docker requires root permissions, and only supports recent (in ``long-term-support'') versions of the host kernel, hence older Docker images may not be executable. +The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated with different software versions almost monthly and only archives the most recent five images. +If the Dockerfile is run in different months, it will contain different core operating system components. +In the year 2024, when long-term support for this version of Ubuntu expires, the image will be unavailable at the expected URL. +This is similar for other OSes: pre-built binary files are large and expensive to maintain and archive. +Furthermore, Docker requires root permissions, and only supports recent (``long-term-support'') versions of the host kernel, so older Docker images may not be executable. Once the host OS is ready, PMs are used to install the software, or environment. Usually the OS's PM, like `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with more generic PMs like Conda, Nix, GNU Guix or Spack. The OS PM suffers from the same longevity problem as the OS. Some third-party tools like Conda and Spack are written in high-level languages like Python, so the PM itself depends on the host's Python installation. Nix and GNU Guix do not have any dependencies and produce bit-wise identical programs, but they need root permissions. -Generally, the exact version of each software's dependencies is not precisely identified in the build instructions (although it is possible). -Therefore, unless precise versions of \emph{every software} are stored, they will use the most recent version. -Furthermore, because each third party PM introduces its own language and framework, they increase the project's complexity. +Generally, the exact version of each software's dependencies is not precisely identified in the build instructions (although that could be implemented). +Therefore, unless precise version identifiers of \emph{every software package} are stored, a PM will use the most recent version. +Furthermore, because each third-party PM introduces its own language and framework, this increases the project's complexity. With the software environment built, job management is the next component of a workflow. -Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails (mostly introduced in the 2000s and using Java) do encourage modularity and robust job management, but the more recent tools (mostly in Python) leave this to project authors. +Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails (mostly introduced in the 2000s and using Java) encourage modularity and robust job management, but the more recent tools (mostly in Python) leave this to project authors. Designing a modular project needs to be encouraged and facilitated because scientists (who are not usually trained in data management) will rarely apply best practices in project management and data carpentry. This includes automatic verification: while it is possible in many solutions, it is rarely practiced, which leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding or validating will be expensive). Finally, to add narrative, computational notebooks\cite{rule18}, like Jupyter, are being increasingly used in many solutions. -However, the complex dependency trees of such web-based tools make them very vulnerable to the passage of time, e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib; one of the more simple Jupyter dependencies. -The longevity of a project is thus determined by its shortest-lived dependency. +However, the complex dependency trees of such web-based tools make them very vulnerable to the passage of time, e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies. +The longevity of a project is determined by its shortest-lived dependency. Furthermore, as with job management, computational notebooks don't actively encourage good practices in programming or project management. -Therefore, notebooks can rarely deliver their promised potential\cite{rule18} and can even hamper reproducibility \cite{pimentel19}. +Notebooks can rarely deliver their promised potential\cite{rule18} and can even hamper reproducibility \cite{pimentel19}. An exceptional solution we encountered was the Image Processing Online Journal (IPOL, \href{https://www.ipol.im}{ipol.im}). -Submitted papers must be accompanied by an ISO C implementation of their algorithm (which is build-able on all operating systems) with example images/data that can also be executed on their webpage. +Submitted papers must be accompanied by an ISO C implementation of their algorithm (which is buildable on any widely used operating system) with example images/data that can also be executed on their webpage. This is possible due to the focus on low-level algorithms that do not need any dependencies beyond an ISO C compiler. -Many data-intensive projects, commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, hence this solution is not scalable. +Many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, so this solution is not scalable. |