diff options
-rw-r--r-- | paper.tex | 64 |
1 files changed, 33 insertions, 31 deletions
@@ -60,16 +60,17 @@ % in the abstract or keywords. \begin{abstract} %% CONTEXT - Reproducible workflow solutions commonly use the high-level technologies that were popular when they were created, providing an immediate solution that are unlikely to be sustainable in the long term. + Reproducible workflow solutions commonly use the high-level technologies that were popular when they were created, providing an immediate solution that is unlikely to be sustainable in the long term. %% AIM We aim to introduce a set of criteria to address this problem and to demonstrate their practicality. %% METHOD - The criteria have been tested in several research publications and can be summarized as: completeness (no dependency beyond a POSIX-compatible operating system, no administrator privileges, no network connection and storage primarily in plain-text); modular design; linking analysis with narrative; temporal provenance; scalability; and free-and-open-source software. + The criteria have been tested in several research publications and can be summarized as: completeness (no dependency beyond a POSIX-compatible operating system, no administrator privileges, no network connection, and storage primarily in plain text); modular design; linking analysis with narrative; temporal provenance; scalability; and free-and-open-source software. %% RESULTS - Through an implementation, called ``Maneage'', we find that storing the project in machine-actionable and human-readable plain-text, enables version-control, cheap archiving, automatic parsing to extract data provenance, and peer-review-able verification. - Furthermore, we find that these criteria are not limited to long-term reproducibility but also provide immediate, fast short-term reproducibility benefits. + Through an implementation, called ``Maneage'', we find that storing the project in machine-actionable and human-readable plain-text, enables version-control, cheap archiving, automatic parsing to extract data provenance, and peer-reviewable verification. + Furthermore, we find that these criteria are not limited to long-term reproducibility, but also provide immediate benefits for short-term reproducibility. %% CONCLUSION - We conclude that requiring longevity from solutions is realistic and discuss the benefits of these criteria for scientific progress. + We conclude that requiring longevity of a reproducible workflow solution is realistic. + We discuss the benefits of these criteria for scientific progress. \end{abstract} % Note that keywords are not normally used for peerreview papers. @@ -99,19 +100,19 @@ Data Lineage, Provenance, Reproducibility, Scientific Pipelines, Workflows % by the rest of the first word in caps. %\IEEEPARstart{F}{irst} word -Reproducible research has been discussed in the sciences for about 30 years \cite{claerbout1992, fineberg19}. -Many solutions have been proposed, mostly relying on the common technology of the day: starting with Make and Matlab libraries in the 1990s, to Java in the 2000s and in the last decade they are mostly based on Python. -Recently controlling the environment has been facilitated through generic package managers (PMs) and containers. +Reproducible research has been discussed in the sciences for at least 30 years \cite{claerbout1992, fineberg19}. +Many solutions have been proposed, mostly relying on the common technology of the day: starting with Make and Matlab libraries in the 1990s, Java in the 2000s and mostly shifting to Python during the last decade. +Recently, controlling the environment has been facilitated through generic package managers (PMs) and containers. -However, because of their high-level nature, such third party tools for the workflow (not the analysis) grow very fast, e.g., Python 2 code cannot run with Python 3, interrupting many projects. -Furthermore, containers (in custom binary formats) are also being heavily used recently, but are large (Gigabytes) and expensive to archive. -Also, once the binary format is obsolete, reading or parsing the project is not possible. +However, because of their high-level nature, such third-party tools for the workflow (not the analysis) develop very fast, e.g., Python 2 code often cannot run with Python 3, interrupting many projects. +Containers (in custom binary formats) are also being heavily used, but are large (Gigabytes) and expensive to archive. +Moreover, once the binary format is obsolete, reading or parsing the project becomes impossible. -The cost of staying up to date with this evolving landscape is high. -Scientific projects in particular suffer the most: scientists have to focus on their own research domain, but they also need to understand the used technology to a certain level, because it determines their results and interpretations. -Decades later, they are also still held accountable for their results. -Hence, the evolving technology creates generational gaps in the scientific community, not allowing the previous generations to share valuable lessons which are too low-level to be published in a traditional scientific paper. -As a solution to this problem, here we introduce a criteria that can guarantee the longevity of a project based on our experiences with existing solutions. +The cost of staying up to date within this rapidly evolving landscape is high. +Scientific projects, in particular, suffer the most: scientists have to focus on their own research domain, but to some degree they need to understand the technology of their tools, because it determines their results and interpretations. +Decades later, scientists are still held accountable for their results. +The evolving technology landscape creates generational gaps in the scientific community, preventing previous generations from sharing valuable lessons which are too low-level to be published in a traditional scientific paper. +As a solution to this problem, here we introduce a set of criteria that can guarantee the longevity of a project based on our experience with existing solutions. @@ -119,23 +120,22 @@ As a solution to this problem, here we introduce a criteria that can guarantee t \section{Commonly used tools and their longevity} To highlight the necessity of longevity, some of the most commonly used tools are reviewed here, from the perspective of long-term usability. -We recall that while longevity is important in some fields (like the sciences and some industries), it isn't the case in others (e.g., short term commercial projects), hence the usage of fast-evolving tools. +We recall that while longevity is important in some fields (like in science and some industries), it isn't the case in others (e.g., short-term commercial projects), hence the usage of fast-evolving tools. Most existing reproducible workflows use a common set of third-party tools that can be categorized as: -(1) Environment isolators like virtual machines, containers and etc. -(2) PMs like Conda, Nix, or Spack, -(3) Job management like scripts, Make, SCons, and CGAT-core, +(1) Environment isolators like virtual machines or containers; +(2) PMs like Conda, Nix, or Spack; +(3) Job management like scripts, Make, SCons, and CGAT-core; (4) Notebooks like Jupyter. -To isolate the environment, virtual machines (VMs) have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (which was awarded 2nd prize in the Elseiver Executable Paper Grand Challenge of 2011 and discontinued in 2019). -However, containers (in particular Docker and lesser, Singularity) are by far the most used solution today, we will focus on Docker here. +To isolate the environment, virtual machines (VMs) have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (which was awarded 2nd prize in the Elsevier Executable Paper Grand Challenge of 2011 but discontinued in 2019). +However, containers (in particular Docker and to a lesser degree, Singularity) are by far the most widely used solution today, so we will focus on Docker here. -%% Note that L. Barba (second author of this paper) is the editor in chief of CiSE. Ideally, it is possible to precisely version/tag the images that are imported into a Docker container. But that is rarely practiced in most solutions that we have studied. Usually, images are imported with generic operating system names e.g. \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}'. The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated with different software versions almost monthly and only archives the last 5 images. Hence if the Dockerfile is run in different months, it will contain different core operating system components. -Furthermore, in the year 2024, when the long-term support for this version of Ubuntu expires, it will be totally removed. +In the year 2024, when the long-term support for this version of Ubuntu expires, it will be totally removed. This is similar in other OSs: pre-built binary files are large and expensive to maintain and archive. Furthermore, Docker requires root permissions, and only supports recent (in ``long-term-support'') versions of the host kernel, hence older Docker images may not be executable. @@ -143,7 +143,7 @@ Once the host OS is ready, PMs are used to install the software, or environment. Usually the OS's PM, like `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with more generic PMs like Conda, Nix, GNU Guix or Spack. The OS PM suffers from the same longevity problem as the OS. Some third-party tools like Conda and Spack are written in high-level languages like Python, so the PM itself depends on the host's Python installation. -Nix and GNU Guix do not have any dependencies and produce bit-wise identical programs, however, they need root permissions. +Nix and GNU Guix do not have any dependencies and produce bit-wise identical programs, but they need root permissions. Generally, the exact version of each software's dependencies is not precisely identified in the build instructions (although it is possible). Therefore, unless precise versions of \emph{every software} are stored, they will use the most recent version. Furthermore, because each third party PM introduces its own language and framework, they increase the project's complexity. @@ -151,14 +151,13 @@ Furthermore, because each third party PM introduces its own language and framewo With the software environment built, job management is the next component of a workflow. Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails (mostly introduced in the 2000s and using Java) do encourage modularity and robust job management, but the more recent tools (mostly in Python) leave this to project authors. Designing a modular project needs to be encouraged and facilitated because scientists (who are not usually trained in data management) will rarely apply best practices in project management and data carpentry. -This includes automatic verification: while it is possible in many solutions, it is rarely practiced. -This leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding or validating will be expensive). +This includes automatic verification: while it is possible in many solutions, it is rarely practiced, which leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding or validating will be expensive). Finally, to add narrative, computational notebooks\cite{rule18}, like Jupyter, are being increasingly used in many solutions. However, the complex dependency trees of such web-based tools make them very vulnerable to the passage of time, e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib; one of the more simple Jupyter dependencies. -The longevity of a project is determined by its shortest-lived dependency. -Furthermore, similar to the point above on job management, they don't actively encourage good practices in programming or project management. -Therefore, notebooks can rarely deliver their promised potential\cite{rule18} or can even hamper reproducibility \cite{pimentel19}. +The longevity of a project is thus determined by its shortest-lived dependency. +Furthermore, as with job management, computational notebooks don't actively encourage good practices in programming or project management. +Therefore, notebooks can rarely deliver their promised potential\cite{rule18} and can even hamper reproducibility \cite{pimentel19}. An exceptional solution we encountered was the Image Processing Online Journal (IPOL, \href{https://www.ipol.im}{ipol.im}). Submitted papers must be accompanied by an ISO C implementation of their algorithm (which is build-able on all operating systems) with example images/data that can also be executed on their webpage. @@ -512,7 +511,10 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314 \end{IEEEbiographynophoto} \begin{IEEEbiographynophoto}{David Valls-Gabaud} - is a professor at the Observatoire de Paris, France. + Observatoire de Paris + + David Valls-Gabaud is a CNRS Research Director at the Observatoire de Paris, France. + His research interests span from cosmology and galaxy evolution to stellar physics and instrumentation. Contact him at david.valls-gabaud@obspm.fr. \end{IEEEbiographynophoto} |