From 9363210b37d6399acdc1d990cb9826e64c38ef5a Mon Sep 17 00:00:00 2001 From: David Valls-Gabaud Date: Mon, 1 Jun 2020 21:43:50 +0100 Subject: Edits by David These are some corrections that David sent to me by email and I am committing here. --- paper.tex | 210 ++++++++++++++++++++++++++++++++------------------------------ 1 file changed, 110 insertions(+), 100 deletions(-) diff --git a/paper.tex b/paper.tex index 7ef7b53..801b380 100644 --- a/paper.tex +++ b/paper.tex @@ -60,17 +60,17 @@ % in the abstract or keywords. \begin{abstract} %% CONTEXT - Reproducible workflow solutions commonly use the high-level technologies that were popular when they were created, providing an immediate solution that is unlikely to be sustainable in the long term. + Reproducible workflow solutions commonly use high-level technologies that were popular when they were created, providing an immediate solution which is however unlikely to be sustainable in the long term. %% AIM - We aim to introduce a set of criteria to address this problem and to demonstrate their practicality. + We therefore introduce a set of criteria to address this problem and demonstrate their practicality and implementation. %% METHOD - The criteria have been tested in several research publications and can be summarized as: completeness (no dependency beyond a POSIX-compatible operating system, no administrator privileges, no network connection and storage primarily in plain text); modular design; linking analysis with narrative; temporal provenance; scalability; and free-and-open-source software. + The criteria have been tested in several research publications and can be summarized as: completeness (no dependency beyond a POSIX-compatible operating system, no administrator privileges, no network connection and storage primarily in plain text); modular design; minimal + complexity; scalability; verifiable inputs and outputs; temporal provenance; linking analysis with narrative; and free-and-open-source software. These criteria are not limited to long-term reproducibility, but also provide immediate benefits for short-term reproducibility. %% RESULTS - Through an implementation, called ``Maneage'', we find that storing the project in machine-actionable and human-readable plain-text, enables version-control, cheap archiving, automatic parsing to extract data provenance, and peer-reviewable verification. - Furthermore, these criteria are not limited to long-term reproducibility, but also provide immediate benefits for short-term reproducibility. + They are implemented in a tool, called ``Maneage'', which stores the project in machine-actionable and human-readable plain-text, enables version-control, cheap archiving, automatic parsing to extract data provenance, and peer-reviewable verification. %% CONCLUSION - We conclude that requiring longevity of a reproducible workflow solution is realistic. - We discuss the benefits of these criteria for scientific progress. + We show that requiring longevity of a reproducible workflow solution is realistic, and + discuss the benefits of these criteria for scientific progress. \end{abstract} % Note that keywords are not normally used for peerreview papers. @@ -103,61 +103,63 @@ Data Lineage, Provenance, Reproducibility, Scientific Pipelines, Workflows Reproducible research has been discussed in the sciences for at least 30 years \cite{claerbout1992, fineberg19}. Many reproducible workflow solutions (hereafter, ``solutions'') have been proposed, mostly relying on the common technology of the day: starting with Make and Matlab libraries in the 1990s, to Java in the 2000s and mostly shifting to Python during the last decade. -However, these technologies develop very fast, e.g., Python 2 code often cannot run with Python 3, interrupting many projects in the last decade. -The cost of staying up to date within this rapidly evolving landscape is high. +However, these technologies develop very fast, e.g., Python 2 code often cannot run with Python 3, + % interrupting many projects in the last decade. +% DVG: I would refrain from saying this unless we can cite examples which have shown that going from 2 to 3 has prevented them. +and the cost of staying up to date within this rapidly-evolving landscape is high. Scientific projects, in particular, suffer the most: scientists have to focus on their own research domain, but to some degree they need to understand the technology of their tools, because it determines their results and interpretations. -Decades later, scientists are still held accountable for their results. -Hence, the evolving technology landscape creates generational gaps in the scientific community, preventing previous generations from sharing valuable lessons which are too hands-on to be published in a traditional scientific paper. +Decades later, scientists are still held accountable for their results and therefore + the evolving technology landscape creates generational gaps in the scientific community, preventing previous generations from sharing valuable lessons which are too hands-on to be published in a traditional scientific paper. \section{Commonly used tools and their longevity} -Longevity is important in science and some fields of industry, but this is not always the case, e.g., fast-evolving tools can be appropriate in short-term commercial projects. -To highlight the necessity of longevity, some of the most commonly used tools are reviewed here from this perspective. +Longevity is as important in science as in some fields of industry, but this is not always the case, e.g., fast-evolving tools can be appropriate in short-term commercial projects. +To highlight the necessity of longevity, some of the most commonly-used tools are reviewed here from this perspective. A common set of third-party tools that are used by most solutions can be categorized as: (1) environment isolators -- virtual machines (VMs) or containers; (2) package managers (PMs) -- Conda, Nix, or Spack; (3) job management -- shell scripts, Make, SCons, or CGAT-core; (4) notebooks -- such as Jupyter. -To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011 but discontinued in 2019). -However, containers (in particular, Docker, and to a lesser degree, Singularity) are by far the most widely used solution today, we will thus focus on Docker here. +To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011 but was discontinued in 2019). +However, since containers (in particular, Docker, and to a lesser degree, Singularity) are by far the most widely-used solution today, we will focus on Docker here. -Ideally, it is possible to precisely identify the images that are imported into a Docker container by their checksums. -But that is rarely practiced in most solutions that we have studied. +Ideally, it is possible to precisely identify the images that are imported into a Docker container by their checksums, +but that is rarely practiced in most solutions that we have surveyed. Usually, images are imported with generic operating system (OS) names e.g. \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}'. -The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated with different software versions almost monthly and only archives the most recent five images. -If the Dockerfile is run in different months, it will contain different core OS components. -In the year 2024, when long-term support for this version of Ubuntu expires, the image will be unavailable at the expected URL. -This is similar for other OSes: pre-built binary files are large and expensive to maintain and archive. +The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly with different software versions and only archives the most recent five images. +Hence, if the Dockerfile is run in different months, it will contain different core OS components. +In the year 2024, when long-term support for this version of Ubuntu will expire, the image will be unavailable at the expected URL. +This is entirely similar in other OSes: pre-built binary files are large and expensive to maintain and archive. Furthermore, Docker requires root permissions, and only supports recent (``long-term-support'') versions of the host kernel, so older Docker images may not be executable. Once the host OS is ready, PMs are used to install the software, or environment. Usually the OS's PM, like `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with generic PMs. -The former suffers from the same longevity problem as the OS. -Some of the latter (like Conda and Spack) are written in high-level languages like Python, so the PM itself depends on the host's Python installation. +The former suffers from the same longevity problem as the OS, while +some of the latter (like Conda and Spack) are written in high-level languages like Python, so the PM itself depends on the host's Python installation. Nix and GNU Guix produce bit-wise identical programs, but they need root permissions. -Generally, the exact version of each software's dependencies is not precisely identified in the build instructions (although that could be implemented). +Generally, the exact version of each software's dependencies is not precisely identified in the build instructions (although this could be implemented). Therefore, unless precise version identifiers of \emph{every software package} are stored, a PM will use the most recent version. Furthermore, because each third-party PM introduces its own language and framework, this increases the project's complexity. With the software environment built, job management is the next component of a workflow. -Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails (mostly introduced in the 2000s and using Java) encourage modularity and robust job management, but the more recent tools (mostly in Python) leave this to the project authors. +Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails (mostly introduced in the 2000s and using Java) encourage modularity and robust job management, but the more recent tools (mostly in Python) leave this to the authors of the project. Designing a modular project needs to be encouraged and facilitated because scientists (who are not usually trained in project or data management) will rarely apply best practices. This includes automatic verification: while it is possible in many solutions, it is rarely practiced, which leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding or validating will be expensive). -Finally, to add narrative, computational notebooks\cite{rule18}, like Jupyter, are being increasingly used. +Finally, to add narrative, computational notebooks \cite{rule18}, like Jupyter, are being increasingly used. However, the complex dependency trees of such web-based tools make them very vulnerable to the passage of time, e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies. -The longevity of a project is determined by its shortest-lived dependency. -Furthermore, as with job management, computational notebooks do not actively encourage good practices in programming or project management. -Hence they can rarely deliver their promised potential\cite{rule18} and can even hamper reproducibility \cite{pimentel19}. +It is important to remember that the longevity of a project is determined by its shortest-lived dependency. +Further, as with job management, computational notebooks do not actively encourage good practices in programming or project management. +Hence they can rarely deliver their promised potential \cite{rule18} and can even hamper reproducibility \cite{pimentel19}. An exceptional solution we encountered was the Image Processing Online Journal (IPOL, \href{https://www.ipol.im}{ipol.im}). Submitted papers must be accompanied by an ISO C implementation of their algorithm (which is buildable on any widely used OS) with example images/data that can also be executed on their webpage. This is possible due to the focus on low-level algorithms that do not need any dependencies beyond an ISO C compiler. -Many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, so this solution is not scalable. +Unfortunately, many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, and hence this solution is not scalable. @@ -165,9 +167,9 @@ Many data-intensive projects commonly involve dozens of high-level dependencies, \section{Proposed criteria for longevity} The main premise is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it in the end \cite{austin17,fineberg19}. -Researchers play a critical role \cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles). +In this context, researchers play a critical role \cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles). Simply archiving a project workflow in a repository after the project is finished is, on its own, insufficient, and maintaining it by repository staff is often either practically infeasible or unscalable. -We argue that workflows satisfying the criteria below can not only improve researcher flexibility during a research project, but can also increase the FAIRness of the deliverables for future researchers. +We argue and propose that workflows satisfying the following criteria can not only improve researcher flexibility during a research project, but can also increase the FAIRness of the deliverables for future researchers: \textbf{Criterion 1: Completeness.} A project that is complete (self-contained) has the following properties. @@ -178,7 +180,8 @@ IEEE defined POSIX (a minimal Unix-like environment) and many OSes have complied (4) It does not require root or administrator privileges. (5) It builds its own controlled software for an independent environment. (6) It can run locally (without an internet connection). -(7) It contains the full project's analysis, visualization \emph{and} narrative: from access to raw inputs to doing the analysis, producing final data products \emph{and} its final published report with figures, e.g., PDF or HTML. +(7) It contains the full project's analysis, visualization \emph{and} narrative: from access to raw inputs to doing the analysis, producing final data products \emph{and} its final published report with figures, e.g., PDF or HTML. +% DVG: but the PDF standard is owned by Adobe(TM), and there are many versions of HTML ... so the long-term validity is jeopardised. (8) It can run automatically, with no human interaction. \textbf{Criterion 2: Modularity.} @@ -199,29 +202,29 @@ More stable/basic tools can be used with less long-term maintenance. \textbf{Criterion 4: Scalability.} A scalable project can easily be used in arbitrarily large and/or complex projects. -On a small scale, the criteria here are trivial to implement, but can become unsustainable very soon. +On a small scale, the criteria here are trivial to implement, but can become unsustainable very rapidly. \textbf{Criterion 5: Verifiable inputs and outputs.} The project should verify its inputs (software source code and data) \emph{and} outputs. Reproduction should be straightforward enough such that ``\emph{a clerk can do it}''\cite{claerbout1992} (with no expert knowledge). \textbf{Criterion 6: History and temporal provenance.} -No exploratory research project is done in a single/first attempt. +No exploratory research project is done in a single, first attempt. Projects evolve as they are being completed. It is natural that earlier phases of a project are redesigned/optimized only after later phases have been completed. -These types of research papers often report this with statements like ``\emph{we [first] tried method [or parameter] X, but Y is used here because it gave lower random error}''. -The ``history'' is thus as valuable as the final/published version. +Research papers often report this with statements such as ``\emph{we [first] tried method [or parameter] X, but Y is used here because it gave lower random error}''. +The ``history'' is thus as valuable as the final, published version. \textbf{Criterion 7: Including narrative, linked to analysis.} A project is not just its computational analysis. A raw plot, figure or table is hardly meaningful alone, even when accompanied by the code that generated it. -A narrative description is also part of the deliverables (defined as ``data article'' in \cite{austin17}): describing the purpose of the computations, and interpretations of the result, and the context in relation to other projects/papers. +A narrative description must also be part of the deliverables (defined as ``data article'' in \cite{austin17}): describing the purpose of the computations, and interpretations of the result, and the context in relation to other projects/papers. This is related to longevity, because if a workflow only contains the steps to do the analysis or generate the plots, it may get separated from its accompanying published paper. \textbf{Criterion 8: Free and open source software:} Technically, reproducibility (as defined in \cite{fineberg19}) is possible with non-free or non-open-source software (a black box). -This criterion is necessary to complement that definition (nature is already a black box!). -If a project is free software (as formally defined), then others can learn from, modify, and build on it. +This criterion is necessary to complement that definition (nature is already a black box!) because +if a project is free software (as formally defined), then others can learn from, modify, and build on it. When the software used by the project is itself also free: (1) The lineage can be traced to the implemented algorithms, possibly enabling optimizations on that level. (2) The source can be modified to work on future hardware. @@ -238,48 +241,50 @@ In contrast, a non-free software package typically cannot be distributed by othe \section{Proof of concept: Maneage} -With the longevity problems of existing tools outlined above, a proof of concept is presented via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}. -It was awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows\cite{austin17}, from the researcher perspective. +With the longevity problems of existing tools outlined above, a proof-of-concept tool is presented here via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}. +It was in fact awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows \cite{austin17}, from the researchers' perspective. -The proof-of-concept is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at \url{https://maneage.org}. +The tool is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at \url{https://maneage.org}. It was developed as a parallel research project over five years of publishing reproducible workflows of our research. The original implementation was published in \cite{akhlaghi15}, and evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}. -Technically, the hardest criterion to implement was the completeness criterion (and, in particular, avoiding non-POSIX dependencies). -Minimizing complexity was also difficult. -One proposed solution was the Guix Workflow Language (GWL), which is written in the same framework (GNU Guile, an implementation of Scheme) as GNU Guix (a PM). -However, because Guix requires root access to install, and only works with the Linux kernel, it failed the completeness criterion. +Technically, the hardest criterion to implement was the first one (completeness) and, in particular, avoiding non-POSIX dependencies). +Minimizing complexity (criterion 3) was also difficult. +A proposed solution was the Guix Workflow Language (GWL), written in the same framework (GNU Guile, an implementation of Scheme) as GNU Guix (a PM), but because + Guix requires root access to install, and only works with the Linux kernel, it failed the completeness criterion. -Inspired by GWL+Guix, a single job management tool was used for both installing of software \emph{and} the analysis workflow: Make. -Make is not an analysis language, it is a job manager, deciding when to call analysis programs (in any language like Python, R, Julia, Shell or C). +Inspired by GWL+Guix, a single job management tool was implemented for both installing software \emph{and} the analysis workflow: Make. +Make is not an analysis language, it is a job manager, deciding when and how to call analysis programs (in any language like Python, R, Julia, Shell or C). Make is standardized in POSIX and is used in almost all core OS components. -It is thus mature, actively maintained and highly optimized (and efficient in managing exact provenance). -Make was recommended by the pioneers of reproducible research\cite{claerbout1992,schwab2000} and many researchers have already had some exposure to it (when building research software). +It is thus mature, actively maintained and highly optimized (and efficient in managing exact provenance), and + was recommended by the pioneers of reproducible research \cite{claerbout1992,schwab2000} and many researchers have already had some exposure to it. % DVG: I think this parenthesis is not needed: (when building research software). %However, because they didn't attempt to build the software environment, in 2006 they moved to SCons (Make-simulator in Python which also attempts to manage software dependencies) in a project called Madagascar (\url{http://ahay.org}), which is highly tailored to Geophysics. -Linking the analysis and narrative was another major design choice. +Linking the analysis and narrative (criterion 7) was another major design choice. Literate programming, implemented as Computational Notebooks like Jupyter, is currently popular. -However, due to the problems above, our implementation follows a more abstract linkage, providing a more direct and precise, but modular connection (modularized into specialised files). +However, due to the problems above, our implementation follows a more abstract linkage, providing a more direct and precise, yet modular, connection (modularized into specialised files). -Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is through automatically created \LaTeX{} macros (during the analysis). -For example, in \cite{akhlaghi19} we say `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'. +Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is +through automatically-created \LaTeX{} macros, during the analysis. +For example, \cite{akhlaghi19} writes `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'. The \LaTeX{} source of the quote above is: `\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}'. The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is generated during the analysis, and expands to the value `\inlinecode{0.25}' when the PDF output is built. -Since values like this depend on the analysis, they should also be reproducible, along with figures and tables. +Since values like this depend on the analysis, they should \emph{also} be reproducible, along with figures and tables. These macros act as a quantifiable link between the narrative and analysis, with the granularity of a word in a sentence and a particular analysis command. -This allows accurate provenance post-publication \emph{and} automatic updates to the text prior to publication. -Manually updating these in the narrative is prone to errors and discourages improvements after writing the first draft. +This allows accurate post-publication provenance \emph{and} automatic updates to the text prior to publication, thus by-passing +the manual update in the narrative which is prone to errors and discourages improvements after writing the first draft. Acting as a link, these macro files build the core skeleton of Maneage. For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation. -These are combined in the end to generate precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}; excluded here due to the strict word limit). +These are combined at the end to generate precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}), which +are excluded here due to the strict word limit. The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel with no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}). -Software dependencies are built down to precise versions of every tool, including the shell, POSIX tools (e.g., GNU Coreutils) or \TeX{}Live, providing the same environment. +All software dependencies are built down to precise versions of every tool, including the shell, POSIX tools (e.g., GNU Coreutils) or \TeX{}Live, providing the same environment. On GNU/Linux operating systems, the GNU Compiler Collection (GCC) is also built from source and the GNU C library is being added (task 15390). -Fast relocation of a project (without building from source) can be done by building the project in a container or VM. +The fast relocation of a project (without building from source) can be done by building the project in a container or VM. -In building software, normally the only difference between projects is choice of which software to build. +When building software, the only difference between projects is usually the choice of the software. However, the analysis will naturally be different from one project to another at a low-level. It was thus necessary to design a generic framework to comfortably host any project, while still satisfying the criteria of modularity, scalability and minimal complexity. We demonstrate this design by replicating Figure 1C of \cite{menke20} in Figure \ref{fig:datalineage} (top). @@ -329,7 +334,7 @@ include $(foreach s,$(makesrc), \ The analysis is orchestrated through a single point of entry (\inlinecode{top-make.mk}, which is a Makefile; see Listing \ref{code:topmake}). It is only responsible for \inlinecode{include}-ing the modular \emph{subMakefiles} of the analysis, in the desired order, without doing any analysis itself. -This is visualized in Figure \ref{fig:datalineage} (bottom) where no built/blue file is placed directly over \inlinecode{top-make.mk} (they are produced by the subMakefiles under them). +This is visualized in Figure \ref{fig:datalineage} (bottom) where no built (blue) file is placed directly over \inlinecode{top-make.mk} (they are produced by the subMakefiles under them). A visual inspection of this file is sufficient for a non-expert to understand the high-level steps of the project (irrespective of the low-level implementation details), provided that the subMakefile names are descriptive (thus encouraging good practice). A human-friendly design that is also optimized for execution is a critical component for reproducible research workflows. @@ -338,13 +343,13 @@ Project authors add their modular subMakefiles in between. Except for \inlinecode{paper.mk} (which builds the ultimate target \inlinecode{paper.pdf}), all subMakefiles build a macro file with the same basename (the \inlinecode{.tex} file in each subMakefile of Figure \ref{fig:datalineage}). Other built files (intermediate analysis steps) cascade down in the lineage to one of these macro files, possibly through other files. -Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk}, to satisfy verification criteria. -All project deliverables (macro files, plot or table data and other datasets) are verified with their checksums here to automatically ensure exact reproducibility. -Where exact reproducibility is not possible, values can be verified by any statistical means (specified by the project authors). -This step was not yet implemented in \cite{akhlaghi19, infante20}. +Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk}, to satisfy the verification criteria. +All project deliverables (macro files, plot or table data and other datasets) are thus verified at this stage, with their checksums, to automatically ensure exact reproducibility. +Where exact reproducibility is not possible, values can be verified by any statistical means, specified by the project authors +(this step was not implemented in \cite{akhlaghi19, infante20}). To further minimize complexity, the low-level implementation can be further separated from the high-level execution through configuration files. -By convention in Maneage, the subMakefiles, and the programs they call for number crunching, do not contain any fixed numbers, settings or parameters. +By convention in Maneage, the subMakefiles (and the programs they call for number crunching) do not contain any fixed numbers, settings or parameters. Parameters are set as Make variables in ``configuration files'' (with a \inlinecode{.conf} suffix) and passed to the respective program by Make. For example, in Figure \ref{fig:datalineage} (bottom), \inlinecode{INPUTS.conf} contains URLs and checksums for all imported datasets, enabling exact verification before usage. To illustrate this, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot). @@ -370,7 +375,7 @@ This fast and cheap testing encourages experimentation (without necessarily know Finally, to satisfy the temporal provenance criterion, version control (currently implemented in Git), plays a crucial role in Maneage, as shown in Figure \ref{fig:branching}. In practice, Maneage is a Git branch that contains the shared components (infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles and interface script). -Every project starts by branching off the Maneage branch and customizing it (e.g., replacing the title, data links, and narrative, and adding subMakefiles for its particular analysis), see Listing \ref{code:branching}. +Every project starts by branching off the Maneage branch and customizing it (e.g., replacing the title, data links, and narrative, and adding subMakefiles for its particular analysis, see Listing \ref{code:branching}). \begin{lstlisting}[ label=code:branching, @@ -392,11 +397,11 @@ $ ./project make # Re-build to see effect. $ git add -u && git commit # Commit changes \end{lstlisting} -As Figure \ref{fig:branching} shows, due to this architecture, it is always possible to import (technically: \emph{merge}) Maneage into a project and improve the low-level infrastructure: +Thanks to this architecture (Figure \ref{fig:branching}), it is always possible to import (technically: \emph{merge}) Maneage into a project and improve the low-level infrastructure: in (a) the authors merge Maneage during an ongoing project; in (b) readers do it after the paper's publication, e.g., when the project remains reproducible but the infrastructure is outdated, or a bug is found in Maneage. -Low-level improvements in Maneage can thus easily propagate to all projects. -This greatly reduces the cost of curation and maintenance of each individual project, before \emph{and} after publication. +In this way, low-level improvements in Maneage can easily propagate to all projects, greatly reducing + the cost of curation and maintenance of each individual project, before \emph{and} after publication. @@ -413,56 +418,58 @@ This greatly reduces the cost of curation and maintenance of each individual pro %% Attempt to generalise the significance. %% should not just present a solution or an enquiry into a unitary problem but make an effort to demonstrate wider significance and application and say something more about the ‘science of data’ more generally. -We have shown that it is possible to build workflows satisfying the proposed criteria. -Here, we comment on our experience in testing them through the proof of concept. -We will discuss the design principles, and how they may be generalized and usable in other projects. -In particular, with the support of RDA, the user base grew phenomenally, highlighting some difficulties for wide-spread adoption. +We have shown that it is possible to build workflows satisfying all the proposed criteria, and +we comment here on our experience in testing them through this proof-of-concept tool, which, +%We will discuss the design principles, and how they may be generalized and usable in other projects. + with the support of RDA, enabled its user base to grow phenomenally, underscoring some difficulties for a widespread adoption. Firstly, while most researchers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used. Fortunately, we have noticed that after witnessing the improvements in their research, many, especially early-career researchers, have started mastering these tools. Scientists are rarely trained sufficiently in data management or software development, and the plethora of high-level tools that change every few years discourages them. -Fast-evolving tools are primarily targeted at software developers, who are paid to learn them and use them effectively for short-term projects before moving on to the next technology. +Indeed the fast-evolving tools are primarily targeted at software developers, who are paid to learn and use them effectively for short-term projects before moving on to the next technology. Scientists, on the other hand, need to focus on their own research fields, and need to consider longevity. -Hence, arguably the most important feature of these criteria (as implemented in Maneage) is that they provide a fully working template, using mature and time-tested tools, for blending version control, the research paper's narrative, software management \emph{and} robust data carpentry. -We have seen that providing a complete \emph{and} customizable template with a clear checklist of the initial steps is much more effective in encouraging mastery of these modern scientific tools than having abstract, isolated tutorials on each tool individually. +Hence, arguably the most important feature of these criteria (as implemented in Maneage) is that they provide a fully working template, using mature and time-tested tools, for blending version control, the research paper's narrative, the software management \emph{and} a robust data carpentry. +We have noticed that providing a complete \emph{and} customizable template with a clear checklist of the initial steps is much more effective in encouraging mastery of these modern scientific tools than having abstract, isolated tutorials on each tool individually. -Secondly, to satisfy the completeness criteria, all the necessary software of the project must be built on various POSIX-compatible systems (we actively test Maneage on several different GNU/Linux distributions and on macOS). -This requires maintenance by our core team and consumes time and energy. -However, the PM and analysis share the same job manager (Make), design principles and conventions. -We have thus found that more than once, advanced users add, or fix, their required software alone and share their low-level commits on the core branch, thus propagating it to all derived projects. +Secondly, to satisfy the completeness criterion, all the required software of the project must be built on various POSIX-compatible systems +(Maneage was tested on several different GNU/Linux distributions and on macOS). +This requires maintenance by our core team and consumes time and energy, but + the PM and analysis share the same job manager (Make), design principles and conventions. +We have found that, more than once, advanced users add, or fix, their required software alone and share their low-level commits on the core branch, thus propagating it to all derived projects. -On a related note, POSIX is a fuzzy standard that does not guarantee bit-wise reproducibility of programs. -However, it has been chosen as the underlying platform here because the results (data) are our focus, not the compiled software. +On a related note, POSIX is a fuzzy standard that does not guarantee the bit-wise reproducibility of programs. +It has been chosen here, however, as the underlying platform because we focus on the results (data), not on the compiled software. POSIX is ubiquitous and fixed versions of low-level software (e.g., core GNU tools) are installable on most POSIX systems; each internally corrects for differences affecting its functionality (partly as part of the GNU portability library). -On GNU/Linux hosts, Maneage builds precise versions of the GNU Compiler Collection (GCC), GNU Binutils and GNU C library (glibc). -However, glibc is not installable on some POSIX OSs (e.g., macOS). -The C library is linked with all programs. -This dependence can hypothetically hinder exact reproducibility \emph{of results}, but we have not encountered this so far. +On GNU/Linux hosts, Maneage builds precise versions of the GNU Compiler Collection (GCC), GNU Binutils and GNU C library (glibc), but + glibc is not installable on some POSIX OSs (e.g., macOS). +The C library is linked with all programs, and +this dependence can hypothetically hinder exact reproducibility \emph{of results}, but we have not encountered this so far. With everything else under precise control, the effect of differing Kernel and C libraries on high-level science results can now be systematically studied with Maneage. +% DVG: It is a pity that the following paragraph cannot be included, as it is really important but perhaps goes beyond the intended goal. %Thirdly, publishing a project's reproducible data lineage immediately after publication enables others to continue with follow-up papers, which may provide unwanted competition against the original authors. %We propose these solutions: %1) Through the Git history, the work added by another team at any phase of the project can be quantified, contributing to a new concept of authorship in scientific projects and helping to quantify Newton's famous ``\emph{standing on the shoulders of giants}'' quote. %This is a long-term goal and would require major changes to academic value systems. %2) Authors can be given a grace period where the journal or a third party embargoes the source, keeping it private for the embargo period and then publishing it. -Other implementations of the criteria, or future improvements in Maneage, may solve some of the caveats above. -However, the proof of concept already shows many advantages in adopting the criteria. -For example, publication of projects with these criteria on a wide scale will allow automatic workflow generation, optimized for desired characteristics of the results (e.g., via machine learning). -Because of the completeness criteria, algorithms and data selection can be similarly optimized. -Furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the connection between an analysis with the resulting narrative \emph{and} the history of that analysis/narrative. +Other implementations of the criteria, or future improvements in Maneage, may solve some of the caveats above, but +this proof of concept has shown many advantages in adopting the proposed criteria. +For example, the publication of projects meeting these criteria on a wide scale will allow automatic workflow generation, optimized for desired characteristics of the results (e.g., via machine learning). +The completeness criteria implies that algorithms and data selection can be similarly optimized and +furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the connection between an analysis with the resulting narrative \emph{and} the history of that analysis/narrative. Parsers can be written over projects for meta-research and provenance studies, e.g., to generate ``research objects''. -As another example, when a bug is found in one software package, all affected projects can be found and the scale of the effect can be measured. +Likewise, when a bug is found in one software package, all affected projects can be found and the scale of the effect can be measured. Combined with SoftwareHeritage, precise high-level science parts of Maneage projects can be accurately cited (e.g., failed/abandoned tests at any historical point). Many components of ``machine-actionable'' data management plans can be automatically filled out by Maneage, which is useful for project PIs and grant funders. -From the data repository perspective, these criteria can also be very useful, e.g., with regard to the challenges mentioned in \cite{austin17}: +From the data repository perspective, these criteria can also be very useful with regard to the challenges mentioned in \cite{austin17}: (1) The burden of curation is shared among all project authors and readers (the latter may find a bug and fix it), not just by database curators, improving sustainability. (2) Automated and persistent bidirectional linking of data and publication can be established through the published \& \emph{complete} data lineage that is under version control. (3) Software management. -With these criteria, each project's unique and complete software management is included: it is not a third-party PM that needs to be maintained by the data center employees. -These criteria enable easy management, preservation, publishing and citation of the software used. +With these criteria, we ensure that each project's unique and complete software management is included. It is not a third-party PM that needs to be maintained by the data center, and they + enable the easy management, preservation, publishing and citation of the software used. For example, see \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}, \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, \href{https://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}, where we have exploited the free-software criterion to distribute the tarballs of all the software used with the project's source and deliverables. (4) ``Linkages between documentation, code, data, and journal articles in an integrated environment'', which effectively summarises the whole purpose of these criteria. @@ -478,7 +485,7 @@ Julia Aguilar-Cabello, Alice Allen, Pedram Ashofteh Ardakani, Roland Bacon, -Antonio Diaz Diaz, +Antonio D\'iaz D\'iaz, Surena Fatemi, Fabrizio Gagliardi, Konrad Hinsen, @@ -498,7 +505,8 @@ for their useful help, suggestions and feedback on Maneage and this paper. Work on Maneage, and this paper, has been partially funded/supported by the following institutions: The Japanese Ministry of Education, Culture, Sports, Science, and Technology (MEXT) PhD scholarship to M. Akhlaghi and its Grant-in-Aid for Scientific Research (21244012, 24253003). The European Research Council (ERC) advanced grant 339659-MUSICOS. -The European Union (EU) Horizon 2020 (H2020) research and innovation programmes No 777388 under RDA EU 4.0 project, and Marie Sk\l{}odowska-Curie grant agreement No 721463 to the SUNDIAL ITN. +The European Union (EU) Horizon 2020 (H2020) research and innovation programmes No 777388 under RDA EU 4.0 project, and Marie +Sk\l{}odowska-Curie grant agreement No 721463 to the SUNDIAL ITN. The State Research Agency (AEI) of the Spanish Ministry of Science, Innovation and Universities (MCIU) and the European Regional Development Fund (ERDF) under the grant AYA2016-76219-P. The IAC project P/300724, financed by the MCIU, through the Canary Islands Department of Economy, Knowledge and Employment. The Fundaci\'on BBVA under its 2017 programme of assistance to scientific research groups, for the project ``Using machine-learning techniques to drag galaxies from the noise in deep imaging''. @@ -544,6 +552,8 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314 \begin{IEEEbiographynophoto}{David Valls-Gabaud} is a CNRS Research Director at the Observatoire de Paris, France. His research interests span from cosmology and galaxy evolution to stellar physics and instrumentation. + He is adamant about ensuring scientific results are fully reproducible. Educated at the universities of + Madrid (Complutense), Paris and Cambridge, he obtained his PhD in astrophysics in 1991. Contact him at david.valls-gabaud@obspm.fr. \end{IEEEbiographynophoto} -- cgit v1.2.1