diff options
author | Mohammad Akhlaghi <mohammad@akhlaghi.org> | 2021-01-07 17:36:18 +0000 |
---|---|---|
committer | Mohammad Akhlaghi <mohammad@akhlaghi.org> | 2021-01-07 17:36:18 +0000 |
commit | e3f4be66020538e3ab641f91405b8c07582e5862 (patch) | |
tree | dce8b52d8d4bf5e66cc5601febb1794f3f62377d | |
parent | e52cbf57ccbe72c8f9a32aaeb927c194c7e485a1 (diff) |
Removed all \new highlights after submission of review
With the submission of the revision (which highlighted all the relevant
parts to the points the referees raised in the submitted PDF) it is no
longer necessary to highlight these parts.
If we get another revision request, we can add new '\new' parts for
highlighting.
-rw-r--r-- | paper.tex | 139 | ||||
-rw-r--r-- | tex/src/appendix-existing-solutions.tex | 2 | ||||
-rw-r--r-- | tex/src/appendix-existing-tools.tex | 2 |
3 files changed, 70 insertions, 73 deletions
@@ -74,7 +74,7 @@ %% AIM A set of criteria is introduced to address this problem: %% METHOD - Completeness (no \new{execution requirement} beyond \new{a minimal Unix-like operating system}, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free software. + Completeness (no execution requirement beyond a minimal Unix-like operating system, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free software. %% RESULTS As a proof of concept, we introduce ``Maneage'' (Managing data lineage), enabling cheap archiving, provenance extraction, and peer verification that been tested in several research publications. %% CONCLUSION @@ -130,7 +130,7 @@ Reproducible research has been discussed in the sciences for at least 30 years \ Many reproducible workflow solutions (hereafter, ``solutions'') have been proposed that mostly rely on the common technology of the day, starting with Make and Matlab libraries in the 1990s, Java in the 2000s, and mostly shifting to Python during the last decade. -However, these technologies develop fast, e.g., code written in Python 2 \new{(which is no longer officially maintained)} often cannot run with Python 3. +However, these technologies develop fast, e.g., code written in Python 2 (which is no longer officially maintained) often cannot run with Python 3. The cost of staying up to date within this rapidly-evolving landscape is high. Scientific projects, in particular, suffer the most: scientists have to focus on their own research domain, but to some degree, they need to understand the technology of their tools because it determines their results and interpretations. Decades later, scientists are still held accountable for their results and therefore the evolving technology landscape creates generational gaps in the scientific community, preventing previous generations from sharing valuable experience. @@ -141,49 +141,48 @@ Decades later, scientists are still held accountable for their results and there \section{Longevity of existing tools} \label{sec:longevityofexisting} -\new{Reproducibility is defined as ``obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis'' \cite{fineberg19}. +Reproducibility is defined as ``obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis'' \cite{fineberg19}. Longevity is defined as the length of time that a project remains \emph{functional} after its creation. Functionality is defined as \emph{human readability} of the source and its \emph{execution possibility} (when necessary). Many usage contexts of a project do not involve execution: for example, checking the configuration parameter of a single step of the analysis to re-\emph{use} in another project, or checking the version of used software, or the source of the input data. -Extracting these from execution outputs is not always possible.} -A basic review of the longevity of commonly used tools is provided here \new{(for a more comprehensive review, please see +Extracting these from execution outputs is not always possible. +A basic review of the longevity of commonly used tools is provided here (for a more comprehensive review, please see \ifdefined\separatesupplement the supplementary appendices% \else% appendices \ref{appendix:existingtools} and \ref{appendix:existingsolutions}% \fi% ). -} To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011, discontinued in 2019). However, containers (e.g., Docker or Singularity) are currently the most widely-used solution. We will focus on Docker here because it is currently the most common. -\new{It is hypothetically possible to precisely identify the used Docker ``images'' with their checksums (or ``digest'') to re-create an identical OS image later. -However, that is rarely done.} +It is hypothetically possible to precisely identify the used Docker ``images'' with their checksums (or ``digest'') to re-create an identical OS image later. +However, that is rarely done. Usually images are imported with operating system (OS) names; e.g., \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}'. The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly, and only the most recent five are archived there. Hence, if the image is built in different months, it will contain different OS components. -In the year 2024, when this version's long-term support (LTS) expires \new{(if not earlier, like CentOS 8 which \href{https://blog.centos.org/2020/12/future-is-centos-stream}{will terminate} 8 years early)}, the image will not be available at the expected URL. +In the year 2024, when this version's long-term support (LTS) expires (if not earlier, like CentOS 8 which \href{https://blog.centos.org/2020/12/future-is-centos-stream}{will terminate} 8 years early), the image will not be available at the expected URL. -Generally, \new{pre-built} binary files (like Docker images) are large and expensive to maintain and archive. -\new{Because of this, in October 2020 Docker Hub (where many workflows are archived) \href{https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}{announced} that inactive images (more than 6 months) will be deleted in free accounts from mid 2021.} +Generally, pre-built binary files (like Docker images) are large and expensive to maintain and archive. +Because of this, in October 2020 Docker Hub (where many workflows are archived) \href{https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}{announced} that inactive images (more than 6 months) will be deleted in free accounts from mid 2021. Furthermore, Docker requires root permissions, and only supports recent (LTS) versions of the host kernel. -Hence older Docker images may not be executable \new{(their longevity is determined by the host kernel, typically a decade).} +Hence older Docker images may not be executable (their longevity is determined by the host kernel, typically a decade). Once the host OS is ready, PMs are used to install the software or environment. Usually the OS's PM, such as `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with generic PMs. -The former has \new{the same longevity} as the OS, while some of the latter (such as Conda and Spack) are written in high-level languages like Python, so the PM itself depends on the host's Python installation \new{with a typical longevity of a few years}. -Nix and GNU Guix produce bit-wise identical programs \new{with considerably better longevity; that of their supported CPU architectures}. +The former has the same longevity as the OS, while some of the latter (such as Conda and Spack) are written in high-level languages like Python, so the PM itself depends on the host's Python installation with a typical longevity of a few years. +Nix and GNU Guix produce bit-wise identical programs with considerably better longevity; that of their supported CPU architectures. However, they need root permissions and are primarily targeted at the Linux kernel. Generally, in all the package managers, the exact version of each software (and its dependencies) is not precisely identified by default, although an advanced user can indeed fix them. Unless precise version identifiers of \emph{every software package} are stored by project authors, a third-party PM will use the most recent version. Furthermore, because third-party PMs introduce their own language, framework, and version history (the PM itself may evolve) and are maintained by an external team, they increase a project's complexity. With the software environment built, job management is the next component of a workflow. -Visual/GUI workflow tools like Apache Taverna, GenePattern \new{(deprecated)}, Kepler or VisTrails \new{(deprecated)}, which were mostly introduced in the 2000s and used Java or Python 2 encourage modularity and robust job management. -\new{However, a GUI environment is tailored to specific applications and is hard to generalize, while being hard to reproduce once the required Java Virtual Machine (JVM) is deprecated. -These tools' data formats are complex (designed for computers to read) and hard to read by humans without the GUI.} +Visual/GUI workflow tools like Apache Taverna, GenePattern (deprecated), Kepler or VisTrails (deprecated), which were mostly introduced in the 2000s and used Java or Python 2 encourage modularity and robust job management. +However, a GUI environment is tailored to specific applications and is hard to generalize, while being hard to reproduce once the required Java Virtual Machine (JVM) is deprecated. +These tools' data formats are complex (designed for computers to read) and hard to read by humans without the GUI. The more recent solutions (mostly non-GUI, written in Python) leave this to the authors of the project. Designing a robust project needs to be encouraged and facilitated because scientists (who are not usually trained in project or data management) will rarely apply best practices. This includes automatic verification, which is possible in many solutions, but is rarely practiced. @@ -193,11 +192,11 @@ Finally, to blend narrative and analysis, computational notebooks \cite{rule18}, However, because of their complex dependency trees, their build is vulnerable to the passage of time; e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies. It is important to remember that the longevity of a project is determined by its shortest-lived dependency. Furthermore, as with job management, computational notebooks do not actively encourage good practices in programming or project management. -\new{The ``cells'' in a Jupyter notebook can either be run sequentially (from top to bottom, one after the other) or by manually selecting the cell to run. +The ``cells'' in a Jupyter notebook can either be run sequentially (from top to bottom, one after the other) or by manually selecting the cell to run. By default, cell dependencies are not included (e.g., automatically running some cells only after certain others), parallel execution, or usage of more than one language. There are third party add-ons like \inlinecode{sos} or \inlinecode{extension's} (both written in Python) for some of these. However, since they are not part of the core, a shorter longevity can be assumed. -The core Jupyter framework has few options for project management, especially as the project grows beyond a small test or tutorial.} +The core Jupyter framework has few options for project management, especially as the project grows beyond a small test or tutorial. Notebooks can therefore rarely deliver their promised potential \cite{rule18} and may even hamper reproducibility \cite{pimentel19}. @@ -213,21 +212,21 @@ We argue and propose that workflows satisfying the following criteria can not on \textbf{Criterion 1: Completeness.} A project that is complete (self-contained) has the following properties. -(1) \new{No \emph{execution requirements} apart from a minimal Unix-like operating system. -Fewer explicit execution requirements would mean larger \emph{execution possibility} and consequently longer \emph{longevity}.} -(2) Primarily stored as plain text \new{(encoded in ASCII/Unicode)}, not needing specialized software to open, parse, or execute. -(3) No impact on the host OS libraries, programs, and \new{environment variables}. +(1) No \emph{execution requirements} apart from a minimal Unix-like operating system. +Fewer explicit execution requirements would mean larger \emph{execution possibility} and consequently longer \emph{longevity}. +(2) Primarily stored as plain text (encoded in ASCII/Unicode), not needing specialized software to open, parse, or execute. +(3) No impact on the host OS libraries, programs, and environment variables. (4) No root privileges to run (during development or post-publication). -(5) Builds its own controlled software \new{with independent environment variables}. +(5) Builds its own controlled software with independent environment variables. (6) Can run locally (without an internet connection). (7) Contains the full project's analysis, visualization \emph{and} narrative: including instructions to automatically access/download raw inputs, build necessary software, do the analysis, produce final data products \emph{and} final published report with figures \emph{as output}, e.g., PDF or HTML. (8) It can run automatically, without human interaction. \textbf{Criterion 2: Modularity.} A modular project enables and encourages independent modules with well-defined inputs/outputs and minimal side effects. -\new{In terms of file management, a modular project will \emph{only} contain the hand-written project source of that particular high-level project: no automatically generated files (e.g., software binaries or figures), software source code, or data should be included. +In terms of file management, a modular project will \emph{only} contain the hand-written project source of that particular high-level project: no automatically generated files (e.g., software binaries or figures), software source code, or data should be included. The latter two (developing low-level software, collecting data, or the publishing and archival of both) are separate projects in themselves because they can be used in other independent projects. -This optimizes the storage, archival/mirroring, and publication costs (which are critical to longevity): a snapshot of a project's hand-written source will usually be on the scale of $\times100$ kilobytes, and the version-controlled history may become a few megabytes.} +This optimizes the storage, archival/mirroring, and publication costs (which are critical to longevity): a snapshot of a project's hand-written source will usually be on the scale of $\times100$ kilobytes, and the version-controlled history may become a few megabytes. In terms of the analysis workflow, explicit communication between various modules enables optimizations on many levels: (1) Modular analysis components can be executed in parallel and avoid redundancies (when a dependency of a module has not changed, it will not be re-run). @@ -239,7 +238,7 @@ In terms of the analysis workflow, explicit communication between various module \textbf{Criterion 3: Minimal complexity.} Minimal complexity can be interpreted as: (1) Avoiding the language or framework that is currently in vogue (for the workflow, not necessarily the high-level analysis). -A popular framework typically falls out of fashion and requires significant resources to translate or rewrite every few years \new{(for example Python 2, which is no longer supported)}. +A popular framework typically falls out of fashion and requires significant resources to translate or rewrite every few years (for example Python 2, which is no longer supported). More stable/basic tools can be used with less long-term maintenance costs. (2) Avoiding too many different languages and frameworks; e.g., when the workflow's PM and analysis are orchestrated in the same framework, it becomes easier to maintain in the long term. @@ -265,12 +264,12 @@ This is related to longevity, because if a workflow contains only the steps to d \textbf{Criterion 8: Free and open-source software:} Non-free or non-open-source software typically cannot be distributed, inspected, or modified by others. -They are reliant on a single supplier (even without payments) \new{and prone to \href{https://www.gnu.org/proprietary/proprietary-obsolescence.html}{proprietary obsolescence}}. -A project that is \href{https://www.gnu.org/philosophy/free-sw.en.html}{free software} (as formally defined by GNU), allows others to run, learn from, \new{distribute, build upon (modify), and publish their modified versions}. +They are reliant on a single supplier (even without payments) and prone to \href{https://www.gnu.org/proprietary/proprietary-obsolescence.html}{proprietary obsolescence}. +A project that is \href{https://www.gnu.org/philosophy/free-sw.en.html}{free software} (as formally defined by GNU), allows others to run, learn from, distribute, build upon (modify), and publish their modified versions. When the software used by the project is itself also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware. -\new{Proprietary software may be necessary to read proprietary data formats produced by data collection hardware (for example micro-arrays in genetics). -In such cases, it is best to immediately convert the data to free formats upon collection and safely use or archive the data as free formats.} +Proprietary software may be necessary to read proprietary data formats produced by data collection hardware (for example micro-arrays in genetics). +In such cases, it is best to immediately convert the data to free formats upon collection and safely use or archive the data as free formats. @@ -284,23 +283,23 @@ In such cases, it is best to immediately convert the data to free formats upon c \section{Proof of concept: Maneage} With the longevity problems of existing tools outlined above, a proof-of-concept solution is presented here via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}. -\new{Since the initial submission of this paper, it has also been used in \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} (on the COVID-19 pandemic) and \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}.} +Since the initial submission of this paper, it has also been used in \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} (on the COVID-19 pandemic) and \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}. It was also awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows \cite{austin17}, from the researchers' perspective. It is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at \url{https://maneage.org}. It was developed as a parallel research project over five years of publishing reproducible workflows of our research. Its primordial implementation was used in \cite{akhlaghi15}, which evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}. -Technically, the hardest criterion to implement was the first (completeness); in particular \new{restricting execution requirements to only a minimal Unix-like operating system}. +Technically, the hardest criterion to implement was the first (completeness); in particular restricting execution requirements to only a minimal Unix-like operating system. One solution we considered was GNU Guix and Guix Workflow Language (GWL). However, because Guix requires root access to install, and only works with the Linux kernel, it failed the completeness criterion. Inspired by GWL+Guix, a single job management tool was implemented for both installing software \emph{and} the analysis workflow: Make. Make is not an analysis language, it is a job manager. Make decides when and how to call analysis steps/programs (in any language like Python, R, Julia, Shell, or C). -Make \new{has been available since 1977, it is still heavily used in almost all components of modern Unix-like OSs} and is standardized in POSIX. +Make has been available since 1977, it is still heavily used in almost all components of modern Unix-like OSs and is standardized in POSIX. It is thus mature, actively maintained, highly optimized, efficient in managing provenance, and recommended by the pioneers of reproducible research \cite{claerbout1992,schwab2000}. -Researchers using free software have also already had some exposure to it \new{(most free research software are built with Make).} +Researchers using free software have also already had some exposure to it (most free research software are built with Make). Linking the analysis and narrative (criterion 7) was historically our first design element. To avoid the problems with computational notebooks mentioned above, we adopt a more abstract linkage, providing a more direct and traceable connection. @@ -316,32 +315,30 @@ Through the former, manual updates by authors (which are prone to errors and dis Acting as a link, the macro files build the core skeleton of Maneage. For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version, and possible citation. -These are combined at the end to generate precise software \new{acknowledgment} and citation that is shown in the -\new{ - \ifdefined\separatesupplement - \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{appendices},% - \else% - appendices (\ref{appendix:software}),% - \fi% -} +These are combined at the end to generate precise software acknowledgment and citation that is shown in the +\ifdefined\separatesupplement% +\href{https://doi.org/10.5281/zenodo.\projectzenodoid}{appendices},% +\else% +appendices (\ref{appendix:software}),% +\fi% for other examples, see \cite{akhlaghi19, infante20}. -\new{Furthermore, the machine-related specifications of the running system (including CPU architecture and byte-order) are also collected to report in the paper (they are reported for this paper in the acknowledgments). -These can help in \emph{root cause analysis} of observed differences/issues in the execution of the workflow on different machines.} +Furthermore, the machine-related specifications of the running system (including CPU architecture and byte-order) are also collected to report in the paper (they are reported for this paper in the acknowledgments). +These can help in \emph{root cause analysis} of observed differences/issues in the execution of the workflow on different machines. The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}). -All software dependencies are built down to precise versions of every tool, including the shell, \new{important low-level application programs} (e.g., GNU Coreutils) and of course, the high-level science software. -\new{The source code of all the free software used in Maneage is archived in, and downloaded from, \href{https://doi.org/10.5281/zenodo.3883409}{zenodo.3883409}. -Zenodo promises long-term archival and also provides a persistent identifier for the files, which are sometimes unavailable at a software package's web page.} +All software dependencies are built down to precise versions of every tool, including the shell, important low-level application programs (e.g., GNU Coreutils) and of course, the high-level science software. +The source code of all the free software used in Maneage is archived in, and downloaded from, \href{https://doi.org/10.5281/zenodo.3883409}{zenodo.3883409}. +Zenodo promises long-term archival and also provides a persistent identifier for the files, which are sometimes unavailable at a software package's web page. On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library (glibc) is being added (task \href{http://savannah.nongnu.org/task/?15390}{15390}). Currently, {\TeX}Live is also being added (task \href{http://savannah.nongnu.org/task/?15267}{15267}), but that is only for building the final PDF, not affecting the analysis or verification. -\new{Building the core Maneage software environment on an 8-core CPU takes about 1.5 hours (GCC consumes more than half of the time). +Building the core Maneage software environment on an 8-core CPU takes about 1.5 hours (GCC consumes more than half of the time). However, this is only necessary once in a project: the analysis (which usually takes months to write/mature for a normal project) will only use the built environment. Hence the few hours of initial software building is negligible compared to a project's life span. To facilitate moving to another computer in the short term, Maneage'd projects can be built in a container or VM. The \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{\inlinecode{README.md}} file has thorough instructions on building in Docker. Through containers or VMs, users on non-Unix-like OSs (like Microsoft Windows) can use Maneage. -For Windows-native software that can be run in batch-mode, evolving technologies like Windows Subsystem for Linux may be usable.} +For Windows-native software that can be run in batch-mode, evolving technologies like Windows Subsystem for Linux may be usable. The analysis phase of the project however is naturally different from one project to another at a low-level. It was thus necessary to design a generic framework to comfortably host any project, while still satisfying the criteria of modularity, scalability, and minimal complexity. @@ -358,11 +355,11 @@ Figure \ref{fig:datalineage} (right) is the data lineage that produced it. Left: an enhanced replica of Figure 1C in \cite{menke20}, shown here for demonstrating Maneage. It shows the fraction of the number of papers mentioning software tools (green line, left vertical axis) in each year (red bars, right vertical axis on a log scale). Right: Schematic representation of the data lineage, or workflow, to generate the plot on the left. - Each colored box is a file in the project and \new{arrows show the operation of various software: linking input file(s) to the output file(s)}. + Each colored box is a file in the project and arrows show the operation of various software: linking input file(s) to the output file(s). Green files/boxes are plain-text files that are under version control and in the project source directory. Blue files/boxes are output files in the build directory, shown within the Makefile (\inlinecode{*.mk}) where they are defined as a \emph{target}. - For example, \inlinecode{paper.pdf} \new{is created by running \LaTeX{} on} \inlinecode{project.tex} (in the build directory; generated automatically) and \inlinecode{paper.tex} (in the source directory; written manually). - \new{Other software is used in other steps.} + For example, \inlinecode{paper.pdf} is created by running \LaTeX{} on \inlinecode{project.tex} (in the build directory; generated automatically) and \inlinecode{paper.tex} (in the source directory; written manually). + Other software is used in other steps. The solid arrows and full-opacity built boxes correspond to the lineage of this paper. The dotted arrows and built boxes show the scalability of Maneage (ease of adding hypothetical steps to the project as it evolves). The underlying data of the left plot is available at @@ -384,7 +381,7 @@ Other built files (``targets'' in intermediate analysis steps) cascade down in t \begin{lstlisting}[ label=code:topmake, caption={This project's simplified \inlinecode{top-make.mk}, also see Figure \ref{fig:datalineage}.\\ - \new{For full file, see \href{https://archive.softwareheritage.org/swh:1:cnt:d552dc18749fbb16249b642cd4f8107c1ce8ff68;origin=https://gitlab.com/makhlaghi/maneage-paper.git;visit=swh:1:snp:ee7cc3bb558c4af703e8de53dd590654c8967663;anchor=swh:1:rev:e4f61544facf8a3bd88c8466e7d3d847544c8228;path=/reproduce/analysis/make/top-make.mk}{SoftwareHeritage}}} + For full file, see \href{https://archive.softwareheritage.org/swh:1:cnt:d552dc18749fbb16249b642cd4f8107c1ce8ff68;origin=https://gitlab.com/makhlaghi/maneage-paper.git;visit=swh:1:snp:ee7cc3bb558c4af703e8de53dd590654c8967663;anchor=swh:1:rev:e4f61544facf8a3bd88c8466e7d3d847544c8228;path=/reproduce/analysis/make/top-make.mk}{SoftwareHeritage}} ] # Default target/goal of project. all: paper.pdf @@ -407,8 +404,8 @@ include $(foreach s,$(makesrc), \ Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk} to satisfy the verification criteria (this step was not available in \cite{infante20}). All project deliverables (macro files, plot or table data, and other datasets) are verified at this stage, with their checksums, to automatically ensure exact reproducibility. -Where exact reproducibility is not possible \new{(for example, due to parallelization)}, values can be verified by the project authors. -\new{For example see \new{\href{https://archive.softwareheritage.org/browse/origin/content/?branch=refs/heads/postreferee_corrections&origin_url=https://codeberg.org/boud/elaphrocentre.git&path=reproduce/analysis/bash/verify-parameter-statistically.sh}{verify-parameter-statistically.sh}} of \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}.} +Where exact reproducibility is not possible (for example, due to parallelization), values can be verified by the project authors. +For example see \href{https://archive.softwareheritage.org/browse/origin/content/?branch=refs/heads/postreferee_corrections&origin_url=https://codeberg.org/boud/elaphrocentre.git&path=reproduce/analysis/bash/verify-parameter-statistically.sh}{verify-parameter-statistically.sh} of \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}. \begin{figure*}[t] \begin{center} \includetikz{figure-branching}{scale=1}\end{center} @@ -442,22 +439,22 @@ Furthermore, the configuration files are a prerequisite of the targets that use If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's source or other built products. This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details; e.g., by co-authors or future readers), and ensures self-consistency. -\new{In contrast to notebooks like Jupyter, the analysis scripts, configuration parameters and paper's narrative are therefore not blended into in a single file, and do not require a unique editor. +In contrast to notebooks like Jupyter, the analysis scripts, configuration parameters and paper's narrative are therefore not blended into in a single file, and do not require a unique editor. To satisfy the modularity criterion, the analysis steps and narrative are written and run in their own files (in different languages) and the files can be viewed or manipulated with any text editor that the authors prefer. -The analysis can benefit from the powerful and portable job management features of Make and communicates with the narrative text through \LaTeX{} macros, enabling much better-formatted output that blends analysis outputs in the narrative sentences and enables direct provenance tracking.} +The analysis can benefit from the powerful and portable job management features of Make and communicates with the narrative text through \LaTeX{} macros, enabling much better-formatted output that blends analysis outputs in the narrative sentences and enables direct provenance tracking. To satisfy the recorded history criterion, version control (currently implemented in Git) is another component of Maneage (see Figure \ref{fig:branching}). Maneage is a Git branch that contains the shared components (infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles, and interface script). -\new{The core Maneage git repository is hosted at \href{http://git.maneage.org/project.git}{git.maneage.org/project.git} (archived at \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{Software Heritage}).} +The core Maneage git repository is hosted at \href{http://git.maneage.org/project.git}{git.maneage.org/project.git} (archived at \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{Software Heritage}). Derived projects start by creating a branch and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its particular analysis, see Listing \ref{code:branching}). -There is a \new{thoroughly elaborated} customization checklist in \inlinecode{README-hacking.md}. +There is a thoroughly elaborated customization checklist in \inlinecode{README-hacking.md}. The current project's Git hash is provided to the authors as a \LaTeX{} macro (shown here at the end of the abstract), as well as the Git hash of the last commit in the Maneage branch (shown in the acknowledgments). -These macros are created in \inlinecode{initialize.mk}, with \new{other basic information from the running system like the CPU architecture, byte order or address sizes (shown in the acknowledgments)}. +These macros are created in \inlinecode{initialize.mk}, with other basic information from the running system like the CPU architecture, byte order or address sizes (shown in the acknowledgments). Figure \ref{fig:branching} shows how projects can re-import Maneage at a later time (technically: \emph{merge}), thus improving their low-level infrastructure: in (a) authors do the merge during an ongoing project; in (b) readers do it after publication; e.g., the project remains reproducible but the infrastructure is outdated, or a bug is fixed in Maneage. -\new{Generally, any Git flow (branching strategy) can be used by the high-level project authors or future readers.} +Generally, any Git flow (branching strategy) can be used by the high-level project authors or future readers. Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of project curation and maintenance, before \emph{and} after publication. \begin{lstlisting}[ @@ -517,20 +514,20 @@ Hence, arguably the most important feature of these criteria (as implemented in Using mature and time-tested tools, for blending version control, the research paper's narrative, the software management \emph{and} a robust data management strategies. We have noticed that providing a clear checklist of the initial customizations is much more effective in encouraging mastery of these core analysis tools than having abstract, isolated tutorials on each tool individually. -Secondly, to satisfy the completeness criterion, all the required software of the project must be built on various \new{Unix-like OSs} (Maneage is actively tested on different GNU/Linux distributions, macOS, and is being ported to FreeBSD also). +Secondly, to satisfy the completeness criterion, all the required software of the project must be built on various Unix-like OSs (Maneage is actively tested on different GNU/Linux distributions, macOS, and is being ported to FreeBSD also). This requires maintenance by our core team and consumes time and energy. However, because the PM and analysis components share the same job manager (Make) and design principles, we have already noticed some early users adding, or fixing, their required software alone. They later share their low-level commits on the core branch, thus propagating it to all derived projects. -\new{Thirdly, Unix-like OSs are a very large and diverse group (mostly conforming with POSIX), so our completeness condition does not guarantee bit-wise reproducibility of the software, even when built on the same hardware. -However our focus is on reproducing results (output of software), not the software itself.} +Thirdly, Unix-like OSs are a very large and diverse group (mostly conforming with POSIX), so our completeness condition does not guarantee bit-wise reproducibility of the software, even when built on the same hardware. +However our focus is on reproducing results (output of software), not the software itself. Well written software internally corrects for differences in OS or hardware that may affect its output (through tools like the GNU Portability Library, or Gnulib). On GNU/Linux hosts, Maneage builds precise versions of the compilation tool chain. -However, glibc is not install-able on some \new{Unix-like} OSs (e.g., macOS) and all programs link with the C library. +However, glibc is not install-able on some Unix-like OSs (e.g., macOS) and all programs link with the C library. This may hypothetically hinder the exact reproducibility \emph{of results} on non-GNU/Linux systems, but we have not encountered this in our research so far. -With everything else under precise control in Maneage, the effect of differing hardware, Kernel and C libraries on high-level science can now be systematically studied in follow-up research \new{(including floating-point arithmetic or optimization differences). -Using continuous integration (CI) is one way to precisely identify breaking points on multiple systems.} +With everything else under precise control in Maneage, the effect of differing hardware, Kernel and C libraries on high-level science can now be systematically studied in follow-up research (including floating-point arithmetic or optimization differences). +Using continuous integration (CI) is one way to precisely identify breaking points on multiple systems. % DVG: It is a pity that the following paragraph cannot be included, as it is really important but perhaps goes beyond the intended goal. %Thirdly, publishing a project's reproducible data lineage immediately after publication enables others to continue with follow-up papers, which may provide unwanted competition against the original authors. @@ -595,11 +592,11 @@ Zahra Sharbaf, Nadia Tonello, Ignacio Trujillo and the AMIGA team at the Instituto de Astrof\'isica de Andaluc\'ia for their useful help, suggestions, and feedback on Maneage and this paper. -\new{The five referees and editors of CiSE (Lorena Barba and George Thiruvathukal) provided many points that greatly helped to clarify this paper.} +The five referees and editors of CiSE (Lorena Barba and George Thiruvathukal) provided many points that greatly helped to clarify this paper. This project (commit \inlinecode{\projectversion}) is maintained in Maneage (\emph{Man}aging data lin\emph{eage}). -\new{The latest merged Maneage branch commit was \inlinecode{\maneageversion} (\maneagedate). -This project was built on an \inlinecode{\machinearchitecture} machine with {\machinebyteorder} byte-order and address sizes {\machineaddresssizes}}. +The latest merged Maneage branch commit was \inlinecode{\maneageversion} (\maneagedate). +This project was built on an \inlinecode{\machinearchitecture} machine with {\machinebyteorder} byte-order and address sizes {\machineaddresssizes}. Work on Maneage, and this paper, has been partially funded/supported by the following institutions: The Japanese MEXT PhD scholarship to M.A and its Grant-in-Aid for Scientific Research (21244012, 24253003). diff --git a/tex/src/appendix-existing-solutions.tex b/tex/src/appendix-existing-solutions.tex index 09a8667..56056be 100644 --- a/tex/src/appendix-existing-solutions.tex +++ b/tex/src/appendix-existing-solutions.tex @@ -415,7 +415,7 @@ This is a major problem for scientific projects: in principle (not knowing how t Umbrella \citeappendix{meng15b} is a high-level wrapper script for isolating the environment of the analysis. The user specifies the necessary operating system, and necessary packages for the analysis steps in various JSON files. Umbrella will then study the host operating system and the various necessary inputs (including data and software) through a process similar to Sciunits mentioned above to find the best environment isolator (maybe using Linux containerization, containers, or VMs). -We could not find a URL to the source software of Umbrella (no source code repository is mentioned in the papers we reviewed above), but from the descriptions in \citeappendix{meng17}, it is written in Python 2.6 (which is now \new{deprecated}). +We could not find a URL to the source software of Umbrella (no source code repository is mentioned in the papers we reviewed above), but from the descriptions in \citeappendix{meng17}, it is written in Python 2.6 (which is now deprecated). diff --git a/tex/src/appendix-existing-tools.tex b/tex/src/appendix-existing-tools.tex index 7efb7cb..f23e2d1 100644 --- a/tex/src/appendix-existing-tools.tex +++ b/tex/src/appendix-existing-tools.tex @@ -377,7 +377,7 @@ Here, we complement that section with more technical details on Make. Usually, the top-level Make instructions are placed in a file called Makefile, but it is also common to use the \inlinecode{.mk} suffix for custom file names. Each stage/step in the analysis is defined through a \emph{rule}. Rules define \emph{recipes} to build \emph{targets} from \emph{pre-requisites}. -In \new{Unix-like operating systems}, everything is a file, even directories and devices. +In Unix-like operating systems, everything is a file, even directories and devices. Therefore all three components in a rule must be files on the running filesystem. To decide which operation should be re-done when executed, Make compares the timestamp of the targets and prerequisites. |