diff options
Diffstat (limited to 'paper.tex')
-rw-r--r-- | paper.tex | 240 |
1 files changed, 135 insertions, 105 deletions
@@ -44,11 +44,11 @@ Mohammadreza Khellat,\\ David Valls-Gabaud, Roberto Baena-Gall\'e\\ - \footnotesize{Manuscript received June 5th, 2020; accepted April 7th, 2021; first published online April 13th, 2021} + \footnotesize{Manuscript received June 5th, 2020; accepted April 7th, 2021; first published by CiSE April 13th, 2021} } %% The paper headers -\markboth{Computing in Science and Engineering, Vol. 23, No. X, MM 2021: \href{https://doi.org/10.1109/MCSE.2021.3072860}{DOI:10.1109/MCSE.2021.3072860}, \href{https://arxiv.org/abs/2006.03018}{arXiv:2006.03018}, \href{https://doi.org/10.5281/zenodo.3872247}{zenodo.3872247}}% +\markboth{Computing in Science and Engineering, Vol. 23, No. X, MM 2021: \href{https://doi.org/10.1109/MCSE.2021.3072860}{DOI:10.1109/MCSE.2021.3072860}, arXiv:2006.03018, \href{https://doi.org/10.5281/zenodo.3872247}{zenodo.3872247}}% {Akhlaghi \MakeLowercase{\textit{et al.}}: \projecttitle} @@ -80,7 +80,7 @@ %% CONCLUSION We show that longevity is a realistic requirement that does not sacrifice immediate or short-term reproducibility. The caveats (with proposed solutions) are then discussed and we conclude with the benefits for the various stakeholders. - This paper is itself written with Maneage (project commit \projectversion). + This article is itself a \emph{Maneage'd} project (project commit \projectversion). \vspace{2.5mm} \emph{Appendices} --- @@ -93,9 +93,9 @@ after main body of paper (Appendices \ref{appendix:existingtools} and \ref{appen \vspace{2.5mm} \emph{Reproducibility} --- - All products in \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}} and - Git history of this paper at \href{http://git.maneage.org/paper-concept.git}{\texttt{git.maneage.org/paper-concept.git}}, - also archived in Software Heritage\footnote{\inlinecode{\href{https://archive.softwareheritage.org/swh:1:dir:45a9e282a86145fe9babef529c8fce52ffe8d717;origin=http://git.maneage.org/paper-concept.git/;visit=swh:1:snp:33d24ae2107e25c734067d704cdad9d33013588a;anchor=swh:1:rev:b858c601613d620f5cf4501816e161a2f8f2e100}{swh:1:dir:45a9e282a86145fe9babef529c8fce52ffe8d717}}\\Software Heritage identifiers (SWHIDs) can be used with resolvers like \inlinecode{http://n2t.net/} (e.g., \inlinecode{http://n2t.net/swh:1:...}). Clicking on the SWHIDs in the digital format will provide more ``context'' for same content.}. + Products available in \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}}. + Git history of this paper is at \href{http://git.maneage.org/paper-concept.git}{\texttt{git.maneage.org/paper-concept.git}}, + which is also archived in Software Heritage\footnote{\inlinecode{\href{https://archive.softwareheritage.org/swh:1:dir:45a9e282a86145fe9babef529c8fce52ffe8d717;origin=http://git.maneage.org/paper-concept.git/;visit=swh:1:snp:33d24ae2107e25c734067d704cdad9d33013588a;anchor=swh:1:rev:b858c601613d620f5cf4501816e161a2f8f2e100}{swh:1:dir:45a9e282a86145fe9babef529c8fce52ffe8d717}}\\Software Heritage identifiers (SWHIDs) can be used with resolvers like \inlinecode{http://n2t.net/} (e.g., \inlinecode{http://n2t.net/swh:1:...}). Clicking on the SWHIDs in the digital format will provide more ``context'' for same content.}. \end{abstract} % Note that keywords are not normally used for peer-review papers. @@ -127,8 +127,7 @@ Data Lineage, Provenance, Reproducibility, Scientific Pipelines, Workflows %\IEEEPARstart{F}{irst} word Reproducible research has been discussed in the sciences for at least 30 years\cite{claerbout1992, fineberg19}. -Many reproducible workflow solutions (hereafter, ``solutions'') have been proposed that mostly rely on the common technology of the day, -starting with Make and Matlab libraries in the 1990s, Java in the 2000s, and mostly shifting to Python during the last decade. +Many reproducible workflow solutions (hereafter, ``solutions'') have been proposed which mostly rely on the common technology of the day, starting with Make and Matlab libraries in the 1990s, Java in the 2000s, and mostly shifting to Python during the last decade. However, these technologies develop fast, e.g., code written in Python 2 (which is no longer officially maintained) often cannot run with Python 3. The cost of staying up to date within this rapidly-evolving landscape is high. @@ -144,11 +143,11 @@ Decades later, scientists are still held accountable for their results and there Reproducibility is defined as ``obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis''\cite{fineberg19}. Longevity is defined as the length of time that a project remains \emph{functional} after its creation. Functionality is defined as \emph{human readability} of the source and its \emph{execution possibility} (when necessary). -Many usage contexts of a project do not involve execution: for example, checking the configuration parameter of a single step of the analysis to re-\emph{use} in another project, or checking the version of used software, or the source of the input data. +Many usage contexts of a project do not involve execution: for example, checking the configuration parameter of a single step of the analysis to \emph{reuse} in another project, or checking the version of used software, or the source of the input data. Extracting these from execution outputs is not always possible. -A basic review of the longevity of commonly used tools is provided here (for a more comprehensive review, please see +A basic review of the longevity of commonly used tools is provided here (for a more comprehensive review, see \ifdefined\separatesupplement - the supplementary appendices% + the supplementary appendices, available online% \else% appendices \ref{appendix:existingtools} and \ref{appendix:existingsolutions}% \fi% @@ -158,46 +157,47 @@ To isolate the environment, virtual machines (VMs) have sometimes been used, e.g However, containers (e.g., Docker or Singularity) are currently more widely used. We will focus on Docker here because it is currently the most common. -It is possible to precisely identify the used Docker ``images'' with their checksums (or ``digest'') to re-create an identical OS image later. +It is possible to precisely identify the used Docker ``images'' with their checksums (or ``digest'') to recreate an identical operating system (OS) image later. However, that is rarely done. -Usually images are imported with operating system (OS) names; e.g., Mesnard \& Barba\cite{mesnard20} use `\inlinecode{FROM ubuntu:16.04}'. +Usually images are imported with OS names; e.g., Mesnard \& Barba\cite{mesnard20} use ``\inlinecode{FROM ubuntu:16.04}''. The extracted tarball URL\footnote{\inlinecode{\url{https://partner-images.canonical.com/core/xenial}}} is updated almost monthly, and only the most recent five are archived. Hence, if the image is built in different months, it will contain different OS components. -In the year 2024, when this version's long-term support (LTS) expires (if not earlier, like CentOS 8 which will terminate 8 years early\footnote{\inlinecode{\url{https://blog.centos.org/2020/12/future-is-centos-stream}}}), the image will not be available at the expected URL. +In the year 2024, when this version's long-term support (LTS) expires (if not earlier, like CentOS 8, which will terminate 8 years early\footnote{\inlinecode{\url{https://blog.centos.org/2020/12/future-is-centos-stream}}}), the image will not be available at the expected URL. -Generally, pre-built binary files (like Docker images) are large and expensive to maintain, distribute and archive. -Because of this, in October 2020 Docker Hub (where many workflows are archived) announced\footnote{\inlinecode{\href{https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}{https://www.docker.com/blog/docker-hub-image-retention}\\\href{https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}{-policy-delayed-and-subscription-updates}}} a new consumpiton-based payment model. +Generally, prebuilt binary files (like Docker images) are large and expensive to maintain, distribute, and archive. +Because of this, in October 2020, Docker Hub (where many workflows are archived) announced\footnote{\inlinecode{\href{https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}{https://www.docker.com/blog/docker-hub-image-retention}\\\href{https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}{-policy-delayed-and-subscription-updates}}} a new consumpiton-based payment model. Furthermore, Docker requires root permissions, and only supports recent (LTS) versions of the host kernel. Hence older Docker images may not be executable: their longevity is determined by OS kernels, typically a decade. Once the host OS is ready, package managers (PMs) are used to install the software or environment. -Usually the OS's PM, such as `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with generic PMs. -The former has the same longevity as the OS, while some of the latter (such as Conda and Spack) are written in high-level languages like Python, so the PM itself depends on the host's Python installation with a typical longevity of a few years. -Nix and GNU Guix produce bit-wise identical programs with considerably better longevity; that of their supported CPU architectures. +Usually the PM of the OS, such as `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with generic PMs. +The former has the same longevity as the OS while some of the latter (such as Conda and Spack) are written in high-level languages like Python; so, the PM itself depends on the host's Python installation with a typical longevity of a few years. +Nix and GNU Guix produce bitwise identical programs with considerably better longevity; that of their supported CPU architectures. However, they need root permissions and are primarily targeted at the Linux kernel. -Generally, in all the package managers, the exact version of each software (and its dependencies) is not precisely identified by default, although an advanced user can indeed fix them. +Generally, in all the PMs, the exact version of each software (and its dependencies) is not precisely identified by default, although an advanced user can, indeed, fix them. Unless precise version identifiers of \emph{every software package} are stored by project authors, a third-party PM will use the most recent version. Furthermore, because third-party PMs introduce their own language, framework, and version history (the PM itself may evolve) and are maintained by an external team, they increase a project's complexity. With the software environment built, job management is the next component of a workflow. -Visual/GUI workflow tools like Apache Taverna, GenePattern (deprecated), Kepler or VisTrails (deprecated), which were mostly introduced in the 2000s and used Java or Python 2 encourage modularity and robust job management. -However, a GUI environment is tailored to specific applications and is hard to generalize, while being hard to reproduce once the required Java Virtual Machine (JVM) is deprecated. +Visual/GUI tools (written in Java or Python 2) such as Taverna (deprecated), GenePattern (deprecated), Kepler, or VisTrails (deprecated), which were mostly introduced in the 2000s encourage modularity and robust job management. +However, a GUI environment is tailored to specific applications and is hard to generalize while being hard to reproduce once the required Java VM (JVM) is deprecated. These tools' data formats are complex (designed for computers to read) and hard to read by humans without the GUI. -The more recent solutions (mostly non-GUI, written in Python) leave this to the authors of the project. +The more recent solutions (mostly non-GUI, written in Python) leave this to the project authors. + Designing a robust project needs to be encouraged and facilitated because scientists (who are not usually trained in project or data management) will rarely apply best practices. This includes automatic verification, which is possible in many solutions, but is rarely practiced. Besides non-reproducibility, weak project management leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding, or validating will be expensive). -Finally, to blend narrative and analysis, computational notebooks\cite{rule18}, such as Jupyter, are currently gaining popularity. -However, because of their complex dependency trees, their build is vulnerable to the passage of time; e.g., see Figure 1 of Alliez et al.\cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies. -It is important to remember that the longevity of a project is determined by its shortest-lived dependency. -Furthermore, as with job management, computational notebooks do not actively encourage good practices in programming or project management. +Finally, to blend narrative and analysis, computational notebooks (CNs) \cite{rule18}, such as Jupyter, are currently gaining popularity. +However, because of their complex dependency trees, their build is vulnerable to the passage of time; e.g., see Figure 1 in the work of Alliez et al.\cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies. +It is important to remember that the longevity of a project is determined by its shortest lived dependency. +Furthermore, as with job management, CNs do not actively encourage good practices in programming or project management. The ``cells'' in a Jupyter notebook can either be run sequentially (from top to bottom, one after the other) or by manually selecting the cell to run. By default, cell dependencies are not included (e.g., automatically running some cells only after certain others), parallel execution, or usage of more than one language. There are third party add-ons like \inlinecode{sos} or \inlinecode{extension's} (both written in Python) for some of these. However, since they are not part of the core, a shorter longevity can be assumed. The core Jupyter framework has few options for project management, especially as the project grows beyond a small test or tutorial. -Notebooks can therefore rarely deliver their promised potential\cite{rule18} and may even hamper reproducibility\cite{pimentel19}. +Notebooks, can, therefore rarely deliver their promised potential\cite{rule18} and may even hamper reproducibility\cite{pimentel19}. @@ -213,10 +213,10 @@ We argue and propose that workflows satisfying the following criteria can not on \textbf{Criterion 1: Completeness.} A project that is complete (self-contained) has the following properties. (1) No \emph{execution requirements} apart from a minimal Unix-like operating system. -Fewer explicit execution requirements would mean larger \emph{execution possibility} and consequently longer \emph{longevity}. +Fewer explicit execution requirements would mean larger \emph{execution possibility} and consequently better \emph{longevity}. (2) Primarily stored as plain text (encoded in ASCII/Unicode), not needing specialized software to open, parse, or execute. (3) No impact on the host OS libraries, programs, and environment variables. -(4) No root privileges to run (during development or post-publication). +(4) No root privileges to run (during development or postpublication). (5) Builds its own controlled software with independent environment variables. (6) Can run locally (without an internet connection). (7) Contains the full project's analysis, visualization \emph{and} narrative: including instructions to automatically access/download raw inputs, build necessary software, do the analysis, produce final data products \emph{and} final published report with figures \emph{as output}, e.g., PDF or HTML. @@ -226,10 +226,10 @@ Fewer explicit execution requirements would mean larger \emph{execution possibil A modular project enables and encourages independent modules with well-defined inputs/outputs and minimal side effects. In terms of file management, a modular project will \emph{only} contain the hand-written project source of that particular high-level project: no automatically generated files (e.g., software binaries or figures), software source code, or data should be included. The latter two (developing low-level software, collecting data, or the publishing and archival of both) are separate projects in themselves because they can be used in other independent projects. -This optimizes the storage, archival/mirroring, and publication costs (which are critical to longevity): a snapshot of a project's hand-written source will usually be on the scale of $\times100$ kilobytes, and the version-controlled history may become a few megabytes. +This optimizes the storage, archival/mirroring, and publication costs (which are critical to longevity): a snapshot of a project's hand-written source will usually be on the scale of $\sim100$ kilobytes, and the version controlled history may become a few megabytes. In terms of the analysis workflow, explicit communication between various modules enables optimizations on many levels: -(1) Modular analysis components can be executed in parallel and avoid redundancies (when a dependency of a module has not changed, it will not be re-run). +(1) Modular analysis components can be executed in parallel and avoid redundancies (when a dependency of a module has not changed, the latter will not be rerun). (2) Usage in other projects. (3) Debugging and adding improvements (possibly by future researchers). (4) Citation of specific parts. @@ -238,7 +238,7 @@ In terms of the analysis workflow, explicit communication between various module \textbf{Criterion 3: Minimal complexity.} Minimal complexity can be interpreted as: (1) Avoiding the language or framework that is currently in vogue (for the workflow, not necessarily the high-level analysis). -A popular framework typically falls out of fashion and requires significant resources to translate or rewrite every few years (for example Python 2, which is no longer supported). +A popular framework typically falls out of fashion and requires significant resources to translate or rewrite every few years (for example, Python 2, which is no longer supported). More stable/basic tools can be used with less long-term maintenance costs. (2) Avoiding too many different languages and frameworks; e.g., when the workflow's PM and analysis are orchestrated in the same framework, it becomes easier to maintain in the long term. @@ -254,7 +254,7 @@ No exploratory research is done in a single, first attempt. Projects evolve as they are being completed. Naturally, earlier phases of a project are redesigned/optimized only after later phases have been completed. Research papers often report this with statements such as ``\emph{we [first] tried method [or parameter] X, but Y is used here because it gave lower random error}''. -The derivation ``history'' of a result is thus not any the less valuable as itself. +The derivation ``history'' of a result is, thus, not any the less valuable as itself. \textbf{Criterion 7: Including narrative that is linked to analysis.} A project is not just its computational analysis. @@ -264,12 +264,12 @@ This is related to longevity, because if a workflow contains only the steps to d \textbf{Criterion 8: Free and open-source software (FOSS):} Non-FOSS software typically cannot be distributed, inspected, or modified by others. -They are thus reliant on a single supplier (even without payments) and prone to \emph{proprietary obsolescence}\footnote{\inlinecode{\url{https://www.gnu.org/proprietary/proprietary-obsolescence.html}}}. +They are, thus, reliant on a single supplier (even without payments) and prone to \emph{proprietary obsolescence}\footnote{\inlinecode{\url{https://www.gnu.org/proprietary/proprietary-obsolescence.html}}}. A project that is \emph{free software} (as formally defined by GNU\footnote{\inlinecode{\url{https://www.gnu.org/philosophy/free-sw.en.html}}}), allows others to run, learn from, distribute, build upon (modify), and publish their modified versions. When the software used by the high-level project is also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware. -Proprietary software may be necessary to read proprietary data formats produced by data collection hardware (for example micro-arrays in genetics). -In such cases, it is best to immediately convert the data to free formats upon collection and safely use or archive the data as free formats. +Proprietary software may be necessary to read proprietary data formats produced by data collection hardware (for example, microarrays in genetics). +In such cases, it is best to immediately convert the data to free formats upon collection and safely use or archive the data in free formats. @@ -282,52 +282,53 @@ In such cases, it is best to immediately convert the data to free formats upon c \section{Proof of concept: Maneage} -With the longevity problems of existing tools outlined above, a proof-of-concept solution is presented here via an implementation that has been tested in published papers\cite{akhlaghi19, infante20}. -Since the initial submission of this paper, it has also been used in \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} (on the COVID-19 pandemic) and \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460} (on galaxy evolution). +With the longevity problems of existing tools outlined earlier, a proof-of-concept solution is presented here via an implementation that has been tested in published papers\cite{akhlaghi19, infante20}. +Since the initial submission of this article, it has also been used in \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} (on the COVID-19 pandemic) and \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460} (on galaxy evolution). It was also awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows\cite{austin17}, from the researchers' perspective. It is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at \inlinecode{\url{https://maneage.org}}. It was developed as a parallel research project over five years of publishing reproducible workflows of our research. -Its primordial implementation was used in Akhlaghi \& Ichikawa\cite{akhlaghi15}, which evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}. +Its primordial implementation was used in Akhlaghi and Ichikawa\cite{akhlaghi15}, which evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}. -Technically, the hardest criterion to implement was the first (completeness); in particular restricting execution requirements to only a minimal Unix-like operating system. +Technically, the hardest criterion to implement was the first (completeness); in particular, restricting execution requirements to only a minimal Unix-like operating system. One solution we considered was GNU Guix and Guix Workflow Language (GWL). However, because Guix requires root access to install, and only works with the Linux kernel, it failed the completeness criterion. Inspired by GWL+Guix, a single job management tool was implemented for both installing software \emph{and} the analysis workflow: Make. Make is not an analysis language, it is a job manager. -Make decides when and how to call analysis steps/programs (in any language like Python, R, Julia, Shell, or C). +Make decides when and how to call analysis steps/programs (in any language such as Python, R, Julia, Shell, or C). Make has been available since 1977, it is still heavily used in almost all components of modern Unix-like OSs and is standardized in POSIX. It is thus mature, actively maintained, highly optimized, efficient in managing provenance, and recommended by the pioneers of reproducible research\cite{claerbout1992,schwab2000}. Moreover, researchers using FOSS have already had some exposure to Make (most FOSS are built with Make). Linking the analysis and narrative (criterion 7) was historically our first design element. -To avoid the problems with computational notebooks mentioned above, we adopt a more abstract linkage, providing a more direct and traceable connection. -Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is through automatically-created \LaTeX{} macros, during the analysis. -For example, Akhlaghi writes\cite{akhlaghi19} `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'. -The \LaTeX{} source of the quote above is: `\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}'. -The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is automatically generated after the analysis and expands to the value `\inlinecode{0.25}' upon creation of the PDF. +To avoid the problems with computational notebooks mentioned before, we adopt a more abstract linkage, providing a more direct and traceable connection. +Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is through automatically created \LaTeX{} macros, during the analysis. +For example, Akhlaghi writes\cite{akhlaghi19} ``\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}''. +The \LaTeX{} source of the quote above is: ``\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}''. +The macro ``\inlinecode{\small\textbackslash{}demosfoptimizedsn}'' is automatically generated after the analysis and expands to the value ``\inlinecode{0.25}'' upon creation of the PDF. Since values like this depend on the analysis, they should \emph{also} be reproducible, along with figures and tables. These macros act as a quantifiable link between the narrative and analysis, with the granularity of a word in a sentence and a particular analysis command. -This allows automatic updates to the embedded numbers during the experimentation phase of a project \emph{and} accurate post-publication provenance. +This allows automatic updates to the embedded numbers during the experimentation phase of a project \emph{and} accurate postpublication provenance. Through the former, manual updates by authors (which are prone to errors and discourage improvements or experimentation after writing the first draft) are by-passed. Acting as a link, the macro files build the core skeleton of Maneage. For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version, and possible citation. These are combined at the end to generate precise software acknowledgment and citation that is shown in the \ifdefined\separatesupplement% -appendices,% +appendices, available online, % \else% appendices (\ref{appendix:software}), % \fi% -other examples are also available\cite{akhlaghi19, infante20}. -Furthermore, the machine-related specifications of the running system (including CPU architecture and byte-order) are also collected to report in the paper (they are reported for this paper in the acknowledgments). +other examples have also been published\cite{akhlaghi19, infante20}. +Furthermore, the machine-related specifications of the running system (including CPU architecture and byte-order) are also collected to report in the paper (they are reported for this article in the section ``Acknowledgments''). These can help in \emph{root cause analysis} of observed differences/issues in the execution of the workflow on different machines. -The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of Alliez et al.\cite{alliez19}). + +The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 in the work by Alliez et al.\cite{alliez19}). All software dependencies are built down to precise versions of every tool, including the shell, important low-level application programs (e.g., GNU Coreutils) and of course, the high-level science software. The source code of all the FOSS software used in Maneage is archived in, and downloaded from, \href{https://doi.org/10.5281/zenodo.3883409}{zenodo.3883409}. -Zenodo promises long-term archival and also provides a persistent identifiers for the files, which are sometimes unavailable at a software package's web page. +Zenodo promises long-term archival and also provides persistent identifiers for the files, which are sometimes unavailable at a software package's web page. On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library (glibc) is being added\footnote{\inlinecode{\url{http://savannah.nongnu.org/task/?15390}}}. Currently, {\TeX}Live is also being added\footnote{\inlinecode{\url{http://savannah.nongnu.org/task/?15267}}}, but that is only for building the final PDF, not affecting the analysis or verification. @@ -340,10 +341,10 @@ The \inlinecode{README.md}\footnote{\inlinecode{\label{maneageatswh}\href{https: Through containers or VMs, users on non-Unix-like OSs (like Microsoft Windows) can use Maneage. For Windows-native software that can be run in batch-mode, evolving technologies like Windows Subsystem for Linux may be usable. -The analysis phase of the project however is naturally different from one project to another at a low-level. -It was thus necessary to design a generic framework to comfortably host any project, while still satisfying the criteria of modularity, scalability, and minimal complexity. -This design is demonstrated with the example of Figure \ref{fig:datalineage} (left) which is an enhanced replication of the ``tool'' curve of Figure 1C in Menke et al.\cite{menke20}. -Figure \ref{fig:datalineage} (right) is the data lineage that produced it. +The analysis phase of the project, however, is naturally different from one project to another at a low-level. +It was, thus, necessary to design a generic framework to comfortably host any project while still satisfying the criteria of modularity, scalability, and minimal complexity. +This design is demonstrated with the example of Figure \ref{fig:datalineage} (left) which is an enhanced replication of the ``tool'' curve of Figure 1C in the work by Menke et al.\cite{menke20}. +Figure \ref{fig:datalineage} (right) shows the data lineage that produced it. \begin{figure*}[t] \begin{center} @@ -352,7 +353,7 @@ Figure \ref{fig:datalineage} (right) is the data lineage that produced it. \end{center} \vspace{-3mm} \caption{\label{fig:datalineage} - Left: an enhanced replica of Figure 1C in Menke et al.\cite{menke20}, shown here for demonstrating Maneage. + Left: an enhanced replica of Figure 1C in the work by Menke et al.\cite{menke20}, shown here for demonstrating Maneage. It shows the fraction of the number of papers mentioning software tools (green line, left vertical axis) in each year (red bars, right vertical axis on a log scale). Right: Schematic representation of the data lineage, or workflow, to generate the plot on the left. Each colored box is a file in the project and arrows show the operation of various software: linking input file(s) to the output file(s). @@ -373,7 +374,7 @@ This is visualized in Figure \ref{fig:datalineage} (right) where no built (blue) A visual inspection of this file is sufficient for a non-expert to understand the high-level steps of the project (irrespective of the low-level implementation details), provided that the subMakefile names are descriptive (thus encouraging good practice). A human-friendly design that is also optimized for execution is a critical component for the FAIRness of reproducible research. -All projects first load \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk} (Listing \ref{code:topmake}). +All projects first load \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk} (see Listing \ref{code:topmake}). Project authors add their modular subMakefiles in between. Except for \inlinecode{paper.mk} (which builds the ultimate target: \inlinecode{paper.pdf}), all subMakefiles build a macro file with the same base-name (the \inlinecode{.tex} file at the bottom of each subMakefile in Figure \ref{fig:datalineage}). Other built files (``targets'' in intermediate analysis steps) cascade down in the lineage to one of these macro files, possibly through other files. @@ -397,7 +398,7 @@ makesrc = initialize \ # General # Load all the configuration files. include reproduce/analysis/config/*.conf -# Load the subMakefiles in the defined order +# Load the subMakefiles in the defined order. include $(foreach s,$(makesrc), \ reproduce/analysis/make/$(s).mk) \end{lstlisting} @@ -405,25 +406,25 @@ include $(foreach s,$(makesrc), \ Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk} to satisfy the verification criteria. All project deliverables (macro files, plot or table data, and other datasets) are verified at this stage, with their checksums, to automatically ensure exact reproducibility. Where exact reproducibility is not possible (for example, due to parallelization), values can be verified by the project authors. -For example see \inlinecode{\small verify-parameter-statistically.sh}\footnote{\inlinecode{\href{https://archive.softwareheritage.org/swh:1:cnt:dae4e6de5399a061ab4df01ea51f4757fd7e293a;origin=https://codeberg.org/boud/elaphrocentre.git;visit=swh:1:snp:54f00113661ea30c800b406eee55ea7a7ea35279;anchor=swh:1:rev:a029edd32d5cd41dbdac145189d9b1a08421114e;path=/reproduce/analysis/bash/verify-parameter-statistically.sh}{swh:1:cnt:dae4e6de5399a061ab4df01ea51f4757fd7e293a}}} of \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}. +For example, see \inlinecode{\small verify-parameter-statistically.sh}\footnote{\inlinecode{\href{https://archive.softwareheritage.org/swh:1:cnt:dae4e6de5399a061ab4df01ea51f4757fd7e293a;origin=https://codeberg.org/boud/elaphrocentre.git;visit=swh:1:snp:54f00113661ea30c800b406eee55ea7a7ea35279;anchor=swh:1:rev:a029edd32d5cd41dbdac145189d9b1a08421114e;path=/reproduce/analysis/bash/verify-parameter-statistically.sh}{swh:1:cnt:dae4e6de5399a061ab4df01ea51f4757fd7e293a}}} of \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}. \begin{figure*}[t] \begin{center} \includetikz{figure-branching}{scale=1}\end{center} \vspace{-3mm} \caption{\label{fig:branching} Maneage is a Git branch. Projects using Maneage are branched off it and apply their customizations. - (a) A hypothetical project's history before publication. + (a) Hypothetical project's history before publication. The low-level structure (in Maneage, shared between all projects) can be updated by merging with Maneage. - (b) A finished/published project can be revitalized for new technologies by merging with the core branch. + (b) Finished/published project can be revitalized for new technologies by merging with the core branch. Each Git ``commit'' is shown on its branch as a colored ellipse, with its commit hash shown and colored to identify the team that is/was working on the branch. Briefly, Git is a version control system, allowing a structured backup of project files, for more see \ifdefined\separatesupplement% - supplementary appendices (section on version control)% + supplementary appendices available online (section on version control)% \else% Appendix \ref{appendix:versioncontrol}% \fi% . Each Git ``commit'' effectively contains a copy of all the project's files at the moment it was made. - The upward arrows at the branch-tops are therefore in the direction of time. + The upward arrows at the branch-tops are, therefore, in the direction of time. } \end{figure*} @@ -439,7 +440,7 @@ Furthermore, the configuration files are a prerequisite of the targets that use If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's source or other built products. This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details; e.g., by co-authors or future readers), and ensures self-consistency. -In contrast to notebooks like Jupyter, the analysis scripts, configuration parameters and paper's narrative are therefore not blended into in a single file, and do not require a unique editor. +In contrast to notebooks like Jupyter, the analysis scripts, configuration parameters, and paper's narrative are, therefore, not blended into in a single file, and do not require a unique editor. To satisfy the modularity criterion, the analysis steps and narrative are written and run in their own files (in different languages) and the files can be viewed or manipulated with any text editor that the authors prefer. The analysis can benefit from the powerful and portable job management features of Make and communicates with the narrative text through \LaTeX{} macros, enabling much better-formatted output that blends analysis outputs in the narrative sentences and enables direct provenance tracking. @@ -449,17 +450,17 @@ The core Maneage git repository is hosted at \inlinecode{\href{http://git.maneag Derived projects start by creating a branch and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its particular analysis, see Listing \ref{code:branching}). There is a thoroughly elaborated customization checklist in \inlinecode{README-hacking.md}. -The current project's Git hash is provided to the authors as a \LaTeX{} macro (shown here in the abstract and acknowledgments), as well as the Git hash of the last commit in the Maneage branch (shown here in the acknowledgments). -These macros are created in \inlinecode{initialize.mk}, with other basic information from the running system like the CPU details (shown in the acknowledgments). +The current project's Git hash is provided to the authors as a \LaTeX{} macro (shown here in the sections ``Abstract'' and ``Acknowledgments''), as well as the Git hash of the last commit in the Maneage branch (shown here in the section ``Acknowledgments''). +These macros are created in \inlinecode{initialize.mk}, with other basic information from the running system like the CPU details (shown in the section ``Acknowledgments''). As opposed to Git ``tag''s, the hash is a core concept in the Git paradigm and is immutable and always present in a given history, which is why it is the recommended version identifier. -Figure \ref{fig:branching} shows how projects can re-import Maneage at a later time (technically: \emph{merge}), thus improving their low-level infrastructure: in (a) authors do the merge during an ongoing project; -in (b) readers do it after publication; e.g., the project remains reproducible but the infrastructure is outdated, or a bug is fixed in Maneage. +Figure \ref{fig:branching} shows how projects can reimport Maneage at a later time (technically: \emph{merge}), thus improving their low-level infrastructure: in (a), authors do the merge during an ongoing project; +in (b), readers do it after publication; e.g., the project remains reproducible but the infrastructure is outdated, or a bug is fixed in Maneage. Generally, any Git flow (branching strategy) can be used by the high-level project authors or future readers. -Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of project curation and maintenance, before \emph{and} after publication. +Low-level improvements in Maneage can, thus, propagate to all projects, greatly reducing the cost of project curation and maintenance, before \emph{and} after publication. -Finally, a snapshot of the complete project source is usually $\sim100$ kilo-bytes. -It can thus easily be published or archived in many servers, for example, it can be uploaded to arXiv (with the \LaTeX{} source\cite{akhlaghi19, infante20, akhlaghi15}), published on Zenodo and archived in Software Heritage. +Finally, a snapshot of the complete project source is usually $\sim100$ kilobytes. +It can, thus, easily be published or archived in many servers, for example, it can be uploaded to arXiv (with the \LaTeX{} source\cite{akhlaghi19, infante20, akhlaghi15}), published on Zenodo and archived in Software Heritage. \begin{lstlisting}[ label=code:branching, @@ -475,9 +476,9 @@ $ git checkout -b main $ ./project configure # Build software environment. $ ./project make # Do analysis, build PDF paper. -# Start editing, test-building and committing +# Start editing, test-building and committing. $ emacs paper.tex # Set your name as author. -$ ./project make # Re-build to see effect. +$ ./project make # Rebuild to see effect. $ git add -u && git commit # Commit changes. \end{lstlisting} @@ -503,24 +504,24 @@ $ git add -u && git commit # Commit changes. We have shown that it is possible to build workflows satisfying all the proposed criteria. Here we comment on our experience in testing them through Maneage and its increasing user-base (thanks to the support of RDA). -Firstly, while most researchers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used. +First, while most researchers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used. Fortunately, we have noticed that after witnessing the improvements in their research, many, especially early-career researchers, have started mastering these tools. Scientists are rarely trained sufficiently in data management or software development, and the plethora of high-level tools that change every few years discourages them. -Indeed the fast-evolving tools are primarily targeted at software developers, who are paid to learn and use them effectively for short-term projects before moving on to the next technology. +Indeed, the fast-evolving tools are primarily targeted at software developers, who are paid to learn and use them effectively for short-term projects before moving on to the next technology. Scientists, on the other hand, need to focus on their own research fields and need to consider longevity. Hence, arguably the most important feature of these criteria (as implemented in Maneage) is that they provide a fully working template or bundle that works immediately out of the box by producing a paper with an example calculation that they just need to start customizing. -Using mature and time-tested tools, for blending version control, the research paper's narrative, the software management \emph{and} a robust data management strategies. +Using mature and time-tested tools, for blending version control, the research paper's narrative, the software management \emph{and} a robust data management strategy. We have noticed that providing a clear checklist of the initial customizations is much more effective in encouraging mastery of these core analysis tools than having abstract, isolated tutorials on each tool individually. -Secondly, to satisfy the completeness criterion, all the required software of the project must be built on various Unix-like OSs (Maneage is actively tested on different GNU/Linux distributions, macOS, and is being ported to FreeBSD also). +Second, to satisfy the completeness criterion, all the required software of the project must be built on various Unix-like OSs (Maneage is actively tested on different GNU/Linux distributions, macOS, and is being ported to FreeBSD also). This requires maintenance by our core team and consumes time and energy. However, because the PM and analysis components share the same job manager (Make) and design principles, we have already noticed some early users adding, or fixing, their required software alone. They later share their low-level commits on the core branch, thus propagating it to all derived projects. -Thirdly, Unix-like OSs are a very large and diverse group (mostly conforming with POSIX), so our completeness condition does not guarantee bit-wise reproducibility of the software, even when built on the same hardware. -However our focus is on reproducing results (output of software), not the software itself. -Well written software internally corrects for differences in OS or hardware that may affect its output (through tools like the GNU Portability Library, or Gnulib). +Third, Unix-like OSs are a very large and diverse group (mostly conforming with POSIX), so our completeness condition does not guarantee bitwise reproducibility of the software, even when built on the same hardware. +However, our focus is on reproducing results (output of software), not the software itself. +Well-written software internally corrects for differences in OS or hardware that may affect its output (through tools like the GNU Portability Library, or Gnulib). On GNU/Linux hosts, Maneage builds precise versions of the compilation tool chain. However, glibc is not install-able on some Unix-like OSs (e.g., macOS) and all programs link with the C library. @@ -540,28 +541,29 @@ For example, the publication of projects meeting these criteria on a wide scale The completeness criterion implies that algorithms and data selection can be included in the optimizations. Furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the connection between an analysis with the resulting narrative \emph{and} the history of that analysis+narrative. -Parsers can be written over projects for meta-research and provenance studies, e.g., to generate Research Objects +Parsers can be written over projects for metaresearch and provenance studies, e.g., to generate Research Objects \ifdefined\separatesupplement -(see supplement appendix B). +(see supplement appendix B, available online) \else -(see Appendix \ref{appendix:researchobject}). +(see Appendix \ref{appendix:researchobject}) \fi or allow interoperability with Common Workflow Language (CWL) or higher-level concepts like Canonical Workflow Framework for Research, or CWFR \ifdefined\separatesupplement -(see supplement appendix A). +(see supplement appendix A, available online). \else (see Appendix \ref{appendix:genericworkflows}). \fi + Likewise, when a bug is found in one science software, affected projects can be detected and the scale of the effect can be measured. -Combined with SoftwareHeritage, precise high-level science components of the analysis can be accurately cited (e.g., even failed/abandoned tests at any historical point). +Combined with Software Heritage, precise high-level science components of the analysis can be accurately cited (e.g., even failed/abandoned tests at any historical point). Many components of ``machine-actionable'' data management plans can also be automatically completed as a byproduct, useful for project PIs and grant funders. -From the data repository perspective, these criteria can also be useful, e.g., the challenges mentioned in Austin et al.\cite{austin17}: +From the data repository perspective, these criteria can also be useful, e.g., the challenges mentioned in the work by Austin et al.\cite{austin17}: (1) The burden of curation is shared among all project authors and readers (the latter may find a bug and fix it), not just by database curators, thereby improving sustainability. (2) Automated and persistent bidirectional linking of data and publication can be established through the published \emph{and complete} data lineage that is under version control. (3) Software management: with these criteria, each project comes with its unique and complete software management. It does not use a third-party PM that needs to be maintained by the data center (and the many versions of the PM), hence enabling robust software management, preservation, publishing, and citation. -For example, see \href{https://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}, \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}, \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} or \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460} where we distribute the source code of all software used in each project in a tarball, as deliverables. +For example, see \href{https://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}, \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}, \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} or \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}, where we distribute the source code of all (FOSS) software used in each project, as deliverables. (4) ``Linkages between documentation, code, data, and journal articles in an integrated environment'', which effectively summarizes the whole purpose of these criteria. @@ -634,38 +636,66 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314 %% Biography \begin{IEEEbiographynophoto}{Mohammad Akhlaghi} - is a postdoctoral researcher at the Instituto de Astrof\'isica de Canarias (IAC), Spain. - He has a PhD from Tohoku University (Japan) and was previously a CNRS postdoc in Lyon (France). - Email: mohammad-AT-akhlaghi.org; Website: \url{https://akhlaghi.org}. + is currently a Postdoctoral Researcher with the Instituto de Astrof\'isica de Canarias (IAC), Santa Cruz de Tenerife, Spain. + Prior to this, he was a CNRS postdoc in Lyon, France. + He received the Ph.D. degree from Tohoku University, Sendai, Japan. + He is the corresponding author of this article. + His ORCID ID is \href{https://orcid.org/0000-0003-1710-6613}{0000-0003-1710-6613}. + For this article he is affiliated with: + 1) Instituto de Astrof\'isica de Canarias, C/V\'ia L\'actea, 38200. La Laguna, Tenerife, Spain. + 2) Facultad de F\'isica, Universidad de La Laguna, Avda. Astrofísico Fco. S\'anchez s/n, 38200. La Laguna, Tenerife, Spain. + 3) Univ Lyon, Ens de Lyon, Univ Lyon1, CNRS, Centre de Recherche Astrophysique de Lyon UMR5574, F-69007, Lyon. + For more details visit \url{https://akhlaghi.org}. + Contact him at mohammad@akhlaghi.org. \end{IEEEbiographynophoto} \begin{IEEEbiographynophoto}{Ra\'ul Infante-Sainz} - is a doctoral student at IAC, Spain. - He has an M.Sc from the University of Granada (Spain). - Email: infantesainz-AT-gmail.com; Website: \url{https://infantesainz.org}. + is currently a Doctoral student at IAC, Spain. + He received the M.Sc. degree from the University of Granada, Granada, Spain. + His ORCID ID is \href{https://orcid.org/0000-0002-6220-7133}{0000-0002-6220-7133}. + For this article he is affiliated with: + 1) Instituto de Astrof\'isica de Canarias, C/V\'ia L\'actea, 38200. La Laguna, Tenerife, Spain. + 2) Facultad de F\'isica, Universidad de La Laguna, Avda. Astrofísico Fco. S\'anchez s/n, 38200. La Laguna, Tenerife, Spain. + For more details visit \url{https://infantesainz.org}. + Contact him at infantesainz@gmail.com. \end{IEEEbiographynophoto} \begin{IEEEbiographynophoto}{Boudewijn F. Roukema} - is a professor of cosmology at the Institute of Astronomy, Faculty of Physics, Astronomy and Informatics, Nicolaus Copernicus University in Toru\'n, Grudziadzka 5, Poland. - He has a PhD from Australian National University. Email: boud-AT-astro.uni.torun.pl. + is a professor of cosmology with the Institute of Astronomy, Faculty of Physics, Astronomy and Informatics, Nicolaus Copernicus University, Toru\'n, Poland. + He received the Ph.D. degree from Australian National University, Canberra, ACT, Australia. + His ORCID ID is \href{https://orcid.org/0000-0002-3772-0250}{0000-0002-3772-0250}. + For this article he is affiliated with: + 1) Institute of Astronomy, Faculty of Physics, Astronomy and Informatics, Nicolaus Copernicus University, Grudziadzka 5, 87-100 Torun, Poland. + 2) Univ Lyon, Ens de Lyon, Univ Lyon1, CNRS, Centre de Recherche Astrophysique de Lyon UMR5574, F-69007, Lyon, France. + Contact him at boud@astro.uni.torun.pl. \end{IEEEbiographynophoto} \begin{IEEEbiographynophoto}{Mohammadreza Khellat} - is the backend technical services manager at Ideal-Information, Oman. - He has an M.Sc in theoretical particle physics from Yazd University (Iran). - Email: mkhellat-AT-ideal-information.com. + is currently the Backend Technical Services Manager at Ideal-Information, Muscat, Oman. + He received the M.Sc. degree in theoretical particle physics from Yazd University, Yazd, Iran. + His ORCID ID is \href{https://orcid.org/0000-0002-8236-809X}{0000-0002-8236-809X}. + For this article he is affiliated with: + 1) Ideal-Information, PC 133 Al Khuwair, PO Box 1886, Muscat, Oman. + Contact him at mkhellat@ideal-information.com. \end{IEEEbiographynophoto} \begin{IEEEbiographynophoto}{David Valls-Gabaud} is a CNRS Research Director at LERMA, Observatoire de Paris, France. - Educated at the universities of Madrid, Paris, and Cambridge, he obtained his PhD in 1991. - Email: david.valls-gabaud-AT-obspm.fr. + He received the degrees from the Universities of Madrid, Paris and Cambridge, and the Ph.D. degree in 1991. + His ORCID id is \href{https://orcid.org/0000-0002-9821-2911}{0000-0002-9821-2911}. + For this article, he is affiliated with: + 1) Paris Observatory, 26914 Paris, \^Ile-de-France, France. + Contact him at david.valls-gabaud@observatoiredeparis.psl.eu. \end{IEEEbiographynophoto} \begin{IEEEbiographynophoto}{Roberto Baena-Gall\'e} - is a professor at the Universidad Internacional de La Rioja. - He previously held a postdoc position at IAC and obtained a degree at the University of Seville, with a PhD at the University of Barcelona. - Email: roberto.baena-AT-unir.net. + is a professor at the Universidad Internacional de La Rioja, La Rioja, Spain. + He was a Postdoc with Instituto de Astrof\'isica de Canarias (IAC), Spain. + He received a degree from the University of Seville, Seville, Spain and a Ph.D. degree from the University of Barcelona, Barcelona, Spain. + His ORCID id is \href{https://orcid.org/0000-0001-5214-7408}{0000-0001-5214-7408}. + For this article, he is affiliated with: + Universidad Internacional de La Rioja (UNIR), Gran V\'ia Rey Juan Carlos I, 41. 26002 Logro\~no, La Rioja, Spain. + Contact him at roberto.baena@unir.net. \end{IEEEbiographynophoto} \vfill |