diff options
Diffstat (limited to 'paper.tex')
-rw-r--r-- | paper.tex | 357 |
1 files changed, 116 insertions, 241 deletions
@@ -65,17 +65,14 @@ %% AIM We therefore aim to introduce a set of criteria to address this problem. %% METHOD - These criteria have been tested in several research publications and have the following features: completeness - (no dependency beyond POSIX, no administrator privileges, no network connection, and storage primarily in plain text); - modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis - with narrative; and free software. + These criteria have been tested in several research publications and have the following features: completeness (no dependency beyond POSIX, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free software. %% RESULTS As a proof of concept, ``Maneage'' is introduced for storing projects in machine-actionable and human-readable plain text, enabling cheap archiving, provenance extraction, and peer verification. %% CONCLUSION - We show that longevity is a realistic requirement that does not sacrifice immediate or short-term reproducibility. We then - discuss the caveats (with proposed solutions) and conclude with the benefits for the various stakeholders. This paper - is itself written with Maneage (project commit \projectversion). + We show that longevity is a realistic requirement that does not sacrifice immediate or short-term reproducibility. + We then discuss the caveats (with proposed solutions) and conclude with the benefits for the various stakeholders. + This paper is itself written with Maneage (project commit \projectversion). \vspace{3mm} \emph{Reproducible supplement} --- All products in \href{https://doi.org/10.5281/zenodo.3872248}{\texttt{zenodo.3872248}}, @@ -128,8 +125,7 @@ creates generational gaps in the scientific community, preventing previous gener \section{Commonly used tools and their longevity} -Longevity is as important in science as in some fields of industry, but this ideal is not always achieved; e.g., -fast-evolving tools can be appropriate in short-term commercial projects. +Longevity is as important in science as in some fields of industry, but this ideal is not always necessary; e.g., fast-evolving tools can be appropriate in short-term commercial projects. To highlight the necessity, some of the most commonly-used tools are reviewed here from this perspective. A set of third-party tools that are commonly used in solutions are reviewed here. They can be categorized as: @@ -138,8 +134,7 @@ They can be categorized as: (3) job management -- shell scripts, Make, SCons, or CGAT-core; (4) notebooks -- Jupyter. -To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} -(which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011 but was discontinued in 2019). +To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011 but was discontinued in 2019). However, containers (in particular, Docker, and to a lesser degree, Singularity) are currently the most widely-used solution. We will thus focus on Docker here. @@ -150,79 +145,57 @@ recent five are archived. Hence, if the Dockerfile is run in different months, its output image will contain different OS components. In the year 2024, when long-term support for this version of Ubuntu expires, the image will be unavailable at the expected URL. Other OSes have similar issues because pre-built binary files are large and expensive to maintain and archive. -Furthermore, Docker requires root permissions, and only supports recent (``long-term-support'') versions of the host kernel, -so older Docker images may not be executable. +Furthermore, Docker requires root permissions, and only supports recent (``long-term-support'') versions of the host kernel, so older Docker images may not be executable. Once the host OS is ready, PMs are used to install the software or environment. -Usually the OS's PM, such as `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are -built with generic PMs. -The former suffers from the same longevity problem as the OS, while some of the latter (such as Conda and Spack) -are written in high-level languages like Python, so the PM itself depends on the host's Python installation. +Usually the OS's PM, such as `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with generic PMs. +The former suffers from the same longevity problem as the OS, while some of the latter (such as Conda and Spack) are written in high-level languages like Python, so the PM itself depends on the host's Python installation. Nix and GNU Guix produce bit-wise identical programs, but they need root permissions and are primarily targeted at the Linux kernel. -Generally, the exact version of each software's dependencies is not precisely identified in the PM build -instructions (although this could be implemented). -Therefore, unless precise version identifiers of \emph{every software package} are stored by project authors, a PM will use -the most recent version. -Furthermore, because each third-party PM introduces its own language, framework, and version history (the PM itself may -evolve), they increase a project's complexity. +Generally, the exact version of each software's dependencies is not precisely identified in the PM build instructions (although this could be implemented). +Therefore, unless precise version identifiers of \emph{every software package} are stored by project authors, a PM will use the most recent version. +Furthermore, because third-party PMs introduce their own language, framework, and version history (the PM itself may evolve) and are maintained by an external team, they increase a project's complexity. With the software environment built, job management is the next component of a workflow. -Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails (mostly introduced in the 2000s and using Java) -encourage modularity and robust job management, but the more recent tools (mostly in Python) leave this to the authors -of the project. -Designing a modular project needs to be encouraged and facilitated because scientists (who are not usually trained in project or -data management) will rarely apply best practices. +Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails (mostly introduced in the 2000s and using Java) encourage modularity and robust job management, but the more recent tools (mostly in Python) leave this to the authors of the project. +Designing a modular project needs to be encouraged and facilitated because scientists (who are not usually trained in project or data management) will rarely apply best practices. This includes automatic verification, which is possible in many solutions, but is rarely practiced. -Weak project management leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding, - or validating will be expensive). +Weak project management leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding, or validating will be expensive). Finally, to add narrative, computational notebooks \cite{rule18}, such as Jupyter, are currently gaining popularity. -However, because of their complex dependency trees, they are vulnerable to the passage of time; -e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies. +However, because of their complex dependency trees, they are vulnerable to the passage of time; e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies. It is important to remember that the longevity of a project is determined by its shortest-lived dependency. -Further, as with job management, computational notebooks do not actively encourage good practices in programming -or project management, -hence they can rarely deliver their promised potential \cite{rule18} and may even hamper reproducibility \cite{pimentel19}. +Further, as with job management, computational notebooks do not actively encourage good practices in programming or project management, hence they can rarely deliver their promised potential \cite{rule18} and may even hamper reproducibility \cite{pimentel19}. An exceptional solution we encountered was the Image Processing Online Journal (IPOL, \href{https://www.ipol.im}{ipol.im}). -Submitted papers must be accompanied by an ISO C implementation of their algorithm (which is buildable on any widely used OS) -with example images/data that can also be executed on their webpage. +Submitted papers must be accompanied by an ISO C implementation of their algorithm (which is buildable on any widely used OS) with example images/data that can also be executed on their webpage. This is possible owing to the focus on low-level algorithms with no dependencies beyond an ISO C compiler. -However, many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex -data formats and analysis, and hence this solution is not scalable. +However, many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, and hence this solution is not scalable. \section{Proposed criteria for longevity} -The main premise is that starting a project with a robust data management strategy (or tools that provide it) is much more -effective, for researchers and the community, than imposing it at the end \cite{austin17,fineberg19}. +The main premise is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it at the end \cite{austin17,fineberg19}. In this context, researchers play a critical role \cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles). -Simply archiving a project workflow in a repository after the project is finished is, on its own, insufficient, -and maintaining it by repository staff is often either practically unfeasible or unscalable. -We argue and propose that workflows satisfying the following criteria can not only improve researcher flexibility during a -research project, but can also increase the FAIRness of the deliverables for future researchers: +Simply archiving a project workflow in a repository after the project is finished is, on its own, insufficient, and maintaining it by repository staff is often either practically unfeasible or unscalable. +We argue and propose that workflows satisfying the following criteria can not only improve researcher flexibility during a research project, but can also increase the FAIRness of the deliverables for future researchers: \textbf{Criterion 1: Completeness.} A project that is complete (self-contained) has the following properties. (1) It has no dependency beyond the Portable Operating System Interface: POSIX (a minimal Unix-like environment). POSIX has been developed by the Austin Group (which includes IEEE) since 1988 and many OSes have complied. -(2) ``No dependency'' requires that the project itself must be primarily stored in plain text, not needing -specialized software to open, parse, or execute. +(2) ``No dependency'' requires that the project itself must be primarily stored in plain text, not needing specialized software to open, parse, or execute. (3) It does not affect the host OS (its libraries, programs, or environment). (4) It does not require root or administrator privileges. (5) It builds its own controlled software for an independent environment. (6) It can run locally (without an internet connection). -(7) It contains the full project's analysis, visualization \emph{and} narrative: -from access to raw inputs to doing the analysis, producing final data products \emph{and} -its final published report with figures \emph{as output}, e.g., PDF or HTML. +(7) It contains the full project's analysis, visualization \emph{and} narrative: from access to raw inputs to doing the analysis, producing final data products \emph{and} its final published report with figures \emph{as output}, e.g., PDF or HTML. (8) It can run automatically, with no human interaction. \textbf{Criterion 2: Modularity.} -A modular project enables and encourages the analysis to be broken into independent modules with well-defined -inputs/outputs and minimal side effects. +A modular project enables and encourages the analysis to be broken into independent modules with well-defined inputs/outputs and minimal side effects. Explicit communication between various modules enables optimizations on many levels: (1) Execution in parallel and avoiding redundancies (when a dependency of a module has not changed, it will not be re-run). (2) Usage in other projects. @@ -235,8 +208,7 @@ Minimal complexity can be interpreted as: (1) Avoiding the language or framework that is currently in vogue (for the workflow, not necessarily the high-level analysis). A popular framework typically falls out of fashion and requires significant resources to translate or rewrite every few years. More stable/basic tools can be used with less long-term maintenance. -(2) Avoiding too many different languages and frameworks; e.g., when the workflow's PM and analysis are orchestrated -in the same framework, it becomes easier to adopt and encourages good practices. +(2) Avoiding too many different languages and frameworks; e.g., when the workflow's PM and analysis are orchestrated in the same framework, it becomes easier to adopt and encourages good practices. \textbf{Criterion 4: Scalability.} A scalable project can easily be used in arbitrarily large and/or complex projects. @@ -250,27 +222,21 @@ Reproduction should be straightforward enough so that ``\emph{a clerk can do it} No exploratory research is done in a single, first attempt. Projects evolve as they are being completed. It is natural that earlier phases of a project are redesigned/optimized only after later phases have been completed. -Research papers often report this with statements such as ``\emph{we [first] tried method [or parameter] X, but Y is used -here because it gave lower random error}''. +Research papers often report this with statements such as ``\emph{we [first] tried method [or parameter] X, but Y is used here because it gave lower random error}''. The derivation ``history'' of a result is thus not any the less valuable as itself. \textbf{Criterion 7: Including narrative, linked to analysis.} A project is not just its computational analysis. A raw plot, figure or table is hardly meaningful alone, even when accompanied by the code that generated it. -A narrative description must also be part of the deliverables (defined as ``data article'' in \cite{austin17}): -describing the purpose of the computations, and interpretations of the result, and the context in relation to +A narrative description must also be part of the deliverables (defined as ``data article'' in \cite{austin17}): describing the purpose of the computations, and interpretations of the result, and the context in relation to other projects/papers. -This is related to longevity, because if a workflow contains only the steps to do the analysis or generate the plots, -in time it may get separated from its accompanying published paper. +This is related to longevity, because if a workflow contains only the steps to do the analysis or generate the plots, in time it may get separated from its accompanying published paper. \textbf{Criterion 8: Free and open source software:} -Reproducibility (defined in \cite{fineberg19}) can be achieved with a black box (non-free or non-open-source software); -this criterion is therefore necessary because nature is already a black box. +Reproducibility (defined in \cite{fineberg19}) can be achieved with a black box (non-free or non-open-source software); this criterion is therefore necessary because nature is already a black box. A project that is free software (as formally defined), allows others to learn from, modify, and build upon it. -When the software used by the project is itself also free, the lineage can be traced to the core algorithms, -possibly enabling optimizations on that level and it can be modified for future hardware. -In contrast, non-free tools typically cannot be distributed or modified by others, making it reliant on a single supplier -(even without payments). +When the software used by the project is itself also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware. +In contrast, non-free tools typically cannot be distributed or modified by others, making it reliant on a single supplier (even without payments). @@ -283,66 +249,47 @@ In contrast, non-free tools typically cannot be distributed or modified by other \section{Proof of concept: Maneage} -With the longevity problems of existing tools outlined above, a proof-of-concept tool is presented here via an implementation -that has been tested in published papers \cite{akhlaghi19, infante20}. -It was in fact awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA -and World Data System (WDS) working group on Publishing Data Workflows \cite{austin17}, from the researchers' perspective. +With the longevity problems of existing tools outlined above, a proof-of-concept tool is presented here via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}. +It was in fact awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows \cite{austin17}, from the researchers' perspective. -The tool is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at -\url{https://maneage.org}. +The tool is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at \url{https://maneage.org}. It was developed as a parallel research project over five years of publishing reproducible workflows of our research. -The original implementation was published in \cite{akhlaghi15}, and evolved in -\href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}. +The original implementation was published in \cite{akhlaghi15}, and evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}. Technically, the hardest criterion to implement was the first (completeness) and, in particular, avoiding non-POSIX dependencies). One solution we considered was GNU Guix and Guix Workflow Language (GWL). However, because Guix requires root access to install, and only works with the Linux kernel, it failed the completeness criterion. -Inspired by GWL+Guix, a single job management tool was implemented for both installing software \emph{and} -the analysis workflow: Make. +Inspired by GWL+Guix, a single job management tool was implemented for both installing software \emph{and} the analysis workflow: Make. -Make is not an analysis language, it is a job manager, deciding when and how to call analysis programs (in any language -like Python, R, Julia, Shell, or C). +Make is not an analysis language, it is a job manager, deciding when and how to call analysis programs (in any language like Python, R, Julia, Shell, or C). Make is standardized in POSIX and is used in almost all core OS components. -It is thus mature, actively maintained, highly optimized, efficient in managing exact provenance, and even recommended by the -pioneers of reproducible research \cite{claerbout1992,schwab2000}. +It is thus mature, actively maintained, highly optimized, efficient in managing exact provenance, and even recommended by the pioneers of reproducible research \cite{claerbout1992,schwab2000}. Researchers using free software tools have also already had some exposure to it. %However, because they didn't attempt to build the software environment, in 2006 they moved to SCons (Make-simulator in Python which also attempts to manage software dependencies) in a project called Madagascar (\url{http://ahay.org}), which is highly tailored to Geophysics. Linking the analysis and narrative (criterion 7) was historically our first design element. -To avoid the problems with computational notebooks mentioned above, our implementation follows a more abstract linkage, providing -a more direct and precise, yet modular, connection. -Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is -through automatically-created \LaTeX{} macros, during the analysis. +To avoid the problems with computational notebooks mentioned above, our implementation follows a more abstract linkage, providing a more direct and precise, yet modular, connection. +Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is through automatically-created \LaTeX{} macros, during the analysis. For example, \cite{akhlaghi19} writes `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'. -The \LaTeX{} source of the quote above is: `\inlinecode{\small detect the outer wings of M51 down to S/N of -\$\textbackslash{}demo\-sf\-optimized\-sn\$}'. -The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is generated during the analysis and expands to the value -`\inlinecode{0.25}' when the PDF output is built. +The \LaTeX{} source of the quote above is: `\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}'. +The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is generated during the analysis and expands to the value `\inlinecode{0.25}' when the PDF output is built. Since values like this depend on the analysis, they should \emph{also} be reproducible, along with figures and tables. -These macros act as a quantifiable link between the narrative and analysis, with the granularity of a word in a sentence and -a particular analysis command. +These macros act as a quantifiable link between the narrative and analysis, with the granularity of a word in a sentence and a particular analysis command. This allows accurate post-publication provenance \emph{and} automatic updates to the embedded numbers during a project. -Through the latter, manual updates by authors are by-passed, which are prone to errors, thus discouraging improvements after -writing the first draft. +Through the latter, manual updates by authors are by-passed, which are prone to errors, thus discouraging improvements after writing the first draft. Acting as a link, the macro files build the core skeleton of Maneage. -For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official -name, version and possible citation. -These are combined at the end to generate precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}), -which are excluded here due tobecause of the strict word limit. -The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized -execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; -see Figure~1 of \cite{alliez19}). +For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation. +These are combined at the end to generate precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}), which are excluded here because of the strict word limit. +The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}). All software dependencies are built down to precise versions of every tool, including the shell, POSIX tools (e.g., GNU Coreutils) or \TeX{}Live, providing the same environment. -On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C -library (glibc) is being added (task 15390). +On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library (glibc) is being added (task 15390). Temporary relocation of a project, without building from source, can be done by building the project in a container or VM. The analysis phase of the project however is naturally different from one project to another at a low-level. -It was thus necessary to design a generic framework to comfortably host any project, while still satisfying the criteria -of modularity, scalability, and minimal complexity. +It was thus necessary to design a generic framework to comfortably host any project, while still satisfying the criteria of modularity, scalability, and minimal complexity. We demonstrate this design by replicating Figure 1C of \cite{menke20} in Figure \ref{fig:datalineage} (top). Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it (including this complete paper). @@ -354,15 +301,12 @@ Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it \vspace{-3mm} \caption{\label{fig:datalineage} Top: an enhanced replica of Figure 1C in \cite{menke20}, shown here for demonstrating Maneage. - It shows the ratio of the number of papers mentioning software tools (green line, left vertical axis) to the total number of - papers studied in that year (light red bars, right vertical axis on a log scale). + It shows the ratio of the number of papers mentioning software tools (green line, left vertical axis) to the total number of papers studied in that year (light red bars, right vertical axis on a log scale). Bottom: Schematic representation of the data lineage, or workflow, to generate the plot above. Each colored box is a file in the project and the arrows show the dependencies between them. Green files/boxes are plain-text files that are under version control and in the project source directory. - Blue files/boxes are output files in the build directory, shown within the Makefile (\inlinecode{*.mk}) where they are defined - as a \emph{target}. - For example, \inlinecode{paper.pdf} depends on \inlinecode{project.tex} (in the build directory; generated automatically) - and \inlinecode{paper.tex} (in the source directory; written manually). + Blue files/boxes are output files in the build directory, shown within the Makefile (\inlinecode{*.mk}) where they are defined as a \emph{target}. + For example, \inlinecode{paper.pdf} depends on \inlinecode{project.tex} (in the build directory; generated automatically) and \inlinecode{paper.tex} (in the source directory; written manually). The solid arrows and full-opacity built boxes correspond to this paper. The dashed arrows and low-opacity built boxes show the scalability by adding hypothetical steps to the project. The underlying data of the top plot is available at @@ -370,15 +314,10 @@ Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it } \end{figure*} -The analysis is orchestrated through a single point of entry (\inlinecode{top-make.mk}, which is a Makefile; see -Listing \ref{code:topmake}). -It is only responsible for \inlinecode{include}-ing the modular \emph{subMakefiles} of the analysis, in the desired order, -without doing any analysis itself. -This is visualized in Figure \ref{fig:datalineage} (bottom) where no built (blue) file is placed directly over -\inlinecode{top-make.mk} (they are produced by the subMakefiles under them). -A visual inspection of this file is sufficient for a non-expert to understand the high-level steps of the project ( -irrespective of the low-level implementation details), provided that the subMakefile names are descriptive (thus encouraging good -practice). +The analysis is orchestrated through a single point of entry (\inlinecode{top-make.mk}, which is a Makefile; see Listing \ref{code:topmake}). +It is only responsible for \inlinecode{include}-ing the modular \emph{subMakefiles} of the analysis, in the desired order, without doing any analysis itself. +This is visualized in Figure \ref{fig:datalineage} (bottom) where no built (blue) file is placed directly over \inlinecode{top-make.mk} (they are produced by the subMakefiles under them). +A visual inspection of this file is sufficient for a non-expert to understand the high-level steps of the project (irrespective of the low-level implementation details), provided that the subMakefile names are descriptive (thus encouraging good practice). A human-friendly design that is also optimized for execution is a critical component for the FAIRness of reproducible research. \begin{lstlisting}[ @@ -404,18 +343,13 @@ include $(foreach s,$(makesrc), \ reproduce/analysis/make/$(s).mk) \end{lstlisting} -All projects first load \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finish with \inlinecode{verify.mk} and -\inlinecode{paper.mk} (Listing \ref{code:topmake}). +All projects first load \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk} (Listing \ref{code:topmake}). Project authors add their modular subMakefiles in between. -Except for \inlinecode{paper.mk} (which builds the ultimate target: \inlinecode{paper.pdf}), all subMakefiles build a macro file -with the same base-name (the \inlinecode{.tex} file in each subMakefile of Figure \ref{fig:datalineage}). -Other built files (intermediate analysis steps) cascade down in the lineage to -one of these macro files, possibly through other files. - -Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk} -to satisfy the verification criteria (this step was not yet available in \cite{akhlaghi19, infante20}). -All project deliverables (macro files, plot or table data and other datasets) are verified at this stage, with their checksums, -to automatically ensure exact reproducibility. +Except for \inlinecode{paper.mk} (which builds the ultimate target: \inlinecode{paper.pdf}), all subMakefiles build a macro file with the same base-name (the \inlinecode{.tex} file in each subMakefile of Figure \ref{fig:datalineage}). +Other built files (intermediate analysis steps) cascade down in the lineage to one of these macro files, possibly through other files. + +Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk} to satisfy the verification criteria (this step was not yet available in \cite{akhlaghi19, infante20}). +All project deliverables (macro files, plot or table data and other datasets) are verified at this stage, with their checksums, to automatically ensure exact reproducibility. Where exact reproducibility is not possible, values can be verified by any statistical means, specified by the project authors. \begin{figure*}[t] @@ -426,41 +360,28 @@ Where exact reproducibility is not possible, values can be verified by any stati (a) A hypothetical project's history prior to publication. The low-level structure (in Maneage, shared between all projects) can be updated by merging with Maneage. (b) A finished/published project can be revitalized for new technologies by merging with the core branch. - Each Git ``commit'' is shown on its branch as a colored ellipse, with its commit hash shown and colored to identify the - team that is/was working on the branch. + Each Git ``commit'' is shown on its branch as a colored ellipse, with its commit hash shown and colored to identify the team that is/was working on the branch. Briefly, Git is a version control system, allowing a structured backup of project files. Each Git ``commit'' effectively contains a copy all the project's files at the moment it was made. The upward arrows at the branch-tops are therefore in the direction of time. } \end{figure*} -To further minimize complexity, the low-level implementation can be further separated from the high-level execution through -configuration files. -By convention in Maneage, the subMakefiles (and the programs they call for number crunching) do not contain any fixed numbers, -settings, or parameters. -Parameters are set as Make variables in ``configuration files'' (with a \inlinecode{.conf} suffix) and passed to the respective -program by Make. -For example, in Figure \ref{fig:datalineage} (bottom), \inlinecode{INPUTS.conf} contains URLs and checksums for all imported -datasets, thereby enabling exact verification before usage. -To illustrate this, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which -is not in their original plot). -The number \inlinecode{\menkenumpapersdemoyear} is stored in \inlinecode{demo-year.conf} and the result -(\inlinecode{\menkenumpapersdemocount}) was calculated after generating \inlinecode{tools-per-year.txt}. +To further minimize complexity, the low-level implementation can be further separated from the high-level execution through configuration files. +By convention in Maneage, the subMakefiles (and the programs they call for number crunching) do not contain any fixed numbers, settings, or parameters. +Parameters are set as Make variables in ``configuration files'' (with a \inlinecode{.conf} suffix) and passed to the respective program by Make. +For example, in Figure \ref{fig:datalineage} (bottom), \inlinecode{INPUTS.conf} contains URLs and checksums for all imported datasets, thereby enabling exact verification before usage. +To illustrate this, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot). +The number \inlinecode{\menkenumpapersdemoyear} is stored in \inlinecode{demo-year.conf} and the result (\inlinecode{\menkenumpapersdemocount}) was calculated after generating \inlinecode{tools-per-year.txt}. Both numbers are expanded as \LaTeX{} macros when creating this PDF file. -An interested reader can change the value in \inlinecode{demo-year.conf} to automatically update the result in the PDF, without -knowing the underlying low-level implementation. +An interested reader can change the value in \inlinecode{demo-year.conf} to automatically update the result in the PDF, without knowing the underlying low-level implementation. Furthermore, the configuration files are a prerequisite of the targets that use them. -If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's -source or other built products. -This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details; -e.g., by co-authors or future readers), and ensures self-consistency. - -Finally, to satisfy the recorded history criterion, version control (currently implemented in Git) is another component of Maneage -(see Figure \ref{fig:branching}). -Maneage is a Git branch that contains the shared components (infrastructure) of all projects (e.g., software tarball URLs, build -recipes, common subMakefiles, and interface script). -Derived projects start by branching off and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its -particular analysis, see Listing \ref{code:branching}, there is customization checklist in \inlinecode{README-hacking.md}). +If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's source or other built products. +This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details; e.g., by co-authors or future readers), and ensures self-consistency. + +Finally, to satisfy the recorded history criterion, version control (currently implemented in Git) is another component of Maneage (see Figure \ref{fig:branching}). +Maneage is a Git branch that contains the shared components (infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles, and interface script). +Derived projects start by branching off and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its particular analysis, see Listing \ref{code:branching}, there is customization checklist in \inlinecode{README-hacking.md}). \begin{lstlisting}[ label=code:branching, @@ -482,18 +403,12 @@ $ ./project make # Re-build to see effect. $ git add -u && git commit # Commit changes. \end{lstlisting} -The branch-based design of Figure \ref{fig:branching} allows projects to re-import Maneage at a later time (technically: -\emph{merge}), thus improving its low-level infrastructure: -in (a) authors do the merge during an ongoing project; -in (b) readers do it after publication; e.g., the project remains reproducible but the infrastructure is outdated, or a bug -is fixed in Maneage. -Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of curation and maintenance of -each individual project, before \emph{and} after publication. +The branch-based design of Figure \ref{fig:branching} allows projects to re-import Maneage at a later time (technically: \emph{merge}), thus improving its low-level infrastructure: in (a) authors do the merge during an ongoing project; +in (b) readers do it after publication; e.g., the project remains reproducible but the infrastructure is outdated, or a bug is fixed in Maneage. +Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of curation and maintenance of each individual project, before \emph{and} after publication. Finally, the complete project source is usually $\sim100$ kilo-bytes. -It can thus easily be published or archived in many servers, for example it can be uploaded to arXiv (with the -\LaTeX{} source, see the arXiv source in \cite{akhlaghi19, infante20, akhlaghi15}), published on Zenodo and archived in -SoftwareHeritage. +It can thus easily be published or archived in many servers, for example it can be uploaded to arXiv (with the \LaTeX{} source, see the arXiv source in \cite{akhlaghi19, infante20, akhlaghi15}), published on Zenodo and archived in SoftwareHeritage. @@ -508,46 +423,30 @@ SoftwareHeritage. %% Attempt to generalise the significance. %% should not just present a solution or an enquiry into a unitary problem but make an effort to demonstrate wider significance and application and say something more about the ‘science of data’ more generally. -We have shown that it is possible to build workflows satisfying all the proposed criteria, and we comment here on our experience -in testing them through this proof-of-concept tool. +We have shown that it is possible to build workflows satisfying all the proposed criteria, and we comment here on our experience in testing them through this proof-of-concept tool. Maneage user-base grew with the support of RDA, underscoring some difficulties for a widespread adoption. -Firstly, while most researchers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the -command-line and Make) are not widely used. -Fortunately, we have noticed that after witnessing the improvements in their research, many, especially early-career -researchers, have started mastering these tools. -Scientists are rarely trained sufficiently in data management or software development, and the plethora of high-level tools that -change every few years discourages them. -Indeed the fast-evolving tools are primarily targeted at software developers, who are paid to learn and use them effectively for -short-term projects before moving on to the next technology. +Firstly, while most researchers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used. +Fortunately, we have noticed that after witnessing the improvements in their research, many, especially early-career researchers, have started mastering these tools. +Scientists are rarely trained sufficiently in data management or software development, and the plethora of high-level tools that change every few years discourages them. +Indeed the fast-evolving tools are primarily targeted at software developers, who are paid to learn and use them effectively for short-term projects before moving on to the next technology. Scientists, on the other hand, need to focus on their own research fields, and need to consider longevity. -Hence, arguably the most important feature of these criteria (as implemented in Maneage) is that they provide a fully working -template, using mature and time-tested tools, for blending version control, the research paper's narrative, the software management -\emph{and} a robust data carpentry. -We have noticed that providing a complete \emph{and} customizable template with a clear checklist of the initial steps is much more -effective in encouraging mastery of these modern scientific tools than having abstract, isolated tutorials on each tool -individually. - -Secondly, to satisfy the completeness criterion, all the required software of the project must be built on various -POSIX-compatible systems (Maneage is actively tested on different GNU/Linux distributions, macOS, and is being ported to -FreeBSD also). +Hence, arguably the most important feature of these criteria (as implemented in Maneage) is that they provide a fully working template, using mature and time-tested tools, for blending version control, the research paper's narrative, the software management \emph{and} a robust data carpentry. +We have noticed that providing a complete \emph{and} customizable template with a clear checklist of the initial steps is much more effective in encouraging mastery of these modern scientific tools than having abstract, isolated tutorials on each tool individually. + +Secondly, to satisfy the completeness criterion, all the required software of the project must be built on various POSIX-compatible systems (Maneage is actively tested on different GNU/Linux distributions, macOS, and is being ported to FreeBSD also). This requires maintenance by our core team and consumes time and energy. -However, because the PM and analysis components share the same job manager (Make) and design principles, we have already noticed -some early users adding, or fixing, their required software alone. +However, because the PM and analysis components share the same job manager (Make) and design principles, we have already noticed some early users adding, or fixing, their required software alone. They later share their low-level commits on the core branch, thus propagating it to all derived projects. A related caveat is that, POSIX is a fuzzy standard, not guaranteeing the bit-wise reproducibility of programs. -It has been chosen here, however, as the underlying platform because our focus is on reproducing the results (data), -which does not necessarily need bit-wise reproducible software. -POSIX is ubiquitous and low-level software (e.g., core GNU tools) are install-able on most; each internally corrects for -differences affecting its functionality (partly as part of the GNU portability library). +It has been chosen here, however, as the underlying platform because our focus is on reproducing the results (data), which does not necessarily need bit-wise reproducible software. +POSIX is ubiquitous and low-level software (e.g., core GNU tools) are install-able on most; each internally corrects for differences affecting its functionality (partly as part of the GNU portability library). On GNU/Linux hosts, Maneage builds precise versions of the compilation tool chain. However, glibc is not install-able on some POSIX OSs (e.g., macOS). -All programs link with the C library, and this may hypothetically hinder the exact reproducibility \emph{of results} on -non-GNU/Linux systems, but we have not encountered this in our research so far. -With everything else under precise control, the effect of differing Kernel and C libraries on high-level science can now be -systematically studied with Maneage in follow-up research. +All programs link with the C library, and this may hypothetically hinder the exact reproducibility \emph{of results} on non-GNU/Linux systems, but we have not encountered this in our research so far. +With everything else under precise control, the effect of differing Kernel and C libraries on high-level science can now be systematically studied with Maneage in follow-up research. % DVG: It is a pity that the following paragraph cannot be included, as it is really important but perhaps goes beyond the intended goal. %Thirdly, publishing a project's reproducible data lineage immediately after publication enables others to continue with follow-up papers, which may provide unwanted competition against the original authors. @@ -556,37 +455,23 @@ systematically studied with Maneage in follow-up research. %This is a long-term goal and would require major changes to academic value systems. %2) Authors can be given a grace period where the journal or a third party embargoes the source, keeping it private for the embargo period and then publishing it. -Other implementations of the criteria, or future improvements in Maneage, may solve some of the caveats above, but this proof of -concept already shows their many advantages. -For example, publication of projects meeting these criteria on a wide scale will allow automatic workflow generation, optimized -for desired characteristics of the results (e.g., via machine learning). +Other implementations of the criteria, or future improvements in Maneage, may solve some of the caveats above, but this proof of concept already shows their many advantages. +For example, publication of projects meeting these criteria on a wide scale will allow automatic workflow generation, optimized for desired characteristics of the results (e.g., via machine learning). The completeness criterion implies that algorithms and data selection can be included in the optimizations. -Furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the -connection between an analysis with the resulting narrative \emph{and} the history of that analysis+narrative. +Furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the connection between an analysis with the resulting narrative \emph{and} the history of that analysis+narrative. Parsers can be written over projects for meta-research and provenance studies, e.g., to generate ``research objects''. -Likewise, when a bug is found in one science software, affected projects can be detected and the scale of the effect can be -measured. -Combined with SoftwareHeritage, precise high-level science components of the analysis can be accurately cited -(e.g., even failed/abandoned tests at any historical point). -Many components of ``machine-actionable'' data management plans can also be automatically completed as a byproduct, -useful for project PIs and grant funders. +Likewise, when a bug is found in one science software, affected projects can be detected and the scale of the effect can be measured. +Combined with SoftwareHeritage, precise high-level science components of the analysis can be accurately cited (e.g., even failed/abandoned tests at any historical point). +Many components of ``machine-actionable'' data management plans can also be automatically completed as a byproduct, useful for project PIs and grant funders. From the data repository perspective, these criteria can also be useful, e.g., the challenges mentioned in \cite{austin17}: -(1) The burden of curation is shared among all project authors and readers (the latter may find a bug and fix it), not just by -database curators, thereby improving sustainability. -(2) Automated and persistent bidirectional linking of data and publication can be established through the published -\emph{and complete} data lineage that is under version control. -(3) Software management: -with these criteria, each project comes with its unique and complete software management. -It does not use a third-party PM that needs to be maintained by the data center (and the many versions of the PM), -hence enabling robust software management, preservation, publishing, and citation. -For example, see \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}, -\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, -\href{https://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}, where we have exploited the free-software criterion to -distribute the tarballs of all the software used with each project's source as deliverables. -(4) ``Linkages between documentation, code, data, and journal articles in an integrated environment'', -which effectively summarizes the whole purpose of these criteria. +(1) The burden of curation is shared among all project authors and readers (the latter may find a bug and fix it), not just by database curators, thereby improving sustainability. +(2) Automated and persistent bidirectional linking of data and publication can be established through the published \emph{and complete} data lineage that is under version control. +(3) Software management: with these criteria, each project comes with its unique and complete software management. +It does not use a third-party PM that needs to be maintained by the data center (and the many versions of the PM), hence enabling robust software management, preservation, publishing, and citation. +For example, see \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}, \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, \href{https://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}, where we have exploited the free-software criterion to distribute the tarballs of all the software used with each project's source as deliverables. +(4) ``Linkages between documentation, code, data, and journal articles in an integrated environment'', which effectively summarizes the whole purpose of these criteria. @@ -610,6 +495,7 @@ Marios Karouzos, Mohammad-reza Khellat, Johan Knapen, Tamara Kovazh, +Terry Mahoney, Ryan O'Connor, Simon Portegies Zwart, Idafen Santana-P\'erez, @@ -629,8 +515,7 @@ Sk\l{}odowska-Curie grant agreement No 721463 to the SUNDIAL ITN. The State Research Agency (AEI) of the Spanish Ministry of Science, Innovation and Universities (MCIU) and the European Regional Development Fund (ERDF) under the grant AYA2016-76219-P. The IAC project P/300724, financed by the MCIU, through the Canary Islands Department of Economy, Knowledge and Employment. -The Fundaci\'on BBVA under its 2017 programme of assistance to scientific research groups, for the project ``Using machine-learning -techniques to drag galaxies from the noise in deep imaging''. +The Fundaci\'on BBVA under its 2017 programme of assistance to scientific research groups, for the project ``Using machine-learning techniques to drag galaxies from the noise in deep imaging''. The ``A next-generation worldwide quantum sensor network with optical atomic clocks'' project of the TEAM IV programme of the Foundation for Polish Science co-financed by the EU under ERDF. The Polish MNiSW grant DIR/WK/2018/12. @@ -651,26 +536,20 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314 %% Biography \begin{IEEEbiographynophoto}{Mohammad Akhlaghi} is a postdoctoral researcher at the Instituto de Astrof\'isica de Canarias, Tenerife, Spain. - His main scientific interest is in early galaxy evolution, but to extract information from the modern complex datasets, he has - been involved in image processing and reproducible workflow management, where he has founded GNU Astronomy Utilities (Gnuastro) - and Maneage (introduced here). - He received his PhD in astronomy from Tohoku University, Sendai Japan, and before coming to Tenerife, held a CNRS postdoc - position at the Centre de Recherche Astrophysique de Lyon (CRAL). + His main scientific interest is in early galaxy evolution, but to extract information from the modern complex datasets, he has been involved in image processing and reproducible workflow management, where he has founded GNU Astronomy Utilities (Gnuastro) and Maneage (introduced here). + He received his PhD in astronomy from Tohoku University, Sendai Japan, and before coming to Tenerife, held a CNRS postdoc position at the Centre de Recherche Astrophysique de Lyon (CRAL), France. Contact him at mohammad@akhlaghi.org and find his website at \url{https://akhlaghi.org}. \end{IEEEbiographynophoto} \begin{IEEEbiographynophoto}{Ra\'ul Infante-Sainz} is a doctoral student at the Instituto de Astrof\'isica de Canarias, Tenerife, Spain. - He has been concerned about the ability of reproducing scientific results since the start of his research and has thus been - actively involved in development and testing of Maneage. - His main scientific interests are galaxy formation and evolution, studying the low-surface-brightness Universe through - reproducible methods. + He has been concerned about the ability of reproducing scientific results since the start of his research and has thus been actively involved in development and testing of Maneage. + His main scientific interests are galaxy formation and evolution, studying the low-surface-brightness Universe through reproducible methods. Contact him at infantesainz@gmail.com and find his website at \url{https://infantesainz.org}. \end{IEEEbiographynophoto} \begin{IEEEbiographynophoto}{Boudewijn F. Roukema} - is a professor at the Institute of Astronomy in the Faculty of Physics, Astronomy and Informatics at Nicolaus Copernicus - University in Toru\'n, Grudziadzka 5, Poland. + is a professor at the Institute of Astronomy in the Faculty of Physics, Astronomy and Informatics at Nicolaus Copernicus University in Toru\'n, Grudziadzka 5, Poland. His research includes galaxy formation, large scale structure of the Universe, cosmic topology and inhomogeneous cosmology. He is involved in experimental research aimed at improving standards in research reproducibility. Roukema obtained his PhD in astronomy and astrophysics at the Australian National University. @@ -688,12 +567,8 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314 \begin{IEEEbiographynophoto}{Roberto Baena-Gall\'e} is a postdoctoral researcher at the Instituto de Astrof\'isica de Canarias, Spain. He previously worked at University of Barcelona, and ONERA-The French Aerospace Lab. - His research interests are image processing and resolution of inverse problems, with applications to AO corrected FOVs, - satellite identification and retina images. - He is currently involved in projects related with PSF estimation of large astronomical surveys and Machine Learning following - reproducibility standards. - Baena-Gall\'e has both MS in Telecommunication and Electronic Engineering from University of Seville (Spain) - and received a PhD in astronomy from University of Barcelona (Spain). + His research interests are image processing and resolution of inverse problems, following reproducibility standards. + Baena-Gall\'e has both MS in Telecommunication and Electronic Engineering from University of Seville (Spain) and received a PhD in astronomy from University of Barcelona (Spain). Contact him at rbaena@iac.es. \end{IEEEbiographynophoto} \vfill |