From a0b5dd69b1fddfc54a9da0422539dff09e2e114d Mon Sep 17 00:00:00 2001 From: Mohammad Akhlaghi Date: Sat, 30 May 2020 05:24:46 +0100 Subject: Discussion on issues with POSIX and minor edits to shorten paper Konrad raised some very interesting points in particular about the limitations of POSIX as a fuzzy standard that does not guaratee reproducibility. A relatively long paragraph was thus added in the discussion to address this important point. In order to fit it in, the paragraph on "unwanted competition" was removed since the POSIX issue was much more relevant for a curious reader. Throughout the text, some other parts were edited to decrease the length of the paper while making it easier to read. --- paper.tex | 86 ++++++++++++++++++++++++++++++++++----------------------------- 1 file changed, 47 insertions(+), 39 deletions(-) diff --git a/paper.tex b/paper.tex index b8b502b..1f1a977 100644 --- a/paper.tex +++ b/paper.tex @@ -114,9 +114,9 @@ Hence, the evolving technology landscape creates generational gaps in the scient \section{Commonly used tools and their longevity} -While longevity is important in science and some fields of industry, this is not always the case, e.g., fast-evolving tools can be appropriate in short-term commercial projects. -To highlight the necessity of longevity in reproducible research, some of the most commonly used tools are reviewed here from this perspective. -Most existing solutions use a common set of third-party tools that can be categorized as: +Longevity is important in science and some fields of industry, but this is not always the case, e.g., fast-evolving tools can be appropriate in short-term commercial projects. +To highlight the necessity of longevity, some of the most commonly used tools are reviewed here from this perspective. +A common set of third-party tools are used by most solutions that can be categorized as: (1) environment isolators -- virtual machines (VMs) or containers; (2) package managers (PMs) -- Conda, Nix, or Spack; (3) job management -- shell scripts, Make, SCons, or CGAT-core; @@ -241,19 +241,19 @@ In contrast, a non-free software package typically cannot be distributed by othe With the longevity problems of existing tools outlined above, a proof of concept is presented via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}. It was awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows\cite{austin17}, from the researcher perspective. -The proof-of-concept is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''). -It was developed as a parallel research project over 5 years of publishing reproducible workflows to supplement our research. +The proof-of-concept is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at \url{https://maneage.org}. +It was developed as a parallel research project over 5 years of publishing reproducible workflows of our research. The original implementation was published in \cite{akhlaghi15}, and evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}. Technically, the hardest criterion to implement was the completeness criterion (and, in particular, avoiding non-POSIX dependencies). Minimizing complexity was also difficult. One proposed solution was the Guix Workflow Language (GWL), which is written in the same framework (GNU Guile, an implementation of Scheme) as GNU Guix (a PM). -The fact that Guix requires root access to install, and only works with the Linux kernel were problematic. +However because Guix requires root access to install, and only works with the Linux kernel, it failed the completeness criterion. Inspired by GWL+Guix, a single job management tool was used for both installing of software \emph{and} the analysis workflow: Make. Make is not an analysis language, it is a job manager, deciding when to call analysis programs (in any language like Python, R, Julia, Shell or C). Make is standardized in POSIX and is used in almost all core OS components. -It is thus mature, actively maintained and highly optimized. +It is thus mature, actively maintained and highly optimized (in a functional-like paradigm, enabling exact provenance). Make was recommended by the pioneers of reproducible research\cite{claerbout1992,schwab2000} and many researchers have already had a minimal exposure to it (at least when building research software). %However, because they didn't attempt to build the software environment, in 2006 they moved to SCons (Make-simulator in Python which also attempts to manage software dependencies) in a project called Madagascar (\url{http://ahay.org}), which is highly tailored to Geophysics. @@ -261,8 +261,8 @@ Linking the analysis and narrative was another major design choice. Literate programming, implemented as Computational Notebooks like Jupyter, is currently popular. However, due to the problems above, our implementation follows a more abstract linkage, providing a more direct and precise, but modular connection (modularized into specialised files). -Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is through \LaTeX{} macros, which are automatically defined during the analysis. -For example, in the abstract of \cite{akhlaghi19} we say `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'. +Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is through automatically created \LaTeX{} macros (during the analysis). +For example, in \cite{akhlaghi19} we say `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'. The \LaTeX{} source of the quote above is: `\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}'. The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is generated during the analysis, and expands to the value `\inlinecode{0.25}' when the PDF output is built. Since values like this depend on the analysis, they should also be reproducible, along with figures and tables. @@ -270,20 +270,18 @@ These macros act as a quantifiable link between the narrative and analysis, with This allows accurate provenance post-publication \emph{and} automatic updates to the text prior to publication. Manually updating these in the narrative is prone to errors and discourages improvements after writing the first draft. -The ultimate aim of any project is to produce a report accompanying a dataset, providing visualizations, or a research article in a journal. -Let's call this \inlinecode{paper.pdf}. -Acting as a link, the macro files described above therefore build the core skeleton of Maneage. +Acting as a link, these macro files therefore build the core skeleton of Maneage. For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation. These are combined in the end to generate precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}; excluded here due to the strict word limit). The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel with no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}). -Software dependencies are built down to precise versions of the shell, POSIX tools (e.g., GNU Coreutils), \TeX{}Live) for an \emph{almost} exact reproducible environment on POSIX-compatible systems. +Software dependencies are built down to precise versions of every tool, including the shell, POSIX tools (e.g., GNU Coreutils) or \TeX{}Live, providing the same environment. On GNU/Linux operating systems, the GNU Compiler Collection (GCC) is also built from source and the GNU C library is being added (task 15390). Fast relocation of a project (without building from source) can be done by building the project in a container or VM. -In building software, normally only the high-level choice of which software to build differs between projects. +In building software, normally the only difference between projects is choice of which software to build. However, the analysis will naturally be different from one project to another at a low-level. -It was thus necessary to design a generic system to comfortably host any project, while still satisfying the criteria of modularity, scalability and minimal complexity. +It was thus necessary to design a generic framework to comfortably host any project, while still satisfying the criteria of modularity, scalability and minimal complexity. We demonstrate this design by replicating Figure 1C of \cite{menke20} in Figure \ref{fig:datalineage} (top). Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it (including this complete paper). @@ -329,10 +327,10 @@ include $(foreach s,$(makesrc), \ reproduce/analysis/make/$(s).mk) \end{lstlisting} -Analysis is orchestrated through a single point of entry (\inlinecode{top-make.mk}, which is a Makefile). +The analysis is orchestrated through a single point of entry (\inlinecode{top-make.mk}, which is a Makefile, see Listing \ref{code:topmake}). It is only responsible for \inlinecode{include}-ing the modular \emph{subMakefiles} of the analysis, in the desired order, without doing any analysis itself. This is visualized in Figure \ref{fig:datalineage} (bottom) where no built/blue file is placed directly over \inlinecode{top-make.mk} (they are produced by the subMakefiles under them). -As shown in Listing \ref{code:topmake}, a visual inspection of this file allows a non-expert to understand the high-level steps of the project (irrespective of the low-level implementation details); provided that the subMakefile names are descriptive (thus encouraging good practice). +A visual inspection of this file is sufficient for a non-expert to understand the high-level steps of the project (irrespective of the low-level implementation details); provided that the subMakefile names are descriptive (thus encouraging good practice). A human-friendly design that is also optimized for execution is a critical component for reproducible research workflows. All projects first load \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk} (Listing \ref{code:topmake}). @@ -347,12 +345,12 @@ This step was not yet implemented in \cite{akhlaghi19, infante20}. To further minimize complexity, the low-level implementation can be further separated from the high-level execution through configuration files. By convention in Maneage, the subMakefiles, and the programs they call for number crunching, do not contain any fixed numbers, settings or parameters. -Parameters are set as Make variables in ``configuration files'' (with a \inlinecode{.conf} suffix) and passed to the respective program. -For example, in Figure \ref{fig:datalineage}, \inlinecode{INPUTS.conf} contains URLs and checksums for all imported datasets, enabling exact verification before usage. +Parameters are set as Make variables in ``configuration files'' (with a \inlinecode{.conf} suffix) and passed to the respective program by Make. +For example, in Figure \ref{fig:datalineage} (bottom), \inlinecode{INPUTS.conf} contains URLs and checksums for all imported datasets, enabling exact verification before usage. To illustrate this, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot). The number \inlinecode{\menkenumpapersdemoyear} is stored in \inlinecode{demo-year.conf} and the result (\inlinecode{\menkenumpapersdemocount}) was calculated after generating \inlinecode{columns.txt}. Both are expanded as \LaTeX{} macros when creating this PDF file. -This enables a random reader to change the value in \inlinecode{demo-year.conf} to automatically update the result, without necessarily knowing the underlying low-level implementation. +A random reader can change the value in \inlinecode{demo-year.conf} to automatically update the result in the PDF, without necessarily knowing the underlying low-level implementation. Furthermore, the configuration files are a prerequisite of the targets that use them. If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's source or other built products. This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details, e.g., by co-authors or future readers), and ensures self-consistency. @@ -384,15 +382,20 @@ $ cd project $ git remote rename origin origin-maneage $ git checkout -b master -# Build the project in two phases: +# Build the raw Maneage skeleton in two phases. $ ./project configure # Build software environment. $ ./project make # Do analysis, build PDF paper. + +# Start editing, test-building and committing +$ emacs paper.tex # e.g., add project title. +$ ./project make # Re-build to see effect. +$ git add -u && git commit # Commit changes \end{lstlisting} -As Figure \ref{fig:branching} shows, due to this architecture, it is always possible to import or merge Maneage into a project to improve the low-level infrastructure: +As Figure \ref{fig:branching} shows, due to this architecture, it is always possible to import (technically: \emph{merge}) Maneage into a project and improve the low-level infrastructure: in (a) the authors merge Maneage during an ongoing project; -in (b) readers can do it after the paper's publication, e.g., when the project remains reproducible but the infrastructure is outdated, or a bug is found in Maneage. -Low-level improvements in Maneage are thus easily propagated to all projects. +in (b) readers do it after the paper's publication, e.g., when the project remains reproducible but the infrastructure is outdated, or a bug is found in Maneage. +Low-level improvements in Maneage can thus easily propagate to all projects. This greatly reduces the cost of curation and maintenance of each individual project, before \emph{and} after publication. @@ -415,40 +418,45 @@ Here, we comment on our experience in testing them through the proof of concept. We will discuss the design principles, and how they may be generalized and usable in other projects. In particular, with the support of RDA, the user base grew phenomenally, highlighting some difficulties for wide-spread adoption. -Firstly, while most researchers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used. +Firstly, while most researchers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used by many. Fortunately, we have noticed that after witnessing the improvements in their research, many, especially early-career researchers, have started mastering these tools. Scientists are rarely trained sufficiently in data management or software development, and the plethora of high-level tools that change every few years discourages them. Fast-evolving tools are primarily targeted at software developers, who are paid to learn them and use them effectively for short-term projects before moving on to the next technology. Scientists, on the other hand, need to focus on their own research fields, and need to consider longevity. -Hence, arguably the most important feature of these criteria is that they provide a fully working template, using mature and time-tested tools, for blending version control, the research paper's narrative, software management \emph{and} robust data carpentry. +Hence, arguably the most important feature of these criteria (as implemented in Maneage) is that they provide a fully working template, using mature and time-tested tools, for blending version control, the research paper's narrative, software management \emph{and} robust data carpentry. We have seen that providing a complete \emph{and} customizable template with a clear checklist of the initial steps is much more effective in encouraging mastery of these modern scientific tools than having abstract, isolated tutorials on each tool individually. Secondly, to satisfy the completeness criteria, all the necessary software of the project must be built on various POSIX-compatible systems (we actively test Maneage on several different GNU/Linux distributions and on macOS). This requires maintenance by our core team and consumes time and energy. However, the PM and analysis share the same job manager (Make), design principles and conventions. -Our experience so far has shown that users' experience in the analysis empowers some of them to add or fix their required software on their own systems. -Later, they share them as commits on the core branch, thus propagating it to all derived projects. -This has already occurred multiple times. - -Thirdly, publishing a project's reproducible data lineage immediately after publication enables others to continue with follow-up papers, which may provide unwanted competition against the original authors. -We propose these solutions: -1) Through the Git history, the work added by another team at any phase of the project can be quantified, contributing to a new concept of authorship in scientific projects and helping to quantify Newton's famous ``\emph{standing on the shoulders of giants}'' quote. -This is a long-term goal and would require major changes to academic value systems. -2) Authors can be given a grace period where the journal or a third party embargoes the source, keeping it private for the embargo period and then publishing it. +We have thus found that more than once, advanced users add, or fix, their required software alone and share their low-level commits on the core branch, thus propagating it to all derived projects. + +On a related note, POSIX is a fuzzy standard and it doesn't guarantee the bit-wise reproducibility of programs. +However, it has been chosen as the underlying platform here because the results (data) are our focus, not the compiled software. +POSIX is ubiquitous and fixed versions of low-level software (e.g., core GNU tools) are installable on the majority of them; each correcting for differences affecting their functionality. +On GNU/Linux hosts, Maneage even builds a precise version of the GNU Compiler Collection (GCC), GNU Binutils and GNU C library (glibc). +However, glibc is not installable on some POSIX OSs (e.g., macOS). +The C library is linked with all programs, thus theoretically this can hinder exact reproducibility \emph{of results}, but we have not encountered any until now. +When present, the non-reproducibility of high-level science results due to differing C libraries can be identified with respect to known sources of error in the analysis (like measurement errors), but this is beyond the scope here. + +%Thirdly, publishing a project's reproducible data lineage immediately after publication enables others to continue with follow-up papers, which may provide unwanted competition against the original authors. +%We propose these solutions: +%1) Through the Git history, the work added by another team at any phase of the project can be quantified, contributing to a new concept of authorship in scientific projects and helping to quantify Newton's famous ``\emph{standing on the shoulders of giants}'' quote. +%This is a long-term goal and would require major changes to academic value systems. +%2) Authors can be given a grace period where the journal or a third party embargoes the source, keeping it private for the embargo period and then publishing it. Other implementations of the criteria, or future improvements in Maneage, may solve some of the caveats above. However, the proof of concept already shows many advantages in adopting the criteria. -For example, publication of projects with these criteria on a wide scale will allow automatic workflow generation, optimized for desired characteristics of the results (for example, via machine learning). +For example, publication of projects with these criteria on a wide scale will allow automatic workflow generation, optimized for desired characteristics of the results (e.g., via machine learning). Because of the completeness criteria, algorithms and data selection can be similarly optimized. Furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the connection between an analysis with the resulting narrative \emph{and} the history of that analysis/narrative. -Parsers can be written over projects for meta-research and provenance studies, for example, to generate ``research objects''. +Parsers can be written over projects for meta-research and provenance studies, e.g.,, to generate ``research objects''. As another example, when a bug is found in one software package, all affected projects can be found and the scale of the effect can be measured. Combined with SoftwareHeritage, precise high-level science parts of Maneage projects can be accurately cited (e.g., failed/abandoned tests at any historical point). Many components of ``machine-actionable'' data management plans can be automatically filled out by Maneage, which is useful for project PIs and grant funders. -From the data repository perspective, these criteria can also be very useful. -For example, with regard to the challenges mentioned in \cite{austin17}: +From the data repository perspective, these criteria can also be very useful, e.g.,, with regard to the challenges mentioned in \cite{austin17}: (1) The burden of curation is shared among all project authors and readers (the latter may find a bug and fix it), not just by database curators, improving sustainability. (2) Automated and persistent bidirectional linking of data and publication can be established through the published \& \emph{complete} data lineage that is under version control. (3) Software management. -- cgit v1.2.1