diff options
-rw-r--r-- | paper.tex | 35 |
1 files changed, 18 insertions, 17 deletions
@@ -103,7 +103,7 @@ Data Lineage, Provenance, Reproducibility, Scientific Pipelines, Workflows Reproducible research has been discussed in the sciences for at least 30 years \cite{claerbout1992, fineberg19}. Many reproducible workflow solutions (hereafter, ``solutions'') have been proposed, mostly relying on the common technology of the day: starting with Make and Matlab libraries in the 1990s, to Java in the 2000s and mostly shifting to Python during the last decade. -However, technologies develop very fast, e.g., Python 2 code often cannot run with Python 3, interrupting many projects in the last decade. +However, these technologies develop very fast, e.g., Python 2 code often cannot run with Python 3, interrupting many projects in the last decade. The cost of staying up to date within this rapidly evolving landscape is high. Scientific projects, in particular, suffer the most: scientists have to focus on their own research domain, but to some degree they need to understand the technology of their tools, because it determines their results and interpretations. Decades later, scientists are still held accountable for their results. @@ -116,7 +116,7 @@ Hence, the evolving technology landscape creates generational gaps in the scient \section{Commonly used tools and their longevity} Longevity is important in science and some fields of industry, but this is not always the case, e.g., fast-evolving tools can be appropriate in short-term commercial projects. To highlight the necessity of longevity, some of the most commonly used tools are reviewed here from this perspective. -A common set of third-party tools are used by most solutions that can be categorized as: +A common set of third-party tools that are used by most solutions can be categorized as: (1) environment isolators -- virtual machines (VMs) or containers; (2) package managers (PMs) -- Conda, Nix, or Spack; (3) job management -- shell scripts, Make, SCons, or CGAT-core; @@ -242,19 +242,19 @@ With the longevity problems of existing tools outlined above, a proof of concept It was awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows\cite{austin17}, from the researcher perspective. The proof-of-concept is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at \url{https://maneage.org}. -It was developed as a parallel research project over 5 years of publishing reproducible workflows of our research. +It was developed as a parallel research project over five years of publishing reproducible workflows of our research. The original implementation was published in \cite{akhlaghi15}, and evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}. Technically, the hardest criterion to implement was the completeness criterion (and, in particular, avoiding non-POSIX dependencies). Minimizing complexity was also difficult. One proposed solution was the Guix Workflow Language (GWL), which is written in the same framework (GNU Guile, an implementation of Scheme) as GNU Guix (a PM). -However because Guix requires root access to install, and only works with the Linux kernel, it failed the completeness criterion. +However, because Guix requires root access to install, and only works with the Linux kernel, it failed the completeness criterion. Inspired by GWL+Guix, a single job management tool was used for both installing of software \emph{and} the analysis workflow: Make. Make is not an analysis language, it is a job manager, deciding when to call analysis programs (in any language like Python, R, Julia, Shell or C). Make is standardized in POSIX and is used in almost all core OS components. -It is thus mature, actively maintained and highly optimized (in a functional-like paradigm, enabling exact provenance). -Make was recommended by the pioneers of reproducible research\cite{claerbout1992,schwab2000} and many researchers have already had a minimal exposure to it (at least when building research software). +It is thus mature, actively maintained and highly optimized (and efficient in managing exact provenance). +Make was recommended by the pioneers of reproducible research\cite{claerbout1992,schwab2000} and many researchers have already had some exposure to it (when building research software). %However, because they didn't attempt to build the software environment, in 2006 they moved to SCons (Make-simulator in Python which also attempts to manage software dependencies) in a project called Madagascar (\url{http://ahay.org}), which is highly tailored to Geophysics. Linking the analysis and narrative was another major design choice. @@ -270,7 +270,7 @@ These macros act as a quantifiable link between the narrative and analysis, with This allows accurate provenance post-publication \emph{and} automatic updates to the text prior to publication. Manually updating these in the narrative is prone to errors and discourages improvements after writing the first draft. -Acting as a link, these macro files therefore build the core skeleton of Maneage. +Acting as a link, these macro files build the core skeleton of Maneage. For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation. These are combined in the end to generate precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}; excluded here due to the strict word limit). @@ -327,10 +327,10 @@ include $(foreach s,$(makesrc), \ reproduce/analysis/make/$(s).mk) \end{lstlisting} -The analysis is orchestrated through a single point of entry (\inlinecode{top-make.mk}, which is a Makefile, see Listing \ref{code:topmake}). +The analysis is orchestrated through a single point of entry (\inlinecode{top-make.mk}, which is a Makefile; see Listing \ref{code:topmake}). It is only responsible for \inlinecode{include}-ing the modular \emph{subMakefiles} of the analysis, in the desired order, without doing any analysis itself. This is visualized in Figure \ref{fig:datalineage} (bottom) where no built/blue file is placed directly over \inlinecode{top-make.mk} (they are produced by the subMakefiles under them). -A visual inspection of this file is sufficient for a non-expert to understand the high-level steps of the project (irrespective of the low-level implementation details); provided that the subMakefile names are descriptive (thus encouraging good practice). +A visual inspection of this file is sufficient for a non-expert to understand the high-level steps of the project (irrespective of the low-level implementation details), provided that the subMakefile names are descriptive (thus encouraging good practice). A human-friendly design that is also optimized for execution is a critical component for reproducible research workflows. All projects first load \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk} (Listing \ref{code:topmake}). @@ -350,7 +350,7 @@ For example, in Figure \ref{fig:datalineage} (bottom), \inlinecode{INPUTS.conf} To illustrate this, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot). The number \inlinecode{\menkenumpapersdemoyear} is stored in \inlinecode{demo-year.conf} and the result (\inlinecode{\menkenumpapersdemocount}) was calculated after generating \inlinecode{columns.txt}. Both are expanded as \LaTeX{} macros when creating this PDF file. -A random reader can change the value in \inlinecode{demo-year.conf} to automatically update the result in the PDF, without necessarily knowing the underlying low-level implementation. +A user can change the value in \inlinecode{demo-year.conf} to automatically update the result in the PDF, without necessarily knowing the underlying low-level implementation. Furthermore, the configuration files are a prerequisite of the targets that use them. If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's source or other built products. This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details, e.g., by co-authors or future readers), and ensures self-consistency. @@ -418,7 +418,7 @@ Here, we comment on our experience in testing them through the proof of concept. We will discuss the design principles, and how they may be generalized and usable in other projects. In particular, with the support of RDA, the user base grew phenomenally, highlighting some difficulties for wide-spread adoption. -Firstly, while most researchers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used by many. +Firstly, while most researchers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used. Fortunately, we have noticed that after witnessing the improvements in their research, many, especially early-career researchers, have started mastering these tools. Scientists are rarely trained sufficiently in data management or software development, and the plethora of high-level tools that change every few years discourages them. Fast-evolving tools are primarily targeted at software developers, who are paid to learn them and use them effectively for short-term projects before moving on to the next technology. @@ -432,13 +432,14 @@ This requires maintenance by our core team and consumes time and energy. However, the PM and analysis share the same job manager (Make), design principles and conventions. We have thus found that more than once, advanced users add, or fix, their required software alone and share their low-level commits on the core branch, thus propagating it to all derived projects. -On a related note, POSIX is a fuzzy standard and it doesn't guarantee the bit-wise reproducibility of programs. +On a related note, POSIX is a fuzzy standard that does not guarantee bit-wise reproducibility of programs. However, it has been chosen as the underlying platform here because the results (data) are our focus, not the compiled software. -POSIX is ubiquitous and fixed versions of low-level software (e.g., core GNU tools) are installable on the majority of them; each internally correcting for differences affecting their functionality. +POSIX is ubiquitous and fixed versions of low-level software (e.g., core GNU tools) are installable on most POSIX systems; each internally corrects for differences affecting its functionality. On GNU/Linux hosts, Maneage builds precise versions of the GNU Compiler Collection (GCC), GNU Binutils and GNU C library (glibc). However, glibc is not installable on some POSIX OSs (e.g., macOS). -The C library is linked with all programs, this can theoretically hinder exact reproducibility \emph{of results}, but we have not encountered any until now. -When present, the non-reproducibility of high-level science results due to differing C libraries can be identified with respect to known sources of error in the analysis (like measurement errors), but studying this is another research project. +The C library is linked with all programs. +This dependence can hypothetically hinder exact reproducibility \emph{of results}, but we have not encountered this so far. +When present, the non-reproducibility of high-level science results due to differing C libraries has so far been traced to known sources of error in the analysis (like measurement errors); further study would be useful. %Thirdly, publishing a project's reproducible data lineage immediately after publication enables others to continue with follow-up papers, which may provide unwanted competition against the original authors. %We propose these solutions: @@ -451,12 +452,12 @@ However, the proof of concept already shows many advantages in adopting the crit For example, publication of projects with these criteria on a wide scale will allow automatic workflow generation, optimized for desired characteristics of the results (e.g., via machine learning). Because of the completeness criteria, algorithms and data selection can be similarly optimized. Furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the connection between an analysis with the resulting narrative \emph{and} the history of that analysis/narrative. -Parsers can be written over projects for meta-research and provenance studies, e.g.,, to generate ``research objects''. +Parsers can be written over projects for meta-research and provenance studies, e.g., to generate ``research objects''. As another example, when a bug is found in one software package, all affected projects can be found and the scale of the effect can be measured. Combined with SoftwareHeritage, precise high-level science parts of Maneage projects can be accurately cited (e.g., failed/abandoned tests at any historical point). Many components of ``machine-actionable'' data management plans can be automatically filled out by Maneage, which is useful for project PIs and grant funders. -From the data repository perspective, these criteria can also be very useful, e.g.,, with regard to the challenges mentioned in \cite{austin17}: +From the data repository perspective, these criteria can also be very useful, e.g., with regard to the challenges mentioned in \cite{austin17}: (1) The burden of curation is shared among all project authors and readers (the latter may find a bug and fix it), not just by database curators, improving sustainability. (2) Automated and persistent bidirectional linking of data and publication can be established through the published \& \emph{complete} data lineage that is under version control. (3) Software management. |