diff options
| -rw-r--r-- | paper.tex | 116 | 
1 files changed, 54 insertions, 62 deletions
| @@ -109,16 +109,15 @@ The cost of staying up to date within this rapidly evolving landscape is high.  Scientific projects, in particular, suffer the most: scientists have to focus on their own research domain, but to some degree they need to understand the technology of their tools, because it determines their results and interpretations.  Decades later, scientists are still held accountable for their results.  Hence, the evolving technology landscape creates generational gaps in the scientific community, preventing previous generations from sharing valuable lessons which are too hands-on to be published in a traditional scientific paper. -As a solution to this problem, here we introduce a set of criteria that can guarantee the longevity of a project based on our experience with existing solutions.  \section{Commonly used tools and their longevity} -To highlight the necessity of longevity, some of the most commonly used tools are reviewed here, from the perspective of long-term usability.  While longevity is important in science and some fields of industry, this is not always the case, e.g., fast-evolving tools can be appropriate in short-term commercial projects. -Most existing reproducible workflows use a common set of third-party tools that can be categorized as: +To highlight the necessity of longevity in reproducible research, some of the most commonly used tools are reviewed here from this perspective. +Most existing solutions use a common set of third-party tools that can be categorized as:  (1) environment isolators -- virtual machines (VMs) or containers;  (2) PMs -- Conda, Nix, or Spack;  (3) job management -- shell scripts, Make, SCons, or CGAT-core; @@ -137,20 +136,20 @@ This is similar for other OSes: pre-built binary files are large and expensive t  Furthermore, Docker requires root permissions, and only supports recent (``long-term-support'') versions of the host kernel, so older Docker images may not be executable.  Once the host OS is ready, PMs are used to install the software, or environment. -Usually the OS's PM, like `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with more generic PMs like Conda, Nix, GNU Guix or Spack. -The OS PM suffers from the same longevity problem as the OS. -Some third-party tools like Conda and Spack are written in high-level languages like Python, so the PM itself depends on the host's Python installation. -Nix and GNU Guix do not have any dependencies and produce bit-wise identical programs, but they need root permissions. +Usually the OS's PM, like `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with generic PMs. +The former suffers from the same longevity problem as the OS. +Some of the latter (like Conda and Spack) are written in high-level languages like Python, so the PM itself depends on the host's Python installation. +Nix and GNU Guix produce bit-wise identical programs, but they need root permissions.  Generally, the exact version of each software's dependencies is not precisely identified in the build instructions (although that could be implemented).  Therefore, unless precise version identifiers of \emph{every software package} are stored, a PM will use the most recent version.  Furthermore, because each third-party PM introduces its own language and framework, this increases the project's complexity.  With the software environment built, job management is the next component of a workflow.  Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails (mostly introduced in the 2000s and using Java) encourage modularity and robust job management, but the more recent tools (mostly in Python) leave this to the project authors. -Designing a modular project needs to be encouraged and facilitated because scientists (who are not usually trained in data management) will rarely apply best practices in project management and data carpentry. +Designing a modular project needs to be encouraged and facilitated because scientists (who are not usually trained in project or data management) will rarely apply best practices.  This includes automatic verification: while it is possible in many solutions, it is rarely practiced, which leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding or validating will be expensive). -Finally, to add narrative, computational notebooks\cite{rule18}, like Jupyter, are being increasingly used in many solutions. +Finally, to add narrative, computational notebooks\cite{rule18}, like Jupyter, are being increasingly used.  However, the complex dependency trees of such web-based tools make them very vulnerable to the passage of time, e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies.  The longevity of a project is determined by its shortest-lived dependency.  Furthermore, as with job management, computational notebooks do not actively encourage good practices in programming or project management. @@ -169,13 +168,12 @@ Many data-intensive projects commonly involve dozens of high-level dependencies,  The main premise is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it in the end \cite{austin17,fineberg19}.  Researchers play a critical role \cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles).  Simply archiving a project workflow in a repository after the project is finished is, on its own, insufficient, and maintaining it by repository staff is often either practically infeasible or unscalable. -In this paper we argue that workflows satisfying the criteria below can improve researcher workflows during the project, reduce the cost of curation for repositories after publication, while maximizing the FAIRness of the deliverables for future researchers. +We argue that workflows satisfying the criteria below can not just improve researcher flexibility during a research project, but can increase the FAIRness of the deliverables for future researchers.  \textbf{Criterion 1: Completeness.}  A project that is complete (self-contained) has the following properties.  (1) It has no dependency beyond the Portable Operating System Interface: POSIX.  IEEE defined POSIX (a minimal Unix-like environment) and many OSes have complied. -It is a reliable foundation for longevity in software execution.  (2) ``No dependency'' requires that the project itself must be primarily stored in plain text, not needing specialized software to open, parse or execute.  (3) It does not affect the host OS (its libraries, programs, or environment).  (4) It does not require root or administrator privileges. @@ -202,11 +200,11 @@ More stable/basic tools can be used with less long-term maintenance.  \textbf{Criterion 4: Scalability.}  A scalable project can easily be used in arbitrarily large and/or complex projects. -On a small scale, the criteria here are trivial to implement, but as the projects get more complex, an implementation can become unsustainable. +On a small scale, the criteria here are trivial to implement, but can become unsustainable very soon.  \textbf{Criterion 5: Verifiable inputs and outputs.}  The project should verify its inputs (software source code and data) \emph{and} outputs. -Reproduction should be straightforward enough such that ``\emph{a clerk can do it}''\cite{claerbout1992}, without requiring expert knowledge. +Reproduction should be straightforward enough such that ``\emph{a clerk can do it}''\cite{claerbout1992} (with no expert knowledge).  \textbf{Criterion 6: History and temporal provenance.}  No exploratory research project is done in a single/first attempt. @@ -219,11 +217,11 @@ The ``history'' is thus as valuable as the final/published version.  A project is not just its computational analysis.  A raw plot, figure or table is hardly meaningful alone, even when accompanied by the code that generated it.  A narrative description is also part of the deliverables (defined as ``data article'' in \cite{austin17}): describing the purpose of the computations, and interpretations of the result, and the context in relation to other projects/papers. -This is related to longevity, because if a workflow only contains the steps to do the analysis or generate the plots, it may become separated from its accompanying published paper in time due to the different hosts. +This is related to longevity, because if a workflow only contains the steps to do the analysis or generate the plots, it may get separated from its accompanying published paper.  \textbf{Criterion 8: Free and open source software:}  Technically, reproducibility (as defined in \cite{fineberg19}) is possible with non-free or non-open-source software (a black box). -This criterion is necessary to complement that definition (nature is already a black box). +This criterion is necessary to complement that definition (nature is already a black box!).  If a project is free software (as formally defined), then others can learn from, modify, and build on it.  When the software used by the project is itself also free:  (1) The lineage can be traced to the implemented algorithms, possibly enabling optimizations on that level. @@ -241,53 +239,52 @@ In contrast, a non-free software package typically cannot be distributed by othe  \section{Proof of concept: Maneage} -Given that existing tools do not satisfy the full set of criteria outlined above, we present a proof of concept via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}. -It was awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows\cite{austin17}, from the researcher perspective to ensure longevity. +With the longevity problems of existing tools outlined above, a proof of concept is presented via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}. +It was awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows\cite{austin17}, from the researcher perspective. -The proof-of-concept implementation is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''). -It was developed along with the criteria, as a parallel research project over 5 years of publishing reproducible workflows to supplement our research. +The proof-of-concept is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''). +It was developed as a parallel research project over 5 years of publishing reproducible workflows to supplement our research.  The original implementation was published in \cite{akhlaghi15}, and evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}.  Technically, the hardest criterion to implement was the completeness criterion (and, in particular, avoiding non-POSIX dependencies).  Minimizing complexity was also difficult.  One proposed solution was the Guix Workflow Language (GWL), which is written in the same framework (GNU Guile, an implementation of Scheme) as GNU Guix (a PM). -However, as natural scientists (astronomers), our background was with languages like Shell, Python, C and Fortran. -Our lack of exposure to Lisp/Scheme and their fundamentally different style made it hard for us to adopt GWL. -Furthermore, the desired solution had to be easily usable by fellow scientists, who generally have not had exposure to Lisp/Scheme. +The fact that Guix requires root access to install, and only works with the Linux kernel were problematic.  Inspired by GWL+Guix, a single job management tool was used for both installing of software \emph{and} the analysis workflow: Make.  Make is not an analysis language, it is a job manager, deciding when to call analysis programs (in any language like Python, R, Julia, Shell or C).  Make is standardized in POSIX and is used in almost all core OS components.  It is thus mature, actively maintained and highly optimized. -Make was recommended by the pioneers of reproducible research\cite{claerbout1992,schwab2000} and many researchers have already had a minimal exposure to it (when building research software). +Make was recommended by the pioneers of reproducible research\cite{claerbout1992,schwab2000} and many researchers have already had a minimal exposure to it (atleast when building research software).  %However, because they didn't attempt to build the software environment, in 2006 they moved to SCons (Make-simulator in Python which also attempts to manage software dependencies) in a project called Madagascar (\url{http://ahay.org}), which is highly tailored to Geophysics.  Linking the analysis and narrative was another major design choice.  Literate programming, implemented as Computational Notebooks like Jupyter, is currently popular. -However, due to the problems above, our implementation follows a more abstract design that provides a more direct and precise, but modular connection (modularized into specialised files). +However, due to the problems above, our implementation follows a more abstract linkage, providing a more direct and precise, but modular connection (modularized into specialised files).  Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is through \LaTeX{} macros, which are automatically defined during the analysis.  For example, in the abstract of \cite{akhlaghi19} we say `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'.  The \LaTeX{} source of the quote above is: `\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}'.  The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is generated during the analysis, and expands to the value `\inlinecode{0.25}' when the PDF output is built. -Since values like this depend on the analysis, they should be reproducible, along with figures and tables. +Since values like this depend on the analysis, they should also be reproducible, along with figures and tables.  These macros act as a quantifiable link between the narrative and analysis, with the granularity of a word in a sentence and a particular analysis command. -This allows accurate provenance \emph{and} automatic updates to the text when necessary. -Manually typing such numbers in the narrative is prone to errors and discourages improvements after writing the first draft. +This allows accurate provenance post-publication \emph{and} automatic updates to the text pre-publication. +Manually updating them in the narrative is prone to errors and discourages improvements after writing the first draft.  The ultimate aim of any project is to produce a report accompanying a dataset, providing visualizations, or a research article in a journal.  Let's call this \inlinecode{paper.pdf}. -The files hosting the macros of each analysis step (which produce numbers, tables, figures included in the report) build the core structure (skeleton) of Maneage. +Acting as a link, the macro filess of each analysis step (which produce numbers, tables, figures included in the report) thus build the core structure (skeleton) of Maneage.  For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation. -These are combined for generating precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}; these software acknowledgments are excluded here due to the strict word limit). -These files act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel with no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}). -Software dependencies are built down to precise versions of the shell, POSIX tools (e.g., GNU Coreutils), \TeX{}Live and etc for an \emph{almost} exact reproducible environment. -On GNU/Linux operating systems, the C compiler is also built from source and the C library is being added (task 15390) for exact reproducibility. +These are combined in the end to generate precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}; excluded here due to the strict word limit). + +The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel with no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}). +Software dependencies are built down to precise versions of the shell, POSIX tools (e.g., GNU Coreutils), \TeX{}Live and etc for an \emph{almost} exact reproducible environment on POSIX-compatible systems. +On GNU/Linux operating systems, the GNU Compiler Collection (GCC) is also built from source and the GNU C library is being added (task 15390).  Fast relocation of a project (without building from source) can be done by building the project in a container or VM. -In building software, normally only the very high-level choice of which software to build differs between projects. +In building software, normally only the high-level choice of which software to build differs between projects.  However, the analysis will naturally be different from one project to another at a low-level. -It was necessary for the design of this system to be generic enough to host any project, while still satisfying the criteria of modularity, scalability and minimal complexity. +It was thus necessary to design generic system to comfortably host any project, while still satisfying the criteria of modularity, scalability and minimal complexity.  We demonstrate this design by replicating Figure 1C of \cite{menke20} in Figure \ref{fig:datalineage} (top).  Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it (including this complete paper). @@ -310,17 +307,6 @@ Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it    }  \end{figure*} -Analysis is orchestrated through a single point of entry (\inlinecode{top-make.mk}, which is a Makefile). -It is only responsible for \inlinecode{include}-ing the modular \emph{subMakefiles} of the analysis, in the desired order, without doing any analysis itself. -This is visualized in Figure \ref{fig:datalineage} (bottom) where no built/blue file is placed directly over \inlinecode{top-make.mk} (they are produced by the subMakefiles). -As shown in Listing \ref{code:topmake}, a visual inspection of this file allows a non-expert to understand the high-level logic and order of the project (irrespective of the low-level implementation details); provided that the subMakefile names are descriptive (thus encouraging good practice). -A human-friendly design that is also optimized for execution is a critical component for reproducible research workflows. - -Listing \ref{code:topmake} shows that all projects, first load \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk}. -Project authors add their modular subMakefiles in between. -Except for \inlinecode{paper.mk} (which builds the ultimate target \inlinecode{paper.pdf}), all subMakefiles build a \LaTeX{} macro file with the same basename (a \inlinecode{.tex} file for each subMakefile of Figure \ref{fig:datalineage}). -Other built files (outputs of intermediate analysis) cascade down in the lineage ,possibly through other files, to one of these macro files. -  \begin{lstlisting}[      label=code:topmake,      caption={This project's simplified \inlinecode{top-make.mk}, also see Figure \ref{fig:datalineage}} @@ -344,8 +330,19 @@ include $(foreach s,$(makesrc), \              reproduce/analysis/make/$(s).mk)  \end{lstlisting} +Analysis is orchestrated through a single point of entry (\inlinecode{top-make.mk}, which is a Makefile). +It is only responsible for \inlinecode{include}-ing the modular \emph{subMakefiles} of the analysis, in the desired order, without doing any analysis itself. +This is visualized in Figure \ref{fig:datalineage} (bottom) where no built/blue file is placed directly over \inlinecode{top-make.mk} (they are produced by the subMakefiles under them). +As shown in Listing \ref{code:topmake}, a visual inspection of this file allows a non-expert to understand the high-level steps of the project (irrespective of the low-level implementation details); provided that the subMakefile names are descriptive (thus encouraging good practice). +A human-friendly design that is also optimized for execution is a critical component for reproducible research workflows. + +All projects, first load \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk} (Listing \ref{code:topmake}). +Project authors add their modular subMakefiles in between. +Except for \inlinecode{paper.mk} (which builds the ultimate target \inlinecode{paper.pdf}), all subMakefiles build a macro file with the same basename (the \inlinecode{.tex} file in each subMakefile of Figure \ref{fig:datalineage}). +Other built files (intermediate analysis steps) cascade down in the lineage to one of these macro files, possibly through other files. +  Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk}, to satisfy verification criteria. -All the macro files, plot information and published datasets of the project are verified with their checksums here to automatically ensure exact reproducibility. +All project deliverables (macro files, plot or table data and other datasets) are verified with their checksums here to automatically ensure exact reproducibility.  Where exact reproducibility is not possible, values can be verified by any statistical means (specified by the project authors).  This step was not yet implemented in \cite{akhlaghi19, infante20}. @@ -353,9 +350,8 @@ To further minimize complexity, the low-level implementation can be further sepa  By convention in Maneage, the subMakefiles, and the programs they call for number crunching, do not contain any fixed numbers, settings or parameters.  Parameters are set as Make variables in ``configuration files'' (with a \inlinecode{.conf} suffix) and passed to the respective program.  For example, in Figure \ref{fig:datalineage}, \inlinecode{INPUTS.conf} contains URLs and checksums for all imported datasets, enabling exact verification before usage. -To illustrate this again, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot). -The number \inlinecode{\menkenumpapersdemoyear} is stored in \inlinecode{demo-year.conf}. -As the lineage shows, the result (\inlinecode{\menkenumpapersdemocount}) was calculated after generating \inlinecode{columns.txt}. +To illustrate this, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot). +The number \inlinecode{\menkenumpapersdemoyear} is stored in \inlinecode{demo-year.conf} and the result (\inlinecode{\menkenumpapersdemocount}) was calculated after generating \inlinecode{columns.txt}.  Both are expanded as \LaTeX{} macros when creating this PDF file.  This enables a random reader to change the value in \inlinecode{demo-year.conf} to automatically update the result, without necessarily knowing the underlying low-level implementation.  Furthermore, the configuration files are a prerequisite of the targets that use them. @@ -376,8 +372,8 @@ This fast and cheap testing encourages experimentation (without necessarily know  \end{figure*}  Finally, to satisfy the temporal provenance criterion, version control (currently implemented in Git), plays a crucial role in Maneage, as shown in Figure \ref{fig:branching}. -In practice, Maneage is a Git branch that contains the shared components (the infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles and interface script). -Every project starts by branching off the Maneage branch and customizing it (e.g., replacing the title, data links, and narrative, and adding subMakefiles for the particular analysis), see Listing \ref{code:branching}. +In practice, Maneage is a Git branch that contains the shared components (infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles and interface script). +Every project starts by branching off the Maneage branch and customizing it (e.g., replacing the title, data links, and narrative, and adding subMakefiles for its particular analysis), see Listing \ref{code:branching}.  \begin{lstlisting}[      label=code:branching, @@ -416,9 +412,9 @@ This greatly reduces the cost of curation and maintenance of each individual pro  %% should not just present a solution or an enquiry into a unitary problem but make an effort to demonstrate wider significance and application and say something more about the ‘science of data’ more generally.  We have shown that it is possible to build workflows satisfying the proposed criteria. -Here, we comment on our experience in building this system and implementing the RDA/WDS recommendations. +Here, we comment on our experience in testing them through the proof of concept.  We will discuss the design principles, and how they may be generalized and usable in other projects. -In particular, with the support of RDA, the user base, the development of the criteria and of Maneage grew phenomenally, highlighting some difficulties for the wide-spread adoption of these criteria. +In particular, with the support of RDA, the user base grew phenomenally, highlighting some difficulties for the wide-spread adoption.  Firstly, while most researchers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used.  Fortunately, we have noticed that after witnessing the improvements in their research, many, especially early-career researchers, have started mastering these tools. @@ -426,8 +422,8 @@ Scientists are rarely trained sufficiently in data management or software develo  Fast-evolving tools are primarily targeted at software developers, who are paid to learn them and use them effectively for short-term projects before moving on to the next technology.  Scientists, on the other hand, need to focus on their own research fields, and need to consider longevity. -Hence, arguably the most important feature of these criteria is that they provide a fully working template, using mature and time-tested tools, for blending version control, the research paper's narrative, software management \emph{and} a modular lineage for analysis. -We have seen that providing a complete \emph{and} customizable template with a clear checklist of the initial steps is much more effective in encouraging mastery of these essential tools for modern science than having abstract, isolated tutorials on each tool individually. +Hence, arguably the most important feature of these criteria is that they provide a fully working template, using mature and time-tested tools, for blending version control, the research paper's narrative, software management \emph{and} robust data carpentry. +We have seen that providing a complete \emph{and} customizable template with a clear checklist of the initial steps is much more effective in encouraging mastery of these modern scientific tools than having abstract, isolated tutorials on each tool individually.  Secondly, to satisfy the completeness criteria, all the necessary software of the project must be built on various POSIX-compatible systems (we actively test Maneage on several different GNU/Linux distributions and on macOS).  This requires maintenance by our core team and consumes time and energy. @@ -523,9 +519,8 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314  \begin{IEEEbiographynophoto}{Ra\'ul Infante-Sainz}    is a doctoral student at the Instituto de Astrof\'isica de Canarias, Tenerife, Spain. -  Since he started studying physics, he was always concern about being able to reproduce scientific results. -  His main scientific interests are the galaxy formation and evolution. -  He is currently doing his PhD. thesis and studying the low-surface-brightness Universe. +  He has been concerned about the ability of reproducing scientific results from the start of his research and has thus been actively involved in development and testing of Maneage. +  His main scientific interests are the galaxy formation and evolution, studying the low-surface-brightness Universe through reproducible methods.    Contact him at infantesainz@gmail.com and find his website at \url{https://infantesainz.org}.  \end{IEEEbiographynophoto} @@ -538,9 +533,7 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314  \end{IEEEbiographynophoto}  \begin{IEEEbiographynophoto}{David Valls-Gabaud} -  Observatoire de Paris - -  David Valls-Gabaud is a CNRS Research Director at the Observatoire de Paris, France. +  is a CNRS Research Director at the Observatoire de Paris, France.    His research interests span from cosmology and galaxy evolution to stellar physics and instrumentation.    Contact him at david.valls-gabaud@obspm.fr.  \end{IEEEbiographynophoto} @@ -553,7 +546,6 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314    Baena-Gall\'e has both MS in Telecommunication and Electronic Engineering from University of Seville (Spain), and received a PhD in astronomy from University of Barcelona (Spain).    Contact him at rbaena@iac.es.  \end{IEEEbiographynophoto} -  \end{document}  %% This file is free software: you can redistribute it and/or modify it | 
