aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--paper.tex123
1 files changed, 63 insertions, 60 deletions
diff --git a/paper.tex b/paper.tex
index c19bb63..ab3d9d1 100644
--- a/paper.tex
+++ b/paper.tex
@@ -165,7 +165,7 @@ Many data-intensive projects commonly involve dozens of high-level dependencies,
-\section{Proposed criteria for longevity}
+\section{Proposed criteria for longevity} \label{s-criteria}
The main premise is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it in the end \cite{austin17,fineberg19}.
Researchers play a critical role\cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles).
@@ -242,18 +242,19 @@ In contrast, a non-free software package typically cannot be distributed by othe
\section{Proof of concept: Maneage}
-Given the limitations of existing tools with the proposed criteria, it is necessary to show a proof of concept.
-The proof presented here has already been tested in previously published papers \cite{akhlaghi19, infante20} and was recently awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows\cite{austin17}, from the researcher perspective to ensure longevity.
+Given that existing tools do not satisfy the full set of criteria outlined in \S\ref{s-criteria}, we present a proof of concept via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}.
+It was awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations concerning researcher perspective and ensuring longevity of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows\cite{austin17}.
-The proof of concept is called Maneage, for \emph{Man}aging data Lin\emph{eage} (ending is pronounced like ``Lineage'').
-It was developed along with the criteria, as a parallel research project in 5 years for publishing our reproducible workflows to supplement our research.
-Its primordial form was implemented in \cite{akhlaghi15} and later evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}.
+The proof-of-concept implementation is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage'').
+It was developed along with the criteria, as a parallel research project over 5 years of publishing reproducible workflows to supplement our research.
+The original implementation was published in \cite{akhlaghi15}, and evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}.
-Technically, the hardest criteria to implement was the completeness criteria (and in particular no dependency beyond POSIX), blended with minimal complexity.
-One proposed solution was the Guix Workflow Language (GWL) which is written in the same framework (GNU Guile, an implementation of Scheme) as GNU Guix (a PM).
-But as natural scientists (astronomers), our background was with languages like Shell, Python, C or Fortran.
-Not having any exposure to Lisp/Scheme and their fundamentally different style, made it very hard for us to adopt GWL.
-Furthermore, the desired solution was meant to be easily understandable/usable by fellow scientists, who generally have not had exposure to Lisp/Scheme.
+Technically, the hardest criterion to implement was the completeness criterion (and, in particular, avoiding non-POSIX dependencies).
+Minimizing complexity was also difficult.
+One proposed solution was the Guix Workflow Language (GWL), which is written in the same framework (GNU Guile, an implementation of Scheme) as GNU Guix (a PM).
+However, as natural scientists (astronomers), our background was with languages like Shell, Python, C and Fortran.
+Our lack of exposure to Lisp/Scheme and their fundamentally different style made it hard for us to adopt GWL.
+Furthermore, the desired solution had to be easily usable by fellow scientists, who generally have not had exposure to Lisp/Scheme.
Inspired by GWL+Guix, a single job management tool was used for both installing of software \emph{and} the analysis workflow: Make.
Make is not an analysis language, it is a job manager, deciding when to call analysis programs (in any language like Python, R, Julia, Shell or C).
@@ -263,31 +264,32 @@ Make was recommended by the pioneers of reproducible research\cite{claerbout1992
%However, because they didn't attempt to build the software environment, in 2006 they moved to SCons (Make-simulator in Python which also attempts to manage software dependencies) in a project called Madagascar (\url{http://ahay.org}), which is highly tailored to Geophysics.
Linking the analysis and narrative was another major design choice.
-Literate programming, implemented as Computational Notebooks like Jupyter, is a common solution these days.
-However, due to the problems above, our implementation follows a more abstract design: providing a more direct and precise, but modular (not in the same file) connection.
+Literate programming, implemented as Computational Notebooks like Jupyter, is currently popular.
+However, due to the problems above, our implementation follows a more abstract design that provides a more direct and precise, but modular connection (modularized into specialised files).
-Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is through \LaTeX{} macros, that are automatically defined during the analysis.
+Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is through \LaTeX{} macros, which are automatically defined during the analysis.
For example, in the abstract of \cite{akhlaghi19} we say `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'.
The \LaTeX{} source of the quote above is: `\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}'.
The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is set during the analysis, and expands to the value `\inlinecode{0.25}' when the PDF output is built.
-Such values also depend on the analysis, hence just as plots, figures or tables they should also be reproduced.
-As a side-effect, these macros act as a quantifiable link between the narrative and analysis, with the granularity of a word in a sentence and exact analysis command.
+Since values like this depend on the analysis, they should be reproducible, along with figures and tables.
+These macros act as a quantifiable link between the narrative and analysis, with the granularity of a word in a sentence and a particular analysis command.
This allows accurate provenance \emph{and} automatic updates to the text when necessary.
-Manually typing such numbers in the narrative is prone to errors and discourages experimentation after writing the first draft.
-
-The ultimate aim of any project is to produce a report accompanying a dataset with some visualizations, or a research article in a journal.
-Let's call it \inlinecode{paper.pdf}.
-Hence the files hosting the macros (that go into the report) of each analysis step, build the core structure (skeleton) of Maneage.
-During the software building (``configuration'') phase, each software is identified by a \LaTeX{} file, containing its official name, version and possible citation.
-In the end, they are combined for precise software acknowledgment and citation (see the appendices of \cite{akhlaghi19, infante20}, not included here due to the strict word-limit).
-Simultaneously, these files act as Make \emph{targets} and \emph{prerequisite}s to allow accurate dependency tracking and optimized execution (parallel, no redundancies), for any complexity (e.g., Maneage also builds Matplotlib if requested, see Figure 1 of \cite{alliez19}).
+Manually typing such numbers in the narrative is prone to errors and discourages improvements after writing the first draft.
+
+The ultimate aim of any project is to produce a report accompanying a dataset, providing visualizations, or a research article in a journal.
+Let's call this \inlinecode{paper.pdf}.
+The files hosting the macros of each analysis step (which produce numbers, tables, figures included in the report) build the core structure (skeleton) of Maneage.
+During the software building (``configuration'') phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation.
+These are combined for precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}; these software acknowledgments are excluded here due to the strict word limit).
+These files act as Make \emph{targets} and \emph{prerequisite}s to allow accurate dependency tracking and optimized execution (parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}).
Software dependencies are built down to precise versions of the shell, POSIX tools (e.g., GNU Coreutils), \TeX{}Live, C compiler, and the C library (task 15390) for an exactly reproducible environment.
-For fast relocation of the project (without building from source) it is possible to build it in the popular container, or VM, technology of the day.
+For fast relocation of a project (without building from source), building it in a popular container or VM is possible.
-In building of software, only the very high-level choice of which software to built differs between projects and the build recipes of each software do not generally change.
+In building software, normally only the very high-level choice of which software to build differs between projects.
+The build recipes of any particular package generally does not change.
However, the analysis will naturally be different from one project to another.
-Therefore, a design was necessary to satisfy the modularity, scalability and minimal complexity criteria while being generic enough to host any project.
-To avoid getting too abstract, we will demonstrate it by replicating Figure 1C of \cite{menke20} in Figure \ref{fig:datalineage} (top).
+It was necessary for the design of this system to be generic enough to host any project, while still satisfying the criteria of modularity, scalability and minimal complexity.
+We demonstrate this design by replicating Figure 1C of \cite{menke20} in Figure \ref{fig:datalineage} (top).
Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it (including this complete paper).
\begin{figure*}[t]
@@ -297,69 +299,70 @@ Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it
\end{center}
\vspace{-3mm}
\caption{\label{fig:datalineage}
- Top: an enhanced replica of figure 1C in \cite{menke20}, shown here for demonstrating Maneage.
- It shows the ratio of papers mentioning software tools (green line, left vertical axis) to total number of papers studied in that year (light red bars, right vertical axis in log-scale).
+ Top: an enhanced replica of Figure 1C in \cite{menke20}, shown here for demonstrating Maneage.
+ It shows the ratio of the number of papers mentioning software tools (green line, left vertical axis) to the total number of papers studied in that year (light red bars, right vertical axis on a log scale).
Bottom: Schematic representation of the data lineage, or workflow, to generate the plot above.
Each colored box is a file in the project and the arrows show the dependencies between them.
Green files/boxes are plain-text files that are under version control and in the project source directory.
- Blue files/boxes are output files in the build-directory, shown within the Makefile (\inlinecode{*.mk}) where they are defined as a \emph{target}.
+ Blue files/boxes are output files in the build directory, shown within the Makefile (\inlinecode{*.mk}) where they are defined as a \emph{target}.
For example, \inlinecode{paper.pdf} depends on \inlinecode{project.tex} (in the build directory; generated automatically) and \inlinecode{paper.tex} (in the source directory; written manually).
- The solid arrows and full-opacity built boxes are included with this paper's source.
+ The solid arrows and full-opacity built boxes correspond to this paper.
The dashed arrows and low-opacity built boxes show the scalability by adding hypothetical steps to the project.
}
\end{figure*}
Analysis is orchestrated in a single point of entry (\inlinecode{top-make.mk}, which is a Makefile).
-It is only responsible for \inlinecode{include}-ing the modular \emph{subMakefiles} of the analysis, in the desired order, not doing any analysis itself.
+It is only responsible for \inlinecode{include}-ing the modular \emph{subMakefiles} of the analysis, in the desired order, without doing any analysis itself.
This is shown in Figure \ref{fig:datalineage} (bottom) where all the built/blue files are placed over subMakefiles.
-A random reader will be able to understand the high-level logic of the project (irrespective of the low-level implementation details) with simple visual inspection of this file, provided that the subMakefile names are descriptive.
-A human-friendly design (that is also optimized for execution) is a critical component of publishing reproducible workflows.
+A non-expert is expected to be able to understand the high-level logic of the project (irrespective of the low-level implementation details) by visual inspection of this file, provided that the subMakefile names are descriptive.
+A human-friendly design that is also optimized for execution is a critical component for reproducible research workflows.
-In all projects \inlinecode{top-make.mk} first loads \inlinecode{initialize.mk} and \inlinecode{download.mk} and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk}.
-Project authors add their modular subMakefiles in between (after \inlinecode{download.mk} and before \inlinecode{verify.mk}), in Figure \ref{fig:datalineage} (bottom), the project-specific subMakefiles are \inlinecode{format.mk} \& \inlinecode{demo-plot.mk}.
-Except for \inlinecode{paper.mk} (which builds the ultimate target \inlinecode{paper.pdf}) all subMakefiles build a \LaTeX{} macro file with the same base-name (a \inlinecode{.tex} in each subMakefile of Figure \ref{fig:datalineage}).
-Other built files ultimately cascade down in the lineage (through other files) to one of these macro files.
+In all projects, \inlinecode{top-make.mk} first loads \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finishes with \inlinecode{verify.mk} and \inlinecode{paper.mk}.
+Project authors add their modular subMakefiles in between (after \inlinecode{download.mk} and before \inlinecode{verify.mk}).
+In Figure \ref{fig:datalineage} (bottom), the project-specific subMakefiles are \inlinecode{format.mk} and \inlinecode{demo-plot.mk}.
+Except for \inlinecode{paper.mk} (which builds the ultimate target \inlinecode{paper.pdf}), all subMakefiles build a \LaTeX{} macro file with the same basename (a \inlinecode{.tex} file for each subMakefile of Figure \ref{fig:datalineage}).
+Other built files cascade down in the lineage (through other files) to one of these macro files.
-Irrespective of the number of subMakefiles, just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk}, to satisfy the verification criteria.
+Irrespective of the number of subMakefiles, just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk}, to satisfy verification criteria.
All the macro files, plot information and published datasets of the project are verified with their checksums here to automatically ensure exact reproducibility.
Where exact reproducibility is not possible, values can be verified by any statistical means (specified by the project authors).
-We note that this step was not yet implemented in \cite{akhlaghi19, infante20}.
+This step was not yet implemented in \cite{akhlaghi19, infante20}.
\begin{figure*}[t]
\begin{center} \includetikz{figure-branching}\end{center}
\vspace{-3mm}
- \caption{\label{fig:branching} Maneage is a Git branch, projects using Maneage are branched-off of it and apply their customizations.
- (a) shows a hypothetical project's history prior to publication.
+ \caption{\label{fig:branching} Maneage is a Git branch. Projects using Maneage are branched off it and apply their customizations.
+ (a) A hypothetical project's history prior to publication.
The low-level structure (in Maneage, shared between all projects) can be updated by merging with Maneage.
- (b) shows how a finished/published project can be revitalized for new technologies simply by merging with the core branch.
- Each Git ``commit'' is shown on its branch as a colored ellipse, with their hash printed in them.
- The commits are colored based on their branch.
+ (b) A finished/published project can be revitalized for new technologies by merging with the core branch.
+ Each Git ``commit'' is shown on its branch as a colored ellipse, with its commit hash shown.
+ The commits are colored to identify their branch.
The collaboration and two paper icons are respectively made by `mynamepong' and `iconixar' from \url{www.flaticon.com}.
}
\end{figure*}
To further minimize complexity, the low-level implementation can be further separated from the high-level execution through configuration files.
-By convention in Maneage, the subMakefiles, and the programs they call for number-crunching, do not contain any fixed numbers, settings or parameters.
+By convention in Maneage, the subMakefiles, and the programs they call for number crunching, do not contain any fixed numbers, settings or parameters.
Parameters are set as Make variables in ``configuration files'' (with a \inlinecode{.conf} suffix) and passed to the respective program.
For example, in Figure \ref{fig:datalineage}, \inlinecode{INPUTS.conf} contains URLs and checksums for all imported datasets, enabling exact verification before usage.
-As another demo, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot).
+To illustrate this again, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot).
The number \inlinecode{\menkenumpapersdemoyear} is stored in \inlinecode{demo-year.conf}.
As the lineage shows, the result (\inlinecode{\menkenumpapersdemocount}) was calculated after generating \inlinecode{columns.txt}.
-Both are expanded in this PDF as \LaTeX{} macros.
+Both are expanded as \LaTeX{} macros when creating this PDF file.
This enables the reader to change the value in \inlinecode{demo-year.conf} to automatically update the result, without necessarily knowing how it was generated.
Furthermore, the configuration files are a prerequisite of the targets that use them.
-Hence if changed, Make will \emph{only} re-execute the dependent recipe and all its descendants with no modification to the project's source or other built products.
-This fast/cheap testing encourages experimentation (without necessarily knowing the implementation details, e.g., by co-authors or future readers), and ensures self-consistency.
+If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's source or other built products.
+This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details, e.g., by co-authors or future readers), and ensures self-consistency.
-Finally, to satisfy the temporal provenance criteria, version control (currently implemented in Git), plays a defining role in Maneage as shown in Figure \ref{fig:branching}.
-In practice, Maneage is a Git branch that contains the shared components, or infrastructure of all projects (e.g., software tarball URLs, build recipes, common subMakefiles and interface script).
-Every project starts by branching-off the Maneage branch and customizing it (e.g., adding their own title, input data links, writing their narrative, and subMakefiles for their analysis), see Listing \ref{code:branching}.
+Finally, to satisfy the temporal provenance criterion, version control (currently implemented in Git), plays a crucial role in Maneage, as shown in Figure \ref{fig:branching}.
+In practice, Maneage is a Git branch that contains the shared components (the infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles and interface script).
+Every project starts by branching off the Maneage branch and customizing it (e.g., replacing the title, data links, and narrative, and adding subMakefiles for the particular analysis), see Listing \ref{code:branching}.
\begin{lstlisting}[
label=code:branching,
caption={Starting a new project with Maneage, and building it},
]
-# Cloning main Maneage branch and branching-off of it.
+# Cloning main Maneage branch and branching off it.
$ git clone https://git.maneage.org/project.git
$ cd project
$ git remote rename origin origin-maneage
@@ -370,11 +373,11 @@ $ ./project configure # Build software environment.
$ ./project make # Do analysis, build PDF paper.
\end{lstlisting}
-As Figure \ref{fig:branching} shows, due to this architecture, it is always possible to import or merge Maneage into the project to improve the low-level infrastructure:
-in (a) the authors merge into Maneage during an ongoing project,
-in (b) readers can do it after the paper's publication, e.g., when the project's infrastructure is outdated, or does not build, and authors cannot be accessed.
-Low-level improvements in Maneage are thus automatically propagated to all projects.
-This greatly reduces the cost of curation, or maintenance, of each individual project, before \emph{and} after publication.
+As Figure \ref{fig:branching} shows, due to this architecture, it is always possible to import or merge Maneage into a project to improve the low-level infrastructure:
+in (a) the authors merge Maneage during an ongoing project;
+in (b) readers can do it after the paper's publication, e.g., when the project remains reproducible but the infrastructure is outdated, or a bug is found in Maneage.
+Low-level improvements in Maneage are thus easily propagated to all projects.
+This greatly reduces the cost of curation and maintenance of each individual project, before \emph{and} after publication.