aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2020-05-29 01:04:27 +0100
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2020-05-29 01:04:27 +0100
commit78e8632f58e9dc014f5d06ae3e9d77575a993861 (patch)
tree43a5b62834264789d954b67d387316e4c2b6558b /paper.tex
parent6634ca012913223ce7064587e2335a7a1ad28260 (diff)
Added top-make.mk as a listing for demonstration, minor edits
To help show the simplicity of 'top-make.mk', it was included as a listing. I also went over some of Boud's corrections and made small edits. In particular: - The '\label' and '\ref' to a section were removed. I done this after inspecting some of their recent papers and noticing that they generally have a simple flow, without such redirections. - In the part about the RDA adoption grant, I moved the "from the researcher perspective" to the end. Because Austin+2017 is mainly focused on data-center management, not the researcher's. They do touch upon researcher solutions that can help data-base managers, but not directly the researchers. In effect with this grant, they acknowledged that our researcher-focused solution confirms with their criteria for data-base management.
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex84
1 files changed, 53 insertions, 31 deletions
diff --git a/paper.tex b/paper.tex
index ab3d9d1..d16882c 100644
--- a/paper.tex
+++ b/paper.tex
@@ -165,7 +165,7 @@ Many data-intensive projects commonly involve dozens of high-level dependencies,
-\section{Proposed criteria for longevity} \label{s-criteria}
+\section{Proposed criteria for longevity}
The main premise is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it in the end \cite{austin17,fineberg19}.
Researchers play a critical role\cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles).
@@ -242,8 +242,8 @@ In contrast, a non-free software package typically cannot be distributed by othe
\section{Proof of concept: Maneage}
-Given that existing tools do not satisfy the full set of criteria outlined in \S\ref{s-criteria}, we present a proof of concept via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}.
-It was awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations concerning researcher perspective and ensuring longevity of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows\cite{austin17}.
+Given that existing tools do not satisfy the full set of criteria outlined above, we present a proof of concept via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}.
+It was awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows\cite{austin17}, from the researcher perspective to ensure longevity.
The proof-of-concept implementation is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage'').
It was developed along with the criteria, as a parallel research project over 5 years of publishing reproducible workflows to supplement our research.
@@ -279,15 +279,15 @@ Manually typing such numbers in the narrative is prone to errors and discourages
The ultimate aim of any project is to produce a report accompanying a dataset, providing visualizations, or a research article in a journal.
Let's call this \inlinecode{paper.pdf}.
The files hosting the macros of each analysis step (which produce numbers, tables, figures included in the report) build the core structure (skeleton) of Maneage.
-During the software building (``configuration'') phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation.
-These are combined for precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}; these software acknowledgments are excluded here due to the strict word limit).
-These files act as Make \emph{targets} and \emph{prerequisite}s to allow accurate dependency tracking and optimized execution (parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}).
-Software dependencies are built down to precise versions of the shell, POSIX tools (e.g., GNU Coreutils), \TeX{}Live, C compiler, and the C library (task 15390) for an exactly reproducible environment.
-For fast relocation of a project (without building from source), building it in a popular container or VM is possible.
+For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation.
+These are combined for generating precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}; these software acknowledgments are excluded here due to the strict word limit).
+These files act as Make \emph{targets} and \emph{prerequisite}s to allow accurate dependency tracking and optimized execution (in parallel with no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}).
+Software dependencies are built down to precise versions of the shell, POSIX tools (e.g., GNU Coreutils), \TeX{}Live and etc for an \emph{almost} exact reproducible environment.
+On GNU/Linux operating systems, the C compiler is also built from source and the C library is being added (task 15390) for exact reproducibility.
+Fast relocation of a project (without building from source) can be done by building the project in a container or VM.
In building software, normally only the very high-level choice of which software to build differs between projects.
-The build recipes of any particular package generally does not change.
-However, the analysis will naturally be different from one project to another.
+However, the analysis will naturally be different from one project to another at a low-level.
It was necessary for the design of this system to be generic enough to host any project, while still satisfying the criteria of modularity, scalability and minimal complexity.
We demonstrate this design by replicating Figure 1C of \cite{menke20} in Figure \ref{fig:datalineage} (top).
Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it (including this complete paper).
@@ -311,23 +311,58 @@ Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it
}
\end{figure*}
-Analysis is orchestrated in a single point of entry (\inlinecode{top-make.mk}, which is a Makefile).
+Analysis is orchestrated through a single point of entry (\inlinecode{top-make.mk}, which is a Makefile).
It is only responsible for \inlinecode{include}-ing the modular \emph{subMakefiles} of the analysis, in the desired order, without doing any analysis itself.
-This is shown in Figure \ref{fig:datalineage} (bottom) where all the built/blue files are placed over subMakefiles.
-A non-expert is expected to be able to understand the high-level logic of the project (irrespective of the low-level implementation details) by visual inspection of this file, provided that the subMakefile names are descriptive.
+This is visualized in Figure \ref{fig:datalineage} (bottom) where no built/blue file is placed directly over \inlinecode{top-make.mk} (they are produced by the subMakefiles).
+As shown in Listing \ref{code:topmake}, a visual inspection of this file allows a non-expert to understand the high-level logic and order of the project (irrespective of the low-level implementation details); provided that the subMakefile names are descriptive (thus encouraging good practice).
A human-friendly design that is also optimized for execution is a critical component for reproducible research workflows.
-In all projects, \inlinecode{top-make.mk} first loads \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finishes with \inlinecode{verify.mk} and \inlinecode{paper.mk}.
-Project authors add their modular subMakefiles in between (after \inlinecode{download.mk} and before \inlinecode{verify.mk}).
-In Figure \ref{fig:datalineage} (bottom), the project-specific subMakefiles are \inlinecode{format.mk} and \inlinecode{demo-plot.mk}.
+Listing \ref{code:topmake} shows that all projects, first load \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk}.
+Project authors add their modular subMakefiles in between.
Except for \inlinecode{paper.mk} (which builds the ultimate target \inlinecode{paper.pdf}), all subMakefiles build a \LaTeX{} macro file with the same basename (a \inlinecode{.tex} file for each subMakefile of Figure \ref{fig:datalineage}).
-Other built files cascade down in the lineage (through other files) to one of these macro files.
+Other built files (outputs of intermediate analysis) cascade down in the lineage ,possibly through other files, to one of these macro files.
-Irrespective of the number of subMakefiles, just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk}, to satisfy verification criteria.
+\begin{lstlisting}[
+ label=code:topmake,
+ caption={This project's simplified \inlinecode{top-make.mk}, also see Figure \ref{fig:datalineage}}
+ ]
+# Default target/goal of project.
+all: paper.pdf
+
+# Define subMakefiles to load in order.
+makesrc = initialize \ # General
+ download \ # General
+ format \ # Project-specific
+ demo-plot \ # Project-specific
+ verify \ # General
+ paper # General
+
+# Load all the configuration files.
+include reproduce/analysis/config/*.conf
+
+# Load the subMakefiles in the defined order
+include $(foreach s,$(makesrc), \
+ reproduce/analysis/make/$(s).mk)
+\end{lstlisting}
+
+Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk}, to satisfy verification criteria.
All the macro files, plot information and published datasets of the project are verified with their checksums here to automatically ensure exact reproducibility.
Where exact reproducibility is not possible, values can be verified by any statistical means (specified by the project authors).
This step was not yet implemented in \cite{akhlaghi19, infante20}.
+To further minimize complexity, the low-level implementation can be further separated from the high-level execution through configuration files.
+By convention in Maneage, the subMakefiles, and the programs they call for number crunching, do not contain any fixed numbers, settings or parameters.
+Parameters are set as Make variables in ``configuration files'' (with a \inlinecode{.conf} suffix) and passed to the respective program.
+For example, in Figure \ref{fig:datalineage}, \inlinecode{INPUTS.conf} contains URLs and checksums for all imported datasets, enabling exact verification before usage.
+To illustrate this again, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot).
+The number \inlinecode{\menkenumpapersdemoyear} is stored in \inlinecode{demo-year.conf}.
+As the lineage shows, the result (\inlinecode{\menkenumpapersdemocount}) was calculated after generating \inlinecode{columns.txt}.
+Both are expanded as \LaTeX{} macros when creating this PDF file.
+This enables a random reader to change the value in \inlinecode{demo-year.conf} to automatically update the result, without necessarily knowing the underlying low-level implementation.
+Furthermore, the configuration files are a prerequisite of the targets that use them.
+If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's source or other built products.
+This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details, e.g., by co-authors or future readers), and ensures self-consistency.
+
\begin{figure*}[t]
\begin{center} \includetikz{figure-branching}\end{center}
\vspace{-3mm}
@@ -341,19 +376,6 @@ This step was not yet implemented in \cite{akhlaghi19, infante20}.
}
\end{figure*}
-To further minimize complexity, the low-level implementation can be further separated from the high-level execution through configuration files.
-By convention in Maneage, the subMakefiles, and the programs they call for number crunching, do not contain any fixed numbers, settings or parameters.
-Parameters are set as Make variables in ``configuration files'' (with a \inlinecode{.conf} suffix) and passed to the respective program.
-For example, in Figure \ref{fig:datalineage}, \inlinecode{INPUTS.conf} contains URLs and checksums for all imported datasets, enabling exact verification before usage.
-To illustrate this again, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot).
-The number \inlinecode{\menkenumpapersdemoyear} is stored in \inlinecode{demo-year.conf}.
-As the lineage shows, the result (\inlinecode{\menkenumpapersdemocount}) was calculated after generating \inlinecode{columns.txt}.
-Both are expanded as \LaTeX{} macros when creating this PDF file.
-This enables the reader to change the value in \inlinecode{demo-year.conf} to automatically update the result, without necessarily knowing how it was generated.
-Furthermore, the configuration files are a prerequisite of the targets that use them.
-If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's source or other built products.
-This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details, e.g., by co-authors or future readers), and ensures self-consistency.
-
Finally, to satisfy the temporal provenance criterion, version control (currently implemented in Git), plays a crucial role in Maneage, as shown in Figure \ref{fig:branching}.
In practice, Maneage is a Git branch that contains the shared components (the infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles and interface script).
Every project starts by branching off the Maneage branch and customizing it (e.g., replacing the title, data links, and narrative, and adding subMakefiles for the particular analysis), see Listing \ref{code:branching}.