aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2020-12-02 01:39:40 +0000
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2020-12-02 01:45:08 +0000
commit074d2b251e2d35c7f26932a7dfb8cbe18fe7a289 (patch)
treed249fd9b85df004324e60e2e289c74fc8500d603 /paper.tex
parent0d81a56bac1f866680acd979229cdd0f4b56618e (diff)
Modularity in file structure discussed with other minor edits
While going through Mohammad-reza's recent two commits, I noticed that we had missed an importnat discussion on modularity in this version of the paper (discussing how file management should also be modular resulting in cheaper archival, and thus better longevity), so a few sentences were added under criteria 2 (Modularity). Mohammad-reza's edits were also generally very good and helped clarify many points. I only reset the part that we discuss the problems with POSIX, and not being able to produce bitwise reproducible software (which systems like Guix work very hard at, and thus need root permissions). I felt the edit missed the main point here (that while bitwise reproducibility of the software is good, it is not always necessary).
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex76
1 files changed, 50 insertions, 26 deletions
diff --git a/paper.tex b/paper.tex
index 7c54fbd..93f8094 100644
--- a/paper.tex
+++ b/paper.tex
@@ -188,14 +188,14 @@ Designing a robust project needs to be encouraged and facilitated because scient
This includes automatic verification, which is possible in many solutions, but is rarely practiced.
Besides non-reproducibility, weak project management leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding, or validating will be expensive).
-Finally, to blend narrative into the workflow, computational notebooks \cite{rule18}, such as Jupyter, are currently gaining popularity.
+Finally, to blend narrative and analysis, computational notebooks \cite{rule18}, such as Jupyter, are currently gaining popularity.
However, because of their complex dependency trees, their build is vulnerable to the passage of time; e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies.
It is important to remember that the longevity of a project is determined by its shortest-lived dependency.
Furthermore, as with job management, computational notebooks do not actively encourage good practices in programming or project management.
-\new{The ``cells'' in a Jupyter notebook can either be run sequentially (from top to bottom, one after the other) or by manually selecting which cell to run.
-The default cells do not include dependencies (requiring some cells to be run only after certain others are re-done), parallel execution, or usage of more than one language.
+\new{The ``cells'' in a Jupyter notebook can either be run sequentially (from top to bottom, one after the other) or by manually selecting the cell to run.
+By default cell dependencies aren't included (e.g., automatically running some cells only after certain others), parallel execution, or usage of more than one language.
There are third party add-ons like \inlinecode{sos} or \inlinecode{nbextensions} (both written in Python) for some of these.
-However, since they are not part of the core and have their own dependencies, their longevity can be assumed to be shorter.
+However, since they are not part of the core, their longevity can be assumed to be shorter.
Therefore, the core Jupyter framework leaves very few options for project management, especially as the project grows beyond a small test or tutorial.}
In summary, notebooks can rarely deliver their promised potential \cite{rule18} and may even hamper reproducibility \cite{pimentel19}.
@@ -219,8 +219,8 @@ We argue and propose that workflows satisfying the following criteria can not on
A project that is complete (self-contained) has the following properties.
(1) \new{No \emph{execution requirements} apart from a minimal Unix-like operating system.
Fewer explicit execution requirements would mean higher \emph{execution possibility} and consequently higher \emph{longetivity}.}
-(2) Primarily stored as plain text \new{(ASCII encoded)}, not needing specialized software to open, parse, or execute.
-(3) No impact on the host OS libraries, programs and \new{the existing environment variables}.
+(2) Primarily stored as plain text \new{(encoded in ASCII/Unicode)}, not needing specialized software to open, parse, or execute.
+(3) No impact on the host OS libraries, programs and \new{environment variables}.
(4) Does not require root privileges to run (during development or post-publication).
(5) Builds its own controlled software \new{with independent environment variables}.
(6) Can run locally (without an internet connection).
@@ -228,18 +228,21 @@ Fewer explicit execution requirements would mean higher \emph{execution possibil
(8) It can run automatically, \new{without} human interaction.
\textbf{Criterion 2: Modularity.}
-A modular project enables and encourages the analysis to be broken into independent modules with well-defined inputs/outputs and minimal side effects.
+A modular project enables and encourages independent modules with well-defined inputs/outputs and minimal side effects.
+\new{In terms of file management, a modular project will \emph{only} contain the hand-written project source of that particular high-level project: no automaticly generated files (e.g., built binaries), software source code (maintained separately), or data (archived separately) should be included.
+The latter two (developing low-level software or collecting data) are separate projects in themselves and can be used in other high-level projects.}
Explicit communication between various modules enables optimizations on many levels:
-(1) Execution in parallel and avoiding redundancies (when a dependency of a module has not changed, it will not be re-run).
-(2) Usage in other projects.
-(3) Easy debugging and improvements.
-(4) Modular citation of specific parts.
-(5) Provenance extraction.
+(1) Storage and archival cost (no duplicate software or data files): a snapshot of a project will be far less than a mega byte.
+(2) Modular analysis components can be executed in parallel and avoid redundancies (when a dependency of a module has not changed, it will not be re-run).
+(3) Usage in other projects.
+(4) Easy debugging and improvements.
+(5) Modular citation of specific parts.
+(6) Provenance extraction.
\textbf{Criterion 3: Minimal complexity.}
Minimal complexity can be interpreted as:
(1) Avoiding the language or framework that is currently in vogue (for the workflow, not necessarily the high-level analysis).
-A popular framework typically falls out of fashion and requires significant resources to translate or rewrite every few years \new{(for example Python 2, which is now a dead language and no longer supported)}.
+A popular framework typically falls out of fashion and requires significant resources to translate or rewrite every few years \new{(for example Python 2, which is no longer supported)}.
More stable/basic tools can be used with less long-term maintenance costs.
(2) Avoiding too many different languages and frameworks; e.g., when the workflow's PM and analysis are orchestrated in the same framework, it becomes easier to adopt and encourages good practices.
@@ -265,7 +268,7 @@ This is related to longevity, because if a workflow contains only the steps to d
\textbf{Criterion 8: Free and open source software:}
Reproducibility is not possible with a black box (non-free or non-open-source software); this criterion is therefore necessary because nature is already a black box, we do not need an artificial source of ambiguity \new{wrapped} over it.
-A project that is \href{https://www.gnu.org/philosophy/free-sw.en.html}{free software} (as formally defined), allows others to learn from, modify, and build upon it.
+A project that is \href{https://www.gnu.org/philosophy/free-sw.en.html}{free software} (as formally defined by GNU), allows others to run, learn from, \new{distribute, build upon (modify), and publish their modified versions}.
When the software used by the project is itself also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware.
In contrast, non-free tools typically cannot be distributed or modified by others, making it reliant on a single supplier (even without payments).
@@ -291,15 +294,16 @@ The tool is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending
It was developed as a parallel research project over five years of publishing reproducible workflows of our research.
The original implementation was published in \cite{akhlaghi15}, and evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}.
-Technically, the hardest criterion to implement was the first (completeness) and, in particular, \new{restricting execution requirements to only a minimal Unix-like operating system}).
+Technically, the hardest criterion to implement was the first (completeness); in particular \new{restricting execution requirements to only a minimal Unix-like operating system}.
One solution we considered was GNU Guix and Guix Workflow Language (GWL).
However, because Guix requires root access to install, and only works with the Linux kernel, it failed the completeness criterion.
Inspired by GWL+Guix, a single job management tool was implemented for both installing software \emph{and} the analysis workflow: Make.
-Make is not an analysis language, it is a job manager, deciding when and how to call analysis programs (in any language like Python, R, Julia, Shell, or C).
-Make is standardized in \new{Unix-like operating systems}.
+Make is not an analysis language, it is a job manager.
+Make decides when and how to call analysis steps/programs (in any language like Python, R, Julia, Shell, or C).
+Make \new{has been available since PWB/Unix 1.0 (released in 1977), it is still used in almost all components of modern Unix-like OSs} and is standardized in POSIX.
It is thus mature, actively maintained, highly optimized, efficient in managing exact provenance, and even recommended by the pioneers of reproducible research \cite{claerbout1992,schwab2000}.
-Researchers using free software tools have also already had some exposure to it \new{(almost all free software projects are built with Make).}
+Researchers using free software tools have also already had some exposure to it \new{(almost all free software research projects are built with Make).}
Linking the analysis and narrative (criterion 7) was historically our first design element.
To avoid the problems with computational notebooks mentioned above, our implementation follows a more abstract linkage, providing a more direct and precise, yet modular, connection.
@@ -315,7 +319,15 @@ Through the former, manual updates by authors (which are prone to errors and dis
Acting as a link, the macro files build the core skeleton of Maneage.
For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation.
-These are combined at the end to generate precise software \new{acknowledgement} and citation (see \cite{akhlaghi19, infante20}), which are excluded here because of the strict word limit.
+These are combined at the end to generate precise software \new{acknowledgement} and citation that is shown in the
+\new{
+ \ifdefined\noappendix
+ \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{appendices}.%
+ \else%
+ appendices (\ref{appendix:software}).%
+ \fi%
+}
+(for other examples, see \cite{akhlaghi19, infante20})
\new{Furthermore, the machine related specifications of the running system (including hardware name and byte-order) are also collected and cited.
These can help in \emph{root cause analysis} of observed differences/issues in the execution of the workflow on different machines.}
The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}).
@@ -409,7 +421,7 @@ Where exact reproducibility is not possible \new{(for example due to paralleliza
Each Git ``commit'' is shown on its branch as a colored ellipse, with its commit hash shown and colored to identify the team that is/was working on the branch.
Briefly, Git is a version control system, allowing a structured backup of project files.
Each Git ``commit'' effectively contains a copy of all the project's files at the moment it was made.
- The upward arrows at the branch-tops are therefore in the \new{time} direction.
+ The upward arrows at the branch-tops are therefore in the time direction.
}
\end{figure*}
@@ -435,7 +447,7 @@ Maneage is a Git branch that contains the shared components (infrastructure) of
Derived projects start by creating a branch and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its particular analysis, see Listing \ref{code:branching}).
There is a \new{thoroughly elaborated} customization checklist in \inlinecode{README-hacking.md}).
-The current project's Git hash is provided to the authors as a \LaTeX{} macro (shown here at the end of the abstract), as well as the Git hash of the last commit in the Maneage branch (shown here in the \new{acknowledgements}).
+The current project's Git hash is provided to the authors as a \LaTeX{} macro (shown here at the end of the abstract), as well as the Git hash of the last commit in the Maneage branch (shown here in the acknowledgements).
These macros are created in \inlinecode{initialize.mk}, with \new{other basic information from the running system like the CPU architecture, byte order or address sizes (shown here in the acknowledgements)}.
The branch-based design of Figure \ref{fig:branching} allows projects to re-import Maneage at a later time (technically: \emph{merge}), thus improving its low-level infrastructure: in (a) authors do the merge during an ongoing project;
@@ -498,18 +510,18 @@ Hence, arguably the most important feature of these criteria (as implemented in
Using mature and time-tested tools, for blending version control, the research paper's narrative, the software management \emph{and} a robust data management strategies.
We have noticed that providing a complete \emph{and} customizable template with a clear checklist of the initial steps is much more effective in encouraging mastery of these modern scientific tools than having abstract, isolated tutorials on each tool individually.
-Secondly, to satisfy the completeness criterion, all the required software of the project must be built on various \new{Unix-like operating systems} (Maneage is actively tested on different GNU/Linux distributions, macOS, and is being ported to FreeBSD also).
+Secondly, to satisfy the completeness criterion, all the required software of the project must be built on various \new{Unix-like OSs} (Maneage is actively tested on different GNU/Linux distributions, macOS, and is being ported to FreeBSD also).
This requires maintenance by our core team and consumes time and energy.
However, because the PM and analysis components share the same job manager (Make) and design principles, we have already noticed some early users adding, or fixing, their required software alone.
They later share their low-level commits on the core branch, thus propagating it to all derived projects.
-\new{We have chosen not to be sharp about the condition of executability on a minimal Unix-like operating system because at the end it should not matter how minimal the operating system is.}
-\new{The operating system should possess enough components to be able to install low-level software (e.g., core GNU tools).}
-Well written software internally corrects for differences in OS or hardware that may affect its functionality (through tools like the GNU portability library).
+\new{Unix-like OSs are a very large and diverse group (mostly conforming with POSIX), so this condition doesn't guarantee bitwise reproducibility of the software, even when built on the same hardware.}.
+However \new{our focus is on reproducing results (output of software), not the software itself.}
+Well written software internally corrects for differences in OS or hardware that may affect its output (through tools like the GNU portability library).
On GNU/Linux hosts, Maneage builds precise versions of the compilation tool chain.
However, glibc is not install-able on some \new{Unix-like} OSs (e.g., macOS) and all programs link with the C library.
This may hypothetically hinder the exact reproducibility \emph{of results} on non-GNU/Linux systems, but we have not encountered this in our research so far.
-With everything else under precise control in Maneage, the effect of differing Kernel and C libraries on high-level science can now be systematically studied in follow-up research \new{(including floating-point arithmetic or optimization differences).
+With everything else under precise control in Maneage, the effect of differing hardware, Kernel and C libraries on high-level science can now be systematically studied in follow-up research \new{(including floating-point arithmetic or optimization differences).
Using continuous integration (CI) is one way to precisely identify breaking points on multiple systems.}
% DVG: It is a pity that the following paragraph cannot be included, as it is really important but perhaps goes beyond the intended goal.
@@ -1664,6 +1676,18 @@ Hence for complex data analysis operations with involve thousands of steps, it i
+%% Mention all used software in an appendix.
+\section{Software acknowledgement}
+\label{appendix:software}
+\input{tex/build/macros/dependencies.tex}
+
+
+
+
+
+
+
+