aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2020-12-27 19:04:45 +0000
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2020-12-27 19:49:46 +0000
commit5a2a4c31f8ee252b778e2fea2da00de8906792c3 (patch)
treefedbbb849e38ba916746669ca7d5f4dc457debaf /paper.tex
parentafc7c57b3b8240aa84e7682272bf528615530ba2 (diff)
Edits to snapshot size argument, minor edits here and there
Following Boud's point in the previous commit, I tried to clarify the point in the text that we are only talking about hand-written source files: in short, in this part of the paper, we are not talking abou the version/snapshot for arXiv which needs figures and many extra automatically built files. We are just talking about the raw, hand-written files. Trying to convince people how good it is to keep the raw files separate from automatically generated files ;-). Also, while looking around in other parts of the main body of the paper, I tried to edit/clarify a few points and summarize/shorten others.
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex47
1 files changed, 25 insertions, 22 deletions
diff --git a/paper.tex b/paper.tex
index 8c74578..8f11870 100644
--- a/paper.tex
+++ b/paper.tex
@@ -230,26 +230,27 @@ Fewer explicit execution requirements would mean higher \emph{execution possibil
\textbf{Criterion 2: Modularity.}
A modular project enables and encourages independent modules with well-defined inputs/outputs and minimal side effects.
-\new{In terms of file management, a modular project will \emph{only} contain the hand-written project source of that particular high-level project: no automatically generated files (e.g., built binaries), software source code (maintained separately), or data (archived separately) should be included.
-The latter two (developing low-level software or collecting data) are separate projects in themselves and can be used in other high-level projects.}
-Explicit communication between various modules enables optimizations on many levels:
-(1) Storage and archival cost (no duplicate software or data files): a snapshot of a project should be less than a megabyte.
-(2) Modular analysis components can be executed in parallel and avoid redundancies (when a dependency of a module has not changed, it will not be re-run).
-(3) Usage in other projects.
-(4) Easy debugging and improvements.
-(5) Modular citation of specific parts.
-(6) Provenance extraction.
+\new{In terms of file management, a modular project will \emph{only} contain the hand-written project source of that particular high-level project: no automatically generated files (e.g., software binaries or figures), software source code, or data should be included.
+The latter two (developing low-level software, collecting data, or the publishing and archival of both) are separate projects in themselves because they can be used in other independent projects.
+This optimizes the storage, archival/mirroring and publication costs (which are critical to longevity): a snapshot of a project's hand-written source will usually be on the scale of $\times100$ kilobytes and the version controlled history may become a few megabytes.}
+
+In terms of the analysis workflow, explicit communication between various modules enables optimizations on many levels:
+(1) Modular analysis components can be executed in parallel and avoid redundancies (when a dependency of a module has not changed, it will not be re-run).
+(2) Usage in other projects.
+(3) Debugging and adding improvements (possibly by future researchers).
+(4) Citation of specific parts.
+(5) Provenance extraction.
\textbf{Criterion 3: Minimal complexity.}
Minimal complexity can be interpreted as:
(1) Avoiding the language or framework that is currently in vogue (for the workflow, not necessarily the high-level analysis).
A popular framework typically falls out of fashion and requires significant resources to translate or rewrite every few years \new{(for example Python 2, which is no longer supported)}.
More stable/basic tools can be used with less long-term maintenance costs.
-(2) Avoiding too many different languages and frameworks; e.g., when the workflow's PM and analysis are orchestrated in the same framework, it becomes easier to adopt and encourages good practices.
+(2) Avoiding too many different languages and frameworks; e.g., when the workflow's PM and analysis are orchestrated in the same framework, it becomes easier to maintain in the long term.
\textbf{Criterion 4: Scalability.}
A scalable project can easily be used in arbitrarily large and/or complex projects.
-On a small scale, the criteria here are trivial to implement, but can rapidly become unsustainable.
+On a small scale, the criteria here are trivial to implement, but can rapidly become unsustainable (see IPOL example above).
\textbf{Criterion 5: Verifiable inputs and outputs.}
The project should automatically verify its inputs (software source code and data) \emph{and} outputs, not needing any expert knowledge.
@@ -261,20 +262,20 @@ It is natural that earlier phases of a project are redesigned/optimized only aft
Research papers often report this with statements such as ``\emph{we [first] tried method [or parameter] X, but Y is used here because it gave lower random error}''.
The derivation ``history'' of a result is thus not any the less valuable as itself.
-\textbf{Criterion 7: Including narrative, linked to analysis.}
+\textbf{Criterion 7: Including narrative that is linked to analysis.}
A project is not just its computational analysis.
A raw plot, figure or table is hardly meaningful alone, even when accompanied by the code that generated it.
A narrative description is also a deliverable (defined as ``data article'' in \cite{austin17}): describing the purpose of the computations, and interpretations of the result, and the context in relation to other projects/papers.
This is related to longevity, because if a workflow contains only the steps to do the analysis or generate the plots, in time it may get separated from its accompanying published paper.
\textbf{Criterion 8: Free and open source software:}
-Reproducibility is not possible with a black box (non-free or non-open-source software); this criterion is therefore necessary because nature is already a black box, we do not need an artificial source of ambiguity \new{wrapped} over it.
+Non-free or non-open-source software typically cannot be distributed, inspected or modified by others.
+They are reliant on a single supplier (even without payments) \new{and prone to \href{https://www.gnu.org/proprietary/proprietary-obsolescence.html}{proprietary obsolescence}}.
A project that is \href{https://www.gnu.org/philosophy/free-sw.en.html}{free software} (as formally defined by GNU), allows others to run, learn from, \new{distribute, build upon (modify), and publish their modified versions}.
When the software used by the project is itself also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware.
-In contrast, non-free tools typically cannot be distributed or modified by others, making it reliant on a single supplier (even without payments)\new{, and prone to \href{https://www.gnu.org/proprietary/proprietary-obsolescence.html}{proprietary obsolescence}}.
-\new{It may happen that proprietary software is necessary to convert proprietary data formats produced by special hardware (for example micro-arrays in genetics) into free data formats.
-In such cases, it is best to immediately convert the data upon collection, and archive the data in free formats (for example, on Zenodo).}
+\new{It may happen that proprietary software is necessary to read proprietary data formats produced by data collection hardware (for example micro-arrays in genetics).
+In such cases, it is best to immediately convert the data to free formats upon collection, and archive (e.g., on Zenodo) or use the data in free formats.}
@@ -302,9 +303,9 @@ Inspired by GWL+Guix, a single job management tool was implemented for both inst
Make is not an analysis language, it is a job manager.
Make decides when and how to call analysis steps/programs (in any language like Python, R, Julia, Shell, or C).
-Make \new{has been available since PWB/Unix 1.0 (released in 1977), it is still used in almost all components of modern Unix-like OSs} and is standardized in POSIX.
-It is thus mature, actively maintained, highly optimized, efficient in managing exact provenance, and even recommended by the pioneers of reproducible research \cite{claerbout1992,schwab2000}.
-Researchers using free software tools have also already had some exposure to it \new{(almost all free software research projects are built with Make).}
+Make \new{has been available since 1977, it is still heavily used in almost all components of modern Unix-like OSs} and is standardized in POSIX.
+It is thus mature, actively maintained, highly optimized, efficient in managing provenance, and recommended by the pioneers of reproducible research \cite{claerbout1992,schwab2000}.
+Researchers using free software have also already had some exposure to it \new{(most free research software are built with Make).}
Linking the analysis and narrative (criterion 7) was historically our first design element.
To avoid the problems with computational notebooks mentioned above, our implementation follows a more abstract linkage, providing a more direct and precise, yet modular, connection.
@@ -341,10 +342,12 @@ Currently, {\TeX}Live is also being added (task \href{http://savannah.nongnu.org
\new{Finally, some software cannot be built on some CPU architectures, hence by default, the architecture is included in the final built paper automatically (see below).}
\new{Building the core Maneage software environment on an 8-core CPU takes about 1.5 hours (GCC consumes more than half of the time).
-However, this is only necessary once for every computer, the analysis phase (which usually takes months to write for a normal project) will use the same environment later.
+However, this is only necessary once in a project: the analysis (which usually takes months to write/mature for a normal project) will only use built environment.
+Hence the few hours of initial software building is negligible compared to a project's life span.
To facilitate moving to another computer in the short term, Maneage'd projects can be built in a container or VM.
The \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{\inlinecode{README.md}} file has instructions on building in Docker.
-Through Docker (or VMs), users on Microsoft Windows can benefit from Maneage, and for Windows-native software that can be run in batch-mode, evolving technologies like Windows Subsystem for Linux may be usable.}
+Through containers or VMs, users on non-Unix-like OSs (like Microsoft Windows) can use Maneage.
+For Windows-native software that can be run in batch-mode, evolving technologies like Windows Subsystem for Linux may be usable.}
The analysis phase of the project however is naturally different from one project to another at a low-level.
It was thus necessary to design a generic framework to comfortably host any project, while still satisfying the criteria of modularity, scalability, and minimal complexity.
@@ -375,7 +378,7 @@ Figure \ref{fig:datalineage} (right) is the data lineage graph that produced it
The analysis is orchestrated through a single point of entry (\inlinecode{top-make.mk}, which is a Makefile; see Listing \ref{code:topmake}).
It is only responsible for \inlinecode{include}-ing the modular \emph{subMakefiles} of the analysis, in the desired order, without doing any analysis itself.
-This is visualized in Figure \ref{fig:datalineage} (right) where no built (blue) file is placed directly over \inlinecode{top-make.mk} (they are produced by the subMakefiles under them).
+This is visualized in Figure \ref{fig:datalineage} (right) where no built (blue) file is placed directly over \inlinecode{top-make.mk}.
A visual inspection of this file is sufficient for a non-expert to understand the high-level steps of the project (irrespective of the low-level implementation details), provided that the subMakefile names are descriptive (thus encouraging good practice).
A human-friendly design that is also optimized for execution is a critical component for the FAIRness of reproducible research.