aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
authorRaul Infante-Sainz <infantesainz@gmail.com>2020-11-23 12:00:22 +0000
committerRaul Infante-Sainz <infantesainz@gmail.com>2020-11-23 12:00:22 +0000
commit49a6067514d48da65e5fcc8089d171d07c186311 (patch)
tree2b423dbb4d3700c02de7c3dede4a0a8de76534da /paper.tex
parentd382f1b610e05096b45055826b8f823b6ca796c3 (diff)
Minor corrections to the final paper document
With this commit, I make several minor changes to the text of the final paper. They are not important, but minor modifications like avoiding contractions (don't -> do not, and so on).
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex37
1 files changed, 17 insertions, 20 deletions
diff --git a/paper.tex b/paper.tex
index b28e2af..16658a4 100644
--- a/paper.tex
+++ b/paper.tex
@@ -60,8 +60,7 @@
Completeness (no dependency beyond POSIX, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free software.
They have been tested in several research publications in various fields.
%% RESULTS
- As a proof of concept, ``Maneage'' is introduced for storing projects in machine-actionable and human-readable
- plain text, enabling cheap archiving, provenance extraction, and peer verification.
+ As a proof of concept, ``Maneage'' is introduced for storing projects in machine-actionable and human-readable plain text, enabling cheap archiving, provenance extraction, and peer verification.
%% CONCLUSION
We show that longevity is a realistic requirement that does not sacrifice immediate or short-term reproducibility.
The caveats (with proposed solutions) are then discussed and we conclude with the benefits for the various stakeholders.
@@ -118,8 +117,7 @@ starting with Make and Matlab libraries in the 1990s, Java in the 2000s, and mos
However, these technologies develop fast, e.g., code written in Python 2 \new{(which is no longer officially maintained)} often cannot run with Python 3.
The cost of staying up to date within this rapidly-evolving landscape is high.
Scientific projects, in particular, suffer the most: scientists have to focus on their own research domain, but to some degree they need to understand the technology of their tools because it determines their results and interpretations.
-Decades later, scientists are still held accountable for their results and therefore the evolving technology landscape
-creates generational gaps in the scientific community, preventing previous generations from sharing valuable experience.
+Decades later, scientists are still held accountable for their results and therefore the evolving technology landscape creates generational gaps in the scientific community, preventing previous generations from sharing valuable experience.
@@ -128,9 +126,9 @@ creates generational gaps in the scientific community, preventing previous gener
\section{Longevity of existing tools}
\label{sec:longevityofexisting}
\new{Reproducibility is defined as ``obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis'' \cite{fineberg19}.
- Longevity is defined as the time that a project can be usable.
- Usability is defined by context: for machines (machine-actionable, or executable files) \emph{and} humans (readability of the source).
- Because many usage contexts don't involve execution; for example checking the configuration parameter of a single step of the analysis to re-\emph{use} in another project, or checking the version of used software, or source of the input data (extracting these from the outputs of execution is not always possible).}
+Longevity is defined as the time that a project can be usable.
+Usability is defined by context: for machines (machine-actionable, or executable files) \emph{and} humans (readability of the source).
+Because many usage contexts do not involve execution; for example checking the configuration parameter of a single step of the analysis to re-\emph{use} in another project, or checking the version of used software, or source of the input data (extracting these from the outputs of execution is not always possible).}
Longevity is as important in science as in some fields of industry, but not all; e.g., fast-evolving tools can be appropriate in short-term commercial projects.
To highlight the necessity, a short review of commonly-used tools is provided below:
@@ -145,7 +143,7 @@ However, containers (in particular, Docker, and to a lesser degree, Singularity)
We will thus focus on Docker here.
\new{It is theoretically possible to precisely identify the used Docker ``images'' with their checksums (or ``digest'') to re-create an identical OS image later.
- However, that is rarely practiced.}
+However, that is rarely practiced.}
Usually images are imported with generic operating system (OS) names; e.g., \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}' \new{(more examples in the appendices)}.
The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly, and only the most recent five are archived there.
Hence, if the image is built in different months, its output image will contain different OS components.
@@ -166,7 +164,7 @@ Furthermore, because third-party PMs introduce their own language, framework, an
With the software environment built, job management is the next component of a workflow.
Visual/GUI workflow tools like Apache Taverna, GenePattern (depreciated), Kepler or VisTrails (depreciated), which were mostly introduced in the 2000s and used Java or Python 2 encourage modularity and robust job management.
-\new{However, a GUI environment is tailored to specific applications and is hard to genralize, while being hard to reproduce once the required Java Virtual Machines (JVM) is depreciated.
+\new{However, a GUI environment is tailored to specific applications and is hard to generalize, while being hard to reproduce once the required Java Virtual Machines (JVM) is depreciated.
Their data formats are also complex (designed for computers to read) and hard to read by humans without the GUI.}
The more recent tools (mostly non-GUI, written in Python) leave this to the authors of the project.
Designing a robust project needs to be encouraged and facilitated because scientists (who are not usually trained in project or data management) will rarely apply best practices.
@@ -178,9 +176,9 @@ However, because of their complex dependency trees, their build is vulnerable to
It is important to remember that the longevity of a project is determined by its shortest-lived dependency.
Furthermore, as with job management, computational notebooks do not actively encourage good practices in programming or project management.
\new{The ``cells'' in a Jupyter notebook can either be run sequentially (from top to bottom, one after the other) or by manually selecting which cell to run.
-The default cells don't include dependencies (so some cells run only after certain others are re-done), parallel execution, or usage of more than one language.
-There are third party add-ons like \inlinecode{sos} or \inlinecode{nbextensions} (both written in Python) for some of these.
-However, since they aren't part of the core and have their own dependencies, their longevity can be assumed to be shorter.
+The default cells do not include dependencies (so some cells run only after certain others are re-done), parallel execution, or usage of more than one language.
+There are third party add-ons like \inlinecode{sos} or \inlinecode{nbextensions} (both written in Python) for some of these.
+However, since they are not part of the core and have their own dependencies, their longevity can be assumed to be shorter.
Therefore, the core Jupyter framework leaves very few options for project management, especially as the project grows beyond a small test or tutorial.}
In summary, notebooks can rarely deliver their promised potential \cite{rule18} and may even hamper reproducibility \cite{pimentel19}.
@@ -249,13 +247,13 @@ A narrative description is also a deliverable (defined as ``data article'' in \c
This is related to longevity, because if a workflow contains only the steps to do the analysis or generate the plots, in time it may get separated from its accompanying published paper.
\textbf{Criterion 8: Free and open source software:}
-Reproducibility is not possible with a black box (non-free or non-open-source software); this criterion is therefore necessary because nature is already a black box, we don't need an artificial source of ambiguity wraped over it.
+Reproducibility is not possible with a black box (non-free or non-open-source software); this criterion is therefore necessary because nature is already a black box, we do not need an artificial source of ambiguity wraped over it.
A project that is \href{https://www.gnu.org/philosophy/free-sw.en.html}{free software} (as formally defined), allows others to learn from, modify, and build upon it.
When the software used by the project is itself also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware.
In contrast, non-free tools typically cannot be distributed or modified by others, making it reliant on a single supplier (even without payments).
\new{It may happen that proprietary software is necessary to convert proprietary data formats produced by special hardware (for example micro-arrays in genetics) into free data formats.
- In such cases, it is best to immediately convert the data upon collection, and archive the data in free formats (for example on Zenodo).}
+In such cases, it is best to immediately convert the data upon collection, and archive the data in free formats (for example on Zenodo).}
@@ -283,12 +281,12 @@ Inspired by GWL+Guix, a single job management tool was implemented for both inst
Make is not an analysis language, it is a job manager, deciding when and how to call analysis programs (in any language like Python, R, Julia, Shell, or C).
Make is standardized in POSIX and is used in almost all core OS components.
It is thus mature, actively maintained, highly optimized, efficient in managing exact provenance, and even recommended by the pioneers of reproducible research \cite{claerbout1992,schwab2000}.
-Researchers using free software tools have also already had some exposure to it \new{(almost all free software projects are built with Make)}
+Researchers using free software tools have also already had some exposure to it \new{(almost all free software projects are built with Make).}
Linking the analysis and narrative (criterion 7) was historically our first design element.
To avoid the problems with computational notebooks mentioned above, our implementation follows a more abstract linkage, providing a more direct and precise, yet modular, connection.
Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is through automatically-created \LaTeX{} macros, during the analysis.
-For example, \cite{akhlaghi19} writes `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'.
+For example, \cite{akhlaghi19} writes `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'.
The \LaTeX{} source of the quote above is: `\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}'.
The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is generated during the analysis and expands to the value `\inlinecode{0.25}' when the PDF output is built.
Since values like this depend on the analysis, they should \emph{also} be reproducible, along with figures and tables.
@@ -301,7 +299,7 @@ Acting as a link, the macro files build the core skeleton of Maneage.
For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation.
These are combined at the end to generate precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}), which are excluded here because of the strict word limit.
\new{Furthermore, the machine related specifications of the running system (including hardware name and byte-order) are also collected and cited.
- These can help in \emph{root cause analysis} of observed differences/issues in the execution of the wokflow on different machines.}
+These can help in \emph{root cause analysis} of observed differences/issues in the execution of the wokflow on different machines.}
The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}).
All software dependencies are built down to precise versions of every tool, including the shell, POSIX tools (e.g., GNU Coreutils) and of course, the high-level science software.
\new{The source code of all the free software used in Maneage is archived in and downloaded from \href{https://doi.org/10.5281/zenodo.3883409}{zenodo.3883409}.
@@ -411,8 +409,8 @@ If changed, Make will \emph{only} re-execute the dependent recipe and all its de
This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details; e.g., by co-authors or future readers), and ensures self-consistency.
\new{To summarize, in contrast to notebooks like Jupyter, in a ``Maneage''d project the analysis scripts and configuration parameters are not blended into the running code (and all stored in one file).
- Based on the modularity criteria, the analysis steps are run in their own files (for their own respective language, thus maximally benefiting from its unique features) and the narrative has its own file(s).
- The analysis communicates with the narrative through intermediate files (the \LaTeX{} macros), enabling much better blending of analysis outputs in the narrative sentences than is possible with the high-level notebooks and enabling direct provenance tracking.}
+Based on the modularity criteria, the analysis steps are run in their own files (for their own respective language, thus maximally benefiting from its unique features) and the narrative has its own file(s).
+The analysis communicates with the narrative through intermediate files (the \LaTeX{} macros), enabling much better blending of analysis outputs in the narrative sentences than is possible with the high-level notebooks and enabling direct provenance tracking.}
To satisfy the recorded history criterion, version control (currently implemented in Git) is another component of Maneage (see Figure \ref{fig:branching}).
Maneage is a Git branch that contains the shared components (infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles, and interface script).
@@ -573,7 +571,6 @@ Sk\l{}odowska-Curie grant agreement No 721463 to the SUNDIAL ITN.
The State Research Agency (AEI) of the Spanish Ministry of Science, Innovation and Universities (MCIU) and the European
Regional Development Fund (ERDF) under the grant AYA2016-76219-P.
The IAC project P/300724, financed by the MCIU, through the Canary Islands Department of Economy, Knowledge and Employment.
-The Fundaci\'on BBVA under its 2017 programme of assistance to scientific research groups, for the project ``Using machine-learning techniques to drag galaxies from the noise in deep imaging''.
The ``A next-generation worldwide quantum sensor network with optical atomic clocks'' project of the TEAM IV programme of the
Foundation for Polish Science co-financed by the EU under ERDF.
The Polish MNiSW grant DIR/WK/2018/12.