aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2020-05-23 23:59:12 +0100
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2020-05-23 23:59:12 +0100
commit03c96cd2952ce6b6e04025f570214a814019317c (patch)
treed21d78e41b00ce7c29ed956ae379bac322eae3e1 /paper.tex
parentb167e2363da31fce9a38c51a52db9c32586d4811 (diff)
Some minor edits on Boud's recent corrections
Generally they were great, but after looking through them I thought a hand-full of them slightly changed my original idea so I am correcting them here. Boud, if you feel the changes aren't good, let's talk about it and find the best way forward ;-). They are mostly clear from a '--word-diff', just some notes on the ones that have changed the meaning: * On the "a clerk can do it" quotation, since its so short, I think its better to keep its original form, otherwise a reader may thing there were paragraphs instead of the "to" and we have changed their intention. * In the part where we are saying that the workflow can get "separated" from the paper, I mostly meant to highlight that the data-centers and journals (hosts) may diverge in decades, or one of them may go bankrupt, or etc. Hence loosing the connection. The issue of it evolving can in theory be addressed through version control, so I think this is a more fundamental problem. * In the part about free software, in the list, the original point was the free software that are used by the project, not the project itself (after all, the project itself falls under the "Open Science" titles that is very fashionable these days, but my point here is to those people who claim to do "Open Science" with closed software (like Microsoft Excel!).
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex24
1 files changed, 12 insertions, 12 deletions
diff --git a/paper.tex b/paper.tex
index 18e2fce..b4c949a 100644
--- a/paper.tex
+++ b/paper.tex
@@ -111,7 +111,7 @@ Moreover, once the binary format is obsolete, reading or parsing the project bec
The cost of staying up to date within this rapidly evolving landscape is high.
Scientific projects, in particular, suffer the most: scientists have to focus on their own research domain, but to some degree they need to understand the technology of their tools, because it determines their results and interpretations.
Decades later, scientists are still held accountable for their results.
-The evolving technology landscape creates generational gaps in the scientific community, preventing previous generations from sharing valuable lessons which are too low-level to be published in a traditional scientific paper.
+Hence, the evolving technology landscape creates generational gaps in the scientific community, preventing previous generations from sharing valuable lessons which are too low-level to be published in a traditional scientific paper.
As a solution to this problem, here we introduce a set of criteria that can guarantee the longevity of a project based on our experience with existing solutions.
@@ -120,17 +120,17 @@ As a solution to this problem, here we introduce a set of criteria that can guar
\section{Commonly used tools and their longevity}
To highlight the necessity of longevity, some of the most commonly used tools are reviewed here, from the perspective of long-term usability.
-While longevity is important in some fields of science and industry, this isn't always the case: fast-evolving tools can be appropriate in short-term commercial projects.
+While longevity is important in science and some fields of industry, this isn't always the case, e.g., fast-evolving tools can be appropriate in short-term commercial projects.
Most existing reproducible workflows use a common set of third-party tools that can be categorized as:
-(1) environment isolators -- virtual machines or containers;
+(1) environment isolators -- virtual machines (VMs) or containers;
(2) PMs -- Conda, Nix, or Spack;
(3) job management -- shell scripts, Make, SCons, or CGAT-core;
(4) notebooks -- such as Jupyter.
-To isolate the environment, virtual machines (VMs) have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011 but discontinued in 2019).
-However, containers (in particular, Docker, and to a lesser degree, Singularity) are by far the most widely used solution today, so we will focus on Docker here.
+To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011 but discontinued in 2019).
+However, containers (in particular, Docker, and to a lesser degree, Singularity) are by far the most widely used solution today, we will thus focus on Docker here.
-Ideally, it is possible to precisely identify the images that are imported into a Docker container by a version number or tag.
+Ideally, it is possible to precisely identify the images that are imported into a Docker container by their checksums.
But that is rarely practiced in most solutions that we have studied.
Usually, images are imported with generic operating system names e.g. \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}'.
The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated with different software versions almost monthly and only archives the most recent five images.
@@ -157,7 +157,7 @@ Finally, to add narrative, computational notebooks\cite{rule18}, like Jupyter, a
However, the complex dependency trees of such web-based tools make them very vulnerable to the passage of time, e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies.
The longevity of a project is determined by its shortest-lived dependency.
Furthermore, as with job management, computational notebooks don't actively encourage good practices in programming or project management.
-Notebooks can rarely deliver their promised potential\cite{rule18} and can even hamper reproducibility \cite{pimentel19}.
+Hence they can rarely deliver their promised potential\cite{rule18} and can even hamper reproducibility \cite{pimentel19}.
An exceptional solution we encountered was the Image Processing Online Journal (IPOL, \href{https://www.ipol.im}{ipol.im}).
Submitted papers must be accompanied by an ISO C implementation of their algorithm (which is buildable on any widely used operating system) with example images/data that can also be executed on their webpage.
@@ -172,7 +172,7 @@ Many data-intensive projects commonly involve dozens of high-level dependencies,
The main premise is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it in the end \cite{austin17,fineberg19}.
Researchers play a critical role\cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles).
-Archiving the workflow of a project in a repository after the project is finished is, on its own, insufficient, and often either practically infeasible or unscalable.
+Simply archiving a project workflow in a repository after the project is finished is, on its own, insufficient, and maintaining it by repository staff is often either practically infeasible or unscalable.
In this paper we argue that workflows satisfying the criteria below can improve researcher workflows during the project, reduce the cost of curation for repositories after publication, while maximizing the FAIRness of the deliverables for future researchers.
\textbf{Criterion 1: Completeness.}
@@ -186,7 +186,7 @@ It is a reliable foundation for longevity in software execution.
(5) It builds its own controlled software for an independent environment.
(6) It can run locally (without an internet connection).
(7) It contains the full project's analysis, visualization \emph{and} narrative: from access to raw inputs to doing the analysis, producing final data products \emph{and} its final published report with figures, e.g., PDF or HTML.
-(8) It an run automatically, with no human interaction.
+(8) It can run automatically, with no human interaction.
\textbf{Criterion 2: Modularity.}
A modular project enables and encourages the analysis to be broken into independent modules with well-defined inputs/outputs and minimal side effects.
@@ -210,7 +210,7 @@ On a small scale, the criteria here are trivial to implement, but as the project
\textbf{Criterion 5: Verifiable inputs and outputs.}
The project should verify its inputs (software source code and data) \emph{and} outputs.
-Expert knowledge should not be required to confirm a reproduction; it should be possible for ``\emph{a clerk [to] do it}''\cite{claerbout1992}.
+Expert knowledge should not be required to confirm a reproduction; such that ``\emph{a clerk can do it}''\cite{claerbout1992}.
\textbf{Criterion 6: History and temporal provenance.}
No exploratory research project is done in a single/first attempt.
@@ -223,13 +223,13 @@ The ``history'' is thus as valuable as the final/published version.
A project is not just its computational analysis.
A raw plot, figure or table is hardly meaningful alone, even when accompanied by the code that generated it.
A narrative description is also part of the deliverables (defined as ``data article'' in \cite{austin17}): describing the purpose of the computations, and interpretations of the result, and the context in relation to other projects/papers.
-This is related to longevity, because if a workflow only contains the steps to do the analysis or generate the plots, it may evolve to become separated from its accompanying published paper.
+This is related to longevity, because if a workflow only contains the steps to do the analysis or generate the plots, it may become separated from its accompanying published paper in time due to the different hosts.
\textbf{Criterion 8: Free and open source software:}
Technically, reproducibility (as defined in \cite{fineberg19}) is possible with non-free or non-open-source software (a black box).
This criterion is necessary to complement that definition (nature is already a black box).
If a project is free software (as formally defined), then others can learn from, modify, and build on it.
-When the software is free:
+When the project's used software are also free:
(1) The lineage can be traced to the implemented algorithms, possibly enabling optimizations on that level.
(2) The source can be modified to work on future hardware.
In contrast, a non-free software package typically cannot be distributed by others, making it reliant on a single server (even without payments).