aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2021-01-03 23:44:48 +0000
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2021-01-03 23:45:27 +0000
commit87b510bc2b4ba56a537e701ffb1de0c74c6942e2 (patch)
tree63712be0840b0132fc6c4d957da6420e8ee30818 /paper.tex
parent2ddfb42f4655633c5bdac6b32c6fb1f52181b031 (diff)
Spell check on main body and appendices
I ran a simple Emacs spell check over the main body and the two appendices. All discovered typos have been fixed.
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex38
1 files changed, 15 insertions, 23 deletions
diff --git a/paper.tex b/paper.tex
index 1de260f..f902b7c 100644
--- a/paper.tex
+++ b/paper.tex
@@ -87,7 +87,7 @@
\emph{Appendices} ---
Two comprehensive appendices that review the longevity of existing solutions; available
\ifdefined\separatesupplement
-as supplementary ``Web extras'' on the journal webpage.
+as supplementary ``Web extras'' on the journal web page.
\else
after main body of paper (Appendices \ref{appendix:existingtools} and \ref{appendix:existingsolutions}).
\fi
@@ -162,21 +162,13 @@ We will focus on Docker here because it is currently the most common.
\new{It is hypothetically possible to precisely identify the used Docker ``images'' with their checksums (or ``digest'') to re-create an identical OS image later.
However, that is rarely done.}
-Usually images are imported with operating system (OS) names; e.g., \cite{mesnard20}
-\ifdefined\separatesupplement
-\new{(more examples in the \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{appendices})}%
-\else%
-\new{(more examples: see the appendices (\ref{appendix:existingtools}))}%
-\fi%
-{ }imports `\inlinecode{FROM ubuntu:16.04}'.
+Usually images are imported with operating system (OS) names; e.g., \cite{mesnard20} imports `\inlinecode{FROM ubuntu:16.04}'.
The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly, and only the most recent five are archived there.
Hence, if the image is built in different months, it will contain different OS components.
-% CentOS announcement: https://blog.centos.org/2020/12/future-is-centos-stream
-In the year 2024, when long-term support (LTS) for this version of Ubuntu expires, the image will be unavailable at the expected URL \new{(if not aborted earlier, like CentOS 8 which will be terminated 8 years early).}
+In the year 2024, when long-term support (LTS) for this version of Ubuntu expires, the image will be unavailable at the expected URL \new{(if not earlier, like CentOS 8 which \href{https://blog.centos.org/2020/12/future-is-centos-stream}{will terminate} 8 years early).}
Generally, \new{pre-built} binary files (like Docker images) are large and expensive to maintain and archive.
-%% This URL: https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}
-\new{Because of this, DockerHub (where many reproducible workflows are archived) announced that inactive images (older than 6 months) will be deleted in free accounts from mid 2021.}
+\new{Because of this, in October 2020 Docker Hub (where many workflows are archived) \href{https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}{announced} that inactive images (older than 6 months) will be deleted in free accounts from mid 2021.}
Furthermore, Docker requires root permissions, and only supports recent (LTS) versions of the host kernel: older Docker images may not be executable \new{(their longevity is determined by the host kernel, typically a decade).}
Once the host OS is ready, PMs are used to install the software or environment.
@@ -203,7 +195,7 @@ It is important to remember that the longevity of a project is determined by its
Furthermore, as with job management, computational notebooks do not actively encourage good practices in programming or project management.
\new{The ``cells'' in a Jupyter notebook can either be run sequentially (from top to bottom, one after the other) or by manually selecting the cell to run.
By default, cell dependencies are not included (e.g., automatically running some cells only after certain others), parallel execution, or usage of more than one language.
-There are third party add-ons like \inlinecode{sos} or \inlinecode{nbextensions} (both written in Python) for some of these.
+There are third party add-ons like \inlinecode{sos} or \inlinecode{extension's} (both written in Python) for some of these.
However, since they are not part of the core, their longevity can be assumed to be shorter.
Therefore, the core Jupyter framework leaves very few options for project management, especially as the project grows beyond a small test or tutorial.}
In summary, notebooks can rarely deliver their promised potential \cite{rule18} and may even hamper reproducibility \cite{pimentel19}.
@@ -222,7 +214,7 @@ We argue and propose that workflows satisfying the following criteria can not on
\textbf{Criterion 1: Completeness.}
A project that is complete (self-contained) has the following properties.
(1) \new{No \emph{execution requirements} apart from a minimal Unix-like operating system.
-Fewer explicit execution requirements would mean higher \emph{execution possibility} and consequently higher \emph{longetivity}.}
+Fewer explicit execution requirements would mean higher \emph{execution possibility} and consequently higher \emph{longevity}.}
(2) Primarily stored as plain text \new{(encoded in ASCII/Unicode)}, not needing specialized software to open, parse, or execute.
(3) No impact on the host OS libraries, programs, and \new{environment variables}.
(4) Does not require root privileges to run (during development or post-publication).
@@ -277,7 +269,7 @@ They are reliant on a single supplier (even without payments) \new{and prone to
A project that is \href{https://www.gnu.org/philosophy/free-sw.en.html}{free software} (as formally defined by GNU), allows others to run, learn from, \new{distribute, build upon (modify), and publish their modified versions}.
When the software used by the project is itself also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware.
-\new{Propietary software may be necessary to read proprietary data formats produced by data collection hardware (for example micro-arrays in genetics).
+\new{Proprietary software may be necessary to read proprietary data formats produced by data collection hardware (for example micro-arrays in genetics).
In such cases, it is best to immediately convert the data to free formats upon collection, and archive (e.g., on Zenodo) or use the data in free formats.}
@@ -324,21 +316,21 @@ Through the former, manual updates by authors (which are prone to errors and dis
Acting as a link, the macro files build the core skeleton of Maneage.
For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version, and possible citation.
-These are combined at the end to generate precise software \new{acknowledgement} and citation that is shown in the
+These are combined at the end to generate precise software \new{acknowledgment} and citation that is shown in the
\new{
\ifdefined\separatesupplement
- \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{appendices}.%
+ \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{appendices},%
\else%
- appendices (\ref{appendix:software}).%
+ appendices (\ref{appendix:software}),%
\fi%
}
-(for other examples, see \cite{akhlaghi19, infante20})
+for other examples, see \cite{akhlaghi19, infante20}.
\new{Furthermore, the machine-related specifications of the running system (including hardware name and byte-order) are also collected and cited.
These can help in \emph{root cause analysis} of observed differences/issues in the execution of the workflow on different machines.}
The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}).
All software dependencies are built down to precise versions of every tool, including the shell,\new{important low-level application programs} (e.g., GNU Coreutils) and of course, the high-level science software.
\new{The source code of all the free software used in Maneage is archived in and downloaded from \href{https://doi.org/10.5281/zenodo.3883409}{zenodo.3883409}.
-Zenodo promises long-term archival and also provides a persistent identifier for the files, which are sometimes unavailable at a software package's webpage.}
+Zenodo promises long-term archival and also provides a persistent identifier for the files, which are sometimes unavailable at a software package's web page.}
On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library (glibc) is being added (task \href{http://savannah.nongnu.org/task/?15390}{15390}).
Currently, {\TeX}Live is also being added (task \href{http://savannah.nongnu.org/task/?15267}{15267}), but that is only for building the final PDF, not affecting the analysis or verification.
@@ -482,7 +474,7 @@ $ git add -u && git commit # Commit changes.
The branch-based design of Figure \ref{fig:branching} allows projects to re-import Maneage at a later time (technically: \emph{merge}), thus improving its low-level infrastructure: in (a) authors do the merge during an ongoing project;
in (b) readers do it after publication; e.g., the project remains reproducible but the infrastructure is outdated, or a bug is fixed in Maneage.
\new{Generally, any git flow (branching strategy) can be used by the high-level project authors or future readers.}
-Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of curation and maintenance of each individual project, before \emph{and} after publication.
+Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of project curation and maintenance, before \emph{and} after publication.
Finally, the complete project source is usually $\sim100$ kilo-bytes.
It can thus easily be published or archived in many servers, for example, it can be uploaded to arXiv (with the \LaTeX{} source, see the arXiv source in \cite{akhlaghi19, infante20, akhlaghi15}), published on Zenodo and archived in SoftwareHeritage.
@@ -525,10 +517,10 @@ This requires maintenance by our core team and consumes time and energy.
However, because the PM and analysis components share the same job manager (Make) and design principles, we have already noticed some early users adding, or fixing, their required software alone.
They later share their low-level commits on the core branch, thus propagating it to all derived projects.
-\new{Unix-like OSs are a very large and diverse group (mostly conforming with POSIX), so this condition does not guarantee bitwise reproducibility of the software, even when built on the same hardware.}.
+\new{Unix-like OSs are a very large and diverse group (mostly conforming with POSIX), so this condition does not guarantee bit-wise reproducibility of the software, even when built on the same hardware.}.
However \new{our focus is on reproducing results (output of software), not the software itself.}
Well written software internally corrects for differences in OS or hardware that may affect its output (through tools like the GNU portability library).
-On GNU/Linux hosts, Maneage builds precise versions of the compilation toolchain.
+On GNU/Linux hosts, Maneage builds precise versions of the compilation tool chain.
However, glibc is not install-able on some \new{Unix-like} OSs (e.g., macOS) and all programs link with the C library.
This may hypothetically hinder the exact reproducibility \emph{of results} on non-GNU/Linux systems, but we have not encountered this in our research so far.
With everything else under precise control in Maneage, the effect of differing hardware, Kernel and C libraries on high-level science can now be systematically studied in follow-up research \new{(including floating-point arithmetic or optimization differences).