aboutsummaryrefslogtreecommitdiff
path: root/tex/src/appendix-existing-tools.tex
diff options
context:
space:
mode:
Diffstat (limited to 'tex/src/appendix-existing-tools.tex')
-rw-r--r--tex/src/appendix-existing-tools.tex27
1 files changed, 13 insertions, 14 deletions
diff --git a/tex/src/appendix-existing-tools.tex b/tex/src/appendix-existing-tools.tex
index 43a0ef9..99a4284 100644
--- a/tex/src/appendix-existing-tools.tex
+++ b/tex/src/appendix-existing-tools.tex
@@ -129,7 +129,7 @@ Because it is highly intertwined with the way software is built and installed, t
Maneage (the solution proposed in this paper) also follows a similar approach of building and installing its own software environment within the host's file system, but without depending on it beyond the kernel.
However, unlike the third-party package manager mentioned above, Maneage'd software management is not detached from the specific research/analysis project: the instructions to build the full isolated software environment is maintained with the high-level analysis steps of the project, and the narrative paper/report of the project.
-This is fundamental to achieve the Completeness criteria.
+This is fundamental to achieve the completeness criterion.
@@ -191,7 +191,7 @@ That hash is then prefixed to the software's installation directory.
As an example Dolstra et al.\citeappendix{dolstra04}: if a certain build of GNU C Library 2.3.2 has a hash of \inlinecode{8d013ea878d0}, then it is installed under \inlinecode{/nix/store/8d013ea878d0-glibc-2.3.2} and all software that is compiled with it (and thus need it to run) will link to this unique address.
This allows for multiple versions of the software to co-exist on the system, while keeping an accurate dependency tree.
-As mentioned in Court{\'e}s \& Wurmus\citeappendix{courtes15}, one major caveat with using these package managers is that they require a daemon with root privileges (failing our completeness criteria).
+As mentioned in Court{\'e}s \& Wurmus\citeappendix{courtes15}, one major caveat with using these package managers is that they require a daemon with root privileges (failing our completeness criterion).
This is necessary ``to use the Linux kernel container facilities that allow it to isolate build processes and maximize build reproducibility''.
This is because the focus in Nix or Guix is to create bitwise reproducible software binaries and this is necessary for the security or development perspectives.
However, in a non-computer-science analysis (for example natural sciences), the main aim is reproducible \emph{results} that can also be created with the same software version that may not be bitwise identical (for example when they are installed in other locations, because the installation location is hard-coded in the software binary or for a different CPU architecture).
@@ -278,7 +278,7 @@ In conclusion for all package managers, there are two common issues regarding ge
This is another consequence of the detachment of the package manager from the project doing the analysis.
\end{itemize}
-Addressing these issues has been the basic reason behind the proposed solution: based on the completeness criteria, instructions to download and build the packages are included within the actual science project, and no special/new syntax/language is used.
+Addressing these issues has been the basic reason behind the proposed solution: based on the completeness criterion, instructions to download and build the packages are included within the actual science project, and no special/new syntax/language is used.
Software download, built and installation is done with the same language/syntax that researchers manage their research: using the shell (by default GNU Bash in Maneage) for low-level steps and Make (by default, GNU Make in Maneage) for job management.
@@ -541,7 +541,7 @@ To solve this problem there are advanced text editors like GNU Emacs that allow
However, editors that can execute or debug the source (like GNU Emacs), just run external programs for these jobs (for example GNU GCC, or GNU GDB), just as if those programs was called from outside the editor.
With text editors, the final edited file is independent of the actual editor and can be further edited with another editor, or executed without it.
-This is a very important feature and corresponds to the modularity criteria of this paper.
+This is a very important feature and corresponds to the modularity criterion of this paper.
This type of modularity is not commonly present for other solutions mentioned below (the source can only be edited/run in a specific browser).
Another very important advantage of advanced text editors like GNU Emacs or Vi(m) is that they can also be run without a graphic user interface, directly on the command-line.
This feature is critical when working on remote systems, in particular high performance computing (HPC) facilities that do not provide a graphic user interface.
@@ -573,7 +573,7 @@ However, Jupyter currently only supports a linear run of the cells: always from
It is possible to manually execute only one cell, but the previous/next cells that may depend on it, also have to be manually run (a common source of human error, and frustration for complex operations).
Integration of directional graph features (dependencies between the cells) into Jupyter has been discussed, but as of this publication, there is no plan to implement it (see Jupyter's GitHub issue 1175\footnote{\inlinecode{\url{https://github.com/jupyter/notebook/issues/1175}}}).
-The fact that the \inlinecode{.ipynb} format stores narrative text, code, and multi-media visualization of the outputs in one file, is another major hurdle and against the modularity criteria proposed here.
+The fact that the \inlinecode{.ipynb} format stores narrative text, code, and multi-media visualization of the outputs in one file, is another major hurdle and against the modularity criterion proposed here.
The files can easily become very large (in volume/bytes) and hard to read when the Jupyter web-interface is not accessible.
Both are critical for scientific processing, especially the latter: when a web browser with proper JavaScript features is not available (can happen in a few years).
This is further exacerbated by the fact that binary data (for example images) are not directly supported in JSON and have to be converted into a much less memory-efficient textual encoding.
@@ -606,7 +606,7 @@ In this context, it is more focused on the latter.
Because of their nature, higher-level languages evolve very fast, creating incompatibilities on the way.
The most prominent example is the transition from Python 2 (released in 2000) to Python 3 (released in 2008).
Python 3 was incompatible with Python 2 and it was decided to abandon the former by 2015.
-However, due to community pressure, this was delayed to January 1st, 2020.
+However, due to community pressure, this was delayed to 1 January 2020.
The end-of-life of Python 2 caused many problems for projects that had invested heavily in Python 2: all their previous work had to be translated, for example, see Jenness\citeappendix{jenness17} or Appendix \ref{appendix:sciunit}.
Some projects could not make this investment and their developers decided to stop maintaining it, for example VisTrails (see Appendix \ref{appendix:vistrails}).
@@ -617,7 +617,7 @@ This is not particular to Python, a similar evolution occurred in Perl: in 2000
However, the Perl community decided not to abandon Perl 5, and Perl 6 was eventually defined as a new language that is now officially called ``Raku'' (\url{https://raku.org}).
It is unreasonably optimistic to assume that high-level languages will not undergo similar incompatible evolutions in the (not too distant) future.
-For industial software developers, this is not a major problem: non-scientific software, and the general population's usage of them, has a similarly fast evolution and shelf-life.
+For industrial software developers, this is not a major problem: non-scientific software, and the general population's usage of them, has a similarly fast evolution and shelf-life.
Hence, it is rarely (if ever) necessary to look into industrial/business codes that are more than a couple of years old.
However, in the sciences (which are commonly funded by public money) this is a major caveat for the longer-term usability of solutions.
@@ -627,10 +627,10 @@ Beyond technical, low-level, problems for the developers mentioned above, this c
\subsubsection{Dependency hell}
The evolution of high-level languages is extremely fast, even within one version.
-For example, packages that are written in Python 3 often only work with a special interval of Python 3 versions.
-For example Snakemake and Occam which can only be run on Python versions 3.4 and 3.5 or newer respectively, see Appendices \ref{appendix:snakemake} and \ref{appendix:occam}.
-This is not just limited to the core language, much faster changes occur in their higher-level libraries.
-For example version 1.9 of Numpy (Python's numerical analysis module) discontinued support for Numpy's predecessor (called Numeric), causing many problems for scientific users\citeappendix{hinsen15}.
+For example, packages that are written in Python 3 often only work with a specific interval of Python 3 versions.
+For example, Snakemake and Occam, which can only be run on Python versions 3.4 and 3.5 or newer respectively, see Appendices \ref{appendix:snakemake} and \ref{appendix:occam}.
+This is not just limited to the core language; much faster changes occur in their higher-level libraries.
+For example, version 1.9 of Numpy (Python's numerical analysis module) discontinued support for Numpy's predecessor (called Numeric), causing many problems for scientific users\citeappendix{hinsen15}.
On the other hand, the dependency graph of tools written in high-level languages is often extremely complex.
For example, see Figure 1 of Alliez et al.\cite{alliez19}.
@@ -640,10 +640,9 @@ Acceptable version intervals between the dependencies will cause incompatibiliti
Since a domain scientist does not always have the resources/knowledge to modify the conflicting part(s), many are forced to create complex environments with different versions of Python and pass the data between them (for example just to use the work of a previous PhD student in the team).
This greatly increases the complexity of the project, even for the principal author.
A well-designed reproducible workflow like Maneage that has no dependencies beyond a C compiler in a Unix-like operating system can account for this.
-However, when the actual workflow system (not the analysis software) is written in a high-level language like the examples above.
+However, when the actual workflow system (not the analysis software) is written in a high-level language like the examples above, this will cause problems.
-Another relevant example of the dependency hell is mentioned here:
-merely installing the Python installer (\inlinecode{pip}) on a Debian system (with \inlinecode{apt install pip2} for Python 2 packages), required 32 other packages as dependencies.
+Another relevant example of the dependency hell is the following: installing the Python installer (\inlinecode{pip}) on a Debian system (with \inlinecode{apt install pip2} for Python 2 packages) required 32 other packages as dependencies.
\inlinecode{pip} is necessary to install Popper and Sciunit (Appendices \ref{appendix:popper} and \ref{appendix:sciunit}).
As of this writing, the \inlinecode{pip3 install popper} and \inlinecode{pip2 install sciunit2} commands for installing each, required 17 and 26 Python modules as dependencies.
It is impossible to run either of these solutions if there is a single conflict in this very complex dependency graph.