aboutsummaryrefslogtreecommitdiff
path: root/tex/src/appendix-existing-tools.tex
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2021-04-09 01:08:31 +0100
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2021-04-09 02:00:18 +0100
commita63900bc5a83052081e6ca6bcc0a2bb4ee5a860e (patch)
tree15c7dcdff040b4c60110547de71d08ff0f5fadd0 /tex/src/appendix-existing-tools.tex
parent55d6570aecc5f442399262b7faa441d16ccd4556 (diff)
Comments by Konrad Hinsen implemented
Konrad had kindly gone through the paper and the appendices with very good feedback that is now being addressed in the paper (thanks a lot Konrad!): - IPOL recently also allows Python code. So the respective parts of the description of IPOL have been updated. To address the dependency issue, I also added a sentence that only certain dependencies (with certain versions) are acceptable. - On Active Papers (AP: which is written by Konrad) corrections were made based on the following parts of his comments: - "The fundamental issue with ActivePapers is its platform dependence on either Java or Python, neither of which is attractive." - "The one point which is overemphasized, in my opinion, is the necessity to download large data files if some analysis script refers to it. That is true in the current implementation (which I consider a research prototype), but not a fundamental feature of the approach. Implementing an on-demand download strategy is not particularly complicated, it just needs to be done, and it wasn't a priority for my own use cases." - "A historical anecdote: you mention that HDF View requires registering for download. This is true today, but wasn't when I started ActivePapers. Otherwise I'd never have built on HDF5. What happened is that the HDF Group, formerly part of NCSA and thus a public research infrastructure, was turned into a semi-commercial entity. They have committed to keeping the core HDF5 library Open Source, but not any of the tooling around it. Many users have moved away from HDF5 as a consequence. The larger lesson is that Richard Stallman was right: if software isn't GPLed, then you never know what will happen to it in the future." - On Guix, some further clarification was added to address Konrad's quote below (with a link to the blog-post mentioned there). In short, I clarified that I mean storing the Guix commit hash with any respective high-level analysis change is the extra step. - "I also looked at the discussion of Nix and Guix, which is what I am mainly using today. It is mostly correct as well, the one exception being the claim that 'it is up to the user to ensure that their created environment is recorded properly for reproducibility in the future'. The environment is *recorded* in all detail, automatically. What requires some effort is extracting a human-readable description of that environment. For Guix, I have described how to do this in a blog post (https://guix.gnu.org/en/blog/2020/reproducible-computations-with-guix/), and in less detail in a recent CiSE paper (https://hal.archives-ouvertes.fr/hal-02877319). There should definitely be a better user interface for this, but it's no more than a user interface issue. What is pretty nice in Guix by now is the user interface for re-creating an environment, using the "guix time-machine" subcommand." - The sentence on Software Heritage being based on Git was reworded to fit this comment of Konrad: "The plural sounds quite optimistic. As far as I know, SWH is the only archive of its kind, and in view of the enormous resources and long-time commitments it requires, I don't expect to see a second one." - When introducing hashes, Konrad suggested the following useful paper that shows how they are used in content-based storage: DOI:10.1109/MCSE.2019.2949441 - On Snakemake, Konrad had the following comment: "[A system call in Python is] No slower than from bash, or even from any C code. Meaning no slower than Make. It's the creation of a new process that takes most of the time." So the point was just shifted to the many quotations necessary for calling external programs and how it is best suited for a Python-based project. In addition some minor typos that I found during the process are also fixed.
Diffstat (limited to 'tex/src/appendix-existing-tools.tex')
-rw-r--r--tex/src/appendix-existing-tools.tex25
1 files changed, 14 insertions, 11 deletions
diff --git a/tex/src/appendix-existing-tools.tex b/tex/src/appendix-existing-tools.tex
index f23e2d1..0c9a1c2 100644
--- a/tex/src/appendix-existing-tools.tex
+++ b/tex/src/appendix-existing-tools.tex
@@ -196,12 +196,14 @@ This is necessary ``to use the Linux kernel container facilities that allow it t
This is because the focus in Nix or Guix is to create bit-wise reproducible software binaries and this is necessary for the security or development perspectives.
However, in a non-computer-science analysis (for example natural sciences), the main aim is reproducible \emph{results} that can also be created with the same software version that may not be bit-wise identical (for example when they are installed in other locations, because the installation location is hard-coded in the software binary or for a different CPU architecture).
-Finally, while Guix and Nix do allow precisely reproducible environments, it requires extra effort on the user's side to ensure that the built environment is reproducible later.
-For example, simply running \inlinecode{guix install gcc} (the most common way to install a new software) will install the most recent version of GCC, that can be different at different times.
-Hence, similar to the discussion in host operating system package managers, it is up to the user to ensure that their created environment is recorded properly for reproducibility in the future.
-It is not a complex operation, but like the Docker digest codes mentioned in Appendix \ref{appendix:containers}, many will probably not know, forget or ignore it.
-Generally, this is an issue with projects that rely on detached (third party) package managers for building their software, including the other tools mentioned below.
-We solved this problem in Maneage by including the package manager and analysis steps into one project: it is simply not possible to forget to record the exact versions of the software used.
+Finally, while Guix and Nix do allow precisely reproducible environments, the inherent detachment from the high-level computational project (that uses the environment) requires extra effort to keep track of the changes in dependencies as the project evolves.
+For example, if users simply run \inlinecode{guix install gcc} (the most common way to install a new software) the most recent version of GCC will be installed.
+But this will be different at different dates on a different system with no record of previous runs.
+It is therefore up to the user to store the used Guix commit in their high level computation and ensure ``Reproducing a reproducible computation''\footnote{A guide/tutorial on storing the Guix environment:\\\inlinecode{\url{https://guix.gnu.org/en/blog/2020/reproducible-computations-with-guix}}}.
+Similar to the Docker digest codes mentioned in Appendix \ref{appendix:containers}, many may not know about, forget, or ignore it.
+
+Generally, this is a common issue with relying on detached (third party) package managers for building a high-level computational project's software (including other tools mentioned below).
+We solved this problem in Maneage by including the low-level package manager and highlevel computation into a single project with a single version controlled history: it is simply not possible to forget to record the exact versions of the software used (or how they change as the project evolves).
\subsubsection{Conda/Anaconda}
\label{appendix:conda}
@@ -210,7 +212,7 @@ Conda is able to maintain an approximately independent environment on an operati
Conda tracks the dependencies of a package/environment through a YAML formatted file, where the necessary software and their acceptable versions are listed.
However, it is not possible to fix the versions of the dependencies through the YAML files alone.
-This is thoroughly discussed under issue 787 (in May 2019) of \inlinecode{conda-forge}\footnote{\url{https://github.com/conda-forge/conda-forge.github.io/issues/787}}.
+This is thoroughly discussed under issue 787 (in May 2019) of \inlinecode{conda-forge}\footnote{\inlinecode{\url{https://github.com/conda-forge/conda-forge.github.io/issues/787}}}.
In that discussion, the authors of \citeappendix{uhse19} report that the half-life of their environment (defined in a YAML file) is 3 months, and that at least one of their dependencies breaks shortly after this period.
The main reply they got in the discussion is to build the Conda environment in a container, which is also the suggested solution by \citeappendix{gruning18}.
However, as described in Appendix \ref{appendix:independentenvironment}, containers just hide the reproducibility problem, they do not fix it: containers are not static and need to evolve (i.e., get re-built) with the project.
@@ -300,9 +302,8 @@ In this way, later processing stages can make sure that they can safely be used,
Solutions to keep track of a project's history have existed since the early days of software engineering in the 1970s and they have constantly improved over the last decades.
Today the distributed model of ``version control'' is the most common, where the full history of the project is stored locally on different systems and can easily be integrated.
There are many existing version control solutions, for example, CVS, SVN, Mercurial, GNU Bazaar, or GNU Arch.
-However, currently, Git is by far the most commonly used in individual projects.
+However, currently, Git is by far the most commonly used in individual projects, such that Software Heritage \citeappendix{dicosmo18} (an archival system aiming for long term preservation of software) is also modeled on Git.
Git is also the foundation upon which this paper's proof of concept (Maneage) is built.
-Archival systems aiming for long term preservation of software like Software Heritage \citeappendix{dicosmo18} are also modeled on Git.
Hence we will just review Git here, but the general concept of version control is the same in all implementations.
\subsubsection{Git}
@@ -315,7 +316,8 @@ the figure on Git in the main body of the paper).
Figure \ref{fig:branching}).
\fi
For example \inlinecode{f4953cc\-f1ca8a\-33616ad\-602ddf\-4cd189\-c2eff97b} is a commit identifier in the Git history of this project.
-Commits are is commonly summarized by the checksum's first few characters, for example, \inlinecode{f4953cc} of the example above.
+Through the content-based storage concept, similar hash structures can be used to identify data \citeappendix{hinsen20}.
+Git commits are commonly summarized by the checksum's first few characters, for example, \inlinecode{f4953cc} of the example above.
With Git, making parallel ``branches'' (in the project's history) is very easy and its distributed nature greatly helps in the parallel development of a project by a team.
The team can host the Git history on a web page and collaborate through that.
@@ -391,7 +393,8 @@ Going deeper into the syntax of Make is beyond the scope of this paper, but we r
Snakemake is a Python-based workflow management system, inspired by GNU Make (discussed above).
It is aimed at reproducible and scalable data analysis \citeappendix{koster12}\footnote{\inlinecode{\url{https://snakemake.readthedocs.io/en/stable}}}.
It defines its own language to implement the ``rule'' concept of Make within Python.
-Technically, calling command-line programs within Python is very slow, and using complex shell scripts in each step will involve a lot of quotations that make the code hard to read.
+Technically, using complex shell scripts (to call software in other languages) in each step will involve a lot of quotations that make the code hard to read and maintain.
+It is therefore most useful for Python-based projects.
Currently, Snakemake requires Python 3.5 (released in September 2015) and above, while Snakemake was originally introduced in 2012.
Hence it is not clear if older Snakemake source files can be executed today.