Comments by Konrad Hinsen implemented

Konrad had kindly gone through the paper and the appendices with very good feedback that is now being addressed in the paper (thanks a lot Konrad!): - IPOL recently also allows Python code. So the respective parts of the description of IPOL have been updated. To address the dependency issue, I also added a sentence that only certain dependencies (with certain versions) are acceptable. - On Active Papers (AP: which is written by Konrad) corrections were made based on the following parts of his comments: - "The fundamental issue with ActivePapers is its platform dependence on either Java or Python, neither of which is attractive." - "The one point which is overemphasized, in my opinion, is the necessity to download large data files if some analysis script refers to it. That is true in the current implementation (which I consider a research prototype), but not a fundamental feature of the approach. Implementing an on-demand download strategy is not particularly complicated, it just needs to be done, and it wasn't a priority for my own use cases." - "A historical anecdote: you mention that HDF View requires registering for download. This is true today, but wasn't when I started ActivePapers. Otherwise I'd never have built on HDF5. What happened is that the HDF Group, formerly part of NCSA and thus a public research infrastructure, was turned into a semi-commercial entity. They have committed to keeping the core HDF5 library Open Source, but not any of the tooling around it. Many users have moved away from HDF5 as a consequence. The larger lesson is that Richard Stallman was right: if software isn't GPLed, then you never know what will happen to it in the future." - On Guix, some further clarification was added to address Konrad's quote below (with a link to the blog-post mentioned there). In short, I clarified that I mean storing the Guix commit hash with any respective high-level analysis change is the extra step. - "I also looked at the discussion of Nix and Guix, which is what I am mainly using today. It is mostly correct as well, the one exception being the claim that 'it is up to the user to ensure that their created environment is recorded properly for reproducibility in the future'. The environment is *recorded* in all detail, automatically. What requires some effort is extracting a human-readable description of that environment. For Guix, I have described how to do this in a blog post (https://guix.gnu.org/en/blog/2020/reproducible-computations-with-guix/), and in less detail in a recent CiSE paper (https://hal.archives-ouvertes.fr/hal-02877319). There should definitely be a better user interface for this, but it's no more than a user interface issue. What is pretty nice in Guix by now is the user interface for re-creating an environment, using the "guix time-machine" subcommand." - The sentence on Software Heritage being based on Git was reworded to fit this comment of Konrad: "The plural sounds quite optimistic. As far as I know, SWH is the only archive of its kind, and in view of the enormous resources and long-time commitments it requires, I don't expect to see a second one." - When introducing hashes, Konrad suggested the following useful paper that shows how they are used in content-based storage: DOI:10.1109/MCSE.2019.2949441 - On Snakemake, Konrad had the following comment: "[A system call in Python is] No slower than from bash, or even from any C code. Meaning no slower than Make. It's the creation of a new process that takes most of the time." So the point was just shifted to the many quotations necessary for calling external programs and how it is best suited for a Python-based project. In addition some minor typos that I found during the process are also fixed.
author: Mohammad Akhlaghi <mohammad@akhlaghi.org> 2021-04-09 01:08:31 +0100
committer: Mohammad Akhlaghi <mohammad@akhlaghi.org> 2021-04-09 02:00:18 +0100
commit: a63900bc5a83052081e6ca6bcc0a2bb4ee5a860e (patch)
tree: 15c7dcdff040b4c60110547de71d08ff0f5fadd0 /tex
parent: 55d6570aecc5f442399262b7faa441d16ccd4556 (diff)
3 files changed, 44 insertions, 25 deletions
diff --git a/tex/src/appendix-existing-solutions.tex b/tex/src/appendix-existing-solutions.tex
index 56056be..5166703 100644
--- a/tex/src/appendix-existing-solutions.tex
+++ b/tex/src/appendix-existing-solutions.tex
@@ -229,18 +229,17 @@ For example the very large cost of maintaining such a system, being based on a g
 The IPOL journal\footnote{\inlinecode{\url{https://www.ipol.im}}} \citeappendix{limare11} (first published article in July 2010) publishes papers on image processing algorithms as well as the the full code of the proposed algorithm.
 An IPOL paper is a traditional research paper, but with a focus on implementation.
 The published narrative description of the algorithm must be detailed to a level that any specialist can implement it in their own programming language (extremely detailed).
-The author's own implementation of the algorithm is also published with the paper (in C, C++, or MATLAB), the code must be commented well enough and link each part of it with the relevant part of the paper.
-The authors must also submit several example of datasets that show the applicability of their proposed algorithm.
+The author's own implementation of the algorithm is also published with the paper (in C, C++ or MATLAB/Octave and recently Python), the code can only have a very limited set of external dependencies (with pre-defined versions), must be commented well enough, and link each part of it with the relevant part of the paper.
+The authors must also submit several example datasets that show the applicability of their proposed algorithm.
 The referee is expected to inspect the code and narrative, confirming that they match with each other, and with the stated conclusions of the published paper.
 After publication, each paper also has a ``demo'' button on its web page, allowing readers to try the algorithm on a web-interface and even provide their own input.
 
 IPOL has grown steadily over the last 10 years, publishing 23 research articles in 2019.
 We encourage the reader to visit its web page and see some of its recent papers and their demos.
 The reason it can be so thorough and complete is its very narrow scope (low-level image processing algorithms), where the published algorithms are highly atomic, not needing significant dependencies (beyond input/output of well-known formats), allowing the referees and readers to go deeply into each implemented algorithm.
-In fact, high-level languages like Perl, Python, or Java are not acceptable in IPOL precisely because of the additional complexities, such as the dependencies that they require.
 However, many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, so while it is modular (a single module, doing a very specific thing) this solution is not scalable.
 
-Furthermore, by not publishing/archiving each paper's version control history or directly linking the analysis and produced paper, it fails criteria 6 and 7.
+Furthermore, by not publishing/archiving each paper's version controlled history or directly linking the analysis and produced paper, it fails criteria 6 and 7.
 Note that on the web page, it is possible to change parameters, but that will not affect the produced PDF.
 A paper written in Maneage (the proof-of-concept solution presented in this paper) could be scrutinized at a similar detailed level to IPOL, but for much more complex research scenarios, involving hundreds of dependencies and complex processing of the data.
 
@@ -265,29 +264,28 @@ It then uses selection and rejection algorithms to find the best components usin
 Active Papers\footnote{\inlinecode{\url{http://www.activepapers.org}}} attempts to package the code and data of a project into one file (in HDF5 format).
 It was initially written in Java because its compiled byte-code outputs in JVM are portable on any machine \citeappendix{hinsen11}.
 However, Java is not a commonly used platform today, hence it was later implemented in Python \citeappendix{hinsen15}.
+Dependence on high-level platforms (Java or Python) is therefore a fundamental issue.
 
 In the Python version, all processing steps and input data (or references to them) are stored in an HDF5 file.
 %However, it can only account for pure-Python packages using the host operating system's Python modules \tonote{confirm this!}.
 When the Python module contains a component written in other languages (mostly C or C++), it needs to be an external dependency to the Active Paper.
 
 As mentioned in \citeappendix{hinsen15}, the fact that it relies on HDF5 is a caveat of Active Papers, because many tools are necessary to merely open it.
-Downloading the pre-built ``HDF View'' binaries (a GUI browser of HDF5 files that is provided by the HDF group) is not possible anonymously/automatically (login is required).
-Installing it using the Debian or Arch Linux package managers also failed due to dependencies in our trials.
-Furthermore, as a high-level data format HDF5 evolves very fast, for example HDF5 1.12.0 (February 29th, 2020) is not usable with older libraries provided by the HDF5 team. % maybe replace with: February 29\textsuperscript{th}, 2020?
+Downloading the pre-built ``HDF View'' binaries (a GUI browser of HDF5 files that is provided by the HDF group) is not possible anonymously/automatically: as of January 2021 login is required\footnote{\inlinecode{\url{https://www.hdfgroup.org/downloads/hdfview}}} (this was not the case when Active Papers moved to HDF5).
+% From K. Hinsen in a private email to M. Akhlaghi: This is true today, but wasn't when I started ActivePapers. Otherwise I'd never have built on HDF5.
+Installing HDF View using the Debian or Arch Linux package managers also failed due to dependencies in our trials.
+Furthermore, like most high-level tools, the HDF5 library evolves very fast: on its webpage (from April 2021), it says ``Applications that were created with earlier HDF5 releases may not compile with 1.12 by default''.
 
 While data and code are indeed fundamentally similar concepts technically\citeappendix{hinsen16}, they are used by humans differently.
 The hand-written code of a large project involving Terabytes of data can be 100 kilo bytes.
-When the two are bundled together, merely seeing one line of the code, requires downloading Terabytes volume that is not needed, this was also acknowledged in \citeappendix{hinsen15}.
+When the two are bundled together in one remote file, merely seeing one line of the code, requires downloading Terabytes volume that is not needed, this was also acknowledged in \citeappendix{hinsen15}.
 It may also happen that the data are proprietary (for example medical patient data).
 In such cases, the data must not be publicly released, but the methods that were applied to them can.
-Furthermore, since all reading and writing is done in the HDF5 file, it can easily bloat the file to very large sizes due to temporary files.
+
+Furthermore, since all reading and writing is currently done in the HDF5 file, it can easily bloat the file to very large sizes due to temporary files.
 These files can later be removed as part of the analysis, but this makes the code more complicated and hard to read/maintain.
 For example the Active Papers HDF5 file of \citeappendix[in \href{https://doi.org/10.5281/zenodo.2549987}{zenodo.2549987}]{kneller19} is 1.8 giga-bytes.
-
-In many scenarios, peers just want to inspect the processing by reading the code and checking a very special part of it (one or two lines, just to see the option values to one step for example).
-They do not necessarily need to run it or obtain the output the datasets (which may be published elsewhere).
-Hence the extra volume for data and obscure HDF5 format that needs special tools for reading its plain-text internals is an issue.
-
+This is not a fundamental feature of the approach and is just how this prototype was implemented and can be improved in the future.
 
 
 
diff --git a/tex/src/appendix-existing-tools.tex b/tex/src/appendix-existing-tools.tex
index f23e2d1..0c9a1c2 100644
--- a/tex/src/appendix-existing-tools.tex
+++ b/tex/src/appendix-existing-tools.tex
@@ -196,12 +196,14 @@ This is necessary ``to use the Linux kernel container facilities that allow it t
 This is because the focus in Nix or Guix is to create bit-wise reproducible software binaries and this is necessary for the security or development perspectives.
 However, in a non-computer-science analysis (for example natural sciences), the main aim is reproducible \emph{results} that can also be created with the same software version that may not be bit-wise identical (for example when they are installed in other locations, because the installation location is hard-coded in the software binary or for a different CPU architecture).
 
-Finally, while Guix and Nix do allow precisely reproducible environments, it requires extra effort on the user's side to ensure that the built environment is reproducible later.
-For example, simply running \inlinecode{guix install gcc} (the most common way to install a new software) will install the most recent version of GCC, that can be different at different times.
-Hence, similar to the discussion in host operating system package managers, it is up to the user to ensure that their created environment is recorded properly for reproducibility in the future.
-It is not a complex operation, but like the Docker digest codes mentioned in Appendix \ref{appendix:containers}, many will probably not know, forget or ignore it.
-Generally, this is an issue with projects that rely on detached (third party) package managers for building their software, including the other tools mentioned below.
-We solved this problem in Maneage by including the package manager and analysis steps into one project: it is simply not possible to forget to record the exact versions of the software used.
+Finally, while Guix and Nix do allow precisely reproducible environments, the inherent detachment from the high-level computational project (that uses the environment) requires extra effort to keep track of the changes in dependencies as the project evolves.
+For example, if users simply run \inlinecode{guix install gcc} (the most common way to install a new software) the most recent version of GCC will be installed.
+But this will be different at different dates on a different system with no record of previous runs.
+It is therefore up to the user to store the used Guix commit in their high level computation and ensure ``Reproducing a reproducible computation''\footnote{A guide/tutorial on storing the Guix environment:\\\inlinecode{\url{https://guix.gnu.org/en/blog/2020/reproducible-computations-with-guix}}}.
+Similar to the Docker digest codes mentioned in Appendix \ref{appendix:containers}, many may not know about, forget, or ignore it.
+
+Generally, this is a common issue with relying on detached (third party) package managers for building a high-level computational project's software (including other tools mentioned below).
+We solved this problem in Maneage by including the low-level package manager and highlevel computation into a single project with a single version controlled history: it is simply not possible to forget to record the exact versions of the software used (or how they change as the project evolves).
 
 \subsubsection{Conda/Anaconda}
 \label{appendix:conda}
@@ -210,7 +212,7 @@ Conda is able to maintain an approximately independent environment on an operati
 
 Conda tracks the dependencies of a package/environment through a YAML formatted file, where the necessary software and their acceptable versions are listed.
 However, it is not possible to fix the versions of the dependencies through the YAML files alone.
-This is thoroughly discussed under issue 787 (in May 2019) of \inlinecode{conda-forge}\footnote{\url{https://github.com/conda-forge/conda-forge.github.io/issues/787}}.
+This is thoroughly discussed under issue 787 (in May 2019) of \inlinecode{conda-forge}\footnote{\inlinecode{\url{https://github.com/conda-forge/conda-forge.github.io/issues/787}}}.
 In that discussion, the authors of \citeappendix{uhse19} report that the half-life of their environment (defined in a YAML file) is 3 months, and that at least one of their dependencies breaks shortly after this period.
 The main reply they got in the discussion is to build the Conda environment in a container, which is also the suggested solution by \citeappendix{gruning18}.
 However, as described in Appendix \ref{appendix:independentenvironment}, containers just hide the reproducibility problem, they do not fix it: containers are not static and need to evolve (i.e., get re-built) with the project.
@@ -300,9 +302,8 @@ In this way, later processing stages can make sure that they can safely be used,
 Solutions to keep track of a project's history have existed since the early days of software engineering in the 1970s and they have constantly improved over the last decades.
 Today the distributed model of ``version control'' is the most common, where the full history of the project is stored locally on different systems and can easily be integrated.
 There are many existing version control solutions, for example, CVS, SVN, Mercurial, GNU Bazaar, or GNU Arch.
-However, currently, Git is by far the most commonly used in individual projects.
+However, currently, Git is by far the most commonly used in individual projects, such that Software Heritage \citeappendix{dicosmo18} (an archival system aiming for long term preservation of software) is also modeled on Git.
 Git is also the foundation upon which this paper's proof of concept (Maneage) is built.
-Archival systems aiming for long term preservation of software like Software Heritage \citeappendix{dicosmo18} are also modeled on Git.
 Hence we will just review Git here, but the general concept of version control is the same in all implementations.
 
 \subsubsection{Git}
@@ -315,7 +316,8 @@ the figure on Git in the main body of the paper).
 Figure \ref{fig:branching}).
 \fi
 For example \inlinecode{f4953cc\-f1ca8a\-33616ad\-602ddf\-4cd189\-c2eff97b} is a commit identifier in the Git history of this project.
-Commits are is commonly summarized by the checksum's first few characters, for example, \inlinecode{f4953cc} of the example above.
+Through the content-based storage concept, similar hash structures can be used to identify data \citeappendix{hinsen20}.
+Git commits are commonly summarized by the checksum's first few characters, for example, \inlinecode{f4953cc} of the example above.
 
 With Git, making parallel ``branches'' (in the project's history) is very easy and its distributed nature greatly helps in the parallel development of a project by a team.
 The team can host the Git history on a web page and collaborate through that.
@@ -391,7 +393,8 @@ Going deeper into the syntax of Make is beyond the scope of this paper, but we r
 Snakemake is a Python-based workflow management system, inspired by GNU Make (discussed above).
 It is aimed at reproducible and scalable data analysis \citeappendix{koster12}\footnote{\inlinecode{\url{https://snakemake.readthedocs.io/en/stable}}}.
 It defines its own language to implement the ``rule'' concept of Make within Python.
-Technically, calling command-line programs within Python is very slow, and using complex shell scripts in each step will involve a lot of quotations that make the code hard to read.
+Technically, using complex shell scripts (to call software in other languages) in each step will involve a lot of quotations that make the code hard to read and maintain.
+It is therefore most useful for Python-based projects.
 
 Currently, Snakemake requires Python 3.5 (released in September 2015) and above, while Snakemake was originally introduced in 2012.
 Hence it is not clear if older Snakemake source files can be executed today.
diff --git a/tex/src/references.tex b/tex/src/references.tex
index e2f5c89..6b768ee 100644
--- a/tex/src/references.tex
+++ b/tex/src/references.tex
@@ -65,6 +65,10 @@ archivePrefix = {arXiv},
 
 }
 
+
+
+
+
 @ARTICLE{mesnard20,
      author = {Olivier Mesnard and Lorena A. Barba},
       title = {Reproducible Workflow on a Public Cloud for Computational Fluid Dynamics},
@@ -99,6 +103,20 @@ archivePrefix = {arXiv},
 
 
 
+@ARTICLE{hinsen20,
+     author = {Konrad Hinsen},
+      title = {The Magic of Content-Addressable Storage},
+       year = {2020},
+    journal = {Computing in Science \& Engineering},
+     volume = {22},
+     number = {03},
+      pages = {113-119},
+        doi = {10.1109/MCSE.2019.2949441},
+}
+
+
+
+
 @ARTICLE{menke20,
      author = {Joe Menke and Martijn Roelandse and Burak Ozyurt and Maryann Martone and Anita Bandrowski},
       title = {Rigor and Transparency Index, a new metric of quality for assessing biological and medical science methods},
author	Mohammad Akhlaghi <mohammad@akhlaghi.org>	2021-04-09 01:08:31 +0100
committer	Mohammad Akhlaghi <mohammad@akhlaghi.org>	2021-04-09 02:00:18 +0100
commit	a63900bc5a83052081e6ca6bcc0a2bb4ee5a860e (patch)
tree	15c7dcdff040b4c60110547de71d08ff0f5fadd0 /tex
parent	55d6570aecc5f442399262b7faa441d16ccd4556 (diff)