From 6941df07ec6c50ec4addfdb59050cde9cc046a32 Mon Sep 17 00:00:00 2001 From: Boud Roukema Date: Sun, 6 Dec 2020 04:21:08 +0100 Subject: Copyediting, based on the not contraction This commit provides a little bit of minor copyediting, mainly in the appendices, based on and around changing the casual 'isn't', 'don't' and other contractions with 'not' to a less casual style of language. A few of the changes aim to improve the meaning in tiny ways. --- paper.tex | 46 +++++++++++++++++++++++----------------------- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/paper.tex b/paper.tex index 14c3410..0a47096 100644 --- a/paper.tex +++ b/paper.tex @@ -439,7 +439,7 @@ Furthermore, the configuration files are a prerequisite of the targets that use If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's source or other built products. This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details; e.g., by co-authors or future readers), and ensures self-consistency. -\new{In contrast to notebooks like Jupyter, the analysis scripts and configuration parameters are therefore not blended into the running code, are not stored together in a single file and don't require a unique editor. +\new{In contrast to notebooks like Jupyter, the analysis scripts and configuration parameters are therefore not blended into the running code, are not stored together in a single file and do not require a unique editor. To satisfy the modularity criterion, the analysis steps and narrative are run in their own files (in different languages, thus maximally benefiting from the unique features of each) and the files can be viewed or manipulated with any text editor that the authors prefer. The analysis can benefit from the powerful and portable job management features of Make and communicates with the narrative text through \LaTeX{} macros, enabling much better formatted output that blends analysis outputs in the narrative sentences and enables direct provenance tracking.} @@ -517,7 +517,7 @@ This requires maintenance by our core team and consumes time and energy. However, because the PM and analysis components share the same job manager (Make) and design principles, we have already noticed some early users adding, or fixing, their required software alone. They later share their low-level commits on the core branch, thus propagating it to all derived projects. -\new{Unix-like OSs are a very large and diverse group (mostly conforming with POSIX), so this condition doesn't guarantee bitwise reproducibility of the software, even when built on the same hardware.}. +\new{Unix-like OSs are a very large and diverse group (mostly conforming with POSIX), so this condition does not guarantee bitwise reproducibility of the software, even when built on the same hardware.}. However \new{our focus is on reproducing results (output of software), not the software itself.} Well written software internally corrects for differences in OS or hardware that may affect its output (through tools like the GNU portability library). On GNU/Linux hosts, Maneage builds precise versions of the compilation tool chain. @@ -737,7 +737,7 @@ Below we'll review some of the most common container solutions: Docker and Singu Generally, VMs or containers are good solutions to reproducibly run/repeating an analysis in the short term (a couple of years). However, their focus is to store the already-built (binary, non-human readable) software environment. Because of this they will be large (many Gigabytes) and expensive to archive, download or access. -Recall the two examples above for VMs in Section \ref{appendix:virtualmachines}. But this is also valid for Docker images, as is clear from Dockerhub's recent decision to delete images of free accounts that haven't been used for more than 6 months. +Recall the two examples above for VMs in Section \ref{appendix:virtualmachines}. But this is also valid for Docker images, as is clear from Dockerhub's recent decision to delete images of free accounts that have not been used for more than 6 months. Meng \& Thain \citeappendix{meng17} also give similar reasons on why Docker images were not suitable in their trials. On a more fundamental level, VMs or contains do not store \emph{how} the core environment was built. @@ -748,7 +748,7 @@ The example of \cite{mesnard20} was previously mentioned in Section \ref{criteri Another useful example is the \href{https://github.com/benmarwick/1989-excavation-report-Madjedbebe/blob/master/Dockerfile}{\inlinecode{Dockerfile}} of \citeappendix{clarkso15} (published in June 2015) which starts with \inlinecode{FROM rocker/verse:3.3.2}. When we tried to build it (November 2020), the core downloaded image (\inlinecode{rocker/verse:3.3.2}, with image ``digest'' \inlinecode{sha256:c136fb0dbab...}) was created in October 2018 (long after the publication of that paper). In principle, it is possible to investigate the difference between this new image and the old one that the authors used, but that would require a lot of effort and may not be possible where the changes are not available in a third public repository or not under version control. -In Docker, it is possible to retrieve the precise Docker image with its digest for example \inlinecode{FROM ubuntu:16.04@sha256:XXXXXXX} (where \inlinecode{XXXXXXX} is the digest, uniquely identifying the core image to be used), but we haven't seen this often done in existing examples of ``reproducible'' \inlinecode{Dockerfiles}. +In Docker, it is possible to retrieve the precise Docker image with its digest for example \inlinecode{FROM ubuntu:16.04@sha256:XXXXXXX} (where \inlinecode{XXXXXXX} is the digest, uniquely identifying the core image to be used), but we have not seen this often done in existing examples of ``reproducible'' \inlinecode{Dockerfiles}. The ``digest'' is specific to Docker repositories. A more generic/longterm approach to ensure identical core OS components at a later time is to construct the containers or VMs with fixed/archived versions of the operating system ISO files. @@ -854,7 +854,7 @@ However, it is not possible to fix the versions of the dependencies through the This is thoroughly discussed under issue 787 (in May 2019) of \inlinecode{conda-forge}\footnote{\url{https://github.com/conda-forge/conda-forge.github.io/issues/787}}. In that discussion, the authors of \citeappendix{uhse19} report that the half-life of their environment (defined in a YAML file) is 3 months, and that at least one of their their dependencies breaks shortly after this period. The main reply they got in the discussion is to build the Conda environment in a container, which is also the suggested solution by \citeappendix{gruning18}. -However, as described in Appendix \ref{appendix:independentenvironment} containers just hide the reproducibility problem, they do not fix it: containers aren't static and need to evolve (i.e., re-built) with the project. +However, as described in Appendix \ref{appendix:independentenvironment} containers just hide the reproducibility problem, they do not fix it: containers are not static and need to evolve (i.e., re-built) with the project. Given these limitations, \citeappendix{uhse19} are forced to host their conda-packaged software as tarballs on a separate repository. Conda installs with a shell script that contains a binary-blob (+500 mega bytes, embedded in the shell script). @@ -993,10 +993,10 @@ While its not impossible, because of the high-level nature of scripts, it is not \subsubsection{Make} \label{appendix:make} Make was originally designed to address the problems mentioned above for scripts \citeappendix{feldman79}. -In particular in the context of managing the compilation of software programs that involve many source code files. -With Make, the various source files of a program that hadn't been changed, wouldn't be recompiled. -Also, when two source files didn't depend on each other, and both needed to be rebuilt, they could be built in parallel. -This greatly helped in debugging of software projects, and speeding up test builds, giving Make a core place in software development in the last 40 years. +In particular, it addresses the context of managing the compilation of software programs that involve many source code files. +With Make, the source files of a program that have not been changed are not recompiled. +Moreover, when two source files do not depend on each other, and both need to be rebuilt, they can be built in parallel. +This was found to greatly help in debugging software projects, and in speeding up test builds, giving Make a core place in software development over the last 40 years. The most common implementation of Make, since the early 1990s, is GNU Make. Make was also the framework used in the first attempts at reproducible scientific papers \citeappendix{claerbout1992,schwab2000}. @@ -1123,7 +1123,7 @@ Integration of directional graph features (dependencies between the cells) into The fact that the \inlinecode{.ipynb} format stores narrative text, code and multi-media visualization of the outputs in one file, is another major hurdle: The files can easy become very large (in volume/bytes) and hard to read. -Both are critical for scientific processing, especially the latter: when a web-browser with proper JavaScript features isn't available (can happen in a few years). +Both are critical for scientific processing, especially the latter: when a web browser with proper JavaScript features is not available (can happen in a few years). This is further exacerbated by the fact that binary data (for example images) are not directly supported in JSON and have to be converted into much less memory-efficient textual encodings. Finally, Jupyter has an extremely complex dependency graph: on a clean Debian 10 system, Pip (a Python package manager that is necessary for installing Jupyter) required 19 dependencies to install, and installing Jupyter within Pip needed 41 dependencies! @@ -1158,16 +1158,16 @@ The most prominent example is the transition from Python 2 (released in 2000) to Python 3 was incompatible with Python 2 and it was decided to abandon the former by 2015. However, due to community pressure, this was delayed to January 1st, 2020. The end-of-life of Python 2 caused many problems for projects that had invested heavily in Python 2: all their previous work had to be translated, for example see \citeappendix{jenness17} or Appendix \ref{appendix:sciunit}. -Some projects couldn't make this investment and their developers decided to stop maintaining it, for example VisTrails (see Appendix \ref{appendix:vistrails}). +Some projects could not make this investment and their developers decided to stop maintaining it, for example VisTrails (see Appendix \ref{appendix:vistrails}). -The problems weren't just limited to translation. +The problems were not just limited to translation. Python 2 was still actively being actively used during the transition period (and is still being used by some, after its end-of-life). Therefore, developers of packages used by others had to maintain (for example fix bugs in) both versions in one package. -This isn't particular to Python, a similar evolution occurred in Perl: in 2000 it was decided to improve Perl 5, but the proposed Perl 6 was incompatible with it. +This is not particular to Python, a similar evolution occurred in Perl: in 2000 it was decided to improve Perl 5, but the proposed Perl 6 was incompatible with it. However, the Perl community decided not to abandon Perl 5, and Perl 6 was eventually defined as a new language that is now officially called ``Raku'' (\url{https://raku.org}). -It is unreasonably optimistic to assume that high-level languages won't undergo similar incompatible evolutions in the (not too distant) future. -For software developers, this isn't a problem at all: non-scientific software, and the general population's usage of them, has a similarly fast evolvution. +It is unreasonably optimistic to assume that high-level languages will not undergo similar incompatible evolutions in the (not too distant) future. +For software developers, this is not a problem at all: non-scientific software, and the general population's usage of them, has a similarly fast evolvution. Hence, it is rarely (if ever) necessary to look into codes that are more than a couple of years old. However, in the sciences (which are commonly funded by public money) this is a major caveat for the longer-term usability of solutions that are designed in such high level languages. @@ -1177,7 +1177,7 @@ Beyond technical, low-level, problems for the developers mentioned above, this c \subsubsection{Dependency hell} The evolution of high-level languages is extremely fast, even within one version. For example packages that are written in Python 3 often only work with a special interval of Python 3 versions (for example newer than Python 3.6). -This isn't just limited to the core language, much faster changes occur in their higher-level libraries. +This is not just limited to the core language, much faster changes occur in their higher-level libraries. For example version 1.9 of Numpy (Python's numerical analysis module) discontinued support for Numpy's predecessor (called Numeric), causing many problems for scientific users \citeappendix{hinsen15}. On the other hand, the dependency graph of tools written in high-level languages is often extremely complex. @@ -1193,7 +1193,7 @@ For example, merely installing the Python installer (\inlinecode{pip}) on a Debi \inlinecode{pip} is necessary to install Popper and Sciunit (Appendices \ref{appendix:popper} and \ref{appendix:sciunit}). As of this writing, the \inlinecode{pip3 install popper} and \inlinecode{pip2 install sciunit2} commands for installing each, required 17 and 26 Python modules as dependencies. It is impossible to run either of these solutions if there is a single conflict in this very complex dependency graph. -This problem actually occurred while we were testing Sciunit: even though it installed, it couldn't run because of conflicts (its last commit was only 1.5 years old), for more see Appendix \ref{appendix:sciunit}. +This problem actually occurred while we were testing Sciunit: even though it installed, it could not run because of conflicts (its last commit was only 1.5 years old), for more see Appendix \ref{appendix:sciunit}. \citeappendix{hinsen15} also report a similar problem when attempting to install Jupyter (see Appendix \ref{appendix:editors}). Of course, this also applies to tools that these systems use, for example Conda (which is also written in Python, see Appendix \ref{appendix:packagemanagement}). @@ -1300,7 +1300,7 @@ A workflow is defined as a directed graph, where nodes are called ``processors'' Each Processor transforms a set of inputs into a set of outputs and they are defined in the Scufl language (an XML-based language, were each step is an atomic task). Other components of the workflow are ``Data links'' and ``Coordination constraints''. The main user interface is graphical, where users move processors in the given space and define links between their inputs outputs (manually constructing a lineage like Figure \ref{fig:datalineage}). -Taverna is only a workflow manager and isn't integrated with a package manager, hence the versions of the used software can be different in different runs. +Taverna is only a workflow manager and is not integrated with a package manager, hence the versions of the used software can be different in different runs. \citeappendix{zhao12} have studied the problem of workflow decays in Taverna. @@ -1314,7 +1314,7 @@ Madagascar is a continuation of the Reproducible Electronic Documents (RED) proj Madagascar has been used in the production of hundreds of research papers or book chapters\footnote{\inlinecode{\url{http://www.ahay.org/wiki/Reproducible_Documents}}}, 120 prior to \citeappendix{fomel13}. Madagascar does include project management tools in the form of SCons extensions. -However, it isn't just a reproducible project management tool. +However, it is not just a reproducible project management tool. It is primarily a collection of analysis programs and tools to interact with RSF files, and plotting facilities. The Regularly Sampled File (RSF) file format\footnote{\inlinecode{\url{http://www.ahay.org/wiki/Guide\_to\_RSF\_file\_format}}} is a custom plain-text file that points to the location of the actual data files on the filesystem and acts as the intermediary between Madagascar's analysis programs. For example in our test of Madagascar 3.0.1, it installed 855 Madagascar-specific analysis programs (\inlinecode{PREFIX/bin/sf*}). @@ -1377,7 +1377,7 @@ Since XML is a plane text format, as the user inspects the data and makes change . However, even though XML is in plain text, it is very hard to edit manually. VisTrails therefore provides a graphic user interface with a visual representation of the project's inter-dependent steps (similar to Figure \ref{fig:datalineage}). -Besides the fact that it is no longer maintained, VisTrails didn't control the software that is run, it only controls the sequence of steps that they are run in. +Besides the fact that it is no longer maintained, VisTrails does not control the software that is run, it only controls the sequence of steps that they are run in. @@ -1529,7 +1529,7 @@ This enables researchers to use a fixed version of a project's figures in the pa The actual code that Sumatra wraps around, must itself be under version control, and it does not run if there is non-committed changes (although its not clear what happens if a commit is amended). Since information on the environment has been captured, Sumatra is able to identify if it has changed since a previous run of the project. Therefore Sumatra makes no attempt at storing the environment of the analysis as in Sciunit (see Appendix \ref{appendix:sciunit}), but its information. -Sumatra thus needs to know the language of the running program and isn't generic. +Sumatra thus needs to know the language of the running program and is not generic. It just captures the environment, it does not store \emph{how} that environment was built. @@ -1567,7 +1567,7 @@ This is a major problem for scientific projects: in principle (not knowing how t Umbrella \citeappendix{meng15b} is a high-level wrapper script for isolating the environment of an analysis. The user specifies the necessary operating system, and necessary packages and analysis steps in variuos JSON files. Umbrella will then study the host operating system and the various necessary inputs (including data and software) through a process similar to Sciunits mentioned above to find the best environment isolator (maybe using Linux containerization, containers or VMs). -We couldn't find a URL to the source software of Umbrella (no source code repository has been mentioned in the papers we reviewed above), but from the descriptions in \citeappendix{meng17}, it is written in Python 2.6 (which is now \new{deprecated}). +We could not find a URL to the source software of Umbrella (no source code repository is mentioned in the papers we reviewed above), but from the descriptions in \citeappendix{meng17}, it is written in Python 2.6 (which is now \new{deprecated}). @@ -1627,7 +1627,7 @@ Moreover, from October 2019 Github changed to a custom YAML-based languguage, so This is an important issue when low-level choices are based on service providers. To start a project, the \inlinecode{popper} command-line program builds a template, or ``scaffold'', which is a minimal set of files that can be run. -However, as of this writing, the scaffold isn't complete: it lacks a manuscript and validation of outputs (as mentioned in the convention). +However, as of this writing, the scaffold is not complete: it lacks a manuscript and validation of outputs (as mentioned in the convention). By default Popper runs in a Docker image (so root permissions are necessary and reproducible issues with Docker images have been discussed above), but Singularity is also supported. See Appendix \ref{appendix:independentenvironment} for more on containers, and Appendix \ref{appendix:highlevelinworkflow} for using high-level languages in the workflow. -- cgit v1.2.1