aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2020-11-15 15:58:13 +0000
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2020-11-15 15:58:13 +0000
commit51ef2929b404f344745c3a3738de01ade5fb8c4f (patch)
treeec694b336285639ac8e2cf2aff445777a6c77153
parent08516255b1cf366069770026503986f12d59bcc1 (diff)
First edits on the newly added appendices in new form
With the optional appendices added recently to the paper, it was important to go through them and make them more fitting into the paper.
-rw-r--r--paper.tex642
-rw-r--r--reproduce/analysis/make/initialize.mk5
-rw-r--r--tex/src/references.tex23
3 files changed, 350 insertions, 320 deletions
diff --git a/paper.tex b/paper.tex
index 42fd646..d588e3d 100644
--- a/paper.tex
+++ b/paper.tex
@@ -74,9 +74,13 @@
This paper is itself written with Maneage (project commit \projectversion).
\vspace{3mm}
- \emph{Reproducible supplement} --- All products in \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}},
+ \emph{Reproducible supplement} ---
+ All products in \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}},
Git history of source at \href{https://gitlab.com/makhlaghi/maneage-paper}{\texttt{gitlab.com/makhlaghi/maneage-paper}},
which is also archived on \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=https://gitlab.com/makhlaghi/maneage-paper.git}{SoftwareHeritage}.
+\ifdefined\noappendix
+ Appendices reviewing existing reproducible solutions available in \href{https://arxiv.org/abs/\projectarxivid}{\texttt{arXiv:\projectarxivid}} or \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}}.
+\fi
\end{abstract}
% Note that keywords are not normally used for peer-review papers.
@@ -173,6 +177,7 @@ However, many data-intensive projects commonly involve dozens of high-level depe
\section{Proposed criteria for longevity}
+\label{criteria}
The main premise is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it at the end \cite{austin17,fineberg19}.
In this context, researchers play a critical role \cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles).
Simply archiving a project workflow in a repository after the project is finished is, on its own, insufficient, and maintaining it by repository staff is often either practically unfeasible or unscalable.
@@ -415,7 +420,7 @@ $ git add -u && git commit # Commit changes.
\section{Discussion}
-
+\label{discussion}
%% It should provide some insight or 'lessons learned', where 'lessons learned' is jargon for 'informal hypotheses motivated by experience', reworded to make the hypotheses sound more scientific (if it's a 'lesson', then it sounds like knowledge, when in fact it's really only a hypothesis).
%% What is the message we should take from the experience?
%% Are there clear demonstrated design principles that can be reapplied elsewhere?
@@ -604,69 +609,84 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314
\section{Survey of existing tools for various phases}
\label{appendix:existingtools}
-Computational workflows are commonly high-level tools which employ various lower-level components to acheive their goal.
-To help in analysing existing reproducible workflow solutions in the next appendix, the most commonly employed lower-level tools are surveyed with a focus on reproducibility and the proposed criteria.
+Data analysis workflows (including those that aim for reproducibility) are commonly high-level frameworks which employ various lower-level components.
+To help in reviewing existing reproducible workflow solutions in light of the proposed criteria in Appendix \ref{appendix:existingsolutions}, we first need survey the most commonly employed lower-level tools.
\subsection{Independent environment}
\label{appendix:independentenvironment}
-The lowest-level challenge of any reproducible solution is to avoid the differences between systems that are running it.
-For example a differing operating system, different versions of installed components, and etc.
-Any reasonable attempt at providing a reproducible workflow therefore has to star with a way to isolate its running envionment from the host.
-There are three general technologies that are used by workflow solutions: 1) Virtual machines, 2) Containers, 3) Controlled build and environment.
-Below, a short description of each solution is provided.
+The lowest-level challenge of any reproducible solution is to avoid the differences between various run-time environments, to a desirable/certain level.
+For example different hardware, operating systems, versions of existing dependencies, and etc.
+Therefore any reasonable attempt at providing a reproducible workflow starts with isolating its running envionment from the host environment.
+There are three general technologies that are used for this purpose and reviewed below:
+1) Virtual machines,
+2) Containers,
+3) Independent build in host's file system.
\subsubsection{Virtual machines}
\label{appendix:virtualmachines}
-Virtual machines (VMs) keep a copy of a full operating system that can be run on other operating systems.
-This includes the lowest-level kernel which connects to the hardware.
+Virtual machines (VMs) host a binary copy of a full operating system that can be run on other operating systems.
+This includes the lowest-level operating system component or the kernel.
VMs thus provide the ultimate control one can have over the run-time environment of an analysis.
-However, the VM's kernel does not talk directly to the hardware that is doing the analysis, it talks to a simulated hardware that is provided by the operating system's kernel.
+However, the VM's kernel does not talk directly to the running hardware that is doing the analysis, it talks to a simulated hardware layer that is provided by the host's kernel.
Therefore, a process that is run inside a virtual machine can be much slower than one that is run on a native kernel.
-VMs are used by cloud providers, enabling them to sell fully independent operating systems on their large servers, to their customers (where the customer can have root access).
-But because of all the overhead, they aren't used often used for reproducing individual processes.
+An advantages of VMs is that they are a single file which can be copied from one computer to another, keeping the full environment within them if the format is recognized.
+VMs are used by cloud service providers, enabling fully independent operating systems on their large servers (where the customer can have root access).
\subsubsection{Containers}
-Containers are higher-level constructs that don't have their own kernel, they talk directly with the host operating system kernel, but have their own independent software for everything else.
-Therefore, they have much less overhead in storage, and hardware/CPU access.
-Users often choose an operating system for the container's independent operating system (most commonly GNU/Linux distributions which are free software).
+\label{appendix:containers}
+Containers also host a binary copy of a running environment, but don't have their own kernel.
+Through a thin layer of low-level system libraries, programs running within a container talk directly with the host operating system kernel.
+Otherwise, containers have their own independent software for everything else.
+Therefore, they have much less overhead in hardware/CPU access.
+Like VMs, users often choose an operating system for the container's independent operating system (most commonly GNU/Linux distributions which are free software).
Below we'll review some of the most common container solutions: Docker and Singularity.
\begin{itemize}
\item {\bf\small Docker containers:} Docker is one of the most popular tools today for keeping an independent analysis environment.
- It is primarily driven by the need of software developers: they need to be able to reproduce a bug on the ``cloud'' (which is just a remote VM), where they have root access.
- A Docker container is composed of independent Docker ``images'' that are built with Dockerfiles.
+ It is primarily driven by the need of software developers for reproducing a previous environment, where they have root access mostly on the ``cloud'' (which is just a remote VM).
+ A Docker container is composed of independent Docker ``images'' that are built with a \inlinecode{Dockerfile}.
It is possible to precisely version/tag the images that are imported (to avoid downloading the latest/different version in a future build).
To have a reproducible Docker image, it must be ensured that all the imported Docker images check their dependency tags down to the initial image which contains the C library.
- Another important drawback of Docker for scientific applications is that it runs as a daemon (a program that is always running in the background) with root permissions.
- This is a major security flaw that discourages many high performance computing (HPC) facilities from installing it.
+ An important drawback of Docker for high performance scientific needs is that it runs as a daemon (a program that is always running in the background) with root permissions.
+ This is a major security flaw that discourages many high performance computing (HPC) facilities from providing it.
\item {\bf\small Singularity:} Singularity is a single-image container (unlike Docker which is composed of modular/independent images).
Although it needs root permissions to be installed on the system (once), it doesn't require root permissions every time it is run.
Its main program is also not a daemon, but a normal program that can be stopped.
- These features make it much easier for HPC administrators to install Docker.
- However, the fact that it requires root access for initial install is still a hindrance for a random project: if its not present on the HPC, the project can't be run as a normal user.
-
-\item {\bf\small Virtualenv:} \tonote{Discuss it later.}
+ These features make it much easier for HPC administrators to install compared to Docker.
+ However, the fact that it requires root access for initial install is still a hindrance for a random project: if its not already present on the HPC, the project can't be run as a normal user.
\end{itemize}
-When the installed software within VMs or containers is precisely under control, they are good solutions to reproducibly ``running''/repeating an analysis.
-However, because they store the already-built software environment, they are not good for ``studying'' the analysis (how the environment was built).
-Currently, the most common practice to install software within containers is to use the package manager of the operating system within the image, usually a minimal Debian-based GNU/Linux operating system.
-For example the Dockerfile\footnote{\url{https://github.com/benmarwick/1989-excavation-report-Madjedbebe/blob/master/Dockerfile}} in the reproducible scripts of \citeappendix{clarkso15}, which uses \inlinecode{sudo apt-get install r-cran-rjags -y} to install the R interface to the JAGS Bayesian statistics (rjags).
-However, the operating system package managers aren't static.
-Therefore the versions of the downloaded and used tools within the Docker image will change depending when it was built.
-At the time \citeappendix{clarkso15} was published (June 2015), the \inlinecode{apt} command above would download and install rjags 3-15, but today (January 2020), it will install rjags 4-10.
-Such problems can be corrected with robust/reproducible package managers like Nix or GNU Guix within the docker image (see Appendix \ref{appendix:packagemanagement}), but this is rarely practiced today.
+Generally, VMs or containers are good solutions to reproducibly run/repeating an analysis.
+However, their focus is to store the already-built (binary, non-human readable) software environment.
+Storing \emph{how} the core environment was built is up to the user, in a third repository (not necessarily inside container or VM file).
+This is a major problem when considering reproducibility.
+The example of \cite{mesnard20} was previously mentioned in Section \ref{criteria}.
+
+Another useful example is the \href{https://github.com/benmarwick/1989-excavation-report-Madjedbebe/blob/master/Dockerfile}{\inlinecode{Dockerfile}} of \citeappendix{clarkso15} (published in June 2015) which starts with \inlinecode{FROM rocker/verse:3.3.2}.
+When we tried to build it (November 2020), the core downloaded image (\inlinecode{rocker/verse:3.3.2}, with image ``digest'' \inlinecode{sha256:c136fb0dbab...}) was created in October 2018 (long after the publication of that paper).
+Theoretically it is possible to investigate the difference between this new image and the old one that the authors used, but that will require a lot of effort and may not be possible where the changes are not in a third public repository or not under version control.
+In Docker, it is possible to retrieve the precise Docker image with its digest for example \inlinecode{FROM ubuntu:16.04@sha256:XXXXXXX} (where \inlinecode{XXXXXXX} is the digest, uniquely identifying the core image to be used), but we haven't seen it practiced often in ``reproducible'' \inlinecode{Dockerfiles}.
-\subsubsection{Package managers}
-\label{appendix:packagemanagersinenv}
-The virtual machine and container solutions mentioned above, install software in standard Unix locations (for example \inlinecode{/usr/bin}), but in their own independent operating systems.
-But if software are built in, and used from, a non-standard, project specific directory, we can have an independent build and run-time environment without needing root access, or the extra layers of the container or VM.
-This leads us to the final method of having an independent environment: a controlled build of the software and its run-time environment.
-Because this is highly intertwined with the way software are installed, we'll describe it in more detail in Section \ref{appendix:packagemanagement} where package managers are reviewed.
+The ``digest'' is specific to Docker repositories.
+A more generic/longterm approach to ensure identical core OS componets at a later time is to construct the containers or VMs with fixed/archived versions of the operating system ISO files.
+ISO files are pre-built binary files with volumes of hundreds of megabytes and not containing their build instructions).
+For example the archives of Debian\footnote{\inlinecode{\url{https://cdimage.debian.org/mirror/cdimage/archive/}}} or Ubuntu\footnote{\inlinecode{\url{http://old-releases.ubuntu.com/releases}}} provide older ISO files.
+
+\subsubsection{Independent build in host's file system}
+\label{appendix:independentbuild}
+The virtual machine and container solutions mentioned above, have their own independent file system.
+Another approach to having an isolated analysis environment is to use the same filesystem as the host, but installing the project's software in a non-standrard, project-specific directory that doesn't interfere with the host.
+Because the environment in this approach can be built in any custom location on the host, this solution generally doesn't require root permissions or extra low-level layers like containers or VMs.
+However, ``moving'' the built product of such solutions from one computer to another is not generally as trivial as containers or VMs.
+Examples of such third-party package managers (that are detached from the host OS's package manager) include Nix, GNU Guix, Python's Virtualenv package and Conda, among others.
+Because it is highly intertwined with the way software are built and installed, third party package managers are described in more detail as part of Section \ref{appendix:packagemanagement}.
+
+Maneage (the solution proposed in this paper) also follows a similar approach of building and installing its own software environment within the the host's file system but without depending on it beyond the kernel.
+However, unlike the third party package maneager mentioned above, based on the Completeness criteria above Maneage's package management is not detached from the specific research/analysis project: the instructions to build the full isolated software environment is maintained with the high-level analysis steps of the project, and the narrative paper/report of the project.
@@ -675,30 +695,62 @@ Because this is highly intertwined with the way software are installed, we'll de
\subsection{Package management}
\label{appendix:packagemanagement}
-Package management is the process of automating the installation of software.
+Package management is the process of automating the build and installation of a software environment.
A package manager thus contains the following information on each software package that can be run automatically: the URL of the software's tarball, the other software that it possibly depends on, and how to configure and build it.
+Package managers can be tied to specific operating systems at a very low level (like \inlinecode{apt} in Debian-based OSs).
+Alternatively, there are third-party package managers which ca be installed on many OSs.
+Both are discussed in more detail below.
-Here some of package management solutions that are used by the reviewed reproducibility solutions of Appendix \ref{appendix:existingsolutions} are reviewed\footnote{For a list of existing package managers, please see \url{https://en.wikipedia.org/wiki/List_of_software_package_management_systems}}.
-Note that we are not including package manager that are only limited to one language, for example \inlinecode{pip} (for Python) or \inlinecode{tlmgr} (for \LaTeX).
+Package managers are the second component in any workflow that relies on containers or VMs for an independent environment, and the starting point in others that use the host's file system (as discussed above in Section \ref{appendix:independentenvironment}).
+In this section, some common package managers are reviewed, in particular those that are most used by the reviewed reproducibility solutions of Appendix \ref{appendix:existingsolutions}.
+For a more comprehensive list of existing package managers, see \href{https://en.wikipedia.org/wiki/List_of_software_package_management_systems}{Wikipedia}.
+Note that we are not including package managers that are specific to one language, for example \inlinecode{pip} (for Python) or \inlinecode{tlmgr} (for \LaTeX).
\subsubsection{Operating system's package manager}
-The most commonly used package managers are those of the host operating system, for example \inlinecode{apt} or \inlinecode{yum} respectively on Debian-based, or RedHat-based GNU/Linux operating systems (among many others).
+The most commonly used package managers are those of the host operating system, for example \inlinecode{apt} or \inlinecode{yum} respectively on Debian-based, or RedHat-based GNU/Linux operating systems, \inlinecode{pkg} in FreeBSD, among many others in other OSs.
-These package managers are tightly intertwined with the operating system.
-Therefore they require root access, and arbitrary control (for different projects) of the versions and configuration options of software within them is not trivial/possible: for example a special version of a software that may be necessary for a project, may conflict with an operating system component, or another project.
-Furthermore, in many operating systems it is only possible to have one version of a software at any moment (no including Nix or GNU Guix which can also be independent of the operating system, described below).
-Hence if two projects need different versions of a software, it is not possible to work on them at the same time.
+These package managers are tightly intertwined with the operating system: they also include the building and updating of the core kernel and the C library.
+Because they are part of the OS, they also commonly require root permissions.
+Also, it is usually only possible to have one version/configuration of a software at any moment and downgrading versions for one project, may conflict with other projects, or even cause problems in the OS.
+Hence if two projects need different versions of a software, it is not possible to work on them at the same time in the OS.
-When a full container or virtual machine (see Appendix \ref{appendix:independentenvironment}) is used for each project, it is common for projects to use the containerized operating system's package manager.
+When a container or virtual machine (see Appendix \ref{appendix:independentenvironment}) is used for each project, it is common for projects to use the containerized operating system's package manager.
However, it is important to remember that operating system package managers are not static: software are updated on their servers.
-For example, simply adding \inlinecode{apt install gcc} to a \inlinecode{Dockerfile} will install different versions of GCC based on when the Docker image is created.
-Requesting a special version also doesn't fully address the problem because the package managers also download and install its dependencies.
-Hence a fixed version of the dependencies must also be included.
+Hence, simply running \inlinecode{apt install gcc}, will install different versions of the GNU Compiler Collection (GCC) based on the version of the OS and when it has been run.
+Requesting a special version of that special software doesn't fully address the problem because the package managers also download and install its dependencies.
+Hence a fixed version of the dependencies must also be specified.
+
+In robust package managers like Debian's \inlinecode{apt} it is possible to fully control (and later reproduce) the build environment of a high-level software.
+Debian also archives all packaged high-level software in its Snapshot\footnote{\inlinecode{\url{https://snapshot.debian.org/}}} service since 2005 which can be used to build the higher-level software environment on an older OS \citeappendix{aissi20}.
+Hence it is indeed theoretically possible to reproduce the software environment only using archived operating systems and their own package managers, but unfortunately we have not seen it practiced in scientific papers/projects.
+
+In summary, the host OS package managers are primarily meant for the operating system components or very low-level components.
+Hence, many robust reproducible analysis solutions (reviewed in Appendix \ref{appendix:existingsolutions}) don't use the host's package manager, but an independent package manager, like the ones below discussed below.
-In summary, these package managers are primarily meant for the operating system components.
-Hence, many robust reproducible analysis solutions (reviewed in Appendix \ref{appendix:existingsolutions}) don't use the host's package manager, but an independent package manager, like the ones below.
+
+\subsubsection{Nix or GNU Guix}
+\label{appendix:nixguix}
+Nix \citeappendix{dolstra04} and GNU Guix \citeappendix{courtes15} are independent package managers that can be installed and used on GNU/Linux operating systems, and macOS (only for Nix, prior to macOS Catalina).
+Both also have a fully functioning operating system based on their packages: NixOS and ``Guix System''.
+GNU Guix is based on the same principles of Nix but implemented differencely, so we'll focus the review here on Nix.
+
+The Nix approach to package management is unique in that it allows exact dependency tracking of all the dependencies, and allows for multiple versions of a software, for more details see \citeappendix{dolstra04}.
+In summary, a unique hash is created from all the components that go into the building of the package.
+That hash is then prefixed to the software's installation directory.
+As an example from \citeappendix{dolstra04}: if a certain build of GNU C Library 2.3.2 has a hash of \inlinecode{8d013ea878d0}, then it is installed under \inlinecode{/nix/store/8d013ea878d0-glibc-2.3.2} and all software that are compiled with it (and thus need it to run) will link to this unique address.
+This allows for multiple versions of the software to co-exist on the system, while keeping an accurate dependency tree.
+
+As mentioned in \citeappendix{courtes15}, one major caveat with using these package managers is that they require a daemon with root privileges.
+This is necessary ``to use the Linux kernel container facilities that allow it to isolate build processes and maximize build reproducibility''.
+This is because the focus in Nix or Guix is to create bit-wise reproducible software binaries and this is necessary in the security or development perspectives.
+However, in a non-computer-science analysis (for example natural sciences), the main aim is reproducibile \emph{results} that can also be created with the same software version that may not be bitwise identical (for example when they are installed in other locations, because the installation location is hardcoded in the software binary).
+
+Finally, while Guix and Nix do allow preciesly reproducible environments, it requires extra effort.
+For example simply running \inlinecode{guix install gcc} will install the most recent version of GCC that can be different at different times.
+Hence, similar to the discussion in host operating system package managers, it is up to the user to ensure that their created environment is recorded properly for reproducibility in the future.
+Generally, this is a major limitation of projects that rely on detached package managers for building their software, including the other tools mentioned below.
\subsubsection{Conda/Anaconda}
\label{appendix:conda}
@@ -732,7 +784,7 @@ However, the resulting environment is not fully independent of the host operatin
\item Many major Conda packaging ``channels'' (for example the core Anaconda channel, or very popular conda-forge channel) don't include the C library, that a package was built with, as a dependency.
They rely on the host operating system's C library.
- C is the core language of most modern operating systems and even higher-level languages like Python or R are written in it, and need it to run.
+ C is the core language of modern operating systems and even higher-level languages like Python or R are written in it, and need it to run.
Therefore if the host operating system's C library is different from the C library that a package was built with, a Conda-packaged program will crash and the project will not be executable.
Theoretically, it is possible to define a new Conda ``channel'' which includes the C library as a dependency of its software packages, but it will take too much time for any individual team to practically implement all their necessary packages, up to their high-level science software.
@@ -747,25 +799,6 @@ As reviewed above, the low-level dependence of Conda on the host operating syste
However, these same factors are major caveats in a scientific scenario, where long-term archivability, readability or usability are important. % alternative to `archivability`?
-
-\subsubsection{Nix or GNU Guix}
-\label{appendix:nixguix}
-Nix \citeappendix{dolstra04} and GNU Guix \citeappendix{courtes15} are independent package managers that can be installed and used on GNU/Linux operating systems, and macOS (only for Nix, prior to macOS Catalina).
-Both also have a fully functioning operating system based on their packages: NixOS and ``Guix System''.
-GNU Guix is based on Nix, so we'll focus the review here on Nix.
-
-The Nix approach to package management is unique in that it allows exact dependency tracking of all the dependencies, and allows for multiple versions of a software, for more details see \citeappendix{dolstra04}.
-In summary, a unique hash is created from all the components that go into the building of the package.
-That hash is then prefixed to the software's installation directory.
-For example \citeappendix[from][]{dolstra04} if a certain build of GNU C Library 2.3.2 has a hash of \inlinecode{8d013ea878d0}, then it is installed under \inlinecode{/nix/store/8d013ea878d0-glibc-2.3.2} and all software that are compiled with it (and thus need it to run) will link to this unique address.
-This allows for multiple versions of the software to co-exist on the system, while keeping an accurate dependency tree.
-
-As mentioned in \citeappendix{courtes15}, one major caveat with using these package managers is that they require a daemon with root privileges.
-This is necessary ``to use the Linux kernel container facilities that allow it to isolate build processes and maximize build reproducibility''.
-
-\tonote{While inspecting the Guix build instructions for some software, I noticed they don't actually mention the version names. This creates a similar issue withe Conda example above (how to regenerate the software with a given hash, given that its dependency versions aren't explicitly mentioned. Ask Ludo' about this.}
-
-
\subsubsection{Spack}
Spack is a package manager that is also influenced by Nix (similar to GNU Guix), see \citeappendix{gamblin15}.
But unlike Nix or GNU Guix, it doesn't aim for full, bit-wise reproducibility and can be built without root access in any generic location.
@@ -776,99 +809,106 @@ Spack is a package manager that is also influenced by Nix (similar to GNU Guix),
Because of such bootstrapping problems (for example how Spack needs Python to build Python and other software), it is generally a good practice to use simpler, lower-level languages/systems for a low-level operation like package management.
-\subsection{Package management conclusion}
-There are two common issues regarding generic package managers that hinders their usage for high-level scientific projects, as listed below:
+In conclusion for all package managers, there are two common issues regarding generic package managers that hinders their usage for high-level scientific projects, as listed below:
\begin{itemize}
\item {\bf\small Pre-compiled/binary downloads:} Most package managers (excluding Nix or its derivatives) only download the software in a binary (pre-compiled) format.
This allows users to download it very fast and almost instantaneously be able to run it.
However, to provide for this, servers need to keep binary files for each build of the software on different operating systems (for example Conda needs to keep binaries for Windows, macOS and GNU/Linux operating systems).
It is also necessary for them to store binaries for each build, which includes different versions of its dependencies.
- This will take major space on the servers, therefore once the shelf-life of a binary has expired, it will not be easy to reproduce a project that depends on it .
-
- For example Debian's Long Term Support is only valid for 5 years.
- Pre-built binaries of the ``Stable'' branch will only be kept during this period and this branch only gets updated once every two years.
- However, scientific software commonly evolve on much faster rates.
- Therefore scientific projects using Debian often use the ``Testing'' branch which has more up to date features.
- The problem is that binaries on the Testing branch are immediately removed when no other package depends on it, and a newer version is available.
- This is not limited to operating systems, similar problems are also reported in Conda for example, see the discussion of Conda above for one real-world example.
-
+ Maintaining such a large binary library is expensive, therefore once the shelf-life of a binary has expired, it will be removed, causing problems for projects that depends on them.
\item {\bf\small Adding high-level software:} Packaging new software is not trivial and needs a good level of knowledge/experience with that package manager.
-For example each has its own special syntax/standards/languages, with pre-defined variables that must already be known to someone packaging new software.
-However, in many scenarios, the most high-level software of a research project are written and used only by the team that is doing the research, even when they are distributed with free licenses on open repositories.
-Although active package manager members are commonly very supportive in helping to package new software, many teams may not take that extra effort/time.
-They will thus manually install their high-level software in an uncontrolled, or non-standard way, thus jeopardizing the reproducibility of the whole work.
+For example each has its own special syntax/standards/languages, with pre-defined variables that must already be known before someone can packaging new software for them.
-\item {\bf\small Built for a generic scenario} All the package managers above are built for one full system, that can possibly be run by multiple projects.
- This can result in not fully documenting the process that each package was built (for example the versions of the dependent libraries of a package).
+However, in many research projects, the most high-level analysis software are written by the team that is doing the research, and they are its primary users, even when the software are distributed with free licenses on open repositories.
+Although active package manager members are commonly very supportive in helping to package new software, many teams may not be able to make that extra effort/time investment.
+As a result, they manually install their high-level software in an uncontrolled, or non-standard way, thus jeopardizing the reproducibility of the whole work.
+This is another consequence of detachment of the package manager from the project doing the analysis.
\end{itemize}
-Addressing these issues has been the basic reason d'\^etre of the proposed template's approach to package management strategy: instructions to download and build the packages are included within the actual science project (thus fully customizable) and no special/new syntax/language is used: software download, building and installation is done with the same language/syntax that researchers manage their research: using the shell (GNU Bash) and Make (GNU Make).
+Addressing these issues has been the basic reason d'\^etre of the proposed criteria: based on the completeness criteria, instructions to download and build the packages are included within the actual science project and no special/new syntax/language is used: software download, building and installation is done with the same language/syntax that researchers manage their research: using the shell (by default GNU Bash in Maneage) and Make (by default, GNU Make in Maneage).
\subsection{Version control}
\label{appendix:versioncontrol}
-A scientific project is not written in a day.
-It commonly takes more than a year (for example a PhD project is 3 or 4 years).
+A scientific project is not written in a day; it usually takes more than a year.
During this time, the project evolves significantly from its first starting date and components are added or updated constantly as it approaches completion.
-Added with the complexity of modern projects, is not trivial to manually track this evolution, and its affect of on the final output: files produced in one stage of the project may be used at later stages (where the project has evolved).
+Added with the complexity of modern computational projects, is not trivial to manually track this evolution, and the evolution's affect of on the final output: files produced in one stage of the project can mistakenly be used by an evolved analysis environment in later stages (where the project has evolved).
+
Furthermore, scientific projects do not progress linearly: earlier stages of the analysis are often modified after later stages are written.
-This is a natural consequence of the scientific method; where progress is defined by experimentation and modification of hypotheses (earlier phases).
+This is a natural consequence of the scientific method; where progress is defined by experimentation and modification of hypotheses (results from earlier phases).
-It is thus very important for the integrity of a scientific project that the state/version of its processing is recorded as the project evolves for example better methods are found or more data arrive.
+It is thus very important for the integrity of a scientific project that the state/version of its processing is recorded as the project evolves.
+For example better methods are found or more data arrive.
Any intermediate dataset that is produced should also be tagged with the version of the project at the time it was created.
In this way, later processing stages can make sure that they can safely be used, i.e., no change has been made in their processing steps.
Solutions to keep track of a project's history have existed since the early days of software engineering in the 1970s and they have constantly improved over the last decades.
Today the distributed model of ``version control'' is the most common, where the full history of the project is stored locally on different systems and can easily be integrated.
There are many existing version control solutions, for example CVS, SVN, Mercurial, GNU Bazaar, or GNU Arch.
-However, currently, Git is by far the most commonly used in individual projects and long term archival systems like Software Heritage \citeappendix{dicosmo18}, it is also the system that is used in the proposed template, so we'll only review it here.
+However, currently, Git is by far the most commonly used in individual projects.
+Git is also the foundation on which this paper's proof of concept (Maneage) is built upon.
+Archival systems aiming for long term preservation of software like Software Heritage \citeappendix{dicosmo18} are also modeled on Git.
+Hence we will just review Git here, but the general concept of version control is the same in all implementations.
\subsubsection{Git}
With Git, changes in a project's contents are accurately identified by comparing them with their previous version in the archived Git repository.
When the user decides the changes are significant compared to the archived state, they can ``commit'' the changes into the history/repository.
-The commit involves copying the changed files into the repository and calculating a 40 character checksum/hash that is calculated from the files, an accompanying ``message'' (a narrative description of the project's state), and the previous commit (thus creating a ``chain'' of commits that are strongly connected to each other).
-For example \inlinecode{f4953cc\-f1ca8a\-33616ad\-602ddf\-4cd189\-c2eff97b} is a commit identifier in the Git history that this paper is being written in.
+The commit involves copying the changed files into the repository and calculating a 40 character checksum/hash that is calculated from the files, an accompanying ``message'' (a narrative description of the purpose/goals of the changes), and the previous commit (thus creating a ``chain'' of commits that are strongly connected to each other like Figure \ref{fig:branching}).
+For example \inlinecode{f4953cc\-f1ca8a\-33616ad\-602ddf\-4cd189\-c2eff97b} is a commit identifier in the Git history of this project.
Commits are is commonly summarized by the checksum's first few characters, for example \inlinecode{f4953cc}.
With Git, making parallel ``branches'' (in the project's history) is very easy and its distributed nature greatly helps in the parallel development of a project by a team.
The team can host the Git history on a webpage and collaborate through that.
-There are several Git hosting services for example \href{http://github.com}{github.com}, \href{http://gitlab.com}{gitlab.com}, or \href{http://bitbucket.org}{bitbucket.org} (among many others).
-
-
+There are several Git hosting services for example \href{http://codeberg.org}{codeberg.org}, \href{http://gitlab.com}{gitlab.com}, \href{http://bitbucket.org}{bitbucket.org} or \href{http://github.com}{github.com} (among many others).
+Storing the changes in binary files is also possible in Git, however it is most useful for human-readable plain-text sources.
\subsection{Job management}
\label{appendix:jobmanagement}
Any analysis will involve more than one logical step.
-For example it is first necessary to download a dataset, then to do some preparations on it, then to actually use it, and finally to make visualizations/tables that can be imported into the final report.
-Each one of these is a logically independent step which needs to be run before/after the others in a specific order.
-There are many tools for managing the sequence of jobs, below we'll review the most common ones that are also used in the proposed template, or the existing reproducibility solutions of Appendix \ref{appendix:existingsolutions}.
+For example it is first necessary to download a dataset and do some preparations on it before applying the research software on it, and finally to make visualizations/tables that can be imported into the final report.
+Each one of these is a logically independent step, which needs to be run before/after the others in a specific order.
+
+Hence job management is a critical component of a research project.
+There are many tools for managing the sequence of jobs, below we'll review the most common ones that are also used the existing reproducibility solutions of Appendix \ref{appendix:existingsolutions}.
+
+\subsubsection{Manual operation with narrative}
+\label{appendix:manual}
+The most commonly used workflow system for many researchers is to run the commands, experiment on them and keep the output when they are happy with it.
+As an improvement, some also keep a narrative description of what they ran.
+Atleast in our personal experience with colleagues, this method is still being heavily practiced by many researchers.
+Given that many researchers don't get trained well in computational methods, this is not surprizing and as discussed in Section \ref{discussion}, we believe that improved literacy in computational methods is the single most important factor for the integrity/reproducibility of modern science.
\subsubsection{Scripts}
\label{appendix:scripts}
Scripts (in any language, for example GNU Bash, or Python) are the most common ways of organizing a series of steps.
-They are primarily designed execute each step sequentially (one after another), making them also very intuitive.
+They are primarily designed to execute each step sequentially (one after another), making them also very intuitive.
However, as the series of operations become complex and large, managing the workflow in a script will become highly complex.
+
For example if 90\% of a long project is already done and a researcher wants to add a followup step, a script will go through all the previous steps (which can take significant time).
-Also, if a small step in the middle of an analysis has to be changed, the full analysis needs to be re-run: scripts have no concept of dependencies (so only the steps that are affected by that change are run).
+In other scenarios, when a small step in the middle of an analysis has to be changed, the full analysis needs to be re-run from the start.
+Scripts have no concept of dependencies, forcing authors to ``temporarily'' comment parts of that they don't want to be re-run (forgetting to un-comment such parts are the most common cause of frustration for the authors and others attempting to reproduce the result).
+
Such factors discourage experimentation, which is a critical component of the scientific method.
-It is possible to manually add conditionals all over the script to add dependencies, but they just make it harder to read, and introduce many bugs themselves.
+It is possible to manually add conditionals all over the script to add dependencies or only run certain steps at certain times, but they just make it harder to read, and introduce many bugs themselves.
Parallelization is another drawback of using scripts.
While its not impossible, because of the high-level nature of scripts, it is not trivial and parallelization can also be very inefficient or buggy.
\subsubsection{Make}
\label{appendix:make}
-Make was originally designed to address the problems mentioned in Appendix \ref{appendix:scripts} for scripts \citeappendix{feldman79}.
-In particular this motivation arose from management issues related to program compilation with many source code files.
-With Make, the various source files of a program that haven't been changed, wouldn't be recompiled.
+Make was originally designed to address the problems mentioned above for scripts \citeappendix{feldman79}.
+In particular in the context of managing the compilation of software programs that involve many source code files.
+With Make, the various source files of a program that hadn't been changed, wouldn't be recompiled.
Also, when two source files didn't depend on each other, and both needed to be rebuilt, they could be built in parallel.
-This greatly helped in debugging of software projects, and speeding up test builds, giving Make a core place in software building tools since then.
-The most common implementation of Make, since the early 1990s, is GNU Make \citeappendix[\url{http://www.gnu.org/s/make}]{stallman88}.
-The proposed solution uses Make to organize its workflow, see Section \ref{sec:usingmake}.
+This greatly helped in debugging of software projects, and speeding up test builds, giving Make a core place in software development in the last 40 years.
+
+The most common implementation of Make, since the early 1990s, is GNU Make.
+Make was also the framework used in the first attempts at reproducible scientific papers \citeappendix{claerbout1992,schwab2000}.
+Our proof-of-concept (Maneage) also uses Make to organize its workflow.
Here, we'll complement that section with more technical details on Make.
Usually, the top-level Make instructions are placed in a file called Makefile, but it is also common to use the \inlinecode{.mk} suffix for custom file names.
@@ -876,24 +916,24 @@ Each stage/step in the analysis is defined through a \emph{rule}.
Rules define \emph{recipes} to build \emph{targets} from \emph{pre-requisites}.
In POSIX operating systems (Unix-like), everything is a file, even directories and devices.
Therefore all three components in a rule must be files on the running filesystem.
-Figure \ref{fig:makeexample} demonstrates a hypothetical Makefile with the targets, prerequisites and recipes highlighted.
To decide which operation should be re-done when executed, Make compares the time stamp of the targets and prerequisites.
When any of the prerequisite(s) is newer than a target, the recipe is re-run to re-build the target.
When all the prerequisites are older than the target, that target doesn't need to be rebuilt.
The recipe can contain any number of commands, they should just all start with a \inlinecode{TAB}.
-Going deeper into the syntax of Make is beyond the scope of this paper, but we recommend interested readers to consult the GNU Make manual\footnote{\url{http://www.gnu.org/software/make/manual/make.pdf}}.
+Going deeper into the syntax of Make is beyond the scope of this paper, but we recommend interested readers to consult the GNU Make manual for a nice introduction\footnote{\inlinecode{\url{http://www.gnu.org/software/make/manual/make.pdf}}}.
\subsubsection{SCons}
-Scons (\url{https://scons.org}) is a Python package for managing operations outside of Python (in contrast to CGAT-core, discussed below, which only organizes Python functions).
+\label{appendix:scons}
+Scons is a Python package for managing operations outside of Python (in contrast to CGAT-core, discussed below, which only organizes Python functions).
In many aspects it is similar to Make, for example it is managed through a `SConstruct' file.
Like a Makefile, SConstruct is also declarative: the running order is not necessarily the top-to-bottom order of the written operations within the file (unlike the imperative paradigm which is common in languages like C, Python, or FORTRAN).
However, unlike Make, SCons doesn't use the file modification date to decide if it should be remade.
-SCons keeps the MD5 hash of all the files (in a hidden binary file) to check if the contents has changed.
+SCons keeps the MD5 hash of all the files (in a hidden binary file) to check if the contents have changed.
SCons thus attempts to work on a declarative file with an imperative language (Python).
It also goes beyond raw job management and attempts to extract information from within the files (for example to identify the libraries that must be linked while compiling a program).
-SCons is therefore more complex than Make: its manual is almost double that of GNU Make.
+SCons is therefore more complex than Make and its manual is almost double that of GNU Make.
Besides added complexity, all these ``smart'' features decrease its performance, especially as files get larger and more numerous: on every call, every file's checksum has to be calculated, and a Python system call has to be made (which is computationally expensive).
Finally, it has the same drawback as any other tool that uses high-level languages, see Section \ref{appendix:highlevelinworkflow}.
@@ -904,27 +944,30 @@ The former will conflict with other system tools that assume \inlinecode{python}
This can also be problematic when a Python analysis library, may require a Python version that conflicts with the running SCons.
\subsubsection{CGAT-core}
-CGAT-Core (\url{https://cgat-core.readthedocs.io/en/latest}) is a Python package for managing workflows, see \citeappendix{cribbs19}.
+CGAT-Core is a Python package for managing workflows, see \citeappendix{cribbs19}.
It wraps analysis steps in Python functions and uses Python decorators to track the dependencies between tasks.
It is used papers like \citeappendix{jones19}, but as mentioned in \citeappendix{jones19} it is good for managing individual outputs (for example separate figures/tables in the paper, when they are fully created within Python).
Because it is primarily designed for Python tasks, managing a full workflow (which includes many more components, written in other languages) is not trivial in it.
Another drawback with this workflow manager is that Python is a very high-level language where future versions of the language may no longer be compatible with Python 3, that CGAT-core is implemented in (similar to how Python 2 programs are not compatible with Python 3).
\subsubsection{Guix Workflow Language (GWL)}
-GWL (\url{https://www.guixwl.org}) GWL is based on the declarative language that GNU Guix uses for package management (see Appendix \ref{appendix:packagemanagement}), which is itself based on the general purpose Scheme language.
+GWL is based on the declarative language that GNU Guix uses for package management (see Appendix \ref{appendix:packagemanagement}), which is itself based on the general purpose Scheme language.
It is closely linked with GNU Guix and can even install the necessary software needed for each individual process.
Hence in the GWL paradigm, software installation and usage doesn't have to be separated.
GWL has two high-level concepts called ``processes'' and ``workflows'' where the latter defines how multiple processes should be executed together.
-As described above shell scripts and Make are a common and highly used system that have existed for several decades and many researchers are already familiar with them and have already used them.
-The list of necessary software solutions for the various stages of a research project (listed in the subsections of Appendix \ref{appendix:existingtools}), is already very large, and each software has its own learning curve (which is a heavy burden for a natural or social scientist for example).
-The other workflow management tools are too specific to a special paradigm, for example CGAT-core is written for Python, or GWL is intertwined with GNU Guix.
-Therefore their generalization into any kind of problem is not trivial.
+In conclusion, shell scripts and Make are very common and extensively used by users of Unix-based OSs (which are most commonly used for computations).
+They have also existed for several decades and are robust and mature.
+Many researchers are also already familiar with them and have already used them.
+As we see in this appendix, the list of necessary tools for the various stages of a research project (an independent environment, package managers, job organizers, analysis languages, writing formats, editors and etc) is already very large.
+Each software has its own learning curve, which is a heavy burden for a natural or social scientist for example.
+Most other workflow management tools are yet another language that have to be mastered.
+
+Furthermore, high-level and specific solutions will evolve very fast causing disruptions in the reproducible framework.
+A good example is Popper \citeappendix{jimenez17} which initially organized its workflow through the HashiCorp configuration language (HCL) because it was the default in GitHub.
+However, in September 2019, GitHub dropped HCL as its default configuration language, so Popper is now using its own custom YAML-based workflow language, see Appendix \ref{appendix:popper} for more on Popper.
+
-Also, high-level and specific solutions will evolve very fast, for example the Popper solution to reproducible research (see Appendix \ref{appendix:popper}) organized its workflow through the HashiCorp configuration language (HCL) because it was the default in GitHub.
-However, in September 2019, GitHub dropped HCL as its default configuration language and is now using its own custom YAML-based language.
-Using such high-level, or provider-specific solutions also has the problem that it makes them hard, or impossible, to use in any generic system.
-Therefore a robust solution would avoid designing their low-level processing steps in these languages and only use them for the highest-level layer of their project, depending on which provider they want to run their project on.
@@ -945,15 +988,16 @@ With text editors, the final edited file is independent of the actual editor and
This is a very important feature that is not commonly present for other solutions mentioned below.
Another very important advantage of advanced text editors like GNU Emacs or Vi(m) is that they can also be run without a graphic user interface, directly on the command-line.
This feature is critical when working on remote systems, in particular high performance computing (HPC) facilities that don't provide a graphic user interface.
+Also, the commonly used minimalistic containers don't include a graphic user interface.
\subsubsection{Integrated Development Environments (IDEs)}
To facilitate the development of source files, IDEs add software building and running environments as well as debugging tools to a plain text editor.
Many IDEs have their own compilers and debuggers, hence source files that are maintained in IDEs are not necessarily usable/portable on other systems.
Furthermore, they usually require a graphic user interface to run.
-In summary IDEs are generally very specialized tools, for special projects and are not a good solution when portability (the ability to run on different systems) is required.
+In summary IDEs are generally very specialized tools, for special projects and are not a good solution when portability (the ability to run on different systems and at different times) is required.
\subsubsection{Jupyter}
-Jupyter \citeappendix[initially IPython,][]{kluyver16} is an implementation of Literate Programming \citeappendix{knuth84}.
+Jupyter (initially IPython) \citeappendix{kluyver16} is an implementation of Literate Programming \citeappendix{knuth84}.
The main user interface is a web-based ``notebook'' that contains blobs of executable code and narrative.
Jupyter uses the custom built \inlinecode{.ipynb} format\footnote{\url{https://nbformat.readthedocs.io/en/latest}}.
Jupyter's name is a combination of the three main languages it was designed for: Julia, Python and R.
@@ -970,7 +1014,7 @@ It is possible to manually execute only one cell, but the previous/next cells th
Integration of directional graph features (dependencies between the cells) into Jupyter has been discussed, but as of this publication, there is no plan to implement it (see Jupyter's GitHub issue 1175\footnote{\url{https://github.com/jupyter/notebook/issues/1175}}).
The fact that the \inlinecode{.ipynb} format stores narrative text, code and multi-media visualization of the outputs in one file, is another major hurdle:
-The files can easy become very large (in volume/bytes) and hard to read from source.
+The files can easy become very large (in volume/bytes) and hard to read.
Both are critical for scientific processing, especially the latter: when a web-browser with proper JavaScript features isn't available (can happen in a few years).
This is further exacerbated by the fact that binary data (for example images) are not directly supported in JSON and have to be converted into much less memory-efficient textual encodings.
@@ -990,10 +1034,10 @@ In summary, Jupyter is most useful in manual, interactive and graphical operatio
\label{appendix:highlevelinworkflow}
Currently the most popular high-level data analysis language is Python.
-R is closely tracking it, and has superseded Python in some fields, while Julia \citeappendix[with its much better performance compared to R and Python, in a high-level structure, see][]{bezanson17} is quickly gaining ground.
+R is closely tracking it, and has superseded Python in some fields, while Julia \citeappendix{bezanson17} is quickly gaining ground.
These languages have themselves superseded previously popular languages for data analysis of the previous decades, for example Java, Perl or C++.
All are part of the C-family programming languages.
-In many cases, this means that the tools to use that language are written in C, which is the language of the operating system.
+In many cases, this means that the tools to use that language are written in C, which is the language of modern operating systems.
Scientists, or data analysts, mostly use these higher-level languages.
Therefore they are naturally drawn to also apply the higher-level languages for lower-level project management, or designing the various stages of their workflow.
@@ -1015,24 +1059,23 @@ This isn't particular to Python, a similar evolution occurred in Perl: in 2000 i
However, the Perl community decided not to abandon Perl 5, and Perl 6 was eventually defined as a new language that is now officially called ``Raku'' (\url{https://raku.org}).
It is unreasonably optimistic to assume that high-level languages won't undergo similar incompatible evolutions in the (not too distant) future.
-For software developers, this isn't a problem at all: non-scientific software, and the general population's usage of them, evolves extremely fast and it is rarely (if ever) necessary to look into codes that are more than a couple of years old.
-However, in the sciences (which are commonly funded by public money) this is a major caveat for the longer-term usability of solutions that are designed.
+For software developers, this isn't a problem at all: non-scientific software, and the general population's usage of them, has a similarly fast evolvution.
+Hence, it is rarely (if ever) necessary to look into codes that are more than a couple of years old.
+However, in the sciences (which are commonly funded by public money) this is a major caveat for the longer-term usability of solutions that are designed in such high level languages.
In summary, in this section we are discussing the bootstrapping problem as regards scientific projects: the workflow/pipeline can reproduce the analysis and its dependencies, but the dependencies of the workflow itself cannot not be ignored.
-The most robust way to address this problem is with a workflow management system that ideally doesn't need any major dependencies: tools that are already part of the operating system.
-
Beyond technical, low-level, problems for the developers mentioned above, this causes major problems for scientific project management as listed below:
\subsubsection{Dependency hell}
The evolution of high-level languages is extremely fast, even within one version.
For example packages that are written in Python 3 often only work with a special interval of Python 3 versions (for example newer than Python 3.6).
This isn't just limited to the core language, much faster changes occur in their higher-level libraries.
-For example version 1.9 of Numpy (Python's numerical analysis module) discontinued support for Numpy's predecessor (called Numeric), causing many problems for scientific users \citeappendix[see][]{hinsen15}.
+For example version 1.9 of Numpy (Python's numerical analysis module) discontinued support for Numpy's predecessor (called Numeric), causing many problems for scientific users \citeappendix{hinsen15}.
On the other hand, the dependency graph of tools written in high-level languages is often extremely complex.
For example see Figure 1 of \citeappendix{alliez19}, it shows the dependencies and their inter-dependencies for Matplotlib (a popular plotting module in Python).
+Acceptable version intervals between the dependencies will cause incompatibilities in a year or two, when a robust package manager is not used (see Appendix \ref{appendix:packagemanagement}).
-Acceptable dependency intervals between the dependencies will cause incompatibilities in a year or two, when a robust package manager is not used (see Appendix \ref{appendix:packagemanagement}).
Since a domain scientist doesn't always have the resources/knowledge to modify the conflicting part(s), many are forced to create complex environments with different versions of Python and pass the data between them (for example just to use the work of a previous PhD student in the team).
This greatly increases the complexity of the project, even for the principal author.
A good reproducible workflow can account for these different versions.
@@ -1054,7 +1097,7 @@ Of course, this also applies to tools that these systems use, for example Conda
This occurs primarily for domain scientists (for example astronomers, biologists or social sciences).
Once they have mastered one version of a language (mostly in the early stages of their career), they tend to ignore newer versions/languages.
The inertia of programming languages is very strong.
-This is natural, because they have their own science field to focus on, and re-writing their very high-level analysis toolkits (which they have curated over their career and is often only readable/usable by themselves) in newer languages requires too much investment and time.
+This is natural, because they have their own science field to focus on, and re-writing their high-level analysis toolkits (which they have curated over their career and is often only readable/usable by themselves) in newer languages every few years requires too much investment and time.
When this investment is not possible, either the mentee has to use the mentor's old method (and miss out on all the new tools, which they need for the future job prospects), or the mentor has to avoid implementation details in discussions with the mentee, because they don't share a common language.
The authors of this paper have personal experiences in both mentor/mentee relational scenarios.
@@ -1082,16 +1125,15 @@ This failure to communicate in the details is a very serious problem, leading to
\section{Survey of common existing reproducible workflows}
\label{appendix:existingsolutions}
-As reviewed in the introduction (Section \ref{sec:introduction}), the problem of reproducibility has received a lot of attention over the last three decades and various solutions have already been proposed.
+As reviewed in the introduction, the problem of reproducibility has received a lot of attention over the last three decades and various solutions have already been proposed.
In this appendix, some of the solutions are reviewed.
-The solutions are based on an evolving software landscape, therefore they are ordered by date\footnote{When the project has a webpage, the year of its first release is used, otherwise their paper's publication year is used.}.
-For each solution, we summarize its methodology and discuss how it relates to the principles in Section \ref{sec:principles}.
+The solutions are based on an evolving software landscape, therefore they are ordered by date: when the project has a webpage, the year of its first release is used for the sorting, otherwise their paper's publication year is used.
+
+For each solution, we summarize its methodology and discuss how it relates to the criteria proposed in this paper.
Freedom of the software/method is a core concept behind scientific reproducibility, as opposed to industrial reproducibility where a black box is acceptable/desirable.
-Therefore proprietary solutions like Code Ocean (\url{https://codeocean.com}) or Nextjournal (\url{https://nextjournal.com}) will not be reviewed here.
+Therefore proprietary solutions like Code Ocean\footnote{\inlinecode{\url{https://codeocean.com}}} or Nextjournal\footnote{\inlinecode{\url{https://nextjournal.com}}} will not be reviewed here.
+Other studies have also attempted to review existing reproducible solutions, foro example \citeappendix{konkol20}.
-\begin{itemize}
-\item \citeappendix{konkol20} have also done a review of some tools from various points of view.
-\end{itemize}
@@ -1099,14 +1141,14 @@ Therefore proprietary solutions like Code Ocean (\url{https://codeocean.com}) or
\subsection{Reproducible Electronic Documents, RED (1992)}
\label{appendix:red}
-Reproducible Electronic Documents (\url{http://sep.stanford.edu/doku.php?id=sep:research:reproducible}) is the first attempt that we could find on doing reproducible research \citeappendix{claerbout1992,schwab2000}.
+RED\footnote{\inlinecode{\url{http://sep.stanford.edu/doku.php?id=sep:research:reproducible}}} is the first attempt that we could find on doing reproducible research, see \citeappendix{claerbout1992,schwab2000}.
It was developed within the Stanford Exploration Project (SEP) for Geophysics publications.
Their introductions on the importance of reproducibility, resonate a lot with today's environment in computational sciences.
In particular the heavy investment one has to make in order to re-do another scientist's work, even in the same team.
RED also influenced other early reproducible works, for example \citeappendix{buckheit1995}.
-To orchestrate the various figures/results of a project, from 1990, they used ``Cake'' \citeappendix[]{somogyi87}, a dialect of Make, for more on Make, see Appendix \ref{appendix:jobmanagement}.
-As described in \citeappendix{schwab2000}, in the latter half of that decade, moved to GNU Make \citeappendix{stallman88}, which was much more commonly used, developed and came with a complete and up-to-date manual.
+To orchestrate the various figures/results of a project, from 1990, they used ``Cake'' \citeappendix{somogyi87}, a dialect of Make, for more on Make, see Appendix \ref{appendix:jobmanagement}.
+As described in \citeappendix{schwab2000}, in the latter half of that decade, they moved to GNU Make, which was much more commonly used, developed and came with a complete and up-to-date manual.
The basic idea behind RED's solution was to organize the analysis as independent steps, including the generation of plots, and organizing the steps through a Makefile.
This enabled all the results to be re-executed with a single command.
Several basic low-level Makefiles were included in the high-level/central Makefile.
@@ -1125,13 +1167,13 @@ Hence in 2006 SEP moved to a new Python-based framework called Madagascar, see A
\subsection{Apache Taverna (2003)}
\label{appendix:taverna}
-Apache Taverna (\url{https://taverna.incubator.apache.org}) is a workflow management system written in Java with a graphical user interface, see \citeappendix[still being actively developed]{oinn04}.
+Apache Taverna\footnote{\inlinecode{\url{https://taverna.incubator.apache.org}}} \citeappendix{oinn04} is a workflow management system written in Java with a graphical user interface which is still being developed.
A workflow is defined as a directed graph, where nodes are called ``processors''.
Each Processor transforms a set of inputs into a set of outputs and they are defined in the Scufl language (an XML-based language, were each step is an atomic task).
Other components of the workflow are ``Data links'' and ``Coordination constraints''.
-The main user interface is graphical, where users place processors in a sheet and define links between their inputs outputs.
+The main user interface is graphical, where users move processors in the given space and define links between their inputs outputs (manually constructing a lineage like Figure \ref{fig:datalineage}).
+Taverna is only a workflow manager and isn't integrated with a package manager, hence the versions of the used software can be different in different runs.
\citeappendix{zhao12} have studied the problem of workflow decays in Taverna.
-In many aspects Taverna is like VisTrails, see Appendix \ref{appendix:vistrails} [Since kepler is older, it may be better to bring the VisTrails features here.]
@@ -1139,26 +1181,28 @@ In many aspects Taverna is like VisTrails, see Appendix \ref{appendix:vistrails}
\subsection{Madagascar (2003)}
\label{appendix:madagascar}
-Madagascar (\url{http://ahay.org}) is a set of extensions to the SCons job management tool \citeappendix{fomel13}.
-For more on SCons, see Appendix \ref{appendix:jobmanagement}.
+Madagascar\footnote{\inlinecode{\url{http://ahay.org}}} \citeappendix{fomel13} is a set of extensions to the SCons job management tool (reviewed in \ref{appendix:scons}).
Madagascar is a continuation of the Reproducible Electronic Documents (RED) project that was discussed in Appendix \ref{appendix:red}.
+Madagascar has been used in the production of hundreds of research papers or book chapters\footnote{\url{http://www.ahay.org/wiki/Reproducible_Documents}}, 120 prior to \citeappendix{fomel13}.
Madagascar does include project management tools in the form of SCons extensions.
-However, it isn't just a reproducible project management tool, it is primarily a collection of analysis programs, tools to interact with RSF files, and plotting facilities.
+However, it isn't just a reproducible project management tool.
+It is primarily a collection of analysis programs and tools to interact with RSF files, and plotting facilities.
+The Regularly Sampled File (RSF) file format\footnote{\inlinecode{\url{http://www.ahay.org/wiki/Guide\_to\_RSF\_file\_format}}} is a custom plain-text file that points to the location of the actual data files on the filesystem and acts as the intermediary between Madagascar's analysis programs.
For example in our test of Madagascar 3.0.1, it installed 855 Madagascar-specific analysis programs (\inlinecode{PREFIX/bin/sf*}).
The analysis programs mostly target geophysical data analysis, including various project specific tools: more than half of the total built tools are under the \inlinecode{build/user} directory which includes names of Madagascar users.
-Following the Unix spirit of modularized programs that communicating through text-based pipes, Madagascar's core is the custom Regularly Sampled File (RSF) format\footnote{\url{http://www.ahay.org/wiki/Guide\_to\_RSF\_file\_format}}.
-RSF is a plain-text file that points to the location of the actual data files on the filesystem, but it can also keep the raw binary dataset within same plain-text file.
Besides the location or contents of the data, RSF also contains name/value pairs that can be used as options to Madagascar programs, which are built with inputs and outputs of this format.
Since RSF contains program options also, the inputs and outputs of Madagascar's analysis programs are read from, and written to, standard input and standard output.
-Madagascar has been used in the production of hundreds of research papers or book chapters\footnote{\url{http://www.ahay.org/wiki/Reproducible_Documents}} \citeappendix[120 prior to][]{fomel13}.
-
+In terms of completeness, as long as the user only uses Madagascar's own analysis programs, it is fairly complete at a high level (not lower-level OS libraries).
+However, this comes at the expense of a large amount of bloatware (programs that one project may never need, but is forced to build).
+Also, the linking between the analysis programs (of a certain user at a certain time) and future versions of that program (that is updated in time) is not immediately obvious.
+Madagascar could have been more useful to a larger community if the workflow components were maintained as a separate project compared to the analysis components.
\subsection{GenePattern (2004)}
\label{appendix:genepattern}
-GenePattern (\url{https://www.genepattern.org}) is a client-server software containing many common analysis functions/modules, primarily focused for Gene studies \citeappendix[first released in 2004]{reich06}.
+GenePattern\footnote{\inlinecode{\url{https://www.genepattern.org}}} \citeappendix{reich06} (first released in 2004) is a client-server software containing many common analysis functions/modules, primarily focused for Gene studies.
Although its highly focused to a special research field, it is reviewed here because its concepts/methods are generic, and in the context of this paper.
Its server-side software is installed with fixed software packages that are wrapped into GenePattern modules.
@@ -1167,23 +1211,23 @@ It is an extension of the Jupyter notebook (see Appendix \ref{appendix:editors})
However, the wrapper modules just call an existing tool on the host system.
Given that each server may have its own set of installed software, the analysis may differ (or crash) when run on different GenePattern servers, hampering reproducibility.
+%% GenePattern shutdown announcement (although as of November 2020, it doesn't open any more!): https://www.genepattern.org/blog/2019/10/01/the-genomespace-project-is-ending-on-november-15-2019
The primary GenePattern server was active since 2008 and had 40,000 registered users with 2000 to 5000 jobs running every week \citeappendix{reich17}.
-However, it was shut down on November 15th 2019 due to end of funding\footnote{\url{https://www.genepattern.org/blog/2019/10/01/the-genomespace-project-is-ending-on-november-15-2019}}.
+However, it was shut down on November 15th 2019 due to end of funding.
All processing with this sever has stopped, and any archived data on it has been deleted.
Since GenePattern is free software, there are alternative public servers to use, so hopefully work on it will continue.
However, funding is limited and those servers may face similar funding problems.
-This is a very nice example of the fragility of solutions that depend on archiving and running high-level research products (including data, binary/compiled code).
+This is a very nice example of the fragility of solutions that depend on archiving and running the research codes with high-level research products (including data, binary/compiled code which are expensive to keep) in one place.
\subsection{Kepler (2005)}
-Kepler (\url{https://kepler-project.org}) is a Java-based Graphic User Interface workflow management tool \citeappendix{ludascher05}.
-Users drag-and-drop analysis components, called ``actors'', into a visual, directional graph, which is the workflow (similar to Figure \ref{fig:analysisworkflow}).
-Each actor is connected to others through the Ptolemy approach \citeappendix{eker03}.
-In many aspects Kepler is like VisTrails, see Appendix \ref{appendix:vistrails}.
-\tonote{Since kepler is older, it may be better to bring the VisTrails features here.}
+Kepler\footnote{\inlinecode{\url{https://kepler-project.org}}} \citeappendix{ludascher05} is a Java-based Graphic User Interface workflow management tool.
+Users drag-and-drop analysis components, called ``actors'', into a visual, directional graph, which is the workflow (similar to Figure \ref{fig:datalineage}).
+Each actor is connected to others through the Ptolemy II\footnote{\inlinecode{\url{https://ptolemy.berkeley.edu}}} \citeappendix{eker03}.
+In many aspects, the usage of Kepler and its issues for long-term reproducibility is like Apache Taverna (see Section \ref{appendix:taverna}).
@@ -1192,7 +1236,7 @@ In many aspects Kepler is like VisTrails, see Appendix \ref{appendix:vistrails}.
\subsection{VisTrails (2005)}
\label{appendix:vistrails}
-VisTrails (\url{https://www.vistrails.org}) was a graphical workflow managing system that is described in \citeappendix{bavoil05}.
+VisTrails\footnote{\inlinecode{\url{https://www.vistrails.org}}} \citeappendix{bavoil05} was a graphical workflow managing system.
According to its webpage, VisTrails maintainance has stopped since May 2016, its last Git commit, as of this writing, was in November 2017.
However, given that it was well maintained for over 10 years is an achievement.
@@ -1202,14 +1246,11 @@ The XML attributes of each module can be used in any XML query language to find
Since the main goal was visualization (as images), apparently its primary output is in the form of image spreadsheets.
Its design is based on a change-based provenance model using a custom VisTrails provenance query language (vtPQL), for more see \citeappendix{scheidegger08}.
Since XML is a plane text format, as the user inspects the data and makes changes to the analysis, the changes are recorded as ``trails'' in the project's VisTrails repository that operates very much like common version control systems (see Appendix \ref{appendix:versioncontrol}).
-
-With respect to keeping the history/provenance of the final dataset, VisTrails is very much like the template introduced in this paper.
+.
However, even though XML is in plain text, it is very hard to edit manually.
VisTrails therefore provides a graphic user interface with a visual representation of the project's inter-dependent steps (similar to Figure \ref{fig:analysisworkflow}).
Besides the fact that it is no longer maintained, the conceptual differences with the proposed template are substantial.
The most important is that VisTrails doesn't control the software that is run, it only controls the sequence of steps that they are run in.
-This template also defines dependencies and operations based on the very standard and commonly known Make system, not a custom XML format.
-Scripts can easily be written to generate an XML-formatted output from Makefiles.
@@ -1218,18 +1259,18 @@ Scripts can easily be written to generate an XML-formatted output from Makefiles
\subsection{Galaxy (2010)}
\label{appendix:galaxy}
-Galaxy (\url{https://galaxyproject.org}) is a web-based Genomics workbench \citeappendix{goecks10}.
+Galaxy\footnote{\inlinecode{\url{https://galaxyproject.org}}} is a web-based Genomics workbench \citeappendix{goecks10}.
The main user interface are ``Galaxy Pages'', which doesn't require any programming: users simply use abstract ``tools'' which are a wrappers over command-line programs.
-Therefore the actual running version of the program can be hard to control across different Galaxy servers \tonote{confirm this}.
+Therefore the actual running version of the program can be hard to control across different Galaxy servers.
Besides the automatically generated metadata of a project (which include version control, or its history), users can also tag/annotate each analysis step, describing its intent/purpose.
-Besides some small differences, this seems to be very similar to GenePattern (Appendix \ref{appendix:genepattern}).
+Besides some small differences Galaxy seems very similar to GenePattern (Appendix \ref{appendix:genepattern}), so most of the same points there apply here too (including the very large cost of maintining such a system).
\subsection{Image Processing On Line journal, IPOL (2010)}
-The IPOL journal (\url{https://www.ipol.im}) attempts to publish the full implementation details of proposed image processing algorithm as a scientific paper \citeappendix[first published article in July 2010]{limare11}.
+The IPOL journal\footnote{\inlinecode{\url{https://www.ipol.im}}} \citeappendix{limare11} (first published article in July 2010) publishes papers on image processing algorithms as well as the the full code of the proposed algorithm.
An IPOL paper is a traditional research paper, but with a focus on implementation.
The published narrative description of the algorithm must be detailed to a level that any specialist can implement it in their own programming language (extremely detailed).
The author's own implementation of the algorithm is also published with the paper (in C, C++ or MATLAB), the code must be commented well enough and link each part of it with the relevant part of the paper.
@@ -1237,20 +1278,21 @@ The authors must also submit several example datasets/scenarios.
The referee actually inspects the code and narrative, confirming that they match with each other, and with the stated conclusions of the published paper.
After publication, each paper also has a ``demo'' button on its webpage, allowing readers to try the algorithm on a web-interface and even provide their own input.
-The IPOL model is indeed the single most robust model of peer review and publishing computational research methods/implementations.
+The IPOL model is the single most robust model of peer review and publishing computational research methods/implementations that we have seen in this survey.
It has grown steadily over the last 10 years, publishing 23 research articles in 2019 alone.
We encourage the reader to visit its webpage and see some of its recent papers and their demos.
-It can be so thorough and complete because it has a very narrow scope (image processing), and the published algorithms are highly atomic, not needing significant dependencies (beyond input/output), allowing the referees to go deep into each implemented algorithm.
-In fact, high-level languages like Perl, Python or Java are not acceptable precisely because of the additional complexities/dependencies that they require.
+The reason it can be so thorough and complete is its a very narrow scope (image processing algorithms), where the published algorithms are highly atomic, not needing significant dependencies (beyond input/output), allowing the referees/readers to go deep into each implemented algorithm.
+In fact, high-level languages like Perl, Python or Java are not acceptable in IPOL precisely because of the additional complexities/dependencies that they require.
+If any referee/reader was inclined to do so, a paper written in Maneage (the proof-of-concept solution presented in this paper) allows for a similar level of scrutiny, but for much more complex research scenarios, involving hundreds of dependencies and complex processing on the data.
+
-Ideally (if any referee/reader was inclined to do so), the proposed template of this paper allows for a similar level of scrutiny, but for much more complex research scenarios, involving hundreds of dependencies and complex processing on the data.
\subsection{WINGS (2010)}
\label{appendix:wings}
-WINGS (\url{https://wings-workflows.org}) is an automatic workflow generation algorithm \citeappendix{gil10}.
+WINGS\footnote{\inlinecode{\url{https://wings-workflows.org}}} \citeappendix{gil10} is an automatic workflow generation algorithm.
It runs on a centralized web server, requiring many dependencies (such that it is recommended to download Docker images).
It allows users to define various workflow components (for example datasets, analysis components and etc), with high-level goals.
It then uses selection and rejection algorithms to find the best components using a pool of analysis components that can satisfy the requested high-level constraints.
@@ -1262,27 +1304,28 @@ It then uses selection and rejection algorithms to find the best components usin
\subsection{Active Papers (2011)}
\label{appendix:activepapers}
-Active Papers (\url{http://www.activepapers.org}) attempts to package the code and data of a project into one file (in HDF5 format).
-It was initially written in Java because its compiled byte-code outputs in JVM are portable on any machine \citeappendix[see][]{hinsen11}.
+Active Papers\footnote{\inlinecode{\url{http://www.activepapers.org}}} attempts to package the code and data of a project into one file (in HDF5 format).
+It was initially written in Java because its compiled byte-code outputs in JVM are portable on any machine \citeappendix{hinsen11}.
However, Java is not a commonly used platform today, hence it was later implemented in Python \citeappendix{hinsen15}.
In the Python version, all processing steps and input data (or references to them) are stored in a HDF5 file.
However, it can only account for pure-Python packages using the host operating system's Python modules \tonote{confirm this!}.
When the Python module contains a component written in other languages (mostly C or C++), it needs to be an external dependency to the Active Paper.
-As mentioned in \citeappendix{hinsen15}, the fact that it relies on HDF5 is a caveat of Active Papers, because many tools are necessary to access it.
+As mentioned in \citeappendix{hinsen15}, the fact that it relies on HDF5 is a caveat of Active Papers, because many tools are necessary to merely open it.
Downloading the pre-built HDF View binaries (provided by the HDF group) is not possible anonymously/automatically (login is required).
-Installing it using the Debian or Arch Linux package managers also failed due to dependencies.
+Installing it using the Debian or Arch Linux package managers also failed due to dependencies in our trials.
Furthermore, as a high-level data format HDF5 evolves very fast, for example HDF5 1.12.0 (February 29th, 2020) is not usable with older libraries provided by the HDF5 team. % maybe replace with: February 29\textsuperscript{th}, 2020?
-While data and code are indeed fundamentally similar concepts technically \tonote{cite Konrad's paper on this}, they are used by humans differently.
-This becomes a burden when large datasets are used, this was also acknowledged in \citeappendix{hinsen15}.
-If the data are proprietary (for example medical patient data), the data must not be released, but the methods they were produced can.
+While data and code are indeed fundamentally similar concepts technically\citeappendix{hinsen16}, they are used by humans differently due to their volume: the code of a large project involving Terabytes of data can be less than a megabyte.
+Hence, storing code and data together becomes a burden when large datasets are used, this was also acknowledged in \citeappendix{hinsen15}.
+Also, if the data are proprietary (for example medical patient data), the data must not be released, but the methods that were applied to them can be published.
Furthermore, since all reading and writing is done in the HDF5 file, it can easily bloat the file to very large sizes due to temporary/reproducible files, and its necessary to remove/dummify them, thus complicating the code, making it hard to read.
For example the Active Papers HDF5 file of \citeappendix[in \href{https://doi.org/10.5281/zenodo.2549987}{zenodo.2549987}]{kneller19} is 1.8 giga-bytes.
-In many scenarios, peers just want to inspect the processing by reading the code and checking a very special part of it (one or two lines), not necessarily needing to run it, or obtaining the datasets.
-Hence the extra volume for data, and obscure HDF5 format that needs special tools for reading plain text code is a major burden.
+In many scenarios, peers just want to inspect the processing by reading the code and checking a very special part of it (one or two lines, just to see the option values to one step for example).
+They do not necessarily need to run it, or obtaining the datasets.
+Hence the extra volume for data, and obscure HDF5 format that needs special tools for reading plain text code is a major hinderance.
@@ -1291,7 +1334,7 @@ Hence the extra volume for data, and obscure HDF5 format that needs special tool
\subsection{Collage Authoring Environment (2011)}
\label{appendix:collage}
The Collage Authoring Environment \citeappendix{nowakowski11} was the winner of Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}.
-It is based on the GridSpace2\footnote{\url{http://dice.cyfronet.pl}} distributed computing environment\tonote{find citation}, which has a web-based graphic user interface.
+It is based on the GridSpace2\footnote{\inlinecode{\url{http://dice.cyfronet.pl}}} distributed computing environment\tonote{find citation}, which has a web-based graphic user interface.
Through its web-based interface, viewers of a paper can actively experiment with the parameters of a published paper's displayed outputs (for example figures).
\tonote{See how it containerizes the software environment}
@@ -1301,8 +1344,8 @@ Through its web-based interface, viewers of a paper can actively experiment with
\subsection{SHARE (2011)}
\label{appendix:SHARE}
-SHARE (\url{https://is.ieis.tue.nl/staff/pvgorp/share}) is a web portal that hosts virtual machines (VMs) for storing the environment of a research project, for more, see \citeappendix{vangorp11}.
-The top project webpage above is still active, however, the virtual machines and SHARE system have been removed since 2019.
+SHARE\footnote{\inlinecode{\url{https://is.ieis.tue.nl/staff/pvgorp/share}}} \citeappendix{vangorp11} is a web portal that hosts virtual machines (VMs) for storing the environment of a research project.
+The top project webpage above is still active, however, the virtual machines and SHARE system have been removed since 2019, probably due to the large volume and high maintainence cost of the VMs.
SHARE was recognized as second position in the Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}.
Simply put, SHARE is just a VM that users can download and run.
@@ -1314,7 +1357,7 @@ The limitations of VMs for reproducibility were discussed in Appendix \ref{appen
\subsection{Verifiable Computational Result, VCR (2011)}
\label{appendix:verifiableidentifier}
-A ``verifiable computational result'' (\url{http://vcr.stanford.edu}) is an output (table, figure, or etc) that is associated with a ``verifiable result identifier'' (VRI), see \citeappendix{gavish11}.
+A ``verifiable computational result''\footnote{\inlinecode{\url{http://vcr.stanford.edu}}} is an output (table, figure, or etc) that is associated with a ``verifiable result identifier'' (VRI), see \citeappendix{gavish11}.
It was awarded the third prize in the Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}.
A VRI is created using tags within the programming source that produced that output, also recording its version control or history.
@@ -1324,7 +1367,7 @@ According to \citeappendix{gavish11}, the VRI generation routine has been implem
VCR also has special \LaTeX{} macros for loading the respective VRI into the generated PDF.
Unfortunately most parts of the webpage are not complete at the time of this writing.
-The VCR webpage contains an example PDF\footnote{\url{http://vcr.stanford.edu/paper.pdf}} that is generated with this system, however, the linked VCR repository (\inlinecode{http://vcr-stat.stanford.edu}) does not exist at the time of this writing.
+The VCR webpage contains an example PDF\footnote{\inlinecode{\url{http://vcr.stanford.edu/paper.pdf}}} that is generated with this system, however, the linked VCR repository\footnote{\inlinecode{\url{http://vcr-stat.stanford.edu}}} does not exist at the time of this writing.
Finally, the date of the files in the MATLAB extension tarball are set to 2011, hinting that probably VCR has been abandoned soon after the publication of \citeappendix{gavish11}.
@@ -1333,7 +1376,7 @@ Finally, the date of the files in the MATLAB extension tarball are set to 2011,
\subsection{SOLE (2012)}
\label{appendix:sole}
-SOLE (Science Object Linking and Embedding) defines ``science objects'' (SOs) that can be manually linked with phrases of the published paper \citeappendix[for more, see ][]{pham12,malik13}.
+SOLE (Science Object Linking and Embedding) defines ``science objects'' (SOs) that can be manually linked with phrases of the published paper \citeappendix{pham12,malik13}.
An SO is any code/content that is wrapped in begin/end tags with an associated type and name.
For example special commented lines in a Python, R or C program.
The SOLE command-line program parses the tagged file, generating metadata elements unique to the SO (including its URI).
@@ -1341,25 +1384,27 @@ SOLE also supports workflows as Galaxy tools \citeappendix{goecks10}.
For reproducibility, \citeappendix{pham12} suggest building a SOLE-based project in a virtual machine, using any custom package manager that is hosted on a private server to obtain a usable URI.
However, as described in Appendices \ref{appendix:independentenvironment} and \ref{appendix:packagemanagement}, unless virtual machines are built with robust package managers, this is not a sustainable solution (the virtual machine itself is not reproducible).
-Also, hosting a large virtual machine server with fixed IP on a hosting service like Amazon (as suggested there) will be very expensive.
+Also, hosting a large virtual machine server with fixed IP on a hosting service like Amazon (as suggested there) for every project in perpetuity will be very expensive.
The manual/artificial definition of tags to connect parts of the paper with the analysis scripts is also a caveat due to human error and incompleteness (tags the authors may not consider important, but may be useful later).
-The solution of the proposed template (where anything coming out of the analysis is directly linked to the paper's contents with \LaTeX{} elements avoids these problems.
+In Maneage, instead of artificial/commented tags directly link the analysis input and outputs to the paper's text automatically.
\subsection{Sumatra (2012)}
-Sumatra (\url{http://neuralensemble.org/sumatra}) attempts to capture the environment information of a running project \citeappendix{davison12}.
-It is written in Python and is a command-line wrapper over the analysis script, by controlling its running, its able to capture the environment it was run in.
-The captured environment can be viewed in plain text, a web interface.
+Sumatra\footnote{\inlinecode{\url{http://neuralensemble.org/sumatra}}} \citeappendix{davison12} attempts to capture the environment information of a running project.
+It is written in Python and is a command-line wrapper over the analysis script.
+By controlling a project at running-time, Sumatra is able to capture the environment it was run in.
+The captured environment can be viewed in plain text or a web interface.
Sumatra also provides \LaTeX/Sphinx features, which will link the paper with the project's Sumatra database.
This enables researchers to use a fixed version of a project's figures in the paper, even at later times (while the project is being developed).
The actual code that Sumatra wraps around, must itself be under version control, and it doesn't run if there is non-committed changes (although its not clear what happens if a commit is amended).
Since information on the environment has been captured, Sumatra is able to identify if it has changed since a previous run of the project.
Therefore Sumatra makes no attempt at storing the environment of the analysis as in Sciunit (see Appendix \ref{appendix:sciunit}), but its information.
-Sumatra thus needs to know the language of the running program.
+Sumatra thus needs to know the language of the running program and isn't generic.
+It just captures the environment, it doesn't store \emph{how} that environment was built.
@@ -1368,32 +1413,27 @@ Sumatra thus needs to know the language of the running program.
\subsection{Research Object (2013)}
\label{appendix:researchobject}
-The Research object (\url{http://www.researchobject.org}) is collection of meta-data ontologies, to describe aggregation of resources, or workflows, see \citeappendix{bechhofer13} and \citeappendix{belhajjame15}.
+The Research object\footnote{\inlinecode{\url{http://www.researchobject.org}}} is collection of meta-data ontologies, to describe aggregation of resources, or workflows, see \citeappendix{bechhofer13} and \citeappendix{belhajjame15}.
It thus provides resources to link various workflow/analysis components (see Appendix \ref{appendix:existingtools}) into a final workflow.
\citeappendix{bechhofer13} describes how a workflow in Apache Taverna (Appendix \ref{appendix:taverna}) can be translated into research objects.
-The important thing is that the research object concept is not specific to any special workflow, it is just a metadata bundle which is only as robust in reproducing the result as the running workflow.
-For example, Apache Taverna cannot guarantee exact reproducibility as described in Appendix \ref{appendix:taverna}.
-But when a translator is written to convert the proposed template into research objects, they can do this.
-
+The important thing is that the research object concept is not specific to any special workflow, it is just a metadata bundle/standard which is only as robust in reproducing the result as the running workflow.
\subsection{Sciunit (2015)}
\label{appendix:sciunit}
-Sciunit (\url{https://sciunit.run}) defines ``sciunit''s that keep the executed commands for an analysis and all the necessary programs and libraries that are used in those commands.
+Sciunit\footnote{\inlinecode{\url{https://sciunit.run}}} \citeappendix{meng15} defines ``sciunit''s that keep the executed commands for an analysis and all the necessary programs and libraries that are used in those commands.
It automatically parses all the executables in the script, and copies them, and their dependency libraries (down to the C library), into the sciunit.
Because the sciunit contains all the programs and necessary libraries, its possible to run it readily on other systems that have a similar CPU architecture.
-For more, please see \citeappendix{meng15}.
In our tests, Sciunit installed successfully, however we couldn't run it because of a dependency problem with the \inlinecode{tempfile} package (in the standard Python library).
Sciunit is written in Python 2 (which reached its end-of-life in January 1st, 2020) and its last Git commit in its main branch is from June 2018 (+1.5 years ago).
Recent activity in a \inlinecode{python3} branch shows that others are attempting to translate the code into Python 3 (the main author has graduated and apparently not working on Sciunit anymore).
Because we weren't able to run it, the following discussion will just be theoretical.
-The main issue with Sciunit's approach is that the copied binaries are just black boxes.
-Therefore, its not possible to see how the used binaries from the initial system were built, or possibly if they have security problems.
+The main issue with Sciunit's approach is that the copied binaries are just black boxes: it is not possible to see how the used binaries from the initial system were built.
This is a major problem for scientific projects, in principle (not knowing how they programs were built) and practice (archiving a large volume sciunit for every step of an analysis requires a lot of space).
@@ -1401,9 +1441,8 @@ This is a major problem for scientific projects, in principle (not knowing how t
\subsection{Binder (2017)}
-Binder (\url{https://mybinder.org}) is a tool to containerize already existing Jupyter based processing steps.
-Users simply add a set of Binder-recognized configuration files to their repository.
-Binder will build a Docker image and install all the dependencies inside of it with Conda (the list of necessary packages comes from Conda).
+Binder\footnote{\inlinecode{\url{https://mybinder.org}}} is used to containerize already existing Jupyter based processing steps.
+Users simply add a set of Binder-recognized configuration files to their repository and Binder will build a Docker image and install all the dependencies inside of it with Conda (the list of necessary packages comes from Conda).
One good feature of Binder is that the imported Docker image must be tagged (something like a checksum).
This will ensure that future/latest updates of the imported Docker image are not mistakenly used.
However, it does not make sure that the dockerfile used by the imported Docker image follows a similar convention also.
@@ -1414,10 +1453,8 @@ Binder is used by \citeappendix{jones19}.
\subsection{Gigantum (2017)}
-Gigantum (\url{https://gigantum.com}) is a client/server system, in which the client is a web-based (graphical) interface that is installed as ``Gigantum Desktop'' within a Docker image and is free software (MIT License).
-\tonote{I couldn't find the license to the server software yet, but it says that 20GB is provided for ``free'', so it is a little confusing if anyone can actually run the server.}
-\tonote{I took the date from their PiPy page, where the first version 0.1 was published in November 2016.}
-
+%% I took the date from their PiPy page, where the first version 0.1 was published in November 2016.
+Gigantum\footnote{\inlinecode{\url{https://gigantum.com}}} is a client/server system, in which the client is a web-based (graphical) interface that is installed as ``Gigantum Desktop'' within a Docker image.
Gigantum uses Docker containers for an independent environment, Conda (or Pip) to install packages, Jupyter notebooks to edit and run code, and Git to store its history.
Simply put, its a high-level wrapper for combining these components.
Internally, a Gigantum project is organized as files in a directory that can be opened without their own client.
@@ -1432,19 +1469,15 @@ However, there is one directory which can be used to store files that must not b
\subsection{Popper (2017)}
\label{appendix:popper}
-Popper (\url{https://falsifiable.us}) is a software implementation of the Popper Convention \citeappendix{jimenez17}.
-The Convention is a set of very generic conditions that are also applicable to the template proposed in this paper.
-For a discussion on the convention, please see Section \ref{sec:principles}, in this section we'll review their software implementation.
-
+Popper\footnote{\inlinecode{\url{https://falsifiable.us}}} is a software implementation of the Popper Convention \citeappendix{jimenez17}.
The Popper team's own solution is through a command-line program called \inlinecode{popper}.
-The \inlinecode{popper} program itself is written in Python, but job management is with the HashiCorp configuration language (HCL).
-HCL is primarily aimed at running jobs on HashiCorp's ``infrastructure as a service'' (IaaS) products.
-Until September 30th, 2019\footnote{\url{https://github.blog/changelog/2019-09-17-github-actions-will-stop-running-workflows-written-in-hcl}}, HCL was used by ``GitHub Actions'' to manage workflows. % maybe use the \textsuperscript{th} with dates?
+The \inlinecode{popper} program itself is written in Python.
+However, job management wash initially based on the HashiCorp configuration language (HCL) because HCL was used by ``GitHub Actions'' to manage workflows.
+However, from October 2019 Github changed to a custom YAML-based languguage, so Popper also depreciated HCL.
+This is an important issue when low-level choices are based on service providers.
To start a project, the \inlinecode{popper} command-line program builds a template, or ``scaffold'', which is a minimal set of files that can be run.
-The scaffold is very similar to the raw template of that is proposed in this paper.
-However, as of this writing, the scaffold isn't complete.
-It lacks a manuscript and validation of outputs (as mentioned in the convention).
+However, as of this writing, the scaffold isn't complete: it lacks a manuscript and validation of outputs (as mentioned in the convention).
By default Popper runs in a Docker image (so root permissions are necessary), but Singularity is also supported.
See Appendix \ref{appendix:independentenvironment} for more on containers, and Appendix \ref{appendix:highlevelinworkflow} for using high-level languages in the workflow.
@@ -1467,14 +1500,6 @@ Furthermore, the fact that a Tale is stored as a binary Docker container causes
-\subsection{Things to add}
-\url{https://sites.nationalacademies.org/cs/groups/pgasite/documents/webpage/pga_180684.pdf}, does the following classification of tools:
- \begin{itemize}
- \item Research environments: \href{http://vcr.stanford.edu}{Verifiable computational research} (discussed above), \href{http://www.sciencedirect.com/science/article/pii/S1877050911001207}{SHARE} (a Virtual Machine), \href{http://www.codeocean.com}{Code Ocean} (discussed above), \href{http://jupyter.org}{Jupyter} (discussed above), \href{https://yihui.name/knitr}{knitR} (based on Sweave, dynamic report generation with R), \href{https://cran.r-project.org}{Sweave} (Function in R, for putting R code within \LaTeX), \href{http://www.cyverse.org}{Cyverse} (proprietary web tool with servers for bioinformatics), \href{https://nanohub.org}{NanoHUB} (collection of Simulation Programs for nanoscale phenomena that run in the cloud), \href{https://www.elsevier.com/about/press-releases/research-and-journals/special-issue-computers-and-graphics-incorporates-executable-paper-grand-challenge-winner-collage-authoring-environment}{Collage Authoring Environment} (discussed above), \href{https://osf.io/ns2m3}{SOLE} (discussed above), \href{https://osf.io}{Open Science framework} (a hosting webpage), \href{https://www.vistrails.org}{VisTrails} (discussed above), \href{https://pypi.python.org/pypi/Sumatra}{Sumatra} (discussed above), \href{http://software.broadinstitute.org/cancer/software/genepattern}{GenePattern} (reviewed above), Image Processing On Line (\href{http://www.ipol.im}{IPOL}) journal (publishes full analysis scripts, but doesn't deal with dependencies), \href{https://github.com/systemslab/popper}{Popper} (reviewed above), \href{https://galaxyproject.org}{Galaxy} (reviewed above), \href{http://torch.ch}{Torch.ch} (finished project for neural networks on images), \href{http://wholetale.org/}{Whole Tale} (discussed above).
- \item Workflow systems: \href{http://www.taverna.org.uk}{Taverna}, \href{http://www.wings-workflows.org}{Wings}, \href{https://pegasus.isi.edu}{Pegasus}, \href{http://www.pgbovine.net/cde.html}{CDE}, \href{http://binder.org}{Binder}, \href{http://wiki.datakurator.org/wiki}{Kurator}, \href{https://kepler-project.org}{Kepler}, \href{https://github.com/everware}{Everware}, \href{http://cds.nyu.edu/projects/reprozip}{Reprozip}.
- \item Dissemination platforms: \href{http://researchcompendia.org}{ResearchCompendia}, \href{https://datacenterhub.org/about}{DataCenterHub}, \href{http://runmycode.org}, \href{https://www.chameleoncloud.org}{ChameleonCloud}, \href{https://occam.cs.pitt.edu}{Occam}, \href{http://rcloud.social/index.html}{RCloud}, \href{http://thedatahub.org}{TheDataHub}, \href{http://www.ahay.org/wiki/Package_overview}{Madagascar}.
- \end{itemize}
-
@@ -1493,74 +1518,79 @@ Furthermore, the fact that a Tale is stored as a binary Docker container causes
-
-\newpage
-\section{Things remaining to add}
-\begin{itemize}
-\item Special volume on ``Reproducible research'' in the Computing in Science Engineering \citeappendix{fomel09}.
-\item ``I’ve learned that interactive programs are slavery (unless they include the ability to arrive in any previous state by means of a script).'' \citeappendix{fomel09}.
-\item \citeappendix{fomel09} discuss the ``versioning problem'': on different systems, programs have different versions.
-\item \citeappendix{fomel09}: a C program written 20 years ago was still usable.
-\item \citeappendix{fomel09}: ``in an attempt to increase the size of the community, Matthias Schwab and I submitted a paper to Computers in Physics, one of CiSE’s forerunners. It was rejected. The editors said if everyone used Microsoft computers, everything would be easily reproducible. They also predicted the imminent demise of Fortran''.
-\item \citeappendix{alliez19}: Software citation, with a nice dependency plot for matplotlib.
- \item SC \href{https://sc19.supercomputing.org/submit/reproducibility-initiative}{Reproducibility Initiative} for mandatory Artifact Description (AD).
- \item \href{https://www.acm.org/publications/policies/artifact-review-badging}{Artifact review badging} by the Association of computing machinery (ACM).
- \item eLife journal \href{https://elifesciences.org/labs/b521cf4d/reproducible-document-stack-towards-a-scalable-solution-for-reproducible-articles}{announcement} on reproducible papers. \citeappendix{lewis18} is their first reproducible paper.
- \item The \href{https://www.scientificpaperofthefuture.org}{Scientific paper of the future initiative} encourages geoscientists to include associate metadata with scientific papers \citeappendix{gil16}.
- \item Digital objects: \url{http://doi.org/10.23728/b2share.b605d85809ca45679b110719b6c6cb11} and \url{http://doi.org/10.23728/b2share.4e8ac36c0dd343da81fd9e83e72805a0}
- \item \citeappendix{mesirov10}, \citeappendix{casadevall10}, \citeappendix{peng11}: Importance of reproducible research.
- \item \citeappendix{sandve13} is an editorial recommendation to publish reproducible results.
- \item \citeappendix{easterbrook14} Free/open software for open science.
- \item \citeappendix{peng15}: Importance of better statistical education.
- \item \citeappendix{topalidou16}: Failed attempt to reproduce a result.
- \item \citeappendix{hutton16} reproducibility in hydrology, criticized in \citeappendix{melson17}.
- \item \citeappendix{fomel09}: Editorial on reproducible research.
- \item \citeappendix{munafo17}: Reproducibility in social sciences.
- \item \citeappendix{stodden18}: Effectiveness of journal policy on computational reproducibility.
- \item \citeappendix{fanelli18} is critical of the narrative that there is a ``reproducibility crisis'', and that its important to empower scientists.
- \item \citeappendix{burrell18} open software (in particular Python) in heliophysics.
- \item \citeappendix{allen18} show that many papers don't cite software.
- \item \citeappendix{zhang18} explicity say that they won't release their code: ``We opt not to make the code used for the chemical evo-lution modeling publicly available because it is an important asset of the re-searchers’ toolkits''
- \item \citeappendix{jones19} make genuine effort at reproducing every number in the paper (using Docker, Conda, and CGAT-core, and Binder), but they can ultimately only release scripts. They claim its not possible to reproduce that level of reproducibility, but here we show it is.
- \item LSST uses Kubernetes and docker for reproducibility \citeappendix{banek19}.
- \item Interesting survey/paper on the importance of coding in science \citeappendix{merali10}.
- \item Discuss the Provenance challenge \citeappendix{moreau08}, showing the importance of meta data and provenance tracking.
- Especially that it is organized by teh medical scientists.
- Its webpage (for latest challenge) has a nice intro: \url{https://www.cccinnovationcenter.com/challenges/provenance-challenge}.
- \item In discussion: The XML provenance system is very interesting, scripts can be written to parse the Makefiles within this template to generate such XML outputs for easy standard metadata parsing.
- The XML that contains a log of the outputs is also interesting.
- \item \citeappendix{becker17} Discuss reproducibility methods in R.
- \item Elsevier Executable Paper Grand Challenge\footnote{\url{https://shar.es/a3dgl2}} \citeappendix{gabriel11}.
- \item \citeappendix{menke20} show how software identifability has seen the best improvement, so there is hope!
- \item Nature's collection on papers about reproducibility: \url{https://www.nature.com/collections/prbfkwmwvz}.
- \item Nice links for applying FAIR principles in research software: \url{https://www.rd-alliance.org/group/software-source-code-ig/wiki/fair4software-reading-materials}
- \item Jupyter Notebooks and problems with reproducibility: \citeappendix{rule18} and \citeappendix{pimentel19}.
- \item Reproducibility certification \url{https://www.cascad.tech}.
- \item \url{https://plato.stanford.edu/entries/scientific-reproducibility}.
- \item
-Modern analysis tools are almost entirely implemented as software packages.
-This has lead many scientists to adopt solutions that software developers use for reproducing software (for example to fix bugs, or avoid security issues).
-These tools and how they are used are thorougly reviewed in Appendices \ref{appendix:existingtools} and \ref{appendix:existingsolutions}.
-However, the problem of reproducibility in the sciences is more complicated and subtle than that of software engineering.
-This difference can be broken up into the following categories, which are described more fully below:
-1) Reading vs. executing, 2) Archiving how software is used and 3) Citation of the software/methods used for scientific credit.
-
-The first difference is because in the sciences, reproducibility is not merely a problem of re-running a research project (where a binary blob like a container or virtual machine is sufficient).
-For a scientist it is more important to read/study a method of a paper that is 1, 10, or 100 years old.
-The hardware to execute the code may have become obsolete, or it may require too much processing power, storage, or time for another random scientist to execute.
-Another scientist just needs to be assured that the commands they are reading is exactly what was (and can potentially be) executed.
-
-On the second point, scientists are devoting a smaller fraction of their papers to the technical aspects of the work because they are done increasingly by pre-written software programs and libraries.
-Therefore, scientific papers are no longer a complete repository for preserving and archiving very important aspects of the scientific endeavor and hard gained experience.
-Attempts such as Software Heritage\footnote{\url{https://www.softwareheritage.org}} \citeappendix{dicosmo18} do a wonderful job at long term preservation and archival of the software source code.
-However, preservation of the software's raw code is only part of the process, it is also critically important to preserve how the software was used: with what configuration or run-time options, for what kinds of problems, in conjunction with which other software tools and etc.
-
-The third major difference was scientific credit, which is measured in units of citations, not dollars.
-As described above, scientific software are playing an increasingly important role in modern science.
-Because of the domain-specific knowledge necessary to produce such software, they are mostly written by scientists for scientists.
-Therefore a significant amount of effort and research funding has gone into producing scientific software.
-Atleast for the software that do have an accompanying paper, it is thus important that those papers be cited when they are used.
-\end{itemize}
+%%\newpage
+%%\section{Things remaining to add}
+%%\begin{itemize}
+%%\item \url{https://sites.nationalacademies.org/cs/groups/pgasite/documents/webpage/pga_180684.pdf}, does the following classification of tools:
+%% \begin{itemize}
+%% \item Research environments: \href{http://vcr.stanford.edu}{Verifiable computational research} (discussed above), \href{http://www.sciencedirect.com/science/article/pii/S1877050911001207}{SHARE} (a Virtual Machine), \href{http://www.codeocean.com}{Code Ocean} (discussed above), \href{http://jupyter.org}{Jupyter} (discussed above), \href{https://yihui.name/knitr}{knitR} (based on Sweave, dynamic report generation with R), \href{https://cran.r-project.org}{Sweave} (Function in R, for putting R code within \LaTeX), \href{http://www.cyverse.org}{Cyverse} (proprietary web tool with servers for bioinformatics), \href{https://nanohub.org}{NanoHUB} (collection of Simulation Programs for nanoscale phenomena that run in the cloud), \href{https://www.elsevier.com/about/press-releases/research-and-journals/special-issue-computers-and-graphics-incorporates-executable-paper-grand-challenge-winner-collage-authoring-environment}{Collage Authoring Environment} (discussed above), \href{https://osf.io/ns2m3}{SOLE} (discussed above), \href{https://osf.io}{Open Science framework} (a hosting webpage), \href{https://www.vistrails.org}{VisTrails} (discussed above), \href{https://pypi.python.org/pypi/Sumatra}{Sumatra} (discussed above), \href{http://software.broadinstitute.org/cancer/software/genepattern}{GenePattern} (reviewed above), Image Processing On Line (\href{http://www.ipol.im}{IPOL}) journal (publishes full analysis scripts, but doesn't deal with dependencies), \href{https://github.com/systemslab/popper}{Popper} (reviewed above), \href{https://galaxyproject.org}{Galaxy} (reviewed above), \href{http://torch.ch}{Torch.ch} (finished project for neural networks on images), \href{http://wholetale.org/}{Whole Tale} (discussed above).
+%% \item Workflow systems: \href{http://www.taverna.org.uk}{Taverna}, \href{http://www.wings-workflows.org}{Wings}, \href{https://pegasus.isi.edu}{Pegasus}, \href{http://www.pgbovine.net/cde.html}{CDE}, \href{http://binder.org}{Binder}, \href{http://wiki.datakurator.org/wiki}{Kurator}, \href{https://kepler-project.org}{Kepler}, \href{https://github.com/everware}{Everware}, \href{http://cds.nyu.edu/projects/reprozip}{Reprozip}.
+%% \item Dissemination platforms: \href{http://researchcompendia.org}{ResearchCompendia}, \href{https://datacenterhub.org/about}{DataCenterHub}, \href{http://runmycode.org}, \href{https://www.chameleoncloud.org}{ChameleonCloud}, \href{https://occam.cs.pitt.edu}{Occam}, \href{http://rcloud.social/index.html}{RCloud}, \href{http://thedatahub.org}{TheDataHub}, \href{http://www.ahay.org/wiki/Package_overview}{Madagascar}.
+%% \end{itemize}
+%%\item Special volume on ``Reproducible research'' in the Computing in Science Engineering \citeappendix{fomel09}.
+%%\item ``I’ve learned that interactive programs are slavery (unless they include the ability to arrive in any previous state by means of a script).'' \citeappendix{fomel09}.
+%%\item \citeappendix{fomel09} discuss the ``versioning problem'': on different systems, programs have different versions.
+%%\item \citeappendix{fomel09}: a C program written 20 years ago was still usable.
+%%\item \citeappendix{fomel09}: ``in an attempt to increase the size of the community, Matthias Schwab and I submitted a paper to Computers in Physics, one of CiSE’s forerunners. It was rejected. The editors said if everyone used Microsoft computers, everything would be easily reproducible. They also predicted the imminent demise of Fortran''.
+%%\item \citeappendix{alliez19}: Software citation, with a nice dependency plot for matplotlib.
+%% \item SC \href{https://sc19.supercomputing.org/submit/reproducibility-initiative}{Reproducibility Initiative} for mandatory Artifact Description (AD).
+%% \item \href{https://www.acm.org/publications/policies/artifact-review-badging}{Artifact review badging} by the Association of computing machinery (ACM).
+%% \item eLife journal \href{https://elifesciences.org/labs/b521cf4d/reproducible-document-stack-towards-a-scalable-solution-for-reproducible-articles}{announcement} on reproducible papers. \citeappendix{lewis18} is their first reproducible paper.
+%% \item The \href{https://www.scientificpaperofthefuture.org}{Scientific paper of the future initiative} encourages geoscientists to include associate metadata with scientific papers \citeappendix{gil16}.
+%% \item Digital objects: \url{http://doi.org/10.23728/b2share.b605d85809ca45679b110719b6c6cb11} and \url{http://doi.org/10.23728/b2share.4e8ac36c0dd343da81fd9e83e72805a0}
+%% \item \citeappendix{mesirov10}, \citeappendix{casadevall10}, \citeappendix{peng11}: Importance of reproducible research.
+%% \item \citeappendix{sandve13} is an editorial recommendation to publish reproducible results.
+%% \item \citeappendix{easterbrook14} Free/open software for open science.
+%% \item \citeappendix{peng15}: Importance of better statistical education.
+%% \item \citeappendix{topalidou16}: Failed attempt to reproduce a result.
+%% \item \citeappendix{hutton16} reproducibility in hydrology, criticized in \citeappendix{melson17}.
+%% \item \citeappendix{fomel09}: Editorial on reproducible research.
+%% \item \citeappendix{munafo17}: Reproducibility in social sciences.
+%% \item \citeappendix{stodden18}: Effectiveness of journal policy on computational reproducibility.
+%% \item \citeappendix{fanelli18} is critical of the narrative that there is a ``reproducibility crisis'', and that its important to empower scientists.
+%% \item \citeappendix{burrell18} open software (in particular Python) in heliophysics.
+%% \item \citeappendix{allen18} show that many papers don't cite software.
+%% \item \citeappendix{zhang18} explicity say that they won't release their code: ``We opt not to make the code used for the chemical evo-lution modeling publicly available because it is an important asset of the re-searchers’ toolkits''
+%% \item \citeappendix{jones19} make genuine effort at reproducing every number in the paper (using Docker, Conda, and CGAT-core, and Binder), but they can ultimately only release scripts. They claim its not possible to reproduce that level of reproducibility, but here we show it is.
+%% \item LSST uses Kubernetes and docker for reproducibility \citeappendix{banek19}.
+%% \item Interesting survey/paper on the importance of coding in science \citeappendix{merali10}.
+%% \item Discuss the Provenance challenge \citeappendix{moreau08}, showing the importance of meta data and provenance tracking.
+%% Especially that it is organized by teh medical scientists.
+%% Its webpage (for latest challenge) has a nice intro: \url{https://www.cccinnovationcenter.com/challenges/provenance-challenge}.
+%% \item In discussion: The XML provenance system is very interesting, scripts can be written to parse the Makefiles within this template to generate such XML outputs for easy standard metadata parsing.
+%% The XML that contains a log of the outputs is also interesting.
+%% \item \citeappendix{becker17} Discuss reproducibility methods in R.
+%% \item Elsevier Executable Paper Grand Challenge\footnote{\url{https://shar.es/a3dgl2}} \citeappendix{gabriel11}.
+%% \item \citeappendix{menke20} show how software identifability has seen the best improvement, so there is hope!
+%% \item Nature's collection on papers about reproducibility: \url{https://www.nature.com/collections/prbfkwmwvz}.
+%% \item Nice links for applying FAIR principles in research software: \url{https://www.rd-alliance.org/group/software-source-code-ig/wiki/fair4software-reading-materials}
+%% \item Jupyter Notebooks and problems with reproducibility: \citeappendix{rule18} and \citeappendix{pimentel19}.
+%% \item Reproducibility certification \url{https://www.cascad.tech}.
+%% \item \url{https://plato.stanford.edu/entries/scientific-reproducibility}.
+%% \item
+%%Modern analysis tools are almost entirely implemented as software packages.
+%%This has lead many scientists to adopt solutions that software developers use for reproducing software (for example to fix bugs, or avoid security issues).
+%%These tools and how they are used are thorougly reviewed in Appendices \ref{appendix:existingtools} and \ref{appendix:existingsolutions}.
+%%However, the problem of reproducibility in the sciences is more complicated and subtle than that of software engineering.
+%%This difference can be broken up into the following categories, which are described more fully below:
+%%1) Reading vs. executing, 2) Archiving how software is used and 3) Citation of the software/methods used for scientific credit.
+%%
+%%The first difference is because in the sciences, reproducibility is not merely a problem of re-running a research project (where a binary blob like a container or virtual machine is sufficient).
+%%For a scientist it is more important to read/study a method of a paper that is 1, 10, or 100 years old.
+%%The hardware to execute the code may have become obsolete, or it may require too much processing power, storage, or time for another random scientist to execute.
+%%Another scientist just needs to be assured that the commands they are reading is exactly what was (and can potentially be) executed.
+%%
+%%On the second point, scientists are devoting a smaller fraction of their papers to the technical aspects of the work because they are done increasingly by pre-written software programs and libraries.
+%%Therefore, scientific papers are no longer a complete repository for preserving and archiving very important aspects of the scientific endeavor and hard gained experience.
+%%Attempts such as Software Heritage\footnote{\url{https://www.softwareheritage.org}} \citeappendix{dicosmo18} do a wonderful job at long term preservation and archival of the software source code.
+%%However, preservation of the software's raw code is only part of the process, it is also critically important to preserve how the software was used: with what configuration or run-time options, for what kinds of problems, in conjunction with which other software tools and etc.
+%%
+%%The third major difference was scientific credit, which is measured in units of citations, not dollars.
+%%As described above, scientific software are playing an increasingly important role in modern science.
+%%Because of the domain-specific knowledge necessary to produce such software, they are mostly written by scientists for scientists.
+%%Therefore a significant amount of effort and research funding has gone into producing scientific software.
+%%Atleast for the software that do have an accompanying paper, it is thus important that those papers be cited when they are used.
+%%\end{itemize}
diff --git a/reproduce/analysis/make/initialize.mk b/reproduce/analysis/make/initialize.mk
index fbd110f..9ce157d 100644
--- a/reproduce/analysis/make/initialize.mk
+++ b/reproduce/analysis/make/initialize.mk
@@ -488,10 +488,11 @@ print-copyright = \
$(mtexdir)/initialize.tex: | $(mtexdir)
# Version and title of project.
- echo "\newcommand{\projecttitle}{$(metadata-title)}" > $@
+ @echo "\newcommand{\projecttitle}{$(metadata-title)}" > $@
echo "\newcommand{\projectversion}{$(project-commit-hash)}" >> $@
- # Zenodo identifier (necessary for download link):
+ # arXiv/Zenodo identifier (necessary for download link):
+ echo "\newcommand{\projectarxivid}{$(metadata-arxiv)}" >> $@
v=$$(echo $(metadata-doi-zenodo) | sed -e's/\./ /g' | awk '{print $$NF}')
echo "\newcommand{\projectzenodoid}{$$v}" >> $@
diff --git a/tex/src/references.tex b/tex/src/references.tex
index 992f40f..1ef3d7d 100644
--- a/tex/src/references.tex
+++ b/tex/src/references.tex
@@ -10,6 +10,17 @@
%% notice and this notice are preserved. This file is offered as-is,
%% without any warranty.
+@ARTICLE{aissi20,
+ author = {Dylan A\"issi},
+ title = {Review for Towards Long-term and Archivable Reproducibility},
+ year = {2020},
+ journal = {Authorea},
+ volume = {n/a},
+ pages = {n/a},
+ doi = {10.22541/au.159724632.29528907},
+
+}
+
@ARTICLE{mesnard20,
author = {Olivier Mesnard and Lorena A. Barba},
title = {Reproducible Workflow on a Public Cloud for Computational Fluid Dynamics},
@@ -1721,18 +1732,6 @@ Reproducible Research in Image Processing},
-@ARTICLE{stallman88,
- author = {Richard M. Stallman and Roland McGrath and Paul D. Smith},
- title = {GNU Make: a program for directing recompilation},
- journal = {Free Software Foundation},
- year = {1988},
- pages = {ISBN:1-882114-83-3. \url{https://www.gnu.org/s/make/manual/make.pdf}},
-}
-
-
-
-
-
@ARTICLE{somogyi87,
author = {Zoltan Somogyi},
title = {Cake: a fifth generation version of make},