aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2020-01-26 02:55:43 +0000
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2020-01-26 02:55:43 +0000
commitf62596ea8b97727ab8366965faf6f316d463ebf7 (patch)
tree9e9e0690a69e7fc08752d4ac277fbc0a2c2bcef1 /paper.tex
parent2ed0c82eff6af8aecd34b3e697b0e392ab129cd3 (diff)
General project structure and configuration described
In the last few days I have been writing these two sections in the middle of other work. But I am making this commit because it has already become a lot! I am now going onto the description of `./project make'.
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex794
1 files changed, 517 insertions, 277 deletions
diff --git a/paper.tex b/paper.tex
index f03bd5e..a115a4c 100644
--- a/paper.tex
+++ b/paper.tex
@@ -29,7 +29,7 @@
{
\footnotesize\mplight
\textsuperscript{1} Instituto de Astrof\'isica de Canarias, C/V\'ia L\'actea, 38200 La Laguna, Tenerife, ES.\\
- \textsuperscript{2} Facultad de F\'isica, Universidad de La Laguna, Avda. Astrof\'isico Fco. S\'anchez s/n, 38200La Laguna, Tenerife, ES.\\
+ \textsuperscript{2} Facultad de F\'isica, Universidad de La Laguna, Avda. Astrof\'isico Fco. S\'anchez s/n, 38200, La Laguna, Tenerife, ES.\\
Corresponding author: Mohammad Akhlaghi
(\href{mailto:mohammad@akhlaghi.org}{\textcolor{black}{mohammad@akhlaghi.org}})
}}
@@ -75,7 +75,7 @@
In the last two decades, several technological advancements have profoundly affected how science is being done: much improved processing power, storage capacity and internet connections, combined with larger public datasets and more robust free and open source software solutions.
Given these powerful tools, scientists are using ever more complex processing steps and datasets, often composed of mixing many different software components in a single project.
-For example see the almost complete list of the necessary software in Appendix A of the following two research papers: \citet{akhlaghi19} and \citet{infante19}.
+For example see the almost complete list of the necessary software in Appendix A of the following two research papers: \citet{akhlaghi19} and \citet{infante20}.
The increased complexity of scientific analysis has made it impossible to describe all the analytical steps of a project in the traditional format of a published paper, to a sufficient level of detail.
Citing this difficulty, many authors suffice to describing the very high-level generalities of their analysis.
@@ -159,6 +159,19 @@ It later inspired \citet{buckheit1995} to publish a reproducible paper (in Matla
\tonote{Find some other historical examples.}
In this paper, a solution to this problem is introduced that attemps to address the problems above and has already been used in scientific papers.
+The primordial implementation of this system was in \citet{akhlaghi15} which described a new detection algorithm in astronomical image processing.
+The detection algorithm was developed as the paper (initially a small report!) was being written.
+An automated sequence of commands to build the figures, and update the paper/report was a practical necessity as the algorithm was evolving.
+In particular, it didn't just reproduce figures, it also used \LaTeX{} macros to update numbers printed within the text.
+Finally, since the full analysis pipeline was in plain-text and roughly 100kb (much less than a single figure), it was uploaded to arXiv with the paper's \LaTeX{} source, under a \inlinecode{reproduce/} directory, see \href{https://arxiv.org/abs/1505.01664}{arXiv:1505.01664}\footnote{
+ To download the \LaTeX{} source of any arXiv paper, click on the ``Other formats'' link, containing necessary instructions and links.}.
+
+The system later evolved in \citet{bacon17}, in particular the two sections of that paper that were done by M.A (first author of this paper): \citet[\href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}]{akhlaghi18a} and \citet[\href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}]{akhlaghi18b}.
+With these projects, the core skeleton of the system was written as a more abstract ``template'' that could be customized for separate projects.
+The template later matured by including installation of all necessary software from source and used in \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}]{akhlaghi19} and \citet[\href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}]{infante20}.
+The short historical review above highlights how this template was created by practicing scientists, and has evolved based on the needs of real scientific projects and working scenarios.
+
+
In Section \ref{sec:definitions}, the problem that is addressed by this template is clearly defined and Section \ref{appendix:existingsolutions} reviews some existing solutions and their pros and cons with respect to reproducibility in a scientific framework.
Section \ref{sec:template} introduces the template, and its approach to the problem, with full design details.
Finally in Section \ref{sec:discussion} the future prospects of using systems like this template are discussed.
@@ -322,6 +335,34 @@ In terms of interfaces, wrappers can be written over this core skeleton for vari
+\subsection{Principle: Complete/Self-contained}
+\label{principle:complete}
+A project should be self-contained, needing no particular features from the host operating system and not affecting it.
+At build-time (when the project is building its necessary tools), the project shouldn't need anything beyond a minimal POSIX environment on the host which is available in Unix-like operating system like GNU/Linux, BSD-based or macOS.
+At run-time (when environment/software are built), it should not use or affect any host operating system programs or libraries.
+
+Generally, a project's source should include the whole project: access to the inputs and validating them (see Section \ref{sec:definitions}), building necessary software (access to tarballs and instructions on configuring, building and installing those software), doing the analysis (run the software on the data) and creating the final narrative report/paper.
+This principle has several important consequences:
+
+\begin{itemize}
+\item A complete project doesn't need an internet connection to build itself or to do its analysis and possibly make a report.
+ Of course this only holds when the analysis doesn't require internet, for example needing a live data feed.
+
+\item A Complete project doesn't need any previlaged/root permissions for system-wide installation or environment preparations.
+ Even when the user does have root previlages, interefering with the host operating system for a project, may lead to many conflicts with the host or other projects.
+ This allows a safe execution of the project, and will not cause any security problems.
+
+\item A complete project inherently includes the complete data lineage and provenance: automatically enabling a full backtrace of the output datasets or narrative, to raw inputs: data or software source code lines (for the definition of inputs, please see \ref{definition:input}).
+ This is very important because existing data provenance solutions require manual tagging within the data workflow or connecting the data with the paper's text (Appendix \ref{appendix:existingsolutions}).
+ Manual tagging can be highly subjective, prone to many errors, and incomplete.
+\end{itemize}
+
+The first two components are particularly important for high performace computing (HPC) facilities.
+Because of security reasons, HPC users commonly don't have previlaged permissions or internet access.
+
+
+
+
\subsection{Principle: Modularity}
\label{principle:modularity}
@@ -338,13 +379,16 @@ This isn't a problem, because inputs are defined as files that are \emph{usable}
Such highly project-specific software can later spin-off into a separate software package later if necessary.
%One nice example of an existing system that doesn't follow this principle is Madagascar, see Appendix \ref{appendix:madagascar}.
-
+%\tonote{Find a place to put this:} Note that input files are a subset of built files: they are imported/built (copied or downloaded) using the project's instructions, from an external location/repository.
+% This principle is inspired by a similar concept in the free and open source software building tools like the GNU Build system (the standard `\texttt{./configure}', `\texttt{make}' and `\texttt{make install}'), or CMake.
+% Software developers already have decades of experience with the problems of mixing hand-written source files with the automatically generated (built) files.
+% This principle has proved to be an exceptionally useful in this model, grealy
\subsection{Principle: Plain text}
\label{principle:text}
-A project's primarily stored/archived format should be plain text with human-readable encoding\footnote{Plain text format doesn't include document container formats like `\texttt{.odf}' or `\texttt{.doc}', for software like LibreOffice or Microsoft Office.}, for example ASCII or Unicode (for the definition of a project, see Section \ref{definition:project}).
+A project's primarily stored/archived format should be plain text with human-readable encoding\footnote{Plain text format doesn't include document container formats like \inlinecode{.odf} or \inlinecode{.doc}, for software like LibreOffice or Microsoft Office.}, for example ASCII or Unicode (for the definition of a project, see Section \ref{definition:project}).
The reason behind this principle is that opening, reading, or editing non-plain text (executable or binary) file formats needs specialized software.
Binary formats will complicate various aspects of the project: its usage, archival, automatic parsing, or human readability.
This is a critical principle for long term preservation and portability: when the software to read binary format has been depreciated or become obsolete and isn't installable on the running system, the project will not be readable/usable any more.
@@ -355,6 +399,7 @@ After publication, independent modules of a plain-text project can be used and c
Archiving a binary version of the project is like archiving a well cooked dish itself, which will be inedible with changes in hardware (temperature, humidity, and the natural world in general).
But archiving a project's plain text source is like archiving the dish's recipe (which is also in plain text!): you can re-cook it any time.
When the environment is under perfect control (as in the proposed system), the binary/executable, or re-cooked, output will be identical.
+\citet{smart18} describe a nice example of the how a seven-year old dispute between condensed matter scientists could only be solved when they shared the plain text source of their respective projects.
This principle doesn't conflict with having an executable or immediately-runnable project\footnote{In their recommendation 4-1 on reproducibility, \citet{fineberg19} mention: ``a detailed description of the study methods (ideally in executable form)''.}.
Because it is trivial to build a text-based project within an executable container or virtual machine.
@@ -364,36 +409,6 @@ A plain-text project's built/executable form can be published as an output of th
-
-
-\subsection{Principle: Complete/Self-contained}
-\label{principle:complete}
-A project should be self-contained, needing no particular features from the host operating system and not affecting it.
-At build-time (when the project is building its necessary tools), the project shouldn't need anything beyond a minimal POSIX environment on the host which is available in Unix-like operating system like GNU/Linux, BSD-based or macOS.
-At run-time (when environment/software are built), it should not use or affect any host operating system programs or libraries.
-
-Generally, a project's source should include the whole project: access to the inputs and validating them (see Section \ref{sec:definitions}), building necessary software (access to tarballs and instructions on configuring, building and installing those software), doing the analysis (run the software on the data) and creating the final narrative report/paper.
-This principle has several important consequences:
-
-\begin{itemize}
-\item A complete project doesn't need an internet connection to build itself or to do its analysis and possibly make a report.
- Of course this only holds when the analysis doesn't require internet, for example needing a live data feed.
-
-\item A Complete project doesn't need any previlaged/root permissions for system-wide installation or environment preparations.
- Even when the user does have root previlages, interefering with the host operating system for a project, may lead to many conflicts with the host or other projects.
- This allows a safe execution of the project, and will not cause any security problems.
-
-\item A complete project inherently includes the complete data lineage and provenance: automatically enabling a full backtrace of the output datasets or narrative, to raw inputs: data or software source code lines (for the definition of inputs, please see \ref{definition:input}).
- This is very important because existing data provenance solutions require manual tagging within the data workflow or connecting the data with the paper's text (Appendix \ref{appendix:existingsolutions}).
- Manual tagging can be highly subjective, prone to many errors, and incomplete.
-\end{itemize}
-
-The first two components are particularly important for high performace computing (HPC) facilities.
-Because of security reasons, HPC users commonly don't have previlaged permissions or internet access.
-
-
-
-
\subsection{Principle: Minimal complexity (i.e., maximal compatability)}
\label{principle:complexity}
An important measure of the quality of a project is how much it avoids complexity.
@@ -424,18 +439,25 @@ However, Github dropped HCL in October 2019, for more see Appendix \ref{appendix
+\subsection{Principle: non-interactive processing}
+\label{principle:batch}
+A reproducible project should run without any manual interaction.
+Manual interaction is an inherently irreproducible operation, exposing the analysis to human error, and will require expert knowledge.
+
-\subsection{Principle: automatic verification}
+
+
+\subsection{Principle: Verifiable outputs}
\label{principle:verify}
-The project should automatically verify its outputs, such that expert knowledge won't be necessary to make sure a re-run was correct.
-This follows from the definition of exact reproducibility (Section \ref{definition:reproduction}).
+The project should contain verification checks its outputs.
+Combined with the principle on batch processing (Section \ref{principle:batch}), expert knowledge won't be necessary to confirm the correct reproduction.
It is just important to emphasize that in practice, exact or bit-wise reproduction is very hard to implement at the level of a file.
This is because many specialized scientific software commonly print the running date on their output files (which is very useful in its own context).
-For example in plain text tables, such meta-data are commonly printed as commented lines (usually starting with `\texttt{\#}').
+For example in plain text tables, such meta-data are commonly printed as commented lines (usually starting with \inlinecode{\#}).
Therefore when verifying a plain text table, the checksum which is used to validate the data, can be recorded after removing all commented lines.
Fortunately, the tools to operate on specialized data formats also usually have ways to remove requested metadata (like creation date), or ignore them.
-For example the FITS standard in astronomy \citep{pence10} defines a special \texttt{DATASUM} keyword which is a checksum calculated only from the raw data, ignoring all metadata.
+For example the FITS standard in astronomy \citep{pence10} defines a special \inlinecode{DATASUM} keyword which is a checksum calculated only from the raw data, ignoring all metadata.
@@ -463,6 +485,8 @@ A system that uses this principle will also provide ``temporal provenance'', qua
+
+
\subsection{Principle: Free and open source software}
\label{principle:freesoftware}
Technically, as defined in Section \ref{definition:reproduction}, reproducibility is also possible with a non-free and non-open-source software (a black box).
@@ -474,7 +498,7 @@ However, as shown below, software freedom as an important pillar for the science
Since free software is modifable, others can modify (or hire someone to modify) it and make it runnable on their particular platform.
\item A non-free software cannot be distributed by the authors, making the whole community reliant only on the proprietary owner's server (even if the proprietary software doesn't ask for payments).
A project that uses free software can also release the necessary tarballs of the software it uses.
- For example see \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481} \citep{akhlaghi19} or \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937} \citep{infante19}.
+ For example see \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481} \citep{akhlaghi19} or \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937} \citep{infante20}.
\item A core component of reproducibility is that anonymous peers should be able confirm the result from the same datasets with minimal effort, and this includes financial cost beyond hardware.
\end{itemize}
@@ -486,68 +510,277 @@ However, as shown below, software freedom as an important pillar for the science
-%\section{Reproducible paper template (OLD)}
-%This template is based on a simple principle: the output's full lineage should be stored in a human-readable, plain text format, that can also be automatically run on a computer.
-%The primary components of the research output's lineage can be summarized as:
-%1) URLs/PIDs and checksums of external inputs.
-%These external inputs can be datasets, if the project needs any (not simulations), or software source code that must be downloaded and built.
-%2) Software building scripts.
-%3) Full series of steps to run the software on the data, i.e., to do the data analysis.
-%4) Building the narrative description of the project that describes the output (published paper).
-%where persistent identifiers, or URLs of input data and software source code, as well as instructions/scripts on how to build the software and run run them on the data to do the analysis
-%It started with \citet{akhlaghi15} which was a paper describing a new detection algorithm.
-%and further evolving in \citet{bacon17}, in particular the following two sections of that paper: \citet{akhlaghi18a} and \citet{akhlaghi18b}.
-%\tonote{Mention the \citet{smart18} paper on how a binary version is not sufficient, low-level, human-readable plain text source is mandatory.}
-%\tonote{Find a place to put this:} Note that input files are a subset of built files: they are imported/built (copied or downloaded) using the project's instructions, from an external location/repository.
-% This principle is inspired by a similar concept in the free and open source software building tools like the GNU Build system (the standard `\texttt{./configure}', `\texttt{make}' and `\texttt{make install}'), or CMake.
-% Software developers already have decades of experience with the problems of mixing hand-written source files with the automatically generated (built) files.
-% This principle has proved to be an exceptionally useful in this model, grealy
+
+
+
+
+
\section{Reproducible paper template}
\label{sec:template}
-The proposed solution is an implementation of the defintions and principles of Sections \ref{sec:definitions} and \ref{sec:principles}.
-In short, its a version-controlled directory with many plain-text files, distributed in various subdirectories, based on context.
+The proposed solution is an implementation of the principles discussed in Section \ref{sec:principles}: it is complete (Section \ref{principle:complete}) modular (Section \ref{principle:modularity}), fully in plain text (Section \ref{principle:text}), having minimal complexity (e.g., no dependencies beyond a minimal POSIX environment, see Section \ref{principle:complexity}), runnable without any human interaction (Section \ref{principle:batch}), with verifiable outputs (Section \ref{principle:verify}), preserving temporal provenance, or project evolution (Section \ref{principle:history}) and finally, it is free software (Section \ref{principle:freesoftware}).
+
+In practice it is a collection of plain-text files, that are distributed in sub-directories by context, and are all under version-control (currently with Git).
+In its raw form (before customizing for different projects), it is just a skeleton of a project without much flesh: containing all the low-level infrastructure, but without any real analysis.
+To start a new project, users will clone the core template, create their own Git branch, and start customizing it by adding their high-level analysis steps, figure creation sources and narrative.
+Because of this, we also refer to the proposed system as a ``template''.
+
+In this section we'll review its current implementation.
+Section \ref{sec:usingmake} describes the reasons behind using Make for job orchestration.
+It is followed with a general outline of the project's file structure in Section \ref{sec:generalimplementation}.
+As described there, we make a cosmetic distinction between ``configuration'' (or building of necessary software) and execution (or running the software on the data), these two phases are discussed in Sections \ref{sec:projectconfigure} \& \ref{sec:projectmake}.
+
+Finally, it is important to note that as with any software, the low-level implementation of this solution will inevitably evolve after the publication of this paper.
+We already have roughly 30 tasks that are left for the future and will affect various phases of the project as described here.
+However, we need to focus on more important issues at this stage and can't implement them before this paper's publication.
+Therefore, a list of the notable changes after the publication of this paper will be kept in in the project's \inlinecode{README-hacking.md} file, and once it becomes substantial, new papers will be written.
+
+
+\subsection{Job orchestration with Make}
+\label{sec:usingmake}
+When non-interactive, or batch, processing is needed (see Section \ref{principle:batch}), shell scripts are usually the first solution that come to mind (see Appendix \ref{appendix:scripts}).
+However, the inherent complexity and non-linearity of progress in a scientific project (where experimentation is key) makes it hard and inefficient to manage the script(s) as the project evolves.
+For example, a script will start from the top/start every time it is run.
+Therefore, if $90\%$ of a research project is done and only the newly added, final $10\%$ must be executed, its necessary to run whole script from the start.
+
+It is possible to manually ignore (by conditionals), or comment, parts of a script to only do a special part.
+However, such conditionals/comments will only add to the complexity and will discourage experimentation on an already completed part of the project.
+This is also prone to very serious bugs in the end, when trying to reproduce from scratch.
+Such bugs are very hard to notice during the work and frustrating to find in the end.
+Similar problems motivated the creation of Make in the early Unix operating system (see Appendix \ref{appendix:make}).
+
+In the Make paradigm, process execution starts from the end: the final \emph{target}.
+Through its syntax, the user specifies the \emph{prerequisite}(s) of each target and a \emph{recipe} (a small shell script) to create the target from the prerequisites.
+Make is thus able to build a dependency tree internally and find where it should start executing the recipes, each time the project is run.
+This has many advantages:
+\begin{itemize}
+\item \textbf{\small Only executing necessary steps:} in the scenario above, a researcher that has just added the final $10\%$ of her research, will only have to run those extra steps, without any modification to the previous parts.
+ With Make, it is also trivial to change the processing of any intermediate (already written) \emph{rule} (or step) in the middle of an already written analysis: the next time Make is run, only rules that are affected by the changes/additions will be re-run, not the whole analysis/project.
+
+Most importantly, this enables full reproducibility from scratch with no changes in the project code that was working during the research.
+This will allow robust results and let the scientists get to what they do best: experiment and be critical to the methods/analysis without having to waste energy on the added complexity of experimentation in scripts.
+
+\item \textbf{\small Parallel processing:} Since the dependencies are clearly demarcated in Make, it can identify independent steps and run them in parallel.
+ This greatly speeds up the processing, with no cost in terms of complexy.
+
+\item \textbf{\small Codifying data lineage and provenance:} In many systems data provenance has to be manually added.
+ However, in Make, it is part of the design and no extra manual step is necessary to fully track (or back-track) the series of steps that generated the data.
+\end{itemize}
+
+Make has been a fixed component of POSIX (or Unix-like operating systems including Unix, GNU/Linux, BSD, and macOS, among others) from very early days of Unix almost 40 years ago.
+It is therefore, by far, the most reliable, commonly used, well-known and well-maintained workflow manager today.
+Because the core operating system components are built with it, Make is expected to keep this unique position into the foreseeable future.
+
+Make is also well known by many outside of the software developing communities.
+For example \citet{schwab2000} report how geophysics students have easily adopted it for the RED project management tool used in their lab at that time (see Appendix \ref{appendix:red} for more on RED).
+Because of its simplicity, we have also had very good feedback on using Make from the early adopters of this system during the last year, in particular graduate students and postdocs.
+In summary Make satisfies all our principles (see Section \ref{sec:principles}), while avoiding the well-known problems of using high-level languages for project managment, including the creation of generational gap between researchers and the ``dependency hell'', see Appendix \ref{appendix:highlevelinworkflow}.
+For more on Make and a discussion on some other job orchestration tools, see Appendices \ref{appendix:make} and \ref{appendix:jobmanagement} respectively.
+
+
+
+\subsection{General implementation structure}
+\label{sec:generalimplementation}
+
+As described above, a project using this template is a combination of plain-text files that are organized in various directories by context.
+Figure \ref{fig:files} shows this directory structure and some representative files in each directory.
+The top-level source only has two main directories: \inlinecode{tex/} (containing \LaTeX{} files) and \inlinecode{reproduce/} (containing all other parts of the project) as well as several high-level files.
+Most of the top project directory files are only intended for human readers (as narrative text, not scripts or programming sources):
+\inlinecode{COPYING} is the project's high-level copyright license,
+\inlinecode{README.md} is a basic introduction to the specific project, and
+\inlinecode{README-hacking.md} describes how to customize, or hack, the template for creators of new projects.
+
+In the top project directory, there are two non-narrative files: \inlinecode{project} (which should have been under \inlinecode{reproduce/}) and \inlinecode{paper.tex} (which should have been under \inlinecode{tex/}).
+The former is nessary in the top project directory because it is the high-level user interface, with the \inlinecode{./project} command.
+The latter is necessary for many web-based automatic paper generating systems like arXiv, journals, or systems like Overleaf.
+
\begin{figure}[t]
\begin{center}
\includetikz{figure-file-architecture}
\end{center}
\vspace{-5mm}
\caption{\label{fig:files}
- Directory and (plain-text) file structure in a hypothetical project using this solution.
- Files are shown with thin, darker boxes that have a suffix in their names (for example `\texttt{*.mk}' or `\texttt{*.conf}').
- Directories are shown as large, brighter boxes, where the name ends in a slash (\texttt{/}).
- Directories with dashed lines are symbolic links that are created after building the project, pointing to commonly needed built directories.
+ Directory and file structure in a hypothetical project using this solution.
+ Files are shown with small, green boxes that have a suffix in their names (for example \inlinecode{analysis-1.mk} or \inlinecode{param-2.conf}).
+ Directories (containing multiple files) are shown as large, brown boxes, where the name ends in a slash (\inlinecode{/}).
+ Directories with dashed lines and no files (just a description) are symbolic links that are created after building the project, pointing to commonly needed built directories.
+ Symbolic links and their contents are not considered part of the source and are not under version control.
Files and directories are shown within their parent directory.
- For example the full address of \texttt{analysis-1.mk} from the top project directory is \texttt{reproduce/analysis/make/analysis-1.mk}.
+ For example the full address of \inlinecode{analysis-1.mk} from the top project directory is \inlinecode{reproduce/analysis/make/analysis-1.mk}.
}
\end{figure}
+\inlinecode{project} is a simple executable POSIX-compliant shell script, that is just a high-level wrapper script to call the project Makefiles.
+Recall that the main job orchestrator in this system is Make, see Section \ref{sec:usingmake} for a discussion on the benefits of Make.
+In the current implementation, the project's execution consists of the following two calls to \inlinecode{project}:
+\begin{lstlisting}[language=bash]
+ ./project configure # Build software from source (takes around 2 hours for full build).
+ ./project make # Do the analysis (download data, run software on data, build PDF).
+\end{lstlisting}
-\subsection{A project is a directory containing plain-text files}
-\label{sec:projectdir}
-Based on the plain-text principle (Section \ref{principle:text}), in the proposed solution, a project is a top-level directory that is ultimately filled with plain-text files (many of the files are under subdirectories).
+The operations of both are managed by files under the top-level \inlinecode{reproduce/} directory.
+When the first command is called, the contents of \inlinecode{reproduce\-/software} are used, and the latter calls files uner \inlinecode{reproduce\-/analysis}.
+This highlights the \emph{cosmetic} distinction we have adopted between the two main steps of a project: 1) building the project's full software environment and 2) doing the analysis (running the software).
+
+Technically there is no difference between the two and they could easily be merged because both steps manage their operations through Makefiles.
+However, in a project, the analysis will involve many files, e.g., Python, R or shell scripts, C or Fortran program sources, or Makefiles.
+On the other hand, software building and project configuration also includes many files.
+Mixing all these two conceptually different (for humans!) set of files under one directory can cause confusion for the people building or running the project.
+During a project, researchers will mostly be working on their analysis, they will rarely want to modify the software building steps.
+
+In summary, the same structure governs both aspects of a project: software building and analysis.
+This is an important and unique feature in this template.
+A researcher that has become familiar with Makefiles for orchestrating their analysis, will also easily be able to modify the Makefiles for the software that is built in their project, and feel free to customize their project's software also.
+Most other systems use third-party package managers for their project's software, thus discouraging project-specific customization of software, for a full review of third party package managers, see Appendix \ref{appendix:packagemanagement}.
+
+
+
+
+
+\subsection{Project configuration}
+\label{sec:projectconfigure}
+
+A critical component of any project is the set of software used to do the analysis.
+However, verifying an already built software environment, which is critical to reproducing the research result, is a very hard.
+This has forced most projects resort to moving around the whole \emph{built} software environment (a black box) as virtual machines or containers, see Appendix \ref{appendix:independentenvironment}.
+But since these black boxes are almost impossible to reproduce themselves, they need to be archived, even though they can take gigabytes of space.
+Package managers like Nix or GNU Guix do provide a verifiable, i.e., reproducible, software building environment, but because they aim to be generic package managers, they have their own limitations on a project-specific level, see Appendix \ref{appendix:nixguix}.
+
+Based on the principles of completeness and minimal complexity (Sections \ref{principle:complete} \& \ref{principle:complexity}), a project that uses this solution, also contains the full instructions to build its necessary software.
+As described in Section \ref{sec:generalimplementation}, this is managed by the files under \inlinecode{reproduce/software}.
+Project configuration involves three high-level steps which are discussed in the subsections below: setting the local directories (Section \ref{sec:localdirs}), checking a working C compiler (Secti
+on \ref{sec:ccompiler}), and the software source code download, build and install (Section \ref{sec:buildsoftware}).
+
+\subsubsection{Setting local directories}
+\label{sec:localdirs}
+The ``build directory'' (\inlinecode{BDIR}) is a directory on the host filesystem.
+All files built by the project will be under it, and no other location on the running operating system will be affected.
+Following the modularity principle (Section \ref{principle:modularity}), this directory should be separate from the source directory and the project will now allow specifying a build directory anywhere under its top source directory.
+Therefore, at configuration time, the first thing to specify is the build directory on the running system.
+The build directory can be specified in two ways: 1) on the command-line with the \inlinecode{--build-dir} option, or 2) manually giving the directory after running the configuration: it will stop with a prompt and some explanation.
+
+Two other local directories can optionally be used by the project for its inputs (Section \ref{definition:input}): 1) software tarball directory and 2) input data directory.
+The contents of these directories just need read permissions by the user running the project.
+If given, nothing will be written inside of them: the project will only look into them for the necessary software tarballs and input data.
+If they are not found, the project will attempt to download them from the provided URLs/PIDs within the project source.
+These directories are therefore primarily tailored to scenarios where the project must run offline (based on the completeness principle of Section \ref{principle:complete}).
+
+After project configuration, a symbolic link is built the top project soure directory that points to the build directory.
+The symbolic link is a hidden file named \inlinecode{.build}, see Figure \ref{fig:files}.
+With this symbolic link, its always very easy to access to built files, no matter where the build directory is actually located on the filesystem.
+
+
+\subsubsection{Checking for a C compiler}
+\label{sec:ccompiler}
+This template builds all its necessary software internally to avoid dependency issues with various software versions on different hosts.
+A working C compiler is thus a necessary prerequisite and the configure script will abort if a working C compiler isn't found.
+In particular, on GNU/Linux systems, the project builds its own version of the GNU Compiler Collection (GCC), therefore a static C library is necessary.
+
+The custom version of GCC is configured to also build Fortran, C++, objective-C and objective-C++ compilers.
+Python and R running environments are themselves written in C, therefore they are also automatically built afterwards when necessary.
+On macOS systems, we currently don't build a C compiler, but it is planned to do so in the future.
+
+
+
+
+
+\subsubsection{Verifying and building necessary software from source}
+\label{sec:buildsoftware}
+
+All necessary software for the project, and their dependencies, are installed from source.
+Based on the completeness principle (Section \ref{principle:complete}, the dependency tree is tracked down to the GNU C Library and GNU Compiler Collection on GNU/Linux systems.
+When these two can't be installed (for example on macOS systems), the users are forced to rely on the host's C compiler and library, and this may hamper the exact reproducibility of the final result.
+Because the project's main output is a \LaTeX{}-built PDF, the project also contains an internal installation of \TeX{}Live, providing all the necessary tools to build the PDF, indepedent of the host operating system's \LaTeX{} version and packages.
+
+To build the software, the project needs access to the software source code.
+If the tarballs are already present on the system, the user can specify the directory at the start of the configuration process (Section \ref{sec:localdirs}), if not, the software tarballs will be downloaded from pre-defined servers.
+Ultimately the origin of the tarballs is irrelevant, what matters is their contents: checked through the SHA-512 checksum \citep[part of the SHA-2 algorithms, see][]{romine15}.
+If the SHA-512 checksum of the tarball is different from the checksum stored for it in the project's source, it will complain and abort.
+Because the server is irrelevant, one planned improvement is thus to allow users to identify the most convenient server themselves, for example to improve download speed.
+
+Software tarball access, building and installation is managed through Makefiles, see Sections \ref{sec:usingmake} \& \ref{sec:generalimplementation}.
+The project's software are classified into two classes: 1) basic and 2) high-level.
+The former contains meta-software: software to build other software, for example GNU Gzip, GNU Tar, GNU Make, GNU Bash, GNU Coreutils, GNU SED, Git, GNU Binutils, GNU Compiler Collection (GCC) and etc\footnote{Note that almost all these GNU software are also installable on non-GNU/Linux operating systems like BSD or macOS also, exceptions include GNU Binutils.}.
+The basic software are built with the host operating system's tools and are installed within all projects.
+The high-level software are those that are used in analysis and can differ from project to project.
+However, because the basic software have already been built by the project, the higher-level software are built with them and independent of the host operating system.
+
+Software building is managed by two top-level Makefiles that follow the same classification.
+Both are under the \inlinecode{reproduce\-/softwar\-e/make/} directory (Figure \ref{fig:files}): \inlinecode{basic.mk} and \inlinecode{high-level.mk}.
+Because \inlinecode{basic.mk} can't assume anything about the host, it is written to comply with POSIX Make and POSIX shell, which is very limited compared to GNU Make and GNU Bash.
+However, after its finished, a specific version of GNU Make (among other basic software), is present, enabling us to assume the much advanced features of GNU Make in \inlinecode{high-level.mk}.
+
+The project's software are installed under \inlinecode{BDIR/software/installed} (containing subdirectories like \inlinecode{bin/}, \inlinecode{lib/}, \inlinecode{include/} and etc).
+For example the custom-built GNU Make executable is placed under \inlinecode{BDIR\-/software\-/installed\-/bin\-/make}.
+The symbolic link \inlinecode{.local} in the top project source directory points to it (see Figure \ref{fig:files}), so \inlinecode{.local/bin/make} is identical to the long address before.
+
+To orchestrate software building with Make, the building of each software has to conclude in a file, which should be used as a target or prerequisite in the Makefiles.
+Initially we tried using the actual software's built files (executable programs, libraries or etc), but managing all the different types of installed files was prone to many bugs and confusions.
+Therefore, once a software is built, a simple plain-text file is created in \inlinecode{.local\-/version-info} and Make uses this file to refer to the software and arrange the order of software execution.
+The contents of this plain-text file are directly imported into the paper, for more see Section \ref{sec:softwarecitation} on software citation.
+
+
+
+\subsubsection{Software citation}
+\label{sec:softwarecitation}
+The template's Makefiles contain the full instructions to automatically build all the software.
+It thus contains the full list of installed software, their versions and their configuration options.
+However, this information is burried deep into the project's source, a distilled fraction of this information must also be printed in the project's final report, blended into the narrative.
+Furthermore, when a published paper is associated with the used software, it is important to cite that paper, the citations help software authors gain more recognition and grants.
+This is particularly important in the case for research software, where the researcher has invested significant time in building the software, and requires official citation to continue working on it.
+
+One notable example is GNU Parallel \citep{tange18}: everytime it is run, it prints the citation information before its actual outputs.
+This doesn't cause any problem in automatic scripts, but can be annoying when debugging the outputs.
+Users can disable the notice, with the \inlinecode{--citation} option and accept to cite its paper, or support it directly by paying $10000$ euros.
+This is justified by an uncomfortably true statement\footnote{GNU Parallel's FAQ on the need to cite software: \url{http://git.savannah.gnu.org/cgit/parallel.git/plain/doc/citation-notice-faq.txt}}: ``history has shown that researchers forget to [cite software] if they are not reminded explicitly. ... If you feel the benefit from using GNU Parallel is too small to warrant a citation, then prove that by simply using another tool''.
+In bug 905674\footnote{Debian bug on the citation notice of GNU Parallel: \url{https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=905674}}, the Debian developers argued that because of this extra condition, GNU Parallel should not be considered as free software, and they are using a patch to remove that part of the code for its build under Debian-based operating systems.
+Most other research software don't resort to such drastic measures, however, citation is imporant for them.
+Given the increasing number of software used in scientific research, the only reliable solution to automaticly cite the used software in the final paper.
+
+As mentioned above in Section \ref{sec:buildsoftware}, a plain-text file is built automatically at the end of a software's successful build and installation.
+This file contains the name, version and possible citation.
+At the end of the configuration phase, all these plain-text files are merged into one \LaTeX{} macro that can be imported directly into the final paper or report.
+In this paper, this macro's value is shown in Appendix \ref{appendix:softwareacknowledge}.
+The paragraph produced by this macro won't be too large, but it will greatly help in the recording of the used software environment and will automatically cite the software where necessary.
+
+In the current version of this template it is assumed the published report of a project is built by \LaTeX{}.
+Therefore, every software that has an associated paper, has a Bib\TeX{} file under the \inlinecode{reproduce\-/software\-/bibtex} directory.
+If the software is built for the project, the Bib\TeX{} entry(s) are copied to the build directory and the command to cite that Bib\TeX{} record is included in the \LaTeX{} macro with the name and version of the software, as shown in Appendix \ref{appendix:softwareacknowledge}.
+
+For a review of the necessity and basic elements in software citation, see \citet{katz14} and \citet{smith16}.
+There are ongoing projects specifically tailored to software citation, including CodeMeta (\url{https://codemeta.github.io}) and Citation file format (CFF: \url{https://citation-file-format.github.io}).
+Both are based on scheme.org, but are respectively implemented in the JSON-LD and YAML.
+Another robust approach is provided by SoftwareHeritage \citep{dicosmo18}.
+The advantage of the SoftwareHeritage approach is that a published paper isn't necessary and it won't populate a paper's bibliography list.
+However, this also makes it hard to count as academic credit.
+We are considering using these tools, and export Bib\TeX{} entries when necessary.
+
+
+
+
+
+
+
+\subsection{Running the analysis}
+\label{sec:projectmake}
+
+Once a project is configured (Section \ref{sec:projectconfigure}), all the necessary software, with precise versions and configurations have been built.
+The project is now ready to do the analysis or, run the built software on the data.
-\subsection{Job orchestration with Make}
-\label{sec:make}
-Job orchestration is done with Make (see Appendix \ref{appendix:jobmanagement}).
-To start with, any POSIX-compliant Make version is enough.
-However, the project will build its own necessary software, and that includes custom version of GNU Make \citep{stallman88}.
-Several other job orchestration tools are reviewed in Appendix \ref{appendix:jobmanagement}, but we chose Make because of some unique features: 1) it is present on any POSIX compliant system, satisfying the principle of
-\tonote{\citet{schwab2000} discuss the ease of students learning Make.}
\begin{figure}[t]
\begin{center}
@@ -557,21 +790,16 @@ Several other job orchestration tools are reviewed in Appendix \ref{appendix:job
\caption{\label{fig:analysisworkflow}Schematic representation of built file dependencies in a hypothetical project/pipeline using the reproducible paper template.
Each colored box is a file in the project and the arrows show the dependencies between them.
Green files/boxes are plain text files that are under version control and in the source-directory.
- Blue files/boxes are output files of various steps in the build-directory, located within the Makefile (\texttt{*.mk}) that generates them.
- For example \texttt{paper.pdf} depends on \texttt{project.tex} (in the build directory and generated automatically) and \texttt{paper.tex} (in the source directory and written by hand).
- In turn, \texttt{project.tex} depends on all the \texttt{*.tex} files at the bottom of the Makefiles above it.
+ Blue files/boxes are output files of various steps in the build-directory, located within the Makefile (\inlinecode{*.mk}) that generates them.
+ For example \inlinecode{paper.pdf} depends on \inlinecode{project.tex} (in the build directory and generated automatically) and \inlinecode{paper.tex} (in the source directory and written by hand).
+ In turn, \inlinecode{project.tex} depends on all the \inlinecode{*.tex} files at the bottom of the Makefiles above it.
}
\end{figure}
-\subsection{Software citation}
-\label{sec:softwarecitation}
-\begin{itemize}
-\item \citet{smith16}: principles of software citation.
-\end{itemize}
@@ -696,14 +924,15 @@ Below we'll review some of the most common container solutions: Docker and Singu
When the installed software within VMs or containers is precisely under control, they are good solutions to reproducibly ``running''/repeating an analysis.
However, because they store the already-built software environment, they are not good for ``studying'' the analysis (how the environment was built).
Currently, the most common practice to install software within containers is to use the package manager of the operating system within the image, usually a minimal Debian-based GNU/Linux operating system.
-For example the Dockerfile\footnote{\url{https://github.com/benmarwick/1989-excavation-report-Madjedbebe/blob/master/Dockerfile}} in the reproducible scripts of \citet{clarkso15}, which uses `\texttt{sudo apt-get install r-cran-rjags -y}' to install the R interface to the JAGS Bayesian statistics (rjags).
+For example the Dockerfile\footnote{\url{https://github.com/benmarwick/1989-excavation-report-Madjedbebe/blob/master/Dockerfile}} in the reproducible scripts of \citet{clarkso15}, which uses \inlinecode{sudo apt-get install r-cran-rjags -y} to install the R interface to the JAGS Bayesian statistics (rjags).
However, the operating system package managers aren't static.
Therefore the versions of the downloaded and used tools within the Docker image will change depending when it was built.
-At the time \citet{clarkso15} was published (June 2015), the \texttt{apt} command above would download and install rjags 3-15, but today (January 2020), it will install rjags 4-10.
+At the time \citet{clarkso15} was published (June 2015), the \inlinecode{apt} command above would download and install rjags 3-15, but today (January 2020), it will install rjags 4-10.
Such problems can be corrected with robust/reproducible package managers like Nix or GNU Guix within the docker image (see Appendix \ref{appendix:packagemanagement}), but this is rarely practiced today.
\subsubsection{Package managers}
-The virtual machine and container solutions mentioned above, install software in standard Unix locations (for example \texttt{/usr/bin}), but in their own independent operating systems.
+\label{appendix:packagemanagersinenv}
+The virtual machine and container solutions mentioned above, install software in standard Unix locations (for example \inlinecode{/usr/bin}), but in their own independent operating systems.
But if software are built in, and used from, a non-standard, project specific directory, we can have an independent build and run-time environment without needing root access, or the extra layers of the container or VM.
This leads us to the final method of having an independent environment: a controlled build of the software and its run-time environment.
Because this is highly intertwined with the way software are installed, we'll describe it in more detail in Section \ref{appendix:packagemanagement} where package managers are reviewed.
@@ -719,11 +948,12 @@ Package management is the process of automating the installation of software.
A package manager thus contains the following information on each software package that can be run automatically: the URL of the software's tarball, the other software that it possibly depends on, and how to configure and build it.
Here some of package management solutions that are used by the reviewed reproducibility solutions of Appendix \ref{appendix:existingsolutions} are reviewed\footnote{For a list of existing package managers, please see \url{https://en.wikipedia.org/wiki/List_of_software_package_management_systems}}.
-Note that we are not including package manager that are only limited to one language, for example \texttt{pip} (for Python) or \texttt{tlmgr} (for \LaTeX).
+Note that we are not including package manager that are only limited to one language, for example \inlinecode{pip} (for Python) or \inlinecode{tlmgr} (for \LaTeX).
-\begin{itemize}
-\item {\bf\small Operating system's package manager:}
-The most commonly used package managers are those of the host operating system, for example `\texttt{apt}' or `\texttt{yum}' respectively on Debian-based, or RedHat-based GNU/Linux operating systems (among many others).
+
+
+\subsubsection{Operating system's package manager}
+The most commonly used package managers are those of the host operating system, for example \inlinecode{apt} or \inlinecode{yum} respectively on Debian-based, or RedHat-based GNU/Linux operating systems (among many others).
These package managers are tighly intertwined with the operating system.
Therefore they require root access, and arbitrary control (for different projects) of the versions and configuration options of software within them is not trivial/possible: for example a special version of a software that may be necessary for a project, may conflict with an operating system component, or another project.
@@ -732,84 +962,89 @@ Hence if two projects need different versions of a software, it is not possible
When a full container or virtual machine (see Appendix \ref{appendix:independentenvironment}) is used for each project, it is common for projects to use the containerized operating system's package manager.
However, it is important to remember that operating system package managers are not static: software are updated on their servers.
-For example, simply adding `\texttt{apt install gcc}' to a \texttt{Dockerfile} will install different versions of GCC based on when the Docker image is created.
+For example, simply adding \inlinecode{apt install gcc} to a \inlinecode{Dockerfile} will install different versions of GCC based on when the Docker image is created.
Requesting a special version also doesn't fully address the problem because the package managers also download and install its dependencies.
Hence a fixed version of the dependencies must also be included.
In summary, these package managers are primarily meant for the operating system components.
Hence, many robust reproducible analysis solutions (reviewed in Appendix \ref{appendix:existingsolutions}) don't use the host's package manager, but an independent package manager, like the ones below.
-\item {\bf\small Conda/Anaconda:} Conda is an independent package manager that can be used on GNU/Linux, macOS, or Windows operating systems, although all software packages are not available in all operating systems.
- Conda is able to maintain an approximately independent environment on an operating system without requiring root access.
-
- Conda tracks the dependencies of a package/environment through a YAML formatted file, where the necessary software and their acceptable versions are listed.
- However, it is not possible to fix the versions of the dependencies through the YAML files alone.
- This is thoroughly discussed under issue 787 of \texttt{conda-forge.github.io}\footnote{\url{https://github.com/conda-forge/conda-forge.github.io/issues/787}}, May 2019.
- In that discussion, the authors of \citet{uhse19} report that the half-life of their environment (defined in a YAML file) is 3 months, and that atleast one of their their depenencies breaks shortly after this period.
- The main reply they got in the discussion is to build the Conda environment in a container, which is also the suggested solution by \citet{gruning18}.
- However, as described in Appendix \ref{appendix:independentenvironment} containers just hide the reproducibility problem, they don't fix it: containers aren't static and need to evolve (i.e., re-built) with the project.
- Given these limitations, \citet{uhse19} are forced to host their conda-packaged software as tarballs on a separate repository.
-
- Conda installs with a shell script that contains a binary-blob (+500 mega bytes, embeded in the shell script).
- This is the first major issue with Conda: from the shell script, it is not clear what is in this binary blob and what it does.
- After installing Conda in any location, users can easily activate that environment by loading a special shell script into their shell.
- However, the resulting environment is not fully independent of the host operating system as described below:
-
- \begin{itemize}
- \item The Conda installation directory is present at the start of environment variables like \texttt{PATH} (which is used to find programs to run) and other such environment variables.
- However, the host operating system's directories are also appended afterwards.
- Therefore, a user, or script may not notice that a software that is being used is actually coming from the operating system, not the controlled Conda installation.
+\subsubsection{Conda/Anaconda}
+Conda is an independent package manager that can be used on GNU/Linux, macOS, or Windows operating systems, although all software packages are not available in all operating systems.
+Conda is able to maintain an approximately independent environment on an operating system without requiring root access.
- \item Generally, by default Conda relies heavily on the operating system and doesn't include core analysis components like \texttt{mkdir}, \texttt{ls} or \texttt{cp}.
- Although they are generally the same between different Unix-like operatings sytems, they have their differences.
- For example `\texttt{mkdir -p}' is a common way to build directories, but this option is only available with GNU Coreutils (default on GNU/Linux systems).
- Running the same command within a Conda environment on a macOS for example, will crash.
- Important packages like GNU Coreutils are available in channels like conda-forge, but they are not the default.
- Therefore, many users may not recognize this, and failing to account for it, will cause unexpected crashes.
+Conda tracks the dependencies of a package/environment through a YAML formatted file, where the necessary software and their acceptable versions are listed.
+However, it is not possible to fix the versions of the dependencies through the YAML files alone.
+This is thoroughly discussed under issue 787 (in May 2019) of \inlinecode{conda-forge}\footnote{\url{https://github.com/conda-forge/conda-forge.github.io/issues/787}}.
+In that discussion, the authors of \citet{uhse19} report that the half-life of their environment (defined in a YAML file) is 3 months, and that atleast one of their their depenencies breaks shortly after this period.
+The main reply they got in the discussion is to build the Conda environment in a container, which is also the suggested solution by \citet{gruning18}.
+However, as described in Appendix \ref{appendix:independentenvironment} containers just hide the reproducibility problem, they don't fix it: containers aren't static and need to evolve (i.e., re-built) with the project.
+Given these limitations, \citet{uhse19} are forced to host their conda-packaged software as tarballs on a separate repository.
- \item Many major Conda packaging ``channels'' (for example the core Anaconda channel, or very popular conda-forge channel) don't include the C library, that a package was built with, as a dependency.
- They rely on the host operating system's C library.
- C is the core language of most modern operating systems and even higher-level languages like Python or R are written in it, and need it to run.
- Therefore if the host operating system's C library is different from the C library that a package was built with, a Conda-packaged program will crash and the project will not be executable.
- Theoretically, it is possible to define a new Conda ``channel'' which includes the C library as a dependency of its software packages, but it will take too much time for any individual team to practically implement all their necessary packages, up to their high-level science software.
+Conda installs with a shell script that contains a binary-blob (+500 mega bytes, embeded in the shell script).
+This is the first major issue with Conda: from the shell script, it is not clear what is in this binary blob and what it does.
+After installing Conda in any location, users can easily activate that environment by loading a special shell script into their shell.
+However, the resulting environment is not fully independent of the host operating system as described below:
- \item Conda does allow a package to depend on a special build of its prerequisites (specified by a checksum, fixing its version and the version of its dependencies).
- However, this is rarely practiced in the main Git repositories of channels like Anaconda and conda-forge: only the name of the high-level prerequisite packages is listed in a package's \texttt{meta.yaml} file, which is version-controlled.
- Therefore two builds of the package from the same Git repository will result in different tarballs (depending on what prerequisites were present at build time).
- In the Conda tarball (that contains the binaries and is not under version control) \texttt{meta.yaml} does include the exact versions of most build-time dependencies.
- However, because the different software of one project may have been built at different times, if they depend on different versions of a single software there will be a conflict and the tarball can't be rebuilt, or the project can't be run.
- \end{itemize}
+\begin{itemize}
+\item The Conda installation directory is present at the start of environment variables like \inlinecode{PATH} (which is used to find programs to run) and other such environment variables.
+ However, the host operating system's directories are also appended afterwards.
+ Therefore, a user, or script may not notice that a software that is being used is actually coming from the operating system, not the controlled Conda installation.
+
+\item Generally, by default Conda relies heavily on the operating system and doesn't include core analysis components like \inlinecode{mkdir}, \inlinecode{ls} or \inlinecode{cp}.
+ Although they are generally the same between different Unix-like operatings sytems, they have their differences.
+ For example \inlinecode{mkdir -p} is a common way to build directories, but this option is only available with GNU Coreutils (default on GNU/Linux systems).
+ Running the same command within a Conda environment on a macOS for example, will crash.
+ Important packages like GNU Coreutils are available in channels like conda-forge, but they are not the default.
+ Therefore, many users may not recognize this, and failing to account for it, will cause unexpected crashes.
+
+\item Many major Conda packaging ``channels'' (for example the core Anaconda channel, or very popular conda-forge channel) don't include the C library, that a package was built with, as a dependency.
+ They rely on the host operating system's C library.
+ C is the core language of most modern operating systems and even higher-level languages like Python or R are written in it, and need it to run.
+ Therefore if the host operating system's C library is different from the C library that a package was built with, a Conda-packaged program will crash and the project will not be executable.
+ Theoretically, it is possible to define a new Conda ``channel'' which includes the C library as a dependency of its software packages, but it will take too much time for any individual team to practically implement all their necessary packages, up to their high-level science software.
+
+\item Conda does allow a package to depend on a special build of its prerequisites (specified by a checksum, fixing its version and the version of its dependencies).
+ However, this is rarely practiced in the main Git repositories of channels like Anaconda and conda-forge: only the name of the high-level prerequisite packages is listed in a package's \inlinecode{meta.yaml} file, which is version-controlled.
+ Therefore two builds of the package from the same Git repository will result in different tarballs (depending on what prerequisites were present at build time).
+ In the Conda tarball (that contains the binaries and is not under version control) \inlinecode{meta.yaml} does include the exact versions of most build-time dependencies.
+ However, because the different software of one project may have been built at different times, if they depend on different versions of a single software there will be a conflict and the tarball can't be rebuilt, or the project can't be run.
+\end{itemize}
- As reviewed above, the low-level dependence of Conda on the host operating system's components and build-time conditions, is the primary reason that it is very fast to install (thus making it an attractive tool to software developers who just need to reproduce a bug in a few minutes).
- However, these same factors are major caveats in a scientific scenario, where long-term archivability, readability or usability are important.
+As reviewed above, the low-level dependence of Conda on the host operating system's components and build-time conditions, is the primary reason that it is very fast to install (thus making it an attractive tool to software developers who just need to reproduce a bug in a few minutes).
+However, these same factors are major caveats in a scientific scenario, where long-term archivability, readability or usability are important.
-\item {\bf\small Nix or GNU Guix:} Nix \citep{dolstra04} and GNU Guix \citep{courtes15} are independent package managers that can be installed and used on GNU/Linux operating systems, and macOS (only for Nix, prior to macOS Catalina).
- Both also have a fully functioning operating system based on their packages: NixOS and ``Guix System''.
- GNU Guix is based on Nix, so we'll focus the review here on Nix.
+\subsubsection{Nix or GNU Guix}
+\label{appendix:nixguix}
+Nix \citep{dolstra04} and GNU Guix \citep{courtes15} are independent package managers that can be installed and used on GNU/Linux operating systems, and macOS (only for Nix, prior to macOS Catalina).
+Both also have a fully functioning operating system based on their packages: NixOS and ``Guix System''.
+GNU Guix is based on Nix, so we'll focus the review here on Nix.
- The Nix approach to package management is unique in that it allows exact dependency tracking of all the dependencies, and allows for multiple versions of a software, for more details see \citep{dolstra04}.
- In summary, a unique hash is created from all the components that go into the building of the package.
- That hash is then prefixed to the software's installation directory.
- For example \citep[from][]{dolstra04} if a certain build of GNU C Library 2.3.2 has a hash of \texttt{8d013ea878d0}, then it is installed under \texttt{/nix/store/8d013ea878d0-glibc-2.3.2} and all software that are compiled with it (and thus need it to run) will link to this unique address.
- This allows for multiple versions of the software to co-exist on the system, while keeping an accurate dependency tree.
+The Nix approach to package management is unique in that it allows exact dependency tracking of all the dependencies, and allows for multiple versions of a software, for more details see \citep{dolstra04}.
+In summary, a unique hash is created from all the components that go into the building of the package.
+That hash is then prefixed to the software's installation directory.
+For example \citep[from][]{dolstra04} if a certain build of GNU C Library 2.3.2 has a hash of \inlinecode{8d013ea878d0}, then it is installed under \inlinecode{/nix/store/8d013ea878d0-glibc-2.3.2} and all software that are compiled with it (and thus need it to run) will link to this unique address.
+This allows for multiple versions of the software to co-exist on the system, while keeping an accurate dependency tree.
- As mentioned in \citet{courtes15}, one major caveat with using these package managers is that they require a daemon with root previlages.
- This is necessary ``to use the Linux kernel container facilities that allow it to isolate build processes and maximize build reproducibility''.
+As mentioned in \citet{courtes15}, one major caveat with using these package managers is that they require a daemon with root previlages.
+This is necessary ``to use the Linux kernel container facilities that allow it to isolate build processes and maximize build reproducibility''.
- \tonote{While inspecting the Guix build instructions for some software, I noticed they don't actually mention the version names. This creates a similar issue withe Conda example above (how to regenerate the software with a given hash, given that its dependency versions aren't explicitly mentioned. Ask Ludo' about this.}
+\tonote{While inspecting the Guix build instructions for some software, I noticed they don't actually mention the version names. This creates a similar issue withe Conda example above (how to regenerate the software with a given hash, given that its dependency versions aren't explicitly mentioned. Ask Ludo' about this.}
-\item {\bf\small Spack:} is a package manager that is also influenced by Nix (similar to GNU Guix), see \citet{gamblin15}.
+\subsubsection{Spack}
+Spack is a package manager that is also influenced by Nix (similar to GNU Guix), see \citet{gamblin15}.
But unlike Nix or GNU Guix, it doesn't aim for full, bit-wise reproducibility and can be built without root access in any generic location.
It relies on the host operating system for the C library.
Spack is fully written in Python, where each software package is an instance of a class, which defines how it should be downloaded, configured, built and installed.
Therefore if the proper version of Python is not present, Spack cannot be used and when incompatibilities arise in future versions of Python (similar to how Python 3 is not compatible with Python 2), software building recipes, or the whole system, have to be upgraded.
Because of such bootstrapping problems (for example how Spack needs Python to build Python and other software), it is generally a good practice to use simpler, lower-level languages/systems for a low-level operation like package management.
-\end{itemize}
+
+\subsection{Package management conclusion}
There are two common issues regarding generic package managers that hinders their usage for high-level scientific projects, as listed below:
\begin{itemize}
\item {\bf\small Pre-compiled/binary downloads:} Most package managers (excluding Nix or its derivaties) only download the software in a binary (pre-compiled) format.
@@ -858,18 +1093,16 @@ Today the distributed model of ``version control'' is the most common, where the
There are many existing version control solutions, for example CVS, SVN, Mercurial, GNU Bazaar, or GNU Arch.
However, currently, Git is by far the most commonly used in individual projects and long term archival systems like Software Heritage \citep{dicosmo18}, it is also the system that is used in the proposed template, so we'll only review it here.
-\begin{itemize}
-\item {\bf\small Git:} With Git, changes in a project's contents are accurately identified by comparing them with their previous version in the archived Git repository.
- When the user decides the changes are significant compared to the archived state, they can ``commit'' the changes into the history/repository.
- The commit involves copying the changed files into the repository and calculating a 40 character checksum/hash that is calculated from the files, an accompanying ``message'' (a narrarative description of the project's state), and the previous commit (thus creating a ``chain'' of commits that are strongly connected to each other).
- For example `\texttt{f4953cc\-f1ca8a\-33616ad\-602ddf\-4cd189\-c2eff97b}' is a commit identifer in the Git history that this paper is being written in.
- Commits are is commonly summarized by the checksum's first few characters, for example `\texttt{f4953cc}'.
-
- With Git, making parallel ``branches'' (in the project's history) is very easy and its distributed nature greatly helps in the parallel development of a project by a team.
- The team can host the Git history on a webpage and collaborate through that.
- There are several Git hosting services for example \href{http://github.com}{github.com}, \href{http://gitlab.com}{gitlab.com}, or \href{http://bitbucket.org}{bitbucket.org} (among many others).
+\subsubsection{Git}
+With Git, changes in a project's contents are accurately identified by comparing them with their previous version in the archived Git repository.
+When the user decides the changes are significant compared to the archived state, they can ``commit'' the changes into the history/repository.
+The commit involves copying the changed files into the repository and calculating a 40 character checksum/hash that is calculated from the files, an accompanying ``message'' (a narrarative description of the project's state), and the previous commit (thus creating a ``chain'' of commits that are strongly connected to each other).
+For example \inlinecode{f4953cc\-f1ca8a\-33616ad\-602ddf\-4cd189\-c2eff97b} is a commit identifer in the Git history that this paper is being written in.
+Commits are is commonly summarized by the checksum's first few characters, for example \inlinecode{f4953cc}.
-\end{itemize}
+With Git, making parallel ``branches'' (in the project's history) is very easy and its distributed nature greatly helps in the parallel development of a project by a team.
+The team can host the Git history on a webpage and collaborate through that.
+There are several Git hosting services for example \href{http://github.com}{github.com}, \href{http://gitlab.com}{gitlab.com}, or \href{http://bitbucket.org}{bitbucket.org} (among many others).
@@ -882,68 +1115,68 @@ For example it is first necessary to download a dataset, then to do some prepara
Each one of these is a logically independent step which needs to be run before/after the others in a specific order.
There are many tools for managing the sequence of jobs, below we'll review the most common ones that are also used in the proposed template, or the existing reproducibility solutions of Appendix \ref{appendix:existingsolutions}.
-\begin{itemize}
-\item {\bf\small Script:} Scripts (in any language, for example GNU Bash, or Python) are the most common ways of organizing a series of steps.
- They are primarily designed execute each step sequentially (one after another), making them also very intuitive.
- However, as the series of operations become complex and large, managing the workflow in a script will become highly complex.
- For example if 90\% of a long project is already done and a researcher wants to add a followup step, a script will go through all the previous steps (which can take significant time).
- Also, if a small step in the middle of an analysis has to be changed, the full analysis needs to be re-run: scripts have no concept of dependencies (so only the steps that are affected by that change are run).
- Such factors discourage experimentation, which is a critical component of the scientific method.
- It is possible to manually add conditionals all over the script to add dependencies, but they just make it harder to read, and introduce many bugs themselves.
- Parallelization is another drawback of using scripts.
- While its not impossible, because of the high-level nature of scripts, it is not trivial and parallelization can also be very inefficient or buggy.
-
-
-\item {\bf\small Make:} (\url{https://www.gnu.org/s/make}) Make was originally designed to address the problems mentioned above for scripts \citep{feldman79}.
- The most common implementation is GNU Make \citep{stallman88}.
- In particular to manage the compilation of programs with many source codes files.
- With it, the various source codes of a program that haven't been changed, wouldn't be recompiled.
- Also, when two source code files didn't depend on each other, and both needed to be rebuilt, they could be built in parallel.
- This greatly helped in debugging of software projects, and speeding up test builds.
-
- Because it has been a fixed component of the Unix systems and culture from very early days, it is by far the most used workflow manager today.
- It is already installed and used when building almost all components of Unix-like operating systems (including GNU/Linux, BSD, and macOS, among others).
- It is also well known (to different levels) by many people, even outside of software engineers (for example even astronomers have to run the \texttt{make} command when they want to install their analysis software).
- However, even though Make can be used to manage any series of steps that involve the creation of files (including data analysis), its usage has predominantly remained in the software-building sphere and it has yet to penetrate higher-level workflows.
-
- The proposed template uses Make to organize its workflow (as described in more detail above \tonote{add section reference later}).
- We'll thus do a short review here.
- A file containing Make instructions is known as a `Makefile'.
+\subsubsection{Scripts}
+\label{appendix:scripts}
+Scripts (in any language, for example GNU Bash, or Python) are the most common ways of organizing a series of steps.
+They are primarily designed execute each step sequentially (one after another), making them also very intuitive.
+However, as the series of operations become complex and large, managing the workflow in a script will become highly complex.
+For example if 90\% of a long project is already done and a researcher wants to add a followup step, a script will go through all the previous steps (which can take significant time).
+Also, if a small step in the middle of an analysis has to be changed, the full analysis needs to be re-run: scripts have no concept of dependencies (so only the steps that are affected by that change are run).
+Such factors discourage experimentation, which is a critical component of the scientific method.
+It is possible to manually add conditionals all over the script to add dependencies, but they just make it harder to read, and introduce many bugs themselves.
+Parallelization is another drawback of using scripts.
+While its not impossible, because of the high-level nature of scripts, it is not trivial and parallelization can also be very inefficient or buggy.
+
+
+\subsubsection{Make}
+\label{appendix:make}
+Make was originally designed to address the problems mentioned in Appendix \ref{appendix:scripts} for scripts \citep{feldman79}.
+In particular to manage the compilation of programs with many source code files.
+With it, the various source files of a program that haven't been changed, wouldn't be recompiled.
+Also, when two source files didn't depend on each other, and both needed to be rebuilt, they could be built in parallel.
+This greatly helped in debugging of software projects, and speeding up test builds, giving Make a core place in software building tools.
+The most common implementation of Make, since the early 1990s, is GNU Make \citep[\url{http://www.gnu.org/s/make}]{stallman88}.
+
+The proposed template uses Make to organize its workflow (as described in more detail above \tonote{add section reference later}).
+We'll thus do a short review here.
+A file containing Make instructions is known as a `Makefile'.
Make manages `rules', which are composed of three components: `targets', `pre-requisites' and `recipes'.
All three components must be files on the running system (note that in Unix-like operating systems, everything is a file, even directories and devices).
To manage operations and decide which operation should be re-done, Make compares the time stamp of the files.
A rule's `recipe' contains instructions (most commonly shell scripts) to produce the `target' file when any of the `prerequisite' files are newer than the target.
When all the prerequisites are older than the target, in Make's paradigm, that target doesn't need to be re-built.
-\item {\bf\small SCons:} (\url{https://scons.org}) is a Python package for managing operations outside of Python (in contrast to CGAT-core, discussed below, which only organizes Python functions).
- In many aspects it is similar to Make, for example it is managed through a `SConstruct' file.
- Like a Makefile, SConstruct is also declerative: the running order is not necessarily the top-to-bottom order of the written operations within the file (unlike the the imperative paradigm which is common in languages like C, Python, or Fortran).
- However, unlike Make, SCons doesn't use the file modification date to decide if it should be remade.
- SCons keeps the MD5 hash of all the files (in a hidden binary file) to check if the contents has changed.
-
- SCons thus attempts to work on a declarative file with an imperative language (Python).
- It also goes beyond raw job management and attempts to extract information from within the files (for example to identify the libraries that must be linked while compiling a program).
- SCons is therefore more complex than Make: its manual is almost double that of GNU Make.
- Besides added complexity, all these ``smart'' features decrease its performance, especially as files get larger and more numerous: on every call, every file's checksum has to be calculated, and a Python system call has to be made (which is computationally expensive).
-
- Finally, it has the same drawback as any other tool that uses hight-level languagues, see Section \ref{appendix:highlevelinworkflow}.
- We encountered such a problem while testing SCons: on the Debian-10 testing system, the `\texttt{python}' program pointed to Python 2.
- However, since Python 2 is now obsolete, SCons was built with Python 3 and our first run crashed.
- To fix it, we had to either manually change the core operating system path, or the SCons source hashbang.
- The former will conflict with other system tools that assume `\texttt{python}' points to Python-2, the latter may need root permissions for some systems.
- This can also be problematic when a Python analysis library, may require a Python version that conflicts with the running SCons.
-
-\item {\bf\small CGAT-core:} (\url{https://cgat-core.readthedocs.io/en/latest}) is a Python package for managing workflows, see \citet{cribbs19}.
- It wraps analysis steps in Python functions and uses Python decorators to track the dependencies between tasks.
- It is used papers like \citet{jones19}, but as mentioned in \citet{jones19} it is good for managing individual outputs (for example separate figures/tables in the paper, when they are fully created within Python).
- Because it is primarily designed for Python tasks, managing a full workflow (which includes many more components, written in other languages) is not trivial in it.
- Another drawback with this workflow manager is that Python is a very high-level language where future versions of the language may no longer be compatible with Python 3, that CGAT-core is implemented in (similar to how Python 2 programs are not compatible with Python 3).
-
-\item {\bf\small Guix Workflow Language (GWL):} (\url{https://www.guixwl.org}) GWL is based on the declarative language that GNU Guix uses for package management (see Appendix \ref{appendix:packagemanagement}), which is itself based on the general purpose Scheme language.
- It is closely linked with GNU Guix and can even install the necessary software needed for each individual process.
- Hence in the GWL paradigm, software installation and usage doesn't have to be separated.
- GWL has two high-level concepts called ``processes'' and ``workflows'' where the latter defines how multiple processes should be executed together.
-\end{itemize}
+\subsubsection{SCons}
+Scons (\url{https://scons.org}) is a Python package for managing operations outside of Python (in contrast to CGAT-core, discussed below, which only organizes Python functions).
+In many aspects it is similar to Make, for example it is managed through a `SConstruct' file.
+Like a Makefile, SConstruct is also declerative: the running order is not necessarily the top-to-bottom order of the written operations within the file (unlike the the imperative paradigm which is common in languages like C, Python, or Fortran).
+However, unlike Make, SCons doesn't use the file modification date to decide if it should be remade.
+SCons keeps the MD5 hash of all the files (in a hidden binary file) to check if the contents has changed.
+
+SCons thus attempts to work on a declarative file with an imperative language (Python).
+It also goes beyond raw job management and attempts to extract information from within the files (for example to identify the libraries that must be linked while compiling a program).
+SCons is therefore more complex than Make: its manual is almost double that of GNU Make.
+Besides added complexity, all these ``smart'' features decrease its performance, especially as files get larger and more numerous: on every call, every file's checksum has to be calculated, and a Python system call has to be made (which is computationally expensive).
+
+Finally, it has the same drawback as any other tool that uses hight-level languagues, see Section \ref{appendix:highlevelinworkflow}.
+We encountered such a problem while testing SCons: on the Debian-10 testing system, the \inlinecode{python} program pointed to Python 2.
+However, since Python 2 is now obsolete, SCons was built with Python 3 and our first run crashed.
+To fix it, we had to either manually change the core operating system path, or the SCons source hashbang.
+The former will conflict with other system tools that assume \inlinecode{python} points to Python-2, the latter may need root permissions for some systems.
+This can also be problematic when a Python analysis library, may require a Python version that conflicts with the running SCons.
+
+\subsubsection{CGAT-core}
+CGAT-Core (\url{https://cgat-core.readthedocs.io/en/latest}) is a Python package for managing workflows, see \citet{cribbs19}.
+It wraps analysis steps in Python functions and uses Python decorators to track the dependencies between tasks.
+It is used papers like \citet{jones19}, but as mentioned in \citet{jones19} it is good for managing individual outputs (for example separate figures/tables in the paper, when they are fully created within Python).
+Because it is primarily designed for Python tasks, managing a full workflow (which includes many more components, written in other languages) is not trivial in it.
+Another drawback with this workflow manager is that Python is a very high-level language where future versions of the language may no longer be compatible with Python 3, that CGAT-core is implemented in (similar to how Python 2 programs are not compatible with Python 3).
+
+\subsubsection{Guix Workflow Language (GWL)}
+GWL (\url{https://www.guixwl.org}) GWL is based on the declarative language that GNU Guix uses for package management (see Appendix \ref{appendix:packagemanagement}), which is itself based on the general purpose Scheme language.
+It is closely linked with GNU Guix and can even install the necessary software needed for each individual process.
+Hence in the GWL paradigm, software installation and usage doesn't have to be separated.
+GWL has two high-level concepts called ``processes'' and ``workflows'' where the latter defines how multiple processes should be executed together.
As described above shell scripts and Make are a common and highly used system that have existed for several decades and many researchers are already familiar with them and have already used them.
The list of necessary software solutions for the various stages of a research project (listed in the subsections of Appendix \ref{appendix:existingtools}), is aleady very large, and each software has its own learning curve (which is a heavy burden for a natural or social scientist for example).
@@ -964,39 +1197,41 @@ For example Shell, Python or R scripts, Makefiles, Dockerfiles, or even the sour
Given that a scientific project does not evolve linearly and many edits are needed as it evolves, it is important to be able to actively test the analysis steps while writing the project's source files.
Here we'll review some common methods that are currently used.
-\begin{itemize}
-\item {\bf\small Text editor:} The most basic way to edit text files is through simple text editors which just allow viewing and editing such files, for example \texttt{gedit} on the GNOME graphic user interface.
- However, working with simple plain text editors like \texttt{gedit} can be very frustrating since its necessary to save the file, then go to a terminal emulator and execute the source files.
- To solve this problem there are advanced text editors like GNU Emacs that allow direct execution of the script, or access to a terminal within the text editor.
- However, editors that can execute or debug the source (like GNU Emacs), just run external programs for these jobs (for example GNU GCC, or GNU GDB), just as if those programs was called from outside the editor.
-
- With text editors, the final edited file is independent of the actual editor and can be further edited with another editor, or executed without it.
- This is a very important feature that is not commonly present for other solutions mentioned below.
- Another very important advantage of advanced text editors like GNU Emacs or Vi(m) is that they can also be run without a graphic user interface, directly on the command-line.
- This feature is critical when working on remote systems, in particular high performance computing (HPC) facilities that don't provide a graphic user interface.
-
-\item {\bf\small Integrated Development Environments (IDEs):} To facilitate the development of source files, IDEs add software building and running environments as well as debugging tools to a plain text editor.
- Many IDEs have their own compilers and debuggers, hence source files that are maintained in IDEs are not necessarily usable/portable on other systems.
- Furthermore, they usually require a graphic user interface to run.
- In summary IDEs are generally very specialized tools, for special projects and are not a good solution when portability (the ability to run on different systems) is required.
-
-\item {\bf\small Jupyter:} Jupyter \citep[initially IPython,][]{kluyver16} is an implementation of Literate Programming \citep{knuth84}.
- The main user interface is a web-based ``notebook'' that contains blobs of executable code and narrative.
- Jupyter uses the custom built \texttt{.ipynb} format\footnote{\url{https://nbformat.readthedocs.io/en/latest}}.
+\subsubsection{Text editors}
+The most basic way to edit text files is through simple text editors which just allow viewing and editing such files, for example \inlinecode{gedit} on the GNOME graphic user interface.
+However, working with simple plain text editors like \inlinecode{gedit} can be very frustrating since its necessary to save the file, then go to a terminal emulator and execute the source files.
+To solve this problem there are advanced text editors like GNU Emacs that allow direct execution of the script, or access to a terminal within the text editor.
+However, editors that can execute or debug the source (like GNU Emacs), just run external programs for these jobs (for example GNU GCC, or GNU GDB), just as if those programs was called from outside the editor.
+
+With text editors, the final edited file is independent of the actual editor and can be further edited with another editor, or executed without it.
+This is a very important feature that is not commonly present for other solutions mentioned below.
+Another very important advantage of advanced text editors like GNU Emacs or Vi(m) is that they can also be run without a graphic user interface, directly on the command-line.
+This feature is critical when working on remote systems, in particular high performance computing (HPC) facilities that don't provide a graphic user interface.
+
+\subsubsection{Integrated Development Environments (IDEs)}
+To facilitate the development of source files, IDEs add software building and running environments as well as debugging tools to a plain text editor.
+Many IDEs have their own compilers and debuggers, hence source files that are maintained in IDEs are not necessarily usable/portable on other systems.
+Furthermore, they usually require a graphic user interface to run.
+In summary IDEs are generally very specialized tools, for special projects and are not a good solution when portability (the ability to run on different systems) is required.
+
+\subsubsection{Jupyter}
+Jupyter \citep[initially IPython,][]{kluyver16} is an implementation of Literate Programming \citep{knuth84}.
+The main user interface is a web-based ``notebook'' that contains blobs of executable code and narrative.
+Jupyter uses the custom built \inlinecode{.ipynb} format\footnote{\url{https://nbformat.readthedocs.io/en/latest}}.
Jupyter's name is a combination of the three main languages it was designed for: Julia, Python and R.
-The \texttt{.ipynb} format, is a simple, human-readable (can be opened in a plain-text editor) file, formatted in Javascript Object Notation (JSON).
+The \inlinecode{.ipynb} format, is a simple, human-readable (can be opened in a plain-text editor) file, formatted in Javascript Object Notation (JSON).
It contains various kinds of ``cells'', or blobs, that can contain narrative description, code, or multi-media visalizations (for example images/plots), that are all stored in one file.
The cells can have any order, allowing the creation of a literal programing style graphical implementation, where narrative descriptions and executable patches of code can be intertwined.
For example to have a paragraph of text about a patch of code, and run that patch immediately in the same page.
-The \texttt{.ipynb} format does theoretically allow dependency tracking between cells, see IPython mailing list (discussion started by Gabriel Becker from July 2013\footnote{\url{https://mail.python.org/pipermail/ipython-dev/2013-July/010725.html}}).
+The \inlinecode{.ipynb} format does theoretically allow dependency tracking between cells, see IPython mailing list (discussion started by Gabriel Becker from July 2013\footnote{\url{https://mail.python.org/pipermail/ipython-dev/2013-July/010725.html}}).
Defining dependencies between the cells can allow non-linear execution which is critical for large scale (thousands of files) and complex (many dependencies between the cells) operations.
It allows automation, run-time optimization (deciding not to run a cell if its not necessary) and parallelization.
However, Jupyter currently only supports a linear run of the cells: always from the start to the end.
It is possible to manually execute only one cell, but the previous/next cells that may depend on it, also have to be manually run (a common source of human error, and frustration for complex operations).
Integration of directional graph features (dependencies between the cells) into Jupyter has been discussed, but as of this publication, there is no plan to implement it (see Jupyter's GitHub issue 1175\footnote{\url{https://github.com/jupyter/notebook/issues/1175}}).
-The fact that the \texttt{.ipynb} format stores narrative text, code and multi-media visualization of the outputs in one file, is another major hurdle:
+The fact that the \inlinecode{.ipynb} format stores narrative text, code and multi-media visualization of the outputs in one file, is another major hurdle:
The files can easy become very large (in volume/bytes) and hard to read from source.
Both are critical for scientific processing, especially the latter: when a web-browser with proper Javascript features isn't available (can happen in a few years).
This is further exacerbated by the fact that binary data (for example images) are not directly supported in JSON and have to be converted into much less memory-efficient textual encodings.
@@ -1007,7 +1242,7 @@ However, the dependencies above are only on the server-side.
Since Jupyter is a web-based system, it requires many dependencies on the viewing/running browser also (for example special Javascript or HTML5 features, which evolve very fast).
As discussed in Appendix \ref{appendix:highlevelinworkflow} having so many dependencies is a major caveat for any system regarding scientific/long-term reproducibility (as opposed to industrial/immediate reproducibility).
In summary, Jupyter is most useful in manual, interactive and graphical operations for temporary operations (for example educational tutorials).
-\end{itemize}
+
@@ -1045,43 +1280,47 @@ It is unreasonably optimistic to assume that high-level languages won't undergo
For software developers, this isn't a problem at all: non-scientific software, and the general population's usage of them, evolves extremely fast and it is rarely (if ever) necessary to look into codes that are more than a couple of years old.
However, in the sciences (which are commonly funded by public money) this is a major caveat for the longer-term usability of solutions that are designed.
+In summary, in this section we are discussing the bootstrapping problem as regards scientific projects: the workflow/pipeline can reproduce the analysis and its dependencies, but the dependencies of the workflow itself cannot not be ignored.
+The most robust way to address this problem is with a workflow management system that ideally doesn't need any major dependencies: tools that are already part of the operating system.
+
Beyond technical, low-level, problems for the developers mentioned above, this causes major problems for scientific project management as listed below:
-\begin{itemize}
-\item {\bf\small Dependency hell:} The evolution of high-level languages is extremely fast, even within one version.
- For example packages thar are written in Python 3 often only work with a special interval of Python 3 versions (for example newer than Python 3.6).
- This isn't just limited to the core language, much faster changes occur in their higher-level libraries.
- For example version 1.9 of Numpy (Python's numerical analysis module) discontinued support for Numpy's predecessor (called Numeric), causing many problems for scientific users \citep[see][]{hinsen15}.
-
- On the other hand, the dependency graph of tools written in high-level languages is often extremely complex.
- For example see Figure 1 of \citet{alliez19}, it shows the dependencies and their inter-dependencies for Matplotlib (a popular plotting module in Python).
-
- Acceptable dependency intervals between the dependencies will cause incompatibilities in a year or two, when a robust pakage manager is not used (see Appendix \ref{appendix:packagemanagement}).
- Since a domain scientist doesn't always have the resources/knowledge to modify the conflicting part(s), many are forced to create complex environments with different versions of Python and pass the data between them (for example just to use the work of a previous PhD student in the team).
- This greatly increases the complexity of the project, even for the principal author.
- A good reproducible workflow can account for these different versions.
- However, when the actual workflow system (not the analysis software) is written in a high-level language this will cause a major problem.
-
- For example, merely installing the Python installer (\texttt{pip}) on a Debian system (with `\texttt{apt install pip2}' for Python 2 packages), required 32 other packages as dependencies.
- \texttt{pip} is necessary to install Popper and Sciunit (Appendices \ref{appendix:popper} and \ref{appendix:sciunit}).
- As of this writing, the `\texttt{pip3 install popper}' and `\texttt{pip2 install sciunit2}' commands for installing each, required 17 and 26 Python modules as dependencies.
- It is impossible to run either of these solutions if there is a single conflict in this very complex dependency graph.
- This problem actually occurred while we were testing Sciunit: even though it installed, it couldn't run because of conflicts (its last commit was only 1.5 years old), for more see Appendix \ref{appendix:sciunit}.
- \citet{hinsen15} also report a similar problem when attempting to install Jupyter (see Appendix \ref{appendix:editors}).
- Ofcourse, this also applies to tools that these systems use, for example Conda (which is also written in Python, see Appendix \ref{appendix:packagemanagement}).
-
-\item {\bf\small Generational gap:} This occurs primarily for domain scientists (for example astronomers, biologists or social sciences).
- Once they have mastered one version of a language (mostly in the early stages of their career), they tend to ignore newer versions/languages.
- The inertia of programming languages is very strong.
- This is natural, because they have their own science field to focus on, and re-writing their very high-level analysis toolkits (which they have curated over their career and is often only readable/usable by themselves) in newer languages requires too much investment and time.
-
- When this investment is not possible, either the mentee has to use the mentor's old method (and miss out on all the new tools, which they need for the future job prospects), or the mentor has to avoid implementation details in discussions with the mentee, because they don't share a common language.
- The authors of this paper have personal experiences in both mentor/mentee relational scenarios.
- This failure to communicate in the details is a very serious problem, leading to the loss of valuable inter-generational experience.
-\end{itemize}
+\subsubsection{Dependency hell}
+The evolution of high-level languages is extremely fast, even within one version.
+For example packages thar are written in Python 3 often only work with a special interval of Python 3 versions (for example newer than Python 3.6).
+This isn't just limited to the core language, much faster changes occur in their higher-level libraries.
+For example version 1.9 of Numpy (Python's numerical analysis module) discontinued support for Numpy's predecessor (called Numeric), causing many problems for scientific users \citep[see][]{hinsen15}.
-In summary, in this section we are discussing the bootstrapping problem as regards scientific projects: the workflow/pipeline can reproduce the analysis and its dependencies, but the dependencies of the workflow itself cannot not be ignored.
-The most robust way to address this problem is with a workflow management system that ideally doesn't need any major dependencies: tools that are already part of the operating system.
+On the other hand, the dependency graph of tools written in high-level languages is often extremely complex.
+For example see Figure 1 of \citet{alliez19}, it shows the dependencies and their inter-dependencies for Matplotlib (a popular plotting module in Python).
+
+Acceptable dependency intervals between the dependencies will cause incompatibilities in a year or two, when a robust pakage manager is not used (see Appendix \ref{appendix:packagemanagement}).
+Since a domain scientist doesn't always have the resources/knowledge to modify the conflicting part(s), many are forced to create complex environments with different versions of Python and pass the data between them (for example just to use the work of a previous PhD student in the team).
+This greatly increases the complexity of the project, even for the principal author.
+A good reproducible workflow can account for these different versions.
+However, when the actual workflow system (not the analysis software) is written in a high-level language this will cause a major problem.
+
+For example, merely installing the Python installer (\inlinecode{pip}) on a Debian system (with \inlinecode{apt install pip2} for Python 2 packages), required 32 other packages as dependencies.
+\inlinecode{pip} is necessary to install Popper and Sciunit (Appendices \ref{appendix:popper} and \ref{appendix:sciunit}).
+As of this writing, the \inlinecode{pip3 install popper} and \inlinecode{pip2 install sciunit2} commands for installing each, required 17 and 26 Python modules as dependencies.
+It is impossible to run either of these solutions if there is a single conflict in this very complex dependency graph.
+This problem actually occurred while we were testing Sciunit: even though it installed, it couldn't run because of conflicts (its last commit was only 1.5 years old), for more see Appendix \ref{appendix:sciunit}.
+\citet{hinsen15} also report a similar problem when attempting to install Jupyter (see Appendix \ref{appendix:editors}).
+Ofcourse, this also applies to tools that these systems use, for example Conda (which is also written in Python, see Appendix \ref{appendix:packagemanagement}).
+
+
+
+
+
+\subsubsection{Generational gap}
+This occurs primarily for domain scientists (for example astronomers, biologists or social sciences).
+Once they have mastered one version of a language (mostly in the early stages of their career), they tend to ignore newer versions/languages.
+The inertia of programming languages is very strong.
+This is natural, because they have their own science field to focus on, and re-writing their very high-level analysis toolkits (which they have curated over their career and is often only readable/usable by themselves) in newer languages requires too much investment and time.
+
+When this investment is not possible, either the mentee has to use the mentor's old method (and miss out on all the new tools, which they need for the future job prospects), or the mentor has to avoid implementation details in discussions with the mentee, because they don't share a common language.
+The authors of this paper have personal experiences in both mentor/mentee relational scenarios.
+This failure to communicate in the details is a very serious problem, leading to the loss of valuable inter-generational experience.
@@ -1118,7 +1357,7 @@ Therefore proprietary solutions like Code Ocean (\url{https://codeocean.com}) or
\subsection{Reproducible Electronic Documents, RED (1992)}
-\label{appendix:electronicdocs}
+\label{appendix:red}
Reproducible Electronic Documents (\url{http://sep.stanford.edu/doku.php?id=sep:research:reproducible}) is the first attempt that we could find on doing reproducible research \citep{claerbout1992,schwab2000}.
It was developed within the Stanford Exploration Project (SEP) for Geophysics publications.
@@ -1131,7 +1370,7 @@ As described in \citep{schwab2000}, in the latter half of that decade, moved to
The basic idea behind RED's solution was to organize the analysis as independent steps, including the generation of plots, and organizing the steps through a Makefile.
This enabled all the results to be re-executed with a single command.
Several basic low-level Makefiles were included in the high-level/central Makefile.
-The reader/user of a project had to manually edit the central Makefile and set the variable \texttt{RESDIR} (result dir), this is the directory where built files are kept.
+The reader/user of a project had to manually edit the central Makefile and set the variable \inlinecode{RESDIR} (result dir), this is the directory where built files are kept.
Afterwards, the reader could set which figures/parts of the project to reproduce by manually adding its name in the central Makefile, and running Make.
At the time, Make was already practiced by individual researchers and projects as a job orchestraion tool, but SEP's innovation was to standardize it as an internal policy, and define conventions for the Makefiles to be consistant across projects.
@@ -1162,12 +1401,12 @@ In many aspects Taverna is like VisTrails, see Appendix \ref{appendix:vistrails}
\label{appendix:madagascar}
Madagascar (\url{http://ahay.org}) is a set of extensions to the SCons job management tool \citep{fomel13}.
For more on SCons, see Appendix \ref{appendix:jobmanagement}.
-Madagascar is a continuation of the Reproducible Electronic Documents (RED) project that was disucssed in Appendix \ref{appendix:electronicdocs}.
+Madagascar is a continuation of the Reproducible Electronic Documents (RED) project that was disucssed in Appendix \ref{appendix:red}.
Madagascar does include project management tools in the form of SCons extensions.
However, it isn't just a reproducible project management tool, it is primarily a collection of analysis programs, tools to interact with RSF files, and plotting facilities.
-For example in our test of Madagascar 3.0.1, it installed 855 Madagascar-specific analysis programs (`\texttt{PREFIX/bin/sf*}').
-The analysis programs mostly target geophysical data analysis, including various project specific tools: more than half of the total built tools are under the `\texttt{build/user}' directory which includes names of Madagascar users.
+For example in our test of Madagascar 3.0.1, it installed 855 Madagascar-specific analysis programs (\inlinecode{PREFIX/bin/sf*}).
+The analysis programs mostly target geophysical data analysis, including various project specific tools: more than half of the total built tools are under the \inlinecode{build/user} directory which includes names of Madagascar users.
Following the Unix spirit of modularized programs that communicating through text-based pipes, Madagascar's core is the custom Regularly Sampled File (RSF) format\footnote{\url{http://www.ahay.org/wiki/Guide\_to\_RSF\_file\_format}}.
RSF is a plain-text file that points to the location of the actual data files on the filesystem, but it can also keep the raw binary dataset within same plain-text file.
@@ -1344,7 +1583,7 @@ According to \citet{gavish11}, the VRI generation routine has been implemented i
VCR also has special \LaTeX{} macros for loading the respective VRI into the generated PDF.
Unfortunately most parts of the webpage are not complete at the time of this writing.
-The VCR webpage contains an example PDF\footnote{\url{http://vcr.stanford.edu/paper.pdf}} that is generated with this system, however, the linked VCR repository (\texttt{http://vcr-stat.stanford.edu}) does not exist at the time of this writing.
+The VCR webpage contains an example PDF\footnote{\url{http://vcr.stanford.edu/paper.pdf}} that is generated with this system, however, the linked VCR repository (\inlinecode{http://vcr-stat.stanford.edu}) does not exist at the time of this writing.
Finally, the date of the files in the Matlab extension tarball are set to 2011, hinting that probably VCR has been abandoned soon after the the publication of \citet{gavish11}.
@@ -1407,9 +1646,9 @@ It automatically parses all the executables in the script, and copies them, and
Because the sciunit contains all the programs and necessary libraries, its possible to run it readily on other systems that have a similar CPU architecture.
For more, please see \citet{meng15}.
-In our tests, Sciunit installed successfully, however we couldn't run it because of a dependency problem with the \texttt{tempfile} package (in the standard Python library).
+In our tests, Sciunit installed successfully, however we couldn't run it because of a dependency problem with the \inlinecode{tempfile} package (in the standard Python library).
Sciunit is written in Python 2 (which reached its end-of-life in January 1st, 2020) and its last Git commit in its main branch is from June 2018 (+1.5 years ago).
-Recent activity in a \texttt{python3} branch shows that others are attempting to translate the code into Python 3 (the main author has graduated and apparently not working on Sciunit anymore).
+Recent activity in a \inlinecode{python3} branch shows that others are attempting to translate the code into Python 3 (the main author has graduated and apparently not working on Sciunit anymore).
Because we weren't able to run it, the following discussion will just be theoretical.
The main issue with Sciunit's approach is that the copied binaries are just black boxes.
@@ -1456,12 +1695,12 @@ Popper (\url{https://falsifiable.us}) is a software implementation of the Popper
The Convention is a set of very generic conditions that are also applicable to the template proposed in this paper.
For a discussion on the convention, please see Section \ref{sec:principles}, in this section we'll review their software implementation.
-The Popper team's own solution is through a command-line program called \texttt{popper}.
-The \texttt{popper} program itself is written in Python, but job management is with the HashiCorp configuration language (HCL).
+The Popper team's own solution is through a command-line program called \inlinecode{popper}.
+The \inlinecode{popper} program itself is written in Python, but job management is with the HashiCorp configuration language (HCL).
HCL is primarily aimed at running jobs on HashiCorp's ``infrastructure as a service'' (IaaS) products.
Until September 30th, 2019\footnote{\url{https://github.blog/changelog/2019-09-17-github-actions-will-stop-running-workflows-written-in-hcl}}, HCL was used by ``GitHub Actions'' to manage workflows.
-To start a project, the \texttt{popper} command-line program builds a template, or ``scaffold'', which is a minimal set of files that can be run.
+To start a project, the \inlinecode{popper} command-line program builds a template, or ``scaffold'', which is a minimal set of files that can be run.
The scaffold is very similar to the raw template of that is proposed in this paper.
However, as of this writing, the scaffold isn't complete.
It lacks a manuscript and validation of outputs (as mentioned in the convention).
@@ -1481,7 +1720,7 @@ It uses online editors like Jupyter or RStudio (see Appendix \ref{appendix:edito
The web-based nature of Whole Tale's approach, and its dependency on many tools (which have many dependencies themselves) is a major limitation for future reproducibility.
For example, when following their own tutorial on ``Creating a new tale'', the provided Jupyter notbook could not be executed because of a dependency problem.
This has been reported to the authors as issue 113\footnote{\url{https://github.com/whole-tale/wt-design-docs/issues/113}}, but as all the second-order dependencies evolve, its not hard to envisage such dependency incompatibilities being the primary issue for older projects on Whole Tale.
-Furthermore, the fact that a Tale is stored as a binary Docker container causes two important problems: 1) it requires a very large storage capacity for every project that is hosted there, making it very expensive to scale if demand expands. 2) It is not possible to see how the environment was built accurately (when the Dockerfile uses \texttt{apt}), for more on this, please see Appendix \ref{appendix:packagemanagement}.
+Furthermore, the fact that a Tale is stored as a binary Docker container causes two important problems: 1) it requires a very large storage capacity for every project that is hosted there, making it very expensive to scale if demand expands. 2) It is not possible to see how the environment was built accurately (when the Dockerfile uses \inlinecode{apt}), for more on this, please see Appendix \ref{appendix:packagemanagement}.
@@ -1557,6 +1796,7 @@ Furthermore, the fact that a Tale is stored as a binary Docker container causes
%% Mention all used software in an appendix.
\section{Software acknowledgement}
+\label{appendix:softwareacknowledge}
\input{tex/build/macros/dependencies.tex}
%% Finish LaTeX