diff options
-rw-r--r-- | README.md | 4 | ||||
-rw-r--r-- | paper.tex | 47 | ||||
-rw-r--r-- | tex/img/codata.pdf | bin | 8798 -> 0 bytes |
3 files changed, 27 insertions, 24 deletions
@@ -1,5 +1,5 @@ -Reproducible source for paper introducing Maneage (MANaging data linEAGE) -------------------------------------------------------------------------- +Reproducible source for Akhlaghi et al. (2020, arXiv:2006.03018) +---------------------------------------------------------------- Copyright (C) 2018-2020 Mohammad Akhlaghi <mohammad@akhlaghi.org>\ See the end of the file for license conditions. @@ -71,7 +71,7 @@ We show that longevity is a realistic requirement that doesn’t sacrifice immediate or short-term reproducibility, discuss the caveats (with proposed solutions) and conclude with the benefits for the various stakeholders. This paper is itself written with Maneage (project commit \projectversion). \vspace{3mm} - \emph{Reproducible supplement} --- Necessary software, workflow and output data are published in \href{https://doi.org/10.5281/zenodo.3872248}{\texttt{zenodo.3872248}} with Git repository at \href{https://gitlab.com/makhlaghi/maneage-paper}{\texttt{gitlab.com/makhlaghi/maneage-paper}}. + \emph{Reproducible supplement} --- All products in \href{https://doi.org/10.5281/zenodo.3872248}{\texttt{zenodo.3872248}}, Git history of source at \href{https://gitlab.com/makhlaghi/maneage-paper}{\texttt{gitlab.com/makhlaghi/maneage-paper}}, which is also archived on \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=https://gitlab.com/makhlaghi/maneage-paper.git}{SoftwareHeritage}. \end{abstract} % Note that keywords are not normally used for peer-review papers. @@ -117,12 +117,13 @@ Decades later, scientists are still held accountable for their results and there \section{Commonly used tools and their longevity} Longevity is as important in science as in some fields of industry, but this is not always the case, e.g., fast-evolving tools can be appropriate in short-term commercial projects. -To highlight the necessity of longevity, some of the most commonly-used tools are reviewed here from this perspective. -A common set of third-party tools that are commonly used can be categorized as: +To highlight the necessity, some of the most commonly-used tools are reviewed here from this perspective. +A set of third-party tools that are commonly used in solutions are reviewed here. +They can be categorized as: (1) environment isolators -- virtual machines (VMs) or containers; (2) package managers (PMs) -- Conda, Nix, or Spack; (3) job management -- shell scripts, Make, SCons, or CGAT-core; -(4) notebooks -- such as Jupyter. +(4) notebooks -- Jupyter. To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011 but was discontinued in 2019). However, containers (in particular, Docker, and to a lesser degree, Singularity) are currently the most widely-used solution, we will thus focus on Docker here. @@ -141,12 +142,13 @@ The former suffers from the same longevity problem as the OS, while some of the Nix and GNU Guix produce bit-wise identical programs, but they need root permissions and are primarily targeted at the Linux kernel. Generally, the exact version of each software's dependencies is not precisely identified in the PM build instructions (although this could be implemented). Therefore, unless precise version identifiers of \emph{every software package} are stored by project authors, a PM will use the most recent version. -Furthermore, because each third-party PM introduces its own language and framework, this increases the project's complexity. +Furthermore, because each third-party PM introduces its own language, framework and version history (the PM itself may evolve) they increase a project's complexity. With the software environment built, job management is the next component of a workflow. Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails (mostly introduced in the 2000s and using Java) encourage modularity and robust job management, but the more recent tools (mostly in Python) leave this to the authors of the project. Designing a modular project needs to be encouraged and facilitated because scientists (who are not usually trained in project or data management) will rarely apply best practices. -This includes automatic verification: while it is possible in many solutions, it is rarely practiced, which leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding or validating will be expensive). +This includes automatic verification: while it is possible in many solutions, it is rarely practiced. +Weak project management leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding or validating will be expensive). Finally, to add narrative, computational notebooks \cite{rule18}, like Jupyter, are currently gaining popularity. However, due to their complex dependency trees, they are vulnerable to the passage of time, e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies. @@ -271,7 +273,7 @@ These are combined at the end to generate precise software acknowledgment and ci The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}). All software dependencies are built down to precise versions of every tool, including the shell, POSIX tools (e.g., GNU Coreutils) or \TeX{}Live, providing the same environment. -On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library is being added (task 15390). +On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library (glibc) is being added (task 15390). Temporary relocation of a project, without building from source, can be done by building the project in a container or VM. The analysis phase of the project however is naturally different from one project to another at a low-level. @@ -337,18 +339,6 @@ Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage r All project deliverables (macro files, plot or table data and other datasets) are verified at this stage, with their checksums, to automatically ensure exact reproducibility. Where exact reproducibility is not possible, values can be verified by any statistical means, specified by the project authors. -To further minimize complexity, the low-level implementation can be further separated from the high-level execution through configuration files. -By convention in Maneage, the subMakefiles (and the programs they call for number crunching) do not contain any fixed numbers, settings or parameters. -Parameters are set as Make variables in ``configuration files'' (with a \inlinecode{.conf} suffix) and passed to the respective program by Make. -For example, in Figure \ref{fig:datalineage} (bottom), \inlinecode{INPUTS.conf} contains URLs and checksums for all imported datasets, enabling exact verification before usage. -To illustrate this, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot). -The number \inlinecode{\menkenumpapersdemoyear} is stored in \inlinecode{demo-year.conf} and the result (\inlinecode{\menkenumpapersdemocount}) was calculated after generating \inlinecode{columns.txt}. -Both numbers are expanded as \LaTeX{} macros when creating this PDF file. -An interested reader can change the value in \inlinecode{demo-year.conf} to automatically update the result in the PDF, without necessarily knowing the underlying low-level implementation. -Furthermore, the configuration files are a prerequisite of the targets that use them. -If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's source or other built products. -This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details, e.g., by co-authors or future readers), and ensures self-consistency. - \begin{figure*}[t] \begin{center} \includetikz{figure-branching}{scale=1}\end{center} \vspace{-3mm} @@ -362,6 +352,18 @@ This fast and cheap testing encourages experimentation (without necessarily know } \end{figure*} +To further minimize complexity, the low-level implementation can be further separated from the high-level execution through configuration files. +By convention in Maneage, the subMakefiles (and the programs they call for number crunching) do not contain any fixed numbers, settings or parameters. +Parameters are set as Make variables in ``configuration files'' (with a \inlinecode{.conf} suffix) and passed to the respective program by Make. +For example, in Figure \ref{fig:datalineage} (bottom), \inlinecode{INPUTS.conf} contains URLs and checksums for all imported datasets, enabling exact verification before usage. +To illustrate this, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot). +The number \inlinecode{\menkenumpapersdemoyear} is stored in \inlinecode{demo-year.conf} and the result (\inlinecode{\menkenumpapersdemocount}) was calculated after generating \inlinecode{tools-per-year.txt}. +Both numbers are expanded as \LaTeX{} macros when creating this PDF file. +An interested reader can change the value in \inlinecode{demo-year.conf} to automatically update the result in the PDF, without knowing the underlying low-level implementation. +Furthermore, the configuration files are a prerequisite of the targets that use them. +If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's source or other built products. +This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details, e.g., by co-authors or future readers), and ensures self-consistency. + Finally, to satisfy the recorded history criterion, version control (currently implemented in Git) is another component of Maneage (see Figure \ref{fig:branching}). Maneage is a Git branch that contains the shared components (infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles and interface script). Derived project start by branching off, and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its particular analysis, see Listing \ref{code:branching}, there is customization checklist in \inlinecode{README-hacking.md}). @@ -425,10 +427,11 @@ However, because the PM and analysis components share the same job manager (Make They later share their low-level commits on the core branch, thus propagating it to all derived projects. A related caveat is that, POSIX is a fuzzy standard, not guaranteeing the bit-wise reproducibility of programs. -It has been chosen here, however, as the underlying platform because our focus on reproducing the results (data) which doesn't always need that bit-wise identical software. +It has been chosen here, however, as the underlying platform because our focus is on reproducing the results (data), which doesn't necessarily need bit-wise reproducible software. POSIX is ubiquitous and low-level software (e.g., core GNU tools) are install-able on most; each internally corrects for differences affecting its functionality (partly as part of the GNU portability library). -On GNU/Linux hosts, Maneage builds precise versions of the compilation tool chain, but glibc is not install-able on some POSIX OSs (e.g., macOS). -The C library is linked with all programs, and this dependence can hypothetically hinder exact reproducibility \emph{of results}, but we have not encountered this so far. +On GNU/Linux hosts, Maneage builds precise versions of the compilation tool chain. +However, glibc is not install-able on some POSIX OSs (e.g., macOS). +All programs link with the C library, and this may hypothetically hinder the exact reproducibility \emph{of results} on non-GNU/Linux systems, but we have not encountered this in our research so far. With everything else under precise control, the effect of differing Kernel and C libraries on high-level science can now be systematically studied with Maneage in followup research. % DVG: It is a pity that the following paragraph cannot be included, as it is really important but perhaps goes beyond the intended goal. diff --git a/tex/img/codata.pdf b/tex/img/codata.pdf Binary files differdeleted file mode 100644 index e00f2ca..0000000 --- a/tex/img/codata.pdf +++ /dev/null |