aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex129
1 files changed, 70 insertions, 59 deletions
diff --git a/paper.tex b/paper.tex
index f902b7c..8421c55 100644
--- a/paper.tex
+++ b/paper.tex
@@ -75,9 +75,8 @@
A set of criteria is introduced to address this problem:
%% METHOD
Completeness (no \new{execution requirement} beyond \new{a minimal Unix-like operating system}, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free software.
- They have been tested in several research publications in various fields.
%% RESULTS
- As a proof of concept, ``Maneage'' is introduced, enabling cheap archiving, provenance extraction, and peer verification.
+ As a proof of concept, we introduce ``Maneage'' (Managing data lineage), enabling cheap archiving, provenance extraction, and peer verification that been tested in several research publications.
%% CONCLUSION
We show that longevity is a realistic requirement that does not sacrifice immediate or short-term reproducibility.
The caveats (with proposed solutions) are then discussed and we conclude with the benefits for the various stakeholders.
@@ -156,20 +155,21 @@ A basic review of the longevity of commonly used tools is provided here \new{(fo
).
}
-To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011, but discontinued in 2019).
+To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011, discontinued in 2019).
However, containers (e.g., Docker or Singularity) are currently the most widely-used solution.
We will focus on Docker here because it is currently the most common.
\new{It is hypothetically possible to precisely identify the used Docker ``images'' with their checksums (or ``digest'') to re-create an identical OS image later.
However, that is rarely done.}
-Usually images are imported with operating system (OS) names; e.g., \cite{mesnard20} imports `\inlinecode{FROM ubuntu:16.04}'.
+Usually images are imported with operating system (OS) names; e.g., \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}'.
The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly, and only the most recent five are archived there.
Hence, if the image is built in different months, it will contain different OS components.
-In the year 2024, when long-term support (LTS) for this version of Ubuntu expires, the image will be unavailable at the expected URL \new{(if not earlier, like CentOS 8 which \href{https://blog.centos.org/2020/12/future-is-centos-stream}{will terminate} 8 years early).}
+In the year 2024, when this version's long-term support (LTS) expires \new{(if not earlier, like CentOS 8 which \href{https://blog.centos.org/2020/12/future-is-centos-stream}{will terminate} 8 years early)}, the image will not be available at the expected URL.
Generally, \new{pre-built} binary files (like Docker images) are large and expensive to maintain and archive.
-\new{Because of this, in October 2020 Docker Hub (where many workflows are archived) \href{https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}{announced} that inactive images (older than 6 months) will be deleted in free accounts from mid 2021.}
-Furthermore, Docker requires root permissions, and only supports recent (LTS) versions of the host kernel: older Docker images may not be executable \new{(their longevity is determined by the host kernel, typically a decade).}
+\new{Because of this, in October 2020 Docker Hub (where many workflows are archived) \href{https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}{announced} that inactive images (more than 6 months) will be deleted in free accounts from mid 2021.}
+Furthermore, Docker requires root permissions, and only supports recent (LTS) versions of the host kernel.
+Hence older Docker images may not be executable \new{(their longevity is determined by the host kernel, typically a decade).}
Once the host OS is ready, PMs are used to install the software or environment.
Usually the OS's PM, such as `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with generic PMs.
@@ -177,14 +177,14 @@ The former has \new{the same longevity} as the OS, while some of the latter (suc
Nix and GNU Guix produce bit-wise identical programs \new{with considerably better longevity; that of their supported CPU architectures}.
However, they need root permissions and are primarily targeted at the Linux kernel.
Generally, in all the package managers, the exact version of each software (and its dependencies) is not precisely identified by default, although an advanced user can indeed fix them.
-Unless precise version identifiers of \emph{every software package} are stored by project authors, a PM will use the most recent version.
+Unless precise version identifiers of \emph{every software package} are stored by project authors, a third-party PM will use the most recent version.
Furthermore, because third-party PMs introduce their own language, framework, and version history (the PM itself may evolve) and are maintained by an external team, they increase a project's complexity.
With the software environment built, job management is the next component of a workflow.
Visual/GUI workflow tools like Apache Taverna, GenePattern \new{(deprecated)}, Kepler or VisTrails \new{(deprecated)}, which were mostly introduced in the 2000s and used Java or Python 2 encourage modularity and robust job management.
\new{However, a GUI environment is tailored to specific applications and is hard to generalize, while being hard to reproduce once the required Java Virtual Machine (JVM) is deprecated.
These tools' data formats are complex (designed for computers to read) and hard to read by humans without the GUI.}
-The more recent tools (mostly non-GUI, written in Python) leave this to the authors of the project.
+The more recent solutions (mostly non-GUI, written in Python) leave this to the authors of the project.
Designing a robust project needs to be encouraged and facilitated because scientists (who are not usually trained in project or data management) will rarely apply best practices.
This includes automatic verification, which is possible in many solutions, but is rarely practiced.
Besides non-reproducibility, weak project management leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding, or validating will be expensive).
@@ -196,9 +196,9 @@ Furthermore, as with job management, computational notebooks do not actively enc
\new{The ``cells'' in a Jupyter notebook can either be run sequentially (from top to bottom, one after the other) or by manually selecting the cell to run.
By default, cell dependencies are not included (e.g., automatically running some cells only after certain others), parallel execution, or usage of more than one language.
There are third party add-ons like \inlinecode{sos} or \inlinecode{extension's} (both written in Python) for some of these.
-However, since they are not part of the core, their longevity can be assumed to be shorter.
-Therefore, the core Jupyter framework leaves very few options for project management, especially as the project grows beyond a small test or tutorial.}
-In summary, notebooks can rarely deliver their promised potential \cite{rule18} and may even hamper reproducibility \cite{pimentel19}.
+However, since they are not part of the core, a shorter longevity can be assumed.
+The core Jupyter framework has few options for project management, especially as the project grows beyond a small test or tutorial.}
+Notebooks can therefore rarely deliver their promised potential \cite{rule18} and may even hamper reproducibility \cite{pimentel19}.
@@ -206,7 +206,7 @@ In summary, notebooks can rarely deliver their promised potential \cite{rule18}
\section{Proposed criteria for longevity}
\label{criteria}
-The main premise is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it at the end \cite{austin17,fineberg19}.
+The main premise here is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it just before publication \cite{austin17,fineberg19}.
In this context, researchers play a critical role \cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles).
Simply archiving a project workflow in a repository after the project is finished is, on its own, insufficient, and maintaining it by repository staff is often either practically unfeasible or unscalable.
We argue and propose that workflows satisfying the following criteria can not only improve researcher flexibility during a research project, but can also increase the FAIRness of the deliverables for future researchers:
@@ -214,14 +214,14 @@ We argue and propose that workflows satisfying the following criteria can not on
\textbf{Criterion 1: Completeness.}
A project that is complete (self-contained) has the following properties.
(1) \new{No \emph{execution requirements} apart from a minimal Unix-like operating system.
-Fewer explicit execution requirements would mean higher \emph{execution possibility} and consequently higher \emph{longevity}.}
+Fewer explicit execution requirements would mean larger \emph{execution possibility} and consequently longer \emph{longevity}.}
(2) Primarily stored as plain text \new{(encoded in ASCII/Unicode)}, not needing specialized software to open, parse, or execute.
(3) No impact on the host OS libraries, programs, and \new{environment variables}.
-(4) Does not require root privileges to run (during development or post-publication).
+(4) No root privileges to run (during development or post-publication).
(5) Builds its own controlled software \new{with independent environment variables}.
(6) Can run locally (without an internet connection).
(7) Contains the full project's analysis, visualization \emph{and} narrative: including instructions to automatically access/download raw inputs, build necessary software, do the analysis, produce final data products \emph{and} final published report with figures \emph{as output}, e.g., PDF or HTML.
-(8) It can run automatically, \new{without} human interaction.
+(8) It can run automatically, without human interaction.
\textbf{Criterion 2: Modularity.}
A modular project enables and encourages independent modules with well-defined inputs/outputs and minimal side effects.
@@ -270,7 +270,7 @@ A project that is \href{https://www.gnu.org/philosophy/free-sw.en.html}{free sof
When the software used by the project is itself also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware.
\new{Proprietary software may be necessary to read proprietary data formats produced by data collection hardware (for example micro-arrays in genetics).
-In such cases, it is best to immediately convert the data to free formats upon collection, and archive (e.g., on Zenodo) or use the data in free formats.}
+In such cases, it is best to immediately convert the data to free formats upon collection and safely use or archive the data as free formats.}
@@ -283,13 +283,13 @@ In such cases, it is best to immediately convert the data to free formats upon c
\section{Proof of concept: Maneage}
-With the longevity problems of existing tools outlined above, a proof-of-concept tool is presented here via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}.
+With the longevity problems of existing tools outlined above, a proof-of-concept solution is presented here via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}.
\new{Since the initial submission of this paper, it has also been used in \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} (on the COVID-19 pandemic) and \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}.}
It was also awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows \cite{austin17}, from the researchers' perspective.
-The tool is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at \url{https://maneage.org}.
+It is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at \url{https://maneage.org}.
It was developed as a parallel research project over five years of publishing reproducible workflows of our research.
-The original implementation was published in \cite{akhlaghi15}, and evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}.
+Its primordial implementation was used in \cite{akhlaghi15}, which evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}.
Technically, the hardest criterion to implement was the first (completeness); in particular \new{restricting execution requirements to only a minimal Unix-like operating system}.
One solution we considered was GNU Guix and Guix Workflow Language (GWL).
@@ -303,11 +303,11 @@ It is thus mature, actively maintained, highly optimized, efficient in managing
Researchers using free software have also already had some exposure to it \new{(most free research software are built with Make).}
Linking the analysis and narrative (criterion 7) was historically our first design element.
-To avoid the problems with computational notebooks mentioned above, our implementation follows a more abstract linkage, providing a more direct and precise, yet modular, connection.
+To avoid the problems with computational notebooks mentioned above, we adopt a more abstract linkage, providing a more direct and traceable connection.
Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is through automatically-created \LaTeX{} macros, during the analysis.
For example, \cite{akhlaghi19} writes `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'.
The \LaTeX{} source of the quote above is: `\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}'.
-The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is generated during the analysis and expands to the value `\inlinecode{0.25}' when the PDF output is built.
+The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is automatically generated after the analysis and expands to the value `\inlinecode{0.25}' upon creation of the PDF.
Since values like this depend on the analysis, they should \emph{also} be reproducible, along with figures and tables.
These macros act as a quantifiable link between the narrative and analysis, with the granularity of a word in a sentence and a particular analysis command.
@@ -325,29 +325,27 @@ These are combined at the end to generate precise software \new{acknowledgment}
\fi%
}
for other examples, see \cite{akhlaghi19, infante20}.
-\new{Furthermore, the machine-related specifications of the running system (including hardware name and byte-order) are also collected and cited.
+\new{Furthermore, the machine-related specifications of the running system (including CPU architecture and byte-order) are also collected to report in the paper (they are reported for this paper in the acknowledgments).
These can help in \emph{root cause analysis} of observed differences/issues in the execution of the workflow on different machines.}
The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}).
-All software dependencies are built down to precise versions of every tool, including the shell,\new{important low-level application programs} (e.g., GNU Coreutils) and of course, the high-level science software.
-\new{The source code of all the free software used in Maneage is archived in and downloaded from \href{https://doi.org/10.5281/zenodo.3883409}{zenodo.3883409}.
+All software dependencies are built down to precise versions of every tool, including the shell, \new{important low-level application programs} (e.g., GNU Coreutils) and of course, the high-level science software.
+\new{The source code of all the free software used in Maneage is archived in, and downloaded from, \href{https://doi.org/10.5281/zenodo.3883409}{zenodo.3883409}.
Zenodo promises long-term archival and also provides a persistent identifier for the files, which are sometimes unavailable at a software package's web page.}
On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library (glibc) is being added (task \href{http://savannah.nongnu.org/task/?15390}{15390}).
Currently, {\TeX}Live is also being added (task \href{http://savannah.nongnu.org/task/?15267}{15267}), but that is only for building the final PDF, not affecting the analysis or verification.
-\new{Finally, some software cannot be built on some CPU architectures, hence by default, the architecture is included in the final built paper automatically (see below).}
\new{Building the core Maneage software environment on an 8-core CPU takes about 1.5 hours (GCC consumes more than half of the time).
However, this is only necessary once in a project: the analysis (which usually takes months to write/mature for a normal project) will only use the built environment.
Hence the few hours of initial software building is negligible compared to a project's life span.
To facilitate moving to another computer in the short term, Maneage'd projects can be built in a container or VM.
-The \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{\inlinecode{README.md}} file has instructions on building in Docker.
+The \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{\inlinecode{README.md}} file has thorough instructions on building in Docker.
Through containers or VMs, users on non-Unix-like OSs (like Microsoft Windows) can use Maneage.
For Windows-native software that can be run in batch-mode, evolving technologies like Windows Subsystem for Linux may be usable.}
The analysis phase of the project however is naturally different from one project to another at a low-level.
It was thus necessary to design a generic framework to comfortably host any project, while still satisfying the criteria of modularity, scalability, and minimal complexity.
-This design is demonstrated with the example of Figure \ref{fig:datalineage} (left).
-It is an enhanced replication of the ``tool'' curve of Figure 1C in \cite{menke20}.
+This design is demonstrated with the example of Figure \ref{fig:datalineage} (left) which is an enhanced replication of the ``tool'' curve of Figure 1C in \cite{menke20}.
Figure \ref{fig:datalineage} (right) is the data lineage that produced it.
\begin{figure*}[t]
@@ -378,9 +376,15 @@ This is visualized in Figure \ref{fig:datalineage} (right) where no built (blue)
A visual inspection of this file is sufficient for a non-expert to understand the high-level steps of the project (irrespective of the low-level implementation details), provided that the subMakefile names are descriptive (thus encouraging good practice).
A human-friendly design that is also optimized for execution is a critical component for the FAIRness of reproducible research.
+All projects first load \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk} (Listing \ref{code:topmake}).
+Project authors add their modular subMakefiles in between.
+Except for \inlinecode{paper.mk} (which builds the ultimate target: \inlinecode{paper.pdf}), all subMakefiles build a macro file with the same base-name (the \inlinecode{.tex} file at the bottom of each subMakefile in Figure \ref{fig:datalineage}).
+Other built files (``targets'' in intermediate analysis steps) cascade down in the lineage to one of these macro files, possibly through other files.
+
\begin{lstlisting}[
label=code:topmake,
- caption={This project's simplified \inlinecode{top-make.mk}, also see Figure \ref{fig:datalineage}}
+ caption={This project's simplified \inlinecode{top-make.mk}, also see Figure \ref{fig:datalineage}.\\
+ \new{For full file, see \href{https://archive.softwareheritage.org/swh:1:cnt:d552dc18749fbb16249b642cd4f8107c1ce8ff68;origin=https://gitlab.com/makhlaghi/maneage-paper.git;visit=swh:1:snp:ee7cc3bb558c4af703e8de53dd590654c8967663;anchor=swh:1:rev:e4f61544facf8a3bd88c8466e7d3d847544c8228;path=/reproduce/analysis/make/top-make.mk}{SoftwareHeritage}}}
]
# Default target/goal of project.
all: paper.pdf
@@ -401,12 +405,7 @@ include $(foreach s,$(makesrc), \
reproduce/analysis/make/$(s).mk)
\end{lstlisting}
-All projects first load \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk} (Listing \ref{code:topmake}).
-Project authors add their modular subMakefiles in between.
-Except for \inlinecode{paper.mk} (which builds the ultimate target: \inlinecode{paper.pdf}), all subMakefiles build a macro file with the same base-name (the \inlinecode{.tex} file in each subMakefile of Figure \ref{fig:datalineage}).
-Other built files (intermediate analysis steps) cascade down in the lineage to one of these macro files, possibly through other files.
-
-Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk} to satisfy the verification criteria (this step was not yet available in \cite{akhlaghi19, infante20}).
+Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk} to satisfy the verification criteria (this step was not available in \cite{infante20}).
All project deliverables (macro files, plot or table data, and other datasets) are verified at this stage, with their checksums, to automatically ensure exact reproducibility.
Where exact reproducibility is not possible \new{(for example, due to parallelization)}, values can be verified by the project authors.
\new{For example see \new{\href{https://archive.softwareheritage.org/browse/origin/content/?branch=refs/heads/postreferee_corrections&origin_url=https://codeberg.org/boud/elaphrocentre.git&path=reproduce/analysis/bash/verify-parameter-statistically.sh}{verify-parameter-statistically.sh}} of \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}.}
@@ -420,9 +419,14 @@ Where exact reproducibility is not possible \new{(for example, due to paralleliz
The low-level structure (in Maneage, shared between all projects) can be updated by merging with Maneage.
(b) A finished/published project can be revitalized for new technologies by merging with the core branch.
Each Git ``commit'' is shown on its branch as a colored ellipse, with its commit hash shown and colored to identify the team that is/was working on the branch.
- Briefly, Git is a version control system, allowing a structured backup of project files.
- Each Git ``commit'' effectively contains a copy of all the project's files at the moment it was made.
- The upward arrows at the branch-tops are therefore in the time direction.
+ Briefly, Git is a version control system, allowing a structured backup of project files, for more see
+ \ifdefined\separatesupplement%
+ supplementary appendices (section on version control)%
+ \else%
+ Appendix \ref{appendix:versioncontrol}%
+ \fi%
+ . Each Git ``commit'' effectively contains a copy of all the project's files at the moment it was made.
+ The upward arrows at the branch-tops are therefore in the direction of time.
}
\end{figure*}
@@ -438,18 +442,23 @@ Furthermore, the configuration files are a prerequisite of the targets that use
If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's source or other built products.
This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details; e.g., by co-authors or future readers), and ensures self-consistency.
-\new{In contrast to notebooks like Jupyter, the analysis scripts and configuration parameters are therefore not blended into the running code, are not stored together in a single file, and do not require a unique editor.
-To satisfy the modularity criterion, the analysis steps and narrative are run in their own files (in different languages, thus maximally benefiting from the unique features of each) and the files can be viewed or manipulated with any text editor that the authors prefer.
+\new{In contrast to notebooks like Jupyter, the analysis scripts, configuration parameters and paper's narrative are therefore not blended into in a single file, and do not require a unique editor.
+To satisfy the modularity criterion, the analysis steps and narrative are written and run in their own files (in different languages) and the files can be viewed or manipulated with any text editor that the authors prefer.
The analysis can benefit from the powerful and portable job management features of Make and communicates with the narrative text through \LaTeX{} macros, enabling much better-formatted output that blends analysis outputs in the narrative sentences and enables direct provenance tracking.}
To satisfy the recorded history criterion, version control (currently implemented in Git) is another component of Maneage (see Figure \ref{fig:branching}).
Maneage is a Git branch that contains the shared components (infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles, and interface script).
\new{The core Maneage git repository is hosted at \href{http://git.maneage.org/project.git}{git.maneage.org/project.git} (archived at \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{Software Heritage}).}
Derived projects start by creating a branch and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its particular analysis, see Listing \ref{code:branching}).
-There is a \new{thoroughly elaborated} customization checklist in \inlinecode{README-hacking.md}).
+There is a \new{thoroughly elaborated} customization checklist in \inlinecode{README-hacking.md}.
-The current project's Git hash is provided to the authors as a \LaTeX{} macro (shown here at the end of the abstract), as well as the Git hash of the last commit in the Maneage branch (shown here in the acknowledgments).
-These macros are created in \inlinecode{initialize.mk}, with \new{other basic information from the running system like the CPU architecture, byte order or address sizes (shown here in the acknowledgments)}.
+The current project's Git hash is provided to the authors as a \LaTeX{} macro (shown here at the end of the abstract), as well as the Git hash of the last commit in the Maneage branch (shown in the acknowledgments).
+These macros are created in \inlinecode{initialize.mk}, with \new{other basic information from the running system like the CPU architecture, byte order or address sizes (shown in the acknowledgments)}.
+
+Figure \ref{fig:branching} shows how projects can re-import Maneage at a later time (technically: \emph{merge}), thus improving their low-level infrastructure: in (a) authors do the merge during an ongoing project;
+in (b) readers do it after publication; e.g., the project remains reproducible but the infrastructure is outdated, or a bug is fixed in Maneage.
+\new{Generally, any Git flow (branching strategy) can be used by the high-level project authors or future readers.}
+Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of project curation and maintenance, before \emph{and} after publication.
\begin{lstlisting}[
label=code:branching,
@@ -471,12 +480,8 @@ $ ./project make # Re-build to see effect.
$ git add -u && git commit # Commit changes.
\end{lstlisting}
-The branch-based design of Figure \ref{fig:branching} allows projects to re-import Maneage at a later time (technically: \emph{merge}), thus improving its low-level infrastructure: in (a) authors do the merge during an ongoing project;
-in (b) readers do it after publication; e.g., the project remains reproducible but the infrastructure is outdated, or a bug is fixed in Maneage.
-\new{Generally, any git flow (branching strategy) can be used by the high-level project authors or future readers.}
-Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of project curation and maintenance, before \emph{and} after publication.
-Finally, the complete project source is usually $\sim100$ kilo-bytes.
+Finally, a snapshot of the complete project source is usually $\sim100$ kilo-bytes.
It can thus easily be published or archived in many servers, for example, it can be uploaded to arXiv (with the \LaTeX{} source, see the arXiv source in \cite{akhlaghi19, infante20, akhlaghi15}), published on Zenodo and archived in SoftwareHeritage.
@@ -499,8 +504,8 @@ It can thus easily be published or archived in many servers, for example, it can
%% Attempt to generalise the significance.
%% should not just present a solution or an enquiry into a unitary problem but make an effort to demonstrate wider significance and application and say something more about the ‘science of data’ more generally.
-We have shown that it is possible to build workflows satisfying all the proposed criteria, and we comment here on our experience in testing them through this proof-of-concept tool.
-Maneage user-base grew with the support of RDA, underscoring some difficulties for widespread adoption.
+We have shown that it is possible to build workflows satisfying all the proposed criteria.
+Here we comment on our experience in testing them through Maneage and its increasing user-base (thanks to the support of RDA).
Firstly, while most researchers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used.
Fortunately, we have noticed that after witnessing the improvements in their research, many, especially early-career researchers, have started mastering these tools.
@@ -510,16 +515,17 @@ Indeed the fast-evolving tools are primarily targeted at software developers, wh
Scientists, on the other hand, need to focus on their own research fields and need to consider longevity.
Hence, arguably the most important feature of these criteria (as implemented in Maneage) is that they provide a fully working template or bundle that works immediately out of the box by producing a paper with an example calculation that they just need to start customizing.
Using mature and time-tested tools, for blending version control, the research paper's narrative, the software management \emph{and} a robust data management strategies.
-We have noticed that providing a complete \emph{and} customizable template with a clear checklist of the initial steps is much more effective in encouraging mastery of these modern scientific tools than having abstract, isolated tutorials on each tool individually.
+We have noticed that providing a clear checklist of the initial customizations is much more effective in encouraging mastery of these core analysis tools than having abstract, isolated tutorials on each tool individually.
Secondly, to satisfy the completeness criterion, all the required software of the project must be built on various \new{Unix-like OSs} (Maneage is actively tested on different GNU/Linux distributions, macOS, and is being ported to FreeBSD also).
This requires maintenance by our core team and consumes time and energy.
However, because the PM and analysis components share the same job manager (Make) and design principles, we have already noticed some early users adding, or fixing, their required software alone.
They later share their low-level commits on the core branch, thus propagating it to all derived projects.
-\new{Unix-like OSs are a very large and diverse group (mostly conforming with POSIX), so this condition does not guarantee bit-wise reproducibility of the software, even when built on the same hardware.}.
-However \new{our focus is on reproducing results (output of software), not the software itself.}
-Well written software internally corrects for differences in OS or hardware that may affect its output (through tools like the GNU portability library).
+\new{Thirdly, Unix-like OSs are a very large and diverse group (mostly conforming with POSIX), so our completeness condition does not guarantee bit-wise reproducibility of the software, even when built on the same hardware.
+However our focus is on reproducing results (output of software), not the software itself.}
+Well written software internally corrects for differences in OS or hardware that may affect its output (through tools like the GNU Portability Library, or Gnulib).
+
On GNU/Linux hosts, Maneage builds precise versions of the compilation tool chain.
However, glibc is not install-able on some \new{Unix-like} OSs (e.g., macOS) and all programs link with the C library.
This may hypothetically hinder the exact reproducibility \emph{of results} on non-GNU/Linux systems, but we have not encountered this in our research so far.
@@ -533,12 +539,17 @@ Using continuous integration (CI) is one way to precisely identify breaking poin
%This is a long-term goal and would require major changes to academic value systems.
%2) Authors can be given a grace period where the journal or a third party embargoes the source, keeping it private for the embargo period and then publishing it.
-Other implementations of the criteria, or future improvements in Maneage, may solve some of the caveats, but this proof of concept already shows their many advantages.
+Other implementations of the criteria, or future improvements in Maneage, may solve some of the caveats, but this proof of concept already shows many advantages.
For example, the publication of projects meeting these criteria on a wide scale will allow automatic workflow generation, optimized for desired characteristics of the results (e.g., via machine learning).
The completeness criterion implies that algorithms and data selection can be included in the optimizations.
Furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the connection between an analysis with the resulting narrative \emph{and} the history of that analysis+narrative.
-Parsers can be written over projects for meta-research and provenance studies, e.g., to generate ``research objects''.
+Parsers can be written over projects for meta-research and provenance studies, e.g., to generate Research Objects
+\ifdefined\separatesupplement
+(see the supplement appendix).
+\else
+(see Appendix \ref{appendix:researchobject}).
+\fi
Likewise, when a bug is found in one science software, affected projects can be detected and the scale of the effect can be measured.
Combined with SoftwareHeritage, precise high-level science components of the analysis can be accurately cited (e.g., even failed/abandoned tests at any historical point).
Many components of ``machine-actionable'' data management plans can also be automatically completed as a byproduct, useful for project PIs and grant funders.
@@ -548,7 +559,7 @@ From the data repository perspective, these criteria can also be useful, e.g., t
(2) Automated and persistent bidirectional linking of data and publication can be established through the published \emph{and complete} data lineage that is under version control.
(3) Software management: with these criteria, each project comes with its unique and complete software management.
It does not use a third-party PM that needs to be maintained by the data center (and the many versions of the PM), hence enabling robust software management, preservation, publishing, and citation.
-For example, see \href{https://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}, \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}, \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} or \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460} where we have exploited the free-software criterion to distribute the source code of all software used in each project as deliverables.
+For example, see \href{https://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}, \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}, \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} or \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460} where we distribute the source code of all software used in each project in a tarball, as deliverables.
(4) ``Linkages between documentation, code, data, and journal articles in an integrated environment'', which effectively summarizes the whole purpose of these criteria.
@@ -725,7 +736,7 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314
%% \item \citeappendix{fanelli18} is critical of the narrative that there is a ``reproducibility crisis'', and that its important to empower scientists.
%% \item \citeappendix{burrell18} open software (in particular Python) in heliophysics.
%% \item \citeappendix{allen18} show that many papers do not cite software.
-%% \item \citeappendix{zhang18} explicity say that they won't release their code: ``We opt not to make the code used for the chemical evo-lution modeling publicly available because it is an important asset of the re-searchers’ toolkits''
+%% \item \citeappendix{zhang18} explicity say that they will not release their code: ``We opt not to make the code used for the chemical evo-lution modeling publicly available because it is an important asset of the re-searchers’ toolkits''
%% \item \citeappendix{jones19} make genuine effort at reproducing every number in the paper (using Docker, Conda, and CGAT-core, and Binder), but they can ultimately only release scripts. They claim its not possible to reproduce that level of reproducibility, but here we show it is.
%% \item LSST uses Kubernetes and docker for reproducibility \citeappendix{banek19}.
%% \item Interesting survey/paper on the importance of coding in science \citeappendix{merali10}.