aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2021-04-09 17:13:54 +0100
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2021-04-09 17:31:39 +0100
commit925091ef4efebd03ca519d780cf5a0c2ce9df18b (patch)
tree3b37462f063e74f7b3f0a42853fbe07bfd55777c /paper.tex
parent4ccf01420f36c705302fdcde9a94c72676256665 (diff)
Implemented EiC (Lorena Barba) comments, and added final review
The email notice of the final acceptance of this paper in CiSE has been included in the project and the stylistic points that were raised by the editor in chief (EiC) have also been implemented. The most important points were: - Including citations within the text structure (as if they would be footnotes), so things like "see \cite{...}" should have been changed. - Hyperlinks should be printed as footnotes (because the journal gets actually printed). Also, to avoid the second listing breaking between pages, it has been moved to after the next paragraph.
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex96
1 files changed, 47 insertions, 49 deletions
diff --git a/paper.tex b/paper.tex
index 53caf4b..db8f3e3 100644
--- a/paper.tex
+++ b/paper.tex
@@ -74,9 +74,9 @@
%% AIM
A set of criteria is introduced to address this problem:
%% METHOD
- Completeness (no execution requirement beyond a minimal Unix-like operating system, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free software.
+ Completeness (no execution requirement beyond a minimal Unix-like operating system, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free and open source software.
%% RESULTS
- As a proof of concept, we introduce ``Maneage'' (Managing data lineage), enabling cheap archiving, provenance extraction, and peer verification that been tested in several research publications.
+ As a proof of concept, we introduce ``Maneage'' (Managing data lineage), enabling cheap archiving, provenance extraction, and peer verification that has been tested in several research publications.
%% CONCLUSION
We show that longevity is a realistic requirement that does not sacrifice immediate or short-term reproducibility.
The caveats (with proposed solutions) are then discussed and we conclude with the benefits for the various stakeholders.
@@ -126,7 +126,7 @@ Data Lineage, Provenance, Reproducibility, Scientific Pipelines, Workflows
% by the rest of the first word in caps.
%\IEEEPARstart{F}{irst} word
-Reproducible research has been discussed in the sciences for at least 30 years \cite{claerbout1992, fineberg19}.
+Reproducible research has been discussed in the sciences for at least 30 years\cite{claerbout1992, fineberg19}.
Many reproducible workflow solutions (hereafter, ``solutions'') have been proposed that mostly rely on the common technology of the day,
starting with Make and Matlab libraries in the 1990s, Java in the 2000s, and mostly shifting to Python during the last decade.
@@ -141,7 +141,7 @@ Decades later, scientists are still held accountable for their results and there
\section{Longevity of existing tools}
\label{sec:longevityofexisting}
-Reproducibility is defined as ``obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis'' \cite{fineberg19}.
+Reproducibility is defined as ``obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis''\cite{fineberg19}.
Longevity is defined as the length of time that a project remains \emph{functional} after its creation.
Functionality is defined as \emph{human readability} of the source and its \emph{execution possibility} (when necessary).
Many usage contexts of a project do not involve execution: for example, checking the configuration parameter of a single step of the analysis to re-\emph{use} in another project, or checking the version of used software, or the source of the input data.
@@ -154,21 +154,21 @@ A basic review of the longevity of commonly used tools is provided here (for a m
\fi%
).
-To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011, discontinued in 2019).
+To isolate the environment, virtual machines (VMs) have sometimes been used, e.g., in SHARE\footnote{\inlinecode{\url{https://is.ieis.tue.nl/staff/pvgorp/share}}} (awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011, discontinued in 2019).
However, containers (e.g., Docker or Singularity) are currently the most widely-used solution.
We will focus on Docker here because it is currently the most common.
It is hypothetically possible to precisely identify the used Docker ``images'' with their checksums (or ``digest'') to re-create an identical OS image later.
However, that is rarely done.
-Usually images are imported with operating system (OS) names; e.g., \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}'.
+Usually images are imported with operating system (OS) names; e.g., Mesnard \& Barba\cite{mesnard20} use `\inlinecode{FROM ubuntu:16.04}'.
The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly, and only the most recent five are archived there.
Hence, if the image is built in different months, it will contain different OS components.
-In the year 2024, when this version's long-term support (LTS) expires (if not earlier, like CentOS 8 which \href{https://blog.centos.org/2020/12/future-is-centos-stream}{will terminate} 8 years early), the image will not be available at the expected URL.
+In the year 2024, when this version's long-term support (LTS) expires (if not earlier, like CentOS 8 which will terminate 8 years early\footnote{\inlinecode{\url{https://blog.centos.org/2020/12/future-is-centos-stream}}}), the image will not be available at the expected URL.
-Generally, pre-built binary files (like Docker images) are large and expensive to maintain and archive.
-Because of this, in October 2020 Docker Hub (where many workflows are archived) \href{https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}{announced} that inactive images (more than 6 months) will be deleted in free accounts from mid 2021.
+Generally, pre-built binary files (like Docker images) are large and expensive to maintain, distribute and archive.
+Because of this, in October 2020 Docker Hub (where many workflows are archived) announced\footnote{\inlinecode{\href{https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}{https://www.docker.com/blog/docker-hub-image-retention}\\\href{https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}{-policy-delayed-and-subscription-updates}}} a new consumpiton-based payment model.
Furthermore, Docker requires root permissions, and only supports recent (LTS) versions of the host kernel.
-Hence older Docker images may not be executable (their longevity is determined by the host kernel, typically a decade).
+Hence older Docker images may not be executable: their longevity is determined by OS kernels, typically a decade.
Once the host OS is ready, package managers (PMs) are used to install the software or environment.
Usually the OS's PM, such as `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with generic PMs.
@@ -188,8 +188,8 @@ Designing a robust project needs to be encouraged and facilitated because scient
This includes automatic verification, which is possible in many solutions, but is rarely practiced.
Besides non-reproducibility, weak project management leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding, or validating will be expensive).
-Finally, to blend narrative and analysis, computational notebooks \cite{rule18}, such as Jupyter, are currently gaining popularity.
-However, because of their complex dependency trees, their build is vulnerable to the passage of time; e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies.
+Finally, to blend narrative and analysis, computational notebooks\cite{rule18}, such as Jupyter, are currently gaining popularity.
+However, because of their complex dependency trees, their build is vulnerable to the passage of time; e.g., see Figure 1 of Alliez et al.\cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies.
It is important to remember that the longevity of a project is determined by its shortest-lived dependency.
Furthermore, as with job management, computational notebooks do not actively encourage good practices in programming or project management.
The ``cells'' in a Jupyter notebook can either be run sequentially (from top to bottom, one after the other) or by manually selecting the cell to run.
@@ -197,7 +197,7 @@ By default, cell dependencies are not included (e.g., automatically running some
There are third party add-ons like \inlinecode{sos} or \inlinecode{extension's} (both written in Python) for some of these.
However, since they are not part of the core, a shorter longevity can be assumed.
The core Jupyter framework has few options for project management, especially as the project grows beyond a small test or tutorial.
-Notebooks can therefore rarely deliver their promised potential \cite{rule18} and may even hamper reproducibility \cite{pimentel19}.
+Notebooks can therefore rarely deliver their promised potential\cite{rule18} and may even hamper reproducibility\cite{pimentel19}.
@@ -205,8 +205,8 @@ Notebooks can therefore rarely deliver their promised potential \cite{rule18} an
\section{Proposed criteria for longevity}
\label{criteria}
-The main premise here is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it just before publication \cite{austin17,fineberg19}.
-In this context, researchers play a critical role \cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles\footnote{FAIR originally targeted data. Work is ongoing to adopt it for software through initiatives like FAIR4RS (FAIR for Research Software).}).
+The main premise here is that starting a project with a robust data management strategy (or tools that provide it) is more effective, for researchers and the community, than imposing it just before publication\cite{austin17,fineberg19}.
+In this context, researchers play a critical role\cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles\footnote{FAIR originally targeted data. Work is ongoing to adopt it for software through initiatives like FAIR4RS (FAIR for Research Software).}).
Simply archiving a project workflow in a repository after the project is finished is, on its own, insufficient, and maintaining it by repository staff is often either practically unfeasible or unscalable.
We argue and propose that workflows satisfying the following criteria can not only improve researcher flexibility during a research project, but can also increase the FAIRness of the deliverables for future researchers.
@@ -259,14 +259,14 @@ The derivation ``history'' of a result is thus not any the less valuable as itse
\textbf{Criterion 7: Including narrative that is linked to analysis.}
A project is not just its computational analysis.
A raw plot, figure, or table is hardly meaningful alone, even when accompanied by the code that generated it.
-A narrative description is also a deliverable (defined as ``data article'' in \cite{austin17}): describing the purpose of the computations, interpretations of the result, and the context in relation to other projects/papers.
+A narrative description is also a deliverable (defined as ``data article''\cite{austin17}): describing the purpose of the computations, interpretations of the result, and the context in relation to other projects/papers.
This is related to longevity, because if a workflow contains only the steps to do the analysis or generate the plots, in time it may get separated from its accompanying published paper.
-\textbf{Criterion 8: Free and open-source software:}
-Non-free or non-open-source software typically cannot be distributed, inspected, or modified by others.
-They are reliant on a single supplier (even without payments) and prone to \href{https://www.gnu.org/proprietary/proprietary-obsolescence.html}{proprietary obsolescence}.
-A project that is \href{https://www.gnu.org/philosophy/free-sw.en.html}{free software} (as formally defined by GNU), allows others to run, learn from, distribute, build upon (modify), and publish their modified versions.
-When the software used by the project is itself also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware.
+\textbf{Criterion 8: Free and open-source software (FOSS):}
+Non-FOSS software typically cannot be distributed, inspected, or modified by others.
+They are thus reliant on a single supplier (even without payments) and prone to \emph{proprietary obsolescence}\footnote{\inlinecode{\url{https://www.gnu.org/proprietary/proprietary-obsolescence.html}}}.
+A project that is \emph{free software} (as formally defined by GNU\footnote{\inlinecode{\url{https://www.gnu.org/philosophy/free-sw.en.html}}}), allows others to run, learn from, distribute, build upon (modify), and publish their modified versions.
+When the software used by the high-level project is also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware.
Proprietary software may be necessary to read proprietary data formats produced by data collection hardware (for example micro-arrays in genetics).
In such cases, it is best to immediately convert the data to free formats upon collection and safely use or archive the data as free formats.
@@ -282,13 +282,13 @@ In such cases, it is best to immediately convert the data to free formats upon c
\section{Proof of concept: Maneage}
-With the longevity problems of existing tools outlined above, a proof-of-concept solution is presented here via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}.
-Since the initial submission of this paper, it has also been used in \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} (on the COVID-19 pandemic) and \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}.
-It was also awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows \cite{austin17}, from the researchers' perspective.
+With the longevity problems of existing tools outlined above, a proof-of-concept solution is presented here via an implementation that has been tested in published papers\cite{akhlaghi19, infante20}.
+Since the initial submission of this paper, it has also been used in \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} (on the COVID-19 pandemic) and \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460} (on galaxy evolution).
+It was also awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows\cite{austin17}, from the researchers' perspective.
-It is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at \url{https://maneage.org}.
+It is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at \inlinecode{\url{https://maneage.org}}.
It was developed as a parallel research project over five years of publishing reproducible workflows of our research.
-Its primordial implementation was used in \cite{akhlaghi15}, which evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}.
+Its primordial implementation was used in Akhlaghi \& Ichikawa\cite{akhlaghi15}, which evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}.
Technically, the hardest criterion to implement was the first (completeness); in particular restricting execution requirements to only a minimal Unix-like operating system.
One solution we considered was GNU Guix and Guix Workflow Language (GWL).
@@ -298,13 +298,13 @@ Inspired by GWL+Guix, a single job management tool was implemented for both inst
Make is not an analysis language, it is a job manager.
Make decides when and how to call analysis steps/programs (in any language like Python, R, Julia, Shell, or C).
Make has been available since 1977, it is still heavily used in almost all components of modern Unix-like OSs and is standardized in POSIX.
-It is thus mature, actively maintained, highly optimized, efficient in managing provenance, and recommended by the pioneers of reproducible research \cite{claerbout1992,schwab2000}.
-Researchers using free software have also already had some exposure to it (most free research software are built with Make).
+It is thus mature, actively maintained, highly optimized, efficient in managing provenance, and recommended by the pioneers of reproducible research\cite{claerbout1992,schwab2000}.
+Moreover, researchers using FOSS have already had some exposure to Make (most FOSS are built with Make).
Linking the analysis and narrative (criterion 7) was historically our first design element.
To avoid the problems with computational notebooks mentioned above, we adopt a more abstract linkage, providing a more direct and traceable connection.
Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is through automatically-created \LaTeX{} macros, during the analysis.
-For example, \cite{akhlaghi19} writes `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'.
+For example, Akhlaghi writes\cite{akhlaghi19} `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'.
The \LaTeX{} source of the quote above is: `\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}'.
The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is automatically generated after the analysis and expands to the value `\inlinecode{0.25}' upon creation of the PDF.
Since values like this depend on the analysis, they should \emph{also} be reproducible, along with figures and tables.
@@ -317,32 +317,32 @@ Acting as a link, the macro files build the core skeleton of Maneage.
For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version, and possible citation.
These are combined at the end to generate precise software acknowledgment and citation that is shown in the
\ifdefined\separatesupplement%
-\href{https://doi.org/10.5281/zenodo.\projectzenodoid}{appendices},%
+appendices,%
\else%
-appendices (\ref{appendix:software}),%
+appendices (\ref{appendix:software}), %
\fi%
-for other examples, see \cite{akhlaghi19, infante20}.
+other examples are also available\cite{akhlaghi19, infante20}.
Furthermore, the machine-related specifications of the running system (including CPU architecture and byte-order) are also collected to report in the paper (they are reported for this paper in the acknowledgments).
These can help in \emph{root cause analysis} of observed differences/issues in the execution of the workflow on different machines.
-The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}).
+The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of Alliez et al.\cite{alliez19}).
All software dependencies are built down to precise versions of every tool, including the shell, important low-level application programs (e.g., GNU Coreutils) and of course, the high-level science software.
-The source code of all the free software used in Maneage is archived in, and downloaded from, \href{https://doi.org/10.5281/zenodo.3883409}{zenodo.3883409}.
-Zenodo promises long-term archival and also provides a persistent identifier for the files, which are sometimes unavailable at a software package's web page.
+The source code of all the FOSS software used in Maneage is archived in, and downloaded from, \href{https://doi.org/10.5281/zenodo.3883409}{zenodo.3883409}.
+Zenodo promises long-term archival and also provides a persistent identifiers for the files, which are sometimes unavailable at a software package's web page.
-On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library (glibc) is being added (task \href{http://savannah.nongnu.org/task/?15390}{15390}).
-Currently, {\TeX}Live is also being added (task \href{http://savannah.nongnu.org/task/?15267}{15267}), but that is only for building the final PDF, not affecting the analysis or verification.
+On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library (glibc) is being added\footnote{\inlinecode{\url{http://savannah.nongnu.org/task/?15390}}}.
+Currently, {\TeX}Live is also being added\footnote{\inlinecode{\url{http://savannah.nongnu.org/task/?15267}}}, but that is only for building the final PDF, not affecting the analysis or verification.
Building the core Maneage software environment on an 8-core CPU takes about 1.5 hours (GCC consumes more than half of the time).
However, this is only necessary once in a project: the analysis (which usually takes months to write/mature for a normal project) will only use the built environment.
Hence the few hours of initial software building is negligible compared to a project's life span.
To facilitate moving to another computer in the short term, Maneage'd projects can be built in a container or VM.
-The \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{\inlinecode{README.md}} file has thorough instructions on building in Docker.
+The \inlinecode{README.md}\footnote{\inlinecode{\label{maneageatswh}\href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{https://archive.softwareheritage.org/browse/origin/directory}\\\href{https://archive.softwareheritage.org/browse/origin/directory/?origin\_url=http://git.maneage.org/project.git}{?origin\_url=http://git.maneage.org/project.git}}} file has thorough instructions on building in Docker.
Through containers or VMs, users on non-Unix-like OSs (like Microsoft Windows) can use Maneage.
For Windows-native software that can be run in batch-mode, evolving technologies like Windows Subsystem for Linux may be usable.
The analysis phase of the project however is naturally different from one project to another at a low-level.
It was thus necessary to design a generic framework to comfortably host any project, while still satisfying the criteria of modularity, scalability, and minimal complexity.
-This design is demonstrated with the example of Figure \ref{fig:datalineage} (left) which is an enhanced replication of the ``tool'' curve of Figure 1C in \cite{menke20}.
+This design is demonstrated with the example of Figure \ref{fig:datalineage} (left) which is an enhanced replication of the ``tool'' curve of Figure 1C in Menke et al.\cite{menke20}.
Figure \ref{fig:datalineage} (right) is the data lineage that produced it.
\begin{figure*}[t]
@@ -352,7 +352,7 @@ Figure \ref{fig:datalineage} (right) is the data lineage that produced it.
\end{center}
\vspace{-3mm}
\caption{\label{fig:datalineage}
- Left: an enhanced replica of Figure 1C in \cite{menke20}, shown here for demonstrating Maneage.
+ Left: an enhanced replica of Figure 1C in Menke et al.\cite{menke20}, shown here for demonstrating Maneage.
It shows the fraction of the number of papers mentioning software tools (green line, left vertical axis) in each year (red bars, right vertical axis on a log scale).
Right: Schematic representation of the data lineage, or workflow, to generate the plot on the left.
Each colored box is a file in the project and arrows show the operation of various software: linking input file(s) to the output file(s).
@@ -402,7 +402,7 @@ include $(foreach s,$(makesrc), \
reproduce/analysis/make/$(s).mk)
\end{lstlisting}
-Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk} to satisfy the verification criteria (this step was not available in \cite{infante20}).
+Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk} to satisfy the verification criteria.
All project deliverables (macro files, plot or table data, and other datasets) are verified at this stage, with their checksums, to automatically ensure exact reproducibility.
Where exact reproducibility is not possible (for example, due to parallelization), values can be verified by the project authors.
For example see \href{https://archive.softwareheritage.org/browse/origin/content/?branch=refs/heads/postreferee_corrections&origin_url=https://codeberg.org/boud/elaphrocentre.git&path=reproduce/analysis/bash/verify-parameter-statistically.sh}{verify-parameter-statistically.sh} of \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}.
@@ -431,7 +431,7 @@ To further minimize complexity, the low-level implementation can be further sepa
By convention in Maneage, the subMakefiles (and the programs they call for number crunching) do not contain any fixed numbers, settings, or parameters.
Parameters are set as Make variables in ``configuration files'' (with a \inlinecode{.conf} suffix) and passed to the respective program by Make.
For example, in Figure \ref{fig:datalineage} (bottom), \inlinecode{INPUTS.conf} contains URLs and checksums for all imported datasets, thereby enabling exact verification before usage.
-To illustrate this, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot).
+To illustrate this, we report that Menke et al.\cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot).
The number \inlinecode{\menkenumpapersdemoyear} is stored in \inlinecode{demo-year.conf} and the result (\inlinecode{\menkenumpapersdemocount}) was calculated after generating \inlinecode{tools-per-year.txt}.
Both numbers are expanded as \LaTeX{} macros when creating this PDF file.
An interested reader can change the value in \inlinecode{demo-year.conf} to automatically update the result in the PDF, without knowing the underlying low-level implementation.
@@ -445,7 +445,7 @@ The analysis can benefit from the powerful and portable job management features
To satisfy the recorded history criterion, version control (currently implemented in Git) is another component of Maneage (see Figure \ref{fig:branching}).
Maneage is a Git branch that contains the shared components (infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles, and interface script).
-The core Maneage git repository is hosted at \href{http://git.maneage.org/project.git}{git.maneage.org/project.git} (archived at \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{Software Heritage}).
+The core Maneage git repository is hosted at \inlinecode{\href{http://git.maneage.org/project.git}{git.maneage.org/project.git}} (archived at \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{Software Heritage}).
Derived projects start by creating a branch and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its particular analysis, see Listing \ref{code:branching}).
There is a thoroughly elaborated customization checklist in \inlinecode{README-hacking.md}.
@@ -458,6 +458,9 @@ in (b) readers do it after publication; e.g., the project remains reproducible b
Generally, any Git flow (branching strategy) can be used by the high-level project authors or future readers.
Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of project curation and maintenance, before \emph{and} after publication.
+Finally, a snapshot of the complete project source is usually $\sim100$ kilo-bytes.
+It can thus easily be published or archived in many servers, for example, it can be uploaded to arXiv (with the \LaTeX{} source\cite{akhlaghi19, infante20, akhlaghi15}), published on Zenodo and archived in SoftwareHeritage.
+
\begin{lstlisting}[
label=code:branching,
caption={Starting a new project with Maneage, and building it},
@@ -479,11 +482,6 @@ $ git add -u && git commit # Commit changes.
\end{lstlisting}
-Finally, a snapshot of the complete project source is usually $\sim100$ kilo-bytes.
-It can thus easily be published or archived in many servers, for example, it can be uploaded to arXiv (with the \LaTeX{} source, see the arXiv source in \cite{akhlaghi19, infante20, akhlaghi15}), published on Zenodo and archived in SoftwareHeritage.
-
-
-
@@ -558,7 +556,7 @@ Likewise, when a bug is found in one science software, affected projects can be
Combined with SoftwareHeritage, precise high-level science components of the analysis can be accurately cited (e.g., even failed/abandoned tests at any historical point).
Many components of ``machine-actionable'' data management plans can also be automatically completed as a byproduct, useful for project PIs and grant funders.
-From the data repository perspective, these criteria can also be useful, e.g., the challenges mentioned in \cite{austin17}:
+From the data repository perspective, these criteria can also be useful, e.g., the challenges mentioned in Austin et al.\cite{austin17}:
(1) The burden of curation is shared among all project authors and readers (the latter may find a bug and fix it), not just by database curators, thereby improving sustainability.
(2) Automated and persistent bidirectional linking of data and publication can be established through the published \emph{and complete} data lineage that is under version control.
(3) Software management: with these criteria, each project comes with its unique and complete software management.