aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--paper.tex26
1 files changed, 12 insertions, 14 deletions
diff --git a/paper.tex b/paper.tex
index 9456617..1644a20 100644
--- a/paper.tex
+++ b/paper.tex
@@ -124,9 +124,7 @@ creates generational gaps in the scientific community, preventing previous gener
\section{Commonly used tools and their longevity}
Longevity is as important in science as in some fields of industry, but this ideal is not always necessary; e.g., fast-evolving tools can be appropriate in short-term commercial projects.
-To highlight the necessity, some of the most commonly-used tools are reviewed here from this perspective.
-A set of third-party tools that are commonly used in solutions are reviewed here.
-They can be categorized as:
+To highlight the necessity, a sample set of commonly-used tools is reviewed here in the following order:
(1) environment isolators -- virtual machines (VMs) or containers;
(2) package managers (PMs) -- Conda, Nix, or Spack;
(3) job management -- shell scripts, Make, SCons, or CGAT-core;
@@ -176,8 +174,7 @@ However, many data-intensive projects commonly involve dozens of high-level depe
\section{Proposed criteria for longevity}
The main premise is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it at the end \cite{austin17,fineberg19}.
-In this context, researchers play a critical role \cite{austin17} in making their research more Findable, Accessible,
-Interoperable, and Reusable (the FAIR principles).
+In this context, researchers play a critical role \cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles).
Simply archiving a project workflow in a repository after the project is finished is, on its own, insufficient, and maintaining it by repository staff is often either practically unfeasible or unscalable.
We argue and propose that workflows satisfying the following criteria can not only improve researcher flexibility during a research project, but can also increase the FAIRness of the deliverables for future researchers:
@@ -186,7 +183,7 @@ A project that is complete (self-contained) has the following properties.
(1) No dependency beyond the Portable Operating System Interface: POSIX (a minimal Unix-like environment).
POSIX has been developed by the Austin Group (which includes IEEE) since 1988 and many OSes have complied.
(2) Primarily stored as plain text, not needing specialized software to open, parse, or execute.
-(3) No affect on the host OS libraries, programs or environment.
+(3) No impact on the host OS libraries, programs or environment.
(4) Does not require root privileges to run (during development or post-publication).
(5) Builds its own controlled software for an independent environment.
(6) Can run locally (without an internet connection).
@@ -226,11 +223,11 @@ The derivation ``history'' of a result is thus not any the less valuable as itse
\textbf{Criterion 7: Including narrative, linked to analysis.}
A project is not just its computational analysis.
A raw plot, figure or table is hardly meaningful alone, even when accompanied by the code that generated it.
-A narrative description must is also a deliverable (defined as ``data article'' in \cite{austin17}): describing the purpose of the computations, and interpretations of the result, and the context in relation to other projects/papers.
-This is related to longevity, because if a workflow contains only the steps to do the analysis or generate the plots, in time it may get separated from its accompanying published paper.
+A narrative description is also a deliverable (defined as ``data article'' in \cite{austin17}): describing the purpose of the computations, and interpretations of the result, and the context in relation to other projects/papers.
+This is related to longevity, because if a workflow contains only the steps to do the analysis or generate the plots, in time it may get separated from its accompanying published paper.
\textbf{Criterion 8: Free and open source software:}
-Reproducibility (defined in \cite{fineberg19}) is possible with a black box (non-free or non-open-source software); this criterion is therefore necessary because nature is already a black box.
+Reproducibility (defined in \cite{fineberg19}) is not possible with a black box (non-free or non-open-source software); this criterion is therefore necessary because nature is already a black box.
A project that is free software (as formally defined), allows others to learn from, modify, and build upon it.
When the software used by the project is itself also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware.
In contrast, non-free tools typically cannot be distributed or modified by others, making it reliant on a single supplier (even without payments).
@@ -277,12 +274,13 @@ This allows accurate post-publication provenance \emph{and} automatic updates to
Through the latter, manual updates by authors are by-passed, which are prone to errors, thus discouraging improvements after writing the first draft.
Acting as a link, the macro files build the core skeleton of Maneage.
-For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation.
+For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation..
These are combined at the end to generate precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}), which are excluded here because of the strict word limit.
+Furthermore, machine related specifications including hardware name and byte-order are also collected and cited, as a reference point if they were needed for \emph{root cause analysis} of observed differences/issues in the execution of the wokflow on different machines.
The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}).
-All software dependencies are built down to precise versions of every tool, including the shell, POSIX tools (e.g., GNU Coreutils) and ofcourse, the high-level science software.
+All software dependencies are built down to precise versions of every tool, including the shell, POSIX tools (e.g., GNU Coreutils) and of course, the high-level science software.
On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library (glibc) is being added (task \href{http://savannah.nongnu.org/task/?15390}{15390}).
-Currently {\TeX}Live is also being added (task \href{http://savannah.nongnu.org/task/?15267}{15267}), but that is only for building the final PDF, not affecting the analysis or verification.
+Currently, {\TeX}Live is also being added (task \href{http://savannah.nongnu.org/task/?15267}{15267}), but that is only for building the final PDF, not affecting the analysis or verification.
Temporary relocation of a built project, without building from source, can be done by building the project in a container or VM (\inlinecode{README.md} has recommendations on building a \inlinecode{Dockerfile}).
The analysis phase of the project however is naturally different from one project to another at a low-level.
@@ -359,8 +357,8 @@ Where exact reproducibility is not possible, values can be verified by any stati
(b) A finished/published project can be revitalized for new technologies by merging with the core branch.
Each Git ``commit'' is shown on its branch as a colored ellipse, with its commit hash shown and colored to identify the team that is/was working on the branch.
Briefly, Git is a version control system, allowing a structured backup of project files.
- Each Git ``commit'' effectively contains a copy all the project's files at the moment it was made.
- The upward arrows at the branch-tops are therefore in the direction of time.
+ Each Git ``commit'' effectively contains a copy of all the project's files at the moment it was made.
+ The upward arrows at the branch-tops are therefore in the timee direction.
}
\end{figure*}