aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex64
1 files changed, 31 insertions, 33 deletions
diff --git a/paper.tex b/paper.tex
index 90efacd..dfb79e8 100644
--- a/paper.tex
+++ b/paper.tex
@@ -220,7 +220,7 @@ To help in the comparison, the founding principles of Maneage are listed below.
\begin{enumerate}[label={\bf P\arabic*}]
\item \label{principle:complete}\textbf{Completeness:}
A project that is complete, or self-contained,
- (P1.1) has no dependency beyond the Port\-able operating system Interface (POSIX).
+ (P1.1) has no dependency beyond the Port\-able Operating System (OS) Interface, or POSIX.
(P1.2) does not affect the host,
(P1.3) does not require root, or administrator, privileges,
(P1.4) builds its software for an independent environment,
@@ -318,9 +318,10 @@ The main Maneage Branch is a fully working skeleton of a project without much fl
Maneage contains a file called \inlinecode{README-hacking.md} (the \inlinecode{README.md} file is reserved for the project using Maneage, not Maneage itself) that has a complete checklist of steps to start a new project and remove demonstration parts.
There are also hands-on tutorials to help new users.
-To start a new project, the authors \emph{clone} Maneage, create a branch, and start their project by customizing it. Thus, the start their project with a good data management strategy rather than try to impose it later, as recommended by \citet{fineberg19}.
-Customization is done by adding the names of the necessary software, references to input data, analysis and visualization commands and a narrative description.
-This will usually be done in multiple commits during the project, preserving the project's history: the descriptions of and motivations for changes or test failures and successes, and the authors and timestamps of each change.
+To start a new project, the authors \emph{clone} Maneage, create a branch, and start their project by customizing it.
+Thus, projects start with a good data management strategy rather than imposing it in the end, as recommended by \citet{fineberg19}.
+Customization is done by adding the names of the necessary software, references to input data, analysis and visualization commands and writting a narrative description.
+This will be done in multiple commits during the project (perhaps years), preserving the project's history: the descriptions of, and motivations for, changes or test failures and successes, as well as the authors and timestamps of each change.
\begin{lstlisting}[language=bash]
git clone https://gitlab.com/maneage/project.git # Clone Maneage, default branch `maneage'.
@@ -329,10 +330,6 @@ This will usually be done in multiple commits during the project, preserving the
git checkout -b master # Make new `master' branch, start customizing.
\end{lstlisting}
-Figure \ref{fig:files} shows the directory structure of the cloned project and typical files.
-The top-level source has only very high-level components: the \inlinecode{project} shell script (POSIX-compliant) that is the main interface to the project, as well as the paper's \LaTeX{} source, documentation and a copyright statement.
-Two sub-directories are also present: \inlinecode{tex/} (containing \LaTeX{} files) and \inlinecode{reproduce/} (containing all other parts of the project).
-
\begin{figure}[t]
\begin{center}
\includetikz{figure-file-architecture}
@@ -349,29 +346,29 @@ Two sub-directories are also present: \inlinecode{tex/} (containing \LaTeX{} fil
}
\end{figure}
-The \inlinecode{project} script is a high-level wrapper.
-It has two main phases: (1) configuration, where the necessary software is built and the environment is set up, and (2) analysis, where data are accessed and the software is run on these to create visualizations and the final report.
-These steps are run with two commands:
+Figure \ref{fig:files} shows the directory structure of the cloned project and typical files.
+The top-level source has only very high-level components: the \inlinecode{project} shell script (POSIX-compliant) that is the main interface to the project, as well as the paper's \LaTeX{} source, documentation and a copyright statement.
+Two sub-directories are also present: \inlinecode{tex/} (containing \LaTeX{} files) and \inlinecode{reproduce/} (containing all other parts of the project).
+Maneage has two main phases: (1) configuration, where the necessary software is built and the environment is set up, and (2) analysis, where data are accessed and the software is run to create the final visualizations and report:
\begin{lstlisting}[language=bash]
./project configure # Build all necessary software from source.
./project make # Do the analysis (download data, run software on data, build PDF).
\end{lstlisting}
-The implementation of Maneage has the following aspects.
Section \ref{sec:usingmake} elaborates why Make was chosen as the main job manager.
-Sections \ref{sec:projectconfigure} \& \ref{sec:projectanalysis} discuss the operations done during the configuration and analysis phase.
+Sections \ref{sec:projectconfigure} \& \ref{sec:projectanalysis} are on the operations done during the configuration and analysis phase.
The benefit from version control is described in Section \ref{sec:projectgit}.
-Section \ref{sec:collaborating} discusses the sharing of a built environment, and in Section \ref{sec:publishing} the publication/archival of Maneage projects is discussed.
+Section \ref{sec:collaborating} discusses the sharing of a built environment, and finally, Section \ref{sec:publishing} is about the publication, or archival, of Maneage projects.
\subsection{Job orchestration with Make}
\label{sec:usingmake}
-Scripts (e.g. shell, Python, Perl) are an obvious solution for non-interactive (batch) processing.
+Scripts (e.g. shell, Perl, or Python) are an obvious solution for non-interactive (batch) processing.
However, the inherent complexity and non-linearity of progress as a project evolves makes it hard to manage scripts.
For example, if $90\%$ of a research project is done and only the final $10\%$ must be executed, a script will re-do the whole project.
-Completed parts can be manually ignored (with conditionals). However, this adds to the complexity and discourages experimentation on already completed parts of the project.
-These problems motivated the creation of Make in the early Unix operating system \citep{feldman79}.
+Completed parts can be manually ignored (with conditionals), but this adds to the complexity and discourages experimentation on already completed parts.
+These problems motivated the creation of Make in the early Unix OS \citep{feldman79}.
Make contiues to be a core component of modern OSs, is actively maintained, and has withstood the test of time.
The Make paradigm starts from the end: the final \emph{target}.
@@ -387,7 +384,7 @@ Furthermore, Make first examines the full lineage before starting the execution
Make is well known by many outside of software development communities.
For example, geophysics students have easily adopted it for the RED project management tool \citep{schwab2000}.
-Very good feedback on the simplicity of using Make has been received from early adopters of Maneage, especially graduate students and postdocs.
+We also received very good feedback on the simplicity of using Make from early adopters of Maneage, especially graduate students and postdocs.
@@ -406,41 +403,42 @@ Project configuration (building the software environment) is managed by the file
At the start of project configuration, Maneage needs a top-level directory to build itself on the host (software and analysis).
We call this the ``build directory'' and it must not be located inside the source directory (see \ref{principle:modularity}).
No other location on the running OS will be affected by the project, including the source directory.
-Two other local directories can optionally be specified by the project when inputs (\ref{definition:input}) are present locally: 1) a software tarball directory and 2) an input data directory.
+Two other local directories can optionally be specified by the project when inputs (\ref{definition:input}) are present locally: 1) software tarball directory and 2) input data directory.
Sections \ref{sec:buildsoftware} and \ref{sec:softwarecitation} detail the building of the required software and the important issue of software citation.
-A Maneage project can be configured in a container or virtual machine to facilitate moving the project without rebuilding everything from source.
-However, such binary blobs are optional. They are not Maneage's primary storage/archival format.
-
\subsubsection{Verifying and building necessary software from source}
\label{sec:buildsoftware}
To compile the necessary software from source, Maneage currently needs the host to have a C and C++ compiler (available on any POSIX-compliant OS).
-These will be used by Maneage to build and install (in the build directory) all necessary software and their dependencies, all with fixed version identities.
-The dependency tree continues down to core operating system components including GNU Bash, GNU AWK, GNU Coreutils on all supported operating systems (including macOS, not just GNU/Linux).
+Maneage will build and install (in the build directory) all necessary software and their dependencies, all with fixed versions and configurations.
+The dependency tree continues down to core OS components including GNU Bash, GNU AWK, GNU Coreutils on all supported OSs (including macOS, not just GNU/Linux).
On GNU/Linux OSs, a fixed version of the GNU Binutils and GNU C Compiler (GCC) is also built from source, and a fixed version of the GNU C Library will soon be added to be fully independent of the host on such systems (task 15390).
-Except for the Kernel, Maneage builds all other necessary components of the OS.
-For example, see the Acknowledgments section for all the software that were built for this paper.
+
+Except for the Kernel, Maneage thus builds all other OS components necessary for the project.
+A Maneage project can be configured in a container or virtual machine to facilitate moving the project without rebuilding everything from source.
+However, such binary blobs are not Maneage's primary storage/archival format.
Before building the software, their source codes are validated by their SHA-512 checksum (stored in the project).
-Maneage includes a large collection of scientific software (and its dependencies), much of which is superfluous.
+Maneage includes a large collection of scientific software (and its dependencies), much of which is superfluous for one project.
Therefore, each project has to identify its high-level software in the \inlinecode{TARGETS.conf} file under \inlinecode{re\-produce\-/soft\-ware\-/config} directory (Fig.~\ref{fig:files}).
\subsubsection{Software citation}
\label{sec:softwarecitation}
-Maneage contains the full list of software that is built, with versions and configuration options, but this information is buried deep into the source.
+Maneage contains the full list of software that were built for the project but this information is buried deep into the source.
Maneage prints a simplified description of this information in the project's final report, blended into the narrative, as in the Acknowledgments of this paper.
Furthermore, when the software is associated with a published paper, that paper's Bib\TeX{} entry is added to the final report and is duly cited with the software's name and version.
This paper uses basic software without associated scientific papers. For software citation examples, see \citet{akhlaghi19} and \citet{infante20}.
This is particularly important for research software, where citation is critical to justify continued development.
-GNU Parallel \citep{tange18} prints citation information each time it is run, proposing to either cite the paper or provide substantial financial support, and provides a \inlinecode{--citation} option to disable the notice.
+A notable example is GNU Parallel \citep{tange18} which prints citation information each time it is run, proposing to either cite the paper or support it with 10000 euros.
+It provides a \inlinecode{--citation} option to disable the notice.
+This is justified by an uncomfortably true statement: ``\emph{history has shown that researchers forget to [cite software] if they are not reminded explicitly. ... If you feel the benefit from using GNU Parallel is too small to warrant a citation, then prove that by simply using another tool}''.
Most software does not resort to such drastic measures. However, proper citation is not only useful practically, it is also an ethical imperative.
Given the increasing role of software in research \citep{clement19}, automatic citation (as presented here) is a step forward.
-The necessity and basic elements of software citation are reviewed by (e.g.) \citet{katz14} and \citet{smith16}. The CodeMeta and Citation file format (CFF) aim to expand software citation beyond Bib\TeX.
-A very robust approach that also includes archival, is Software Heritage \citep{dicosmo18}.
+For a review of the necessity and basic elements of software citation, see \citet{katz14} and \citet{smith16}.
+The CodeMeta and Citation file format (CFF) aim to expand software citation beyond Bib\TeX, while Software Heritage \citep{dicosmo18} also includes archival and citation abilities.
These will be tested and enabled in Maneage.
@@ -670,7 +668,7 @@ Modern version control systems provide many more capabilities that can be levera
Because the project's source and build directories are separate, it is possible for different users to share a build directory, while working on their own separate project branches during a collaboration.
Similar to the parallel branch that is later merged in Figure \ref{fig:branching}(a).
-To enable this mode, \inlinecode{./project} script has a special \inlinecode{--group} option which takes the name of a (POSIX) user group in the host operating system.
+To enable this mode, \inlinecode{./project} script has a special \inlinecode{--group} option which takes the name of a (POSIX) user group in the host OS.
All files built in the build directory are then automatically assigned to this user group, with read and write permissions.
Of course, avoiding conflicts in the build directory, while members are working on different branches is up to the team.
@@ -723,7 +721,7 @@ The first is that Maneage uses very low-level tools that are not widely used by
We have discovered that this is primarily because of a lack of exposure.
Many (in particular early career researchers) have started mastering them as they adopt Maneage once they witness their advantages, but it does take time.
-A second caveat is the fact that Maneage is effectively an almost complete GNU operating system, tailored to each project.
+A second caveat is the fact that Maneage is effectively an almost complete GNU OS, tailored to each project.
Maintaining the various packages is time consuming for us (Maneage maintainers).
However, because software installation is also in Make, some users are already adding their necessary software to the core Maneage branch, thus propagating the improvements to all projects using Maneage.
Another caveat that has been raised is that publishing the project's reproducible data lineage immediately after publication may hamper their ability to continue with followup papers because others may do it before them.