aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2020-04-18 00:51:20 +0100
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2020-04-18 00:51:20 +0100
commitc003a2dfc2fdea38d39a5555a8ee16bff7cb6b62 (patch)
treeb6ba75e98d5bfddefb4a79a3cd2d04adc3862153 /paper.tex
parent9ed594ec27405e38548d6efd6fe28a4dabf0fb41 (diff)
Edits in the text to make it shorter and fix a few mistakes
A few minor issues were found and fixed in the text. I also tried to shorten it a little further.
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex34
1 files changed, 14 insertions, 20 deletions
diff --git a/paper.tex b/paper.tex
index fe141f2..240f1e8 100644
--- a/paper.tex
+++ b/paper.tex
@@ -219,7 +219,7 @@ To highlight the uniqueness of Maneage in this plethora of tools, a more elabora
A complete project can
(i) automatically access the inputs (see definition \ref{definition:input}),
- (ii) build its necessary software (instructions on configuring, building and installing those software in a fixed environment),
+ (ii) build its necessary software,
(iii) do the analysis (run the software on the data) and
(iv) create the final narrative report/paper as well as its visualizations, in its final format (usually in PDF or HTML).
No manual/human interaction is required to run a complete project, as \citet{claerbout1992} put it: ``\emph{a clerk can do it}''.
@@ -227,7 +227,7 @@ To highlight the uniqueness of Maneage in this plethora of tools, a more elabora
Lastly, the plain-text format is particularly important because any other storage format will require specialized software \emph{before} the project can be opened.
\emph{Comparison with existing:} Except for IPOL, none of the tools above are complete as they all have many dependencies far beyond POSIX.
- For example, in the more recent ones are written in Python (the project/workflow, not the analysis), or rely on Jupyter notebooks.
+ For example, most of the recent ones use Python (the project/workflow, not the analysis), or rely on Jupyter notebooks.
Such high-level tools have very short lifespans and evolve very fast (e.g., Python 2 code cannot run with Python 3).
They also have a very complex dependency trees, making them extremely vulnerable and hard to maintain. For example, see Figure 1 of \citet{alliez19} on the dependency tree of Matplotlib (one of the smaller Jupyter dependencies).
It is important to remember that the longevity of a workflow (not the analysis itself) is determined by its shortest-lived dependency.
@@ -237,7 +237,6 @@ To highlight the uniqueness of Maneage in this plethora of tools, a more elabora
Both are highly dependent on the time they are executed: precise versions are rarely stored, and the servers remove old binaries.
Docker containers are a good example of their short lifespan: Docker only runs on long-term support OSs, not older.
In GNU/Linux systems, this corresponds to Linux kernel 3.2.x (initially released in 2012) and above.
- The current Docker images made today may not be usable in a similar time frame in the future.
As plain-text, besides being extremely low volumne ($\sim100$ kilobytes), the project is still human-readable and parseable by any machine, even when it can't be executed.
\item \label{principle:modularity}\textbf{Modularity:}
@@ -299,14 +298,13 @@ IPOL, which uniquely stands out in other principles, fails here: only the final
\section{Maneage}
\label{sec:maneage}
-Maneage is an implementation of the principles of Section \ref{sec:principles}: it is complete (\ref{principle:complete}), modular (\ref{principle:modularity}), has minimal complexity (\ref{principle:complexity}), verifies its inputs \& outputs (\ref{principle:verify}), preserves temporal provenance (\ref{principle:history}) and finally, it is free software (\ref{principle:freesoftware}).
+Maneage is an implementation of the principles of Section \ref{sec:principles}.
In practice, Maneage is a collection of plain-text files that are distributed in pre-defined sub-directories by context (a modular source), and are all under version-control, currently with Git.
The main Maneage Branch is a fully-working skeleton of a project without much flesh: it contains all the low-level infrastructure, but without any actual high-level analysis operations.
Maneage contains a file called \inlinecode{README-hacking.md} (the \inlinecode{README.md} file is reserved for the project using Maneage, not Maneage itself) that has a complete checklist of steps to start a new project and remove demonstration parts.
There are also hands-on tutorials to help new users.
-To start a new project, the authors \emph{clone} Maneage, create a branch, and start their project by customizing that branch as shown below.
-In Git, ``cloning'' imports the project's files and history from a repository to local system.
+To start a new project, the authors \emph{clone} Maneage, create a branch, and start their project by customizing it by following good practice, as opposed to focing a good data management strategy in the end, \citet{fineberg19} also note the importance of this.
Customization is done by adding the names of the necessary software, references to input data, analysis and visualization commands and a narrative description.
This will usually be done in multiple commits in the project's duration (perhaps multiple years), thus preserving the project's history: the causes of all choices, the authors and times of each change, failed tests, etc.
@@ -371,7 +369,7 @@ It is a higher-level structure enabling modular/atomic scripts (in any language)
The formal connection of targets with prerequisites it provides enables the creation of an optimized workflow.
Besides formalizing data lineage, Make also greatly encourages experimentation in a project because a recipe is executed only when at least one prerequisite is more recent than its target.
-Therefore, when only $5\%$ of a project's targets are affected by a change, only they will be recreated, the other $95\%$ remaining dormant.
+Therefore, when only $5\%$ of a project's targets are affected by a change, the other $95\%$ remain dormant.
Furthermore, Make first examines the full lineage before starting the execution of recipes, and
it can thus execute independent rules in parallel, further improving the speed and encouraging experimentation.
@@ -407,9 +405,9 @@ However, such binary blobs are optional outputs of Maneage, they are not its pri
To compile the necessary software from source Maneage currently needs the host to have a C and C++ compiler (available on any POSIX-compliant OS).
They will be used by Maneage to build and install (in the build directory) all necessary software and their dependencies with fixed versions.
The dependency tree goes all the way down to core operating system components like GNU Bash, GNU AWK, GNU Coreutils on all supported operating systems (including macOS, not just GNU/Linux).
-For example, the full list of installed software for this paper is automatically available in the Acknowledgments section of this paper.
On GNU/Linux OSs, a fixed version of the GNU Binutils and GNU C Compiler (GCC) is also built from source, and a fixed version of the GNU C Library will soon be added to be fully independent of the host on such systems (task 15390).
In effect, except for the Kernel, Maneage builds all other necessary components of the OS.
+For example, see this paper's Acknowledgments section for all the software that were built for this paper.
Before building the software, their source codes will be validated by their SHA-512 checksum (which is already stored in the project).
Maneage includes a large collection of scientific software (and their dependencies) that are usually not necessary in all projects.
@@ -699,6 +697,8 @@ The primordial implementation was written for \citet{akhlaghi15}.
It later evolved in \citet{bacon17}, and in particular the two sections of that paper that were done by M. Akhlaghi: \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}.
With these projects, the skeleton of the system was written as a more abstract ``template'' that could be customized for separate projects.
That template later matured into Maneage by including the installation of all necessary software from source and it was used in \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}]{akhlaghi19} and \citet[\href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}]{infante20}.
+Bugs will still be found and Maneage will continue to evolve after this paper is published.
+A list of the notable changes after the publication of this paper will be kept in the \inlinecode{README-hacking.md} file.
As Git repositories, a Maneage project can benefit from the wonderful archival and citation features of Software Heritage \citep{dicosmo18}, enabling easy citation of precise parts of other projects, at various points in their history.
Once Maneage is adopted on a wide scale in a special topic, it is possible to feed them into machine learning algorithms for automatic workflow generation, optimized for certain aspects of the result.
@@ -706,20 +706,14 @@ Because Maneage is complete and also includes the project's history, even inputs
Furthermore, writing parsers of Maneage projects to generate Research Objects is trivial, and very useful for meta-research and data provenance studies.
Maneage was awarded a Research Data Alliance (RDA) adoption grant adhering to the recommendations of the publishing data workflows working group \citep{austin17}.
-Its user base (and thus its development) grew phenomenally afterwards.
-However, bugs will still be found and its core architecture will continue to evolve after the publication of this paper.
-A list of the notable changes after the publication of this paper will be kept in the \inlinecode{README-hacking.md} file.
-
-Based on feedback from early adopters, we have seen the following caveats for Maneage.
-The first caveat is regarding its widespread adoption: by principle, Maneage uses very low-level tools that are not commonly used by scientists like Git, \LaTeX, Make and the command-line.
-We have discovered that this is because they have mainly not been exposed to them as useful components in their research.
-Once the usage of these tools was witnessed in practice, these tools were adopted to follow best practices.
-\citet{fineberg19} also note the importance of projects starting by following good practice, not to force it in the end.
+Its user base, and thus its development, grew phenomenally afterwards and highlighted some caveats.
+The first is that Maneage uses very low-level tools that are (unfortunately) not widely used by scientists, e.g., Git, \LaTeX, Make and the command-line.
+We have discovered that this is primarily because of a lack of exposure.
+Many (in particular early career researchers) have started mastering them as they adopt Maneage once they witness their advantages for their project.
A second caveat is the fact that Maneage is effectively an almost complete GNU operating system, tailored to each project.
-Maintaining the various packages can consume time for its core developers.
-In Maneage, package management (Section \ref{sec:projectconfigure}) is in the same language as the analysis, therefore some users are already adding their necessary software in it and submitted them to the core Maneage branch, thus propagating the improvement to all projects using Maneage.
-With a larger user-base we look forward to an increasing number of such contributors, hence decreasing the burden on our core team.
+Maintaining the various packages is time consuming for Maneage maintainers, not derived projects.
+However, because software installation is also in Make, some users are already adding their necessary software to the core Maneage branch, thus propagating the improvement to all projects using Maneage.
Another caveat that has been raised is that publishing the project's reproducible data lineage immediately after publication may hamper their ability to continue with followup papers because others may do it before them.
Given the strong integrity checks in Maneage, we believe it has features to address this problem in the following ways: