aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2020-04-18 18:02:53 +0100
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2020-04-18 18:14:09 +0100
commitc7969da4091512f945dddc374eb37f8a210a9246 (patch)
tree3ba0a1cac8ac0c617c8ebe793291f42f8a1af4c7
parent063b74c653771735a971f98f69075ca7e8237342 (diff)
Added Scalability as a principle, minor edits/clippings
Someone reading the principles section until now would think that IPOL is an almosts perfect solution, and for its usecase it certainly is. However, this is only because of the nature of its work: it only focuses on algorithms, not usage/analysis which cannot be done in raw ISO C. So with this commit, I added a new principle on Scalability and discussed this limitation of IPOL there. To avoid simply lengthening the text, to add this new principle, I had to remove/summarize some parts that seemed redundant. In the process, I also removed some of the existing tools (at the start of the principles section) that had several others in the same time frame, I have already mentioned (through the "and many more") that this list is not complete. Also, the list of people to thank in the acknowledgments is now put in a one-line per name to be more easily maintainable: Boud and Mohammad-reza were added, and given that I have sent the paper to several other people for feedback, I expect the list to get longer.
-rw-r--r--paper.tex58
1 files changed, 38 insertions, 20 deletions
diff --git a/paper.tex b/paper.tex
index 299ecd7..ae180f1 100644
--- a/paper.tex
+++ b/paper.tex
@@ -51,7 +51,7 @@
{\noindent\mpregular
The era of big data has ushered an era of big responsibility.
In the absence of reproducibility, as a test on understanding data lineage, the result can be the subject of perpetual debate.
- To address this problem, we introduce Maneage (management + lineage) which is founded on the principles of completeness (e.g., no dependency beyond a POSIX-compatible operating system, no administrator privileges, and no network connection), modular and straightforward design, temporal lineage and free software.
+ To address this problem, we introduce Maneage (management + lineage) which is founded on the principles of completeness (e.g., no dependency beyond a POSIX-compatible operating system, no administrator privileges, and no network connection), modular and straightforward design, temporal provenance, scalability, and free software.
A project using Maneage is fully stored in machine\--action\-able, and human\--read\-able plain-text format, facilitating version-control, publication, archival, and automatic parsing to extract data provenance.
The provided lineage is not limited to high-level processing, but also includes building the necessary software from source with fixed versions and build configurations.
Additionally, a project's final visualizations and narrative report are also included, establishing direct links between the analysis and the narrative or visualizations, to the precision of a word within a sentence or a point in a plot.
@@ -200,10 +200,8 @@ As a consequence, before starting with the technical details it is important to
\label{sec:principles}
The core principle of Maneage is simple: science is defined by its method, not its result.
-\citet{buckheit1995} summarize this nicely by noting that modern scientific papers (narrative combined with plots, tables and figures) are merely advertisements of a scholarship, the actual scholarship is the scripts and software usage that went into doing the analysis.
-
-Maneage is not the first attempted solution to this fundamental problem.
-Various solutions have been proposed since the early 1990s, for example RED (1992), Apache Taverna (2003), Madagascar (2003), GenePattern (2004), Kepler (2005), VisTrails (2005), Galaxy (2010), WINGS (2010), Image Processing On Line journal (IPOL, 2011), Active papers (2011), Collage Authoring Environment (2011), SHARE (2011), Verifiable Computational Result (2011), SOLE (2012), Sumatra (2012), Sciunit (2015), Binder (2017), Popper (2017), WholeTale (2019), and many more.
+\citet{buckheit1995} summarize this nicely by noting that modern scientific papers are merely advertisements of a scholarship, the actual scholarship is the coding behind the analysis that ultimately generated the plots/results.
+Various solutions have been proposed for this since the early 1990s, for example RED (1992), Apache Taverna (2003), GenePattern (2004), Galaxy (2010), WINGS (2010), Image Processing On Line journal (IPOL, 2011), Active papers (2011), SHARE (2011), Verifiable Computational Result (2011), SOLE (2012), Sciunit (2015), Binder (2017), Popper (2017), WholeTale (2019), and many more.
To highlight the uniqueness of Maneage in this plethora of tools, a more elaborate list of principles are required:
\begin{enumerate}[label={\bf P\arabic*}]
@@ -254,13 +252,13 @@ However, designing a modular project needs to be encouraged and facilitated, oth
2) avoid the programming language that is currently in vogue because it is going to fall out of fashion soon and significant resources are required to translate or rewrite it every few years (to stay in vogue).
The same job can be done with more stable/basic tools, and less effort in the long run.
- \emph{Comparison with existing:} Most of the existing tools use the language/framework that is most popular at the epoch at the time of the tools' created. For example, as we approach the present, successively larger fractions of tools are written in Python.
+ \emph{Comparison with existing:} Most of the existing solutions above use tools that are most popular at their creation epoch. For example, as we approach the present, successively larger fractions of tools are written in Python, and use Conda or Jupyter (see \ref{principle:complete}).
\item \label{principle:verify}\textbf{Verifiable inputs and outputs:}
The project should contain automatic verification checks on its inputs (software source code and data) and outputs.
When applied, expert knowledge will not be necessary to confirm the correct reproduction.
-\emph{Comparison with existing:} Such verification is usually possible in most systems, but maintenance is the responsibility of the user alone.
+\emph{Comparison with existing:} Such verification is usually possible in most systems, but adding this is usually the responsibility of the user alone.
Automatic verification of inputs is commonly implemented, but the outputs are much more rarely verified.
\item \label{principle:history}\textbf{History and temporal provenance:}
@@ -271,18 +269,26 @@ This is often seen in scientific papers, with statements like ``\emph{we [first]
A project's ``history'' is thus as scientifically relevant as the final, or published version.
\emph{Comparison with existing:} The systems above that are implemented with version control usually support this principle.
-However, because the systems as a whole are rarely complete (as discussed in principle \ref{principle:complete}), their histories are rarely complete.
-IPOL, which uniquely stands out by satisfying several of the other principles, fails here: only the final snapshot is published.
+However, because the systems as a whole are rarely complete (see \ref{principle:complete}), their histories are also incomplete.
+IPOL fails here because only the final snapshot is published.
+
+\item \label{principle:scalable}\textbf{Scalable:}
+A project should be scalable to arbitrarily large projects.
+
+\emph{Comparison with existing:}
+Most of the more recent solutions above are scalable.
+However, IPOL, which uniquely stands out in satisfying most principles also fails here: IPOL is devoted to low-level image processing algorithms that \emph{can be} done with no dependencies beyond an ISO C compiler (even available on Microsoft Windows).
+Its solution is thus not scalable to large projects which commonly involve tens of high-level dependencies, with complex data formats and analysis.
\item \label{principle:freesoftware}\textbf{Free and open source software:}
Technically, reproducibility (defined in \ref{definition:reproduction}) is possible with non-free or non-open-source software (a black box).
This principle is thus necessary to complement the definition of reproducibility and has many advantages which are critical to the sciences and to industry:
1) The lineage, and its optimization, can be traced down to the internal algorithm in the software's source.
- 2) A free software package that may not execute on a particular piece of future hardware can be modified to work on that hardware.
- 3) A non-free software project typically cannot be distributed by the project, making the whole community reliant only on the proprietary package owner's server (even if the proprietary software does not ask for payments).
+ 2) A free software package that may not execute on a particular piece hardware can be modified to work on it.
+ 3) A non-free software project typically cannot be distributed by others, making the whole community reliant only on the owner's server (even if the proprietary software does not ask for payments).
\emph{Comparison with existing:} The existing solutions listed above are all free software.
- While there are non-free lineage or workflow solutions, we do not consider them here precisely because of this principle.
+ There are non-free solutions, but we do not consider them here because of this principle.
\end{enumerate}
@@ -292,8 +298,6 @@ IPOL, which uniquely stands out by satisfying several of the other principles, f
-
-
\section{Maneage}
\label{sec:maneage}
Maneage is an implementation of the principles of Section \ref{sec:principles}.
@@ -354,19 +358,18 @@ Section \ref{sec:collaborating} discusses the sharing of a built environment, an
Scripts (in Shell, Python, or any other high-level language) are usually the first solution that come to mind when non-interactive, or batch, processing is needed.
However, the inherent complexity and non-linearity of progress, as a project evolves, makes it hard to manage scripts.
For example, if $90\%$ of a research project is done and only the newly-added final $10\%$ must be executed, a script will re-do the whole project every time.
-It is possible to manually ignore (with some conditionals) already completed parts, however this only adds to the complexity and will discourage experimentation on an already-completed part of the project.
+It is possible to manually ignore completed parts (with conditionals), however this only adds to the complexity and will discourage experimentation on an already-completed part of the project.
These problems motivated the creation of Make in the early Unix operating system \citep{feldman79}.
Make contiues to be a core component of modern OSs, is actively maintained, and has withstood the test of time.
The Make paradigm starts from the end: the final \emph{target}.
-In Make's syntax, the process is broken into atomic \emph{rules} where each rule has a single \emph{target} file which can depend on any number of \emph{prerequisite} files.
+In Make, the project is broken into atomic \emph{rules} where each rule has a single \emph{target} file which can depend on any number of \emph{prerequisite} files.
To build the target from the prerequisites, each rule also has a \emph{recipe} (an atomic script).
-The plain-text files containing Make rules and their components are called Makefiles.
+The plain-text files containing Make source code are called Makefiles.
Note that Make does not replace scripting languages like the shell, Python or R.
It is a higher-level structure enabling modular/atomic scripts (in any language) to be put into a workflow.
-The formal connection of targets with prerequisites it provides enables the creation of an optimized workflow.
-Besides formalizing data lineage, Make also greatly encourages experimentation in a project because a recipe is executed only when at least one prerequisite is more recent than its target.
+Besides formalizing a project's data lineage, Make also greatly encourages experimentation in a project because a recipe is executed only when at least one prerequisite is more recent than its target.
Therefore, when only $5\%$ of a project's targets are affected by a change, the other $95\%$ remain dormant.
Furthermore, Make first examines the full lineage before starting the execution of recipes, and
it can thus execute independent rules in parallel, further improving the speed and encouraging experimentation.
@@ -755,7 +758,22 @@ With a larger user-base and wider application in scientific (and hopefully indus
%% Acknowledgements
\section*{Acknowledgments}
-The authors wish to thank Alice Allen, Johan Knapen, Ignacio Trujillo, Roland Bacon, Konrad Hinsen, Yahya Sefidbakht, Simon Portegies Zwart, Ryan O'Connor, Pedram Ashofteh Ardakani, Elham Saremi, Zahra Sharbaf and Surena Fatemi for their useful suggestions and feedback on Maneage and this paper.
+The authors wish to thank (sorted alphabetically)
+Alice Allen,
+Pedram Ashofteh Ardakani,
+Roland Bacon,
+Surena Fatemi,
+Konrad Hinsen,
+Mohammad-reza Khellat,
+Johan Knapen,
+Ryan O'Connor,
+Simon Portegies Zwart,
+Boud Roukema,
+Elham Saremi,
+Yahya Sefidbakht,
+Zahra Sharbaf,
+and Ignacio Trujillo
+for their useful suggestions and feedback on Maneage and this paper.
We also thank Julia Aguilar-Cabello for designing the Maneage logo.
During its development, Maneage has been partially funded (in historical order) by the following institutions:
The Japanese Ministry of Education, Culture, Sports, Science, and Technology ({\small MEXT}) PhD scholarship to M.A and its Grant-in-Aid for Scientific Research (21244012, 24253003).