aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex129
1 files changed, 64 insertions, 65 deletions
diff --git a/paper.tex b/paper.tex
index 9e27dbf..aeb4434 100644
--- a/paper.tex
+++ b/paper.tex
@@ -199,106 +199,105 @@ As a consequence, before starting with the technical details it is important to
\section{Principles}
\label{sec:principles}
-The core principle of Maneage is simple: science is defined by its method, not its result.
-\citet{buckheit1995} summarize this nicely by noting that modern scientific papers are merely advertisements of a scholarship, the actual scholarship is the coding behind the analysis that ultimately generated the plots/results.
+The core principle of Maneage is simple: science is defined primarily by its method, not its result.
+As \citet{buckheit1995} describe it, modern scientific papers are merely advertisements of scholarship, while the actual scholarship is the coding behind the plots/results.
Many solutions have been proposed in the last decades, including (but not limited to)
-1992: \href{https://sep.stanford.edu/doku.php?id=sep:research:reproducible}{RED};
-2003: \href{https://taverna.incubator.apache.org}{Apache Taverna};
-2004: \href{https://www.genepattern.org}{GenePattern};
-2010: \href{https://wings-workflows.org}{WINGS};
+1992: \href{https://sep.stanford.edu/doku.php?id=sep:research:reproducible}{RED},
+2003: \href{https://taverna.incubator.apache.org}{Apache Taverna},
+2004: \href{https://www.genepattern.org}{GenePattern},
+2010: \href{https://wings-workflows.org}{WINGS},
2011: \href{https://www.ipol.im}{Image Processing On Line journal} (IPOL),
\href{https://www.activepapers.org}{Active papers},
\href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE},
2015: \href{https://sciunit.run}{Sciunit};
2017: \href{https://falsifiable.us}{Popper};
2019: \href{https://wholetale.org}{WholeTale}.
-To highlight the uniqueness of Maneage in this plethora of tools, a more elaborate list of principles are required:
+To help in the comparison, the founding principles of Maneage are listed below.
\begin{enumerate}[label={\bf P\arabic*}]
-\item \label{principle:complete}\textbf{Complete:}
+\item \label{principle:complete}\textbf{Completeness:}
A project that is complete, or self-contained,
- (i) does not depend on anything beyond the Portable operating system Interface (POSIX),
- (ii) does not affect the host system,
- (iii) does not require root/administrator privileges,
- (iv) does not need an internet connection (when its inputs are on the file-system), and
- (v) is stored in a format that does not require any software beyond POSIX tools to open, parse or execute.
-
- A complete project can
- (i) automatically access the inputs (see definition \ref{definition:input}),
- (ii) build its necessary software,
- (iii) do the analysis (run the software on the data) and
- (iv) create the final narrative report/paper as well as its visualizations, in its final format (usually in PDF or HTML).
- No manual/human interaction is required to run a complete project, as \citet{claerbout1992} put it: ``\emph{a clerk can do it}''.
- Generally, manual intervention in any of the steps above, or an interactive interface, constitutes an incompleteness.
- Lastly, the plain-text format is particularly important because any other storage format will require specialized software \emph{before} the project can be opened.
-
- \emph{Comparison with existing:} Except for IPOL, none of the tools above are complete as they all have many dependencies far beyond POSIX.
- For example, most of the recent ones use Python (the project/workflow, not the analysis), or rely on Jupyter notebooks.
- Such high-level tools have very short lifespans and evolve very fast (e.g., Python 2 code cannot run with Python 3).
- They also have a very complex dependency trees, making them extremely vulnerable and hard to maintain. For example, see Figure 1 of \citet{alliez19} on the dependency tree of Matplotlib (one of the smaller Jupyter dependencies).
- It is important to remember that the longevity of a workflow (not the analysis itself) is determined by its shortest-lived dependency.
-
- Many existing tools therefore do not attempt to store the project as plain text, but pre-built binary blobs (containers or virtual machines) that can rarely be recreated and also have a short lifespan.
- Their recreation is hard because most are built with the package manager of the blob's OS, or Conda.
- Both are highly dependent on the time they are executed: precise versions are rarely stored, and the servers remove old binaries.
- Docker containers are a good example of their short lifespan: Docker only runs on long-term support OSs, not older.
- In GNU/Linux systems, this corresponds to Linux kernel 3.2.x (initially released in 2012) and above.
- As plain-text, besides being extremely low volume ($\sim100$ kilobytes), the project is still human-readable and parsable by any machine, even when it can't be executed.
+ (P1.1) has no dependency beyond the Port\-able operating system Interface (POSIX).
+ (P1.2) does not affect the host,
+ (P1.3) does not require root, or administrator, privileges,
+ (P1.4) builds its software for an independent environment,
+ (P1.5) can be run locally (without internet connection),
+ (P1.6) contains the full project's analysis, visualization \emph{and} narrative, from access to raw inputs to producing final published format (e.g., PDF or HTML),
+ (P1.7) requires no manual/human interaction and can run automatically \citep[according to][``\emph{a clerk can do it}'']{claerbout1992}.
+ A consequence of P1.1 is that the project itself must be stored in plain-text, and not need any specialized software to open, parse or execute.
+
+ \emph{Comparison with existing:} with many dependencies beyond POSIX, except for IPOL, none of the tools above are complete.
+ For example, most recent solutions need Python (for the workflow, not the analysis), or rely on Jupyter notebooks.
+ High-level tools have short lifespans and evolve fast (e.g., Python 2 code cannot run with Python 3).
+ They also have complex dependency trees, making them hard to maintain.
+ For example, see the dependency tree of Matlplotlib in \citet[][Figure 1]{alliez19}, its one of the simpler Jupyter dependencies.
+ The longevity of a workflow is determined by its shortest-lived dependency.
+
+ As a result the primary storage format of most recent solutions is pre-built binary blobs like containers or virtual machines.
+ They are large (Giga-bytes) and expensive to archive, furthermore generic package managers (e.g., Conda), or OS's (e.g., \inlinecode{apt} or \inlinecode{yum}) are used to setup its environment.
+ Because exact versions of \emph{every software} are rarely included, and the servers remove old binaries, recreating them is very hard.
+ Blobs also have a short lifespan (e.g., Docker containers only run on long-term support OSs, in GNU/Linux systems, this corresponds to Linux 3.2.x, released in 2012).
+ A plain-text project consumes below one megabyte, is human-readable and parsable by any machine, even if it can't be executed.
\item \label{principle:modularity}\textbf{Modularity:}
-A project should be compartmentalized or partitioned into independent modules or components with well-defined inputs/outputs having no side-effects.
-In a modular project, communication between the independent modules is explicit, providing optimizations on multiple levels:
-1) Execution: independent modules can run in parallel, or modules that do not need to be run (because their dependencies have not changed) will not be re-done.
-2) Data provenance extraction (recording any dataset's origins).
-3) Citation: allowing others to credit specific parts of a project.
-4) Usage in other projects.
-
-\emph{Comparison with existing:} Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails do encourage this, but the more recent tools leave such design choices to the experience of project authors.
-However, designing a modular project needs to be encouraged and facilitated, otherwise scientists (who are not usually trained in data management) will not design their projects to be modular, leading to great inefficiencies in terms of project cost and/or scientific accuracy.
+A project should be compartmentalized into independent modules with well-defined inputs/outputs having no side effects.
+Communication between the independent modules should be explicit, providing several optimizations:
+(1) independent modules can run in parallel.
+Modules that do not need to be run (because their dependencies have not changed) will not be re-run.
+(2) Data provenance extraction (recording any dataset's origins).
+(3) Citation: others can credit specific parts of a project.
+(4) Usage in other projects.
+(5) Most importantly: they are easy to debug and improve.
+
+\emph{Comparison with existing:} Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails encourage this, but the more recent tools leave this design choice to project authors.
+However, designing a modular project needs to be encouraged and facilitated.
+Otherwise, scientists, who are not usually trained in data management, will rarely design a modular project, leading to great inefficiencies in terms of project cost and/or scientific accuracy (testing/validating will be expensive).
\item \label{principle:complexity}\textbf{Minimal complexity:}
- This principle is essentially Ockham's razor: ``\emph{Never posit pluralities without necessity}'' \citep{schaffer15}, but extrapolated to project management:
- 1) avoid complex relations between analysis steps (which is related to the principle of modularity in \ref{principle:modularity}).
- 2) avoid the programming language that is currently in vogue because it is going to fall out of fashion soon and significant resources are required to translate or rewrite it every few years (to stay in vogue).
- The same job can be done with more stable/basic tools, and less effort in the long run.
+ This is Ockham's razor extrapolated to project management \citep[``\emph{Never posit pluralities without necessity}''][]{schaffer15}:
+ 1) avoid complex relations between analysis steps (related to \ref{principle:modularity}).
+ 2) avoid the programming language that is currently in vogue, because it is going to fall out of fashion soon and require significant resources to translate or rewrite it every few years (to stay fashionable).
+ The same job can be done with more stable/basic tools, requiring less long-term effort.
- \emph{Comparison with existing:} Most of the existing solutions above use tools that are most popular at their creation epoch. For example, as we approach the present, successively larger fractions of tools are written in Python, and use Conda or Jupyter (see \ref{principle:complete}).
+ \emph{Comparison with existing:} IPOL stands out here too (requiring only ISO C), however most existing solutions use tools that were most popular at their creation date.
+ For example, as we approach the present, successively larger fractions of tools are written in Python, and use Conda or Jupyter (see \ref{principle:complete}).
\item \label{principle:verify}\textbf{Verifiable inputs and outputs:}
-The project should contain automatic verification checks on its inputs (software source code and data) and outputs.
-When applied, expert knowledge will not be necessary to confirm the correct reproduction.
+The project should automaticly verify its inputs (software source code and data) \emph{and} outputs.
+Thus not needing any expert knowledge to confirm a reproduction.
-\emph{Comparison with existing:} Such verification is usually possible in most systems, but adding this is usually the responsibility of the user alone.
-Automatic verification of inputs is commonly implemented, but the outputs are much more rarely verified.
+\emph{Comparison with existing:} Such verification is usually possible in most systems, but is usually the responsibility of the project authors.
+As with \ref{principle:modularity}, due to lack of training, this must be actively encouraged and facilitated, otherwise most will not be able to implement it.
\item \label{principle:history}\textbf{History and temporal provenance:}
No project is done in a single/first attempt.
Projects evolve as they are being completed.
It is natural that earlier phases of a project are redesigned/optimized only after later phases have been completed.
-This is often seen in scientific papers, with statements like ``\emph{we [first] tried method [or parameter] X, but Y is used here because it showed to have better precision [or less bias, or etc]}''.
+This is often seen in exploratory research papers, with statements like ``\emph{we [first] tried method [or parameter] X, but Y is used here because it gave lower random error}''.
A project's ``history'' is thus as scientifically relevant as the final, or published version.
-\emph{Comparison with existing:} The systems above that are implemented with version control usually support this principle.
+\emph{Comparison with existing:} The solutions above that implement version control usually support this principle.
However, because the systems as a whole are rarely complete (see \ref{principle:complete}), their histories are also incomplete.
-IPOL fails here because only the final snapshot is published.
+IPOL fails here, because only the final snapshot is published.
-\item \label{principle:scalable}\textbf{Scalable:}
+\item \label{principle:scalable}\textbf{Scalability:}
A project should be scalable to arbitrarily large and/or complex projects.
\emph{Comparison with existing:}
Most of the more recent solutions above are scalable.
-However, IPOL, which uniquely stands out in satisfying most principles also fails here: IPOL is devoted to low-level image processing algorithms that \emph{can be} done with no dependencies beyond an ISO C compiler (even available on Microsoft Windows).
-Its solution is thus not scalable to large projects which commonly involve tens of high-level dependencies, with complex data formats and analysis.
+However, IPOL, which uniquely stands out in satisfying most principles, fails here: IPOL is devoted to low-level image processing algorithms that \emph{can be} done with no dependencies beyond an ISO C compiler.
+IPOL is thus not scalable to large projects, which commonly involve dozens of high-level dependencies, with complex data formats and analysis.
\item \label{principle:freesoftware}\textbf{Free and open source software:}
- Technically, reproducibility (defined in \ref{definition:reproduction}) is possible with non-free or non-open-source software (a black box).
- This principle is thus necessary to complement the definition of reproducibility and has many advantages which are critical to the sciences and to industry:
- 1) The lineage, and its optimization, can be traced down to the internal algorithm in the software's source.
- 2) A free software package that may not execute on a particular piece hardware can be modified to work on it.
- 3) A non-free software project typically cannot be distributed by others, making the whole community reliant only on the owner's server (even if the proprietary software does not ask for payments).
+ Technically, reproducibility (see \ref{definition:reproduction}) is possible with non-free or non-open-source software (a black box).
+ This principle is thus necessary to complement it with these critical points (to the sciences and to industry):
+ (1) When the project itself is free software, others can learn-from and build-upon it.
+ (2) The lineage, can be traced to a free software's implemented algorithm, also enabling optimizations on that level.
+ (3) A free-software package that does not execute on particular hardware can be modified to work on it.
+ (4) A non-free software project typically cannot be distributed by others, making the whole community reliant on the owner's server (even if the owner does not ask for payments).
\emph{Comparison with existing:} The existing solutions listed above are all free software.
- There are non-free solutions, but we do not consider them here because of this principle.
+ Based on this principle, we do not consider non-free solutions.
\end{enumerate}