aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2020-04-19 20:48:20 +0100
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2020-04-19 20:48:20 +0100
commitbf6e876a1f8edcc7f7a58712a53161f8d53aa570 (patch)
tree5c067a1f8a1c17e97e419f410e11fbb0c6a6b611 /paper.tex
parent22f380a646c23a13d1a7443633bb35e39ce8f111 (diff)
Further summarized the principles section
Following Boud's great corrections, I was able to futher summarize this section, decreasing roughly 150 more words from this section.
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex99
1 files changed, 49 insertions, 50 deletions
diff --git a/paper.tex b/paper.tex
index 0ef9ae9..27b71c8 100644
--- a/paper.tex
+++ b/paper.tex
@@ -200,66 +200,64 @@ As a consequence, before starting with the technical details it is important to
\label{sec:principles}
The core principle of Maneage is simple: science is defined primarily by its method, not its result.
-\citet{buckheit1995} argue that modern scientific papers are merely advertisements of scholarship, while the actual scholarship is the coding behind the analysis that generated the plots/results.
+As \citet{buckheit1995} describe it, modern scientific papers are merely advertisements of scholarship, while the actual scholarship is the coding behind the plots/results.
Many solutions have been proposed for this since the early 1990s, including: 1992: \href{https://sep.stanford.edu/doku.php?id=sep:research:reproducible}{RED}; 2003: \href{https://taverna.incubator.apache.org}{Apache Taverna}; 2004: \href{https://www.genepattern.org}{GenePattern}; 2010: \href{https://galaxyproject.org}{Galaxy}, \href{https://wings-workflows.org}{WINGS}; 2011: \href{https://www.ipol.im}{Image Processing On Line journal} (IPOL), \href{https://www.activepapers.org}{Active papers}, \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE}, \href{https://vcr.stanford.edu}{Verifiable Computational Result}; 2012: \href{https://osf.io/ns2m3}{SOLE}; 2015: \href{https://sciunit.run}{Sciunit}; 2017: \href{https://mybinder.org}{Binder}, \href{https://falsifiable.us}{Popper}; 2019: \href{https://wholetale.org}{WholeTale}.
-A detailed list of principles shows how Maneage is unique compared to these other tools:
+To help in the comparison, the founding principles of Maneage are listed below.
\begin{enumerate}[label={\bf P\arabic*}]
\item \label{principle:complete}\textbf{Completeness:}
A project that is complete, or self-contained,
- (i) does not depend on anything beyond the Portable operating system Interface (POSIX),
- (ii) does not affect the host system,
- (iii) does not require root/administrator privileges,
- (iv) does not need an internet connection (its inputs can be stored on the local file system), and
- (v) is stored in a format that only needs POSIX tools to open, parse or execute.
-
- A complete project can
- (i) automatically access the inputs (see definition \ref{definition:input}),
- (ii) build the software it needs,
- (iii) do the analysis (run the software on the data) and
- (iv) create the final narrative report/paper and its visualizations in their final format (e.g., PDF/HTML).
- No manual/human interaction is required to run a complete project (``\emph{a clerk can do it}''; \citet{claerbout1992}).
- A need for manual intervention in any of the steps above, or an interactive interface, constitutes incompleteness.
- Plain-text format is vital because any other storage format will require specialized software \emph{before} the project can be opened.
-
- \emph{Comparison with existing:} Except for IPOL, none of the tools above are complete, as they all have many dependencies far beyond POSIX.
- For example, most recent projects use Python (for project/workflow, not analysis), or rely on Jupyter notebooks.
- Such high-level tools have short lifespans and evolve fast.
- They also have complex dependency trees, making them vulnerable and hard to maintain. For example, see the dependency tree of Matlplotlib (one of the smaller Jupyter dependencies; \citet[][Fig.~1]{alliez19}).
- The longevity of a workflow (not the analysis itself) is determined by its shortest-lived dependency.
-
- Many existing tools do not store the project as plain text, but instead provide pre-built binary blobs (containers or virtual machines) that can rarely be re-created; these have a short lifespan.
- Their re-creation is difficult because most are built with the package manager of the blob's OS, or Conda.
- Both are highly dependent on the date of execution: precise versions are rarely stored, and the servers remove old binaries.
- Docker containers are a good example of the short lifespan problem: Docker only runs on long-term support OSs, not older ones.
- In GNU/Linux systems, this corresponds to Linux kernel 3.2.x (initially released in 2012) and above.
- A plain-text project, besides being extremely low volume ($\sim100$ kilobytes), is human-readable and parsable by any machine, even if it can't be executed.
+ (P1.1) has no dependency beyond the Port\-able operating system Interface (POSIX).
+ (P1.2) does not affect the host,
+ (P1.3) does not require root, or administrator, privileges,
+ (P1.4) builds its software for an independent environment,
+ (P1.5) can be run locally (without internet connection),
+ (P1.6) contains the full project's analysis, visualization \emph{and} narrative, from access to raw inputs to producing final published format (e.g., PDF or HTML),
+ (P1.7) requires no manual/human interaction and can run automatically \citep[according to][``\emph{a clerk can do it}'']{claerbout1992}.
+ A consequence of P1.1 is that the project itself must be stored in plain-text, and not need any specialized software to open, parse or execute.
+
+ \emph{Comparison with existing:} with many dependencies beyond POSIX, except for IPOL, none of the tools above are complete.
+ For example, most recent solutions need Python (for the workflow, not the analysis), or rely on Jupyter notebooks.
+ High-level tools have short lifespans and evolve fast (e.g., Python 2 code cannot run with Python 3).
+ They also have complex dependency trees, making them hard to maintain.
+ For example, see the dependency tree of Matlplotlib in \citet[][Figure 1]{alliez19}, its one of the simpler Jupyter dependencies.
+ The longevity of a workflow is determined by its shortest-lived dependency.
+
+ As a result the primary storage format of most recent solutions is pre-built binary blobs like containers or virtual machines.
+ They are large (Giga-bytes) and expensive to archive, furthermore generic package managers (e.g., Conda), or OS's (e.g., \inlinecode{apt} or \inlinecode{yum}) are used to setup its environment.
+ Because exact versions of \emph{every software} are rarely included, and the servers remove old binaries, recreating them is very hard.
+ Blobs also have a short lifespan (e.g., Docker containers only run on long-term support OSs, in GNU/Linux systems, this corresponds to Linux 3.2.x, released in 2012).
+ A plain-text project consumes below one megabyte, is human-readable and parsable by any machine, even if it can't be executed.
\item \label{principle:modularity}\textbf{Modularity:}
A project should be compartmentalized into independent modules with well-defined inputs/outputs having no side effects.
Communication between the independent modules should be explicit, providing several optimizations:
-1) Execution: independent modules can run in parallel. Modules that do not need to be run (because their dependencies have not changed) will not be re-run.
-2) Data provenance extraction (recording any dataset's origins).
-3) Citation: others can credit specific parts of a project.
-4) Usage in other projects.
+(1) independent modules can run in parallel.
+Modules that do not need to be run (because their dependencies have not changed) will not be re-run.
+(2) Data provenance extraction (recording any dataset's origins).
+(3) Citation: others can credit specific parts of a project.
+(4) Usage in other projects.
+(5) Most importantly: they are easy to debug and improve.
-\emph{Comparison with existing:} Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails encourage this, but the more recent tools leave this design choice as the responsibility of project authors.
-However, designing a modular project needs to be encouraged and facilitated. Otherwise, scientists, who are not usually trained in data management, will rarely design their projects to be modular, leading to great inefficiencies in terms of project cost and/or scientific accuracy.
+\emph{Comparison with existing:} Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails encourage this, but the more recent tools leave this design choice to project authors.
+However, designing a modular project needs to be encouraged and facilitated.
+Otherwise, scientists, who are not usually trained in data management, will rarely design a modular project, leading to great inefficiencies in terms of project cost and/or scientific accuracy (testing/validating will be expensive).
\item \label{principle:complexity}\textbf{Minimal complexity:}
- This principle is Ockham's razor, ``\emph{Never posit pluralities without necessity}'' \citep{schaffer15}, extrapolated to project management:
- 1) avoid complex relations between analysis steps (this is related to modularity: \ref{principle:modularity}).
+ This is Ockham's razor extrapolated to project management \citep[``\emph{Never posit pluralities without necessity}''][]{schaffer15}:
+ 1) avoid complex relations between analysis steps (related to \ref{principle:modularity}).
2) avoid the programming language that is currently in vogue, because it is going to fall out of fashion soon and require significant resources to translate or rewrite it every few years (to stay fashionable).
- The same job can be done with more stable/basic tools, and less long-term effort.
+ The same job can be done with more stable/basic tools, requiring less long-term effort.
- \emph{Comparison with existing:} Most of the solutions above were created using tools that were at the time the most popular. For example, as we approach the present, successively larger fractions of tools are written in Python, and use Conda or Jupyter (see \ref{principle:complete}).
+ \emph{Comparison with existing:} IPOL stands out here too (requiring only ISO C), however most existing solutions use tools that were most popular at their creation date.
+ For example, as we approach the present, successively larger fractions of tools are written in Python, and use Conda or Jupyter (see \ref{principle:complete}).
\item \label{principle:verify}\textbf{Verifiable inputs and outputs:}
-The project should contain automatic verification checks on its inputs (software source code and data) and outputs.
-When applied, expert knowledge will not be necessary to confirm the correct reproduction.
+The project should automaticly verify its inputs (software source code and data) \emph{and} outputs.
+Thus not needing any expert knowledge to confirm a reproduction.
-\emph{Comparison with existing:} Such verification is usually possible in most systems, but this is usually the responsibility of the user alone.
-Automatic verification of inputs is commonly implemented, but the outputs are much more rarely verified.
+\emph{Comparison with existing:} Such verification is usually possible in most systems, but is usually the responsibility of the project authors.
+As with \ref{principle:modularity}, due to lack of training, this must be actively encouraged and facilitated, otherwise most will not be able to implement it.
\item \label{principle:history}\textbf{History and temporal provenance:}
No project is done in a single/first attempt.
@@ -268,7 +266,7 @@ It is natural that earlier phases of a project are redesigned/optimized only aft
This is often seen in exploratory research papers, with statements like ``\emph{we [first] tried method [or parameter] X, but Y is used here because it gave lower random error}''.
A project's ``history'' is thus as scientifically relevant as the final, or published version.
-\emph{Comparison with existing:} The solutions above that are implemented with version control usually support this principle.
+\emph{Comparison with existing:} The solutions above that implement version control usually support this principle.
However, because the systems as a whole are rarely complete (see \ref{principle:complete}), their histories are also incomplete.
IPOL fails here, because only the final snapshot is published.
@@ -281,14 +279,15 @@ However, IPOL, which uniquely stands out in satisfying most principles, fails he
IPOL is thus not scalable to large projects, which commonly involve dozens of high-level dependencies, with complex data formats and analysis.
\item \label{principle:freesoftware}\textbf{Free and open source software:}
- Technically, reproducibility (defined in \ref{definition:reproduction}) is possible with non-free or non-open-source software (a black box).
- This principle is thus necessary to characterize the many advantages that are critical to the sciences and to industry:
- 1) The lineage, and its optimization, can be traced to the internal algorithm in the software's source.
- 2) A free-software package that does not execute on particular hardware can be modified to work on it.
- 3) A non-free software project typically cannot be distributed by others, making the whole community reliant on the owner's server (even if the owner does not ask for payments).
+ Technically, reproducibility (see \ref{definition:reproduction}) is possible with non-free or non-open-source software (a black box).
+ This principle is thus necessary to complement it with these critical points (to the sciences and to industry):
+ (1) When the project itself is free software, others can learn-from and build-upon it.
+ (2) The lineage, can be traced to a free software's implemented algorithm, also enabling optimizations on that level.
+ (3) A free-software package that does not execute on particular hardware can be modified to work on it.
+ (4) A non-free software project typically cannot be distributed by others, making the whole community reliant on the owner's server (even if the owner does not ask for payments).
\emph{Comparison with existing:} The existing solutions listed above are all free software.
- Based on this principle, we do not consider non-free solutions here.
+ Based on this principle, we do not consider non-free solutions.
\end{enumerate}