From 49cdb178c5ba2712bf922ab178d0379949381c5f Mon Sep 17 00:00:00 2001 From: Boud Roukema Date: Sun, 19 Apr 2020 16:17:04 +0200 Subject: Principles intro Word-length reduction (8 words) of the first part of 3 Principles. Change in meaning: we can argue that *results* are not part of science, but science needs aims as well as methods; hypotheses are needed too, but these overlap between the aims and methods. So I put "primarily". --- paper.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/paper.tex b/paper.tex index 13c0aad..b0b3242 100644 --- a/paper.tex +++ b/paper.tex @@ -199,10 +199,10 @@ As a consequence, before starting with the technical details it is important to \section{Principles} \label{sec:principles} -The core principle of Maneage is simple: science is defined by its method, not its result. -\citet{buckheit1995} summarize this nicely by noting that modern scientific papers are merely advertisements of a scholarship, the actual scholarship is the coding behind the analysis that ultimately generated the plots/results. +The core principle of Maneage is simple: science is defined primarily by its method, not its result. +\citet{buckheit1995} argue that modern scientific papers are merely advertisements of scholarship, while the actual scholarship is the coding behind the analysis that generated the plots/results. Many solutions have been proposed for this since the early 1990s, including: 1992: \href{https://sep.stanford.edu/doku.php?id=sep:research:reproducible}{RED}; 2003: \href{https://taverna.incubator.apache.org}{Apache Taverna}; 2004: \href{https://www.genepattern.org}{GenePattern}; 2010: \href{https://galaxyproject.org}{Galaxy}, \href{https://wings-workflows.org}{WINGS}; 2011: \href{https://www.ipol.im}{Image Processing On Line journal} (IPOL), \href{https://www.activepapers.org}{Active papers}, \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE}, \href{https://vcr.stanford.edu}{Verifiable Computational Result}; 2012: \href{https://osf.io/ns2m3}{SOLE}; 2015: \href{https://sciunit.run}{Sciunit}; 2017: \href{https://mybinder.org}{Binder}, \href{https://falsifiable.us}{Popper}; 2019: \href{https://wholetale.org}{WholeTale}. -To highlight the uniqueness of Maneage in this plethora of tools, a more elaborate list of principles are required: +A detailed list of principles shows how Maneage is unique compared to these other tools: \begin{enumerate}[label={\bf P\arabic*}] \item \label{principle:complete}\textbf{Complete:} -- cgit v1.2.1 From 6e97fdde49333084d9cc9185f29c38a43d0b460f Mon Sep 17 00:00:00 2001 From: Boud Roukema Date: Sun, 19 Apr 2020 16:55:51 +0200 Subject: Principles - P1 - Complete Compression by about 40 words. Updating python2 to python3 is often nothing more than modifying print statements, so removing this doesn't weaken the text by much. Re-creation helps avoid thinking of watching movies, going to the beach, reading a novel, when seeing the word "recreation": https://en.wiktionary.org/wiki/recreation#Usage_notes The matplotlib sentence was not so clear: now it's a bit shorter and hopefully clearer. --- paper.tex | 38 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/paper.tex b/paper.tex index b0b3242..7cf43bc 100644 --- a/paper.tex +++ b/paper.tex @@ -210,30 +210,30 @@ A detailed list of principles shows how Maneage is unique compared to these othe (i) does not depend on anything beyond the Portable operating system Interface (POSIX), (ii) does not affect the host system, (iii) does not require root/administrator privileges, - (iv) does not need an internet connection (when its inputs are on the file-system), and - (v) is stored in a format that does not require any software beyond POSIX tools to open, parse or execute. + (iv) does not need an internet connection (its inputs can be stored on the local file system), and + (v) is stored in a format that only needs POSIX tools to open, parse or execute. A complete project can (i) automatically access the inputs (see definition \ref{definition:input}), - (ii) build its necessary software, + (ii) build the software it needs, (iii) do the analysis (run the software on the data) and - (iv) create the final narrative report/paper as well as its visualizations, in its final format (usually in PDF or HTML). - No manual/human interaction is required to run a complete project, as \citet{claerbout1992} put it: ``\emph{a clerk can do it}''. - Generally, manual intervention in any of the steps above, or an interactive interface, constitutes an incompleteness. - Lastly, the plain-text format is particularly important because any other storage format will require specialized software \emph{before} the project can be opened. - - \emph{Comparison with existing:} Except for IPOL, none of the tools above are complete as they all have many dependencies far beyond POSIX. - For example, most of the recent ones use Python (the project/workflow, not the analysis), or rely on Jupyter notebooks. - Such high-level tools have very short lifespans and evolve very fast (e.g., Python 2 code cannot run with Python 3). - They also have a very complex dependency trees, making them extremely vulnerable and hard to maintain. For example, see Figure 1 of \citet{alliez19} on the dependency tree of Matplotlib (one of the smaller Jupyter dependencies). - It is important to remember that the longevity of a workflow (not the analysis itself) is determined by its shortest-lived dependency. - - Many existing tools therefore do not attempt to store the project as plain text, but pre-built binary blobs (containers or virtual machines) that can rarely be recreated and also have a short lifespan. - Their recreation is hard because most are built with the package manager of the blob's OS, or Conda. - Both are highly dependent on the time they are executed: precise versions are rarely stored, and the servers remove old binaries. - Docker containers are a good example of their short lifespan: Docker only runs on long-term support OSs, not older. + (iv) create the final narrative report/paper and its visualizations in their final format (e.g., PDF/HTML). + No manual/human interaction is required to run a complete project (``\emph{a clerk can do it}''; \citet{claerbout1992}). + A need for manual intervention in any of the steps above, or an interactive interface, constitutes incompleteness. + Plain-text format is vital because any other storage format will require specialized software \emph{before} the project can be opened. + + \emph{Comparison with existing:} Except for IPOL, none of the tools above are complete, as they all have many dependencies far beyond POSIX. + For example, most recent projects use Python (for project/workflow, not analysis), or rely on Jupyter notebooks. + Such high-level tools have short lifespans and evolve fast. + They also have complex dependency trees, making them vulnerable and hard to maintain. For example, see the dependency tree of Matlplotlib (one of the smaller Jupyter dependencies; \citet[][Fig.~1]{alliez19}). + The longevity of a workflow (not the analysis itself) is determined by its shortest-lived dependency. + + Many existing tools do not store the project as plain text, but instead provide pre-built binary blobs (containers or virtual machines) that can rarely be re-created; these have a short lifespan. + Their re-creation is difficult because most are built with the package manager of the blob's OS, or Conda. + Both are highly dependent on the date of execution: precise versions are rarely stored, and the servers remove old binaries. + Docker containers are a good example of the short lifespan problem: Docker only runs on long-term support OSs, not older ones. In GNU/Linux systems, this corresponds to Linux kernel 3.2.x (initially released in 2012) and above. - As plain-text, besides being extremely low volume ($\sim100$ kilobytes), the project is still human-readable and parsable by any machine, even when it can't be executed. + A plain-text project, besides being extremely low volume ($\sim100$ kilobytes), is human-readable and parsable by any machine, even if it can't be executed. \item \label{principle:modularity}\textbf{Modularity:} A project should be compartmentalized or partitioned into independent modules or components with well-defined inputs/outputs having no side-effects. -- cgit v1.2.1 From a1339189dae70488a52b128fa5bd9a61934e199c Mon Sep 17 00:00:00 2001 From: Boud Roukema Date: Sun, 19 Apr 2020 17:03:53 +0200 Subject: principles: all nouns For consistency, the principles should either all be nouns, or all be adjectives. Most are nouns, so this commit switches the adjectives to nouns. --- paper.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/paper.tex b/paper.tex index 7cf43bc..aeb7404 100644 --- a/paper.tex +++ b/paper.tex @@ -205,7 +205,7 @@ Many solutions have been proposed for this since the early 1990s, including: 199 A detailed list of principles shows how Maneage is unique compared to these other tools: \begin{enumerate}[label={\bf P\arabic*}] -\item \label{principle:complete}\textbf{Complete:} +\item \label{principle:complete}\textbf{Completeness:} A project that is complete, or self-contained, (i) does not depend on anything beyond the Portable operating system Interface (POSIX), (ii) does not affect the host system, @@ -272,7 +272,7 @@ A project's ``history'' is thus as scientifically relevant as the final, or publ However, because the systems as a whole are rarely complete (see \ref{principle:complete}), their histories are also incomplete. IPOL fails here because only the final snapshot is published. -\item \label{principle:scalable}\textbf{Scalable:} +\item \label{principle:scalable}\textbf{Scalability:} A project should be scalable to arbitrarily large and/or complex projects. \emph{Comparison with existing:} -- cgit v1.2.1 From 13d0a688a100e973f05e2991334903b22f233a01 Mon Sep 17 00:00:00 2001 From: Boud Roukema Date: Sun, 19 Apr 2020 17:14:18 +0200 Subject: principles - P2 modularity Minor wording improvements; reduction by 10 words. --- paper.tex | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/paper.tex b/paper.tex index aeb7404..d43aa8c 100644 --- a/paper.tex +++ b/paper.tex @@ -236,15 +236,15 @@ A detailed list of principles shows how Maneage is unique compared to these othe A plain-text project, besides being extremely low volume ($\sim100$ kilobytes), is human-readable and parsable by any machine, even if it can't be executed. \item \label{principle:modularity}\textbf{Modularity:} -A project should be compartmentalized or partitioned into independent modules or components with well-defined inputs/outputs having no side-effects. -In a modular project, communication between the independent modules is explicit, providing optimizations on multiple levels: -1) Execution: independent modules can run in parallel, or modules that do not need to be run (because their dependencies have not changed) will not be re-done. +A project should be compartmentalized into independent modules with well-defined inputs/outputs having no side effects. +Communication between the independent modules should be explicit, providing several optimizations: +1) Execution: independent modules can run in parallel. Modules that do not need to be run (because their dependencies have not changed) will not be re-run. 2) Data provenance extraction (recording any dataset's origins). -3) Citation: allowing others to credit specific parts of a project. +3) Citation: others can credit specific parts of a project. 4) Usage in other projects. -\emph{Comparison with existing:} Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails do encourage this, but the more recent tools leave such design choices to the experience of project authors. -However, designing a modular project needs to be encouraged and facilitated, otherwise scientists (who are not usually trained in data management) will not design their projects to be modular, leading to great inefficiencies in terms of project cost and/or scientific accuracy. +\emph{Comparison with existing:} Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails encourage this, but the more recent tools leave this design choice as the responsibility of project authors. +However, designing a modular project needs to be encouraged and facilitated. Otherwise, scientists, who are not usually trained in data management, will rarely design their projects to be modular, leading to great inefficiencies in terms of project cost and/or scientific accuracy. \item \label{principle:complexity}\textbf{Minimal complexity:} This principle is essentially Ockham's razor: ``\emph{Never posit pluralities without necessity}'' \citep{schaffer15}, but extrapolated to project management: -- cgit v1.2.1 From 4e9e145afe79faac1cca38e86d5d7ffecdc1c91f Mon Sep 17 00:00:00 2001 From: Boud Roukema Date: Sun, 19 Apr 2020 17:20:46 +0200 Subject: Principles - P3 minimal complexity Minor wording changes - reduction by 10 words. --- paper.tex | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/paper.tex b/paper.tex index d43aa8c..b8201b9 100644 --- a/paper.tex +++ b/paper.tex @@ -247,12 +247,12 @@ Communication between the independent modules should be explicit, providing seve However, designing a modular project needs to be encouraged and facilitated. Otherwise, scientists, who are not usually trained in data management, will rarely design their projects to be modular, leading to great inefficiencies in terms of project cost and/or scientific accuracy. \item \label{principle:complexity}\textbf{Minimal complexity:} - This principle is essentially Ockham's razor: ``\emph{Never posit pluralities without necessity}'' \citep{schaffer15}, but extrapolated to project management: - 1) avoid complex relations between analysis steps (which is related to the principle of modularity in \ref{principle:modularity}). - 2) avoid the programming language that is currently in vogue because it is going to fall out of fashion soon and significant resources are required to translate or rewrite it every few years (to stay in vogue). - The same job can be done with more stable/basic tools, and less effort in the long run. + This principle is Ockham's razor, ``\emph{Never posit pluralities without necessity}'' \citep{schaffer15}, extrapolated to project management: + 1) avoid complex relations between analysis steps (this is related to modularity: \ref{principle:modularity}). + 2) avoid the programming language that is currently in vogue, because it is going to fall out of fashion soon and require significant resources to translate or rewrite it every few years (to stay fashionable). + The same job can be done with more stable/basic tools, and less long-term effort. - \emph{Comparison with existing:} Most of the existing solutions above use tools that are most popular at their creation epoch. For example, as we approach the present, successively larger fractions of tools are written in Python, and use Conda or Jupyter (see \ref{principle:complete}). + \emph{Comparison with existing:} Most of the solutions above were created using tools that were at the time the most popular. For example, as we approach the present, successively larger fractions of tools are written in Python, and use Conda or Jupyter (see \ref{principle:complete}). \item \label{principle:verify}\textbf{Verifiable inputs and outputs:} The project should contain automatic verification checks on its inputs (software source code and data) and outputs. -- cgit v1.2.1 From e8f5b6aed32fc7cf41c00216d0f18ef6c8e3a4d6 Mon Sep 17 00:00:00 2001 From: Boud Roukema Date: Sun, 19 Apr 2020 17:22:59 +0200 Subject: Principles - P4 verifiable inputs and outputs One superfluous word was removed. --- paper.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/paper.tex b/paper.tex index b8201b9..41a856c 100644 --- a/paper.tex +++ b/paper.tex @@ -258,7 +258,7 @@ However, designing a modular project needs to be encouraged and facilitated. Oth The project should contain automatic verification checks on its inputs (software source code and data) and outputs. When applied, expert knowledge will not be necessary to confirm the correct reproduction. -\emph{Comparison with existing:} Such verification is usually possible in most systems, but adding this is usually the responsibility of the user alone. +\emph{Comparison with existing:} Such verification is usually possible in most systems, but this is usually the responsibility of the user alone. Automatic verification of inputs is commonly implemented, but the outputs are much more rarely verified. \item \label{principle:history}\textbf{History and temporal provenance:} -- cgit v1.2.1 From 1d281bffd44fbe3ff43439b3ab3357953f523728 Mon Sep 17 00:00:00 2001 From: Boud Roukema Date: Sun, 19 Apr 2020 17:30:14 +0200 Subject: Principles - P5 History and temporal provenance Reduction by 5 words. The term "exploratory research" is intended in the specific sense listed at en.Wikipedia: https://en.wikipedia.org/wiki/Exploratory_research to distinguish it from hypothesis testing. The final phases of clinical (medical) research, for example, to test whether a candidate SARS-CoV-2 vaccine is (i) effective and (ii) safe in homo sapiens, cannot accept the exploratory methods that are acceptable in astronomy, or in other exploratory research (which is acceptable in the early stages of medical research). Clinical trial registration is aimed at *preventing* scientists from modifying their methods in a given project: https://en.wikipedia.org/wiki/Clinical_trial_registration --- paper.tex | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/paper.tex b/paper.tex index 41a856c..b1ebb9d 100644 --- a/paper.tex +++ b/paper.tex @@ -265,12 +265,12 @@ Automatic verification of inputs is commonly implemented, but the outputs are mu No project is done in a single/first attempt. Projects evolve as they are being completed. It is natural that earlier phases of a project are redesigned/optimized only after later phases have been completed. -This is often seen in scientific papers, with statements like ``\emph{we [first] tried method [or parameter] X, but Y is used here because it showed to have better precision [or less bias, or etc]}''. +This is often seen in exploratory research papers, with statements like ``\emph{we [first] tried method [or parameter] X, but Y is used here because it gave lower random error}''. A project's ``history'' is thus as scientifically relevant as the final, or published version. -\emph{Comparison with existing:} The systems above that are implemented with version control usually support this principle. +\emph{Comparison with existing:} The solutions above that are implemented with version control usually support this principle. However, because the systems as a whole are rarely complete (see \ref{principle:complete}), their histories are also incomplete. -IPOL fails here because only the final snapshot is published. +IPOL fails here, because only the final snapshot is published. \item \label{principle:scalable}\textbf{Scalability:} A project should be scalable to arbitrarily large and/or complex projects. -- cgit v1.2.1 From e8eef373e3b96cdd41f6fd03edf8b0b58bfa6ee2 Mon Sep 17 00:00:00 2001 From: Boud Roukema Date: Sun, 19 Apr 2020 17:40:37 +0200 Subject: Principles - P6 Scalability Reduction by 7 words. For a regular GNU/Linux of other unix-like system user, the bit about ISO C compilers even existing for Microsoft systems more or less says "despite there being no point ever trying to do science on a Microsoft system, you *could* hypothetically compile and run any ISO C program on it". Interesting, but not directly of interest to this user, who is unlikely to actually want to do it. A Microsoft user who thinks that s/he can do science on a Microsoft system will typically think "Microsoft is good, so of course I can run anything I want on it". So the message here could more likely be seen as provocative rather than useful, since this user is unaware of the fundamental problems of Microsoft as an authoritarian, manipulative, centralised organisation providing bad software. So either way, the parenthesis about Microsoft can be safely removed given the space constraints. --- paper.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/paper.tex b/paper.tex index b1ebb9d..bf8565d 100644 --- a/paper.tex +++ b/paper.tex @@ -277,8 +277,8 @@ A project should be scalable to arbitrarily large and/or complex projects. \emph{Comparison with existing:} Most of the more recent solutions above are scalable. -However, IPOL, which uniquely stands out in satisfying most principles also fails here: IPOL is devoted to low-level image processing algorithms that \emph{can be} done with no dependencies beyond an ISO C compiler (even available on Microsoft Windows). -Its solution is thus not scalable to large projects which commonly involve tens of high-level dependencies, with complex data formats and analysis. +However, IPOL, which uniquely stands out in satisfying most principles, fails here: IPOL is devoted to low-level image processing algorithms that \emph{can be} done with no dependencies beyond an ISO C compiler. +IPOL is thus not scalable to large projects, which commonly involve dozens of high-level dependencies, with complex data formats and analysis. \item \label{principle:freesoftware}\textbf{Free and open source software:} Technically, reproducibility (defined in \ref{definition:reproduction}) is possible with non-free or non-open-source software (a black box). -- cgit v1.2.1 From 22f380a646c23a13d1a7443633bb35e39ce8f111 Mon Sep 17 00:00:00 2001 From: Boud Roukema Date: Sun, 19 Apr 2020 17:52:41 +0200 Subject: Principles - P7 FOSS Reduction by 15 words. --- paper.tex | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/paper.tex b/paper.tex index bf8565d..0ef9ae9 100644 --- a/paper.tex +++ b/paper.tex @@ -282,13 +282,13 @@ IPOL is thus not scalable to large projects, which commonly involve dozens of hi \item \label{principle:freesoftware}\textbf{Free and open source software:} Technically, reproducibility (defined in \ref{definition:reproduction}) is possible with non-free or non-open-source software (a black box). - This principle is thus necessary to complement the definition of reproducibility and has many advantages which are critical to the sciences and to industry: - 1) The lineage, and its optimization, can be traced down to the internal algorithm in the software's source. - 2) A free software package that may not execute on a particular piece hardware can be modified to work on it. - 3) A non-free software project typically cannot be distributed by others, making the whole community reliant only on the owner's server (even if the proprietary software does not ask for payments). + This principle is thus necessary to characterize the many advantages that are critical to the sciences and to industry: + 1) The lineage, and its optimization, can be traced to the internal algorithm in the software's source. + 2) A free-software package that does not execute on particular hardware can be modified to work on it. + 3) A non-free software project typically cannot be distributed by others, making the whole community reliant on the owner's server (even if the owner does not ask for payments). \emph{Comparison with existing:} The existing solutions listed above are all free software. - There are non-free solutions, but we do not consider them here because of this principle. + Based on this principle, we do not consider non-free solutions here. \end{enumerate} -- cgit v1.2.1 From bf6e876a1f8edcc7f7a58712a53161f8d53aa570 Mon Sep 17 00:00:00 2001 From: Mohammad Akhlaghi Date: Sun, 19 Apr 2020 20:48:20 +0100 Subject: Further summarized the principles section Following Boud's great corrections, I was able to futher summarize this section, decreasing roughly 150 more words from this section. --- paper.tex | 99 +++++++++++++++++++++++++++++++-------------------------------- 1 file changed, 49 insertions(+), 50 deletions(-) diff --git a/paper.tex b/paper.tex index 0ef9ae9..27b71c8 100644 --- a/paper.tex +++ b/paper.tex @@ -200,66 +200,64 @@ As a consequence, before starting with the technical details it is important to \label{sec:principles} The core principle of Maneage is simple: science is defined primarily by its method, not its result. -\citet{buckheit1995} argue that modern scientific papers are merely advertisements of scholarship, while the actual scholarship is the coding behind the analysis that generated the plots/results. +As \citet{buckheit1995} describe it, modern scientific papers are merely advertisements of scholarship, while the actual scholarship is the coding behind the plots/results. Many solutions have been proposed for this since the early 1990s, including: 1992: \href{https://sep.stanford.edu/doku.php?id=sep:research:reproducible}{RED}; 2003: \href{https://taverna.incubator.apache.org}{Apache Taverna}; 2004: \href{https://www.genepattern.org}{GenePattern}; 2010: \href{https://galaxyproject.org}{Galaxy}, \href{https://wings-workflows.org}{WINGS}; 2011: \href{https://www.ipol.im}{Image Processing On Line journal} (IPOL), \href{https://www.activepapers.org}{Active papers}, \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE}, \href{https://vcr.stanford.edu}{Verifiable Computational Result}; 2012: \href{https://osf.io/ns2m3}{SOLE}; 2015: \href{https://sciunit.run}{Sciunit}; 2017: \href{https://mybinder.org}{Binder}, \href{https://falsifiable.us}{Popper}; 2019: \href{https://wholetale.org}{WholeTale}. -A detailed list of principles shows how Maneage is unique compared to these other tools: +To help in the comparison, the founding principles of Maneage are listed below. \begin{enumerate}[label={\bf P\arabic*}] \item \label{principle:complete}\textbf{Completeness:} A project that is complete, or self-contained, - (i) does not depend on anything beyond the Portable operating system Interface (POSIX), - (ii) does not affect the host system, - (iii) does not require root/administrator privileges, - (iv) does not need an internet connection (its inputs can be stored on the local file system), and - (v) is stored in a format that only needs POSIX tools to open, parse or execute. - - A complete project can - (i) automatically access the inputs (see definition \ref{definition:input}), - (ii) build the software it needs, - (iii) do the analysis (run the software on the data) and - (iv) create the final narrative report/paper and its visualizations in their final format (e.g., PDF/HTML). - No manual/human interaction is required to run a complete project (``\emph{a clerk can do it}''; \citet{claerbout1992}). - A need for manual intervention in any of the steps above, or an interactive interface, constitutes incompleteness. - Plain-text format is vital because any other storage format will require specialized software \emph{before} the project can be opened. - - \emph{Comparison with existing:} Except for IPOL, none of the tools above are complete, as they all have many dependencies far beyond POSIX. - For example, most recent projects use Python (for project/workflow, not analysis), or rely on Jupyter notebooks. - Such high-level tools have short lifespans and evolve fast. - They also have complex dependency trees, making them vulnerable and hard to maintain. For example, see the dependency tree of Matlplotlib (one of the smaller Jupyter dependencies; \citet[][Fig.~1]{alliez19}). - The longevity of a workflow (not the analysis itself) is determined by its shortest-lived dependency. - - Many existing tools do not store the project as plain text, but instead provide pre-built binary blobs (containers or virtual machines) that can rarely be re-created; these have a short lifespan. - Their re-creation is difficult because most are built with the package manager of the blob's OS, or Conda. - Both are highly dependent on the date of execution: precise versions are rarely stored, and the servers remove old binaries. - Docker containers are a good example of the short lifespan problem: Docker only runs on long-term support OSs, not older ones. - In GNU/Linux systems, this corresponds to Linux kernel 3.2.x (initially released in 2012) and above. - A plain-text project, besides being extremely low volume ($\sim100$ kilobytes), is human-readable and parsable by any machine, even if it can't be executed. + (P1.1) has no dependency beyond the Port\-able operating system Interface (POSIX). + (P1.2) does not affect the host, + (P1.3) does not require root, or administrator, privileges, + (P1.4) builds its software for an independent environment, + (P1.5) can be run locally (without internet connection), + (P1.6) contains the full project's analysis, visualization \emph{and} narrative, from access to raw inputs to producing final published format (e.g., PDF or HTML), + (P1.7) requires no manual/human interaction and can run automatically \citep[according to][``\emph{a clerk can do it}'']{claerbout1992}. + A consequence of P1.1 is that the project itself must be stored in plain-text, and not need any specialized software to open, parse or execute. + + \emph{Comparison with existing:} with many dependencies beyond POSIX, except for IPOL, none of the tools above are complete. + For example, most recent solutions need Python (for the workflow, not the analysis), or rely on Jupyter notebooks. + High-level tools have short lifespans and evolve fast (e.g., Python 2 code cannot run with Python 3). + They also have complex dependency trees, making them hard to maintain. + For example, see the dependency tree of Matlplotlib in \citet[][Figure 1]{alliez19}, its one of the simpler Jupyter dependencies. + The longevity of a workflow is determined by its shortest-lived dependency. + + As a result the primary storage format of most recent solutions is pre-built binary blobs like containers or virtual machines. + They are large (Giga-bytes) and expensive to archive, furthermore generic package managers (e.g., Conda), or OS's (e.g., \inlinecode{apt} or \inlinecode{yum}) are used to setup its environment. + Because exact versions of \emph{every software} are rarely included, and the servers remove old binaries, recreating them is very hard. + Blobs also have a short lifespan (e.g., Docker containers only run on long-term support OSs, in GNU/Linux systems, this corresponds to Linux 3.2.x, released in 2012). + A plain-text project consumes below one megabyte, is human-readable and parsable by any machine, even if it can't be executed. \item \label{principle:modularity}\textbf{Modularity:} A project should be compartmentalized into independent modules with well-defined inputs/outputs having no side effects. Communication between the independent modules should be explicit, providing several optimizations: -1) Execution: independent modules can run in parallel. Modules that do not need to be run (because their dependencies have not changed) will not be re-run. -2) Data provenance extraction (recording any dataset's origins). -3) Citation: others can credit specific parts of a project. -4) Usage in other projects. +(1) independent modules can run in parallel. +Modules that do not need to be run (because their dependencies have not changed) will not be re-run. +(2) Data provenance extraction (recording any dataset's origins). +(3) Citation: others can credit specific parts of a project. +(4) Usage in other projects. +(5) Most importantly: they are easy to debug and improve. -\emph{Comparison with existing:} Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails encourage this, but the more recent tools leave this design choice as the responsibility of project authors. -However, designing a modular project needs to be encouraged and facilitated. Otherwise, scientists, who are not usually trained in data management, will rarely design their projects to be modular, leading to great inefficiencies in terms of project cost and/or scientific accuracy. +\emph{Comparison with existing:} Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails encourage this, but the more recent tools leave this design choice to project authors. +However, designing a modular project needs to be encouraged and facilitated. +Otherwise, scientists, who are not usually trained in data management, will rarely design a modular project, leading to great inefficiencies in terms of project cost and/or scientific accuracy (testing/validating will be expensive). \item \label{principle:complexity}\textbf{Minimal complexity:} - This principle is Ockham's razor, ``\emph{Never posit pluralities without necessity}'' \citep{schaffer15}, extrapolated to project management: - 1) avoid complex relations between analysis steps (this is related to modularity: \ref{principle:modularity}). + This is Ockham's razor extrapolated to project management \citep[``\emph{Never posit pluralities without necessity}''][]{schaffer15}: + 1) avoid complex relations between analysis steps (related to \ref{principle:modularity}). 2) avoid the programming language that is currently in vogue, because it is going to fall out of fashion soon and require significant resources to translate or rewrite it every few years (to stay fashionable). - The same job can be done with more stable/basic tools, and less long-term effort. + The same job can be done with more stable/basic tools, requiring less long-term effort. - \emph{Comparison with existing:} Most of the solutions above were created using tools that were at the time the most popular. For example, as we approach the present, successively larger fractions of tools are written in Python, and use Conda or Jupyter (see \ref{principle:complete}). + \emph{Comparison with existing:} IPOL stands out here too (requiring only ISO C), however most existing solutions use tools that were most popular at their creation date. + For example, as we approach the present, successively larger fractions of tools are written in Python, and use Conda or Jupyter (see \ref{principle:complete}). \item \label{principle:verify}\textbf{Verifiable inputs and outputs:} -The project should contain automatic verification checks on its inputs (software source code and data) and outputs. -When applied, expert knowledge will not be necessary to confirm the correct reproduction. +The project should automaticly verify its inputs (software source code and data) \emph{and} outputs. +Thus not needing any expert knowledge to confirm a reproduction. -\emph{Comparison with existing:} Such verification is usually possible in most systems, but this is usually the responsibility of the user alone. -Automatic verification of inputs is commonly implemented, but the outputs are much more rarely verified. +\emph{Comparison with existing:} Such verification is usually possible in most systems, but is usually the responsibility of the project authors. +As with \ref{principle:modularity}, due to lack of training, this must be actively encouraged and facilitated, otherwise most will not be able to implement it. \item \label{principle:history}\textbf{History and temporal provenance:} No project is done in a single/first attempt. @@ -268,7 +266,7 @@ It is natural that earlier phases of a project are redesigned/optimized only aft This is often seen in exploratory research papers, with statements like ``\emph{we [first] tried method [or parameter] X, but Y is used here because it gave lower random error}''. A project's ``history'' is thus as scientifically relevant as the final, or published version. -\emph{Comparison with existing:} The solutions above that are implemented with version control usually support this principle. +\emph{Comparison with existing:} The solutions above that implement version control usually support this principle. However, because the systems as a whole are rarely complete (see \ref{principle:complete}), their histories are also incomplete. IPOL fails here, because only the final snapshot is published. @@ -281,14 +279,15 @@ However, IPOL, which uniquely stands out in satisfying most principles, fails he IPOL is thus not scalable to large projects, which commonly involve dozens of high-level dependencies, with complex data formats and analysis. \item \label{principle:freesoftware}\textbf{Free and open source software:} - Technically, reproducibility (defined in \ref{definition:reproduction}) is possible with non-free or non-open-source software (a black box). - This principle is thus necessary to characterize the many advantages that are critical to the sciences and to industry: - 1) The lineage, and its optimization, can be traced to the internal algorithm in the software's source. - 2) A free-software package that does not execute on particular hardware can be modified to work on it. - 3) A non-free software project typically cannot be distributed by others, making the whole community reliant on the owner's server (even if the owner does not ask for payments). + Technically, reproducibility (see \ref{definition:reproduction}) is possible with non-free or non-open-source software (a black box). + This principle is thus necessary to complement it with these critical points (to the sciences and to industry): + (1) When the project itself is free software, others can learn-from and build-upon it. + (2) The lineage, can be traced to a free software's implemented algorithm, also enabling optimizations on that level. + (3) A free-software package that does not execute on particular hardware can be modified to work on it. + (4) A non-free software project typically cannot be distributed by others, making the whole community reliant on the owner's server (even if the owner does not ask for payments). \emph{Comparison with existing:} The existing solutions listed above are all free software. - Based on this principle, we do not consider non-free solutions here. + Based on this principle, we do not consider non-free solutions. \end{enumerate} -- cgit v1.2.1