aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex68
1 files changed, 36 insertions, 32 deletions
diff --git a/paper.tex b/paper.tex
index 070223e..086e620 100644
--- a/paper.tex
+++ b/paper.tex
@@ -150,22 +150,24 @@ To highlight the necessity, a short review of commonly-used tools is provided be
\fi%
}
-To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011 but was discontinued in 2019).
-However, containers (in particular, Docker, and to a lesser degree, Singularity) are currently the most widely-used solution.
-We will thus focus on Docker here.
+To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011, but discontinued in 2019).
+However, containers (e.g., Docker or Singularity) are currently the most widely-used solution.
+We will focus on Docker here because it is currently the most common.
\new{It is hypothetically possible to precisely identify the used Docker ``images'' with their checksums (or ``digest'') to re-create an identical OS image later.
However, that is rarely done.}
-Usually images are imported with generic operating system (OS) names; e.g., \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}'
- \ifdefined\noappendix
- \new{(more examples in the \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{appendices})}%
- \else%
- \new{(more examples: see the appendices (\ref{appendix:existingtools}))}%
- \fi%
-. The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly, and only the most recent five are archived there.
- Hence, if the image is built in different months, its output image will contain different OS components.
+Usually images are imported with operating system (OS) names; e.g., \cite{mesnard20}
+\ifdefined\noappendix
+\new{(more examples in the \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{appendices})}%
+\else%
+\new{(more examples: see the appendices (\ref{appendix:existingtools}))}%
+\fi%
+{ }imports `\inlinecode{FROM ubuntu:16.04}'.
+The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly, and only the most recent five are archived there.
+Hence, if the image is built in different months, it will contain different OS components.
% CentOS announcement: https://blog.centos.org/2020/12/future-is-centos-stream
-In the year 2024, when long-term support (LTS) for this version of Ubuntu expires, the image will be unavailable at the expected URL \new{(if not abruptly aborted earlier, like CentOS 8 which will be terminated 8 years early).}
+In the year 2024, when long-term support (LTS) for this version of Ubuntu expires, the image will be unavailable at the expected URL \new{(if not aborted earlier, like CentOS 8 which will be terminated 8 years early).}
+
Generally, \new{pre-built} binary files (like Docker images) are large and expensive to maintain and archive.
%% This URL: https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}
\new{Because of this, DockerHub (where many reproducible workflows are archived) announced that inactive images (older than 6 months) will be deleted in free accounts from mid 2021.}
@@ -200,11 +202,6 @@ However, since they are not part of the core, their longevity can be assumed to
Therefore, the core Jupyter framework leaves very few options for project management, especially as the project grows beyond a small test or tutorial.}
In summary, notebooks can rarely deliver their promised potential \cite{rule18} and may even hamper reproducibility \cite{pimentel19}.
-An exceptional solution we encountered was the Image Processing Online Journal (IPOL, \href{https://www.ipol.im}{ipol.im}).
-Submitted papers must be accompanied by an ISO C implementation of their algorithm (which is buildable on any widely used OS) with example images/data that can also be executed on their webpage.
-This is possible owing to the focus on low-level algorithms with no dependencies beyond an ISO C compiler.
-However, many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, so this solution is not scalable.
-
@@ -250,7 +247,7 @@ More stable/basic tools can be used with less long-term maintenance costs.
\textbf{Criterion 4: Scalability.}
A scalable project can easily be used in arbitrarily large and/or complex projects.
-On a small scale, the criteria here are trivial to implement, but can rapidly become unsustainable (see IPOL example above).
+On a small scale, the criteria here are trivial to implement, but can rapidly become unsustainable.
\textbf{Criterion 5: Verifiable inputs and outputs.}
The project should automatically verify its inputs (software source code and data) \emph{and} outputs, not needing any expert knowledge.
@@ -351,8 +348,9 @@ For Windows-native software that can be run in batch-mode, evolving technologies
The analysis phase of the project however is naturally different from one project to another at a low-level.
It was thus necessary to design a generic framework to comfortably host any project, while still satisfying the criteria of modularity, scalability, and minimal complexity.
-We demonstrate this design by replicating Figure 1C of \cite{menke20} in Figure \ref{fig:datalineage} (left).
-Figure \ref{fig:datalineage} (right) is the data lineage graph that produced it (including this complete paper).
+This design is demonstrated with the example of Figure \ref{fig:datalineage} (left).
+It is an enhanced replication of the ``tool'' curve of Figure 1C in \cite{menke20}.
+Figure \ref{fig:datalineage} (right) is the data lineage that produced it.
\begin{figure*}[t]
\begin{center}
@@ -455,14 +453,6 @@ There is a \new{thoroughly elaborated} customization checklist in \inlinecode{RE
The current project's Git hash is provided to the authors as a \LaTeX{} macro (shown here at the end of the abstract), as well as the Git hash of the last commit in the Maneage branch (shown here in the acknowledgements).
These macros are created in \inlinecode{initialize.mk}, with \new{other basic information from the running system like the CPU architecture, byte order or address sizes (shown here in the acknowledgements)}.
-The branch-based design of Figure \ref{fig:branching} allows projects to re-import Maneage at a later time (technically: \emph{merge}), thus improving its low-level infrastructure: in (a) authors do the merge during an ongoing project;
-in (b) readers do it after publication; e.g., the project remains reproducible but the infrastructure is outdated, or a bug is fixed in Maneage.
-\new{Generally, any git flow (branching strategy) can be used by the high-level project authors or future readers.}
-Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of curation and maintenance of each individual project, before \emph{and} after publication.
-
-Finally, the complete project source is usually $\sim100$ kilo-bytes.
-It can thus easily be published or archived in many servers, for example it can be uploaded to arXiv (with the \LaTeX{} source, see the arXiv source in \cite{akhlaghi19, infante20, akhlaghi15}), published on Zenodo and archived in SoftwareHeritage.
-
\begin{lstlisting}[
label=code:branching,
caption={Starting a new project with Maneage, and building it},
@@ -483,6 +473,15 @@ $ ./project make # Re-build to see effect.
$ git add -u && git commit # Commit changes.
\end{lstlisting}
+The branch-based design of Figure \ref{fig:branching} allows projects to re-import Maneage at a later time (technically: \emph{merge}), thus improving its low-level infrastructure: in (a) authors do the merge during an ongoing project;
+in (b) readers do it after publication; e.g., the project remains reproducible but the infrastructure is outdated, or a bug is fixed in Maneage.
+\new{Generally, any git flow (branching strategy) can be used by the high-level project authors or future readers.}
+Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of curation and maintenance of each individual project, before \emph{and} after publication.
+
+Finally, the complete project source is usually $\sim100$ kilo-bytes.
+It can thus easily be published or archived in many servers, for example it can be uploaded to arXiv (with the \LaTeX{} source, see the arXiv source in \cite{akhlaghi19, infante20, akhlaghi15}), published on Zenodo and archived in SoftwareHeritage.
+
+
@@ -1487,6 +1486,7 @@ Besides some small differences Galaxy seems very similar to GenePattern (Appendi
\subsection{Image Processing On Line journal, IPOL (2010)}
+\label{appendix:ipol}
The IPOL journal\footnote{\inlinecode{\url{https://www.ipol.im}}} \citeappendix{limare11} (first published article in July 2010) publishes papers on image processing algorithms as well as the the full code of the proposed algorithm.
An IPOL paper is a traditional research paper, but with a focus on implementation.
The published narrative description of the algorithm must be detailed to a level that any specialist can implement it in their own programming language (extremely detailed).
@@ -1495,12 +1495,16 @@ The authors must also submit several example datasets/scenarios.
The referee is expected to inspect the code and narrative, confirming that they match with each other, and with the stated conclusions of the published paper.
After publication, each paper also has a ``demo'' button on its webpage, allowing readers to try the algorithm on a web-interface and even provide their own input.
-The IPOL model is the single most robust model of peer review and publishing computational research methods/implementations that we have seen in this survey.
-It has grown steadily over the last 10 years, publishing 23 research articles in 2019 alone.
+IPOL has grown steadily over the last 10 years, publishing 23 research articles in 2019 alone.
We encourage the reader to visit its webpage and see some of its recent papers and their demos.
-The reason it can be so thorough and complete is its very narrow scope (image processing algorithms), where the published algorithms are highly atomic, not needing significant dependencies (beyond input/output), allowing the referees and readers to go deeply into each implemented algorithm.
+The reason it can be so thorough and complete is its very narrow scope (low-level image processing algorithms), where the published algorithms are highly atomic, not needing significant dependencies (beyond input/output of well-known formats), allowing the referees and readers to go deeply into each implemented algorithm.
In fact, high-level languages like Perl, Python or Java are not acceptable in IPOL precisely because of the additional complexities, such as dependencies, that they require.
-If any referee or reader were inclined to do so, a paper written in Maneage (the proof-of-concept solution presented in this paper) could be scrutinised at a similar detailed level, but for much more complex research scenarios, involving hundreds of dependencies and complex processing of the data.
+However, many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, so this solution is not scalable.
+
+IPOL thus fails on our Scalability criteria.
+Furthermore, by not publishing/archiving each paper's version control history or directly linking the analysis and produced paper, it fails Criterias 6 and 7.
+Note that on the webpage, it is possible to change parameters, but that will not affect the produced PDF.
+A paper written in Maneage (the proof-of-concept solution presented in this paper) could be scrutinised at a similar detailed level to IPOL, but for much more complex research scenarios, involving hundreds of dependencies and complex processing of the data.