tex/src/appendix-existing-solutions.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502

%% Appendix on reviewing existing reproducible workflow solutions. This
%% file is loaded by the project's 'paper.tex' or 'tex/src/supplement.tex',
%% it should not be run independently.
%
%% Copyright (C) 2020-2021 Mohammad Akhlaghi <mohammad@akhlaghi.org>
%% Copyright (C) 2021 Raúl Infante-Sainz <infantesainz@gmail.com>
%
%% This file is free software: you can redistribute it and/or modify it
%% under the terms of the GNU General Public License as published by the
%% Free Software Foundation, either version 3 of the License, or (at your
%% option) any later version.
%
%% This file is distributed in the hope that it will be useful, but WITHOUT
%% ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
%% FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
%% for more details. See <http://www.gnu.org/licenses/>.


\section{Survey of common existing reproducible workflows}
\label{appendix:existingsolutions}
As reviewed in the introduction, the problem of reproducibility has received considerable attention over the last three decades and various solutions have already been proposed.
The core principles that many of the existing solutions (including Maneage) aim to achieve are nicely summarized by the FAIR principles \citeappendix{wilkinson16}.
In this appendix, some of the solutions are reviewed.
The solutions are based on an evolving software landscape, therefore they are ordered by date: when the project has a web page, the year of its first release is used for the sorting.
Otherwise their paper's publication year is used.

For each solution, we summarize its methodology and discuss how it relates to the criteria proposed in this paper.
Freedom of the software/method is a core concept behind scientific reproducibility, as opposed to industrial reproducibility where a black box is acceptable/desirable.
Therefore proprietary solutions like Code Ocean\footnote{\inlinecode{\url{https://codeocean.com}}} or Nextjournal\footnote{\inlinecode{\url{https://nextjournal.com}}} will not be reviewed here.
Other studies have also attempted to review existing reproducible solutions, for example, see \citeappendix{konkol20}.


\subsection{Suggested rules, checklists, or criteria}
Before going into the various implementations, it is also useful to review existing suggested rules, checklists, or criteria for computationally reproducible research.

All the cases below are primarily targeted to immediate reproducibility and do not consider longevity explicitly.
Therefore, they lack a strong/clear completeness criterion (they mainly only suggest, rather than require, the recording of versions, and their ultimate suggestion of storing the full binary OS in a binary VM or container is problematic (as mentioned in \ref{appendix:independentenvironment} and \citeappendix{oliveira18}).

Sandve et al. \citeappendix{sandve13} propose ``ten simple rules for reproducible computational research'' that can be applied in any project.
Generally, these are very similar to the criteria proposed here and follow a similar spirit, but they do not provide any actual research papers following up all those points, nor do they provide a proof of concept.
The Popper convention \citeappendix{jimenez17} also provides a set of principles that are indeed generally useful, among which some are common to the criteria here (for example, automatic validation, and, as in Maneage, the authors suggest providing a template for new users), but the authors do not include completeness as a criterion nor pay attention to longevity (Popper itself is written in Python with many dependencies, and its core operating language has already changed once).
For more on Popper, please see Section \ref{appendix:popper}.

For improved reproducibility in Jupyter notebook users, \citeappendix{rule19} propose ten rules to improve reproducibility and also provide links to example implementations.
These can be very useful for users of Jupyter but are not generic for non-Jupyter-based computational projects.
Some criteria (which are indeed very good in a more general context) do not directly relate to reproducibility, for example their Rule 1: ``Tell a Story for an Audience''.
Generally, as reviewed in
\ifdefined\separatesupplement
the main body of this paper (section on the longevity of existing tools)
\else
Section \ref{sec:longevityofexisting}
\fi
and Section \ref{appendix:jupyter} (below), Jupyter itself has many issues regarding reproducibility.
To create Docker images, N\"ust et al. propose ``ten simple rules'' in \citeappendix{nust20}.
They recommend some issues that can indeed help increase the quality of Docker images and their production/usage, such as their rule 7 to ``mount datasets [only] at run time'' to separate the computational environment from the data.
However, the long-term reproducibility of the images is not included as a criterion by these authors.
For example, they recommend using base operating systems, with version identification limited to a single brief identifier such as \inlinecode{ubuntu:18.04}, which has a serious problem with longevity issues
\ifdefined\separatesupplement
(as discussed in the longevity of existing tools section of the main paper).
\else
(Section \ref{sec:longevityofexisting}).
\fi
Furthermore, in their proof-of-concept Dockerfile (listing 1), \inlinecode{rocker} is used with a tag (not a digest), which can be problematic due to the high risk of ambiguity (as discussed in Section \ref{appendix:containers}).


\subsection{Reproducible Electronic Documents, RED (1992)}
\label{appendix:red}
RED\footnote{\inlinecode{\url{http://sep.stanford.edu/doku.php?id=sep:research:reproducible}}} is the first attempt that we could find on doing reproducible research, see \cite{claerbout1992,schwab2000}.
It was developed within the Stanford Exploration Project (SEP) for Geophysics publications.
Their introductions on the importance of reproducibility, resonate a lot with today's environment in computational sciences.
In particular, the heavy investment one has to make in order to re-do another scientist's work, even in the same team.
RED also influenced other early reproducible works, for example \citeappendix{buckheit1995}.

To orchestrate the various figures/results of a project, from 1990, they used ``Cake'' \citeappendix{somogyi87}, a dialect of Make, for more on Make, see Appendix \ref{appendix:jobmanagement}.
As described in \cite{schwab2000}, in the latter half of that decade, they moved to GNU Make, which was much more commonly used, developed and came with a complete and up-to-date manual.
The basic idea behind RED's solution was to organize the analysis as independent steps, including the generation of plots, and organizing the steps through a Makefile.
This enabled all the results to be re-executed with a single command.
Several basic low-level Makefiles were included in the high-level/central Makefile.
The reader/user of a project had to manually edit the central Makefile and set the variable \inlinecode{RESDIR} (result directory), this is the directory where built files are kept.
The reader could later select which figures/parts of the project to reproduce by manually adding its name in the central Makefile, and running Make.

At the time, Make was already practiced by individual researchers and projects as a job orchestration tool, but SEP's innovation was to standardize it as an internal policy, and define conventions for the Makefiles to be consistent across projects.
This enabled new members to benefit from the already existing work of previous team members (who had graduated or moved to other jobs).
However, RED only used the existing software of the host system, it had no means to control them.
Therefore, with wider adoption, they confronted a ``versioning problem'' where the host's analysis software had different versions on different hosts, creating different results, or crashing \citeappendix{fomel09}.
Hence in 2006 SEP moved to a new Python-based framework called Madagascar, see Appendix \ref{appendix:madagascar}.


\subsection{Apache Taverna (2003)}
\label{appendix:taverna}
Apache Taverna\footnote{\inlinecode{\url{https://taverna.incubator.apache.org}}} \citeappendix{oinn04} is a workflow management system written in Java with a graphical user interface which is still being developed.
A workflow is defined as a directed graph, where nodes are called ``processors''.
Each Processor transforms a set of inputs into a set of outputs and they are defined in the Scufl language (an XML-based language, where each step is an atomic task).
Other components of the workflow are ``Data links'' and ``Coordination constraints''.
The main user interface is graphical, where users move processors in the given space and define links between their inputs and outputs (manually constructing a lineage like
\ifdefined\separatesupplement
lineage figure of the main paper.
\else
Figure \ref{fig:datalineage}).
\fi
Taverna is only a workflow manager and is not integrated with a package manager, hence the versions of the used software can be different in different runs.
\citeappendix{zhao12} have studied the problem of workflow decays in Taverna.


\subsection{Madagascar (2003)}
\label{appendix:madagascar}
Madagascar\footnote{\inlinecode{\url{http://ahay.org}}} \citeappendix{fomel13} is a set of extensions to the SCons job management tool (reviewed in \ref{appendix:scons}).
Madagascar is a continuation of the Reproducible Electronic Documents (RED) project that was discussed in Appendix \ref{appendix:red}.
Madagascar has been used in the production of hundreds of research papers or book chapters\footnote{\inlinecode{\url{http://www.ahay.org/wiki/Reproducible_Documents}}}, 120 prior to \citeappendix{fomel13}.

Madagascar does include project management tools in the form of SCons extensions.
However, it is not just a reproducible project management tool.
The Regularly Sampled File (RSF) file format\footnote{\inlinecode{\url{http://www.ahay.org/wiki/Guide\_to\_RSF\_file\_format}}} is a custom plain-text file that points to the location of the actual data files on the file system and acts as the intermediary between Madagascar's analysis programs.
Therefore, Madagascar is primarily a collection of analysis programs and tools to interact with RSF files and plotting facilities.
For example in our test of Madagascar 3.0.1, it installed 855 Madagascar-specific analysis programs (\inlinecode{PREFIX/bin/sf*}).
The analysis programs mostly target geophysical data analysis, including various project-specific tools: more than half of the total built tools are under the \inlinecode{build/user} directory which includes names of Madagascar users.

Besides the location or contents of the data, RSF also contains name/value pairs that can be used as options to Madagascar programs, which are built with inputs and outputs of this format.
Since RSF contains program options also, the inputs and outputs of Madagascar's analysis programs are read from, and written to, standard input and standard output.

In terms of completeness, as long as the user only uses Madagascar's own analysis programs, it is fairly complete at a high level (not lower-level OS libraries).
However, this comes at the expense of a large amount of bloatware (programs that one project may never need, but is forced to build), thus adding complexity.
Also, the linking between the analysis programs (of a certain user at a certain time) and future versions of that program (that is updated in time) is not immediately obvious.
Furthermore, the blending of the workflow component with the low-level analysis components fails the modularity criteria.


\subsection{GenePattern (2004)}
\label{appendix:genepattern}
GenePattern\footnote{\inlinecode{\url{https://www.genepattern.org}}} \citeappendix{reich06} (first released in 2004) is a client-server software containing many common analysis functions/modules, primarily focused for Gene studies.
Although it is highly focused to a special research field, it is reviewed here because its concepts/methods are generic, and in the context of this paper.

Its server-side software is installed with fixed software packages that are wrapped into GenePattern modules.
The modules are used through a web interface, the modern implementation is GenePattern Notebook \citeappendix{reich17}.
It is an extension of the Jupyter notebook (see Appendix \ref{appendix:editors}), which also has a special ``GenePattern'' cell that will connect to GenePattern servers for doing the analysis.
However, the wrapper modules just call an existing tool on the host system.
Given that each server may have its own set of installed software, the analysis may differ (or crash) when run on different GenePattern servers, hampering reproducibility.

%% GenePattern shutdown announcement (although as of November 2020, it does not open any more): https://www.genepattern.org/blog/2019/10/01/the-genomespace-project-is-ending-on-november-15-2019
The primary GenePattern server was active since 2008 and had 40,000 registered users with 2000 to 5000 jobs running every week \citeappendix{reich17}.
However, it was shut down on November 15th 2019 due to the end of funding.
All processing with this sever has stopped, and any archived data on it has been deleted.
Since GenePattern is free software, there are alternative public servers to use, so hopefully, work on it will continue.
However, funding is limited and those servers may face similar funding problems.
This is a very nice example of the fragility of solutions that depend on archiving and running the research codes with high-level research products (including data and binary/compiled codes that are expensive to keep in one place).


\subsection{Kepler (2005)}
Kepler\footnote{\inlinecode{\url{https://kepler-project.org}}} \citeappendix{ludascher05} is a Java-based Graphic User Interface workflow management tool.
Users drag-and-drop analysis components, called ``actors'', into a visual, directional graph, which is the workflow (similar to
\ifdefined\separatesupplement
the lineage figure shown in the main paper.
\else
Figure \ref{fig:datalineage}).
\fi
Each actor is connected to others through the Ptolemy II\footnote{\inlinecode{\url{https://ptolemy.berkeley.edu}}} \citeappendix{eker03}.
In many aspects, the usage of Kepler and its issues for long-term reproducibility is like Apache Taverna (see Section \ref{appendix:taverna}).


\subsection{VisTrails (2005)}
\label{appendix:vistrails}
VisTrails\footnote{\inlinecode{\url{https://www.vistrails.org}}} \citeappendix{bavoil05} was a graphical workflow managing system.
According to its web page, VisTrails maintenance has stopped since May 2016, its last Git commit, as of this writing, was in November 2017.
However, given that it was well maintained for over 10 years is an achievement.

VisTrails (or ``visualization trails'') was initially designed for managing visualizations, but later grew into a generic workflow system with meta-data and provenance features.
Each analysis step, or module, is recorded in an XML schema, which defines the operations and their dependencies.
The XML attributes of each module can be used in any XML query language to find certain steps (for example those that used a certain command).
Since the main goal was visualization (as images), apparently its primary output is in the form of image spreadsheets.
Its design is based on a change-based provenance model using a custom VisTrails provenance query language (vtPQL), for more see \citeappendix{scheidegger08}.
Since XML is a plain text format, as the user inspects the data and makes changes to the analysis, the changes are recorded as ``trails'' in the project's VisTrails repository that operates very much like common version control systems (see Appendix \ref{appendix:versioncontrol}).
.
However, even though XML is in plain text, it is very hard to edit manually.
VisTrails, therefore, provides a graphic user interface with a visual representation of the project's inter-dependent steps (similar to
\ifdefined\separatesupplement
the data lineage figure of the main paper).
\else
Figure \ref{fig:datalineage}).
\fi
Besides the fact that it is no longer maintained, VisTrails does not control the software that is run, it only controls the sequence of steps that they are run in.


\subsection{Galaxy (2010)}
\label{appendix:galaxy}
Galaxy\footnote{\inlinecode{\url{https://galaxyproject.org}}} is a web-based Genomics workbench \citeappendix{goecks10}.
The main user interface is the ``Galaxy Pages'', which does not require any programming: users graphically manipulate abstract ``tools'' which are wrappers over command-line programs.
Therefore the actual running version of the program can be hard to control across different Galaxy servers.
Besides the automatically generated metadata of a project (which include version control, or its history), users can also tag/annotate each analysis step, describing its intent/purpose.
Besides some small differences, Galaxy seems very similar to GenePattern (Appendix \ref{appendix:genepattern}), so most of the same points there apply here too.
For example the very large cost of maintaining such a system and being based on a graphic environment.


\subsection{Image Processing On Line journal, IPOL (2010)}
\label{appendix:ipol}
The IPOL journal\footnote{\inlinecode{\url{https://www.ipol.im}}} \citeappendix{limare11} (first published article in July 2010) publishes papers on image processing algorithms as well as the the full code of the proposed algorithm.
An IPOL paper is a traditional research paper, but with a focus on implementation.
The published narrative description of the algorithm must be detailed to a level that any specialist can implement it in their own programming language (extremely detailed).
The author's own implementation of the algorithm is also published with the paper (in C, C++, or MATLAB), the code must be commented well enough and link each part of it with the relevant part of the paper.
The authors must also submit several examples of datasets/scenarios.
The referee is expected to inspect the code and narrative, confirming that they match with each other, and with the stated conclusions of the published paper.
After publication, each paper also has a ``demo'' button on its web page, allowing readers to try the algorithm on a web-interface and even provide their own input.

IPOL has grown steadily over the last 10 years, publishing 23 research articles in 2019 alone.
We encourage the reader to visit its web page and see some of its recent papers and their demos.
The reason it can be so thorough and complete is its very narrow scope (low-level image processing algorithms), where the published algorithms are highly atomic, not needing significant dependencies (beyond input/output of well-known formats), allowing the referees and readers to go deeply into each implemented algorithm.
In fact, high-level languages like Perl, Python, or Java are not acceptable in IPOL precisely because of the additional complexities, such as the dependencies that they require.
However, many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, so this solution is not scalable.

IPOL thus fails on our Scalability criteria.
Furthermore, by not publishing/archiving each paper's version control history or directly linking the analysis and produced paper, it fails criteria 6 and 7.
Note that on the web page, it is possible to change parameters, but that will not affect the produced PDF.
A paper written in Maneage (the proof-of-concept solution presented in this paper) could be scrutinized at a similar detailed level to IPOL, but for much more complex research scenarios, involving hundreds of dependencies and complex processing of the data.


\subsection{WINGS (2010)}
\label{appendix:wings}
WINGS\footnote{\inlinecode{\url{https://wings-workflows.org}}} \citeappendix{gil10} is an automatic workflow generation algorithm.
It runs on a centralized web server, requiring many dependencies (such that it is recommended to download Docker images).
It allows users to define various workflow components (for example datasets, analysis components, etc), with high-level goals.
It then uses selection and rejection algorithms to find the best components using a pool of analysis components that can satisfy the requested high-level constraints.
%\tonote{Read more about this}


\subsection{Active Papers (2011)}
\label{appendix:activepapers}
Active Papers\footnote{\inlinecode{\url{http://www.activepapers.org}}} attempts to package the code and data of a project into one file (in HDF5 format).
It was initially written in Java because its compiled byte-code outputs in JVM are portable on any machine \citeappendix{hinsen11}.
However, Java is not a commonly used platform today, hence it was later implemented in Python \citeappendix{hinsen15}.

In the Python version, all processing steps and input data (or references to them) are stored in an HDF5 file.
%However, it can only account for pure-Python packages using the host operating system's Python modules \tonote{confirm this!}.
When the Python module contains a component written in other languages (mostly C or C++), it needs to be an external dependency to the Active Paper.

As mentioned in \citeappendix{hinsen15}, the fact that it relies on HDF5 is a caveat of Active Papers, because many tools are necessary to merely open it.
Downloading the pre-built HDF View binaries (provided by the HDF group) is not possible anonymously/automatically (login is required).
Installing it using the Debian or Arch Linux package managers also failed due to dependencies in our trials.
Furthermore, as a high-level data format HDF5 evolves very fast, for example HDF5 1.12.0 (February 29th, 2020) is not usable with older libraries provided by the HDF5 team. % maybe replace with: February 29\textsuperscript{th}, 2020?

While data and code are indeed fundamentally similar concepts technically\citeappendix{hinsen16}, they are used by humans differently due to their volume: the code of a large project involving Terabytes of data can be less than a megabyte.
Hence, storing code and data together becomes a burden when large datasets are used, this was also acknowledged in \citeappendix{hinsen15}.
Also, if the data are proprietary (for example medical patient data), the data must not be released, but the methods that were applied to them can be published.
Furthermore, since all reading and writing is done in the HDF5 file, it can easily bloat the file to very large sizes due to temporary files.
These files can later be removed as part of the analysis, but this makes the code more complicated and hard to read/maintain.
For example the Active Papers HDF5 file of \citeappendix[in \href{https://doi.org/10.5281/zenodo.2549987}{zenodo.2549987}]{kneller19} is 1.8 giga-bytes.

In many scenarios, peers just want to inspect the processing by reading the code and checking a very special part of it (one or two lines, just to see the option values to one step for example).
They do not necessarily need to run it or obtain the output the datasets (which may be published elsewhere).
Hence the extra volume for data and obscure HDF5 format that needs special tools for reading its plain-text internals is an issue.


\subsection{Collage Authoring Environment (2011)}
\label{appendix:collage}
The Collage Authoring Environment \citeappendix{nowakowski11} was the winner of Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}.
It is based on the GridSpace2\footnote{\inlinecode{\url{http://dice.cyfronet.pl}}} distributed computing environment, which has a web-based graphic user interface.
Through its web-based interface, viewers of a paper can actively experiment with the parameters of a published paper's displayed outputs (for example figures).
%\tonote{See how it containerizes the software environment}


\subsection{SHARE (2011)}
\label{appendix:SHARE}
SHARE\footnote{\inlinecode{\url{https://is.ieis.tue.nl/staff/pvgorp/share}}} \citeappendix{vangorp11} is a web portal that hosts virtual machines (VMs) for storing the environment of a research project.
SHARE was recognized as the second position in the Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}.
Simply put, SHARE was just a VM library that users could download or connect to, and run.
The limitations of VMs for reproducibility were discussed in Appendix \ref{appendix:virtualmachines}, and the SHARE system does not specify any requirements on making the VM itself reproducible.
As of January 2021, the top SHARE web page still works.
However, upon selecting any operation, a notice is printed that ``SHARE is offline'' since 2019 and the reason is not mentioned.


\subsection{Verifiable Computational Result, VCR (2011)}
\label{appendix:verifiableidentifier}
A ``verifiable computational result''\footnote{\inlinecode{\url{http://vcr.stanford.edu}}} is an output (table, figure, etc) that is associated with a ``verifiable result identifier'' (VRI), see \citeappendix{gavish11}.
It was awarded the third prize in the Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}.

A VRI is created using tags within the programming source that produced that output, also recording its version control or history.
This enables the exact identification and citation of results.
The VRIs are automatically generated web-URLs that link to public VCR repositories containing the data, inputs, and scripts, that may be re-executed.
According to \citeappendix{gavish11}, the VRI generation routine has been implemented in MATLAB, R, and Python, although only the MATLAB version was available during the writing of this paper.
VCR also has special \LaTeX{} macros for loading the respective VRI into the generated PDF.

Unfortunately, most parts of the web page are not complete at the time of this writing.
The VCR web page contains an example PDF\footnote{\inlinecode{\url{http://vcr.stanford.edu/paper.pdf}}} that is generated with this system, but the linked VCR repository\footnote{\inlinecode{\url{http://vcr-stat.stanford.edu}}} does not exist at the time of this writing.
Finally, the date of the files in the MATLAB extension tarball is set to 2011, hinting that probably VCR has been abandoned soon after the publication of \citeappendix{gavish11}.


\subsection{SOLE (2012)}
\label{appendix:sole}
SOLE (Science Object Linking and Embedding) defines ``science objects'' (SOs) that can be manually linked with phrases of the published paper \citeappendix{pham12,malik13}.
An SO is any code/content that is wrapped in begin/end tags with an associated type and name.
For example, special commented lines in a Python, R, or C program.
The SOLE command-line program parses the tagged file, generating metadata elements unique to the SO (including its URI).
SOLE also supports workflows as Galaxy tools \citeappendix{goecks10}.

For reproducibility, \citeappendix{pham12} suggest building a SOLE-based project in a virtual machine, using any custom package manager that is hosted on a private server to obtain a usable URI.
However, as described in Appendices \ref{appendix:independentenvironment} and \ref{appendix:packagemanagement}, unless virtual machines are built with robust package managers, this is not a sustainable solution (the virtual machine itself is not reproducible).
Also, hosting a large virtual machine server with fixed IP on a hosting service like Amazon (as suggested there) for every project in perpetuity will be very expensive.
The manual/artificial definition of tags to connect parts of the paper with the analysis scripts is also a caveat due to human error and incompleteness (the authors may not consider tags as important things, but they may be useful later).
In Maneage, instead of using artificial/commented tags, the analysis inputs and outputs are automatically linked into the paper's text through \LaTeX{} macros that are the backbone of the whole system (aren't artifical/extra features).


\subsection{Sumatra (2012)}
Sumatra\footnote{\inlinecode{\url{http://neuralensemble.org/sumatra}}} \citeappendix{davison12} attempts to capture the environment information of a running project.
It is written in Python and is a command-line wrapper over the analysis script.
By controlling a project at running-time, Sumatra is able to capture the environment it was run in.
The captured environment can be viewed in plain text or a web interface.
Sumatra also provides \LaTeX/Sphinx features, which will link the paper with the project's Sumatra database.
This enables researchers to use a fixed version of a project's figures in the paper, even at later times (while the project is being developed).

The actual code that Sumatra wraps around, must itself be under version control, and it does not run if there are non-committed changes (although it is not clear what happens if a commit is amended).
Since information on the environment has been captured, Sumatra is able to identify if it has changed since a previous run of the project.
Therefore Sumatra makes no attempt at storing the environment of the analysis as in Sciunit (see Appendix \ref{appendix:sciunit}), but its information.
Sumatra thus needs to know the language of the running program and is not generic.
It just captures the environment, it does not store \emph{how} that environment was built.


\subsection{Research Object (2013)}
\label{appendix:researchobject}
The Research object\footnote{\inlinecode{\url{http://www.researchobject.org}}} is collection of meta-data ontologies, to describe aggregation of resources, or workflows, see \citeappendix{bechhofer13} and \citeappendix{belhajjame15}.
It thus provides resources to link various workflow/analysis components (see Appendix \ref{appendix:existingtools}) into a final workflow.

\citeappendix{bechhofer13} describes how a workflow in Apache Taverna (Appendix \ref{appendix:taverna}) can be translated into research objects.
The important thing is that the research object concept is not specific to any special workflow, it is just a metadata bundle/standard which is only as robust in reproducing the result as the running workflow.


\subsection{Sciunit (2015)}
\label{appendix:sciunit}
Sciunit\footnote{\inlinecode{\url{https://sciunit.run}}} \citeappendix{meng15} defines ``sciunit''s that keep the executed commands for an analysis and all the necessary programs and libraries that are used in those commands.
It automatically parses all the executable files in the script and copies them, and their dependency libraries (down to the C library), into the sciunit.
Because the sciunit contains all the programs and necessary libraries, it is possible to run it readily on other systems that have a similar CPU architecture.
Sciunit was originally written in Python 2 (which reached its end-of-life on January 1st, 2020).
Therefore Sciunit2 is a new implementation in Python 3.

The main issue with Sciunit's approach is that the copied binaries are just black boxes: it is not possible to see how the used binaries from the initial system were built.
This is a major problem for scientific projects: in principle (not knowing how the programs were built) and in practice (archiving a large volume sciunit for every step of the analysis requires a lot of storage space).


\subsection{Umbrella (2015)}
Umbrella \citeappendix{meng15b} is a high-level wrapper script for isolating the environment of the analysis.
The user specifies the necessary operating system, and necessary packages for the analysis steps in various JSON files.
Umbrella will then study the host operating system and the various necessary inputs (including data and software) through a process similar to Sciunits mentioned above to find the best environment isolator (maybe using Linux containerization, containers, or VMs).
We could not find a URL to the source software of Umbrella (no source code repository is mentioned in the papers we reviewed above), but from the descriptions in \citeappendix{meng17}, it is written in Python 2.6 (which is now \new{deprecated}).


\subsection{ReproZip (2016)}
ReproZip\footnote{\inlinecode{\url{https://www.reprozip.org}}} \citeappendix{chirigati16} is a Python package that is designed to automatically track all the necessary data files, libraries, and environment variables into a single bundle.
The tracking is done at the kernel system-call level, so any file that is accessed during the running of the project is identified.
The tracked files can be packaged into a \inlinecode{.rpz} bundle that can then be unpacked into another system.

ReproZip is therefore very good to take a ``snapshot'' of the running environment into a single file.
However, the bundle can become very large when many/large datasets are involved, or if the software environment is complex (many dependencies).
Since it copies the binary software libraries, it can only be run on systems with a similar CPU architecture to the original.
Furthermore, ReproZip just copies the binary/compiled files used in a project, it has no way to know how the software was built.
As mentioned in this paper, and also \citeappendix{oliveira18} the question of ``how'' the environment was built is critical for understanding the results, and simply having the binaries cannot necessarily be useful.

For the data, it is similarly not possible to extract which data server they came from.
Hence two projects that each use a 1-terabyte dataset will need a full copy of that same 1-terabyte file in their bundle, making long-term preservation extremely expensive.


\subsection{Binder (2017)}
Binder\footnote{\inlinecode{\url{https://mybinder.org}}} is used to containerize already existing Jupyter based processing steps.
Users simply add a set of Binder-recognized configuration files to their repository and Binder will build a Docker image and install all the dependencies inside of it with Conda (the list of necessary packages comes from Conda).
One good feature of Binder is that the imported Docker image must be tagged (something like a checksum).
This will ensure that future/latest updates of the imported Docker image are not mistakenly used.
However, it does not make sure that the Dockerfile used by the imported Docker image follows a similar convention also.
Binder is used by \citeappendix{jones19}.


\subsection{Gigantum (2017)}
%% I took the date from their PiPy page, where the first version 0.1 was published in November 2016.
Gigantum\footnote{\inlinecode{\url{https://gigantum.com}}} is a client/server system, in which the client is a web-based (graphical) interface that is installed as ``Gigantum Desktop'' within a Docker image.
Gigantum uses Docker containers for an independent environment, Conda (or Pip) to install packages, Jupyter notebooks to edit and run code, and Git to store its history.
Simply put, it is a high-level wrapper for combining these components.
Internally, a Gigantum project is organized as files in a directory that can be opened without their own client.
The file structure (which is under version control) includes codes, input data, and output data.
As acknowledged on their own web page, this greatly reduces the speed of Git operations, transmitting, or archiving the project.
Therefore there are size limits on the dataset/code sizes.
However, there is one directory that can be used to store files that must not be tracked.


\subsection{Popper (2017)}
\label{appendix:popper}
Popper\footnote{\inlinecode{\url{https://falsifiable.us}}} is a software implementation of the Popper Convention \citeappendix{jimenez17}.
The Popper team's own solution is through a command-line program called \inlinecode{popper}.
The \inlinecode{popper} program itself is written in Python.
However, job management was initially based on the HashiCorp configuration language (HCL) because HCL was used by ``GitHub Actions'' to manage workflows.
Moreover, from October 2019 GitHub changed to a custom YAML-based language, so Popper also deprecated HCL.
This is an important issue when low-level choices are based on service providers.

To start a project, the \inlinecode{popper} command-line program builds a template, or ``scaffold'', which is a minimal set of files that can be run.
However, as of this writing, the scaffold is not complete: it lacks a manuscript and validation of outputs (as mentioned in the convention).
By default, Popper runs in a Docker image (so root permissions are necessary and reproducible issues with Docker images have been discussed above), but Singularity is also supported.
See Appendix \ref{appendix:independentenvironment} for more on containers, and Appendix \ref{appendix:highlevelinworkflow} for using high-level languages in the workflow.

Popper does not comply with the completeness, minimal complexity, and including the narrative criteria.
Moreover, the scaffold that is provided by Popper is an output of the program that is not directly under version control.
Hence, tracking future changes in Popper and how they relate to the high-level projects that depend on it will be very hard.
In Maneage, the same \inlinecode{maneage} git branch is shared by the developers and users; any new feature or change in Maneage can thus be directly tracked with Git when the high-level project merges their branch with Maneage.


\subsection{Whole Tale (2017)}
\label{appendix:wholetale}
Whole Tale\footnote{\inlinecode{\url{https://wholetale.org}}} is a web-based platform for managing a project and organizing data provenance, see \citeappendix{brinckman17}.
It uses online editors like Jupyter or RStudio (see Appendix \ref{appendix:editors}) that are encapsulated in a Docker container (see Appendix \ref{appendix:independentenvironment}).

The web-based nature of Whole Tale's approach and its dependency on many tools (which have many dependencies themselves) is a major limitation for future reproducibility.
For example, when following their own tutorial on ``Creating a new tale'', the provided Jupyter notebook could not be executed because of a dependency problem.
This was reported to the authors as issue 113\footnote{\inlinecode{\url{https://github.com/whole-tale/wt-design-docs/issues/113}}}, but as all the second-order dependencies evolve, it is not hard to envisage such dependency incompatibilities being the primary issue for older projects on Whole Tale.
Furthermore, the fact that a Tale is stored as a binary Docker container causes two important problems:
1) it requires a very large storage capacity for every project that is hosted there, making it very expensive to scale if demand expands.
2) It is not possible to see how the environment was built accurately (when the Dockerfile uses \inlinecode{apt}).
This issue with Whole Tale (and generally all other solutions that only rely on preserving a container/VM) was also mentioned in \citeappendix{oliveira18}, for more on this, please see Appendix \ref{appendix:packagemanagement}.


\subsection{Occam (2018)}
Occam\footnote{\inlinecode{\url{https://occam.cs.pitt.edu}}} \citeappendix{oliveira18} is web-based application to preserve software and its execution.
To achieve long-term reproducibility, Occam includes its own package manager (instructions to build software and their dependencies) to be in full control of the software build instructions, similar to Maneage.
Besides Nix or Guix (which are primarily a package manager that can also do job management), Occam has been the only solution in our survey here that attempts to be complete in this aspect.

However it is incomplete from the perspective of requirements: it works within a Docker image (that requires root permissions) and currently only runs on Debian-based, Red Hat based, and Arch-based GNU/Linux operating systems that respectively use the \inlinecode{apt}, \inlinecode{pacman} or \inlinecode{yum} package managers.
It is also itself written in Python (version 3.4 or above).

Furthermore, it does not account for the minimal complexity criteria because the instructions to build the software and their versions are not immediately viewable or modifiable by the user.
Occam contains its own JSON database that should be parsed with its own custom program.
The analysis phase of Occam is also through a drag-and-drop interface (similar to Taverna, Appendix \ref{appendix:taverna}) that is a web-based graphic user interface.
All the connections between various phases of the analysis need to be pre-defined in a JSON file and manually linked in the GUI.
Hence for complex data analysis operations that involve thousands of steps, it is not scalable.