1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
|
%% Appendix on reviewing existing reproducible workflow solutions. This
%% file is loaded by the project's 'paper.tex' or 'tex/src/supplement.tex',
%% it should not be run independently.
%
%% Copyright (C) 2020-2021 Mohammad Akhlaghi <mohammad@akhlaghi.org>
%% Copyright (C) 2021 Raúl Infante-Sainz <infantesainz@gmail.com>
%% Copyright (C) 2021 Boudewijn F. Roukema <boud@astro.uni.torun.pl>
%
%% This file is free software: you can redistribute it and/or modify it
%% under the terms of the GNU General Public License as published by the
%% Free Software Foundation, either version 3 of the License, or (at your
%% option) any later version.
%
%% This file is distributed in the hope that it will be useful, but WITHOUT
%% ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
%% FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
%% for more details. See <http://www.gnu.org/licenses/>.
\section{Survey of common existing reproducible workflows}
\label{appendix:existingsolutions}
The problem of reproducibility has received considerable attention over the last three decades and various solutions have already been proposed.
The core principles that many of the existing solutions (including Maneage) aim to achieve are nicely summarized by the FAIR principles\citeappendix{wilkinson16}.
In this appendix, \emph{some} of the solutions are reviewed.
We are not just reviewing solutions that can be used today.
The main focus of this paper is longevity, therefore we also spent considerable time on finding and inspecting solutions that have been aborted, discontinued or abandoned.
The solutions are based on an evolving software landscape, therefore they are ordered by date: when the project has a web page, the year of its first release is used for the sorting.
Otherwise their paper's publication year is used.
For each solution, we summarize its methodology and discuss how it relates to the criteria proposed in this paper.
Freedom of the software/method is a core concept behind scientific reproducibility, as opposed to industrial reproducibility where a black box is acceptable/desirable.
Therefore proprietary solutions like Code Ocean\footnote{\inlinecode{\url{https://codeocean.com}}} or Nextjournal\footnote{\inlinecode{\url{https://nextjournal.com}}} will not be reviewed here.
Other studies have also attempted to review existing reproducible solutions, for example, see Konkol et al.\citeappendix{konkol20}.
We have tried our best to test and read through the documentation of almost all reviewed solutions to a sufficient level.
However, due to time constraints, it is inevitable that we may have missed some aspects of the solutions, or incorrectly interpreted their behavior and outputs.
In this case, please let us know and we will correct it in the text on the paper's Git repository and publish the updated (postprint) PDF on \href{https://doi.org/10.5281/zenodo.3872247}{zenodo.3872247} (this is the version-independent DOI, which always points to the most recent Zenodo upload).
\subsection{Suggested rules, checklists, or criteria}
Before going into the various implementations, it is useful to review some existing suggested rules, checklists, or criteria for computationally reproducible research.
Sandve et al.\citeappendix{sandve13} propose ``ten simple rules for reproducible computational research'' that can be applied in any project.
Generally, these are very similar to the criteria proposed here and follow a similar spirit, but they do not provide any actual research papers following up all those points, nor do they provide a proof of concept.
The Popper convention\citeappendix{jimenez17} also provides a set of principles that are indeed generally useful, among which some are common to the criteria here (for example, automatic validation, and, as in Maneage, the authors suggest providing a template for new users), but the authors do not include completeness as a criterion nor pay attention to longevity: Popper has already changed its core workflow language once and is written in Python with many dependencies that evolve fast, see \ref{appendix:highlevelinworkflow}.
For more on Popper, please see Section \ref{appendix:popper}.
For improved reproducibility Jupyter notebooks, Rule et al.\citeappendix{rule19} propose ten rules and also provide links to example implementations.
These can be very useful for users of Jupyter but are not generic for non-Jupyter-based computational projects.
Some criteria (which are indeed very good in a more general context) do not directly relate to reproducibility, for example their Rule 1: ``Tell a Story for an Audience''.
Generally, as reviewed in
\ifdefined\separatesupplement%
the main body of this paper (section on the longevity of existing tools)%
\else%
Section \ref{sec:longevityofexisting}%
\fi
and Section \ref{appendix:jupyter} (below), Jupyter itself has many issues regarding reproducibility.
To create Docker images, N\"ust et al. propose\citeappendix{nust20} ``ten simple rules''.
They recommend some issues that can indeed help increase the quality of Docker images and their production/usage, such as their rule 7 to ``mount datasets [only] at run time'' to separate the computational environment from the data.
However, the long-term reproducibility of the images is not included as a criterion by these authors.
For example, they recommend using base operating systems, with version identification limited to a single brief identifier such as \inlinecode{ubuntu:18.04}, which has a serious problem with longevity issues
\ifdefined\separatesupplement%
(as discussed in the longevity of existing tools section of the main paper)%
\else%
(Section \ref{sec:longevityofexisting})%
\fi.
Furthermore, in their proof-of-concept Dockerfile (listing 1), \inlinecode{rocker} is used with a tag (not a digest), which can be problematic due to the high risk of ambiguity (as discussed in Section \ref{appendix:containers}).
Previous criteria are thus primarily targeted to immediate reproducibility and do not consider longevity.
Therefore, they lack a strong/clear completeness criterion (they mainly only suggest, rather than require, the recording of versions, and their ultimate suggestion of storing the full binary OS in a binary VM or container is problematic (as mentioned in \ref{appendix:independentenvironment} and Oliveira et al.\citeappendix{oliveira18}).
\subsection{Reproducible Electronic Documents, RED (1992)}
\label{appendix:red}
RED\footnote{\inlinecode{\url{http://sep.stanford.edu/doku.php?id=sep:research:reproducible}}} is the first attempt\cite{claerbout1992,schwab2000} that we could find on doing reproducible research.
It was developed within the Stanford Exploration Project (SEP) for Geophysics publications.
Their introductions on the importance of reproducibility resonate a lot with today's environment in computational sciences.
In particular, the authors highlight the heavy investment one has to make in order to re-do another scientist's work, even in the same team.
RED also influenced other early reproducible works, for example Buckheit \& Donoho\citeappendix{buckheit1995}.
To orchestrate the various figures/results of a project, from 1990, they used ``Cake''\citeappendix{somogyi87}, a dialect of Make, for more on Make, see Appendix \ref{appendix:jobmanagement}.
As described in Schwab et al.\cite{schwab2000}, in the latter half of that decade, they moved to GNU Make, which was much more commonly used, better maintained, and came with a complete and up-to-date manual.
The basic idea behind RED's solution was to organize the analysis as independent steps, including the generation of plots, and organizing the steps through a Makefile.
This enabled all the results to be re-executed with a single command.
Several basic low-level Makefiles were included in the high-level/central Makefile.
The reader/user of a project had to manually edit the central Makefile and set the variable \inlinecode{RESDIR} (result directory), the directory where built files are kept.
The reader could later select which figures/parts of the project to reproduce by manually adding their names to the central Makefile, and running Make.
At the time, Make was already used by individual researchers and projects as a job orchestration tool, but SEP's innovation was to standardize it as an internal policy, and define conventions for the Makefiles to be consistent across projects.
This enabled new members to benefit from the already existing work of previous team members (who had graduated or moved to other jobs).
However, RED only used the existing software of the host system, with no means to control that software.
Therefore, with wider adoption, they confronted a ``versioning problem'' where the host's analysis software had different versions on different hosts, creating different results, or crashing\citeappendix{fomel09}.
Hence, in 2006, SEP moved to a new Python-based framework called Madagascar; see Appendix \ref{appendix:madagascar}.
\subsection{Taverna (2003)}
\label{appendix:taverna}
Taverna\footnote{\inlinecode{\url{https://github.com/taverna}}}\citeappendix{oinn04} was a workflow management system written in Java with a graphical user interface.
In 2014 it was sponsored by the Apache Incubator project and called ``Apache Taverna'', but its developers \href{https://lists.apache.org/thread.html/r559e0dd047103414fbf48a6ce1bac2e17e67504c546300f2751c067c\%40\%3Cdev.taverna.apache.org\%3E}{voted} to \emph{retire} it in 2020 because development has come to a standstill (as of April 2021, latest public Github commit was in 2016).
In Taverna, a workflow is defined as a directed graph, where nodes are called ``processors''.
Each Processor transforms a set of inputs into a set of outputs and they are defined in the Scufl language (an XML-based language, where each step is an atomic task).
Other components of the workflow are ``Data links'' and ``Coordination constraints''.
The main user interface is graphical, where users move processors in the given space and define links between their inputs and outputs (manually constructing a lineage, as in the
\ifdefined\separatesupplement
lineage figure of the main paper).
\else
Figure \ref{fig:datalineage}).
\fi
Taverna is only a workflow manager and is not integrated with a package manager, hence the versions of the used software can be different in different runs.
Zhao et al. \citeappendix{zhao12} studied the problem of workflow decays in Taverna.
\subsection{Madagascar (2003)}
\label{appendix:madagascar}
Madagascar\footnote{\inlinecode{\url{http://ahay.org}}}\citeappendix{fomel13} is a set of extensions to the SCons job management tool (reviewed in \ref{appendix:scons}).
Madagascar is a continuation of the Reproducible Electronic Documents (RED) project that was discussed in Appendix \ref{appendix:red}.
Madagascar has been used in the production of hundreds of research papers or book chapters\footnote{\inlinecode{\url{http://www.ahay.org/wiki/Reproducible_Documents}}}, 120 prior to Fomel et al.\citeappendix{fomel13}.
Madagascar does include project management tools in the form of SCons extensions.
However, it is not just a reproducible project management tool.
The Regularly Sampled File (RSF) file format\footnote{\inlinecode{\url{http://www.ahay.org/wiki/Guide\_to\_RSF\_file\_format}}} is a custom plain-text file that points to the location of the actual data files on the file system and acts as the intermediary between Madagascar's analysis programs.
Therefore, Madagascar is primarily a collection of analysis programs and tools to interact with RSF files and plotting facilities.
For example in our test of Madagascar 3.0.1, it installed 855 Madagascar-specific analysis programs (\inlinecode{PREFIX/bin/sf*}).
The analysis programs mostly target geophysical data analysis, including various project-specific tools: more than half of the total built tools are under the \inlinecode{build/user} directory which includes names of Madagascar users.
Besides the location or contents of the data, RSF also contains name/value pairs that can be used as options to Madagascar programs, which are built with inputs and outputs of this format.
Since RSF contains program options also, the inputs and outputs of Madagascar's analysis programs are read from, and written to, standard input and standard output.
In terms of completeness, as long as the user only uses Madagascar's own analysis programs, it is fairly complete at a high level (not lower-level OS libraries).
However, this comes at the expense of a large amount of bloatware (programs that one project may never need, but is forced to build), thus adding complexity.
Also, the linking between the analysis programs (of a certain user at a certain time) and future versions of that program (that is updated in time) is not immediately obvious.
Furthermore, the blending of the workflow component with the low-level analysis components fails the modularity criterion.
\subsection{GenePattern (2004)}
\label{appendix:genepattern}
GenePattern\footnote{\inlinecode{\url{https://www.genepattern.org}}}\citeappendix{reich06} (first released in 2004) is a client-server software containing many common analysis functions/modules, primarily focused for Gene studies.
Although it is highly focused to a special research field, it is reviewed here because its concepts/methods are generic.
Its server-side software is installed with fixed software packages that are wrapped into GenePattern modules.
The modules are used through a web interface, the modern implementation is GenePattern Notebook\citeappendix{reich17}.
It is an extension of the Jupyter notebook (see Appendix \ref{appendix:editors}), which also has a special ``GenePattern'' cell that will connect to GenePattern servers for doing the analysis.
However, the wrapper modules just call an existing tool on the running system.
Given that each server may have its own set of installed software, the analysis may differ (or crash) when run on different GenePattern servers, hampering reproducibility.
%% GenePattern shutdown announcement (although as of November 2020, it does not open any more): https://www.genepattern.org/blog/2019/10/01/the-genomespace-project-is-ending-on-november-15-2019
The primary GenePattern server was active since 2008 and had 40,000 registered users with 2000 to 5000 jobs running every week\citeappendix{reich17}.
However, it was shut down on November 15th 2019 due to the end of funding.
All processing with this sever has stopped, and any archived data on it has been deleted.
Since GenePattern is free software, there are alternative public servers to use, so hopefully, work on it will continue.
However, funding is limited and those servers may face similar funding problems.
This is a very nice example of the fragility of solutions that depend on archiving and running the research codes with high-level research products (including data and binary/compiled codes that are expensive to keep in one place).
The data and software may have backups in other places, but the high-level project-specific workflows that researchers spent most time on, have been lost due to the deletion (unless they were backed up privately by the authors!).
\subsection{Kepler (2005)}
Kepler\footnote{\inlinecode{\url{https://kepler-project.org}}}\citeappendix{ludascher05} is a Java-based Graphic User Interface workflow management tool.
Users drag-and-drop analysis components, called ``actors'', into a visual, directional graph, which is the workflow (similar to
\ifdefined\separatesupplement
the lineage figure shown in the main paper).
\else
Figure \ref{fig:datalineage}).
\fi
Each actor is connected to others through Ptolemy II\footnote{\inlinecode{\url{https://ptolemy.berkeley.edu}}}\citeappendix{eker03}.
In many aspects, the usage of Kepler and its issues for long-term reproducibility is like Taverna (see Section \ref{appendix:taverna}).
\subsection{VisTrails (2005)}
\label{appendix:vistrails}
VisTrails\footnote{\inlinecode{\url{https://www.vistrails.org}}}\citeappendix{bavoil05} was a graphical workflow managing system.
According to its web page, VisTrails maintenance has stopped since May 2016, its last Git commit, as of this writing, was in November 2017.
However, given that it was well maintained for over 10 years is an achievement.
VisTrails (or ``visualization trails'') was initially designed for managing visualizations, but later grew into a generic workflow system with meta-data and provenance features.
Each analysis step, or module, is recorded in an XML schema, which defines the operations and their dependencies.
The XML attributes of each module can be used in any XML query language to find certain steps (for example those that used a certain command).
Since the main goal was visualization (as images), apparently its primary output is in the form of image spreadsheets.
Its design is based on a change-based provenance model using a custom VisTrails provenance query language (vtPQL), for more see Scheidegger et al.\citeappendix{scheidegger08}.
Since XML is a plain text format, as the user inspects the data and makes changes to the analysis, the changes are recorded as ``trails'' in the project's VisTrails repository that operates very much like common version control systems (see Appendix \ref{appendix:versioncontrol}).
.
However, even though XML is in plain text, it is very hard to read/edit without the VisTrails software (which is no longer maintained).
VisTrails, therefore, provides a graphic user interface with a visual representation of the project's inter-dependent steps (similar to
\ifdefined\separatesupplement
the data lineage figure of the main paper).
\else
Figure \ref{fig:datalineage}).
\fi
Besides the fact that it is no longer maintained, VisTrails did not control the software that is run, it only controlled the sequence of steps that they are run in.
\subsection{Galaxy (2010)}
\label{appendix:galaxy}
Galaxy\footnote{\inlinecode{\url{https://galaxyproject.org}}} is a web-based Genomics workbench\citeappendix{goecks10}.
The main user interface is the ``Galaxy Pages'', which does not require any programming: users graphically manipulate abstract ``tools'' which are wrappers over command-line programs.
Therefore the actual running version of the program can be hard to control across different Galaxy servers.
Besides the automatically generated metadata of a project (which include version control, or its history), users can also tag/annotate each analysis step, describing its intent/purpose.
Besides some small differences, Galaxy seems very similar to GenePattern (Appendix \ref{appendix:genepattern}), so most of the same points there apply here too.
For example the very large cost of maintaining such a system, being based on a graphic environment and blending hand-written code with automatically generated (large) files.
\subsection{Image Processing On Line journal, IPOL (2010)}
\label{appendix:ipol}
The IPOL journal\footnote{\inlinecode{\url{https://www.ipol.im}}}\citeappendix{limare11} (first published article in July 2010) publishes papers on image processing algorithms as well as the the full code of the proposed algorithm.
An IPOL paper is a traditional research paper, but with a focus on implementation.
The published narrative description of the algorithm must be detailed to a level that any specialist can implement it in their own programming language (extremely detailed).
The author's own implementation of the algorithm is also published with the paper (in C, C++ or MATLAB/Octave and recently Python), the code can only have a very limited set of external dependencies (with pre-defined versions), must be commented well enough, and link each part of it with the relevant part of the paper.
The authors must also submit several example datasets that show the applicability of their proposed algorithm.
The referee is expected to inspect the code and narrative, confirming that they match with each other, and with the stated conclusions of the published paper.
After publication, each paper also has a ``demo'' button on its web page, allowing readers to try the algorithm on a web-interface and even provide their own input.
IPOL has grown steadily over the last 10 years, publishing 23 research articles in 2019.
We encourage the reader to visit its web page and see some of its recent papers and their demos.
The reason it can be so thorough and complete is its very narrow scope (low-level image processing algorithms), where the published algorithms are highly atomic, not needing significant dependencies (beyond input/output of well-known formats), allowing the referees and readers to go deeply into each implemented algorithm.
However, many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, so while it is modular (a single module, doing a very specific thing) this solution is not scalable.
Furthermore, by not publishing/archiving each paper's version controlled history or directly linking the analysis and produced paper, it fails criteria 6 and 7.
Note that on the web page, it is possible to change parameters, but that will not affect the produced PDF.
A paper written in Maneage (the proof-of-concept solution presented in this paper) could be scrutinized at a similar detailed level to IPOL, but for much more complex research scenarios, involving hundreds of dependencies and complex processing of the data.
\subsection{WINGS (2010)}
\label{appendix:wings}
WINGS\footnote{\inlinecode{\url{https://wings-workflows.org}}}\citeappendix{gil10} is an automatic workflow generation algorithm.
It runs on a centralized web server, requiring many dependencies (such that it is recommended to download Docker images).
It allows users to define various workflow components (for example datasets, analysis components, etc), with high-level goals.
It then uses selection and rejection algorithms to find the best components using a pool of analysis components that can satisfy the requested high-level constraints.
%\tonote{Read more about this}
\subsection{Active Papers (2011)}
\label{appendix:activepapers}
Active Papers\footnote{\inlinecode{\url{http://www.activepapers.org}}} attempts to package the code and data of a project into one file (in HDF5 format).
It was initially written in Java because its compiled byte-code outputs in JVM are portable on any machine\citeappendix{hinsen11}.
However, Java is not a commonly used platform today, hence it was later implemented in Python\citeappendix{hinsen15}.
Dependence on high-level platforms (Java or Python) is therefore a fundamental issue.
In the Python version, all processing steps and input data (or references to them) are stored in an HDF5 file.
%However, it can only account for pure-Python packages using the host operating system's Python modules \tonote{confirm this!}.
When the Python module contains a component written in other languages (mostly C or C++), it needs to be an external dependency to the Active Paper.
As mentioned in Hinsen\citeappendix{hinsen15}, the fact that it relies on HDF5 is a caveat of Active Papers, because many tools are necessary to merely open it.
Downloading the pre-built ``HDF View'' binaries (a GUI browser of HDF5 files that is provided by the HDF group) is not possible anonymously/automatically: as of January 2021 login is required\footnote{\inlinecode{\url{https://www.hdfgroup.org/downloads/hdfview}}} (this was not the case when Active Papers moved to HDF5).
% From K. Hinsen in a private email to M. Akhlaghi: This is true today, but wasn't when I started ActivePapers. Otherwise I'd never have built on HDF5.
Installing HDF View using the Debian or Arch Linux package managers also failed due to dependencies in our trials.
Furthermore, like most high-level tools, the HDF5 library evolves very fast: on its webpage (from April 2021), it says ``Applications that were created with earlier HDF5 releases may not compile with 1.12 by default''.
While data and code are indeed fundamentally similar concepts technically\citeappendix{hinsen16}, they are used by humans differently.
The hand-written code of a large project involving Terabytes of data can be 100 kilo bytes.
When the two are bundled together in one remote file, merely seeing one line of the code, requires downloading Terabytes volume that is not needed, this was also acknowledged in Hinsen\citeappendix{hinsen15}.
It may also happen that the data are proprietary (for example medical patient data).
In such cases, the data must not be publicly released, but the methods that were applied to them can.
Furthermore, since all reading and writing is currently done in the HDF5 file, it can easily bloat the file to very large sizes due to temporary files.
These files can later be removed as part of the analysis, but this makes the code more complicated and hard to read/maintain.
For example the Active Papers HDF5 file of \citeappendix[in \href{https://doi.org/10.5281/zenodo.2549987}{zenodo.2549987}]{kneller19} is 1.8 giga-bytes.
This is not a fundamental feature of the approach, but rather an effect of the initial implementation; future improvements are possible.
\subsection{Collage Authoring Environment (2011)}
\label{appendix:collage}
The Collage Authoring Environment\citeappendix{nowakowski11} was the winner of Elsevier Executable Paper Grand Challenge\citeappendix{gabriel11}.
It is based on the GridSpace2\footnote{\inlinecode{\url{http://dice.cyfronet.pl}}} distributed computing environment, which has a web-based graphic user interface.
Through its web-based interface, viewers of a paper can actively experiment with the parameters of a published paper's displayed outputs (for example figures) through a web interface.
In their Figure 3, they nicely vizualize how the ``Executable Paper'' of Collage operates through two servers and a computing backend.
Unfortunately in the paper no webpage has been provided to follow up on the work and find its current status.
A web search only pointed us to its main paper\citeappendix{nowakowski11}.
In the paper, the authors do not discuss the major issue of software versioning and its verification to ensure that future updates to the backend do not affect the result; apparently it just assumes that the software exists on the ``Computing backend''.
Since we could not access or test it, from the descriptions in the paper, it seems to be very similar to the modern day Jupyter notebook concept (see \ref{appendix:jupyter}), which had not yet been created in its current form in 2011.
So we expect similar longevity issues with Collage.
\subsection{SHARE (2011)}
\label{appendix:SHARE}
SHARE\footnote{\inlinecode{\url{https://is.ieis.tue.nl/staff/pvgorp/share}}}\citeappendix{vangorp11} is a web portal that hosts virtual machines (VMs) for storing the environment of a research project.
SHARE was recognized as the second position in the Elsevier Executable Paper Grand Challenge\citeappendix{gabriel11}.
Simply put, SHARE was just a VM library that users could download or connect to, and run.
The limitations of VMs for reproducibility were discussed in Appendix \ref{appendix:virtualmachines}, and the SHARE system does not specify any requirements or standards on making the VM itself reproducible, or enforcing common internals for its supported projects.
As of January 2021, the top SHARE web page still works.
However, upon selecting any operation, a notice is printed that ``SHARE is offline'' since 2019 and the reason is not mentioned.
\subsection{Verifiable Computational Result, VCR (2011)}
\label{appendix:verifiableidentifier}
A ``verifiable computational result''\footnote{\inlinecode{\url{http://vcr.stanford.edu}}} is an output (table, figure, etc) that is associated with a ``verifiable result identifier'' (VRI), see\citeappendix{gavish11}.
It was awarded the third prize in the Elsevier Executable Paper Grand Challenge\citeappendix{gabriel11}.
A VRI is a hash that is created using tags within the programming source that produced that output, also recording its version control or history.
This enables the exact identification and citation of results.
The VRIs are automatically generated web-URLs that link to public VCR repositories containing the data, inputs, and scripts, that may be re-executed.
According to Gavish \& Donoho\citeappendix{gavish11}, the VRI generation routine has been implemented in MATLAB, R, and Python, although only the MATLAB version was available on the webpage in January 2021.
VCR also has special \LaTeX{} macros for loading the respective VRI into the generated PDF.
In effect this is very similar to what we have done at the end of the caption of
\ifdefined\separatesupplement
the first figure in the main body of the paper,
\else
Figure \ref{fig:datalineage},
\fi
where you can click on the given Zenodo link and be taken to the raw data that created the plot.
However, instead of a long and hard to read hash, we point to the plotted file's source as a Zenodo DOI (which has long-term funding for longevity).
Unfortunately, most parts of the web page are not complete as of January 2021.
The VCR web page contains an example PDF\footnote{\inlinecode{\url{http://vcr.stanford.edu/paper.pdf}}} that is generated with this system, but the linked VCR repository\footnote{\inlinecode{\url{http://vcr-stat.stanford.edu}}} did not exist (again, as of January 2021).
Finally, the date of the files in the MATLAB extension tarball is set to May 2011, hinting that probably VCR has been abandoned soon after the publication of Gavish \& Donoho\citeappendix{gavish11}.
\subsection{SOLE (2012)}
\label{appendix:sole}
SOLE (Science Object Linking and Embedding) defines ``science objects'' (SOs) that can be manually linked with phrases of the published paper\citeappendix{pham12,malik13}.
An SO is any code/content that is wrapped in begin/end tags with an associated type and name.
For example, special commented lines in a Python, R, or C program.
The SOLE command-line program parses the tagged file, generating metadata elements unique to the SO (including its URI).
SOLE also supports workflows as Galaxy tools\citeappendix{goecks10}.
For reproducibility, Pham et al. \citeappendix{pham12} suggest building a SOLE-based project in a virtual machine, using any custom package manager that is hosted on a private server to obtain a usable URI.
However, as described in Appendices \ref{appendix:independentenvironment} and \ref{appendix:packagemanagement}, unless virtual machines are built with robust package managers, this is not a sustainable solution (the virtual machine itself is not reproducible).
Also, hosting a large virtual machine server with fixed IP on a hosting service like Amazon (as suggested there) for every project in perpetuity will be very expensive.
The manual/artificial definition of tags to connect parts of the paper with the analysis scripts is also a caveat due to human error and incompleteness (the authors may not consider tags as important things, but they may be useful later).
In Maneage, instead of using artificial/commented tags, the analysis inputs and outputs are automatically linked into the paper's text through \LaTeX{} macros that are the backbone of the whole system (are not artifical/extra features).
\subsection{Sumatra (2012)}
Sumatra\footnote{\inlinecode{\url{http://neuralensemble.org/sumatra}}}\citeappendix{davison12} attempts to capture the environment information of a running project.
It is written in Python and is a command-line wrapper over the analysis script.
By controlling a project at running-time, Sumatra is able to capture the environment it was run in.
The captured environment can be viewed in plain text or a web interface.
Sumatra also provides \LaTeX/Sphinx features, which will link the paper with the project's Sumatra database.
This enables researchers to use a fixed version of a project's figures in the paper, even at later times (while the project is being developed).
The actual code that Sumatra wraps around, must itself be under version control, and it does not run if there are non-committed changes (although it is not clear what happens if a commit is amended).
Since information on the environment has been captured, Sumatra is able to identify if it has changed since a previous run of the project.
Therefore Sumatra makes no attempt at storing the environment of the analysis as in Sciunit (see Appendix \ref{appendix:sciunit}), but its information.
Sumatra thus needs to know the language of the running program and is not generic.
It just captures the environment, it does not store \emph{how} that environment was built.
\subsection{Research Object (2013)}
\label{appendix:researchobject}
The Research object\footnote{\inlinecode{\url{http://www.researchobject.org}}} is collection of meta-data ontologies, to describe aggregation of resources, or workflows\citeappendix{bechhofer13,belhajjame15}.
It thus provides resources to link various workflow/analysis components (see Appendix \ref{appendix:existingtools}) into a final workflow.
Bechhofer et al. \citeappendix{bechhofer13} describes how a workflow in Taverna (Appendix \ref{appendix:taverna}) can be translated into research objects.
The important thing is that the research object concept is not specific to any special workflow, it is just a metadata bundle/standard which is only as robust in reproducing the result as the running workflow.
Therefore if implemented over a complete workflow like Maneage, it can be very useful in analysing/optimizing the workflow, finding common components between many Maneage'd workflows, or translating to other complete workflows.
\subsection{Sciunit (2015)}
\label{appendix:sciunit}
Sciunit\footnote{\inlinecode{\url{https://sciunit.run}}}\citeappendix{meng15} defines ``sciunit''s that keep the executed commands for an analysis and all the necessary programs and libraries that are used in those commands.
It automatically parses all the executable files in the script and copies them, and their dependency libraries (down to the C library), into the sciunit.
Because the sciunit contains all the programs and necessary libraries, it is possible to run it readily on other systems that have a similar CPU architecture.
Sciunit was originally written in Python 2 (which reached its end-of-life on January 1st, 2020).
Therefore Sciunit2 is a new implementation in Python 3.
The main issue with Sciunit's approach is that the copied binaries are just black boxes: it is not possible to see how the used binaries from the initial system were built.
This is a major problem for scientific projects: in principle (not knowing how the programs were built) and in practice (archiving a large volume sciunit for every step of the analysis requires a lot of storage space and archival cost).
\subsection{Umbrella (2015)}
Umbrella\citeappendix{meng15b} is a high-level wrapper script for isolating the environment of the analysis.
The user specifies the necessary operating system, and necessary packages for the analysis steps in various JSON files.
Umbrella will then study the host operating system and the various necessary inputs (including data and software) through a process similar to Sciunits mentioned above to find the best environment isolator (maybe using Linux containerization, containers, or VMs).
We could not find a URL to the source software of Umbrella (no source code repository is mentioned in the papers we reviewed above), but from the descriptions\citeappendix{meng17}, it is written in Python 2.6 (which is now deprecated).
\subsection{ReproZip (2016)}
ReproZip\footnote{\inlinecode{\url{https://www.reprozip.org}}}\citeappendix{chirigati16} is a Python package that is designed to automatically track all the necessary data files, libraries, and environment variables of a process into a single bundle.
The tracking is done at the kernel system-call level, so any file that is accessed during the running of the project is identified.
The tracked files can be packaged into a \inlinecode{.rpz} bundle that can then be unpacked into another system.
ReproZip is therefore very good for storing a ``snapshot'' of the running environment, at a single moment, into a single file.
However, the bundle can become very large when many/large datasets are involved, or if the software environment is complex (many dependencies).
Furthermore, since the binary software libraries are directly copied, it can only be re-run on a systems with a compatible CPU architecture.
Another problem is that ReproZip copies all files used in a project, without (by default) a way of knowing how the software was built (its provenance).
As mentioned in this paper, and also Oliveira et al. \citeappendix{oliveira18}, the question of ``how'' the environment was built is critical to understanding the results; having only the binaries is not useful in many contexts.
It is possible to include the build instructions of the software used within the project to be ReproZip'd, but this risks bloating the bundle with the many temporary files that are created during the build of the software, adding complexity and slowing down the project's running time.
For the data, it is similarly not possible to extract which data server they came from.
Hence two projects that each use a 1-terabyte dataset will need a full copy of that same 1-terabyte file in their bundle, making long-term preservation extremely expensive.
Such files can be excluded from the bundle through modifications in the configuration file.
However, this will add complexity: a higher-level script will be necessary with the ReproZip bundle, to make sure that the data and bundle are used together, or to check the integrity of the data (in case they have changed).
Finally, because it is only a snapshot of one moment in a project's history, preserving the connection between the ReproZip'd bundles of various points in a project's history is likely to be difficult (for example, when software or data are updated, or when analysis methods are modified).
In other words, a ReproZip user will have to personally define an archival method to preserve the various black boxes of the project as it evolves, and tracking what has changed between the versions is not trivial.
\subsection{Binder (2017)}
Binder\footnote{\inlinecode{\url{https://mybinder.org}}} is used to containerize already existing Jupyter based processing steps.
Users simply add a set of Binder-recognized configuration files to their repository and Binder will build a Docker image and install all the dependencies inside of it with Conda (the list of necessary packages comes from Conda).
One good feature of Binder is that the imported Docker image must be tagged, although as mentioned in Appendix \ref{appendix:containers}, tags do not ensure reproducibility.
However, it does not make sure that the Dockerfile used by the imported Docker image follows a similar convention also.
So users can simply use generic operating system names.
Binder is used by Jones et al.\citeappendix{jones19}.
\subsection{Gigantum (2017)}
%% I took the date from their PiPy page, where the first version 0.1 was published in November 2016.
Gigantum\footnote{\inlinecode{\url{https://gigantum.com}}} is a client/server system, in which the client is a web-based (graphical) interface that is installed as ``Gigantum Desktop'' within a Docker image.
Gigantum uses Docker containers for an independent environment, Conda (or Pip) to install packages, Jupyter notebooks to edit and run code, and Git to store its history.
The reproducibility issues with these tools has been thoroughly discussed in \ref{appendix:existingtools}.
Simply put, it is a high-level wrapper for combining these components.
Internally, a Gigantum project is organized as files in a directory that can be opened without their own client.
The file structure (which is under version control) includes codes, input data, and output data.
As acknowledged on their own web page, this greatly reduces the speed of Git operations, transmitting, or archiving the project.
Therefore there are size limits on the dataset/code sizes.
However, there is one directory that can be used to store files that must not be tracked.
\subsection{Popper (2017)}
\label{appendix:popper}
Popper\footnote{\inlinecode{\url{https://getpopper.io}}} is a software implementation of the Popper Convention\citeappendix{jimenez17}.
The Popper team's own solution is through a command-line program called \inlinecode{popper}.
The \inlinecode{popper} program itself is written in Python.
However, job management was initially based on the HashiCorp configuration language (HCL) because HCL was used by ``GitHub Actions'' to manage workflows at that time.
However, from October 2019 GitHub changed to a custom YAML-based language, so Popper also deprecated HCL.
This is an important issue when low-level choices are based on service providers (see Appendix \ref{appendix:highlevelinworkflow}).
To start a project, the \inlinecode{popper} command-line program builds a template, or ``scaffold'', which is a minimal set of files that can be run.
By default, Popper runs in a Docker image (so root permissions are necessary and reproducible issues with Docker images have been discussed above), but Singularity is also supported.
See Appendix \ref{appendix:independentenvironment} for more on containers, and Appendix \ref{appendix:highlevelinworkflow} for using high-level languages in the workflow.
Popper does not comply with the completeness, minimal complexity, and including-the-narrative criteria.
Moreover, the scaffold that is provided by Popper is an output of the program that is not directly under version control.
Hence, tracking future low-level changes in Popper and how they relate to the high-level projects that depend on it through the scaffold will be very hard.
In Maneage, users start their projects by branching off the core \inlinecode{maneage} git branch.
Hence any future change in the low level features will be directly propagated to all derived projects (and will appear prominently as Git conflicts if the user has customized them).
\subsection{Whole Tale (2017)}
\label{appendix:wholetale}
Whole Tale\footnote{\inlinecode{\url{https://wholetale.org}}} is a web-based platform for managing a project and organizing data provenance\citeappendix{brinckman17}.
It uses online editors like Jupyter or RStudio (see Appendix \ref{appendix:editors}) that are encapsulated in a Docker container (see Appendix \ref{appendix:independentenvironment}).
The web-based nature of Whole Tale's approach and its dependency on many tools (which have many dependencies themselves) is a major limitation for future reproducibility.
For example, when following their own tutorial on ``Creating a new tale'', the provided Jupyter notebook could not be executed because of a dependency problem.
This was reported to the authors as issue 113\footnote{\inlinecode{\url{https://github.com/whole-tale/wt-design-docs/issues/113}}} and fixed.
But as all the second-order dependencies evolve, it is not hard to envisage such dependency incompatibilities being the primary issue for older projects on Whole Tale.
Furthermore, the fact that a Tale is stored as a binary Docker container causes two important problems:
1) it requires a very large storage capacity for every project that is hosted there, making it very expensive to scale if demand expands.
2) It is not possible to see how the environment was built accurately (when the Dockerfile uses operating system package managers like \inlinecode{apt}).
This issue with Whole Tale (and generally all other solutions that only rely on preserving a container/VM) was also mentioned in Oliveira et al.\citeappendix{oliveira18}, for more on this, please see Appendix \ref{appendix:packagemanagement}.
\subsection{Occam (2018)}
\label{appendix:occam}
Occam\footnote{\inlinecode{\url{https://occam.cs.pitt.edu}}}\citeappendix{oliveira18} is a web-based application to preserve software and its execution.
To achieve long-term reproducibility, Occam includes its own package manager (instructions to build software and its dependencies) in order to be in full control of the software build instructions, similarly to Maneage.
Besides Nix or Guix (which are primarily a package manager that can also do job management), Occam is the only solution in our survey that attempts to be complete in this aspect.
However, it is incomplete from the perspective of requirements: it works within a Docker image (that requires root permissions) and currently only runs on Debian-based, Red Hat based, and Arch-based GNU/Linux operating systems that respectively use the \inlinecode{apt}, \inlinecode{yum} or \inlinecode{pacman} package managers.
It is also itself written in Python (version 3.4 or above).
Furthermore, it does not satisfy the minimal complexity criterion, because the instructions to build the software packages and their versions are not immediately viewable or modifiable by the user.
Occam contains its own JSON database that should be parsed by Occam's own custom program.
The analysis phase of Occam is through a drag-and-drop interface (similar to Taverna, Appendix \ref{appendix:taverna}), which is provided as a web-based graphic user interface.
All the connections between the various phases of the analysis need to be pre-defined in a JSON file and manually linked in the GUI.
Hence, for complex data analysis operations that involve thousands of steps, this is not scalable.
|