aboutsummaryrefslogtreecommitdiff
path: root/tex/src/appendix-existing-tools.tex
blob: 885de8bce5572b572ae02bf82bcb21e528c3bb6e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
%% Appendix on reviewing existing low-level tools that are used in
%% high-level reproducible workflow solutions. This file is loaded by the
%% project's 'paper.tex' or 'tex/src/supplement.tex', it should not be run
%% independently.
%
%% Copyright (C) 2020-2021 Mohammad Akhlaghi <mohammad@akhlaghi.org>
%% Copyright (C) 2021 Raúl Infante-Sainz <infantesainz@gmail.com>
%% Copyright (C) 2021 Boudewijn F. Roukema <boud@astro.uni.torun.pl>
%
%% This file is free software: you can redistribute it and/or modify it
%% under the terms of the GNU General Public License as published by the
%% Free Software Foundation, either version 3 of the License, or (at your
%% option) any later version.
%
%% This file is distributed in the hope that it will be useful, but WITHOUT
%% ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
%% FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
%% for more details. See <http://www.gnu.org/licenses/>.





\section{Survey of existing tools for various phases}
\label{appendix:existingtools}
Data analysis workflows (including those that aim for reproducibility) are commonly high-level frameworks that employ various lower-level components.
To help in reviewing existing reproducible workflow solutions in light of the proposed criteria in Appendix \ref{appendix:existingsolutions}, we first need to survey the most commonly employed lower-level tools.





\subsection{Independent environment}
\label{appendix:independentenvironment}
The lowest-level challenge of any reproducible solution is to avoid the differences between various run-time environments, to a desirable/certain level.
For example different hardware, operating systems, versions of existing dependencies, etc.
Therefore, any reasonable attempt at providing a reproducible workflow starts with isolating its running environment from the host environment.
Three general technologies are used for this purpose and reviewed below:
1) Virtual machines,
2) Containers,
3) Independent build in the host's file system.

\subsubsection{Virtual machines}
\label{appendix:virtualmachines}
Virtual machines (VMs) host a binary copy of a full operating system that can be run on other operating systems.
This includes the lowest-level operating system component or the kernel.
VMs thus provide the ultimate control one can have over the run-time environment of the analysis.
However, the VM's kernel does not talk directly to the running hardware that is doing the analysis, it talks to a simulated hardware layer that is provided by the host's kernel.
Therefore, a process that is run inside a virtual machine can be much slower than one that is run on a native kernel.
An advantage of VMs is that they are a single file that can be copied from one computer to another, keeping the full environment within them if the format is recognized.
VMs are used by cloud service providers, enabling fully independent operating systems on their large servers where the customer can have root access.

VMs were used in solutions like SHARE\citeappendix{vangorp11} (which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011\citeappendix{gabriel11}), or in some suggested reproducible papers\citeappendix{dolfi14}.
However, due to their very large size, these are expensive to maintain, thus leading SHARE to discontinue its services in 2019.
The URL to the VM file \texttt{provenance\_machine.ova} that is mentioned in Dolfi et al.\citeappendix{dolfi14} is also not currently accessible (we suspect that this is due to size and archival costs).

\subsubsection{Containers}
\label{appendix:containers}
Containers also host a binary copy of a running environment but do not have their own kernel.
Through a thin layer of low-level system libraries, programs running within a container talk directly with the host operating system kernel.
Otherwise, containers have their own independent software for everything else.
Therefore, they have much less overhead in hardware/CPU access.
Like VMs, users often choose an operating system for the container's independent operating system (most commonly GNU/Linux distributions which are free software).

We review some of the most common container solutions: Docker, Singularity, and Podman.

\begin{itemize}
\item {\bf\small Docker containers:} Docker is one of the most popular tools nowadays for keeping an independent analysis environment.
  It is primarily driven by the need of software developers for reproducing a previous environment, where they have root access mostly on the ``cloud'' (which is usually a remote VM).
  A Docker container is composed of independent Docker ``images'' that are built with a \inlinecode{Dockerfile}.
  It is possible to precisely version/tag the images that are imported (to avoid downloading the latest/different version in a future build).
  To have a reproducible Docker image, it must be ensured that all the imported Docker images check their dependency tags down to the initial image which contains the C library.

  An important drawback of Docker for high-performance scientific needs is that it runs as a daemon (a program that is always running in the background) with root permissions.
  This is a major security flaw that discourages many high-performance computing (HPC) facilities from providing it.

\item {\bf\small Singularity:} Singularity\citeappendix{kurtzer17} is a single-image container (unlike Docker, which is composed of modular/independent images).
  Although it needs root permissions to be installed on the system (once), it does not require root permissions every time it is run.
  Its main program is also not a daemon, but a normal program that can be stopped.
  These features make it much safer for HPC administrators to install compared to Docker.
  However, the fact that it requires root access for the initial install is still a hindrance for a typical project: if Singularity is not already present on the HPC, the user's science project cannot be run by a non-root user.

\item {\bf\small Podman:} Podman uses the Linux kernel containerization features to enable containers without a daemon, and without root permissions.
  It has a command-line interface very similar to Docker, but only works on GNU/Linux operating systems.
\end{itemize}

Generally, VMs or containers are good solutions to reproducibly run/repeating an analysis in the short term (a couple of years).
However, their focus is to store the already-built (binary, non-human readable) software environment.
Because of this, they will be large (many Gigabytes) and expensive to archive, download, or access.
Recall the two examples above for VMs in Section \ref{appendix:virtualmachines}. But this is also valid for Docker images, as is clear from Dockerhub's recent decision to a new consumpiton-based payment model.
Meng \& Thain\citeappendix{meng17} also give similar reasons on why Docker images were not suitable in their trials.

On a more fundamental level, VMs or containers do not store \emph{how} the core environment was built.
This information is usually in a third-party repository, and not necessarily inside the container or VM file, making it hard (if not impossible) to track for future users.
This is a major problem in relation to the proposed completeness criteria and is also highlighted as an issue in terms of long term reproducibility by Oliveira et al.\citeappendix{oliveira18}.

The example of \inlinecode{Dockerfile} of Mesnard \& Barba\cite{mesnard20} was previously mentioned in
\ifdefined\separatesupplement
the main body of this paper, when discussing the criteria.
\else
in Section \ref{criteria}.
\fi
Another useful example is the \inlinecode{Dockerfile}\footnote{\inlinecode{\href{https://github.com/benmarwick/1989-excavation-report-Madjedbebe/blob/master/Dockerfile}{https://github.com/benmarwick/1989-excavation-report-}\\\href{https://github.com/benmarwick/1989-excavation-report-Madjedbebe/blob/master/Dockerfile}{Madjedbebe/blob/master/Dockerfile}}} of Clarkson et al.\citeappendix{clarkso15} (published in June 2015) which starts with \inlinecode{FROM rocker/verse:3.3.2}.
When we tried to build it (November 2020), we noticed that the core downloaded image (\inlinecode{rocker/verse:3.3.2}, with image ``digest'' \inlinecode{sha256:c136fb0dbab...}) was created in October 2018 (long after the publication of that paper).
In principle, it is possible to investigate the difference between this new image and the old one that the authors used, but that would require a lot of effort and may not be possible when the changes are not available in a third public repository or not under version control.
In Docker, it is possible to retrieve the precise Docker image with its digest, for example, \inlinecode{FROM ubuntu:16.04@sha256:XXXXXXX} (where \inlinecode{XXXXXXX} is the digest, uniquely identifying the core image to be used), but we have not seen this often done in existing examples of ``reproducible'' \inlinecode{Dockerfiles}.

The ``digest'' is specific to Docker repositories.
A more generic/long-term approach to ensure identical core OS components at a later time is to construct the containers or VMs with fixed/archived versions of the operating system ISO files.
ISO files are pre-built binary files with volumes of hundreds of megabytes and not containing their build instructions.
For example, the archives of Debian\footnote{\inlinecode{\url{https://cdimage.debian.org/mirror/cdimage/archive/}}} or Ubuntu\footnote{\inlinecode{\url{http://old-releases.ubuntu.com/releases}}} provide older ISO files.

The concept of containers (and the independent images that build them) can also be extended beyond just the software environment.
For example, Lofstead et al.\citeappendix{lofstead19} propose a ``data pallet'' concept to containerize access to data and thus allow tracing data back to the application that produced them.

In summary, containers or VMs are just a built product themselves.
If they are built properly (for example building a Maneage'd project inside a Docker container), they can be useful for immediate usage and fast-moving of the project from one system to another.
With a robust building, the container or VM can also be exactly reproduced later.
However, attempting to archive the actual binary container or VM files as a black box (not knowing the precise versions of the software in them, and \emph{how} they were built) is expensive, and will not be able to answer the most fundamental questions.

\subsubsection{Independent build in host's file system}
\label{appendix:independentbuild}
The virtual machine and container solutions mentioned above, have their own independent file system.
Another approach to having an isolated analysis environment is to use the same file system as the host, but installing the project's software in a non-standard, project-specific directory that does not interfere with the host.
Because the environment in this approach can be built in any custom location on the host, this solution generally does not require root permissions or extra low-level layers like containers or VMs.
However, ``moving'' the built product of such solutions from one computer to another is not generally as trivial as containers or VMs.
Examples of such third-party package managers (that are detached from the host OS's package manager) include (but are not limited to) Nix, GNU Guix, Python's Virtualenv package, Conda.
Because it is highly intertwined with the way software is built and installed, third party package managers are described in more detail as part of Section \ref{appendix:packagemanagement}.

Maneage (the solution proposed in this paper) also follows a similar approach of building and installing its own software environment within the host's file system, but without depending on it beyond the kernel.
However, unlike the third-party package manager mentioned above, Maneage'd software management is not detached from the specific research/analysis project: the instructions to build the full isolated software environment is maintained with the high-level analysis steps of the project, and the narrative paper/report of the project.
This is fundamental to achieve the completeness criterion.





\subsection{Package management}
\label{appendix:packagemanagement}
Package management is the process of automating the build and installation of a software environment.
A package manager thus contains the following information on each software package that can be run automatically: the URL of the software's tarball, the other software that it possibly depends on, and how to configure and build it.
Package managers can be tied to specific operating systems at a very low level (like \inlinecode{apt} in Debian-based OSs).
Alternatively, there are third-party package managers that can be installed on many OSs.
Both are discussed in more detail below.

Package managers are the second component in any workflow that relies on containers or VMs for an independent environment, and the starting point in others that use the host's file system (as discussed above in Section \ref{appendix:independentenvironment}).
In this section, some common package managers are reviewed, in particular those that are most used by the reviewed reproducibility solutions of Appendix \ref{appendix:existingsolutions}.
For a more comprehensive list of existing package managers, see Wikipedia\footnote{\inlinecode{\href{https://en.wikipedia.org/wiki/List\_of\_software\_package\_management\_systems}{https://en.wikipedia.org/wiki/List\_of\_software\_package\_}\\\href{https://en.wikipedia.org/wiki/List\_of\_software\_package\_management\_systems}{management\_systems}}}.
Note that we are not including package managers that are specific to one language, for example \inlinecode{pip} (for Python) or \inlinecode{tlmgr} (for \LaTeX).

\subsubsection{Operating system's package manager}
The most commonly used package managers are those of the host operating system, for example, \inlinecode{apt}, \inlinecode{yum} or \inlinecode{pkg} which are respectively used in Debian-based, Red Hat-based and FreeBSD-based OSs (among many other OSs).

These package managers are tightly intertwined with the operating system: they also include the building and updating of the core kernel and the C library.
Because they are part of the OS, they also commonly require root permissions.
Also, it is usually only possible to have one version/configuration of the software at any moment and downgrading versions for one project, may conflict with other projects, or even cause problems in the OS.
Hence if two projects need different versions of the software, it is not possible to work on them at the same time in the OS.

When a container or virtual machine (see Appendix \ref{appendix:independentenvironment}) is used for each project, it is common for projects to use the containerized operating system's package manager.
However, it is important to remember that operating system package managers are not static: software is updated on their servers.
Hence, simply running \inlinecode{apt install gcc}, will install different versions of the GNU Compiler Collection (GCC) based on the version of the OS and when it has been run.
Requesting a special version of that special software does not fully address the problem because the package managers also download and install its dependencies.
Hence a fixed version of the dependencies must also be specified.

In robust package managers like Debian's \inlinecode{apt} it is possible to fully control (and later reproduce) the built environment of a high-level software.
Debian also archives all packaged high-level software in its Snapshot\footnote{\inlinecode{\url{https://snapshot.debian.org/}}} service since 2005 which can be used to build the higher-level software environment on an older OS\citeappendix{aissi20}.
Therefore it is indeed theoretically possible to reproduce the software environment only using archived operating systems and their own package managers, but unfortunately, we have not seen it practiced in (reproducible) scientific papers/projects.

In summary, the host OS package managers are primarily meant for the low-level operating system components.
Hence, many robust reproducible analysis workflows (reviewed in Appendix \ref{appendix:existingsolutions}) do not use the host's package manager, but an independent package manager, like the ones discussed below.

\subsubsection{Blind packaging of already built software}
An already-built software contains links to the system libraries it uses.
Therefore one way of packaging a software is to look into the binary file for the libraries it uses and bring them into a file with the executable so on different systems, the same set of dependencies are moved around with the desired software.
Tools like AppImage\footnote{\inlinecode{\url{https://appimage.org}}}, Flatpak\footnote{\inlinecode{\url{https://flatpak.org}}} or Snap\footnote{\inlinecode{\url{https://snapcraft.io}}} are designed for this purpose: the software's binary product and all its dependencies (not including the core C library) are packaged into one file.
This makes it very easy to move that single software's built product and already built dependencies to different systems.
However, because the C library is not included, it can fail on newer/older systems (depending on the system it was built on).
We call this method ``blind'' packaging because it is agnostic to \emph{how} the software and its dependencies were built (which is important in a scientific context).
Moreover, these types of packagers are designed for the Linux kernel (using its containerization and unique mounting features).
They can therefore only be run on GNU/Linux operating systems.

\subsubsection{Nix or GNU Guix}
\label{appendix:nixguix}
Nix\footnote{\inlinecode{\url{https://nixos.org}}}\citeappendix{dolstra04} and GNU Guix\footnote{\inlinecode{\url{https://guix.gnu.org}}}\citeappendix{courtes15} are independent package managers that can be installed and used on GNU/Linux operating systems, and macOS (only for Nix, prior to macOS Catalina).
Both also have a fully functioning operating system based on their packages: NixOS and ``Guix System''.
GNU Guix is based on the same principles of Nix but implemented differently, so we focus the review here on Nix.

The Nix approach to package management is unique in that it allows exact dependency tracking of all the dependencies, and allows for multiple versions of software, for more details see Dolstra et al.\citeappendix{dolstra04}.
In summary, a unique hash is created from all the components that go into the building of the package (including the instructions on how to build the software).
That hash is then prefixed to the software's installation directory.
As an example from Dolstra et al.\citeappendix{dolstra04}: if a certain build of GNU C Library 2.3.2 has a hash of \inlinecode{8d013ea878d0}, then it is installed under \inlinecode{/nix/store/8d013ea878d0-glibc-2.3.2} and all software that is compiled with it (and thus need it to run) will link to this unique address.
This allows for multiple versions of the software to co-exist on the system, while keeping an accurate dependency tree.

As mentioned in Court{\'e}s \& Wurmus\citeappendix{courtes15}, one major caveat with using these package managers is that they require a daemon with root privileges (failing our completeness criterion).
This is necessary ``to use the Linux kernel container facilities that allow it to isolate build processes and maximize build reproducibility''.
This is because the focus in Nix or Guix is to create bitwise reproducible software binaries and this is necessary for the security or development perspectives.
However, in a non-computer-science analysis (for example natural sciences), the main aim is reproducible \emph{results} that can also be created with the same software version that may not be bitwise identical (for example when they are installed in other locations, because the installation location is hard-coded in the software binary or for a different CPU architecture).

Finally, while Guix and Nix do allow precisely reproducible environments, the inherent detachment from the high-level computational project (that uses the environment) requires extra effort to keep track of the changes in dependencies as the project evolves.
For example, if users simply run \inlinecode{guix install gcc} (the most common way to install a new software) the most recent version of GCC will be installed.
But this will be different at different dates on a different system with no record of previous runs.
It is therefore up to the user to store the used Guix commit in their high level computation and ensure ``Reproducing a reproducible computation''\footnote{A guide/tutorial on storing the Guix environment:\\\inlinecode{\url{https://guix.gnu.org/en/blog/2020/reproducible-computations-with-guix}}}.
Similar to the Docker digest codes mentioned in Appendix \ref{appendix:containers}, many may not know about, forget, or ignore it.

Generally, this is a common issue with relying on detached (third party) package managers for building a high-level computational project's software (including other tools mentioned below).
We solved this problem in Maneage by including the low-level package manager and highlevel computation into a single project with a single version controlled history: it is simply not possible to forget to record the exact versions of the software used (or how they change as the project evolves).

\subsubsection{Conda/Anaconda}
\label{appendix:conda}
Conda is an independent package manager that can be used on GNU/Linux, macOS, or Windows operating systems, although all software packages are not available in all operating systems.
Conda is able to maintain an approximately independent environment on an operating system without requiring root access.

Conda tracks the dependencies of a package/environment through a YAML formatted file, where the necessary software and their acceptable versions are listed.
However, it is not possible to fix the versions of the dependencies through the YAML files alone.
This is thoroughly discussed under issue 787 (in May 2019) of \inlinecode{conda-forge}\footnote{\inlinecode{\url{https://github.com/conda-forge/conda-forge.github.io/issues/787}}}.
In that Github discussion, the authors of Uhse et al.\citeappendix{uhse19} report that the half-life of their environment (defined in a YAML file) is 3 months, and that at least one of their dependencies breaks shortly after this period.
The main reply they got in the discussion is to build the Conda environment in a container, which is also the suggested solution by Gr\"uning et al.\citeappendix{gruning18}.
However, as described in Appendix \ref{appendix:independentenvironment}, containers just hide the reproducibility problem, they do not fix it: containers are not static and need to evolve (i.e., get re-built) with the project.
Given these limitations, Uhse et al.\citeappendix{uhse19} are forced to host their conda-packaged software as tarballs on a separate repository.

Conda installs with a shell script that contains a binary-blob (+500 megabytes, embedded in the shell script).
This is the first major issue with Conda: from the shell script, it is not clear what is in this binary blob and what it does.
After installing Conda in any location, users can easily activate that environment by loading a special shell script.
However, the resulting environment is not fully independent of the host operating system as described below:

\begin{itemize}
\item The Conda installation directory is present at the start of environment variables like \inlinecode{PATH} (which is used to find programs to run) and other such environment variables.
  However, the host operating system's directories are also appended afterward.
  Therefore, a user or script may not notice that the software that is being used is actually coming from the operating system, and not from the controlled Conda installation.

\item Generally, by default, Conda relies heavily on the operating system and does not include core commands like \inlinecode{mkdir} (to make a directory), \inlinecode{ls} (to list files) or \inlinecode{cp} (to copy).
  Although a minimal functionality is defined for them in POSIX and generally behave similarly for basic operations on different Unix-like operating systems, they have their differences.
  For example, \inlinecode{mkdir -p} is a common way to build directories, but this option is only available with the \inlinecode{mkdir} of GNU Coreutils (default on GNU/Linux systems and installable in almost all Unix-like OSs).
  Running the same command within a Conda environment that does not include GNU Coreutils on a macOS would crash.
  Important packages like GNU Coreutils are available in channels like conda-forge, but they are not the default.
  Therefore, many users may not recognize this, and failing to account for it, will cause unexpected crashes when the project is run on a new system.

\item Many major Conda packaging ``channels'' (for example the core Anaconda channel, or very popular conda-forge channel) do not include the C library, that a package was built with, as a dependency.
  They rely on the host operating system's C library.
  C is the core language of modern operating systems and even higher-level languages like Python or R are written in it, and need it to run.
  Therefore if the host operating system's C library is different from the C library that a package was built with, a Conda-packaged program will crash and the project will not be executable.
  Theoretically, it is possible to define a new Conda ``channel'' which includes the C library as a dependency of its software packages, but it will take too much time for any individual team to practically implement all their necessary packages, up to their high-level science software.

\item Conda does allow a package to depend on a special build of its prerequisites (specified by a checksum, fixing its version and the version of its dependencies).
  However, this is rarely practiced in the main Git repositories of channels like Anaconda and conda-forge: only the name of the high-level prerequisite packages is listed in a package's \inlinecode{meta.yaml} file, which is version-controlled.
  Therefore two builds of the package from the same Git repository will result in different tarballs (depending on what prerequisites were present at build time).
  In Conda's downloaded tarball (that contains the built binaries and is not under version control) the exact versions of most build-time dependencies are listed.
  However, because the different software of one project may have been built at different times, if they depend on different versions of a single software there will be a conflict and the tarball cannot be rebuilt, or the project cannot be run.
\end{itemize}

As reviewed above, the low-level dependence of Conda on the host operating system's components and build-time conditions, is the primary reason that it is very fast to install (thus making it an attractive tool to software developers who just need to reproduce a bug in a few minutes).
However, these same factors are major caveats in a scientific scenario, where long-term archivability, readability, or usability are important. % alternative to `archivability`?

\subsubsection{Spack}
Spack\citeappendix{gamblin15} is a package manager that is also influenced by Nix (similar to GNU Guix).
But unlike Nix or GNU Guix, it does not aim for full, bitwise reproducibility and can be built without root access in any generic location.
It relies on the host operating system for the C library.

Spack is fully written in Python, where each software package is an instance of a class, which defines how it should be downloaded, configured, built, and installed.
Therefore if the proper version of Python is not present, Spack cannot be used and when incompatibilities arise in future versions of Python (similar to how Python 3 is not compatible with Python 2), software building recipes, or the whole system, have to be upgraded.
Because of such bootstrapping problems (for example how Spack needs Python to build Python and other software), it is generally a good practice to use simpler, lower-level languages/systems for a low-level operation like package management.

In conclusion for all package managers, there are two common issues regarding generic package managers that hinder their usage for high-level scientific projects:

\begin{itemize}
\item {\bf\small Pre-compiled/binary downloads:} Most package managers primarily download the software in a binary (pre-compiled) format.
  This allows users to download it very fast and almost instantaneously be able to run it.
  However, to provide for this, servers need to keep binary files for each build of the software on different operating systems (for example Conda needs to keep binaries for Windows, macOS and GNU/Linux operating systems).
  It is also necessary for them to store binaries for each build, which includes different versions of its dependencies.
  Maintaining such a large binary library is expensive, therefore once the shelf-life of a binary has expired, it will be removed, causing problems for projects that depend on them.

\item {\bf\small Adding high-level software:} Packaging new software is not trivial and needs a good level of knowledge/experience with that package manager.
  For example, each one has its own special syntax/standards/languages, with pre-defined variables that must already be known before someone can package new software for them.
  However, in many research projects, the most high-level analysis software is written by the team that is doing the research, and they are its primary/only users, even when the software is distributed with free licenses on open repositories.

  Although active package manager members are commonly very supportive in helping to package new software, many teams may not be able to make that extra effort and time investment to package their most high-level (i.e., relevant) software in it.
  As a result, they manually install their high-level software in an uncontrolled, or non-standard way, thus jeopardizing the reproducibility of the whole work.
  This is another consequence of the detachment of the package manager from the project doing the analysis.
\end{itemize}

Addressing these issues has been the basic reason behind Maneage: based on the completeness criterion, instructions to download and build the packages are included within the actual science project, and no special/new syntax/language is used.
Software download, built and installation is done with the same language/syntax that researchers manage their research: using the shell (by default GNU Bash in Maneage) for low-level steps and Make (by default, GNU Make in Maneage) for job management.





\subsection{Version control}
\label{appendix:versioncontrol}
A scientific project is not written in a day; it usually takes more than a year.
During this time, the project evolves significantly from its first starting date, and components are added or updated constantly as it approaches completion.
Added with the complexity of modern computational projects, is not trivial to manually track this evolution, and the evolution's affect of on the final output: files produced in one stage of the project can mistakenly be used by an evolved analysis environment in later stages (where the project has evolved).

Furthermore, scientific projects do not progress linearly: earlier stages of the analysis are often modified after later stages are written.
This is a natural consequence of the scientific method; where progress is defined by experimentation and modification of hypotheses (results from earlier phases).

It is thus very important for the integrity of a scientific project that the state/version of its processing is recorded as the project evolves.
For example, better methods are found or more data arrive.
Any intermediate dataset that is produced should also be tagged with the version of the project at the time it was created.
In this way, later processing stages can make sure that they can safely be used, i.e., no change has been made in their processing steps.

Solutions to keep track of a project's history have existed since the early days of software engineering in the 1970s and they have constantly improved over the last decades.
Today the distributed model of ``version control'' is the most common, where the full history of the project is stored locally on different systems and can easily be integrated.
There are many existing version control solutions, for example, CVS, SVN, Mercurial, GNU Bazaar, or GNU Arch.
However, currently, Git is by far the most commonly used in individual projects, such that Software Heritage\citeappendix{dicosmo18} (an archival system aiming for long term preservation of software) is also modeled on Git.
Git is also the foundation upon which this paper's proof of concept (Maneage) is built.
Hence we will just review Git here, but the general concept of version control is the same in all implementations.

\subsubsection{Git}
With Git, changes in a project's contents are accurately identified by comparing them with their previous version in the archived Git repository.
When the user decides the changes are significant compared to the archived state, they can ``commit'' the changes into the history/repository.
The commit involves copying the changed files into the repository and calculating a 40 character checksum/hash that is calculated from the files, an accompanying ``message'' (a narrative description of the purpose/goals of the changes), and the previous commit (thus creating a ``chain'' of commits that are strongly connected to each other, as in
\ifdefined\separatesupplement
the figure on Git in the main body of the paper).
\else
Figure \ref{fig:branching}).
\fi
For example \inlinecode{f4953cc\-f1ca8a\-33616ad\-602ddf\-4cd189\-c2eff97b} is a commit identifier in the Git history of this project.
Through the content-based storage concept, similar hash structures can be used to identify data\citeappendix{hinsen20}.
Git commits are commonly summarized by the checksum's first few characters, for example, \inlinecode{f4953cc} of the example above.

With Git, making parallel ``branches'' (in the project's history) is very easy and its distributed nature greatly helps in the parallel development of a project by a team.
The team can host the Git history on a web page and collaborate through that.
There are several Git hosting services, for example, \href{https://codeberg.org}{codeberg.org}, \href{https://notabug.org}{notabug.org}, \href{https://gitlab.com}{gitlab.com}, \href{https://bitbucket.com}{bitbucket.com} or \href{https://github.com}{github.com} (among many others).
Storing the changes in binary files is also possible in Git, however it is most useful for human-readable plain-text sources.








\subsection{Archiving}
\label{appendix:archiving}

Long-term, bytewise, checksummed archiving of software research projects is necessary for a project to be reproducible by a broad community (in both time and space).
Generally, archival includes either binary or plain-text source code files.
In some cases, specific tools have their own archival systems, such as Docker Hub\footnote{\inlinecode{\url{https://hub.docker.com}}} for Docker containers (that were discussed above in Appendix \ref{appendix:containers}, so they are not reviewed here).
We will focus on generic archival tools in this section.

The Wayback Machine (part of the Internet Archive)\footnote{\inlinecode{\url{https://archive.org}}} and similar services such as Archive Today\footnote{\inlinecode{\url{https://archive.today}}} provide on-demand long-term archiving of web pages, which is a critically important service for preserving the history of the World Wide Web.
However, because these services are heavily tailored to the web format, they have many limitations for scientific source code or data.
For example, the only way to archive the source code of a computational project is through its tarball\footnote{For example \inlinecode{\url{https://archive.org/details/gnuastro}}}.

Through public research repositories such as Zenodo\footnote{\inlinecode{\url{https://zenodo.org}}} or Figshare\footnote{\inlinecode{\url{https://figshare.com}}} academic files (in any format and of any type of content: data, hand-written narrative or code) can be archived for the long term.
Since they are tailored to academic files, these services mint a DOI for each package of files, and provide convenient maintenance of metadata by the uploading user, while verifying the files with MD5 checksums.
Since these services allow large files, they are mostly useful for data (for example Zenodo currently allows a total size, for all files, of 50 GB in each upload).
Universities now regularly provide their own repositories,\footnote{For example \inlinecode{\url{https://repozytorium.umk.pl}}} many of which are registered with the \emph{Open Archives Initiative} that aims at repository interoperability\footnote{\inlinecode{\url{https://www.openarchives.org/Register/BrowseSites}}}.

However, a computational research project's source code (including instructions on how to do the research analysis, how to build the plots, blended with narrative, how to access the data, and how to prepare the software environment) are different from the data to be analysed (which are usually just a sequence of values resulting from experiments and whose volume can be very large).
Even though both source code and data are ultimately just sequences of bits in a file, their creation and usage are fundamentally different within a project, from both the philosophy-of-science point of view and from a practical point of view.
Source code is often written by humans, for machines to execute \emph{and also} for humans to read/modify; it is often composed of many files and thousands of lines of (modular) code.
Often, the fine details of the history of the changes in those lines are preserved through version control, as mentioned in Appendix \ref{appendix:versioncontrol}.

Due to this fundamental difference, some services focus only on archiving the source code of a project.
A prominent example is arXiv\footnote{\inlinecode{\url{https://arXiv.org}}}, which pioneered the archiving of research preprints.
ArXiv uses the {\LaTeX} source of a paper (and its plots) to build the paper internally and provide users with in-house Postscript or PDF outputs: having access to the {\LaTeX} source, allows it to extract metadata or contextual information among other benefits\footnote{\inlinecode{\url{https://arxiv.org/help/faq/whytex}}}.
However, along with the {\LaTeX} source, authors can also submit any type of plain-text file, including Shell or Python scripts for example (as long as the total volume of the upload doesn't exceed a certain limit).
This feature of arXiv is heavily used by Maneage'd papers.
For example this paper is available at \href{https://arxiv.org/abs/2006.03018}{arXiv:2006.03018}; by clicking on ``Other formats'', and then ``Download source'', the full source file that we uploaded is available to any interested reader.
The file includes a full snapshot of this Maneage'd project, at the point the paper was submitted there, including all data and software download and build instructions, analysis commands and narrative source code.
In fact the \inlinecode{./project make dist} command in Maneage will automatically create the arXiv-ready tarball to help authors upload their project to arXiv.
ArXiv provides long-term stable URIs, giving unique identifiers for each publication\footnote{\inlinecode{\url{https://arxiv.org/help/arxiv_identifier}}} and is mirrored on many servers across the globe.

The granularity offered by the archival systems above is a file (which is usually a compressed package of many files in the case of source code).
It is thus not possible to be more precise when preserving or linking to the contents of a file, or to preserve the history of changes in the file (both of which are very important in hand-written source code).
Commonly used Git repositories (like Codeberg, Notabug, Gitlab or Github) do provide one way to access the fine details of the source files in a project.
However, the Git history of projects on these repositories can easily be changed by the owners, or the whole site may become inactive (for association-run sites, like Codeberg or Notabug) or go bankrupt or be sold to another (commercial sites, like Gitlab or Github), thus changing the URL or conditions of access.
Such repositories are thus not reliable sources in view of longevity.

For preserving, and providing direct access to the fine details of a source-code file (with the granularity of a line within the file), Software Heritage is especially useful\citeappendix{abramatic18,dicosmo18}.
Through Software Heritage, users can anonymously nominate the version-controlled repository of any publicly accessible project and request that it be archived.
The Software Heritage scripts (themselves free-licensed) download the repository (including its full history) and preserve it.
This allows the repository as a whole, or individual files, and certain lines within the files, to be accessed using a standard Software Heritage ID (SWHID), for more see \citeappendix{dicosmo18}.
In the main body of \emph{this} paper, we use this feature several times.
Software Heritage is mirrored on international servers and is supported by major international institutions like UNESCO.

An open question in archiving the full sequence of steps that go into a quantitative scientific research project is whether or how to preserve ``scholarly ephemera''.
This refers to discussions about the project such as bug reports or proposals of adding new features: which are usually referred to as ``issues'' or ``pull requests'' (also called ``merge requests'').
These ephemera are not part of the Git commit history of a software project, but add wider context and understanding beyond the commit history itself, and provide a record that could be used to allocate intellectual credit.
For these reasons, the \emph{Investigating \& Archiving the Scholarly Git Experience} (IASGE) project proposes that the ephemera should be archived along with the Git repositories themselves\footnote{\inlinecode{\href{https://investigating-archiving-git.gitlab.io/updates/define-scholarly-ephemera}{https://investigating-archiving-git.gitlab.io/updates/}}\\\inlinecode{\href{https://investigating-archiving-git.gitlab.io/updates/define-scholarly-ephemera}{define-scholarly-ephemera}}}.
While Github is controversial for practical and ethical reasons\footnote{\inlinecode{\href{https://web.archive.org/web/20210613150610/https://git.sdf.org/humanacollaborator/humanacollabora/src/branch/master/github.md}{https://web.archive.org/web/20210613150610/https://git.sdf}\\\inlinecode{\href{https://web.archive.org/web/20210613150610/https://git.sdf.org/humanacollaborator/humanacollabora/src/branch/master/github.md}{.org/humanacollaborator/humanacollabora/src/branch/master/}}\\\inlinecode{\href{https://web.archive.org/web/20210613150610/https://git.sdf.org/humanacollaborator/humanacollabora/src/branch/master/github.md}{github.md}}}}, it is currently in wide use, and appears to be the first git repository hoster for which the ephemera are being preserved, by the GHTorrent project\footnote{\inlinecode{\url{https://ghtorrent.org}}}.
The GHTorrent project tracks the public Github ``event timeline'', downloads all ``contents and their dependencies, exhaustively'', and provides database files of all the material.
A particular complication that will need to be dealt with by projects such as GHTorrent is the copyright of the git hoster on the particular format and creative choices in style in which the ephemera are provided for downloading.






\subsection{Job management}
\label{appendix:jobmanagement}
Any analysis will involve more than one logical step.
For example, it is first necessary to download a dataset and do some preparations on it before applying the research software on it, and finally to make visualizations/tables that can be imported into the final report.
Each one of these is a logically independent step, which needs to be run before/after the others in a specific order.

Hence job management is a critical component of a research project.
There are many tools for managing the sequence of jobs, below we review the most common ones that are also used in the existing reproducibility solutions of Appendix \ref{appendix:existingsolutions} and Maneage.

\subsubsection{Manual operation with narrative}
\label{appendix:manual}
The most commonly used workflow system for many researchers is to run the commands, experiment on them, and keep the output when they are happy with it (therefore loosing the actual command that produced it).
As an improvement, some researchers also keep a narrative description in a text file, and keep a copy of the command they ran.
At least in our personal experience with colleagues, this method is still being heavily practiced by many researchers.
Given that many researchers do not get trained well in computational methods, this is not surprising.
As discussed in
\ifdefined\separatesupplement
the discussion section of the main paper,
\else
Section \ref{discussion},
\fi
based on this observation we believe that improved literacy in computational methods is the single most important factor for the integrity/reproducibility of modern science.

\subsubsection{Scripts}
\label{appendix:scripts}
Scripts (in any language, for example GNU Bash, or Python) are the most common ways of organizing a series of steps.
They are primarily designed to execute each step sequentially (one after another), making them also very intuitive.
However, as the series of operations become complex and large, managing the workflow in a script will become highly complex.

For example, if 90\% of a long project is already done and a researcher wants to add a followup step, a script will go through all the previous steps every time it is run (which can take significant time).
In other scenarios, when a small step in the middle of the analysis has to be changed, the full analysis needs to be re-run from the start.
Scripts have no concept of dependencies, forcing authors to ``temporarily'' comment parts that they do not want to be re-run.
Therefore forgetting to un-comment them afterwards is the most common cause of frustration.

This discourages experimentation, which is a critical component of the scientific method.
It is possible to manually add conditionals all over the script, thus manually defining dependencies, or only run certain steps at certain times, but they just make it harder to read, add logical complexity and introduce many bugs themselves.
Parallelization is another drawback of using scripts.
While it is not impossible, because of the high-level nature of scripts, it is not trivial and parallelization can also be very inefficient or buggy.

\subsubsection{Make}
\label{appendix:make}
Make was originally designed to address the problems mentioned above for scripts\citeappendix{feldman79}.
In particular, it was originally designed in the context of managing the compilation of software source code that are distributed in many files.
With Make, the source files of a program that have not been changed are not recompiled.
Moreover, when two source files do not depend on each other, and both need to be rebuilt, they can be built in parallel.
This was found to greatly help in debugging software projects, and in speeding up test builds, giving Make a core place in software development over the last 40 years.

The most common implementation of Make, since the early 1990s, is GNU Make.
Make was also the framework used in the first attempts at reproducible scientific papers\cite{claerbout1992,schwab2000}.
Our proof-of-concept (Maneage) also uses Make to organize its workflow.
Here, we complement that section with more technical details on Make.

Usually, the top-level Make instructions are placed in a file called Makefile, but it is also common to use the \inlinecode{.mk} suffix for custom file names.
Each stage/step in the analysis is defined through a \emph{rule}.
Rules define \emph{recipes} to build \emph{targets} from \emph{pre-requisites}.
In Unix-like operating systems, everything is a file, even directories and devices.
Therefore all three components in a rule must be files on the running filesystem.

To decide which operation should be re-done when executed, Make compares the timestamp of the targets and prerequisites.
When any of the prerequisite(s) is newer than a target, the recipe is re-run to re-build the target.
When all the prerequisites are older than the target, that target does not need to be rebuilt.
A recipe is just a bundle or shell commands that are executed if necessary.
Going deeper into the syntax of Make is beyond the scope of this paper, but we recommend interested readers to consult the GNU Make manual for a very good introduction\footnote{\inlinecode{\url{http://www.gnu.org/software/make/manual/make.pdf}}}.

\subsubsection{Snakemake}
\label{appendix:snakemake}
Snakemake is a Python-based workflow management system, inspired by GNU Make (discussed above).
It is aimed at reproducible and scalable data analysis\citeappendix{koster12}\footnote{\inlinecode{\url{https://snakemake.readthedocs.io/en/stable}}}.
It defines its own language to implement the ``rule'' concept of Make within Python.
Technically, using complex shell scripts (to call software in other languages) in each step will involve a lot of quotations that make the code hard to read and maintain.
It is therefore most useful for Python-based projects.

Currently, Snakemake requires Python 3.5 (released in September 2015) and above, while Snakemake was originally introduced in 2012.
Hence it is not clear if older Snakemake source files can be executed today.
As reviewed in many tools here, depending on high-level systems for low-level project components causes a major bootstrapping problem that reduces the longevity of a project.

\subsubsection{Bazel}
Bazel\footnote{\inlinecode{\url{https://bazel.build}}} is a high-level job organizer that depends on Java and Python and is primarily tailored to software developers (with features like facilitating linking of libraries through its high-level constructs).

\subsubsection{SCons}
\label{appendix:scons}
Scons\footnote{\inlinecode{\url{https://scons.org}}} is a Python package for managing operations outside of Python (in contrast to CGAT-core, discussed below, which only organizes Python functions).
In many aspects it is similar to Make, for example, it is managed through a `SConstruct' file.
Like a Makefile, SConstruct is also declarative: the running order is not necessarily the top-to-bottom order of the written operations within the file (unlike the imperative paradigm which is common in languages like C, Python, or FORTRAN).
However, unlike Make, SCons does not use the file modification date to decide if it should be remade.
SCons keeps the MD5 hash of all the files in a hidden binary file and checks them to see if it is necessary to re-run.

SCons thus attempts to work on a declarative file with an imperative language (Python).
It also goes beyond raw job management and attempts to extract information from within the files (for example to identify the libraries that must be linked while compiling a program).
SCons is, therefore, more complex than Make and its manual is almost double that of GNU Make.
Besides added complexity, all these ``smart'' features decrease its performance, especially as files get larger and more numerous: on every call, every file's checksum has to be calculated, and a Python system call has to be made (which is computationally expensive).

Finally, it has the same drawback as any other tool that uses high-level languages, see Section \ref{appendix:highlevelinworkflow}.
We encountered such a problem while testing SCons: on the Debian-10 testing system, the \inlinecode{python} program pointed to Python 2.
However, since Python 2 is now obsolete, SCons was built with Python 3 and our first run crashed.
To fix it, we had to either manually change the core operating system path, or the SCons source hashbang.
The former will conflict with other system tools that assume \inlinecode{python} points to Python-2, the latter may need root permissions for some systems.
This can also be problematic when a Python analysis library, may require a Python version that conflicts with the running SCons.

\subsubsection{CGAT-core}
CGAT-Core\citeappendix{cribbs19} is a Python package for managing workflows.
It wraps analysis steps in Python functions and uses Python decorators to track the dependencies between tasks.
It is used in papers like Jones et al.\citeappendix{jones19}.
However, as mentioned there it is primarily good for managing individual outputs (for example separate figures/tables in the paper, when they are fully created within Python).
Because it is primarily designed for Python tasks, managing a full workflow (which includes many more components, written in other languages) is not trivial.
Another drawback with this workflow manager is that Python is a very high-level language where future versions of the language may no longer be compatible with Python 3, that CGAT-core is implemented in (similar to how Python 2 programs are not compatible with Python 3).

\subsubsection{Guix Workflow Language (GWL)}
GWL is based on the declarative language that GNU Guix uses for package management (see Appendix \ref{appendix:packagemanagement}), which is itself based on the general purpose Scheme language.
It is closely linked with GNU Guix and can even install the necessary software needed for each individual process.
Hence in the GWL paradigm, software installation and usage does not have to be separated.
GWL has two high-level concepts called ``processes'' and ``workflows'' where the latter defines how multiple processes should be executed together.

\subsubsection{Nextflow (2013)}
Nextflow\footnote{\inlinecode{\url{https://www.nextflow.io}}} workflow language\citeappendix{tommaso17} with a command-line interface that is written in Java.

\subsubsection{Generic workflow specifications (CWL and WDL)}
\label{appendix:genericworkflows}
Due to the variety of custom workflows used in existing reproducibility solution (like those of Appendix \ref{appendix:existingsolutions}), some attempts have been made to define common workflow standards like the Common workflow language (CWL\footnote{\inlinecode{\url{https://www.commonwl.org}}}, with roots in Make, formatted in YAML or JSON) and Workflow Description Language (WDL\footnote{\inlinecode{\url{https://openwdl.org}}}, formatted in JSON).
These are primarily specifications/standards rather than software.
At an even higher level solutions like Canonical Workflow Frameworks for Research (CWFR) are being proposed\footnote{\inlinecode{\href{https://codata.org/wp-content/uploads/2021/01/CWFR-position-paper-v3.pdf}{https://codata.org/wp-content/uploads/2021/01/}}\\\inlinecode{\href{https://codata.org/wp-content/uploads/2021/01/CWFR-position-paper-v3.pdf}{CWFR-position-paper-v3.pdf}}}.
With these standards, ideally, translators can be written between the various workflow systems to make them more interoperable.

In conclusion, shell scripts and Make are very common and extensively used by users of Unix-based OSs (which are most commonly used for computations).
They have also existed for several decades and are robust and mature.
Many researchers that use heavy computations are also already familiar with them and have used them (to different levels).
As we demonstrated above in this appendix, the list of necessary tools for the various stages of a research project (an independent environment, package managers, job organizers, analysis languages, writing formats, editors, etc) is already very large.
Each software/tool/paradigm has its own learning curve, which is not easy for a natural or social scientist for example (who need to put their primary focus on their own scientific domain).
Most workflow management tools and the reproducible workflow solutions that depend on them are, yet another language/paradigm that has to be mastered by researchers and thus a heavy burden.

Furthermore as shown above (and below) high-level tools will evolve very fast causing disruptions in the reproducible framework.
A good example is Popper\citeappendix{jimenez17} which initially organized its workflow through the HashiCorp configuration language (HCL) because it was the default in GitHub.
However, in September 2019, GitHub dropped HCL as its default configuration language, so Popper is now using its own custom YAML-based workflow language, see Appendix \ref{appendix:popper} for more on Popper.





\subsection{Editing steps and viewing results}
\label{appendix:editors}
In order to reproduce a project, the analysis steps must be stored in files.
For example Shell, Python, R scripts, Makefiles, Dockerfiles, or even the source files of compiled languages like C or FORTRAN.
Given that a scientific project does not evolve linearly and many edits are needed as it evolves, it is important to be able to actively test the analysis steps while writing the project's source files.
Here we review some common methods that are currently used.

\subsubsection{Text editors}
The most basic way to edit text files is through simple text editors which just allow viewing and editing such files, for example, \inlinecode{gedit} on the GNOME graphic user interface.
However, working with simple plain text editors like \inlinecode{gedit} can be very frustrating since it is necessary to save the file, then go to a terminal emulator and execute the source files.
To solve this problem there are advanced text editors like GNU Emacs that allow direct execution of the script, or access to a terminal within the text editor.
However, editors that can execute or debug the source (like GNU Emacs), just run external programs for these jobs (for example GNU GCC, or GNU GDB), just as if those programs were called from outside the editor.

With text editors, the final edited file is independent of the actual editor and can be further edited with another editor, or executed without it.
This is a very important feature and corresponds to the modularity criterion of this paper.
This type of modularity is not commonly present for other solutions mentioned below (the source can only be edited/run in a specific browser).
Another very important advantage of advanced text editors like GNU Emacs or Vi(m) is that they can also be run without a graphic user interface, directly on the command-line.
This feature is critical when working on remote systems, in particular high performance computing (HPC) facilities that do not provide a graphic user interface.
Also, the commonly used minimalistic containers do not include a graphic user interface.
Hence by default all Maneage'd projects also build the simple GNU Nano plain-text editor as part of the project (to be able to edit the source directly within a minimal environment).
Maneage can also also optionally build GNU Emacs or Vim, but it is up to the user to build them (same as their high-level science software).

\subsubsection{Integrated Development Environments (IDEs)}
To facilitate the development of source code in special programming languages, IDEs add software building and running environments as well as debugging tools to a plain text editor.
Many IDEs have their own compilers and debuggers, hence source files that are maintained in IDEs are not necessarily usable/portable on other systems.
Furthermore, they usually require a graphic user interface to run.
In summary, IDEs are generally very specialized tools, for special projects and are not a good solution when portability (the ability to run on different systems and at different times) is required.

\subsubsection{Jupyter}
\label{appendix:jupyter}
Jupyter\citeappendix{kluyver16} (initially IPython) is an implementation of Literate Programming \citeappendix{knuth84}.
Jupyter's name is a combination of the three main languages it was designed for: Julia, Python, and R.
The main user interface is a web-based ``notebook'' that contains blobs of executable code and narrative.
Jupyter uses the custom built \inlinecode{.ipynb} format\footnote{\inlinecode{\url{https://nbformat.readthedocs.io/en/latest}}}.
The \inlinecode{.ipynb} format, is a simple, human-readable format that can be opened in a plain-text editor) and formatted in JavaScript Object Notation (JSON).
It contains various kinds of ``cells'', or blobs, that can contain narrative description, code, or multi-media visualizations (for example images/plots), that are all stored in one file.
The cells can have any order, allowing the creation of a literal programming style graphical implementation, where narrative descriptions and executable patches of code can be intertwined.
For example to have a paragraph of text about a patch of code, and run that patch immediately on the same page.

The \inlinecode{.ipynb} format does theoretically allow dependency tracking between cells, see IPython mailing list (discussion started by Gabriel Becker from July 2013\footnote{\url{https://mail.python.org/pipermail/ipython-dev/2013-July/010725.html}}).
Defining dependencies between the cells can allow non-linear execution which is critical for large scale (thousands of files) and complex (many dependencies between the cells) operations.
It allows automation, run-time optimization (deciding not to run a cell if it is not necessary), and parallelization.
However, Jupyter currently only supports a linear run of the cells: always from the start to the end.
It is possible to manually execute only one cell, but the previous/next cells that may depend on it, also have to be manually run (a common source of human error, and frustration for complex operations).
Integration of directional graph features (dependencies between the cells) into Jupyter has been discussed, but as of this publication, there is no plan to implement it (see Jupyter's GitHub issue 1175\footnote{\inlinecode{\url{https://github.com/jupyter/notebook/issues/1175}}}).

The fact that the \inlinecode{.ipynb} format stores narrative text, code, and multi-media visualization of the outputs in one file, is another major hurdle and against the modularity criterion proposed here.
The files can easily become very large (in volume/bytes) and hard to read when the Jupyter web-interface is not accessible.
Both are critical for scientific processing, especially the latter: when a web browser with proper JavaScript features is not available (can happen in a few years).
This is further exacerbated by the fact that binary data (for example images) are not directly supported in JSON and have to be converted into a much less memory-efficient textual encoding.

Finally, Jupyter has an extremely complex dependency graph: on a clean Debian 10 system, Pip (a Python package manager that is necessary for installing Jupyter) required 19 dependencies to install, and installing Jupyter within Pip needed 41 dependencies.
Hinsen\citeappendix{hinsen15} reported such conflicts when building Jupyter into the Active Papers framework (see Appendix \ref{appendix:activepapers}).
However, the dependencies above are only on the server-side.
Since Jupyter is a web-based system, it requires many dependencies on the viewing/running browser also (for example special JavaScript or HTML5 features, which evolve very fast).
As discussed in Appendix \ref{appendix:highlevelinworkflow} having so many dependencies is a major caveat for any system regarding scientific/long-term reproducibility.
In summary, Jupyter is most useful in manual, interactive, and graphical operations for temporary operations (for example educational tutorials).





\subsection{Project management in high-level languages}
\label{appendix:highlevelinworkflow}
Currently, the most popular high-level data analysis language is Python.
R is closely tracking it and has superseded Python in some fields, while Julia\citeappendix{bezanson17} is quickly gaining ground.
These languages have themselves superseded previously popular languages for data analysis of the previous decades, for example, Java, Perl, or C++.
All are part of the C-family programming languages.
In many cases, this means that the language's execution environment are themselves written in C, which is the language of modern operating systems.

Scientists, or data analysts, mostly use these higher-level languages.
Therefore they are naturally drawn to also apply the higher-level languages for lower-level project management, or designing the various stages of their workflow.
For example Conda or Spack (Appendix \ref{appendix:packagemanagement}), CGAT-core (Appendix \ref{appendix:jobmanagement}), Jupyter (Appendix \ref{appendix:editors}) or Popper (Appendix \ref{appendix:popper}) are written in Python.
The discussion below applies to both the actual analysis software and project management software.
In this context, it is more focused on the latter.

Because of their nature, higher-level languages evolve very fast, creating incompatibilities on the way.
The most prominent example is the transition from Python 2 (released in 2000) to Python 3 (released in 2008).
Python 3 was incompatible with Python 2 and it was decided to abandon the former by 2015.
However, due to community pressure, this was delayed to 1 January 2020.
The end-of-life of Python 2 caused many problems for projects that had invested heavily in Python 2: all their previous work had to be translated, for example, see Jenness\citeappendix{jenness17} or Appendix \ref{appendix:sciunit}.
Some projects could not make this investment and their developers decided to stop maintaining it, for example VisTrails (see Appendix \ref{appendix:vistrails}).

The problems were not just limited to translation.
Python 2 was still being actively used during the transition period (and is still being used by some, after its end-of-life).
Therefore, developers had to maintain (for example fix bugs in) both versions in one package.
This is not particular to Python, a similar evolution occurred in Perl: in 2000 it was decided to improve Perl 5, but the proposed Perl 6 was incompatible with it.
However, the Perl community decided not to abandon Perl 5, and Perl 6 was eventually defined as a new language that is now officially called ``Raku'' (\url{https://raku.org}).

It is unreasonably optimistic to assume that high-level languages will not undergo similar incompatible evolutions in the (not too distant) future.
For industrial software developers, this is not a major problem: non-scientific software, and the general population's usage of them, has a similarly fast evolution and shelf-life.
Hence, it is rarely (if ever) necessary to look into industrial/business codes that are more than a couple of years old.
However, in the sciences (which are commonly funded by public money) this is a major caveat for the longer-term usability of solutions.

In summary, in this section we are discussing the bootstrapping problem as regards scientific projects: the workflow/pipeline can reproduce the analysis and its dependencies.
However, the dependencies of the workflow itself should not be ignored.
Beyond technical, low-level, problems for the developers mentioned above, this causes major problems for scientific project management as listed below:

\subsubsection{Dependency hell}
The evolution of high-level languages is extremely fast, even within one version.
For example, packages that are written in Python 3 often only work with a specific interval of Python 3 versions.
For example, Snakemake and Occam, which can only be run on Python versions 3.4 and 3.5 or newer respectively, see Appendices \ref{appendix:snakemake} and \ref{appendix:occam}.
This is not just limited to the core language; much faster changes occur in their higher-level libraries.
For example, version 1.9 of Numpy (Python's numerical analysis module) discontinued support for Numpy's predecessor (called Numeric), causing many problems for scientific users\citeappendix{hinsen15}.

On the other hand, the dependency graph of tools written in high-level languages is often extremely complex.
For example, see Figure 1 of Alliez et al.\cite{alliez19}.
It shows the dependencies and their inter-dependencies for Matplotlib (a popular plotting module in Python).
Acceptable version intervals between the dependencies will cause incompatibilities in a year or two, when a robust package manager is not used (see Appendix \ref{appendix:packagemanagement}).

Since a domain scientist does not always have the resources/knowledge to modify the conflicting part(s), many are forced to create complex environments, with different versions of Python (sometimes on different computers), and pass the data between them (for example just to use the work of a previous PhD student in the team).
This greatly increases the complexity/cost of the project, even for the principal author.
A well-designed reproducible workflow like Maneage that has no dependencies beyond a C compiler in a Unix-like operating system can account for this.
However, when the actual workflow system (not the analysis software) is written in a high-level language like the examples above, the complex dependencies of the workflow itself will inevitably cause bootstrapping problems in the future.

Another relevant example of the dependency hell is the following: installing the Python installer (\inlinecode{pip}) on a Debian system (with \inlinecode{apt install pip2} for Python 2 packages) required 32 other packages as dependencies.
\inlinecode{pip} is necessary to install Popper and Sciunit (Appendices \ref{appendix:popper} and \ref{appendix:sciunit}).
As of this writing, the \inlinecode{pip3 install popper} and \inlinecode{pip2 install sciunit2} commands for installing each, required 17 and 26 Python modules as dependencies.
It is impossible to run either of these solutions if there is a single conflict in this very complex dependency graph.
This problem actually occurred while we were testing Sciunit: even though it was installed, it could not run because of conflicts (its last commit was only 1.5 years old), for more see Appendix \ref{appendix:sciunit}.
Hinsen\citeappendix{hinsen15} also report a similar problem when attempting to install Jupyter (see Appendix \ref{appendix:editors}).
Of course, this also applies to tools that these systems use, for example Conda (which is also written in Python, see Appendix \ref{appendix:packagemanagement}).

\subsubsection{Generational gap}
This occurs primarily for scientists in a given domain (for example, astronomers, biologists, or social scientists).
Once they have mastered one version of a language (mostly in the early stages of their career), they tend to ignore newer versions/languages.
The inertia of programming languages is very strong.
This is natural because scientists have their own science field to focus on, and re-writing their high-level analysis toolkits (which they have curated over their career and is often only readable/usable by themselves) in newer languages every few years is impractical.

When this investment is not possible, either the mentee has to use the mentor's old method (and miss out on all the newly fashionable tools that many are talking about), or the mentor has to avoid implementation details in discussions with the mentee because they do not share a common language.
The authors of this paper have personal experiences in both mentor/mentee relational scenarios.
This failure to communicate in the details is a very serious problem, leading to the loss of valuable inter-generational experience.