Dear CiSE editos,

Thank you very much for the very complete and useful referee reports. They
have been fully implemented in this submission and have significantly
improved teh quality and clarity of the paper.

Below all the points raised by the Editor in Chief (EiC), Associate editor,
and the 5 referees (in the same order as the review process report) are
addressed individually as a numbered list.

Sincerely yours,
Dr. Mohammad Akhlaghi [on behalf of the co-authors]
Instituto de Astrofísica de Canarias, Tenerife, Spain.

------------------------------


1.  [EiC] Some reviewers request additions, and overview of other
    tools.

ANSWER: Indeed, there is already a large body of previous work in this
field, and we had learnt a lot from them during the creation of the
criteria and the proof of concept tool (Maneage). Before submitting the
paper, we had already done a very comprehensive review of the tools (as you
may notice from the Git repository[1], where most of the tools were run and
practically tested). However, the CiSE Author Information explicitly
states: "The introduction should provide a modicum of background in one or
two paragraphs, but should not attempt to give a literature review". This
is the usual practice in previously published papers at CiSE and is in line
with the maximum 6250 word-count and maximum of 12 references to be used in
bibliography.

We already discussed this point privately with you and we agreed upon the
following solution: the extended reviews will be submitted as supplementary
material, to accompany the paper as "Web extras". These appendices are also
mentioned in the submitted paper so that any interested CiSE reader can
easily be informed about the existance from the paper and access them.

Appendix A is focused on the low-level "tools" that are commonly used in
the reproducible workflow solutions (including Maneage). In Appendix B, we
touch upon +25 reproducible solutions and compare them directly with our
criteria. In particular, we also review tools that have been abandoned or
discontinued and use the criteria to justify why this happened.

[1] https://gitlab.com/makhlaghi/maneage-paper/-/blob/master/tex/src/paper-long.tex#L1579
[2] https://arxiv.org/abs/2006.03018
[3] https://doi.org/10.5281/zenodo.3872247

------------------------------


2.  [Associate Editor] There are general concerns about the paper
    lacking focus

ANSWER: Thanks to all the corrections/clarifications that have been done in
this review, the paper is much more focused and direct to the point. We are
very grateful to the thorough listing of points by the referees that helped
clarify points that we needed to improve.

------------------------------


3.  [Associate Editor] Some terminology is not well-defined
    (e.g. longevity).

ANSWER: In this revision, "Reproducibility", "Longevity" and "Usage" have
been explicitly defined in the first paragraph of Section II. With this
definition, the main argument of the paper has become much more clear.
Thank you (and the referees) for highlighting this.

------------------------------


4.  [Associate Editor] The discussion of tools could benefit from some
    categorization to characterize their longevity.

ANSWER: The approximate longevity of the various tools reviewed in Section
II is now mentioned immediately after each and highlighted in green. For
example we have added this after containers "(their longevity is determined
by the host kernel, typically a decade)".

------------------------------


5.  [Associate Editor] Background and related efforts need significant
    improvement. (See below.)

ANSWER: This has been done, as mentioned in (1.) above.

------------------------------


6.  [Associate Editor] There is consistency among the reviews that
    related work is particularly lacking.

ANSWER: This has been done, as mentioned in (1.) above.

------------------------------


7.  [Associate Editor] The current work needs to do a better job of
    explaining how it deals with the nagging problem of running on CPU
    vs. different architectures.

ANSWER: The CPU architecture of the running system is now precisely
reported in the "Acknowledgments" section (highlighted in green). Also, a
description of dependency on hardware architecture, and how Maneage reports
this, is also added in the "Proof of concept: Maneage" Section.

------------------------------


8.  [Associate Editor] At least one review commented on the need to
    include a discussion of continuous integration (CI) and its
    potential to help identify problems running on different
    architectures. Is CI employed in any way in the work presented in
    this article?

ANSWER: CI has been added in the "Discussion" section as one solution to
find breaking points in operating system updates and new/different
architectures. For the core Maneage branch, we have defined task #15741 [1]
to add CI on many architectures in the near future.

[1] http://savannah.nongnu.org/task/?15741

------------------------------


9.  [Associate Editor] The presentation of the Maneage tool is both
    lacking in clarity and consistency with the public
    information/documentation about the tool. While our review focus
    is on the article, it is important that readers not be confused
    when they visit your site to use your tools.

ANSWER: Thank you for raising this important point. We have broken down the
very long "About" page into multiple pages to help in readability:

https://maneage.org/about.html

Generally, the webpage will soon undergo major improvements to be even more
clear (as part of our RDA grant for Maneage, after the paper we have
promised a clear and friendly webpage). The website is developed on a
public git repository (https://git.maneage.org/webpage.git), so any
specific proposals for improvements can be handled efficiently and
transparently and we welcome any feedback in this aspect.

------------------------------


10. [Associate Editor] A significant question raised by one review is
    how this work compares to "executable" papers and Jupyter
    notebooks.  Does this work embody similar/same design principles
    or expand upon the established alternatives? In any event, a
    discussion of this should be included in background/motivation and
    related work to help readers understand the clear need for a new
    approach, if this is being presented as new/novel.

ANSWER: Thank you for highlighting this important point. We saw that it is
necessary to compare and contrast our Maneage proof-of-concept
demonstration more directly against the Jupyter notebook type of
approach. Two paragraphs have been added in Sections II and IV to clarify
this (our criteria require and build in more modularity and longevity than
Jupyter). A much more extensive comparison and review is now also available
in Appendix A.


------------------------------


11. [Reviewer 1] Adding an explicit list of contributions would make
    it easier to the reader to appreciate these. These are not
    mentioned/cited and are highly relevant to this paper (in no
    particular order):
     1.  Git flows, both in general and in particular for research.
     2.  Provenance work, in general and with git in particular
     3.  Reprozip: https://www.reprozip.org/
     4.  OCCAM: https://occam.cs.pitt.edu/
     5.  Popper: http://getpopper.io/
     6.  Whole Tale: https://wholetale.org/
     7.  Snakemake: https://github.com/snakemake/snakemake
     8.  CWL https://www.commonwl.org/ and WDL https://openwdl.org/
     9.  Nextflow: https://www.nextflow.io/
     10. Sumatra: https://pythonhosted.org/Sumatra/
     11. Podman: https://podman.io
     12. AppImage (https://appimage.org/)
     13. Flatpack (https://flatpak.org/)
     14. Snap (https://snapcraft.io/)
     15. nbdev https://github.com/fastai/nbdev and jupytext
     16. Bazel: https://bazel.build/
     17. Debian reproducible builds: https://wiki.debian.org/ReproducibleBuilds

ANSWER:

1.  In Section IV, we have added that "Generally, any git flow (branching
    strategies) can be used by the high-level project authors or future
    readers."
2.  We have mentioned research objects as one mode of provenance tracking
    and the related provenance work that has already been done and can be
    exploited using these criteria and our proof of concept is indeed very
    large. However, the 6250 word-count limit is very tight and if we add
    more on it in this length, we would have to remove points of higher priority.
    Hopefully this can be the subject of a follow-up paper.
3.  A review of ReproZip is in Appendix B.
4.  A review of Occam is in Appendix B.
5.  A review of Popper is in Appendix B.
6.  A review of Whole Tale is in Appendix B.
7.  A review of Snakemake is in Appendix A.
8.  CWL and WDL are described in Appendix A (Job management).
9.  Nextflow is described in Appendix A (Job management).
10. Sumatra is described in Appendix B.
11. Podman is mentioned in Appendix A (Containers).
12. AppImage is mentioned in Appendix A (Package management).
13. Flatpak is mentioned in Appendix A (Package management).
14. Snap is mentioned in Appendix A (Package management).
15. nbdev and jupytext are high-level tools to generate documentation and
    packaging custom code in Conda or pypi. High-level package managers
    like Conda and Pypi have already been thoroughly reviewed in Appendix A
    for their longevity issues, so we feel that there is no need to
    include these.
16. Bazel is mentioned in Appendix A (job management).
17. Debian's reproducible builds are only designed for ensuring that software
    packaged for Debian is bitwise reproducible. As mentioned in the
    discussion section of this paper, the bitwise reproducibility of software is
    not an issue in the context discussed here; the reproducibility of the
    relevant output data of the software is the main issue.

------------------------------


12. [Reviewer 1] Existing guidelines similar to the proposed "Criteria
    for longevity". Many articles of these in the form "10 simple
    rules for X", for example (not exhaustive list):
     * https://doi.org/10.1371/journal.pcbi.1003285
     * https://arxiv.org/abs/1810.08055
     * https://osf.io/fsd7t/
     * A model project for reproducible papers: https://arxiv.org/abs/1401.2000
     * Executable/reproducible paper articles and original concepts

ANSWER: Thank you for highlighting these points. Appendix B starts with a
subsection titled "suggested rules, checklists or criteria". In this
section, we review the existing sets of criteria. This subsection includes
the sources proposed by the reviewer [Sandve et al; Rule et al; Nust et al]
(and others).

ArXiv:1401.2000 has been added in Appendix A as an example paper using
virtual machines. We thank the referee for bringing up this paper, because
the link to the VM provided in the paper no longer works (the URL
http://archive.comp-phys.org/provenance_challenge/provenance_machine.ova
redirects to
https://share.phys.ethz.ch//~alpsprovenance_challenge/provenance_machine.ova
which gives a 'Not Found' html response). Together with SHARE, this very
nicely highlights our main issue with binary containers or VMs: their lack
of longevity due to the high cost of long term storage of large files.

------------------------------


13. [Reviewer 1] Several claims in the manuscript are not properly
    justified, neither in the text nor via citation. Examples (not
    exhaustive list):
     1. "it is possible to precisely identify the Docker “images” that
        are imported with their checksums, but that is rarely practiced
        in most solutions that we have surveyed [which ones?]"
     2. "Other OSes [which ones?] have similar issues because pre-built
        binary files are large and expensive to maintain and archive."
     3. "Researchers using free software tools have also already had
        some exposure to it"
     4. "A popular framework typically falls out of fashion and
        requires significant resources to translate or rewrite every
        few years."

ANSWER: These points have been clarified in the highlighted parts of the text:

1. Many examples have been given throughout the newly added
   appendices. To avoid confusion in the main body of the paper, we
   have removed the "we have surveyed" part. It is already mentioned
   above this point in the text that a large survey of existing
   methods/solutions is given in the appendices.

2. Due to the thorough discussion of this issue in the appendices with
   precise examples, this line has been removed to allow space for the
   other points raised by the referees. The main point (high cost of
   keeping binaries) is already abundantly clear.

   On a similar topic, Dockerhub's recent announcement that inactive images
   (for over 6 months) will be deleted has also been added. The announcemnt
   URL is a hyperlink in the text (it was too long to print directly, if
   IEEE has a special short-url format, we can add it).

   Another interesting News in relation to longevity has also been added
   here: the decision by CentOS to abandon CentOS 8 next year. Again, the
   URL is within a hyperlink on the text. Many scientific and industrial
   projects have relied on CentOS for longevity over the last two decades,
   but that didn't stop its creators from abandoning it 8 years early and
   completely switching its release paradigm.

3. A small statement has been added, reminding the readers that almost all
   free software projects are built with Make (CMake is also used
   sometimes, but CMake is just a high-level wrapper over Make: it finally
   produces a 'Makefile'; practical usage of CMake generally obliges the
   user to understand Make).

4. The example of Python 2 has been added to clarify this point.


------------------------------


14. [Reviewer 1] As mentioned in the discussion by the authors, not
    even Bash, Git or Make is reproducible, thus not even Maneage can
    address the longevity requirements. One possible alternative is
    the use of CI to ensure that papers are re-executable (several
    papers have been written on this topic). Note that CI is
    well-established technology (e.g. Jenkins is almost 10 years old).

ANSWER: Thank you for raising these issues. We had initially planned to
discuss CIs, but like many discussion points, we were forced to remove it
before the first submission due to the very tight word-count limit. We have
now added a sentence on CI in the discussion.

On the issue of Bash/Git/Make, indeed, the executable built files of Bash,
Git and Make binaries are not bitwise reproducible/identical on different
systems. However, as mentioned in the discussion, we are concerned with the
_output_ of the software's executable file. We are not interested in the
executable file itself (which should be different for different OSs or CPU
architectures).

The reproducibility of a binary file only becomes important for security
purposes where binaries are downloaded. In Maneage, we download the
software source code tarball, confirm the tarball's SHA512 checksum with
the checksum that is recorded in Maneage [1], and build the software with
precisely defined build environment and dependencies.

In summary, even though the compiled binary files of specific versions of
Git, Bash or Make will not be bitwise reproducible/identical on different
systems, their scientific outputs are exactly reproducible: 'git describe'
or Bash's 'for' loop will have the same output on GNU/Linux, macOS/Darwin
or FreeBSD (despite having bitwise different executables).

[1] http://git.maneage.org/project.git/tree/reproduce/software/config/checksums.conf

------------------------------


15. [Reviewer 1] Criterion has been proposed previously. Maneage itself
    provides little novelty (see comments below).

ANSWER: The previously suggested sets of criteria that were listed by
Reviewer 1 are reviewed by us in the newly added Appendix B, and the
novelty and advantages of our proposed criteria are contrasted there
with the earlier sets of criteria.

------------------------------


16. [Reviewer 2] Authors should add indication that using good practices it
    is possible to use Docker or VM to obtain identical OS usable for
    reproducible research.

ANSWER: In the submitted version we had stated that "Ideally, it is
possible to precisely identify the Docker images that are imported with
their checksums ...". But to be more clear and go directly to the point, it
has been edited to explicity say "... to recreate an identical OS image
later".

------------------------------


17. [Reviewer 2] The CPU architecture of the platform used to run the
    workflow is not discussed in the manuscript. Authors should probably
    take into account the architecture used in their workflow or at least
    report it.

ANSWER: Thank you very much for raising this important point. We hadn't
seen other reproducibility papers mention this important point and thus
missed it. In the acknowledgments (where we also mention the commit hashes)
we now explicitly mention the exact CPU architecture used to build this
paper: "This project was built on an x86_64 machine with Little Endian
byte-order and address sizes 39 bits physical, 48 bits virtual.". This is
because we have already seen cases where the architecture is the same, but
programs fail because of the byte order.

Generally, Maneage will now extract this information from the running
system during its configuration phase, and provide the users with three
different LaTeX macros that contain this information. Users can use these
LaTeX macros anywhere in their paper.

------------------------------


18. [Reviewer 2] I don’t understand the "no dependency beyond
    POSIX". Authors should more explained what they mean by this sentence.

ANSWER: This has been clarified with the short extra statement "a minimal
Unix-like standard that is shared between many operating systems". Also in
the appendix we now say "no execution requirement beyond a minimal
Unix-like operating system".

We would have liked to explain this more, but the word limit is very
constraining. It is more clear in the appendices, and we will put more
clear explations in teh web page.

------------------------------


19. [Reviewer 2] Unfortunately, sometime we need proprietary or specialized
    software to read raw data... For example in genetics, micro-array raw
    data are stored in binary proprietary formats. To convert this data
    into a plain text format, we need the proprietary software provided
    with the measurement tool.

ANSWER: Thank you very much for this good point. A description of a
possible solution to this has been added after criterion 8.

------------------------------


20. [Reviewer 2] I was not able to properly set up a project with
    Maneage. The configuration step failed during the download of tools
    used in the workflow. This is probably due to a firewall/antivirus
    restriction out of my control. How frequent this failure happen to
    users?

ANSWER: Thank you for mentioning this. This has been fixed by archiving all
Maneage'd software on Zenodo (https://doi.org/10.5281/zenodo.3883409) and
also downloading them from there as highest precedence.

Until recently we would directly access each software's own webpage to
download the source files, and this caused frequent problems of the type
you mentioned (different servers in different ISPs/states/countries can
behave differentely). In other cases, we were very frustrated when a
software's webpage would temporarily be unavailable (e.g., for maintenance
reasons); this was a major hindrance in building new projects.

Since all the software is free-licensed, we are legally allowed to
re-distribute them (within the conditions, such as not removing copyright
notices) and Zenodo is defined for long-term archival of academic digital
objects, so we decided that a software source code repository on Zenodo
would be the most reliable solution. At configure time, Maneage now
accesses Zenodo's DOI and resolves the most recent URL to automatically
download any necessary software source code that the project needs from
there.

Generally, we also keep all software in a Git repository on our own
webpage: http://git.maneage.org/tarballs-software.git/tree. Also, Maneage
users can identify their own custom URLs for downloading software, which
will be given higher priority than Zenodo (useful for situations when
custom software is downloaded and built in a project branch (not the core
'maneage' branch).

------------------------------


21. [Reviewer 2] The time to configure a new project is quite long because
    everything needs to be compiled. Authors should compare the time
    required to set up a project Maneage versus time used by other
    workflows to give an indication to the readers.

ANSWER: Thank you for raising this point. it takes about 1.5 hours to
configure the default Maneage branch on an 8-core CPU (more than half of
this time is devoted to GCC on GNU/Linux operating systems, and the
building of GCC can optionally be disabled with the '--host-cc' option to
significantly speed up the build when the host's GCC is
similar). Furthermore, Maneage can be built within a Docker container.

A paragraph has been added in Section IV on this issue (the
build time and building within a Docker container). We have also defined
task #15818 [1] to have our own core Docker image that is ready to build a
Maneaged project and will be adding it shortly.

[1] https://savannah.nongnu.org/task/index.php?15818

------------------------------


22. [Reviewer 3] Authors should define their use of the term [Replicability
    or Reproducibility] briefly for their readers.

ANSWER: "Reproducibility" has been defined along with "Longevity" and
"usage" at the start of Section II.

------------------------------


23. [Reviewer 3] The introduction is consistent with the proposal of the
    article, but deals with the tools separately, many of which can be used
    together to minimize some of the problems presented. The use of
    Ansible, Helm, among others, also helps in minimizing problems.

ANSWER: That is correct. In the new appendices we have touched upon this,
especially in Appendix B where we discuss the technologies used by various
reproducible workflow solutions.

About Ansible and Helm; they are primarily designed for distributed
computing. For example Helm is just a high-level package manager for a
Kubernetes cluster that is based on containers. A review of them could be
added to the Appendices, but we feel they this would distract somewhat from
the main points of our current paper.

------------------------------


24. [Reviewer 3] When the authors use the Python example, I believe it is
    interesting to point out that today version 2 has been discontinued by
    the maintaining community, which creates another problem within the
    perspective of the article.

ANSWER: Thank you very much for highlighting this point. We had excluded
this point for the sake of article length, but we have restored it in
the introduction of the revised version.

------------------------------


25. [Reviewer 3] Regarding the use of VM's and containers, I believe that
    the discussion presented by THAIN et al., 2015 is interesting to
    increase essential points of the current work.

ANSWER: Thank you very much for pointing out the works by Thain. We
couldn't find any first-author papers in 2015, but found Meng & Thain
(https://doi.org/10.1016/j.procs.2017.05.116) which had a relevant
discussion of why they didn't use Docker containers in their work. That
paper is now cited in the discussion of Containers in Appendix A.

------------------------------


26. [Reviewer 3] About the Singularity, the description article was missing
    (Kurtzer GM, Sochat V, Bauer MW, 2017).

ANSWER: Thank you for the reference. We are restricted in the main
body of the paper due to the strict bibliography limit of 12
references; we have included Kurtzer et al 2017 in Appendix A (where
we discuss Singularity).

------------------------------


27. [Reviewer 3] I also believe that a reference to FAIR is interesting
    (WILKINSON et al., 2016).

ANSWER: The FAIR principles have been mentioned in the main body of the
paper, but unfortunately we had to remove its citation in the main paper (like
many others) to keep to the maximum of 12 references. We have cited it in
Appendix B.

------------------------------


28. [Reviewer 3] In my opinion, the paragraph on IPOL seems to be out of
    context with the previous ones. This issue of end-to-end
    reproducibility of a publication could be better explored, which would
    further enrich the tool presented.


ANSWER: We agree and have removed the IPOL example from that section.  We
have included an in-depth discussion of IPOL in Appendix B and we comment
on how Maneage'd projects offer a similar level of peer-review control.

------------------------------


29. [Reviewer 3] On the project website, I suggest that the information
    contained in README-hacking be presented on the same page as the
    Tutorial. A topic breakdown is interesting, as the markdown reading may
    be too long to find information.

ANSWER: Thank you very much for this good suggestion, it has been
implemented: https://maneage.org/about.html . The webpage will continuously
be improved and such feedback is always very welcome.

------------------------------


31. [Reviewer 3] The tool is suitable for Unix users, keeping users away
    from Microsoft environments.

ANSWER: The issue of building on Windows has been discussed in Section IV,
either using Docker (or VMs) or using the Windows Subsystem for Linux.

------------------------------


32. [Reviewer 3] Important references are missing; more references are
    needed

ANSWER: Two comprehensive Appendices have been added to address this issue.

------------------------------


33. [Reviewer 4] Revisit the criteria, show how you have come to decide on
    them, give some examples of why they are important, and address
    potential missing criteria.

ANSWER: In the new appendix B, we have added a new section, reviewing some
existing criteria. We would be very interested to discuss them even further
in the main body, Within the constraints of space (the limit is 6250
words), it is almost impossible to discuss the history of each in detail or
add more anecdotal examples of their relevance.

------------------------------


34. [Reviewer 4] Clarify the discussion of challenges to adoption and make
    it clearer which tradeoffs are important to practitioners.

ANSWER: We discuss many of these challenges and caveats in the Discussion
Section (V), within the existing word limit.

------------------------------


35. [Reviewer 4] Be clearer about which sorts of research workflow are best
    suited to this approach.

ANSWER: Maneage is flexible enough to enable a wide range of workflows to
be implemented. This is done by leveraging the highly modular and flexible
nature of Makefiles run via 'Make'.

GUI-based operations (that involve human interaction and cannot be run in
batch-mode) are one type of workflow that our proof-of-concept will not
support. But as discussed in the completeness criteria, human interaction
is an incompleteness, dramatically reducing the reproducibility of a
result.

------------------------------


36. [Reviewer 4] There is also the challenge of mathematical
    reproducibility, particularly of the handling of floating point number,
    which might occur because of the way the code is written, and the
    hardware architecture (including if code is optimised / parallelised).

ANSWER: Floating point errors and optimizations have been mentioned in the
discussion (Section V). The issue with parallelization has also been
discussed in Section IV, in the part on verification ("Where exact
reproducibility is not possible (for example due to parallelization),
values can be verified by a statistical method specified by the project
authors."). We have linked keywords in the latter sentence to a Software
Heritage URI [1] with the specific file in a Maneage'd paper that
illustrates an example of how statistical verification of parallelised code
can work in practice (Peper & Roukema 2020; zenodo.4062460).

We would be interested to hear if any other papers already exist that use
automatic statistical verification of parallelised code as has been done in
this Maneage'd paper.

[1] https://archive.softwareheritage.org/browse/origin/content/?branch=refs/heads/postreferee_corrections&origin_url=https://codeberg.org/boud/elaphrocentre.git&path=reproduce/analysis/bash/verify-parameter-statistically.sh

------------------------------


37. [Reviewer 4] ... the handling of floating point number
[reproducibility] ...  will come with a tradeoff agianst performance, which
is never mentioned.

ANSWER: The criteria we propose and the proof-of-concept with Maneage do
not force the choice of a tradeoff between exact bitwise floating point
reproducibility versus performance (e.g. speed). The specific concepts of
"verification" and "reproducibility" will vary between domains of
scientific computation, but we expect that the criteria allow this wide
range.

Performance is indeed an important issue for _immediate_ reproducibility
and we would have liked to discuss it. But due to the strict word-count, we
feel that adding it to the discussion points, without having adequate space
to elaborate, can confuse the readers away from the focus of this paper (on
long term usability). It has therefore not been added.

------------------------------


38. [Reviewer 4] Tradeoff, which might affect Criterion 3 is time to result,
    people use popular frameworks because it is easier to use them.

ANSWER: That is true. In section IV, we have given the time it takes to
build Maneage (only once on each computer) to be around 1.5 hours on an
8-core CPU (a typical machine that may be used for data analysis). We
therefore conclude that when the analysis is complex (and thus taking many
hours, or even days to complete), this time is negligible.

But if the project's full analysis takes 10 minutes or less (like the
extremely simple analysis done in this paper). Indeed, the 1.5 hour
building time is significant. In those cases, as discussed in the main
body, the project can be built once in a Docker image and easily moved to
other computers.

Generally, it is true that the initial configuration time (only once on
each computer) of a Maneage install may discourage some scientists; but a
serious scientific research project is never started and completed on a
time scale of a few hours.

------------------------------


39. [Reviewer 4] I would liked to have seen explanation of how these
    challenges to adoption were identified: was this anecdotal, through
    surveys? participant observation?

ANSWER: The results mentioned here are anecdotal: based on private
discussions after holding multiple seminars and Webinars with RDA's
support, and also a workshop that was planned for non-astronomers. We
invited (funded) early career researchers to come to the workshop with the
RDA funding. However, that workshop was cancelled due to the COVID-19
pandemic and we had private communications instead.

We would very much like to elaborate on this experience of training new
researchers with these tools. However, as with many of the cases above, the
very strict word-limit doesn't allow us to elaborate beyond what we have
already written. Hopefully in a couple of years and with the wider usage of
Maneage or these criteria in research papers, we will be able to write a
paper that is directly focused on this.

------------------------------


40. [Reviewer 4] Potentially an interesting sidebar to investigate how
    LaTeX/TeX has ensured its longevity!

ANSWER: That is indeed a very interesting subject to study (an obvious link
is that LaTeX/TeX is very strongly based on plain text files). We have been
in touch with Karl Berry (one of the core people behind TeX Live, who also
plays a prominent role in GNU) and have whitnessed the TeX Live community's
efforts to become more and more portable and longer-lived.

However, as the reviewer states, this would be a sidebar, and we are
constrained for space, so we couldn't find a place to highlight this. But
it is indeed a subject worthy of a full paper (that can be very useful for
many software projects).

------------------------------


41. [Reviewer 4] The title is not specific enough - it should refer to the
    reproducibility of workflows/projects.

ANSWER: A problem here is that "workflow" and "project" taken in isolation
risk being vague for wider audiences. Also, we aim at covering a wider
range of aspects of a project than just than the workflow alone; in the
other direction, the word "project" could be seen as too broad, including
the funding, principal investigator, and team coordination.

A specific title that might be appropriate could be, for example, "Towards
long-term and archivable reproducibility of scientific computational
research projects". Using a term proposed by one of our reviewers, "Towards
long-term and archivable end-to-end reproducibility of scientific
computational research projects" might also be appropriate.

Nevertheless, we feel that in the context of an article published in CiSE,
our current short title is sufficient.

------------------------------


42. [Reviewer 4] Whilst the thesis stated is valid, it may not be useful to
    practitioners of computation science and engineering as it stands.

ANSWER: This point appears to refer to floating point bitwise
reproducibility and possibly to the conciseness of our paper. The former is
fully allowed for, as stated above, though not obligatory, using the
"verify.mk" rule file to (typically, but not obligatorily) force bitwise
reproducibility. The latter is constrained by the 6250-word limit of
CiSE. The addition of supplementary appendices in the extended version help
respond to the latter point.

------------------------------


43. [Reviewer 4] Longevity is not defined.

ANSWER: This has been defined now at the start of Section II.

------------------------------


44. [Reviewer 4] Whilst various tools are discussed and discarded, no
    attempt is made to categorise the magnitude of longevity for which they
    are relevant. For instance, environment isolators are regarded by the
    software preservation community as adequate for timescale of the order
    of years, but may not be suitable for the timescale of decades where
    porting and emulation are used.

ANSWER: Statements on quantifying the longevity of specific tools have been
added in Section II and are highlighted in green. For example in the case
of Docker images: "their longevity is determined by the host kernel,
usually a decade", for Python packages: "Python installation with a usual
longevity of a few years", for Nix/Guix: "with considerably better
longevity; same as supported CPU architectures."

------------------------------


45. [Reviewer 4] The title of this section "Commonly used tools and their
    longevity" is confusing - do you mean the longevity of the tools or the
    longevity of the workflows that can be produced using these tools?
    What happens if you use a combination of all four categories of tools?

ANSWER: We have changed the section title to "Longevity of existing tools"
to clarify that we refer to longevity of the tools.

If the four categories of tools were combined, then the overall longevity
would be that of the shortest intersection of the time spans over which the
tools remained viable.

------------------------------


46. [Reviewer 4] It wasn't clear to me if code was being run to generate
    the results and figures in a LaTeX paper that is part of a project in
    Maneage. It appears to be suggested this is the case, but Figure 1
    doesn't show how this works - it just has the LaTeX files, the data
    files and the Makefiles. Is it being suggested that LaTeX itself is the
    programming language, using its macro functionality?

ANSWER: Thank you for highlighting this point of confusion. The caption of
Figure 1 has been edited to hopefully clarify the point. In short, the
arrows represent the operation of software and boxes represent files. In
the case of generating 'paper.pdf' from its three dependencies
('references.tex', 'paper.tex' and 'project.tex'), yes, LaTeX is used. But
in other steps, other tools are used (depending on the analysis). For
example as you see in [1] the main step of the arrow connecting
'table-3.txt' to 'tools-per-year.txt' is an AWK command (there are also a
few 'echo' commands for meta data and copyright in the output plain-text
file [2]).

[1] https://gitlab.com/makhlaghi/maneage-paper/-/blob/master/reproduce/analysis/make/demo-plot.mk#L51
[2] https://zenodo.org/record/3911395/files/tools-per-year.txt

------------------------------


47. [Reviewer 4] I was a bit confused on how collaboration is handled as
    well - this appears to be using the Git branching model, and the
    suggestion that Maneage is keeping track of all components from all
    projects - but what happens if you are working with collaborators that
    are using their own Maneage instance?

ANSWER: Indeed, Maneage operates based on the Git branching model. As
mentioned in the text, Maneage is itself a Git branch. Researchers spin-off
their own branch for a specific project from the 'maneage' branch and start
customizing it for their particular project in their own particular
repository. They can also use all types of Git-based collaborating models
to work together on their branch.

Figure 2 in fact explicitly shows such a case: the main project leader is
committing on the "project" branch. But a collaborator creates a separate
branch over commit '01dd812' and makes a couple of commits ('f69e1f4' and
'716b56b'), and finally asks the project leader to merge them into the
project. This can be generalized to any Git based collaboration model.

Recent experience by one of us [Roukema] found that a merge of a
Maneage-based cosmology simulation project (now zenodo.4062460), after
separate evolution of about 30-40 commits on maneage and possibly 100 on
the project, needed about one day of straightforward effort, without any
major difficulties. So it is easy to update low-level infrastructure.

------------------------------


48. [Reviewer 4] I would also [have] liked to have seen a comparison
    between this approach and other "executable" paper approaches
    e.g. Jupyter notebooks, compared on completeness, time taken to
    write a "paper", ease of depositing in a repository, and ease of
    use by another researcher.

ANSWER: This type of sociological survey will make sense once the number of
projects run with Maneage is sufficiently high and comparable to Jupyter
for example. The time taken to write a paper is be measurable
automatically: from the git history. The other parameters suggested would
require cooperation from the scientists in responding to the survey, or
will have to be collected anecdotally in the short term. This is a good
subject for a follow-up paper in a few years.

------------------------------


49. [Reviewer 4] The weakest aspect is the assumption that research can be
    easily compartmentalized into simple and complete packages. Given that
    so much of research involves collaboration and interaction, this is not
    sufficiently addressed. In particular, the challenge of
    interdisciplinary work, where there may not be common languages to
    describe concepts and there may be different common workflow practices
    will be a barrier to wider adoption of the primary thesis and criteria.

ANSWER: Maneage was precisely defined to address the problem of
publishing/collaborating on complete workflows by many people (in this
paper itself, we are already 6 people who have been collaborating to
complete it and you can see this in the Git history). Git has been
exceptionally powerful in enabling collaborations of huge projects with
thousands of contributors like the Linux kernel. Exactly the same
collaborating style of the Linux kernel can be implemented in Maneage for
large scientific projects.

Hopefully with the clarification to point 47 above, this should also become
clear.

------------------------------


50. [Reviewer 5] Major figures currently working in this exact field do not
    have their work acknowledged in this work.

ANSWER: This was due to the strict word limit and the CiSE publication
policy (to not include a literature review because there is a limit of only
12 citations). But we had indeed already done a comprehensive literature
review and the editors kindly agreed that we submit that review as
supplementary appendices.

------------------------------


51. [Reviewer 5] Jimenez I et al ... 2017 "The popper convention: Making
    reproducible systems evaluation practical ..." and the later
    revision that uses GitHub Actions, is largely the same as this
    work.

ANSWER: This work and the proposed criteria are very different from
Popper. A detailed review of Popper, in particular, is given in Appendix B.

------------------------------


52. [Reviewer 5] The lack of attention to virtual machines and containers
    is highly problematic. While a reader cannot rely on DockerHub or a
    generic OS version label for a VM or container, these are some of the
    most promising tools for offering true reproducibility.

ANSWER: Containers and VMs have been more thoroughly discussed in the main
body and also extensively discussed in appendix A. As discussed (with many
cited examples), Containers and VMs are only appropriate when they are
themselves reproducible (for example, if running the Dockerfile this year
and next year gives the same internal environment). However, we show that
this is not the case in most solutions (a more comprehensive review would
require its own paper).

Moreover, with complete, robust environment builders like Maneage, Nix or
GNU Guix, the analysis environment within a container can be exactly
reproduced later. But even so, due to their binary nature and large storage
volume, they are not trusable sources for the long term (it is expensive to
archive them). We show several examples in the paper and appendices of how
projects that relied on VMs in 2011 and 2014 are no longer active, and how
even Dockerhub will be deleting containers that are not used for more than
6 months in free accounts (due to the high storage costs).

Furthermore, as a unique new feature, Maneage has the criterion of "Minimal
complexity". This means that even if, for any reason, the project is not
able to be run in the future, the content, analysis scripts, etc. are
accessible for the interested reader as plain text (only the development
history - the git history - is storied in git's binary format). Unlike Nix
or Guix, our approach doesn't need a third-party package package manager:
the instructions for building all the software of a project are directly in
the same project as the high-level analysis software. The full end-to-end
process is transparent and archived in Maneage, and the interested
scientist can follow the analysis and study the different decisions of each
step (why and how the analysis was done). They can also modify it to work
on future hardware that we don't know about today (this is not possible on
a binary file like VMs or containers).

------------------------------


53. [Reviewer 5] On the data side, containers have the promise to manage
    data sets and workflows completely [Lofstead J, Baker J, Younge A. Data
    pallets: containerizing storage for reproducibility and
    traceability. InInternational Conference on High Performance Computing
    2019 Jun 16 (pp. 36-45). Springer, Cham.] Taufer has picked up this
    work and has graduated a MS student working on this topic with a
    published thesis. See also Jimenez's P-RECS workshop at HPDC for
    additional work highly relevant to this paper.

ANSWER: Thank you for the interesting paper by Lofstead+2019 on Data
pallets. We have cited it in Appendix A as an example of how generic the
concept of containers is.

The topic of linking data to analysis is also a core result of the criteria
presented here, and is also discussed briefly in our paper.  There are
indeed many very interesting works on this topic. But the format of CiSE is
very short (a maximum of ~6500 words with 12 references), so we don't have
the space to go into this any further. But this is indeed a very
interesting aspect for follow-up studies, especially as usage of
Maneage grows, and we have more example workflows by users to study the
linkage of data analysis.

------------------------------


54. [Reviewer 5] Some other systems that do similar things include:
    reprozip, occam, whole tale, snakemake.

ANSWER: All these tools have been reviewed in the newly added appendices.

------------------------------


55. [Reviewer 5] the paper needs to include the context of the current
    community development level to be a complete research paper. A revision
    that includes evaluation of (using the criteria) and comparison with
    the suggested systems and a related work section that seriously
    evaluates the work of the recommended authors, among others, would make
    this paper worthy for publication.

ANSWER: A thorough review of current low-level tools and and high-level
reproducible workflow management systems has been added in the extended
Appendices.

------------------------------


56. [Reviewer 5] Yet another example of a reproducible workflows project.

ANSWER: As the newly added thorough comparisons with existing systems
shows, these set of criteria and the proof-of-concept offer uniquely new
features. As another referee summarized: "This manuscript describes a new
reproducible workflow _which doesn't require another new trendy high-level
software_. The proposed workflow is only based on low-level tools already
widely known."

Interestingly, the fact that we don't define yet another workflow language
and framework is itself what makes our proof-of-concept unique. Other
unique features of Maneage is that it is based on time-tested solutions
(the youngest tool we use it Git which is already 15 years old) in a
framwork that costs only ~100 kB to archive (in contrast to multi-GB
containers or VMs).

------------------------------


57. [Reviewer 5] There are numerous examples, mostly domain specific, and
    this one is not the most advanced general solution.

ANSWER: As the comparisons in the appendices and clarifications above show,
there are many features in the proposed criteria and proof of concept that
are new and not satisfied by the domain-specific solutions known to us.

------------------------------


58. [Reviewer 5] Lack of context in the field missing very relevant work
    that eliminates much, if not all, of the novelty of this work.

ANSWER: The newly added appendices thoroughly describe the context and
previous work that has been done in this field.

------------------------------