All questions have now been responded to

This commit is intended to be submittable quality. Point 56 was removed, and the later points renumbered, because it was a point of Reviewer 5 described what we have done - it was not a criticism to respond do. :) The current word count (without abstract and references) is 6091.
author: Boud Roukema <boud@cosmo.torun.pl> 2020-11-26 05:39:50 +0100
committer: Boud Roukema <boud@cosmo.torun.pl> 2020-11-26 05:39:50 +0100
commit: 90596115b4a454c70232b2610fbca2aff913ceb6 (patch)
tree: 6405ccd44a808bd9cadf48d2e3f718d57609a2e4 /peer-review
parent: eb984bd431af209dbdc8bad8ee52435ccb89f5d0 (diff)
1 files changed, 170 insertions, 104 deletions
diff --git a/peer-review/1-answer.txt b/peer-review/1-answer.txt
index 5e612f8..5c27866 100644
--- a/peer-review/1-answer.txt
+++ b/peer-review/1-answer.txt
@@ -32,9 +32,10 @@ reader can easily access them.
 2.  [Associate Editor] There are general concerns about the paper
     lacking focus
 
-###########################
-ANSWER:
-###########################
+ANSWER: We believe that by responding to the specific concerns raised
+by the reviewers, as detailed below, we have tightened the focus of
+the paper.
+
 
 ------------------------------
 
@@ -132,10 +133,14 @@ future.
     is on the article, it is important that readers not be confused
     when they visit your site to use your tools.
 
-###########################
-ANSWER [NOT COMPLETE]: We should separate the various sections of the
-README-hacking.md webpage into smaller pages that can be entered.
-###########################
+ANSWER: Improving the consistency between this research paper and
+the Maneage website is a useful recommendation. We have listed
+this together with point 29 below at
+https://savannah.nongnu.org/task/index.php?15823
+on the Maneage development task list. As indicated there, the
+website is developed on a public git repository, so any specific
+proposals for improvements can be handled efficiently and
+transparently.
 
 ------------------------------
 
@@ -597,9 +602,9 @@ level of peer-review control.
     Tutorial. A topic breakdown is interesting, as the markdown reading may
     be too long to find information.
 
-#####################################
-ANSWER:
-#####################################
+ANSWER: Thank you for the very useful suggestion. We have listed this as
+a task at  https://savannah.nongnu.org/task/index.php?15823 .
+
 
 ------------------------------
 
@@ -691,9 +696,18 @@ highly modular and flexible nature of Makefiles run via 'Make'.
     which might occur because of the way the code is written, and the
     hardware architecture (including if code is optimised / parallelised).
 
-################################
-ANSWER:
-################################
+
+ANSWER: The authors of particular projects have to choose the level
+floating point reproducibility that they judge viable. In section IV,
+within the 6500-word limit, this is briefly described in the discussion
+of the "verify.mk" rule file. The main paragraph is "Just before reaching ...
+All project deliverables ... are verified ... with their checksums, to
+automatically ensure exact reproducibility. .... [or] by any statistical
+means, specified by the project authors."
+
+We have added a brief reference to zenodo.3951151, pointing out that
+it illustrates an approach for statistical verifiability of
+parallelised code using Maneage.
 
 ------------------------------
 
@@ -701,20 +715,30 @@ ANSWER:
 
 
 
-37. [Reviewer 4] Performance ... is never mentioned
+37. [Reviewer 4] ... the handling of floating point number
+[reproducibility] ...  will come with a tradeoff agianst
+performance, which is never mentioned.
+
+ANSWER: The criteria we propose and the proof-of-concept with
+Maneage do not force the choice of a tradeoff between exact bitwise
+floating point reproducibility versus performance (e.g. speed). The
+specific concepts of "verification" and "reproducibility" will vary
+between domains of scientific computation, but we expect that the
+criteria allow this wide range. We did not add text on this point.
 
-################################
-ANSWER:
-################################
 
 ------------------------------
 
 38. [Reviewer 4] Tradeoff, which might affect Criterion 3 is time to result,
     people use popular frameworks because it is easier to use them.
 
-################################
-ANSWER:
-################################
+ANSWER: Section IV includes some quantified examples of timing
+involved in the Maneage implementation of the criteria of our
+paper. It is true that the initial build time of a Maneage install
+may discourage some scientists; but a serious scientific research
+project is never started and completed on a time scale of a few
+hours.
+
 
 ------------------------------
 
@@ -726,17 +750,18 @@ ANSWER:
     challenges to adoption were identified: was this anecdotal, through
     surveys? participant observation?
 
-ANSWER: The results mentioned here are based on private discussions after
-holding multiple seminars and Webinars with RDA's support, and also a
-workshop that was planned for non-astronomers. We even invited (funded)
-early career researchers to come to the workshop with the RDA funding,
-however, that workshop was cancelled due to the pandemic and we had private
-communications after.
+ANSWER: The results mentioned here are anecdotal: based on private
+discussions after holding multiple seminars and Webinars with RDA's
+support, and also a workshop that was planned for
+non-astronomers. We invited (funded) early career researchers to
+come to the workshop with the RDA funding.  However, that workshop
+was cancelled due to the pandemic and we had private communications
+instead.
 
 We would very much like to elaborate on this experience of training new
 researchers with these tools. However, as with many of the cases above, the
-very strict word-limit doesn't allow us to elaborate beyond what is already
-there.
+very strict word-limit doesn't allow us to elaborate beyond what we have
+already written.
 
 ------------------------------
 
@@ -747,9 +772,14 @@ there.
 40. [Reviewer 4] Potentially an interesting sidebar to investigate how
     LaTeX/TeX has ensured its longevity!
 
-##############################
-ANSWER:
-##############################
+
+ANSWER: We agree that this would be interesting; an obvious link is
+that LaTeX/TeX is very strongly based on plain text files, making user
+hacking easy, provided that the user is willing to experiment and
+search and read through the source files. However, as the reviewer states,
+this would be a sidebar, and we are constrained for space.
+
+
 
 ------------------------------
 
@@ -760,9 +790,22 @@ ANSWER:
 41. [Reviewer 4] The title is not specific enough - it should refer to the
     reproducibility of workflows/projects.
 
-##############################
-ANSWER:
-##############################
+ANSWER: A problem here is that "workflow" and "project" taken in
+isolation risk being vague for wider audiences. Also, we aim at
+covering a wider range of aspects of a project than just than the
+workflow alone; in the other direction, the word "project" could be
+seen as too broad, including the funding, principal investigator,
+and team coordination.
+
+A specific title that might be appropriate could be, for example,
+"Towards long-term and archivable reproducibility of scientific
+computational research projects". Using a term proposed by one of
+our reviewers, "Towards long-term and archivable end-to-end
+reproducibility of scientific computational research projects"
+might also be appropriate.
+
+Nevertheless, we feel that in the context of an article published in CiSE,
+our current short title is sufficient.
 
 ------------------------------
 
@@ -773,12 +816,20 @@ ANSWER:
 42. [Reviewer 4] Whilst the thesis stated is valid, it may not be useful to
     practitioners of computation science and engineering as it stands.
 
-ANSWER: We would appreciate if you could clarify this point a little
-more. We have shown how it has already been used in many research projects
-(also outside of observational astronomy which is the first author's main
-background). It is precisely defined for computational science and
-engineering problems where _publication_ of the human-readable workflow
-source is also important.
+ANSWER: This point appears to refer to floating point bitwise reproducibility
+and possibly to the conciseness of our paper. The former is fully allowed
+for, as stated above, though not obligatory, using the "verify.mk" rule
+file to (typically, but not obligatorily) force bitwise reproducibility.
+The latter is constrained by the 6500-word limit. The addition of appendices
+in the extended version may help respond to the latter point.
+
+The current small number of existing research projects using
+Maneage, as indicated in the revised version of our paper includes
+papers outside of observational astronomy (which is the first
+author's main background). The fact that the approach is precisely
+defined for computational science and engineering problems where
+_publication_ of the human-readable workflow source is also
+important may partly respond to this issue.
 
 ------------------------------
 
@@ -788,7 +839,7 @@ source is also important.
 
 43. [Reviewer 4] Longevity is not defined.
 
-ANSWER: It has been defined now at the start of Section II.
+ANSWER: This has been defined now at the start of Section II.
 
 ------------------------------
 
@@ -803,11 +854,12 @@ ANSWER: It has been defined now at the start of Section II.
     of years, but may not be suitable for the timescale of decades where
     porting and emulation are used.
 
-ANSWER: Statements on quantifying their longevity have been added in
-Section II. For example in the case of Docker images: "their longevity is
-determined by the host kernel, usually a decade", for Python packages:
-"Python installation with a usual longevity of a few years", for Nix/Guix:
-"with considerably better longevity; same as supported CPU architectures."
+ANSWER: Statements on quantifying the longevity of specific tools
+have been added in Section II. For example in the case of Docker
+images: "their longevity is determined by the host kernel, usually a
+decade", for Python packages: "Python installation with a usual
+longevity of a few years", for Nix/Guix: "with considerably better
+longevity; same as supported CPU architectures."
 
 ------------------------------
 
@@ -820,9 +872,13 @@ determined by the host kernel, usually a decade", for Python packages:
     longevity of the workflows that can be produced using these tools?
     What happens if you use a combination of all four categories of tools?
 
-##########################
-ANSWER:
-##########################
+
+ANSWER: We have changed the section title to "Longevity of existing tools"
+to clarify that we refer to longevity of the tools.
+
+If the four categories of tools were combined, then the overall
+longevity would be that of the shortest intersection of the time
+spans over which the tools remained viable.
 
 ------------------------------
 
@@ -876,21 +932,32 @@ branch over commit '01dd812' and makes a couple of commits ('f69e1f4' and
 '716b56b'), and finally asks the project leader to merge them into the
 project. This can be generalized to any Git based collaboration model.
 
+Recent experience by one of us [Roukema] found that a merge of a
+Maneage-based cosmology simulation project (now zenodo.4062460),
+after separate evolution of about 30-40 commits on maneage and
+possibly 100 on the project, needed about one day of straightforward
+effort, without any major difficulties.
+
 ------------------------------
 
 
 
 
 
-48. [Reviewer 4] I would also liked to have seen a comparison between this
-    approach and other "executable" paper approaches e.g. Jupyter
-    notebooks, compared on completeness, time taken to write a "paper",
-    ease of depositing in a repository, and ease of use by another
-    researcher.
+48. [Reviewer 4] I would also [have] liked to have seen a comparison
+    between this approach and other "executable" paper approaches
+    e.g. Jupyter notebooks, compared on completeness, time taken to
+    write a "paper", ease of depositing in a repository, and ease of
+    use by another researcher.
 
-#######################
-ANSWER:
-#######################
+
+ANSWER: This type of sociological survey will make sense once the
+   number of projects run with Maneage is sufficiently high. The
+   time taken to write a paper should be measurably automatically,
+   from the git history. The other parameters suggested would
+   require cooperation from the scientists in responding to
+   the survey, or will have to be collected anecdotally in the
+   short term.
 
 ------------------------------
 
@@ -919,11 +986,12 @@ clarification to point 47 above, this should also become clear.
 50. [Reviewer 5] Major figures currently working in this exact field do not
     have their work acknowledged in this work.
 
-ANSWER: This was due to the strict word limit and the CiSE publication
-policy (to not include a literature review because there is a limit of only
-12 citations). But we had indeed done a comprehensive literature review and
-the editors kindly agreed that we publish that review as appendices to the
-main paper on arXiv and Zenodo.
+ANSWER: This was due to the strict word limit and the CiSE
+publication policy (to not include a literature review because there
+is a limit of only 12 citations). But we had indeed already done a
+comprehensive literature review and the editors kindly agreed that
+we publish that review as appendices to the main paper on arXiv and
+Zenodo.
 
 ------------------------------
 
@@ -931,12 +999,16 @@ main paper on arXiv and Zenodo.
 
 
 
-51. [Reviewer 5] The popper convention: Making reproducible systems
-    evaluation practical ... and the later revision that uses GitHub
-    Actions, is largely the same as this work.
+51. [Reviewer 5] Jimenez I et al ... 2017 "The popper convention: Making
+    reproducible systems evaluation practical ..." and the later
+    revision that uses GitHub Actions, is largely the same as this
+    work.
 
 ANSWER: This work and the proposed criteria are very different from
-Popper. A review of Popper has been given in Appendix B.
+Popper. We agree that VMs and containers are an important component
+of this field, and the appendices add depth to our discussion of this.
+However, these do not appear to satisfy all our proposed criteria.
+A detailed review of Popper, in particular, is given in Appendix B.
 
 ------------------------------
 
@@ -949,33 +1021,37 @@ Popper. A review of Popper has been given in Appendix B.
     generic OS version label for a VM or container, these are some of the
     most promising tools for offering true reproducibility.
 
-ANSWER: Containers and VMs have been more thoroughly discussed in the main
-body and also extensively discussed in appendix A (that are now available
-in the arXiv and Zenodo versions of this paper). As discussed (with many
-cited examples), Contains and VMs are only good when they are themselves
-reproducible (for example running the Dockerfile this year and next year
-gives the same internal environment). However we show that this is not the
-case in most solutions (a more comprehensive review would require its own
+ANSWER: Containers and VMs have been more thoroughly discussed in
+the main body and also extensively discussed in appendix A (that are
+now available in the arXiv and Zenodo versions of this paper). As
+discussed (with many cited examples), Containers and VMs are only
+appropriate when they are themselves reproducible (for example, if
+running the Dockerfile this year and next year gives the same
+internal environment). However, we show that this is not the case in
+most solutions (a more comprehensive review would require its own
 paper).
 
-However with complete/robust environment builders like Maneage, Nix or GNU
+Moreover, with complete, robust environment builders like Maneage, Nix or GNU
 Guix, the analysis environment within a container can be exactly reproduced
 later. But even so, due to their binary nature and large storage volume,
 they are not trusable sources for the long term (it is expensive to archive
 them). We show several example in the paper of how projects that relied on
 VMs in 2011 and 2014 are no longer active, and how even Dockerhub will be
 deleting containers that are not used for more than 6 months in free
-accounts (due to the large storage costs).
-
-Furthermore, As a unique new feature, Maneage has the criterion of "Minimal
-complexity". This means that even if for any reason the project is not able
-to be run in the future, the content, analysis scripts, etc. are accesible
-for the interested reader (because it is in plain text). Unlike Nix or Guix
-it also doesn't have a third-party package package manager: the
-instructions of building all the software of a project are directly in the
-same project as the high-level analysis software. So, it is transparent in
-any case and the interested reader can follow the analysis and study the
-different decissions of each step (why and how the analysis was done).
+accounts (due to the high storage costs).
+
+Furthermore, as a unique new feature, Maneage has the criterion of
+"Minimal complexity". This means that even if for any reason the
+project is not able to be run in the future, the content, analysis
+scripts, etc. are accessible for the interested reader since they
+are stored as plain text (only the development history - the git
+history - is storied in git's binary format). Unlike Nix or Guix,
+our approach doesn't need a third-party package package manager: the
+instructions for building all the software of a project are directly
+in the same project as the high-level analysis software. The full
+end-to-end process is transparent in our case, and the interested
+scientist can follow the analysis and study the different decisions
+of each step (why and how the analysis was done).
 
 ------------------------------
 
@@ -993,16 +1069,16 @@ different decissions of each step (why and how the analysis was done).
     additional work highly relevant to this paper.
 
 ANSWER: Thank you for the interesting paper by Lofstead+2019 on Data
-pallets. We have cited it in Appendix A as examples of how generic the
+pallets. We have cited it in Appendix A as an example of how generic the
 concept of containers is.
 
 The topic of linking data to analysis is also a core result of the criteria
-presented here, and is also discussed shortly in the paper.  There are
+presented here, and is also discussed briefly in our paper.  There are
 indeed many very interesting works on this topic. But the format of CiSE is
-very short (a maximum of ~6000 words with 12 references), so we don't have
+very short (a maximum of ~6500 words with 12 references), so we don't have
 the space to go into this any further. But this is indeed a very
-interesting aspect for follow up studies, especially as the usage of
-Maneage incrases, and we have more example workflows by users to study the
+interesting aspect for follow-up studies, especially as usage of
+Maneage grows, and we have more example workflows by users to study the
 linkage of data analysis.
 
 ------------------------------
@@ -1039,18 +1115,8 @@ Appendix.
 
 
 
-56. [Reviewer 5] Offers criteria any system that offers reproducibility
-   should have.
-
-ANSWER:
-
-------------------------------
-
-
-
-
 
-57. [Reviewer 5] Yet another example of a reproducible workflows project.
+56. [Reviewer 5] Yet another example of a reproducible workflows project.
 
 ANSWER: As the newly added thorough comparisons with existing systems
 shows, these set of criteria and the proof-of-concept offer uniquely new
@@ -1070,12 +1136,12 @@ is new.
 
 
 
-58. [Reviewer 5] There are numerous examples, mostly domain specific, and
+57. [Reviewer 5] There are numerous examples, mostly domain specific, and
     this one is not the most advanced general solution.
 
 ANSWER: As the comparisons in the appendices and clarifications above show,
 there are many features in the proposed criteria and proof of concept that
-are new.
+are new and not satisfied by the domain-specific solutions known to us.
 
 ------------------------------
 
@@ -1083,7 +1149,7 @@ are new.
 
 
 
-59. [Reviewer 5] Lack of context in the field missing very relevant work
+58. [Reviewer 5] Lack of context in the field missing very relevant work
     that eliminates much, if not all, of the novelty of this work.
 
 ANSWER: The newly added appendices thoroughly describe the context and
author	Boud Roukema <boud@cosmo.torun.pl>	2020-11-26 05:39:50 +0100
committer	Boud Roukema <boud@cosmo.torun.pl>	2020-11-26 05:39:50 +0100
commit	90596115b4a454c70232b2610fbca2aff913ceb6 (patch)
tree	6405ccd44a808bd9cadf48d2e3f718d57609a2e4 /peer-review
parent	eb984bd431af209dbdc8bad8ee52435ccb89f5d0 (diff)