diff options
author | Mohammad Akhlaghi <mohammad@akhlaghi.org> | 2020-11-27 22:32:58 +0000 |
---|---|---|
committer | Mohammad Akhlaghi <mohammad@akhlaghi.org> | 2020-11-27 22:32:58 +0000 |
commit | d540cddaad7b9e1369f4520e2e6c97b7fd730956 (patch) | |
tree | 7477917c064ee3a14f37c50ff0f166144294c328 /peer-review | |
parent | bb5d173399d657453533cf1bdda584203a1d096e (diff) | |
parent | 90596115b4a454c70232b2610fbca2aff913ceb6 (diff) |
Merged with Boud's corrected answers (generally very similar)
The only issue that still remains is how to address statistical
reproducibility, and I am in touch with Boud to do this in the best way
possible (it has been highlighted with '#####'s in the answers.
Diffstat (limited to 'peer-review')
-rw-r--r-- | peer-review/1-answer.txt | 262 |
1 files changed, 155 insertions, 107 deletions
diff --git a/peer-review/1-answer.txt b/peer-review/1-answer.txt index e0b0da1..b837ce4 100644 --- a/peer-review/1-answer.txt +++ b/peer-review/1-answer.txt @@ -36,7 +36,6 @@ ANSWER: With all the corrections/clarifications that have been done in this review the focus of the paper should be clear now. We are very grateful to the thorough listing of points by the referees. - ------------------------------ @@ -140,7 +139,10 @@ very long "About" page into multiple pages to help in readability: https://maneage.org/about.html Generally, the webpage will soon undergo major improvements to be even more -clear. +clear. The website is developed on a public git repository +(https://git.maneage.org/webpage.git), so any specific proposals for +improvements can be handled efficiently and transparently and we welcome +any feedback in this aspect. ------------------------------ @@ -699,23 +701,37 @@ highly modular and flexible nature of Makefiles run via 'Make'. ANSWER: Floating point errors and optimizations have been mentioned in the discussion (Section V). The issue with parallelization has also been discussed in Section IV, in the part on verification ("Where exact -reproducibility is not possible (for example due to paralleliza- tion), +reproducibility is not possible (for example due to parallelization), values can be verified by any statistical means, specified by the project authors."). +##################### +Find a good way to link to (Peper and Roukema: +https://doi.org/10.5281/zenodo.4062460) +##################### + ------------------------------ -37. [Reviewer 4] Performance ... is never mentioned +37. [Reviewer 4] ... the handling of floating point number +[reproducibility] ... will come with a tradeoff agianst +performance, which is never mentioned. -ANSWER: Performance is indeed an important issue for _immediate_ -reproducibility and we would have liked to discuss it. But due to the -strict word-count, we feel that adding it to the discussion points, without -having adequate space to elaborate, can confuse the readers of this paper -(which is focused on long term usability). +ANSWER: The criteria we propose and the proof-of-concept with Maneage do +not force the choice of a tradeoff between exact bitwise floating point +reproducibility versus performance (e.g. speed). The specific concepts of +"verification" and "reproducibility" will vary between domains of +scientific computation, but we expect that the criteria allow this wide +range. + +Performance is indeed an important issue for _immediate_ reproducibility +and we would have liked to discuss it. But due to the strict word-count, we +feel that adding it to the discussion points, without having adequate space +to elaborate, can confuse the readers away from the focus of this paper +(which is focused on long term usability). It has therefore not been added. ------------------------------ @@ -727,10 +743,10 @@ having adequate space to elaborate, can confuse the readers of this paper people use popular frameworks because it is easier to use them. ANSWER: That is true. In section IV, we have given the time it takes to -build Maneage (only once for a project on each computer) to be around 1.5 -hours on an 8-core CPU (a typical machine that may be used for data -analysis). We therefore conclude that when the analysis is complex (and -thus taking many hours or days to complete), this time is negligible. +build Maneage (only once on each computer) to be around 1.5 hours on an +8-core CPU (a typical machine that may be used for data analysis). We +therefore conclude that when the analysis is complex (and thus taking many +hours or days to complete), this time is negligible. But if the project's full analysis takes 10 minutes or less (like the extremely simple analysis done in this paper which takes a fraction of a @@ -738,6 +754,11 @@ second). Indeed, the 1.5 hour building time is significant. In those cases, as discussed in the main body, the project can be built once in a Docker image and easily moved to other computers. +Generally, it is true that the initial configuration time (only once on +each computer) of a Maneage install may discourage some scientists; but a +serious scientific research project is never started and completed on a +time scale of a few hours. + ------------------------------ @@ -748,17 +769,18 @@ image and easily moved to other computers. challenges to adoption were identified: was this anecdotal, through surveys? participant observation? -ANSWER: The results mentioned here are based on private discussions after -holding multiple seminars and Webinars with RDA's support, and also a -workshop that was planned for non-astronomers. We even invited (funded) -early career researchers to come to the workshop with the RDA funding, -however, that workshop was cancelled due to the pandemic and we had private -communications after. +ANSWER: The results mentioned here are anecdotal: based on private +discussions after holding multiple seminars and Webinars with RDA's +support, and also a workshop that was planned for +non-astronomers. We invited (funded) early career researchers to +come to the workshop with the RDA funding. However, that workshop +was cancelled due to the pandemic and we had private communications +instead. We would very much like to elaborate on this experience of training new researchers with these tools. However, as with many of the cases above, the -very strict word-limit doesn't allow us to elaborate beyond what is already -there. +very strict word-limit doesn't allow us to elaborate beyond what we have +already written. ------------------------------ @@ -769,13 +791,16 @@ there. 40. [Reviewer 4] Potentially an interesting sidebar to investigate how LaTeX/TeX has ensured its longevity! -ANSWER: That is indeed a very interesting subject to study. We have been in -touch with Karl Berry (one of the core people behind TeX Live, who also +ANSWER: That is indeed a very interesting subject to study (an obvious link +is that LaTeX/TeX is very strongly based on plain text files). We have been +in touch with Karl Berry (one of the core people behind TeX Live, who also plays a prominent role in GNU) and have whitnessed the TeX Live community's -efforts to become more and more portable and longer-lived. But after -looking at the strict word limit, we couldn't find a place to highlight -this. But it is indeed a subject worthy of a full paper (that can be very -useful for many software projects0.. +efforts to become more and more portable and longer-lived. + +However, as the reviewer states, this would be a sidebar, and we are +constrained for space, so we couldn't find a place to highlight this. But +it is indeed a subject worthy of a full paper (that can be very useful for +many software projects0.. ------------------------------ @@ -786,11 +811,20 @@ useful for many software projects0.. 41. [Reviewer 4] The title is not specific enough - it should refer to the reproducibility of workflows/projects. -ANSWER: Since this journal is focused on "Computing in Science and -Engineering", the fact that it relates to computational workflows will be -clear to any reader. Since the other referees didn't complain about this, -we will keep it as it was, but of course, we are open to the suggestions of -the editors in the final title. +ANSWER: A problem here is that "workflow" and "project" taken in isolation +risk being vague for wider audiences. Also, we aim at covering a wider +range of aspects of a project than just than the workflow alone; in the +other direction, the word "project" could be seen as too broad, including +the funding, principal investigator, and team coordination. + +A specific title that might be appropriate could be, for example, "Towards +long-term and archivable reproducibility of scientific computational +research projects". Using a term proposed by one of our reviewers, "Towards +long-term and archivable end-to-end reproducibility of scientific +computational research projects" might also be appropriate. + +Nevertheless, we feel that in the context of an article published in CiSE, +our current short title is sufficient. ------------------------------ @@ -801,12 +835,20 @@ the editors in the final title. 42. [Reviewer 4] Whilst the thesis stated is valid, it may not be useful to practitioners of computation science and engineering as it stands. -ANSWER: We would appreciate if you could clarify this point a little -more. We have shown how it has already been used in many research projects -(also outside of observational astronomy which is the first author's main -background). It is precisely defined for computational science and -engineering problems where _publication_ of the human-readable workflow -source is also important. +ANSWER: This point appears to refer to floating point bitwise reproducibility +and possibly to the conciseness of our paper. The former is fully allowed +for, as stated above, though not obligatory, using the "verify.mk" rule +file to (typically, but not obligatorily) force bitwise reproducibility. +The latter is constrained by the 6500-word limit. The addition of appendices +in the extended version may help respond to the latter point. + +The current small number of existing research projects using +Maneage, as indicated in the revised version of our paper includes +papers outside of observational astronomy (which is the first +author's main background). The fact that the approach is precisely +defined for computational science and engineering problems where +_publication_ of the human-readable workflow source is also +important may partly respond to this issue. ------------------------------ @@ -816,7 +858,7 @@ source is also important. 43. [Reviewer 4] Longevity is not defined. -ANSWER: It has been defined now at the start of Section II. +ANSWER: This has been defined now at the start of Section II. ------------------------------ @@ -831,11 +873,12 @@ ANSWER: It has been defined now at the start of Section II. of years, but may not be suitable for the timescale of decades where porting and emulation are used. -ANSWER: Statements on quantifying their longevity have been added in -Section II. For example in the case of Docker images: "their longevity is -determined by the host kernel, usually a decade", for Python packages: -"Python installation with a usual longevity of a few years", for Nix/Guix: -"with considerably better longevity; same as supported CPU architectures." +ANSWER: Statements on quantifying the longevity of specific tools +have been added in Section II. For example in the case of Docker +images: "their longevity is determined by the host kernel, usually a +decade", for Python packages: "Python installation with a usual +longevity of a few years", for Nix/Guix: "with considerably better +longevity; same as supported CPU architectures." ------------------------------ @@ -848,16 +891,12 @@ determined by the host kernel, usually a decade", for Python packages: longevity of the workflows that can be produced using these tools? What happens if you use a combination of all four categories of tools? -ANSWER: Thank you for highlighting this. The title has been shortend and -the section immediately starts with definitions. +ANSWER: We have changed the section title to "Longevity of existing tools" +to clarify that we refer to longevity of the tools. -The aspects of the tools discussed in this section are orthogonal to each -other. For example a VM/container, package manager, notebook: some projects -may have any different combinations of the three. In some aspects using -them together can improve the operations, but for example building a -VM/container with or without a package manager makes no difference on the -main issue we raise about containers (that they are large binary blobs that -don't necessarily contain how the environment within them was built). +If the four categories of tools were combined, then the overall +longevity would be that of the shortest intersection of the time +spans over which the tools remained viable. ------------------------------ @@ -911,20 +950,30 @@ branch over commit '01dd812' and makes a couple of commits ('f69e1f4' and '716b56b'), and finally asks the project leader to merge them into the project. This can be generalized to any Git based collaboration model. +Recent experience by one of us [Roukema] found that a merge of a +Maneage-based cosmology simulation project (now zenodo.4062460), +after separate evolution of about 30-40 commits on maneage and +possibly 100 on the project, needed about one day of straightforward +effort, without any major difficulties. + ------------------------------ -48. [Reviewer 4] I would also liked to have seen a comparison between this - approach and other "executable" paper approaches e.g. Jupyter - notebooks, compared on completeness, time taken to write a "paper", - ease of depositing in a repository, and ease of use by another - researcher. +48. [Reviewer 4] I would also [have] liked to have seen a comparison + between this approach and other "executable" paper approaches + e.g. Jupyter notebooks, compared on completeness, time taken to + write a "paper", ease of depositing in a repository, and ease of + use by another researcher. -ANSWER: These have been highlighted in various parts of the text (also -reviewed in previous points). +ANSWER: This type of sociological survey will make sense once the number of +projects run with Maneage is sufficiently high. The time taken to write a +paper should be measurable automatically: from the git history. The other +parameters suggested would require cooperation from the scientists in +responding to the survey, or will have to be collected anecdotally in the +short term. ------------------------------ @@ -953,11 +1002,12 @@ clarification to point 47 above, this should also become clear. 50. [Reviewer 5] Major figures currently working in this exact field do not have their work acknowledged in this work. -ANSWER: This was due to the strict word limit and the CiSE publication -policy (to not include a literature review because there is a limit of only -12 citations). But we had indeed done a comprehensive literature review and -the editors kindly agreed that we publish that review as appendices to the -main paper on arXiv and Zenodo. +ANSWER: This was due to the strict word limit and the CiSE +publication policy (to not include a literature review because there +is a limit of only 12 citations). But we had indeed already done a +comprehensive literature review and the editors kindly agreed that +we publish that review as appendices to the main paper on arXiv and +Zenodo. ------------------------------ @@ -965,12 +1015,16 @@ main paper on arXiv and Zenodo. -51. [Reviewer 5] The popper convention: Making reproducible systems - evaluation practical ... and the later revision that uses GitHub - Actions, is largely the same as this work. +51. [Reviewer 5] Jimenez I et al ... 2017 "The popper convention: Making + reproducible systems evaluation practical ..." and the later + revision that uses GitHub Actions, is largely the same as this + work. ANSWER: This work and the proposed criteria are very different from -Popper. A review of Popper has been given in Appendix B. +Popper. We agree that VMs and containers are an important component +of this field, and the appendices add depth to our discussion of this. +However, these do not appear to satisfy all our proposed criteria. +A detailed review of Popper, in particular, is given in Appendix B. ------------------------------ @@ -983,33 +1037,37 @@ Popper. A review of Popper has been given in Appendix B. generic OS version label for a VM or container, these are some of the most promising tools for offering true reproducibility. -ANSWER: Containers and VMs have been more thoroughly discussed in the main -body and also extensively discussed in appendix A (that are now available -in the arXiv and Zenodo versions of this paper). As discussed (with many -cited examples), Contains and VMs are only good when they are themselves -reproducible (for example running the Dockerfile this year and next year -gives the same internal environment). However we show that this is not the -case in most solutions (a more comprehensive review would require its own +ANSWER: Containers and VMs have been more thoroughly discussed in +the main body and also extensively discussed in appendix A (that are +now available in the arXiv and Zenodo versions of this paper). As +discussed (with many cited examples), Containers and VMs are only +appropriate when they are themselves reproducible (for example, if +running the Dockerfile this year and next year gives the same +internal environment). However, we show that this is not the case in +most solutions (a more comprehensive review would require its own paper). -However with complete/robust environment builders like Maneage, Nix or GNU +Moreover, with complete, robust environment builders like Maneage, Nix or GNU Guix, the analysis environment within a container can be exactly reproduced later. But even so, due to their binary nature and large storage volume, they are not trusable sources for the long term (it is expensive to archive them). We show several example in the paper of how projects that relied on VMs in 2011 and 2014 are no longer active, and how even Dockerhub will be deleting containers that are not used for more than 6 months in free -accounts (due to the large storage costs). - -Furthermore, As a unique new feature, Maneage has the criterion of "Minimal -complexity". This means that even if for any reason the project is not able -to be run in the future, the content, analysis scripts, etc. are accesible -for the interested reader (because it is in plain text). Unlike Nix or Guix -it also doesn't have a third-party package package manager: the -instructions of building all the software of a project are directly in the -same project as the high-level analysis software. So, it is transparent in -any case and the interested reader can follow the analysis and study the -different decissions of each step (why and how the analysis was done). +accounts (due to the high storage costs). + +Furthermore, as a unique new feature, Maneage has the criterion of +"Minimal complexity". This means that even if for any reason the +project is not able to be run in the future, the content, analysis +scripts, etc. are accessible for the interested reader since they +are stored as plain text (only the development history - the git +history - is storied in git's binary format). Unlike Nix or Guix, +our approach doesn't need a third-party package package manager: the +instructions for building all the software of a project are directly +in the same project as the high-level analysis software. The full +end-to-end process is transparent in our case, and the interested +scientist can follow the analysis and study the different decisions +of each step (why and how the analysis was done). ------------------------------ @@ -1027,16 +1085,16 @@ different decissions of each step (why and how the analysis was done). additional work highly relevant to this paper. ANSWER: Thank you for the interesting paper by Lofstead+2019 on Data -pallets. We have cited it in Appendix A as examples of how generic the +pallets. We have cited it in Appendix A as an example of how generic the concept of containers is. The topic of linking data to analysis is also a core result of the criteria -presented here, and is also discussed shortly in the paper. There are +presented here, and is also discussed briefly in our paper. There are indeed many very interesting works on this topic. But the format of CiSE is -very short (a maximum of ~6000 words with 12 references), so we don't have +very short (a maximum of ~6500 words with 12 references), so we don't have the space to go into this any further. But this is indeed a very -interesting aspect for follow up studies, especially as the usage of -Maneage incrases, and we have more example workflows by users to study the +interesting aspect for follow-up studies, especially as usage of +Maneage grows, and we have more example workflows by users to study the linkage of data analysis. ------------------------------ @@ -1073,18 +1131,8 @@ Appendix. -56. [Reviewer 5] Offers criteria any system that offers reproducibility - should have. - -ANSWER: - ------------------------------- - - - - -57. [Reviewer 5] Yet another example of a reproducible workflows project. +56. [Reviewer 5] Yet another example of a reproducible workflows project. ANSWER: As the newly added thorough comparisons with existing systems shows, these set of criteria and the proof-of-concept offer uniquely new @@ -1104,12 +1152,12 @@ is new. -58. [Reviewer 5] There are numerous examples, mostly domain specific, and +57. [Reviewer 5] There are numerous examples, mostly domain specific, and this one is not the most advanced general solution. ANSWER: As the comparisons in the appendices and clarifications above show, there are many features in the proposed criteria and proof of concept that -are new. +are new and not satisfied by the domain-specific solutions known to us. ------------------------------ @@ -1117,7 +1165,7 @@ are new. -59. [Reviewer 5] Lack of context in the field missing very relevant work +58. [Reviewer 5] Lack of context in the field missing very relevant work that eliminates much, if not all, of the novelty of this work. ANSWER: The newly added appendices thoroughly describe the context and |