From 90596115b4a454c70232b2610fbca2aff913ceb6 Mon Sep 17 00:00:00 2001 From: Boud Roukema Date: Thu, 26 Nov 2020 05:39:50 +0100 Subject: All questions have now been responded to This commit is intended to be submittable quality. Point 56 was removed, and the later points renumbered, because it was a point of Reviewer 5 described what we have done - it was not a criticism to respond do. :) The current word count (without abstract and references) is 6091. --- paper.tex | 17 +-- peer-review/1-answer.txt | 274 +++++++++++++++++++++++++++++------------------ 2 files changed, 179 insertions(+), 112 deletions(-) diff --git a/paper.tex b/paper.tex index 9dca3c9..e19a7df 100644 --- a/paper.tex +++ b/paper.tex @@ -79,7 +79,7 @@ at the end (Appendices \ref{appendix:existingtools} and \ref{appendix:existingso \emph{Reproducible supplement} --- All products in \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}}, Git history of source at \href{https://gitlab.com/makhlaghi/maneage-paper}{\texttt{gitlab.com/makhlaghi/maneage-paper}}, - which is also archived on \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=https://gitlab.com/makhlaghi/maneage-paper.git}{SoftwareHeritage}. + which is also archived at \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=https://gitlab.com/makhlaghi/maneage-paper.git}{SoftwareHeritage}. \end{abstract} % Note that keywords are not normally used for peer-review papers. @@ -277,7 +277,7 @@ In such cases, it is best to immediately convert the data upon collection, and a \section{Proof of concept: Maneage} With the longevity problems of existing tools outlined above, a proof-of-concept tool is presented here via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}. -\new{Since the initial submission of this paper, it has also been used in \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} (on the COVID-19 pandemic) and \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}.} +\new{Since the initial submission of this paper, it has also been used in \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} (on the COVID-19 pandemic) and \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460} (which illustrates statistical reproducibility for parallelised code).} It was also awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows \cite{austin17}, from the researchers' perspective. The tool is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at \url{https://maneage.org}. @@ -1367,7 +1367,7 @@ Its design is based on a change-based provenance model using a custom VisTrails Since XML is a plane text format, as the user inspects the data and makes changes to the analysis, the changes are recorded as ``trails'' in the project's VisTrails repository that operates very much like common version control systems (see Appendix \ref{appendix:versioncontrol}). . However, even though XML is in plain text, it is very hard to edit manually. -VisTrails therefore provides a graphic user interface with a visual representation of the project's inter-dependent steps (similar to Figure \ref{fig:analysisworkflow}). +VisTrails therefore provides a graphic user interface with a visual representation of the project's inter-dependent steps (similar to Figure \ref{fig:datalineage}). Besides the fact that it is no longer maintained, VisTrails didn't control the software that is run, it only controls the sequence of steps that they are run in. @@ -1613,8 +1613,8 @@ However, there is one directory which can be used to store files that must not b Popper\footnote{\inlinecode{\url{https://falsifiable.us}}} is a software implementation of the Popper Convention \citeappendix{jimenez17}. The Popper team's own solution is through a command-line program called \inlinecode{popper}. The \inlinecode{popper} program itself is written in Python. -However, job management wash initially based on the HashiCorp configuration language (HCL) because HCL was used by ``GitHub Actions'' to manage workflows. -However, from October 2019 Github changed to a custom YAML-based languguage, so Popper also depreciated HCL. +However, job management was initially based on the HashiCorp configuration language (HCL) because HCL was used by ``GitHub Actions'' to manage workflows. +Moreover, from October 2019 Github changed to a custom YAML-based languguage, so Popper also deprecated HCL. This is an important issue when low-level choices are based on service providers. To start a project, the \inlinecode{popper} command-line program builds a template, or ``scaffold'', which is a minimal set of files that can be run. @@ -1622,9 +1622,10 @@ However, as of this writing, the scaffold isn't complete: it lacks a manuscript By default Popper runs in a Docker image (so root permissions are necessary and reproducible issues with Docker images have been discussed above), but Singularity is also supported. See Appendix \ref{appendix:independentenvironment} for more on containers, and Appendix \ref{appendix:highlevelinworkflow} for using high-level languages in the workflow. -Igonoring the failure to comply with the completeness, minimal complexity and includig narrative, the scaffold that is provided by Popper is an output of the program that is not directly under version control. -Hence tracking future changes in Popper and how they relate to the high level projects that depend on it will be very hard. -In Maneage, the same \inlinecode{maneage} git branch is shared by the developers and users, any new feature or change in Maneage can thus be directly tracked with Git when the high-level project merges their branch with Maneage. +Popper does not comply with the completeness, minimal complexity and including-narrative criteria. +Moreover, the scaffold that is provided by Popper is an output of the program that is not directly under version control. +Hence, tracking future changes in Popper and how they relate to the high-level projects that depend on it will be very hard. +In Maneage, the same \inlinecode{maneage} git branch is shared by the developers and users; any new feature or change in Maneage can thus be directly tracked with Git when the high-level project merges their branch with Maneage. \subsection{Whole Tale (2017)} diff --git a/peer-review/1-answer.txt b/peer-review/1-answer.txt index 5e612f8..5c27866 100644 --- a/peer-review/1-answer.txt +++ b/peer-review/1-answer.txt @@ -32,9 +32,10 @@ reader can easily access them. 2. [Associate Editor] There are general concerns about the paper lacking focus -########################### -ANSWER: -########################### +ANSWER: We believe that by responding to the specific concerns raised +by the reviewers, as detailed below, we have tightened the focus of +the paper. + ------------------------------ @@ -132,10 +133,14 @@ future. is on the article, it is important that readers not be confused when they visit your site to use your tools. -########################### -ANSWER [NOT COMPLETE]: We should separate the various sections of the -README-hacking.md webpage into smaller pages that can be entered. -########################### +ANSWER: Improving the consistency between this research paper and +the Maneage website is a useful recommendation. We have listed +this together with point 29 below at +https://savannah.nongnu.org/task/index.php?15823 +on the Maneage development task list. As indicated there, the +website is developed on a public git repository, so any specific +proposals for improvements can be handled efficiently and +transparently. ------------------------------ @@ -597,9 +602,9 @@ level of peer-review control. Tutorial. A topic breakdown is interesting, as the markdown reading may be too long to find information. -##################################### -ANSWER: -##################################### +ANSWER: Thank you for the very useful suggestion. We have listed this as +a task at https://savannah.nongnu.org/task/index.php?15823 . + ------------------------------ @@ -691,9 +696,18 @@ highly modular and flexible nature of Makefiles run via 'Make'. which might occur because of the way the code is written, and the hardware architecture (including if code is optimised / parallelised). -################################ -ANSWER: -################################ + +ANSWER: The authors of particular projects have to choose the level +floating point reproducibility that they judge viable. In section IV, +within the 6500-word limit, this is briefly described in the discussion +of the "verify.mk" rule file. The main paragraph is "Just before reaching ... +All project deliverables ... are verified ... with their checksums, to +automatically ensure exact reproducibility. .... [or] by any statistical +means, specified by the project authors." + +We have added a brief reference to zenodo.3951151, pointing out that +it illustrates an approach for statistical verifiability of +parallelised code using Maneage. ------------------------------ @@ -701,20 +715,30 @@ ANSWER: -37. [Reviewer 4] Performance ... is never mentioned +37. [Reviewer 4] ... the handling of floating point number +[reproducibility] ... will come with a tradeoff agianst +performance, which is never mentioned. + +ANSWER: The criteria we propose and the proof-of-concept with +Maneage do not force the choice of a tradeoff between exact bitwise +floating point reproducibility versus performance (e.g. speed). The +specific concepts of "verification" and "reproducibility" will vary +between domains of scientific computation, but we expect that the +criteria allow this wide range. We did not add text on this point. -################################ -ANSWER: -################################ ------------------------------ 38. [Reviewer 4] Tradeoff, which might affect Criterion 3 is time to result, people use popular frameworks because it is easier to use them. -################################ -ANSWER: -################################ +ANSWER: Section IV includes some quantified examples of timing +involved in the Maneage implementation of the criteria of our +paper. It is true that the initial build time of a Maneage install +may discourage some scientists; but a serious scientific research +project is never started and completed on a time scale of a few +hours. + ------------------------------ @@ -726,17 +750,18 @@ ANSWER: challenges to adoption were identified: was this anecdotal, through surveys? participant observation? -ANSWER: The results mentioned here are based on private discussions after -holding multiple seminars and Webinars with RDA's support, and also a -workshop that was planned for non-astronomers. We even invited (funded) -early career researchers to come to the workshop with the RDA funding, -however, that workshop was cancelled due to the pandemic and we had private -communications after. +ANSWER: The results mentioned here are anecdotal: based on private +discussions after holding multiple seminars and Webinars with RDA's +support, and also a workshop that was planned for +non-astronomers. We invited (funded) early career researchers to +come to the workshop with the RDA funding. However, that workshop +was cancelled due to the pandemic and we had private communications +instead. We would very much like to elaborate on this experience of training new researchers with these tools. However, as with many of the cases above, the -very strict word-limit doesn't allow us to elaborate beyond what is already -there. +very strict word-limit doesn't allow us to elaborate beyond what we have +already written. ------------------------------ @@ -747,9 +772,14 @@ there. 40. [Reviewer 4] Potentially an interesting sidebar to investigate how LaTeX/TeX has ensured its longevity! -############################## -ANSWER: -############################## + +ANSWER: We agree that this would be interesting; an obvious link is +that LaTeX/TeX is very strongly based on plain text files, making user +hacking easy, provided that the user is willing to experiment and +search and read through the source files. However, as the reviewer states, +this would be a sidebar, and we are constrained for space. + + ------------------------------ @@ -760,9 +790,22 @@ ANSWER: 41. [Reviewer 4] The title is not specific enough - it should refer to the reproducibility of workflows/projects. -############################## -ANSWER: -############################## +ANSWER: A problem here is that "workflow" and "project" taken in +isolation risk being vague for wider audiences. Also, we aim at +covering a wider range of aspects of a project than just than the +workflow alone; in the other direction, the word "project" could be +seen as too broad, including the funding, principal investigator, +and team coordination. + +A specific title that might be appropriate could be, for example, +"Towards long-term and archivable reproducibility of scientific +computational research projects". Using a term proposed by one of +our reviewers, "Towards long-term and archivable end-to-end +reproducibility of scientific computational research projects" +might also be appropriate. + +Nevertheless, we feel that in the context of an article published in CiSE, +our current short title is sufficient. ------------------------------ @@ -773,12 +816,20 @@ ANSWER: 42. [Reviewer 4] Whilst the thesis stated is valid, it may not be useful to practitioners of computation science and engineering as it stands. -ANSWER: We would appreciate if you could clarify this point a little -more. We have shown how it has already been used in many research projects -(also outside of observational astronomy which is the first author's main -background). It is precisely defined for computational science and -engineering problems where _publication_ of the human-readable workflow -source is also important. +ANSWER: This point appears to refer to floating point bitwise reproducibility +and possibly to the conciseness of our paper. The former is fully allowed +for, as stated above, though not obligatory, using the "verify.mk" rule +file to (typically, but not obligatorily) force bitwise reproducibility. +The latter is constrained by the 6500-word limit. The addition of appendices +in the extended version may help respond to the latter point. + +The current small number of existing research projects using +Maneage, as indicated in the revised version of our paper includes +papers outside of observational astronomy (which is the first +author's main background). The fact that the approach is precisely +defined for computational science and engineering problems where +_publication_ of the human-readable workflow source is also +important may partly respond to this issue. ------------------------------ @@ -788,7 +839,7 @@ source is also important. 43. [Reviewer 4] Longevity is not defined. -ANSWER: It has been defined now at the start of Section II. +ANSWER: This has been defined now at the start of Section II. ------------------------------ @@ -803,11 +854,12 @@ ANSWER: It has been defined now at the start of Section II. of years, but may not be suitable for the timescale of decades where porting and emulation are used. -ANSWER: Statements on quantifying their longevity have been added in -Section II. For example in the case of Docker images: "their longevity is -determined by the host kernel, usually a decade", for Python packages: -"Python installation with a usual longevity of a few years", for Nix/Guix: -"with considerably better longevity; same as supported CPU architectures." +ANSWER: Statements on quantifying the longevity of specific tools +have been added in Section II. For example in the case of Docker +images: "their longevity is determined by the host kernel, usually a +decade", for Python packages: "Python installation with a usual +longevity of a few years", for Nix/Guix: "with considerably better +longevity; same as supported CPU architectures." ------------------------------ @@ -820,9 +872,13 @@ determined by the host kernel, usually a decade", for Python packages: longevity of the workflows that can be produced using these tools? What happens if you use a combination of all four categories of tools? -########################## -ANSWER: -########################## + +ANSWER: We have changed the section title to "Longevity of existing tools" +to clarify that we refer to longevity of the tools. + +If the four categories of tools were combined, then the overall +longevity would be that of the shortest intersection of the time +spans over which the tools remained viable. ------------------------------ @@ -876,21 +932,32 @@ branch over commit '01dd812' and makes a couple of commits ('f69e1f4' and '716b56b'), and finally asks the project leader to merge them into the project. This can be generalized to any Git based collaboration model. +Recent experience by one of us [Roukema] found that a merge of a +Maneage-based cosmology simulation project (now zenodo.4062460), +after separate evolution of about 30-40 commits on maneage and +possibly 100 on the project, needed about one day of straightforward +effort, without any major difficulties. + ------------------------------ -48. [Reviewer 4] I would also liked to have seen a comparison between this - approach and other "executable" paper approaches e.g. Jupyter - notebooks, compared on completeness, time taken to write a "paper", - ease of depositing in a repository, and ease of use by another - researcher. +48. [Reviewer 4] I would also [have] liked to have seen a comparison + between this approach and other "executable" paper approaches + e.g. Jupyter notebooks, compared on completeness, time taken to + write a "paper", ease of depositing in a repository, and ease of + use by another researcher. -####################### -ANSWER: -####################### + +ANSWER: This type of sociological survey will make sense once the + number of projects run with Maneage is sufficiently high. The + time taken to write a paper should be measurably automatically, + from the git history. The other parameters suggested would + require cooperation from the scientists in responding to + the survey, or will have to be collected anecdotally in the + short term. ------------------------------ @@ -919,11 +986,12 @@ clarification to point 47 above, this should also become clear. 50. [Reviewer 5] Major figures currently working in this exact field do not have their work acknowledged in this work. -ANSWER: This was due to the strict word limit and the CiSE publication -policy (to not include a literature review because there is a limit of only -12 citations). But we had indeed done a comprehensive literature review and -the editors kindly agreed that we publish that review as appendices to the -main paper on arXiv and Zenodo. +ANSWER: This was due to the strict word limit and the CiSE +publication policy (to not include a literature review because there +is a limit of only 12 citations). But we had indeed already done a +comprehensive literature review and the editors kindly agreed that +we publish that review as appendices to the main paper on arXiv and +Zenodo. ------------------------------ @@ -931,12 +999,16 @@ main paper on arXiv and Zenodo. -51. [Reviewer 5] The popper convention: Making reproducible systems - evaluation practical ... and the later revision that uses GitHub - Actions, is largely the same as this work. +51. [Reviewer 5] Jimenez I et al ... 2017 "The popper convention: Making + reproducible systems evaluation practical ..." and the later + revision that uses GitHub Actions, is largely the same as this + work. ANSWER: This work and the proposed criteria are very different from -Popper. A review of Popper has been given in Appendix B. +Popper. We agree that VMs and containers are an important component +of this field, and the appendices add depth to our discussion of this. +However, these do not appear to satisfy all our proposed criteria. +A detailed review of Popper, in particular, is given in Appendix B. ------------------------------ @@ -949,33 +1021,37 @@ Popper. A review of Popper has been given in Appendix B. generic OS version label for a VM or container, these are some of the most promising tools for offering true reproducibility. -ANSWER: Containers and VMs have been more thoroughly discussed in the main -body and also extensively discussed in appendix A (that are now available -in the arXiv and Zenodo versions of this paper). As discussed (with many -cited examples), Contains and VMs are only good when they are themselves -reproducible (for example running the Dockerfile this year and next year -gives the same internal environment). However we show that this is not the -case in most solutions (a more comprehensive review would require its own +ANSWER: Containers and VMs have been more thoroughly discussed in +the main body and also extensively discussed in appendix A (that are +now available in the arXiv and Zenodo versions of this paper). As +discussed (with many cited examples), Containers and VMs are only +appropriate when they are themselves reproducible (for example, if +running the Dockerfile this year and next year gives the same +internal environment). However, we show that this is not the case in +most solutions (a more comprehensive review would require its own paper). -However with complete/robust environment builders like Maneage, Nix or GNU +Moreover, with complete, robust environment builders like Maneage, Nix or GNU Guix, the analysis environment within a container can be exactly reproduced later. But even so, due to their binary nature and large storage volume, they are not trusable sources for the long term (it is expensive to archive them). We show several example in the paper of how projects that relied on VMs in 2011 and 2014 are no longer active, and how even Dockerhub will be deleting containers that are not used for more than 6 months in free -accounts (due to the large storage costs). - -Furthermore, As a unique new feature, Maneage has the criterion of "Minimal -complexity". This means that even if for any reason the project is not able -to be run in the future, the content, analysis scripts, etc. are accesible -for the interested reader (because it is in plain text). Unlike Nix or Guix -it also doesn't have a third-party package package manager: the -instructions of building all the software of a project are directly in the -same project as the high-level analysis software. So, it is transparent in -any case and the interested reader can follow the analysis and study the -different decissions of each step (why and how the analysis was done). +accounts (due to the high storage costs). + +Furthermore, as a unique new feature, Maneage has the criterion of +"Minimal complexity". This means that even if for any reason the +project is not able to be run in the future, the content, analysis +scripts, etc. are accessible for the interested reader since they +are stored as plain text (only the development history - the git +history - is storied in git's binary format). Unlike Nix or Guix, +our approach doesn't need a third-party package package manager: the +instructions for building all the software of a project are directly +in the same project as the high-level analysis software. The full +end-to-end process is transparent in our case, and the interested +scientist can follow the analysis and study the different decisions +of each step (why and how the analysis was done). ------------------------------ @@ -993,16 +1069,16 @@ different decissions of each step (why and how the analysis was done). additional work highly relevant to this paper. ANSWER: Thank you for the interesting paper by Lofstead+2019 on Data -pallets. We have cited it in Appendix A as examples of how generic the +pallets. We have cited it in Appendix A as an example of how generic the concept of containers is. The topic of linking data to analysis is also a core result of the criteria -presented here, and is also discussed shortly in the paper. There are +presented here, and is also discussed briefly in our paper. There are indeed many very interesting works on this topic. But the format of CiSE is -very short (a maximum of ~6000 words with 12 references), so we don't have +very short (a maximum of ~6500 words with 12 references), so we don't have the space to go into this any further. But this is indeed a very -interesting aspect for follow up studies, especially as the usage of -Maneage incrases, and we have more example workflows by users to study the +interesting aspect for follow-up studies, especially as usage of +Maneage grows, and we have more example workflows by users to study the linkage of data analysis. ------------------------------ @@ -1039,18 +1115,8 @@ Appendix. -56. [Reviewer 5] Offers criteria any system that offers reproducibility - should have. - -ANSWER: - ------------------------------- - - - - -57. [Reviewer 5] Yet another example of a reproducible workflows project. +56. [Reviewer 5] Yet another example of a reproducible workflows project. ANSWER: As the newly added thorough comparisons with existing systems shows, these set of criteria and the proof-of-concept offer uniquely new @@ -1070,12 +1136,12 @@ is new. -58. [Reviewer 5] There are numerous examples, mostly domain specific, and +57. [Reviewer 5] There are numerous examples, mostly domain specific, and this one is not the most advanced general solution. ANSWER: As the comparisons in the appendices and clarifications above show, there are many features in the proposed criteria and proof of concept that -are new. +are new and not satisfied by the domain-specific solutions known to us. ------------------------------ @@ -1083,7 +1149,7 @@ are new. -59. [Reviewer 5] Lack of context in the field missing very relevant work +58. [Reviewer 5] Lack of context in the field missing very relevant work that eliminates much, if not all, of the novelty of this work. ANSWER: The newly added appendices thoroughly describe the context and -- cgit v1.2.1