paper-concept.git - Paper (Towards Long-term and Archivable Reproducibility)

Age	Commit message (Collapse)	Author	Lines
2022-05-09	Imported recent updates in Maneage, conflicts fixed	Mohammad Akhlaghi	-3/+3
	Until now, Maneage had undergone some updates. With this commit, those updates have been imported and the conflicts that resulted were fixed. They were all cosmetic and had no effect on the analysis. The most significant one was about the change in the format of 'INPUTS.conf'. In the process, I also noticed that the IEEEtran LaTeX package is now called 'ieeetran' (the 'tlmgr' of TeXLive 2022 was failing).
2021-06-15	Futher copyediting on the apendices	Boud Roukema	-1/+1
	In the discussion on criteria that Popper lacks, the last mentioned criteria "including the narrative" is written in such a way that can confuse readers into thinking that only a single criteria is lacking. Hyphenating ('including-the-narrative') has been applied to make the sentence less likely to be misunderstood. The ending of the first paragraph in the "Generational gaps" item in Appendix A.G ("... every few years is not practically possible.") sounds like "not almost possible". So it can cause confusions. Endings that are much clearer include: * is impractical. * is not possible in practice. * is not practical. * is not possible practically. [meaning 2. is less likely in this case] I've selected the first option, also replacing "they" by "scientists" to avoid the misinterpretation that "programming languages ... have their own science field to focus on". This commit and the previous one were "amended" by Mohammad (compared to the original commits that Boud had sent).
2021-06-14	Copyediting of appendices	Boud Roukema	-23/+23
	This commit does several small copyediting fixes in the body of the appandices which should improve their readability.
2021-06-13	Add GHTorrent, some https, notabug	Boud Roukema	-1/+1
	This commit adds a few sentences in relation to the first known attempt to store and make available git repository hosting ephemera (GHTorrent, introduced to us by Roberto Di Cosmo). Since one of the two sponsors of GHTorrent is Microsoft, both the ethics and practical aspects of this in the context of reproducibility and scientific ethics as expressed by the international scientific community are rather unclear, so a link to one of the well-known lists of practical and ethical issues with Github is included. A minor fix is made in 'tex/src/appendix-existing-solutions.tex', since the word 'data' is plural (singular is 'datum').
2021-06-08	Minor edits and updated first-page Software Heritage ID	Mohammad Akhlaghi	-1/+1
	After going through Boud's corrections and edits in the previous commit, I thought some minor clarifications would be necessary, and they are implemented in this commit. Also, in preparation for submission to the journal, the top-level software heritage ID has been corrected to the latest commit on Software Heritage.
2021-06-08	Several minor edits, removed exact value of arXiv's size-limit	Boud Roukema	-8/+8
	This commit makes several copyediting changes to the appendices and to the supplement.tex introduction to the appendices. The ArXiv unofficially increased upload limit of 50 Mb comes from a tweet: https://nitter.fdn.fr/arxiv/status/1286381643893268483 (archive: https://archive.today/PdxhT) but not listed on official ArXiv pages. So it seems safer not to quote a value. The very old value was 0.5 Mb - out of respect to people with low bandwidth, especially scientists in poor countries. Tweets are generally not acceptable as "reliable sources" in en.Wikipedia.
2021-06-07	Clarifications added to ReproZip in the appendix	Mohammad Akhlaghi	-5/+12
	After Boud posted a notice about Maneage in an online forum [1], Rémi Rampin and Vicky Rampin (from the ReproZip project) replied with some notes about our review of ReproZip in Appendix B. We are very grateful to both Rémi and Vicky for looking into it and for their comments, their contribution has been gratefully acknowledged with this commit. The relevant comments are listed below and have been addressed in this commit (see the 'diff' of this commit). - [Rémi Rampin] ReproZip can capture the build step if you want it to, it's just another command. So if you want to trace "make" and "pip install" etc before tracing your actual experiment, you will have all that build information. - [Rémi Rampin] Bundle size is easily fixed by not putting terabyte-sized data in the bundle, which is done by editing a simple configuration file. - [Vicky Rampin] Not all the files in the bundle are compiled/binary files [in relation to the old sentence "ReproZip just copies the binary/compiled files used in a project"]. [1] https://framapiaf.org/@boud/106296894758145705
2021-05-26	ReproZip, Popper: minor fixes	Boud Roukema	-4/+4
	This commit contains minor fixes in Appendix B. ReproZip: As Vicky Rampin points out [1], ReproZip typically also includes non-binary files, so I removed "just" and improved the wording. Popper: the Popper URL that we gave is obsolete; at Wayback Machine it redirects to getpopper.io [2], so I've updated this; and I've fixed up the wording ('off of' only exists in US English). [1] https://octodon.social/@VickyRampin/106298214313216228 [2] https://web.archive.org/web/20210425223605/http://falsifiable.us/
2021-04-09	Implemented EiC (Lorena Barba) comments, and added final review	Mohammad Akhlaghi	-54/+54
	The email notice of the final acceptance of this paper in CiSE has been included in the project and the stylistic points that were raised by the editor in chief (EiC) have also been implemented. The most important points were: - Including citations within the text structure (as if they would be footnotes), so things like "see \cite{...}" should have been changed. - Hyperlinks should be printed as footnotes (because the journal gets actually printed). Also, to avoid the second listing breaking between pages, it has been moved to after the next paragraph.
2021-04-09	Minor corrections on previous copyedit	Mohammad Akhlaghi	-1/+1
	Being immutable doesn't necessary mean that something is always present, so an "always present" was also added for the reason we recommend a Git hash. The end of the sentence was also slightly summarized to allow the extra few words. The re-wording of the conclusion of Active papers, was great! I just changed the "likely" to "possible", because as Konrad mentioned in Commit a63900bc5a8, he is now using Guix.
2021-04-09	Minor copyedits	Boud Roukema	-1/+1
	These are minor last minute copyedits for recently added text, e.g. a git hash is not literally a timestamp.
2021-04-09	Comments by IAA's AMIGA team implemented	Mohammad Akhlaghi	-6/+8
	The AMIGA team at the Instituto Astrofísica Andalucía (IAA) are very active proponents of reproducibility. They had already provided very constructive comments after my visit there and many subsequent interactions. So until now, the whole team's contributions were acknowledged. Since the last submission, several of the team members were able to kindly invest the time in reading the paper and providing very useful comments which are now being implemented. As a result, I was able to specifically thank them in the paper's acknowledgments (Thanks a lot AMIGA!). Below, I am listing the points in the order that is shown in 'git log -p -1' for this commit. - Javier Moldón: "PM is not defined. First appearance in the first page". Thanks for noticing this Javier, it has been corrected. - Javier Moldón: "In Section III. PROPOSED CRITERIA FOR LONGEVITY and Appendix B, you mention the FAIR principles as desirable properties of research projects and solutions, respectively which is good, but may bring confusion. Although they are general enough, FAIR principles are specifically for scientific data, not scientific software. Currently, there is an initiative promoted by the Research Data Alliance (RDA), among others, to create FAIR principles adapted to research software, and it is called FAIR4RS (FAIR for Research Software). More information here: https://www.rd-alliance.org/groups/fair-4-research-software-fair4rs-wg. In 2020 there was a kick-off meeting to divide the work in 4 WG. There is some more information in this talk: https://sorse.github.io/programme/workshops/event-016/. I have been following the work of WG1, and they are about the finish the first document describing how to adapt the FAIR principles to software. Even if all this is still work in progress, I think the paper would benefit from mentioning the existence of this effort and noticing the diferences between Data and Software FAIR definitions." Thanks for highlighting this Javier, a footnote has been added for this (hopefully faithfully summarizing it into one sentence due to space limitations). - Sebastian Luna Valero: "Would it be a good idea to define long-term as a period of time; for example, 5 years is a lot in the field of computer science (i.e. in terms of hardware and software aging), but maybe that is not the case in other domains (e.g. Astronomy)." Thanks Sebastian, in section 2, we do give longevity of the various "tools" in rough units of years (this was also a suggestion by a referee). But of course the discussion there is very generic, so going into finer detail would probably be too subjective and bore the reader. - Sebastian Luna Valero: "Why do you use git commit eeff5de instead of git tags or releases for Maneage? Shown for example in the abstract of the paper: "This paper is itself written with Maneage (project commit eeff5de)." Thanks for raising this important point, a sentence has been added to explain why hashes are objective and immutable for a given history, while tags can easily be removed or changed, or not cloned/pushed at all. - Susana Sanchez Exposito: "We think interoperability with other research projects would be important, do you have any plans to make maneage interoperable with, for example, the Common Workflow Language (CWL)?". Thanks a lot for raising this point Susana. Indeed, in the future I really do hope we can invest enough resources on this. In the discussion, I had already touched upon research objects as one method for interoperability, there was also a discussion on such generic standards in Appendix A.D.10. But to further clarify this point (given its importance), I mentioned CWL (and also the even more generic CWFR) in the discussion. - Sebastian Luna Valero: "Regarding Apache Taverna, please see:" https://github.com/apache/incubator-taverna-engine/blob/master/README.md Thanks a lot for this note Sebastian! I didn't know this! I wrote this section (and visited their webpage) before their "vote"! It was a surprize to see that their page had changed. I have modified the explanation of Taverna to mention that it has been "retired" and use the Github link instead. - Sebastian Luna Valero: "Page 21: 'logevity' should be 'longevity'." Thanks a lot for noticing this! It has been corrected :-). - Javier Moldón: "There is a nice diagram in Johannes Köster's article on data processing with snakemake that I find very interesting to show some key aspects of data workflows: see Fig 1 in https://www.authorea.com/users/165354/articles/441233-sustainable-data-analysis-with-snakemake " This is indeed a nice diagram! I tried to cite it, but as of today, this link is not a complete paper (with no abstract and many empty section titles). If it was complete, I would certainly have cited it in Snakemake's discussion. - Javier Moldón: "Regarding the problem mentioned in the introduction about PM not precisely identified all software versions, I would like to mention that with Snakemake, even if the analysis are usually constructed using other package managers such as conda, or containers, you don't need to depend on online servers or poorly-documented software versions, as you can now encapsulate an analysis in a tarball containing all the software needed. You still have long-term dependency problems (as you will need to install snakemake itself, and a particular OS), but at least you can keep the exact software versions for a particular platform." Thanks for highlighting this Javier. This is indeed better than nothing, we have already discussed the dangers of this "black box" approach of archiving binaries in many contexts, and many package managers have it. So while I really appreciate the point (I didn't know this), to avoid lengthening the paper, I think its fine to not mention it in the paper.
2021-04-09	Comments by Konrad Hinsen implemented	Mohammad Akhlaghi	-14/+12
	Konrad had kindly gone through the paper and the appendices with very good feedback that is now being addressed in the paper (thanks a lot Konrad!): - IPOL recently also allows Python code. So the respective parts of the description of IPOL have been updated. To address the dependency issue, I also added a sentence that only certain dependencies (with certain versions) are acceptable. - On Active Papers (AP: which is written by Konrad) corrections were made based on the following parts of his comments: - "The fundamental issue with ActivePapers is its platform dependence on either Java or Python, neither of which is attractive." - "The one point which is overemphasized, in my opinion, is the necessity to download large data files if some analysis script refers to it. That is true in the current implementation (which I consider a research prototype), but not a fundamental feature of the approach. Implementing an on-demand download strategy is not particularly complicated, it just needs to be done, and it wasn't a priority for my own use cases." - "A historical anecdote: you mention that HDF View requires registering for download. This is true today, but wasn't when I started ActivePapers. Otherwise I'd never have built on HDF5. What happened is that the HDF Group, formerly part of NCSA and thus a public research infrastructure, was turned into a semi-commercial entity. They have committed to keeping the core HDF5 library Open Source, but not any of the tooling around it. Many users have moved away from HDF5 as a consequence. The larger lesson is that Richard Stallman was right: if software isn't GPLed, then you never know what will happen to it in the future." - On Guix, some further clarification was added to address Konrad's quote below (with a link to the blog-post mentioned there). In short, I clarified that I mean storing the Guix commit hash with any respective high-level analysis change is the extra step. - "I also looked at the discussion of Nix and Guix, which is what I am mainly using today. It is mostly correct as well, the one exception being the claim that 'it is up to the user to ensure that their created environment is recorded properly for reproducibility in the future'. The environment is recorded in all detail, automatically. What requires some effort is extracting a human-readable description of that environment. For Guix, I have described how to do this in a blog post (https://guix.gnu.org/en/blog/2020/reproducible-computations-with-guix/), and in less detail in a recent CiSE paper (https://hal.archives-ouvertes.fr/hal-02877319). There should definitely be a better user interface for this, but it's no more than a user interface issue. What is pretty nice in Guix by now is the user interface for re-creating an environment, using the "guix time-machine" subcommand." - The sentence on Software Heritage being based on Git was reworded to fit this comment of Konrad: "The plural sounds quite optimistic. As far as I know, SWH is the only archive of its kind, and in view of the enormous resources and long-time commitments it requires, I don't expect to see a second one." - When introducing hashes, Konrad suggested the following useful paper that shows how they are used in content-based storage: DOI:10.1109/MCSE.2019.2949441 - On Snakemake, Konrad had the following comment: "[A system call in Python is] No slower than from bash, or even from any C code. Meaning no slower than Make. It's the creation of a new process that takes most of the time." So the point was just shifted to the many quotations necessary for calling external programs and how it is best suited for a Python-based project. In addition some minor typos that I found during the process are also fixed.
2021-01-07	Removed all \new highlights after submission of review	Mohammad Akhlaghi	-1/+1
	With the submission of the revision (which highlighted all the relevant parts to the points the referees raised in the submitted PDF) it is no longer necessary to highlight these parts. If we get another revision request, we can add new '\new' parts for highlighting.
2021-01-07	Minor copyedits in appendices, e.g. parentheses	Boud Roukema	-3/+4
	This commit makes some minor fixes following the hardwired non-numerical solution to the cross-referencing issue between the main article and the supplement, such as fixing "lineage like lineage" and missing closing parentheses. From Mohammad: while re-basing the commit over the 'master' branch, I also added Boud'd name at the top of the copyright holders of the appendices.
2021-01-05	appendix.bbl is now included in make dist tarball	Mohammad Akhlaghi	-1/+1
	Since the addition of the appendix bibliography we hadn't checked the 'make dist' command, as a result the PDF couldn't be built. With this commit, in the 'dist' rule, we are now also copying 'appendix.bbl' and the created tarball could build the PDF properly. Also the 'peer-review' directory is now also included in the tarball created by './project make dist'. I also found a small typo in the description of Occam (an 'a' was missing) and fixed it.
2021-01-05	Polished main paper and appendices after a full re-read	Mohammad Akhlaghi	-60/+82
	In preparation for the submission of the revised manuscript, I went through the full paper and appendices one last time. The second appendix (reviewing existing reproducible solutions) in particular needed some attention because some of the tools weren't properly compared with the criteria. In the paper, I was also able to remove about 30 words, and bring our own count (which is an over-estimation already) to below 6250.
2021-01-04	Edits on points raised by Raul	Mohammad Akhlaghi	-10/+12
	After his previous two commits, we discussed some of the points and I am making these edits following those. In particular the last statement about Madagascar "could have been more useful..." was changed to simply mention that mixing workflow with analysis is against the modularity principle. We should not judge its usefulness to the community (which is beyond our scope and would need an official survey). A few other minor edits were done here and there to clarify some of the points.
2021-01-04	Minor corrections to the existing solutions appendix	Raul Infante-Sainz	-87/+104
	With this commit, I have corrected some minor typos of this appendix. In addition to that, I also put empty lines to separate subsections and subsubsections appropiately.
2021-01-03	Spell check on main body and appendices	Mohammad Akhlaghi	-25/+25
	I ran a simple Emacs spell check over the main body and the two appendices. All discovered typos have been fixed.
2021-01-03	No links to main body in the appendices in --supplement mode	Mohammad Akhlaghi	-6/+31
	Until now, in the appendices we were simply using '\ref' to refer to different parts of the published paper. However, when built in '--supplement' mode, the main body of the paper is a separate PDF and having links to a separate PDF is not impossible, but far too complicated. However, having the links adds to the richness of the text and helps point readers to specific parts of the paper. With this commit, there is a LaTeX conditional anywhere in the appendices that we want to refer the reader to sections/figures in the main body. When building a separate PDF, the resepective section/figure is cited in a descriptive mode (like "Seciton discussing longevity of tools"). However, when the appendices go into the same PDF as the main body, the '\ref's remain.
2021-01-02	Supplement (containing appendices) optionally built separately	Mohammad Akhlaghi	-0/+15
	Until now, the build strategy of the paper was to have a single output PDF that either contains (1) the full paper with appendices in the same paper (2) only the main body of the paper with no appencies. But the editor in chief of CiSE recently recommended publishing the appendices as supplements that is a separate PDF (on its webpage). So with this commit, the project can make either (1) a single PDF (containing both the main body and the appendices) that will be published on arXiv and will be the default output (this is the same as before). (2) two PDFs: one that is only the main body of the paper and another that is only the appendices. Since the appendices will be printed as a PDF in any case now, the old '--no-appendix' option has been replaced by '--supplement'. Also, the internal shell/TeX variable 'noappendix' has been renamed to 'separatesupplement'.
2020-12-30	Each appendix moved to a separate .tex file	Mohammad Akhlaghi	-0/+443
	As recommended by Lorena Barba (editor in chief of CiSE), we should prepare the appendices as a separate "Supplement" for the journal. But we also want them to be appendices within the paper when built for arXiv. As a first step, with this commit, each appendix has been put in a separate 'tex/src/appendix-*.tex' file and '\input' into the paper. We will then be able to conditionally include them in the PDF or not. Also, as recommended by Lorena, the general "necessity for reproducible research" appendix isn't included (possibly going into the webpage later).