paper-concept.git - Paper (Towards Long-term and Archivable Reproducibility)

Age	Commit message (Collapse)	Author	Lines
2020-05-30	Corrected a few words to make POSIX-fuzzyness paragraph more clear	Mohammad Akhlaghi	-2/+2
	Hopefully, it is more to the point with these few word-corrections.
2020-05-30	Discussion on issues with POSIX and minor edits to shorten paper	Mohammad Akhlaghi	-39/+47
	Konrad raised some very interesting points in particular about the limitations of POSIX as a fuzzy standard that does not guaratee reproducibility. A relatively long paragraph was thus added in the discussion to address this important point. In order to fit it in, the paragraph on "unwanted competition" was removed since the POSIX issue was much more relevant for a curious reader. Throughout the text, some other parts were edited to decrease the length of the paper while making it easier to read.
2020-05-30	Minor edits removing redundant sentences	Mohammad Akhlaghi	-4/+3
	Some of the redundant sentences have been removed and some minor edits made.
2020-05-29	Minor tidying of about half a dozen words	Boud Roukema	-11/+11
	The changes in this commit are best shown with `git diff --word-diff` or `git patch --word-diff`. There are about half a dozen changes of 1-2 words or a comma, the reasons should be obvious. The sentence with "can not just" seems to be correct formally, but "can not only" seems to me better to warn the reader that this is a phrase of the form "can not only do X but can also do Y"; "can not just" sounds a bit like "You cannot just enter the room without knocking" - it doesn't require a second part.
2020-05-29	Edits to the text, making it slightly shorter and more clear	Mohammad Akhlaghi	-62/+54
	One major point was that following Konrad's suggestion the issue of not being familiar with the Lisp/Scheme framework of GWL is now removed. We actually mention the main problem we have had with Guix, but also highlight that their solution was one of the main inspirations for this work.
2020-05-29	Adding small paragraph for Raul's biography	Raul Infante-Sainz	-1/+4
	Until this commit, there was only a small description of me. With this commit, I have added a small paragraph with my biography. I know we are very restricted because of the word limit so I tried to be very short!
2020-05-29	Minor typos corrected	Raul Infante-Sainz	-5/+6
	With this commit, I have corrected several minor typos.
2020-05-29	Minor corrections in abstract and introduction	Raul Infante-Sainz	-9/+9
	With this commit, I did some minor changes in these Sections. Main changes are: define the contraction `OS' from Operating System and use only `OS' later on, and not use contractions like `isn't'
2020-05-29	Cut down biography and inclusion of a mention to reproducibility	Roberto Baena-Gallé	-5/+5
	Before this commit: Roberto's bio was about 120 words. With this commit: it is now less than 100 words. A comment about reproducibility has been added.
2020-05-29	Reproducible research based on open-access papers	Boud Roukema	-1/+1
	Publishing a paper on reproducible research without making it easy for readers to read the references would defeat the point. Of course we have to make some compromises with some journals' reluctance to shift towards the free world, but to satisfy scientific ethics, we should at least provide clickable URLs to the references, preferably to the ArXiv version if available [1], and also to the DOI, again, preferably to an open-access version of the URL if available. I was not able to fully get this done in the .bst file, so there's an sed/tr hack done to the .bbl file in `reproduce/analysis/make/paper.mk` to tidy up commas and spaces. This commit also reverts some of the hacks in the Akhlaghi IAU Symposium `tex/src/references.tex` entry, to match the improved .bst file, `tex/src/IEEEtran_openaccess.bst`, provided here with a different name to the original, in order to satisfy the LaTeX licence. [1] https://cosmo.torun.pl/blog/arXiv_refs
2020-05-29	pdftotext only called if present in system, minor edit	Mohammad Akhlaghi	-2/+2
	David and Raul had both reported that because 'pdftotext' wasn't available on their system, the project failed (even though the PDF was built!). So with this commit, we first check if the system has 'pdftotext' and call it only if its is available. Some minor edits were made, building upon Boud's previous commit.
2020-05-29	Section V - small changes	Boud Roukema	-25/+26
	This commit provides mostly small changes. There didn't seem much point in repeating the `lessons learned` jargon and claiming that we draw good conclusions - insights - from our experience. Better just state what hypotheses we have generated from the experience rather than give the misleading impression that our hypotheses are well-established facts. In the comments, I put a suggested translation of what the `lessons learned` jargon means. I seem to have first heard this term in the mainstream media a few years after the US 2003 attack on Iraq, when a US military representative stated that the US forces had "learned lessons" after having started a war of aggression against Iraq.
2020-05-29	Sentence with the clerk who can do it, software as uncountable noun	Boud Roukema	-2/+2
	This commit changes two lines. (1) Keeping the exact quote with the clerk while having a sentence that makes sense in plain English cannot be done, it seems to me, without making the sentence a bit longer. Here's one option that seems about the best we can do, even though it still sounds a bit funny, because it's hard to write a future conditional with the present "can". Since it's a quote, it will probably survive the proofreaders. (2) Software is an uncountable noun [1], so we say "software is", like "water is"; "used software" sounds odd; I added "is itself" to emphasise that we're especially talking about the full chain of software for running the project. This commit modifies the "When the ..." sentence and hopefully sounds better. [1] https://en.wiktionary.org/wiki/software#Noun
2020-05-29	Added top-make.mk as a listing for demonstration, minor edits	Mohammad Akhlaghi	-31/+53
	To help show the simplicity of 'top-make.mk', it was included as a listing. I also went over some of Boud's corrections and made small edits. In particular: - The '\label' and '\ref' to a section were removed. I done this after inspecting some of their recent papers and noticing that they generally have a simple flow, without such redirections. - In the part about the RDA adoption grant, I moved the "from the researcher perspective" to the end. Because Austin+2017 is mainly focused on data-center management, not the researcher's. They do touch upon researcher solutions that can help data-base managers, but not directly the researchers. In effect with this grant, they acknowledged that our researcher-focused solution confirms with their criteria for data-base management.
2020-05-29	Many small changes to Section IV - proof of concept: maneage	Boud Roukema	-60/+63
	Possibly the least trivial edit in this commit is that the previous text appeared to state that it's normal to find that a project prepared with `maneage` may be ... unbuildable. Which would defeat our whole claim of reproducibility! Obviously, `maneage` is still in a rapid development stage and might still have significant, not-yet-detected bugs. But the wording has to explain that this would constitute a bug in `Maneage` (in a particular version of it), not an expected regular event. :) This commit aims to fix that and other minor wording issues in IV. Pdf word count 5855.
2020-05-28	Cherry-pick 7bf5fcd to make merging easier	Boud Roukema	-6/+3
	This series of commits aims to edit sections II+III, but first implements the changes from 7bf5fcd, apart from one that conflicts in the abstract: this commit has ``Maneage'' without `(managing+lineage)` in the abstract. From Mohammad: this commit has been rebased after several other parallel branches, so some things may differ from the message.
2020-05-23	Some minor edits on Boud's recent corrections	Mohammad Akhlaghi	-12/+12
	Generally they were great, but after looking through them I thought a hand-full of them slightly changed my original idea so I am correcting them here. Boud, if you feel the changes aren't good, let's talk about it and find the best way forward ;-). They are mostly clear from a '--word-diff', just some notes on the ones that have changed the meaning: * On the "a clerk can do it" quotation, since its so short, I think its better to keep its original form, otherwise a reader may thing there were paragraphs instead of the "to" and we have changed their intention. * In the part where we are saying that the workflow can get "separated" from the paper, I mostly meant to highlight that the data-centers and journals (hosts) may diverge in decades, or one of them may go bankrupt, or etc. Hence loosing the connection. The issue of it evolving can in theory be addressed through version control, so I think this is a more fundamental problem. * In the part about free software, in the list, the original point was the free software that are used by the project, not the project itself (after all, the project itself falls under the "Open Science" titles that is very fashionable these days, but my point here is to those people who claim to do "Open Science" with closed software (like Microsoft Excel!).
2020-05-23	Section III edits - 5901 words	Boud Roukema	-43/+43
	This commit makes several small changes to Section III, some of which are quite significant in terms of meaning. It was difficult to improve the clarity without extending the word length. Now we're at 5901 words.
2020-05-23	Section II edits + definition of solutions	Boud Roukema	-23/+23
	This commit implements quite a few minor changes in section II. The aim of most is to clarify the meaning and remove ambiguity. A few changes are that the reader will normally assume that successive sentences in a paragraph are closely related in terms of logical flow. It is superfluous - and considered excessive - to put too many "Therefore"'s and "Hence"'s in (at least) modern astronomy style. These are supposed to be used when there is a strong chain of reasoning. One change is done in the Introduction, because if we're going to use "solution(s)" throughout to mean "reproducible workflow solution(s)", then we have to clearly define this as jargon for this particular paper. It's probably preferable to RWS - reproducible workflow solution - or RWI - reproducible workflow implementation. But we can't just keep saying "solution" because that has many different meanings in a scientific context. Pdf word count = 5880
2020-05-23	Cherry-pick 7bf5fcd to make merging easier	Boud Roukema	-1/+1
	This series of commits aims to edit sections II+III, but first implements the changes from 7bf5fcd, apart from one that conflicts in the abstract: this commit has ``Maneage'' without `(managing+lineage)` in the abstract.
2020-05-23	Main text: implement most of David's changes	Boud Roukema	-31/+33
	This commit implements most of David's changes from c76727b, but excluding some, such as the proposal to use 'which' in a restrictive clause in the abstract. This is allowed, but the Fowler brothers' rule tends to followed in science writing: https://ell.stackexchange.com/questions/5/is-there-any-difference-between-which-and-that A few points on the abstract: * an immediate solution = singular * The "immediate, fast short-term" benefits sentence sounded like it was redundantly superfluously repetitively repeating doubled-up information. Hopefully this edit is better. * in the %Conclusion, "solutions" is vague, like people who say "technology" when they're only talking about software, so this edit reminds the reader to make the sentence more self-contained and understandable.
2020-05-23	Biography style reverted to CiSE PDF mode (different from webpage)	Mohammad Akhlaghi	-18/+8
	After a look at the PDFs of the linked papers of the previous commit and a few 2020 papers, we noticed that the biography format of the webpage and PDFs are different! So it is now back in its old way (which is how biographies are presented in the PDF). A few other minor edits were made in the text.
2020-05-23	Affiliations CiSE style	Boud Roukema	-6/+19
	It appears from looking at https://ieeexplore.ieee.org/document/5725236/authors#authors https://ieeexplore.ieee.org/document/7878935/authors#authors that the affiliations section needs to start with a one-phrase definition of the author's main affiliation. In 5725236, the typesetters/proofreaders swapped van der Walt and Colbert, so don't be confused by that. It shows that nobody proofread properly. With this commit, each author's institute (single hierarchical level) is written as the first paragraph of the author's affiliation section. Since 5725236 allows a very-well-known acronym, I'm guessing that IAC can be defined for Mohammad and then re-used for the others. I've added a brief CV for me. If necessary, we could compress my main research together as "observational cosmology", but let's see how we go in the word count. I have not (yet) worked through the main text. There is also one minor language fix - `Because is complete` was incomplete. Pdf word count: 5873
2020-05-23	Edits, to make the text more readable	Mohammad Akhlaghi	-84/+81
	After one day not looking at the first draft of this new version (commit 7b008dfbb9b2), I went through the text and done some general edits to make its presentation and logic smoother.
2020-05-23	Typo and style corrections in the text, Roberto's bio added	Roberto Baena-Gallé	-37/+41
	Before this commit: several typos were present along the text. With this commit several typos have been corrected (types listed below) and my bio has been added. a) double words b) general typos c) comas after adverbs at the beginning of a sentence d) contractions are removed, e.g., don't vs do not e) three sentences in parenthesis have been removed since I think they were out of context or unnecessary f) etc
2020-05-22	Corrected copyright notices to fit GPL suggested format	Mohammad Akhlaghi	-7/+10
	In time, some of the copyright license description had been mistakenly shortened to two paragraphs instead of the original three that is recommended in the GPL. With this commit, they are corrected to be exactly in the same three paragraph format suggested by GPL. The following files also didn't have a copyright notice, so one was added for them: reproduce/software/make/README.md reproduce/software/bibtex/healpix.tex reproduce/analysis/config/delete-me-num.conf reproduce/analysis/config/verify-outputs.conf
2020-05-22	Re-write of the paper to fit in ~6000 words and IEEE format	Mohammad Akhlaghi	-121/+335
	Following the fact that the DSJ editor decided that this paper doesn't fit into their scope, we decided to submit it to IEEE's Computing in Science and Engineering (CiSE). So with this commit the text was re-written to fit into their style and word-count limitations.
2020-05-02	First implementation of style in IEEEtran style	Mohammad Akhlaghi	-520/+119
	The paper is no longer using LuaLaTeX, but raw LaTeX (that saves a DVI), it is so much faster! Initially I had used LuaLaTeX to use special fonts to resemble the CODATA Data Science Journal, but all that overhead is no longer necessary. Therefore I also removed the MANY extra LaTeX packages we were importing. The paper builds and is able to construct one of its images (the git-branching figure) with only 7 packages beyond the minimal TeX/LaTeX installation. Also in terms of processing it is so much faster. The text is just temporary now, and mainly just a place holder. With the next commit, I'll fill it with proper text.
2020-05-01	Imported recent changes in Maneage, minor conflicts fixed	Mohammad Akhlaghi	-12/+9
	A few small conflicts showed up here and there. They are fixed with this merge.
2020-05-01	Abstract: three minor language edits	Boud Roukema	-4/+4
	The difference between `that` and `which` is not strictly required, but it helps clarify the difference in meaning, which is important in science and software :). This is best shown by an example: * Maneage provides reproducibility, which is a good thing. The sentence would make sense if we drop `, which is a good thing.` The last part of the sentence is a comment rather than a necessary part of the sentence. * Maneage provides a quality of reproducibility that is missing from other implementations. The sentence would not quite make sense if we drop `that is ...`, since we would not know what sort of quality is provided. The fact that the quality is missing is key to the intended meaning of the sentence.
2020-05-01	Merged David's suggestions, further edited to be more clear	Mohammad Akhlaghi	-7/+5
	It is also slightly shorter with this commit, without loosing anything substantial.
2020-05-01	Minor edits in abstract	David Valls-Gabaud	-7/+7
	No need to invent a new word (archive-able) when an existing one (archivable) does the job. One issue that we have not included and which perhaps we could discuss in the paper (space permitting), is that this tool could bypass the use of blockchains in this context.
2020-05-01	Minor edits in abstract, link between analysis and narrative added	Mohammad Akhlaghi	-3/+3
	As discussed by Boud in the previous commit, this is an important feature that was lost in the new abstract. So I added it as a criteria.
2020-05-01	Several minor edits to the title + abstract	Boud Roukema	-12/+13
	Most are minor English tidying, e.g. * spelling: achieving * archivable - https://en.wiktionary.org/wiki/archivable * `i.e.` does not look good in an abstract; * `when` didn't sound quite right; Comment: we no longer state one of the most interesting aspects of Maneage - producing the draft paper that is submittable for peer review in a way that makes it natural for the authors to achieve automatic consistency between the calculations/analysis and the values in the paper. But this is hard to describe in a compact way without disrupting the overall argument of the abstract, so it's a bit of a pity, but people will learn about it anyway from the body of the article (or from trying out the package!) `Peer-review verification` does not directly state producing a pdf. Related to this absence of talking about reproducing the paper, not just the calculations, I suggest dropping `, with snapshot \projectversion` from the abstract initially sent to the journal (they can't stop us updating it afterwards), because without the context of explaining that the paper itself is produced from the package, it's not clear what the snapshot means - a snapshot of the abstract? In the `real` paper, it makes sense, because the reader will have access to the rest of the paper.
2020-05-01	Edited abstract for more clarity, still in the 250 word limit	Mohammad Akhlaghi	-28/+15
	Boud's suggestions in the previous commit were great and really helped in improving the tone of the abstract (and thus the whole paper shortly!), better putting it in the big picture. I had forgot to give the exact word limit (which was 250), so Boud had set it to a very conservative value of 190, I added around 22 words to better highlight the points we want to make, while still being below the limit.
2020-05-01	Abstract re-organized to be more research-oriented	Boud Roukema	-7/+28
	To make this a research article, we either have to present it as a theoretical advance, or as an empirical advance. An empirical research result would be something like doing a survey of users and getting statistics of their success/failure in using the system, and of whether their experience is consistent with the claimed properties and principles of Maneage (e.g. success/failure in creating paper.pdf as expected? was the user's system POSIX? did the user do the install with non-root privileges? was this a with-network or without-network ./project configure ?) This is doable, but would require a bit of extra work that we are not necessarily motivated to do or have the time to do right now. I think it's possible to present Maneage as a theoretical advance, but it has to be worded properly. Maneage is a tool, but it's a tool that satisfies what we can reasonably present as a unique theoretical proposal. Here's my proposed rewrite. I've aimed at minimum word length. I've also included (commented out) keywords for a structured research abstract - these are just for us, as a guideline to improve the abstract. I think "criteria" is safer than "standards". Whether a principle is good or bad tends to lead to debate. Whether a criterion is satisfied or not is a more objective question, independent of whether you agree with the criterion or not. In the rewrite below, we propose a theoretical standard and show that the new standard can be satisfied. Maneage is used as a tool to prove that the standard is not too difficult to achieve. Maneage is no longer the subject of the paper. (That won't change the main body of the paper too much, apart from compression, but the way it's presented will have to change, under this proposal.) The title would need to match this. E.g. TITLE.1: Evidence that a higher standard of reproducibility criteria is attainable TITLE.2: Evidence that a rigorous standard of reproducibility criteria is attainable TITLE.3: Towards a more rigorous standard of reproducibility criteria I would probably go for TITLE.3.
2020-05-01	Abstract re-written to better highlight the uniqueness of Maneage	Mohammad Akhlaghi	-9/+9
	This abstract is a first step in order to put more focus on the research aspects of Maneage.
2020-05-01	Removed Definition and Summary sections and low-level figures	Mohammad Akhlaghi	-129/+19
	Given the very strict limits of journals, we needed to remove these sections and images. The removed images are: the `figure-file-architecture', `figure-src-topmake' and `figure-src-inputconf'. In total, with `wc' we now have 9019 words. This will be futher reduced when we remove all the technical parts of the Maneage section, in short, we will only describe the generalities, not any specific details.
2020-04-27	Thanked Fabrizio, Tamara and Nadia for their support	Mohammad Akhlaghi	-1/+4
	They supported my visit and talk on Maneage at the Barcelona Super Computing center. They have also offerred to read the paper and are providing comments. Also, I noticed that in the author list, we had forgot to put an `,' after Boud's name. That is also corrected here.
2020-04-25	Demonstration cloning URL set to https://git.maneage.org/project.git	Mohammad Akhlaghi	-2/+2
	Until now, we were using GitLab as the main Git repository of Maneage. But today I finally setup our own Git repository under `git.maneage.org' and enabled a CGit web interface for a simple and fast viewing of the commits and changes. Since this URL is under our own control, we can always ensure that it will point to somewhere meaningful, on any server so in the long-run its much better than publishing the paper an explicit reliance of `gitlab.com'.
2020-04-25	IMPORTANT: Primary Maneage repositories are now under maneage.org	Mohammad Akhlaghi	-15/+11
	Until now, the primary Maneage URLs were under GitLab, but since we now have a dedicated URL and Git repository, its better to transfer to this as soon as possible. Therefore with this commit, throughout Maneage, any place that Maneage was referenced through GitLab has been corrected. Please correct your project's remote to point to the new repository at `git.maneage.org/project.git', and please make sure it follows the `maneage' branch. There is no more `master' branch on Maneage.
2020-04-23	Minor edits on Boud's great corrections	Mohammad Akhlaghi	-13/+12
	Reading over Boud's edits, I noticed a few other parts that I could summarize more and corrected one or two other parts to fit the original purpose of the sentence better.
2020-04-23	Conclusion	Boud Roukema	-9/+9
	Reduction by about 5 words. Although it's true that the low-level tools - make, bash, gcc - are still being actively developed, only expert users will tend to notice the differences, and in this context, it's probably more useful to point out that these are actively maintained. (Comment: I felt that the first sentence in the Conclusion is missing one of the obvious criteria for handling big data - citizen control so that big data could hopefully become less Orwellian than it is right now, with GAFAM having the main big data databases that are used by AI researchers and will tend to affect people's lives more than traditional "scientific" databases. But there's no point adding this here, since the criteria that tend to satisfy the scientific requirements ("principles") and citizens' rights tend to overlap to a fair degree...)
2020-04-23	Discussion/caveats section.	Boud Roukema	-27/+27
	Reduction of about 50 words. There were a couple of expressions that look a bit like some sort of software/research analysis jargon, such as `Research Objects`, `Software Heritage`, `Machine actionable`. Unless these are defined, capitalising them makes the reader assume that there is some well-known formal meaning and that s/he has to search for that him/herself. As lower case expressions, the reader can guess some reasonable meanings of these. The word "embargo" was introduced for proposal 2) to handle the third caveat.
2020-04-23	Further edits to summarize the parts corrected by Boud	Mohammad Akhlaghi	-43/+39
	[Compared to first submission to DSJ last week with 11436 words in raw PDF, we have decreased the paper by ~1000 words to 10493 :-)] As with the previous commits, the moment Boud changed the structure of sentences, I was able to find the redundancies and remove them! This is a fascinating feature of collaboration I had never felt before: it is so hard to find redundancies in my own raw text, but even a minor correction by someone else suddeny breaks my mental memories/barrier on the sentence, allowing me to be more critical to it! Anyway, besides such corrections, I fixed a few other things: 1) In the DSJ's recently published papers, ther is no `~' between "Figure" and its number. 2) I noticed that in `tex/src/figure-src-inputconf.tex' I was actually using manually input strings for the filename, checksum and size! This was contrary to the whole philosophy of Maneage(!), I must have rushed and forgot! So LaTeX variables are now defined and used.
2020-04-23	4.6 Project analysis - publication	Boud Roukema	-14/+12
	About 20 words less. The ArXiv URL is added - this adds no extra length in words, and some readers will not be familiar with ArXiv (although the COVID-19 pandemic has attracted attention to BiorXiv).
2020-04-23	4.5 Project analysis - multi-user	Boud Roukema	-4/+4
	Increase by 5 words. We don't need to give a big warning here, but "Permissions management" is meant to be a brief way of saying that whether or not different users can really read/write/execute in subdirectories will firstly depend on whether the user who cloned Maneage has handled these permissions correctly and whether s/he is able to allow others to edit in his/her subdirectories. Comment: Users would have to check who else is logged in at the time, who else is running jobs, and so on. On a supercomputer this might make sense, to avoid unnecessary recompiles. Anyway, this edit summary is not the place to discuss this...
2020-04-23	4.4 Project analysis - git branches	Boud Roukema	-12/+12
	Reduction by 15 words. "Branch" is fine as a verb, and "off" is fine as a preposition; there's no need for a second preposition. "We branched off the main forest path onto a smaller path".
2020-04-23	4.3.6 Project analysis - configure files	Boud Roukema	-13/+14
	Length reduction by about 15 words. A semantically significant change is from `leading to more robust scientific results` to `evolves in the case of exploratory research papers, and better self-consistency in hypothesis testing papers`. I said this in a previous commit, but it can't hurt repeating: In the covidian epoch (though not only), it is especially important to distinguish bayesian type exploratory research (typical in astronomy or searching for a good COVID-19 treatment or vaccine) from hypothesis testing (clinical testing in double-blind random access trials with clinical trials methods published on a public registry prior to the trials taking place). In the latter case, you want your results to be analysed consistently with the plan published before the trials even begin, and ideally you want them to be published (or at least posted on the trial registry website) even if your results are insignificant, to avoid a publication bias in favour of significant results. Test homeopathy against placebos in 1000 independent experiments, analyse them all the same way, and 2-3 experiments will be significant at the 3 sigma level...
2020-04-23	4.3.5 Project analysis - downloads	Boud Roukema	-5/+4
	Reduction by about 7 words. I added "internet security" as an extra reason for having all the downloads in a single file. Modularity and minimal complexity in themselves generally contribute to internet security, but in this case, it's obvious that having all the communication with the outside world managed through a single file makes internet security management much simpler. I replaced the "fake URL" by the real one, because at least in the present format, the URL fits in nicely. So both `paper.tex` and `tex/src/figure-src-inputconf.tex` are modified in this commit.