aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
AgeCommit message (Collapse)AuthorLines
2020-09-03Added example of DockerHub deleting unused Docker imagesMohammad Akhlaghi-1/+3
I saw this link today in the news (to be implemented from November 1st, 2020), and because it is directly related to this work, I added it. Many people assume that simply pushing a Docker image to DockerHub is enough to preserve it, but ignore how much it costs to maintain the storage and network capacity.
2020-08-20Data lineage and replicated plot in one rowMohammad Akhlaghi-9/+15
Until now, the replicated plot had the width of the full page and the data lineage graph was under it. Together they were covering more than half of the height of the page! But the plot showing the number of papers with tools really doesn't have too much detail, and all the space was being wasted. With this commit, the plot is now much much thinner and the data lineage graph has been fitted to the right of it.
2020-07-01Corrected small typo: ny --> anyMohammad Akhlaghi-1/+1
This was pointed out by Mervyn O'Luing.
2020-06-30Implemented comments by Mervyn O'LuingMohammad Akhlaghi-19/+17
Mervyn had read the paper and provided some interesting thoughts that I tried to implement. Mervyn's comments are shown below. I just haven't addressed the last point yet, because I am affraid it may make the text too long (we are already on the boundary of the word-limit). We have already discussed that it is a good research topic, and have hopefully triggered the curiosity of the readers to test it ;-). ------------------- Page 2: Regarding Criterion 1: Completeness. A project must be self contained? So this includes not requiring root or administrator privileges. This suggests that the project is only made open after the development has been completed? Regarding Criterion 5: 'a clerk can do it' -- in the pc world that we live in could this be taken as a disparaging comment? Page 5: 'The C library is linked with all programs, and this dependence can hypothetically hinder exact reproducibility of results, but we have not encountered this so far.' - what do you think might happen if this does affect reproducibility? Do you have a plan to deal with this? Or are you going to wait until you hear of such cases as the number will probably be small? Have you done probability analysis to show that the rates are likely to be very small? Or should you have a disclaimer with maneage?
2020-06-28Zenodo identifier is extracted automatically from metadata.confMohammad Akhlaghi-2/+2
Until now, the Zenodo identifier was manually written in the paper. But now we have the Zenodo DOI in 'metadata.conf', so its much more robust to get it from there (in case updated versions of the paper is published).
2020-06-22Acknowledged very usful discussions at the AMIGA group of IAAMohammad Akhlaghi-1/+2
I visited the AMIGA group in January this year and we had some very useful discussion on Maneage.
2020-06-21Minor edits to clarify some pointsMohammad Akhlaghi-241/+116
After going through Terry's corrections, some things were clarified more. Technically, I realized that many new-lines were introduced and corrected them. Also, in Roberto's biography, I noticed that compared to the others it has too much non-reproducibility details, so I removed the redundant parts for this paper.
2020-06-21Edits by Terry MahoneyTerry Mahoney-147/+285
Terry is an astronomer at IAC's Scientific Editorial Service and kindly agreed to review this paper for us and actually pushed this commit. I am just adding a commit message here.
2020-06-16Acknowledged contributions of Marios KarouzosMohammad Akhlaghi-3/+7
Marios had read the first draft of the paper (Commit f990bba) and provided valuable feedback (shown below) that ultimately helped in the current version. But because of all the work that was necessary in those days, I forgot to actually thank him in the acknowledgment, while I had implemented most of his thoughts. Following Marios' thoughts on the Git branching figure, with this commit, I am also adding a few sentences at the end of the caption with a very rough summary of Git. I also changed the branch commit-colors to shades of brown (incrementally becoming lighter as higher-level branches are shown) to avoid the confusion with the blue and green signs within the schematic papers shown in the figure. Marios' comments (April 28th, 2020, on Commit f990bba) ------------------------------------------------------ I think the structure of the paper is more or less fine. There are two places that I thought could be improved: 1) Section 3 (Principles) was somewhat confusing to me in the way that it was structured. I think the main source of confusion is the mixing of what Maeage is about and what other programs have done. I would suggest to separate the two. I would have short intro for the section, similar to what you have now. However, I would suggest to highlight the underlying goals motivating the principles that follow: reproducibility, open science, something else? Then I would go into the details of the seven principles. Some of the principles are less clear to me than others. For example, why is simplicity a guiding principle? Then some other principles appear to be related, for example modularity, minimal complexity and scalability to my eyes are not necessarily separate. Finally, I would separate the comparison with other software and either dedicate a section to that somewhere toward the end of the paper (perhaps a subsection for section 5) or at least condense it and put it as a closing paragraph for Section 3. As it is now I think it draws focus from Maneage and also includes some repetitions. 2) Section 4 (Maneage) was at times confusing because it is written, I think in part as a demonstration of Maneage (i.e., including examples that showed how Maneage was used to write this or other papers) and a manual/description of the software. I wonder whether these two aspects can be more cleanly separated. Perhaps it would be possible to first have a section 4 where each of the modules/units of Maneage are listed and explained and then have the following section discuss a working example of Maneage using this or another paper. 3) I found Figure 7 [the git branching figure] and its explanation not very intuitive. This probably has to do with my zero knowledge of github and how versioning there works, but perhaps the description can be a bit more "user friendly" even for those who are not familiar with the tool. 4) I find Section 6 to be rather inconsequential. It does not add anything and it more or less is just a summary of what was discussed. I would personally remove it and include a very short summary of the ideals/principles/goals of Maneage at the beginning of Section 5, before the discussion.
2020-06-14Corrected the relation of POSIX and IEEEMohammad Akhlaghi-2/+3
Until now, we were saying "POSIX is defined by the IEEE", but in issue #12, Michael Crusoe pointed out that this is not accurate. It is actually jointly developed and operated by the IEEE, The Open Group and ISO/IEC JTC 1/SC 22, which together form the Austin Group. So the sentence was modified to say tha the IEEE (potential publisher of this paper) is part of the Austin Group that develops the POSIX standard. Thanks a lot for bringing this up Michael.
2020-06-13Custom-built EPS icons in branching figureMarjan Akbari-1/+1
Until now, we were using three EPS (created from SVG) that were downloaded from https://www.flaticon.com. Therefore it was necessary to acknowledge the creators and put a link to the webpage. This consumed space in the caption and decreased the originality of the plot. Another problem was that the "collaboration" icon (with three people in it) had arrows, and some of those arrows pointed downwards, make ambiguity in relation to the top-ward arrows under the commits. With this commit, three alternative icons are added that I made from scratch, using Inkscape. The collaboration icon now is two figures and two speech-bubbles, without any arrows.
2020-06-13Two small edits in demo listing and paragraph after itMohammad Akhlaghi-3/+3
Recently, by default, Maneage will not take the title directly in the PDF, the title should be given in the 'metdata.conf' file and it is passed onto LaTeX as a variable. So the comment to "add project title" in the listing could be confusing. To avoid confusing, I edited it to "Set your name as author". The comments above the '\title' part is very complete and users will clearly be able to modify the title if they want. Also, we had an extra ')' in the line just under it which is now corrected.
2020-06-09Two minor typos correctedMohammad Akhlaghi-2/+2
Two words were corrected in the text that made the sentences grammatically wrong (they were actually typos! historically they were correct, but we later changed the later part of the sentence without fixing the first part).
2020-06-07Added SoftwareHeritage link, minor typo corrections and clarificationsMohammad Akhlaghi-22/+25
The git history of the project is now archived on SoftwareHeritage and a link to it as was added in the "Reproducible supplement" tag just under the abstract. Also, some corrections were also made in the text. In particular, the part explaining the separation of software and data reproducibility was slightly clarified to be more clear
2020-06-06Summarized abstract to be less than 150 wordsMohammad Akhlaghi-16/+15
Upon submission to CiSE we were informed that the abstract has to be less than 150 words to be processed. So with this commit, I am shrinking the abstract slightly, trying to remove some points that are less important and trying to shrink some of the sentences. Also, to avoid confusion and be more clear, the term "temporal provenance" has been replaced by "Recorded history".
2020-06-04Scale element in includegraphics for roughly similar-sized figuresMohammad Akhlaghi-3/+3
Until now, when the figures were built directly from EPS ('\newcommand{\makepdf}{}' was commented), they would take the full line-width becoming a little too large! I noticed this after letting arXiv build the PDF. With this commit, the 'includetikz' tool takes a second argument to be a parameter given to 'includegraphics' (which is scale in this case).
2020-06-04Final full reading, and minor edits to submit to Zenodo and arXivMohammad Akhlaghi-58/+57
Everything else regarding the submission to arXiv and Zenodo has been complete, so I done a final read, making some minor edits to hopefully make the text easier to read.
2020-06-04Verification activated, README added, Proper metadata in plot dataMohammad Akhlaghi-2/+8
All the steps following the to-be-added (in 'README-hacking.md') publication checklist prior to the final check from new clone have been added: - 'README.md' file has been set. - "Reproducible supplement" was added just above the keywords, pointing to Zenodo. - A link to the to-be-uploaded data underlying the plot was added in the caption of the tools-per-year plot. - A new meta-data configuration file was added to store basic project metadata to be used throughout the project. This will later be taken into Maneage. For examle the project title is now stored here and written into the paper's LaTeX source and output datasets automatically. - Verification was activated and plot's data and LaTeX macro files are now automatically verified. - A complete metadata was added for the data underlying the plot. - A generic function was added in 'initialize.mk' that will automatically write project info and copyright in all plain-text outputs.
2020-06-03Adding point on small-ness of final product, some summarizationMohammad Akhlaghi-83/+65
I noticed that we hadn't include the publication of the workflow and the advantage that Maneage provides in this regard. So it was added at the end of the proof-of-concept section. However, it was necessary to summarize some other parts to not increase the wordcount.
2020-06-01Edits by DavidDavid Valls-Gabaud-100/+110
These are some corrections that David sent to me by email and I am committing here.
2020-06-01Implemented Antonio's suggestion and thanked himMohammad Akhlaghi-1/+2
Antonio Diaz Diaz (author of the Lzip program/library), has had a very supportive role in what became Maneage in the last 4 years. For example I really started to appreciate the value of simplicity and archivability while reading Lzip's documentation. Fortunately he also read a recent version of the paper that was again very supportive. Some of the minor points he raised had already been fixed, but using 'supplier' instead of 'server' (in the Free Software) criterion was new so I implemented it here with this commit. With this, I am also thanking him for all his wonderful support and encouragement in the last 4 years.
2020-06-01Minor edits to clarify some of the previous correctionsMohammad Akhlaghi-3/+3
Boud's point about a "random reader" not being a good example case was correct. But "user" also gives it a software perspective that is ofcourse not wrong, its can just be confusing. So I thought of changing it to "interested reader". In the part about the C-library dependency of high-level software, from Boud's correction, I found out that it is very hard to convey what I wanted to say (that separating errors due to C-library implementation and measurement errors will be easy, because they should be on much different scales). But I then corrected it to give it a slightly better tone while mentioning the same thing: that with Maneage we can now accurately measure the effect of the C library.
2020-05-31Mostly minor edits of nearly final versionBoud Roukema-17/+18
Changes with this commit are mostly minor and obvious. Some worth commenting on include: * `technologies develop very fast` - As a general statement, this is too jargony, since technology is much wider than just `software`; `some technologies` makes it clear that we're referring to the specific case of the previous sentence * `in a functional-like paradigm, enabling exact provenance` - While `make` is not an imperative programming language, I don't see how `make` is `like` a functional programming language. Classifying it as a declarative and a dataflow programming language and as a metaprogramming language would seem to go in the right direction [1-3]. I also couldn't see how the language type relates to tracking exact provenance. But since we don't want to lengthen the text, my proposal is to put `and efficient in managing exact provenance` without trying to explain this in terms of a taxonomy of programming languages. [1] https://en.wikipedia.org/wiki/Functional_programming [2] https://en.wikipedia.org/wiki/Comparison_of_multi-paradigm_programming_languages [3] https://en.wikipedia.org/wiki/Dataflow_programming * `A random reader` - In the scientific programming context, `random` has quite specific meanings which we are not using here; a `reader` has not necessarily tried to reproduce the project. So I've proposed `A user` here - with the idea that a `user` is more likely to be someone who has done `./project configure && ./project make`. * `studying this is another research project` - the present tense `is` doesn't sound so good; I've put what seems to be about the shortest natural equivalent. Pdf word count: 5856
2020-05-30Corrected a few words for more clarityMohammad Akhlaghi-2/+2
An "internally" was added to the part about core GNU tools accounting for the differences between POSIX-compatible systems. One extra word was also removed in the next sentence.
2020-05-30Corrected a few words to make POSIX-fuzzyness paragraph more clearMohammad Akhlaghi-2/+2
Hopefully, it is more to the point with these few word-corrections.
2020-05-30Discussion on issues with POSIX and minor edits to shorten paperMohammad Akhlaghi-39/+47
Konrad raised some very interesting points in particular about the limitations of POSIX as a fuzzy standard that does not guaratee reproducibility. A relatively long paragraph was thus added in the discussion to address this important point. In order to fit it in, the paragraph on "unwanted competition" was removed since the POSIX issue was much more relevant for a curious reader. Throughout the text, some other parts were edited to decrease the length of the paper while making it easier to read.
2020-05-30Minor edits removing redundant sentencesMohammad Akhlaghi-4/+3
Some of the redundant sentences have been removed and some minor edits made.
2020-05-29Minor tidying of about half a dozen wordsBoud Roukema-11/+11
The changes in this commit are best shown with `git diff --word-diff` or `git patch --word-diff`. There are about half a dozen changes of 1-2 words or a comma, the reasons should be obvious. The sentence with "can not just" seems to be correct formally, but "can not only" seems to me better to warn the reader that this is a phrase of the form "can not only do X but can also do Y"; "can not just" sounds a bit like "You cannot just enter the room without knocking" - it doesn't require a second part.
2020-05-29Edits to the text, making it slightly shorter and more clearMohammad Akhlaghi-62/+54
One major point was that following Konrad's suggestion the issue of not being familiar with the Lisp/Scheme framework of GWL is now removed. We actually mention the main problem we have had with Guix, but also highlight that their solution was one of the main inspirations for this work.
2020-05-29Adding small paragraph for Raul's biographyRaul Infante-Sainz-1/+4
Until this commit, there was only a small description of me. With this commit, I have added a small paragraph with my biography. I know we are very restricted because of the word limit so I tried to be very short!
2020-05-29Minor typos correctedRaul Infante-Sainz-5/+6
With this commit, I have corrected several minor typos.
2020-05-29Minor corrections in abstract and introductionRaul Infante-Sainz-9/+9
With this commit, I did some minor changes in these Sections. Main changes are: define the contraction `OS' from Operating System and use only `OS' later on, and not use contractions like `isn't'
2020-05-29Cut down biography and inclusion of a mention to reproducibilityRoberto Baena-Gallé-5/+5
Before this commit: Roberto's bio was about 120 words. With this commit: it is now less than 100 words. A comment about reproducibility has been added.
2020-05-29Reproducible research based on open-access papersBoud Roukema-1/+1
Publishing a paper on reproducible research without making it easy for readers to read the references would defeat the point. Of course we have to make some compromises with some journals' reluctance to shift towards the free world, but to satisfy scientific ethics, we should at least provide clickable URLs to the references, preferably to the ArXiv version if available [1], and also to the DOI, again, preferably to an open-access version of the URL if available. I was not able to fully get this done in the .bst file, so there's an sed/tr hack done to the .bbl file in `reproduce/analysis/make/paper.mk` to tidy up commas and spaces. This commit also reverts some of the hacks in the Akhlaghi IAU Symposium `tex/src/references.tex` entry, to match the improved .bst file, `tex/src/IEEEtran_openaccess.bst`, provided here with a different name to the original, in order to satisfy the LaTeX licence. [1] https://cosmo.torun.pl/blog/arXiv_refs
2020-05-29pdftotext only called if present in system, minor editMohammad Akhlaghi-2/+2
David and Raul had both reported that because 'pdftotext' wasn't available on their system, the project failed (even though the PDF was built!). So with this commit, we first check if the system has 'pdftotext' and call it only if its is available. Some minor edits were made, building upon Boud's previous commit.
2020-05-29Section V - small changesBoud Roukema-25/+26
This commit provides mostly small changes. There didn't seem much point in repeating the `lessons learned` jargon and claiming that we draw good conclusions - insights - from our experience. Better just state what hypotheses we have generated from the experience rather than give the misleading impression that our hypotheses are well-established facts. In the comments, I put a suggested translation of what the `lessons learned` jargon means. I seem to have first heard this term in the mainstream media a few years after the US 2003 attack on Iraq, when a US military representative stated that the US forces had "learned lessons" after having started a war of aggression against Iraq.
2020-05-29Sentence with the clerk who can do it, software as uncountable nounBoud Roukema-2/+2
This commit changes two lines. (1) Keeping the exact quote with the clerk while having a sentence that makes sense in plain English cannot be done, it seems to me, without making the sentence a bit longer. Here's one option that seems about the best we can do, even though it still sounds a bit funny, because it's hard to write a future conditional with the present "can". Since it's a quote, it will probably survive the proofreaders. (2) Software is an uncountable noun [1], so we say "software is", like "water is"; "used software" sounds odd; I added "is itself" to emphasise that we're especially talking about the full chain of software for running the project. This commit modifies the "When the ..." sentence and hopefully sounds better. [1] https://en.wiktionary.org/wiki/software#Noun
2020-05-29Added top-make.mk as a listing for demonstration, minor editsMohammad Akhlaghi-31/+53
To help show the simplicity of 'top-make.mk', it was included as a listing. I also went over some of Boud's corrections and made small edits. In particular: - The '\label' and '\ref' to a section were removed. I done this after inspecting some of their recent papers and noticing that they generally have a simple flow, without such redirections. - In the part about the RDA adoption grant, I moved the "from the researcher perspective" to the end. Because Austin+2017 is mainly focused on data-center management, not the researcher's. They do touch upon researcher solutions that can help data-base managers, but not directly the researchers. In effect with this grant, they acknowledged that our researcher-focused solution confirms with their criteria for data-base management.
2020-05-29Many small changes to Section IV - proof of concept: maneageBoud Roukema-60/+63
Possibly the least trivial edit in this commit is that the previous text appeared to state that it's normal to find that a project prepared with `maneage` may be ... unbuildable. Which would defeat our whole claim of reproducibility! Obviously, `maneage` is still in a rapid development stage and might still have significant, not-yet-detected bugs. But the wording has to explain that this would constitute a bug in `Maneage` (in a particular version of it), not an expected regular event. :) This commit aims to fix that and other minor wording issues in IV. Pdf word count 5855.
2020-05-28Cherry-pick 7bf5fcd to make merging easierBoud Roukema-6/+3
This series of commits aims to edit sections II+III, but first implements the changes from 7bf5fcd, apart from one that conflicts in the abstract: this commit has ``Maneage'' without `(managing+lineage)` in the abstract. From Mohammad: this commit has been rebased after several other parallel branches, so some things may differ from the message.
2020-05-23Some minor edits on Boud's recent correctionsMohammad Akhlaghi-12/+12
Generally they were great, but after looking through them I thought a hand-full of them slightly changed my original idea so I am correcting them here. Boud, if you feel the changes aren't good, let's talk about it and find the best way forward ;-). They are mostly clear from a '--word-diff', just some notes on the ones that have changed the meaning: * On the "a clerk can do it" quotation, since its so short, I think its better to keep its original form, otherwise a reader may thing there were paragraphs instead of the "to" and we have changed their intention. * In the part where we are saying that the workflow can get "separated" from the paper, I mostly meant to highlight that the data-centers and journals (hosts) may diverge in decades, or one of them may go bankrupt, or etc. Hence loosing the connection. The issue of it evolving can in theory be addressed through version control, so I think this is a more fundamental problem. * In the part about free software, in the list, the original point was the free software that are used by the project, not the project itself (after all, the project itself falls under the "Open Science" titles that is very fashionable these days, but my point here is to those people who claim to do "Open Science" with closed software (like Microsoft Excel!).
2020-05-23Section III edits - 5901 wordsBoud Roukema-43/+43
This commit makes several small changes to Section III, some of which are quite significant in terms of meaning. It was difficult to improve the clarity without extending the word length. Now we're at 5901 words.
2020-05-23Section II edits + definition of solutionsBoud Roukema-23/+23
This commit implements quite a few minor changes in section II. The aim of most is to clarify the meaning and remove ambiguity. A few changes are that the reader will normally assume that successive sentences in a paragraph are closely related in terms of logical flow. It is superfluous - and considered excessive - to put too many "Therefore"'s and "Hence"'s in (at least) modern astronomy style. These are supposed to be used when there is a strong chain of reasoning. One change is done in the Introduction, because if we're going to use "solution(s)" throughout to mean "reproducible workflow solution(s)", then we have to clearly define this as jargon for this particular paper. It's probably preferable to RWS - reproducible workflow solution - or RWI - reproducible workflow implementation. But we can't just keep saying "solution" because that has many different meanings in a scientific context. Pdf word count = 5880
2020-05-23Cherry-pick 7bf5fcd to make merging easierBoud Roukema-1/+1
This series of commits aims to edit sections II+III, but first implements the changes from 7bf5fcd, apart from one that conflicts in the abstract: this commit has ``Maneage'' without `(managing+lineage)` in the abstract.
2020-05-23Main text: implement most of David's changesBoud Roukema-31/+33
This commit implements most of David's changes from c76727b, but excluding some, such as the proposal to use 'which' in a restrictive clause in the abstract. This is allowed, but the Fowler brothers' rule tends to followed in science writing: https://ell.stackexchange.com/questions/5/is-there-any-difference-between-which-and-that A few points on the abstract: * an immediate solution = singular * The "immediate, fast short-term" benefits sentence sounded like it was redundantly superfluously repetitively repeating doubled-up information. Hopefully this edit is better. * in the %Conclusion, "solutions" is vague, like people who say "technology" when they're only talking about software, so this edit reminds the reader to make the sentence more self-contained and understandable.
2020-05-23Biography style reverted to CiSE PDF mode (different from webpage)Mohammad Akhlaghi-18/+8
After a look at the PDFs of the linked papers of the previous commit and a few 2020 papers, we noticed that the biography format of the webpage and PDFs are different! So it is now back in its old way (which is how biographies are presented in the PDF). A few other minor edits were made in the text.
2020-05-23Affiliations CiSE styleBoud Roukema-6/+19
It appears from looking at https://ieeexplore.ieee.org/document/5725236/authors#authors https://ieeexplore.ieee.org/document/7878935/authors#authors that the affiliations section needs to start with a one-phrase definition of the author's main affiliation. In 5725236, the typesetters/proofreaders swapped van der Walt and Colbert, so don't be confused by that. It shows that nobody proofread properly. With this commit, each author's institute (single hierarchical level) is written as the first paragraph of the author's affiliation section. Since 5725236 allows a very-well-known acronym, I'm guessing that IAC can be defined for Mohammad and then re-used for the others. I've added a brief CV for me. If necessary, we could compress my main research together as "observational cosmology", but let's see how we go in the word count. I have not (yet) worked through the main text. There is also one minor language fix - `Because is complete` was incomplete. Pdf word count: 5873
2020-05-23Edits, to make the text more readableMohammad Akhlaghi-84/+81
After one day not looking at the first draft of this new version (commit 7b008dfbb9b2), I went through the text and done some general edits to make its presentation and logic smoother.
2020-05-23Typo and style corrections in the text, Roberto's bio addedRoberto Baena-Gallé-37/+41
Before this commit: several typos were present along the text. With this commit several typos have been corrected (types listed below) and my bio has been added. a) double words b) general typos c) comas after adverbs at the beginning of a sentence d) contractions are removed, e.g., don't vs do not e) three sentences in parenthesis have been removed since I think they were out of context or unnecessary f) etc
2020-05-22Re-write of the paper to fit in ~6000 words and IEEE formatMohammad Akhlaghi-121/+335
Following the fact that the DSJ editor decided that this paper doesn't fit into their scope, we decided to submit it to IEEE's Computing in Science and Engineering (CiSE). So with this commit the text was re-written to fit into their style and word-count limitations.