paper-concept.git - Paper (Towards Long-term and Archivable Reproducibility)

Age	Commit message (Collapse)	Author	Lines
2021-06-08	Improved appendix on archival	Mohammad Akhlaghi	-0/+15
	Until now the appendix only touched upon the archival aspects of scholarly research producs (data, code, narrative). To help in clarity, the context of this section has been improved, giving more explanations and examples.
2021-04-09	Comments by Konrad Hinsen implemented	Mohammad Akhlaghi	-0/+18
	Konrad had kindly gone through the paper and the appendices with very good feedback that is now being addressed in the paper (thanks a lot Konrad!): - IPOL recently also allows Python code. So the respective parts of the description of IPOL have been updated. To address the dependency issue, I also added a sentence that only certain dependencies (with certain versions) are acceptable. - On Active Papers (AP: which is written by Konrad) corrections were made based on the following parts of his comments: - "The fundamental issue with ActivePapers is its platform dependence on either Java or Python, neither of which is attractive." - "The one point which is overemphasized, in my opinion, is the necessity to download large data files if some analysis script refers to it. That is true in the current implementation (which I consider a research prototype), but not a fundamental feature of the approach. Implementing an on-demand download strategy is not particularly complicated, it just needs to be done, and it wasn't a priority for my own use cases." - "A historical anecdote: you mention that HDF View requires registering for download. This is true today, but wasn't when I started ActivePapers. Otherwise I'd never have built on HDF5. What happened is that the HDF Group, formerly part of NCSA and thus a public research infrastructure, was turned into a semi-commercial entity. They have committed to keeping the core HDF5 library Open Source, but not any of the tooling around it. Many users have moved away from HDF5 as a consequence. The larger lesson is that Richard Stallman was right: if software isn't GPLed, then you never know what will happen to it in the future." - On Guix, some further clarification was added to address Konrad's quote below (with a link to the blog-post mentioned there). In short, I clarified that I mean storing the Guix commit hash with any respective high-level analysis change is the extra step. - "I also looked at the discussion of Nix and Guix, which is what I am mainly using today. It is mostly correct as well, the one exception being the claim that 'it is up to the user to ensure that their created environment is recorded properly for reproducibility in the future'. The environment is recorded in all detail, automatically. What requires some effort is extracting a human-readable description of that environment. For Guix, I have described how to do this in a blog post (https://guix.gnu.org/en/blog/2020/reproducible-computations-with-guix/), and in less detail in a recent CiSE paper (https://hal.archives-ouvertes.fr/hal-02877319). There should definitely be a better user interface for this, but it's no more than a user interface issue. What is pretty nice in Guix by now is the user interface for re-creating an environment, using the "guix time-machine" subcommand." - The sentence on Software Heritage being based on Git was reworded to fit this comment of Konrad: "The plural sounds quite optimistic. As far as I know, SWH is the only archive of its kind, and in view of the enormous resources and long-time commitments it requires, I don't expect to see a second one." - When introducing hashes, Konrad suggested the following useful paper that shows how they are used in content-based storage: DOI:10.1109/MCSE.2019.2949441 - On Snakemake, Konrad had the following comment: "[A system call in Python is] No slower than from bash, or even from any C code. Meaning no slower than Make. It's the creation of a new process that takes most of the time." So the point was just shifted to the many quotations necessary for calling external programs and how it is best suited for a Python-based project. In addition some minor typos that I found during the process are also fixed.
2021-01-03	Imported recent updates in Maneage, minor conflicts fixed	Mohammad Akhlaghi	-1/+1
	There were only three very small conflicts that have been fixed.
2021-01-02	Copyright year updated in all source files	Mohammad Akhlaghi	-1/+1
	Having entered 2021, it was necessary to update the copyright years at the top of the source files. We recommend that you do this for all your project-specific source files also.
2020-12-29	Copyedit on Appendix A	Boud Roukema	-34/+34
	This commit makes many small wording fixes, mainly to Appendix A. It also insert "quotes" around some of the titles fields in 'tex/src/references.tex', since otherwise capitalisation is lost (DNA becomes Dna; 'of Reinhart and Rogoff' becomes 'of reinhart and rogoff'; and so on). I didn't do this for all titles, because some Have All Words Capitalised, which blocks the .bib file from choosing a consistent style.
2020-12-28	Minor edits, updated citation to published Menke+20 paper	Mohammad Akhlaghi	-4/+5
	Some minor edits were made to the paper to shorten it. In particular the example of IPOL was removed from the main body of the paper, and we'll just rely on the more extensive review of IPOL in the appendix. I also updated the referee report to account for the new Appendix A that is just an extended introduction. Also, I noticed that the Menke+20 paper that we replicate here has recently been published in the iScience journal. So its bibliography was updated from the bioarXiv information to the journal information. Also, the number of words (after removing abstract and captions and accounting for figures) is now only printed when the project is built with '--no-appendix'. This was done because this information is extra/annoying/unnecessary for the case where there is an appendix.
2020-12-28	The old/long introduction is now an appendix on necessity	Mohammad Akhlaghi	-5/+5
	In the first/long draft of this work, we had a good introduction on the necessity of reproducibility. But we were forced to remove it because of word-count limits. Having moved a major portion of the previous work into the appendices, I thought it would be good to put that introduction as a first appendix also, focused on the necessity for reproducibile research.
2020-11-23	First draft of all the points addressed by the referees	Mohammad Akhlaghi	-3/+216
	A new directory has been added at the top of the project's source called 'peer-review'. The raw reviews of the paper by the editors and referees has been added there as '1-review.txt'. All the main points raised by the referees have been listed in a numbered list and addressed (mostly) in '1-answers.txt'. The text of the paper now also includes all the implemented answers to the various points.
2020-11-15	First edits on the newly added appendices in new form	Mohammad Akhlaghi	-12/+11
	With the optional appendices added recently to the paper, it was important to go through them and make them more fitting into the paper.
2020-11-04	Appendix of long paper added, optionally we can disable it	Mohammad Akhlaghi	-5/+5
	Given the referee reports, after discussing with the editors of CiSE, we decided that it is important to include the complete appendix we had before that included a thorough review of existing tools and methods. However, the appendix will not be published in the paper (due to the strict word-count limit). It will only be used in the arXiv/Zenodo versions of the paper. This actually created a technical problem: we want the commit hash of the project source to remain the same when the paper is built with an appendix or without it. To fix this problem the choice of including an appendix has gone into the 'project' script as a run-time option called '--no-appendix'. So by default (when someone just runs './project make'), the PDF will have an appendix, but when we want to submit to the journal, or when the appendix isn't needed for a certain reason, we can use this new option. The appendix also has its own separate bibliography. Some other corrections made in this commit: 1. Some new references were added that had an '_' in their source, they were corrected in 'references.tex'. 2. I noticed that 'preamble-style.tex' is not actually used in this paper, so it has been deleted.
2020-08-20	Imported recent updates in Maneage, minor conflicts fixed	Mohammad Akhlaghi	-0/+12
	Some very minor conflicts came up and were easily corrected. They were mostly in parts that are also shared with the demonstration in the core Maneage branch.
2020-07-04	Citing Maneage paper in acknowledgments	Mohammad Akhlaghi	-1/+1
	In the previous commit, the modified abstract of the acknowledgments only included the URL of Maneage, but its more formal to cite the Maneage paper, the URL is already present in the paper.
2020-06-10	Updated text of default paper.tex, putting more recent examples	Mohammad Akhlaghi	-22/+78
	The text of the default paper hadn't been changed for a very long time! In this time, three papers using Maneage have been published (which can be very good as an example), Maneage also now has a webpage! With these commit these examples and the webpage have been added and generally it was also polished a little to hopefully be more useful.
2020-05-29	Reproducible research based on open-access papers	Boud Roukema	-2/+2
	Publishing a paper on reproducible research without making it easy for readers to read the references would defeat the point. Of course we have to make some compromises with some journals' reluctance to shift towards the free world, but to satisfy scientific ethics, we should at least provide clickable URLs to the references, preferably to the ArXiv version if available [1], and also to the DOI, again, preferably to an open-access version of the URL if available. I was not able to fully get this done in the .bst file, so there's an sed/tr hack done to the .bbl file in `reproduce/analysis/make/paper.mk` to tidy up commas and spaces. This commit also reverts some of the hacks in the Akhlaghi IAU Symposium `tex/src/references.tex` entry, to match the improved .bst file, `tex/src/IEEEtran_openaccess.bst`, provided here with a different name to the original, in order to satisfy the LaTeX licence. [1] https://cosmo.torun.pl/blog/arXiv_refs
2020-05-22	Re-write of the paper to fit in ~6000 words and IEEE format	Mohammad Akhlaghi	-0/+1786
	Following the fact that the DSJ editor decided that this paper doesn't fit into their scope, we decided to submit it to IEEE's Computing in Science and Engineering (CiSE). So with this commit the text was re-written to fit into their style and word-count limitations.
2020-05-02	First implementation of style in IEEEtran style	Mohammad Akhlaghi	-1772/+0
	The paper is no longer using LuaLaTeX, but raw LaTeX (that saves a DVI), it is so much faster! Initially I had used LuaLaTeX to use special fonts to resemble the CODATA Data Science Journal, but all that overhead is no longer necessary. Therefore I also removed the MANY extra LaTeX packages we were importing. The paper builds and is able to construct one of its images (the git-branching figure) with only 7 packages beyond the minimal TeX/LaTeX installation. Also in terms of processing it is so much faster. The text is just temporary now, and mainly just a place holder. With the next commit, I'll fill it with proper text.
2020-05-01	Added interesting references by David	Mohammad Akhlaghi	-0/+28
	David suggested some interesting references in particular about the problems with Juypyter notebooks that are now added to the long version of the paper. We'll later decide if/how they can be used.
2020-04-18	Two papers cited, for research software and data management plans	Mohammad Akhlaghi	-0/+13
	These are important aspects that are highly relevant to Maneage: its philosophy (the former) and usability (the latter). To add them, I tried to summarize some other parts of the paper.
2020-04-14	Further text shrink, added Competing interest and Author contributions	Mohammad Akhlaghi	-0/+14
	To make the text easier to read and further comply with the author guideline, the text was shrank a little more and the two final sections were also added on "Competing interest" and "Author contributions". I also found the CODATA logo on Wikipedia in SVG format (vector graphics), so I replaced the previous pixelated PNG format with the PDF (converted from SVG).
2020-03-30	Section on starting new projects, and publishing project added	Mohammad Akhlaghi	-0/+20
	With the main structure of Maneage explained, I have started to explain how a new project is created, along with a schematic diagram that shows two scenarios of how Git can help with project management.
2020-03-28	Cleaned up the introduction, definitions for provenance and lineage	Mohammad Akhlaghi	-1/+35
	Until now, the introduction had repeated several things and also had a relatively long list of things to add in its end. Also, it was highly focused to scientific papers. With this commit, I effectively re-wrote it, with the starting paragraphs becoming more industry-friendly, while also focusing on the scientific cases. Many of the repetative parts were removed and the listed items in the end were put into the text in a much better context. Also, now that the name of the system involves "lineage" (and a lot of focus is put on it in the start) the terms data provenance and lineage were defined in the definition section. Some other intersting points that I encountered during the research on definitions were added to the discussion and final lists, and the DOI of one reference paper was corrected.
2020-03-02	Described the first analysis phase with a demo subMakefile	Mohammad Akhlaghi	-0/+59
	Until now, there was no explanation on an actual analysis phase, therefore with this commit an example scenario with a readable Makefile is included. The Data lineage graph was also simplified to both be more readable, and also to correspond to this new explanation and subMakefile. Some random edits/typos were also corrected and some references added for discussion.
2020-02-16	Menke+2020 data is now imported and ready for later steps in plain text	Mohammad Akhlaghi	-1/+1
	The main problems with this dataset was the names of the journals (which sometimes have single quotes or apostrophes in them that is really annoying for SED)! But ultimately, for the simple study we want to do here, the journal names are irrelevant, so in the end I just ignored the names. Later we can set an identifier for the journals if necessary. But now we have the basic information in a way that is usable in a plot to show in this paper.
2020-02-15	Edits in text, added Menke+2020 as a reference	Mohammad Akhlaghi	-0/+14
	The text was slightly improved/edited and I also recently came up to the Menke et al. 2020 (DOI:10.1101/2020.01.15.908111) which also has some good datasets we can use as a demonstration here.
2020-02-07	Edited parts of the text	Mohammad Akhlaghi	-4/+4
	While reading over the already written parts (and hopefully complete the paper), they were edited/corrected to be more clear.
2020-02-06	Minor edits to various parts	Mohammad Akhlaghi	-0/+28
	Some edits were made after rereading of some parts.
2020-01-26	General project structure and configuration described	Mohammad Akhlaghi	-22/+67
	In the last few days I have been writing these two sections in the middle of other work. But I am making this commit because it has already become a lot! I am now going onto the description of `./project make'.
2020-01-20	Added figure showing project's file structure	Mohammad Akhlaghi	-3/+17
	It was a little hard to describe the file structure so instead of using a standard listing as most papers do, I thought of showing the file and directory structure as boxes within each other (modeled on the Gnome disk-utility). Some other polishing was done throughout the paper also.
2020-01-18	Raw draft (until now as a separate repository) imported	Mohammad Akhlaghi	-30/+1500
	Until now, I was writing the paper without the template. But we will soon be adding a tutorial to the template, and I thought it will be good to have an example demonstration here too. So I just brought the hole project into the template structure, allowing us to add the template analysis later when its ready, and also allowing us to easily reproduce this paper ofcourse (without having to worry about the host's TeXLive installation.
2020-01-01	Copyright statements updated to include 2020	Mohammad Akhlaghi	-1/+1
	Now that its 2020, its necessary to include this year in the copyright statements.
2019-04-13	Corrected copyright notices and info about adding copyright info	Mohammad Akhlaghi	-2/+2
	Until now, the files where the people were meant to change didn't have a proper copyright notice (for example `Copyright (C) YOUR NAME.'). This was wrong because the license does not convey copyright ownership. So the name of the file's original author must always be included and when people modify it (and add their own copyright-able modifications). With this commit, the file's original author (and email) are added to the copyright notice and when more than one person modified a file, both names have their individual copyright notice. Based on this, the description for adding a copyright notice in `README-hacking.md' has also been modified.
2019-04-12	Dependency BibTeX entries included only when necessary	Mohammad Akhlaghi	-249/+7
	Until now, there was a single `tex/src/references.tex' file that housed the BibTex entries for everything (software and non-software). Since we have started to include the BibTeX entry for more software, it will be hard to manage the large (sometime unused) BibTeX entries of the software in the middle of the non-software related citations in the text of the paper. Therefore, with this commit, a `tex/dependencies' directory has been made which has a separate BibTeX entry file for each software that needs one. After the software is built, this file is copied to the new `.local/version-info/cite' directory. At the end, the configure script will concatenate all the files in this directory into one file which will later be used with `tex/src/references.tex' by BibLaTeX. This greatly simplifies managing of citations. Allowing us to focus on the software-building and paper-writing citations separately/cleanly (and thus be more efficient in both).
2019-04-12	Imported recent corrections, no conflicts	Mohammad Akhlaghi	-0/+115
	Some recent corrections that were done by Raul are now merged into the pipeline. There weren't any conflicts.
2019-04-12	Fixed some Scipy-related packages citations	Raul Infante-Sainz	-12/+42
	Until now, the Scipy citation was only one paper and not the correct one (it was the online manual). With this commit, Scipy is properly cited using the two papers. Also some modifications in the `tex/src/references.tex' have been done (remove last page number).
2019-04-12	Acknowledged Scipy-related packages: Cython, Matplotlib, Numpy and Scipy	Raul Infante-Sainz	-0/+85
	Until now, name and version of all Python packages were indicated in the final paper, but not the main paper of them (if it exists). With this commit, some Python packages (Cython, Matplotlib, Numpy and Scipy) are now properly acknoledged by citating the source paper. `mpi4py' is also cited although this package is not yet included into the pipeline.
2019-04-12	Gnuastro's citation included in its build target	Mohammad Akhlaghi	-1/+1
	With this commit, we are applying the new style of citing software within the build rule of Gnuastro.
2019-04-02	Copyright notice added to remaining files	Mohammad Akhlaghi	-0/+21
	After doing a systematic search for files without a copyright notice, a few more were found that didn't have a notice. So a notice was added for them. I used this Bash command to find the files: for f in $(find ./ -type f); do \ if [[ $f != .git ]]; then \ n=$(grep -i copyright $f \| wc -l); \ echo "$n $f"; \ fi; \ done \| awk '$1==0'
2019-02-13	Imported recent work on building Python within the pipeline	Mohammad Akhlaghi	-0/+94
	Raul Infante-Sainz added the building of Python (along with the Numpy and Astropy packages) into the pipeline. That work is now being merged into the main pipeline branch. There was only this small problem that needed to be fixed: the Python tarball's name after unpacking is actually `Python-X.X.X' (with a captial P), not `python-X.X.X'. This has been corrected with this merge.
2019-02-13	Astropy installed in the pipeline	Raul Infante-Sainz	-0/+94
	Astropy was added and one very important thing is that we have to use the pypi tarball (https://pypi.org/) (which is bootstrapped) and not the github tarball.
2019-02-06	Better management for .tex directories to build from tarball	Mohammad Akhlaghi	-0/+45
	In order to collaborate effectively in the project, even project members that don't necessarily want (or have the capacity) to do the whole analysis must be able to contribute to the project. Until now, the users of the distributed tarball could only modify the text and not the figures (built with PGFPlots) of the paper. With this commit, the management of TeX source files in the pipeline was slightly modified to allow this as cleanly as I could think of now! In short, the hand-written TeX files are now kept in `tex/src' and for the pipeline's generated TeX files (in particular the old `tex/pipeline.tex'), we now have a `tex/pipeline' symbolic-link/directory that points to the `tex' directory under the build directory. When packaging the project, `tex/pipeline' will be a full directory with a copy of all the necessary files. Therefore as far as LaTeX is concerned, having a build-directory is no longer relevant. Many other small changes were made to do this job cleanly which will just make this commit message too long! Also, the old `tarball' and `zip' targets are now `dist' and `dist-zip' (as in the standard GNU Build system).