Age | Commit message (Collapse) | Author | Lines |
|
Until now the appendix only touched upon the archival aspects of scholarly
research producs (data, code, narrative). To help in clarity, the context
of this section has been improved, giving more explanations and examples.
|
|
Konrad had kindly gone through the paper and the appendices with very good
feedback that is now being addressed in the paper (thanks a lot Konrad!):
- IPOL recently also allows Python code. So the respective parts of the
description of IPOL have been updated. To address the dependency issue, I
also added a sentence that only certain dependencies (with certain
versions) are acceptable.
- On Active Papers (AP: which is written by Konrad) corrections were made
based on the following parts of his comments:
- "The fundamental issue with ActivePapers is its platform dependence on
either Java or Python, neither of which is attractive."
- "The one point which is overemphasized, in my opinion, is the necessity
to download large data files if some analysis script refers to it. That
is true in the current implementation (which I consider a research
prototype), but not a fundamental feature of the approach. Implementing
an on-demand download strategy is not particularly complicated, it just
needs to be done, and it wasn't a priority for my own use cases."
- "A historical anecdote: you mention that HDF View requires registering
for download. This is true today, but wasn't when I started
ActivePapers. Otherwise I'd never have built on HDF5. What happened is
that the HDF Group, formerly part of NCSA and thus a public research
infrastructure, was turned into a semi-commercial entity. They have
committed to keeping the core HDF5 library Open Source, but not any of
the tooling around it. Many users have moved away from HDF5 as a
consequence. The larger lesson is that Richard Stallman was right: if
software isn't GPLed, then you never know what will happen to it in the
future."
- On Guix, some further clarification was added to address Konrad's quote
below (with a link to the blog-post mentioned there). In short, I
clarified that I mean storing the Guix commit hash with any respective
high-level analysis change is the extra step.
- "I also looked at the discussion of Nix and Guix, which is what I am
mainly using today. It is mostly correct as well, the one exception
being the claim that 'it is up to the user to ensure that their created
environment is recorded properly for reproducibility in the
future'. The environment is *recorded* in all detail,
automatically. What requires some effort is extracting a human-readable
description of that environment. For Guix, I have described how to do
this in a blog post
(https://guix.gnu.org/en/blog/2020/reproducible-computations-with-guix/),
and in less detail in a recent CiSE paper
(https://hal.archives-ouvertes.fr/hal-02877319). There should
definitely be a better user interface for this, but it's no more than a
user interface issue. What is pretty nice in Guix by now is the user
interface for re-creating an environment, using the "guix time-machine"
subcommand."
- The sentence on Software Heritage being based on Git was reworded to fit
this comment of Konrad: "The plural sounds quite optimistic. As far as I
know, SWH is the only archive of its kind, and in view of the enormous
resources and long-time commitments it requires, I don't expect to see a
second one."
- When introducing hashes, Konrad suggested the following useful paper that
shows how they are used in content-based storage:
DOI:10.1109/MCSE.2019.2949441
- On Snakemake, Konrad had the following comment: "[A system call in Python
is] No slower than from bash, or even from any C code. Meaning no slower
than Make. It's the creation of a new process that takes most of the
time." So the point was just shifted to the many quotations necessary for
calling external programs and how it is best suited for a Python-based
project.
In addition some minor typos that I found during the process are also
fixed.
|
|
There were only three very small conflicts that have been fixed.
|
|
Having entered 2021, it was necessary to update the copyright years at the
top of the source files. We recommend that you do this for all your
project-specific source files also.
|
|
This commit makes many small wording fixes, mainly to Appendix A.
It also insert "quotes" around some of the titles fields in
'tex/src/references.tex', since otherwise capitalisation is lost (DNA
becomes Dna; 'of Reinhart and Rogoff' becomes 'of reinhart and rogoff'; and
so on). I didn't do this for all titles, because some Have All Words
Capitalised, which blocks the .bib file from choosing a consistent style.
|
|
Some minor edits were made to the paper to shorten it. In particular the
example of IPOL was removed from the main body of the paper, and we'll just
rely on the more extensive review of IPOL in the appendix. I also updated
the referee report to account for the new Appendix A that is just an
extended introduction.
Also, I noticed that the Menke+20 paper that we replicate here has recently
been published in the iScience journal. So its bibliography was updated
from the bioarXiv information to the journal information.
Also, the number of words (after removing abstract and captions and
accounting for figures) is now only printed when the project is built with
'--no-appendix'. This was done because this information is
extra/annoying/unnecessary for the case where there is an appendix.
|
|
In the first/long draft of this work, we had a good introduction on the
necessity of reproducibility. But we were forced to remove it because of
word-count limits. Having moved a major portion of the previous work into
the appendices, I thought it would be good to put that introduction as a
first appendix also, focused on the necessity for reproducibile research.
|
|
A new directory has been added at the top of the project's source called
'peer-review'. The raw reviews of the paper by the editors and referees has
been added there as '1-review.txt'. All the main points raised by the
referees have been listed in a numbered list and addressed (mostly) in
'1-answers.txt'. The text of the paper now also includes all the
implemented answers to the various points.
|
|
With the optional appendices added recently to the paper, it was important
to go through them and make them more fitting into the paper.
|
|
Given the referee reports, after discussing with the editors of CiSE, we
decided that it is important to include the complete appendix we had before
that included a thorough review of existing tools and methods. However, the
appendix will not be published in the paper (due to the strict word-count
limit). It will only be used in the arXiv/Zenodo versions of the paper.
This actually created a technical problem: we want the commit hash of the
project source to remain the same when the paper is built with an appendix
or without it.
To fix this problem the choice of including an appendix has gone into the
'project' script as a run-time option called '--no-appendix'. So by default
(when someone just runs './project make'), the PDF will have an appendix,
but when we want to submit to the journal, or when the appendix isn't
needed for a certain reason, we can use this new option. The appendix also
has its own separate bibliography.
Some other corrections made in this commit:
1. Some new references were added that had an '_' in their source, they
were corrected in 'references.tex'.
2. I noticed that 'preamble-style.tex' is not actually used in this paper,
so it has been deleted.
|
|
Some very minor conflicts came up and were easily corrected. They were
mostly in parts that are also shared with the demonstration in the core
Maneage branch.
|
|
In the previous commit, the modified abstract of the acknowledgments only
included the URL of Maneage, but its more formal to cite the Maneage paper,
the URL is already present in the paper.
|
|
The text of the default paper hadn't been changed for a very long time! In
this time, three papers using Maneage have been published (which can be
very good as an example), Maneage also now has a webpage!
With these commit these examples and the webpage have been added and
generally it was also polished a little to hopefully be more useful.
|
|
Publishing a paper on reproducible research without making it easy for
readers to read the references would defeat the point. Of course we have to
make some compromises with some journals' reluctance to shift towards the
free world, but to satisfy scientific ethics, we should at least provide
clickable URLs to the references, preferably to the ArXiv version if
available [1], and also to the DOI, again, preferably to an open-access
version of the URL if available.
I was not able to fully get this done in the .bst file, so there's an
sed/tr hack done to the .bbl file in `reproduce/analysis/make/paper.mk` to
tidy up commas and spaces.
This commit also reverts some of the hacks in the Akhlaghi IAU Symposium
`tex/src/references.tex` entry, to match the improved .bst file,
`tex/src/IEEEtran_openaccess.bst`, provided here with a different name to
the original, in order to satisfy the LaTeX licence.
[1] https://cosmo.torun.pl/blog/arXiv_refs
|
|
Following the fact that the DSJ editor decided that this paper doesn't fit
into their scope, we decided to submit it to IEEE's Computing in Science
and Engineering (CiSE). So with this commit the text was re-written to fit
into their style and word-count limitations.
|
|
The paper is no longer using LuaLaTeX, but raw LaTeX (that saves a DVI), it
is so much faster! Initially I had used LuaLaTeX to use special fonts to
resemble the CODATA Data Science Journal, but all that overhead is no
longer necessary. Therefore I also removed the MANY extra LaTeX packages we
were importing. The paper builds and is able to construct one of its images
(the git-branching figure) with only 7 packages beyond the minimal
TeX/LaTeX installation. Also in terms of processing it is so much faster.
The text is just temporary now, and mainly just a place holder. With the
next commit, I'll fill it with proper text.
|
|
David suggested some interesting references in particular about the
problems with Juypyter notebooks that are now added to the long version of
the paper. We'll later decide if/how they can be used.
|
|
These are important aspects that are highly relevant to Maneage: its
philosophy (the former) and usability (the latter). To add them, I tried to
summarize some other parts of the paper.
|
|
To make the text easier to read and further comply with the author
guideline, the text was shrank a little more and the two final sections
were also added on "Competing interest" and "Author contributions".
I also found the CODATA logo on Wikipedia in SVG format (vector graphics),
so I replaced the previous pixelated PNG format with the PDF (converted
from SVG).
|
|
With the main structure of Maneage explained, I have started to explain how
a new project is created, along with a schematic diagram that shows two
scenarios of how Git can help with project management.
|
|
Until now, the introduction had repeated several things and also had a
relatively long list of things to add in its end. Also, it was highly
focused to scientific papers.
With this commit, I effectively re-wrote it, with the starting paragraphs
becoming more industry-friendly, while also focusing on the scientific
cases. Many of the repetative parts were removed and the listed items in
the end were put into the text in a much better context.
Also, now that the name of the system involves "lineage" (and a lot of
focus is put on it in the start) the terms data provenance and lineage were
defined in the definition section.
Some other intersting points that I encountered during the research on
definitions were added to the discussion and final lists, and the DOI of
one reference paper was corrected.
|
|
Until now, there was no explanation on an actual analysis phase, therefore
with this commit an example scenario with a readable Makefile is included.
The Data lineage graph was also simplified to both be more readable, and
also to correspond to this new explanation and subMakefile.
Some random edits/typos were also corrected and some references added for
discussion.
|
|
The main problems with this dataset was the names of the journals (which
sometimes have single quotes or apostrophes in them that is really annoying
for SED)! But ultimately, for the simple study we want to do here, the
journal names are irrelevant, so in the end I just ignored the names. Later
we can set an identifier for the journals if necessary.
But now we have the basic information in a way that is usable in a plot to
show in this paper.
|
|
The text was slightly improved/edited and I also recently came up to the
Menke et al. 2020 (DOI:10.1101/2020.01.15.908111) which also has some good
datasets we can use as a demonstration here.
|
|
While reading over the already written parts (and hopefully complete the
paper), they were edited/corrected to be more clear.
|
|
Some edits were made after rereading of some parts.
|
|
In the last few days I have been writing these two sections in the middle
of other work. But I am making this commit because it has already become a
lot! I am now going onto the description of `./project make'.
|
|
It was a little hard to describe the file structure so instead of using a
standard listing as most papers do, I thought of showing the file and
directory structure as boxes within each other (modeled on the Gnome
disk-utility).
Some other polishing was done throughout the paper also.
|
|
Until now, I was writing the paper without the template. But we will soon
be adding a tutorial to the template, and I thought it will be good to have
an example demonstration here too. So I just brought the hole project into
the template structure, allowing us to add the template analysis later when
its ready, and also allowing us to easily reproduce this paper ofcourse
(without having to worry about the host's TeXLive installation.
|
|
Now that its 2020, its necessary to include this year in the copyright
statements.
|
|
Until now, the files where the people were meant to change didn't have a
proper copyright notice (for example `Copyright (C) YOUR NAME.'). This was
wrong because the license does not convey copyright ownership. So the name
of the file's original author must always be included and when people
modify it (and add their own copyright-able modifications).
With this commit, the file's original author (and email) are added to the
copyright notice and when more than one person modified a file, both names
have their individual copyright notice.
Based on this, the description for adding a copyright notice in
`README-hacking.md' has also been modified.
|
|
Until now, there was a single `tex/src/references.tex' file that housed the
BibTex entries for everything (software and non-software).
Since we have started to include the BibTeX entry for more software, it
will be hard to manage the large (sometime unused) BibTeX entries of the
software in the middle of the non-software related citations in the text of
the paper.
Therefore, with this commit, a `tex/dependencies' directory has been made
which has a separate BibTeX entry file for each software that needs
one. After the software is built, this file is copied to the new
`.local/version-info/cite' directory. At the end, the configure script will
concatenate all the files in this directory into one file which will later
be used with `tex/src/references.tex' by BibLaTeX.
This greatly simplifies managing of citations. Allowing us to focus on the
software-building and paper-writing citations separately/cleanly (and thus
be more efficient in both).
|
|
Some recent corrections that were done by Raul are now merged into the
pipeline. There weren't any conflicts.
|
|
Until now, the Scipy citation was only one paper and not the correct one
(it was the online manual).
With this commit, Scipy is properly cited using the two papers. Also
some modifications in the `tex/src/references.tex' have been done
(remove last page number).
|
|
Until now, name and version of all Python packages were indicated in the
final paper, but not the main paper of them (if it exists).
With this commit, some Python packages (Cython, Matplotlib, Numpy and
Scipy) are now properly acknoledged by citating the source paper.
`mpi4py' is also cited although this package is not yet included into
the pipeline.
|
|
With this commit, we are applying the new style of citing software within
the build rule of Gnuastro.
|
|
After doing a systematic search for files without a copyright notice, a few
more were found that didn't have a notice. So a notice was added for them.
I used this Bash command to find the files:
for f in $(find ./ -type f); do \
if [[ $f != *.git* ]]; then \
n=$(grep -i copyright $f | wc -l); \
echo "$n $f"; \
fi; \
done | awk '$1==0'
|
|
Raul Infante-Sainz added the building of Python (along with the Numpy and
Astropy packages) into the pipeline. That work is now being merged into the
main pipeline branch.
There was only this small problem that needed to be fixed: the Python
tarball's name after unpacking is actually `Python-X.X.X' (with a captial
P), not `python-X.X.X'. This has been corrected with this merge.
|
|
Astropy was added and one very important thing is that we have to
use the pypi tarball (https://pypi.org/) (which is bootstrapped)
and not the github tarball.
|
|
In order to collaborate effectively in the project, even project members
that don't necessarily want (or have the capacity) to do the whole analysis
must be able to contribute to the project. Until now, the users of the
distributed tarball could only modify the text and not the figures (built
with PGFPlots) of the paper.
With this commit, the management of TeX source files in the pipeline was
slightly modified to allow this as cleanly as I could think of now! In
short, the hand-written TeX files are now kept in `tex/src' and for the
pipeline's generated TeX files (in particular the old `tex/pipeline.tex'),
we now have a `tex/pipeline' symbolic-link/directory that points to the
`tex' directory under the build directory.
When packaging the project, `tex/pipeline' will be a full directory with a
copy of all the necessary files. Therefore as far as LaTeX is concerned,
having a build-directory is no longer relevant. Many other small changes
were made to do this job cleanly which will just make this commit message
too long!
Also, the old `tarball' and `zip' targets are now `dist' and `dist-zip' (as
in the standard GNU Build system).
|