Age | Commit message (Collapse) | Author | Lines |
|
David and Raul had both reported that because 'pdftotext' wasn't available
on their system, the project failed (even though the PDF was built!). So
with this commit, we first check if the system has 'pdftotext' and call it
only if its is available.
Some minor edits were made, building upon Boud's previous commit.
|
|
This commit provides mostly small changes. There didn't seem much point in
repeating the `lessons learned` jargon and claiming that we draw good
conclusions - insights - from our experience. Better just state what
hypotheses we have generated from the experience rather than give the
misleading impression that our hypotheses are well-established facts. In
the comments, I put a suggested translation of what the `lessons learned`
jargon means. I seem to have first heard this term in the mainstream media
a few years after the US 2003 attack on Iraq, when a US military
representative stated that the US forces had "learned lessons" after having
started a war of aggression against Iraq.
|
|
This commit changes two lines.
(1) Keeping the exact quote with the clerk while having a sentence that
makes sense in plain English cannot be done, it seems to me, without
making the sentence a bit longer. Here's one option that seems about
the best we can do, even though it still sounds a bit funny, because
it's hard to write a future conditional with the present "can". Since
it's a quote, it will probably survive the proofreaders.
(2) Software is an uncountable noun [1], so we say "software is", like
"water is"; "used software" sounds odd; I added "is itself" to
emphasise that we're especially talking about the full chain of
software for running the project. This commit modifies the "When the
..." sentence and hopefully sounds better.
[1] https://en.wiktionary.org/wiki/software#Noun
|
|
To help show the simplicity of 'top-make.mk', it was included as a
listing. I also went over some of Boud's corrections and made small
edits. In particular:
- The '\label' and '\ref' to a section were removed. I done this after
inspecting some of their recent papers and noticing that they generally
have a simple flow, without such redirections.
- In the part about the RDA adoption grant, I moved the "from the
researcher perspective" to the end. Because Austin+2017 is mainly
focused on data-center management, not the researcher's. They do touch
upon researcher solutions that can help data-base managers, but not
directly the researchers. In effect with this grant, they acknowledged
that our researcher-focused solution confirms with their criteria for
data-base management.
|
|
Possibly the least trivial edit in this commit is that the previous text
appeared to state that it's normal to find that a project prepared with
`maneage` may be ... unbuildable. Which would defeat our whole claim of
reproducibility! Obviously, `maneage` is still in a rapid development
stage and might still have significant, not-yet-detected bugs. But the
wording has to explain that this would constitute a bug in `Maneage` (in a
particular version of it), not an expected regular event. :) This commit
aims to fix that and other minor wording issues in IV.
Pdf word count 5855.
|
|
This series of commits aims to edit sections II+III, but first implements
the changes from 7bf5fcd, apart from one that conflicts in the abstract:
this commit has ``Maneage'' without `(managing+lineage)` in the abstract.
From Mohammad: this commit has been rebased after several other parallel
branches, so some things may differ from the message.
|
|
Generally they were great, but after looking through them I thought a
hand-full of them slightly changed my original idea so I am correcting them
here. Boud, if you feel the changes aren't good, let's talk about it and
find the best way forward ;-).
They are mostly clear from a '--word-diff', just some notes on the ones
that have changed the meaning:
* On the "a clerk can do it" quotation, since its so short, I think its
better to keep its original form, otherwise a reader may thing there
were paragraphs instead of the "to" and we have changed their
intention.
* In the part where we are saying that the workflow can get "separated"
from the paper, I mostly meant to highlight that the data-centers and
journals (hosts) may diverge in decades, or one of them may go
bankrupt, or etc. Hence loosing the connection. The issue of it
evolving can in theory be addressed through version control, so I think
this is a more fundamental problem.
* In the part about free software, in the list, the original point was
the free software that are used by the project, not the project itself
(after all, the project itself falls under the "Open Science" titles
that is very fashionable these days, but my point here is to those
people who claim to do "Open Science" with closed software (like
Microsoft Excel!).
|
|
This commit makes several small changes to Section III, some of
which are quite significant in terms of meaning.
It was difficult to improve the clarity without extending
the word length. Now we're at 5901 words.
|
|
This commit implements quite a few minor changes in section II.
The aim of most is to clarify the meaning and remove ambiguity.
A few changes are that the reader will normally assume that
successive sentences in a paragraph are closely related in terms
of logical flow. It is superfluous - and considered excessive -
to put too many "Therefore"'s and "Hence"'s in (at least) modern
astronomy style. These are supposed to be used when there is a
strong chain of reasoning.
One change is done in the Introduction, because if we're going to
use "solution(s)" throughout to mean "reproducible workflow
solution(s)", then we have to clearly define this as jargon for
this particular paper. It's probably preferable to RWS - reproducible
workflow solution - or RWI - reproducible workflow implementation.
But we can't just keep saying "solution" because that has many
different meanings in a scientific context.
Pdf word count = 5880
|
|
This series of commits aims to edit sections II+III,
but first implements the changes from 7bf5fcd, apart
from one that conflicts in the abstract: this commit
has ``Maneage'' without `(managing+lineage)` in the
abstract.
|
|
This commit implements most of David's changes from c76727b, but excluding
some, such as the proposal to use 'which' in a restrictive clause in the
abstract. This is allowed, but the Fowler brothers' rule tends to followed
in science writing:
https://ell.stackexchange.com/questions/5/is-there-any-difference-between-which-and-that
A few points on the abstract:
* an immediate solution = singular
* The "immediate, fast short-term" benefits sentence sounded like it was
redundantly superfluously repetitively repeating doubled-up
information. Hopefully this edit is better.
* in the %Conclusion, "solutions" is vague, like people who say
"technology" when they're only talking about software, so this edit
reminds the reader to make the sentence more self-contained and
understandable.
|
|
David reported this problem, it happened right after importing IEEEtran,
but for some reason, it didn't happen for me.
|
|
When entering the name of the "listings" package, I had forgot to add the
final 's', so it wasn't being installed on a clean system! I didn't have a
problem until now, because it remained from previous builds.
|
|
After a look at the PDFs of the linked papers of the previous commit and a
few 2020 papers, we noticed that the biography format of the webpage and
PDFs are different! So it is now back in its old way (which is how
biographies are presented in the PDF).
A few other minor edits were made in the text.
|
|
It appears from looking at
https://ieeexplore.ieee.org/document/5725236/authors#authors
https://ieeexplore.ieee.org/document/7878935/authors#authors
that the affiliations section needs to start with a one-phrase
definition of the author's main affiliation. In 5725236,
the typesetters/proofreaders swapped van der Walt and Colbert,
so don't be confused by that. It shows that nobody proofread
properly.
With this commit, each author's institute (single hierarchical
level) is written as the first paragraph of the author's
affiliation section. Since 5725236 allows a very-well-known
acronym, I'm guessing that IAC can be defined for Mohammad
and then re-used for the others.
I've added a brief CV for me. If necessary, we could compress
my main research together as "observational cosmology", but let's
see how we go in the word count.
I have not (yet) worked through the main text.
There is also one minor language fix - `Because is complete` was
incomplete.
Pdf word count: 5873
|
|
After one day not looking at the first draft of this new version (commit
7b008dfbb9b2), I went through the text and done some general edits to make
its presentation and logic smoother.
|
|
Before this commit: several typos were present along the text. With this
commit several typos have been corrected (types listed below) and my bio
has been added.
a) double words
b) general typos
c) comas after adverbs at the beginning of a sentence
d) contractions are removed, e.g., don't vs do not
e) three sentences in parenthesis have been removed since I think they
were out of context or unnecessary
f) etc
|
|
In order to correspond to the updated datalineage plot, the name of the
plotted columns was changed to 'columns.txt', but I had forgot to update it
in the LaTeX source and since the old file still remained I hadn't
noticed. This was found by Boud and corrected.
|
|
Following the fact that the DSJ editor decided that this paper doesn't fit
into their scope, we decided to submit it to IEEE's Computing in Science
and Engineering (CiSE). So with this commit the text was re-written to fit
into their style and word-count limitations.
|
|
The paper is no longer using LuaLaTeX, but raw LaTeX (that saves a DVI), it
is so much faster! Initially I had used LuaLaTeX to use special fonts to
resemble the CODATA Data Science Journal, but all that overhead is no
longer necessary. Therefore I also removed the MANY extra LaTeX packages we
were importing. The paper builds and is able to construct one of its images
(the git-branching figure) with only 7 packages beyond the minimal
TeX/LaTeX installation. Also in terms of processing it is so much faster.
The text is just temporary now, and mainly just a place holder. With the
next commit, I'll fill it with proper text.
|
|
As explained in the new `README-hacking.md', this files greatly helps in
avoiding un-necessary conflicts.
|
|
A few small conflicts showed up here and there. They are fixed with this
merge.
|
|
Until this commit, the configure step would fail with an error when
compiling libgit2 on a test system. The origin of this bug, on the OS that
was tested, appears to be that in OpenSSL Version 1.1.1a, openssl/ec.h
fails to include openssl/openconf.h. The bug is described in more detail at
https://savannah.nongnu.org/bugs/index.php?58263
With this commit, this is fixed by manually inserting a necessary
components. In particular, `sed` is used to insert a preprocessor
instruction into `openssl/openconf.h`, defining `DEPRECATED_1_2_0(f)`, for
an arbitrary section of code `f`, to include that code rather than exclude
it or warn about it.
This commit is valid provided that openssl remains at a version earlier
than 1.2.0. Starting at version 1.2.0, deprecation warnings should be run
normally. We have thus moved the version of OpenSSL in `versions.conf' to
the section for programs that need to be manually checked for version
updates with a note to remind the user when reaching that version.
Other packages that use OpenSSL may benefit from this commit, not just
libgit2.
|
|
The difference between `that` and `which` is not strictly
required, but it helps clarify the difference in meaning,
which is important in science and software :).
This is best shown by an example:
* Maneage provides reproducibility, which is a good thing.
The sentence would make sense if we drop `, which is a good
thing.` The last part of the sentence is a comment rather than a
necessary part of the sentence.
* Maneage provides a quality of reproducibility that is missing
from other implementations.
The sentence would not quite make sense if we drop `that is ...`,
since we would not know what sort of quality is provided. The
fact that the quality is missing is key to the intended meaning
of the sentence.
|
|
It is also slightly shorter with this commit, without loosing anything
substantial.
|
|
No need to invent a new word (archive-able) when an existing one
(archivable) does the job.
One issue that we have not included and which perhaps we could discuss in
the paper (space permitting), is that this tool could bypass the use of
blockchains in this context.
|
|
As discussed by Boud in the previous commit, this is an important feature
that was lost in the new abstract. So I added it as a criteria.
|
|
Most are minor English tidying, e.g.
* spelling: achieving
* archivable - https://en.wiktionary.org/wiki/archivable
* `i.e.` does not look good in an abstract;
* `when` didn't sound quite right;
Comment: we no longer state one of the most interesting aspects
of Maneage - producing the draft paper that is submittable for
peer review in a way that makes it natural for the authors to
achieve automatic consistency between the calculations/analysis
and the values in the paper. But this is hard to describe in a
compact way without disrupting the overall argument of the
abstract, so it's a bit of a pity, but people will learn about it
anyway from the body of the article (or from trying out the
package!) `Peer-review verification` does not directly state
producing a pdf.
Related to this absence of talking about reproducing the *paper*,
not just the calculations, I suggest dropping `, with snapshot
\projectversion` from the abstract initially sent to the journal
(they can't stop us updating it afterwards), because without the
context of explaining that the paper itself is produced from the
package, it's not clear what the snapshot means - a snapshot of
the abstract? In the `real` paper, it makes sense, because the
reader will have access to the rest of the paper.
|
|
Boud's suggestions in the previous commit were great and really helped in
improving the tone of the abstract (and thus the whole paper shortly!),
better putting it in the big picture. I had forgot to give the exact word
limit (which was 250), so Boud had set it to a very conservative value of
190, I added around 22 words to better highlight the points we want to
make, while still being below the limit.
|
|
To make this a research article, we either have to present it as a
theoretical advance, or as an empirical advance.
An empirical research result would be something like doing a survey of
users and getting statistics of their success/failure in using the system,
and of whether their experience is consistent with the claimed properties
and principles of Maneage (e.g. success/failure in creating paper.pdf as
expected? was the user's system POSIX? did the user do the install with
non-root privileges? was this a with-network or without-network ./project
configure ?) This is doable, but would require a bit of extra work that we
are not necessarily motivated to do or have the time to do right now.
I think it's possible to present Maneage as a theoretical advance, but it
has to be worded properly. Maneage is a tool, but it's a tool that
satisfies what we can reasonably present as a unique theoretical proposal.
Here's my proposed rewrite. I've aimed at minimum word length. I've also
included (commented out) keywords for a structured research abstract -
these are just for us, as a guideline to improve the abstract.
I think "criteria" is safer than "standards". Whether a principle is good
or bad tends to lead to debate. Whether a criterion is satisfied or not is
a more objective question, independent of whether you agree with the
criterion or not.
In the rewrite below, we propose a theoretical standard and show that the
new standard can be satisfied. Maneage is *used as a tool* to prove that
the standard is not too difficult to achieve. Maneage is no longer the
subject of the paper. (That won't change the main body of the paper too
much, apart from compression, but the way it's presented will have to
change, under this proposal.)
The title would need to match this. E.g.
TITLE.1: Evidence that a higher standard of reproducibility criteria is
attainable
TITLE.2: Evidence that a rigorous standard of reproducibility criteria is
attainable
TITLE.3: Towards a more rigorous standard of reproducibility criteria
I would probably go for TITLE.3.
|
|
This abstract is a first step in order to put more focus on the research
aspects of Maneage.
|
|
Given the very strict limits of journals, we needed to remove these
sections and images. The removed images are: the
`figure-file-architecture', `figure-src-topmake' and
`figure-src-inputconf'. In total, with `wc' we now have 9019 words.
This will be futher reduced when we remove all the technical parts of the
Maneage section, in short, we will only describe the generalities, not any
specific details.
|
|
David suggested some interesting references in particular about the
problems with Juypyter notebooks that are now added to the long version of
the paper. We'll later decide if/how they can be used.
|
|
Until now, if GCC couldn't be built for any reason, Maneage would crash and
the user had no way forward. Since GCC is complicated, it may happen and is
frustrating to wait until the bug is fixed. Also, while debugging Maneage,
when we know GCC has no problem, because it takes so long, it discourages
testing.
With this commit, we have re-activated the `--host-cc' option. It was
already defined in the options of `./project', but its affect was nullified
by hard-coding it to zero in the configure script on GNU/Linux systems. So
with this commit that has been removed and the user can use their own C
compiler on a GNU/Linux operating system also.
Furthermore, to inform the user about this option and its usefulness, when
GCC fails to build, a clear warning message is printed, instructing the
user to post the problem as a bug and telling them how to continue building
the project with the `--host-cc' option.
|
|
Until now, at the end of the configuration step, we would tell the user
this: "To change the configuration later, please re-run './project
configure', DO NOT manually edit the relevant files". However, as Boud
suggested in Bug #58243, this is against our principle to encourage users
to modify Maneage.
With this commit, that explanation has been expanded by a few sentences to
tell the users what to change and warn them in case they decide to change
the build-directory.
|
|
Until now Gnuastro and Astropy where installed by default in any clean
build of Maneage. Gnuastro is used to do the demonstration analysis that is
reported in the paper and Astropy was just there to help in testing the
building of the MANY tools it depends on! It (and its dependencies) also
had several papers that helped show software citation.
However, as Boud suggested in task #15619, the burden of installing them
for a new user may be too much and any future changes will cause merge
conflicts. It may also give the impression that Maneage is only/mainly
written for astronomers.
So with this commit, I am removing Astropy as a default target. But we can
only remove Gnuastro after we include an alternative analysis in the
demonstration `delete-me' files. Following Boud's suggestion in that task,
`TARGETS.conf' was also added to the files to be ignored in any future
merge (in the checklist of `README-hacking.mk'). The solution was already
described there, but mainly focused on the deleted `delete-me' files. So
with this commit, I brought out this item as a more prominent item in the
list. Maybe we can later add the analysis done in the Maneage paper (not
yet published).
In terms of testing the software builds, we already have task #15272
(Single target to build all high-level software, for testing) that aims to
have a single configure option to install ALL high-level software and we
can ask people to try if they like and report errors.
|
|
Similar to the previous commit (e43e3291483699), following a change made
yesterday in the identification of software names from their tarballs, a
few other problematic names are corrected with this commit: `apr-util',
HDF5, TeX Live's installation tarball and `rpcsvc-proto'.
Even though we have visually checked the list of software, other
unidentified similar cases may remain and will be fixed when found in
practice.
|
|
Until Commit 3409a54 (from yesterday), pkg-config was found correctly in
`reproduce/software/make/basic.mk` by searching for `pkg`. However, commit
a21ea20 made an improvement in the regular expression for relating package
names and download filenames, and the string `pkg-config` with the new
regex no longer simplifies to `pkg`. The result of this was that the
basic.mk could not find `pkg-config` in the list of packages, since it was
still listed as `pkg`. This blocked downloading for a system without
pkg-config preloaded.
With this commit (of just a few bytes), the bug is fixed.
|
|
Until now, we wouldn't explicity check for GNU gettext. If it was present
on the system, we would just add a link to it in Maneage's installation
directory. However, in bug #58248, Boud noticed that Git (a basic software)
actually needs it to complete its installation. Unfortunately we haven't
had the tiem to include a build of Gettext in Maneage. Because it is mostly
available on many systems, it hasn't been reported too commonly, it also
has many dependencies which make it a little time consuming to install.
So with this commit, we actually check for GNU gettext right after checking
the compiler and if its not available an informative error message is
written to inform the user of the problem, along with suggestions on fixing
it (how to install GNU gettext from their package manager).
|
|
They supported my visit and talk on Maneage at the Barcelona Super
Computing center. They have also offerred to read the paper and are
providing comments.
Also, I noticed that in the author list, we had forgot to put an `,' after
Boud's name. That is also corrected here.
|
|
Until now, the sed script for determining URL download rules in the three
software building Makefiles (`basic.mk', `high-level.mk' and `python.mk')
considered package names such as `fftw-3...` and `fftw2-2.1...` to be
identical. As the example above shows, this would make it hard to include
some software that may hav conflicting non-number names.
With this commit, the SED script that is used to separate the version from
the tarball name only matches numbers that are after a dash
(`-'). Therefore considers `fftw-3...` and `fftw-2...` to be identical, but
`fftw-3-...` and `fftw2-2.1...` to be different. As a result of this
change, the `elif' check for some of the other programs like `m4', or
`help2man' was also corrected in all three Makefiles.
While doing this check on all the software, we noticed that `zlib-version'
is being repeated two times in `version.conf' so it was removed. It caused
no complications, because both were the same number, but could lead to bugs
later.
|
|
Recently (since Commit 7d0c5ef77), the preparation is not run automatically
every time. It is only run automatically the first time and needs to be
manually called with the `--prepare-redo' option. But this wasn't explained
in `README-hacking.md' (currently the main documentation of Maneage).
With this commit, a description about invoking the preparation process
after the first attempt of the running project has been added to
`README-hacking.md'.
|
|
Recently (in Commit 8eb0892e) the Gnuastro configuration files moved under
"reproduce/analysis/config/gnuastro" directory (before that they were in
`reproduce/software/config/gnuastro)'. But this hadn't been reflected in it
the variable that defines this directory in `initialize.mk'.
With this commit, the address of the Gnuastro configuration files directory
is corrected, allowing Gnuastro programs to operate properly when it is
used.
|
|
Until now, the comment in the file said that setting the `verify-outputs`
variable to `yes` disables the verification. Looking at
`reproduce/analysis/make/verify.mk` shows that the opposite is true.
With this commit, the word `disable` is replaced with `enable` so that the
user is not confused by the conflict between the source code in the other
file and this comment.
|
|
Until now we only checked for the existance and write-ability of the build
directory. But we recently discovered that if the specified build-directory
is in a non-POSIX compatible partition (for example NTFS), permissions
can't be modified and this can cause crashs in some programs (in
particular, while building Perl, see [1]). The thing that makes this
problem hard to identify is that on such partitions, `chmod' will still
return 0 (so it was hard to find).
With this commit, a check has been added after the user specifies the
build-directory. If the proposed build directory is not able to handle
permissions as expected, the configure script will not continue and will
let the user know and will ask them for another directory.
Also, the two printed characters at the start of error messages were
changed to `**' (instead of `--'). When everything is good, we'll use `--'
to tell the user that their given directory will be used as the build
directory. And since there are multiple checks now, the final message to
specify a new build directory is now moved to the end and not repeated in
every check.
[1] https://savannah.nongnu.org/support/?110220
|
|
Until now, we were using GitLab as the main Git repository of Maneage. But
today I finally setup our own Git repository under `git.maneage.org' and
enabled a CGit web interface for a simple and fast viewing of the commits
and changes.
Since this URL is under our own control, we can always ensure that it will
point to somewhere meaningful, on any server so in the long-run its much
better than publishing the paper an explicit reliance of `gitlab.com'.
|
|
Until now, the primary Maneage URLs were under GitLab, but since we now
have a dedicated URL and Git repository, its better to transfer to this as
soon as possible. Therefore with this commit, throughout Maneage, any place
that Maneage was referenced through GitLab has been corrected.
Please correct your project's remote to point to the new repository at
`git.maneage.org/project.git', and please make sure it follows the
`maneage' branch. There is no more `master' branch on Maneage.
|
|
|
|
Reading over Boud's edits, I noticed a few other parts that I could
summarize more and corrected one or two other parts to fit the original
purpose of the sentence better.
|
|
Reduction by about 5 words.
Although it's true that the low-level tools - make, bash, gcc -
are still being actively developed, only expert users will tend
to notice the differences, and in this context, it's probably
more useful to point out that these are actively *maintained*.
(Comment: I felt that the first sentence in the Conclusion is
missing one of the obvious criteria for handling big data -
citizen control so that big data could hopefully become less
Orwellian than it is right now, with GAFAM having the main big
data databases that are used by AI researchers and will tend to
affect people's lives more than traditional "scientific"
databases. But there's no point adding this here, since the
criteria that tend to satisfy the scientific requirements
("principles") and citizens' rights tend to overlap to a fair
degree...)
|