Age | Commit message (Collapse) | Author | Lines |
|
The AMIGA team at the Instituto Astrofísica Andalucía (IAA) are very active
proponents of reproducibility. They had already provided very constructive
comments after my visit there and many subsequent interactions. So until
now, the whole team's contributions were acknowledged.
Since the last submission, several of the team members were able to kindly
invest the time in reading the paper and providing very useful comments
which are now being implemented. As a result, I was able to specifically
thank them in the paper's acknowledgments (Thanks a lot AMIGA!). Below, I
am listing the points in the order that is shown in 'git log -p -1' for
this commit.
- Javier Moldón: "PM is not defined. First appearance in the first page".
Thanks for noticing this Javier, it has been corrected.
- Javier Moldón: "In Section III. PROPOSED CRITERIA FOR LONGEVITY and
Appendix B, you mention the FAIR principles as desirable properties of
research projects and solutions, respectively which is good, but may
bring confusion. Although they are general enough, FAIR principles are
specifically for scientific data, not scientific software. Currently,
there is an initiative promoted by the Research Data Alliance (RDA),
among others, to create FAIR principles adapted to research software, and
it is called FAIR4RS (FAIR for Research Software). More information here:
https://www.rd-alliance.org/groups/fair-4-research-software-fair4rs-wg. In
2020 there was a kick-off meeting to divide the work in 4 WG. There is
some more information in this talk:
https://sorse.github.io/programme/workshops/event-016/. I have been
following the work of WG1, and they are about the finish the first
document describing how to adapt the FAIR principles to software. Even if
all this is still work in progress, I think the paper would benefit from
mentioning the existence of this effort and noticing the diferences
between Data and Software FAIR definitions."
Thanks for highlighting this Javier, a footnote has been added for this
(hopefully faithfully summarizing it into one sentence due to space
limitations).
- Sebastian Luna Valero: "Would it be a good idea to define long-term as a
period of time; for example, 5 years is a lot in the field of computer
science (i.e. in terms of hardware and software aging), but maybe that is
not the case in other domains (e.g. Astronomy)."
Thanks Sebastian, in section 2, we do give longevity of the various
"tools" in rough units of years (this was also a suggestion by a
referee). But of course the discussion there is very generic, so going
into finer detail would probably be too subjective and bore the reader.
- Sebastian Luna Valero: "Why do you use git commit eeff5de instead of git
tags or releases for Maneage? Shown for example in the abstract of the
paper: "This paper is itself written with Maneage (project commit
eeff5de)."
Thanks for raising this important point, a sentence has been added to
explain why hashes are objective and immutable for a given history, while
tags can easily be removed or changed, or not cloned/pushed at all.
- Susana Sanchez Exposito: "We think interoperability with other research
projects would be important, do you have any plans to make maneage
interoperable with, for example, the Common Workflow Language (CWL)?".
Thanks a lot for raising this point Susana. Indeed, in the future I
really do hope we can invest enough resources on this. In the discussion,
I had already touched upon research objects as one method for
interoperability, there was also a discussion on such generic standards
in Appendix A.D.10. But to further clarify this point (given its
importance), I mentioned CWL (and also the even more generic CWFR) in the
discussion.
- Sebastian Luna Valero: "Regarding Apache Taverna, please see:"
https://github.com/apache/incubator-taverna-engine/blob/master/README.md
Thanks a lot for this note Sebastian! I didn't know this! I wrote this
section (and visited their webpage) before their "vote"! It was a
surprize to see that their page had changed. I have modified the
explanation of Taverna to mention that it has been "retired" and use the
Github link instead.
- Sebastian Luna Valero: "Page 21: 'logevity' should be 'longevity'."
Thanks a lot for noticing this! It has been corrected :-).
- Javier Moldón: "There is a nice diagram in Johannes Köster's article on
data processing with snakemake that I find very interesting to show some
key aspects of data workflows: see Fig 1 in
https://www.authorea.com/users/165354/articles/441233-sustainable-data-analysis-with-snakemake "
This is indeed a nice diagram! I tried to cite it, but as of today, this
link is not a complete paper (with no abstract and many empty section
titles). If it was complete, I would certainly have cited it in
Snakemake's discussion.
- Javier Moldón: "Regarding the problem mentioned in the introduction about
PM not precisely identified all software versions, I would like to
mention that with Snakemake, even if the analysis are usually constructed
using other package managers such as conda, or containers, you don't need
to depend on online servers or poorly-documented software versions, as
you can now encapsulate an analysis in a tarball containing all the
software needed. You still have long-term dependency problems (as you
will need to install snakemake itself, and a particular OS), but at least
you can keep the exact software versions for a particular platform."
Thanks for highlighting this Javier. This is indeed better than nothing,
we have already discussed the dangers of this "black box" approach of
archiving binaries in many contexts, and many package managers have
it. So while I really appreciate the point (I didn't know this), to avoid
lengthening the paper, I think its fine to not mention it in the paper.
|
|
Konrad had kindly gone through the paper and the appendices with very good
feedback that is now being addressed in the paper (thanks a lot Konrad!):
- IPOL recently also allows Python code. So the respective parts of the
description of IPOL have been updated. To address the dependency issue, I
also added a sentence that only certain dependencies (with certain
versions) are acceptable.
- On Active Papers (AP: which is written by Konrad) corrections were made
based on the following parts of his comments:
- "The fundamental issue with ActivePapers is its platform dependence on
either Java or Python, neither of which is attractive."
- "The one point which is overemphasized, in my opinion, is the necessity
to download large data files if some analysis script refers to it. That
is true in the current implementation (which I consider a research
prototype), but not a fundamental feature of the approach. Implementing
an on-demand download strategy is not particularly complicated, it just
needs to be done, and it wasn't a priority for my own use cases."
- "A historical anecdote: you mention that HDF View requires registering
for download. This is true today, but wasn't when I started
ActivePapers. Otherwise I'd never have built on HDF5. What happened is
that the HDF Group, formerly part of NCSA and thus a public research
infrastructure, was turned into a semi-commercial entity. They have
committed to keeping the core HDF5 library Open Source, but not any of
the tooling around it. Many users have moved away from HDF5 as a
consequence. The larger lesson is that Richard Stallman was right: if
software isn't GPLed, then you never know what will happen to it in the
future."
- On Guix, some further clarification was added to address Konrad's quote
below (with a link to the blog-post mentioned there). In short, I
clarified that I mean storing the Guix commit hash with any respective
high-level analysis change is the extra step.
- "I also looked at the discussion of Nix and Guix, which is what I am
mainly using today. It is mostly correct as well, the one exception
being the claim that 'it is up to the user to ensure that their created
environment is recorded properly for reproducibility in the
future'. The environment is *recorded* in all detail,
automatically. What requires some effort is extracting a human-readable
description of that environment. For Guix, I have described how to do
this in a blog post
(https://guix.gnu.org/en/blog/2020/reproducible-computations-with-guix/),
and in less detail in a recent CiSE paper
(https://hal.archives-ouvertes.fr/hal-02877319). There should
definitely be a better user interface for this, but it's no more than a
user interface issue. What is pretty nice in Guix by now is the user
interface for re-creating an environment, using the "guix time-machine"
subcommand."
- The sentence on Software Heritage being based on Git was reworded to fit
this comment of Konrad: "The plural sounds quite optimistic. As far as I
know, SWH is the only archive of its kind, and in view of the enormous
resources and long-time commitments it requires, I don't expect to see a
second one."
- When introducing hashes, Konrad suggested the following useful paper that
shows how they are used in content-based storage:
DOI:10.1109/MCSE.2019.2949441
- On Snakemake, Konrad had the following comment: "[A system call in Python
is] No slower than from bash, or even from any C code. Meaning no slower
than Make. It's the creation of a new process that takes most of the
time." So the point was just shifted to the many quotations necessary for
calling external programs and how it is best suited for a Python-based
project.
In addition some minor typos that I found during the process are also
fixed.
|
|
With the submission of the revision (which highlighted all the relevant
parts to the points the referees raised in the submitted PDF) it is no
longer necessary to highlight these parts.
If we get another revision request, we can add new '\new' parts for
highlighting.
|
|
This commit makes some minor fixes following the hardwired non-numerical
solution to the cross-referencing issue between the main article and the
supplement, such as fixing "lineage like lineage" and missing closing
parentheses.
From Mohammad: while re-basing the commit over the 'master' branch, I also
added Boud'd name at the top of the copyright holders of the appendices.
|
|
In preparation for the submission of the revised manuscript, I went through
the full paper and appendices one last time. The second appendix (reviewing
existing reproducible solutions) in particular needed some attention
because some of the tools weren't properly compared with the criteria.
In the paper, I was also able to remove about 30 words, and bring our own
count (which is an over-estimation already) to below 6250.
|
|
I ran a simple Emacs spell check over the main body and the two
appendices. All discovered typos have been fixed.
|
|
With this commit, I have corrected some minor typos of this appendix. In
addition to that, I also put empty lines to separate subsections and
subsubsections appropiately (5 lines and 1 line, respectively).
|
|
Until now, in the appendices we were simply using '\ref' to refer to
different parts of the published paper. However, when built in
'--supplement' mode, the main body of the paper is a separate PDF and
having links to a separate PDF is not impossible, but far too complicated.
However, having the links adds to the richness of the text and helps point
readers to specific parts of the paper.
With this commit, there is a LaTeX conditional anywhere in the appendices
that we want to refer the reader to sections/figures in the main body. When
building a separate PDF, the resepective section/figure is cited in a
descriptive mode (like "Seciton discussing longevity of tools"). However,
when the appendices go into the same PDF as the main body, the '\ref's
remain.
|
|
Until now, the build strategy of the paper was to have a single output PDF
that either contains (1) the full paper with appendices in the same paper
(2) only the main body of the paper with no appencies.
But the editor in chief of CiSE recently recommended publishing the
appendices as supplements that is a separate PDF (on its webpage). So with
this commit, the project can make either (1) a single PDF (containing both
the main body and the appendices) that will be published on arXiv and will
be the default output (this is the same as before). (2) two PDFs: one that
is only the main body of the paper and another that is only the appendices.
Since the appendices will be printed as a PDF in any case now, the old
'--no-appendix' option has been replaced by '--supplement'. Also, the
internal shell/TeX variable 'noappendix' has been renamed to
'separatesupplement'.
|
|
As recommended by Lorena Barba (editor in chief of CiSE), we should prepare
the appendices as a separate "Supplement" for the journal. But we also want
them to be appendices within the paper when built for arXiv.
As a first step, with this commit, each appendix has been put in a separate
'tex/src/appendix-*.tex' file and '\input' into the paper. We will then be
able to conditionally include them in the PDF or not.
Also, as recommended by Lorena, the general "necessity for reproducible
research" appendix isn't included (possibly going into the webpage later).
|