aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2021-04-09 03:58:46 +0100
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2021-04-09 03:58:46 +0100
commite8de7ed202ef4e944631cc5ff0246d9be64d4afc (patch)
tree026bb532671540f5890188e489c9ec480c14ab70
parenta63900bc5a83052081e6ca6bcc0a2bb4ee5a860e (diff)
Comments by IAA's AMIGA team implemented
The AMIGA team at the Instituto Astrofísica Andalucía (IAA) are very active proponents of reproducibility. They had already provided very constructive comments after my visit there and many subsequent interactions. So until now, the whole team's contributions were acknowledged. Since the last submission, several of the team members were able to kindly invest the time in reading the paper and providing very useful comments which are now being implemented. As a result, I was able to specifically thank them in the paper's acknowledgments (Thanks a lot AMIGA!). Below, I am listing the points in the order that is shown in 'git log -p -1' for this commit. - Javier Moldón: "PM is not defined. First appearance in the first page". Thanks for noticing this Javier, it has been corrected. - Javier Moldón: "In Section III. PROPOSED CRITERIA FOR LONGEVITY and Appendix B, you mention the FAIR principles as desirable properties of research projects and solutions, respectively which is good, but may bring confusion. Although they are general enough, FAIR principles are specifically for scientific data, not scientific software. Currently, there is an initiative promoted by the Research Data Alliance (RDA), among others, to create FAIR principles adapted to research software, and it is called FAIR4RS (FAIR for Research Software). More information here: https://www.rd-alliance.org/groups/fair-4-research-software-fair4rs-wg. In 2020 there was a kick-off meeting to divide the work in 4 WG. There is some more information in this talk: https://sorse.github.io/programme/workshops/event-016/. I have been following the work of WG1, and they are about the finish the first document describing how to adapt the FAIR principles to software. Even if all this is still work in progress, I think the paper would benefit from mentioning the existence of this effort and noticing the diferences between Data and Software FAIR definitions." Thanks for highlighting this Javier, a footnote has been added for this (hopefully faithfully summarizing it into one sentence due to space limitations). - Sebastian Luna Valero: "Would it be a good idea to define long-term as a period of time; for example, 5 years is a lot in the field of computer science (i.e. in terms of hardware and software aging), but maybe that is not the case in other domains (e.g. Astronomy)." Thanks Sebastian, in section 2, we do give longevity of the various "tools" in rough units of years (this was also a suggestion by a referee). But of course the discussion there is very generic, so going into finer detail would probably be too subjective and bore the reader. - Sebastian Luna Valero: "Why do you use git commit eeff5de instead of git tags or releases for Maneage? Shown for example in the abstract of the paper: "This paper is itself written with Maneage (project commit eeff5de)." Thanks for raising this important point, a sentence has been added to explain why hashes are objective and immutable for a given history, while tags can easily be removed or changed, or not cloned/pushed at all. - Susana Sanchez Exposito: "We think interoperability with other research projects would be important, do you have any plans to make maneage interoperable with, for example, the Common Workflow Language (CWL)?". Thanks a lot for raising this point Susana. Indeed, in the future I really do hope we can invest enough resources on this. In the discussion, I had already touched upon research objects as one method for interoperability, there was also a discussion on such generic standards in Appendix A.D.10. But to further clarify this point (given its importance), I mentioned CWL (and also the even more generic CWFR) in the discussion. - Sebastian Luna Valero: "Regarding Apache Taverna, please see:" https://github.com/apache/incubator-taverna-engine/blob/master/README.md Thanks a lot for this note Sebastian! I didn't know this! I wrote this section (and visited their webpage) before their "vote"! It was a surprize to see that their page had changed. I have modified the explanation of Taverna to mention that it has been "retired" and use the Github link instead. - Sebastian Luna Valero: "Page 21: 'logevity' should be 'longevity'." Thanks a lot for noticing this! It has been corrected :-). - Javier Moldón: "There is a nice diagram in Johannes Köster's article on data processing with snakemake that I find very interesting to show some key aspects of data workflows: see Fig 1 in https://www.authorea.com/users/165354/articles/441233-sustainable-data-analysis-with-snakemake " This is indeed a nice diagram! I tried to cite it, but as of today, this link is not a complete paper (with no abstract and many empty section titles). If it was complete, I would certainly have cited it in Snakemake's discussion. - Javier Moldón: "Regarding the problem mentioned in the introduction about PM not precisely identified all software versions, I would like to mention that with Snakemake, even if the analysis are usually constructed using other package managers such as conda, or containers, you don't need to depend on online servers or poorly-documented software versions, as you can now encapsulate an analysis in a tarball containing all the software needed. You still have long-term dependency problems (as you will need to install snakemake itself, and a particular OS), but at least you can keep the exact software versions for a particular platform." Thanks for highlighting this Javier. This is indeed better than nothing, we have already discussed the dangers of this "black box" approach of archiving binaries in many contexts, and many package managers have it. So while I really appreciate the point (I didn't know this), to avoid lengthening the paper, I think its fine to not mention it in the paper.
-rw-r--r--paper.tex27
-rw-r--r--tex/src/appendix-existing-solutions.tex14
-rw-r--r--tex/src/appendix-existing-tools.tex2
3 files changed, 29 insertions, 14 deletions
diff --git a/paper.tex b/paper.tex
index d6ea107..115f5f7 100644
--- a/paper.tex
+++ b/paper.tex
@@ -170,7 +170,7 @@ Because of this, in October 2020 Docker Hub (where many workflows are archived)
Furthermore, Docker requires root permissions, and only supports recent (LTS) versions of the host kernel.
Hence older Docker images may not be executable (their longevity is determined by the host kernel, typically a decade).
-Once the host OS is ready, PMs are used to install the software or environment.
+Once the host OS is ready, package managers (PMs) are used to install the software or environment.
Usually the OS's PM, such as `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with generic PMs.
The former has the same longevity as the OS, while some of the latter (such as Conda and Spack) are written in high-level languages like Python, so the PM itself depends on the host's Python installation with a typical longevity of a few years.
Nix and GNU Guix produce bit-wise identical programs with considerably better longevity; that of their supported CPU architectures.
@@ -206,9 +206,9 @@ Notebooks can therefore rarely deliver their promised potential \cite{rule18} an
\section{Proposed criteria for longevity}
\label{criteria}
The main premise here is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it just before publication \cite{austin17,fineberg19}.
-In this context, researchers play a critical role \cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles).
+In this context, researchers play a critical role \cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles\footnote{FAIR originally targeted data, work is ongoing to adopt it for software through initiatives like FAIR4RS (FAIR for Research Software).}).
Simply archiving a project workflow in a repository after the project is finished is, on its own, insufficient, and maintaining it by repository staff is often either practically unfeasible or unscalable.
-We argue and propose that workflows satisfying the following criteria can not only improve researcher flexibility during a research project, but can also increase the FAIRness of the deliverables for future researchers:
+We argue and propose that workflows satisfying the following criteria can not only improve researcher flexibility during a research project, but can also increase the FAIRness of the deliverables for future researchers.
\textbf{Criterion 1: Completeness.}
A project that is complete (self-contained) has the following properties.
@@ -449,8 +449,9 @@ The core Maneage git repository is hosted at \href{http://git.maneage.org/projec
Derived projects start by creating a branch and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its particular analysis, see Listing \ref{code:branching}).
There is a thoroughly elaborated customization checklist in \inlinecode{README-hacking.md}.
-The current project's Git hash is provided to the authors as a \LaTeX{} macro (shown here at the end of the abstract), as well as the Git hash of the last commit in the Maneage branch (shown in the acknowledgments).
-These macros are created in \inlinecode{initialize.mk}, with other basic information from the running system like the CPU architecture, byte order or address sizes (shown in the acknowledgments).
+The current project's Git hash is provided to the authors as a \LaTeX{} macro (shown here in the abstract and acknowledgments), as well as the Git hash of the last commit in the Maneage branch (shown here in the acknowledgments).
+These macros are created in \inlinecode{initialize.mk}, with other basic information from the running system like the CPU details (shown in the acknowledgments).
+As opposed to Git ``tag''s, the hash is a core concept in the Git paradigm and is immutable for a given history, it is therefore the recommended timestamp.
Figure \ref{fig:branching} shows how projects can re-import Maneage at a later time (technically: \emph{merge}), thus improving their low-level infrastructure: in (a) authors do the merge during an ongoing project;
in (b) readers do it after publication; e.g., the project remains reproducible but the infrastructure is outdated, or a bug is fixed in Maneage.
@@ -543,10 +544,16 @@ The completeness criterion implies that algorithms and data selection can be inc
Furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the connection between an analysis with the resulting narrative \emph{and} the history of that analysis+narrative.
Parsers can be written over projects for meta-research and provenance studies, e.g., to generate Research Objects
\ifdefined\separatesupplement
-(see the supplement appendix).
+(see supplement appendix B).
\else
(see Appendix \ref{appendix:researchobject}).
\fi
+or allow interoperability with Common Workflow Language (CWL) or higher-level concepts like Canonical Workflow Framework for Research, or CWFR
+\ifdefined\separatesupplement
+(see supplement appendix A).
+\else
+(see Appendix \ref{appendix:genericworkflows}).
+\fi
Likewise, when a bug is found in one science software, affected projects can be detected and the scale of the effect can be measured.
Combined with SoftwareHeritage, precise high-level science components of the analysis can be accurately cited (e.g., even failed/abandoned tests at any historical point).
Many components of ``machine-actionable'' data management plans can also be automatically completed as a byproduct, useful for project PIs and grant funders.
@@ -581,17 +588,21 @@ Konrad Hinsen,
Marios Karouzos,
Johan Knapen,
Tamara Kovazh,
+Sebastian Luna Valero,
Terry Mahoney,
+Javier Mold\'on,
Ryan O'Connor,
Mervyn O'Luing,
Simon Portegies Zwart,
+Susana Sanchez Exposito,
Idafen Santana-P\'erez,
Elham Saremi,
Yahya Sefidbakht,
Zahra Sharbaf,
Nadia Tonello,
-Ignacio Trujillo and
-the AMIGA team at the Instituto de Astrof\'isica de Andaluc\'ia for their useful help, suggestions, and feedback on Maneage and this paper.
+Ignacio Trujillo
+and Lourdes Verdes-Montenegro
+for their useful help, suggestions, and feedback on Maneage and this paper.
The five referees and editors of CiSE (Lorena Barba and George Thiruvathukal) provided many points that greatly helped to clarify this paper.
This project (commit \inlinecode{\projectversion}) is maintained in Maneage (\emph{Man}aging data lin\emph{eage}).
diff --git a/tex/src/appendix-existing-solutions.tex b/tex/src/appendix-existing-solutions.tex
index 5166703..2396b1b 100644
--- a/tex/src/appendix-existing-solutions.tex
+++ b/tex/src/appendix-existing-solutions.tex
@@ -101,10 +101,12 @@ Hence in 2006 SEP moved to a new Python-based framework called Madagascar, see A
-\subsection{Apache Taverna (2003)}
+\subsection{Taverna (2003)}
\label{appendix:taverna}
-Apache Taverna\footnote{\inlinecode{\url{https://taverna.incubator.apache.org}}} \citeappendix{oinn04} is a workflow management system written in Java with a graphical user interface which is still being used and developed.
-A workflow is defined as a directed graph, where nodes are called ``processors''.
+Taverna\footnote{\inlinecode{\url{https://github.com/taverna}}} \citeappendix{oinn04} was a workflow management system written in Java with a graphical user interface.
+In 2014 it was sponsored by the Apache Incubator project and called ``Apache Taverna'', but its developers \href{https://lists.apache.org/thread.html/r559e0dd047103414fbf48a6ce1bac2e17e67504c546300f2751c067c\%40\%3Cdev.taverna.apache.org\%3E}{voted} to \emph{retire} it in 2020 because development has come to a standstill (as of April 2021, latest public Github commit was in 2016).
+
+In Taverna, a workflow is defined as a directed graph, where nodes are called ``processors''.
Each Processor transforms a set of inputs into a set of outputs and they are defined in the Scufl language (an XML-based language, where each step is an atomic task).
Other components of the workflow are ``Data links'' and ``Coordination constraints''.
The main user interface is graphical, where users move processors in the given space and define links between their inputs and outputs (manually constructing a lineage, as in the
@@ -179,7 +181,7 @@ the lineage figure shown in the main paper).
Figure \ref{fig:datalineage}).
\fi
Each actor is connected to others through Ptolemy II\footnote{\inlinecode{\url{https://ptolemy.berkeley.edu}}} \citeappendix{eker03}.
-In many aspects, the usage of Kepler and its issues for long-term reproducibility is like Apache Taverna (see Section \ref{appendix:taverna}).
+In many aspects, the usage of Kepler and its issues for long-term reproducibility is like Taverna (see Section \ref{appendix:taverna}).
@@ -334,7 +336,7 @@ the first figure in the main body of the paper,
Figure \ref{fig:datalineage},
\fi
where you can click on the given Zenodo link and be taken to the raw data that created the plot.
-However, instead of a long and hard to read hash, we simply point to the plotted file's source as a Zenodo DOI (which has long term funding for logevity).
+However, instead of a long and hard to read hash, we simply point to the plotted file's source as a Zenodo DOI (which has long term funding for longevity).
Unfortunately, most parts of the web page are not complete as of January 2021.
The VCR web page contains an example PDF\footnote{\inlinecode{\url{http://vcr.stanford.edu/paper.pdf}}} that is generated with this system, but the linked VCR repository\footnote{\inlinecode{\url{http://vcr-stat.stanford.edu}}} did not exist (again, as of January 2021).
@@ -386,7 +388,7 @@ It just captures the environment, it does not store \emph{how} that environment
The Research object\footnote{\inlinecode{\url{http://www.researchobject.org}}} is collection of meta-data ontologies, to describe aggregation of resources, or workflows, see \citeappendix{bechhofer13} and \citeappendix{belhajjame15}.
It thus provides resources to link various workflow/analysis components (see Appendix \ref{appendix:existingtools}) into a final workflow.
-Ref.\/~\citeappendix{bechhofer13} describes how a workflow in Apache Taverna (Appendix \ref{appendix:taverna}) can be translated into research objects.
+Ref.\/~\citeappendix{bechhofer13} describes how a workflow in Taverna (Appendix \ref{appendix:taverna}) can be translated into research objects.
The important thing is that the research object concept is not specific to any special workflow, it is just a metadata bundle/standard which is only as robust in reproducing the result as the running workflow.
Therefore if implemented over a complete workflow like Maneage, it can be very useful in analysing/optimizing the workflow, finding common components between many Maneage'd workflows, or translating to other complete workflows.
diff --git a/tex/src/appendix-existing-tools.tex b/tex/src/appendix-existing-tools.tex
index 0c9a1c2..a773322 100644
--- a/tex/src/appendix-existing-tools.tex
+++ b/tex/src/appendix-existing-tools.tex
@@ -441,8 +441,10 @@ GWL has two high-level concepts called ``processes'' and ``workflows'' where the
Nextflow\footnote{\inlinecode{\url{https://www.nextflow.io}}} \citeappendix{tommaso17} workflow language with a command-line interface that is written in Java.
\subsubsection{Generic workflow specifications (CWL and WDL)}
+\label{appendix:genericworkflows}
Due to the variety of custom workflows used in existing reproducibility solution (like those of Appendix \ref{appendix:existingsolutions}), some attempts have been made to define common workflow standards like the Common workflow language (CWL\footnote{\inlinecode{\url{https://www.commonwl.org}}}, with roots in Make, formatted in YAML or JSON) and Workflow Description Language (WDL\footnote{\inlinecode{\url{https://openwdl.org}}}, formatted in JSON).
These are primarily specifications/standards rather than software.
+At an even higher level solutions like Canonical Workflow Frameworks for Research (CWFR) are being proposed\footnote{\inlinecode{\href{https://codata.org/wp-content/uploads/2021/01/CWFR-position-paper-v3.pdf}{https://codata.org/wp-content/uploads/2021/01/}}\\\inlinecode{\href{https://codata.org/wp-content/uploads/2021/01/CWFR-position-paper-v3.pdf}{CWFR-position-paper-v3.pdf}}}.
With these standards, ideally, translators can be written between the various workflow systems to make them more interoperable.
In conclusion, shell scripts and Make are very common and extensively used by users of Unix-based OSs (which are most commonly used for computations).