aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2020-06-03 05:53:44 +0100
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2020-06-03 05:53:44 +0100
commit80e1cb81ac9a020756d82dfaa7007c4146aab64c (patch)
tree3f944ca9e48ddd4fe4c483c0347da76b172f57ca /paper.tex
parent9363210b37d6399acdc1d990cb9826e64c38ef5a (diff)
Adding point on small-ness of final product, some summarization
I noticed that we hadn't include the publication of the workflow and the advantage that Maneage provides in this regard. So it was added at the end of the proof-of-concept section. However, it was necessary to summarize some other parts to not increase the wordcount.
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex148
1 files changed, 65 insertions, 83 deletions
diff --git a/paper.tex b/paper.tex
index 801b380..685849f 100644
--- a/paper.tex
+++ b/paper.tex
@@ -60,17 +60,16 @@
% in the abstract or keywords.
\begin{abstract}
%% CONTEXT
- Reproducible workflow solutions commonly use high-level technologies that were popular when they were created, providing an immediate solution which is however unlikely to be sustainable in the long term.
+ Reproducible workflow solutions commonly use high-level technologies that were popular when they were created, providing an immediate solution which is unlikely to be sustainable in the long term.
%% AIM
- We therefore introduce a set of criteria to address this problem and demonstrate their practicality and implementation.
+ We therefore introduce a set of criteria to address this problem and demonstrate their practicality and implementation.
%% METHOD
- The criteria have been tested in several research publications and can be summarized as: completeness (no dependency beyond a POSIX-compatible operating system, no administrator privileges, no network connection and storage primarily in plain text); modular design; minimal
- complexity; scalability; verifiable inputs and outputs; temporal provenance; linking analysis with narrative; and free-and-open-source software. These criteria are not limited to long-term reproducibility, but also provide immediate benefits for short-term reproducibility.
+ The criteria have been tested in several research publications and can be summarized as: completeness (no dependency beyond a POSIX-compatible operating system, no administrator privileges, no network connection and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; temporal provenance; linking analysis with narrative; and free-and-open-source software.
%% RESULTS
They are implemented in a tool, called ``Maneage'', which stores the project in machine-actionable and human-readable plain-text, enables version-control, cheap archiving, automatic parsing to extract data provenance, and peer-reviewable verification.
%% CONCLUSION
- We show that requiring longevity of a reproducible workflow solution is realistic, and
- discuss the benefits of these criteria for scientific progress.
+ We show that requiring longevity of a reproducible workflow solution is realistic, and discuss the benefits of the criteria for scientific progress, but also immediate benefits for short-term reproducibility.
+ This paper has itself been written in Maneage, with snapshot \projectversion.
\end{abstract}
% Note that keywords are not normally used for peerreview papers.
@@ -103,13 +102,9 @@ Data Lineage, Provenance, Reproducibility, Scientific Pipelines, Workflows
Reproducible research has been discussed in the sciences for at least 30 years \cite{claerbout1992, fineberg19}.
Many reproducible workflow solutions (hereafter, ``solutions'') have been proposed, mostly relying on the common technology of the day: starting with Make and Matlab libraries in the 1990s, to Java in the 2000s and mostly shifting to Python during the last decade.
-However, these technologies develop very fast, e.g., Python 2 code often cannot run with Python 3,
- % interrupting many projects in the last decade.
-% DVG: I would refrain from saying this unless we can cite examples which have shown that going from 2 to 3 has prevented them.
-and the cost of staying up to date within this rapidly-evolving landscape is high.
+However, these technologies develop very fast, and the cost of staying up to date within this rapidly-evolving landscape is high; e.g., Python 2 code often cannot run with Python 3.
Scientific projects, in particular, suffer the most: scientists have to focus on their own research domain, but to some degree they need to understand the technology of their tools, because it determines their results and interpretations.
-Decades later, scientists are still held accountable for their results and therefore
- the evolving technology landscape creates generational gaps in the scientific community, preventing previous generations from sharing valuable lessons which are too hands-on to be published in a traditional scientific paper.
+Decades later, scientists are still held accountable for their results and therefore the evolving technology landscape creates generational gaps in the scientific community, preventing previous generations from sharing valuable experience.
@@ -118,19 +113,18 @@ Decades later, scientists are still held accountable for their results and there
\section{Commonly used tools and their longevity}
Longevity is as important in science as in some fields of industry, but this is not always the case, e.g., fast-evolving tools can be appropriate in short-term commercial projects.
To highlight the necessity of longevity, some of the most commonly-used tools are reviewed here from this perspective.
-A common set of third-party tools that are used by most solutions can be categorized as:
+A common set of third-party tools that are commonly used can be categorized as:
(1) environment isolators -- virtual machines (VMs) or containers;
(2) package managers (PMs) -- Conda, Nix, or Spack;
(3) job management -- shell scripts, Make, SCons, or CGAT-core;
(4) notebooks -- such as Jupyter.
To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011 but was discontinued in 2019).
-However, since containers (in particular, Docker, and to a lesser degree, Singularity) are by far the most widely-used solution today, we will focus on Docker here.
+However, containers (in particular, Docker, and to a lesser degree, Singularity) are currently the most widely-used solution, we will thus focus on Docker here.
-Ideally, it is possible to precisely identify the images that are imported into a Docker container by their checksums,
-but that is rarely practiced in most solutions that we have surveyed.
+Ideally, it is possible to precisely identify the Docker ``images'' that are imported by their checksums, but that is rarely practiced in most solutions that we have surveyed.
Usually, images are imported with generic operating system (OS) names e.g. \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}'.
-The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly with different software versions and only archives the most recent five images.
+The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly and only the most recent five are archived.
Hence, if the Dockerfile is run in different months, it will contain different core OS components.
In the year 2024, when long-term support for this version of Ubuntu will expire, the image will be unavailable at the expected URL.
This is entirely similar in other OSes: pre-built binary files are large and expensive to maintain and archive.
@@ -138,28 +132,27 @@ Furthermore, Docker requires root permissions, and only supports recent (``long-
Once the host OS is ready, PMs are used to install the software, or environment.
Usually the OS's PM, like `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with generic PMs.
-The former suffers from the same longevity problem as the OS, while
-some of the latter (like Conda and Spack) are written in high-level languages like Python, so the PM itself depends on the host's Python installation.
-Nix and GNU Guix produce bit-wise identical programs, but they need root permissions.
+The former suffers from the same longevity problem as the OS, while some of the latter (like Conda and Spack) are written in high-level languages like Python, so the PM itself depends on the host's Python installation.
+Nix and GNU Guix produce bit-wise identical programs, but they need root permissions and are primarily targeted at the Linux kernel.
Generally, the exact version of each software's dependencies is not precisely identified in the build instructions (although this could be implemented).
Therefore, unless precise version identifiers of \emph{every software package} are stored, a PM will use the most recent version.
Furthermore, because each third-party PM introduces its own language and framework, this increases the project's complexity.
With the software environment built, job management is the next component of a workflow.
-Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails (mostly introduced in the 2000s and using Java) encourage modularity and robust job management, but the more recent tools (mostly in Python) leave this to the authors of the project.
+Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails (mostly introduced in the 2000s and using Java) encourage modularity and robust job management, but the more recent tools (mostly in Python) leave this to the authors of the project.
Designing a modular project needs to be encouraged and facilitated because scientists (who are not usually trained in project or data management) will rarely apply best practices.
This includes automatic verification: while it is possible in many solutions, it is rarely practiced, which leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding or validating will be expensive).
-Finally, to add narrative, computational notebooks \cite{rule18}, like Jupyter, are being increasingly used.
-However, the complex dependency trees of such web-based tools make them very vulnerable to the passage of time, e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies.
+Finally, to add narrative, computational notebooks \cite{rule18}, like Jupyter, are currently gaining popularity.
+However, due to their complex dependency trees, they are very vulnerable to the passage of time, e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies.
It is important to remember that the longevity of a project is determined by its shortest-lived dependency.
Further, as with job management, computational notebooks do not actively encourage good practices in programming or project management.
Hence they can rarely deliver their promised potential \cite{rule18} and can even hamper reproducibility \cite{pimentel19}.
An exceptional solution we encountered was the Image Processing Online Journal (IPOL, \href{https://www.ipol.im}{ipol.im}).
Submitted papers must be accompanied by an ISO C implementation of their algorithm (which is buildable on any widely used OS) with example images/data that can also be executed on their webpage.
-This is possible due to the focus on low-level algorithms that do not need any dependencies beyond an ISO C compiler.
-Unfortunately, many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, and hence this solution is not scalable.
+This is possible due to the focus on low-level algorithms with no dependencies beyond an ISO C compiler.
+However, many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, and hence this solution is not scalable.
@@ -169,7 +162,7 @@ Unfortunately, many data-intensive projects commonly involve dozens of high-leve
The main premise is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it in the end \cite{austin17,fineberg19}.
In this context, researchers play a critical role \cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles).
Simply archiving a project workflow in a repository after the project is finished is, on its own, insufficient, and maintaining it by repository staff is often either practically infeasible or unscalable.
-We argue and propose that workflows satisfying the following criteria can not only improve researcher flexibility during a research project, but can also increase the FAIRness of the deliverables for future researchers:
+We argue and propose that workflows satisfying the following criteria can not only improve researcher flexibility during a research project, but can also increase the FAIRness of the deliverables for future researchers:
\textbf{Criterion 1: Completeness.}
A project that is complete (self-contained) has the following properties.
@@ -180,8 +173,7 @@ IEEE defined POSIX (a minimal Unix-like environment) and many OSes have complied
(4) It does not require root or administrator privileges.
(5) It builds its own controlled software for an independent environment.
(6) It can run locally (without an internet connection).
-(7) It contains the full project's analysis, visualization \emph{and} narrative: from access to raw inputs to doing the analysis, producing final data products \emph{and} its final published report with figures, e.g., PDF or HTML.
-% DVG: but the PDF standard is owned by Adobe(TM), and there are many versions of HTML ... so the long-term validity is jeopardised.
+(7) It contains the full project's analysis, visualization \emph{and} narrative: from access to raw inputs to doing the analysis, producing final data products \emph{and} its final published report with figures \emph{as output}, e.g., PDF or HTML.
(8) It can run automatically, with no human interaction.
\textbf{Criterion 2: Modularity.}
@@ -223,8 +215,8 @@ This is related to longevity, because if a workflow only contains the steps to d
\textbf{Criterion 8: Free and open source software:}
Technically, reproducibility (as defined in \cite{fineberg19}) is possible with non-free or non-open-source software (a black box).
-This criterion is necessary to complement that definition (nature is already a black box!) because
-if a project is free software (as formally defined), then others can learn from, modify, and build on it.
+This criterion is necessary to complement that definition (nature is already a black box!).
+If a project is free software (as formally defined), then others can learn from, modify, and build on it.
When the software used by the project is itself also free:
(1) The lineage can be traced to the implemented algorithms, possibly enabling optimizations on that level.
(2) The source can be modified to work on future hardware.
@@ -248,44 +240,39 @@ The tool is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending
It was developed as a parallel research project over five years of publishing reproducible workflows of our research.
The original implementation was published in \cite{akhlaghi15}, and evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}.
-Technically, the hardest criterion to implement was the first one (completeness) and, in particular, avoiding non-POSIX dependencies).
-Minimizing complexity (criterion 3) was also difficult.
-A proposed solution was the Guix Workflow Language (GWL), written in the same framework (GNU Guile, an implementation of Scheme) as GNU Guix (a PM), but because
- Guix requires root access to install, and only works with the Linux kernel, it failed the completeness criterion.
-
+Technically, the hardest criterion to implement was the first one (completeness) and, in particular, avoiding non-POSIX dependencies).
+One solution we considered was GNU Guix and Guix Workflow Language (GWL).
+However, because Guix requires root access to install, and only works with the Linux kernel, it failed the completeness criterion.
Inspired by GWL+Guix, a single job management tool was implemented for both installing software \emph{and} the analysis workflow: Make.
+
Make is not an analysis language, it is a job manager, deciding when and how to call analysis programs (in any language like Python, R, Julia, Shell or C).
Make is standardized in POSIX and is used in almost all core OS components.
-It is thus mature, actively maintained and highly optimized (and efficient in managing exact provenance), and
- was recommended by the pioneers of reproducible research \cite{claerbout1992,schwab2000} and many researchers have already had some exposure to it. % DVG: I think this parenthesis is not needed: (when building research software).
+It is thus mature, actively maintained, highly optimized, efficient in managing exact provenance, and even recommended by the pioneers of reproducible research \cite{claerbout1992,schwab2000}.
+Researchers using free software tools have also already had some exposure to it.
%However, because they didn't attempt to build the software environment, in 2006 they moved to SCons (Make-simulator in Python which also attempts to manage software dependencies) in a project called Madagascar (\url{http://ahay.org}), which is highly tailored to Geophysics.
-Linking the analysis and narrative (criterion 7) was another major design choice.
-Literate programming, implemented as Computational Notebooks like Jupyter, is currently popular.
-However, due to the problems above, our implementation follows a more abstract linkage, providing a more direct and precise, yet modular, connection (modularized into specialised files).
-
-Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is
-through automatically-created \LaTeX{} macros, during the analysis.
+Linking the analysis and narrative (criterion 7) was historically our first design element.
+To avoid the problems with computational notebooks mentioned above, our implementation follows a more abstract linkage, providing a more direct and precise, yet modular, connection.
+Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is through automatically-created \LaTeX{} macros, during the analysis.
For example, \cite{akhlaghi19} writes `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'.
The \LaTeX{} source of the quote above is: `\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}'.
The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is generated during the analysis, and expands to the value `\inlinecode{0.25}' when the PDF output is built.
Since values like this depend on the analysis, they should \emph{also} be reproducible, along with figures and tables.
+
These macros act as a quantifiable link between the narrative and analysis, with the granularity of a word in a sentence and a particular analysis command.
-This allows accurate post-publication provenance \emph{and} automatic updates to the text prior to publication, thus by-passing
-the manual update in the narrative which is prone to errors and discourages improvements after writing the first draft.
+This allows accurate post-publication provenance \emph{and} automatic updates to the embedded numbers during a project.
+Through the latter, manual updates by authors are by-passed, which are prone to errors, thus discouraging improvements after writing the first draft.
-Acting as a link, these macro files build the core skeleton of Maneage.
+Acting as a link, the macro files build the core skeleton of Maneage.
For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation.
-These are combined at the end to generate precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}), which
-are excluded here due to the strict word limit.
+These are combined at the end to generate precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}), which are excluded here due to the strict word limit.
-The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel with no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}).
+The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}).
All software dependencies are built down to precise versions of every tool, including the shell, POSIX tools (e.g., GNU Coreutils) or \TeX{}Live, providing the same environment.
-On GNU/Linux operating systems, the GNU Compiler Collection (GCC) is also built from source and the GNU C library is being added (task 15390).
-The fast relocation of a project (without building from source) can be done by building the project in a container or VM.
+On GNU/Linux, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library is being added (task 15390).
+Relocation of a project without building from source, can be done by building the project in a container or VM.
-When building software, the only difference between projects is usually the choice of the software.
-However, the analysis will naturally be different from one project to another at a low-level.
+The analysis phase of the project however is naturally very different from one project to another at a low-level.
It was thus necessary to design a generic framework to comfortably host any project, while still satisfying the criteria of modularity, scalability and minimal complexity.
We demonstrate this design by replicating Figure 1C of \cite{menke20} in Figure \ref{fig:datalineage} (top).
Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it (including this complete paper).
@@ -345,8 +332,7 @@ Other built files (intermediate analysis steps) cascade down in the lineage to o
Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk}, to satisfy the verification criteria.
All project deliverables (macro files, plot or table data and other datasets) are thus verified at this stage, with their checksums, to automatically ensure exact reproducibility.
-Where exact reproducibility is not possible, values can be verified by any statistical means, specified by the project authors
-(this step was not implemented in \cite{akhlaghi19, infante20}).
+Where exact reproducibility is not possible, values can be verified by any statistical means, specified by the project authors (this step was not available in \cite{akhlaghi19, infante20}).
To further minimize complexity, the low-level implementation can be further separated from the high-level execution through configuration files.
By convention in Maneage, the subMakefiles (and the programs they call for number crunching) do not contain any fixed numbers, settings or parameters.
@@ -397,13 +383,13 @@ $ ./project make # Re-build to see effect.
$ git add -u && git commit # Commit changes
\end{lstlisting}
-Thanks to this architecture (Figure \ref{fig:branching}), it is always possible to import (technically: \emph{merge}) Maneage into a project and improve the low-level infrastructure:
+Thanks to this architecture (Figure \ref{fig:branching}), it is always possible to import Maneage (technically: \emph{merge}) into a project and improve the low-level infrastructure:
in (a) the authors merge Maneage during an ongoing project;
in (b) readers do it after the paper's publication, e.g., when the project remains reproducible but the infrastructure is outdated, or a bug is found in Maneage.
-In this way, low-level improvements in Maneage can easily propagate to all projects, greatly reducing
- the cost of curation and maintenance of each individual project, before \emph{and} after publication.
-
+In this way, low-level improvements in Maneage can easily propagate to all projects, greatly reducing the cost of curation and maintenance of each individual project, before \emph{and} after publication.
+Finally, the complete project source is usually $\sim100$ kilo-bytes.
+It can thus easily be published or archived in many servers, for example it can be uploaded to arXiv (with the LaTeX source, see \cite{akhlaghi19, infante20}), published on Zenodo and archived in SoftwareHeritage.
@@ -418,10 +404,8 @@ In this way, low-level improvements in Maneage can easily propagate to all proje
%% Attempt to generalise the significance.
%% should not just present a solution or an enquiry into a unitary problem but make an effort to demonstrate wider significance and application and say something more about the ‘science of data’ more generally.
-We have shown that it is possible to build workflows satisfying all the proposed criteria, and
-we comment here on our experience in testing them through this proof-of-concept tool, which,
-%We will discuss the design principles, and how they may be generalized and usable in other projects.
- with the support of RDA, enabled its user base to grow phenomenally, underscoring some difficulties for a widespread adoption.
+We have shown that it is possible to build workflows satisfying all the proposed criteria, and we comment here on our experience in testing them through this proof-of-concept tool.
+With the support of RDA, our user-base grew phenomenally, underscoring some difficulties for a widespread adoption.
Firstly, while most researchers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used.
Fortunately, we have noticed that after witnessing the improvements in their research, many, especially early-career researchers, have started mastering these tools.
@@ -432,19 +416,16 @@ Scientists, on the other hand, need to focus on their own research fields, and n
Hence, arguably the most important feature of these criteria (as implemented in Maneage) is that they provide a fully working template, using mature and time-tested tools, for blending version control, the research paper's narrative, the software management \emph{and} a robust data carpentry.
We have noticed that providing a complete \emph{and} customizable template with a clear checklist of the initial steps is much more effective in encouraging mastery of these modern scientific tools than having abstract, isolated tutorials on each tool individually.
-Secondly, to satisfy the completeness criterion, all the required software of the project must be built on various POSIX-compatible systems
-(Maneage was tested on several different GNU/Linux distributions and on macOS).
-This requires maintenance by our core team and consumes time and energy, but
- the PM and analysis share the same job manager (Make), design principles and conventions.
-We have found that, more than once, advanced users add, or fix, their required software alone and share their low-level commits on the core branch, thus propagating it to all derived projects.
+Secondly, to satisfy the completeness criterion, all the required software of the project must be built on various POSIX-compatible systems (Maneage is actively tested on different GNU/Linux distributions, macOS, and is being ported to FreeBSD also).
+This requires maintenance by our core team and consumes time and energy.
+However, because the PM and analysis components share the same job manager (Make), design principles and conventions, we have already noticed some early users adding, or fixing, their required software alone.
+They later share their low-level commits on the core branch, thus propagating it to all derived projects.
On a related note, POSIX is a fuzzy standard that does not guarantee the bit-wise reproducibility of programs.
It has been chosen here, however, as the underlying platform because we focus on the results (data), not on the compiled software.
-POSIX is ubiquitous and fixed versions of low-level software (e.g., core GNU tools) are installable on most POSIX systems; each internally corrects for differences affecting its functionality (partly as part of the GNU portability library).
-On GNU/Linux hosts, Maneage builds precise versions of the GNU Compiler Collection (GCC), GNU Binutils and GNU C library (glibc), but
- glibc is not installable on some POSIX OSs (e.g., macOS).
-The C library is linked with all programs, and
-this dependence can hypothetically hinder exact reproducibility \emph{of results}, but we have not encountered this so far.
+POSIX is ubiquitous and fixed versions of low-level software (e.g., core GNU tools) are installable on most; each internally corrects for differences affecting its functionality (partly as part of the GNU portability library).
+On GNU/Linux hosts, Maneage builds precise versions of the compilation toolchain, but glibc is not installable on some POSIX OSs (e.g., macOS).
+The C library is linked with all programs, and this dependence can hypothetically hinder exact reproducibility \emph{of results}, but we have not encountered this so far.
With everything else under precise control, the effect of differing Kernel and C libraries on high-level science results can now be systematically studied with Maneage.
% DVG: It is a pity that the following paragraph cannot be included, as it is really important but perhaps goes beyond the intended goal.
@@ -454,11 +435,10 @@ With everything else under precise control, the effect of differing Kernel and C
%This is a long-term goal and would require major changes to academic value systems.
%2) Authors can be given a grace period where the journal or a third party embargoes the source, keeping it private for the embargo period and then publishing it.
-Other implementations of the criteria, or future improvements in Maneage, may solve some of the caveats above, but
-this proof of concept has shown many advantages in adopting the proposed criteria.
+Other implementations of the criteria, or future improvements in Maneage, may solve some of the caveats above, but this proof of concept has shown many advantages in adopting them.
For example, the publication of projects meeting these criteria on a wide scale will allow automatic workflow generation, optimized for desired characteristics of the results (e.g., via machine learning).
-The completeness criteria implies that algorithms and data selection can be similarly optimized and
-furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the connection between an analysis with the resulting narrative \emph{and} the history of that analysis/narrative.
+The completeness criteria implies that algorithms and data selection can be similarly optimized.
+Furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the connection between an analysis with the resulting narrative \emph{and} the history of that analysis/narrative.
Parsers can be written over projects for meta-research and provenance studies, e.g., to generate ``research objects''.
Likewise, when a bug is found in one software package, all affected projects can be found and the scale of the effect can be measured.
Combined with SoftwareHeritage, precise high-level science parts of Maneage projects can be accurately cited (e.g., failed/abandoned tests at any historical point).
@@ -468,9 +448,10 @@ From the data repository perspective, these criteria can also be very useful wit
(1) The burden of curation is shared among all project authors and readers (the latter may find a bug and fix it), not just by database curators, improving sustainability.
(2) Automated and persistent bidirectional linking of data and publication can be established through the published \& \emph{complete} data lineage that is under version control.
(3) Software management.
-With these criteria, we ensure that each project's unique and complete software management is included. It is not a third-party PM that needs to be maintained by the data center, and they
- enable the easy management, preservation, publishing and citation of the software used.
-For example, see \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}, \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, \href{https://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}, where we have exploited the free-software criterion to distribute the tarballs of all the software used with the project's source and deliverables.
+With these criteria, we ensure that each project's unique and complete software management is included.
+It is not a third-party PM that needs to be maintained by the data center.
+Hence enabling robust software management, preservation, publishing and citation.
+For example, see \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}, \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, \href{https://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}, where we have exploited the free-software criterion to distribute the tarballs of all the software used with each project's source as deliverables.
(4) ``Linkages between documentation, code, data, and journal articles in an integrated environment'', which effectively summarises the whole purpose of these criteria.
@@ -505,7 +486,7 @@ for their useful help, suggestions and feedback on Maneage and this paper.
Work on Maneage, and this paper, has been partially funded/supported by the following institutions:
The Japanese Ministry of Education, Culture, Sports, Science, and Technology (MEXT) PhD scholarship to M. Akhlaghi and its Grant-in-Aid for Scientific Research (21244012, 24253003).
The European Research Council (ERC) advanced grant 339659-MUSICOS.
-The European Union (EU) Horizon 2020 (H2020) research and innovation programmes No 777388 under RDA EU 4.0 project, and Marie
+The European Union (EU) Horizon 2020 (H2020) research and innovation programmes No 777388 under RDA EU 4.0 project, and Marie
Sk\l{}odowska-Curie grant agreement No 721463 to the SUNDIAL ITN.
The State Research Agency (AEI) of the Spanish Ministry of Science, Innovation and Universities (MCIU) and the European Regional Development Fund (ERDF) under the grant AYA2016-76219-P.
The IAC project P/300724, financed by the MCIU, through the Canary Islands Department of Economy, Knowledge and Employment.
@@ -552,8 +533,8 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314
\begin{IEEEbiographynophoto}{David Valls-Gabaud}
is a CNRS Research Director at the Observatoire de Paris, France.
His research interests span from cosmology and galaxy evolution to stellar physics and instrumentation.
- He is adamant about ensuring scientific results are fully reproducible. Educated at the universities of
- Madrid (Complutense), Paris and Cambridge, he obtained his PhD in astrophysics in 1991.
+ He is adamant about ensuring scientific results are fully reproducible.
+ Educated at the universities of Madrid (Complutense), Paris and Cambridge, he obtained his PhD in astrophysics in 1991.
Contact him at david.valls-gabaud@obspm.fr.
\end{IEEEbiographynophoto}
@@ -565,6 +546,7 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314
Baena-Gall\'e has both MS in Telecommunication and Electronic Engineering from University of Seville (Spain), and received a PhD in astronomy from University of Barcelona (Spain).
Contact him at rbaena@iac.es.
\end{IEEEbiographynophoto}
+\vfill
\end{document}
%% This file is free software: you can redistribute it and/or modify it