From 70597b66d8d8f8f4347cca63d432a4a9ed2fe6b8 Mon Sep 17 00:00:00 2001 From: Mohammad Akhlaghi Date: Sat, 23 May 2020 02:32:17 +0100 Subject: Edits, to make the text more readable After one day not looking at the first draft of this new version (commit 7b008dfbb9b2), I went through the text and done some general edits to make its presentation and logic smoother. --- paper.tex | 165 ++++++++++++++++++++++++++++++-------------------------------- 1 file changed, 81 insertions(+), 84 deletions(-) diff --git a/paper.tex b/paper.tex index 004f9d2..09412fb 100644 --- a/paper.tex +++ b/paper.tex @@ -118,21 +118,21 @@ As a solution to this problem, here we introduce a criteria that can guarantee t \section{Commonly used tools and their longevity} -To highlight the proposed criteria, some of the most commonly used tools are reviewed from the long-term usability perspective. -We recall that while longevity is important in some fields (like the sciences), it is not necessarily of interest in others (e.g., short term commercial projects), hence the wide usage of tools that evolve very fast. -Most existing reproducible workflows use a common set of third-party tools that can be categozied as: +To highlight the necessity of longevity, some of the most commonly used tools are reviewed here, from the perspective of long-term usability. +We recall that while longevity is important in some fields (like the sciences and some industries), it isn't the case in others (e.g., short term commercial projects), hence the usage of fast-evolving tools. +Most existing reproducible workflows use a common set of third-party tools that can be categorized as: (1) Environment isolators like virtual machines, containers and etc. (2) PMs like Conda, Nix, or Spack, -(3) Job orchestrators like scripts, Make, SCons, and CGAT-core, +(3) Job management like scripts, Make, SCons, and CGAT-core, (4) Notebooks like Jupyter. To isolate the environment, virtual machines (VMs) have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (which was awarded 2nd prize in the Elseiver Executable Paper Grand Challenge of 2011 and discontinued in 2019). -However, containers (in particular Docker and lesser, Singularity) are by far the most used solution today, so we will focus on Docker here. +However, containers (in particular Docker and lesser, Singularity) are by far the most used solution today, we will focus on Docker here. %% Note that L. Barba (second author of this paper) is the editor in chief of CiSE. Ideally, it is possible to precisely version/tag the images that are imported into a Docker container. But that is rarely practiced in most solutions that we have studied. -Usually, images are imported with generic operating system names e.g., `\inlinecode{FROM ubuntu:16.04}'\cite{mesnard20}. +Usually, images are imported with generic operating system names e.g. \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}'. The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated with different software versions almost monthly and only archives the last 5 images. Hence if the Dockerfile is run in different months, it will contain different core operating system components. Furthermore, in the year 2024, when the long-term support for this version of Ubuntu expires, it will be totally removed. @@ -157,7 +157,8 @@ This leads to many inefficiencies in project cost and/or scientific accuracy (re Finally, to add narrative, computational notebooks\cite{rule18}, like Jupyter, are being increasingly used in many solutions. However, the complex dependency trees of such web-based tools make them very vulnerable to the passage of time, e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib; one of the more simple Jupyter dependencies. The longevity of a project is determined by its shortest-lived dependency. -Furthermore, similar to the point above on job management, by not actively encouraging good practices in programming or project management, such tools can rarely deliver their promised potential\cite{rule18} or can even hamper reproducibility \cite{pimentel19}. +Furthermore, similar to the point above on job management, they don't actively encourage good practices in programming or project management. +Therefore, notebooks can rarely deliver their promised potential\cite{rule18} or can even hamper reproducibility \cite{pimentel19}. An exceptional solution we encountered was the Image Processing Online Journal (IPOL, \href{https://www.ipol.im}{ipol.im}). Submitted papers must be accompanied by an ISO C implementation of their algorithm (which is build-able on all operating systems) with example images/data that can also be executed on their webpage. @@ -173,7 +174,7 @@ Many data-intensive projects, commonly involve dozens of high-level dependencies The main premise is that starting a project with robust data management strategy (or tools that provide it) is much more effective, for the researchers and community, than imposing it in the end \cite{austin17,fineberg19}. Researchers play a critical role\cite{austin17} in making their research more Findabe, Accessible, Interoperable, and Reusable (the FAIR principles). Actively curating workflows for evolving technologies by repositories alone is not practically feasible, or scalable. -In this paper we argue that workflows satisfying the criteria below can reduce the cost of curation for repositories, while maximizing the FAIRness of the deliverables for future researchers. +In this paper we argue that workflows satisfying the criteria below can will improve researcher workflows during the project, reduce the cost of curation for repositories after publication, while maximizing the FAIRness of the deliverables for future researchers. \textbf{Criteria 1: Completeness.} A project that is complete, or self-contained, has the following properties: @@ -209,7 +210,7 @@ A scalable project can easily be used in arbitrarily large and/or complex projec On a small scale, the criteria here are trivial to implement, but as the projects get more complex, it can become unsustainable. \textbf{Criteria 5: Verifiable inputs and outputs.} -The project should automatically verify its inputs (software source code and data) \emph{and} outputs. +The project should verify its inputs (software source code and data) \emph{and} outputs. Expert knowledge should not be required to confirm a reproduction, such that ``\emph{a clerk can do it}''\cite{claerbout1992}. \textbf{Criteria 6: History and temporal provenance.} @@ -224,7 +225,6 @@ A project is not just its computational analysis. A raw plot, figure or table is hardly meaningful alone, even when accompanied by the code that generated it. A narrative description is also part of the deliverables (defined as ``data article'' in \cite{austin17}): describing the purpose of the computations, and interpretations of the result, possibly with respect to other projects/papers. This is related to longevity because if a workflow only contains the steps to do the analysis, or generate the plots, in time, it may be separated from its accompanying published paper. -A raw analysis workflow with no context is hardly useful. \textbf{Criteria 8: Free and open source software:} Technically, reproducibility (as defined in \cite{fineberg19}) is possible with non-free or non-open-source software (a black box). @@ -232,7 +232,7 @@ This criteria is thus necessary to complement that definition (nature is already As free software, others can learn from, modify, and build upon a project. When the used software are also free, (1) The lineage can be traced to the implemented algorithms, possibly enabling optimizations on that level. -(2) It can be modified to work on a future hardware by others. +(2) Their source can be modified to work on a future hardware by others. (3) A non-free software typically cannot be distributed by others, making it reliant on a single server (even without payments). @@ -247,10 +247,10 @@ When the used software are also free, \section{Proof of concept: Maneage} Given the limitations of existing tools with the proposed criteria, it is necessary to show a proof of concept. -The proof presented here has already been tested in previously published papers \cite{akhlaghi19, infante20} and was recently awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows\cite{austin17} from the researcher perspective to ensure longevity. +The proof presented here has already been tested in previously published papers \cite{akhlaghi19, infante20} and was recently awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows\cite{austin17}, from the researcher perspective to ensure longevity. -The proof of concept is called Maneage (MANaging+LinEAGE, ending is pronounced like ``Lineage''). -It was developed along with the criteria, as a parallel research project in 5 years for publishing our reproducible research workflows with our research. +The proof of concept is called Maneage (MANaging + LinEAGE, ending is pronounced like ``Lineage''). +It was developed along with the criteria, as a parallel research project in 5 years for publishing our reproducible workflows to supplement our research. Its primordial form was implemented in \cite{akhlaghi15} and later evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}. Technically, the hardest criteria to implement was the completeness criteria (and in particular no dependency beyond POSIX), blended with minimal complexity. @@ -260,7 +260,7 @@ Not having any exposure to Lisp/Scheme and their fundamentally different style, Furthermore, the desired solution was meant to be easily understandable/usable by fellow scientists, who generally have not had exposure to Lisp/Scheme. Inspired by GWL+Guix, a single job management tool was used for both installing of software \emph{and} the analysis workflow: Make. -Make is not an analysis language, it is a job manager, deciding when to call analysis programs (written in any language like Shell, Python, Julia or C). +Make is not an analysis language, it is a job manager, deciding when to call analysis programs (in any language like Python, R, Julia, Shell or C). Make is standardized in POSIX and is used in almost all core OS components. It is thus mature, actively maintained and highly optimized. Make was recommended by the pioneers of reproducible research\cite{claerbout1992,schwab2000} and many researchers have already had a minimal exposure to it (when building research software). @@ -275,25 +275,24 @@ For example, in the abstract of \cite{akhlaghi19} we say `\emph{... detect the o The \LaTeX{} source of the quote above is: `\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}'. The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is set during the analysis, and expands to the value `\inlinecode{0.25}' when the PDF output is built. Such values also depend on the analysis, hence just as plots, figures or tables they should also be reproduced. -As a side-effect, these macros act as a quantifiable link between the narrative and analysis, with the granulity of a word in a sentence and exact analysis command. -This allows accurate provenance tracking \emph{and} automatic updates to the text when any part of the analysis is changed. -Manually typing such numbers in the narrative is prone to errors and discourages experimentation after the first writing of the project. - -The ultimate aim of any project is to produce a report accompaning a dataset with some visualizations, or a research article in a journal, let's call it \inlinecode{paper.pdf}. -Hence the files with the relevant macros of each (modular) step, build the core structure (skeleton) of Maneage. -During the software building (configuration) phase, each package is identified by a \LaTeX{} file, containing its official name, version and possible citation. -In the end, they are combined to enable precise software acknowledgement and citation (see the appendices of \cite{akhlaghi19, infante20}, not included here due to the strict word-limit). -Simultaneously, they act as Make \emph{targets} and \emph{prerequisite}s to allow accurate dependency tracking and optimized execution (parallel, no redundancies), for any complexity (e.g., Maneage also builds Matplotlib if requested, see Figure 1 of \cite{alliez19}). -Dependencies go down to precise versions of the shell, C compiler, and the C library (task 15390) for an exactly reproducible environment. -To enable easy and fast relocation of the project without building from source, it is possible to build it in any existing container/VM. -The important factor is that, the precise environment isolator is irrelevant, it can always be rebuilt. - -During configuration, only the very high-level choice of which software to built differs between projects. -The Makefiles containig build recipes of each software do not generally change. +As a side-effect, these macros act as a quantifiable link between the narrative and analysis, with the granularity of a word in a sentence and exact analysis command. +This allows accurate provenance \emph{and} automatic updates to the text when necessary. +Manually typing such numbers in the narrative is prone to errors and discourages experimentation after writing the first draft. + +The ultimate aim of any project is to produce a report accompanying a dataset with some visualizations, or a research article in a journal. +Let's call it \inlinecode{paper.pdf}. +Hence the files hosting the macros (that go into the report) of each analysis step, build the core structure (skeleton) of Maneage. +During the software building (``configuration'') phase, each software is identified by a \LaTeX{} file, containing its official name, version and possible citation. +In the end, they are combined for precise software acknowledgment and citation (see the appendices of \cite{akhlaghi19, infante20}, not included here due to the strict word-limit). +Simultaneously, these files act as Make \emph{targets} and \emph{prerequisite}s to allow accurate dependency tracking and optimized execution (parallel, no redundancies), for any complexity (e.g., Maneage also builds Matplotlib if requested, see Figure 1 of \cite{alliez19}). +Software dependencies are built down to precise versions of the shell, POSIX tools (e.g., GNU Coreutils), \TeX{}Live, C compiler, and the C library (task 15390) for an exactly reproducible environment. +For fast relocation of the project (without building from source) it is possible to build it in the popular container, or VM, technology of the day. + +In building of software, only the very high-level choice of which software to built differs between projects and the build recipes of each software do not generally change. However, the analysis will naturally be different from one project to another. -Therefore, a design was necessary to satisfy the modularity, scalability and minimal complexity criteria. +Therefore, a design was necessary to satisfy the modularity, scalability and minimal complexity criteria while being generic enough to host any project. To avoid getting too abstract, we will demonstrate it by replicating Figure 1C of \cite{menke20} in Figure \ref{fig:datalineage} (top). -Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it (with this whole paper). +Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it (including this complete paper). \begin{figure*}[t] \begin{center} @@ -314,21 +313,21 @@ Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it } \end{figure*} -Analysis is orchestrated in a single point of entry (the Makefile \inlinecode{top-make.mk}). +Analysis is orchestrated in a single point of entry (\inlinecode{top-make.mk}, which is a Makefile). It is only responsible for \inlinecode{include}-ing the modular \emph{subMakefiles} of the analysis, in the desired order, not doing any analysis itself. This is shown in Figure \ref{fig:datalineage} (bottom) where all the built/blue files are placed over subMakefiles. A random reader will be able to understand the high-level logic of the project (irrespective of the low-level implementation details) with simple visual inspection of this file, provided that the subMakefile names are descriptive. A human-friendly design (that is also optimized for execution) is a critical component of publishing reproducible workflows. -In all projects \inlinecode{top-make.mk} will first load the subMakefiles \inlinecode{initialize.mk} and \inlinecode{download.mk}, while concluding with \inlinecode{verify.mk} and \inlinecode{paper.mk}. +In all projects \inlinecode{top-make.mk} first loads \inlinecode{initialize.mk} and \inlinecode{download.mk} and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk}. Project authors add their modular subMakefiles in between (after \inlinecode{download.mk} and before \inlinecode{verify.mk}), in Figure \ref{fig:datalineage} (bottom), the project-specific subMakefiles are \inlinecode{format.mk} \& \inlinecode{demo-plot.mk}. -Except for \inlinecode{paper.mk} which builds the ultimate target \inlinecode{paper.pdf}, all subMakefiles build at least one file: a \LaTeX{} macro file with the same base-name, see the \inlinecode{.tex} files in each subMakefile of Figure \ref{fig:datalineage}. -The other built files will ultimately (through other files) lead to one of the macro files. +Except for \inlinecode{paper.mk} (which builds the ultimate target \inlinecode{paper.pdf}) all subMakefiles build a \LaTeX{} macro file with the same base-name (a \inlinecode{.tex} in each subMakefile of Figure \ref{fig:datalineage}). +Other built files ultimately cascade down in the lineage (through other files) to one of these macro files. -Irrespective of the number of subMakefiles, there lineage reaches a bottleneck in \inlinecode{verify.mk} to satisfy the verification criteria. +Irrespective of the number of subMakefiles, just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk}, to satisfy the verification criteria. All the macro files, plot information and published datasets of the project are verified with their checksums here to automatically ensure exact reproducibility. Where exact reproducibility is not possible, values can be verified by any statistical means (specified by the project authors). -Finally, having verified quantitative results, the project builds the ultimate target in \inlinecode{paper.mk}. +We note that this step was not yet implemented in \cite{akhlaghi19, infante20}. \begin{figure*}[t] \begin{center} \includetikz{figure-branching}\end{center} @@ -344,24 +343,25 @@ Finally, having verified quantitative results, the project builds the ultimate t \end{figure*} To further minimize complexity, the low-level implementation can be further separated from the high-level execution through configuration files. -By convention in Maneage, the subMakefiles (and the Python, Julia, C, Fortran, or etc, programs that they call for doing the number crunching), only organize the analysis, they do not contain any fixed numbers, settings or parameters. -Parameters are set as Make variables in ``configuration files'' and passed to the respective program (\inlinecode{.conf} files in Figure \ref{fig:datalineage}). -In the demo lineage, \inlinecode{INPUTS.conf} contains URLs and checksums for all imported datasets, enabling exact verification before usage. +By convention in Maneage, the subMakefiles, and the programs they call for number-crunching, do not contain any fixed numbers, settings or parameters. +Parameters are set as Make variables in ``configuration files'' (with a \inlinecode{.conf} suffix) and passed to the respective program. +For example, in Figure \ref{fig:datalineage}, \inlinecode{INPUTS.conf} contains URLs and checksums for all imported datasets, enabling exact verification before usage. As another demo, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot). The number \inlinecode{\menkenumpapersdemoyear} is stored in \inlinecode{demo-year.conf}. -The result \inlinecode{\menkenumpapersdemocount} was calculated after generating \inlinecode{columns.txt}. -Both are expanded in the PDF as \LaTeX{} macros. -Enabling the reader to change the value in \inlinecode{demo-year.conf} to automatically update the result, without necessarily knowing how it was generated. -Since a configuration file is a prerequisite of the target that uses it, if it is changed, Make will re-execute the recipe and its descendants. -This encourages testing (without necessarily knowing the implementation details, e.g., by co-authors or future readers), and ensures self-consistency. +As the lineage shows, the result (\inlinecode{\menkenumpapersdemocount}) was calculated after generating \inlinecode{columns.txt}. +Both are expanded in this PDF as \LaTeX{} macros. +This enables the reader to change the value in \inlinecode{demo-year.conf} to automatically update the result, without necessarily knowing how it was generated. +Furthermore, the configuration files are a prerequisite of the targets that use them. +Hence if changed, Make will \emph{only} re-execute the dependent recipe and all its descendants with no modification to the project's source or other built products. +This fast/cheap testing encourages experimentation (without necessarily knowing the implementation details, e.g., by co-authors or future readers), and ensures self-consistency. Finally, to satisfy the temporal provenance criteria, version control (currently implemented in Git), plays a defining role in Maneage as shown in Figure \ref{fig:branching}. In practice, Maneage is a Git branch that contains the shared components, or infrastructure of all projects (e.g., software tarball URLs, build recipes, common subMakefiles and interface script). -Every project starts by branching-off the Maneage branch and customizing it by adding their own title, input data links, writing their narrative, and subMakefiles for their analsyis, see Listing \ref{code:branching}. +Every project starts by branching-off the Maneage branch and customizing it (e.g., adding their own title, input data links, writing their narrative, and subMakefiles for their analysis), see Listing \ref{code:branching}. \begin{lstlisting}[ label=code:branching, - caption={Starting new project with Maneage, and building it}, + caption={Starting a new project with Maneage, and building it}, ] # Cloning main Maneage branch and branching-off of it. $ git clone https://git.maneage.org/project.git @@ -376,9 +376,9 @@ $ ./project make # Do analysis, build PDF paper. As Figure \ref{fig:branching} shows, due to this architecture, it is always possible to import or merge Maneage into the project to improve the low-level infrastructure: in (a) the authors merge into Maneage during an ongoing project, -in (b) readers can do it after the paper's publication, even when authors cannot be accessed, and the project's infrastructure is outdated, or does not build. +in (b) readers can do it after the paper's publication, e.g., when the project's infrastructure is outdated, or does not build, and authors cannot be accessed. Low-level improvements in Maneage are thus automatically propagated to all projects. -This greatly reduces the cost of curation, or maintenance, of each individual project, before and after publication. +This greatly reduces the cost of curation, or maintenance, of each individual project, before \emph{and} after publication. @@ -395,59 +395,56 @@ This greatly reduces the cost of curation, or maintenance, of each individual pr %% Attempt to generalise the significance. %% should not just present a solution or an enquiry into a unitary problem but make an effort to demonstrate wider significance and application and say something more about the ‘science of data’ more generally. -As shown in the proof of concept above, it is possible to define a workflow that satisfies the criteria presented in this paper. -Here we will review the lessons learnt and insights gained, while sharing the experience of implementing the RDA recommendations +Having shown that it is possible to build workflows satisfing the proposed criteria, here we will review the lessons leaned and insights gained, while sharing the experience of implementing the RDA/WDS recommendations. We will also discuss the design principles, an how they may be generalized and usable in other projects. +In particular, with the support of RDA, the user base and development of the criteria and Maneage grew phenomenally, highlighting some difficulties for the wide-spread adoption of these criteria. -With the support of RDA, the user base and development of the criteria and Maneage grew phenomenally, highlighting some difficulties for wide-spread adoption of these criteria. -Firstly, the low-level tools are not widely used by many scientists, e.g., Git, \LaTeX, the command-line and Make. -This is primarily because of a lack of exposure, we noticed that after witnessing the improvements in their research, many (especially early career researchers) have started mastering these tools. -Fortunately, many research institutes are having courses on these generic tools and we will also be adding more tutorials and demonstration videos in its documentation. +Firstly, while most researachers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used. +But we have noticed that after witnessing the improvements in their research, many (especially early career researchers) have started mastering these tools. +Scientists are rarely trained sufficiently in data management or software development, and the plethora of high-level tools that change every few years, discourages them. +Fast evolving tools are primarily targeted at software developers, who are paid to learn them and use them effectively for short-term projects and move to the next technology. +Scientists, on the other hand, need to focus on their own research fields, and need to consider longevity. +Hence, arguably the most important feature of these criteria is that they provide a fully working template, using mature and time-tested tools, for blending version control, paper's narrative, software management \emph{and} a modular lineage for analysis. +We have seen that a complete \emph{and} customizable template with a clear checklist of first steps is much more effective in encouraging mastery of these essential tools for modern science. +As opposed to having abstract/isolated tutorials on each tool individually. Secondly, to satisfy the completeness criteria, all the necessary software of the project must be built on various POSIX-compatible systems (we actively test Maneage on several GNU/Linux distributions and macOS). This requires maintenance by our core team and consumes time and energy. -However, due to the complexity criteria, the PM and analysis share the same job manager. -Our experience has shown that users' experience in the analysis empowers some of them to add/fix their required software on their own systems, and share that commits on the core branch, thus propagating to all derived projects. -This has already happened in multiple cases. +However, the PM and analysis share the same job manager and our experience so far has shown that users' experience in the analysis, empowers some of them to add/fix their required software on their own systems. +Later, they share them as commits on the core branch, thus propagating it to all derived projects. +This has already occurred multiple times. Thirdly, publishing a project's reproducible data lineage immediately after publication enables others to continue with follow-up papers in competition with the original authors. We propose these solutions: 1) Through the Git history, the work added by another team at any phase of the project can be quantified, contributing to a new concept of authorship in scientific projects and helping to quantify Newton's famous ``\emph{standing on the shoulders of giants}'' quote. -This is a long-term goal and requires major changes to academic value systems. +However, this is a long-term goal and requires major changes to academic value systems. 2) Authors can be given a grace period where the journal or a third party embargoes the source, keeping it private for the embargo period and then publishing it. -Other implementations of the criteria, or future improvements in Maneage, may solve the caveats above. +Other implementations of the criteria, or future improvements in Maneage, may solve some of the caveats above. However, the proof of concept already shows many advantages to adopting the criteria. -Above, the benefits for researchers was the main focus, but these criteria also help in data centers, for example with regard to the challenges mentioned in \cite{austin17}: -(1) The burden of curation is shared among all project authors and/or readers (who may find a bug and fix it), not just by data-base curators, improving the sustainability of data centers. -(2) Automated and persistent bi-directional linking of data and publication can be established through the published \& \emph{complete} data lineage that is version controlled. -(3) Software management. -With these criteria, each project's unique and complete software management is included: its not a third-party PM, that needs to be maintained by the data center employees. -This enables easy management, preservation, publishing and citation of used software. -For example see \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}, \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, \href{https://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} where we have exploited the free software criteria to distribute all the used software tarballs with the other project files. -(4) ``Linkages between documentation, code, data, and journal articles in an integrated environment'', which results from the criteria. - -Generally, scientists are rarely trained sufficiently in data management or software development, and the plethora of high-level tools that change every few years does not help. -Such high-level tools are primarily targetted at software developers, who are paid to learn them and use them effectively for short-term projects. -Scientists, on the other hand, need to focus on their own research fields, and need to think about longevity. -Hence, arguably the most important feature is that the un-customized project is already a fully working template blending version control, paper's narrative, software management \emph{and} a modular lineage for analysis with mature tools, allowing scientists to learn them in practice, not abstractly. - -Publication of projects with these criteria on a wide scale allows automatic workflow generation, optimized for desired characteristics of the results (for example via machine learning). +For example, publication of projects with these criteria on a wide scale allows automatic workflow generation, optimized for desired characteristics of the results (for example via machine learning). Because is complete, algorithms and data selection methods can be similarly optimized. -Furthermore, through elements like the macros, natural language processing can also be included, allowing a direct connection between an analysis and the resulting narrative \emph{and} history of that narrative. +Furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the connection between an analysis with the resulting narrative \emph{and} history of that analysis/narrative. Parsers can be written over projects for meta-research and data provenance studies, for example to generate ``research objects''. As another example, when a bug is found in one software package, all affected projects can be found and the scale of the effect can be measured. Combined with SoftwareHeritage, precise high-level science parts of Maneage projects can be accurately cited (e.g., failed/abandoned tests at any historical point). Many components of ``machine-actionable'' data management plans can be automatically filled out by Maneage, which is useful for project PIs and grant funders. - +From the data repository perspective these criteria can also be very useful, for example with regard to the challenges mentioned in \cite{austin17}: +(1) The burden of curation is shared among all project authors and/or readers (who may find a bug and fix it), not just by data-base curators, improving their sustainability. +(2) Automated and persistent bi-directional linking of data and publication can be established through the published \& \emph{complete} data lineage that is under version control. +(3) Software management. +With these criteria, each project's unique and complete software management is included: its not a third-party PM, that needs to be maintained by the data center employees. +This enables easy management, preservation, publishing and citation of used software. +For example see \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}, \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, \href{https://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} where we have exploited the free software criteria to distribute all the used software tarballs with the project's source and deliverables. +(4) ``Linkages between documentation, code, data, and journal articles in an integrated environment'', which is the whole purpose of these criteria. % use section* for acknowledgment -\section*{Acknowledgement} +\section*{Acknowledgment} The authors wish to thank (sorted alphabetically) Julia Aguilar-Cabello, @@ -471,7 +468,7 @@ and Ignacio Trujillo for their useful help, suggestions and feedback on Maneage and this paper. Work on Maneage, and this paper, has been partially funded/supported by the following institutions: -The Japanese Ministry of Education, Culture, Sports, Science, and Technology (MEXT) PhD scholarship to M. Akhl\-aghi and its Grant-in-Aid for Scientific Research (21244012, 24253003). +The Japanese Ministry of Education, Culture, Sports, Science, and Technology (MEXT) PhD scholarship to M. Akhlaghi and its Grant-in-Aid for Scientific Research (21244012, 24253003). The European Research Council (ERC) advanced grant 339659-MUSICOS. The European Union (EU) Horizon 2020 (H2020) research and innovation programmes No 777388 under RDA EU 4.0 project, and Marie Sk\l{}odowska-Curie grant agreement No 721463 to the SUNDIAL ITN. The State Research Agency (AEI) of the Spanish Ministry of Science, Innovation and Universities (MCIU) and the European Regional Development Fund (ERDF) under the grant AYA2016-76219-P. @@ -495,24 +492,24 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314 %% Biography \begin{IEEEbiographynophoto}{Mohammad Akhlaghi} - is currently a postdoctoral researcher at the Instituto de Astrof\'isica de Canarias, Tenerife, Spain. + is a postdoctoral researcher at the Instituto de Astrof\'isica de Canarias, Tenerife, Spain. His main scientific interest is in early galaxy evolution, but to extract information from the modern complex datasets, he has been involved in image processing and reproducible workflow management where he has founded GNU Astronomy Utilities (Gnuastro) and Maneage (introduced here). He received his PhD in astronomy from Tohoku University, Sendai Japan, and before coming to Tenerife, held a CNRS postdoc position at the Centre de Recherche Astrophysique de Lyon (CRAL). Contact him at mohammad@akhlaghi.org and find his website at \url{https://akhlaghi.org}. \end{IEEEbiographynophoto} \begin{IEEEbiographynophoto}{Ra\'ul Infante-Sainz} - is currently a doctoral student at the Instituto de Astrof\'isica de Canarias, Tenerife, Spain. + is a doctoral student at the Instituto de Astrof\'isica de Canarias, Tenerife, Spain. Contact him at infantesainz@gmail.com. \end{IEEEbiographynophoto} \begin{IEEEbiographynophoto}{Boudewijn F. Roukema} - is currently a professor at the Astronomy and Informatics department of Nicolaus Copernicus University in Toru\'n, Poland. + is a professor at the Astronomy and Informatics department of Nicolaus Copernicus University in Toru\'n, Poland. Contact him at boud@astro.uni.torun.pl. \end{IEEEbiographynophoto} \begin{IEEEbiographynophoto}{David Valls-Gabaud} - is currently a professor at the Observatoire de Paris, France. + is a professor at the Observatoire de Paris, France. Contact him at david.valls-gabaud@obspm.fr. \end{IEEEbiographynophoto} -- cgit v1.2.1