diff options
-rw-r--r-- | paper.tex | 22 |
1 files changed, 11 insertions, 11 deletions
@@ -168,7 +168,7 @@ Many data-intensive projects commonly involve dozens of high-level dependencies, The main premise is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it in the end \cite{austin17,fineberg19}. Researchers play a critical role \cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles). Simply archiving a project workflow in a repository after the project is finished is, on its own, insufficient, and maintaining it by repository staff is often either practically infeasible or unscalable. -We argue that workflows satisfying the criteria below can not just improve researcher flexibility during a research project, but can increase the FAIRness of the deliverables for future researchers. +We argue that workflows satisfying the criteria below can not only improve researcher flexibility during a research project, but can also increase the FAIRness of the deliverables for future researchers. \textbf{Criterion 1: Completeness.} A project that is complete (self-contained) has the following properties. @@ -255,7 +255,7 @@ Inspired by GWL+Guix, a single job management tool was used for both installing Make is not an analysis language, it is a job manager, deciding when to call analysis programs (in any language like Python, R, Julia, Shell or C). Make is standardized in POSIX and is used in almost all core OS components. It is thus mature, actively maintained and highly optimized. -Make was recommended by the pioneers of reproducible research\cite{claerbout1992,schwab2000} and many researchers have already had a minimal exposure to it (atleast when building research software). +Make was recommended by the pioneers of reproducible research\cite{claerbout1992,schwab2000} and many researchers have already had a minimal exposure to it (at least when building research software). %However, because they didn't attempt to build the software environment, in 2006 they moved to SCons (Make-simulator in Python which also attempts to manage software dependencies) in a project called Madagascar (\url{http://ahay.org}), which is highly tailored to Geophysics. Linking the analysis and narrative was another major design choice. @@ -268,23 +268,23 @@ The \LaTeX{} source of the quote above is: `\inlinecode{\small detect the outer The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is generated during the analysis, and expands to the value `\inlinecode{0.25}' when the PDF output is built. Since values like this depend on the analysis, they should also be reproducible, along with figures and tables. These macros act as a quantifiable link between the narrative and analysis, with the granularity of a word in a sentence and a particular analysis command. -This allows accurate provenance post-publication \emph{and} automatic updates to the text pre-publication. -Manually updating them in the narrative is prone to errors and discourages improvements after writing the first draft. +This allows accurate provenance post-publication \emph{and} automatic updates to the text prior to publication. +Manually updating these in the narrative is prone to errors and discourages improvements after writing the first draft. The ultimate aim of any project is to produce a report accompanying a dataset, providing visualizations, or a research article in a journal. Let's call this \inlinecode{paper.pdf}. -Acting as a link, the macro filess of each analysis step (which produce numbers, tables, figures included in the report) thus build the core structure (skeleton) of Maneage. +Acting as a link, the macro files of each analysis step (which produce numbers, tables, figures included in the report) thus build the core structure (skeleton) of Maneage. For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation. These are combined in the end to generate precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}; excluded here due to the strict word limit). The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel with no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}). -Software dependencies are built down to precise versions of the shell, POSIX tools (e.g., GNU Coreutils), \TeX{}Live and etc for an \emph{almost} exact reproducible environment on POSIX-compatible systems. +Software dependencies are built down to precise versions of the shell, POSIX tools (e.g., GNU Coreutils), \TeX{}Live) for an \emph{almost} exact reproducible environment on POSIX-compatible systems. On GNU/Linux operating systems, the GNU Compiler Collection (GCC) is also built from source and the GNU C library is being added (task 15390). Fast relocation of a project (without building from source) can be done by building the project in a container or VM. In building software, normally only the high-level choice of which software to build differs between projects. However, the analysis will naturally be different from one project to another at a low-level. -It was thus necessary to design generic system to comfortably host any project, while still satisfying the criteria of modularity, scalability and minimal complexity. +It was thus necessary to design a generic system to comfortably host any project, while still satisfying the criteria of modularity, scalability and minimal complexity. We demonstrate this design by replicating Figure 1C of \cite{menke20} in Figure \ref{fig:datalineage} (top). Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it (including this complete paper). @@ -336,7 +336,7 @@ This is visualized in Figure \ref{fig:datalineage} (bottom) where no built/blue As shown in Listing \ref{code:topmake}, a visual inspection of this file allows a non-expert to understand the high-level steps of the project (irrespective of the low-level implementation details); provided that the subMakefile names are descriptive (thus encouraging good practice). A human-friendly design that is also optimized for execution is a critical component for reproducible research workflows. -All projects, first load \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk} (Listing \ref{code:topmake}). +All projects first load \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk} (Listing \ref{code:topmake}). Project authors add their modular subMakefiles in between. Except for \inlinecode{paper.mk} (which builds the ultimate target \inlinecode{paper.pdf}), all subMakefiles build a macro file with the same basename (the \inlinecode{.tex} file in each subMakefile of Figure \ref{fig:datalineage}). Other built files (intermediate analysis steps) cascade down in the lineage to one of these macro files, possibly through other files. @@ -414,7 +414,7 @@ This greatly reduces the cost of curation and maintenance of each individual pro We have shown that it is possible to build workflows satisfying the proposed criteria. Here, we comment on our experience in testing them through the proof of concept. We will discuss the design principles, and how they may be generalized and usable in other projects. -In particular, with the support of RDA, the user base grew phenomenally, highlighting some difficulties for the wide-spread adoption. +In particular, with the support of RDA, the user base grew phenomenally, highlighting some difficulties for wide-spread adoption. Firstly, while most researchers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used. Fortunately, we have noticed that after witnessing the improvements in their research, many, especially early-career researchers, have started mastering these tools. @@ -519,8 +519,8 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314 \begin{IEEEbiographynophoto}{Ra\'ul Infante-Sainz} is a doctoral student at the Instituto de Astrof\'isica de Canarias, Tenerife, Spain. - He has been concerned about the ability of reproducing scientific results from the start of his research and has thus been actively involved in development and testing of Maneage. - His main scientific interests are the galaxy formation and evolution, studying the low-surface-brightness Universe through reproducible methods. + He has been concerned about the ability of reproducing scientific results since the start of his research and has thus been actively involved in development and testing of Maneage. + His main scientific interests are galaxy formation and evolution, studying the low-surface-brightness Universe through reproducible methods. Contact him at infantesainz@gmail.com and find his website at \url{https://infantesainz.org}. \end{IEEEbiographynophoto} |