diff options
-rw-r--r-- | paper.tex | 31 |
1 files changed, 15 insertions, 16 deletions
@@ -60,19 +60,18 @@ % in the abstract or keywords. \begin{abstract} %% CONTEXT - Reproducible workflow solutions commonly use high-level technologies that were popular when they were created, providing an immediate solution which is unlikely to be sustainable in the long term. + Reproducible workflows commonly use high-level technologies that are popular when created, but are unlikely to be sustainable in the long term. %% AIM - We therefore introduce a set of criteria to address this problem and demonstrate their practicality and implementation. + We therefore aim to introduce a set of criteria to address this problem. %% METHOD - The criteria have been tested in several research publications and can be summarized as: completeness (no dependency beyond a POSIX-compatible operating system, no administrator privileges, no network connection and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; temporal provenance; linking analysis with narrative; and free-and-open-source software. + They have been tested in several research publications and can be summarized as: completeness (no dependency beyond POSIX, no administrator privileges, no network connection and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free software. %% RESULTS - As a proof of concept, we have implemented ``Maneage'', a solution which stores the project in machine-actionable and human-readable plain-text, enables version-control, cheap archiving, automatic parsing to extract data provenance, and peer-reviewable verification. + As a proof of concept, ``Maneage'' is introduced, storing projects in machine-actionable and human-readable plain text, enabling cheap archiving, provenance extraction, and peer verification. %% CONCLUSION - We show that requiring longevity of a reproducible workflow solution is realistic, without sacrificing immediate or short-term reproducibility and discuss the benefits of the criteria for scientific progress. - This paper has itself been written in Maneage, with snapshot \projectversion. + We show that longevity is a realistic requirement that doesn’t sacrifice immediate or short-term reproducibility, discuss the caveats (with proposed solutions) and conclude with the benefits for the various stakeholders. This paper is itself written with Maneage (project commit \projectversion). \vspace{3mm} - \emph{Reproducible supplement} --- Necessary software, workflow and output data are published in \href{https://doi.org/10.5281/zenodo.3872248}{\texttt{Zenodo.3872248}}. + \emph{Reproducible supplement} --- Necessary software, workflow and output data are published in \href{https://doi.org/10.5281/zenodo.3872248}{\texttt{zenodo.3872248}} with Git repository at \href{https://gitlab.com/makhlaghi/maneage-paper}{\texttt{gitlab.com/makhlaghi/maneage-paper}}. \end{abstract} % Note that keywords are not normally used for peer-review papers. @@ -206,8 +205,8 @@ On a small scale, the criteria here are trivial to implement, but can rapidly be The project should verify its inputs (software source code and data) \emph{and} outputs. Reproduction should be straightforward enough such that ``\emph{a clerk can do it}''\cite{claerbout1992} (with no expert knowledge). -\textbf{Criterion 6: History and temporal provenance.} -No exploratory research project is done in a single, first attempt. +\textbf{Criterion 6: Recorded history.} +No exploratory research is done in a single, first attempt. Projects evolve as they are being completed. It is natural that earlier phases of a project are redesigned/optimized only after later phases have been completed. Research papers often report this with statements such as ``\emph{we [first] tried method [or parameter] X, but Y is used here because it gave lower random error}''. @@ -300,6 +299,12 @@ Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it } \end{figure*} +The analysis is orchestrated through a single point of entry (\inlinecode{top-make.mk}, which is a Makefile; see Listing \ref{code:topmake}). +It is only responsible for \inlinecode{include}-ing the modular \emph{subMakefiles} of the analysis, in the desired order, without doing any analysis itself. +This is visualized in Figure \ref{fig:datalineage} (bottom) where no built (blue) file is placed directly over \inlinecode{top-make.mk} (they are produced by the subMakefiles under them). +A visual inspection of this file is sufficient for a non-expert to understand the high-level steps of the project (irrespective of the low-level implementation details), provided that the subMakefile names are descriptive (thus encouraging good practice). +A human-friendly design that is also optimized for execution is a critical component for the FAIRness of reproducible research. + \begin{lstlisting}[ label=code:topmake, caption={This project's simplified \inlinecode{top-make.mk}, also see Figure \ref{fig:datalineage}} @@ -323,12 +328,6 @@ include $(foreach s,$(makesrc), \ reproduce/analysis/make/$(s).mk) \end{lstlisting} -The analysis is orchestrated through a single point of entry (\inlinecode{top-make.mk}, which is a Makefile; see Listing \ref{code:topmake}). -It is only responsible for \inlinecode{include}-ing the modular \emph{subMakefiles} of the analysis, in the desired order, without doing any analysis itself. -This is visualized in Figure \ref{fig:datalineage} (bottom) where no built (blue) file is placed directly over \inlinecode{top-make.mk} (they are produced by the subMakefiles under them). -A visual inspection of this file is sufficient for a non-expert to understand the high-level steps of the project (irrespective of the low-level implementation details), provided that the subMakefile names are descriptive (thus encouraging good practice). -A human-friendly design that is also optimized for execution is a critical component for the FAIRness of reproducible research. - All projects first load \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk} (Listing \ref{code:topmake}). Project authors add their modular subMakefiles in between. Except for \inlinecode{paper.mk} (which builds the ultimate target: \inlinecode{paper.pdf}), all subMakefiles build a macro file with the same base-name (the \inlinecode{.tex} file in each subMakefile of Figure \ref{fig:datalineage}). @@ -363,7 +362,7 @@ This fast and cheap testing encourages experimentation (without necessarily know } \end{figure*} -Finally, to satisfy the temporal provenance criterion, version control (currently implemented in Git) is another component of Maneage (see Figure \ref{fig:branching}). +Finally, to satisfy the recorded history criterion, version control (currently implemented in Git) is another component of Maneage (see Figure \ref{fig:branching}). Maneage is a Git branch that contains the shared components (infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles and interface script). Derived project start by branching off, and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its particular analysis, see Listing \ref{code:branching}, there is customization checklist in \inlinecode{README-hacking.md}). |