diff options
-rw-r--r-- | paper.tex | 86 |
1 files changed, 43 insertions, 43 deletions
@@ -170,69 +170,69 @@ Many data-intensive projects commonly involve dozens of high-level dependencies, \section{Proposed criteria for longevity} -The main premise is that starting a project with robust data management strategy (or tools that provide it) is much more effective, for the researchers and community, than imposing it in the end \cite{austin17,fineberg19}. -Researchers play a critical role\cite{austin17} in making their research more Findabe, Accessible, Interoperable, and Reusable (the FAIR principles). -Actively curating workflows for evolving technologies by repositories alone is not practically feasible, or scalable. -In this paper we argue that workflows satisfying the criteria below can will improve researcher workflows during the project, reduce the cost of curation for repositories after publication, while maximizing the FAIRness of the deliverables for future researchers. - -\textbf{Criteria 1: Completeness.} -A project that is complete, or self-contained, has the following properties: -(1) has no dependency beyond the Portable Operating System (OS) Interface, or POSIX. -IEEE defined POSIX (a minimal Unix-like environment) and many OSs have complied. -It is thus a sufficiently reliable foundation for longevity in execution. -(2) No dependency implies that the project itself must be primarily stored in plain-text: not needing specialized software to open, parse or execute. -(3) Does not affect the host OS (its libraries, programs, or environment). -(4) Does not require root or administrator privileges. -(5) Builds its own controlled software for an independent environment. -(6) Can run locally (without internet connection). -(7) Contains the full project's analysis, visualization \emph{and} narrative: from access to raw inputs to doing the analysis, producing final data products \emph{and} its final published report with figures, e.g., PDF or HTML. -(8) Can run automatically, with no human interaction. - -\textbf{Criteria 2: Modularity.} +The main premise is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it in the end \cite{austin17,fineberg19}. +Researchers play a critical role\cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles). +Archiving the workflow of a project in a repository after the project is finished is, on its own, insufficient, and often either practically infeasible or unscalable. +In this paper we argue that workflows satisfying the criteria below can improve researcher workflows during the project, reduce the cost of curation for repositories after publication, while maximizing the FAIRness of the deliverables for future researchers. + +\textbf{Criterion 1: Completeness.} +A project that is complete (self-contained) has the following properties. +(1) It has no dependency beyond the Portable Operating System (OS) Interface: POSIX. +IEEE defined POSIX (a minimal Unix-like environment) and many OSes have complied. +It is a reliable foundation for longevity in software execution. +(2) ``No dependency'' requires that the project itself must be primarily stored in plain text, not needing specialized software to open, parse or execute. +(3) It does not affect the host OS (its libraries, programs, or environment). +(4) It does not require root or administrator privileges. +(5) It builds its own controlled software for an independent environment. +(6) It can run locally (without an internet connection). +(7) It contains the full project's analysis, visualization \emph{and} narrative: from access to raw inputs to doing the analysis, producing final data products \emph{and} its final published report with figures, e.g., PDF or HTML. +(8) It an run automatically, with no human interaction. + +\textbf{Criterion 2: Modularity.} A modular project enables and encourages the analysis to be broken into independent modules with well-defined inputs/outputs and minimal side effects. Explicit communication between various modules enables optimizations on many levels: (1) Execution in parallel and avoiding redundancies (when a dependency of a module has not changed, it will not be re-run). (2) Usage in other projects. -(3) Easy to debug and improve. -(4) Facilitates citation of specific parts, +(3) Easy debugging and improvements. +(4) Modular citation of specific parts. (5) Provenance extraction. -\textbf{Criteria 3: Minimal complexity.} -Minimal complexity can be interpreted as -(1) avoiding the language or framework that is currently in vogue (for the workflow, not necessarily the high-level analysis). -Because it is going to fall out of fashion soon and require significant resources to translate or rewrite every few years. -More stable/basic tools can also be used with less long-term maintenance. -(2) avoiding too many different languages and frameworks, e.g., when the workflow's PM and analysis are orchestrated in the same framework, it becomes easier to adopt and encourages good practices. +\textbf{Criterion 3: Minimal complexity.} +Minimal complexity can be interpreted as: +(1) Avoiding the language or framework that is currently in vogue (for the workflow, not necessarily the high-level analysis). +A popular framework typically falls out of fashion and requires significant resources to translate or rewrite every few years. +More stable/basic tools can be used with less long-term maintenance. +(2) Avoiding too many different languages and frameworks, e.g., when the workflow's PM and analysis are orchestrated in the same framework, it becomes easier to adopt and encourages good practices. -\textbf{Criteria 4: Scalability.} +\textbf{Criterion 4: Scalability.} A scalable project can easily be used in arbitrarily large and/or complex projects. -On a small scale, the criteria here are trivial to implement, but as the projects get more complex, it can become unsustainable. +On a small scale, the criteria here are trivial to implement, but as the projects get more complex, an implementation can become unsustainable. -\textbf{Criteria 5: Verifiable inputs and outputs.} +\textbf{Criterion 5: Verifiable inputs and outputs.} The project should verify its inputs (software source code and data) \emph{and} outputs. -Expert knowledge should not be required to confirm a reproduction, such that ``\emph{a clerk can do it}''\cite{claerbout1992}. +Expert knowledge should not be required to confirm a reproduction; it should be possible for ``\emph{a clerk [to] do it}''\cite{claerbout1992}. -\textbf{Criteria 6: History and temporal provenance.} -No project is done in a single/first attempt. +\textbf{Criterion 6: History and temporal provenance.} +No exploratory research project is done in a single/first attempt. Projects evolve as they are being completed. It is natural that earlier phases of a project are redesigned/optimized only after later phases have been completed. -This is often seen in exploratory research papers, with statements like ``\emph{we [first] tried method [or parameter] X, but Y is used here because it gave lower random error}''. +These types of research papers often report this with statements like ``\emph{we [first] tried method [or parameter] X, but Y is used here because it gave lower random error}''. The ``history'' is thus as valuable as the final/published version. -\textbf{Criteria 7: Including narrative, linked to analysis.} +\textbf{Criterion 7: Including narrative, linked to analysis.} A project is not just its computational analysis. A raw plot, figure or table is hardly meaningful alone, even when accompanied by the code that generated it. -A narrative description is also part of the deliverables (defined as ``data article'' in \cite{austin17}): describing the purpose of the computations, and interpretations of the result, possibly with respect to other projects/papers. -This is related to longevity because if a workflow only contains the steps to do the analysis, or generate the plots, in time, it may be separated from its accompanying published paper. +A narrative description is also part of the deliverables (defined as ``data article'' in \cite{austin17}): describing the purpose of the computations, and interpretations of the result, and the context in relation to other projects/papers. +This is related to longevity, because if a workflow only contains the steps to do the analysis or generate the plots, it may evolve to become separated from its accompanying published paper. -\textbf{Criteria 8: Free and open source software:} +\textbf{Criterion 8: Free and open source software:} Technically, reproducibility (as defined in \cite{fineberg19}) is possible with non-free or non-open-source software (a black box). -This criteria is thus necessary to complement that definition (nature is already a black box). -As free software, others can learn from, modify, and build upon a project. -When the used software are also free, +This criterion is necessary to complement that definition (nature is already a black box). +If a project is free software (as formally defined), then others can learn from, modify, and build on it. +When the software is free: (1) The lineage can be traced to the implemented algorithms, possibly enabling optimizations on that level. -(2) Their source can be modified to work on a future hardware by others. -(3) A non-free software typically cannot be distributed by others, making it reliant on a single server (even without payments). +(2) The source can be modified to work on future hardware. +In contrast, a non-free software package typically cannot be distributed by others, making it reliant on a single server (even without payments). |