aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--paper.tex11
1 files changed, 6 insertions, 5 deletions
diff --git a/paper.tex b/paper.tex
index f8da400..d726735 100644
--- a/paper.tex
+++ b/paper.tex
@@ -166,9 +166,8 @@ Many data-intensive projects commonly involve dozens of high-level dependencies,
\section{Proposed criteria for longevity}
-
The main premise is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it in the end \cite{austin17,fineberg19}.
-Researchers play a critical role\cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles).
+Researchers play a critical role \cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles).
Simply archiving a project workflow in a repository after the project is finished is, on its own, insufficient, and maintaining it by repository staff is often either practically infeasible or unscalable.
In this paper we argue that workflows satisfying the criteria below can improve researcher workflows during the project, reduce the cost of curation for repositories after publication, while maximizing the FAIRness of the deliverables for future researchers.
@@ -270,7 +269,7 @@ However, due to the problems above, our implementation follows a more abstract d
Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is through \LaTeX{} macros, which are automatically defined during the analysis.
For example, in the abstract of \cite{akhlaghi19} we say `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'.
The \LaTeX{} source of the quote above is: `\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}'.
-The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is set during the analysis, and expands to the value `\inlinecode{0.25}' when the PDF output is built.
+The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is generated during the analysis, and expands to the value `\inlinecode{0.25}' when the PDF output is built.
Since values like this depend on the analysis, they should be reproducible, along with figures and tables.
These macros act as a quantifiable link between the narrative and analysis, with the granularity of a word in a sentence and a particular analysis command.
This allows accurate provenance \emph{and} automatic updates to the text when necessary.
@@ -281,7 +280,7 @@ Let's call this \inlinecode{paper.pdf}.
The files hosting the macros of each analysis step (which produce numbers, tables, figures included in the report) build the core structure (skeleton) of Maneage.
For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation.
These are combined for generating precise software acknowledgment and citation (see \cite{akhlaghi19, infante20}; these software acknowledgments are excluded here due to the strict word limit).
-These files act as Make \emph{targets} and \emph{prerequisite}s to allow accurate dependency tracking and optimized execution (in parallel with no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}).
+These files act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel with no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}).
Software dependencies are built down to precise versions of the shell, POSIX tools (e.g., GNU Coreutils), \TeX{}Live and etc for an \emph{almost} exact reproducible environment.
On GNU/Linux operating systems, the C compiler is also built from source and the C library is being added (task 15390) for exact reproducibility.
Fast relocation of a project (without building from source) can be done by building the project in a container or VM.
@@ -425,6 +424,7 @@ Firstly, while most researchers are generally familiar with them, the necessary
Fortunately, we have noticed that after witnessing the improvements in their research, many, especially early-career researchers, have started mastering these tools.
Scientists are rarely trained sufficiently in data management or software development, and the plethora of high-level tools that change every few years discourages them.
Fast-evolving tools are primarily targeted at software developers, who are paid to learn them and use them effectively for short-term projects before moving on to the next technology.
+
Scientists, on the other hand, need to focus on their own research fields, and need to consider longevity.
Hence, arguably the most important feature of these criteria is that they provide a fully working template, using mature and time-tested tools, for blending version control, the research paper's narrative, software management \emph{and} a modular lineage for analysis.
We have seen that providing a complete \emph{and} customizable template with a clear checklist of the initial steps is much more effective in encouraging mastery of these essential tools for modern science than having abstract, isolated tutorials on each tool individually.
@@ -452,7 +452,8 @@ As another example, when a bug is found in one software package, all affected pr
Combined with SoftwareHeritage, precise high-level science parts of Maneage projects can be accurately cited (e.g., failed/abandoned tests at any historical point).
Many components of ``machine-actionable'' data management plans can be automatically filled out by Maneage, which is useful for project PIs and grant funders.
-From the data repository perspective these criteria can also be very useful, for example with regard to the challenges mentioned in \cite{austin17}:
+From the data repository perspective, these criteria can also be very useful.
+For example, with regard to the challenges mentioned in \cite{austin17}:
(1) The burden of curation is shared among all project authors and readers (the latter may find a bug and fix it), not just by database curators, improving sustainability.
(2) Automated and persistent bidirectional linking of data and publication can be established through the published \& \emph{complete} data lineage that is under version control.
(3) Software management.