aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--paper.tex44
1 files changed, 20 insertions, 24 deletions
diff --git a/paper.tex b/paper.tex
index fbb5c1f..43413a9 100644
--- a/paper.tex
+++ b/paper.tex
@@ -112,7 +112,7 @@ For example, \citet{smart18} describes how a 7-year old conflict in theoretical
Nature is already a black box which we are trying hard to unlock, or understand.
Not being able to experiment on the methods of other researchers is an artificial and self-imposed black box, wrapped over the original, and taking most of the energy of researchers.
-\citet{miller06} found that a mistaken column flipping, leading to retraction of 5 papers in major journals, including Science. \hl{I think this sentence is bad written, not sure}}
+\citet{miller06} found a mistaken column flipping in a project's workflow, leading to retraction of 5 papers in major journals, including Science.
\citet{baggerly09} highlighted the inadequate narrative description of the analysis and showed the prevalence of simple errors in published results, ultimately calling their work ``forensic bioinformatics''.
\citet{herndon14} and \citet[a self-correction]{horvath15} also reported similar situations and \citet{ziemann16} concluded that one-fifth of papers with supplementary Microsoft Excel gene lists contain erroneous gene name conversions.
The reason such reports are mostly from genomics and bioinformatics is because they have traditionally been more open to publishing workflows: for example \href{https://www.myexperiment.org}{myexperiment.org}, which mostly uses Apache Taverna \citep{oinn04}, or \href{https://www.genepattern.org}{genepattern.org} \citep{reich06}, \href{https://galaxyproject.org}{galaxy\-project.org} \citep{goecks10}, among others.
@@ -145,7 +145,7 @@ Projects are built around specific software technologies, and research in softwa
In this paper, we introduce Maneage as a solution to the collective problem of preserving a project's data lineage and its software dependencies.
A project using Maneage will start by branching from the main Git branch of Maneage and starts customizing it: specifying the necessary software tools for that particular project, adding analysis steps and writing a narrative based on the analysis results.
The temporal provenance of the project is fully preserved in Git, and allows merging of the project with the core branch to update the low-level infra-structure (common to all projects) without changing the high-level steps specific to this project.
-In Section \ref{sec:d-and-p} \hl{the section label is missed} the basic concepts are defined and the founding principles of Maneage are discussed.
+In Sections \ref{sec:definitions} \& \ref{sec:principles} the basic concepts are defined and the founding principles of Maneage are discussed.
Section \ref{sec:maneage} describes the internal structure of Maneage and Section \ref{sec:discussion} is a discussion on its benefits, caveats and future prospects.
@@ -153,31 +153,29 @@ Section \ref{sec:maneage} describes the internal structure of Maneage and Sectio
\label{sec:definitions}
The concepts and terminologies of reproducibility and project/workflow management and design are commonly used differently by different research communities or different solution provides.
-As a consequence, before starting with the technical details it is important to clarify the specific terms used throughout this paper and its appendix. \hl{do you still have appendices (in pl if more than one)?}
+As a consequence, before starting with the technical details it is important to clarify the specific terms used.
\begin{enumerate}[label={\bf D\arabic*}]
\item \label{definition:input}\textbf{Input:}
- \hl{Here, we define as input a}ny computer file needed by a project that may be usable in other projects.
+ A project's input is any computer file that may be usable in other projects.
The inputs of a project include data or software source code, see \citet{hinsen16} on the fundamental similarity of data and source code.
- Inputs may have initially been created/written (e.g., software source code) or collected (e.g., data) for one specific project.
- However, they can, and most often will, be used in other/later projects too.
+ Inputs may have initially been created/written (e.g., software source code) or collected (e.g., data) for one specific project, however, they can, and most often will, be used in later projects too.
\item \label{definition:output}\textbf{Output:}
- \hl{The output is a}ny computer file that is published at the end of the project.
- The output(s) of a project can be a published narrative paper, datasets (e.g., table(s), image(s), a number, or Boolean: confirming a hypothesis as true or false), automatically generated software source code, or any other computer file.
+ A project's output is any computer file that is published at the end.
+ The output(s) of a project can be a narrative paper or report with visualizations, datasets (e.g., table(s), image(s), a number, or Boolean: confirming a hypothesis as true or false), automatically generated software source code, or any other computer file.
\item \label{definition:project}\textbf{Project:}
- \hl{The project is t}he high-level series of operations that are done on input(s) to produce outputs.
- This definition is therefore very similar to ``workflow'' \citep{oinn04, reich06, goecks10}, but because the published narrative paper/report is also an output, the project defined here also includes the source of the narrative (e.g., \LaTeX{} or MarkDown) \emph{and} how the visualizations in it were created.
+ A project is the series of operations that are done on input(s) to produce outputs.
+ This definition is therefore very similar to ``workflow'' \citep{oinn04, reich06, goecks10}, but because the published narrative paper/report is also an output, a project also includes the source of the narrative (e.g., \LaTeX{} or MarkDown) \emph{and} how the visualizations in it were created.
- The project is thus only in charge of managing the inputs and outputs of each analysis step (take the outputs of one step, and feed them as inputs to the next), not to do analysis by itself.
- A good project will follow the modularity principle: analysis scripts should be well-defined as an independently managed software source with clearly defined inputs and outputs.
- For example, modules in Python, packages in R, or libraries/programs in C/C++, which can be executed by the higher-level project source when necessary.
- Maintaining these lower-level components as independent software projects enables their easy usage in other projects.
- Therefore, they are defined as inputs (not the project).
+ In a good project, all analysis scripts (e.g., written in Python, packages in R, or libraries/programs in C/C++, or etc) are well-defined as an independently managed software with clearly defined inputs and outputs with no side-effects.
+ This greatly helps in debugging and experimentation during the project, and their re-usability in later projects.
+ Hence such analysis scripts/programs are defined above as ``inputs'' to the project.
+ A project, as defined here, doesn't include any analysis source code (to the extent possible), it only contains calls to them.
\item \label{definition:provenance}\textbf{Data Provenance:}
- A dataset's provenance is \hl{is defined as} the set of metadata (in any ontology, standard or structure) that connect it to the components (other datasets or scripts) that produced it.
+ A dataset's provenance is is defined as the set of metadata (in any ontology, standard or structure) that connect it to the components (other datasets or scripts) that produced it.
Data provenance thus provides a high-level \emph{and structured} view of a project's lineage.
A good example of this is Research Objects \citep{belhajjame15}.
@@ -341,9 +339,8 @@ Two sub-directories are also present: \inlinecode{tex/} (containing \LaTeX{} fil
The \inlinecode{project} script is a high-level wrapper to interface with Maneage and in its current implementation has two main phases as shown below.
As seen below, a project's operations are broken-up into two phases: 1) configuration, where the necessary software are built and the environment is setup. 2) analysis, where data are accessed and the software is run on them to create visualizations and the final report.
-\hl{Below. first, I guess 2 hours depend on the machine. Second, shouldn't you include here the prepare step too?}
\begin{lstlisting}[language=bash]
- ./project configure # Build software from source (takes around 2 hours for full build).
+ ./project configure # Build all necessary software from source.
./project make # Do the analysis (download data, run software on data, build PDF).
\end{lstlisting}
@@ -394,7 +391,7 @@ Therefore, a researcher already using Maneage easily understands and can customi
Project configuration (building the software environment) is managed by the files under \inlinecode{reproduce\-/soft\-ware} of Maneage's source, see Figure \ref{fig:files}.
At the start of project configuration, Maneage needs a top-level directory to build itself on the host filesystem (software and analysis).
-We call this the ``build directory'' (or \hl{the so-called} \inlinecode{BDIR}) and it must not be under the source directory (see \ref{principle:modularity}).
+We call this the ``build directory'' and it must not be under the source directory (see \ref{principle:modularity}).
No other location on the running operating system will be affected by the project and it should not affect the result, so its value is not under version control.
Two other local directories can optionally be specified by the project when inputs (\ref{definition:input}) are present locally and do not need to be downloaded: 1) software tarball directory and 2) input data directory.
Sections \ref{sec:buildsoftware} and \ref{sec:softwarecitation} elaborate more on the building of the necessary software and the important problem of software citation.
@@ -430,7 +427,7 @@ This is particularly important in the case for research software, where the rese
One notable example that nicely highlights this issue is GNU Parallel \citep{tange18}: every time it is run, it prints the citation information before it starts.
This does not cause any problem in automatic scripts, but can be annoying when reading/debugging the outputs.
Users can disable the notice, with the \inlinecode{--citation} option and accept to cite its paper, or support its development directly by paying $10000$ euros!
-This is justified by an uncomfortably true statement\footnote{GNU Parallel's FAQ on the need to cite software: \url{http://git.savannah.gnu.org/cgit/parallel.git/plain/doc/citation-notice-faq.txt}}: ``history has shown that researchers forget to [cite software] \hl{why in brackets?} if they are not reminded explicitly. ... If you feel the benefit from using GNU Parallel is too small to warrant a citation, then prove that by simply using another tool''.
+This is justified by an uncomfortably true statement\footnote{GNU Parallel's FAQ on the need to cite software: \url{http://git.savannah.gnu.org/cgit/parallel.git/plain/doc/citation-notice-faq.txt}}: ``history has shown that researchers forget to [cite software] if they are not reminded explicitly. ... If you feel the benefit from using GNU Parallel is too small to warrant a citation, then prove that by simply using another tool''.
In bug 905674\footnote{Debian bug on the citation notice of GNU Parallel: \url{https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=905674}}, the Debian developers argued that because of this extra condition, GNU Parallel should not be considered as free software, and they are using a patch to remove that part of the code for its build under Debian-based OSs.
Most other research software do not resort to such drastic measures, however, citation is important for them.
Given the increasing number of software used in scientific research, the only reliable solution is to automatically cite the used software in the final paper.
@@ -552,8 +549,7 @@ Note that their location after the standard starting subMakefiles (initializatio
\end{center}
\vspace{-5mm}
\caption{\label{fig:toolsperyear}Fraction of papers mentioning software tools (green line, left vertical axis) to total number of papers studied in that year (light red bars, right vertical axis in log-scale).
- Data from \citet{menke20}. \hl{Maybe say here also in the caption this is a replica of figure 1C from menke20 done with Maneage }
- The subMakefile archiving the executable lineage of figure's data is shown in Figure \ref{fig:demoplotsrc} and discussed in Section \ref{sec:analysis}.
+ This is an enhanced replica of figure 1C \citet{menke20}, shown here for demonstrating Maneage, see Figure \ref{fig:datalineage} for its lineage and Section \ref{sec:analysis} for how it was organized.
}
\end{figure}
@@ -729,7 +725,7 @@ Therefore, there are various scenarios for the publication of the project as des
\item \textbf{Public Git repository:} This is the simplest publication method.
The project will already be on a (private) Git repository prior to publication.
In such cases, the private configuration can be removed so it becomes public.
- \item \textbf{In journal or PDF-only preprint systems (e.g., bioRxiv):} If the journal or pre-print server allows publication of small supplement files to the paper, the commit that produced the final paper can be submitted as a compressed file, for example, with the \hl{Something is missed}
+ \item \textbf{In journal or PDF-only preprint systems (e.g., bioRxiv):} If the journal or pre-print server allows publication of small supplement files to the paper, the project source can be submitted as a supplement.
\item \textbf{arXiv:} arXiv will run its own internal \LaTeX{} engine on the uploaded files and produce the PDF that is published.
When the project is published, arXiv also allows users to anonymously download the \LaTeX{} source tarball that the authors uploaded.
Therefore, simply uploading the tarball from the \inlinecode{./project make dist} command is sufficient.
@@ -743,7 +739,7 @@ Therefore, there are various scenarios for the publication of the project as des
However, based on the definition of inputs in Section \ref{definition:input}, they are usable in other projects: another project may use the same data or software source code, in a different way.
Therefore, even when published with the source, it is encouraged to publish them as separate files.
- For example, strategy was followed in \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}\footnote{https://doi.org/10.5281/zenodo.3408481} which supplements \citet{akhlaghi19} which contains the following files. \hl{Rewrite the sentence, I think something is missed and there are two \emph{which} very close each other}
+ For example, this strategy was followed in \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}\footnote{https://doi.org/10.5281/zenodo.3408481} which supplements \citet{akhlaghi19} and contains the following files.
\begin{itemize}
\item \textbf{Final PDF:} for easy understanding of the project.