diff options
-rw-r--r-- | paper.tex | 49 |
1 files changed, 25 insertions, 24 deletions
@@ -175,11 +175,11 @@ Finally, in Appendix \ref{appendix:softwareacknowledge} we acknowledge the vario -\section{Definitions of important terms} +\section{Definition of important terms} \label{sec:definitions} The concepts and terminologies of reproducibility and project/workflow management and design are commonly used differently by different research communities or different solution provides. -It is therefore important to clarify the specific terms used throughout this paper and its appendix, before starting the technical details. +As a consequence, before starting with the technical details it is important to clarify the specific terms used throughout this paper and its appendix. @@ -188,17 +188,16 @@ It is therefore important to clarify the specific terms used throughout this pap \subsection{Definition: input} \label{definition:input} Any computer file that may be usable in more than one project. -The inputs of a project include data, software source code, or etc. -See \citet{hinsen16} on the fundamental similarity of data and source code. -Inputs may be encoded in plain text (for example tables of comma-separated values, CSV, or processing scripts) or custom binary formats, for example JPEG images, or domain-specific data formats \citep[e.g., FITS in astronomy, see][]{pence10}. +The inputs of a project include data, software source code, etc. (see \citet{hinsen16} on the fundamental similarity of data and source code). +Inputs may be encoded in plain text (for example tables of comma-separated values, CSV, or processing scripts), custom binary formats (for example JPEG images), or domain-specific data formats \citep[e.g., FITS in astronomy, see][]{pence10}. -Inputs may have initially been created/writted (e.g., software soure code) or collected (e.g., data) for one specific project. +Inputs may have initially been created/written (e.g., software soure code) or collected (e.g., data) for one specific project. However, they can, and most often will, be used in other/later projects also. -Following the principle of modularity, it is therefore optimal to treat the inputs of any project as independent entities, not mixing them with how they are managed (how software is run on the data) within the project (see \ref{definition:project}). +Following the principle of modularity, it is therefore optimal to treat the inputs of any project as independent entities, not mixing them with how they are managed (how software is run on the data) within the project (see Section \ref{definition:project}). -Inputs are nevertheless necessary for building and running any project project. +Inputs are nevertheless necessary for building and running any project. Some inputs may already archived/published independently prior to the project's publication. -In this case, they can easily be downloaded by independent projects and used. +In this case, they can easily be downloaded and used by independent projects. Otherwise, they can be published with the project, but as independent files, for example see \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481} \citep{akhlaghi19}. @@ -208,12 +207,12 @@ Otherwise, they can be published with the project, but as independent files, for \subsection{Definition: output} \label{definition:output} Any computer file that is published at the end of the project. -The output(s) can be datasets (terrabyte-sized, small table(s) or image(s), a single number, a true/false (boolian) outcome), automatically generated software source code, or any other file. +The output(s) can be datasets (terabyte-sized, small table(s) or image(s), a single number, a true/false (boolean) outcome), automatically generated software source code, or any other file. The raw output files are commonly supplemented with a paper/report that summarizes them in a human-friendly readable/printable/narrative format. The report commonly includes highlights of the input/output datasets (or intermediate datasets) as plots, figures, tables or simple numbers blended into the text. The outputs can either be published independently on data servers which assign specific persistant identifers (PIDs) to be cited in the final report or published paper (in a journal for example). -Alternatively, the datasets can be published with the project source, for example \href{https://doi.org/10.5281/zenodo.1164774}{zenodo.1164774} \citep[Sections 7.3 \& 3.4]{bacon17}. +Alternatively, the datasets can be published with the project source, see for example \href{https://doi.org/10.5281/zenodo.1164774}{zenodo.1164774} \citep[Sections 7.3 \& 3.4]{bacon17}. @@ -221,14 +220,14 @@ Alternatively, the datasets can be published with the project source, for exampl \subsection{Definition: project} \label{definition:project} -The most high-level series of operations that are done on input(s) to produce outputs. +The most high-level series of operations that are done on input(s) to produce the output(s). Because the project's report is also defined as an output (see above), besides the high-level analysis, the project's source also includes scripts/commands to produce plots, figures or tables. With this definition, this concept of a ``project'' is similar to ``workflow''. However, it is imporant to emphasize that the project's source code and inputs are distinct entities. For example the project may be written in the same programming language as one analysis step. Generally, the project source is defined as the most high-level source file that is unique to that individual project (its language is irrelevant). -The project is thus only in charge of managing the inputs and outputs of each analysis step (take the outputs of one step, and feed them as inputs to the next), not to do analysis itself. +The project is thus only in charge of managing the inputs and outputs of each analysis step (take the outputs of one step, and feed them as inputs to the next), not to do analysis by itself. A good project will follow the modularity principle: analysis scripts should be well-defined as an independently managed software source. For example modules in Python, packages in R, or libraries/programs in C/C++ that can be imported in higher-level project sources. @@ -240,7 +239,7 @@ For example modules in Python, packages in R, or libraries/programs in C/C++ tha Data provenance is a very generic term which points to slightly different technical concepts in different fields like databases, storage systems and scientific workflows. For example within a database, an SQL query from a relational database connects a subset of the database entries to the output (\emph{why-} provenance), their more detailed dependency (\emph{how-} provenance) and the precise location of the input sources (\emph{where-} provenance), for more see \citet{cheney09}. -In scientific workflows provenance goes beyond a single database and its datasets, but may includes many databases that aren't directly linked, the higher-level project specific analysis that is done on the data, and linking of the analysis to the text of the paper, for example see \citet{bavoil05, moreau08, malik13}. +In scientific workflows, provenance goes beyond a single database and its datasets, but may includes many databases that aren't directly linked, the higher-level project specific analysis that is done on the data, and linking of the analysis to the text of the paper, for example see \citet{bavoil05, moreau08, malik13}. Here, we define provenance to be the common factor of the usages above: a dataset's provenance is the set of metadata (in any ontology, standard or structure) that connect it to the components (other datasets or scripts) that produced it. Data provenance thus provides a high-level view of the data's genealogy. @@ -258,26 +257,27 @@ Data provenance thus provides a high-level view of the data's genealogy. % Technical data lineage is being provided by solutions such as MANTA or Informatica Metadata Manager. " Data lineage is commonly used interchangably with Data provenance \citep[for example][\tonote{among many others, just search ``data lineage'' in scholar.google.com}]{cheney09}. However, for clarity, in this paper we refer to the term ``Data lineage'' as a low-level and fine-grained recording of the data's source, and operations that occur on it, down to the exact command that produced each intermediate step. -This \emph{recording} doesn't necessarily have to be in a formal metadata model. +This \emph{recording} does not necessarily have to be in a formal metadata model. But data lineage must be complete (see completeness principle in Section \ref{principle:complete}), and allow extraction of data provenance metadata, and thus higher-level operations like visualization of the workflow. -\subsection{Definition: reproducibility \& replicability} +\subsection{Definition: reproducibility and replicability} \label{definition:reproduction} These terms have been used in the literature with various meanings, sometimes in a contradictory way. It is therefore necessary to clarify the precise usage of this term in this paper. -But before doing so, it is important to highlight that in this paper, we are only considering computational analysis, in other words, analysis after data has been collected and stored as a file on a filesystem. +But before that, it is important to highlight that in this paper we are only considering computational analysis. In other words, analysis after data has been collected and stored as a file on a filesystem. Therefore, many of the definitions reviewed in \citet{plesser18}, that are about data collection, are out of context here. We adopt the same definition of \citet{leek17,fineberg19}, among others: \begin{itemize} \item {\bf\small Reproducibility:} (same inputs $\rightarrow$ consistant result). Formally: ``obtaining consistent [not necessarily identical] results using the same input data; computational steps, methods, and code; and conditions of analysis'' \citep{fineberg19}. - This is thus synonymous with ``computational reproducibility''. + This is thus synonymous of ``computational reproducibility''. \citet{fineberg19} allow non-bitwise or non-identical numeric outputs within their definition of reproducibility, but they also acknowledge that this flexibility can lead to complexities: what is an acceptable non-identical reproduction? Exactly reproducbile outputs can be precisely and automatically verified without statistical interpretations, even in a very complex analysis (involving many CPU cores, and random operations), see Section \ref{principle:verify}. It also requires no expertise, as \citet{claerbout1992} put it: ``a clerk can do it''. + \tonote{Raul: I don't know if this is true... at least it needs a bit of training and an extra time. Maybe remove last phrase?} In this paper, unless otherwise mentioned, we only consider bitwise/exact reproducibility. \item {\bf\small Replicability:} (different inputs $\rightarrow$ consistant result). @@ -285,20 +285,21 @@ We adopt the same definition of \citet{leek17,fineberg19}, among others: Generally, since replicability involves new data collection, it can be expensive. For example the ``Reproducibility Project: Cancer Biology'' initiative started in 2013 to replicate 50 high-impact papers in cancer biology\footnote{\url{https://elifesciences.org/collections/9b1e83d1/reproducibility-project-cancer-biology}}. -Even with a funding of atleast \$1.3 million, it later shrunk to 18 projects \citep{kaiser18} due to very high costs. -We also note that replicability doesn't have to be limited to different input data: using the same data, but with different implementations of methods, is also a replication attempt \citep[also known as ``in silico'' experiments, see][]{stevens03}. +Even with a funding of at least \$1.3 million, it later shrunk to 18 projects \citep{kaiser18} due to very high costs. +We also note that replicability does not have to be limited to different input data: using the same data, but with different implementations of methods, is also a replication attempt \citep[also known as ``in silico'' experiments, see][]{stevens03}. \end{itemize} -Some have defined these terms in the opposite manner. +\tonote{Raul: put white line to separate next paragraph from the previous list?} +Some authors have defined these terms in the opposite manner. Examples include \citet{hinsen15} and the policy guidelines of the Association of Computing Machinery\footnote{\url{https://www.acm.org/publications/policies/artifact-review-badging}} (ACM, dated April 2018). ACM has itself adopted the 2008 definitions of Vocabulaire international de m\'etrologie (VIM). Besides the two terms above, ``repeatability'' is also sometimes used in regards to the concept discussed here and must be clarified. -For example \citet{ioannidis2009} use ``repeatability'' to encompass both the terms above. +For example, \citet{ioannidis2009} use ``repeatability'' to encompass both the terms above. However, the ACM/VIM definition for repeatability is ``a researcher can reliably repeat her own computation''. Hence, in the ACM terminology, the only difference between replicability and repeatability is the ``team'' that is conducting the computation. -In the context of this paper inputs are precisely defined (Section \ref{definition:input}): files with specific/registered checksums. -Therfore our inputs are team-agnostic, allowing us to safely ignore ``repeatability'' as defined by ACM/VIM. +In the context of this paper, inputs are precisely defined (Section \ref{definition:input}): files with specific/registered checksums (see Section \ref{principle:verify}). +Therefore, our inputs are team-agnostic, allowing us to safely ignore ``repeatability'' as defined by ACM/VIM. |