aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex26
1 files changed, 19 insertions, 7 deletions
diff --git a/paper.tex b/paper.tex
index 13534e0..7bd6b14 100644
--- a/paper.tex
+++ b/paper.tex
@@ -26,7 +26,7 @@
\title{Maneage: Customizable Template for Managing Data Lineage}
\author{\large\mpregular \authoraffil{Mohammad Akhlaghi}{1,2},
- \large\mpregular \authoraffil{Raúl Infante-Sainz}{1,2}\\
+ \large\mpregular \authoraffil{Ra\'ul Infante-Sainz}{1,2}\\
{
\footnotesize\mplight
\textsuperscript{1} Instituto de Astrof\'isica de Canarias, C/V\'ia L\'actea, 38200 La Laguna, Tenerife, ES.\\
@@ -49,10 +49,10 @@
The era of big data has also ushered an era of big responsability.
Without it, the integrity of the results will be a subject of perpetual debate.
In this paper, Maneage (management + lineage) is introduced as a low-level solution to this problem.
- It is designed considering the following principles: complete (e.g., not requiring anything beyond a POSIX-compatible system), modular, fully in plain-text, minimal complexity in design, verifiable inputs and outputs, temporal lineage/provenance, and free software (in scientific applications).
- A project using Maneage will have full control over the data lineage, making it exactly reproducible.
- This control goes as far back as the automatic downloading of input data, and automatic building of necessary software that are used in the analysis.
- It also contains the narrative description of the final project's report (built into a PDF).
+ It is designed considering the following principles: complete (e.g., not requiring any dependencies beyond a POSIX-compatible system, administrator previlages or a network connection), modular, fully in plain-text, minimal complexity in design, verifiable inputs and outputs, temporal lineage/provenance, and free software (in scientific applications).
+ A project that uses Maneage will be able to publish the complete data lineage, making it exactly reproducible (as a test on sufficiently conveying the data lineage).
+ This control goes as far back as the automatic downloading of input data, and automatic building of necessary software (with fixed versions and build configurations) that are used in the analysis.
+ It also contains the narrative description of the final project's report (built into a PDF), while providing automatic and direct links between the analysis and the part of the narrative description that it was used.
Adopting Maneage on a wide scale will greatly improve scientific collaborations and building upon the work of other researchers, instead of the current technical frustrations that many researchers experience and can affect their scientific result and interpretations.
It can also be used on more ambitious projects like automatic workflow creation through machine learning tools, or automating data management plans.
As a demostration, this paper has itself been generated with Maneage (snapshot \projectversion).
@@ -78,7 +78,12 @@
The increasing volume and complexity of data analysis has been highly productive, giving rise to a new branch of ``Big Data'' in many fields of the sciences and industry.
However, given its inherent complexity, the mere results are barely useful alone.
-Questions such as these commonly follow any such result: What inputs were used? What operations were done on those inputs? How were the configurations or training data chosen? How did the quantiative results get visualized into the final demonstration plots, figures or narrative/qualitative interpretation? See Figure \ref{fig:questions} for a more detailed visual representation of such questions for various stages of the workflow.
+Questions such as these commonly follow any such result:
+What inputs were used?
+What operations were done on those inputs? How were the configurations or training data chosen?
+How did the quantiative results get visualized into the final demonstration plots, figures or narrative/qualitative interpretation?
+May there be a bias in the visualization?
+See Figure \ref{fig:questions} for a more detailed visual representation of such questions for various stages of the workflow.
In data science and database management, this type of metadata are commonly known as \emph{data provenance} or \emph{data lineage}.
Their definitions are elaborated with other basic concepts in Section \ref{sec:definitions}.
@@ -148,6 +153,8 @@ However, this is not a practical solution because software updates are necessary
Generally, software is not a secular component of projects, where one software can easily be swapped with another.
Projects are built around specific software technologies, and research in software methods and implementations is itself a vibrant research topic in many domains \citep{dicosmo19}.
+\tonote{add a short summary of the advantages of Maneage.}
+
This paper introduces Maneage as a solution to these important issues.
Section \ref{sec:definitions} defines the necessay concepts and terminology used in this paper, leading to a discussion of the necessary guiding principles in Section \ref{sec:principles}.
Section \ref{sec:maneage} introduces the implementation of Maneage, going into lower-level details in some cases.
@@ -269,10 +276,15 @@ But before that, it is important to highlight that in this paper we are only con
Therefore, many of the definitions reviewed in \citet{plesser18}, that are about data collection, are out of context here.
We adopt the same definition of \citet{leek17,fineberg19}, among others:
+%% From Zahra Sharbaf:
+%% According to a U.S. National Science Foundation (NSF), the definition of reproducibility is “reproducibility refers to the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator.
+%% That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results….
+%% Reproducibility is a minimum necessary condition for a finding to be believable and informative.”(K. Bollen, J. T. Cacioppo, R. Kaplan, J. Krosnick, J. L. Olds, Social, Behavioral, and Economic Sciences Perspectives on Robust and Reliable Science (National Science Foundation, Arlington, VA, 2015)).
+
\begin{itemize}
\item {\bf\small Reproducibility:} (same inputs $\rightarrow$ consistant result).
Formally: ``obtaining consistent [not necessarily identical] results using the same input data; computational steps, methods, and code; and conditions of analysis'' \citep{fineberg19}.
- This is thus synonymous of ``computational reproducibility''.
+ This is thus synonymous with ``computational reproducibility''.
\citet{fineberg19} allow non-bitwise or non-identical numeric outputs within their definition of reproducibility, but they also acknowledge that this flexibility can lead to complexities: what is an acceptable non-identical reproduction?
Exactly reproducbile outputs can be precisely and automatically verified without statistical interpretations, even in a very complex analysis (involving many CPU cores, and random operations), see Section \ref{principle:verify}.