aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--paper.tex33
1 files changed, 17 insertions, 16 deletions
diff --git a/paper.tex b/paper.tex
index 9ec7eb4..97c0980 100644
--- a/paper.tex
+++ b/paper.tex
@@ -24,7 +24,7 @@
-\title{Maneage: Customizable framework for Managing Data Lineage}
+\title{Maneage: Customizable Framework for Managing Data Lineage}
\author{\large\mpregular \authoraffil{Mohammad Akhlaghi}{1,2},
\large\mpregular \authoraffil{Ra\'ul Infante-Sainz}{1,2},
\large\mpregular \authoraffil{Roberto Baena-Gall\'e}{1,2}\\
@@ -51,9 +51,9 @@
In the absence of reproducibility, as a test on understanding data lineage, the result will be subject to perpetual debate.
To address this problem, we introduce Maneage (management + lineage) which has already been tested and used in several scientific papers.
Maneage is founded on the principles of completeness (e.g., no dependency beyond a POSIX-compatible operating system, no administrator privileges, or no network connection), modular and straight-forward design, temporal lineage and free software.
- A project using Maneage is fully stored in in machine\--action\-able, and human\--read\-able, plain-text format, facilitating version-control, publication, archival, or automatic parsing to extract data provenance.
+ A project using Maneage is fully stored in machine\--action\-able, and human\--read\-able plain-text format, facilitating version-control, publication, archival, or automatic parsing to extract data provenance.
The provided lineage is not limited to high-level processing, but also includes building the necessary software from source with fixed versions and build configurations.
- Additionally, the project's final visualizations and narrative report are also included, establishing direct, and parse-able, links between the data analysis and the narrative or plots, with the precision of a word in a sentence or a point in a plot.
+ Additionally, the project's final visualizations and narrative report are also included, establishing direct links between the data analysis and the narrative or plots, with the precision of a word in a sentence or a point in a plot.
Maneage enables incremental projects, where a new project can branch off an existing one, with moderate changes to enable experimentation on published methods.
Once Maneage is implemented in a sufficiently wide scale, it can aid in automatic and optimized workflow creation through machine learning, or automating data management plans.
Maneage was a recipient of the research data alliance (RDA) Europe Adoption Grant in 2019 and is also used to write this paper, with snapshot \projectversion.
@@ -106,11 +106,11 @@ For example, \citet{smart18} describes how a 7-year old conflict in theoretical
The reason such reports are mostly from genomics and bioinformatics is because they have traditionally been more open to publishing workflows: for example \href{https://www.myexperiment.org}{myexperiment.org}, or \href{https://www.genepattern.org}{genepattern.org}, \href{https://galaxyproject.org}{galaxy\-project.org}, and others.
Such integrity checks are a critical component of the scientific method, but are only possible with access to the data \emph{and} its lineage (workflows).
-The status in other fields, where workflows are not commonly shared, is probably (much!) worse.
+The status in other fields, where workflows are not commonly shared, is probably (much) worse.
-The completeness of a paper's published metadata (or ``Methods'' section) can be measured by the ability reproduce the result without needing to contact the authors.
+The completeness of a paper's published metadata (or ``Methods'' section) can be measured by the ability to reproduce the result without needing to contact the authors.
Several studies have attempted to answer this with different levels of detail, for example, \citet{allen18} found that roughly half of the papers in astrophysics do not even mention the names of any analysis software, while \citet{menke20} found this fraction has greatly improved in medical/biological field and is currently above $80\%$.
-\citet{ioannidis2009} attempted to reproduce 18 published results by two independent groups but only fully succeeded in 2 of them and partially in 6.
+\citet{ioannidis2009} attempted to reproduce 18 published results by two independent groups, but only fully succeeded in 2 of them and partially in 6.
\citet{chang15} attempted to reproduce 67 papers in well-regarded economic journals with data and code: only 22 could be reproduced without contacting authors, and more than half could not be replicated at all.
\citet{stodden18} attempted to replicate the results of 204 scientific papers published in the journal Science \emph{after} that journal adopted a policy of publishing the data and code associated with the papers.
Even though the authors were contacted, the success rate was $26\%$.
@@ -119,13 +119,13 @@ Generally, this problem is unambiguously felt in the community: \citet{baker16}
This is not a new problem in the sciences: in 2011, Elsevier conducted an ``Executable Paper Grand Challenge'' \citep{gabriel11}.
The proposed solutions were published in a special edition.
Before that, in an attempt to simulate research projects, \citet{ioannidis05} proved that ``most claimed research findings are false''.
-In the 1990s, \citet{schwab2000, buckheit1995, claerbout1992} describe this same problem very eloquently and also provided some solutions they used.
+In the 1990s, \citet{schwab2000, buckheit1995, claerbout1992} describe the same problem very eloquently and also provided some solutions they used.
While the situation has improved since the early 1990s, these papers still resonate strongly with the frustrations of today's scientists.
Even earlier, through his famous quartet, \citet{anscombe73} qualitatively showed how distancing of researchers from the intricacies of algorithms/methods can lead to misinterpretation of the results.
-One of the earliest such efforts we found was \citet{roberts69} who discussed conventions in FORTRAN programming and documentation to help in publishing research codes.
+One of the earliest such efforts we found was \citet{roberts69}, who discussed conventions in FORTRAN programming and documentation to help in publishing research codes.
In this paper, we introduce Maneage as a solution to the collective problem of preserving a project's data lineage and its software dependencies.
-A project using Maneage will start by branching from the main Git branch of Maneage and starts customizing it: specifying the necessary software tools for that particular project, adding analysis steps and writing a narrative based on the analysis results.
+A project using Maneage will start by branching from the main Git branch of Maneage and starts customizing itself: specifying the necessary software tools for that particular project, adding analysis steps and writing a narrative based on the analysis results.
The temporal provenance of the project is fully preserved in Git, and allows merging of the project with the core branch to update the low-level infra-structure (common to all projects) without changing the high-level steps specific to this project.
In Sections \ref{sec:definitions} \& \ref{sec:principles} the basic concepts are defined and the founding principles of Maneage are discussed.
Section \ref{sec:maneage} describes the internal structure of Maneage and Section \ref{sec:discussion} is a discussion on its benefits, caveats and future prospects.
@@ -151,10 +151,10 @@ As a consequence, before starting with the technical details it is important to
A project is the series of operations that are done on input(s) to produce outputs.
This definition is therefore very similar to ``workflow'' \citep{oinn04, reich06, goecks10}, but because the published narrative paper/report is also an output, a project also includes the source of the narrative (e.g., \LaTeX{} or MarkDown) \emph{and} how the visualizations in it were created.
- In a good project, all analysis scripts (e.g., written in Python, packages in R, or libraries/programs in C/C++, or etc) are well-defined as an independently managed software with clearly defined inputs, outputs and no side-effects.
+ In a good project, all analysis scripts (e.g., written in Python, packages in R, libraries/programs in C/C++, etc.) are well-defined as an independently managed software with clearly defined inputs, outputs and no side-effects.
This greatly helps in debugging and experimentation during the project, and their re-usability in later projects.
- Hence such analysis scripts/programs are defined above as ``inputs'' to the project.
- A project hence doesn't include any analysis source code (to the extent possible), it only manages calls to them.
+ As a consequence, such analysis scripts/programs are defined above as ``inputs'' for the project.
+ A project hence does not include any analysis source code (to the extent possible), it only manages calls to them.
\item \label{definition:provenance}\textbf{Data Provenance:}
A dataset's provenance is defined as the set of metadata (in any ontology, standard or structure) that connect it to the components (other datasets or scripts) that produced it.
@@ -171,7 +171,7 @@ As a consequence, before starting with the technical details it is important to
% Technical data lineage is being provided by solutions such as MANTA or Informatica Metadata Manager. "
\item \label{definition:lineage}\textbf{Data Lineage:}
Data lineage is commonly used interchangeably with Data provenance \citep[for example][]{cheney09}.
-For clarity, we define term ``Data lineage'' as a low-level and fine-grained recording of the data's trajectory in an analysis (not meta-data, but actual commands).
+For clarity, we define the term ``Data lineage'' as a low-level and fine-grained recording of the data's trajectory in an analysis (not meta-data, but actual commands).
Therefore, data lineage is synonymous with ``project'' as defined above.
\item \label{definition:reproduction}\textbf{Reproducibility \& Replicability:}
These two terms have been used in the literature with various meanings, sometimes in a contradictory way.
@@ -203,7 +203,7 @@ To highlight the uniqueness of Maneage in this plethora of tools, a more elabora
\begin{enumerate}[label={\bf P\arabic*}]
\item \label{principle:complete}\textbf{Complete:}
- A project that is complete, or self-contained, does not depend on anything beyond the Portable operating system Interface (POSIX), does not affect the host system, does not require root/administrator privileges, does not need an internet connection (when its inputs are on the file-system), and is stored in a format that doesn't require any software beyond POSIX tools to open, parse or execute.
+ A project that is complete, or self-contained, does not depend on anything beyond the Portable operating system Interface (POSIX), does not affect the host system, does not require root/administrator privileges, does not need an internet connection (when its inputs are on the file-system), and is stored in a format that does not require any software beyond POSIX tools to open, parse or execute.
A complete project can automatically access to the inputs (see definition \ref{definition:input}), build its necessary software (instructions on configuring, building and installing those software in a fixed environment), do the analysis (run the software on the data) and create the final narrative report/paper as well as its visualizations, in its final format (usually in PDF or HTML).
No manual/human interaction is required within a complete project, as \citet{claerbout1992} put it: ``a clerk can do it''.
@@ -228,7 +228,7 @@ In a modular project, communication between the independent modules is explicit,
3) Citation: allowing others to credit specific parts of a project.
4) Usage in other projects.
-\emph{Comparison with existing:} Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails do encourage this, but the more recent tools leaving such design choices to the experience of project authors.
+\emph{Comparison with existing:} Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails do encourage this, but the more recent tools leave such design choices to the experience of project authors.
However, designing a modular project needs to be encouraged and facilitated, otherwise, scientists (that are not usually trained in data management) will not design their projects to be modular, leading to great inefficiencies in terms of project cost or scientific accuracy.
\item \label{principle:complexity}\textbf{Minimal complexity:}
@@ -238,6 +238,7 @@ However, designing a modular project needs to be encouraged and facilitated, oth
The same job can be done with more stable/basic tools, and less effort in the long-run.
\emph{Comparison with existing:} Most of the existing tools use the language that was in vogue when they were created, for example, a larger fraction of them are written in Python as we come closer to the present time.
+ \tonote{Raul: Maybe we can avoid the next sentences. That project is not the only one with this problem, and explicity point it could make think to someone that there are something personal here.}
Again, IPOL stands out from the rest in this principle.
\item \label{principle:verify}\textbf{Verifiable inputs and outputs:}
@@ -258,7 +259,7 @@ A project's ``history'' is thus as scientifically relevant as the final, or publ
However, because they are rarely complete (as discussed in principle \ref{principle:complete}), this history is also not complete.
IPOL, which uniquely stands out in other principles, fails here: only the final snapshot is published.
-\item \label{principle:freesoftware}\textbf{Free and open source software}
+\item \label{principle:freesoftware}\textbf{Free and open source software:}
Technically, as defined in Section \ref{definition:reproduction}, reproducibility is also possible with a non-free or non-open-source software (a black box).
This principle is thus necessary to complement the definition of reproducibility and has many advantages which are critical to the sciences and the industry:
1) The lineage, and its optimization, can be traced down to the internal algorithm in the software's source.