diff options
| -rw-r--r-- | paper.tex | 105 | 
1 files changed, 55 insertions, 50 deletions
| @@ -25,7 +25,8 @@  \title{Maneage: Customizable Template for Managing Data Lineage} -\author{\large\mpregular \authoraffil{Mohammad Akhlaghi}{1,2}\\ +\author{\large\mpregular \authoraffil{Mohammad Akhlaghi}{1,2}, +        \large\mpregular \authoraffil{Ra\'ul Infante-Sainz}{1,2}\\    {      \footnotesize\mplight      \textsuperscript{1} Instituto de Astrof\'isica de Canarias, C/V\'ia L\'actea, 38200 La Laguna, Tenerife, ES.\\ @@ -46,21 +47,19 @@  %% Abstract  {\noindent\mpregular    The era of big data has also ushered an era of big responsability. -  Without it, the integrity of the result will be a subject of perpetual debate. -  In this paper Maneage is introduced as a low-level solution to this problem. -  Maneage (management + lineage) is an executable workflow for project authors and readers in the sciences or the industry. -  It is designed following principles: complete (e.g., not requiring anything beyond a POSIX-compatible system, administrator previlages or a network connection), modular, fully in plain-text, minimal complexity in design, verifiable inputs and outputs, temporal lineage/provenance, and free software (in scientific applications). +  Without it, the integrity of the results will be a subject of perpetual debate. +  In this paper, Maneage (management + lineage) is introduced as a low-level solution to this problem. +  It is designed considering the following principles: complete (e.g., not requiring any dependencies beyond a POSIX-compatible system, administrator previlages or a network connection), modular, fully in plain-text, minimal complexity in design, verifiable inputs and outputs, temporal lineage/provenance, and free software (in scientific applications).    A project that uses Maneage will be able to publish the complete data lineage, making it exactly reproducible (as a test on sufficiently conveying the data lineage). -  This control goes as far back as the automatic downloading of input data, and automatic building of necessary software that are used to analyze the data, with fixed versions and build configurations. +  This control goes as far back as the automatic downloading of input data, and automatic building of necessary software (with fixed versions and build configurations) that are used in the analysis.    It also contains the narrative description of the final project's report (built into a PDF), while providing automatic and direct links between the analysis and the part of the narrative description that it was used. -  Also, starting new projects, or editing previously published papers is trivial because of its version control system. -  If adopted on a wide scale, Maneage can greatly improve scientific collaborations and building upon the work of other researchers instead of the current technical frustrations many researchers experience and can affect their scientific result and interpretations. +  Adopting Maneage on a wide scale will greatly improve scientific collaborations and building upon the work of other researchers, instead of the current technical frustrations that many researchers experience and can affect their scientific result and interpretations.    It can also be used on more ambitious projects like automatic workflow creation through machine learning tools, or automating data management plans. -  This paper has itself been written in Maneage (snapshot \projectversion). +  As a demostration, this paper has itself been generated with Maneage (snapshot \projectversion).    \horizontalline    \noindent -  {\mpbold Keywords:} Data Lineage, Data Provenance, Reproducibility, Workflows, scientific pipelines +  {\mpbold Keywords:} Data Lineage, Data Provenance, Reproducibility, Scientific Pipelines, Workflows  }  \horizontalline @@ -79,11 +78,16 @@  The increasing volume and complexity of data analysis has been highly productive, giving rise to a new branch of ``Big Data'' in many fields of the sciences and industry.  However, given its inherent complexity, the mere results are barely useful alone. -Questions such as these commonly follow any such result: What inputs were used? What operations were done on those inputs? How were the configurations or training data chosen? How did the quantiative results get visualized into the final demonstration plots, figures or narrative/qualitative interpretation (may there be a bias in the visualization)? See Figure \ref{fig:questions} for a more detailed visual representation of such questions for various stages of the workflow. +Questions such as these commonly follow any such result: +What inputs were used? +What operations were done on those inputs? How were the configurations or training data chosen? +How did the quantiative results get visualized into the final demonstration plots, figures or narrative/qualitative interpretation? +May there be a bias in the visualization? +See Figure \ref{fig:questions} for a more detailed visual representation of such questions for various stages of the workflow.  In data science and database management, this type of metadata are commonly known as \emph{data provenance} or \emph{data lineage}.  Their definitions are elaborated with other basic concepts in Section \ref{sec:definitions}. -Data lineage is being increasingly demaded for integrity checking from both the scientific and industrial/legal domains. +Data lineage is being increasingly demanded for integrity checking from both the scientific and industrial/legal domains.  Notable examples in each domain are respectively the ``Reproducibility crisis'' in the sciences that was claimed by the Nature journal \citep{baker16}, and the General Data Protection Regulation (GDPR) by the European parliment and the California Consumer Privacy Act (CCPA), implemented in 2018 and 2020 respectively.  The former argues that reproducibility (as a test on sufficiently conveying the data lineage) is necessary for other scientists to study, check and build-upon each other's work.  The latter requires the data intensive industry to give individual users control over their data, effectively requiring thorough management and knowledge of the data's lineage. @@ -93,7 +97,7 @@ In the sciences, the results of a project's analysis are published as scientific  From our own experiences, this section is usually most discussed during peer review and conference presentations, showing its importance.  After all, a result is defined as ``scientific'' based on its \emph{method} (the ``scientific method''), or lineage in data-science terminology.  In the industry however, data governance is usually kept as a trade secret and isn't publicly published or scrutinized. -Therefore while the proposed approach introduced in this paper (Maneage) is also useful in industrial contexts, the main practical focus will be in the scientific front which has traditionally been more open to publishing the methods and anonymous peer scrutiny. +Therefore while the proposed approach introduced in this paper (Maneage) is also useful in industrial contexts, the main practical focus would be in the scientific front which has traditionally been more open to publishing the methods and anonymous peer scrutiny.  \begin{figure}[t]    \begin{center} @@ -108,29 +112,29 @@ Therefore while the proposed approach introduced in this paper (Maneage) is also  \end{figure}  The traditional format of a scientific paper has been very successful in conveying the method with the result in the last centuries. -However, the complexity mentioned above has made it impossible to describe all the analytical steps of a project to a sufficient level of detail, in the traditional format of a published paper. +However, the complexity mentioned above has made it impossible to describe all the analytical steps of a project to a sufficient level of detail.  Citing this difficulty, many authors suffice to describing the very high-level generalities of their analysis, while even the most basic calculations (like the mean of a distribution) can depend on the software implementation.  Due to the complexity of modern scientific analysis, a small deviation in the final result can be due to many different steps, which may be significant.  Publishing the precise codes of the analysis is the only guarantee.  For example, \citet{smart18} describes how a 7-year old conflict in theoretical condensed matter physics was only identified after the relative codes were shared.  Nature is already a black box which we are trying hard to unlock, or understand. -Not being able to experiment on the methods of other researchers is an artificial and self-imposed back box, wrapped over the original, and taking most of the energy of fellow researchers. +Not being able to experiment on the methods of other researchers is an artificial and self-imposed black box, wrapped over the original, and taking most of the energy of researchers. -\citet{miller06} found that a mistaken column flipping, leading to retraction of 5 papers in major journals, including Science. -\citet{baggerly09} highlighted the inadequet narrative description of the analysis and showed the prevalence of simple errors in published results, ultimately calling their work ``forensic bioinformatics''. +\citet{miller06} found that a mistaken column flipping caused the retraction of 5 papers in major journals, including Science. +\citet{baggerly09} highlighted the inadequate narrative description of the analysis and showed the prevalence of simple errors in published results, ultimately calling their work ``forensic bioinformatics''.  \citet{herndon14} and \citet[a self-correction]{horvath15} also reported similar situations and \citet{ziemann16} concluded that one-fifth of papers with supplementary Microsoft Excel gene lists contain erroneous gene name conversions.  Such integrity checks tests are a critical component of the scientific method, but are only possible with access to the data and codes.  The completeness of a paper's published metadata (or ``Methods'' section) can be measured by a simple question: given the same input datasets (supposedly on a third-party database like \href{http://zenodo.org}{zenodo.org}), can another researcher reproduce the exact same result automatically, without needing to contact the authors? -Several studies have attempted to answer this with differnet levels of detail. -For example \citet{allen18} found that roughly half of the papers in astrophysics don't even mention the names of any analysis software they have used, while \citet{menke20} found that the fraction of papers explicitly mentioning their tools/software has greatly improved over the last two decades. +Several studies have attempted to answer this question with differnet levels of detail. +For example, \citet{allen18} found that roughly half of the papers in astrophysics do not even mention the names of any analysis software they used, while \citet{menke20} found that the fraction of papers explicitly mentioning their tools/software has greatly improved over the last two decades. -\citet{ioannidis2009} attempted to reproduce 18 published results by two independent groups but only fully succeeded in 2 of them and partially in 6. -\citet{chang15} attempted to reproduce 67 papers in well-regarded economics journals with data and code: only 22 could be reproduced without contacting authors, more than half couldn't be replicated at all. +\citet{ioannidis2009} attempted to reproduce 18 published results by two independent groups but, only fully succeeded in 2 of them and partially in 6. +\citet{chang15} attempted to reproduce 67 papers in well-regarded economic journals with data and code: only 22 could be reproduced without contacting authors, and more than half could not be replicated at all.  \citet{stodden18} attempted to replicate the results of 204 scientific papers published in the journal Science \emph{after} that journal adopted a policy of publishing the data and code associated with the papers.  Even though the authors were contacted, the success rate was $26\%$. -Generally, this problem is unambiguously felt in the community: \citet{baker16} surveyed 1574 researchers and found that only $3\%$ didn't see a ``reproducibility crisis''. +Generally, this problem is unambiguously felt in the community: \citet{baker16} surveyed 1574 researchers and found that only $3\%$ did not see a ``reproducibility crisis''.  This is not a new problem in the sciences: in 2011, Elsevier conducted an ``Executable Paper Grand Challenge'' \citep{gabriel11}.  The proposed solutions were published in a special edition. @@ -139,22 +143,22 @@ Before that, \citet{ioannidis05} proved that ``most claimed research findings ar  In the 1990s, \citet{schwab2000, buckheit1995, claerbout1992} describe this same problem very eloquently and also provided some solutions that they used.  While the situation has improved since the early 1990s, these papers still resonate strongly with the frustrations of today's scientists.  Even earlier, through his famous quartet, \citet{anscombe73} qualitatively showed how distancing of researchers from the intricacies of algorithms/methods can lead to misinterpretation of the results. -One of the earliest such efforts we found was \citet{roberts69} who discussed conventions in Fortran programming and documentation to help in publishing research codes. +One of the earliest such efforts we found was \citet{roberts69}, who discussed conventions in Fortran programming and documentation to help in publishing research codes.  From a practical point of view, for those who publish the data lineage, a major problem is the fast evolving and diverse software technologies and methodologies that are used by different teams in different epochs.  \citet{zhao12} describe it as ``workflow decay'' and recommend preserving these auxilary resources. -But in the case of software its not as streightforward as data: if preserved in binary form, software can only be run on certain hardware and if kept as source-code, their build dependencies and build configuration must also be preserved. +But in the case of software, its not as straightforward as data: if preserved in binary form, software can only be run on certain hardware and if kept as source-code, their build dependencies and build configuration must also be preserved.  \citet{gronenschild12} specifically study the effect of software version and environment and encourage researchers to not update their software environment. -However, this is not a practical solution because software updates are necessary, atleast to fix bugs in the same research software. +However, this is not a practical solution because software updates are necessary, at least to fix bugs in the same research software.  Generally, software is not a secular component of projects, where one software can easily be swapped with another.  Projects are built around specific software technologies, and research in software methods and implementations is itself a vibrant research topic in many domains \citep{dicosmo19}.  \tonote{add a short summary of the advantages of Maneage.}  This paper introduces Maneage as a solution to these important issues. -Section \ref{sec:definitions} defines the necessay concepts and terminology used in this paper leading to a discussion of the necessary guiding principles in Section \ref{sec:principles}. +Section \ref{sec:definitions} defines the necessay concepts and terminology used in this paper, leading to a discussion of the necessary guiding principles in Section \ref{sec:principles}.  Section \ref{sec:maneage} introduces the implementation of Maneage, going into lower-level details in some cases. -Finally in Section \ref{sec:discussion} the future prospects of using systems like this template are discussed. +Finally, in Section \ref{sec:discussion}, the future prospects of using systems like this template are discussed.  After the main body, Appendix \ref{appendix:existingtools} reviews the most commonly used lower-level technologies used today.  In light of the guiding principles, in Appendix \ref{appendix:existingsolutions} a critical review of many workflow management systems that have been introduced over the last three decades is given.  Finally, in Appendix \ref{appendix:softwareacknowledge} we acknowledge the various software (with a name and version number) that were used for this project. @@ -178,11 +182,11 @@ Finally, in Appendix \ref{appendix:softwareacknowledge} we acknowledge the vario -\section{Definitions of important terms} +\section{Definition of important terms}  \label{sec:definitions}  The concepts and terminologies of reproducibility and project/workflow management and design are commonly used differently by different research communities or different solution provides. -It is therefore important to clarify the specific terms used throughout this paper and its appendix, before starting the technical details. +As a consequence, before starting with the technical details it is important to clarify the specific terms used throughout this paper and its appendix. @@ -191,17 +195,16 @@ It is therefore important to clarify the specific terms used throughout this pap  \subsection{Definition: input}  \label{definition:input}  Any computer file that may be usable in more than one project. -The inputs of a project include data, software source code, or etc. -See \citet{hinsen16} on the fundamental similarity of data and source code. -Inputs may be encoded in plain text (for example tables of comma-separated values, CSV, or processing scripts) or custom binary formats, for example JPEG images, or domain-specific data formats \citep[e.g., FITS in astronomy, see][]{pence10}. +The inputs of a project include data, software source code, etc. (see \citet{hinsen16} on the fundamental similarity of data and source code). +Inputs may be encoded in plain text (for example tables of comma-separated values, CSV, or processing scripts), custom binary formats (for example JPEG images), or domain-specific data formats \citep[e.g., FITS in astronomy, see][]{pence10}. -Inputs may have initially been created/writted (e.g., software soure code) or collected (e.g., data) for one specific project. +Inputs may have initially been created/written (e.g., software soure code) or collected (e.g., data) for one specific project.  However, they can, and most often will, be used in other/later projects also. -Following the principle of modularity, it is therefore optimal to treat the inputs of any project as independent entities, not mixing them with how they are managed (how software is run on the data) within the project (see \ref{definition:project}). +Following the principle of modularity, it is therefore optimal to treat the inputs of any project as independent entities, not mixing them with how they are managed (how software is run on the data) within the project (see Section \ref{definition:project}). -Inputs are nevertheless necessary for building and running any project project. +Inputs are nevertheless necessary for building and running any project.  Some inputs may already archived/published independently prior to the project's publication. -In this case, they can easily be downloaded by independent projects and used. +In this case, they can easily be downloaded and used by independent projects.  Otherwise, they can be published with the project, but as independent files, for example see \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481} \citep{akhlaghi19}. @@ -211,12 +214,12 @@ Otherwise, they can be published with the project, but as independent files, for  \subsection{Definition: output}  \label{definition:output}  Any computer file that is published at the end of the project. -The output(s) can be datasets (terrabyte-sized, small table(s) or image(s), a single number, a true/false (boolian) outcome), automatically generated software source code, or any other file. +The output(s) can be datasets (terabyte-sized, small table(s) or image(s), a single number, a true/false (boolean) outcome), automatically generated software source code, or any other file.  The raw output files are commonly supplemented with a paper/report that summarizes them in a human-friendly readable/printable/narrative format.  The report commonly includes highlights of the input/output datasets (or intermediate datasets) as plots, figures, tables or simple numbers blended into the text.  The outputs can either be published independently on data servers which assign specific persistant identifers (PIDs) to be cited in the final report or published paper (in a journal for example). -Alternatively, the datasets can be published with the project source, for example \href{https://doi.org/10.5281/zenodo.1164774}{zenodo.1164774} \citep[Sections 7.3 \& 3.4]{bacon17}. +Alternatively, the datasets can be published with the project source, see for example \href{https://doi.org/10.5281/zenodo.1164774}{zenodo.1164774} \citep[Sections 7.3 \& 3.4]{bacon17}. @@ -224,14 +227,14 @@ Alternatively, the datasets can be published with the project source, for exampl  \subsection{Definition: project}  \label{definition:project} -The most high-level series of operations that are done on input(s) to produce outputs. +The most high-level series of operations that are done on input(s) to produce the output(s).  Because the project's report is also defined as an output (see above), besides the high-level analysis, the project's source also includes scripts/commands to produce plots, figures or tables.  With this definition, this concept of a ``project'' is similar to ``workflow''.  However, it is imporant to emphasize that the project's source code and inputs are distinct entities.  For example the project may be written in the same programming language as one analysis step.  Generally, the project source is defined as the most high-level source file that is unique to that individual project (its language is irrelevant). -The project is thus only in charge of managing the inputs and outputs of each analysis step (take the outputs of one step, and feed them as inputs to the next), not to do analysis itself. +The project is thus only in charge of managing the inputs and outputs of each analysis step (take the outputs of one step, and feed them as inputs to the next), not to do analysis by itself.  A good project will follow the modularity principle: analysis scripts should be well-defined as an independently managed software source.  For example modules in Python, packages in R, or libraries/programs in C/C++ that can be imported in higher-level project sources. @@ -243,7 +246,7 @@ For example modules in Python, packages in R, or libraries/programs in C/C++ tha  Data provenance is a very generic term which points to slightly different technical concepts in different fields like databases, storage systems and scientific workflows.  For example within a database, an SQL query from a relational database connects a subset of the database entries to the output (\emph{why-} provenance), their more detailed dependency (\emph{how-} provenance) and the precise location of the input sources (\emph{where-} provenance), for more see \citet{cheney09}. -In scientific workflows provenance goes beyond a single database and its datasets, but may includes many databases that aren't directly linked, the higher-level project specific analysis that is done on the data, and linking of the analysis to the text of the paper, for example see \citet{bavoil05, moreau08, malik13}. +In scientific workflows, provenance goes beyond a single database and its datasets, but may includes many databases that aren't directly linked, the higher-level project specific analysis that is done on the data, and linking of the analysis to the text of the paper, for example see \citet{bavoil05, moreau08, malik13}.  Here, we define provenance to be the common factor of the usages above: a dataset's provenance is the set of metadata (in any ontology, standard or structure) that connect it to the components (other datasets or scripts) that produced it.  Data provenance thus provides a high-level view of the data's genealogy. @@ -261,15 +264,15 @@ Data provenance thus provides a high-level view of the data's genealogy.  % Technical data lineage is being provided by solutions such as MANTA or Informatica Metadata Manager. "  Data lineage is commonly used interchangably with Data provenance \citep[for example][\tonote{among many others, just search ``data lineage'' in scholar.google.com}]{cheney09}.  However, for clarity, in this paper we refer to the term ``Data lineage'' as a low-level and fine-grained recording of the data's source, and operations that occur on it, down to the exact command that produced each intermediate step. -This \emph{recording} doesn't necessarily have to be in a formal metadata model. +This \emph{recording} does not necessarily have to be in a formal metadata model.  But data lineage must be complete (see completeness principle in Section \ref{principle:complete}), and allow extraction of data provenance metadata, and thus higher-level operations like visualization of the workflow. -\subsection{Definition: reproducibility \& replicability} +\subsection{Definition: reproducibility and replicability}  \label{definition:reproduction}  These terms have been used in the literature with various meanings, sometimes in a contradictory way.  It is therefore necessary to clarify the precise usage of this term in this paper. -But before doing so, it is important to highlight that in this paper, we are only considering computational analysis, in other words, analysis after data has been collected and stored as a file on a filesystem. +But before that, it is important to highlight that in this paper we are only considering computational analysis. In other words, analysis after data has been collected and stored as a file on a filesystem.  Therefore, many of the definitions reviewed in \citet{plesser18}, that are about data collection, are out of context here.  We adopt the same definition of \citet{leek17,fineberg19}, among others: @@ -286,6 +289,7 @@ We adopt the same definition of \citet{leek17,fineberg19}, among others:    \citet{fineberg19} allow non-bitwise or non-identical numeric outputs within their definition of reproducibility, but they also acknowledge that this flexibility can lead to complexities: what is an acceptable non-identical reproduction?    Exactly reproducbile outputs can be precisely and automatically verified without statistical interpretations, even in a very complex analysis (involving many CPU cores, and random operations), see Section \ref{principle:verify}.    It also requires no expertise, as \citet{claerbout1992} put it: ``a clerk can do it''. +  \tonote{Raul: I don't know if this is true... at least it needs a bit of training and an extra time. Maybe remove last phrase?}    In this paper, unless otherwise mentioned, we only consider bitwise/exact reproducibility.  \item {\bf\small Replicability:} (different inputs $\rightarrow$ consistant result). @@ -293,20 +297,21 @@ We adopt the same definition of \citet{leek17,fineberg19}, among others:  Generally, since replicability involves new data collection, it can be expensive.  For example the ``Reproducibility Project: Cancer Biology'' initiative started in 2013 to replicate 50 high-impact papers in cancer biology\footnote{\url{https://elifesciences.org/collections/9b1e83d1/reproducibility-project-cancer-biology}}. -Even with a funding of atleast \$1.3 million, it later shrunk to 18 projects \citep{kaiser18} due to very high costs. -We also note that replicability doesn't have to be limited to different input data: using the same data, but with different implementations of methods, is also a replication attempt \citep[also known as ``in silico'' experiments, see][]{stevens03}. +Even with a funding of at least \$1.3 million, it later shrunk to 18 projects \citep{kaiser18} due to very high costs. +We also note that replicability does not have to be limited to different input data: using the same data, but with different implementations of methods, is also a replication attempt \citep[also known as ``in silico'' experiments, see][]{stevens03}.  \end{itemize} -Some have defined these terms in the opposite manner. +\tonote{Raul: put white line to separate next paragraph from the previous list?} +Some authors have defined these terms in the opposite manner.  Examples include \citet{hinsen15} and the policy guidelines of the Association of Computing Machinery\footnote{\url{https://www.acm.org/publications/policies/artifact-review-badging}} (ACM, dated April 2018).  ACM has itself adopted the 2008 definitions of Vocabulaire international de m\'etrologie (VIM).  Besides the two terms above, ``repeatability'' is also sometimes used in regards to the concept discussed here and must be clarified. -For example \citet{ioannidis2009} use ``repeatability'' to encompass both the terms above. +For example, \citet{ioannidis2009} use ``repeatability'' to encompass both the terms above.  However, the ACM/VIM definition for repeatability is ``a researcher can reliably repeat her own computation''.  Hence, in the ACM terminology, the only difference between replicability and repeatability is the ``team'' that is conducting the computation. -In the context of this paper inputs are precisely defined (Section \ref{definition:input}): files with specific/registered checksums. -Therfore our inputs are team-agnostic, allowing us to safely ignore ``repeatability'' as defined by ACM/VIM. +In the context of this paper, inputs are precisely defined (Section \ref{definition:input}): files with specific/registered checksums (see Section \ref{principle:verify}). +Therefore, our inputs are team-agnostic, allowing us to safely ignore ``repeatability'' as defined by ACM/VIM. | 
