From ce485fca250477546d8c96c5a9034f8768f884df Mon Sep 17 00:00:00 2001 From: Mohammad Akhlaghi Date: Fri, 1 May 2020 01:50:14 +0100 Subject: Removed Definition and Summary sections and low-level figures Given the very strict limits of journals, we needed to remove these sections and images. The removed images are: the `figure-file-architecture', `figure-src-topmake' and `figure-src-inputconf'. In total, with `wc' we now have 9019 words. This will be futher reduced when we remove all the technical parts of the Maneage section, in short, we will only describe the generalities, not any specific details. --- paper.tex | 148 ++++++++------------------------------------------------------ 1 file changed, 19 insertions(+), 129 deletions(-) diff --git a/paper.tex b/paper.tex index 94153c7..758c418 100644 --- a/paper.tex +++ b/paper.tex @@ -133,8 +133,8 @@ While the situation has somewhat improved, all these papers still resonate stron To address the collective problem of preserving a project's data lineage as well as its software dependencies, we introduce Maneage (Maneage+Lineage), pronounced man-ee-ij or \textipa{[m{\ae}n}i{\textsci}d{\textyogh}], hosted at \url{http://maneage.org}. A project using Maneage starts by branching from its main Git branch, allowing the authors to customize it: specifying the necessary software tools for that particular project, adding analysis steps and adding visualizations and a narrative based on the results. -In Sections \ref{sec:definitions} \& \ref{sec:principles} the basic concepts are defined and the founding principles of Maneage are discussed. -Section \ref{sec:maneage} describes the internal structure of Maneage and Section \ref{sec:discussion} is a discussion on its benefits, caveats and future prospects and we conclude with a summary in Section \ref{sec:conclusion} +In Section \ref{sec:principles} the founding principles behind Maneage are discussed. +Section \ref{sec:maneage} describes the internal structure of Maneage and Section \ref{sec:discussion} is a discussion on its benefits, caveats and future prospects. @@ -145,61 +145,6 @@ Section \ref{sec:maneage} describes the internal structure of Maneage and Sectio -\section{Definitions} -\label{sec:definitions} - -The concepts and terminologies of reproducibility and project/workflow management and design are used differently by different research communities or different solution providers. -It is therefore important to clarify some specific terms used in this paper. - -\begin{enumerate}[label={\bf D\arabic*}] -\item \label{definition:input}\textbf{Input:} - A project's input is any file that may be usable in other projects. - The inputs of a project include data or software source code \citep[see][on the fundamental similarity of data and source code]{hinsen16}. - Inputs may have initially been created/written (e.g., software source code) or collected (e.g., data) for one specific project. However, they can, and most often will, be used in later projects as well. - -\item \label{definition:output}\textbf{Output:} - A project's output is any file that is published at the end. - The output(s) of a project can be a narrative paper or report with visualizations, datasets (e.g., table(s), image(s), a number, or Boolean: confirming a hypothesis as being true or false), automatically-generated software source code, or any other computer file. - -\item \label{definition:project}\textbf{Project:} - A project is the series of operations that are done on input(s) to produce outputs. - This definition is therefore very similar to ``\emph{workflow}'' \citep[e.g.,][]{oinn04, goecks10}, but because the published narrative paper/report is also an output, a project also includes both the source of the narrative (e.g., \LaTeX{} or MarkDown) \emph{and} how - its visualizations were created. - - In a well-designed project, all analysis steps (e.g., written in Python, packages in R, libraries/programs in C/C++, etc.) are written to be modular, or executable independent of the rest with well-defined inputs, outputs and no side-effects. - This is crucial help for debugging and experimenting during the project, and also for their re-usability in later projects. - As a consequence, such analysis scripts/programs are defined above as ``inputs'' for the project. - A project hence does not include any analysis source code (to the extent this is possible), it only manages calls to them. - -\item \label{definition:provenance}\textbf{Data Provenance:} - A dataset's provenance is defined as the set of metadata (in any ontology, standard or structure) that connects it to the components (other datasets or scripts) that produced it. - Data provenance thus provides a high-level \emph{and structured} view of a project's lineage. - A good example of this is Research Objects \citep{belhajjame15}. - -% This definition of data lineage is inspired from https://stackoverflow.com/questions/43383197/what-are-the-differences-between-data-lineage-and-data-provenance: - -% "data provenance includes only high level view of the system for business users, so they can roughly navigate where their data come from. -% It's provided by variety of modeling tools or just simple custom tables and charts. -% Data lineage is a more specific term and includes two sides - business (data) lineage and technical (data) lineage. -% Business lineage pictures data flows on a business-term level and it's provided by solutions like Collibra, Alation and many others. -% Technical data lineage is created from actual technical metadata and tracks data flows on the lowest level - actual tables, scripts and statements. -% Technical data lineage is being provided by solutions such as MANTA or Informatica Metadata Manager. " -\item \label{definition:lineage}\textbf{Data Lineage:} - Data lineage is commonly used interchangeably with Data provenance \citep[for example][]{cheney09}. - For clarity, we define the term ``\emph{Data lineage}'' as a low-level and fine-grained recording of the data's trajectory in an analysis (not meta-data, but actual commands). - Therefore, data lineage is synonymous with ``\emph{project}'' as defined above. -\item \label{definition:reproduction}\textbf{Reproducibility \& Replicability:} - These two terms have been used in the literature with various meanings, sometimes in a contradictory way. - It is important to highlight that in this paper we are only considering computational analysis: \emph{after} data has been collected and stored as a file. - Therefore, many of the definitions reviewed in \citet{plesser18}, which are about data collection, do not apply here, and - we adopt the same definition of \citet{leek17,fineberg19}, among others. - \citet{fineberg19} define reproducibility as \emph{obtaining consistent [not necessarily identical] results using the same input data; computational steps, methods, and code; and conditions of analysis}, or same inputs $\rightarrow$ consistent result. - They define Replicability as \emph{obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data}, or different inputs $\rightarrow$ consistent result. -\end{enumerate} - - - - \section{Principles} \label{sec:principles} @@ -218,6 +163,7 @@ Many solutions have been proposed in the last decades, including (but not limite 2019: \href{https://wholetale.org}{WholeTale}. To help in the comparison, the founding principles of Maneage are listed below. + \begin{enumerate}[label={\bf P\arabic*}] \item \label{principle:complete}\textbf{Completeness:} A project that is complete, or self-contained, @@ -290,7 +236,7 @@ However, IPOL, which uniquely stands out in satisfying most principles, fails he IPOL is thus not scalable to large projects, which commonly involve dozens of high-level dependencies, with complex data formats and analysis. \item \label{principle:freesoftware}\textbf{Free and open source software:} - Technically, reproducibility (see \ref{definition:reproduction}) is possible with non-free or non-open-source software (a black box). + Technically, reproducibility \citet{fineberg19} is possible with non-free or non-open-source software (a black box). This principle is thus necessary to complement that definition (nature is already a black box, we don't need another one): (1) As a free software, others can learn from, modify, and build upon it. (2) The lineage can be traced to free software's implemented algorithms, also enabling optimizations on that level. @@ -328,25 +274,6 @@ This will be done in multiple commits during the project (perhaps years), preser git checkout -b master # Make new `master' branch, start customizing. \end{lstlisting} -\begin{figure}[t] - \begin{center} - \includetikz{figure-file-architecture} - \end{center} - \vspace{-5mm} - \caption{\label{fig:files} - Directory and file structure in a hypothetical project using Maneage. - Files are shown with small green boxes that have a suffix in their names (for example \inlinecode{format.mk} or \inlinecode{download.tex}). - Directories (containing multiple files) are shown as large brown boxes, where the name ends in a slash (\inlinecode{/}). - Directories with dashed lines and no files (just a description) are symbolic links that are created after building the project, pointing to commonly-needed built directories. - Symbolic links and their contents are not considered part of the source and are not under version control. - Files and directories are shown within their parent directory. - For example, the full address of \inlinecode{format.mk} from the top project directory is \inlinecode{reproduce/analysis/make/format.mk}. - } -\end{figure} - -Figure \ref{fig:files} shows the directory structure of the cloned project and typical files. -The top-level source has only very high-level components: the \inlinecode{project} shell script (POSIX-compliant) that is the main interface to the project, as well as the paper's \LaTeX{} source, documentation and a copyright statement. -Two sub-directories are also present: \inlinecode{tex/} (containing \LaTeX{} files) and \inlinecode{reproduce/} (containing all other parts of the project). Maneage has two main phases: (1) configuration, where the necessary software is built and the environment is set up, and (2) analysis, where data are accessed and the software is run to create the final visualizations and report: \begin{lstlisting}[language=bash] @@ -397,11 +324,11 @@ Thus, a researcher using Maneage for high-level analysis easily understands and The existing tools listed in Section \ref{sec:principles} mostly use package managers like Conda to maintain the software environment, but Conda itself is written in Python, contrary to our completeness principle \ref{principle:complete}. Highly-robust solutions like Nix and GNU Guix exist, but these require root permissions, contrary to principle P1.3. -Project configuration (building the software environment) is managed by the files under \inlinecode{reproduce\-/soft\-ware} of Maneage's source (see Figure \ref{fig:files}). +Project configuration (building the software environment) is managed by the files under \inlinecode{reproduce\-/soft\-ware} of Maneage's source. At the start of project configuration, Maneage needs a top-level directory to build itself on the host (software and analysis). We call this the ``build directory'' and it must not be located inside the source directory (see \ref{principle:modularity}). No other location on the running OS will be affected by the project, including the source directory. -Two other local directories can optionally be specified by the project when inputs (\ref{definition:input}) are present locally: 1) software tarball directory and 2) input data directory. +Two other local directories can optionally be specified by the project when inputs are present locally: 1) software tarball directory and 2) input data directory. Sections \ref{sec:buildsoftware} and \ref{sec:softwarecitation} detail the building of the required software and the important issue of software citation. \subsubsection{Verifying and building necessary software from source} @@ -420,7 +347,7 @@ However, such binary blobs are not the primary storage/archival format of Maneag Before building the software, their source codes are validated by their SHA-512 checksum (stored in the project). Maneage includes a growing collection of scientific software (and its dependencies), much of which is superfluous for any single project. -Therefore, each project has to identify its high-level software in the \inlinecode{TARGETS.conf} file under \inlinecode{re\-produce\-/soft\-ware\-/config} directory (Figure \ref{fig:files}). +Therefore, each project has to identify its high-level software in the \inlinecode{TARGETS.conf} file. \subsubsection{Software citation} \label{sec:softwarecitation} @@ -460,7 +387,7 @@ Large files are in general a bad practice and against the modularity and minimal Maneage is thus designed to encourage and facilitate modularity by distributing the analysis into many Makefiles that contain contextually-similar analysis steps. Hereafter, these lower-level Makefiles are termed \emph{subMakefiles}. When run with the \inlinecode{make} argument, the \inlinecode{project} script (Section \ref{sec:maneage}), calls \inlinecode{top-make.mk}, which loads the subMakefiles using the \inlinecode{include} directive (see Section \ref{sec:analysis}). -All the analysis Makefiles are in \inlinecode{re\-produce\-/anal\-ysis\-/make} (see Figure \ref{fig:files}). Figure \ref{fig:datalineage} shows their relationship with the target/built files that they manage. +All the analysis Makefiles are in \inlinecode{re\-produce\-/anal\-ysis\-/make}. Figure \ref{fig:datalineage} shows their relationship with the target/built files that they manage. To keep the project's logic clear and simple (minimal complexity principle, \ref{principle:complexity}), recursion (where one instance of Make calls Make internally) is, by default, not used. \begin{figure}[t] @@ -480,7 +407,7 @@ To keep the project's logic clear and simple (minimal complexity principle, \ref To avoid getting too abstract in the subsections below, where necessary we will do a basic analysis on the data of \citet{menke20} (hereafter M20) and replicate one of the results. We cannot use the same software as M20, because M20 used Microsoft Excel for their analysis, violating several of our principles: \ref{principle:complete}, \ref{principle:complexity} and \ref{principle:freesoftware}. -Since we do not use the same software, this does not qualify as a reproduction (see \ref{definition:reproduction}). +Since we do not use the same software, this does not qualify as a reproduction \citep{fineberg19}. In the subsections below, this paper's analysis on that dataset is described using the data lineage graph of Figure \ref{fig:datalineage}. We will follow Make's paradigm (see Section \ref{sec:usingmake}) of starting the lineage backwards form the ultimate target in Section \ref{sec:paperpdf} (bottom of Figure \ref{fig:datalineage}) to the configuration files \ref{sec:configfiles} (top of Figure \ref{fig:datalineage}). To better understand this project, we recommend study of this paper's own Maneage source, published as a supplement. @@ -530,7 +457,6 @@ The analysis is demonstrated with the practical example of replicating Figure 1C As shown in Figure \ref{fig:datalineage}, for this example we split this goal into two subMakefiles: \inlinecode{format.mk} and \inlinecode{demo-plot.mk}. The former converts the Excel-formatted input into comma-separated value (CSV) format, and the latter generates the table to build Figure \ref{fig:toolsperyear}. In a real project, subMakefiles could, and will, be much more complex. -Figure \ref{fig:topmake} shows how the two subMakefiles are placed as values in the \inlinecode{makesrc} variable of \inlinecode{top-make.mk}, without their suffixes as described in Section \ref{sec:valuesintext}. Their location after the standard starting subMakefiles (initialization and download) and before the standard ending subMakefiles (verification and final paper) is important, along with their order. \begin{figure}[t] @@ -543,14 +469,6 @@ Their location after the standard starting subMakefiles (initialization and down } \end{figure} -\begin{figure}[t] - \input{tex/src/figure-src-topmake.tex} - \vspace{-3mm} - \caption{\label{fig:topmake} General view of the high-level \inlinecode{top-make.mk} Makefile which manages the project's analysis that is in various subMakefiles. - See Figures \ref{fig:files} \& \ref{fig:datalineage} for its location in the project's file structure and its data lineage, as well as the subMakefiles it includes. - } -\end{figure} - To enhance the original M20 plot, Figure \ref{fig:toolsperyear} also shows the number of papers in each year and its horizontal axis shows the full range of the data (starting from \menkefirstyear), while M20 starts from 1997. This was probably because the authors judged the earlier years' data to be too noisy. For example, in \menkenumpapersdemoyear, only \menkenumpapersdemocount{} papers were analysed. Both the numbers in the previous sentence (\menkenumpapersdemoyear{} and \menkenumpapersdemocount), and the dataset's oldest year (mentioned above: \menkefirstyear) are automatically generated \LaTeX{} macros, see Section \ref{sec:valuesintext}. @@ -580,17 +498,9 @@ The relation between the project and the outside world is maintained in this sin Each external dataset has some basic information, including its expected name on the local system (for offline access), a checksum to validate it (either the whole file or just its main ``data'', as discussed in Section \ref{sec:outputverification}), and its URL/PID. In Maneage, they are stored in the \inlinecode{INPUTS.conf} file. -See Figures \ref{fig:files} \& \ref{fig:datalineage} for the position of \inlinecode{INPUTS.conf} in the project's file structure and data lineage, respectively. -Figure \ref{fig:inputconf} demonstrates this for the dataset of M20 that is stored in one \inlinecode{.xlsx} file on bioXriv. +See Figure \ref{fig:datalineage} for the position of \inlinecode{INPUTS.conf} in the project's file structure and data lineage, respectively. Each is stored as a Make variable, and is automatically loaded into the full project when Make starts, like other configuration files, usable in any subMakefile. -\begin{figure}[t] - \input{tex/src/figure-src-inputconf.tex} - \vspace{-3mm} - \caption{\label{fig:inputconf} The \inlinecode{INPUTS.conf} configuration file keeps references to external (input) datasets of a project, as well as their checksums for validation, see Sections \ref{sec:download} \& \ref{sec:configfiles}. - Shown here are the entries for the demonstration dataset of \citet{menke20}. - } -\end{figure} \subsubsection{Configuration files} @@ -598,8 +508,8 @@ Each is stored as a Make variable, and is automatically loaded into the full pro The subMakefiles discussed above should only organize the analysis, they should not contain any fixed numbers, settings or parameters, which should instead be set as variables in configuration files. Configuration files logically separate the low-level implementation from the high-level running of a project. -In the data lineage plot of Figure \ref{fig:datalineage}, configuration files are shown as sharp-edged, green \inlinecode{*.conf} boxes in the top row (for example, the file \inlinecode{INPUTS.conf} that was shown in Figure \ref{fig:inputconf} and mentioned in Section \ref{sec:download}). -All the configuration files of a project are placed under the \inlinecode{reproduce/analysis/config} subdirectory (see Figure \ref{fig:files}), and are loaded into \inlinecode{top-make.mk} before any of the subMakefiles (Figure \ref{fig:topmake}), hence they are available to all of them. +In the data lineage plot of Figure \ref{fig:datalineage}, configuration files are shown as sharp-edged, green \inlinecode{*.conf} boxes in the top row (for example, the file \inlinecode{INPUTS.conf} that was mentioned in Section \ref{sec:download}). +All the configuration files of a project are placed under the \inlinecode{reproduce/analysis/config} subdirectory, and are loaded into \inlinecode{top-make.mk} before any of the subMakefiles, hence they are available to all of them. The example analysis in Section \ref{sec:analysis}, in which we reported the number of papers studied by M20 in \menkenumpapersdemoyear, illustrates this. The year ``\menkenumpapersdemoyear'' is not written by hand in \inlinecode{demo-plot.mk}. @@ -613,7 +523,7 @@ analysis as the project evolves in the case of exploratory research papers, and \subsubsection{Project initialization (\inlinecode{initialize.mk})} \label{sec:initialize} -The \inlinecode{initial\-ize\-.mk} subMakefile is present in all projects and is the first subMakefile that is loaded into \inlinecode{top-make.mk} (see Figures \ref{fig:datalineage} \& \ref{fig:topmake}). +The \inlinecode{initial\-ize\-.mk} subMakefile is present in all projects and is the first subMakefile that is loaded into \inlinecode{top-make.mk} (see Figures \ref{fig:datalineage}). It does not contain any analysis or major processing steps, it just initializes the system by setting the necessary Make environment as well as other general jobs like defining the Git commit hash of the run as a \LaTeX{} (\inlinecode{\textbackslash{}projectversion}) macro that can be loaded into the narrative. Papers using Maneage usually put this hash as the last word in their abstract, for example, see \citet{akhlaghi19} and \citet{infante20}. For the current version of this paper, it expands to \projectversion. @@ -674,7 +584,7 @@ This is useful for publishers to create the report without necessarily building The \inlinecode{dist-zip} target provides Zip compression as an alternative. Depending on the built graphics used in the report, this compressed file will usually be roughly a mega-byte. -However, the required inputs (\ref{definition:input}) and the outputs may be much bigger, from megabytes to petabytes. +However, the required inputs and the outputs may be much bigger, from megabytes to petabytes. This gives two scenarios for publication of the project: 1) publishing only the source, or 2) publishing the source with the data. In the former case, the output of \inlinecode{dist} can be submitted to the journal as a supplement, or uploaded to pre-print servers like \href{https://arXiv.org}{arXiv} that will compile the \LaTeX{} source and build their own PDFs. The Git history can also be archived as a single ``bundle'' file and submitted as a supplement. @@ -692,6 +602,11 @@ For example, \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481 \section{Discussion \& Caveats} \label{sec:discussion} +To optimally extract the potentials of big data in science, we need to have a complete view of its lineage. +Scientists are, however, rarely trained sufficiently in data management or software development, and the plethora of high-level tools that change every few years does not help. +Such high-level tools are primarily targetted at software developers, who are paid to learn them and use them effectively for short-term projects. +Scientists, on the other hand, need to focus on their own research fields, and need to think about longevity. + The primordial implementation was written for \citet{akhlaghi15}. To use in other projects without a full re-write, the skeleton was separated from the flesh as a more abstract ``template'' that was used in \citet{bacon17}, in particular Sections 4 and 7.3 (respectively in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}). Later, software building was incorporated and used in \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}]{akhlaghi19} and \citet[\href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}]{infante20}. @@ -730,31 +645,6 @@ This is a long-term goal and requires major changes to academic value systems. -\section{Conclusion \& Summary} -\label{sec:conclusion} - -To optimally extract the potentials of big data in science, we need to have a complete view of its lineage. -Scientists are, however, rarely trained sufficiently in data management or software development, and the plethora of high-level tools that change every few years does not help. -Such high-level tools are primarily targetted at software developers, who are paid to learn them and use them effectively for short-term projects. -Scientists, on the other hand, need to focus on their own research fields, and need to think about longevity. - -Maneage is designed as a complete template, providing scientists with a pre-built low-level skeleton, using simple and robust tools that have withstood the test of time and are actively maintained. -Scientists can customize Maneage's existing data management for their own projects, enabling them to learn and master the lower-level tools. -This improves their efficiency and the robustness of their scientific results, while also enabling future scientists to reproduce and build upon their work. - -We discussed the founding principles of Maneage that are completeness, modularity, minimal complexity, verifiable inputs and outputs, temporal provenance, scalability, and free software. -We showed how these principles are implemented in an existing structure, ready for customization, and discussed its advantages and disadvantages. -With a larger user-base and wider application in scientific (and hopefully industrial) applications, Maneage will grow and become even more robust, stable and user friendly. - - - - - - - - - - %% Acknowledgements \section*{Acknowledgments} The authors wish to thank (sorted alphabetically) -- cgit v1.2.1