From 646756675566a0907edf143c6b6950e0479d9e7e Mon Sep 17 00:00:00 2001 From: Mohammad Akhlaghi Date: Thu, 2 Apr 2020 04:06:03 +0100 Subject: Rewrote abstract, better organization in publishing section I hadn't updated the abstract since first writing it. With this commit, it has been updated to be more precise and generically interesting, focusing more on the principles and usability. I also greatly improved the section on publishing the workflow. --- paper.tex | 142 +++++++++++++++++++++++++++++++++++++++++++++++----------- tex/README.md | 59 ++++++++++++++++++++++++ 2 files changed, 175 insertions(+), 26 deletions(-) create mode 100644 tex/README.md diff --git a/paper.tex b/paper.tex index ca88ec3..a91e86f 100644 --- a/paper.tex +++ b/paper.tex @@ -45,18 +45,22 @@ %% Abstract {\noindent\mpregular - Computational methods, implemented as software, are a major component of almost all scientific datasets, results, papers, or discoveries. - However, the full software environment and details of how they were run to do an analysis cannot be reported with sufficient detail within the confines of a traditional research paper like this one. - It is thus becoming harder to archive, understand, reproduce, or validate a scientific result, even by the original author. - To fasciliate reproducible, or archivable, data analysis this paper introduces the ''Reproducible paper template''. - It provides the necessary low-level infrastructure in a generic design/template to easily allow the addition of higher-level analysis steps in individual projects. - It is designed, and later published, fully as plain-text files, with no binary component. - The workflow of a project that uses this template will contain the following steps that can all be executed automatically, or inspected/parsed as a plain text file: software tarball URLs, input data URLs/PIDs, checksums for the data and software tarballs, scripts to build the software (containing all the dependencies), scripts to do the analysis, and finally, the \LaTeX{} source of the narrative paper, or data description. - This paper itself is exactly reproducible (snapshot \projectversion). + The era of big data has also ushered an era of big responsability. + Without it, the integrity of the result will be a subject of perpetual debate. + In this paper Maneage is introduced as a low-level solution to this problem. + Maneage (management + lineage) is an executable workflow for project authors and readers in the sciences or the industry. + It is designed following principles: complete (e.g., not requiring anything beyond a POSIX-compatible system, administrator previlages or a network connection), modular, fully in plain-text, minimal complexity in design, verifiable inputs and outputs, temporal lineage/provenance, and free software (in scientific applications). + A project that uses Maneage will have full control over the data lineage, making it exactly reproducible. + This control goes as far back as the automatic downloading of input data, and automatic building of necessary software that are used to analyze the data, with fixed versions and build configurations. + It also contains the narrative description of the final project's report (built into a PDF), while providing automatic and direct links between the analysis and the part of the narrative description that it was used. + Also, starting new projects, or editing previously published papers is trivial because of its version control system. + If adopted on a wide scale, Maneage can greatly improve scientific collaborations and building upon the work of other researchers instead of the current technical frustrations many researchers experience and can affect their scientific result and interpretations. + It can also be used on more ambitious projects like automatic workflow creation through machine learning tools, or automating data management plans. + This paper has itself been written in Maneage (snapshot \projectversion). \horizontalline \noindent - {\mpbold Keywords:} Reproducibility, Workflows, scientific pipelines + {\mpbold Keywords:} Data Lineage, Data Provenance, Reproducibility, Workflows, scientific pipelines } \horizontalline @@ -944,24 +948,24 @@ When the names of the subMakefiles are descriptive enough, this enables both the \subsubsection{Ultimate target: the project's paper or report (\inlinecode{paper.pdf})} \label{sec:paperpdf} -The ultimate purpose of a project is to report its result. -In scientific projects, this ``report'' is the published, or draft, paper. -In the industry, it is a quality-check and analysis of the final data product. -The raw result is usually dataset(s) that is (are) visualized in the report, for example as a plot, figure or table and blended into the narrative description. -In Figure \ref{fig:datalineage} it is shown as \inlinecode{paper.pdf}. -Note that it is the only built file (blue box) with no arrows leaving it. -In other words, nothing depends on it: highlighting its unique ``ultimate target'' position in the lineage. +The ultimate purpose of a project is to report the data analysis result. +In scientific projects, this ``report'' is the published (or draft) paper. +In the industry, it is a quality-check and analysis of the final data product(s). +In both cases, the report contains many visualizations of the final data product of the project, for example as a plot, figure, table, or numbers blended into the narrative description. +In Figure \ref{fig:datalineage} it is shown as \inlinecode{paper.pdf}, note that it is the only built file (blue box) with no arrows leaving it. +In other words, nothing depends on it: highlighting its unique ``ultimate target'' position in the lineage. The instructions to build \inlinecode{paper.pdf} are in \inlinecode{paper.mk}. The report's source (containing the main narrative, its typesetting as well as that of the figures or tables) is \inlinecode{paper.tex}. To build the final report's PDF, \inlinecode{references.tex} and \inlinecode{project.tex} are also loaded into \LaTeX{}. \inlinecode{references.tex} is part of the project's source and can contain the Bib\TeX{} entries for the bibliography of the final report. -In other words, it formalizes the connections of this scholarship with previous scholarship. +In other words, it formalizes the connections of this project with previous projects. Another class of files that maybe loaded into \LaTeX{}, but are not shown to avoid complications in the figure, are the figure or plot data, or built figures. For example in this paper, the demonstration figure shown in Section \ref{sec:analysis} is drawn directly within \LaTeX{} (using its PGFPlots package). The project only needed to build the plain-text table of numbers that were fed into PGFPlots (\inlinecode{tools-per-year.txt} in Figure \ref{fig:datalineage}). -However, building some plots may not be possible with PGFPlots, or the authors may prefer another tool to generate the visualization's image file \citep[for example with Python's Matplotlib, ][]{matplotlib2007}, then load that image file into \LaTeX{} as a graphic. + +However, building some plots may not be possible with PGFPlots, or the authors may prefer another tool to generate the visualization's image file \citep[for example with Python's Matplotlib, ][]{matplotlib2007}. For this scenario, the actual image file of the visualization can be used in the lineage, for example \inlinecode{tools-per-year.pdf} instead of \inlinecode{tools-per-year.txt}. See Section \ref{sec:publishing} on the project publication for special considerations regarding these files. @@ -1192,7 +1196,8 @@ However, unlike before, this directory is placed under \inlinecode{texdir} which This is because the plot of Figure \ref{fig:toolsperyear} is directly made within \LaTeX{}, using its PGFPlots package\footnote{PGFPLots package of \LaTeX: \url{https://ctan.org/pkg/pgfplots}. \inlinecode{texdir} has some special features when using \LaTeX{}, see Section \ref{sec:buildingpaper}. PGFPlots uses the same graphics engine that is building the paper, producing a highquality figure that blends nicely in the paper.}. -Note that this is just our personal choice, other methods of generating plots (for example with R, Gnuplot or Matplotlib) are also possible within this system, see Section \ref{sec:buildingpaper}. +Note that this is just our personal choice, other methods of generating plots (for example with R, Gnuplot or Matplotlib) are also possible within this system. +As with the input data files of PGFPlots, it is just necessary to put the files that are loaded into \LaTeX{} under the \inlinecode{\$(BDIR)/tex} directory, see Section \ref{sec:publishing}. The plain-text table that is used to build Figure \ref{fig:toolsperyear} is defined as the variable \inlinecode{a2mk20f1c} of Figure \ref{fig:demoplotsrc} (just above the second rule). As shown in the second rule, again we use GNU AWK to extract the necessary information from \inlinecode{mk20tab3} (which was built in \inlinecode{format.mk}). @@ -1349,17 +1354,99 @@ However, human error is inevitable, so when the project takes long in some phase \subsection{Publishing the project} \label{sec:publishing} -Once the project is complete, publishing the project (its narrative report as well as the full lineage) is the final step. -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +Once the project is complete, publishing the project is the final step. +In a scientific scenario, it is public +As discussed in the various steps before, the source of the project (the software configuration, data lineage and narrative text) is fully in plain text, greatly facilitating the publication of the project. + + +\subsubsection{Automatic creation of publication tarball} +\label{sec:makedist} +To facilitate the publication of the project source, Maneage has a special \inlinecode{dist} target during the build process which is activated with the command \inlinecode{./project make dist}. +In this mode, Maneage will not do any analysis, it will simply copy the full project's source (on the given commit) into a temporary directory and compress it into a \inlinecode{.tar.gz} file. +If a Zip compression is necessary, the \inlinecode{dist-zip} target can be called instead \inlinecode{dist}. + +The \inlinecode{dist} tarball contains the project's full data lineage and is enough to reproduce the full project: it can build the software, download the data, run the analysis, and build the final PDF. +However, it doesn't contain the Git history, it is just a checkout of one commit. +Instead of the history, it contains all the necessary \emph{built products} that go into building the final paper without the analysis: for example the used plots, figures, tables, and \inlinecode{project.tex}, see Section \ref{sec:valuesintext}. +As a result, the tarball can be \emph{also} only build the final report with a simple \inlinecode{pdflatex paper} command \emph{without} running \inlinecode{./project}. +When the project is distributed as a tarball (not as a Git repository), building the report may be the main purpose, like the arXiv distribution scenario discussed below, the data lineage (under the \inlinecode{reproduce/} directory) is likely just a supplement. + +\subsubsection{What to publish, and where?} +\label{sec:whatpublish} +The project's source, which is fully in hand-written plain-text, has a very small volume, usually much less than one megabyte. +However, the necessary input files (see Section \ref{definition:input}) and built datasets may be arbitrarily large, from megabytes to petabytes or more. +Therefore, there are various scenarios for the publication of the project as described below: \begin{itemize} -\item \inlinecode{./project make dist}. -\item The data files, or image files that go into the \LaTeX{} paper in \inlinecode{texdir}. Especially how \inlinecode{tex/build} points to it. -\item Discuss how easy it is to built graphics outside of \LaTeX{}. +\item \textbf{Only source:} Publishing the project source is very easy because it only contains plain-text files with a very small volume: a commit will usually be on the scale of $\times100kB$. With the Git history, it will usually only be on the scale of $\sim5MB$. + + \begin{itemize} + \item \textbf{Public Git repository:} This is the simplest publication method. + The project will already be on a (private) Git repository prior to publication. + In such cases, the private configuration can be removed so it becomes public. + \item \textbf{In journal or PDF-only preprint systems (e.g., bioRxiv):} If the journal or pre-print server allows publication of small supplement files to the paper, the commit that produced the final paper can be submitted as a compressed file, for example with the + \item \textbf{arXiv:} Besides simply uploading a PDF pre-print, on arXiv, it is also possible to upload the \LaTeX{} source of the paper. + arXiv will run its own internal \LaTeX{} engine on the uploaded files and produce the PDF that is published. + When the project is published, arXiv also allows users to anonymously download the \LaTeX{} source tarball that the authors uploaded\footnote{In the current arXiv user interface, the tarball is available by clicking the ``Other formats'' link on the paper's main page, and then clicking ``Download source'', it can be checked with \url{https://arxiv.org/abs/1909.11230} of \citet{akhlaghi19}.}. + Therefore, simply uploading the tarball from the \inlinecode{./project make dist} command is sufficient for arXiv, and will allow the full project data lineage to also be published there with the \LaTeX{} source. + We done this in \citet[arXiv:1909.11230]{akhlaghi19} and \citet[arXiv:1911.01430]{infante20}. + Since arXiv is mirrored in many institutes over the planet, this is a robust way to preserve the reproducible lineage. + \item \textbf{In output datasets:} Many data storage formats support an internal structure with the data file. + One commonly used example today is the Hierarchical Data Format (HDF), and in particular its HDF5 which can host a complex filesystem in POSIX syntax. + It is even used by some reproducible analysis solutions like the Active papers project \citet[for more, see Appendix \ref{appendix:activepapers}]{hinsen11}. + Since the volume of the project source is so insignificant compared to the output datasets of most projects, the whole project source can be stored with each published data file if the format supports it. + \end{itemize} +\item \textbf{Source and data:} The project inputs (including the software tarballs, or possible datasets) may have a large volume. + Publishing them with the source is thus not always possible. + However, based on the definition of inputs in Section \ref{definition:input}, they are usable in other projects: another project may use the same data or software source code, in a different way. + Therefore even when published with the source, it is encouraged to publish them as separate files. + + For example strategy was followed in \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}\footnote{https://doi.org/10.5281/zenodo.3408481} which supplements \citet{akhlaghi19} which contains the following files. + + \begin{itemize} + \item \textbf{Final PDF:} for easy understanding of the project. + \item \textbf{Git history:} as the Git ``bundle'' of the project. + This single file contains the full Git history of the project until its publication date (only 4Mb), see Section \ref{sec:starting}. + \item \textbf{Project source tarball}: output of \inlinecode{./project make dist}, as explained above. + \item \textbf{Tarballs of all necessary software:} This is necessary in case the software webpages is not accessible for any reason at a later date or the project must be run with no internet access. + This is only possible because of the free software principle discussed in Section \ref{principle:freesoftware}. + \end{itemize} + + Note that \citet{akhlaghi19} used previously published datasets which are automatically accessed when necessary. + Also, that paper didn't produce any output datasets beyond the figures shown in the report, therefore the Zenodo upload doesn't contain any datasets. + When a project involves data collection, or added-value data products, they can also be uploaded with the files above. \end{itemize} +\subsubsection{Worries about getting scooped!} +\label{sec:scooped} +Publishing the project source with the paper can have many benefits for the researcher and the larger community. +For example if the source is published with a pre-print, others my help the authors find bugs, or improvements to the source that can affect the validity or precision of the result, or simply optimize it so it does the same work in half the time for example. + +However, one particular feedback raised by a minority of researchers is that publishing the project's reproducible data lineage immediately after publication may hamper their ability to continue harvesting from all their hard work. +Because others can easily reproduce the work, others may take the next follow-up project they originally intended to do. +This is informally known as getting scooped. + +The level that this may happen is an interesting subject to be studied once many papers become reproducible. +But it is a valid concern that must be addressed. +Given the strong integrity checks in Maneage, we believe it has features to address this problem in the following ways: + +\begin{enumerate} +\item This worry is essentially the 2nd phase of Figure \ref{fig:branching}. + The commits of the other team are built up on the commits of the original authors. + It is therefore perfectly clear (with the precision of a character!) how much of their result is purely their own work (qualitatively or quantitatively). + In this way, Maneage can contribute to a new concept of authorship in scientific projects and help to quantify newton's famous ``standing on the shoulders of giants'' quote. + However, this is a long term goal and requires major changes to academic value systems. +\item The authors can be given a grace period where the journal, or some third authority, keeps the source and publishes it a certain time after publication. + Infact, journals can create specific policies for such scenarios, for example saying that all project sources will be available publicly, $N$ months/years after publication while allowing authors to opt-out of it if they like, so the source is published immediately with the paper. + However, journals cannot expect exclusive copyright to distribute the project source, in the same manner they do with the final paper. + As discussed in the free software principle of Section \ref{principle:freesoftware}, it is critical that the project source be free for the community to use, modify and distribute. + + This can also be done by the authors on servers like Zenodo, where you can get the dataset's final DOI first, and publish at a later date. + Reproducibility is indeed very important for the sciences, but the hard work that went into it should also be acknowledged for the authors that would like to publish the source at a later date. +\end{enumerate} + + + \subsection{Future of Maneage and its past} \label{sec:futurework} @@ -1436,6 +1523,8 @@ Once the improvements become substantial, new paper(s) will be written to comple \item \url{https://arxiv.org/pdf/2003.04915.pdf}: how data lineage can help machine learning. \item Interesting patent on ``documenting data lineage'': \url{https://patentimages.storage.googleapis.com/c0/51/6e/1f3af366cd73b1/US10481961.pdf} \item Automated data lineage extractor: \url{http://hdl.handle.net/20.500.11956/110206}. +\item Caveat: Many low-level tools. +\item High-level tools can be written to exploit the low-level features. \end{itemize} @@ -2172,6 +2261,7 @@ When the Python module contains a component written in other languages (mostly C As mentioned in \citep{hinsen15}, the fact that it relies on HDF5 is a caveat of Active Papers, because many tools are necessary to access it. Downloading the pre-built HDF View binaries (provided by the HDF group) is not possible anonymously/automatically (login is required). Installing it using the Debain or Arch Linux package managers also failed due to dependencies. +Furthermore, as a high-level data format HDF5 evolves very fast, for example HDF5 1.12.0 (February 29th, 2020) is not usable with older libraries provided by the HDF5 team. While data and code are indeed fundamentally similar concepts technically \tonote{cite Konrad's paper on this}, they are used by humans differently. This becomes a burden when large datasets are used, this was also acknowledged in \citet{hinsen15}. diff --git a/tex/README.md b/tex/README.md new file mode 100644 index 0000000..0f2f0d6 --- /dev/null +++ b/tex/README.md @@ -0,0 +1,59 @@ +Directory containing LaTeX-related files +---------------------------------------- + +Copyright (C) 2018-2020 Mohammad Akhlaghi \ +See the end of the file for license conditions. + +This directory contains directories to various components the LaTeX part of +the project. In a running project, it will contain the atleast the +following sub-directories. Note that + +- The `src/` directory contains the LaTeX files that are loaded into + `paper.tex`. This includes the necessary preambles, the LaTeX source + files to build tables or figures (for example with TiKZ or PGFPlots), and + etc. These files are under version-control and an integral part of the + project's source. + +- The `build/` directory contains all the built products (not source!) that + are created during the analysis and are necessary for building the + paper. This includes figures, plots, images, table source contents and + etc. Note that this directory is not under version control. + +- The `tikz/` directory is only relevant if some of the project's figures + are built with the LaTeX packages of TiKZ or PGFPlots. It points to the + directory containing the figures (in PDF) that were built by these tools. + Note that this directory is not under version control. + +The latter two directory and its contents are not under version control, so +if you have just cloned the project or are viewing its contents on a +browser, they don't exist. They will be created after the project is +configured for the running system. + +When the full project is build from scratch (building the software, +downloading necessary datasets, running the analysis and building the +paper's PDF), the latter two directories will be symbolic links to special +places under the Build directory. + +However, when the distributed tarball is used to only build the PDF paper +(without doing any analysis), the latter two directories will not be +symbolic links and will contain the necessary components for the paper to +be built. + + + + + +### Copyright information + +This project is free software: you can redistribute it and/or modify it +under the terms of the GNU General Public License as published by the Free +Software Foundation, either version 3 of the License, or (at your option) +any later version. + +This project is distributed in the hope that it will be useful, but WITHOUT +ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or +FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for +more details. + +You should have received a copy of the GNU General Public License along +with this project. If not, see . \ No newline at end of file -- cgit v1.2.1