aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--paper.tex91
-rw-r--r--tex/img/collaboration-icon.pdfbin0 -> 2855 bytes
-rw-r--r--tex/img/paper-icon.pdfbin0 -> 2165 bytes
-rw-r--r--tex/src/figure-branching.tex140
-rw-r--r--tex/src/figure-data-lineage.tex4
-rw-r--r--tex/src/figure-src-download.tex2
-rw-r--r--tex/src/figure-src-topmake.tex4
-rw-r--r--tex/src/references.tex20
8 files changed, 242 insertions, 19 deletions
diff --git a/paper.tex b/paper.tex
index dc5c032..0a85da2 100644
--- a/paper.tex
+++ b/paper.tex
@@ -254,7 +254,9 @@ Data provenance thus provides a high-level view of the data's genealogy.
% Technical data lineage is created from actual technical metadata and tracks data flows on the lowest level - actual tables, scripts and statements.
% Technical data lineage is being provided by solutions such as MANTA or Informatica Metadata Manager. "
Data lineage is commonly used interchangably with Data provenance \citep[for example][\tonote{among many others, just search ``data lineage'' in scholar.google.com}]{cheney09}.
-However, for clarity, in this paper we refer to the term Data lineage as a low-level and fine-grained record of the data's genealogy, down to the exact command that produced each intermediate step.
+However, for clarity, in this paper we refer to the term ``Data lineage'' as a low-level and fine-grained recording of the data's source, and operations that occur on it, down to the exact command that produced each intermediate step.
+This \emph{recording} doesn't necessarily have to be in a formal metadata model.
+But data lineage must be complete (see completeness principle in Section \ref{principle:complete}), and allow extraction of data provenance metadata, and thus higher-level operations like visualization of the workflow.
\subsection{Definition: reproducibility \& replicability}
@@ -901,7 +903,7 @@ We'll follow Make's paradigm (see Section \ref{sec:usingmake}) of starting form
\includetikz{figure-data-lineage}
\end{center}
\vspace{-7mm}
- \caption{\label{fig:datalineage}Schematic representation of built file dependencies in a hypothetical project/pipeline using the reproducible paper template.
+ \caption{\label{fig:datalineage}Schematic representation of data lineage in a hypothetical project/pipeline using Maneage.
Each colored box is a file in the project and the arrows show the dependencies between them.
Green files/boxes are plain text files that are under version control and in the source-directory.
Blue files/boxes are output files of various steps in the build-directory, located within the Makefile (\inlinecode{*.mk}) that generates them.
@@ -942,18 +944,27 @@ When the names of the subMakefiles are descriptive enough, this enables both the
\subsubsection{Ultimate target: the project's paper or report (\inlinecode{paper.pdf})}
\label{sec:paperpdf}
-The ultimate purpose of a project is to report its result and interpret it in a larger context of human knowledge.
-In scientific projects, this is the final published paper.
-The raw result is usually dataset(s) that is (are) visualized, for example as a plot, figure or table and blended into the narrative description.
-In Figure \ref{fig:datalineage} it is shown as \inlinecode{paper.pdf} and the instructions to build \inlinecode{paper.pdf} are in \inlinecode{paper.mk}.
-In the complete directed graph of Figure \ref{fig:datalineage}, \inlinecode{paper.pdf} is the only node that has no outward edge or arrows (further showing that it is the ultimate target: nothing depends on it).
+The ultimate purpose of a project is to report its result.
+In scientific projects, this ``report'' is the published, or draft, paper.
+In the industry, it is a quality-check and analysis of the final data product.
+The raw result is usually dataset(s) that is (are) visualized in the report, for example as a plot, figure or table and blended into the narrative description.
+In Figure \ref{fig:datalineage} it is shown as \inlinecode{paper.pdf}.
+Note that it is the only built file (blue box) with no arrows leaving it.
+In other words, nothing depends on it: highlighting its unique ``ultimate target'' position in the lineage.
+The instructions to build \inlinecode{paper.pdf} are in \inlinecode{paper.mk}.
The report's source (containing the main narrative, its typesetting as well as that of the figures or tables) is \inlinecode{paper.tex}.
To build the final report's PDF, \inlinecode{references.tex} and \inlinecode{project.tex} are also loaded into \LaTeX{}.
-Another class of files that maybe loaded into \LaTeX{}, but are not shown to avoid complications in the figure, are the figure or plot PDFs which may have been created during the analysis steps.
\inlinecode{references.tex} is part of the project's source and can contain the Bib\TeX{} entries for the bibliography of the final report.
In other words, it formalizes the connections of this scholarship with previous scholarship.
+Another class of files that maybe loaded into \LaTeX{}, but are not shown to avoid complications in the figure, are the figure or plot data, or built figures.
+For example in this paper, the demonstration figure shown in Section \ref{sec:analysis} is drawn directly within \LaTeX{} (using its PGFPlots package).
+The project only needed to build the plain-text table of numbers that were fed into PGFPlots (\inlinecode{tools-per-year.txt} in Figure \ref{fig:datalineage}).
+However, building some plots may not be possible with PGFPlots, or the authors may prefer another tool to generate the visualization's image file \citep[for example with Python's Matplotlib, ][]{matplotlib2007}, then load that image file into \LaTeX{} as a graphic.
+For this scenario, the actual image file of the visualization can be used in the lineage, for example \inlinecode{tools-per-year.pdf} instead of \inlinecode{tools-per-year.txt}.
+See Section \ref{sec:publishing} on the project publication for special considerations regarding these files.
+
@@ -1214,7 +1225,8 @@ When descriptive names are chosen for the subMakefiles, a simple glance over the
\begin{figure}[t]
\input{tex/src/figure-src-topmake.tex}
\vspace{-3mm}
- \caption{\label{fig:topmake} Important parts of High-level \inlinecode{top-make.mk}.
+ \caption{\label{fig:topmake} General view of the High-level \inlinecode{top-make.mk} Makefile which manages the project's analysis that is in various subMakefiles.
+ See Figures \ref{fig:files} \& \ref{fig:datalineage} for its location in the project's file structure and its data lineage, as well as the subMakefiles it includes.
}
\end{figure}
@@ -1253,17 +1265,66 @@ This style of managing project parameters therefore produces a much more healthy
+\subsection{Projects as Git branches of Maneage}
+\label{sec:starting}
+
+Maneage is fully composed of plain-text files distributed in a directory structure (see Sections \ref{principle:text} \& \ref{sec:generalimplementation} and Figure \ref{fig:files}).
+Therefore it can be maintained under under version control systems like Git (for more on version control, see Appendix \ref{appendix:versioncontrol}).
+Every commit in the version controlled history contains a complete snapshot of the executable data lineage, for more see the completeness principle in Section \ref{principle:complete}.
+Maneage is maintained by its developers in a central branch, which we'll call \inlinecode{maneage} hereafter.
+
+The \inlinecode{maneage} branch contains all the low-level infrastructure that is necessary for any project; primarily the configuration features discussed in Section \ref{sec:projectconfigure} that are located under \inlinecode{reproduce/software} in Figure \ref{fig:files}, and executed with the \inlinecode{./project configure} command.
+The main branch only contains a minimal/demonstration analysis in order to be complete.
+The names of all the files related to the demonstration of the \inlinecode{maneage} branch have a \inlinecode{delete-me} prefix to highlight that they must be deleted when starting a new project.
-\subsubsection{Building the paper}
-\label{sec:buildingpaper}
+To start a new project, users simply clone it from its reference repository and build their own Git branch over the most recent commit.
+They can then start customizing Maneage for their project in their own branch, and push that branch to their own Git repository for management.
+Project customization will usually start with deleting the demonstration files that have a \inlinecode{delete-me} prefix, and adding their own input datasets, analysis and narrative instead.
+Manages contains a file called \inlinecode{README-hacking.md} that has a complete checklist of steps to start a new project and remove demonstration parts.
+This file is updated on the \inlinecode{maneage} branch and will always be uptodate with the low-level infrastructure.
+
+\begin{figure}[t]
+ \includetikz{figure-branching}
+ \vspace{-3mm}
+ \caption{\label{fig:branching} Schematic view of main Maneage branch, and how new projects start by branching off of it in two scenarios: pre-publicatioin (allowing easy collaboration) and post-publication (allowing other scientists to easily build upon each other's published work).
+ In the pre-publication scenario, another co-author has made two commits in parallel to the main author, which have later been merged as the project evolves.
+ In the post-publication scenario, another scientist builds upon the published work of another scintist, then merges with the Maneage branch to improve the low-level infrastructure.
+ }
+\end{figure}
+
+Having the uniform Maneage branch at the core of all projects that use it has the following benefits:
\begin{itemize}
-\item Discuss the importance of putting the \LaTeX{} related files in \inlinecode{texdir}. Especially how \inlinecode{tex/build} points to it.
+\item The project's infrastructure can be updated even after publication, when the operating systems or hardware may not be compatible with the project's core components.
+\item Other projects can branch-off an existing project and update their infrastructure while retaining the high-level analysis that they will edit.
+\item Merging work from multiple projects can be as easy as merging their branches: they all share the common \inlinecode{maneage} branch.
+\end{itemize}
+
+
+\subsection{Collaborating with same build directory}
+\label{sec:collaborating}
+
+Because the project's source and build directories are separate, it is possible for different users to share a build directory, while working on their own separate branch of Maneage during a collaboration.
+
+
+
+
+\subsection{Publishing the project}
+\label{sec:publishing}
+
+Once the project is complete, publishing the project (its narrative report as well as the full lineage) is the final step.
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\begin{itemize}
+\item \inlinecode{./project make dist}.
+\item The data files, or image files that go into the \LaTeX{} paper in \inlinecode{texdir}. Especially how \inlinecode{tex/build} points to it.
\item Discuss how easy it is to built graphics outside of \LaTeX{}.
\end{itemize}
-\subsection{Future work and history}
+\subsection{Future of Maneage and its past}
\label{sec:futureworkx}
As with any software, the core architecture of Maneage will inevitably evolve after the publication of this paper.
The current version introduced here has already experienced 5 years of evolution and several reincarnations.
@@ -1349,7 +1410,9 @@ Once the improvements become substantial, new paper(s) will be written to comple
Work on the reproducible paper template has been funded by the Japanese Ministry of Education, Culture, Sports, Science, and Technology ({\small MEXT}) scholarship and its Grant-in-Aid for Scientific Research (21244012, 24253003), the European Research Council (ERC) advanced grant 339659-MUSICOS, European Union’s Horizon 2020 research and innovation programme under Marie Sklodowska-Curie grant agreement No 721463 to the SUNDIAL ITN, and from the Spanish Ministry of Economy and Competitiveness (MINECO) under grant number AYA2016-76219-P.
The reproducible paper template was also supported by European Union’s Horizon 2020 (H2020) research and innovation programme via the RDA EU 4.0 project (ref. GA no. 777388).
-
+%% Collaboration icon: https://www.flaticon.com/free-icon/collaboration_809522?term=collaboration&page=1&position=36
+%% Paper icon: https://www.flaticon.com/free-icon/paper_2541979?term=paper&page=1&position=28
+The collaboration and paper icons in Figure \ref{fig:branching} were respectively made by `mynamepong' and `freepik' and downloaded from \url{www.flaticon.com}.
%% Tell BibLaTeX to put the bibliography list here.
\printbibliography
diff --git a/tex/img/collaboration-icon.pdf b/tex/img/collaboration-icon.pdf
new file mode 100644
index 0000000..7bb5795
--- /dev/null
+++ b/tex/img/collaboration-icon.pdf
Binary files differ
diff --git a/tex/img/paper-icon.pdf b/tex/img/paper-icon.pdf
new file mode 100644
index 0000000..db7660c
--- /dev/null
+++ b/tex/img/paper-icon.pdf
Binary files differ
diff --git a/tex/src/figure-branching.tex b/tex/src/figure-branching.tex
new file mode 100644
index 0000000..bc6fb41
--- /dev/null
+++ b/tex/src/figure-branching.tex
@@ -0,0 +1,140 @@
+% Copyright (C) 2020 Mohammad Akhlaghi <mohammad@akhlaghi.org>
+%
+%% This LaTeX source is free software: you can redistribute it and/or
+%% modify it under the terms of the GNU General Public License as published
+%% by the Free Software Foundation, either version 3 of the License, or (at
+%% your option) any later version.
+%
+%% This LaTeX source is distributed in the hope that it will be useful, but
+%% WITHOUT ANY WARRANTY; without even the implied warranty of
+%% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+%% General Public License for more details.
+%
+%% You should have received a copy of the GNU General Public License along
+%% with this LaTeX source. If not, see <https://www.gnu.org/licenses/>.
+
+
+
+
+
+%% To simplify the adding of new commits.
+\newcommand{\branchcommit}[4]{
+ \draw [fill=#1, opacity=0.8] (#2,#3) circle [x radius=5.5mm, y radius=2.1mm];
+ \draw [anchor=center] (#2,#3) node {\textcolor{white}{\scriptsize\texttt{#4}}};
+}
+
+
+
+
+
+\begin{tikzpicture}
+
+ %% Just for a reference (so the image size always remains fixed). It also
+ %% helps in defining easy coordinates for all the other elements.
+ \draw [white] (0,0) -- (0,10cm);
+ \draw [white] (0,0) -- (\linewidth,0);
+
+ %% Maneage branch line.
+ \draw [black!40!white, dashed, line width=2mm] (2cm,0) -- (2cm,0.6cm);
+ \draw [->, black!40!white, line width=2mm] (2cm,0.6cm) -- (2cm,7.9cm);
+ \draw [anchor=south, black!20!white] (2cm,4cm) node [rotate=90, scale=2]
+ {\bf Maneage branch};
+
+ %% Project branch line.
+ \draw [->, black!40!white, rounded corners, line width=2mm]
+ (2cm,2cm) -- (3.5cm,2.5cm) -- (3.5cm,7.9cm);
+ \draw [black!40!white, line width=2mm] (2cm,5cm) -- (3.5cm,5.5cm);
+ \draw [anchor=south, black!20!white] (3.5cm,5cm) node [rotate=90, scale=2]
+ {\bf Project branch};
+
+ %% Derivative project
+ \draw [black!40!white, rounded corners, line width=2mm]
+ (3.5cm,4.5cm) -- (5cm,5cm) -- (5cm,6cm) -- (3.5cm,6.5cm);
+
+ %% Maneage commits.
+ \branchcommit{green!70!blue}{2cm}{1cm}{1d72e26}
+ \branchcommit{green!70!blue}{2cm}{2cm}{0c120cb}
+ \branchcommit{green!70!blue}{2cm}{3cm}{5781173}
+ \branchcommit{green!70!blue}{2cm}{4cm}{0774aac}
+ \branchcommit{green!70!blue}{2cm}{5cm}{3c05235}
+ \branchcommit{green!70!blue}{2cm}{6cm}{6ec4881}
+ \branchcommit{green!70!blue}{2cm}{7cm}{852d996}
+
+ %% Project commits.
+ \branchcommit{red!60!green}{3.5cm}{2.5cm}{4483a81}
+ \branchcommit{red!60!green}{3.5cm}{3.5cm}{5e830f5}
+ \branchcommit{red!60!green}{3.5cm}{4.5cm}{01dd812}
+ \branchcommit{red!60!green}{3.5cm}{5.5cm}{2ed0c82}
+ \branchcommit{red!60!green}{3.5cm}{6.5cm}{f62596e}
+
+ %% Derivate project commits.
+ \branchcommit{red!60!green}{5cm}{5cm}{f69e1f4}
+ \branchcommit{red!60!green}{5cm}{6cm}{716b56b}
+ \node[inner sep=0pt] at (4.5cm,7cm)
+ {\includegraphics[width=9mm]{tex/img/collaboration-icon.pdf}};
+
+
+ %% Description of this scenario:
+ \draw [anchor=west, black] (2.7cm,1.5cm) node {\textbf{Scenario 1} (pre-publication):};
+ \draw [anchor=west, black] (2.8cm,1.1cm) node {\small Collaborating on a project while};
+ \draw [anchor=west, black] (2.8cm,0.7cm) node {\small working in parallel, then merging.};
+
+
+
+ %% Middle line.
+ \draw [black] (8cm,0.5) -- (8cm,7.5cm);
+
+
+
+
+ %% Maneage branch line.
+ \draw [black!40!white, dashed, line width=2mm] (10cm,0) -- (10cm,0.6cm);
+ \draw [->, black!40!white, line width=2mm] (10cm,0.6cm) -- (10cm,9.9cm);
+ \draw [anchor=south, black!20!white] (10cm,4cm) node [rotate=90, scale=2]
+ {\bf Maneage branch};
+
+ %% Project branch line.
+ \draw [black!40!white, rounded corners, line width=2mm]
+ (10cm,2cm) -- (11.5cm,2.5cm) -- (11.5cm,6.9cm);
+ \draw [black!40!white, line width=2mm] (10cm,5cm) -- (11.5cm,5.5cm);
+ \draw [anchor=south, black!20!white] (11.5cm,5cm) node [rotate=90, scale=2]
+ {\bf Project branch};
+
+ %% Derivative project
+ \draw [->, black!40!white, rounded corners, line width=2mm]
+ (11.5cm,6.5cm) -- (13cm,7cm) -- (13cm,9.9cm);
+ \draw [black!40!white, line width=2mm] (10cm,8cm) -- (13cm,9cm);
+ \draw [anchor=south, black!20!white] (13cm,6.5cm) node [rotate=90, scale=2]
+ {\bf Derivative branch};
+
+ %% Maneage commits.
+ \branchcommit{green!70!blue}{10cm}{1cm}{1d72e26}
+ \branchcommit{green!70!blue}{10cm}{2cm}{0c120cb}
+ \branchcommit{green!70!blue}{10cm}{3cm}{5781173}
+ \branchcommit{green!70!blue}{10cm}{4cm}{0774aac}
+ \branchcommit{green!70!blue}{10cm}{5cm}{3c05235}
+ \branchcommit{green!70!blue}{10cm}{6cm}{6ec4881}
+ \branchcommit{green!70!blue}{10cm}{7cm}{852d996}
+ \branchcommit{green!70!blue}{10cm}{8cm}{13a1881}
+ \branchcommit{green!70!blue}{10cm}{9cm}{61b6b01}
+
+ %% Project commits.
+ \branchcommit{red!60!green}{11.5cm}{2.5cm}{4483a81}
+ \branchcommit{red!60!green}{11.5cm}{3.5cm}{5e830f5}
+ \branchcommit{red!60!green}{11.5cm}{4.5cm}{01dd812}
+ \branchcommit{red!60!green}{11.5cm}{5.5cm}{2ed0c82}
+ \branchcommit{red!60!green}{11.5cm}{6.5cm}{f62596e}
+ \node[inner sep=0pt] at (11.5cm,7.2cm) {\includegraphics[width=9mm]{tex/img/paper-icon.pdf}};
+ \draw [anchor=north, black] (11.5cm,8cm) node {\scriptsize Published};
+
+ %% Derivate project commits.
+ \branchcommit{purple!60!yellow}{13cm}{7cm}{b177c7e}
+ \branchcommit{purple!60!yellow}{13cm}{8cm}{5ae1fdc}
+ \branchcommit{purple!60!yellow}{13cm}{9cm}{bcf4512}
+
+ %% Description of this scenario:
+ \draw [anchor=west, black] (10.7cm,1.5cm) node {\textbf{Scenario 2} (post-publication):};
+ \draw [anchor=west, black] (10.8cm,1.1cm) node {\small Other researchers building upon};
+ \draw [anchor=west, black] (10.8cm,0.7cm) node {\small previously published work.};
+
+\end{tikzpicture}
diff --git a/tex/src/figure-data-lineage.tex b/tex/src/figure-data-lineage.tex
index 7379b2f..146a833 100644
--- a/tex/src/figure-data-lineage.tex
+++ b/tex/src/figure-data-lineage.tex
@@ -61,8 +61,8 @@
label={[shift={(0,-5mm)}]\texttt{format.mk}}] {};
\node (analysis2mk) [node-makefile, at={(2.67cm,-1.3cm)},
label={[shift={(0,-5mm)}]\texttt{demo-plot.mk}}] {};
- \node (analysis3mk) [node-makefile, at={(5.47cm,-1.3cm)},
- label={[shift={(0,-5mm)}]\texttt{another-step.mk}}] {};
+ \node [opacity=0.6] (analysis3mk) [node-makefile, at={(5.47cm,-1.3cm)},
+ label={[shift={(0,-5mm)}, opacity=0.6]\texttt{another-step.mk}}] {};
%% verify.mk
\node [at={(-5.3cm,-2.8cm)},
diff --git a/tex/src/figure-src-download.tex b/tex/src/figure-src-download.tex
index 74026b8..4d4b755 100644
--- a/tex/src/figure-src-download.tex
+++ b/tex/src/figure-src-download.tex
@@ -1,4 +1,4 @@
-\begin{tcolorbox}[title=\inlinecode{\textcolor{white}{download.mk}} \textcolor{white}{(only \LaTeX{} macro's rule.}]
+\begin{tcolorbox}[title=\inlinecode{\textcolor{white}{download.mk}} \hfill\textcolor{white}{(only \LaTeX{} macro's rule)}]
\footnotesize
\texttt{\mkcomment{Write download URL into the paper (through a LaTeX macro).}}
diff --git a/tex/src/figure-src-topmake.tex b/tex/src/figure-src-topmake.tex
index bd4b67d..6ed315b 100644
--- a/tex/src/figure-src-topmake.tex
+++ b/tex/src/figure-src-topmake.tex
@@ -15,10 +15,10 @@
\texttt{{ }{ }{ }{ }{ }{ }{ }{ }{ }{ }paper}\par
\vspace{1em}
- \texttt{\mkcomment{Load all the configuration files.}}\par
+ \texttt{\mkcomment{Include all the configuration files.}}\par
\texttt{\textcolor{purple}{include} reproduce/analysis/config/*.conf}
\vspace{1em}
- \texttt{\mkcomment{Load/include the subMakefiles in the specified order.}}\par
+ \texttt{\mkcomment{Include the subMakefiles in the specified order.}}\par
\texttt{\textcolor{purple}{include} \$(\textcolor{blue}{foreach} s, \$(\mkvar{makesrc}), reproduce/analysis/make/\$(\mkvar{s}).mk)}
\end{tcolorbox}
diff --git a/tex/src/references.tex b/tex/src/references.tex
index 6e1de41..510fc89 100644
--- a/tex/src/references.tex
+++ b/tex/src/references.tex
@@ -1428,6 +1428,26 @@ Reproducible Research in Image Processing},
+@Article{matplotlib2007,
+ Author = {Hunter, J. D.},
+ Title = {Matplotlib: A 2D graphics environment},
+ Journal = {CiSE},
+ Volume = {9},
+ Number = {3},
+ Pages = {90},
+ abstract = {Matplotlib is a 2D graphics package used for Python
+ for application development, interactive scripting, and
+ publication-quality image generation across user
+ interfaces and operating systems.},
+ publisher = {IEEE COMPUTER SOC},
+ doi = {10.1109/MCSE.2007.55},
+ year = 2007
+}
+
+
+
+
+
@ARTICLE{witten2007,
author = {Ben Witten and Bill Curry and Jeff Shragge},
title = {A New Build Environment for SEP},