diff options
author | Mohammad Akhlaghi <mohammad@akhlaghi.org> | 2020-01-27 03:25:16 +0000 |
---|---|---|
committer | Mohammad Akhlaghi <mohammad@akhlaghi.org> | 2020-01-27 03:25:16 +0000 |
commit | c9383127fae0acb87f900a8832b820cb46bc589b (patch) | |
tree | 030042dd3249aab6a1c964124f371a755f061550 | |
parent | f62596ea8b97727ab8366965faf6f316d463ebf7 (diff) |
Analysis phase description started with Final paper and verification
With this commit, the general outline of the analysis phase is given, as
well as a description of the LaTeX macros and their relation to the paper
and thier verification.
Also, the data-lineage figure was updated to have references.tex also and
some resizing of the folders in file-architecture to be more clear.
-rw-r--r-- | paper.tex | 119 | ||||
-rw-r--r-- | tex/src/figure-data-lineage.tex | 46 | ||||
-rw-r--r-- | tex/src/figure-file-architecture.tex | 44 | ||||
-rw-r--r-- | tex/src/preamble-style.tex | 2 |
4 files changed, 157 insertions, 54 deletions
@@ -697,8 +697,10 @@ On macOS systems, we currently don't build a C compiler, but it is planned to do \label{sec:buildsoftware} All necessary software for the project, and their dependencies, are installed from source. -Based on the completeness principle (Section \ref{principle:complete}, the dependency tree is tracked down to the GNU C Library and GNU Compiler Collection on GNU/Linux systems. -When these two can't be installed (for example on macOS systems), the users are forced to rely on the host's C compiler and library, and this may hamper the exact reproducibility of the final result. +Researchers using the template only have to specify the most high-level analysis software they need in \inlinecode{reproduce\-/software\-/config\-/installation\-/TARGETS.conf}. +Based on the completeness principle (Section \ref{principle:complete}), the dependency tree is automatically traced down to the GNU C Library and GNU Compiler Collection on GNU/Linux systems. +Thus creating identical high-level analysis software on any system. +When the C library and compiler can't be installed (for example on macOS systems), the users are forced to rely on the host's C compiler and library, and this may hamper the exact reproducibility of the final result. Because the project's main output is a \LaTeX{}-built PDF, the project also contains an internal installation of \TeX{}Live, providing all the necessary tools to build the PDF, indepedent of the host operating system's \LaTeX{} version and packages. To build the software, the project needs access to the software source code. @@ -773,21 +775,52 @@ We are considering using these tools, and export Bib\TeX{} entries when necessar \subsection{Running the analysis} \label{sec:projectmake} -Once a project is configured (Section \ref{sec:projectconfigure}), all the necessary software, with precise versions and configurations have been built. -The project is now ready to do the analysis or, run the built software on the data. - - - - - - +Once a project is configured (Section \ref{sec:projectconfigure}), all the necessary software, with precise versions and configurations, are built and ready to use. +The analysis phase of the project (running the software on the data) is also orchestrated through Makefiles (see Sections \ref{sec:usingmake} \& \ref{sec:generalimplementation} for the benefits of using Make). +In particular, after running \inlinecode{./project make}, two high-level Makefiles are called in sequence, both are under \inlinecode{reproduce\-/analysis\-/make}, see Figure \ref{fig:files}: \inlinecode{top-prepare.mk} and \inlinecode{top-make.mk}. + +These two high-level Makefiles don't see any of the the host's environment variables\footnote{The host environment is fully ignored before calling the analysis Makefiles through the \inlinecode{env -i} command (\inlinecode{-i} is short for \inlinecode{--ignore-environment}). + Note that the locally built \inlinecode{env} program is used, not the one provided by the host. + \inlinecode{env} is installed by GNU Coreutils.}. +The project will define its own values for standard environment variables. +Combined with the fact that all the software were compiled from source for this project at configuration time (Section \ref{sec:buildsoftware}), this completely isolates the analysis from the host operating system, creating an exactly reproducible result on any machine that the project can be configured. +For example, the project builds is own fixed version of GNU Bash (a shell). +It also has its own \inlinecode{bashrc} startup script\footnote{The project's Bash startup script is under \inlinecode{reproduce\-/software\-/bash\-/bashrc.sh}, see Figure \ref{fig:files}.}. +Therefore the \inlinecode{BASH\_ENV} environment variable is set to load this startup script and the \inlinecode{HOME} environment variable is set to the build directory to avoid the penetration of any existing Bash startup file in the user's home directory into the analysis. + +The former \inlinecode{top-prepare.mk} is in charge of optimizing the main job orchestration of \inlinecode{top-make.mk}, or to ``prepare'' for it. +In many sitations it may not be necessary at all, but we'll introduce its role with an example. +Let's assume the raw input data (that the project recieved from a database) has 5000 rows (potential targets). +However, based on an initial selection criteria, the project only needs to work on 100 of them. +If the full 5000 targets are given to \inlinecode{top-make.mk}, Make will need to fllow the data lineage of all of them. +If the analysis is complex (has many steps), this can be slow (many of its executions will be redundant), and the researchers have to add checks in many places to ignore those that aren't necessary (which will add to complexity and cause bugs). +However, if this basic selection is done before calling \inlinecode{top-make.mk}, only the required 100 targets and their lineage are orchestrated. +This allows Make to optimially organize the complex set of operations that must be run on each input and the dependencies (possibly in parallel), and also greatly simplifies the coding (no extra checks are necessary there). +Where necessary this preparation is done in \inlinecode{top-prepare.mk}. + +Generally, \inlinecode{top-prepare.mk} will not be necessary in many scenarios and its internal design and concepts are identical to \inlinecode{top-make.mk}. +Hence, we'll continue with a detailed discussion of \inlinecode{top-make.mk} below and touch upon the differences with \inlinecode{top-prepare.mk} in the end. + +A normal project will usually consist of many analysis steps, including data access (possibly by downloading), and running various steps of the analysis on them. +Having everything in one Makefile will create a very large file, which can be hard to maintain, extend/grow, read, reuse, and cite. +Generally, this is against the modularity principle (Section \ref{principle:modularity} above). +Therefore the project is designed to encourage modularity and facilitate all these points by distributing the analysis in multiple Makefiles that contain contextually-similar analysis steps or ``rules''. +For Make this distribution is just cosmetic: they are all loaded into \inlinecode{top-make.mk} and executed in one instance of Make. +Within the project's source the lower-level Makefiles are also placed in \inlinecode{reproduce\-/analysis\-/make} (like \inlinecode{top-make.mk}), see Figure \ref{fig:files}. +Therefore by design, \inlinecode{top-make.mk} is very simple: it just defines the ultimate target, and the name and order of the lower-level Makefiles that should be loaded. + +Figure \ref{fig:datalineage} is a general overview of the analysis phase in a hypothetical project using this template. +As described above and shown in that figure, \inlinecode{top-make.mk} imports the various lower-level Makefiles that are in charge of the different phases of the analysis. +Each of the lower-level Makefiles builds intermediate targets (files) which are also shown there. +In the subsections below, the project's analysis is described using this graph. +We'll follow Make's paradigm (see Section \ref{sec:usingmake}): starting form the ultimate target in Section \ref{sec:paperpdf}, and tracing the data's lineage up to the inputs. \begin{figure}[t] \begin{center} \includetikz{figure-data-lineage} \end{center} \vspace{-7mm} - \caption{\label{fig:analysisworkflow}Schematic representation of built file dependencies in a hypothetical project/pipeline using the reproducible paper template. + \caption{\label{fig:datalineage}Schematic representation of built file dependencies in a hypothetical project/pipeline using the reproducible paper template. Each colored box is a file in the project and the arrows show the dependencies between them. Green files/boxes are plain text files that are under version control and in the source-directory. Blue files/boxes are output files of various steps in the build-directory, located within the Makefile (\inlinecode{*.mk}) that generates them. @@ -800,6 +833,70 @@ The project is now ready to do the analysis or, run the built software on the da +\subsubsection{Ultimate target: the project's paper or report} +\label{sec:paperpdf} + +The ultimate purpose of a project is to report its result and interpret it in a larger context. +In scientific projects, this is the final published paper. +The raw result is usually a dataset(s) that is(are) visualized, for example as a plot or figure, and blended into the narrative description. +In Figure \ref{fig:datalineage} this report is shown as \inlinecode{paper.pdf}. +In the complete directed graph of this figure, \inlinecode{paper.pdf} is the only node that has no outward edge or arrows. + +The source of this report (containing the main narrative and positioning of figures or tables) is \inlinecode{paper.tex}. +To build the PDF, \inlinecode{references.tex} and \inlinecode{project.tex} are also loaded into \LaTeX{}. +Another class of files that maybe loaded into \LaTeX{}, but are not shown to avoid complications in the figure, are the figure or plot PDFs which may have been created during the analysis steps. +\inlinecode{references.tex} is part of the project's source and can contain the Bib\TeX{} entries for the bibliography of the final report. + +To understand \inlinecode{project.tex}, note that that besides figures, plots or tables, another output of the analysis are quantitative values that are blended into the sentences of the report's narration. +For example this sentence in the abstract of \citet{akhlaghi19}: ``... the outer wings of M51 down to S/N of 0.25 ...''. +The reported signal-to-noise (S/N) value ``0.25'' depends on the analysis and is an output of the analysis just like paper's figures and plots. +Manually typing the number in the \LaTeX{} source is prone to very important bugs: the author may forget to check it after a change in an analysis like using a newer version of the software, or changing an analysis parameter for another part of the paper. +Given the evolution of a scientific projects, this type of human error is very hard to avoid. + +Calculated values mentioned within the narrative's sentences must therefore be automatically generated. +In the quote above, the \LaTeX{} source\footnote{\citet{akhlaghi19} uses this templat to be reproducible, so its LaTeX source is available in multiple ways: 1) direct download from arXiv:\href{https://arxiv.org/abs/1909.11230}{1909.11230}, by clicking on ``other formats'', or 2) the Git or \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481} links is also available on arXiv.} looks like this: ``\inlinecode{\small the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}''. +The \LaTeX{} macro ``\inlinecode{\small\textbackslash{}demosfoptimizedsn}'' is automatically defined through \inlinecode{project.tex} and expands to the value ``0.25''. +\inlinecode{project.tex} contains such a macro for every project output that is included within the narrative's sentences (or any other usage by \LaTeX). + +However, managing all the necessary \LaTeX{} macros for a full project in one file is against the modularity principle and can be frustrating. +Therefore, in this system, all lower-level Makefiles \emph{must} contain a fixed target with the same name, but with a \inlinecode{.tex} suffix. +For example in Figure \ref{fig:datalineage}, assume \texttt{out-1b.dat} is a table and the mean of its third column must be reported in the paper. +To do this, \inlinecode{out-1b.dat} is set to be a prerequisite of the rule for \inlinecode{analysis1.tex} (as shown by the arrow in Figure \ref{fig:datalineage}). +The recipe of this rule will calculate the mean of the column and put the result in the \LaTeX{} macro which is written in \inlinecode{analysis1.tex}. +The same is done for any other reported calculation from \inlinecode{analysis1.mk}. + +These \LaTeX{} macro files are thus the core skeleton of the project: as shown in Figure \ref{fig:datalineage}, the outward arrows of all built files ultimately lead to one of these \LaTeX{} macro files. +However, note that built files in a lower-level Makefile don't have to point to that Makefile's \LaTeX{} macro file. +For example even though \inlinecode{input1.dat} is a target in \inlinecode{download.mk}, it isn't a prerequisite of \inlinecode{download.tex}, it is a prerequisite of \inlinecode{out-2a.dat} (a target in \inlinecode{analysis2.mk}), and ultimate ends in \inlinecode{analysis3.tex}. + + + + + +\subsubsection{Verification of outputs} +\label{sec:outputverification} +An important principle for this template is that outputs should be automatically verified, see Section \ref{principle:verify}. +As shown in Figure \ref{fig:datalineage}, all the \LaTeX{} macro files are not directly a prerequisite of \inlinecode{project.tex}, but of \inlinecode{verify\-.tex}. +Prior to publication, the project authors should add the MD5 checksums of all the\LaTeX{} macro files in \inlinecode{verify\-.tex}, the necessary structure is already there. +If any \LaTeX{} macro is different in future builds of the project, the project will abort with a warning of the problematic file. + +When projects involve other outputs (for example images, tables or datasets that will also be published), their contents should also be validated. +To do this, prerequisites should be added to the \inlinecode{verify\-.tex} rule that automatically check the contents of other project outputs also. +However, verification of other files should be done with care: as mentioned in Section \ref{principle:verify}, many tools that produce datasets write the creation date into the produced file. +The verification in such cases should be done by ignoring such terms in the generated files. + + + + + + + + + + + + + diff --git a/tex/src/figure-data-lineage.tex b/tex/src/figure-data-lineage.tex index d849e8c..aa2a397 100644 --- a/tex/src/figure-data-lineage.tex +++ b/tex/src/figure-data-lineage.tex @@ -1,24 +1,25 @@ -\newcommand{\paperpdf}{} -\newcommand{\papertex}{} -\newcommand{\projecttex}{} -\newcommand{\verifytex}{} -\newcommand{\initializetex}{} -\newcommand{\downloadtex}{} -\newcommand{\inputtwo}{} -\newcommand{\inputsconf}{} -\newcommand{\analysisonetex}{} -\newcommand{\outoneb}{} -\newcommand{\outonebdep}{} -\newcommand{\inputone}{} -\newcommand{\inputonedep}{} -\newcommand{\analysistwotex}{} -\newcommand{\outtwob}{} -\newcommand{\outtwobdep}{} -\newcommand{\analysisthreetex}{} -\newcommand{\analysisthreeouts}{} -\newcommand{\outtwoa}{} -\newcommand{\outtwoadep}{} -\newcommand{\outthreeadep}{} +% All macros commented % 1 +\newcommand{\paperpdf}{} % 2 +\newcommand{\papertex}{} % 3 +\newcommand{\projecttex}{} % 4 +\newcommand{\verifytex}{} % 5 +\newcommand{\initializetex}{} % 6 +\newcommand{\downloadtex}{} % 7 +\newcommand{\inputtwo}{} % 8 +\newcommand{\inputsconf}{} % 9 +\newcommand{\analysisonetex}{} % 10 +\newcommand{\outoneb}{} % 11 +\newcommand{\outonebdep}{} % 12 +\newcommand{\inputone}{} % 13 +\newcommand{\inputonedep}{} % 14 +\newcommand{\analysistwotex}{} % 15 +\newcommand{\outtwob}{} % 16 +\newcommand{\outtwobdep}{} % 17 +\newcommand{\analysisthreetex}{} % 18 +\newcommand{\analysisthreeouts}{} % 19 +\newcommand{\outtwoa}{} % 20 +\newcommand{\outtwoadep}{} % 21 +\newcommand{\outthreeadep}{} % 22 @@ -96,7 +97,10 @@ %% paper.tex \ifdefined\papertex + \node (reftex) [node-nonterminal, at={(2.67cm,-4.2cm)}] {references.tex}; \node (papertex) [node-nonterminal, at={(5.47cm,-4.2cm)}] {paper.tex}; + \node (papertex-north) [node-point, at={(5.47cm,-3.58cm)}] {}; + \draw [rounded corners] (reftex) |- (papertex-north); \draw [->] (papertex) -- (paperpdf); \fi diff --git a/tex/src/figure-file-architecture.tex b/tex/src/figure-file-architecture.tex index c892232..a4d4755 100644 --- a/tex/src/figure-file-architecture.tex +++ b/tex/src/figure-file-architecture.tex @@ -54,7 +54,7 @@ \fi %% reproduce/software/bash - \node [dirbox, at={(-5.75cm,-0.8cm)}, minimum width=2.6cm, minimum height=2.1cm, + \node [dirbox, at={(-5.75cm,-0.8cm)}, minimum width=2.6cm, minimum height=1.6cm, label={[shift={(0,-5mm)}]\texttt{bash/}}, fill=brown!25!white] {}; \ifdefined\fullfilearchitecture \node [node-nonterminal-thin, at={(-5.75cm,-1.5cm)}] {bashrc.sh}; @@ -75,32 +75,34 @@ label={[shift={(0,-5mm)}]\texttt{analysis/}}, fill=brown!20!white] {}; %% reproduce/analysis/config - \node [dirbox, at={(0.15cm,1.5cm)}, minimum width=2.6cm, minimum height=2.1cm, + \node [dirbox, at={(0.15cm,1.5cm)}, minimum width=2.6cm, minimum height=2.6cm, label={[shift={(0,-5mm)}]\texttt{config/}}, fill=brown!25!white] {}; \node [node-nonterminal-thin, at={(0.15cm,0.8cm)}] {INPUTS.conf}; \node [node-nonterminal-thin, at={(0.15cm,0.3cm)}] {param-1.conf}; - \node [node-nonterminal-thin, at={(0.15cm,-0.2cm)}] {param-2.conf}; + \node [node-nonterminal-thin, at={(0.15cm,-0.2cm)}] {param-2a.conf}; + \node [node-nonterminal-thin, at={(0.15cm,-0.7cm)}] {param-2b.conf}; %% reproduce/analysis/make - \node [dirbox, at={(2.95cm,1.5cm)}, minimum width=2.6cm, minimum height=2.1cm, + \node [dirbox, at={(2.95cm,1.5cm)}, minimum width=2.6cm, minimum height=2.6cm, label={[shift={(0,-5mm)}]\texttt{make/}}, fill=brown!25!white] {}; - \node [node-nonterminal-thin, at={(2.95cm,0.8cm)}] {initialize.mk}; - \node [node-nonterminal-thin, at={(2.95cm,0.3cm)}] {download.mk}; - \node [node-nonterminal-thin, at={(2.95cm,-0.2cm)}] {analysis-1.mk}; + \node [node-nonterminal-thin, at={(2.95cm,0.8cm)}] {top-prepare.mk}; + \node [node-nonterminal-thin, at={(2.95cm,0.3cm)}] {top-make.mk}; + \node [node-nonterminal-thin, at={(2.95cm,-0.2cm)}] {initialize.mk}; + \node [node-nonterminal-thin, at={(2.95cm,-0.7cm)}] {analysis1.mk}; %% reproduce/analysis/bash - \node [dirbox, at={(0.15cm,-0.8cm)}, minimum width=2.6cm, minimum height=2.1cm, + \node [dirbox, at={(0.15cm,-1.3cm)}, minimum width=2.6cm, minimum height=1.1cm, label={[shift={(0,-5mm)}]\texttt{bash/}}, fill=brown!25!white] {}; \ifdefined\fullfilearchitecture - \node [node-nonterminal-thin, at={(0.15cm,-1.5cm)}] {process-A.sh}; + \node [node-nonterminal-thin, at={(0.15cm,-2.0cm)}] {process-A.sh}; \fi %% reproduce/analysis/python - \node [dirbox, at={(2.95cm,-0.8cm)}, minimum width=2.6cm, minimum height=2.1cm, + \node [dirbox, at={(2.95cm,-1.3cm)}, minimum width=2.6cm, minimum height=1.6cm, label={[shift={(0,-5mm)}]\texttt{python/}}, fill=brown!25!white] {}; \ifdefined\fullfilearchitecture - \node [node-nonterminal-thin, at={(2.95cm,-1.5cm)}] {operation-B.py}; - \node [node-nonterminal-thin, at={(2.95cm,-2.0cm)}] {fitting-plot.py}; + \node [node-nonterminal-thin, at={(2.95cm,-2.0cm)}] {operation-B.py}; + \node [node-nonterminal-thin, at={(2.95cm,-2.5cm)}] {fitting-plot.py}; \fi %% tex/ @@ -108,28 +110,28 @@ label={[shift={(0,-5mm)}]\texttt{tex/}}, fill=brown!15!white] {}; %% tex/src - \node [dirbox, at={(6cm,2.1cm)}, minimum width=2.5cm, minimum height=2.1cm, + \node [dirbox, at={(6cm,2.1cm)}, minimum width=2.5cm, minimum height=1.6cm, label={[shift={(0,-5mm)}]\texttt{src/}}, fill=brown!20!white] {}; + \node [node-nonterminal-thin, at={(6cm,1.4cm)}] {references.tex}; \ifdefined\fullfilearchitecture - \node [node-nonterminal-thin, at={(6cm,1.4cm)}] {preamble-1.tex}; \node [node-nonterminal-thin, at={(6cm,0.9cm)}] {figure-1.tex}; \fi %% tex/build \ifdefined\fullfilearchitecture - \node [dirbox, at={(6cm,-0.2cm)}, minimum width=2.5cm, minimum height=1.3cm, + \node [dirbox, at={(6cm,0.1cm)}, minimum width=2.5cm, minimum height=1.3cm, label={[shift={(0,-5mm)}]\texttt{build/}}, dashed, , fill=brown!20!white] {}; - \node [anchor=west, at={(4.7cm,-1.0cm)}] {\scriptsize\sf Symbolic link to}; - \node [anchor=west, at={(4.7cm,-1.3cm)}] {\scriptsize\sf \LaTeX{} build directory.}; + \node [anchor=west, at={(4.7cm,-0.7cm)}] {\scriptsize\sf Symbolic link to}; + \node [anchor=west, at={(4.7cm,-1.0cm)}] {\scriptsize\sf \LaTeX{} build directory.}; \fi %% tex/tikz \ifdefined\fullfilearchitecture - \node [dirbox, at={(6cm,-1.7cm)}, minimum width=2.5cm, minimum height=1.6cm, + \node [dirbox, at={(6cm,-1.6cm)}, minimum width=2.5cm, minimum height=1.6cm, label={[shift={(0,-5mm)}]\texttt{tikz/}}, dashed, fill=brown!20!white] {}; - \node [anchor=west, at={(4.67cm,-2.5cm)}] {\scriptsize\sf Symbolic link to TikZ}; - \node [anchor=west, at={(4.67cm,-2.8cm)}] {\scriptsize\sf directory (figures built}; - \node [anchor=west, at={(4.67cm,-3.1cm)}] {\scriptsize\sf by \LaTeX).}; + \node [anchor=west, at={(4.67cm,-2.4cm)}] {\scriptsize\sf Symbolic link to TikZ}; + \node [anchor=west, at={(4.67cm,-2.7cm)}] {\scriptsize\sf directory (figures built}; + \node [anchor=west, at={(4.67cm,-3.0cm)}] {\scriptsize\sf by \LaTeX).}; \fi %% .local diff --git a/tex/src/preamble-style.tex b/tex/src/preamble-style.tex index fe36eec..c3aeca2 100644 --- a/tex/src/preamble-style.tex +++ b/tex/src/preamble-style.tex @@ -136,4 +136,4 @@ } %% Custom macros -\newcommand{\inlinecode}[1]{`\textcolor{blue!35!black}{\texttt{#1}}'} +\newcommand{\inlinecode}[1]{\textcolor{blue!35!black}{\texttt{#1}}} |