diff options
author | Mohammad Akhlaghi <mohammad@akhlaghi.org> | 2020-03-02 02:55:00 +0000 |
---|---|---|
committer | Mohammad Akhlaghi <mohammad@akhlaghi.org> | 2020-03-02 02:55:00 +0000 |
commit | e9c81f9f40187bc4701ac539d110003e92b9ca69 (patch) | |
tree | e3929f2a116cf1bc790eaf88883e0315ef0ffa6c | |
parent | f51082b27e47e552658000689161c150d9c9a70e (diff) |
Described the first analysis phase with a demo subMakefile
Until now, there was no explanation on an actual analysis phase, therefore
with this commit an example scenario with a readable Makefile is included.
The Data lineage graph was also simplified to both be more readable, and
also to correspond to this new explanation and subMakefile.
Some random edits/typos were also corrected and some references added for
discussion.
-rw-r--r-- | paper.tex | 300 | ||||
-rw-r--r-- | reproduce/analysis/config/INPUTS.mk | 2 | ||||
-rw-r--r-- | reproduce/analysis/make/analysis-1.mk (renamed from reproduce/analysis/make/menke2020.mk) | 10 | ||||
-rw-r--r-- | reproduce/analysis/make/download.mk | 8 | ||||
-rw-r--r-- | reproduce/analysis/make/top-make.mk | 2 | ||||
-rw-r--r-- | reproduce/software/config/installation/texlive.mk | 3 | ||||
-rw-r--r-- | tex/src/figure-data-lineage.tex | 28 | ||||
-rw-r--r-- | tex/src/figure-download.tex | 8 | ||||
-rw-r--r-- | tex/src/figure-mk20tab3.tex | 50 | ||||
-rw-r--r-- | tex/src/preamble-style.tex | 11 | ||||
-rw-r--r-- | tex/src/references.tex | 59 |
11 files changed, 383 insertions, 98 deletions
@@ -24,7 +24,7 @@ -\title{Reproducible data analysis, preserving data lineage} +\title{Maneage: Customizable Template for Managing Data Lineage} \author{\large\mpregular \authoraffil{Mohammad Akhlaghi}{1,2}\\ { \footnotesize\mplight @@ -190,6 +190,7 @@ Finally in Section \ref{sec:discussion} the future prospects of using systems li \item Nature's collection on papers about reproducibility: \url{https://www.nature.com/collections/prbfkwmwvz}. \item \citet{menke20} on the ``Rigor and Transparency Index'', in particular showing how practices have improved but not enough. Also, how software identifability has seen the best improvement. +\item \citet{dicosmo19} summarize the special place of software in modern science very nicely: ``Software is a hybrid object in the world research as it is equally a driving force (as a tool), a result (as proof of the existence of a solution) and an object of study (as an artefact)''. \end{itemize} @@ -796,69 +797,103 @@ We are considering using these tools, and export Bib\TeX{} entries when necessar -\subsection{Running the analysis} -\label{sec:projectmake} + + + +\subsection{High-level organization of analysis} +\label{sec:highlevelanalysis} Once a project is configured (Section \ref{sec:projectconfigure}), all the necessary software, with precise versions and configurations, are built and ready to use. -The analysis phase of the project (running the software on the data) is also orchestrated through Makefiles (see Sections \ref{sec:usingmake} \& \ref{sec:generalimplementation} for the benefits of using Make). +The analysis phase of the project (running the software on the data) is also orchestrated through Makefiles. +For the unique advantages of using Make to manage a research project, see Sections \ref{sec:usingmake} \& \ref{sec:generalimplementation}. +In order to best follow the principle of modularity (Section \ref{principle:modularity}), the analysis is not done in one phase or with a single Makefile. +Here, the two high-level phases of the analysis are reviewed. +The organization of the lower-level analysis, in many modular Makefiles, is discussed in Section \ref{sec:lowlevelanalysis}. + +After running \inlinecode{./project make}, the analysis is done in two sequential phases: 1) preparation and 2) main analysis. +The purpose of the preparation phase is further elaborated in Section \ref{sec:prepare}. +Technically, these two phases are managed by the two high-level Makefiles: \inlinecode{top-prepare.mk} and \inlinecode{top-make.mk}. +Both are under \inlinecode{reproduce\-/analysis\-/make} (see Figure \ref{fig:files}) and both have an identical lower-level analysis structure. +But before that, in Section \ref{sec:analysisenvironment} the isolation of the analysis environment from the host is discussed. + + + + + +\subsubsection{Isolated analysis environment} +\label{sec:analysisenvironment} +By default, the analysis part of the project is not exposed to any of the host's environment variables. +This is accomplished through the `\inlinecode{env -i}' command\footnote{Note that the project's self-built \inlinecode{env} program is used, not the one provided by the host operating system. + Within the project, \inlinecode{env} is installed as part of GNU Coreutils and \inlinecode{-i} is short for \inlinecode{--ignore-environment}.}, which will remove the host environment. +The project will define its own values for standard environment variables to avoid using system or user defaults. +Combined with the fact that all the software were configured and compiled from source for each project at configuration time (Section \ref{sec:buildsoftware}), this completely isolates the analysis from the host operating system, creating an exactly reproducible result on any machine that the project can be configured. + +For example, the project builds is own fixed version of GNU Bash (a command-line shell environment). +It also has its own \inlinecode{bashrc} startup script\footnote{The project's Bash startup script is under \inlinecode{reproduce\-/software\-/bash\-/bashrc.sh}, see Figure \ref{fig:files}.}, and the \inlinecode{BASH\_ENV} environment variable is set to load this startup script. +Furthermore, the \inlinecode{HOME} environment variable is set to \inlinecode{BDIR} to avoid the penetration of any existing Bash startup file of the user's home directory into the analysis. + + -The analysis Makefiles don't see any of the the host's environment variables\footnote{The host environment is fully ignored before calling the analysis Makefiles through the \inlinecode{env -i} command (\inlinecode{-i} is short for \inlinecode{--ignore-environment}). - Note that the project's own \inlinecode{env} program is used, not the one provided by the host OS, \inlinecode{env} is installed by GNU Coreutils.}. -The project will define its own values for standard environment variables. -Combined with the fact that all the software were compiled from source for this project at configuration time (Section \ref{sec:buildsoftware}), this completely isolates the analysis from the host operating system, creating an exactly reproducible result on any machine that the project can be configured. -For example, the project builds is own fixed version of GNU Bash (a shell). -It also has its own \inlinecode{bashrc} startup script\footnote{The project's Bash startup script is under \inlinecode{reproduce\-/software\-/bash\-/bashrc.sh}, see Figure \ref{fig:files}.}. -Therefore the \inlinecode{BASH\_ENV} environment variable is set to load this startup script and the \inlinecode{HOME} environment variable is set to \inlinecode{BDIR} to avoid the penetration of any existing Bash startup file of the user's home directory into the analysis. -In particular, after running \inlinecode{./project make}, the analysis is done in two phases: a preparation step which is described in Section \ref{sec:prepare} and the final analysis step that is described in Section \ref{sec:prepare}. -Technically, these two phases are managed by two top-level Makefiles: \inlinecode{top-prepare.mk} and \inlinecode{top-make.mk}. -Both are under \inlinecode{reproduce\-/analysis\-/make} (see Figure \ref{fig:files}). \subsubsection{Preparation phase} \label{sec:prepare} -The first analysis Makefile that is run is \inlinecode{top-prepare.mk}. -It is in charge of any selection steps that may be necessary to optimize \inlinecode{top-make.mk}, or to ``prepare'' for it. -In many situations it may not be necessary at all and can be completely ignored. +When \inlinecode{./project make} is called, the first Makefile that is run is \inlinecode{top-prepare.mk}. +It is designed for any selection steps that may be necessary to optimize \inlinecode{top-make.mk}, or to ``prepare'' for it. +It is mainly useful when the research targets are more focused than the raw input and may not be necessary in many scenarios. +Its role is described here with an example. -We'll introduce its role with an example. Let's assume the raw input data (that the project recieved from a database) has 5000 rows (potential targets for doing the analysis on). However, this particular project only needs to work on 100 of them, not the full 5000. If the full 5000 targets are given to \inlinecode{top-make.mk}, Make will need to create a data lineage for all 5000 targets and project authors have to add checks in many places to ignore those that aren't necessary. -This will add to the project's complexity and cause bugs. -Furthermore, if the filesystem isn't fast (for example a filesystem that exists over a network), checking the file dates over the full lineage can be slow. +This will add to the project's complexity and is prone to many bugs. +Furthermore, if the filesystem isn't fast (for example a filesystem that exists over a network), checking all the intermediate and final files over the full lineage can be slow. + +In this scenario, the preparation phase finds the IDs of the 100 targets of interest and saves them as a Make variable in a file under \inlinecode{BDIR}. +Later, this file is loaded into the analysis phase, precisely identifing the project's targets-of-interest. +This selection phase can't be done within \inlinecode{top-make.mk} because the full data lineage (all input and output files) must be known to Make before it starts to execute the necessary operations. +It is possible to for Make to call itself as another Makefile, but this practice is strongly discouraged here because it makes the flow very hard to read. +However, if the project authors insist on calling Make within Make, it is certainly possible. -In the scenario above, the Makefiles called by \inlinecode{top-prepare.mk} would be in charge of finding the IDs of the 100 targets of interest and saving them as a Make variable that is later loaded into one of the analysis Makefiles (that are loaded by \inlinecode{top-make.mk}). -This can't be done within \inlinecode{top-make.mk} because the full data lineage (all input and output files) must be known before Make is run. The ``preparation'' phase thus allows \inlinecode{top-make.mk} to optimially organize the complex set of operations that must be run on each input and the dependencies (possibly in parallel). It also greatly simplifies the coding for the project authors. - Ideally \inlinecode{top-prepare.mk} is only for the ``preparation phase''. -However, projects can be complex and its up to the authors which parts of an analysis are ``preparation'' for an analysis and and which parts are the actual analysis. -Generally, the internal design and concepts of \inlinecode{top-prepare.mk} are identical to \inlinecode{top-make.mk} so it won't be discussed any further. +However, projects can be complex and ultimately, the choice of which parts of an analysis being a ``preparation'' can be highly subjective. +Generally, the internal design and concepts of \inlinecode{top-prepare.mk} are identical to \inlinecode{top-make.mk}. +Therefore in Section \ref{sec:lowlevelanalysis}, where the lower-level management is discussed, we will only focus on the latter to avoid confusion. + -\subsubsection{Main analysis phase} -\label{sec:analysis} -A normal project will usually consist of many analysis steps, including data access (possibly by downloading), running various steps of the analysis on them, and creating the necessary plots, figures or outputs for the report/paper. -If all of these steps are organized in a single Makefile, it will become very large and will be hard to maintain, extend/grow, read, reuse, and cite. -Generally, large files are bad practice in any management style because it is against the modularity principle (Section \ref{principle:modularity}). -This solution is thus designed to encourage and facilitate modularity by distributing the analysis in many Makefiles that contain contextually-similar (or modular) analysis steps. -This distribution is thus primarily done for the human writers and readers of the project. -For Make it is cosmetic: they are all loaded into \inlinecode{top-make.mk} and executed in one instance of Make. -In other words, Make sees them all as one file anyway. -Within the project's source, the subMakefiles are also placed in \inlinecode{reproduce\-/analysis\-/make} (like \inlinecode{top-make\-.mk}), see Figure \ref{fig:files}. -Therefore by design, \inlinecode{top-make.mk} is very simple: it just defines the ultimate target, and the name and order of the subMakefiles that should be loaded. + + +\subsection{Low-level organization of analysis} +\label{sec:lowlevelanalysis} + +A project consists of many steps, including data access (possibly by downloading), running various steps of the analysis on the obtained data, and creating the necessary plots, figures or tables for a published report, or output datasets for a database. +If all of these steps are organized in a single Makefile, it will become very large, or long, and will be hard to maintain, extend/grow, read, reuse, and cite. +Generally, large files are a bad practice because it is against the modularity principle (Section \ref{principle:modularity}). + +The proposed template is thus designed to encourage and facilitate modularity by distributing the analysis in many Makefiles that contain contextually-similar (or modular) analysis steps. +In the rest of this paper these modular, or lower-level, Makefiles will be called \emph{subMakefiles}. +The subMakefiles are loaded into \inlinecode{top-make.mk} in a certain order and executed in one instance of Make without recursion (see Section \ref{sec:nonrecursivemake} below). +In other words, this modularity is just cosmetic for Make: Make ``see''s all the subMakefiles as parts of one file. +However, this modularity plays a critical role for the human reader/author of the project and is necessary in re-using or citing parts of the analysis in other projects. + +Within the project's source, the subMakefiles are placed in \inlinecode{reproduce\-/analysis\-/make} (with \inlinecode{top-make\-.mk}), see Figure \ref{fig:files}. +Therefore by design, \inlinecode{top-make.mk} is very simple: it just defines the ultimate target (\inlinecode{paper\-.pdf}), and the name and order of the subMakefiles that should be loaded into Make. + +The precise organization of the analysis steps highly depends on each individual project. +However, many aspects of the project management are the same, irrespective of the particular project, here we will focus on those. Figure \ref{fig:datalineage} is a general overview of the analysis phase in a hypothetical project using this template. -As described above and shown in that figure, \inlinecode{top-make.mk} imports the various modular Makefiles under the \inlinecode{reproduce/} directory that are in charge of the different phases of the analysis. -Let's call them `subMakefiles'. -Each of the subMakefiles builds intermediate targets (files) which are shown there as blue boxes. +As described above and shown in Figure \ref{fig:datalineage}, \inlinecode{top-make.mk} imports the various Makefiles under the \inlinecode{reproduce/} directory that are in charge of the different phases of the analysis. +Each of the subMakefiles builds intermediate targets, or outputs (files), which are shown there as blue boxes. In the subsections below, the project's analysis is described using this graph. -We'll follow Make's paradigm (see Section \ref{sec:usingmake}): starting form the ultimate target in Section \ref{sec:paperpdf}, and tracing back its lineage all the way up to the inputs and configuration files. +We'll follow Make's paradigm (see Section \ref{sec:usingmake}) of starting form the ultimate target in Section \ref{sec:paperpdf}, and tracing back its lineage all the way up to the inputs and configuration files. \begin{figure}[t] \begin{center} @@ -874,8 +909,31 @@ We'll follow Make's paradigm (see Section \ref{sec:usingmake}): starting form th } \end{figure} +To aviod getting too abstract in the subsections below, where necessary, we'll do a basic analysis on the data of \citet[data were published as supplementary material on bioXriv]{menke20} and try to replicate some of their results. +Note that because we are not using the same software, this isn't a reproduction (see Section \ref{definition:reproduction}). +We can't use the same software because they use Microsoft Excel for the analysis which violates several of our principles: 1) Completeness (as a graphic user interface program, it needs human interaction, Section \ref{principle:complete}), 2) Minimal complexity (even free software alternatives like LibreOffice involve many dependencies and are extremely hard to build, Section \ref{principle:complexity}) and 3) Free software (Section \ref{principle:freesoftware}). + + +\subsubsection{Non-recursive Make} +\label{sec:nonrecursivemake} + +It is possible to call a new instance of Make within an existing Make instance. +This is also known as recursive Make\footnote{\url{https://www.gnu.org/software/make/manual/html_node/Recursion.html}}. +Recursive Make is infact used by many Make users, especially in the software development communities. +It is also possible within a project using the proposed template. + +However, recursive Make is discouraged in the template, and not used in it. +All the subMakefiles mentioned above are \emph{included}\footnote{\url{https://www.gnu.org/software/make/manual/html_node/Include.html}} into \inlinecode{top-make.mk}, i.e., their contents are read into Make as it is parsing \inlinecode{top-make.mk}. +In Make's view, this is identical to having one long file with all the subMakefiles concatenated to each other. +Make is only called once and there is no recursion. + +Furthermore, we have the convention that only \inlinecode{top-make.mk} (or \inlinecode{top-prepare.mk}) can include subMakefiles. +SubMakefiles should not include other subMakefiles. +The main reason behind this convention is the Minimal Complexity principle (Section \ref{principle:complexity}): a simple glance at \inlinecode{top-make.mk}, will immediately show \emph{all} the project's subMakefiles \emph{and} their loading order. +When the names of the subMakefiles are descriptive enough, this enables both the project authors, and later, project readers to get a complete view of the various stages of the project. + \subsubsection{Ultimate target: the project's paper or report (\inlinecode{paper.pdf})} @@ -884,8 +942,8 @@ We'll follow Make's paradigm (see Section \ref{sec:usingmake}): starting form th The ultimate purpose of a project is to report its result and interpret it in a larger context of human knowledge. In scientific projects, this is the final published paper. The raw result is usually dataset(s) that is (are) visualized, for example as a plot, figure or table and blended into the narrative description. -In Figure \ref{fig:datalineage} this final report is shown as \inlinecode{paper.pdf} and the instructions to build it are in \inlinecode{paper.mk}. -In the complete directed graph of this figure, \inlinecode{paper.pdf} is the only node that has no outward edge or arrows (further showing that it is the ultimate target: nothing depends on it). +In Figure \ref{fig:datalineage} it is shown as \inlinecode{paper.pdf} and the instructions to build \inlinecode{paper.pdf} are in \inlinecode{paper.mk}. +In the complete directed graph of Figure \ref{fig:datalineage}, \inlinecode{paper.pdf} is the only node that has no outward edge or arrows (further showing that it is the ultimate target: nothing depends on it). The report's source (containing the main narrative, its typesetting as well as that of the figures or tables) is \inlinecode{paper.tex}. To build the final report's PDF, \inlinecode{references.tex} and \inlinecode{project.tex} are also loaded into \LaTeX{}. @@ -894,6 +952,9 @@ Another class of files that maybe loaded into \LaTeX{}, but are not shown to avo In other words, it formalizes the connections of this scholarship with previous scholarship. + + + \subsubsection{Values within text (\inlinecode{project.tex})} \label{sec:valuesintext} Figures, plots, tables and narrative aren't the only analysis output that goes into the paper. @@ -928,7 +989,7 @@ The lineage ultimate ends in a \LaTeX{} macro file in \inlinecode{analysis3.tex} -\subsubsection{Verification of outputs} +\subsubsection{Verification of outputs (\inlinecode{verify.mk})} \label{sec:outputverification} An important principle for this template is that outputs should be automatically verified, see Section \ref{principle:verify}. However, simply confirming the checksum of the final PDF, or figures and datasets is not generally possible: as mentioned in Section \ref{principle:verify}, many tools that produce datasets or PDFs write the creation date into the produced files. @@ -949,21 +1010,136 @@ Recall that many tools print the creation date automatically when creating a fil -\subsubsection{Orchestrating the analysis} -\label{sec:analysisorchestration} -As described in Section \ref{sec:valuesintext}, the output files of a project's analysis steps are ultimately prerequisites of a subMakefile's final target (a \LaTeX{} macro file with the same base name, see the data lineage in Figure \ref{fig:datalineage}). -The detailed organization of the analysis steps highly depends on the particular project and because Make already knows which files are independent of others, it can run them in any order or on any number of threads in parallel. +\subsubsection{Project initialization (\inlinecode{initialize.mk})} +\label{sec:initialize} +The \inlinecode{initial\-ize\-.mk} subMakefile is present in all projects and is the first subMakefile that is loaded into \inlinecode{top-make.mk} (see Figure \ref{fig:datalineage}). +Project authors rarely need to modify/edit this file, it is part of the template's low-level infra-structure. +Nevertheless, project authors are strongly encouraged to study it and use all the useful variables and targets that it defines. +\inlinecode{initial\-ize\-.mk} doesn't contain any analysis or major processing steps, it just initializes the system. +For example it sets the necessary environment variables, internal Make variables and defines generic rules like \inlinecode{./project make clean} (to clean/delete all built products, not software) or \inlinecode{./project make dist} (to package the project into a tarball for distribution) among others. -Two subMakefiles are common to all projects: \inlinecode{initial\-ize\-.mk} and \inlinecode{download\-.mk}. -\inlinecode{init\-ial\-ize\-.mk} is the first subMakefile that is loaded into \inlinecode{top-make.mk}. -It doesn't actually contain any analysis, it just initializes the system: setting environment variables, internal Make variables and generic rules like \inlinecode{./project make clean} (to clean/delete all built products, not software) or \inlinecode{./project make dist} (to package the project into a tarball for distribution) among others. -To get a good fealing of the system, it is recommended to look through this file. -\inlinecode{download.mk} has some commonly necessary steps to facilitate the importation of input datasets: a simple \inlinecode{wget} command is not usually enough. -We also want to check if a local copy exists and also calculate the file's checksum to verfiy it. +It also adds one special \LaTeX{} macro in \inlinecode{initial\-ize\-.tex}: the current Git commit that is generated everytime the analysis is run. +It is stored in \inlinecode{{\footnotesize\textbackslash}projectversion} macro and can be used anywhere within the final report. +For this PDF it has a value of \inlinecode{\projectversion}. +One good place to put it is in the end of the abstract for any reader to be able to identify the exact point in history that the report was created. +It also uses the \inlinecode{--dirty} feature of Git's \inlinecode{--describe} output: if any version-controlled file is not already commited, the value to this macro will have a \inlinecode{-dirty} suffix. +If its in a prominent place (like the abstract), it will always remind the author to commit their work. -\tonote{Start going into the details of how we are obtaining the \citet{menke20} results.} -For example \citet{menke20} studied $\menkenumpapers$ papers in $\menkenumjournals$ journals. + + + +\subsubsection{Importing and validating inputs (\inlinecode{download.mk})} +\label{sec:download} +The \inlinecode{download.mk} subMakefile is present in all Maneage projects and contains the common steps for importing the input dataset(s) into the project. +All necessary input datasets to the project are imported through this subMakefile. +This helps in modularity and minimal complexity (Sections \ref{principle:modularity} \& \ref{principle:complexity}): to see what external datasets were used in a project, this is the only necessary file to manage/read. +Also, a simple call to a downloader (for example \inlinecode{wget}) is not usually enough. +Irrespective of where the dataset is \emph{used} in the project's lineage, some operations are always necessary when importing datasets: +\begin{itemize} +\item The file may already be present on the host, or the user may not have internet connection. + Hence it is necessary to check the given \emph{input directory} on the host before attempting to download over the internet (see Section \ref{sec:localdirs}). +\item The network might temporarily fail, but connect with an automatic re-trial (even today this isn't uncommon). + Crashing the whole analysis because of a temporary network issue, requires human intervention and is against the completeness principle (Section \ref{principle:complete}). +\item Before it can be used, the integrity of the imported file must be confirmed with its stored checksum. +\end{itemize} + +In some scenarios the generic download script may not be useful. +For example when the database takes queries and generates the dataset for downloading on the fly. +In such cases, users can add their own Make rules in this \inlinecode{download.mk} to import the file. +They can use its pre-defined structure to do the extra steps like validating it. +Note that in such cases the servers often encode the creation date and version of their database system in the resulting file as metadata. +Even when the actual data is identical, this metadata (which is in the same file) will differ based on the moment the query was done. +Therefore a simple checksum of the whole downloaded file can't be used for validation in such scenarios, see Section \ref{principle:verify}. + +Each external dataset has some basic information, including its expected name on the local system (for offline access), the necessary checksum to validate it (either the whole file or just its main ``data''), and its URL/PID. +In this template, such information regarding a project's input dataset(s) is in the \inlinecode{INPUTS.conf} file. +See Figures \ref{fig:files} \& \ref{fig:datalineage} for the position of \inlinecode{INPUTS.conf} in the project's file structure and data lineage respectively. +For demonstration, in this paper, we are using the datasets of \citet{menke20} which are stored in one \inlinecode{.xlsx} file on bioXriv. +In \inlinecode{INPUTS.conf}, the example lines below show the necessary information as Make variables for this dataset. +Just note the the full URL was too large to show in this demonstration\footnote{\label{footnote:dataurl}This is the full URL: \url{\menketwentyurl}. Note that in the \LaTeX{} source, this URL is just a macro that was created in \inlinecode{download.mk}, and directly comes from \inlinecode{INPUTS.mk}, it is not hand-written.}. + +\begin{lstlisting}[language=bash] + MK20DATA = menke20.xlsx + MK20MD5 = 8e4eee64791f351fec58680126d558a0 + MK20SIZE = 1.9MB + MK20URL = https://the.full.url/is/too/large/for/here/media-1.xlsx +\end{lstlisting} + +\begin{figure}[t] + \input{tex/src/figure-download.tex} + \caption{\label{fig:download} Simplified Make rule, showing how the downloaded data URL is written into this paper. + In Make, the \emph{target} is placed before a colon (\inlinecode{:}) and its \emph{prerequisite(s)} are placed after the colon. + The executable recipe lines (commands to build the target from the prerequisite), start with a \inlinecode{TAB} (shown here with a light gray \inlinecode{\_\_\_TAB\_\_\_}). + The command names are shown in dark green. + Comment lines (ignored by Make) are shown in light red and start with a \inlinecode{\#}. + The \inlinecode{MK20URL} variable is defined in \inlinecode{INPUTS.conf} and directly used to download the input dataset. + The same URL is then passed to this paper through the definition of the \inlinecode{\textbackslash{}menketwentyurl} \LaTeX{} variable that is written in \inlinecode{\$(mtexdir)/download.tex} (the main target, shown as \inlinecode{\$@} in the recipe). + Later, when the paper's PDF is being built, this \inlinecode{.tex} file is loaded into it. + \inlinecode{mtexdir} is the directory hosting all the \LaTeX{} macro files for various stages of the analysis, see Section \ref{sec:valuesintext}. + } +\end{figure} + +If \inlinecode{menke20.xlsx} exists on the \emph{input} directory, it will just be validated and put it in the \emph{build} directory. +Otherwise, it will be downloaded from the given URL, validated, and put it in the build directory. +Recall that the input and build directories differ from system to system and are specified at project configuration time, see Section \ref{sec:localdirs}. +In the data lineage of Figure \ref{fig:datalineage}, the arrow from \inlinecode{INPUTS.conf} to \inlinecode{menke20.xlsx} sympolizes this step. +Note that in our notation, once an external dataset is imported, it is a \emph{built} product, it thus has a blue box in Figure \ref{fig:datalineage}. + +It is sometimes necessary to report basic information about external datasets in the report/paper. +As described in Section \ref{sec:valuesintext}, here this is done with \LaTeX{} macros to avoid human error. +For example in Footnote \ref{footnote:dataurl}, we gave the full URL that this dataset was downloaded from. +In the \LaTeX{} source of that footnote, this URL is stored as the \inlinecode{\textbackslash{}menketwentyurl} macro which is created with the simplied\footnote{This Make rule is simplified by removing the directory variable names to help in readability.} Make rule below (it is located at the bottom of \inlinecode{download.mk}). + +In this rule, \inlinecode{download.tex} is the \emph{target} and \inlinecode{menke20.xlsx} is its \emph{prerequisite}. +The \emph{recipe} to build the target from the prerequisite is the \inlinecode{echo} shell command which writes the \LaTeX{} macro definition as a simple string (enclosed in double-quotes) into the \inlinecode{download.tex}. +The target is built after the prerequisite(s) are built, or when the prerequisites are newer than the target (for more on Make, see Appendix \ref{appendix:make}). +Note that \inlinecode{\$(MK20URL)} is a call to the variable defined above in \inlinecode{INPUTS.conf}. +Also recall that in Make, \inlinecode{\$@} is an \emph{automatic variable}, which is expanded to the rule's target name (in this case, \inlinecode{download.tex}). +Therefore if the dataset is re-imported (possibly with a new URL), the URL in Footnote \ref{footnote:dataurl} will also be re-created automatically. +In the data lineage of Figure \ref{fig:datalineage}, the arrow from \inlinecode{menke20.xlsx} to \inlinecode{download.tex} sympolizes this step. + + + + + +\subsubsection{The analysis} +\label{sec:analysis} + +The analysis subMakefile(s) are loaded after the initialization and download steps (see Sections \ref{sec:download} and \ref{sec:initialize}). +However, the analysis phase involves much more complexity for all the various research operations done on the raw inputs until the generation of the final plots, data or report. +If done without modularity in mind from the start, research project sources can become very long, thus becoming hard to modify, debug, improve or read. +Maneage is therefore designed to encourage and facilitate splitting the analysis into multiple/modular subMakefiles. +For example in the data lineage graph of Figure \ref{fig:datalineage}, the analysis is broken into three subMakefiles: \inlinecode{analysis-1.mk}, \inlinecode{analysis-2.mk} and \inlinecode{analysis-3.mk}. + +Theoretical discussion of this phase can be hard to follow, we will thus describe the contents of \inlinecode{analysis1\-.mk} in a demo project on data from \citet{menke20}, see Figure \ref{fig:mk20tab3}. +In Section \ref{sec:download}, the process of importing this dataset into the proejct was described. +The first issue is that \inlinecode{menke20.xlsx} must be converted to a simple plain-text table which is generically usable by simple tools (see principle of minimal complexity in Section \ref{principle:complexity}). +For more on the problems with Microsoft Office and this format, see Section \ref{sec:lowlevelanalysis}. +In \inlinecode{analysis1.mk} (Figure \ref{fig:mk20tab3}), we thus convert it to a simple white-space separated, plain-text table (\inlinecode{menke20-table-3.txt}) and do a basic calculation here to report. + +\begin{figure}[t] + \input{tex/src/figure-mk20tab3.tex} + \caption{\label{fig:mk20tab3}Simplified contents of \inlinecode{analysis1.mk}. + For the position of this subMakefile in the full project's data lineage, see Figure \ref{fig:datalineage}. + In particular, here the arrows of that figure from \inlinecode{menke20.xlsx} to \inlinecode{menke20-table-3.txt} and from the latter to \inlinecode{analysis1.tex} are shown as the second and third Make rules. + See Figure \ref{fig:download} and Appendix \ref{appendix:make} for more on the Make notation. + The general job here is to convert the raw download into a usable table for our analysis (a simple, space-separated, fixed-column-width plain-text table) along with the directory hosting it (\inlinecode{a1dir}), followed by a small measurement on it. + Note that \inlinecode{a1dir} is a prerequisite of \inlinecode{mk20tab3}, so the former is always built \emph{before} the latter \emph{and} when it doesn't already exist on the running system.} +\end{figure} + +As shown in Figure \ref{fig:mk20tab3}, the lowest-level operation is to define a directory to keep the generated files. +To keep the source and build-directories separate, we thus define \inlinecode{a1dir} under the build-directory (\inlinecode{BDIR}, see Section \ref{sec:localdirs}). +We'll then define all outputs/targets to be under this directory. +The second rule (which depends on the directory), then converts the Microsoft Excel spreadsheet file to a simple plain-text format using the XLSX I/O program. +But XLSX I/O only converts to CSV and we don't need many columns here, so we further shorten the table using the AWK program. +As described in Section \ref{sec:valuesintext}, the last rule of a subMakefile should be a \LaTeX{} macro file. +This is natural, for example we need to report how many journal papers were studied in \citet{menke20}. +So in the third rule, we count the number of papers they studied (summing over the second column). +The final sum is written into the \inlinecode{\textbackslash{}menkenumpapers} \LaTeX{} macro, which expands to $\menkenumpapers$ when this PDF is built. +Figure \ref{fig:mk20tab3} thus quantifies two of the arrows in the data lineage of Figure \ref{fig:datalineage}: the arrow from \inlinecode{menke20.xlsx} to \inlinecode{menke20-table-3.txt} and the arrow from \inlinecode{menke20-table-3.txt} to \inlinecode{analysis1.tex}. + +studied $\menkenumpapers$ papers in $\menkenumjournals$ journals. @@ -998,6 +1174,8 @@ For example \citet{menke20} studied $\menkenumpapers$ papers in $\menkenumjourna \item \citet{gibney20}: After code submission was encouraged by the Neural Information Processing Systems (NeurIPS), the frac \item When the data are confidential, \citet{perignon19} suggest to have a third party familiar with coding to referee the code and give its approval. In this system, because of automatic verification of inputs and outputs, no technical knowledge is necessary for the verification. +\item \citet{miksa19b} Machine-actionable data management plans (maDMPs) embeded in workflows, allowing +\item \citet{miksa19a} RDA recommendation on maDMPs. \end{itemize} @@ -1321,16 +1499,16 @@ Figure \ref{fig:makeexample} demonstrates a hypothetical Makefile with the targe {\small \texttt{\mkcomment{\# The ultimate "target" of this Makefile is 'ultimate.txt' (the first target Make finds).}} - \texttt{\mktarget{ultimate.txt}: \mkprereq{out.txt}\hfill\mkcomment{\# 'ultimate.txt' depends on 'out.txt'.{ }{ }{ }{ }{ }}} + \texttt{\mktarget{ultimate.txt}: out.txt\hfill\mkcomment{\# 'ultimate.txt' depends on 'out.txt'.{ }{ }{ }{ }{ }}} - \texttt{\mktab{}awk '\$1\textless5' \mkprereq{out.txt} \textgreater{ }\mktarget{ultimate.txt}\hfill\mkcomment{\# Only rows with 1st column less than 5.{ }{ }{ }}} + \texttt{\mktab{}awk '\$1\textless5' out.txt \textgreater{ }\mktarget{ultimate.txt}\hfill\mkcomment{\# Only rows with 1st column less than 5.{ }{ }{ }}} \vspace{1em} \texttt{\mkcomment{\# But 'out.txt', is created by a Python script, and 'params.conf' keeps its configuration.}} - \texttt{\mktarget{out.txt}: \mkprereq{run.py} \mkprereq{params.conf}} + \texttt{\mktarget{out.txt}: run.py params.conf} - \texttt{\mktab{}python \mkprereq{run.py} --in=\mkprereq{params.conf} --out=\mktarget{out.txt}} + \texttt{\mktab{}python run.py --in=params.conf --out=\mktarget{out.txt}} } \caption{\label{fig:makeexample}An example Makefile that describes how to build \inlinecode{ultimate.txt} with two \emph{rules}. @@ -1554,7 +1732,9 @@ For each solution, we summarize its methodology and discuss how it relates to th Freedom of the software/method is a core concept behind scientific reproducibility, as opposed to industrial reproducibility where a black box is acceptable/desirable. Therefore proprietary solutions like Code Ocean (\url{https://codeocean.com}) or Nextjournal (\url{https://nextjournal.com}) will not be reviewed here. - +\begin{itemize} +\item \citet{konkol20} have also done a review of some tools from various points of view. +\end{itemize} diff --git a/reproduce/analysis/config/INPUTS.mk b/reproduce/analysis/config/INPUTS.mk index 9332df3..b1cf546 100644 --- a/reproduce/analysis/config/INPUTS.mk +++ b/reproduce/analysis/config/INPUTS.mk @@ -9,7 +9,7 @@ # this notice are preserved. This file is offered as-is, without any # warranty. -MK20DATA = menke-etal-2020.xlsx +MK20DATA = menke20.xlsx MK20MD5 = 8e4eee64791f351fec58680126d558a0 MK20SIZE = 1.9MB MK20URL = https://www.biorxiv.org/content/biorxiv/early/2020/01/18/2020.01.15.908111/DC1/embed/media-1.xlsx diff --git a/reproduce/analysis/make/menke2020.mk b/reproduce/analysis/make/analysis-1.mk index 4dd9897..9d0018e 100644 --- a/reproduce/analysis/make/menke2020.mk +++ b/reproduce/analysis/make/analysis-1.mk @@ -20,10 +20,10 @@ # Save the "Table 3" spreadsheet from the downloaded `.xlsx' file into a # simple plain-text file that is easy to use. -mk20dir = $(BDIR)/menke2020 -mk20tab3 = $(mk20dir)/table-3.txt -$(mk20dir):; mkdir $@ -$(mk20tab3): $(indir)/menke-etal-2020.xlsx | $(mk20dir) +a1dir = $(BDIR)/analysis-1 +mk20tab3 = $(a1dir)/menke20-table-3.txt +$(a1dir):; mkdir $@ +$(mk20tab3): $(indir)/menke20.xlsx | $(a1dir) # Set a base-name for the table-3 data. base=$(basename $(notdir $<))-table-3 @@ -62,7 +62,7 @@ $(mk20tab3): $(indir)/menke-etal-2020.xlsx | $(mk20dir) # Main LaTeX macro file -$(mtexdir)/menke2020.tex: $(mk20tab3) | $(mtexdir) +$(mtexdir)/analysis-1.tex: $(mk20tab3) | $(mtexdir) # Count the total number of papers in their study. v=$$(awk '!/^#/{c+=$$2} END{print c}' $(mk20tab3)) diff --git a/reproduce/analysis/make/download.mk b/reproduce/analysis/make/download.mk index 7e61cb8..e4f2ccd 100644 --- a/reproduce/analysis/make/download.mk +++ b/reproduce/analysis/make/download.mk @@ -49,11 +49,11 @@ # progress at every moment. $(indir):; mkdir $@ downloadwrapper = $(bashdir)/download-multi-try -inputdatasets = $(indir)/menke-etal-2020.xlsx +inputdatasets = $(indir)/menke20.xlsx $(inputdatasets): $(indir)/%: | $(indir) $(lockdir) # Set the necessary parameters for this input file. - if [ $* = menke-etal-2020.xlsx ]; then + if [ $* = menke20.xlsx ]; then origname=$(MK20DATA); fullurl=$(MK20URL); mdf=$(MK20MD5); else echo; echo; echo "Not recognized input dataset: '$*.fits'." @@ -93,5 +93,5 @@ $(inputdatasets): $(indir)/%: | $(indir) $(lockdir) # # It is very important to mention the address where the data were # downloaded in the final report. -$(mtexdir)/download.tex: $(pconfdir)/INPUTS.mk | $(mtexdir) - echo > $@ +$(mtexdir)/download.tex: $(indir)/menke20.xlsx | $(mtexdir) + echo "\newcommand{\menketwentyurl}{$(MK20URL)}" > $@ diff --git a/reproduce/analysis/make/top-make.mk b/reproduce/analysis/make/top-make.mk index 29bcd83..6dd322f 100644 --- a/reproduce/analysis/make/top-make.mk +++ b/reproduce/analysis/make/top-make.mk @@ -113,7 +113,7 @@ endif makesrc = initialize \ download \ verify \ - menke2020 \ + analysis-1 \ paper diff --git a/reproduce/software/config/installation/texlive.mk b/reproduce/software/config/installation/texlive.mk index c918748..6760eba 100644 --- a/reproduce/software/config/installation/texlive.mk +++ b/reproduce/software/config/installation/texlive.mk @@ -23,4 +23,5 @@ texlive-packages = tex fancyhdr ec newtx fontaxes xkeyval etoolbox xcolor \ trimspaces pdftexcmds pdfescape letltxmacro bitset \ mweights \ \ - alegreya enumitem fontspec lastpage listings + alegreya enumitem fontspec lastpage listings environ \ + tcolorbox diff --git a/tex/src/figure-data-lineage.tex b/tex/src/figure-data-lineage.tex index aa2a397..010a0be 100644 --- a/tex/src/figure-data-lineage.tex +++ b/tex/src/figure-data-lineage.tex @@ -136,14 +136,14 @@ %% input-2.dat \ifdefined\inputtwo - \node (input2) [node-terminal, at={(-2.93cm,1.1cm)}] {input2.dat}; + \node (input2) [node-terminal, at={(-2.93cm,1.9cm)}] {menke20.xlsx}; \draw [->] (input2) -- (downloadtex); \fi %% INPUTS.conf \ifdefined\inputsconf \node (INPUTS) [node-nonterminal, at={(-2.93cm,4.6cm)}] {INPUTS.conf}; - \node (input2-west) [node-point, at={(-4.33cm,1.1cm)}] {}; + \node (input2-west) [node-point, at={(-4.33cm,1.9cm)}] {}; \draw [->,rounded corners] (INPUTS.west) -| (input2-west) |- (input2); \fi @@ -155,31 +155,14 @@ %% out1b.dat \ifdefined\outoneb - \node (out1b) [node-terminal, at={(-0.13cm,1.1cm)}] {out-1b.dat}; + \node (out1b) [node-terminal, at={(-0.13cm,1.1cm)}] {menke20-table-3.txt}; \draw [->] (out1b) -- (a1tex); \fi %% outonebdep \ifdefined\outonebdep \node (out1b-west) [node-point, at={(-1.53cm,1.1cm)}] {}; - \node (out1a) [node-terminal, at={(-0.13cm,2.7cm)}] {out-1a.dat}; - \node (a1conf1) [node-nonterminal, at={(-0.13cm,4.6cm)}] {param-1.conf}; - \draw [->] (input2) -- (out1b); - \draw [->] (out1a) -- (out1b); - \draw [->,rounded corners] (a1conf1.west) -| (out1b-west) |- (out1b); - \fi - - %% input1.dat - \ifdefined\inputone - \node (input1) [node-terminal, at={(-2.93cm,1.9cm)}] {input1.dat}; - \draw [->,rounded corners,] (input1.north) |- (out1a); - \fi - - %% input1 dependencies - \ifdefined\inputonedep - \node (input1-east) [node-point, at={(-1.53cm,1.9cm)}] {}; - \node (input1-west) [node-point, at={(-4.33cm,1.9cm)}] {}; - \draw [->,rounded corners] (INPUTS.west) -| (input1-west) |- (input1); + \draw [->, rounded corners] (input2) |- (out1b); \fi %% analysis2.tex @@ -227,14 +210,13 @@ \node (out2a-west) [node-point, at={(1.27cm,1.9cm)}] {}; \draw [->,rounded corners] (a2conf1.west) -| (out2a-west) |- (out2a); \draw [->,rounded corners] (a2conf2.west) -| (out2a-west) |- (out2a); - \draw [->] (input1) -- (out2a); + %\draw [->] (input1) -- (out2a); \fi %% Dependencies of out-3a \ifdefined\outthreeadep \node (out3a-west) [node-point, at={(4.07cm,2.7cm)}] {}; \node (a3conf1) [node-nonterminal, at={(5.47cm,4.6cm)}] {param-3.conf}; - \draw [->] (out1a) -- (out3a); \draw [rounded corners] (a3conf1.west) -| (out3a-west) |- (out3a); \fi \end{tikzpicture} diff --git a/tex/src/figure-download.tex b/tex/src/figure-download.tex new file mode 100644 index 0000000..b9da02f --- /dev/null +++ b/tex/src/figure-download.tex @@ -0,0 +1,8 @@ +\begin{tcolorbox} + \footnotesize + \texttt{\mkcomment{Write download URL into the paper (through a LaTeX macro).}} + + \texttt{\mktarget{\$(mtexdir)/download.tex}: \$(\mkvar{indir})/menke20.xlsx} + + \texttt{\mktab{}\mkprog{echo} "\textbackslash{}newcommand{\textbackslash{}menketwentyurl}\{\mktarget{\$(MK20URL)}\}" > \$@} +\end{tcolorbox} diff --git a/tex/src/figure-mk20tab3.tex b/tex/src/figure-mk20tab3.tex new file mode 100644 index 0000000..ebeac0e --- /dev/null +++ b/tex/src/figure-mk20tab3.tex @@ -0,0 +1,50 @@ +\begin{tcolorbox} + \footnotesize + \texttt{\mkcomment{Define and build the directory hosting the final table.}} + + \texttt{\mkvar{a1dir} = \$(\mkvar{BDIR})/analysis-1} + + \texttt{\mktarget{\$(a1dir)}:} + + \texttt{\mktab{}\mkprog{mkdir} \$@} + + \vspace{1.5em} + \texttt{\mkcomment{Define and build the main target.}} + + \texttt{\mkvar{mk20tab3} = \$(\mkvar{a1dir})/menke20-table-3.txt} + + \texttt{\mktarget{\$(mk20tab3)}: \$(\mkvar{indir})/menke20.xlsx | \$(\mkvar{a1dir})} + + \texttt{\recipecomment{Call XLSX I/O to convert all the spreadsheets into different CSV files.}} + + \texttt{\recipecomment{We only want the `table-3' spreadsheet, but XLSX I/O doesn't allow setting its}} + + \texttt{\recipecomment{output filename. For simplicity, let's assume its written in `table-3.csv'.}} + + \texttt{\mktab{}\mkprog{xlsxio\_xlsx2csv} \$<} + + \vspace{0.5em} + \texttt{\recipecomment{Use GNU AWK to keep the desired columns in space-separated, fixed-width format.}} + + \texttt{\recipecomment{With `FPAT' commas within double quotes are not counted as columns.}} + + \texttt{\mktab{}\mkprog{awk} 'NR>1\{printf("\%-10d\%-10d\%-10d \%s\textbackslash{}n", \$\$2, \$\$3, \$\$(NF-1)*\$\$NF, \$\$1)\}' \textbackslash} + + \texttt{\mktab{}{ }{ }{ }{ }FPAT='([\^{},]+)|("[\^{}"]+")' table-3.csv > \$@} + + \vspace{0.5em} + \texttt{\recipecomment{Delete the temporary CSV file.}} + + \texttt{\mktab{}\mkprog{rm} table-3.csv} + + \vspace{1.5em} + \texttt{\mkcomment{Main LaTeX macro file}} + + \texttt{\mktarget{\$(mtexdir)/analysis1.tex}: \$(\mkvar{mk20tab3)}} + + \texttt{\recipecomment{Count the total number of papers in their study to report in this paper.}} + + \texttt{\mktab{}v=\$\$(\mkprog{awk} '\!/\^{}\#/\{c+=\$\$2\} END\{print c\}' \$(\mkvar{mk20tab3)})} + + \texttt{\mktab{}\mkprog{echo} "\textbackslash{}newcommand\{\textbackslash{}menkenumpapers\}\{\$\$v\}" > \$@} +\end{tcolorbox} diff --git a/tex/src/preamble-style.tex b/tex/src/preamble-style.tex index 9556f54..e20c73c 100644 --- a/tex/src/preamble-style.tex +++ b/tex/src/preamble-style.tex @@ -135,11 +135,16 @@ \renewcommand\footrulewidth{0.0pt} } +%% For creating color boxes +\usepackage[many]{tcolorbox} + %% Custom macros \newcommand{\inlinecode}[1]{\textcolor{blue!35!black}{\texttt{#1}}} %% Example Makefile macros -\newcommand{\mkcomment}[1]{\textcolor{red!35!white}{#1}} +\newcommand{\mkcomment}[1]{\textcolor{red!70!white}{\# #1}} +\newcommand{\mkvar}[1]{\textcolor{orange!40!black}{#1}} \newcommand{\mktarget}[1]{\textcolor{blue!40!black}{#1}} -\newcommand{\mkprereq}[1]{\textcolor{green!30!black}{#1}} -\newcommand{\mktab}[1]{\textcolor{black!25!white}{\_\_\_TAB\_\_\_}} +\newcommand{\mkprog}[1]{\textcolor{green!30!black}{#1}} +\newcommand{\mktab}[1]{\textcolor{black!30!white}{\_\_\_TAB\_\_\_}} +\newcommand{\recipecomment}[1]{{ }{ }{ }{ }{ }{ }{ }{ }{ }\mkcomment{#1}} diff --git a/tex/src/references.tex b/tex/src/references.tex index 2a67584..f4dee1d 100644 --- a/tex/src/references.tex +++ b/tex/src/references.tex @@ -12,6 +12,24 @@ +@ARTICLE{konkol20, + author = {{Konkol}, Markus and {N{\"u}st}, Daniel and {Goulier}, Laura}, + title = "{Publishing computational research -- A review of infrastructures for reproducible and transparent scholarly communication}", + journal = {arXiv}, + year = 2020, + month = jan, + pages = {2001.00484}, +archivePrefix = {arXiv}, + eprint = {2001.00484}, + primaryClass = {cs.DL}, + adsurl = {https://ui.adsabs.harvard.edu/abs/2020arXiv200100484K}, + adsnote = {Provided by the SAO/NASA Astrophysics Data System} +} + + + + + @ARTICLE{infante20, author = {{Infante-Sainz}, Ra{\'u}l and {Trujillo}, Ignacio and {Rom{\'a}n}, Javier}, @@ -49,6 +67,47 @@ archivePrefix = {arXiv}, +@ARTICLE{miksa19a, + author = {Tomasz Miksa and Paul Walk and Peter Neish}, + title = {RDA DMP Common Standard for Machine-actionable Data Management Plans}, + year = {2019}, + journal = {RDA}, + pages = {doi:10.15497/rda00039}, + doi = {10.15497/rda00039}, +} + + + + + +@ARTICLE{miksa19b, + author = {Tomasz Miksa and Stephanie Simms and Daniel Mietchen and Sarah Jones}, + title = {Ten principles for machine-actionable data management plans}, + year = {2019}, + journal = {PLoS Computational Biology}, + volume = {15}, + pages = {e1006750}, + doi = {10.1371/journal.pcbi.1006750}, +} + + + + + +@ARTICLE{dicosmo19, + author = {Roberto {Di Cosmo} and Francois Pellegrini}, + title = {Encouraging a wider usage of software derived from research}, + year = {2019}, + journal = {\doihref{https://www.ouvrirlascience.fr/wp-content/uploads/2020/02/Opportunity-Note_software-derived-from-research_EN.pdf}{Ouvrir la science}}, + volume = {}, + pages = {}, + doi = {}, +} + + + + + @ARTICLE{perignon19, author = {Christophe P\'erignon and Kamel Gadouche and Christophe Hurlin and Roxane Silberman and Eric Debonnel}, title = {Certify reproducibility with confidential data}, |