From f69e1f407831dabddc20a0593716dcc4edcb4148 Mon Sep 17 00:00:00 2001 From: Mohammad Akhlaghi Date: Sat, 15 Feb 2020 21:16:00 +0000 Subject: Edits in text, added Menke+2020 as a reference The text was slightly improved/edited and I also recently came up to the Menke et al. 2020 (DOI:10.1101/2020.01.15.908111) which also has some good datasets we can use as a demonstration here. --- paper.tex | 148 ++++++++++++++++++++++++++++++------------------- tex/src/references.tex | 14 +++++ 2 files changed, 106 insertions(+), 56 deletions(-) diff --git a/paper.tex b/paper.tex index e41cf35..d50cc33 100644 --- a/paper.tex +++ b/paper.tex @@ -188,6 +188,8 @@ Finally in Section \ref{sec:discussion} the future prospects of using systems li \item \citet{chang15}: 67 studies in well-regarded economics journals with data and code. Only 22 could be reproduced without contacting authors, more than half couldn't be replicated at all (they use ``replicate''). \item \citet{horvath15}: errartum, describing the effect of a software mistake on result. \item Nature's collection on papers about reproducibility: \url{https://www.nature.com/collections/prbfkwmwvz}. +\item \citet{menke20} on the ``Rigor and Transparency Index'', in particular showing how practices have improved but not enough. + Also, how software identifability has seen the best improvement. \end{itemize} @@ -799,9 +801,8 @@ We are considering using these tools, and export Bib\TeX{} entries when necessar Once a project is configured (Section \ref{sec:projectconfigure}), all the necessary software, with precise versions and configurations, are built and ready to use. The analysis phase of the project (running the software on the data) is also orchestrated through Makefiles (see Sections \ref{sec:usingmake} \& \ref{sec:generalimplementation} for the benefits of using Make). -In particular, after running \inlinecode{./project make}, two high-level Makefiles are called in sequence, both are under \inlinecode{reproduce\-/analysis\-/make}: \inlinecode{top-prepare.mk} and \inlinecode{top-make.mk} (see Figure \ref{fig:files}). -These two high-level Makefiles don't see any of the the host's environment variables\footnote{The host environment is fully ignored before calling the analysis Makefiles through the \inlinecode{env -i} command (\inlinecode{-i} is short for \inlinecode{--ignore-environment}). +The analysis Makefiles don't see any of the the host's environment variables\footnote{The host environment is fully ignored before calling the analysis Makefiles through the \inlinecode{env -i} command (\inlinecode{-i} is short for \inlinecode{--ignore-environment}). Note that the project's own \inlinecode{env} program is used, not the one provided by the host OS, \inlinecode{env} is installed by GNU Coreutils.}. The project will define its own values for standard environment variables. Combined with the fact that all the software were compiled from source for this project at configuration time (Section \ref{sec:buildsoftware}), this completely isolates the analysis from the host operating system, creating an exactly reproducible result on any machine that the project can be configured. @@ -809,32 +810,55 @@ For example, the project builds is own fixed version of GNU Bash (a shell). It also has its own \inlinecode{bashrc} startup script\footnote{The project's Bash startup script is under \inlinecode{reproduce\-/software\-/bash\-/bashrc.sh}, see Figure \ref{fig:files}.}. Therefore the \inlinecode{BASH\_ENV} environment variable is set to load this startup script and the \inlinecode{HOME} environment variable is set to \inlinecode{BDIR} to avoid the penetration of any existing Bash startup file of the user's home directory into the analysis. -The former \inlinecode{top-prepare.mk} is in charge of optimizing the main job orchestration of \inlinecode{top-make.mk}, or to ``prepare'' for it. -In many sitations it may not be necessary at all, but we'll introduce its role with an example. -Let's assume the raw input data (that the project recieved from a database) has 5000 rows (potential targets). -However, based on an initial selection criteria, the project only needs to work on 100 of them. -If the full 5000 targets are given to \inlinecode{top-make.mk}, Make will need to create a data lineage for all 5000 targets. -If the analysis is complex (has many steps), this can be slow (many of its executions will be redundant), and project authors have to add checks in many places to ignore those that aren't necessary (which will add to the project's complexity and cause bugs). -However, if this basic selection is done before calling \inlinecode{top-make.mk}, only the required 100 targets and their lineage are orchestrated. -This allows Make to optimially organize the complex set of operations that must be run on each input and the dependencies (possibly in parallel), and also greatly simplifies the coding (no extra checks are necessary there). -Where necessary this preparation is done in \inlinecode{top-prepare.mk}. - -Generally, \inlinecode{top-prepare.mk} will not be necessary in many scenarios and its internal design and concepts are identical to \inlinecode{top-make.mk}. -Hence, we'll continue with a detailed discussion of \inlinecode{top-make.mk} below and touch upon the differences with \inlinecode{top-prepare.mk} in the end. - -A normal project will usually consist of many analysis steps, including data access (possibly by downloading), and running various steps of the analysis on them. -Having all the rules in one Makefile will create a very large file, which can be hard to maintain, extend/grow, read, reuse, and cite. -Generally, this is bad practice and is against the modularity principle (Section \ref{principle:modularity}). -This solution is thus designed to encourage modularity and facilitate modularity by distributing the analysis in many Makefiles that contain contextually-similar (or modular) analysis steps. -For Make this distribution is just cosmetic: they are all loaded into \inlinecode{top-make.mk} and executed in one instance of Make. -Within the project's source, the lower-level Makefiles are also placed in \inlinecode{reproduce\-/analysis\-/make} (like \inlinecode{top-make.mk}), see Figure \ref{fig:files}. -Therefore by design, \inlinecode{top-make.mk} is very simple: it just defines the ultimate target, and the name and order of the lower-level Makefiles that should be loaded. +In particular, after running \inlinecode{./project make}, the analysis is done in two phases: a preparation step which is described in Section \ref{sec:prepare} and the final analysis step that is described in Section \ref{sec:prepare}. +Technically, these two phases are managed by two top-level Makefiles: \inlinecode{top-prepare.mk} and \inlinecode{top-make.mk}. +Both are under \inlinecode{reproduce\-/analysis\-/make} (see Figure \ref{fig:files}). + +\subsubsection{Preparation phase} +\label{sec:prepare} +The first analysis Makefile that is run is \inlinecode{top-prepare.mk}. +It is in charge of any selection steps that may be necessary to optimize \inlinecode{top-make.mk}, or to ``prepare'' for it. +In many situations it may not be necessary at all and can be completely ignored. + +We'll introduce its role with an example. +Let's assume the raw input data (that the project recieved from a database) has 5000 rows (potential targets for doing the analysis on). +However, this particular project only needs to work on 100 of them, not the full 5000. +If the full 5000 targets are given to \inlinecode{top-make.mk}, Make will need to create a data lineage for all 5000 targets and project authors have to add checks in many places to ignore those that aren't necessary. +This will add to the project's complexity and cause bugs. +Furthermore, if the filesystem isn't fast (for example a filesystem that exists over a network), checking the file dates over the full lineage can be slow. + +In the scenario above, the Makefiles called by \inlinecode{top-prepare.mk} would be in charge of finding the IDs of the 100 targets of interest and saving them as a Make variable that is later loaded into one of the analysis Makefiles (that are loaded by \inlinecode{top-make.mk}). +This can't be done within \inlinecode{top-make.mk} because the full data lineage (all input and output files) must be known before Make is run. +The ``preparation'' phase thus allows \inlinecode{top-make.mk} to optimially organize the complex set of operations that must be run on each input and the dependencies (possibly in parallel). +It also greatly simplifies the coding for the project authors. + +Ideally \inlinecode{top-prepare.mk} is only for the ``preparation phase''. +However, projects can be complex and its up to the authors which parts of an analysis are ``preparation'' for an analysis and and which parts are the actual analysis. +Generally, the internal design and concepts of \inlinecode{top-prepare.mk} are identical to \inlinecode{top-make.mk} so it won't be discussed any further. + + + + + +\subsubsection{Main analysis phase} +\label{sec:analysis} +A normal project will usually consist of many analysis steps, including data access (possibly by downloading), running various steps of the analysis on them, and creating the necessary plots, figures or outputs for the report/paper. +If all of these steps are organized in a single Makefile, it will become very large and will be hard to maintain, extend/grow, read, reuse, and cite. +Generally, large files are bad practice in any management style because it is against the modularity principle (Section \ref{principle:modularity}). + +This solution is thus designed to encourage and facilitate modularity by distributing the analysis in many Makefiles that contain contextually-similar (or modular) analysis steps. +This distribution is thus primarily done for the human writers and readers of the project. +For Make it is cosmetic: they are all loaded into \inlinecode{top-make.mk} and executed in one instance of Make. +In other words, Make sees them all as one file anyway. +Within the project's source, the subMakefiles are also placed in \inlinecode{reproduce\-/analysis\-/make} (like \inlinecode{top-make\-.mk}), see Figure \ref{fig:files}. +Therefore by design, \inlinecode{top-make.mk} is very simple: it just defines the ultimate target, and the name and order of the subMakefiles that should be loaded. Figure \ref{fig:datalineage} is a general overview of the analysis phase in a hypothetical project using this template. -As described above and shown in that figure, \inlinecode{top-make.mk} imports the various lower-level Makefiles that are in charge of the different phases of the analysis. -Each of the lower-level Makefiles builds intermediate targets (files) which are also shown there. +As described above and shown in that figure, \inlinecode{top-make.mk} imports the various modular Makefiles under the \inlinecode{reproduce/} directory that are in charge of the different phases of the analysis. +Let's call them `subMakefiles'. +Each of the subMakefiles builds intermediate targets (files) which are shown there as blue boxes. In the subsections below, the project's analysis is described using this graph. -We'll follow Make's paradigm (see Section \ref{sec:usingmake}): starting form the ultimate target in Section \ref{sec:paperpdf}, and tracing the data's lineage all the way up to the inputs and configuration files. +We'll follow Make's paradigm (see Section \ref{sec:usingmake}): starting form the ultimate target in Section \ref{sec:paperpdf}, and tracing back its lineage all the way up to the inputs and configuration files. \begin{figure}[t] \begin{center} @@ -854,41 +878,51 @@ We'll follow Make's paradigm (see Section \ref{sec:usingmake}): starting form th -\subsubsection{Ultimate target: the project's paper or report} +\subsubsection{Ultimate target: the project's paper or report (\inlinecode{paper.pdf})} \label{sec:paperpdf} The ultimate purpose of a project is to report its result and interpret it in a larger context of human knowledge. In scientific projects, this is the final published paper. The raw result is usually dataset(s) that is (are) visualized, for example as a plot, figure or table and blended into the narrative description. -In Figure \ref{fig:datalineage} this final report is shown as \inlinecode{paper.pdf}. -In the complete directed graph of this figure, \inlinecode{paper.pdf} is the only node that has no outward edge or arrows. +In Figure \ref{fig:datalineage} this final report is shown as \inlinecode{paper.pdf} and the instructions to build it are in \inlinecode{paper.mk}. +In the complete directed graph of this figure, \inlinecode{paper.pdf} is the only node that has no outward edge or arrows (further showing that it is the ultimate target: nothing depends on it). -The source of this report (containing the main narrative and positioning of figures or tables) is \inlinecode{paper.tex}. -To build the PDF, \inlinecode{references.tex} and \inlinecode{project.tex} are also loaded into \LaTeX{}. +The report's source (containing the main narrative, its typesetting as well as that of the figures or tables) is \inlinecode{paper.tex}. +To build the final report's PDF, \inlinecode{references.tex} and \inlinecode{project.tex} are also loaded into \LaTeX{}. Another class of files that maybe loaded into \LaTeX{}, but are not shown to avoid complications in the figure, are the figure or plot PDFs which may have been created during the analysis steps. \inlinecode{references.tex} is part of the project's source and can contain the Bib\TeX{} entries for the bibliography of the final report. +In other words, it formalizes the connections of this scholarship with previous scholarship. -To understand \inlinecode{project.tex}, note that that besides figures, plots or tables, another output of the analysis are quantitative values that are blended into the sentences of the report's narration. + +\subsubsection{Values within text (\inlinecode{project.tex})} +\label{sec:valuesintext} +Figures, plots, tables and narrative aren't the only analysis output that goes into the paper. +In many cases, quantitative values from the analysis are also blended into the sentences of the report's narration. For example this sentence in the abstract of \citet{akhlaghi19}: ``... the outer wings of M51 down to S/N of 0.25 ...''. -The reported signal-to-noise (S/N) value ``0.25'' depends on the analysis and is an output of the analysis just like paper's figures and plots. -Manually typing the number in the \LaTeX{} source is prone to very important bugs: the author may forget to check it after a change in an analysis like using a newer version of the software, or changing an analysis parameter for another part of the paper. -Given the evolution of a scientific projects, this type of human error is very hard to avoid. +The reported signal-to-noise ratio (S/N) value ``0.25'' depends on the analysis and is an output of the analysis just like paper's figures and plots. +Manually typing the number in the \LaTeX{} source is prone to very important bugs: the author may forget to check it after a change in an analysis (e.g., using a newer version of the software, or changing an analysis parameter for another part of the paper). +Given the evolution of a scientific projects, this type of human error is very hard to avoid when such values are manually written. +Such values must also be automatically generated. -Calculated values mentioned within the narrative's sentences must therefore be automatically generated. +To automatically generate and blend them in the text, we use \LaTeX{} macros. In the quote above, the \LaTeX{} source\footnote{\citet{akhlaghi19} uses this templat to be reproducible, so its LaTeX source is available in multiple ways: 1) direct download from arXiv:\href{https://arxiv.org/abs/1909.11230}{1909.11230}, by clicking on ``other formats'', or 2) the Git or \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481} links is also available on arXiv.} looks like this: ``\inlinecode{\small the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}''. -The \LaTeX{} macro ``\inlinecode{\small\textbackslash{}demosfoptimizedsn}'' is automatically defined through \inlinecode{project.tex} and expands to the value ``0.25''. -\inlinecode{project.tex} contains such a macro for every project output that is included within the narrative's sentences (or any other usage by \LaTeX). +The \LaTeX{} macro ``\inlinecode{\small\textbackslash{}demosfoptimizedsn}'' is automatically calculated and recorded during in the project and expands to the value ``\inlinecode{0.25}''. +The automatically generated file \inlinecode{project.tex} stores all such inline output macros. +Furthermore, Figure \ref{fig:datalineage} shows that it is a prerequisite of \inlinecode{paper.pdf} (as well as the manually written \LaTeX{} sources that are shown in green). +Therefore \inlinecode{paper.pdf} will not be built until this file is ready and up-to-date. -However, managing all the necessary \LaTeX{} macros for a full project in one file is against the modularity principle and can be frustrating. -Therefore, in this system, all lower-level Makefiles \emph{must} contain a fixed target with the same name, but with a \inlinecode{.tex} suffix. -For example in Figure \ref{fig:datalineage}, assume \texttt{out-1b.dat} is a table and the mean of its third column must be reported in the paper. -To do this, \inlinecode{out-1b.dat} is set to be a prerequisite of the rule for \inlinecode{analysis1.tex} (as shown by the arrow in Figure \ref{fig:datalineage}). -The recipe of this rule will calculate the mean of the column and put the result in the \LaTeX{} macro which is written in \inlinecode{analysis1.tex}. -The same is done for any other reported calculation from \inlinecode{analysis1.mk}. +However, managing all the necessary \LaTeX{} macros for a full project in one file is against the modularity principle and can be frustrating and buggy. +To address this problem, all subMakefiles \emph{must} contain a fixed target with the same base-name, but with a \inlinecode{.tex} suffix. +For example in Figure \ref{fig:datalineage}, assume \inlinecode{out-1b.dat} is a table and the mean of its third column must be reported in the paper. +Therefore in \inlinecode{analysis1.mk}, a prerequisite of \inlinecode{analysis1.tex} is \inlinecode{out-1b.dat} (as shown by the arrow in Figure \ref{fig:datalineage}). +The recipe of this rule will calculate the mean of the column and put it in the \LaTeX{} macro which is written in \inlinecode{analysis1.tex}. +In a similar way, any other reported calculation from \inlinecode{analysis1.mk} is stored as a \LaTeX{} macro in \inlinecode{analysis1.tex}. -These \LaTeX{} macro files are thus the core skeleton of the project: as shown in Figure \ref{fig:datalineage}, the outward arrows of all built files ultimately lead to one of these \LaTeX{} macro files. -However, note that built files in a lower-level Makefile don't have to point to that Makefile's \LaTeX{} macro file. -For example even though \inlinecode{input1.dat} is a target in \inlinecode{download.mk}, it isn't a prerequisite of \inlinecode{download.tex}, it is a prerequisite of \inlinecode{out-2a.dat} (a target in \inlinecode{analysis2.mk}), and ultimate ends in \inlinecode{analysis3.tex}. +These \LaTeX{} macro files thus form the core skeleton of the project: as shown in Figure \ref{fig:datalineage}, the outward arrows of all built files of any subMakefile ultimately leads to one of these \LaTeX{} macro files. +Note that \emph{built} files in a subMakefile don't have to be a prerequisite of its \inlinecode{.tex} file. +They may point to another Makefile's \LaTeX{} macro file. +For example even though \inlinecode{input1.dat} is a target in \inlinecode{download.mk}, it isn't a prerequisite of \inlinecode{download.tex}, it is a prerequisite of \inlinecode{out-2a.dat} (a target in \inlinecode{analysis2.mk}). +The lineage ultimate ends in a \LaTeX{} macro file in \inlinecode{analysis3.tex}. @@ -898,17 +932,18 @@ For example even though \inlinecode{input1.dat} is a target in \inlinecode{downl \label{sec:outputverification} An important principle for this template is that outputs should be automatically verified, see Section \ref{principle:verify}. However, simply confirming the checksum of the final PDF, or figures and datasets is not generally possible: as mentioned in Section \ref{principle:verify}, many tools that produce datasets or PDFs write the creation date into the produced files. - -To facilitate output verification this template has the lower-level \inlinecode{verify.mk} Makefile, see Figure \ref{fig:datalineage}. +Therefore it is necessary to verify the project's outputs before the PDF is created. +To facilitate output verification, the project has a \inlinecode{verify.mk} Makefile, see Figure \ref{fig:datalineage}. It is the only prerequisite of \inlinecode{project.tex} that was described in Section \ref{sec:paperpdf}. -As shown in Figure \ref{fig:datalineage}, the \LaTeX{} macros of all lower-level Makefiles are a prerequisite to \inlinecode{verify.tex}. Verification is therefore the connection-point, or bottleneck, between the analysis steps of the project and its final report. -Prior to publication, the project authors should add the MD5 checksums of all the\LaTeX{} macro files in the recipe of \inlinecode{verify\-.tex}, the necessary structure is already there. +Prior to publication, the project authors should add the MD5 checksums of all the\LaTeX{} macro files in the recipe of \inlinecode{verify\-.tex}. +The necessary structure is already there, so adding/changing the values is trivial. If any \LaTeX{} macro is different in future builds of the project, the project will abort with a warning of the problematic file. When projects involve other outputs (for example images, tables or datasets that will also be published), their contents should also be validated. To do this, prerequisites should be added to the \inlinecode{verify\-.tex} rule that automatically check the \emph{contents} of other project outputs. -Recall that many tools print the creation date automatically when creating a file, so to verify a file, such metadata must be ignored. +Recall that many tools print the creation date automatically when creating a file, so to verify a file, this kind of metadata must be ignored. +\inlinecode{verify\-.tex} contains some Make functions to facilitate checking with some some file formats, others can be added easily. @@ -916,18 +951,18 @@ Recall that many tools print the creation date automatically when creating a fil \subsubsection{Orchestrating the analysis} \label{sec:analysisorchestration} -All the analysis steps of a project are ultimately prerequisites of the verification step, see the data lineage in Figure \ref{fig:datalineage} and Section \ref{sec:outputverification}. +As described in Section \ref{sec:valuesintext}, the output files of a project's analysis steps are ultimately prerequisites of a subMakefile's final target (a \LaTeX{} macro file with the same base name, see the data lineage in Figure \ref{fig:datalineage}). The detailed organization of the analysis steps highly depends on the particular project and because Make already knows which files are independent of others, it can run them in any order or on any number of threads in parallel. -Two lower-level analysis Makefiles are common to all projects: \inlinecode{initial\-ize\-.mk} and \inlinecode{download\-.mk}. -\inlinecode{init\-ial\-ize\-.mk} is the first lower-level Makefile that is loaded into \inlinecode{top-make.mk}. -It doesn't actually contain any analysis, it just initializes the system: setting environment variables, internal Make variables and generic rules like \inlinecode{./project make clean} (to clean all builts products, not software) or \inlinecode{./project make dist} (to package the project into a tarball for distribution) among others. +Two subMakefiles are common to all projects: \inlinecode{initial\-ize\-.mk} and \inlinecode{download\-.mk}. +\inlinecode{init\-ial\-ize\-.mk} is the first subMakefile that is loaded into \inlinecode{top-make.mk}. +It doesn't actually contain any analysis, it just initializes the system: setting environment variables, internal Make variables and generic rules like \inlinecode{./project make clean} (to clean/delete all built products, not software) or \inlinecode{./project make dist} (to package the project into a tarball for distribution) among others. To get a good fealing of the system, it is recommended to look through this file. \inlinecode{download.mk} has some commonly necessary steps to facilitate the importation of input datasets: a simple \inlinecode{wget} command is not usually enough. We also want to check if a local copy exists and also calculate the file's checksum to verfiy it. - +\tonote{------------------------Continue here===} @@ -1957,6 +1992,7 @@ Furthermore, the fact that a Tale is stored as a binary Docker container causes The XML that contains a log of the outputs is also interesting. \item \citet{becker17} Discuss reproducibility methods in R. \item Elsevier Executable Paper Grand Challenge\footnote{\url{https://shar.es/a3dgl2}} \citep{gabriel11}. + \item \citet{menke20} show how software identifability has seen the best improvement, so there is hope! \end{itemize} diff --git a/tex/src/references.tex b/tex/src/references.tex index f57b26e..63ea0b2 100644 --- a/tex/src/references.tex +++ b/tex/src/references.tex @@ -1,3 +1,17 @@ +@ARTICLE{menke20, + author = {Joe Menke and Martijn Roelandse and Burak Ozyurt and Maryann Martone and Anita Bandrowski}, + title = {Rigor and Transparency Index, a new metric of quality for assessing biological and medical science methods}, + year = {2020}, + journal = {bioRxiv}, + volume = {}, + pages = {2020.01.15.908111}, + doi = {10.1101/2020.01.15.908111}, +} + + + + + @ARTICLE{infante20, author = {{Infante-Sainz}, Ra{\'u}l and {Trujillo}, Ignacio and {Rom{\'a}n}, Javier}, -- cgit v1.2.1