aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--paper.tex138
-rw-r--r--tex/src/figure-file-architecture.tex2
-rw-r--r--tex/src/preamble-biblatex.tex2
3 files changed, 73 insertions, 69 deletions
diff --git a/paper.tex b/paper.tex
index abedeb5..fce9988 100644
--- a/paper.tex
+++ b/paper.tex
@@ -82,16 +82,15 @@
\label{sec:introduction}
The increasing volume and complexity of data analysis has been extraordinarily productive, giving rise to a new branch of ``Big Data'' in many fields of the sciences and industry.
-However, given its inherent complexity, as the mere results are barely useful only by themselves, questions naturally arise on its lineage/provenance:
+However, given its inherent complexity, the results are barely useful alone, questions naturally arise on their lineage or provenance:
What inputs were used?
How were the configurations or training data chosen?
What operations were done on those inputs, how were the plots made?
-Figure \ref{fig:questions} provides a more detailed visual representation of such questions at various stages of the workflow.
+See Figure \ref{fig:questions} for some similar questions, classified by their place in project.
\tonote{Johan: add some general references.}
-Due to the complexity of modern data analysis, a small deviation in the final result can be due to any or many different steps, which may be significant for its interpretation, and hence
-publishing the \emph{complete} codes of the analysis is the only way to fully understand the results, and to avoid wasting time
-and resources.
+Due to the complexity of modern data analysis, a small deviation in the final result can be due to many different steps, which may be significant for its interpretation.
+Publishing the \emph{complete} set of operations is the only way to avoid ambiguity and wasting resources.
For example, \citet{smart18} describes how a 7-year old conflict in theoretical condensed matter physics was only identified after the relative codes were shared.
\citet{miller06} found a mistaken column flipping in a project's workflow, leading to the retraction of 5 papers in major journals, including \emph{Science}.
\citet{baggerly09} highlighted the inadequate narrative description of the analysis and showed the prevalence of simple errors in published results, ultimately calling their work ``\emph{forensic bioinformatics}''.
@@ -115,7 +114,7 @@ The status in other fields, where workflows are not commonly shared, is highly l
Nature is already a black box which we are trying hard to unlock.
Not being able to experiment on the methods of other researchers is a self-imposed back box over it.
-The completeness of a paper's published workflow (usually within the ``Methods'' section) can be measured by the ability to reproduce the result without needing to contact the authors.
+The completeness of a project's published workflow (usually within the ``Methods'' section) can be measured by the ability to reproduce the result without needing to contact the authors.
Several studies have attempted to answer this with different levels of detail. For example, \citet{allen18} found that roughly half of the papers in astrophysics do not even mention the names of any analysis software, while \citet{menke20} found this fraction has greatly improved in medical/biological field and is currently above $80\%$.
\citet{ioannidis2009} attempted to reproduce 18 published results by two independent groups, but fully succeeded in only 2 of them and partially in 6.
\citet{chang15} attempted to reproduce 67 papers in well-regarded journals in Economics with data and code: only 22 could be reproduced without contacting authors, and more than half could not be replicated at all. \tonote{DVG: even after contacting the authors?}
@@ -162,10 +161,10 @@ As a consequence, before starting with the technical details it is important to
\item \label{definition:project}\textbf{Project:}
A project is the series of operations that are done on input(s) to produce outputs.
- This definition is therefore very similar to ``\emph{workflow}'' \citep{oinn04, reich06, goecks10}, but because the published narrative paper/report is also an output, a project also includes both the source of the narrative (e.g., \LaTeX{} or MarkDown) \emph{and} how
+ This definition is therefore very similar to ``\emph{workflow}'' \citep[e.g.,][]{oinn04, goecks10}, but because the published narrative paper/report is also an output, a project also includes both the source of the narrative (e.g., \LaTeX{} or MarkDown) \emph{and} how
its visualizations were created.
- In a good project, all analysis scripts (e.g., written in Python, packages in R, libraries/programs in C/C++, etc.) are well-defined as an independently managed software with clearly defined inputs, outputs and no side-effects.
+ In a well-designed project, all analysis steps (e.g., written in Python, packages in R, libraries/programs in C/C++, etc.) are written to be modular, or executable independent of the rest with well-defined inputs, outputs and no side-effects.
This is crucial help for debugging and experimenting during the project, and also for their re-usability in later projects.
As a consequence, such analysis scripts/programs are defined above as ``inputs'' for the project.
A project hence does not include any analysis source code (to the extent this is possible), it only manages calls to them.
@@ -233,9 +232,13 @@ To highlight the uniqueness of Maneage in this plethora of tools, a more elabora
They also have a very complex dependency trees, making them extremely vulnerable and hard to maintain. For example, see Figure 1 of \citet{alliez19} on the dependency tree of Matplotlib (one of the smaller Jupyter dependencies).
It is important to remember that the longevity of a workflow (not the analysis itself) is determined by its shortest-lived dependency.
- Many existing tools therefore do not attempt to store the project as plain text, but pre-built binary blobs (containers or virtual machines) that can rarely be recreated\footnote{Using the package manager of the container's OS, or Conda which are both highly dependent on the time they were created.} and also have a short lifespan\footnote{For example Docker only works on Linux kernels that are on long-term support, not older.
- Currently, this is Linux 3.2.x that was initially released in 2012. The current Docker images may not be usable in a similar time frame in the future.}.
- As plain-text, even if it is no longer executable due to much-evolved technologies, it is still human-readable and parseable by any machine.
+ Many existing tools therefore do not attempt to store the project as plain text, but pre-built binary blobs (containers or virtual machines) that can rarely be recreated and also have a short lifespan.
+ Their recreation is hard because most are built with the package manager of the blob's OS, or Conda.
+ Both are highly dependent on the time they are executed: precise versions are rarely stored, and the servers remove old binaries.
+ Docker containers are a good example of their short lifespan: Docker only runs on long-term support OSs, not older.
+ In GNU/Linux systems, this corresponds to Linux kernel 3.2.x (initially released in 2012) and above.
+ The current Docker images made today may not be usable in a similar time frame in the future.
+ As plain-text, besides being extremely low volumne ($\sim100$ kilobytes), the project is still human-readable and parseable by any machine, even when it can't be executed.
\item \label{principle:modularity}\textbf{Modularity:}
A project should be compartmentalized or partitioned into independent modules or components with well-defined inputs/outputs having no side-effects.
@@ -298,15 +301,23 @@ IPOL, which uniquely stands out in other principles, fails here: only the final
\label{sec:maneage}
Maneage is an implementation of the principles of Section \ref{sec:principles}: it is complete (\ref{principle:complete}), modular (\ref{principle:modularity}), has minimal complexity (\ref{principle:complexity}), verifies its inputs \& outputs (\ref{principle:verify}), preserves temporal provenance (\ref{principle:history}) and finally, it is free software (\ref{principle:freesoftware}).
In practice, Maneage is a collection of plain-text files that are distributed in pre-defined sub-directories by context (a modular source), and are all under version-control, currently with Git.
-The main Maneage Branch is a fully-working skeleton of a project without much flesh: it contains all the low-level infrastructure, but without any actual high-level analysis operations\footnote{In the core Maneage branch, only a simple demo analysis is included to be complete, and can easily be removed: all its files and steps have a \inlinecode{delete-me} prefix.}.
-Maneage contains a file called \inlinecode{README-hacking.md}\footnote{Note that the \inlinecode{README.md} file is reserved for the project using Maneage, not Maneage itself.} that has a complete checklist of steps to start a new project and remove demonstration parts.
+The main Maneage Branch is a fully-working skeleton of a project without much flesh: it contains all the low-level infrastructure, but without any actual high-level analysis operations.
+Maneage contains a file called \inlinecode{README-hacking.md} (the \inlinecode{README.md} file is reserved for the project using Maneage, not Maneage itself) that has a complete checklist of steps to start a new project and remove demonstration parts.
There are also hands-on tutorials to help new users.
-To start a new project, the authors just \emph{clone}\footnote{In Git, the ``clone'' operation is the process of copying all the project's files and history from a repository onto the local system.} Maneage, create their own Git branch over the latest commit, and start their project by customizing that branch.
-Customization in their project branch is done by adding the names of the software they need, references to their input data, the analysis commands, visualization commands, and a narrative report which includes the visualizations.
+To start a new project, the authors \emph{clone} Maneage, create a branch, and start their project by customizing that branch as shown below.
+In Git, ``cloning'' imports the project's files and history from a repository to local system.
+Customization is done by adding the names of the necessary software, references to input data, analysis and visualization commands and a narrative description.
This will usually be done in multiple commits in the project's duration (perhaps multiple years), thus preserving the project's history: the causes of all choices, the authors and times of each change, failed tests, etc.
-Figure \ref{fig:files} shows this directory structure containing the modular plain-text files (classified by context in sub-directories) and some representative files in each directory.
+\begin{lstlisting}[language=bash]
+ git clone https://gitlab.com/maneage/project.git # Clone Maneage, default branch `maneage'.
+ mv project my-project && cd my-project # Set custom name and enter directory.
+ git remote rename origin origin-maneage # Rename remote server to use `origin' later.
+ git checkout -b master # Make new `master' branch, start customizing.
+\end{lstlisting}
+
+Figure \ref{fig:files} shows the directory structure of the cloned project and some representative files in each directory.
The top-level source has only very high-level components: the \inlinecode{project} shell script (POSIX-compliant) that is the main interface to the project, as well as the paper's \LaTeX{} source, documentation and a copyright statement.
Two sub-directories are also present: \inlinecode{tex/} (containing \LaTeX{} files) and \inlinecode{reproduce/} (containing all other parts of the project).
@@ -326,7 +337,8 @@ Two sub-directories are also present: \inlinecode{tex/} (containing \LaTeX{} fil
}
\end{figure}
-The \inlinecode{project} script is a high-level wrapper to interface with Maneage and in its current implementation has two main phases as shown below: (1) configuration, where the necessary software are built and the environment is setup, and (2) analysis, where data are accessed and the software is run on them to create visualizations and the final report.
+The \inlinecode{project} script is a high-level wrapper to interface with Maneage.
+It has two main phases as shown below: (1) configuration, where the necessary software are built and the environment is setup, and (2) analysis, where data are accessed and the software is run on them to create visualizations and the final report.
In practice, these steps are run with two commands:
\begin{lstlisting}[language=bash]
@@ -334,8 +346,8 @@ In practice, these steps are run with two commands:
./project make # Do the analysis (download data, run software on data, build PDF).
\end{lstlisting}
-We now delve deeper into the implementation and some usage details of Maneage.
-Section \ref{sec:usingmake} elaborates why Make (a POSIX tool) was chosen as the main job manager in Maneage.
+The general implementation of Maneage is discussed below.
+Section \ref{sec:usingmake} elaborates why Make was chosen as the main job manager.
Sections \ref{sec:projectconfigure} \& \ref{sec:projectanalysis} then discuss the operations done during the configuration and analysis phase.
Afterwards, we describe how Maneage projects benefit from version control in Section \ref{sec:projectgit}.
Section \ref{sec:collaborating} discusses the sharing of a built environment, and in Section \ref{sec:publishing} the publication/archival of Maneage projects is discussed.
@@ -343,11 +355,12 @@ Section \ref{sec:collaborating} discusses the sharing of a built environment, an
\subsection{Job orchestration with Make}
\label{sec:usingmake}
-Scripts (in Shell, Python, or any other high-level language) are usually the first solution that come to mind when non-interactive, or batch, processing is needed (the completeness principle, see \ref{principle:complete}),
-However, the inherent complexity and non-linearity of progress, as a project evolves, makes it hard to manage such scripts.
-For example, if $90\%$ of a research project is done and only the newly-added final $10\%$ must be executed, a script will always start from the beginning.
-It is possible to manually ignore (with some conditionals) or comment already completed parts, however this only adds to the complexity and will discourage experimentation on an already-completed part of the project.
+Scripts (in Shell, Python, or any other high-level language) are usually the first solution that come to mind when non-interactive, or batch, processing is needed.
+However, the inherent complexity and non-linearity of progress, as a project evolves, makes it hard to manage scripts.
+For example, if $90\%$ of a research project is done and only the newly-added final $10\%$ must be executed, a script will re-do the whole project every time.
+It is possible to manually ignore (with some conditionals) already completed parts, however this only adds to the complexity and will discourage experimentation on an already-completed part of the project.
These problems motivated the creation of Make in the early Unix operating system \citep{feldman79}.
+Make contiues to be a core component of modern OSs, is actively maintained, and has withstood the test of time.
The Make paradigm starts from the end: the final \emph{target}.
In Make's syntax, the process is broken into atomic \emph{rules} where each rule has a single \emph{target} file which can depend on any number of \emph{prerequisite} files.
@@ -355,7 +368,7 @@ To build the target from the prerequisites, each rule also has a \emph{recipe} (
The plain-text files containing Make rules and their components are called Makefiles.
Note that Make does not replace scripting languages like the shell, Python or R.
It is a higher-level structure enabling modular/atomic scripts (in any language) to be put into a workflow.
-The formal connection of targets with prerequisites, as defined in Make, enables the creation of an optimized workflow that is very mature and has withstood the test of time: almost all OSs rely on it.
+The formal connection of targets with prerequisites it provides enables the creation of an optimized workflow.
Besides formalizing data lineage, Make also greatly encourages experimentation in a project because a recipe is executed only when at least one prerequisite is more recent than its target.
Therefore, when only $5\%$ of a project's targets are affected by a change, only they will be recreated, the other $95\%$ remaining dormant.
@@ -374,49 +387,46 @@ Because of its simplicity, we have also had very good feedback on using Make fro
\label{sec:projectconfigure}
Maneage orchestrates the building of its necessary software in the same language that it orchestrates the analysis: Make (see Section \ref{sec:usingmake}).
-Therefore, a researcher already using Maneage for their high-level analysis easily understands, and can customize, the software environment too, without delving into the intricacies of third-party tools.
+Therefore, a researcher already using Maneage for their high-level analysis easily understands, and can customize the software environment too, without delving into the intricacies of third-party tools.
Most existing tools reviewed in Section \ref{sec:principles} use package managers like Conda to maintain the software environment, but since Conda itself is written in Python, it does not fulfill our completeness principle \ref{principle:complete}.
Highly-robust solutions like Nix and GNU Guix do exist, but they require root permissions which is also problematic for this principle.
Project configuration (building the software environment) is managed by the files under \inlinecode{reproduce\-/soft\-ware} of Maneage's source, see Figure \ref{fig:files}.
-At the start of project configuration, Maneage needs a top-level directory to build itself on the host filesystem (software and analysis).
-We call this the ``build directory'' and it must not be under the source directory (see \ref{principle:modularity}): by default Maneage will not touch any file in its source.
-No other location on the running operating system will be affected by the project and the build directory should not affect the result, so its value is not under version control.
-Two other local directories can optionally be specified by the project when inputs (\ref{definition:input}) are present locally and do not need to be downloaded: 1) software tarball directory and 2) input data directory.
+At the start of project configuration, Maneage needs a top-level directory to build itself on the host (software and analysis).
+We call this the ``build directory'' and it must not be under the source directory (see \ref{principle:modularity}).
+No other location on the running OS will be affected by the project, including the source directory.
+Two other local directories can optionally be specified by the project when inputs (\ref{definition:input}) are present locally: 1) software tarball directory and 2) input data directory.
Sections \ref{sec:buildsoftware} and \ref{sec:softwarecitation} elaborate more on the building of the required software and the important problem of software citation.
+A Maneage project can be configured in a container or virtual machine to facilitate moving the project without rebuilding everything from source.
+However, such binary blobs are optional outputs of Maneage, they are not its primary storage/archival format.
+
\subsubsection{Verifying and building necessary software from source}
\label{sec:buildsoftware}
-To compile the necessary software from source Maneage currently needs the host to have a \emph{C} compiler (available on any POSIX-compliant OS).
-This \emph{C} compiler will be used by Maneage to build and install (in the build directory) all necessary software and their dependencies with fixed versions.
-The dependency tree goes all the way down to core operating system components like GNU Bash, GNU AWK, GNU Coreutils, and many more on all supported operating systems (including macOS, not just GNU/Linux).
+To compile the necessary software from source Maneage currently needs the host to have a C and C++ compiler (available on any POSIX-compliant OS).
+They will be used by Maneage to build and install (in the build directory) all necessary software and their dependencies with fixed versions.
+The dependency tree goes all the way down to core operating system components like GNU Bash, GNU AWK, GNU Coreutils on all supported operating systems (including macOS, not just GNU/Linux).
For example, the full list of installed software for this paper is automatically available in the Acknowledgments section of this paper.
-On GNU/Linux OSs, a fixed version of the GNU Binutils and GNU C Compiler (GCC) is also included, and soon Maneage will also install its own fixed version of the GNU C Library to be fully independent of the host on such systems (Task 15390\footnote{\url{https://savannah.nongnu.org/task/?15390}}).
-In effect, except for the Kernel, Maneage builds all other components of the GNU OS on the host from source.
+On GNU/Linux OSs, a fixed version of the GNU Binutils and GNU C Compiler (GCC) is also built from source, and a fixed version of the GNU C Library will soon be added to be fully independent of the host on such systems (task 15390).
+In effect, except for the Kernel, Maneage builds all other necessary components of the OS.
-The software source code may already be present on the host filesystem, if not, it can be downloaded.
-Before being used to build the software, it will be validated by its SHA-512 checksum (which is already stored in the project).
+Before building the software, their source codes will be validated by their SHA-512 checksum (which is already stored in the project).
Maneage includes a large collection of scientific software (and their dependencies) that are usually not necessary in all projects.
Therefore, each project has to identify its high-level software in the \inlinecode{TARGETS.conf} file under \inlinecode{re\-produce\-/soft\-ware\-/config} directory, see Figure \ref{fig:files}.
-All the high-level software dependencies are codified in Maneage as Make \emph{prerequisites}, and hence the specified software will be automatically built after its dependencies.
-
-Note that project configuration can be done in a container or virtual machine to facilitate moving the project.
-The important factor, however, is that such binary blobs are optional outputs of Maneage, they are not its primary storage/archival format.
\subsubsection{Software citation}
\label{sec:softwarecitation}
-Maneage contains the full list of built software for each project, their versions and their configuration options, but
- this information is buried deep into each project's source.
-Maneage also prints a distilled fraction of this information in the project's final report, blended into the narrative, as seen in the Acknowledgments of this paper.
+Maneage contains the full list of built software for each project, their versions and their configuration options, but this information is buried deep into each project's source.
+Therefore Maneage also prints a distilled fraction of this information in the project's final report, blended into the narrative, as seen in the Acknowledgments of this paper.
Furthermore, when the software is associated with a published paper, that paper's Bib\TeX{} entry is also added to the final report and is duly cited with the software's name and version.
-For example\footnote{In this paper we have used very basic tools that are not accompanied by a paper}, see the software acknowledgement sections of \citet{akhlaghi19} and \citet{infante20}.
+This paper uses basic tools that don't have a paper, for software citation examples see \citet{akhlaghi19} and \citet{infante20}.
This is particularly important in the case for research software, where citation is critical to justify continued development.
One notable example that nicely highlights this issue is GNU Parallel \citep{tange18}: every time it is run, it prints the citation information before it starts.
Users can disable the notice, with the \inlinecode{--citation} option and accept to cite its paper, or support its development directly by paying $10000$ euros!
-This is justified by an uncomfortably true statement\footnote{GNU Parallel's FAQ on the need to cite software: \url{http://git.savannah.gnu.org/cgit/parallel.git/plain/doc/citation-notice-faq.txt}}: ``\emph{history has shown that researchers forget to [cite software] if they are not reminded explicitly. ... If you feel the benefit from using GNU Parallel is too small to warrant a citation, then prove that by simply using another tool}''.
+This is justified by an uncomfortably true statement ``\emph{history has shown that researchers forget to [cite software] if they are not reminded explicitly. ... If you feel the benefit from using GNU Parallel is too small to warrant a citation, then prove that by simply using another tool}''.
Most research software does not resort to such drastic measures, however, but proper citation is not only important but also ethical.
Given the increasing number of software used in scientific research, the only reliable solution is to automatically cite the used software.
@@ -428,11 +438,11 @@ We plan to enable these tools in future versions of Maneage.
-\subsection{Analysis of the Project}
+\subsection{Project analysis}
\label{sec:projectanalysis}
Once the project is configured (Section \ref{sec:projectconfigure}), a unique and fully-controlled environment is available to execute the analysis.
-All analysis operations run such that the host's OS settings cannot penetrate it, enabling an isolated environment without the extra layer of containers or a virtual machine.
+All analysis operations run with no influence from the host OS, enabling an isolated environment without the extra layer of containers or a virtual machine.
In Maneage, a project's analysis is broken into two phases: 1) preparation, and 2) analysis.
The former is mostly necessary to optimize extremely large datasets and is only useful for advanced users, while following an identical internal structure to the later.
We will therefore not go any further into it and refer the interested reader to the documentation.
@@ -443,11 +453,10 @@ Large files are in general a bad practice and do not fulfil the modularity princ
Maneage is thus designed to encourage and facilitate modularity by distributing the analysis into many Makefiles that contain contextually-similar analysis steps.
In the rest of this paper these modular, or lower-level, Makefiles will be called \emph{subMakefiles}.
-The subMakefiles are loaded into the special Makefile \inlinecode{top-make.mk} with a certain order and executed in one instance of Make\footnote{The subMakefiles are loaded into \inlinecode{top-make.mk} using Make's \inlinecode{include} directive.
- Hence no recursion is used (where one instance of Make, calls Make within itself) because recursion is against the minimal complexity principle and can make the code very hard to read \ref{principle:complexity}.}.
-When run with the \inlinecode{make} argument, the \inlinecode{project} script (Section \ref{sec:maneage}), calls \inlinecode{top-make.mk}.
-All these Makefiles are in \inlinecode{re\-produce\-/anal\-ysis\-/make}, see Figure \ref{fig:files}.
-Figure \ref{fig:datalineage} schematically shows these subMakefiles and their relation with each other with the targets they build.
+When run with the \inlinecode{make} argument, the \inlinecode{project} script (Section \ref{sec:maneage}), calls \inlinecode{top-make.mk} which loads the subMakefiles with a certain order.
+They are loaded using Make's \inlinecode{include} feature (so Make sees everything as one file in one instance of Make).
+By default Maneage doesn't use recursion (where one instance of Make, calls another instance of Make within itself) to comply with minimal complexity principle (\ref{principle:complexity}) and keep the code's logic clear and simple.
+All the analysis Makefiles are in \inlinecode{re\-produce\-/anal\-ysis\-/make} (see Figure \ref{fig:files}) and Figure \ref{fig:datalineage} shows their inter-relation with the target/built files that they manage.
\begin{figure}[t]
\begin{center}
@@ -506,9 +515,9 @@ These \LaTeX{} macro files thus form the core skeleton of a Maneage project: as
Before the modular \LaTeX{} macro files of Section \ref{sec:valuesintext} are merged into the single \inlinecode{project.tex} file, they need to pass through the verification filter, which is another core principle of Maneage (\ref{principle:verify}).
Note that simply confirming the checksum of the final PDF, or figures and datasets is not generally possible: many tools write the creation date into the produced files.
-To avoid such cases the raw data (independently of their metadata-like creation date) must be verified. Some standards include such features, for example, the \inlinecode{DATASUM} keyword in the FITS format \citep{pence10}.
-To facilitate output verification, Maneage has the \inlinecode{verify.mk} subMakefile (see Figure \ref{fig:datalineage}) which is the second-last subMakefile.
-Verification is thus the boundary between the analytical phase of the paper, and the production of the report.
+To avoid such cases the raw data must be verified, independent of metadata like date.
+Some standards include such features, for example, the \inlinecode{DATASUM} keyword in the FITS format \citep{pence10}.
+To facilitate output verification, Maneage has the \inlinecode{verify.mk} subMakefile that is the boundary between the analytical phase of the paper, and the production of the report (see Figure \ref{fig:datalineage}).
It has some tests on pre-defined formats, and other formats can easily be added.
\subsubsection{The analysis}
@@ -541,29 +550,24 @@ Note that their location after the standard starting subMakefiles (initializatio
\end{figure}
To enhance the original plot, Figure \ref{fig:toolsperyear} also shows the number of papers that were studied each year.
-Its horizontal axis shows the full range of the data (starting from \menkefirstyear, while the original Figure 1C in \citet{menke20} starts from 1997).
-The probable reason is that \citet{menke20} decided to avoid earlier years due to their small number of papers.
-For example, in \menkenumpapersdemoyear, they had only studied \menkenumpapersdemocount{} papers.
-Note that both the numbers of the previous sentence (\menkenumpapersdemoyear{} and \menkenumpapersdemocount), and the dataset's oldest year (mentioned above: \menkefirstyear) are automatically generated \LaTeX{} macros, see \ref{sec:valuesintext}.
+Furthermore, its horizontal axis shows the full range of the data (starting from \menkefirstyear), while the original starts from 1997.
+This was probably because they didn't have sufficient data for older papers, for example, in \menkenumpapersdemoyear, they only had \menkenumpapersdemocount{} papers.
+Note that both the numbers of the previous sentence (\menkenumpapersdemoyear{} and \menkenumpapersdemocount), and the dataset's oldest year (mentioned above: \menkefirstyear) are automatically generated \LaTeX{} macros, see Section \ref{sec:valuesintext}.
They are \emph{not} typeset manually in this narrative explanation.
This step (generating the macros) is shown schematically in Figure \ref{fig:datalineage} with the arrow from \inlinecode{tools-per-year.txt} to \inlinecode{demo-plot.tex}.
-To create Figure \ref{fig:toolsperyear}, we used the \LaTeX{} package PGFPlots. The final analysis output we needed was therefore a simple plain-text table with 3 columns (year, paper per year, tool fraction per year).
+To create Figure \ref{fig:toolsperyear}, we used the PGFPlots package within \LaTeX{}.
+Therefore, the necessary analysis output to feed into PGFPlots was a simple plain-text table with 3 columns (year, paper per year, tool fraction per year).
This table is shown in the lineage graph of Figure \ref{fig:datalineage} as \inlinecode{tools-per-year.txt} and The PGFPlots source to generate this figure is located in \inlinecode{tex\-/src\-/figure\--tools\--per\--year\-.tex}.
If another plotting tool was desired (for example \emph{Python}'s Matplotlib, or Gnuplot), the built graphic file (for example \inlinecode{tools-per-year.pdf}) could be the target instead of the raw table.
-The \inlinecode{tools-per-year.txt} is a value-added table with only \menkenumyears{} rows (counting per year), the original dataset had \menkenumorigrows{} rows (one row for each year of each journal).
+Note that \inlinecode{tools-per-year.txt} is a value-added table with only \menkenumyears{} rows (one row for every year).
+The original dataset had \menkenumorigrows{} rows (one row for each year of each journal).
We see in Figure \ref{fig:datalineage} that it is defined as a Make \emph{target} in \inlinecode{demo-plot.mk} and that its prerequisite is \inlinecode{menke20-table-3.txt} (schematically shown by the arrow connecting them).
Note that both the row numbers mentioned at the start of this paragraph are also macros.
Again from Figure \ref{fig:datalineage}, we see that \inlinecode{menke20-table-3.txt} is a target in \inlinecode{format.mk} and its prerequisite is the input file \inlinecode{menke20.xlsx}.
The input files (which come from outside the project) are all \emph{targets} in \inlinecode{download.mk} and futher discussed in Section \ref{sec:download}.
-Having prepared the full dataset in a simple format, let's report the number of subjects (papers and journals) that were studied in \citet{menke20}.
-The necessary file for this measurement is \inlinecode{menke20-table-3.txt}.
-and this calculation is done with a simple AWK command, and the results are written in \inlinecode{format.tex}, which is automatically loaded into this paper's source along with the macro files.
-In the built PDF paper, the two macros expand to $\menkenumpapers$ (number of papers studied) and $\menkenumjournals$ (number of journals studied) respectively.
-This step is shown schematically in Figure \ref{fig:datalineage} with the arrow from \inlinecode{menke20-table-3.txt} to \inlinecode{format.tex}.
-
\subsubsection{Importing and validating inputs (\inlinecode{download.mk})}
diff --git a/tex/src/figure-file-architecture.tex b/tex/src/figure-file-architecture.tex
index 1fc26c5..c3b55ff 100644
--- a/tex/src/figure-file-architecture.tex
+++ b/tex/src/figure-file-architecture.tex
@@ -35,7 +35,7 @@
\node [dirbox, at={(-5.75cm,1.5cm)}, minimum width=2.6cm, minimum height=2.1cm,
label={[shift={(0,-5mm)}]\texttt{config/}}, fill=brown!25!white] {};
\ifdefined\fullfilearchitecture
- \node [node-nonterminal-thin, at={(-5.75cm,0.8cm)}] {LOCAL.conf.in};
+ \node [node-nonterminal-thin, at={(-5.75cm,0.8cm)}] {TARGETS.conf};
\fi
\node [node-nonterminal-thin, at={(-5.75cm,0.3cm)}] {versions.conf};
\ifdefined\fullfilearchitecture
diff --git a/tex/src/preamble-biblatex.tex b/tex/src/preamble-biblatex.tex
index caa03b4..fbf72ce 100644
--- a/tex/src/preamble-biblatex.tex
+++ b/tex/src/preamble-biblatex.tex
@@ -74,7 +74,7 @@
\addbibresource{tex/build/macros/dependencies-bib.tex}
\renewbibmacro{in:}{}
\AtEveryBibitem{\clearfield{month}}
-\renewcommand*{\bibfont}{\footnotesize}
+\renewcommand*{\bibfont}{\normalsize}
\DefineBibliographyStrings{english}{references = {References}}
%% Include the adsurl field key into those that are recognized: