aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex82
1 files changed, 39 insertions, 43 deletions
diff --git a/paper.tex b/paper.tex
index 6613d6f..a4f546d 100644
--- a/paper.tex
+++ b/paper.tex
@@ -397,7 +397,7 @@ Thus, a researcher using Maneage for high-level analysis easily understands and
The existing tools listed in Section \ref{sec:principles} mostly use package managers like Conda to maintain the software environment, but Conda itself is written in Python, contrary to our completeness principle \ref{principle:complete}.
Highly-robust solutions like Nix and GNU Guix exist, but these require root permissions, contrary to principle P1.3.
-Project configuration (building the software environment) is managed by the files under \inlinecode{reproduce\-/soft\-ware} of Maneage's source (see Fig.~\ref{fig:files}).
+Project configuration (building the software environment) is managed by the files under \inlinecode{reproduce\-/soft\-ware} of Maneage's source (see Figure \ref{fig:files}).
At the start of project configuration, Maneage needs a top-level directory to build itself on the host (software and analysis).
We call this the ``build directory'' and it must not be located inside the source directory (see \ref{principle:modularity}).
No other location on the running OS will be affected by the project, including the source directory.
@@ -420,7 +420,7 @@ However, such binary blobs are not the primary storage/archival format of Maneag
Before building the software, their source codes are validated by their SHA-512 checksum (stored in the project).
Maneage includes a growing collection of scientific software (and its dependencies), much of which is superfluous for any single project.
-Therefore, each project has to identify its high-level software in the \inlinecode{TARGETS.conf} file under \inlinecode{re\-produce\-/soft\-ware\-/config} directory (Fig.~\ref{fig:files}).
+Therefore, each project has to identify its high-level software in the \inlinecode{TARGETS.conf} file under \inlinecode{re\-produce\-/soft\-ware\-/config} directory (Figure \ref{fig:files}).
\subsubsection{Software citation}
\label{sec:softwarecitation}
@@ -460,7 +460,7 @@ Large files are in general a bad practice and against the modularity and minimal
Maneage is thus designed to encourage and facilitate modularity by distributing the analysis into many Makefiles that contain contextually-similar analysis steps.
Hereafter, these lower-level Makefiles are termed \emph{subMakefiles}.
When run with the \inlinecode{make} argument, the \inlinecode{project} script (Section \ref{sec:maneage}), calls \inlinecode{top-make.mk}, which loads the subMakefiles using the \inlinecode{include} directive (see Section \ref{sec:analysis}).
-All the analysis Makefiles are in \inlinecode{re\-produce\-/anal\-ysis\-/make} (see Figure \ref{fig:files}). Figure~\ref{fig:datalineage} shows their relationship with the target/built files that they manage.
+All the analysis Makefiles are in \inlinecode{re\-produce\-/anal\-ysis\-/make} (see Figure \ref{fig:files}). Figure \ref{fig:datalineage} shows their relationship with the target/built files that they manage.
To keep the project's logic clear and simple (minimal complexity principle, \ref{principle:complexity}), recursion (where one instance of Make calls Make internally) is, by default, not used.
\begin{figure}[t]
@@ -526,12 +526,11 @@ Other formats can easily be added.
\subsubsection{The analysis}
\label{sec:analysis}
-The organizing of the analysis into modular subMakefiles is discussed above.
-We illustrate this here with the practical example of replicating Figure~1C of M20, with some enhancements, in Figure~\ref{fig:toolsperyear}.
-As shown in Figure~\ref{fig:datalineage}, for this example we split this goal into two subMakefiles: \inlinecode{format.mk} and \inlinecode{demo-plot.mk}.
+The analysis is demonstrated with the practical example of replicating Figure 1C of M20, with some enhancements, in Figure \ref{fig:toolsperyear}.
+As shown in Figure \ref{fig:datalineage}, for this example we split this goal into two subMakefiles: \inlinecode{format.mk} and \inlinecode{demo-plot.mk}.
The former converts the Excel-formatted input into comma-separated value (CSV) format, and the latter generates the table to build Figure \ref{fig:toolsperyear}.
In a real project, subMakefiles could, and will, be much more complex.
-Figure \ref{fig:topmake} shows how the two subMakefiles are placed as values in the \inlinecode{makesrc} variable of \inlinecode{top-make.mk}, without their suffixes (see Section \ref{sec:valuesintext}).
+Figure \ref{fig:topmake} shows how the two subMakefiles are placed as values in the \inlinecode{makesrc} variable of \inlinecode{top-make.mk}, without their suffixes as described in Section \ref{sec:valuesintext}.
Their location after the standard starting subMakefiles (initialization and download) and before the standard ending subMakefiles (verification and final paper) is important, along with their order.
\begin{figure}[t]
@@ -552,23 +551,22 @@ Their location after the standard starting subMakefiles (initialization and down
}
\end{figure}
-To enhance the original plot, Figure \ref{fig:toolsperyear} also shows the number of papers that were studied each year.
-Furthermore, its horizontal axis shows the full range of the data (starting from \menkefirstyear), while the original starts from 1997.
+To enhance the original M20 plot, Figure \ref{fig:toolsperyear} also shows the number of papers in each year and its horizontal axis shows the full range of the data (starting from \menkefirstyear), while M20 starts from 1997.
This was probably because the authors judged the earlier years' data to be too noisy. For example, in \menkenumpapersdemoyear, only \menkenumpapersdemocount{} papers were analysed.
Both the numbers in the previous sentence (\menkenumpapersdemoyear{} and \menkenumpapersdemocount), and the dataset's oldest year (mentioned above: \menkefirstyear) are automatically generated \LaTeX{} macros, see Section \ref{sec:valuesintext}.
These are \emph{not} typeset manually in this narrative explanation.
This step (generating the macros) is shown schematically in Figure \ref{fig:datalineage} with the arrow from \inlinecode{tools-per-year.txt} to \inlinecode{demo-plot.tex}.
To create Figure \ref{fig:toolsperyear}, we used the PGFPlots package within \LaTeX{}.
-Therefore, the necessary analysis output to feed into PGFPlots was a plain-text table with 3 columns (year, paper per year, tool fraction per year).
+Therefore, the necessary analysis output to feed into \LaTeX{} was a plain-text table with 3 columns (year, paper per year, tool fraction per year).
This table is shown in the lineage graph of Figure \ref{fig:datalineage} as \inlinecode{tools-per-year.txt} and The PGFPlots source to generate this figure is located in \inlinecode{tex\-/src\-/figure\--tools\--per\--year\-.tex}.
-If another plotting tool was desired (for example \emph{Python}'s Matplotlib, or Gnuplot), the built graphic file (for example \inlinecode{tools-per-year.pdf}) could be the target instead of the raw table.
+If another plotting tool was desired (for example Python's Matplotlib, or Gnuplot), the built graphic file (for example \inlinecode{tools-per-year.pdf}) would be the target instead.
The file \inlinecode{tools-per-year.txt} is a value-added table with only \menkenumyears{} rows (one row for every year).
The original dataset had \menkenumorigrows{} rows (one row for each year of each journal).
We see in Figure \ref{fig:datalineage} that it is defined as a Make \emph{target} in \inlinecode{demo-plot.mk} and that its prerequisite is \inlinecode{menke20-table-3.txt} (schematically shown by the arrow connecting them).
Both the row counts mentioned at the start of this paragraph are again macros.
-In Figure \ref{fig:datalineage}, we see that \inlinecode{menke20-table-3.txt} is a target in \inlinecode{format.mk} and its prerequisite is the input file \inlinecode{menke20.xlsx}.
+In Figure \ref{fig:datalineage}, we see that \inlinecode{menke20-table-3.txt} is a target in \inlinecode{format.mk} and its prerequisite is the input file \inlinecode{menke20.xlsx} (XLSX I/O is used for the conversion).
The input files (which come from outside the project) are all \emph{targets} in \inlinecode{download.mk} and futher discussed in Section \ref{sec:download}.
@@ -577,14 +575,14 @@ The input files (which come from outside the project) are all \emph{targets} in
\label{sec:download}
The \inlinecode{download.mk} subMakefile is present in all projects, containing common steps for importing the input dataset(s).
-All necessary input datasets for the project are imported through this subMakefile.
-Irrespective of where the dataset is \emph{used} in the project's lineage, the relation between the project and the outside world is maintained in this single subMakefile, aiming at modularity and minimal complexity (\ref{principle:modularity} \& \ref{principle:complexity}), and internet security.
+All necessary datasets are imported through this subMakefile, irrespective of where the dataset is \emph{used}.
+The relation between the project and the outside world is maintained in this single subMakefile, aiming at modularity (\ref{principle:modularity}) minimal complexity (\ref{principle:complexity}) and internet security.
Each external dataset has some basic information, including its expected name on the local system (for offline access), a checksum to validate it (either the whole file or just its main ``data'', as discussed in Section \ref{sec:outputverification}), and its URL/PID.
-In Maneage, such information regarding a project's input dataset(s) is in the \inlinecode{INPUTS.conf} file.
+In Maneage, they are stored in the \inlinecode{INPUTS.conf} file.
See Figures \ref{fig:files} \& \ref{fig:datalineage} for the position of \inlinecode{INPUTS.conf} in the project's file structure and data lineage, respectively.
-We demonstrate this with the datasets of M20 stored in one \inlinecode{.xlsx} file on bioXriv.
-Figure \ref{fig:inputconf} shows the corresponding \inlinecode{INPUTS.conf} where the necessary information is stored as Make variables and is automatically loaded into the full project when Make starts (and is most often used in \inlinecode{download.mk}).
+Figure \ref{fig:inputconf} demonstrates this for the dataset of M20 that is stored in one \inlinecode{.xlsx} file on bioXriv.
+Each is stored as a Make variable, and is automatically loaded into the full project when Make starts, like other configuration files, usable in any subMakefile.
\begin{figure}[t]
\input{tex/src/figure-src-inputconf.tex}
@@ -598,18 +596,16 @@ Figure \ref{fig:inputconf} shows the corresponding \inlinecode{INPUTS.conf} wher
\subsubsection{Configuration files}
\label{sec:configfiles}
-The subMakefiles discussed above should only organize the analysis, they should not contain any fixed numbers, settings or parameters, which
-should instead be set as variables in configuration files.
+The subMakefiles discussed above should only organize the analysis, they should not contain any fixed numbers, settings or parameters, which should instead be set as variables in configuration files.
Configuration files logically separate the low-level implementation from the high-level running of a project.
In the data lineage plot of Figure \ref{fig:datalineage}, configuration files are shown as sharp-edged, green \inlinecode{*.conf} boxes in the top row (for example, the file \inlinecode{INPUTS.conf} that was shown in Figure \ref{fig:inputconf} and mentioned in Section \ref{sec:download}).
-All the configuration files of a project are placed under the \inlinecode{reproduce/analysis/config} (see Figure \ref{fig:files}) subdirectory, and are loaded into \inlinecode{top-make.mk} before any of the subMakefiles (Figure \ref{fig:topmake}).
+All the configuration files of a project are placed under the \inlinecode{reproduce/analysis/config} subdirectory (see Figure \ref{fig:files}), and are loaded into \inlinecode{top-make.mk} before any of the subMakefiles (Figure \ref{fig:topmake}), hence they are available to all of them.
The example analysis in Section \ref{sec:analysis}, in which we reported the number of papers studied by M20 in \menkenumpapersdemoyear, illustrates this.
-The year ``\menkenumpapersdemoyear'' is not written by hand in \inlinecode{demo-plot.mk}.\sloppy
-It is referenced through the \inlinecode{menke-year-demo} variable, which is defined in \inlinecode{menke-demo-year.conf}, which is a prerequisite of the \inlinecode{demo-plot.tex} rule.
-This is also visible in the data lineage of Figure \ref{fig:datalineage}.
+The year ``\menkenumpapersdemoyear'' is not written by hand in \inlinecode{demo-plot.mk}.
+It is referenced through the \inlinecode{menke-year-demo} variable, which is defined in \inlinecode{menke-demo-year.conf}, which is a prerequisite of the \inlinecode{demo\--plot\-.tex} rule, see it in Figure \ref{fig:datalineage}.
If we wished to report the number in a different year, it would be sufficient to change the value in \inlinecode{menke-demo-year.conf}.
-A configuration file is a prerequisite of the target that uses it, so its timestamp will be newer than \inlinecode{demo-plot.tex}.
+A configuration file is a prerequisite of the target that uses it, so after the change, its timestamp will be newer than \inlinecode{demo-plot.tex}.
Thus, Make will re-execute the recipe to generate the macro file before this paper is re-built and the corresponding year and value will be updated in this paper, always in synchronization with each other and no matter how many times they are used.
Combined with the fact that all source files in Maneage are under version control, this encourages testing of various settings of the
analysis as the project evolves in the case of exploratory research papers, and better self-consistency in hypothesis testing papers.
@@ -617,8 +613,7 @@ analysis as the project evolves in the case of exploratory research papers, and
\subsubsection{Project initialization (\inlinecode{initialize.mk})}
\label{sec:initialize}
-\fussy
-The \inlinecode{initial\-ize\-.mk} subMakefile is present in all projects and is the first subMakefile that is loaded into \inlinecode{top-make.mk} (see Figure \ref{fig:datalineage}).
+The \inlinecode{initial\-ize\-.mk} subMakefile is present in all projects and is the first subMakefile that is loaded into \inlinecode{top-make.mk} (see Figures \ref{fig:datalineage} \& \ref{fig:topmake}).
It does not contain any analysis or major processing steps, it just initializes the system by setting the necessary Make environment as well as other general jobs like defining the Git commit hash of the run as a \LaTeX{} (\inlinecode{\textbackslash{}projectversion}) macro that can be loaded into the narrative.
Papers using Maneage usually put this hash as the last word in their abstract, for example, see \citet{akhlaghi19} and \citet{infante20}.
For the current version of this paper, it expands to \projectversion.
@@ -626,12 +621,13 @@ For the current version of this paper, it expands to \projectversion.
\subsection{Projects as Git branches of Maneage}
\label{sec:projectgit}
-Maneage contains only plain-text files. It can thus be efficiently maintained under version control systems (currently using Git).
+Maneage projects are primarily stored as plain-text files.
+It can thus be efficiently maintained under version control systems (currently using Git).
Every commit in the version-controlled history contains \emph{a complete} snapshot of the data lineage (see the completeness principle \ref{principle:complete}).
Maneage is maintained by its developers in a central branch, \inlinecode{man\-eage}.
The \inlinecode{man\-eage} branch contains all the low-level infrastructure, a skeleton, that is needed by any new project.
-To start a new project (Section \ref{sec:maneage}), users clone \inlinecode{man\-eage} from its reference repository and build their own Git branch or fork.
-This is demonstrated in Figure \ref{fig:branching}(a), where a project has started by branching off commit \inlinecode{0c120cb}.
+As shown in Section \ref{sec:maneage} new projects start by cloning \inlinecode{man\-eage} and customizing their own Git branch, or fork.
+Figure \ref{fig:branching}(a) shows how a project has started by branching off commit \inlinecode{0c120cb}.
%% Exact URLs of imported images.
%% Collaboration icon: https://www.flaticon.com/free-icon/collaboration_809522
@@ -650,10 +646,10 @@ This is demonstrated in Figure \ref{fig:branching}(a), where a project has start
}
\end{figure}
-After a project starts, Maneage will evolve. For example, new features will be added or low-level bugs will be fixed.
-Because all projects branch off the same branch that these infrastructure improvements are made, updating the project's low-level skeleton is as easy as merging the \inlinecode{maneage} branch into the project's branch.
-For example, in Figure \ref{fig:branching}(a), see how Maneage's \inlinecode{3c05235} commit has been merged into project's branch through commit \inlinecode{2ed0c82}.
-This allows infrastructure improvements and fixes to be easily propagated to all projects.
+After a project starts, Maneage will evolve with new features or fixed bugs.
+Because all projects branch from it, updating the project's low-level skeleton is as easy as merging the \inlinecode{maneage} branch into the project's branch.
+For example, in Figure \ref{fig:branching}(a), see how Maneage's \inlinecode{3c05235} commit has been merged into the project's branch in commit \inlinecode{2ed0c82}.
+Hence infrastructure improvements and fixes are easily propagated to all projects.
Another useful scenario is reviving a finished/published project at a later date, possibly by other researchers as shown in Figure \ref{fig:branching}(b), e.g., assuming the original project was completed years ago, and is no longer directly executable.
Other scenarios include projects that are created by merging various other projects.
@@ -665,24 +661,25 @@ Modern version control systems provide many more capabilities that can be levera
Because the project's source and build directories are separate, an option is provided for different users to share a build directory, while working on their own separate project branches during a collaboration.
This is similar to the parallel branch that is later merged in Figure \ref{fig:branching}(a).
To enable this mode, the \inlinecode{./project} script has an option \inlinecode{--group} that must be given the name of a (POSIX) user group in the host OS.
-All files built in the build directory are then automatically assigned to this user group, with read and write permissions.
-Permission management and avoiding conflicts in the build directory while members work on different branches is the responsibility of the team.
+All built files are then automatically assigned to this user group, with read and write permissions for all members.
+Permission management and avoiding conflicts in the build directory (while members work on different branches) is the responsibility of the team.
\subsection{Publishing the project}
\label{sec:publishing}
-In a scientific scenario, the subject report is submitted to a journal, while in an industrial context it is submitted to the customers or employers.
-To facilitate publication of the project's source, Maneage has a \inlinecode{dist} target, which is activated with the command \inlinecode{./project make dist}.
-In this mode, Maneage will not do any analysis, but will instead copy the full project's source (for the given commit, without the version history) into a temporary directory and compress it into a \inlinecode{.tar.gz} file.
-(The \inlinecode{dist-zip} target provides Zip compression as an alternative.)
-Since the complete project is in plain text, this compressed file is typically abut 100 kilobytes in size.
+In a scientific scenario, the final report is submitted to a journal, while in an industrial context it is submitted to the customers or employers.
+To facilitate publication of the project's source with the narrative, Maneage has a \inlinecode{dist} target, which is activated with \inlinecode{./project make dist}.
+In this mode, Maneage will not do any analysis, but will instead put full project's source (for the given commit, without the version history), with all the built files that are necessary for \LaTeX{}, into a compressed \inlinecode{.tar.gz} file.
+This is useful for publishers to create the report without necessarily building the full project: since the full project source is included, it can be rebuilt.
+The \inlinecode{dist-zip} target provides Zip compression as an alternative.
+Depending on the built graphics used in the report, this compressed file will usually be roughly a mega-byte.
However, the required inputs (\ref{definition:input}) and the outputs may be much bigger, from megabytes to petabytes.
This gives two scenarios for publication of the project: 1) publishing only the source, or 2) publishing the source with the data.
In the former case, the output of \inlinecode{dist} can be submitted to the journal as a supplement, or uploaded to pre-print servers like \href{https://arXiv.org}{arXiv} that will compile the \LaTeX{} source and build their own PDFs.
The Git history can also be archived as a single ``bundle'' file and submitted as a supplement.
When publishing with datasets, the project's outputs, and/or inputs, can be published on servers like Zenodo.
-For example, \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}]{akhlaghi19} uploaded all the project's required software and its final PDF along with the project's source tarball and a Git ``bundle'' to Zenodo.
+For example, \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}]{akhlaghi19} uploaded all the project's required software tarballs (mentioned in the acknowledgements) and its final PDF, along with the project's source and a Git ``bundle''.
@@ -695,11 +692,10 @@ For example, \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481
\section{Discussion \& Caveats}
\label{sec:discussion}
-Maneage was created, and has evolved during various research projects.
The primordial implementation was written for \citet{akhlaghi15}.
It later evolved in \citet{bacon17}, and in particular the two sections of that paper that were done by M. Akhlaghi (\href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}).
With these, the customizable skeleton was separated from the flesh as a more abstract ``template''.
-Later, steps to build necessary software were included added, as used in \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}]{akhlaghi19} and \citet[\href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}]{infante20}.
+Later, software building was also included and used in \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}]{akhlaghi19} and \citet[\href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}]{infante20}.
After this paper is published, bugs will still be found and Maneage will continue to evolve and improve, notable changes beyond this paper will be kept in \inlinecode{README-hacking.md}.
Once adopted on a wide scale, Maneage projects can be fed them into machine learning (ML) tools for automatic workflow generation, optimized for certain aspects of the result.