aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2020-03-08 00:51:30 +0000
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2020-03-08 00:51:30 +0000
commitc66f973ff865d0cdec38f940430221addb32c76f (patch)
tree725e54663c2516a02acce4fbcb243eb22fb96355
parente9c81f9f40187bc4701ac539d110003e92b9ca69 (diff)
Edited description of example subMakefile for analysis-1
In order to make the description more clear and readable, the rules in the demonstrated Makefile (and their links to the data lineage plot) were made more clear.
-rw-r--r--paper.tex56
-rw-r--r--tex/src/figure-mk20tab3.tex19
2 files changed, 48 insertions, 27 deletions
diff --git a/paper.tex b/paper.tex
index 0c3a61d..24c401e 100644
--- a/paper.tex
+++ b/paper.tex
@@ -191,6 +191,8 @@ Finally in Section \ref{sec:discussion} the future prospects of using systems li
\item \citet{menke20} on the ``Rigor and Transparency Index'', in particular showing how practices have improved but not enough.
Also, how software identifability has seen the best improvement.
\item \citet{dicosmo19} summarize the special place of software in modern science very nicely: ``Software is a hybrid object in the world research as it is equally a driving force (as a tool), a result (as proof of the existence of a solution) and an object of study (as an artefact)''.
+\item Nice links for applying FAIR principles in research software: \url{https://www.rd-alliance.org/group/software-source-code-ig/wiki/fair4software-reading-materials}
+\item Nice paper about software citation: \url{https://doi.org/10.1109/MCSE.2019.2963148}.
\end{itemize}
@@ -1068,13 +1070,15 @@ Just note the the full URL was too large to show in this demonstration\footnote{
\begin{figure}[t]
\input{tex/src/figure-download.tex}
- \caption{\label{fig:download} Simplified Make rule, showing how the downloaded data URL is written into this paper.
- In Make, the \emph{target} is placed before a colon (\inlinecode{:}) and its \emph{prerequisite(s)} are placed after the colon.
- The executable recipe lines (commands to build the target from the prerequisite), start with a \inlinecode{TAB} (shown here with a light gray \inlinecode{\_\_\_TAB\_\_\_}).
- The command names are shown in dark green.
- Comment lines (ignored by Make) are shown in light red and start with a \inlinecode{\#}.
+ \vspace{-3mm}
+ \caption{\label{fig:download} Simplified Make rule, showing how the downloaded data URL is written into this paper (Footnote \ref{footnote:dataurl}).
+ In Make, lines starting with a \inlinecode{\#} are ignored (thus used for comments, like first line here).
+ The \emph{target} is placed before a colon (\inlinecode{:}) and its \emph{prerequisite(s)} is(are) after the colon (here, both can be seen in the second line).
+ The executable \emph{recipe} lines (shell commands to build the target from the prerequisite), start with a \inlinecode{TAB} (shown here with a light gray \inlinecode{\_\_\_TAB\_\_\_}).
+ A Make recipe can be viewed as a containerized shell script.
+ In the recipe, \inlinecode{\$@} is an \emph{automatic variable}, expanding to the target file's name.
The \inlinecode{MK20URL} variable is defined in \inlinecode{INPUTS.conf} and directly used to download the input dataset.
- The same URL is then passed to this paper through the definition of the \inlinecode{\textbackslash{}menketwentyurl} \LaTeX{} variable that is written in \inlinecode{\$(mtexdir)/download.tex} (the main target, shown as \inlinecode{\$@} in the recipe).
+ The same URL is then passed to this paper through the definition of the \inlinecode{\textbackslash{}menketwentyurl} \LaTeX{} variable that is written in \inlinecode{\$(mtexdir)/download.tex}.
Later, when the paper's PDF is being built, this \inlinecode{.tex} file is loaded into it.
\inlinecode{mtexdir} is the directory hosting all the \LaTeX{} macro files for various stages of the analysis, see Section \ref{sec:valuesintext}.
}
@@ -1106,13 +1110,13 @@ In the data lineage of Figure \ref{fig:datalineage}, the arrow from \inlinecode{
\subsubsection{The analysis}
\label{sec:analysis}
-The analysis subMakefile(s) are loaded after the initialization and download steps (see Sections \ref{sec:download} and \ref{sec:initialize}).
-However, the analysis phase involves much more complexity for all the various research operations done on the raw inputs until the generation of the final plots, data or report.
+The analysis subMakefile(s) are loaded into \inlinecode{top-make.mk} after the initialization and download steps (see Sections \ref{sec:download} and \ref{sec:initialize}).
+However, the analysis phase involves much more complexity.
If done without modularity in mind from the start, research project sources can become very long, thus becoming hard to modify, debug, improve or read.
Maneage is therefore designed to encourage and facilitate splitting the analysis into multiple/modular subMakefiles.
For example in the data lineage graph of Figure \ref{fig:datalineage}, the analysis is broken into three subMakefiles: \inlinecode{analysis-1.mk}, \inlinecode{analysis-2.mk} and \inlinecode{analysis-3.mk}.
-Theoretical discussion of this phase can be hard to follow, we will thus describe the contents of \inlinecode{analysis1\-.mk} in a demo project on data from \citet{menke20}, see Figure \ref{fig:mk20tab3}.
+Theoretical discussion of this phase can be hard to follow, we will thus describe the contents of \inlinecode{analysis1\-.mk} (Figure \ref{fig:mk20tab3}) in a demo project on data from \citet{menke20}.
In Section \ref{sec:download}, the process of importing this dataset into the proejct was described.
The first issue is that \inlinecode{menke20.xlsx} must be converted to a simple plain-text table which is generically usable by simple tools (see principle of minimal complexity in Section \ref{principle:complexity}).
For more on the problems with Microsoft Office and this format, see Section \ref{sec:lowlevelanalysis}.
@@ -1120,26 +1124,34 @@ In \inlinecode{analysis1.mk} (Figure \ref{fig:mk20tab3}), we thus convert it to
\begin{figure}[t]
\input{tex/src/figure-mk20tab3.tex}
+ \vspace{-3mm}
\caption{\label{fig:mk20tab3}Simplified contents of \inlinecode{analysis1.mk}.
For the position of this subMakefile in the full project's data lineage, see Figure \ref{fig:datalineage}.
In particular, here the arrows of that figure from \inlinecode{menke20.xlsx} to \inlinecode{menke20-table-3.txt} and from the latter to \inlinecode{analysis1.tex} are shown as the second and third Make rules.
- See Figure \ref{fig:download} and Appendix \ref{appendix:make} for more on the Make notation.
- The general job here is to convert the raw download into a usable table for our analysis (a simple, space-separated, fixed-column-width plain-text table) along with the directory hosting it (\inlinecode{a1dir}), followed by a small measurement on it.
- Note that \inlinecode{a1dir} is a prerequisite of \inlinecode{mk20tab3}, so the former is always built \emph{before} the latter \emph{and} when it doesn't already exist on the running system.}
+ See Figure \ref{fig:download} and Appendix \ref{appendix:make} for more on the Make notation and Section \ref{sec:analysis} for describing the steps.
+ }
\end{figure}
-As shown in Figure \ref{fig:mk20tab3}, the lowest-level operation is to define a directory to keep the generated files.
+As shown in Figure \ref{fig:mk20tab3}, the first operation (or Make \emph{rule}) is to define a directory to keep the generated files.
To keep the source and build-directories separate, we thus define \inlinecode{a1dir} under the build-directory (\inlinecode{BDIR}, see Section \ref{sec:localdirs}).
We'll then define all outputs/targets to be under this directory.
-The second rule (which depends on the directory), then converts the Microsoft Excel spreadsheet file to a simple plain-text format using the XLSX I/O program.
-But XLSX I/O only converts to CSV and we don't need many columns here, so we further shorten the table using the AWK program.
-As described in Section \ref{sec:valuesintext}, the last rule of a subMakefile should be a \LaTeX{} macro file.
-This is natural, for example we need to report how many journal papers were studied in \citet{menke20}.
-So in the third rule, we count the number of papers they studied (summing over the second column).
-The final sum is written into the \inlinecode{\textbackslash{}menkenumpapers} \LaTeX{} macro, which expands to $\menkenumpapers$ when this PDF is built.
-Figure \ref{fig:mk20tab3} thus quantifies two of the arrows in the data lineage of Figure \ref{fig:datalineage}: the arrow from \inlinecode{menke20.xlsx} to \inlinecode{menke20-table-3.txt} and the arrow from \inlinecode{menke20-table-3.txt} to \inlinecode{analysis1.tex}.
+The second rule (which depends on the directory as a prerequisite), then converts the Microsoft Excel spreadsheet file to a simple plain-text format using the XLSX I/O program.
+But XLSX I/O only converts to CSV and we don't need all the columns here, so we further shorten and modify the table (re-order columns and multiply them) using the AWK program (which is available on any Unix-like operating system).
+In Figure \ref{fig:datalineage} on the example data lineage, this second rule is shown with the arrow from \inlinecode{menke20.xlsx} to \inlinecode{menke20-table-3.txt}.
+
+Finally, as described in Section \ref{sec:valuesintext}, the last rule of a subMakefile should be a \LaTeX{} macro file (in Figure \ref{fig:mk20tab3}, this is the third rule).
+Ending each analysis phase with a \LaTeX{} macro is natural in many reports.
+For example, here, once the dataset is ready, we want to give the reader a general view of the dataset size.
+We thus need to report the number of subjects (papers/journals) studied in \citet{menke20}.
+Therefore in the \LaTeX{} macro rule, we count them from the simplified table of the second rule.
+In both cases, we write the sum as a temporary shell variable \inlinecode{v}, then write the value of \inlinecode{v} into \inlinecode{\textbackslash{}menkenumpapers} and \inlinecode{\textbackslash{}menkenumjournals} \LaTeX{} macros respectively.
+In the built PDF paper, they expand to $\menkenumpapers$ (number of papers studied) and $\menkenumjournals$ (number of journals studied) respectively.
+This rule is shown schematically in Figure \ref{fig:datalineage} with the arrow from \inlinecode{menke20-table-3.txt} to \inlinecode{analysis1.tex}.
-studied $\menkenumpapers$ papers in $\menkenumjournals$ journals.
+Figure \ref{fig:mk20tab3} also shows one major advantage of Maneage: 1) The XLSX I/O software may not be present on many systems, or 2) the \inlinecode{FPAT} feature is only present in GNU AWK, not all implementations of AWK.
+Therefore, while this Makefile can work when run alone, on many systems it won't complete successfuly because of these major portability problems.
+However, because Maneage installs its own software, these problems don't exist: specific versions of XLSX I/O and GNU AWK are installed within the project.
+Such portability problems are much more pronounced and relevant in higher-level science software.
@@ -1255,7 +1267,7 @@ Below we'll review some of the most common container solutions: Docker and Singu
It is primarily driven by the need of software developers: they need to be able to reproduce a bug on the ``cloud'' (whic is just a remote VM), where they have root access.
A Docker container is composed of independent Docker ``images'' that are built with Dockerfiles.
It is possible to precisely version/tag the images that are imported (to avoid downloading the latest/different version in a future build).
- To have a reproducible Docker image, it must be ensured that all the imported Docker images check their dependency tags down to the initial image which contains the kernel and C library.
+ To have a reproducible Docker image, it must be ensured that all the imported Docker images check their dependency tags down to the initial image which contains the C library.
Another important drawback of Docker for scientific applications is that it runs as a daemon (a program that is always running in the background) with root permissions.
This is a major security flaw that discourages many high performacen computing (HPC) facilities from installing it.
diff --git a/tex/src/figure-mk20tab3.tex b/tex/src/figure-mk20tab3.tex
index ebeac0e..96468bb 100644
--- a/tex/src/figure-mk20tab3.tex
+++ b/tex/src/figure-mk20tab3.tex
@@ -1,6 +1,6 @@
\begin{tcolorbox}
\footnotesize
- \texttt{\mkcomment{Define and build the directory hosting the final table.}}
+ \texttt{\mkcomment{1ST MAKE RULE: build the directory hosting the used table.}}
\texttt{\mkvar{a1dir} = \$(\mkvar{BDIR})/analysis-1}
@@ -8,8 +8,8 @@
\texttt{\mktab{}\mkprog{mkdir} \$@}
- \vspace{1.5em}
- \texttt{\mkcomment{Define and build the main target.}}
+ \vspace{2em}
+ \texttt{\mkcomment{2ND MAKE RULE: Convert the XLSX table to a simple plain-text table.}}
\texttt{\mkvar{mk20tab3} = \$(\mkvar{a1dir})/menke20-table-3.txt}
@@ -37,8 +37,8 @@
\texttt{\mktab{}\mkprog{rm} table-3.csv}
- \vspace{1.5em}
- \texttt{\mkcomment{Main LaTeX macro file}}
+ \vspace{2em}
+ \texttt{\mkcomment{3RD MAKE RULE: Main LaTeX macro file for reported values.}}
\texttt{\mktarget{\$(mtexdir)/analysis1.tex}: \$(\mkvar{mk20tab3)}}
@@ -47,4 +47,13 @@
\texttt{\mktab{}v=\$\$(\mkprog{awk} '\!/\^{}\#/\{c+=\$\$2\} END\{print c\}' \$(\mkvar{mk20tab3)})}
\texttt{\mktab{}\mkprog{echo} "\textbackslash{}newcommand\{\textbackslash{}menkenumpapers\}\{\$\$v\}" > \$@}
+
+ \vspace{0.5em}
+ \texttt{\recipecomment{Count total number of journals in that study.}}
+
+ \texttt{\mktab{}v=\$\$(awk 'BEGIN{FIELDWIDTHS="31 10000"} !/\^\#/\{print \$\$2\}' \$(mk20tab3) \textbackslash}
+
+ \texttt{\mktab{ }{ }{ }{ }{ }{ }{ }{ }{ }{ }| uniq | wc -l)}
+
+ \texttt{\mktab{}\mkprog{echo} "\textbackslash{}newcommand\{\textbackslash{}menkenumjournals\}\{\$\$v\}" >> \$@}
\end{tcolorbox}