diff options
author | Mohammad Akhlaghi <mohammad@akhlaghi.org> | 2020-02-07 21:11:40 +0100 |
---|---|---|
committer | Mohammad Akhlaghi <mohammad@akhlaghi.org> | 2020-02-07 21:11:40 +0100 |
commit | 4b42f6fb445b3b6f103154cfc688c539b1854434 (patch) | |
tree | 93c9686fc0bf333ed30fdfd7b2967a167915d413 | |
parent | 79ec93bc2c6f9868e20ba081489f5e74ced58d9e (diff) |
Edited parts of the text
While reading over the already written parts (and hopefully complete the
paper), they were edited/corrected to be more clear.
-rw-r--r-- | paper.tex | 262 | ||||
-rw-r--r-- | tex/src/figure-file-architecture.tex | 28 | ||||
-rw-r--r-- | tex/src/preamble-style.tex | 6 | ||||
-rw-r--r-- | tex/src/references.tex | 8 |
4 files changed, 177 insertions, 127 deletions
@@ -498,7 +498,8 @@ A system that uses this principle will also provide ``temporal provenance'', qua \subsection{Principle: Free and open source software} \label{principle:freesoftware} Technically, as defined in Section \ref{definition:reproduction}, reproducibility is also possible with a non-free and non-open-source software (a black box). -However, as shown below, software freedom as an important pillar for the sciences. +This principle is thus necessary to complement the definition of reproducibility. +This is because software freedom as an important pillar for the sciences as shown below: \begin{itemize} \item Based on the completeness principle (Section \ref{principle:complete}), it is possible to trace the output's provenance back to the exact source code lines within an analysis software. If the software's source code isn't available such important and useful provenance information is lost. @@ -534,20 +535,21 @@ However, as shown below, software freedom as an important pillar for the science The proposed solution is an implementation of the principles discussed in Section \ref{sec:principles}: it is complete and automatic (Section \ref{principle:complete}), modular (Section \ref{principle:modularity}), fully in plain text (Section \ref{principle:text}), having minimal complexity (see Section \ref{principle:complexity}), with automatically verifiable inputs \& outputs (Section \ref{principle:verify}), preserving temporal provenance, or project evolution (Section \ref{principle:history}) and finally, it is free software (Section \ref{principle:freesoftware}). -In practice it is a collection of plain-text files, that are distributed in sub-directories by context, and are all under version-control (currently with Git). -In its raw form (before customizing for different projects), it is just a skeleton of a project without much flesh: containing all the low-level infrastructure, but without any real analysis. -To start a new project, users will clone the core template, create their own Git branch, and start customizing it by adding their high-level analysis steps, figure creation sources and narrative. +In practice it is a collection of plain-text files, that are distributed in pre-defined sub-directories by context, and are all under version-control (currently with Git). +In its raw form (before customizing for different projects), it is a fully working skeleton of a project without much flesh: containing all the low-level infrastructure, with just a small demonstrative ``delete-me'' analysis. +To start a new project, users will \emph{clone}\footnote{In Git, ``clone''ing is the process of copying all the project's file and their history into the host system.} the core skeleton, create their own Git branch, and start customizing the core files (adding their high-level analysis steps, scripts to generate figure and narrative) within their custom branch. Because of this, we also refer to the proposed system as a ``template''. -In this section we'll review its current implementation. -Section \ref{sec:usingmake} describes the reasons behind using Make for job orchestration. -It is followed with a general outline of the project's file structure in Section \ref{sec:generalimplementation}. -As described there, we make a cosmetic distinction between ``configuration'' (or building of necessary software) and execution (or running the software on the data), these two phases are discussed in Sections \ref{sec:projectconfigure} \& \ref{sec:projectmake}. +Before going into the details, it is important to note that as with any software, the template core architecture will inevitably evolve after the publication of this paper. +We already have roughly 30 tasks that are left for the future and will affect various high-level phases of the project as described here. +However, the core of the system has been used and become stable enough already and we don't see any major change in the core methodology in the near future. +A list of the notable changes after the publication of this paper will be kept in in the project's \inlinecode{README-hacking.md} file. +Once the improvements become substantial, new paper(s) will be written to complement or replace this one. -Finally, it is important to note that as with any software, the low-level implementation of this solution will inevitably evolve after the publication of this paper. -We already have roughly 30 tasks that are left for the future and will affect various phases of the project as described here. -However, we need to focus on more important issues at this stage and can't implement them before this paper's publication. -Therefore, a list of the notable changes after the publication of this paper will be kept in in the project's \inlinecode{README-hacking.md} file, and once it becomes substantial, new papers will be written. +In this section we will review the current implementation of the reproducibile paper template. +Generally, job orchestration is implemented in Make (a POSIX software), this choice is elaborated in Section \ref{sec:usingmake}. +We continue with a general outline of the project's file structure in Section \ref{sec:generalimplementation}. +As described there, we make a cosmetic distinction between ``configuration'' (or building of necessary software) and execution (or running the software on the data), these two phases are discussed in Sections \ref{sec:projectconfigure} \& \ref{sec:projectmake}. \subsection{Job orchestration with Make} @@ -555,24 +557,24 @@ Therefore, a list of the notable changes after the publication of this paper wil When non-interactive, or batch, processing is needed (see Section \ref{principle:complete}), shell scripts are usually the first solution that come to mind (see Appendix \ref{appendix:scripts}). However, the inherent complexity and non-linearity of progress in a scientific project (where experimentation is key) makes it hard and inefficient to manage the script(s) as the project evolves. For example, a script will start from the top/start every time it is run. -Therefore, if $90\%$ of a research project is done and only the newly added, final $10\%$ must be executed, its necessary to run whole script from the start. +Therefore, even if $90\%$ of a research project is done and only the newly added, final $10\%$ must be executed, a script will always start from the start. It is possible to manually ignore (by conditionals), or comment, parts of a script to only do a special part. However, such conditionals/comments will only add to the complexity and will discourage experimentation on an already completed part of the project. -This is also prone to very serious bugs in the end, when trying to reproduce from scratch. +This is also prone to very serious bugs in the end (e.g., due to human error, some parts may be left-out or not up to date), when re-running from scratch. Such bugs are very hard to notice during the work and frustrating to find in the end. -Similar problems motivated the creation of Make in the early Unix operating system (see Appendix \ref{appendix:make}). +These problems motivated the creation of Make in the early Unix operating system \citep{feldman79}. In the Make paradigm, process execution starts from the end: the final \emph{target}. -Through its syntax, the user specifies the \emph{prerequisite}(s) of each target and a \emph{recipe} (a small shell script) to create the target from the prerequisites. -Make is thus able to build a dependency tree internally and find where it should start executing the recipes, each time the project is run. +Through the Make syntax, the user specifies the \emph{prerequisite}(s) of each target and a \emph{recipe} (a small shell script) to create the target from the prerequisites (for more see Appendix \ref{appendix:make}). +With this lineage, Make is thus able to build a dependency tree internally and find the rule(s) that need to be executed on each run. This has many advantages: \begin{itemize} \item \textbf{\small Only executing necessary steps:} in the scenario above, a researcher that has just added the final $10\%$ of her research, will only have to run those extra steps, without any modification to the previous parts. With Make, it is also trivial to change the processing of any intermediate (already written) \emph{rule} (or step) in the middle of an already written analysis: the next time Make is run, only rules that are affected by the changes/additions will be re-run, not the whole analysis/project. Most importantly, this enables full reproducibility from scratch with no changes in the project code that was working during the research. -This will allow robust results and let the scientists get to what they do best: experiment and be critical to the methods/analysis without having to waste energy on the added complexity of experimentation in scripts. +This will allow robust results and let scientists do what they do best: experiment, and be critical to the methods/analysis without having to waste energy on the added complexity of experimentation in scripts. \item \textbf{\small Parallel processing:} Since the dependencies are clearly demarcated in Make, it can identify independent steps and run them in parallel. This greatly speeds up the processing, with no cost in terms of complexy. @@ -584,11 +586,11 @@ This will allow robust results and let the scientists get to what they do best: Make has been a fixed component of POSIX (or Unix-like operating systems including Unix, GNU/Linux, BSD, and macOS, among others) from very early days of Unix almost 40 years ago. It is therefore, by far, the most reliable, commonly used, well-known and well-maintained workflow manager today. Because the core operating system components are built with it, Make is expected to keep this unique position into the foreseeable future. - Make is also well known by many outside of the software developing communities. For example \citet{schwab2000} report how geophysics students have easily adopted it for the RED project management tool used in their lab at that time (see Appendix \ref{appendix:red} for more on RED). Because of its simplicity, we have also had very good feedback on using Make from the early adopters of this system during the last year, in particular graduate students and postdocs. -In summary Make satisfies all our principles (see Section \ref{sec:principles}), while avoiding the well-known problems of using high-level languages for project managment, including the creation of generational gap between researchers and the ``dependency hell'', see Appendix \ref{appendix:highlevelinworkflow}. + +In summary Make satisfies all our principles (see Section \ref{sec:principles}), while avoiding the well-known problems of using high-level languages for project managment like a generational gap and ``dependency hell'', see Appendix \ref{appendix:highlevelinworkflow}. For more on Make and a discussion on some other job orchestration tools, see Appendices \ref{appendix:make} and \ref{appendix:jobmanagement} respectively. @@ -623,13 +625,12 @@ The latter is necessary for many web-based automatic paper generating systems li Symbolic links and their contents are not considered part of the source and are not under version control. Files and directories are shown within their parent directory. For example the full address of \inlinecode{analysis-1.mk} from the top project directory is \inlinecode{reproduce/analysis/make/analysis-1.mk}. - \tonote{Add the `.git' directory also.} } \end{figure} -\inlinecode{project} is a simple executable POSIX-compliant shell script, that is just a high-level wrapper script to call the project Makefiles. -Recall that the main job orchestrator in this system is Make, see Section \ref{sec:usingmake} for a discussion on the benefits of Make. -In the current implementation, the project's execution consists of the following two calls to \inlinecode{project}: +\inlinecode{project} is a simple executable POSIX-compliant shell script, that is just a high-level wrapper script to call the project's Makefiles. +Recall that the main job orchestrator in this system is Make, see Section \ref{sec:usingmake} for why Make was chosen. +In the current implementation, the project's execution consists of the following two calls to the \inlinecode{project} script: \begin{lstlisting}[language=bash] ./project configure # Build software from source (takes around 2 hours for full build). @@ -639,12 +640,9 @@ In the current implementation, the project's execution consists of the following The operations of both are managed by files under the top-level \inlinecode{reproduce/} directory. When the first command is called, the contents of \inlinecode{reproduce\-/software} are used, and the latter calls files uner \inlinecode{reproduce\-/analysis}. This highlights the \emph{cosmetic} distinction we have adopted between the two main steps of a project: 1) building the project's full software environment and 2) doing the analysis (running the software). - -Technically there is no difference between the two and they could easily be merged because both steps manage their operations through Makefiles. -However, in a project, the analysis will involve many files, e.g., Python, R or shell scripts, C or Fortran program sources, or Makefiles. -On the other hand, software building and project configuration also includes many files. -Mixing all these two conceptually different (for humans!) set of files under one directory can cause confusion for the people building or running the project. -During a project, researchers will mostly be working on their analysis, they will rarely want to modify the software building steps. +Technically there is no difference between the two and they could easily be merged under one directory. +However, during a research project, researchers commonly just need to focus on their analysis steps and will rarely need to edit the software environment settings (maybe only once at the start of the project). +Therefore, having the files mixed under the same directory can be confusing. In summary, the same structure governs both aspects of a project: software building and analysis. This is an important and unique feature in this template. @@ -657,30 +655,32 @@ Most other systems use third-party package managers for their project's software \subsection{Project configuration} \label{sec:projectconfigure} - A critical component of any project is the set of software used to do the analysis. -However, verifying an already built software environment, which is critical to reproducing the research result, is a very hard. -This has forced most projects resort to moving around the whole \emph{built} software environment (a black box) as virtual machines or containers, see Appendix \ref{appendix:independentenvironment}. -But since these black boxes are almost impossible to reproduce themselves, they need to be archived, even though they can take gigabytes of space. +However, verifying an already built software environment (which is critical to reproducing the research result) is a very hard. +This has forced most projects to move around the whole \emph{built} software environment (a black box) as virtual machines or containers, see Appendix \ref{appendix:independentenvironment}. +Because these black boxes are almost impossible to reproduce themselves, they need to be archived, even though they can take gigabytes of space. Package managers like Nix or GNU Guix do provide a verifiable, i.e., reproducible, software building environment, but because they aim to be generic package managers, they have their own limitations on a project-specific level, see Appendix \ref{appendix:nixguix}. -Based on the principles of completeness and minimal complexity (Sections \ref{principle:complete} \& \ref{principle:complexity}), a project that uses this solution, also contains the full instructions to build its necessary software. -As described in Section \ref{sec:generalimplementation}, this is managed by the files under \inlinecode{reproduce/software}. -Project configuration involves three high-level steps which are discussed in the subsections below: setting the local directories (Section \ref{sec:localdirs}), checking a working C compiler (Secti -on \ref{sec:ccompiler}), and the software source code download, build and install (Section \ref{sec:buildsoftware}). +Based on the principles of completeness and minimal complexity (Sections \ref{principle:complete} \& \ref{principle:complexity}), a project that uses this solution, also contains the full instructions to build its necessary software in the same language that the analysis is orchestrated: Make. +Project configuration (building software environment) is managed by the files under \inlinecode{reproduce\-/software}. +Project configuration involves three high-level steps which are discussed in the subsections below: setting the local directories (Section \ref{sec:localdirs}), checking a working C compiler (Section \ref{sec:ccompiler}), and the software source code download, build and install (Section \ref{sec:buildsoftware}). + + + + \subsubsection{Setting local directories} \label{sec:localdirs} -The ``build directory'' (\inlinecode{BDIR}) is a directory on the host filesystem. -All files built by the project will be under it, and no other location on the running operating system will be affected. -Following the modularity principle (Section \ref{principle:modularity}), this directory should be separate from the source directory and the project will now allow specifying a build directory anywhere under its top source directory. +All files built by the project (software or analysis) will be under a ``build directory'' (or\inlinecode{BDIR}) on the host filesystem. +No other location on the running operating system will be affected by the project. +Following the modularity principle (Section \ref{principle:modularity}), this directory should be separate from the source directory. Therefore, at configuration time, the first thing to specify is the build directory on the running system. The build directory can be specified in two ways: 1) on the command-line with the \inlinecode{--build-dir} option, or 2) manually giving the directory after running the configuration: it will stop with a prompt and some explanation. -Two other local directories can optionally be used by the project for its inputs (Section \ref{definition:input}): 1) software tarball directory and 2) input data directory. -The contents of these directories just need read permissions by the user running the project. -If given, nothing will be written inside of them: the project will only look into them for the necessary software tarballs and input data. -If they are not found, the project will attempt to download them from the provided URLs/PIDs within the project source. +Two other local directories can optionally be specified by the project when inputs are present locally (for the definition of inputs, see Section \ref{definition:input}) and don't need to be downloaded: 1) software tarball directory and 2) input data directory. +The project just needs reading permissions on these directories: when given, nothing will be written inside of them. +The project will only look into them for the necessary software tarballs and input data. +If they are not found, the project will attempt to download any necessary file from the recoded URLs/PIDs within the project source. These directories are therefore primarily tailored to scenarios where the project must run offline (based on the completeness principle of Section \ref{principle:complete}). After project configuration, a symbolic link is built the top project soure directory that points to the build directory. @@ -688,14 +688,18 @@ The symbolic link is a hidden file named \inlinecode{.build}, see Figure \ref{fi With this symbolic link, its always very easy to access to built files, no matter where the build directory is actually located on the filesystem. + + + \subsubsection{Checking for a C compiler} \label{sec:ccompiler} This template builds all its necessary software internally to avoid dependency issues with various software versions on different hosts. -A working C compiler is thus a necessary prerequisite and the configure script will abort if a working C compiler isn't found. -In particular, on GNU/Linux systems, the project builds its own version of the GNU Compiler Collection (GCC), therefore a static C library is necessary. +A working C compiler is thus mandatory and the configure script will abort if a working C compiler isn't found. +In particular, on GNU/Linux systems, the project builds its own version of the GNU Compiler Collection (GCC), therefore a static C library is necessary with the compiler. +If not found, an informative error message will be printed and the project will abort. The custom version of GCC is configured to also build Fortran, C++, objective-C and objective-C++ compilers. -Python and R running environments are themselves written in C, therefore they are also automatically built afterwards when necessary. +Python and R running environments are themselves written in C, therefore they are also automatically built afterwards if the project uses these languages. On macOS systems, we currently don't build a C compiler, but it is planned to do so in the future. @@ -706,72 +710,81 @@ On macOS systems, we currently don't build a C compiler, but it is planned to do \label{sec:buildsoftware} All necessary software for the project, and their dependencies, are installed from source. -Researchers using the template only have to specify the most high-level analysis software they need in \inlinecode{reproduce\-/software\-/config\-/installation\-/TARGETS.conf}. -Based on the completeness principle (Section \ref{principle:complete}), the dependency tree is automatically traced down to the GNU C Library and GNU Compiler Collection on GNU/Linux systems. +Researchers using the template only have to specify the most high-level analysis software they need in \inlinecode{reproduce\-/software\-/config\-/installation\-/TARGETS.conf} (see Figure \ref{fig:files}). +Based on the completeness principle (Section \ref{principle:complete}), on GNU/Linux systems the dependency tree is automatically traced down to the GNU C Library and GNU Compiler Collection (GCC). Thus creating identical high-level analysis software on any system. -When the C library and compiler can't be installed (for example on macOS systems), the users are forced to rely on the host's C compiler and library, and this may hamper the exact reproducibility of the final result. -Because the project's main output is a \LaTeX{}-built PDF, the project also contains an internal installation of \TeX{}Live, providing all the necessary tools to build the PDF, indepedent of the host operating system's \LaTeX{} version and packages. +When the C library and compiler can't be installed (for example on macOS systems), the users are forced to rely on the host's C compiler and library, and this may hamper the exact reproducibility of the final result: the project will abort if the final outputs have changed. +Because the project's main output is currently a \LaTeX{}-built PDF, the project also contains an internal installation of \TeX{}Live, providing all the necessary tools to build the PDF, indepedent of the host operating system's \LaTeX{} version and packages. -To build the software, the project needs access to the software source code. -If the tarballs are already present on the system, the user can specify the directory at the start of the configuration process (Section \ref{sec:localdirs}), if not, the software tarballs will be downloaded from pre-defined servers. -Ultimately the origin of the tarballs is irrelevant, what matters is their contents: checked through the SHA-512 checksum \citep[part of the SHA-2 algorithms, see][]{romine15}. -If the SHA-512 checksum of the tarball is different from the checksum stored for it in the project's source, it will complain and abort. -Because the server is irrelevant, one planned improvement is thus to allow users to identify the most convenient server themselves, for example to improve download speed. +To build the software from source, the project needs access to its source tarball or zip-file. +If the tarballs are already present on the system, the user can specify the respective directory at the start of project configuration (Section \ref{sec:localdirs}). +If not, the software tarballs will be downloaded from pre-defined servers. +Ultimately the origin of the tarballs is irrelevant for this project, what matters is the tarball contents: checked through the SHA-512 checksum \citep[part of the SHA-2 algorithms, see][]{romine15}. +If the SHA-512 checksum of the tarball is different from the checksum stored for it in the project's source, the project will complain and abort. +Because the server is irrelevant, one planned task\tonote{add task number} is to allow users to identify the most convenient server themselves, for example to improve download speed. -Software tarball access, building and installation is managed through Makefiles, see Sections \ref{sec:usingmake} \& \ref{sec:generalimplementation}. +Software tarball access, unpacking, building and installation is managed through Makefiles, see Sections \ref{sec:usingmake} \& \ref{sec:generalimplementation}. The project's software are classified into two classes: 1) basic and 2) high-level. -The former contains meta-software: software to build other software, for example GNU Gzip, GNU Tar, GNU Make, GNU Bash, GNU Coreutils, GNU SED, Git, GNU Binutils, GNU Compiler Collection (GCC) and etc\footnote{Note that almost all these GNU software are also installable on non-GNU/Linux operating systems like BSD or macOS also, exceptions include GNU Binutils.}. -The basic software are built with the host operating system's tools and are installed within all projects. -The high-level software are those that are used in analysis and can differ from project to project. -However, because the basic software have already been built by the project, the higher-level software are built with them and independent of the host operating system. +The former contains meta-software: software needed to build other software, for example GNU Gzip, GNU Tar, GNU Make, GNU Bash, GNU Coreutils, GNU SED, GNU Binutils, GNU Compiler Collection (GCC) and etc\footnote{Note that almost all these GNU software are also installable on non-GNU/Linux operating systems like BSD or macOS also, exceptions include GNU Binutils.}. +The basic software are built with the host operating system's tools and are installed with any project. +The high-level software are those that are used directly in the science analysis and can differ from project to project. +However, because the basic software have already been built by the project, the higher-level software are built with them and independent of the host operating system's tools. Software building is managed by two top-level Makefiles that follow the same classification. Both are under the \inlinecode{reproduce\-/softwar\-e/make/} directory (Figure \ref{fig:files}): \inlinecode{basic.mk} and \inlinecode{high-level.mk}. -Because \inlinecode{basic.mk} can't assume anything about the host, it is written to comply with POSIX Make and POSIX shell, which is very limited compared to GNU Make and GNU Bash. -However, after its finished, a specific version of GNU Make (among other basic software), is present, enabling us to assume the much advanced features of GNU Make in \inlinecode{high-level.mk}. +Because \inlinecode{basic.mk} can't assume anything about the host, it is written to comply with POSIX Make and POSIX shell, which are very limited compared to GNU Make and GNU Bash respectively. +However, after it is finished, a specific version of GNU Make (among other basic software), is present, enabling us to assume the much advanced features of GNU tools in \inlinecode{high-level.mk}. -The project's software are installed under \inlinecode{BDIR/software/installed} (containing subdirectories like \inlinecode{bin/}, \inlinecode{lib/}, \inlinecode{include/} and etc). -For example the custom-built GNU Make executable is placed under \inlinecode{BDIR\-/software\-/installed\-/bin\-/make}. -The symbolic link \inlinecode{.local} in the top project source directory points to it (see Figure \ref{fig:files}), so \inlinecode{.local/bin/make} is identical to the long address before. +The project's software are installed under \inlinecode{BDIR/software/installed}. +The \inlinecode{.local} symbolic link in the top project source directory points to it for easy access (see Figure \ref{fig:files}). +It contains the top-level POSIX filesystem hierarchy subdirectories for the project including \inlinecode{bin/}, \inlinecode{lib/}, \inlinecode{include/} among others. +For example the custom-built GNU Make executable is placed under \inlinecode{BDIR\-/software\-/installed\-/bin\-/make} or alternatively \inlinecode{.local/bin/make}. -To orchestrate software building with Make, the building of each software has to conclude in a file, which should be used as a target or prerequisite in the Makefiles. -Initially we tried using the actual software's built files (executable programs, libraries or etc), but managing all the different types of installed files was prone to many bugs and confusions. -Therefore, once a software is built, a simple plain-text file is created in \inlinecode{.local\-/version-info} and Make uses this file to refer to the software and arrange the order of software execution. -The contents of this plain-text file are directly imported into the paper, for more see Section \ref{sec:softwarecitation} on software citation. +To orchestrate software building with Make, the building of each software should be represented as a file. +In the Makefiles that file should be used as a \emph{target}, in the rule that builds the software, or \emph{prerequisite}, in the rule(s) of software that depend on it. +For more on Make, see Appendix \ref{appendix:make}. +Initially we tried using the actual software's built files (executable programs, libraries or etc). +However, given the variety of installed files, using them as the software's representative caused many complexities, confusions and bugs. +Therefore, in the current system, once a software is built, a simple plain-text file is created under a sub-directory of \inlinecode{.local\-/version-info}. +The representative files for C/C++ programs or libraries are placed under the \inlinecode{proglib} sub-directory. +The Python or \TeX{}Live representative files are placed under the \inlinecode{python} and \inlinecode{tex} subdirectories respectively. +Make uses this file to refer to the software and arrange the order of software execution. +The contents of this plain-text file are the name and possible citation to the software that are directly imported into the final paper in the end. +For more on software citation, see Section \ref{sec:softwarecitation}. \subsubsection{Software citation} \label{sec:softwarecitation} -The template's Makefiles contain the full instructions to automatically build all the software. -It thus contains the full list of installed software, their versions and their configuration options. -However, this information is burried deep into the project's source, a distilled fraction of this information must also be printed in the project's final report, blended into the narrative. -Furthermore, when a published paper is associated with the used software, it is important to cite that paper, the citations help software authors gain more recognition and grants. -This is particularly important in the case for research software, where the researcher has invested significant time in building the software, and requires official citation to continue working on it. - -One notable example is GNU Parallel \citep{tange18}: everytime it is run, it prints the citation information before its actual outputs. -This doesn't cause any problem in automatic scripts, but can be annoying when debugging the outputs. -Users can disable the notice, with the \inlinecode{--citation} option and accept to cite its paper, or support it directly by paying $10000$ euros. +Based on the completeness principle (Section \ref{principle:complete}), the project contains the full list of installed software, their versions and their configuration options. +However, this information is burried deep into the project's source. +A distilled fraction of this information must also be printed in the project's final report, blended into the narrative. +Furthermore, when a published paper is associated with the used software, it is important to cite that paper, the citations help software authors gain more recognition and grants, encouraging them to further develop it. +This is particularly important in the case for research software, where the researcher has invested significant time in building the software, and requires official citation to justify continued work on it. + +One notable example that nicely highlights this issue is GNU Parallel \citep{tange18}: everytime it is run, it prints the citation information before it starts. +This doesn't cause any problem in automatic scripts, but can be annoying when reading/debugging the outputs. +Users can disable the notice, with the \inlinecode{--citation} option and accept to cite its paper, or support its development directly by paying $10000$ euros! This is justified by an uncomfortably true statement\footnote{GNU Parallel's FAQ on the need to cite software: \url{http://git.savannah.gnu.org/cgit/parallel.git/plain/doc/citation-notice-faq.txt}}: ``history has shown that researchers forget to [cite software] if they are not reminded explicitly. ... If you feel the benefit from using GNU Parallel is too small to warrant a citation, then prove that by simply using another tool''. In bug 905674\footnote{Debian bug on the citation notice of GNU Parallel: \url{https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=905674}}, the Debian developers argued that because of this extra condition, GNU Parallel should not be considered as free software, and they are using a patch to remove that part of the code for its build under Debian-based operating systems. Most other research software don't resort to such drastic measures, however, citation is imporant for them. -Given the increasing number of software used in scientific research, the only reliable solution to automaticly cite the used software in the final paper. +Given the increasing number of software used in scientific research, the only reliable solution is to automaticly cite the used software in the final paper. As mentioned above in Section \ref{sec:buildsoftware}, a plain-text file is built automatically at the end of a software's successful build and installation. -This file contains the name, version and possible citation. +This file contains the name, version and possible citation of that software. At the end of the configuration phase, all these plain-text files are merged into one \LaTeX{} macro that can be imported directly into the final paper or report. In this paper, this macro's value is shown in Appendix \ref{appendix:softwareacknowledge}. The paragraph produced by this macro won't be too large, but it will greatly help in the recording of the used software environment and will automatically cite the software where necessary. In the current version of this template it is assumed the published report of a project is built by \LaTeX{}. Therefore, every software that has an associated paper, has a Bib\TeX{} file under the \inlinecode{reproduce\-/software\-/bibtex} directory. -If the software is built for the project, the Bib\TeX{} entry(s) are copied to the build directory and the command to cite that Bib\TeX{} record is included in the \LaTeX{} macro with the name and version of the software, as shown in Appendix \ref{appendix:softwareacknowledge}. +When the software is built for the project (possibly as a dependency of another software specified by the user), the Bib\TeX{} entry(s) are copied to the build directory and the command to cite that Bib\TeX{} record is included in the \LaTeX{} macro with the name and version of the software, as shown in Appendix \ref{appendix:softwareacknowledge}. For a review of the necessity and basic elements in software citation, see \citet{katz14} and \citet{smith16}. There are ongoing projects specifically tailored to software citation, including CodeMeta (\url{https://codemeta.github.io}) and Citation file format (CFF: \url{https://citation-file-format.github.io}). Both are based on scheme.org, but are respectively implemented in the JSON-LD and YAML. Another robust approach is provided by SoftwareHeritage \citep{dicosmo18}. -The advantage of the SoftwareHeritage approach is that a published paper isn't necessary and it won't populate a paper's bibliography list. +The feature of the SoftwareHeritage is that a published paper isn't necessary and it won't populate a research paper's bibliography list. However, this also makes it hard to count as academic credit. We are considering using these tools, and export Bib\TeX{} entries when necessary. @@ -786,23 +799,22 @@ We are considering using these tools, and export Bib\TeX{} entries when necessar Once a project is configured (Section \ref{sec:projectconfigure}), all the necessary software, with precise versions and configurations, are built and ready to use. The analysis phase of the project (running the software on the data) is also orchestrated through Makefiles (see Sections \ref{sec:usingmake} \& \ref{sec:generalimplementation} for the benefits of using Make). -In particular, after running \inlinecode{./project make}, two high-level Makefiles are called in sequence, both are under \inlinecode{reproduce\-/analysis\-/make}, see Figure \ref{fig:files}: \inlinecode{top-prepare.mk} and \inlinecode{top-make.mk}. +In particular, after running \inlinecode{./project make}, two high-level Makefiles are called in sequence, both are under \inlinecode{reproduce\-/analysis\-/make}: \inlinecode{top-prepare.mk} and \inlinecode{top-make.mk} (see Figure \ref{fig:files}). These two high-level Makefiles don't see any of the the host's environment variables\footnote{The host environment is fully ignored before calling the analysis Makefiles through the \inlinecode{env -i} command (\inlinecode{-i} is short for \inlinecode{--ignore-environment}). - Note that the locally built \inlinecode{env} program is used, not the one provided by the host. - \inlinecode{env} is installed by GNU Coreutils.}. + Note that the project's own \inlinecode{env} program is used, not the one provided by the host OS, \inlinecode{env} is installed by GNU Coreutils.}. The project will define its own values for standard environment variables. Combined with the fact that all the software were compiled from source for this project at configuration time (Section \ref{sec:buildsoftware}), this completely isolates the analysis from the host operating system, creating an exactly reproducible result on any machine that the project can be configured. For example, the project builds is own fixed version of GNU Bash (a shell). It also has its own \inlinecode{bashrc} startup script\footnote{The project's Bash startup script is under \inlinecode{reproduce\-/software\-/bash\-/bashrc.sh}, see Figure \ref{fig:files}.}. -Therefore the \inlinecode{BASH\_ENV} environment variable is set to load this startup script and the \inlinecode{HOME} environment variable is set to the build directory to avoid the penetration of any existing Bash startup file in the user's home directory into the analysis. +Therefore the \inlinecode{BASH\_ENV} environment variable is set to load this startup script and the \inlinecode{HOME} environment variable is set to \inlinecode{BDIR} to avoid the penetration of any existing Bash startup file of the user's home directory into the analysis. The former \inlinecode{top-prepare.mk} is in charge of optimizing the main job orchestration of \inlinecode{top-make.mk}, or to ``prepare'' for it. In many sitations it may not be necessary at all, but we'll introduce its role with an example. Let's assume the raw input data (that the project recieved from a database) has 5000 rows (potential targets). However, based on an initial selection criteria, the project only needs to work on 100 of them. -If the full 5000 targets are given to \inlinecode{top-make.mk}, Make will need to fllow the data lineage of all of them. -If the analysis is complex (has many steps), this can be slow (many of its executions will be redundant), and the researchers have to add checks in many places to ignore those that aren't necessary (which will add to complexity and cause bugs). +If the full 5000 targets are given to \inlinecode{top-make.mk}, Make will need to create a data lineage for all 5000 targets. +If the analysis is complex (has many steps), this can be slow (many of its executions will be redundant), and project authors have to add checks in many places to ignore those that aren't necessary (which will add to the project's complexity and cause bugs). However, if this basic selection is done before calling \inlinecode{top-make.mk}, only the required 100 targets and their lineage are orchestrated. This allows Make to optimially organize the complex set of operations that must be run on each input and the dependencies (possibly in parallel), and also greatly simplifies the coding (no extra checks are necessary there). Where necessary this preparation is done in \inlinecode{top-prepare.mk}. @@ -811,18 +823,18 @@ Generally, \inlinecode{top-prepare.mk} will not be necessary in many scenarios a Hence, we'll continue with a detailed discussion of \inlinecode{top-make.mk} below and touch upon the differences with \inlinecode{top-prepare.mk} in the end. A normal project will usually consist of many analysis steps, including data access (possibly by downloading), and running various steps of the analysis on them. -Having everything in one Makefile will create a very large file, which can be hard to maintain, extend/grow, read, reuse, and cite. -Generally, this is against the modularity principle (Section \ref{principle:modularity} above). -Therefore the project is designed to encourage modularity and facilitate all these points by distributing the analysis in multiple Makefiles that contain contextually-similar analysis steps or ``rules''. +Having all the rules in one Makefile will create a very large file, which can be hard to maintain, extend/grow, read, reuse, and cite. +Generally, this is bad practice and is against the modularity principle (Section \ref{principle:modularity}). +This solution is thus designed to encourage modularity and facilitate modularity by distributing the analysis in many Makefiles that contain contextually-similar (or modular) analysis steps. For Make this distribution is just cosmetic: they are all loaded into \inlinecode{top-make.mk} and executed in one instance of Make. -Within the project's source the lower-level Makefiles are also placed in \inlinecode{reproduce\-/analysis\-/make} (like \inlinecode{top-make.mk}), see Figure \ref{fig:files}. +Within the project's source, the lower-level Makefiles are also placed in \inlinecode{reproduce\-/analysis\-/make} (like \inlinecode{top-make.mk}), see Figure \ref{fig:files}. Therefore by design, \inlinecode{top-make.mk} is very simple: it just defines the ultimate target, and the name and order of the lower-level Makefiles that should be loaded. Figure \ref{fig:datalineage} is a general overview of the analysis phase in a hypothetical project using this template. As described above and shown in that figure, \inlinecode{top-make.mk} imports the various lower-level Makefiles that are in charge of the different phases of the analysis. Each of the lower-level Makefiles builds intermediate targets (files) which are also shown there. In the subsections below, the project's analysis is described using this graph. -We'll follow Make's paradigm (see Section \ref{sec:usingmake}): starting form the ultimate target in Section \ref{sec:paperpdf}, and tracing the data's lineage up to the inputs. +We'll follow Make's paradigm (see Section \ref{sec:usingmake}): starting form the ultimate target in Section \ref{sec:paperpdf}, and tracing the data's lineage all the way up to the inputs and configuration files. \begin{figure}[t] \begin{center} @@ -845,10 +857,10 @@ We'll follow Make's paradigm (see Section \ref{sec:usingmake}): starting form th \subsubsection{Ultimate target: the project's paper or report} \label{sec:paperpdf} -The ultimate purpose of a project is to report its result and interpret it in a larger context. +The ultimate purpose of a project is to report its result and interpret it in a larger context of human knowledge. In scientific projects, this is the final published paper. -The raw result is usually a dataset(s) that is(are) visualized, for example as a plot or figure, and blended into the narrative description. -In Figure \ref{fig:datalineage} this report is shown as \inlinecode{paper.pdf}. +The raw result is usually dataset(s) that is (are) visualized, for example as a plot, figure or table and blended into the narrative description. +In Figure \ref{fig:datalineage} this final report is shown as \inlinecode{paper.pdf}. In the complete directed graph of this figure, \inlinecode{paper.pdf} is the only node that has no outward edge or arrows. The source of this report (containing the main narrative and positioning of figures or tables) is \inlinecode{paper.tex}. @@ -1254,20 +1266,52 @@ While its not impossible, because of the high-level nature of scripts, it is not \subsubsection{Make} \label{appendix:make} Make was originally designed to address the problems mentioned in Appendix \ref{appendix:scripts} for scripts \citep{feldman79}. -In particular to manage the compilation of programs with many source code files. -With it, the various source files of a program that haven't been changed, wouldn't be recompiled. +In particular this motivation arose from management issues related to program compilation with many source code files. +With Make, the various source files of a program that haven't been changed, wouldn't be recompiled. Also, when two source files didn't depend on each other, and both needed to be rebuilt, they could be built in parallel. -This greatly helped in debugging of software projects, and speeding up test builds, giving Make a core place in software building tools. +This greatly helped in debugging of software projects, and speeding up test builds, giving Make a core place in software building tools since then. The most common implementation of Make, since the early 1990s, is GNU Make \citep[\url{http://www.gnu.org/s/make}]{stallman88}. +The proposed solution uses Make to organize its workflow, see Section \ref{sec:usingmake}. +Here, we'll complement that section with more technical details on Make. + +Usually, the top-level Make instructions are placed in a file called Makefile, but it is also common to use the \inlinecode{.mk} suffix for custom file names. +Each stage/step in the analysis is defined through a \emph{rule}. +Rules define \emph{recipes} to build \emph{targets} from \emph{pre-requisites}. +In POSIX operating systems (Unix-like), everything is a file, even directories and devices. +Therefore all three components in a rule must be files on the running filesystem. +Figure \ref{fig:makeexample} demonstrates a hypothetical Makefile with the targets, prerequisites and recipes highlighted. + +\begin{figure}[t] + {\small + \texttt{\mkcomment{\# The ultimate "target" of this Makefile is 'ultimate.txt' (the first target Make finds).}} + + \texttt{\mktarget{ultimate.txt}: \mkprereq{out.txt}\hfill\mkcomment{\# 'ultimate.txt' depends on 'out.txt'.{ }{ }{ }{ }{ }}} + + \texttt{\mktab{}awk '\$1\textless5' \mkprereq{out.txt} \textgreater{ }\mktarget{ultimate.txt}\hfill\mkcomment{\# Only rows with 1st column less than 5.{ }{ }{ }}} + + \vspace{1em} + \texttt{\mkcomment{\# But 'out.txt', is created by a Python script, and 'params.conf' keeps its configuration.}} + + \texttt{\mktarget{out.txt}: \mkprereq{run.py} \mkprereq{params.conf}} + + \texttt{\mktab{}python \mkprereq{run.py} --in=\mkprereq{params.conf} --out=\mktarget{out.txt}} + } + + \caption{\label{fig:makeexample}An example Makefile that describes how to build \inlinecode{ultimate.txt} with two \emph{rules}. + \emph{targets} (blue) are placed before the colon (\texttt{:}). + \emph{prerequisites} (green) are placed after the colon. + The \emph{recipe} to build the targets from the prerequisites is placed after a \texttt{TAB}. + The final target is the first one that Make confronts (\inlinecode{ultimate.txt}). + It depends on the output of a Python program (\inlinecode{run.py}), which is configured by \inlinecode{params.conf}. + Anytime \inlinecode{run.py} or \inlinecode{params.conf} are edited/updated, \inlinecode{out.txt} is re-created and thus \inlinecode{ultimate.txt} is also re-created. + } +\end{figure} -The proposed template uses Make to organize its workflow (as described in more detail above \tonote{add section reference later}). -We'll thus do a short review here. -A file containing Make instructions is known as a `Makefile'. -Make manages `rules', which are composed of three components: `targets', `pre-requisites' and `recipes'. -All three components must be files on the running system (note that in Unix-like operating systems, everything is a file, even directories and devices). -To manage operations and decide which operation should be re-done, Make compares the time stamp of the files. -A rule's `recipe' contains instructions (most commonly shell scripts) to produce the `target' file when any of the `prerequisite' files are newer than the target. -When all the prerequisites are older than the target, in Make's paradigm, that target doesn't need to be re-built. +To decide which operation should be re-done when executed, Make compares the time stamp of the targets and prerequisites. +When any of the prerequisite(s) is newer than a target, the recipe is re-run to re-build the target. +When all the prerequisites are older than the target, that target doesn't need to be rebuilt. +The recipe can contain any number of commands, they should just all start with a \inlinecode{TAB}. +Going deeper into the syntax of Make is beyond the scope of this paper, but we recommend interested readers to consult the GNU Make manual\footnote{\url{http://www.gnu.org/software/make/manual/make.pdf}}. \subsubsection{SCons} Scons (\url{https://scons.org}) is a Python package for managing operations outside of Python (in contrast to CGAT-core, discussed below, which only organizes Python functions). diff --git a/tex/src/figure-file-architecture.tex b/tex/src/figure-file-architecture.tex index e08ee52..8fb1a6d 100644 --- a/tex/src/figure-file-architecture.tex +++ b/tex/src/figure-file-architecture.tex @@ -134,32 +134,32 @@ \node [anchor=west, at={(4.67cm,-3.0cm)}] {\scriptsize\sf by \LaTeX).}; \fi + %% .git/ + \ifdefined\fullfilearchitecture + \node [dirbox, at={(0,-3.6cm)}, minimum width=14.2cm, minimum height=7mm, + label={[shift={(0,-5mm)}]\texttt{.git/}}, fill=brown!15!white] {}; + \node [anchor=north, at={(0cm,-3.9cm)}] + {\scriptsize\sf Full project temporal provenance (version controlled history) in Git.}; + \fi + %% .local/ \ifdefined\fullfilearchitecture - \node [dirbox, at={(-3.6cm,-3.6cm)}, minimum width=7cm, minimum height=1.2cm, + \node [dirbox, at={(-3.6cm,-4.5cm)}, minimum width=7cm, minimum height=1.2cm, label={[shift={(0,-5mm)}]\texttt{.local/}}, dashed, fill=brown!15!white] {}; - \node [anchor=west, at={(-7.1cm,-4.3cm)}] + \node [anchor=west, at={(-7.1cm,-5.2cm)}] {\scriptsize\sf Symbolic link to project's software environment, e.g., }; - \node [anchor=west, at={(-7.1cm,-4.6cm)}] + \node [anchor=west, at={(-7.1cm,-5.5cm)}] {\scriptsize\sf Python or R, run `\texttt{.local/bin/python}' or `\texttt{.local/bin/R}'}; \fi %% .build/ \ifdefined\fullfilearchitecture - \node [dirbox, at={(3.6cm,-3.6cm)}, minimum width=7cm, minimum height=1.2cm, + \node [dirbox, at={(3.6cm,-4.5cm)}, minimum width=7cm, minimum height=1.2cm, label={[shift={(0,-5mm)}]\texttt{.build/}}, dashed, fill=brown!15!white] {}; - \node [anchor=west, at={(0.1cm,-4.3cm)}] + \node [anchor=west, at={(0.1cm,-5.2cm)}] {\scriptsize\sf Symbolic link to project's top-level build directory.}; - \node [anchor=west, at={(0.1cm,-4.6cm)}] + \node [anchor=west, at={(0.1cm,-5.5cm)}] {\scriptsize\sf Enabling easy access to all of project's built components.}; \fi - %% .git/ - \ifdefined\fullfilearchitecture - \node [dirbox, at={(0,-5cm)}, minimum width=14.2cm, minimum height=7mm, - label={[shift={(0,-5mm)}]\texttt{.git/}}, dashed, fill=brown!15!white] {}; - \node [anchor=north, at={(0cm,-5.3cm)}] - {\scriptsize\sf Full project temporal provenance (version controlled history) in Git.}; - \fi - \end{tikzpicture} diff --git a/tex/src/preamble-style.tex b/tex/src/preamble-style.tex index c3aeca2..9556f54 100644 --- a/tex/src/preamble-style.tex +++ b/tex/src/preamble-style.tex @@ -137,3 +137,9 @@ %% Custom macros \newcommand{\inlinecode}[1]{\textcolor{blue!35!black}{\texttt{#1}}} + +%% Example Makefile macros +\newcommand{\mkcomment}[1]{\textcolor{red!35!white}{#1}} +\newcommand{\mktarget}[1]{\textcolor{blue!40!black}{#1}} +\newcommand{\mkprereq}[1]{\textcolor{green!30!black}{#1}} +\newcommand{\mktab}[1]{\textcolor{black!25!white}{\_\_\_TAB\_\_\_}} diff --git a/tex/src/references.tex b/tex/src/references.tex index 305c3ab..f57b26e 100644 --- a/tex/src/references.tex +++ b/tex/src/references.tex @@ -845,13 +845,13 @@ archivePrefix = {arXiv}, @ARTICLE{courtes15, - author = {{Court{\`e}s}, Ludovic and {Wurmus}, Ricardo}, - title = "{Reproducible and User-Controlled Software Environments in HPC with Guix}", + author = {{Court{\'e}s}, Ludovic and {Wurmus}, Ricardo}, + title = {Reproducible and User-Controlled Software Environments in HPC with Guix}, journal = {Euro-Par}, volume = {9523}, keywords = {Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Operating Systems, Computer Science - Software Engineering}, - year = "2015", - month = "Jun", + year = {2015}, + month = {Jun}, eid = {arXiv:1506.02822}, pages = {arXiv:1506.02822}, archivePrefix = {arXiv}, |