diff options
-rw-r--r-- | paper.tex | 138 | ||||
-rw-r--r-- | tex/src/figure-file-architecture.tex | 43 | ||||
-rw-r--r-- | tex/src/references.tex | 28 |
3 files changed, 136 insertions, 73 deletions
@@ -313,23 +313,23 @@ Therfore our inputs are team-agnostic, allowing us to safely ignore ``repeatabil \section{Principles of the proposed solution} \label{sec:principles} -The core principle behind this work is simple: science is defined by its method, not its result. +The core principle behind this solution is simple: science is defined by its method, not its result. Statements that convey a ``result'' abound in all aspects of human life (e.g., in fiction, religion and science). -The distinguishing factor is the ``method'' the result was derived. +What distinguishes one from the other is the ``method'' that the result was derived. Science is the only class that attempts to be as objective as possible through the ``scientific method''. \citet{buckheit1995} nicely summarize this by pointing out that modern scientific papers (narrative combined with plots, tables and figures) are merely advertisements of a scholarship, the actual scholarship is the scripts and software usage that went into doing the analysis. -This paper thus proposes a framework that is optimally designed for both running a project, \emph{and} publication of the (computational) methods along with the published result. +This paper thus proposes a framework that is optimally designed for both designing and executing a project, \emph{as well as} publication of the (computational) methods along with the published paper/result. However, this paper is not the first attempted solution to this fundamental problem. Various solutions have been proposed since the early 1990s, see Appendix \ref{appendix:existingsolutions} for a review. -To better highlight the differences with those methods, and also highlight the core foundations of this method (which help in understanding certain implementation choices), in the sub-sections below we elaborate on the core principle, by breaking it into logically independent sub-components. +To better highlight the differences with those methods, and the foundations of this method (which help in understanding certain implementation choices), in the sub-sections below, the core principle above is expaneded by breaking it into logically independent sub-components. It is important to note that based on the definition of a project (Section \ref{definition:project}) and the first principle below (modularity, Section \ref{principle:modularity}) this paper is designed to be modular and thus agnostic to high-level choices. -For example the choice of hardware (e.g., high performance computing facility or a personal computer), or high-level interfaces (i.e., beyond the raw project's source, for example a webpage or specialized graphic user interface). +For example the choice of hardware (e.g., high performance computing facility or a personal computer), or high-level interfaces (for example a webpage or specialized graphic user interface). The proposed solution in this paper is a low-level skeleton that is designed to be easily adapted to any high-level, project-specific, choice. For example, in terms of hardware choice, a large simulation project simply cannot be run on smaller machines. However, when such a project is managed in the proposed system, the complete project (see Section \ref{principle:complete}) is published and readable by peers, who can be sure that what they are reading, contains the full/exact environment and commands that produced the result. -In terms of interfaces, wrappers can be written over this core skeleton for various other high-level cosmetics, for example a web interface, or plugins to text editors or notebooks (see Appendix \ref{appendix:existingtools}). +In terms of interfaces, wrappers can be written over this core skeleton for various other high-level cosmetics, for example a web interface, a graphic user interface or plugins to text editors or notebooks (see Appendix \ref{appendix:editors}). @@ -337,28 +337,31 @@ In terms of interfaces, wrappers can be written over this core skeleton for vari \subsection{Principle: Complete/Self-contained} \label{principle:complete} -A project should be self-contained, needing no particular features from the host operating system and not affecting it. +A project should be self-contained, needing no particular features from the host operating system, and not affecting it. At build-time (when the project is building its necessary tools), the project shouldn't need anything beyond a minimal POSIX environment on the host which is available in Unix-like operating system like GNU/Linux, BSD-based or macOS. At run-time (when environment/software are built), it should not use or affect any host operating system programs or libraries. -Generally, a project's source should include the whole project: access to the inputs and validating them (see Section \ref{sec:definitions}), building necessary software (access to tarballs and instructions on configuring, building and installing those software), doing the analysis (run the software on the data) and creating the final narrative report/paper. +Generally, a project's source should include the whole project: access to the inputs (see Section \ref{sec:definitions}), building necessary software (access to tarballs and instructions on configuring, building and installing those software), doing the analysis (run the software on the data) and creating the final narrative report/paper in its final format. This principle has several important consequences: \begin{itemize} -\item A complete project doesn't need an internet connection to build itself or to do its analysis and possibly make a report. - Of course this only holds when the analysis doesn't require internet, for example needing a live data feed. - -\item A Complete project doesn't need any previlaged/root permissions for system-wide installation or environment preparations. +\item A complete project doesn't need any previlaged/root permissions for system-wide installation, or environment preparations. Even when the user does have root previlages, interefering with the host operating system for a project, may lead to many conflicts with the host or other projects. - This allows a safe execution of the project, and will not cause any security problems. + This principle thus allows a safe execution of the project, and will not cause any security problems. -\item A complete project inherently includes the complete data lineage and provenance: automatically enabling a full backtrace of the output datasets or narrative, to raw inputs: data or software source code lines (for the definition of inputs, please see \ref{definition:input}). - This is very important because existing data provenance solutions require manual tagging within the data workflow or connecting the data with the paper's text (Appendix \ref{appendix:existingsolutions}). +\item A complete project doesn't need an internet connection to build itself or to do its analysis and possibly make a report. + Of course this only holds when the analysis doesn't inherently require internet, for example needing a live data feed. + +\item A complete project inherently includes the complete data lineage and provenance: automatically enabling a full backtrace of the output datasets or narrative, to raw inputs: data or software source code lines. + This is very important because many existing data provenance solutions require manual tagging within the data workflow or connecting the data with the paper's text (Appendix \ref{appendix:existingsolutions}). Manual tagging can be highly subjective, prone to many errors, and incomplete. + +\item A complete project will not need any user interaction and can complete itself automatically. + This is because manual interaction is an incompleteness. + Interactivity is also an inherently irreproducible operation, exposing the analysis to human error, and requiring expert knowledge. \end{itemize} -The first two components are particularly important for high performace computing (HPC) facilities. -Because of security reasons, HPC users commonly don't have previlaged permissions or internet access. +The first two components are particularly important for high performace computing (HPC) facilities: because of security reasons, HPC users commonly don't have previlaged permissions or internet access. @@ -369,15 +372,21 @@ Because of security reasons, HPC users commonly don't have previlaged permission A project should be compartmentalized or partitioned to independent modules or components with well-defined inputs/outputs having no side-effects. In a modular project, communication between the independent modules is explicit, providing optimizations on multiple levels: 1) Execution: independent modules can run in parallel, or modules that don't need to be run (because their dependencies haven't changed) won't be re-done. -2) Data lineage (for example experimenting on project), and data provenance extraction (recording any dataset's origins). +2) Data lineage and data provenance extraction (recording any dataset's origins). 3) Citation: allowing others to credit specific parts of a project. - This principle doesn't just apply to the analysis, it also applies to the whole project, for example see the definitions of ``input'', ``output'' and ``project'' in Section \ref{sec:definitions}. + +Within the analysis phase, this principle can be summarized best with the Unix philosophy, best described by \citet{mcilroy78} in the ``Style'' section. +In particular ``Make each program do one thing well. +To do a new job, build afresh rather than complicate old programs by adding new `features'''. +Independent parts of the analysis can be maintained as independent software (for example shell, Python, or R scripts, or programs written in C, C++ or Fortran, among others). +This core aspect of the Unix philosophy has been the cause of its continueed success (particulary through GNU and BSD) and development in the last half century. + For the most high-level analysis/operations, the boundary between the ``analysis'' and ``project'' can become blurry. -It is thus inevitable that some highly project-specific analysis is ultimately kept within the project and not maintained as a separate project. +It is thus inevitable that some highly project-specific, and small, analysis steps are also kept within the project and not maintained as a separate software package (that is built before the project is run). This isn't a problem, because inputs are defined as files that are \emph{usable} by other projects (see Section \ref{definition:input}). -Such highly project-specific software can later spin-off into a separate software package later if necessary. -%One nice example of an existing system that doesn't follow this principle is Madagascar, see Appendix \ref{appendix:madagascar}. +If necessary, such highly project-specific software can later spin-off into a separate software package later. +One example of an existing system that doesn't follow this principle is Madagascar, it builds a large number of analysis programs as part of the project (see Appendix \ref{appendix:madagascar}). %\tonote{Find a place to put this:} Note that input files are a subset of built files: they are imported/built (copied or downloaded) using the project's instructions, from an external location/repository. % This principle is inspired by a similar concept in the free and open source software building tools like the GNU Build system (the standard `\texttt{./configure}', `\texttt{make}' and `\texttt{make install}'), or CMake. @@ -393,19 +402,25 @@ The reason behind this principle is that opening, reading, or editing non-plain Binary formats will complicate various aspects of the project: its usage, archival, automatic parsing, or human readability. This is a critical principle for long term preservation and portability: when the software to read binary format has been depreciated or become obsolete and isn't installable on the running system, the project will not be readable/usable any more. -A project that is solely in plain text format can be put under version control as it evolves, with easy tracking of changed parts, using already available and mature tools in software development (software source code is also in plain text). -After publication, independent modules of a plain-text project can be used and cited through services like Software Heritage \citep{dicosmo18}, enabling future projects to easily build ontop of old ones. +A project that is solely in plain text format can be put under version control as it evolves, with easy tracking of changed parts, using already available and mature tools in software development: software source code is also in plain text. +After publication, independent modules of a plain-text project can be used and cited through services like Software Heritage \citep{dicosmo18}, enabling future projects to easily build ontop of old ones, or cite specific parts of a project. Archiving a binary version of the project is like archiving a well cooked dish itself, which will be inedible with changes in hardware (temperature, humidity, and the natural world in general). -But archiving a project's plain text source is like archiving the dish's recipe (which is also in plain text!): you can re-cook it any time. -When the environment is under perfect control (as in the proposed system), the binary/executable, or re-cooked, output will be identical. -\citet{smart18} describe a nice example of the how a seven-year old dispute between condensed matter scientists could only be solved when they shared the plain text source of their respective projects. +But archiving the dish's recipe (which is also in plain text!): you can re-cook it any time. +When the environment is under perfect control (as in the proposed system), the binary/executable, or re-cooked, output will be verifiably identical. +One illustrative example of the importance of source code is mentioned in \citet{smart18}: a seven-year old dispute between condensed matter scientists could only be solved when they shared the plain text source of their respective projects. This principle doesn't conflict with having an executable or immediately-runnable project\footnote{In their recommendation 4-1 on reproducibility, \citet{fineberg19} mention: ``a detailed description of the study methods (ideally in executable form)''.}. Because it is trivial to build a text-based project within an executable container or virtual machine. For more on containers, please see Appendix \ref{appendix:independentenvironment}. -Similar to how a software is built from its plain-text source in such systems: a project is just a higher-level software. -A plain-text project's built/executable form can be published as an output of the project to help contemporary researchers (see Section \ref{definition:output}). +To help contemporary researchers, this built/executable form of the project can be published as an output in respective servers like \url{http://hub.docker.com} (see Section \ref{definition:output}). + +Note that this principle applies to the whole project, not just the initial phase. +Therefore a project like Conda that currently includes a $+500$MB binary blob in a plain-text shell script (see Appendix \ref{appendix:conda}) is not acceptable for this principle. +This failure also applies to projects that build tools to read binary sources. +In short, the full source of a project should be in plain text. + + @@ -416,12 +431,13 @@ In principle this is similar to Occum's rasor: ``Never posit pluralities without In this context Occum's rasor can be interpretted like the following cases: minimize the number of a project's dependency software (there are often multiple ways of doing something), avoid complex relationtions between analysis steps (which is not unrelated to the principle of modularity in Section \ref{principle:modularity}), -or avoid the vogue programming language of the day (since its going to fall out of fashion soon and take the project down with it, see Appendix \ref{appendix:highlevelinworkflow}). +or avoid the programming language that is currently in vogue because it is going to fall out of fashion soon and take the project down with it, see Appendix \ref{appendix:highlevelinworkflow}). This principle has several important concequences: \begin{itemize} \item Easier learning curve. Scientists can't adopt new tools and methods as fast as software developers. They have to invest the majority of their time on their own research domain. +Because of this researchers usually continue their career with the language/tools they learnt when they started. \item Future usage. Scientific projects require longevity: unlike software engineering, there is no end-of-life in science (e.g., Aristotle's work 2.5 millennia ago is still ``science''). @@ -439,24 +455,16 @@ However, Github dropped HCL in October 2019, for more see Appendix \ref{appendix -\subsection{Principle: non-interactive processing} -\label{principle:batch} -A reproducible project should run without any manual interaction. -Manual interaction is an inherently irreproducible operation, exposing the analysis to human error, and will require expert knowledge. - - - - - -\subsection{Principle: Verifiable outputs} +\subsection{Principle: Verifiable inputs and outputs} \label{principle:verify} -The project should contain verification checks its outputs. -Combined with the principle on batch processing (Section \ref{principle:batch}), expert knowledge won't be necessary to confirm the correct reproduction. +The project should contain automatic verification checks on its inputs (software source code and data) and outputs. +When applied, expert knowledge won't be necessary to confirm the correct reproduction. It is just important to emphasize that in practice, exact or bit-wise reproduction is very hard to implement at the level of a file. This is because many specialized scientific software commonly print the running date on their output files (which is very useful in its own context). + For example in plain text tables, such meta-data are commonly printed as commented lines (usually starting with \inlinecode{\#}). -Therefore when verifying a plain text table, the checksum which is used to validate the data, can be recorded after removing all commented lines. -Fortunately, the tools to operate on specialized data formats also usually have ways to remove requested metadata (like creation date), or ignore them. +Therefore when verifying such a plain text table, the checksum which is used to validate the data, can be recorded after removing all commented lines. +Fortunately, the tools to operate on specialized data formats also usually have ways to remove requested metadata (like creation date), or ignore metadata altogether. For example the FITS standard in astronomy \citep{pence10} defines a special \inlinecode{DATASUM} keyword which is a checksum calculated only from the raw data, ignoring all metadata. @@ -468,9 +476,9 @@ For example the FITS standard in astronomy \citep{pence10} defines a special \in No project is done in a single/first attempt. Projects evolve as they are being completed. It is natural that earlier phases of a project are redesigned/optimized only after later phases have been completed. -This is often seen in scientific papers, with statements like ``we also [first] tried method [or parameter] XXXX, but YYYY is used here because it showed to have better precision [or less bias, or etc]''. -A project's ``history'' is thus as scientifically relevant as the final, or published, state, or snapshot, of the project. -All the outputs (datasets or narrative papers) need to contain the the exact point in the project's history that produced them. +This is often seen in scientific papers, with statements like ``we [first] tried method [or parameter] XXXX, but YYYY is used here because it showed to have better precision [or less bias, or etc]''. +A project's ``history'' is thus as scientifically relevant as the final, or published, snapshot of the project. +All the outputs (datasets or narrative papers) need to contain the exact point in the project's history that produced them. For a complete project (see Section \ref{principle:complete}) that is under version control (like Git), this would be the unique commit checksum (for more on version control, see Appendix \ref{appendix:versioncontrol}). This principle thus benefits from the plain-text principle (Section \ref{principle:text}). @@ -479,7 +487,7 @@ After publication, the project's history can also be published on services like Taking this principle to a higher level, newer projects are built upon the shoulders of previous projects. A project management system should be able to provide this temporal connection between projects. -Quantifying how newer projects relate to older projects will enable 1) scientists to simply use the relevant parts of an older project, 2) quantify the connections of various projects, which is primarily of interest for meta-research (research on research) or historical studies. +Quantifying how newer projects relate to older projects (for example through Git banches) will enable 1) scientists to simply use the relevant parts of an older project, 2) quantify the connections of various projects, which is primarily of interest for meta-research (research on research) or historical studies. In data science, ``provenance'' is used to track the analysis and original datasets that were used in producing a higher-level dataset. A system that uses this principle will also provide ``temporal provenance'', quantifying how a certain project grew/evolved in the time dimension. @@ -524,7 +532,7 @@ However, as shown below, software freedom as an important pillar for the science \section{Reproducible paper template} \label{sec:template} -The proposed solution is an implementation of the principles discussed in Section \ref{sec:principles}: it is complete (Section \ref{principle:complete}) modular (Section \ref{principle:modularity}), fully in plain text (Section \ref{principle:text}), having minimal complexity (e.g., no dependencies beyond a minimal POSIX environment, see Section \ref{principle:complexity}), runnable without any human interaction (Section \ref{principle:batch}), with verifiable outputs (Section \ref{principle:verify}), preserving temporal provenance, or project evolution (Section \ref{principle:history}) and finally, it is free software (Section \ref{principle:freesoftware}). +The proposed solution is an implementation of the principles discussed in Section \ref{sec:principles}: it is complete and automatic (Section \ref{principle:complete}), modular (Section \ref{principle:modularity}), fully in plain text (Section \ref{principle:text}), having minimal complexity (see Section \ref{principle:complexity}), with automatically verifiable inputs \& outputs (Section \ref{principle:verify}), preserving temporal provenance, or project evolution (Section \ref{principle:history}) and finally, it is free software (Section \ref{principle:freesoftware}). In practice it is a collection of plain-text files, that are distributed in sub-directories by context, and are all under version-control (currently with Git). In its raw form (before customizing for different projects), it is just a skeleton of a project without much flesh: containing all the low-level infrastructure, but without any real analysis. @@ -544,7 +552,7 @@ Therefore, a list of the notable changes after the publication of this paper wil \subsection{Job orchestration with Make} \label{sec:usingmake} -When non-interactive, or batch, processing is needed (see Section \ref{principle:batch}), shell scripts are usually the first solution that come to mind (see Appendix \ref{appendix:scripts}). +When non-interactive, or batch, processing is needed (see Section \ref{principle:complete}), shell scripts are usually the first solution that come to mind (see Appendix \ref{appendix:scripts}). However, the inherent complexity and non-linearity of progress in a scientific project (where experimentation is key) makes it hard and inefficient to manage the script(s) as the project evolves. For example, a script will start from the top/start every time it is run. Therefore, if $90\%$ of a research project is done and only the newly added, final $10\%$ must be executed, its necessary to run whole script from the start. @@ -615,6 +623,7 @@ The latter is necessary for many web-based automatic paper generating systems li Symbolic links and their contents are not considered part of the source and are not under version control. Files and directories are shown within their parent directory. For example the full address of \inlinecode{analysis-1.mk} from the top project directory is \inlinecode{reproduce/analysis/make/analysis-1.mk}. + \tonote{Add the `.git' directory also.} } \end{figure} @@ -876,20 +885,34 @@ For example even though \inlinecode{input1.dat} is a target in \inlinecode{downl \subsubsection{Verification of outputs} \label{sec:outputverification} An important principle for this template is that outputs should be automatically verified, see Section \ref{principle:verify}. -As shown in Figure \ref{fig:datalineage}, all the \LaTeX{} macro files are not directly a prerequisite of \inlinecode{project.tex}, but of \inlinecode{verify\-.tex}. -Prior to publication, the project authors should add the MD5 checksums of all the\LaTeX{} macro files in \inlinecode{verify\-.tex}, the necessary structure is already there. -If any \LaTeX{} macro is different in future builds of the project, the project will abort with a warning of the problematic file. +However, simply confirming the checksum of the final PDF, or figures and datasets is not generally possible: as mentioned in Section \ref{principle:verify}, many tools that produce datasets or PDFs write the creation date into the produced files. +To facilitate output verification this template has the lower-level \inlinecode{verify.mk} Makefile, see Figure \ref{fig:datalineage}. +It is the only prerequisite of \inlinecode{project.tex} that was described in Section \ref{sec:paperpdf}. +As shown in Figure \ref{fig:datalineage}, the \LaTeX{} macros of all lower-level Makefiles are a prerequisite to \inlinecode{verify.tex}. +Verification is therefore the connection-point, or bottleneck, between the analysis steps of the project and its final report. + +Prior to publication, the project authors should add the MD5 checksums of all the\LaTeX{} macro files in the recipe of \inlinecode{verify\-.tex}, the necessary structure is already there. +If any \LaTeX{} macro is different in future builds of the project, the project will abort with a warning of the problematic file. When projects involve other outputs (for example images, tables or datasets that will also be published), their contents should also be validated. -To do this, prerequisites should be added to the \inlinecode{verify\-.tex} rule that automatically check the contents of other project outputs also. -However, verification of other files should be done with care: as mentioned in Section \ref{principle:verify}, many tools that produce datasets write the creation date into the produced file. -The verification in such cases should be done by ignoring such terms in the generated files. +To do this, prerequisites should be added to the \inlinecode{verify\-.tex} rule that automatically check the \emph{contents} of other project outputs. +Recall that many tools print the creation date automatically when creating a file, so to verify a file, such metadata must be ignored. +\subsubsection{Orchestrating the analysis} +\label{sec:analysisorchestration} +All the analysis steps of a project are ultimately prerequisites of the verification step, see the data lineage in Figure \ref{fig:datalineage} and Section \ref{sec:outputverification}. +The detailed organization of the analysis steps highly depends on the particular project and because Make already knows which files are independent of others, it can run them in any order or on any number of threads in parallel. +Two lower-level analysis Makefiles are common to all projects: \inlinecode{initial\-ize\-.mk} and \inlinecode{download\-.mk}. +\inlinecode{init\-ial\-ize\-.mk} is the first lower-level Makefile that is loaded into \inlinecode{top-make.mk}. +It doesn't actually contain any analysis, it just initializes the system: setting environment variables, internal Make variables and generic rules like \inlinecode{./project make clean} (to clean all builts products, not software) or \inlinecode{./project make dist} (to package the project into a tarball for distribution) among others. +To get a good fealing of the system, it is recommended to look through this file. +\inlinecode{download.mk} has some commonly necessary steps to facilitate the importation of input datasets: a simple \inlinecode{wget} command is not usually enough. +We also want to check if a local copy exists and also calculate the file's checksum to verfiy it. @@ -925,6 +948,8 @@ The verification in such cases should be done by ignoring such terms in the gene \item Research objects (Appendix \ref{appendix:researchobject}) can automatically be generated from the Makefiles, we can also apply special commenting conventions, to be included as annotations/descriptions in the research object metadata. \item Provenance between projects: through Git, all projects based on this template are automatically connected, but also through inputs/outputs, the lineage of a project can be traced back to projects before it also. \item \citet{gibney20}: After code submission was encouraged by the Neural Information Processing Systems (NeurIPS), the frac +\item When the data are confidential, \citet{perignon19} suggest to have a third party familiar with coding to referee the code and give its approval. + In this system, because of automatic verification of inputs and outputs, no technical knowledge is necessary for the verification. \end{itemize} @@ -1067,6 +1092,7 @@ In summary, these package managers are primarily meant for the operating system Hence, many robust reproducible analysis solutions (reviewed in Appendix \ref{appendix:existingsolutions}) don't use the host's package manager, but an independent package manager, like the ones below. \subsubsection{Conda/Anaconda} +\label{appendix:conda} Conda is an independent package manager that can be used on GNU/Linux, macOS, or Windows operating systems, although all software packages are not available in all operating systems. Conda is able to maintain an approximately independent environment on an operating system without requiring root access. diff --git a/tex/src/figure-file-architecture.tex b/tex/src/figure-file-architecture.tex index a4d4755..e08ee52 100644 --- a/tex/src/figure-file-architecture.tex +++ b/tex/src/figure-file-architecture.tex @@ -10,7 +10,7 @@ \footnotesize %% project/ - \node [dirbox, at={(0,4cm)}, minimum width=15cm, minimum height=9cm, + \node [dirbox, at={(0,4cm)}, minimum width=15cm, minimum height=9.9cm, label={[shift={(0,-5mm)}]\texttt{project/}}] {}; \ifdefined\fullfilearchitecture \node [node-nonterminal-thin, at={(-6.0cm,3.3cm)}] {COPYING}; @@ -31,7 +31,7 @@ \node [dirbox, at={(-4.35cm,2.1cm)}, minimum width=5.7cm, minimum height=5.3cm, label={[shift={(0,-5mm)}]\texttt{software/}}, fill=brown!20!white] {}; - %% reproduce/software/config + %% reproduce/software/config/ \node [dirbox, at={(-5.75cm,1.5cm)}, minimum width=2.6cm, minimum height=2.1cm, label={[shift={(0,-5mm)}]\texttt{config/}}, fill=brown!25!white] {}; \ifdefined\fullfilearchitecture @@ -42,7 +42,7 @@ \node [node-nonterminal-thin, at={(-5.75cm,-0.2cm)}] {checksums.conf}; \fi - %% reproduce/software/make + %% reproduce/software/make/ \node [dirbox, at={(-2.95cm,1.5cm)}, minimum width=2.6cm, minimum height=2.1cm, label={[shift={(0,-5mm)}]\texttt{make/}}, fill=brown!25!white] {}; \ifdefined\fullfilearchitecture @@ -53,15 +53,15 @@ \node [node-nonterminal-thin, at={(-2.95cm,-0.2cm)}] {python.mk}; \fi - %% reproduce/software/bash + %% reproduce/software/bash/ \node [dirbox, at={(-5.75cm,-0.8cm)}, minimum width=2.6cm, minimum height=1.6cm, - label={[shift={(0,-5mm)}]\texttt{bash/}}, fill=brown!25!white] {}; + label={[shift={(0,-5mm)}]\texttt{shell/}}, fill=brown!25!white] {}; \ifdefined\fullfilearchitecture - \node [node-nonterminal-thin, at={(-5.75cm,-1.5cm)}] {bashrc.sh}; - \node [node-nonterminal-thin, at={(-5.75cm,-2.0cm)}] {configure.sh}; + \node [node-nonterminal-thin, at={(-5.75cm,-1.5cm)}] {configure.sh}; + \node [node-nonterminal-thin, at={(-5.75cm,-2.0cm)}] {bashrc.sh}; \fi - %% reproduce/software/bibtex + %% reproduce/software/bibtex/ \node [dirbox, at={(-2.95cm,-0.8cm)}, minimum width=2.6cm, minimum height=2.1cm, label={[shift={(0,-5mm)}]\texttt{bibtex/}}, fill=brown!25!white] {}; \ifdefined\fullfilearchitecture @@ -74,7 +74,7 @@ \node [dirbox, at={(1.55cm,2.1cm)}, minimum width=5.7cm, minimum height=5.3cm, label={[shift={(0,-5mm)}]\texttt{analysis/}}, fill=brown!20!white] {}; - %% reproduce/analysis/config + %% reproduce/analysis/config/ \node [dirbox, at={(0.15cm,1.5cm)}, minimum width=2.6cm, minimum height=2.6cm, label={[shift={(0,-5mm)}]\texttt{config/}}, fill=brown!25!white] {}; \node [node-nonterminal-thin, at={(0.15cm,0.8cm)}] {INPUTS.conf}; @@ -82,7 +82,7 @@ \node [node-nonterminal-thin, at={(0.15cm,-0.2cm)}] {param-2a.conf}; \node [node-nonterminal-thin, at={(0.15cm,-0.7cm)}] {param-2b.conf}; - %% reproduce/analysis/make + %% reproduce/analysis/make/ \node [dirbox, at={(2.95cm,1.5cm)}, minimum width=2.6cm, minimum height=2.6cm, label={[shift={(0,-5mm)}]\texttt{make/}}, fill=brown!25!white] {}; \node [node-nonterminal-thin, at={(2.95cm,0.8cm)}] {top-prepare.mk}; @@ -90,14 +90,14 @@ \node [node-nonterminal-thin, at={(2.95cm,-0.2cm)}] {initialize.mk}; \node [node-nonterminal-thin, at={(2.95cm,-0.7cm)}] {analysis1.mk}; - %% reproduce/analysis/bash + %% reproduce/analysis/bash/ \node [dirbox, at={(0.15cm,-1.3cm)}, minimum width=2.6cm, minimum height=1.1cm, label={[shift={(0,-5mm)}]\texttt{bash/}}, fill=brown!25!white] {}; \ifdefined\fullfilearchitecture \node [node-nonterminal-thin, at={(0.15cm,-2.0cm)}] {process-A.sh}; \fi - %% reproduce/analysis/python + %% reproduce/analysis/python/ \node [dirbox, at={(2.95cm,-1.3cm)}, minimum width=2.6cm, minimum height=1.6cm, label={[shift={(0,-5mm)}]\texttt{python/}}, fill=brown!25!white] {}; \ifdefined\fullfilearchitecture @@ -109,7 +109,7 @@ \node [dirbox, at={(6cm,2.6cm)}, minimum width=2.7cm, minimum height=6cm, label={[shift={(0,-5mm)}]\texttt{tex/}}, fill=brown!15!white] {}; - %% tex/src + %% tex/src/ \node [dirbox, at={(6cm,2.1cm)}, minimum width=2.5cm, minimum height=1.6cm, label={[shift={(0,-5mm)}]\texttt{src/}}, fill=brown!20!white] {}; \node [node-nonterminal-thin, at={(6cm,1.4cm)}] {references.tex}; @@ -117,7 +117,7 @@ \node [node-nonterminal-thin, at={(6cm,0.9cm)}] {figure-1.tex}; \fi - %% tex/build + %% tex/build/ \ifdefined\fullfilearchitecture \node [dirbox, at={(6cm,0.1cm)}, minimum width=2.5cm, minimum height=1.3cm, label={[shift={(0,-5mm)}]\texttt{build/}}, dashed, , fill=brown!20!white] {}; @@ -125,7 +125,7 @@ \node [anchor=west, at={(4.7cm,-1.0cm)}] {\scriptsize\sf \LaTeX{} build directory.}; \fi - %% tex/tikz + %% tex/tikz/ \ifdefined\fullfilearchitecture \node [dirbox, at={(6cm,-1.6cm)}, minimum width=2.5cm, minimum height=1.6cm, label={[shift={(0,-5mm)}]\texttt{tikz/}}, dashed, fill=brown!20!white] {}; @@ -134,7 +134,7 @@ \node [anchor=west, at={(4.67cm,-3.0cm)}] {\scriptsize\sf by \LaTeX).}; \fi - %% .local + %% .local/ \ifdefined\fullfilearchitecture \node [dirbox, at={(-3.6cm,-3.6cm)}, minimum width=7cm, minimum height=1.2cm, label={[shift={(0,-5mm)}]\texttt{.local/}}, dashed, fill=brown!15!white] {}; @@ -144,7 +144,7 @@ {\scriptsize\sf Python or R, run `\texttt{.local/bin/python}' or `\texttt{.local/bin/R}'}; \fi - %% .build + %% .build/ \ifdefined\fullfilearchitecture \node [dirbox, at={(3.6cm,-3.6cm)}, minimum width=7cm, minimum height=1.2cm, label={[shift={(0,-5mm)}]\texttt{.build/}}, dashed, fill=brown!15!white] {}; @@ -153,4 +153,13 @@ \node [anchor=west, at={(0.1cm,-4.6cm)}] {\scriptsize\sf Enabling easy access to all of project's built components.}; \fi + + %% .git/ + \ifdefined\fullfilearchitecture + \node [dirbox, at={(0,-5cm)}, minimum width=14.2cm, minimum height=7mm, + label={[shift={(0,-5mm)}]\texttt{.git/}}, dashed, fill=brown!15!white] {}; + \node [anchor=north, at={(0cm,-5.3cm)}] + {\scriptsize\sf Full project temporal provenance (version controlled history) in Git.}; + \fi + \end{tikzpicture} diff --git a/tex/src/references.tex b/tex/src/references.tex index 68d59d3..305c3ab 100644 --- a/tex/src/references.tex +++ b/tex/src/references.tex @@ -35,6 +35,20 @@ archivePrefix = {arXiv}, +@ARTICLE{perignon19, + author = {Christophe P\'erignon and Kamel Gadouche and Christophe Hurlin and Roxane Silberman and Eric Debonnel}, + title = {Certify reproducibility with confidential data}, + year = {2019}, + journal = {Science}, + volume = {365}, + pages = {127}, + doi = {10.1126/science.aaw2825}, +} + + + + + @ARTICLE{munafo19, author = {Marcus Munaf\'o}, title = {Raising research quality will require collective action}, @@ -1537,6 +1551,20 @@ Reproducible Research in Image Processing}, +@ARTICLE{mcilroy78, + author = {M. D. McIlroy and E. N. Pinson and B. A. Tague}, + title = {UNIX Time-Sharing System: Forward}, + journal = {\doihref{https://archive.org/details/bstj57-6-1899/mode/2up}{Bell System Technical Journal}}, + year = {1978}, + volume = {57}, + pages = {6, ark:/13960/t0gt6xf72}, + doi = {}, +} + + + + + @ARTICLE{anscombe73, author = {{Anscombe}, F.J.}, title = {Graphs in Statistical Analysis}, |