aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--paper.tex168
-rw-r--r--reproduce/software/make/high-level.mk33
-rw-r--r--tex/src/figure-file-architecture.tex223
-rw-r--r--tex/src/preamble-pgfplots.tex11
-rw-r--r--tex/src/preamble-style.tex1
-rw-r--r--tex/src/references.tex20
6 files changed, 364 insertions, 92 deletions
diff --git a/paper.tex b/paper.tex
index 58bd4ef..7237876 100644
--- a/paper.tex
+++ b/paper.tex
@@ -107,7 +107,7 @@ Even earlier, \citet{ioannidis05} prove that ``most claimed research findings ar
Given the scale of the problem, a committee of the National Academy of Sciences was asked to assess its impact by the USA National Science Foundation (NSF, asked by the USA congress).
The results were recently published by \citet{fineberg19} and provide a good review of the status.
-That committee doesn't recognize a ``crisis'', but the importance is stressed along with definitions (see Section \ref{sec:terminology}) and proposals.
+That committee doesn't recognize a ``crisis'', but the importance is stressed along with definitions (see Section \ref{sec:definitions}) and proposals.
Earlier in 2011, the Elsevier conducted an ``Executable Paper Grand Challenge'' \citep{gabriel11}, and to the best working solutions (at that time) were recognized.
Some of them are reviewed in Appendix \ref{appendix:existingsolutions}, but most have not been continued since then.
@@ -159,7 +159,7 @@ It later inspired \citet{buckheit1995} to publish a reproducible paper (in Matla
\tonote{Find some other historical examples.}
In this paper, a solution to this problem is introduced that attemps to address the problems above and has already been used in scientific papers.
-In Section \ref{sec:terminology}, the problem that is addressed by this template is clearly defined and Section \ref{appendix:existingsolutions} reviews some existing solutions and their pros and cons with respect to reproducibility in a scientific framework.
+In Section \ref{sec:definitions}, the problem that is addressed by this template is clearly defined and Section \ref{appendix:existingsolutions} reviews some existing solutions and their pros and cons with respect to reproducibility in a scientific framework.
Section \ref{sec:template} introduces the template, and its approach to the problem, with full design details.
Finally in Section \ref{sec:discussion} the future prospects of using systems like this template are discussed.
@@ -185,7 +185,7 @@ Finally in Section \ref{sec:discussion} the future prospects of using systems li
\section{Definitions of important terms}
-\label{sec:terminology}
+\label{sec:definitions}
The concepts and terminologies of reproducibility and project/workflow management and design are commonly used differently by different research communities or different solution provides.
It is therefore important to clarify the specific terms used throughout this paper and its appendix, before starting the technical details.
@@ -331,13 +331,11 @@ In a modular project, communication between the independent modules is explicit,
2) Data lineage (for example experimenting on project), and data provenance extraction (recording any dataset's origins).
3) Citation: allowing others to credit specific parts of a project.
-This modularity principle doesn't just apply to the analysis, it also applies to the whole project, for example see the definitions of ``input'', ``output'' and ``project'' in Section \ref{sec:terminology}.
-Based on this principle, an ideal project source, shouldn't itself have any analysis tools.
-Software that does a specific analysis must be a separate entity (a software package), the project should just contain its installation and usage instructions.
-
+This principle doesn't just apply to the analysis, it also applies to the whole project, for example see the definitions of ``input'', ``output'' and ``project'' in Section \ref{sec:definitions}.
For the most high-level analysis/operations, the boundary between the ``analysis'' and ``project'' can become blurry.
It is thus inevitable that some highly project-specific analysis is ultimately kept within the project and not maintained as a separate project.
-We don't see this as a problem, because it can later spin-off into a separate software package if the need is felt by the community.
+This isn't a problem, because inputs are defined as files that are \emph{usable} by other projects (see Section \ref{definition:input}).
+Such highly project-specific software can later spin-off into a separate software package later if necessary.
%One nice example of an existing system that doesn't follow this principle is Madagascar, see Appendix \ref{appendix:madagascar}.
@@ -348,11 +346,10 @@ We don't see this as a problem, because it can later spin-off into a separate so
\label{principle:text}
A project's primarily stored/archived format should be plain text with human-readable encoding\footnote{Plain text format doesn't include document container formats like `\texttt{.odf}' or `\texttt{.doc}', for software like LibreOffice or Microsoft Office.}, for example ASCII or Unicode (for the definition of a project, see Section \ref{definition:project}).
The reason behind this principle is that opening, reading, or editing non-plain text (executable or binary) file formats needs specialized software.
-Binary formats will complicate various aspects of the project: its archival, automatic parsing, or human readability.
+Binary formats will complicate various aspects of the project: its usage, archival, automatic parsing, or human readability.
This is a critical principle for long term preservation and portability: when the software to read binary format has been depreciated or become obsolete and isn't installable on the running system, the project will not be readable/usable any more.
-A project that is solely in plain text format has many useful advantages in a scientific context (portable, with long term archivability, and generic parsing).
-For example the project itself can be put under version control as it evolves, with easy tracking of changed parts, using already available and mature tools in software development (for more, see the principle on project history in Section \ref{principle:history}).
+A project that is solely in plain text format can be put under version control as it evolves, with easy tracking of changed parts, using already available and mature tools in software development (software source code is also in plain text).
After publication, independent modules of a plain-text project can be used and cited through services like Software Heritage \citep{dicosmo18}, enabling future projects to easily build ontop of old ones.
Archiving a binary version of the project is like archiving a well cooked dish itself, which will be inedible with changes in hardware (temperature, humidity, and the natural world in general).
@@ -371,24 +368,25 @@ A plain-text project's built/executable form can be published as an output of th
\subsection{Principle: Complete/Self-contained}
\label{principle:complete}
-A project should be self-contained, needing no particular features from the host.
+A project should be self-contained, needing no particular features from the host operating system and not affecting it.
At build-time (when the project is building its necessary tools), the project shouldn't need anything beyond a minimal POSIX environment on the host which is available in Unix-like operating system like GNU/Linux, BSD-based or macOS.
-At run-time (when environment/software are built), it should not use any host operating system programs or libraries.
+At run-time (when environment/software are built), it should not use or affect any host operating system programs or libraries.
-Generally, a project's source should include the whole project: access to the inputs and validating them (see Section \ref{sec:terminology}), building necessary software (access to tarballs and instructions on configuring, building and installing those software), doing the analysis (run the software on the data) and creating the final report/paper PDF.
+Generally, a project's source should include the whole project: access to the inputs and validating them (see Section \ref{sec:definitions}), building necessary software (access to tarballs and instructions on configuring, building and installing those software), doing the analysis (run the software on the data) and creating the final narrative report/paper.
This principle has several important consequences:
-\begin{enumerate}
+\begin{itemize}
\item A complete project doesn't need an internet connection to build itself or to do its analysis and possibly make a report.
Of course this only holds when the analysis doesn't require internet, for example needing a live data feed.
\item A Complete project doesn't need any previlaged/root permissions for system-wide installation or environment preparations.
Even when the user does have root previlages, interefering with the host operating system for a project, may lead to many conflicts with the host or other projects.
+ This allows a safe execution of the project, and will not cause any security problems.
\item A complete project inherently includes the complete data lineage and provenance: automatically enabling a full backtrace of the output datasets or narrative, to raw inputs: data or software source code lines (for the definition of inputs, please see \ref{definition:input}).
This is very important because existing data provenance solutions require manual tagging within the data workflow or connecting the data with the paper's text (Appendix \ref{appendix:existingsolutions}).
Manual tagging can be highly subjective, prone to many errors, and incomplete.
-\end{enumerate}
+\end{itemize}
The first two components are particularly important for high performace computing (HPC) facilities.
Because of security reasons, HPC users commonly don't have previlaged permissions or internet access.
@@ -485,73 +483,68 @@ However, as shown below, software freedom as an important pillar for the science
-\subsection{Comparisons of principles with other attempts}
-
-\tonote{Also have a look at the goals in \citep{nowakowski11} and \citet{hinsen15}.}
-
-\tonote{Add in larger context:} Each step should be completely independent and not need memory filled from previous steps.
-This is critical for the workflow to start at any random point.
-
-The Popper Convention \citep{jimenez17} for reproducible systems defines a list of guidelines for a ``popperized'' workflow.
-Note that in this section we only discuss the Popper convention, the software implementation/solution that is also discussed in \citet{jimenez17} is reviewed in Appendix \ref{appendix:popper}.
-The proposed template builds upon the Popper convention, with extra addition which is very important for scientists when using the template, and also long term archivability of the project.
-The goals of the proposed template are listed below, the first few from the Popper convention are marked.
-\begin{enumerate}
- \setlength\itemsep{-1mm}
-\item {\bf\small Self-contained project:} (from Popper) everything (e.g., code or paper's manuscript) must be included.
-\item {\bf\small Automated validation:} (from Popper) check if analysis can complete successfully.
-\item {\bf\small Toolchain Agnosticism:} (from Popper) unique identifiers and automatically download of inputs.
-\item {\bf\small Existing template:} (from Popper) containing the full research cycle, for new adopters.
-\item {\bf\small Minimize meta-dependencies}: managing the project with a minimal requirements.
-\end{enumerate}
-
-By meta-dependencies, we mean software/language dependencies of the project management phase, not the core science software/dependencies.
-For example tools for the following operations: version control, job managers, containers, package managers, specialized editors, web interfaces and etc (see Appendix \ref{appendix:existingtools}).
-This additional goal is based on the real experience of scientists (with no engineering or software engineering background), derived since our first attempts \citep[see the version on arXiv]{akhlaghi15}, and listed below:
-
-
-
-
+%\section{Reproducible paper template (OLD)}
+%This template is based on a simple principle: the output's full lineage should be stored in a human-readable, plain text format, that can also be automatically run on a computer.
+%The primary components of the research output's lineage can be summarized as:
+%1) URLs/PIDs and checksums of external inputs.
+%These external inputs can be datasets, if the project needs any (not simulations), or software source code that must be downloaded and built.
+%2) Software building scripts.
+%3) Full series of steps to run the software on the data, i.e., to do the data analysis.
+%4) Building the narrative description of the project that describes the output (published paper).
+%where persistent identifiers, or URLs of input data and software source code, as well as instructions/scripts on how to build the software and run run them on the data to do the analysis
+%It started with \citet{akhlaghi15} which was a paper describing a new detection algorithm.
+%and further evolving in \citet{bacon17}, in particular the following two sections of that paper: \citet{akhlaghi18a} and \citet{akhlaghi18b}.
+%\tonote{Mention the \citet{smart18} paper on how a binary version is not sufficient, low-level, human-readable plain text source is mandatory.}
+%\tonote{Find a place to put this:} Note that input files are a subset of built files: they are imported/built (copied or downloaded) using the project's instructions, from an external location/repository.
+% This principle is inspired by a similar concept in the free and open source software building tools like the GNU Build system (the standard `\texttt{./configure}', `\texttt{make}' and `\texttt{make install}'), or CMake.
+% Software developers already have decades of experience with the problems of mixing hand-written source files with the automatically generated (built) files.
+% This principle has proved to be an exceptionally useful in this model, grealy
\section{Reproducible paper template}
\label{sec:template}
-This template is based on a simple principle: the output's full lineage should be stored in a human-readable, plain text format, that can also be automatically run on a computer.
-The primary components of the research output's lineage can be summarized as:
-1) URLs/PIDs and checksums of external inputs.
-These external inputs can be datasets, if the project needs any (not simulations), or software source code that must be downloaded and built.
-2) Software building scripts.
-3) Full series of steps to run the software on the data, i.e., to do the data analysis.
-4) Building the narrative description of the project that describes the output (published paper).
+The proposed solution is an implementation of the defintions and principles of Sections \ref{sec:definitions} and \ref{sec:principles}.
+In short, its a version-controlled directory with many plain-text files, distributed in various subdirectories, based on context.
-where persistent identifiers, or URLs of input data and software source code, as well as instructions/scripts on how to build the software and run run them on the data to do the analysis
-It started with \citet{akhlaghi15} which was a paper describing a new detection algorithm.
+\begin{figure}[t]
+ \begin{center}
+ \includetikz{figure-file-architecture}
+ \end{center}
+ \vspace{-5mm}
+ \caption{\label{fig:files}
+ Directory and (plain-text) file structure in a hypothetical project using this solution.
+ Files are shown with thin, darker boxes that have a suffix in their names (for example `\texttt{*.mk}' or `\texttt{*.conf}').
+ Directories are shown as large, brighter boxes, where the name ends in a slash (\texttt{/}).
+ Files and directories are shown within their parent directory.
+ For example the full address of \texttt{analysis-1.mk} from the top project directory is \texttt{reproduce/analysis/make/analysis-1.mk}.
+ }
+\end{figure}
+
-and further evolving in \citet{bacon17}, in particular the following two sections of that paper: \citet{akhlaghi18a} and \citet{akhlaghi18b}.
-\tonote{Mention the \citet{smart18} paper on how a binary version is not sufficient, low-level, human-readable plain text source is mandatory.}
+\subsection{A project is a directory containing plain-text files}
+\label{sec:projectdir}
+Based on the plain-text principle (Section \ref{principle:text}), in the proposed solution, a project is a top-level directory that is ultimately filled with plain-text files (many of the files are under subdirectories).
-\tonote{Find a place to put this:} Note that input files are a subset of built files: they are imported/built (copied or downloaded) using the project's instructions, from an external location/repository.
- This principle is inspired by a similar concept in the free and open source software building tools like the GNU Build system (the standard `\texttt{./configure}', `\texttt{make}' and `\texttt{make install}'), or CMake.
- Software developers already have decades of experience with the problems of mixing hand-written source files with the automatically generated (built) files.
- This principle has proved to be an exceptionally useful in this model, grealy
-\subsection{Make}
+\subsection{Job orchestration with Make}
\label{sec:make}
-The template manages the higher-level dependency tracking of a project using Make.
-The very high-level user-interface of the template is written in shell script (in particular GNU Bash).
-All the individual operations (like downloading and installing software or doing the analysis) is done through Makefiles that are called by that script.
+
+Job orchestration is done with Make (see Appendix \ref{appendix:jobmanagement}).
+To start with, any POSIX-compliant Make version is enough.
+However, the project will build its own necessary software, and that includes custom version of GNU Make \citep{stallman88}.
+Several other job orchestration tools are reviewed in Appendix \ref{appendix:jobmanagement}, but we chose Make because of some unique features: 1) it is present on any POSIX compliant system, satisfying the principle of
\tonote{\citet{schwab2000} discuss the ease of students learning Make.}
@@ -1123,7 +1116,7 @@ Therefore proprietary solutions like Code Ocean (\url{https://codeocean.com}) or
-\subsection{Reproducible Electronic Documents (RED), 1992}
+\subsection{Reproducible Electronic Documents, RED (1992)}
\label{appendix:electronicdocs}
Reproducible Electronic Documents (\url{http://sep.stanford.edu/doku.php?id=sep:research:reproducible}) is the first attempt that we could find on doing reproducible research \citep{claerbout1992,schwab2000}.
@@ -1150,7 +1143,7 @@ Hence in 2006 SEP moved to a new Python-based framework called Madagascar, see A
-\subsection{Apache Taverna, 2003}
+\subsection{Apache Taverna (2003)}
\label{appendix:taverna}
Apache Taverna (\url{https://taverna.incubator.apache.org}) is a workflow management system written in Java with a graphical user interface, see \citet[still being actively developed]{oinn04}.
A workflow is defined as a directed graph, where nodes are called ``processors''.
@@ -1164,7 +1157,7 @@ In many aspects Taverna is like VisTrails, see Appendix \ref{appendix:vistrails}
-\subsection{Madagascar, 2003}
+\subsection{Madagascar (2003)}
\label{appendix:madagascar}
Madagascar (\url{http://ahay.org}) is a set of extensions to the SCons job management tool \citep{fomel13}.
For more on SCons, see Appendix \ref{appendix:jobmanagement}.
@@ -1183,7 +1176,7 @@ Since RSF contains program options also, the inputs and outputs of Madagascar's
Madagascar has been used in the production of hundreds of research papers or book chapters\footnote{\url{http://www.ahay.org/wiki/Reproducible_Documents}} \citep[120 prior to][]{fomel13}.
-\subsection{GenePattern, 2004}
+\subsection{GenePattern (2004)}
\label{appendix:genepattern}
GenePattern (\url{https://www.genepattern.org}) is a client-server software containing many common analysis functions/modules, primarily focused for Gene studies \citet[first released in 2004]{reich06}.
Although its highly focused to a special research field, it is reviewed here because its concepts/methods are generic, and in the context of this paper.
@@ -1205,7 +1198,7 @@ This is a very nice example of the fragility of solutions that depend on archivi
-\subsection{Kepler, 2005}
+\subsection{Kepler (2005)}
Kepler (\url{https://kepler-project.org}) is a Java-based Graphic User Interface workflow management tool \citep{ludascher05}.
Users drag-and-drop analysis components, called ``actors'', into a visual, directional graph, which is the workflow (similar to Figure \ref{fig:analysisworkflow}).
Each actor is connected to others through the Ptolemy approach \citep{eker03}.
@@ -1216,7 +1209,7 @@ In many aspects Kepler is like VisTrails, see Appendix \ref{appendix:vistrails}.
-\subsection{VisTrails, 2005}
+\subsection{VisTrails (2005)}
\label{appendix:vistrails}
VisTrails (\url{https://www.vistrails.org}) was a graphical workflow managing system that is described in \citet{bavoil05}.
@@ -1242,7 +1235,7 @@ Scripts can easily be written to generate an XML-formatted output from Makefiles
-\subsection{Galaxy, 2010}
+\subsection{Galaxy (2010)}
\label{appendix:galaxy}
Galaxy (\url{https://galaxyproject.org}) is a web-based Genomics workbench \citep{goecks10}.
@@ -1255,7 +1248,7 @@ Besides some small differences, this seems to be very similar to GenePattern (Ap
-\subsection{Image Processing On Line (IPOL) journal, 2010}
+\subsection{Image Processing On Line journal, IPOL (2010)}
The IPOL journal (\url{https://www.ipol.im}) attempts to publish the full implementation details of proposed image processing algorithm as a scientific paper \citep[first published article in July 2010]{limare11}.
An IPOL paper is a traditional research paper, but with a focus on implementation.
The published narrative description of the algorithm must be detailed to a level that any specialist can implement it in their own programming language (extremely detailed).
@@ -1274,9 +1267,20 @@ Ideally (if any referee/reader was inclined to do so), the proposed template of
+\subsection{WINGS (2010)}
+\label{appendix:wings}
+
+WINGS (\url{https://wings-workflows.org}) is an automatic workflow generation algorithm \citep{gil10}.
+It runs on a centralized web server, requiring many dependencies (such that it is recommended to download Docker images).
+It allows users to define various workflow components (for example datasets, analysis components and etc), with high-level goals.
+It then uses selection and rejection algorithms to find the best components using a pool of analysis components that can satisfy the requested high-level constraints.
+\tonote{Read more about this}
+
+
+
-\subsection{Active Papers, 2011}
+\subsection{Active Papers (2011)}
\label{appendix:activepapers}
Active Papers (\url{http://www.activepapers.org}) attempts to package the code and data of a project into one file (in HDF5 format).
It was initially written in Java because its compiled byte-code outputs in JVM are portable on any machine \citep[see][]{hinsen11}.
@@ -1303,7 +1307,7 @@ Hence the extra volume for data, and obscure HDF5 format that needs special tool
-\subsection{Collage Authoring Environment, 2011}
+\subsection{Collage Authoring Environment (2011)}
\label{appendix:collage}
The Collage Authoring Environment \citep{nowakowski11} was the winner of Elsevier Executable Paper Grand Challenge \citep{gabriel11}.
It is based on the GridSpace2\footnote{\url{http://dice.cyfronet.pl}} distributed computing environment\tonote{find citation}, which has a web-based graphic user interface.
@@ -1314,7 +1318,7 @@ Through its web-based interface, viewers of a paper can actively experiment with
-\subsection{SHARE, 2011}
+\subsection{SHARE (2011)}
\label{appendix:SHARE}
SHARE (\url{https://is.ieis.tue.nl/staff/pvgorp/share}) is a web portal that hosts virtual machines (VMs) for storing the environment of a research project, for more, see \citet{vangorp11}.
The top project webpage above is still active, however, the virtual machines and SHARE system have been removed since 2019.
@@ -1327,7 +1331,7 @@ The limitations of VMs for reproducibility were discussed in Appendix \ref{appen
-\subsection{Verifiable Computational Result (VCR), 2011}
+\subsection{Verifiable Computational Result, VCR (2011)}
\label{appendix:verifiableidentifier}
A ``verifiable computational result'' (\url{http://vcr.stanford.edu}) is an output (table, figure, or etc) that is associated with a ``verifiable result identifier'' (VRI), see \citet{gavish11}.
It was awarded the third prize in the Elsevier Executable Paper Grand Challenge \citep{gabriel11}.
@@ -1346,7 +1350,7 @@ Finally, the date of the files in the Matlab extension tarball are set to 2011,
-\subsection{SOLE, 2012}
+\subsection{SOLE (2012)}
\label{appendix:sole}
SOLE (Science Object Linking and Embedding) defines ``science objects'' (SOs) that can be manually linked with phrases of the published paper \citep[for more, see ][]{pham12,malik13}.
An SO is any code/content that is wrapped in begin/end tags with an associated type and name.
@@ -1364,7 +1368,7 @@ The solution of the proposed template (where anything coming out of the analysis
-\subsection{Sumatra, 2012}
+\subsection{Sumatra (2012)}
Sumatra (\url{http://neuralensemble.org/sumatra}) attempts to capture the environment information of a running project \citet{davison12}.
It is written in Python and is a command-line wrapper over the analysis script, by controlling its running, its able to capture the environment it was run in.
The captured environment can be viewed in plain text, a web interface.
@@ -1380,7 +1384,7 @@ Sumatra thus needs to know the language of the running program.
-\subsection{Research Object, 2013}
+\subsection{Research Object (2013)}
\label{appendix:researchobject}
The Research object (\url{http://www.researchobject.org}) is collection of meta-data ontologies, to describe aggregation of resources, or workflows, see \citet{bechhofer13} and \citet{belhajjame15}.
@@ -1395,7 +1399,7 @@ But when a translator is written to convert the proposed template into research
-\subsection{Sciunit, 2015}
+\subsection{Sciunit (2015)}
\label{appendix:sciunit}
Sciunit (\url{https://sciunit.run}) defines ``sciunit''s that keep the executed commands for an analysis and all the necessary programs and libraries that are used in those commands.
It automatically parses all the executables in the script, and copies them, and their dependency libraries (down to the C library), into the sciunit.
@@ -1415,7 +1419,7 @@ This is a major problem for scientific projects, in principle (not knowing how t
-\subsection{Binder, 2017}
+\subsection{Binder (2017)}
Binder (\url{https://mybinder.org}) is a tool to containerize already existing Jupyter based processing steps.
Users simply add a set of Binder-recognized configuration files to their repository.
Binder will build a Docker image and install all the dependencies inside of it with Conda (the list of necessary packages comes from Conda).
@@ -1428,7 +1432,7 @@ Binder is used by \citet{jones19}.
-\subsection{Gigantum, 2017}
+\subsection{Gigantum (2017)}
Gigantum (\url{https://gigantum.com}) is a client/server system, in which the client is a web-based (graphical) interface that is installed as ``Gigantum Desktop'' within a Docker image and is free software (MIT License).
\tonote{I couldn't find the license to the server software yet, but it says that 20GB is provided for ``free'', so it is a little confusing if anyone can actually run the server.}
\tonote{I took the date from their PiPy page, where the first version 0.1 was published in November 2016.}
@@ -1445,7 +1449,7 @@ However, there is one directory which can be used to store files that must not b
-\subsection{Popper, 2017}
+\subsection{Popper (2017)}
\label{appendix:popper}
Popper (\url{https://falsifiable.us}) is a software implementation of the Popper Convention \citep{jimenez17}.
The Convention is a set of very generic conditions that are also applicable to the template proposed in this paper.
@@ -1467,7 +1471,7 @@ See Appendix \ref{appendix:independentenvironment} for more on containers, and A
-\subsection{Whole Tale, 2019}
+\subsection{Whole Tale (2019)}
\label{appendix:wholetale}
Whole Tale (\url{https://wholetale.org}) is a web-based platform for managing a project and organizing data provenance, see \citet{brinckman19}
diff --git a/reproduce/software/make/high-level.mk b/reproduce/software/make/high-level.mk
index f46480a..735a24a 100644
--- a/reproduce/software/make/high-level.mk
+++ b/reproduce/software/make/high-level.mk
@@ -1220,13 +1220,32 @@ $(itidir)/texlive: reproduce/software/config/installation/texlive.mk \
texlive=$$(pdflatex --version | awk 'NR==1' | sed 's/.*(\(.*\))/\1/' \
| awk '{print $$NF}');
- # Package names and versions.
- rm -f $@
+ # Package names and versions. Note that all TeXLive packages
+ # don't have a version unfortunately! So we need to also read the
+ # `revision' and `cat-date' elements and print them incase
+ # version isn't available.
tlmgr info $(texlive-packages) --only-installed | awk \
- '$$1=="package:" {version=0; \
- if($$NF=="tex-gyre") name="texgyre"; \
- else name=$$NF} \
+ '$$1=="package:" { \
+ if(name!=0) \
+ { \
+ if(version=="") \
+ { \
+ if(revision=="") \
+ { \
+ if(date="") printf("%s (no version)\n", name); \
+ else printf("%s %s (date)\n", name, date); \
+ } \
+ else
+ printf("%s %s (revision)\n", name, revision); \
+ } \
+ else \
+ printf("%s %s\n", name, version); \
+ } \
+ name=""; version=""; revision=""; date=""; \
+ if($$NF=="tex-gyre") name="texgyre"; \
+ else name=$$NF \
+ } \
+ $$1=="cat-date:" {date=$$NF} \
$$1=="cat-version:" {version=$$NF} \
- $$1=="cat-date:" {if(version==0) version=$$2; \
- printf("%s %s\n", name, version)}' >> $@
+ $$1=="revision:" {revision=$$NF}' > $@
fi
diff --git a/tex/src/figure-file-architecture.tex b/tex/src/figure-file-architecture.tex
new file mode 100644
index 0000000..3a7b760
--- /dev/null
+++ b/tex/src/figure-file-architecture.tex
@@ -0,0 +1,223 @@
+\begin{tikzpicture}[
+ line width=1.5pt,
+ black!50,
+ text=black,
+]
+
+ %% Use small fonts
+ \footnotesize
+
+ %% project
+ \node [at={(0,4cm)}, anchor=north,
+ rectangle,
+ very thick,
+ text centered,
+ font=\ttfamily,
+ minimum width=15cm,
+ minimum height=7.5cm,
+ draw=green!50!black!50,
+ fill=black!10!green!2!white,
+ label={[shift={(0,-5mm)}]\texttt{top-project-directory/}}] {};
+ %% Its files
+ \node [node-nonterminal-thin, at={(-6.0cm,-3cm)}] {COPYING};
+ \node [node-nonterminal-thin, at={(-3.5cm,-3cm)}] {paper.tex};
+ \node [node-nonterminal-thin, at={(-1.0cm,-3cm)}] {project};
+ \node [node-nonterminal-thin, at={(+1.5cm,-3cm)}] {README.md};
+ \node [node-nonterminal-thin, at={(+4.25cm,-3cm)},
+ text width=2.5cm, text depth=-3pt] {README-hacking.md};
+
+ %% reproduce
+ \node [at={(-1.4cm,3.5cm)}, anchor=north,
+ rectangle,
+ very thick,
+ text centered,
+ font=\ttfamily,
+ minimum height=6cm,
+ minimum width=11.9cm,
+ draw=green!50!black!50,
+ fill=black!10!green!2!white,
+ label={[shift={(0,-5mm)}]\texttt{reproduce/}}] {};
+
+ %% reproduce/software
+ \node [at={(-4.35cm,3cm)}, anchor=north,
+ rectangle,
+ very thick,
+ text centered,
+ font=\ttfamily,
+ minimum height=5.3cm,
+ minimum width=5.7cm,
+ draw=green!50!black!50,
+ fill=black!10!green!2!white,
+ label={[shift={(0,-5mm)}]\texttt{software/}}] {};
+
+ %% reproduce/analysis
+ \node [at={(1.55cm,3cm)}, anchor=north,
+ rectangle,
+ very thick,
+ text centered,
+ font=\ttfamily,
+ minimum height=5.3cm,
+ minimum width=5.7cm,
+ draw=green!50!black!50,
+ fill=black!10!green!2!white,
+ label={[shift={(0,-5mm)}]\texttt{analysis/}}] {};
+
+ %% reproduce/software/config
+ \node [at={(-5.75cm,2.5cm)}, anchor=north,
+ rectangle,
+ very thick,
+ text centered,
+ font=\ttfamily,
+ minimum height=2.1cm,
+ minimum width=2.6cm,
+ draw=green!50!black!50,
+ fill=black!10!green!2!white,
+ label={[shift={(0,-5mm)}]\texttt{config/}}] {};
+ %% Its files
+ \node [node-nonterminal-thin, at={(-5.75cm,1.8cm)}] {LOCAL.conf.in};
+ \node [node-nonterminal-thin, at={(-5.75cm,1.3cm)}] {versions.conf};
+ \node [node-nonterminal-thin, at={(-5.75cm,0.8cm)}] {checksums.conf};
+
+
+ %% reproduce/software/make
+ \node [at={(-2.95cm,2.5cm)}, anchor=north,
+ rectangle,
+ very thick,
+ text centered,
+ font=\ttfamily,
+ minimum height=2.1cm,
+ minimum width=2.6cm,
+ draw=green!50!black!50,
+ fill=black!10!green!2!white,
+ label={[shift={(0,-5mm)}]\texttt{make/}}] {};
+ %% Its files
+ \node [node-nonterminal-thin, at={(-2.95cm,1.8cm)}] {basic.mk};
+ \node [node-nonterminal-thin, at={(-2.95cm,1.3cm)}] {high-level.mk};
+ \node [node-nonterminal-thin, at={(-2.95cm,0.8cm)}] {python.mk};
+
+ %% reproduce/software/bash
+ \node [at={(-5.75cm,0.2cm)}, anchor=north,
+ rectangle,
+ very thick,
+ text centered,
+ font=\ttfamily,
+ minimum height=2.1cm,
+ minimum width=2.6cm,
+ draw=green!50!black!50,
+ fill=black!10!green!2!white,
+ label={[shift={(0,-5mm)}]\texttt{bash/}}] {};
+ %% Its files
+ \node [node-nonterminal-thin, at={(-5.75cm,-0.5cm)}] {bashrc.sh};
+ \node [node-nonterminal-thin, at={(-5.75cm,-1.0cm)}] {configure.sh};
+ \node [node-nonterminal-thin, at={(-5.75cm,-1.5cm)}] {git-pre-comit};
+
+ %% reproduce/software/bibtex
+ \node [at={(-2.95cm,0.2cm)}, anchor=north,
+ rectangle,
+ very thick,
+ text centered,
+ font=\ttfamily,
+ minimum height=2.1cm,
+ minimum width=2.6cm,
+ draw=green!50!black!50,
+ fill=black!10!green!2!white,
+ label={[shift={(0,-5mm)}]\texttt{bibtex/}}] {};
+ %% Its files
+ \node [node-nonterminal-thin, at={(-2.95cm,-0.5cm)}] {fftw.tex};
+ \node [node-nonterminal-thin, at={(-2.95cm,-1.0cm)}] {numpy.tex};
+ \node [node-nonterminal-thin, at={(-2.95cm,-1.5cm)}] {gnuastro.tex};
+
+ %% reproduce/analysis/config
+ \node [at={(0.15cm,2.5cm)}, anchor=north,
+ rectangle,
+ very thick,
+ text centered,
+ font=\ttfamily,
+ minimum height=2.1cm,
+ minimum width=2.6cm,
+ draw=green!50!black!50,
+ fill=black!10!green!2!white,
+ label={[shift={(0,-5mm)}]\texttt{config/}}] {};
+ %% Its files
+ \node [node-nonterminal-thin, at={(0.15cm,1.8cm)}] {INPUTS.conf};
+ \node [node-nonterminal-thin, at={(0.15cm,1.3cm)}] {param-1.conf};
+ \node [node-nonterminal-thin, at={(0.15cm,0.8cm)}] {param-2.conf};
+
+ %% reproduce/analysis/make
+ \node [at={(2.95cm,2.5cm)}, anchor=north,
+ rectangle,
+ very thick,
+ text centered,
+ font=\ttfamily,
+ minimum height=2.1cm,
+ minimum width=2.6cm,
+ draw=green!50!black!50,
+ fill=black!10!green!2!white,
+ label={[shift={(0,-5mm)}]\texttt{make/}}] {};
+ %% Its files
+ \node [node-nonterminal-thin, at={(2.95cm,1.8cm)}] {initialize.mk};
+ \node [node-nonterminal-thin, at={(2.95cm,1.3cm)}] {download.mk};
+ \node [node-nonterminal-thin, at={(2.95cm,0.8cm)}] {analysis-1.mk};
+
+ %% reproduce/analysis/bash
+ \node [at={(0.15cm,0.2cm)}, anchor=north,
+ rectangle,
+ very thick,
+ text centered,
+ font=\ttfamily,
+ minimum height=2.1cm,
+ minimum width=2.6cm,
+ draw=green!50!black!50,
+ fill=black!10!green!2!white,
+ label={[shift={(0,-5mm)}]\texttt{bash/}}] {};
+ %% Its files
+ \node [node-nonterminal-thin, at={(0.15cm,-0.5cm)}] {process-1.sh};
+ \node [node-nonterminal-thin, at={(0.15cm,-1.0cm)}] {process-2.sh};
+ \node [node-nonterminal-thin, at={(0.15cm,-1.5cm)}] {process-3.sh};
+
+ %% reproduce/analysis/python
+ \node [at={(2.95cm,0.2cm)}, anchor=north,
+ rectangle,
+ very thick,
+ text centered,
+ font=\ttfamily,
+ minimum height=2.1cm,
+ minimum width=2.6cm,
+ draw=green!50!black!50,
+ fill=black!10!green!2!white,
+ label={[shift={(0,-5mm)}]\texttt{python/}}] {};
+ %% Its files
+ \node [node-nonterminal-thin, at={(2.95cm,-0.5cm)}] {operation-1.py};
+ \node [node-nonterminal-thin, at={(2.95cm,-1.0cm)}] {operation-2.py};
+ \node [node-nonterminal-thin, at={(2.95cm,-1.5cm)}] {fitting-plot.py};
+
+ %% tex
+ \node [at={(6cm,3.5cm)}, anchor=north,
+ rectangle,
+ very thick,
+ text centered,
+ font=\ttfamily,
+ minimum height=6cm,
+ minimum width=2.7cm,
+ draw=green!50!black!50,
+ fill=black!10!green!2!white,
+ label={[shift={(0,-5mm)}]\texttt{tex/}}] {};
+
+ %% tex/src
+ \node [at={(6cm,3cm)}, anchor=north,
+ rectangle,
+ very thick,
+ text centered,
+ font=\ttfamily,
+ minimum width=2.5cm,
+ minimum height=5.3cm,
+ draw=green!50!black!50,
+ fill=black!10!green!2!white,
+ label={[shift={(0,-5mm)}]\texttt{src/}}] {};
+ %% Its files
+ \node [node-nonterminal-thin, at={(6cm,2.2cm)}] {preamble-1.tex};
+ \node [node-nonterminal-thin, at={(6cm,1.7cm)}] {preamble-2.tex};
+ \node [node-nonterminal-thin, at={(6cm,1.2cm)}] {figure-1.tex};
+ \node [node-nonterminal-thin, at={(6cm,0.7cm)}] {figure-2.tex};
+
+\end{tikzpicture}
diff --git a/tex/src/preamble-pgfplots.tex b/tex/src/preamble-pgfplots.tex
index 3f467a6..57a0128 100644
--- a/tex/src/preamble-pgfplots.tex
+++ b/tex/src/preamble-pgfplots.tex
@@ -150,6 +150,17 @@
bottom color=green!80!black!50,
font=\ttfamily}}
+\tikzset{node-nonterminal-thin/.style={
+ rectangle,
+ very thick,
+ text centered,
+ top color=white,
+ text width=2cm,
+ minimum size=4mm,
+ draw=green!50!black!50,
+ bottom color=green!80!black!50,
+ font=\ttfamily\scriptsize}}
+
\tikzset{node-makefile/.style={
thick,
rectangle,
diff --git a/tex/src/preamble-style.tex b/tex/src/preamble-style.tex
index e843903..f3a96e9 100644
--- a/tex/src/preamble-style.tex
+++ b/tex/src/preamble-style.tex
@@ -22,6 +22,7 @@
%% To allow a prefix to the enumeration.
\usepackage{enumitem}
+\setlist{nolistsep} % No space before `\begin{itemize}'
%% Horizontal line with spacing
\newcommand{\horizontalline}{\vspace{3mm}\hrule\vspace{3mm}}
diff --git a/tex/src/references.tex b/tex/src/references.tex
index 0f8171b..170ace6 100644
--- a/tex/src/references.tex
+++ b/tex/src/references.tex
@@ -1091,7 +1091,7 @@ Reproducible Research in Image Processing},
author = {{Peng}, R.D.},
title = {Reproducible Research in Computational Science},
journal = {Science},
- year = 2011,
+ year = {2011},
month = dec,
volume = 334,
pages = {1226},
@@ -1102,6 +1102,20 @@ Reproducible Research in Image Processing},
+@ARTICLE{gil10,
+ author = {Yolanda Gil and Pedro A. González-Calero and Jihie Kim and Joshua Moody and Varun Ratnakar},
+ title = {A semantic framework for automatic generation of computational workflows using distributed data and component catalogues},
+ journal = {Journal of Experimental \& Theoretical Artificial Intelligence},
+ year = {2010},
+ volume = {23},
+ pages = {389},
+ doi = {10.1080/0952813X.2010.490962},
+}
+
+
+
+
+
@ARTICLE{pence10,
author = {{Pence}, W.~D. and {Chiappetti}, L. and {Page}, C.~G. and {Shaw}, R.~A. and
{Stobie}, E.},
@@ -1126,8 +1140,8 @@ Reproducible Research in Image Processing},
author = {Jeremy Goecks and Anton Nekrutenko and James Taylor},
title = {Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences},
journal = {Genome Biology},
- year = 2010,
- volume = 11,
+ year = {2010},
+ volume = {11},
pages = {R86},
doi = {10.1186/gb-2010-11-8-r86},
}