aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2020-01-18 04:03:13 +0000
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2020-01-18 04:30:24 +0000
commit5e830f5fb60c4bb186cbd4bd92908e187c037af4 (patch)
treeb727dc5653ec833f93efe29d80e806192563cfbe
parent4483a81c4254596dd2fa977e7a2faf6f28a7ac6f (diff)
Raw draft (until now as a separate repository) imported
Until now, I was writing the paper without the template. But we will soon be adding a tutorial to the template, and I thought it will be good to have an example demonstration here too. So I just brought the hole project into the template structure, allowing us to add the template analysis later when its ready, and also allowing us to easily reproduce this paper ofcourse (without having to worry about the host's TeXLive installation.
-rw-r--r--paper.tex1551
-rw-r--r--reproduce/analysis/make/initialize.mk1
-rw-r--r--reproduce/analysis/make/paper.mk6
-rw-r--r--reproduce/software/config/installation/texlive.mk4
-rw-r--r--tex/img/codata.pngbin0 -> 112554 bytes
-rw-r--r--tex/src/figure-data-lineage.tex236
-rw-r--r--tex/src/figure-project-outline.tex229
-rw-r--r--tex/src/preamble-biblatex.tex9
-rw-r--r--tex/src/preamble-header.tex89
-rw-r--r--tex/src/preamble-necessary.tex90
-rw-r--r--tex/src/preamble-pgfplots.tex72
-rw-r--r--tex/src/preamble-style.tex257
-rw-r--r--tex/src/references.tex1530
13 files changed, 3663 insertions, 411 deletions
diff --git a/paper.tex b/paper.tex
index 0b92607..58bd4ef 100644
--- a/paper.tex
+++ b/paper.tex
@@ -1,6 +1,4 @@
-%% Copyright (C) 2018-2020 Mohammad Akhlaghi <mohammad@akhlaghi.org>
-%% See the end of the file for license conditions.
-\documentclass[10pt, twocolumn]{article}
+\documentclass[10.5pt]{article}
%% This is a convenience variable if you are using PGFPlots to build plots
%% within LaTeX. If you want to import PDF files for figures directly, you
@@ -16,92 +14,1541 @@
%% only for discussion.
\newcommand{\highlightchanges}{}
-%% Necessary LaTeX preambles to include for relevant functionality. We want
-%% to start this file as fast as possible with the actual body of the
-%% paper, while keeping modularity in the preambles.
+%% Import the necessary preambles.
\input{tex/src/preamble-style.tex}
-\input{tex/src/preamble-header.tex}
-\input{tex/src/preamble-biblatex.tex}
+\input{tex/build/macros/project.tex}
\input{tex/src/preamble-pgfplots.tex}
-\input{tex/src/preamble-necessary.tex}
-
-%% Title and author information. For a more fine-grained control of the
-%% headers including author name, or paper info, see
-%% `tex/src/preamble-header.tex'. Note that if you plan to use a journal's
-%% LaTeX style file, you will probably not need to set them, and can also
-%% replace this "Title and author information" section with the journal's
-%% preferred format.
-\title{\large \uppercase{The paper's title goes here}}
-\author[1]{Your name}
-\author[2]{Coauthor one}
-\author[1,3]{Coauthor two}
-\affil[1]{The first affiliation in the list.; \url{your@email.address}}
-\affil[2]{Another affilation can be put here.}
-\affil[3]{And generally as many affiliations as you like.
-\par \emph{Received YYYY MM DD; accepted YYYY MM DD; published YYYY MM DD}}
-\date{}
+\input{tex/src/preamble-biblatex.tex}
+
+\title{Reproducible data analysis, preserving data lineage}
+\author{\large\mpregular \authoraffil{Mohammad Akhlaghi}{1,2}\\
+ {
+ \footnotesize\mplight
+ \textsuperscript{1} Instituto de Astrof\'isica de Canarias, C/V\'ia L\'actea, 38200 La Laguna, Tenerife, ES.\\
+ \textsuperscript{2} Facultad de F\'isica, Universidad de La Laguna, Avda. Astrof\'isico Fco. S\'anchez s/n, 38200La Laguna, Tenerife, ES.\\
+ Corresponding author: Mohammad Akhlaghi
+ (\href{mailto:mohammad@akhlaghi.org}{\textcolor{black}{mohammad@akhlaghi.org}})
+ }}
+\date{}
+
+\begin{document}%\layout
+\thispagestyle{firstpage}
+\maketitle
+%% Abstract
+{\noindent\mpregular
+ Computational methods, implemented as software, are a major component of almost all scientific datasets, results, papers, or discoveries.
+ However, the full software environment and details of how they were run to do an analysis cannot be reported with sufficient detail within the confines of a traditional research paper like this one.
+ It is thus becoming harder to archive, understand, reproduce, or validate a scientific result, even by the original author.
+ To fasciliate reproducible, or archivable, data analysis this paper introduces the ''Reproducible paper template''.
+ It provides the necessary low-level infrastructure in a generic design/template to easily allow the addition of higher-level analysis steps in individual projects.
+ It is designed, and later published, fully as plain-text files, with no binary component.
+ The workflow of a project that uses this template will contain the following steps that can all be executed automatically, or inspected/parsed as a plain text file: software tarball URLs, input data URLs/PIDs, checksums for the data and software tarballs, scripts to build the software (containing all the dependencies), scripts to do the analysis, and finally, the \LaTeX{} source of the narrative paper, or data description.
+ This paper itself is exactly reproducible (snapshot \projectversion).
+ \horizontalline
-%% Start creating the paper.
-\begin{document}
+ \noindent
+ {\mpbold Keywords:} Reproducibility, Workflows, scientific pipelines
+}
-%% Project abstract and keywords.
-\includeabstract{
+\horizontalline
- To be filled.
- \vspace{0.25cm}
- \textsl{Keywords}: Add some keywords for your research here.
- \textsl{Reproducible paper}: All quantitave results (numbers and plots)
- in this paper are exactly reproducible (version \projectversion{},
- \url{https://gitlab.com/makhlaghi/reproducible-paper}).}
-%% To add the first page's headers.
-\thispagestyle{firststyle}
-%% Start of main body.
\section{Introduction}
+\label{sec:introduction}
+
+In the last two decades, several technological advancements have profoundly affected how science is being done: much improved processing power, storage capacity and internet connections, combined with larger public datasets and more robust free and open source software solutions.
+Given these powerful tools, scientists are using ever more complex processing steps and datasets, often composed of mixing many different software components in a single project.
+For example see the almost complete list of the necessary software in Appendix A of the following two research papers: \citet{akhlaghi19} and \citet{infante19}.
+
+The increased complexity of scientific analysis has made it impossible to describe all the analytical steps of a project in the traditional format of a published paper, to a sufficient level of detail.
+Citing this difficulty, many authors suffice to describing the very high-level generalities of their analysis.
+However, even the most basic calculations (like the mean of a distribution) can depend on the software implementation.
+Therefore, even if the raw collected data are published with the paper, it is very hard, and expensitve, to study the validity/integrity of a result because of incomplete metadata.
+
+Due to the complexity of modern scientific analysis, a small deviation in the final result can be due to many different steps, which may be significant in their own intermediate steps.
+This makes it critically important to share the research steps that go into the analysis to finest possible detail.
+Attempts to reproduce an incomplete reporting is simply too expensive for anyone (even for the original authors, one year after publication).
+For example, \citet{smart18} describes how a 7-year old conflict in theoretical condensed matter was only identified after the relative codes were shared.
+
+The energy/cost to independently repeat the mistakes of other researchers, is a waste of precious scientific funding, and public trust in the sciences.
+It is therefore a critical component of the current Big data era.
+Nature is already a black box which we are trying hard to unlock, or understand.
+Not being able to experiment on the methods of other researchers is an artificial, self-imposed back box wrapped over the original.
+
+The completeness of a paper's metadata can be measured by a simple question: given the same input datasets, can another researcher reproduce the exact same result automatically (without needing to contact the authors)?
+Several studies have actually attempted to answer this with differnet levels of detail.
+For example \citet{stodden18} attempted to replicate the results of 204 scientific papers published in the journal Science after that journal adopted a policy of publishing the data and code associated with the papers.
+Even though the authors were contacted, the success rate was $26\%$, concluding that policy change along is insufficient.
+\citet{allen18} \tonote{Add a short summary of its results}.
+\citet{zhao12} study ``workflow decay'' in papers using Taverna workflows (Appendix \ref{appendix:taverna}). \tonote{Review some of their major results}
+
+This problem is also generally felt in the community, \citet{baker16} found that $52\%$ and $38\%$ of the 1576 researchers surveyed, respectively acknowledged ``a significant crisis'' and ``a slight crisis'' regarding the reproducibility of scientific results.
+Only $3\%$ believed that there is no reproducibility crisis.
+It must be added that this is not a recent problem, it was also strongly felt in the previous decades.
+For example \citet{baggerly09} complaining about inadequet narrative description of the analysis and showing the prevalence of simple errors, calling their work ``forensic bioinformatics''.
+Even earlier, \citet{ioannidis05} prove that ``most claimed research findings are false''.
+
+Given the scale of the problem, a committee of the National Academy of Sciences was asked to assess its impact by the USA National Science Foundation (NSF, asked by the USA congress).
+The results were recently published by \citet{fineberg19} and provide a good review of the status.
+That committee doesn't recognize a ``crisis'', but the importance is stressed along with definitions (see Section \ref{sec:terminology}) and proposals.
+Earlier in 2011, the Elsevier conducted an ``Executable Paper Grand Challenge'' \citep{gabriel11}, and to the best working solutions (at that time) were recognized.
+Some of them are reviewed in Appendix \ref{appendix:existingsolutions}, but most have not been continued since then.
+
+\begin{figure}[t]
+ \begin{center}
+ \includetikz{figure-project-outline}
+ \end{center}
+ \vspace{-17mm}
+ \caption{Graph of a generic project's workflow (connected through arrows), highlighting the various issues/questions on each step.
+ The green boxes with sharp edges are inputs and the blue boxes with rounded corners are the intermediate or final outputs.
+ The red boxes with dashed edges highlight the main questions on the respective stage.
+ The orange box surrounding the software download and build phases marks shows the various commonly recognized solutions to the questions in it, for more see Appendix \ref{appendix:jobmanagement}.
+ }
+\end{figure}
+
+Modern analysis tools are almost entirely implemented as software packages.
+This has lead many scientists to adopt solutions that software developers use for reproducing software (for example to fix bugs, or avoid security issues).
+These tools and how they are used are thorougly reviewed in Appendices \ref{appendix:existingtools} and \ref{appendix:existingsolutions}.
+However, the problem of reproducibility in the sciences is more complicated and subtle than that of software engineering.
+This difference can be broken up into the following categories, which are described more fully below:
+1) Reading vs. executing, 2) Archiving how software is used and 3) Citation of the software/methods used for scientific credit.
+
+The first difference is because in the sciences, reproducibility is not merely a problem of re-running a research project (where a binary blob like a container or virtual machine is sufficient).
+For a scientist it is more important to read/study a method of a paper that is 1, 10, or 100 years old.
+The hardware to execute the code may have become obsolete, or it may require too much processing power, storage, or time for another random scientist to execute.
+Another scientist just needs to be assured that the commands they are reading is exactly what was (and can potentially be) executed.
+
+On the second point, scientists are devoting a smaller fraction of their papers to the technical aspects of the work because they are done increasingly by pre-written software programs and libraries.
+Therefore, scientific papers are no longer a complete repository for preserving and archiving very important aspects of the scientific endeavor and hard gained experience.
+Attempts such as Software Heritage\footnote{\url{https://www.softwareheritage.org}} \citep{dicosmo18} do a wonderful job at long term preservation and archival of the software source code.
+However, preservation of the software's raw code is only part of the process, it is also critically important to preserve how the software was used: with what configuration or run-time options, for what kinds of problems, in conjunction with which other software tools and etc.
+
+The third major difference was scientific credit, which is measured in units of citations, not dollars.
+As described above, scientific software are playing an increasingly important role in modern science.
+Because of the domain-specific knowledge necessary to produce such software, they are mostly written by scientists for scientists.
+Therefore a significant amount of effort and research funding has gone into producing scientific software.
+Atleast for the software that do have an accompanying paper, it is thus important that those papers be cited when they are used.
+
+Similar community concerns on the importance of metadata in research products have lead to the wide adoption of the FAIR principles (Findable, Accessible, Interoperable, Reusable) for data management and stewardship \citep{wilkinson16}.
+These are very good generic guidelines that don't go into any implementation details.
+\tonote{Discuss this and other similar attempts in a little more detail.}
+
+The importance of publishing processing source code, and allowing for critical analysis of the methods, along with scientific paper is not a recent problem.
+For example \citet{roberts69} discussed conventions in Fortran programming and documentation to help in publishing research codes.
+%\citet{anscombe73} showed how the distancing of researchers from the intricacies of algorithms/methods can lead to misinterpretation of the results.
+\citet[Geophysicists]{claerbout1992} is the first paper we have found that discusses the issue directly in the same sense as this paper: a scientific paper must be accompanied by the code that generated it's results, they describe a model they had started from 1990 and used in a PhD dissertation and 6 other documents, with nearly a thousand reproducible plots.
+The high-level analysis orchestration was organized through Cake ( which were distributed in CD-ROMs along with analysis code, text and all non-proprietary software (including \LaTeX{}) .
+It later inspired \citet{buckheit1995} to publish a reproducible paper (in Matlab).
+\tonote{Find some other historical examples.}
+
+In this paper, a solution to this problem is introduced that attemps to address the problems above and has already been used in scientific papers.
+In Section \ref{sec:terminology}, the problem that is addressed by this template is clearly defined and Section \ref{appendix:existingsolutions} reviews some existing solutions and their pros and cons with respect to reproducibility in a scientific framework.
+Section \ref{sec:template} introduces the template, and its approach to the problem, with full design details.
+Finally in Section \ref{sec:discussion} the future prospects of using systems like this template are discussed.
+
+\begin{itemize}
+\item \citep{claerbout1992,schwab2000}: These papers describe the practical need very nicely.
+\item \citet{herndon14}: 1) simple typos (from the original spreadsheets, they found typos in number of rows). 2) Importance of releasing data (by the study they reproduced).
+\item \citet{ziemann16}: one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions.
+\item \citet{ioannidis2009}: Two teams attempts to independently replicate results from 18 articles.
+ Two were successful, six were partial and ten weren't replicated.
+ The main reason for failure to reproduce was data unavailability, and discrepancies were mostly due to incomplete data annotation or specification of data processing and analysis.
+\item \citet{miller06}: an incorrect column filliping in a custom analysis caused the retraction of 5 papers in major journals (including Science).
+\item \citet{gronenschild12}: effect of software version and environment on scientific results: encouraging researchers to not update environment.
+\item \citet{chang15}: 67 studies in well-regarded economics journals with data and code. Only 22 could be reproduced without contacting authors, more than half couldn't be replicated at all (they use ``replicate'').
+\item \citet{horvath15}: errartum, describing the effect of a software mistake on result.
+\item Nature's collection on papers about reproducibility: \url{https://www.nature.com/collections/prbfkwmwvz}.
+\end{itemize}
+
+
+
+
+
+
+
+
+\section{Definitions of important terms}
+\label{sec:terminology}
+
+The concepts and terminologies of reproducibility and project/workflow management and design are commonly used differently by different research communities or different solution provides.
+It is therefore important to clarify the specific terms used throughout this paper and its appendix, before starting the technical details.
+
+
+
+
+
+\subsection{Definition: input}
+\label{definition:input}
+Any computer file that may be usable in more than one project.
+The inputs of a project include data, software source code, or etc.
+See \citet{hinsen16} on the fundamental similarity of data and source code.
+Inputs may be encoded in plain text (for example tables of comma-separated values, CSV, or processing scripts) or custom binary formats, for example JPEG images, or domain-specific data formats \citep[e.g., FITS in astronomy, see][]{pence10}.
+
+Inputs may have initially been created/writted (e.g., software soure code) or collected (e.g., data) for one specific project.
+However, they can, and most often will, be used in other/later projects also.
+Following the principle of modularity, it is therefore optimal to treat the inputs of any project as independent entities, not mixing them with how they are managed (how software is run on the data) within the project (see \ref{definition:project}).
+
+Inputs are nevertheless necessary for building and running any project project.
+Some inputs may already archived/published independently prior to the project's publication.
+In this case, they can easily be downloaded by independent projects and used.
+Otherwise, they can be published with the project, but as independent files, for example see \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481} \citep{akhlaghi19}.
+
+
+
+
+
+\subsection{Definition: output}
+\label{definition:output}
+Any computer file that is published at the end of the project.
+The output(s) can be datasets (terrabyte-sized, small table(s) or image(s), a single number, a true/false (boolian) outcome), automatically generated software source code, or any other file.
+The raw output files are commonly supplemented with a paper/report that summarizes them in a human-friendly readable/printable/narrative format.
+The report commonly includes highlights of the input/output datasets (or intermediate datasets) as plots, figures, tables or simple numbers blended into the text.
+
+The outputs can either be published independently on data servers which assign specific persistant identifers (PIDs) to be cited in the final report or published paper (in a journal for example).
+Alternatively, the datasets can be published with the project source, for example \href{https://doi.org/10.5281/zenodo.1164774}{zenodo.1164774} \citep[Sections 7.3 \& 3.4]{bacon17}.
+
+
+
+
+
+\subsection{Definition: project}
+\label{definition:project}
+The most high-level series of operations that are done on input(s) to produce outputs.
+Because the project's report is also defined as an output (see above), besides the high-level analysis, the project's source also includes scripts/commands to produce plots, figures or tables.
+
+With this definition, this concept of a ``project'' is similar to ``workflow''.
+However, it is imporant to emphasize that the project's source code and inputs are distinct entities.
+For example the project may be written in the same programming language as one analysis step.
+Generally, the project source is defined as the most high-level source file that is unique to that individual project (its language is irrelevant).
+The project is thus only in charge of managing the inputs and outputs of each analysis step (take the outputs of one step, and feed them as inputs to the next), not to do analysis itself.
+A good project will follow the modularity principle: analysis scripts should be well-defined as an independently managed software source.
+For example modules in Python, packages in R, or libraries/programs in C/C++ that can be imported in higher-level project sources.
+
+
+
+
+
+\subsection{Definition: reproducibility \& replicability}
+\label{definition:reproduction}
+These terms have been used in the literature with various meanings, sometimes in a contradictory way.
+It is therefore necessary to clarify the precise usage of this term in this paper.
+But before doing so, it is important to highlight that in this paper, we are only considering computational analysis, in other words, analysis after data has been collected and stored as a file on a filesystem.
+Therefore, many of the definitions reviewed in \citet{plesser18}, that are about data collection, are out of context here.
+We adopt the same definition of \citet{leek17,fineberg19}, among others:
+
+\begin{itemize}
+\item {\bf\small Reproducibility:} (same inputs $\rightarrow$ consistant result).
+ Formally: ``obtaining consistent [not necessarily identical] results using the same input data; computational steps, methods, and code; and conditions of analysis'' \citep{fineberg19}.
+ This is thus synonymous with ``computational reproducibility''.
+
+ \citet{fineberg19} allow non-bitwise or non-identical numeric outputs within their definition of reproducibility, but they also acknowledge that this flexibility can lead to complexities: what is an acceptable non-identical reproduction?
+ Exactly reproducbile outputs can be precisely and automatically verified without statistical interpretations, even in a very complex analysis (involving many CPU cores, and random operations), see Section \ref{principle:verify}.
+ It also requires no expertise, as \citet{claerbout1992} put it: ``a clerk can do it''.
+ In this paper, unless otherwise mentioned, we only consider bitwise/exact reproducibility.
+
+\item {\bf\small Replicability:} (different inputs $\rightarrow$ consistant result).
+ Formally: ``obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data'' \citep{fineberg19}.
+
+Generally, since replicability involves new data collection, it can be expensive.
+For example the ``Reproducibility Project: Cancer Biology'' initiative started in 2013 to replicate 50 high-impact papers in cancer biology\footnote{\url{https://elifesciences.org/collections/9b1e83d1/reproducibility-project-cancer-biology}}.
+Even with a funding of atleast \$1.3 million, it later shrunk to 18 projects \citep{kaiser18} due to very high costs.
+We also note that replicability doesn't have to be limited to different input data: using the same data, but with different implementations of methods, is also a replication attempt \citep[also known as ``in silico'' experiments, see][]{stevens03}.
+\end{itemize}
+
+Some have defined these terms in the opposite manner.
+Examples include \citet{hinsen15} and the policy guidelines of the Association of Computing Machinery\footnote{\url{https://www.acm.org/publications/policies/artifact-review-badging}} (ACM, dated April 2018).
+ACM has itself adopted the 2008 definitions of Vocabulaire international de m\'etrologie (VIM).
+
+Besides the two terms above, ``repeatability'' is also sometimes used in regards to the concept discussed here and must be clarified.
+For example \citet{ioannidis2009} use ``repeatability'' to encompass both the terms above.
+However, the ACM/VIM definition for repeatability is ``a researcher can reliably repeat her own computation''.
+Hence, in the ACM terminology, the only difference between replicability and repeatability is the ``team'' that is conducting the computation.
+In the context of this paper inputs are precisely defined (Section \ref{definition:input}): files with specific/registered checksums.
+Therfore our inputs are team-agnostic, allowing us to safely ignore ``repeatability'' as defined by ACM/VIM.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+\section{Principles of the proposed solution}
+\label{sec:principles}
+
+The core principle behind this work is simple: science is defined by its method, not its result.
+Statements that convey a ``result'' abound in all aspects of human life (e.g., in fiction, religion and science).
+The distinguishing factor is the ``method'' the result was derived.
+Science is the only class that attempts to be as objective as possible through the ``scientific method''.
+\citet{buckheit1995} nicely summarize this by pointing out that modern scientific papers (narrative combined with plots, tables and figures) are merely advertisements of a scholarship, the actual scholarship is the scripts and software usage that went into doing the analysis.
+
+This paper thus proposes a framework that is optimally designed for both running a project, \emph{and} publication of the (computational) methods along with the published result.
+However, this paper is not the first attempted solution to this fundamental problem.
+Various solutions have been proposed since the early 1990s, see Appendix \ref{appendix:existingsolutions} for a review.
+To better highlight the differences with those methods, and also highlight the core foundations of this method (which help in understanding certain implementation choices), in the sub-sections below we elaborate on the core principle, by breaking it into logically independent sub-components.
+
+It is important to note that based on the definition of a project (Section \ref{definition:project}) and the first principle below (modularity, Section \ref{principle:modularity}) this paper is designed to be modular and thus agnostic to high-level choices.
+For example the choice of hardware (e.g., high performance computing facility or a personal computer), or high-level interfaces (i.e., beyond the raw project's source, for example a webpage or specialized graphic user interface).
+The proposed solution in this paper is a low-level skeleton that is designed to be easily adapted to any high-level, project-specific, choice.
+For example, in terms of hardware choice, a large simulation project simply cannot be run on smaller machines.
+However, when such a project is managed in the proposed system, the complete project (see Section \ref{principle:complete}) is published and readable by peers, who can be sure that what they are reading, contains the full/exact environment and commands that produced the result.
+In terms of interfaces, wrappers can be written over this core skeleton for various other high-level cosmetics, for example a web interface, or plugins to text editors or notebooks (see Appendix \ref{appendix:existingtools}).
+
+
+
+
+
+
+\subsection{Principle: Modularity}
+\label{principle:modularity}
+A project should be compartmentalized or partitioned to independent modules or components with well-defined inputs/outputs having no side-effects.
+In a modular project, communication between the independent modules is explicit, providing optimizations on multiple levels:
+1) Execution: independent modules can run in parallel, or modules that don't need to be run (because their dependencies haven't changed) won't be re-done.
+2) Data lineage (for example experimenting on project), and data provenance extraction (recording any dataset's origins).
+3) Citation: allowing others to credit specific parts of a project.
+
+This modularity principle doesn't just apply to the analysis, it also applies to the whole project, for example see the definitions of ``input'', ``output'' and ``project'' in Section \ref{sec:terminology}.
+Based on this principle, an ideal project source, shouldn't itself have any analysis tools.
+Software that does a specific analysis must be a separate entity (a software package), the project should just contain its installation and usage instructions.
+
+For the most high-level analysis/operations, the boundary between the ``analysis'' and ``project'' can become blurry.
+It is thus inevitable that some highly project-specific analysis is ultimately kept within the project and not maintained as a separate project.
+We don't see this as a problem, because it can later spin-off into a separate software package if the need is felt by the community.
+%One nice example of an existing system that doesn't follow this principle is Madagascar, see Appendix \ref{appendix:madagascar}.
+
+
+
+
+
+\subsection{Principle: Plain text}
+\label{principle:text}
+A project's primarily stored/archived format should be plain text with human-readable encoding\footnote{Plain text format doesn't include document container formats like `\texttt{.odf}' or `\texttt{.doc}', for software like LibreOffice or Microsoft Office.}, for example ASCII or Unicode (for the definition of a project, see Section \ref{definition:project}).
+The reason behind this principle is that opening, reading, or editing non-plain text (executable or binary) file formats needs specialized software.
+Binary formats will complicate various aspects of the project: its archival, automatic parsing, or human readability.
+This is a critical principle for long term preservation and portability: when the software to read binary format has been depreciated or become obsolete and isn't installable on the running system, the project will not be readable/usable any more.
+
+A project that is solely in plain text format has many useful advantages in a scientific context (portable, with long term archivability, and generic parsing).
+For example the project itself can be put under version control as it evolves, with easy tracking of changed parts, using already available and mature tools in software development (for more, see the principle on project history in Section \ref{principle:history}).
+After publication, independent modules of a plain-text project can be used and cited through services like Software Heritage \citep{dicosmo18}, enabling future projects to easily build ontop of old ones.
+
+Archiving a binary version of the project is like archiving a well cooked dish itself, which will be inedible with changes in hardware (temperature, humidity, and the natural world in general).
+But archiving a project's plain text source is like archiving the dish's recipe (which is also in plain text!): you can re-cook it any time.
+When the environment is under perfect control (as in the proposed system), the binary/executable, or re-cooked, output will be identical.
+
+This principle doesn't conflict with having an executable or immediately-runnable project\footnote{In their recommendation 4-1 on reproducibility, \citet{fineberg19} mention: ``a detailed description of the study methods (ideally in executable form)''.}.
+Because it is trivial to build a text-based project within an executable container or virtual machine.
+For more on containers, please see Appendix \ref{appendix:independentenvironment}.
+Similar to how a software is built from its plain-text source in such systems: a project is just a higher-level software.
+A plain-text project's built/executable form can be published as an output of the project to help contemporary researchers (see Section \ref{definition:output}).
+
+
+
+
+
+\subsection{Principle: Complete/Self-contained}
+\label{principle:complete}
+A project should be self-contained, needing no particular features from the host.
+At build-time (when the project is building its necessary tools), the project shouldn't need anything beyond a minimal POSIX environment on the host which is available in Unix-like operating system like GNU/Linux, BSD-based or macOS.
+At run-time (when environment/software are built), it should not use any host operating system programs or libraries.
+
+Generally, a project's source should include the whole project: access to the inputs and validating them (see Section \ref{sec:terminology}), building necessary software (access to tarballs and instructions on configuring, building and installing those software), doing the analysis (run the software on the data) and creating the final report/paper PDF.
+This principle has several important consequences:
+
+\begin{enumerate}
+\item A complete project doesn't need an internet connection to build itself or to do its analysis and possibly make a report.
+ Of course this only holds when the analysis doesn't require internet, for example needing a live data feed.
+
+\item A Complete project doesn't need any previlaged/root permissions for system-wide installation or environment preparations.
+ Even when the user does have root previlages, interefering with the host operating system for a project, may lead to many conflicts with the host or other projects.
+
+\item A complete project inherently includes the complete data lineage and provenance: automatically enabling a full backtrace of the output datasets or narrative, to raw inputs: data or software source code lines (for the definition of inputs, please see \ref{definition:input}).
+ This is very important because existing data provenance solutions require manual tagging within the data workflow or connecting the data with the paper's text (Appendix \ref{appendix:existingsolutions}).
+ Manual tagging can be highly subjective, prone to many errors, and incomplete.
+\end{enumerate}
+
+The first two components are particularly important for high performace computing (HPC) facilities.
+Because of security reasons, HPC users commonly don't have previlaged permissions or internet access.
+
+
+
+
+\subsection{Principle: Minimal complexity (i.e., maximal compatability)}
+\label{principle:complexity}
+An important measure of the quality of a project is how much it avoids complexity.
+In principle this is similar to Occum's rasor: ``Never posit pluralities without necessity'' \citep{schaffer15}, but extrapolated to project management.
+In this context Occum's rasor can be interpretted like the following cases:
+minimize the number of a project's dependency software (there are often multiple ways of doing something),
+avoid complex relationtions between analysis steps (which is not unrelated to the principle of modularity in Section \ref{principle:modularity}),
+or avoid the vogue programming language of the day (since its going to fall out of fashion soon and take the project down with it, see Appendix \ref{appendix:highlevelinworkflow}).
+This principle has several important concequences:
+\begin{itemize}
+\item Easier learning curve.
+Scientists can't adopt new tools and methods as fast as software developers.
+They have to invest the majority of their time on their own research domain.
+
+\item Future usage.
+Scientific projects require longevity: unlike software engineering, there is no end-of-life in science (e.g., Aristotle's work 2.5 millennia ago is still ``science'').
+Scientific projects that depend too much on an ever evolving, high-level software developing toolchain, will be harder to archive, run, or even study for their immediate and future peers.
+One recent example is the Popper software implementation: it was originally designed in the HashiCorp configuration language (HCL) because it was the default for organizing operations in GitHub.
+However, Github dropped HCL in October 2019, for more see Appendix \ref{appendix:popper}.
+
+\item Compatible and extensible.
+ A project that has minimal complexity, can easily adapt to any kind of data, programming language, host hardware or software and etc.
+ It can also be easily extended for new inputs and environments.
+ For example when a project management system is designed only to manage Python functions (like CGAT-core, see Appendix \ref{appendix:jobmanagement}), it will be hard, inefficient and buggy for managing an analysis step that is written in R and another written in Fortran.
+\end{itemize}
+
+
+
+
+
+
+
+\subsection{Principle: automatic verification}
+\label{principle:verify}
+The project should automatically verify its outputs, such that expert knowledge won't be necessary to make sure a re-run was correct.
+This follows from the definition of exact reproducibility (Section \ref{definition:reproduction}).
+It is just important to emphasize that in practice, exact or bit-wise reproduction is very hard to implement at the level of a file.
+This is because many specialized scientific software commonly print the running date on their output files (which is very useful in its own context).
+For example in plain text tables, such meta-data are commonly printed as commented lines (usually starting with `\texttt{\#}').
+Therefore when verifying a plain text table, the checksum which is used to validate the data, can be recorded after removing all commented lines.
+Fortunately, the tools to operate on specialized data formats also usually have ways to remove requested metadata (like creation date), or ignore them.
+For example the FITS standard in astronomy \citep{pence10} defines a special \texttt{DATASUM} keyword which is a checksum calculated only from the raw data, ignoring all metadata.
-%% End of main body.
+\subsection{Principle: History and temporal provenance (version control)}
+\label{principle:history}
+No project is done in a single/first attempt.
+Projects evolve as they are being completed.
+It is natural that earlier phases of a project are redesigned/optimized only after later phases have been completed.
+This is often seen in scientific papers, with statements like ``we also [first] tried method [or parameter] XXXX, but YYYY is used here because it showed to have better precision [or less bias, or etc]''.
+A project's ``history'' is thus as scientifically relevant as the final, or published, state, or snapshot, of the project.
+All the outputs (datasets or narrative papers) need to contain the the exact point in the project's history that produced them.
+For a complete project (see Section \ref{principle:complete}) that is under version control (like Git), this would be the unique commit checksum (for more on version control, see Appendix \ref{appendix:versioncontrol}).
+This principle thus benefits from the plain-text principle (Section \ref{principle:text}).
+Note that with our definition of a project (Section \ref{definition:project}), ``changes'' in the project include changes in the software building or versions, changes in the running environment, changes in the analysis, or changes in the narrative.
+After publication, the project's history can also be published on services like Software Heritage \citep{dicosmo18}, enabling precise citation and archival of all stages of the project's evolution.
+
+Taking this principle to a higher level, newer projects are built upon the shoulders of previous projects.
+A project management system should be able to provide this temporal connection between projects.
+Quantifying how newer projects relate to older projects will enable 1) scientists to simply use the relevant parts of an older project, 2) quantify the connections of various projects, which is primarily of interest for meta-research (research on research) or historical studies.
+In data science, ``provenance'' is used to track the analysis and original datasets that were used in producing a higher-level dataset.
+A system that uses this principle will also provide ``temporal provenance'', quantifying how a certain project grew/evolved in the time dimension.
+
+
+
+\subsection{Principle: Free and open source software}
+\label{principle:freesoftware}
+Technically, as defined in Section \ref{definition:reproduction}, reproducibility is also possible with a non-free and non-open-source software (a black box).
+However, as shown below, software freedom as an important pillar for the sciences.
+\begin{itemize}
+\item Based on the completeness principle (Section \ref{principle:complete}), it is possible to trace the output's provenance back to the exact source code lines within an analysis software.
+ If the software's source code isn't available such important and useful provenance information is lost.
+\item A non-free software may not be runnable on a given hardware.
+ Since free software is modifable, others can modify (or hire someone to modify) it and make it runnable on their particular platform.
+\item A non-free software cannot be distributed by the authors, making the whole community reliant only on the proprietary owner's server (even if the proprietary software doesn't ask for payments).
+ A project that uses free software can also release the necessary tarballs of the software it uses.
+ For example see \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481} \citep{akhlaghi19} or \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937} \citep{infante19}.
+\item A core component of reproducibility is that anonymous peers should be able confirm the result from the same datasets with minimal effort, and this includes financial cost beyond hardware.
+\end{itemize}
+
+
+
+
+
+
+\subsection{Comparisons of principles with other attempts}
+
+\tonote{Also have a look at the goals in \citep{nowakowski11} and \citet{hinsen15}.}
+
+\tonote{Add in larger context:} Each step should be completely independent and not need memory filled from previous steps.
+This is critical for the workflow to start at any random point.
+
+The Popper Convention \citep{jimenez17} for reproducible systems defines a list of guidelines for a ``popperized'' workflow.
+Note that in this section we only discuss the Popper convention, the software implementation/solution that is also discussed in \citet{jimenez17} is reviewed in Appendix \ref{appendix:popper}.
+The proposed template builds upon the Popper convention, with extra addition which is very important for scientists when using the template, and also long term archivability of the project.
+The goals of the proposed template are listed below, the first few from the Popper convention are marked.
+\begin{enumerate}
+ \setlength\itemsep{-1mm}
+\item {\bf\small Self-contained project:} (from Popper) everything (e.g., code or paper's manuscript) must be included.
+\item {\bf\small Automated validation:} (from Popper) check if analysis can complete successfully.
+\item {\bf\small Toolchain Agnosticism:} (from Popper) unique identifiers and automatically download of inputs.
+\item {\bf\small Existing template:} (from Popper) containing the full research cycle, for new adopters.
+\item {\bf\small Minimize meta-dependencies}: managing the project with a minimal requirements.
+\end{enumerate}
+
+By meta-dependencies, we mean software/language dependencies of the project management phase, not the core science software/dependencies.
+For example tools for the following operations: version control, job managers, containers, package managers, specialized editors, web interfaces and etc (see Appendix \ref{appendix:existingtools}).
+This additional goal is based on the real experience of scientists (with no engineering or software engineering background), derived since our first attempts \citep[see the version on arXiv]{akhlaghi15}, and listed below:
+
+
+
+
+
+
+
+
+
+
+
+
+
+\section{Reproducible paper template}
+\label{sec:template}
+
+This template is based on a simple principle: the output's full lineage should be stored in a human-readable, plain text format, that can also be automatically run on a computer.
+The primary components of the research output's lineage can be summarized as:
+1) URLs/PIDs and checksums of external inputs.
+These external inputs can be datasets, if the project needs any (not simulations), or software source code that must be downloaded and built.
+2) Software building scripts.
+3) Full series of steps to run the software on the data, i.e., to do the data analysis.
+4) Building the narrative description of the project that describes the output (published paper).
+
+where persistent identifiers, or URLs of input data and software source code, as well as instructions/scripts on how to build the software and run run them on the data to do the analysis
+
+It started with \citet{akhlaghi15} which was a paper describing a new detection algorithm.
+
+and further evolving in \citet{bacon17}, in particular the following two sections of that paper: \citet{akhlaghi18a} and \citet{akhlaghi18b}.
+
+\tonote{Mention the \citet{smart18} paper on how a binary version is not sufficient, low-level, human-readable plain text source is mandatory.}
+
+\tonote{Find a place to put this:} Note that input files are a subset of built files: they are imported/built (copied or downloaded) using the project's instructions, from an external location/repository.
+ This principle is inspired by a similar concept in the free and open source software building tools like the GNU Build system (the standard `\texttt{./configure}', `\texttt{make}' and `\texttt{make install}'), or CMake.
+ Software developers already have decades of experience with the problems of mixing hand-written source files with the automatically generated (built) files.
+ This principle has proved to be an exceptionally useful in this model, grealy
+
+
+
+\subsection{Make}
+\label{sec:make}
+The template manages the higher-level dependency tracking of a project using Make.
+The very high-level user-interface of the template is written in shell script (in particular GNU Bash).
+All the individual operations (like downloading and installing software or doing the analysis) is done through Makefiles that are called by that script.
+
+\tonote{\citet{schwab2000} discuss the ease of students learning Make.}
+
+\begin{figure}[t]
+ \begin{center}
+ \includetikz{figure-data-lineage}
+ \end{center}
+ \vspace{-7mm}
+ \caption{\label{fig:analysisworkflow}Schematic representation of built file dependencies in a hypothetical project/pipeline using the reproducible paper template.
+ Each colored box is a file in the project and the arrows show the dependencies between them.
+ Green files/boxes are plain text files that are under version control and in the source-directory.
+ Blue files/boxes are output files of various steps in the build-directory, located within the Makefile (\texttt{*.mk}) that generates them.
+ For example \texttt{paper.pdf} depends on \texttt{project.tex} (in the build directory and generated automatically) and \texttt{paper.tex} (in the source directory and written by hand).
+ In turn, \texttt{project.tex} depends on all the \texttt{*.tex} files at the bottom of the Makefiles above it.
+ }
+\end{figure}
+
+
+
+
+\subsection{Software citation}
+\label{sec:softwarecitation}
+
+\begin{itemize}
+\item \citet{smith16}: principles of software citation.
+\end{itemize}
+
+
+
+
+
+\section{Discussion}
+\label{sec:discussion}
+
+\begin{itemize}
+\item Science is defined by its method, not its results.
+ Just as papers are quality checked for a reasonable English (which is not necessary for conveying the final result), the necessities of modern science require a similar check on a reasonable review of the computation, which is easiest to check when the result is exactly reproducibile.
+\item Initiative such as \url{https://software.ac.uk} (UK) and \url{http://urssi.us} (USA) are good attempts at improving the quality of research software.
+\item Hiring software engineers is not the solution: the language of science has changed.
+ Could Galileo have created a telescope if he wasn't familiar with what a lens is?
+ Science is not independnet of its its tools.
+\item The actual processing is archived in multiple places (with the paper on arXiv, with the data on Zenodo, on a Git repository, in future versions of the project).
+\item As shown by the very common use of something like Conda, Software (even free software) is mainly seen in executable form, but this is wrong: even if the software source code is no longer compilable, it is still readable.
+\item The software/workflow is not independent of the paper.
+\item Cost of archiving is a critical issue (the NSF director mentions this during the first Panel of the National Academies meeting\footnote{\url{https://vimeo.com/367085708} (around minute 45:00)}).
+\item Meta-science (or ``Science of science'', ``economics of science'', ``Research on research'') and its importance.
+\item Provenance tracking (like some tools in Appendix \ref{appendix:existingsolutions}) is built-in and doesn't need any manual tagging.
+\item Important that a project starts by following good practice \citep{fineberg19}, not an extra step in the end.
+\item It is possible to write graphic user interface wrappers like those in Appendix \ref{appendix:existingsolutions}.
+\item What is often discussed is ``taking data and apply different methods to it'', but an even more productive questions can be this: ``take (exact) methods and give different data to it''.
+\item \citet{munafo19} discuss how collective action is necessary.
+\item Research objects (Appendix \ref{appendix:researchobject}) can automatically be generated from the Makefiles, we can also apply special commenting conventions, to be included as annotations/descriptions in the research object metadata.
+\item Provenance between projects: through Git, all projects based on this template are automatically connected, but also through inputs/outputs, the lineage of a project can be traced back to projects before it also.
+\item \citet{gibney20}: After code submission was encouraged by the Neural Information Processing Systems (NeurIPS), the frac
+\end{itemize}
+
+
+
+
+
+%% Acknowledgements
\section{Acknowledgements}
+Work on the reproducible paper template has been funded by the Japanese Ministry of Education, Culture, Sports, Science, and Technology ({\small MEXT}) scholarship and its Grant-in-Aid for Scientific Research (21244012, 24253003), the European Research Council (ERC) advanced grant 339659-MUSICOS, European Union’s Horizon 2020 research and innovation programme under Marie Sklodowska-Curie grant agreement No 721463 to the SUNDIAL ITN, and from the Spanish Ministry of Economy and Competitiveness (MINECO) under grant number AYA2016-76219-P.
+The reproducible paper template was also supported by European Union’s Horizon 2020 (H2020) research and innovation programme via the RDA EU 4.0 project (ref. GA no. 777388).
+
-This research was partly done with the reproducible paper template
-\projectversion. Work on Gnuastro and the reproducible paper template has
-been funded by the Japanese Ministry of Education, Culture, Sports,
-Science, and Technology (MEXT) scholarship and its Grant-in-Aid for
-Scientific Research (21244012, 24253003), the European Research Council
-(ERC) advanced grant 339659-MUSICOS, European Union’s Horizon 2020 research
-and innovation programme under Marie Sklodowska-Curie grant agreement No
-721463 to the SUNDIAL ITN, and from the Spanish Ministry of Economy and
-Competitiveness (MINECO) under grant number AYA2016-76219-P. The
-reproducible paper template was also supported by European Union’s Horizon
-2020 (H2020) research and innovation programme via the RDA EU 4.0 project
-(ref. GA no. 777388).
%% Tell BibLaTeX to put the bibliography list here.
\printbibliography
-%% Start appendix.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+\newpage
\appendix
+\noindent
+ {\Large\bf Appendices}\\
+\vspace{-5mm}
+\section{Survey of existing tools for various phases}
+\label{appendix:existingtools}
+
+Conducting a reproducible research project is a high-level process, which involves using various lower-level tools.
+In this section, a survey of the most commonly used lower-level tools for various aspects of a reproducible project is presented with an introduction as relates to reproducibility and the proposed template.
+In particular, we focus on the tools used within the proposed template and also tools that are used by the existing reproducibile framework that is reviewed in Appendix \ref{appendix:existingsolutions}.
+Some existing solution to for managing the different parts of a reproducible workflow are reviewed here.
+
+
+
+
+
+\subsection{Independent environment}
+\label{appendix:independentenvironment}
+
+There are three general ways of controlling the environment: 1) Virtual machines, 2) Containers, 3) controlled build and environment.
+Below, a short description of each solution is provided.
+
+\subsubsection{Virtual machines}
+\label{appendix:virtualmachines}
+Virtual machines (VMs) keep a copy of a full operating system that can be run on other operating systems.
+This includes the lowest-level kernel which connects to the hardware.
+VMs thus provide the ultimate control one can have over the run-time environment of an analysis.
+However, the VM's kernel does not talk directly to the hardware that is doing the analysis, it talks to a simulated hardware that is provided by the operating system's kernel.
+Therefore, a process that is run inside a virtual machine can be much slower than one that is run on a native kernel.
+VMs are used by cloud providers, enabling them to sell fully independent operating systems on their large servers, to their customers (where the customer can have root access).
+But because of all the overhead, they aren't used often used for reproducing individual processes.
+
+\subsubsection{Containers}
+Containers are higher-level constructs that don't have their own kernel, they talk directly with the host operating system kernel, but have their own independent software for everything else.
+Therefore, they have much less overhead in storage, and hardware/CPU access.
+Users often choose an operating system for the container's independent operating system (most commonly GNU/Linux distributions which are free software).
+
+Below we'll review some of the most common container solutions: Docker and Singularity.
+
+\begin{itemize}
+\item {\bf\small Docker containers:} Docker is one of the most popular tools today for keeping an independnet analysis environment.
+ It is primarily driven by the need of software developers: they need to be able to reproduce a bug on the ``cloud'' (whic is just a remote VM), where they have root access.
+ A Docker container is composed of independent Docker ``images'' that are built with Dockerfiles.
+ It is possible to precisely version/tag the images that are imported (to avoid downloading the latest/different version in a future build).
+ To have a reproducible Docker image, it must be ensured that all the imported Docker images check their dependency tags down to the initial image which contains the kernel and C library.
+
+ Another important drawback of Docker for scientific applications is that it runs as a daemon (a program that is always running in the background) with root permissions.
+ This is a major security flaw that discourages many high performacen computing (HPC) facilities from installing it.
+
+\item {\bf\small Singularity:} Singularity is a single-image container (unlike Docker which is composed of modular/independent images).
+ Although it needs root permissions to be installed on the system (once), it doesn't require root permissions every time it is run.
+ Its main program is also not a daemon, but a normal program that can be stopped.
+ These features make it much easier for HPC administrators to install Docker.
+ However, the fact that it requires root access for initial install is still a hindrance for a random project: if its not present on the HPC, the project can't be run as a normal user.
+
+\item {\bf\small Virtualenv:} \tonote{Discuss it later.}
+\end{itemize}
+
+When the installed software within VMs or containers is precisely under control, they are good solutions to reproducibly ``running''/repeating an analysis.
+However, because they store the already-built software environment, they are not good for ``studying'' the analysis (how the environment was built).
+Currently, the most common practice to install software within containers is to use the package manager of the operating system within the image, usually a minimal Debian-based GNU/Linux operating system.
+For example the Dockerfile\footnote{\url{https://github.com/benmarwick/1989-excavation-report-Madjedbebe/blob/master/Dockerfile}} in the reproducible scripts of \citet{clarkso15}, which uses `\texttt{sudo apt-get install r-cran-rjags -y}' to install the R interface to the JAGS Bayesian statistics (rjags).
+However, the operating system package managers aren't static.
+Therefore the versions of the downloaded and used tools within the Docker image will change depending when it was built.
+At the time \citet{clarkso15} was published (June 2015), the \texttt{apt} command above would download and install rjags 3-15, but today (January 2020), it will install rjags 4-10.
+Such problems can be corrected with robust/reproducible package managers like Nix or GNU Guix within the docker image (see Appendix \ref{appendix:packagemanagement}), but this is rarely practiced today.
+
+\subsubsection{Package managers}
+The virtual machine and container solutions mentioned above, install software in standard Unix locations (for example \texttt{/usr/bin}), but in their own independent operating systems.
+But if software are built in, and used from, a non-standard, project specific directory, we can have an independent build and run-time environment without needing root access, or the extra layers of the container or VM.
+This leads us to the final method of having an independent environment: a controlled build of the software and its run-time environment.
+Because this is highly intertwined with the way software are installed, we'll describe it in more detail in Section \ref{appendix:packagemanagement} where package managers are reviewed.
+
+
+
+
+
+\subsection{Package management}
+\label{appendix:packagemanagement}
+
+Package management is the process of automating the installation of software.
+A package manager thus contains the following information on each software package that can be run automatically: the URL of the software's tarball, the other software that it possibly depends on, and how to configure and build it.
+
+Here some of package management solutions that are used by the reviewed reproducibility solutions of Appendix \ref{appendix:existingsolutions} are reviewed\footnote{For a list of existing package managers, please see \url{https://en.wikipedia.org/wiki/List_of_software_package_management_systems}}.
+Note that we are not including package manager that are only limited to one language, for example \texttt{pip} (for Python) or \texttt{tlmgr} (for \LaTeX).
+
+\begin{itemize}
+\item {\bf\small Operating system's package manager:}
+The most commonly used package managers are those of the host operating system, for example `\texttt{apt}' or `\texttt{yum}' respectively on Debian-based, or RedHat-based GNU/Linux operating systems (among many others).
+
+These package managers are tighly intertwined with the operating system.
+Therefore they require root access, and arbitrary control (for different projects) of the versions and configuration options of software within them is not trivial/possible: for example a special version of a software that may be necessary for a project, may conflict with an operating system component, or another project.
+Furthermore, in many operating systems it is only possible to have one version of a software at any moment (no including Nix or GNU Guix which can also be independent of the operating system, described below).
+Hence if two projects need different versions of a software, it is not possible to work on them at the same time.
+
+When a full container or virtual machine (see Appendix \ref{appendix:independentenvironment}) is used for each project, it is common for projects to use the containerized operating system's package manager.
+However, it is important to remember that operating system package managers are not static: software are updated on their servers.
+For example, simply adding `\texttt{apt install gcc}' to a \texttt{Dockerfile} will install different versions of GCC based on when the Docker image is created.
+Requesting a special version also doesn't fully address the problem because the package managers also download and install its dependencies.
+Hence a fixed version of the dependencies must also be included.
+
+In summary, these package managers are primarily meant for the operating system components.
+Hence, many robust reproducible analysis solutions (reviewed in Appendix \ref{appendix:existingsolutions}) don't use the host's package manager, but an independent package manager, like the ones below.
+
+\item {\bf\small Conda/Anaconda:} Conda is an independent package manager that can be used on GNU/Linux, macOS, or Windows operating systems, although all software packages are not available in all operating systems.
+ Conda is able to maintain an approximately independent environment on an operating system without requiring root access.
+
+ Conda tracks the dependencies of a package/environment through a YAML formatted file, where the necessary software and their acceptable versions are listed.
+ However, it is not possible to fix the versions of the dependencies through the YAML files alone.
+ This is thoroughly discussed under issue 787 of \texttt{conda-forge.github.io}\footnote{\url{https://github.com/conda-forge/conda-forge.github.io/issues/787}}, May 2019.
+ In that discussion, the authors of \citet{uhse19} report that the half-life of their environment (defined in a YAML file) is 3 months, and that atleast one of their their depenencies breaks shortly after this period.
+ The main reply they got in the discussion is to build the Conda environment in a container, which is also the suggested solution by \citet{gruning18}.
+ However, as described in Appendix \ref{appendix:independentenvironment} containers just hide the reproducibility problem, they don't fix it: containers aren't static and need to evolve (i.e., re-built) with the project.
+ Given these limitations, \citet{uhse19} are forced to host their conda-packaged software as tarballs on a separate repository.
+
+ Conda installs with a shell script that contains a binary-blob (+500 mega bytes, embeded in the shell script).
+ This is the first major issue with Conda: from the shell script, it is not clear what is in this binary blob and what it does.
+ After installing Conda in any location, users can easily activate that environment by loading a special shell script into their shell.
+ However, the resulting environment is not fully independent of the host operating system as described below:
+
+ \begin{itemize}
+ \item The Conda installation directory is present at the start of environment variables like \texttt{PATH} (which is used to find programs to run) and other such environment variables.
+ However, the host operating system's directories are also appended afterwards.
+ Therefore, a user, or script may not notice that a software that is being used is actually coming from the operating system, not the controlled Conda installation.
+
+ \item Generally, by default Conda relies heavily on the operating system and doesn't include core analysis components like \texttt{mkdir}, \texttt{ls} or \texttt{cp}.
+ Although they are generally the same between different Unix-like operatings sytems, they have their differences.
+ For example `\texttt{mkdir -p}' is a common way to build directories, but this option is only available with GNU Coreutils (default on GNU/Linux systems).
+ Running the same command within a Conda environment on a macOS for example, will crash.
+ Important packages like GNU Coreutils are available in channels like conda-forge, but they are not the default.
+ Therefore, many users may not recognize this, and failing to account for it, will cause unexpected crashes.
+
+ \item Many major Conda packaging ``channels'' (for example the core Anaconda channel, or very popular conda-forge channel) don't include the C library, that a package was built with, as a dependency.
+ They rely on the host operating system's C library.
+ C is the core language of most modern operating systems and even higher-level languages like Python or R are written in it, and need it to run.
+ Therefore if the host operating system's C library is different from the C library that a package was built with, a Conda-packaged program will crash and the project will not be executable.
+ Theoretically, it is possible to define a new Conda ``channel'' which includes the C library as a dependency of its software packages, but it will take too much time for any individual team to practically implement all their necessary packages, up to their high-level science software.
+
+ \item Conda does allow a package to depend on a special build of its prerequisites (specified by a checksum, fixing its version and the version of its dependencies).
+ However, this is rarely practiced in the main Git repositories of channels like Anaconda and conda-forge: only the name of the high-level prerequisite packages is listed in a package's \texttt{meta.yaml} file, which is version-controlled.
+ Therefore two builds of the package from the same Git repository will result in different tarballs (depending on what prerequisites were present at build time).
+ In the Conda tarball (that contains the binaries and is not under version control) \texttt{meta.yaml} does include the exact versions of most build-time dependencies.
+ However, because the different software of one project may have been built at different times, if they depend on different versions of a single software there will be a conflict and the tarball can't be rebuilt, or the project can't be run.
+ \end{itemize}
+
+ As reviewed above, the low-level dependence of Conda on the host operating system's components and build-time conditions, is the primary reason that it is very fast to install (thus making it an attractive tool to software developers who just need to reproduce a bug in a few minutes).
+ However, these same factors are major caveats in a scientific scenario, where long-term archivability, readability or usability are important.
+
+
+
+\item {\bf\small Nix or GNU Guix:} Nix \citep{dolstra04} and GNU Guix \citep{courtes15} are independent package managers that can be installed and used on GNU/Linux operating systems, and macOS (only for Nix, prior to macOS Catalina).
+ Both also have a fully functioning operating system based on their packages: NixOS and ``Guix System''.
+ GNU Guix is based on Nix, so we'll focus the review here on Nix.
+
+ The Nix approach to package management is unique in that it allows exact dependency tracking of all the dependencies, and allows for multiple versions of a software, for more details see \citep{dolstra04}.
+ In summary, a unique hash is created from all the components that go into the building of the package.
+ That hash is then prefixed to the software's installation directory.
+ For example \citep[from][]{dolstra04} if a certain build of GNU C Library 2.3.2 has a hash of \texttt{8d013ea878d0}, then it is installed under \texttt{/nix/store/8d013ea878d0-glibc-2.3.2} and all software that are compiled with it (and thus need it to run) will link to this unique address.
+ This allows for multiple versions of the software to co-exist on the system, while keeping an accurate dependency tree.
+
+ As mentioned in \citet{courtes15}, one major caveat with using these package managers is that they require a daemon with root previlages.
+ This is necessary ``to use the Linux kernel container facilities that allow it to isolate build processes and maximize build reproducibility''.
+
+ \tonote{While inspecting the Guix build instructions for some software, I noticed they don't actually mention the version names. This creates a similar issue withe Conda example above (how to regenerate the software with a given hash, given that its dependency versions aren't explicitly mentioned. Ask Ludo' about this.}
+
+
+\item {\bf\small Spack:} is a package manager that is also influenced by Nix (similar to GNU Guix), see \citet{gamblin15}.
+ But unlike Nix or GNU Guix, it doesn't aim for full, bit-wise reproducibility and can be built without root access in any generic location.
+ It relies on the host operating system for the C library.
+
+ Spack is fully written in Python, where each software package is an instance of a class, which defines how it should be downloaded, configured, built and installed.
+ Therefore if the proper version of Python is not present, Spack cannot be used and when incompatibilities arise in future versions of Python (similar to how Python 3 is not compatible with Python 2), software building recipes, or the whole system, have to be upgraded.
+ Because of such bootstrapping problems (for example how Spack needs Python to build Python and other software), it is generally a good practice to use simpler, lower-level languages/systems for a low-level operation like package management.
+\end{itemize}
+
+There are two common issues regarding generic package managers that hinders their usage for high-level scientific projects, as listed below:
+\begin{itemize}
+\item {\bf\small Pre-compiled/binary downloads:} Most package managers (excluding Nix or its derivaties) only download the software in a binary (pre-compiled) format.
+ This allows users to download it very fast and almost instantaneously be able to run it.
+ However, to provide for this, servers need to keep binary files for each build of the software on different operating systems (for example Conda needs to keep binaries for Windows, macOS and GNU/Linux operating systems).
+ It is also necessary for them to store binaries for each build, which includes different versions of its dependencies.
+ This will take major space on the servers, therefore once the shelf-life of a binary has expired, it will not be easy to reproduce a project that depends on it .
+
+ For example Debian's Long Term Support is only valid for 5 years.
+ Pre-built binaries of the ``Stable'' branch will only be kept during this period and this branch only gets updated once every two years.
+ However, scientific sofware commonly evolve on much faster rates.
+ Therefore scientific projects using Debian often use the ``Testing'' branch which has more up to date features.
+ The problem is that binaries on the Testing branch are immediately removed when no other package depends on it, and a newer version is available.
+ This is not limited to operating systems, similar problems are also reported in Conda for example, see the discussion of Conda above for one real-world example.
+
+
+\item {\bf\small Adding high-level software:} Packaging new software is not trivial and needs a good level of knowledge/experience with that package manager.
+For example each has its own special syntax/standards/languages, with pre-defined variables that must already be known to someone packaging new software.
+However, in many scenarios, the most high-level software of a research project are written and used only by the team that is doing the research, even when they are distributed with free licenses on open repositories.
+Although active package manager members are commonly very supportive in helping to package new software, many teams may not take that extra effort/time.
+They will thus manually install their high-level software in an uncontrolled, or non-standard way, thus jeopardizing the reproducibility of the whole work.
+
+\item {\bf\small Built for a generic scenario} All the package managers above are built for one full system, that can possibly be run by multiple projects.
+ This can result in not fully documenting the process that each package was built (for example the versions of the dependent libraries of a package).
+\end{itemize}
+
+Addressing these issues has been the basic raison d'\^etre of the proposed template's approach to package management strategy: instructions to download and build the packages are included within the actual science project (thus fully cusomizable) and no special/new syntax/language is used: software download, building and installation is done with the same langugage/syntax that researchers manage their research: using the shell (GNU Bash) and Make (GNU Make).
+
+
+
+\subsection{Version control}
+\label{appendix:versioncontrol}
+A scientific project is not written in a day.
+It commonly takes more than a year (for example a PhD project is 3 or 4 years).
+During this time, the project evolves significantly from its first starting date and components are added or updated constantly as it approaches completion.
+Added with the complexity of modern projects, is not trivial to manually track this evolution, and its affect of on the final output: files produced in one stage of the project may be used at later stages (where the project has evolved).
+Furthermore, scientific projects do not progres linearly: earlier stages of the analysis are often modified after later stages are written.
+This is a natural consequence of the scientific method; where progress is defined by experimentation and modification of hypotheses (earlier phases).
+
+It is thus very important for the integrity of a scientific project that the state/version of its processing is recorded as the project evolves for example better methods are found or more data arrive.
+Any intermediate dataset that is produced should also be tagged with the version of the project at the time it was created.
+In this way, later processing stages can make sure that they can safely be used, i.e., no change has been made in their processing steps.
+
+Solutions to keep track of a project's history have existed since the early days of software engineering in the 1970s and they have constantly improved over the last decades.
+Today the distributed model of ``version control'' is the most common, where the full history of the project is stored locally on different systems and can easily be integrated.
+There are many existing version control solutions, for example CVS, SVN, Mercurial, GNU Bazaar, or GNU Arch.
+However, currently, Git is by far the most commonly used in individual projects and long term archival systems like Software Heritage \citep{dicosmo18}, it is also the system that is used in the proposed template, so we'll only review it here.
+
+\begin{itemize}
+\item {\bf\small Git:} With Git, changes in a project's contents are accurately identified by comparing them with their previous version in the archived Git repository.
+ When the user decides the changes are significant compared to the archived state, they can ``commit'' the changes into the history/repository.
+ The commit involves copying the changed files into the repository and calculating a 40 character checksum/hash that is calculated from the files, an accompanying ``message'' (a narrarative description of the project's state), and the previous commit (thus creating a ``chain'' of commits that are strongly connected to each other).
+ For example `\texttt{f4953cc\-f1ca8a\-33616ad\-602ddf\-4cd189\-c2eff97b}' is a commit identifer in the Git history that this paper is being written in.
+ Commits are is commonly summarized by the checksum's first few characters, for example `\texttt{f4953cc}'.
+
+ With Git, making parallel ``branches'' (in the project's history) is very easy and its distributed nature greatly helps in the parallel development of a project by a team.
+ The team can host the Git history on a webpage and collaborate through that.
+ There are several Git hosting services for example \href{http://github.com}{github.com}, \href{http://gitlab.com}{gitlab.com}, or \href{http://bitbucket.org}{bitbucket.org} (among many others).
+
+\end{itemize}
+
+
+
+
+
+\subsection{Job management}
+\label{appendix:jobmanagement}
+Any analysis will involve more than one logical step.
+For example it is first necessary to download a dataset, then to do some preparations on it, then to actually use it, and finally to make visualizations/tables that can be imported into the final report.
+Each one of these is a logically independent step which needs to be run before/after the others in a specific order.
+There are many tools for managing the sequence of jobs, below we'll review the most common ones that are also used in the proposed template, or the existing reproducibility solutions of Appendix \ref{appendix:existingsolutions}.
+
+\begin{itemize}
+\item {\bf\small Script:} Scripts (in any language, for example GNU Bash, or Python) are the most common ways of organizing a series of steps.
+ They are primarily designed execute each step sequentially (one after another), making them also very intuitive.
+ However, as the series of operations become complex and large, managing the workflow in a script will become highly complex.
+ For example if 90\% of a long project is already done and a researcher wants to add a followup step, a script will go through all the previous steps (which can take significant time).
+ Also, if a small step in the middle of an analysis has to be changed, the full analysis needs to be re-run: scripts have no concept of dependencies (so only the steps that are affected by that change are run).
+ Such factors discourage experimentation, which is a critical component of the scientific method.
+ It is possible to manually add conditionals all over the script to add dependencies, but they just make it harder to read, and introduce many bugs themselves.
+ Parallelization is another drawback of using scripts.
+ While its not impossible, because of the high-level nature of scripts, it is not trivial and parallelization can also be very inefficient or buggy.
+
+
+\item {\bf\small Make:} (\url{https://www.gnu.org/s/make}) Make was originally designed to address the problems mentioned above for scripts \citep{feldman79}.
+ The most common implementation is GNU Make \citep{stallman88}.
+ In particular to manage the compilation of programs with many source codes files.
+ With it, the various source codes of a program that haven't been changed, wouldn't be recompiled.
+ Also, when two source code files didn't depend on each other, and both needed to be rebuilt, they could be built in parallel.
+ This greatly helped in debugging of software projects, and speeding up test builds.
+
+ Because it has been a fixed component of the Unix systems and culture from very early days, it is by far the most used workflow manager today.
+ It is already installed and used when building almost all components of Unix-like operating systems (including GNU/Linux, BSD, and macOS, among others).
+ It is also well known (to different levels) by many people, even outside of software engineers (for example even astronomers have to run the \texttt{make} command when they want to install their analysis software).
+ However, even though Make can be used to manage any series of steps that involve the creation of files (including data analysis), its usage has predominantly remained in the software-building sphere and it has yet to penetrate higher-level workflows.
+
+ The proposed template uses Make to organize its workflow (as described in more detail above \tonote{add section reference later}).
+ We'll thus do a short review here.
+ A file containing Make instructions is known as a `Makefile'.
+Make manages `rules', which are composed of three components: `targets', `pre-requisites' and `recipes'.
+All three components must be files on the running system (note that in Unix-like operating systems, everything is a file, even directories and devices).
+To manage operations and decide which operation should be re-done, Make compares the time stamp of the files.
+A rule's `recipe' contains instructions (most commonly shell scripts) to produce the `target' file when any of the `prerequisite' files are newer than the target.
+When all the prerequisites are older than the target, in Make's paradigm, that target doesn't need to be re-built.
+
+\item {\bf\small SCons:} (\url{https://scons.org}) is a Python package for managing operations outside of Python (in contrast to CGAT-core, discussed below, which only organizes Python functions).
+ In many aspects it is similar to Make, for example it is managed through a `SConstruct' file.
+ Like a Makefile, SConstruct is also declerative: the running order is not necessarily the top-to-bottom order of the written operations within the file (unlike the the imperative paradigm which is common in languages like C, Python, or Fortran).
+ However, unlike Make, SCons doesn't use the file modification date to decide if it should be remade.
+ SCons keeps the MD5 hash of all the files (in a hidden binary file) to check if the contents has changed.
+
+ SCons thus attempts to work on a declarative file with an imperative language (Python).
+ It also goes beyond raw job management and attempts to extract information from within the files (for example to identify the libraries that must be linked while compiling a program).
+ SCons is therefore more complex than Make: its manual is almost double that of GNU Make.
+ Besides added complexity, all these ``smart'' features decrease its performance, especially as files get larger and more numerous: on every call, every file's checksum has to be calculated, and a Python system call has to be made (which is computationally expensive).
+
+ Finally, it has the same drawback as any other tool that uses hight-level languagues, see Section \ref{appendix:highlevelinworkflow}.
+ We encountered such a problem while testing SCons: on the Debian-10 testing system, the `\texttt{python}' program pointed to Python 2.
+ However, since Python 2 is now obsolete, SCons was built with Python 3 and our first run crashed.
+ To fix it, we had to either manually change the core operating system path, or the SCons source hashbang.
+ The former will conflict with other system tools that assume `\texttt{python}' points to Python-2, the latter may need root permissions for some systems.
+ This can also be problematic when a Python analysis library, may require a Python version that conflicts with the running SCons.
+
+\item {\bf\small CGAT-core:} (\url{https://cgat-core.readthedocs.io/en/latest}) is a Python package for managing workflows, see \citet{cribbs19}.
+ It wraps analysis steps in Python functions and uses Python decorators to track the dependencies between tasks.
+ It is used papers like \citet{jones19}, but as mentioned in \citet{jones19} it is good for managing individual outputs (for example separate figures/tables in the paper, when they are fully created within Python).
+ Because it is primarily designed for Python tasks, managing a full workflow (which includes many more components, written in other languages) is not trivial in it.
+ Another drawback with this workflow manager is that Python is a very high-level language where future versions of the language may no longer be compatible with Python 3, that CGAT-core is implemented in (similar to how Python 2 programs are not compatible with Python 3).
+
+\item {\bf\small Guix Workflow Language (GWL):} (\url{https://www.guixwl.org}) GWL is based on the declarative language that GNU Guix uses for package management (see Appendix \ref{appendix:packagemanagement}), which is itself based on the general purpose Scheme language.
+ It is closely linked with GNU Guix and can even install the necessary software needed for each individual process.
+ Hence in the GWL paradigm, software installation and usage doesn't have to be separated.
+ GWL has two high-level concepts called ``processes'' and ``workflows'' where the latter defines how multiple processes should be executed together.
+\end{itemize}
+
+As described above shell scripts and Make are a common and highly used system that have existed for several decades and many researchers are already familiar with them and have already used them.
+The list of necessary software solutions for the various stages of a research project (listed in the subsections of Appendix \ref{appendix:existingtools}), is aleady very large, and each software has its own learning curve (which is a heavy burden for a natural or social scientist for example).
+The other workflow management tools are too specific to a special paradigm, for example CGAT-core is written for Python, or GWL is intertwined with GNU Guix.
+Therefore their generalization into any kind of problem is not trivial.
+
+Also, high-level and specific solutions will evolve very fast, for example the Popper solution to reproducible research (see Appendix \ref{appendix:popper}) organized its workflow through the HashiCorp configuration language (HCL) because it was the default in GitHub.
+However, in September 2019, GitHub dropped HCL as its default configuration language and is now using its own custom YAML-based language.
+Using such high-level, or provider-specific solutions also has the problem that it makes them hard, or impossible, to use in any generic system.
+Therefore a robust solution would avoid designing their low-level processing steps in these languages and only use them for the highest-level layer of their project, depending on which provider they want to run their project on.
+
+
+
+\subsection{Editing steps and viewing results}
+\label{appendix:editors}
+In order to later reproduce a project, the analysis steps must be stored in files.
+For example Shell, Python or R scripts, Makefiles, Dockerfiles, or even the source files of compiled languages like C or Fortran.
+Given that a scientific project does not evolve linearly and many edits are needed as it evolves, it is important to be able to actively test the analysis steps while writing the project's source files.
+Here we'll review some common methods that are currently used.
+
+\begin{itemize}
+\item {\bf\small Text editor:} The most basic way to edit text files is through simple text editors which just allow viewing and editing such files, for example \texttt{gedit} on the GNOME graphic user interface.
+ However, working with simple plain text editors like \texttt{gedit} can be very frustrating since its necessary to save the file, then go to a terminal emulator and execute the source files.
+ To solve this problem there are advanced text editors like GNU Emacs that allow direct execution of the script, or access to a terminal within the text editor.
+ However, editors that can execute or debug the source (like GNU Emacs), just run external programs for these jobs (for example GNU GCC, or GNU GDB), just as if those programs was called from outside the editor.
+
+ With text editors, the final edited file is independent of the actual editor and can be further edited with another editor, or executed without it.
+ This is a very important feature that is not commonly present for other solutions mentioned below.
+ Another very important advantage of advanced text editors like GNU Emacs or Vi(m) is that they can also be run without a graphic user interface, directly on the command-line.
+ This feature is critical when working on remote systems, in particular high performance computing (HPC) facilities that don't provide a graphic user interface.
+
+\item {\bf\small Integrated Development Environments (IDEs):} To facilitate the development of source files, IDEs add software building and running environments as well as debugging tools to a plain text editor.
+ Many IDEs have their own compilers and debuggers, hence source files that are maintained in IDEs are not necessarily usable/portable on other systems.
+ Furthermore, they usually require a graphic user interface to run.
+ In summary IDEs are generally very specialized tools, for special projects and are not a good solution when portability (the ability to run on different systems) is required.
+
+\item {\bf\small Jupyter:} Jupyter \citep[initially IPython,][]{kluyver16} is an implementation of Literate Programming \citep{knuth84}.
+ The main user interface is a web-based ``notebook'' that contains blobs of executable code and narrative.
+ Jupyter uses the custom built \texttt{.ipynb} format\footnote{\url{https://nbformat.readthedocs.io/en/latest}}.
+Jupyter's name is a combination of the three main languages it was designed for: Julia, Python and R.
+The \texttt{.ipynb} format, is a simple, human-readable (can be opened in a plain-text editor) file, formatted in Javascript Object Notation (JSON).
+It contains various kinds of ``cells'', or blobs, that can contain narrative description, code, or multi-media visalizations (for example images/plots), that are all stored in one file.
+The cells can have any order, allowing the creation of a literal programing style graphical implementation, where narrative descriptions and executable patches of code can be intertwined.
+For example to have a paragraph of text about a patch of code, and run that patch immediately in the same page.
+
+The \texttt{.ipynb} format does theoretically allow dependency tracking between cells, see IPython mailing list (discussion started by Gabriel Becker from July 2013\footnote{\url{https://mail.python.org/pipermail/ipython-dev/2013-July/010725.html}}).
+Defining dependencies between the cells can allow non-linear execution which is critical for large scale (thousands of files) and complex (many dependencies between the cells) operations.
+It allows automation, run-time optimization (deciding not to run a cell if its not necessary) and parallelization.
+However, Jupyter currently only supports a linear run of the cells: always from the start to the end.
+It is possible to manually execute only one cell, but the previous/next cells that may depend on it, also have to be manually run (a common source of human error, and frustration for complex operations).
+Integration of directional graph features (dependencies between the cells) into Jupyter has been discussed, but as of this publication, there is no plan to implement it (see Jupyter's GitHub issue 1175\footnote{\url{https://github.com/jupyter/notebook/issues/1175}}).
+
+The fact that the \texttt{.ipynb} format stores narrative text, code and multi-media visualization of the outputs in one file, is another major hurdle:
+The files can easy become very large (in volume/bytes) and hard to read from source.
+Both are critical for scientific processing, especially the latter: when a web-browser with proper Javascript features isn't available (can happen in a few years).
+This is further exacerbated by the fact that binary data (for example images) are not directly supported in JSON and have to be converted into much less memory-efficient textual encodings.
+
+Finally, Jupyter has an extremely complex dependency graph: on a clean Debian 10 system, Pip (a Python package manager that is necessary for installing Jupyter) required 19 dependencies to install, and installing Jupyter within Pip needed 41 dependencies!
+\citet{hinsen15} reported such conflicts when building Jupyter into the Active Papers framework (see Appendix \ref{appendix:activepapers}).
+However, the dependencies above are only on the server-side.
+Since Jupyter is a web-based system, it requires many dependencies on the viewing/running browser also (for example special Javascript or HTML5 features, which evolve very fast).
+As discussed in Appendix \ref{appendix:highlevelinworkflow} having so many dependencies is a major caveat for any system regarding scientific/long-term reproducibility (as opposed to industrial/immediate reproducibility).
+In summary, Jupyter is most useful in manual, interactive and graphical operations for temporary operations (for example educational tutorials).
+\end{itemize}
+
+
+
+
+
+\subsection{Project management in high-level languages}
+\label{appendix:highlevelinworkflow}
+
+Currently the most popular high-level data analysis language is Python.
+R is closely tracking it, and has superseded Python in some fields, while Julia \citep[with its much better performance compared to R and Python, in a high-level structure, see][]{bezanson17} is quickly gaining ground.
+These languages have themselves superseded previously popular languages for data analysis of the previous decades, for example Java, Perl or C++.
+All are part of the C-family programming languages.
+In many cases, this means that the tools to use that language are written in C, which is the language of the operating system.
+
+Scientists, or data analysts, mostly use these higher-level languages.
+Therefore they are naturally drawn to also apply the higher-level languages for lower-level project management, or designing the various stages of their workflow.
+For example Conda or Spack (Appendix \ref{appendix:packagemanagement}), CGAT-core (Appendix \ref{appendix:jobmanagement}), Jupyter (Appendix \ref{appendix:editors}) or Popper (Appendix \ref{appendix:popper}) are written in Python.
+The discussion below applies to both the actual analysis software and project management software.
+In this context, its more focused on the latter.
+
+Because of their nature, higher-level languages evolve very fast, creating incompatabilities on the way.
+The most prominent example is the transition from Python 2 (released in 2000) to Python 3 (released in 2008).
+Python 3 was incompatible with Python 2 and it was decided to abandon the former by 2015.
+However, due to community pressure, this was delayed to January 1st, 2020.
+The end-of-life of Python 2 caused many problems for projects that had invested heavily in Python 2: all their previous work had to be translated, for example see \citet{jenness17} or Appendix \ref{appendix:sciunit}.
+Some projects couldn't make this investment and their developers decided to stop maintaining it, for example VisTrails (see Appendix \ref{appendix:vistrails}).
+
+The problems weren't just limited to translation.
+Python 2 was still actively being actively used during the transition period (and is still being used by some, after its end-of-life).
+Therefore, developers of packages used by others had to maintain (for example fix bugs in) both versions in one package.
+This isn't particular to Python, a similar evolution occurred in Perl: in 2000 it was decided to improve Perl 5, but the proposed Perl 6 was incompatible with it.
+However, the Perl community decided not to abandon Perl 5, and Perl 6 was eventually defined as a new language that is now officially called ``Raku'' (\url{https://raku.org}).
+
+It is unreasonably optimistic to assume that high-level languages won't undergo similar incompatible evolutions in the (not too distant) future.
+For software developers, this isn't a problem at all: non-scientific software, and the general population's usage of them, evolves extremely fast and it is rarely (if ever) necessary to look into codes that are more than a couple of years old.
+However, in the sciences (which are commonly funded by public money) this is a major caveat for the longer-term usability of solutions that are designed.
+
+Beyond technical, low-level, problems for the developers mentioned above, this causes major problems for scientific project management as listed below:
+
+\begin{itemize}
+\item {\bf\small Dependency hell:} The evolution of high-level languages is extremely fast, even within one version.
+ For example packages thar are written in Python 3 often only work with a special interval of Python 3 versions (for example newer than Python 3.6).
+ This isn't just limited to the core language, much faster changes occur in their higher-level libraries.
+ For example version 1.9 of Numpy (Python's numerical analysis module) discontinued support for Numpy's predecessor (called Numeric), causing many problems for scientific users \citep[see][]{hinsen15}.
+
+ On the other hand, the dependency graph of tools written in high-level languages is often extremely complex.
+ For example see Figure 1 of \citet{alliez19}, it shows the dependencies and their inter-dependencies for Matplotlib (a popular plotting module in Python).
+
+ Acceptable dependency intervals between the dependencies will cause incompatibilities in a year or two, when a robust pakage manager is not used (see Appendix \ref{appendix:packagemanagement}).
+ Since a domain scientist doesn't always have the resources/knowledge to modify the conflicting part(s), many are forced to create complex environments with different versions of Python and pass the data between them (for example just to use the work of a previous PhD student in the team).
+ This greatly increases the complexity of the project, even for the principal author.
+ A good reproducible workflow can account for these different versions.
+ However, when the actual workflow system (not the analysis software) is written in a high-level language this will cause a major problem.
+
+ For example, merely installing the Python installer (\texttt{pip}) on a Debian system (with `\texttt{apt install pip2}' for Python 2 packages), required 32 other packages as dependencies.
+ \texttt{pip} is necessary to install Popper and Sciunit (Appendices \ref{appendix:popper} and \ref{appendix:sciunit}).
+ As of this writing, the `\texttt{pip3 install popper}' and `\texttt{pip2 install sciunit2}' commands for installing each, required 17 and 26 Python modules as dependencies.
+ It is impossible to run either of these solutions if there is a single conflict in this very complex dependency graph.
+ This problem actually occurred while we were testing Sciunit: even though it installed, it couldn't run because of conflicts (its last commit was only 1.5 years old), for more see Appendix \ref{appendix:sciunit}.
+ \citet{hinsen15} also report a similar problem when attempting to install Jupyter (see Appendix \ref{appendix:editors}).
+ Ofcourse, this also applies to tools that these systems use, for example Conda (which is also written in Python, see Appendix \ref{appendix:packagemanagement}).
+
+\item {\bf\small Generational gap:} This occurs primarily for domain scientists (for example astronomers, biologists or social sciences).
+ Once they have mastered one version of a language (mostly in the early stages of their career), they tend to ignore newer versions/languages.
+ The inertia of programming languages is very strong.
+ This is natural, because they have their own science field to focus on, and re-writing their very high-level analysis toolkits (which they have curated over their career and is often only readable/usable by themselves) in newer languages requires too much investment and time.
+
+ When this investment is not possible, either the mentee has to use the mentor's old method (and miss out on all the new tools, which they need for the future job prospects), or the mentor has to avoid implementation details in discussions with the mentee, because they don't share a common language.
+ The authors of this paper have personal experiences in both mentor/mentee relational scenarios.
+ This failure to communicate in the details is a very serious problem, leading to the loss of valuable inter-generational experience.
+\end{itemize}
+
+In summary, in this section we are discussing the bootstrapping problem as regards scientific projects: the workflow/pipeline can reproduce the analysis and its dependencies, but the dependencies of the workflow itself cannot not be ignored.
+The most robust way to address this problem is with a workflow management system that ideally doesn't need any major dependencies: tools that are already part of the operating system.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+\section{Survey of common existing reproducible workflows}
+\label{appendix:existingsolutions}
+
+As reviewed in the introduction (Section \ref{sec:introduction}), the problem of reproducibility has received a lot of attention over the last three decades and various solutions have already been proposed.
+In this appendix, some of the solutions are reviewed.
+The solutions are based on an evolving software landscape, therefore they are ordered by date\footnote{When the project has a webpage, the year of its first release is used, otherwise their paper's publication year is used.}.
+For each solution, we summarize its methodology and discuss how it relates to the principles in Section \ref{sec:principles}.
+Freedom of the software/method is a core concept behind scientific reproducibility, as opposed to industrial reproducibility where a black box is acceptable/desirable.
+Therefore proprietary solutions like Code Ocean (\url{https://codeocean.com}) or Nextjournal (\url{https://nextjournal.com}) will not be reviewed here.
+
+
+
+
+
+
+\subsection{Reproducible Electronic Documents (RED), 1992}
+\label{appendix:electronicdocs}
+
+Reproducible Electronic Documents (\url{http://sep.stanford.edu/doku.php?id=sep:research:reproducible}) is the first attempt that we could find on doing reproducible research \citep{claerbout1992,schwab2000}.
+It was developed within the Stanford Exploration Project (SEP) for Geophysics publications.
+Their introductions on the importance of reproducibility, resonate a lot with today's environment in computational sciences.
+In particluar the heavy investment one has to make in order to re-do another scientist's work, even in the same team.
+RED also influenced other early reproducible works, for example \citet{buckheit1995}.
+
+To orchestrate the various figures/results of a project, from 1990, they used ``Cake'' \citep[]{somogyi87}, a dialect of Make, for more on Make, see Appendix \ref{appendix:jobmanagement}.
+As described in \citep{schwab2000}, in the latter half of that decade, moved to GNU Make \citep{stallman88}, which was much more commonly used, developed and came with a complete and up-to-date manual.
+The basic idea behind RED's solution was to organize the analysis as independent steps, including the generation of plots, and organizing the steps through a Makefile.
+This enabled all the results to be re-executed with a single command.
+Several basic low-level Makefiles were included in the high-level/central Makefile.
+The reader/user of a project had to manually edit the central Makefile and set the variable \texttt{RESDIR} (result dir), this is the directory where built files are kept.
+Afterwards, the reader could set which figures/parts of the project to reproduce by manually adding its name in the central Makefile, and running Make.
+
+At the time, Make was already practiced by individual researchers and projects as a job orchestraion tool, but SEP's innovation was to standardize it as an internal policy, and define conventions for the Makefiles to be consistant across projects.
+This enabled new members to benefit from the already existing work of previous team members (who had graduated or moved to other jobs).
+However, RED only used the existing software of the host system, it had no means to control them.
+Therefore, with wider adoption, they confronted a ``versioning problem'' where the host's analysis software had different versions on different hosts, creating different results, or crashing \citep{fomel09}.
+Hence in 2006 SEP moved to a new Python-based framework called Madagascar, see Appendix \ref{appendix:madagascar}.
+
+
+
+
+
+\subsection{Apache Taverna, 2003}
+\label{appendix:taverna}
+Apache Taverna (\url{https://taverna.incubator.apache.org}) is a workflow management system written in Java with a graphical user interface, see \citet[still being actively developed]{oinn04}.
+A workflow is defined as a directed graph, where nodes are called ``processors''.
+Each Processor transforms a set of inputs into a set of outputs and they are defined in the Scufl language (an XML-based language, were each step is an atomic task).
+Other components of the workflow are ``Data links'' and ``Coordination constraints''.
+The main user interface is graphical, where users place processers in a sheet and define links between their intputs outputs.
+\citet{zhao12} have studied the problem of workflow decays in Taverna.
+In many aspects Taverna is like VisTrails, see Appendix \ref{appendix:vistrails} [Since kepler is older, it may be better to bring the VisTrails features here.]
+
+
+
+
+
+\subsection{Madagascar, 2003}
+\label{appendix:madagascar}
+Madagascar (\url{http://ahay.org}) is a set of extensions to the SCons job management tool \citep{fomel13}.
+For more on SCons, see Appendix \ref{appendix:jobmanagement}.
+Madagascar is a continuation of the Reproducible Electronic Documents (RED) project that was disucssed in Appendix \ref{appendix:electronicdocs}.
+
+Madagascar does include project management tools in the form of SCons extensions.
+However, it isn't just a reproducible project management tool, it is primarily a collection of analysis programs, tools to interact with RSF files, and plotting facilities.
+For example in our test of Madagascar 3.0.1, it installed 855 Madagascar-specific analysis programs (`\texttt{PREFIX/bin/sf*}').
+The analysis programs mostly target geophysical data analysis, including various project specific tools: more than half of the total built tools are under the `\texttt{build/user}' directory which includes names of Madagascar users.
+Following the Unix spirit of modularized programs that communicating through text-based pipes, Madagascar's core is the custom Regularly Sampled File (RSF) format\footnote{\url{http://www.ahay.org/wiki/Guide\_to\_RSF\_file\_format}}.
+RSF is a plain-text file that points to the location of the actual data files on the filesystem, but it can also keep the raw binary dataset within same plain-text file.
+
+Besides the location or contents of the data, RSF also contains name/value pairs that can be used as options to Madagascar programs, which are built with inputs and outputs of this format.
+Since RSF contains program options also, the inputs and outputs of Madagascar's analysis programs are read from, and written to, standard input and standard output.
+
+Madagascar has been used in the production of hundreds of research papers or book chapters\footnote{\url{http://www.ahay.org/wiki/Reproducible_Documents}} \citep[120 prior to][]{fomel13}.
+
+
+\subsection{GenePattern, 2004}
+\label{appendix:genepattern}
+GenePattern (\url{https://www.genepattern.org}) is a client-server software containing many common analysis functions/modules, primarily focused for Gene studies \citet[first released in 2004]{reich06}.
+Although its highly focused to a special research field, it is reviewed here because its concepts/methods are generic, and in the context of this paper.
+
+Its server-side software is installed with fixed software packages that are wrapped into GenePattern modules.
+The modules are used through a web interface, the modern implementation is GenePattern Notebook \citep{reich17}.
+It is an extension of the Jupyter notebook (see Appendix \ref{appendix:editors}), which also has a special ``GenePattern'' cell that will connect to GenePattern servers for doing the analysis.
+However, the wrapper modules just call an existing tool on the host system.
+Given that each server may have its own set of installed software, the analysis may differ (or crash) when run on different GenePattern servers, hampering reproducibility.
+
+The primary GenePattern server was active since 2008 and had 40,000 registered users with 2000 to 5000 jobs running every week \citep{reich17}.
+However, it was shut down on November 15th 2019 due to end of funding\footnote{\url{https://www.genepattern.org/blog/2019/10/01/the-genomespace-project-is-ending-on-november-15-2019}}.
+All processing with this sever has stopped, and any archived data on it has been deleted.
+Since GenePattern is free software, there are alternative public servers to use, so hopefully work on it will continue.
+However, funding is limited and those servers may face similar funding problems.
+This is a very nice example of the fragility of solutions that depend on archiving and running high-level research products (including data, binary/compiled code).
+
+
+
+
+
+\subsection{Kepler, 2005}
+Kepler (\url{https://kepler-project.org}) is a Java-based Graphic User Interface workflow management tool \citep{ludascher05}.
+Users drag-and-drop analysis components, called ``actors'', into a visual, directional graph, which is the workflow (similar to Figure \ref{fig:analysisworkflow}).
+Each actor is connected to others through the Ptolemy approach \citep{eker03}.
+In many aspects Kepler is like VisTrails, see Appendix \ref{appendix:vistrails}.
+\tonote{Since kepler is older, it may be better to bring the VisTrails features here.}
+
+
+
+
+
+\subsection{VisTrails, 2005}
+\label{appendix:vistrails}
+
+VisTrails (\url{https://www.vistrails.org}) was a graphical workflow managing system that is described in \citet{bavoil05}.
+According to its webpage, VisTrails maintainance has stopped since May 2016, its last Git commit, as of this writing, was in November 2017.
+However, given that it was well maintained for over 10 years is an achievement.
+
+VisTrails (or ``visualization trails'') was initially designed for managing visualizations, but later grew into a generic workflow system with meta-data and provenance features.
+Each analysis step, or module, is recorded in an XML schema, which defines the operations and their dependencies.
+The XML attributes of each module can be used in any XML query language to find certain steps (for example those that used a certain command).
+Since the main goal was visualization (as images), apparently its primary output is in the form of image spreadsheets.
+Its design is based on a change-based provenance model using a custom VisTrails provenance query language (vtPQL), for more see \citet{scheidegger08}.
+Since XML is a plane text format, as the user inspects the data and makes changes to the analysis, the changes are recorded as ``trails'' in the project's VisTrails repository that operates very much like common version control systems (see Appendix \ref{appendix:versioncontrol}).
+
+With respect to keeping the history/provenance of the final dataset, VisTrails is very much like the template introduced in this paper.
+However, even though XML is in plain text, it is very hard to edit manually.
+VisTrails therefore provides a graphic user interface with a visual representation of the project's inter-dependent steps (similar to Figure \ref{fig:analysisworkflow}).
+Besides the fact that it is no longer maintained, the conceptual differences with the proposed template are substantial.
+The most important is that VisTrails doesn't control the software that is run, it only controls the sequence of steps that they are run in.
+This template also defines dependencies and opertions based on the very standard and commonly known Make system, not a custom XML format.
+Scripts can easily be written to generate an XML-formatted output from Makefiles.
+
+
+
+
+
+\subsection{Galaxy, 2010}
+\label{appendix:galaxy}
+
+Galaxy (\url{https://galaxyproject.org}) is a web-based Genomics workbench \citep{goecks10}.
+The main user interface are ``Galaxy Pages'', which doesn't require any programming: users simply use abstract ``tools'' which are a wrappers over command-line programs.
+Therefore the actual running version of the program can be hard to control across different Galaxy servers \tonote{confirm this}.
+Besides the automatically generated metadata of a project (which include version control, or its history), users can also tag/annotate each analysis step, describing its intent/purpose.
+Besides some small differences, this seems to be very similar to GenePattern (Appendix \ref{appendix:genepattern}).
+
+
+
+
+
+\subsection{Image Processing On Line (IPOL) journal, 2010}
+The IPOL journal (\url{https://www.ipol.im}) attempts to publish the full implementation details of proposed image processing algorithm as a scientific paper \citep[first published article in July 2010]{limare11}.
+An IPOL paper is a traditional research paper, but with a focus on implementation.
+The published narrative description of the algorithm must be detailed to a level that any specialist can implement it in their own programming language (extremely detailed).
+The author's own implementation of the algorithm is also published with the paper (in C, C++ or Matlab), the code must be commented well enough and link each part of it with the relevant part of the paper.
+The authors must also submit several example datasets/scenarios.
+The referee actually inspects the code and narrative, confirming that they match with each other, and with the stated conclusions of the published paper.
+After publication, each paper also has a ``demo'' button on its webpage, allowing readers to try the algorithm on a web-interface and even provide their own input.
+
+The IPOL model is indeed the single most robust model of peer review and publishing computaional research methods/implementations.
+It has grown steadily over the last 10 years, publishing 23 research articles in 2019 alone.
+We encourage the reader to visit its webpage and see some of its recent papers and their demos.
+It can be so thorough and complete because it has a very narrow scope (image processing), and the published algorithms are highly atomic, not needing significant dependencies (beyond input/output), allowing the referees to go deep into each implemented algorithm.
+Infact high-level languages like Perl, Python or Java are not acceptable precisely because of the additional complexities/dependencies that they require.
+
+Ideally (if any referee/reader was inclined to do so), the proposed template of this paper allows for a similar level of scrutiny, but for much more complex research scenarios, involving hundreds of dependencies and complex processing on the data.
+
+
+
+
+
+\subsection{Active Papers, 2011}
+\label{appendix:activepapers}
+Active Papers (\url{http://www.activepapers.org}) attempts to package the code and data of a project into one file (in HDF5 format).
+It was initially written in Java because its compiled byte-code outputs in JVM are portable on any machine \citep[see][]{hinsen11}.
+However, Java is not a commonly used platform today, hence it was later implemented in Python \citep{hinsen15}.
+
+In the Python version, all processing steps and input data (or references to them) are stored in a HDF5 file.
+However, it can only account for pure-Python packages using the host operating system's Python modules \tonote{confirm this!}.
+When the Python module contains a component written in other languages (mostly C or C++), it needs to be an external dependency to the Active Paper.
+
+As mentioned in \citep{hinsen15}, the fact that it relies on HDF5 is a caveat of Active Papers, because many tools are necessary to access it.
+Downloading the pre-built HDF View binaries (provided by the HDF group) is not possible anonymously/automatically (login is required).
+Installing it using the Debain or Arch Linux package managers also failed due to dependencies.
+
+While data and code are indeed fundamentally similar concepts technically \tonote{cite Konrad's paper on this}, they are used by humans differently.
+This becomes a burden when large datasets are used, this was also acknowledged in \citet{hinsen15}.
+If the data are proprietary (for example medical patient data), the data must not be released, but the methods they were produced can.
+Furthermore, since all reading and writing is done in the HDF5 file, it can easily bloat the file to very large sizes due to temporary/reproducible files, and its necessary to remove/dummify them, thus complicating the code, making it hard to read.
+For example the Active Papers HDF5 file of \citet[in \href{https://doi.org/10.5281/zenodo.2549987}{zenodo.2549987}]{kneller19} is 1.8 giga-bytes.
+
+In many scenarios, peers just want to inspect the processing by reading the code and checking a very special part of it (one or two lines), not necessarily needing to run it, or obtaining the datasets.
+Hence the extra volume for data, and obscure HDF5 format that needs special tools for reading plain text code is a major burden.
+
+
+
+
+
+\subsection{Collage Authoring Environment, 2011}
+\label{appendix:collage}
+The Collage Authoring Environment \citep{nowakowski11} was the winner of Elsevier Executable Paper Grand Challenge \citep{gabriel11}.
+It is based on the GridSpace2\footnote{\url{http://dice.cyfronet.pl}} distributed computing environment\tonote{find citation}, which has a web-based graphic user interface.
+Through its web-based interface, viewers of a paper can actively experiment with the parameters of a published paper's displayed outputs (for example figures).
+\tonote{See how it containerizes the software environment}
+
+
+
+
+
+\subsection{SHARE, 2011}
+\label{appendix:SHARE}
+SHARE (\url{https://is.ieis.tue.nl/staff/pvgorp/share}) is a web portal that hosts virtual machines (VMs) for storing the environment of a research project, for more, see \citet{vangorp11}.
+The top project webpage above is still active, however, the virtual machines and SHARE system have been removed since 2019.
+
+SHARE was recognized as second position in the Elsevier Executable Paper Grand Challenge \citep{gabriel11}.
+Simply put, SHARE is just a VM that users can download and run.
+The limitations of VMs for reproducibility were discussed in Appendix \ref{appendix:virtualmachines}, and the SHARE system does not specify any requirements on making the VM itself reproducible.
+
+
+
+
+
+\subsection{Verifiable Computational Result (VCR), 2011}
+\label{appendix:verifiableidentifier}
+A ``verifiable computational result'' (\url{http://vcr.stanford.edu}) is an output (table, figure, or etc) that is associated with a ``verifiable result identifier'' (VRI), see \citet{gavish11}.
+It was awarded the third prize in the Elsevier Executable Paper Grand Challenge \citep{gabriel11}.
+
+A VRI is created using tags within the programming source that produced that output, also recording its version control or history.
+This enables exact identificatication and citation of results.
+The VRIs are automatically generated web-URLs that link to public VCR reposities containing the data, inputs and scripts, that may be re-executed.
+According to \citet{gavish11}, the VRI generation routine has been implemented in Matlab, R and Python, although only the Matlab version was available during the writing of this paper.
+VCR also has special \LaTeX{} macros for loading the respective VRI into the generated PDF.
+
+Unfortunately most parts of the webpage are not complete at the time of this writing.
+The VCR webpage contains an example PDF\footnote{\url{http://vcr.stanford.edu/paper.pdf}} that is generated with this system, however, the linked VCR repository (\texttt{http://vcr-stat.stanford.edu}) does not exist at the time of this writing.
+Finally, the date of the files in the Matlab extension tarball are set to 2011, hinting that probably VCR has been abandoned soon after the the publication of \citet{gavish11}.
+
+
+
+
+
+\subsection{SOLE, 2012}
+\label{appendix:sole}
+SOLE (Science Object Linking and Embedding) defines ``science objects'' (SOs) that can be manually linked with phrases of the published paper \citep[for more, see ][]{pham12,malik13}.
+An SO is any code/content that is wrapped in begin/end tags with an associated type and name.
+For example special commented lines in a Python, R or C program.
+The SOLE command-line program parses the tagged file, generating metadata elements unique to the SO (including its URI).
+SOLE also supports workflows as Galaxy tools \citep{goecks10}.
+
+For reproducibility, \citet{pham12} suggest building a SOLE-based project in a virtual machine, using any custom package manager that is hosted on a private server to obtain a usable URI.
+However, as described in Appendices \ref{appendix:independentenvironment} and \ref{appendix:packagemanagement}, unless virtual machines are built with robust package managers, this is not a sustainable solution (the virtual machine itself is not reproducible).
+Also, hosting a large virtual machine server with fixed IP on a hosting service like Amazon (as suggested there) will be very expensive.
+The manual/artificial defintion of tags to connect parts of the paper with the analysis scripts is also a caveat due to human error and incompleteness (tags the authors may not consider important, but may be useful later).
+The solution of the proposed template (where anything coming out of the analysis is directly linked to the paper's contents with \LaTeX{} elements avoids these problems.
+
+
+
+
+
+\subsection{Sumatra, 2012}
+Sumatra (\url{http://neuralensemble.org/sumatra}) attempts to capture the environment information of a running project \citet{davison12}.
+It is written in Python and is a command-line wrapper over the analysis script, by controlling its running, its able to capture the environment it was run in.
+The captured environment can be viewed in plain text, a web interface.
+Sumatra also provides \LaTeX/Sphinx features, which will link the paper with the project's Sumatra database.
+This enables researchers to use a fixed version of a project's figures in the paper, even at later times (while the project is being developed).
+
+The actual code that Sumatra wraps around, must itself be under version control, and it doesn't run if there is non-commited changes (although its not clear what happens if a commit is ammended).
+Since information on the environment has been captured, Sumatra is able to identify if it has changed since a previous run of the project.
+Therefore Sumatra makes no attempt at storing the environment of the analysis as in Sciunit (see Appendix \ref{appendix:sciunit}), but its information.
+Sumatra thus needs to know the language of the running program.
+
+
+
+
+
+\subsection{Research Object, 2013}
+\label{appendix:researchobject}
+
+The Research object (\url{http://www.researchobject.org}) is collection of meta-data ontologies, to describe aggregation of resources, or workflows, see \citet{bechhofer13} and \citet{belhajjame15}.
+It thus provides resources to link various workflow/analysis components (see Appendix \ref{appendix:existingtools}) into a final workflow.
+
+\citet{bechhofer13} describes how a workflow in Apache Taverna (Appendix \ref{appendix:taverna}) can be translated into research objects.
+The important thing is that the research object concept is not specific to any special workflow, it is just a metadata bundle which is only as robust in reproducing the result as the running workflow.
+For example, Apache Tavenra cannot guarantee exact reproducibility as described in Appendix \ref{appendix:taverna}.
+But when a translator is written to convert the proposed template into research objects, they can do this.
+
+
+
+
+
+\subsection{Sciunit, 2015}
+\label{appendix:sciunit}
+Sciunit (\url{https://sciunit.run}) defines ``sciunit''s that keep the executed commands for an analysis and all the necessary programs and libraries that are used in those commands.
+It automatically parses all the executables in the script, and copies them, and their dependency libraries (down to the C library), into the sciunit.
+Because the sciunit contains all the programs and necessary libraries, its possible to run it readily on other systems that have a similar CPU architecture.
+For more, please see \citet{meng15}.
+
+In our tests, Sciunit installed successfully, however we couldn't run it because of a dependency problem with the \texttt{tempfile} package (in the standard Python library).
+Sciunit is written in Python 2 (which reached its end-of-life in January 1st, 2020) and its last Git commit in its main branch is from June 2018 (+1.5 years ago).
+Recent activity in a \texttt{python3} branch shows that others are attempting to translate the code into Python 3 (the main author has graduated and apparently not working on Sciunit anymore).
+
+Because we weren't able to run it, the following discussion will just be theoretical.
+The main issue with Sciunit's approach is that the copied binaries are just black boxes.
+Therefore, its not possible to see how the used binaries from the initial system were built, or possibly if they have security problems.
+This is a major problem for scientific projects, in principle (not knowing how they programs were built) and practice (archiving a large volume sciunit for every step of an analysis requires a lot of space).
+
+
+
+
+
+\subsection{Binder, 2017}
+Binder (\url{https://mybinder.org}) is a tool to containerize already existing Jupyter based processing steps.
+Users simply add a set of Binder-recognized configuration files to their repository.
+Binder will build a Docker image and install all the dependencies inside of it with Conda (the list of necessary packages comes from Conda).
+One good feature of Binder is that the imported Docker image must be tagged (something like a checksum).
+This will ensure that future/latest updates of the imported Docker image are not mistakenly used.
+However, it does not make sure that the dockerfile used by the imported Docker image follows a similar convention also.
+Binder is used by \citet{jones19}.
+
+
+
+
+
+\subsection{Gigantum, 2017}
+Gigantum (\url{https://gigantum.com}) is a client/server system, in which the client is a web-based (graphical) interface that is installed as ``Gigantum Desktop'' within a Docker image and is free software (MIT License).
+\tonote{I couldn't find the license to the server software yet, but it says that 20GB is provided for ``free'', so it is a little confusing if anyone can actually run the server.}
+\tonote{I took the date from their PiPy page, where the first version 0.1 was published in November 2016.}
+
+Gigantum uses Docker containers for an independent environment, Conda (or Pip) to install packages, Jupyter notebooks to edit and run code, and Git to store its history.
+Simply put, its a high-level wrapper for combining these components.
+Internally, a Gigantum project is organized as files in a directory that can be opened without their own client.
+The file structure (which is under version control) includes codes, input data and output data.
+As acknowledged on their own webpage, this greatly reduces the speed of Git operations, transmitting, or archiving the project.
+Therefore there are size limits on the dataset/code sizes.
+However, there is one directory which can be used to store files that must not be tracked.
+
+
+
+
+
+\subsection{Popper, 2017}
+\label{appendix:popper}
+Popper (\url{https://falsifiable.us}) is a software implementation of the Popper Convention \citep{jimenez17}.
+The Convention is a set of very generic conditions that are also applicable to the template proposed in this paper.
+For a discussion on the convention, please see Section \ref{sec:principles}, in this section we'll review their software implementation.
+
+The Popper team's own solution is through a command-line program called \texttt{popper}.
+The \texttt{popper} program itself is written in Python, but job management is with the HashiCorp configuration language (HCL).
+HCL is primarily aimed at running jobs on HashiCorp's ``infrastructure as a service'' (IaaS) products.
+Until September 30th, 2019\footnote{\url{https://github.blog/changelog/2019-09-17-github-actions-will-stop-running-workflows-written-in-hcl}}, HCL was used by ``GitHub Actions'' to manage workflows.
+
+To start a project, the \texttt{popper} command-line program builds a template, or ``scaffold'', which is a minimal set of files that can be run.
+The scaffold is very similar to the raw template of that is proposed in this paper.
+However, as of this writing, the scaffold isn't complete.
+It lacks a manuscript and validation of outputs (as mentioned in the convention).
+By default Popper runs in a Docker image (so root permissions are necessary), but Singularity is also supported.
+See Appendix \ref{appendix:independentenvironment} for more on containers, and Appendix \ref{appendix:highlevelinworkflow} for using high-level languages in the workflow.
+
+
+
+
+
+\subsection{Whole Tale, 2019}
+\label{appendix:wholetale}
+
+Whole Tale (\url{https://wholetale.org}) is a web-based platform for managing a project and organizing data provenance, see \citet{brinckman19}
+It uses online editors like Jupyter or RStudio (see Appendix \ref{appendix:editors}) that are encapsulated in a Docker container (see Appendix \ref{appendix:independentenvironment}).
+
+The web-based nature of Whole Tale's approach, and its dependency on many tools (which have many dependencies themselves) is a major limitation for future reproducibility.
+For example, when following their own tutorial on ``Creating a new tale'', the provided Jupyter notbook could not be executed because of a dependency problem.
+This has been reported to the authors as issue 113\footnote{\url{https://github.com/whole-tale/wt-design-docs/issues/113}}, but as all the second-order dependencies evolve, its not hard to envisage such dependency incompatibilities being the primary issue for older projects on Whole Tale.
+Furthermore, the fact that a Tale is stored as a binary Docker container causes two important problems: 1) it requires a very large storage capacity for every project that is hosted there, making it very expensive to scale if demand expands. 2) It is not possible to see how the environment was built accurately (when the Dockerfile uses \texttt{apt}), for more on this, please see Appendix \ref{appendix:packagemanagement}.
+
+
+
+
+
+\subsection{Things to add}
+\url{https://sites.nationalacademies.org/cs/groups/pgasite/documents/webpage/pga_180684.pdf}, does the following classification of tools:
+ \begin{itemize}
+ \item Research environments: \href{http://vcr.stanford.edu}{Verifiable computational research} (discussed above), \href{http://www.sciencedirect.com/science/article/pii/S1877050911001207}{SHARE} (a Virtual Machine), \href{http://www.codeocean.com}{Code Ocean} (discussed above), \href{http://jupyter.org}{Jupyter} (discussed above), \href{https://yihui.name/knitr}{knitR} (based on Sweave, dynamic report generation with R), \href{https://cran.r-project.org}{Sweave} (Function in R, for putting R code within \LaTeX), \href{http://www.cyverse.org}{Cyverse} (proprietary web tool with servers for bioinformatics), \href{https://nanohub.org}{NanoHUB} (collection of Simulation Programs for nanoscale phenomena that run in the cloud), \href{https://www.elsevier.com/about/press-releases/research-and-journals/special-issue-computers-and-graphics-incorporates-executable-paper-grand-challenge-winner-collage-authoring-environment}{Collage Authoring Environment} (discussed above), \href{https://osf.io/ns2m3}{SOLE} (discussed above), \href{https://osf.io}{Open Science framework} (a hosting webpage), \href{https://www.vistrails.org}{VisTrails} (discussed above), \href{https://pypi.python.org/pypi/Sumatra}{Sumatra} (discussed above), \href{http://software.broadinstitute.org/cancer/software/genepattern}{GenePattern} (reviewed above), Image Processing On Line (\href{http://www.ipol.im}{IPOL}) journal (publishes full analysis scripts, but doesn't deal with dependencies), \href{https://github.com/systemslab/popper}{Popper} (reviewed above), \href{https://galaxyproject.org}{Galaxy} (reviewed above), \href{http://torch.ch}{Torch.ch} (finished project for neural networks on images), \href{http://wholetale.org/}{Whole Tale} (discussed above).
+ \item Workflow systems: \href{http://www.taverna.org.uk}{Taverna}, \href{http://www.wings-workflows.org}{Wings}, \href{https://pegasus.isi.edu}{Pegasus}, \href{http://www.pgbovine.net/cde.html}{CDE}, \href{http://binder.org}{Binder}, \href{http://wiki.datakurator.org/wiki}{Kurator}, \href{https://kepler-project.org}{Kepler}, \href{https://github.com/everware}{Everware}, \href{http://cds.nyu.edu/projects/reprozip}{Reprozip}.
+ \item Dissemination platforms: \href{http://researchcompendia.org}{ResearchCompendia}, \href{https://datacenterhub.org/about}{DataCenterHub}, \href{http://runmycode.org}, \href{https://www.chameleoncloud.org}{ChameleonCloud}, \href{https://occam.cs.pitt.edu}{Occam}, \href{http://rcloud.social/index.html}{RCloud}, \href{http://thedatahub.org}{TheDataHub}, \href{http://www.ahay.org/wiki/Package_overview}{Madagascar}.
+ \end{itemize}
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+\newpage
+\section{Things remaining to add}
+\begin{itemize}
+\item Special volume on ``Reproducible research'' in the Computing in Science Engineering \citep{fomel09}.
+\item ``I’ve learned that interactive programs are slavery (unless they include the ability to arrive in any previous state by means of a script).'' \citep{fomel09}.
+\item \citet{fomel09} discuss the ``versioning problem'': on different systems, programs have different versions.
+\item \citet{fomel09}: a C program written 20 years ago was still usable.
+\item \citet{fomel09}: ``in an attempt to increase the size of the community, Matthias Schwab and I submitted a paper to Computers in Physics, one of CiSE’s forerunners. It was rejected. The editors said if everyone used Microsoft computers, everything would be easily reproducible. They also predicted the imminent demise of Fortran''.
+\item \citet{alliez19}: Software citation, with a nice dependency plot for matplotlib.
+ \item SC \href{https://sc19.supercomputing.org/submit/reproducibility-initiative}{Reproducibility Initiative} for mandatory Artifact Description (AD).
+ \item \href{https://www.acm.org/publications/policies/artifact-review-badging}{Artifact review badging} by the Association of computing machinery (ACM).
+ \item eLife journal \href{https://elifesciences.org/labs/b521cf4d/reproducible-document-stack-towards-a-scalable-solution-for-reproducible-articles}{announcement} on reproducible papers. \citet{lewis18} is their first reproducible paper.
+ \item The \href{https://www.scientificpaperofthefuture.org}{Scientific paper of the future initiative} encourages geoscientists to include associate metadata with scientific papers \citep{gil16}.
+ \item Digital objects: \url{http://doi.org/10.23728/b2share.b605d85809ca45679b110719b6c6cb11} and \url{http://doi.org/10.23728/b2share.4e8ac36c0dd343da81fd9e83e72805a0}
+ \item \citet{mesirov10}, \citet{casadevall10}, \citet{peng11}: Importance of reproducible research.
+ \item \citet{sandve13} is an editorial recommendation to publish reproducible results.
+ \item \citet{easterbrook14} Free/open software for open science.
+ \item \citet{peng15}: Importance of better statistical education.
+ \item \citet{topalidou16}: Failed attempt to reproduce a result.
+ \item \citet{hutton16} reproducibility in hydrology, criticized in \citet{melson17}.
+ \item \citet{fomel09}: Editorial on reproducible research.
+ \item \citet{munafo17}: Reproducibility in social sciences.
+ \item \citet{stodden18}: Effectiveness of journal policy on computational reproducibility.
+ \item \citet{fanelli18} is critical of the narrative that there is a ``reproducibility crisis'', and that its important to empower scientists.
+ \item \citet{burrell18} open software (in particular Python) in heliophysics.
+ \item \citet{allen18} show that many papers don't cite software.
+ \item \citet{zhang18} explicity say that they won't release their code: ``We opt not to make the code used for the chemical evo-lution modeling publicly available because it is an important asset of the re-searchers’ toolkits''
+ \item \citet{jones19} make genuine effort at reproducing every number in the paper (using Docker, Conda, and CGAT-core, and Binder), but they can ultimately only release scripts. They claim its not possible to reproduce that level of reproducibility, but here we show it is.
+ \item LSST uses Kubernetes and docker for reproducibility \citep{banek19}.
+ \item Interesting survey/paper on the importance of coding in science \citep{merali10}.
+ \item Discuss the Provenance challenge \citep{moreau08}, showing the importance of meta data and provenance tracking.
+ Especially that it is organized by teh medical scientists.
+ Its webpage (for latest challenge) has a nice intro: \url{https://www.cccinnovationcenter.com/challenges/provenance-challenge}.
+ \item In discussion: The XML provenance system is very interesting, scripts can be written to parse the Makefiles within this template to generate such XML outputs for easy standard metadata parsing.
+ The XML that contains a log of the outputs is also interesting.
+ \item \citet{becker17} Discuss reproducibility methods in R.
+ \item Elsevier Executable Paper Grand Challenge\footnote{\url{https://shar.es/a3dgl2}} \citep{gabriel11}.
+\end{itemize}
+
+
%% Mention all used software in an appendix.
\section{Software acknowledgement}
diff --git a/reproduce/analysis/make/initialize.mk b/reproduce/analysis/make/initialize.mk
index cdf2129..a4acff7 100644
--- a/reproduce/analysis/make/initialize.mk
+++ b/reproduce/analysis/make/initialize.mk
@@ -132,6 +132,7 @@ curdir := $(shell echo $$(pwd))
# we are also going to overwrite `TEXINPUTS' just before `pdflatex'.
.ONESHELL:
.SHELLFLAGS = -ec
+export TERM=xterm
export TEXINPUTS :=
export CCACHE_DISABLE := 1
export PATH := $(installdir)/bin
diff --git a/reproduce/analysis/make/paper.mk b/reproduce/analysis/make/paper.mk
index a4eeb2e..af6bdc5 100644
--- a/reproduce/analysis/make/paper.mk
+++ b/reproduce/analysis/make/paper.mk
@@ -44,7 +44,7 @@ $(mtexdir)/project.tex: $(mtexdir)/verify.tex
# If no PDF is requested, or if LaTeX isn't available, don't
# continue to building the final PDF. Otherwise, merge all the TeX
# macros into one for building the PDF.
- @if [ -f .local/bin/pdflatex ] && [ x"$(pdf-build-final)" != x ]; then
+ @if [ -f .local/bin/lualatex ] && [ x"$(pdf-build-final)" != x ]; then
# Put a LaTeX input command for all the necessary macro files.
rm -f $(mtexdir)/project.tex
@@ -100,7 +100,7 @@ $(texbdir)/paper.bbl: tex/src/references.tex $(mtexdir)/dependencies-bib.tex \
p=$$(pwd)
export TEXINPUTS=$$p:
cd $(texbdir);
- pdflatex -shell-escape -halt-on-error $$p/paper.tex
+ lualatex -shell-escape -halt-on-error $$p/paper.tex
biber paper
fi
@@ -127,7 +127,7 @@ paper.pdf: $(mtexdir)/project.tex paper.tex $(texbdir)/paper.bbl
p=$$(pwd)
export TEXINPUTS=$$p:
cd $(texbdir)
- pdflatex -shell-escape -halt-on-error $$p/paper.tex
+ lualatex -shell-escape -halt-on-error $$p/paper.tex
# Come back to the top project directory and copy the built PDF
# file here.
diff --git a/reproduce/software/config/installation/texlive.mk b/reproduce/software/config/installation/texlive.mk
index c53e170..d0b7159 100644
--- a/reproduce/software/config/installation/texlive.mk
+++ b/reproduce/software/config/installation/texlive.mk
@@ -21,4 +21,6 @@ texlive-packages = tex fancyhdr ec newtx fontaxes xkeyval etoolbox xcolor \
preprint ulem biblatex biber logreq pgf pgfplots fp \
courier tex-gyre txfonts times csquotes kastrup \
trimspaces pdftexcmds pdfescape letltxmacro bitset \
- mweights
+ mweights \
+ \
+ alegreya enumitem fontspec lastpage
diff --git a/tex/img/codata.png b/tex/img/codata.png
new file mode 100644
index 0000000..c78dbc3
--- /dev/null
+++ b/tex/img/codata.png
Binary files differ
diff --git a/tex/src/figure-data-lineage.tex b/tex/src/figure-data-lineage.tex
new file mode 100644
index 0000000..d849e8c
--- /dev/null
+++ b/tex/src/figure-data-lineage.tex
@@ -0,0 +1,236 @@
+\newcommand{\paperpdf}{}
+\newcommand{\papertex}{}
+\newcommand{\projecttex}{}
+\newcommand{\verifytex}{}
+\newcommand{\initializetex}{}
+\newcommand{\downloadtex}{}
+\newcommand{\inputtwo}{}
+\newcommand{\inputsconf}{}
+\newcommand{\analysisonetex}{}
+\newcommand{\outoneb}{}
+\newcommand{\outonebdep}{}
+\newcommand{\inputone}{}
+\newcommand{\inputonedep}{}
+\newcommand{\analysistwotex}{}
+\newcommand{\outtwob}{}
+\newcommand{\outtwobdep}{}
+\newcommand{\analysisthreetex}{}
+\newcommand{\analysisthreeouts}{}
+\newcommand{\outtwoa}{}
+\newcommand{\outtwoadep}{}
+\newcommand{\outthreeadep}{}
+
+
+
+
+
+\begin{tikzpicture}[
+ line width=1.5pt,
+ black!50,
+ text=black,
+]
+
+ %% Use small fonts
+ \footnotesize
+
+ %% These white lines are only added to fix the vertical position of
+ %% the figure so it doesn't change as we add more boxes.
+ \draw [white] (-7.5,0) -- (7.4,0);
+ \draw [white] (0,-4.7) -- (0,5.7);
+
+ %% top-make.mk
+ \node [at={(-0.05cm,2mm)},
+ rectangle,
+ very thick,
+ text centered,
+ font=\ttfamily,
+ text width=2.8cm,
+ minimum height=7.8cm,
+ draw=green!50!black!50,
+ minimum width=\linewidth,
+ fill=black!10!green!2!white,
+ label={[shift={(0,-5mm)}]\texttt{top-make.mk}}] {};
+
+ %% Work-horse Makefiles. -5.6 -> -5.73 = -0.13
+ \node (initializemk) [node-makefile, at={(-5.73cm,-1.3cm)},
+ label={[shift={(0,-5mm)}]\texttt{initialize.mk}}] {};
+ \node (downloadmk) [node-makefile, at={(-2.93cm,-1.3cm)},
+ label={[shift={(0,-5mm)}]\texttt{download.mk}}] {};
+ \node (analysis1mk) [node-makefile, at={(-0.13cm,-1.3cm)},
+ label={[shift={(0,-5mm)}]\texttt{analysis1.mk}}] {};
+ \node (analysis2mk) [node-makefile, at={(2.67cm,-1.3cm)},
+ label={[shift={(0,-5mm)}]\texttt{analysis2.mk}}] {};
+ \node (analysis2mk) [node-makefile, at={(5.47cm,-1.3cm)},
+ label={[shift={(0,-5mm)}]\texttt{analysis3.mk}}] {};
+
+ %% verify.mk
+ \node [at={(-5.3cm,-2.8cm)},
+ thick,
+ rectangle,
+ text centered,
+ font=\ttfamily,
+ text width=2.45cm,
+ minimum width=3.5cm,
+ minimum height=1.3cm,
+ draw=green!50!black!50,
+ fill=black!10!green!12!white,
+ label={[shift={(1cm,-5mm)}]\texttt{verify.mk}}] {};
+
+ %% Paper.mk
+ \node [at={(2.67cm,-2.8cm)},
+ thick,
+ rectangle,
+ text centered,
+ text width=2.8cm,
+ minimum width=8.5cm,
+ minimum height=1.3cm,
+ draw=green!50!black!50,
+ fill=black!10!green!12!white,
+ font=\ttfamily,
+ label={[shift={(0,-5mm)}]\texttt{paper.mk}}] {};
+
+ %% paper.pdf
+ \ifdefined\paperpdf
+ \node (paperpdf) [node-terminal, at={(5.47cm,-2.9cm)}] {paper.pdf};
+ \fi
+
+ %% paper.tex
+ \ifdefined\papertex
+ \node (papertex) [node-nonterminal, at={(5.47cm,-4.2cm)}] {paper.tex};
+ \draw [->] (papertex) -- (paperpdf);
+ \fi
+
+ %% project.tex
+ \ifdefined\projecttex
+ \node (projecttex) [node-terminal, at={(-0.13cm,-2.9cm)}] {project.tex};
+ \draw [->] (projecttex) -- (paperpdf);
+ \fi
+
+ %% verify.tex
+ \ifdefined\verifytex
+ \node (verifytex) [node-terminal, at={(-5.73cm,-2.9cm)}] {verify.tex};
+ \draw [->] (verifytex) -- (projecttex);
+ \fi
+
+ %% Initialize.tex
+ \ifdefined\initializetex
+ \node (initializetex) [node-terminal, at={(-5.73cm,-0.8cm)}] {initialize.tex};
+ \node (initialize-south) [node-point, at={(-5.73cm,-1.5cm)}] {};
+ \draw [->] (initializetex) -- (verifytex);
+ \node [anchor=west, at={(-7.05cm,2.30cm)}] {Basic project info};
+ \node [anchor=west, at={(-7.05cm,1.95cm)}] {(e.g., Git commit).};
+ \node [anchor=west, at={(-7.05cm,1.10cm)}] {Also defines};
+ \node [anchor=west, at={(-7.05cm,0.75cm)}] {project structure};
+ \node [anchor=west, at={(-7.05cm,0.40cm)}] {(for \texttt{*.mk} files).};
+ \fi
+
+ %% download.tex
+ \ifdefined\downloadtex
+ \node (downloadtex) [node-terminal, at={(-2.93cm,-0.8cm)}] {download.tex};
+ \draw [rounded corners, -] (downloadtex) |- (initialize-south);
+ \fi
+
+ %% input-2.dat
+ \ifdefined\inputtwo
+ \node (input2) [node-terminal, at={(-2.93cm,1.1cm)}] {input2.dat};
+ \draw [->] (input2) -- (downloadtex);
+ \fi
+
+ %% INPUTS.conf
+ \ifdefined\inputsconf
+ \node (INPUTS) [node-nonterminal, at={(-2.93cm,4.6cm)}] {INPUTS.conf};
+ \node (input2-west) [node-point, at={(-4.33cm,1.1cm)}] {};
+ \draw [->,rounded corners] (INPUTS.west) -| (input2-west) |- (input2);
+ \fi
+
+ %% analysis1.tex
+ \ifdefined\analysisonetex
+ \node (a1tex) [node-terminal, at={(-0.13cm,-0.8cm)}] {analysis1.tex};
+ \draw [rounded corners, -] (a1tex) |- (initialize-south);
+ \fi
+
+ %% out1b.dat
+ \ifdefined\outoneb
+ \node (out1b) [node-terminal, at={(-0.13cm,1.1cm)}] {out-1b.dat};
+ \draw [->] (out1b) -- (a1tex);
+ \fi
+
+ %% outonebdep
+ \ifdefined\outonebdep
+ \node (out1b-west) [node-point, at={(-1.53cm,1.1cm)}] {};
+ \node (out1a) [node-terminal, at={(-0.13cm,2.7cm)}] {out-1a.dat};
+ \node (a1conf1) [node-nonterminal, at={(-0.13cm,4.6cm)}] {param-1.conf};
+ \draw [->] (input2) -- (out1b);
+ \draw [->] (out1a) -- (out1b);
+ \draw [->,rounded corners] (a1conf1.west) -| (out1b-west) |- (out1b);
+ \fi
+
+ %% input1.dat
+ \ifdefined\inputone
+ \node (input1) [node-terminal, at={(-2.93cm,1.9cm)}] {input1.dat};
+ \draw [->,rounded corners,] (input1.north) |- (out1a);
+ \fi
+
+ %% input1 dependencies
+ \ifdefined\inputonedep
+ \node (input1-east) [node-point, at={(-1.53cm,1.9cm)}] {};
+ \node (input1-west) [node-point, at={(-4.33cm,1.9cm)}] {};
+ \draw [->,rounded corners] (INPUTS.west) -| (input1-west) |- (input1);
+ \fi
+
+ %% analysis2.tex
+ \ifdefined\analysistwotex
+ \node (a2tex) [node-terminal, at={(2.67cm,-0.8cm)}] {analysis2.tex};
+ \draw [rounded corners, -] (a2tex) |- (initialize-south);
+ \fi
+
+ %% out-2b.dat
+ \ifdefined\outtwob
+ \node (out2b) [node-terminal, at={(2.67cm,0.3cm)}] {out-2b.dat};
+ \draw [->] (out2b) -- (a2tex);
+ \fi
+
+ %% out-2b dependencies
+ \ifdefined\outtwobdep
+ \draw [->,rounded corners,] (out1b.south) |- (out2b);
+ \fi
+
+ %% analysis3.tex
+ \ifdefined\analysisthreetex
+ \node (a3tex) [node-terminal, at={(5.47cm,-0.8cm)}] {analysis3.tex};
+ \draw [rounded corners, -] (a3tex) |- (initialize-south);
+ \fi
+
+ %% Outputs of analysis3
+ \ifdefined\analysisthreeouts
+ \node (out3a) [node-terminal, at={(5.47cm,2.7cm)}] {out-3a.dat};
+ \node (out3b) [node-terminal, at={(5.47cm,1.1cm)}] {out-3b.dat};
+ \node (a3tex-east) [node-point, at={(6.87cm,-0.8cm)}] {};
+ \draw [->,rounded corners] (out3a.east) -| (a3tex-east) |- (a3tex);
+ \draw [->] (out3b) -- (a3tex);
+ \fi
+
+ %% out-2a.dat
+ \ifdefined\outtwoa
+ \node (out2a) [node-terminal, at={(2.67cm,1.9cm)}] {out-2a.dat};
+ \draw [->, rounded corners] (out2a.south) |- (out3b);
+ \fi
+
+ %% Dependencies of out-2a
+ \ifdefined\outtwoadep
+ \node (a2conf1) [node-nonterminal, at={(2.67cm,5.3cm)}] {param-2a.conf};
+ \node (a2conf2) [node-nonterminal, at={(2.67cm,4.6cm)}] {param-2b.conf};
+ \node (out2a-west) [node-point, at={(1.27cm,1.9cm)}] {};
+ \draw [->,rounded corners] (a2conf1.west) -| (out2a-west) |- (out2a);
+ \draw [->,rounded corners] (a2conf2.west) -| (out2a-west) |- (out2a);
+ \draw [->] (input1) -- (out2a);
+ \fi
+
+ %% Dependencies of out-3a
+ \ifdefined\outthreeadep
+ \node (out3a-west) [node-point, at={(4.07cm,2.7cm)}] {};
+ \node (a3conf1) [node-nonterminal, at={(5.47cm,4.6cm)}] {param-3.conf};
+ \draw [->] (out1a) -- (out3a);
+ \draw [rounded corners] (a3conf1.west) -| (out3a-west) |- (out3a);
+ \fi
+\end{tikzpicture}
diff --git a/tex/src/figure-project-outline.tex b/tex/src/figure-project-outline.tex
new file mode 100644
index 0000000..4cd933d
--- /dev/null
+++ b/tex/src/figure-project-outline.tex
@@ -0,0 +1,229 @@
+% Copyright (C) 2018-2019 Mohammad Akhlaghi <mohammad@akhlaghi.org>
+%
+% This LaTeX source is free software: you can redistribute it and/or
+% modify it under the terms of the GNU General Public License as
+% published by the Free Software Foundation, either version 3 of the
+% License, or (at your option) any later version.
+%
+% This LaTeX source is distributed in the hope that it will be useful,
+% but WITHOUT ANY WARRANTY; without even the implied warranty of
+% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+% General Public License for more details.
+%
+% You should have received a copy of the GNU General Public License
+% along with this LaTeX source. If not, see <https://www.gnu.org/licenses/>.
+
+%% Environment variables.
+\newcommand{\allopacity}{1}
+\newcommand{\sver}{}
+\newcommand{\srep}{}
+\newcommand{\dver}{}
+\newcommand{\ddver}{}
+\newcommand{\confopt}{}
+\newcommand{\confenv}{}
+\newcommand{\containers}{}
+\newcommand{\db}{}
+\newcommand{\calib}{}
+\newcommand{\corr}{}
+\newcommand{\runord}{}
+\newcommand{\runopt}{}
+\newcommand{\humanerr}{}
+\newcommand{\confirmbias}{}
+\newcommand{\depupdate}{}
+\newcommand{\coauth}{}
+\newcommand{\varsinpaper}{}
+\newcommand{\recordinfo}{}
+\newcommand{\softcite}{}
+\newcommand{\prevchange}{}
+
+\begin{tikzpicture}[>=stealth, thick, black!50, text=black,
+ every new ->/.style={shorten >=1pt},
+ hv path/.style={to path={-| (\tikztotarget)}},
+ graphs/every graph/.style={edges=rounded corners}]
+
+ \footnotesize
+
+ %% This white line is only added to fix the vertical position of the
+ %% figure so it doesn't change as we add more boxes.
+ \draw [white] (0,-4.3) -- (0,3.8);
+ \draw [white] (-0.5,0) -- (12,0);
+
+ %% Box showing containers.
+ \ifdefined\containers
+ \filldraw[orange!20!white, rounded corners=2mm] (-0.1,0.3) rectangle (5.6,3.8);
+ \draw (-0.1,3.65) node [anchor=west] {\scriptsize Existing solutions:};
+ \draw (0,3.35) node [anchor=west] {\scriptsize Virtual machines};
+ \draw (0,3.05) node [anchor=west] {\scriptsize Containers};
+ \draw (0,2.75) node [anchor=west] {\scriptsize Package managers};
+ \fi
+
+ \graph[grow right sep, simple] {
+ { [nodes={yshift=7mm}]
+ soft/Software [gbox] -> build/Build [bbox],
+ hard/Hardware/data [gbox, yshift=-0.5cm] --
+ p1 [coordinate, xshift=2cm, yshift=-0.5cm]
+ } -- [hv path]
+ p2 [coordinate] ->
+ srun/Run software on data [bbox] ->
+ paper/Paper [bbox]
+ };
+
+ \ifdefined\paperfinal
+ \node (happy) [inner sep=0pt, below=of paper, yshift=+8mm]
+ {\includegraphics[width=2cm]{img/happy-question.jpg}};
+ \node (happyurl) [below=of happy, xshift=-9.5mm, yshift=+1cm]
+ {\tiny \url{https://heywhatwhatdidyousay.wordpress.com}};
+ \node (qurl) [below=of happyurl, xshift=10.5mm, yshift=+1.2cm]
+ {\tiny \url{http://pngimages.net}};
+ \else
+ \ifdefined\paperinit
+ \node (happy) [inner sep=0pt, below=of paper, yshift=+8mm]
+ {\includegraphics[width=2cm]{img/happy.jpg}};
+ \node (happyurl) [below=of happy, xshift=-9.5mm, yshift=+1cm]
+ {\tiny \url{https://heywhatwhatdidyousay.wordpress.com}};
+ \fi
+ \fi
+
+ %% Software...
+ \let\ppopacity\undefined
+ \ifdefined\allopacity \newcommand{\ppopacity}{1}
+ \else \ifdefined\focusonpackages
+ \newcommand{\ppopacity}{1}
+ \else
+ \newcommand{\ppopacity}{0.3}
+ \fi
+ \fi
+
+ \ifdefined\sver
+ \node (sver)
+ [rbox, above=of soft, yshift=-8mm, opacity=\ppopacity]
+ {What version?};
+ \fi
+ \ifdefined\srep
+ \node (srep)
+ [rbox, above=of sver, yshift=-8mm, opacity=\ppopacity]
+ {Repository?};
+ \fi
+
+ %% Build
+ \ifdefined\dver
+ \node (dver)
+ [rbox, above=of build, yshift=-8mm, opacity=\ppopacity]
+ {Dependencies?};
+ \fi
+ \ifdefined\ddver
+ \node (ddver)
+ [rbox, above=of dver, yshift=-8mm, opacity=\ppopacity]
+ {Dep. versions?};
+ \fi
+ \ifdefined\confopt
+ \node (confopt)
+ [rbox, above=of ddver, yshift=-8mm, opacity=\ppopacity]
+ {Config options?};
+ \fi
+ \ifdefined\confenv
+ \node (confenv)
+ [rbox, above=of confopt, yshift=-8mm, opacity=\ppopacity]
+ {Config environment?};
+ \fi
+
+ %% Hardware/data
+ \let\ppopacity\undefined
+ \ifdefined\allopacity \newcommand{\ppopacity}{1}
+ \else \ifdefined\focusonhardware
+ \newcommand{\ppopacity}{1}
+ \else
+ \newcommand{\ppopacity}{0.3}
+ \fi
+ \fi
+ \ifdefined\db
+ \node (db)
+ [rbox, below=of hard, yshift=+8mm, opacity=\ppopacity]
+ {Data base, or PID?};
+ \fi
+ \ifdefined\calib
+ \node (calib)
+ [rbox, below=of db, yshift=+8mm, opacity=\ppopacity]
+ {Calibration/version?};
+ \fi
+ \ifdefined\corr
+ \node (corr)
+ [rbox, below=of calib, yshift=+8mm, opacity=\ppopacity]
+ {Integrity?};
+ \fi
+
+ %% Run software ...
+ \let\ppopacity\undefined
+ \ifdefined\allopacity \newcommand{\ppopacity}{1}
+ \else \ifdefined\focusonrun
+ \newcommand{\ppopacity}{1}
+ \else
+ \newcommand{\ppopacity}{0.3}
+ \fi
+ \fi
+ \ifdefined\runord
+ \node (runord)
+ [rbox, above=of srun, yshift=-8mm, opacity=\ppopacity]
+ {What order?};
+ \fi
+ \ifdefined\runopt
+ \node (runopt)
+ [rbox, above=of runord, yshift=-8mm, opacity=\ppopacity]
+ {Runtime options?};
+ \fi
+ \ifdefined\humanerr
+ \node (humanerr)
+ [rbox, above=of runopt, yshift=-8mm, opacity=\ppopacity]
+ {Human error?};
+ \fi
+ \ifdefined\confirmbias
+ \node (confirmbias)
+ [rbox, above=of humanerr, yshift=-8mm, opacity=\ppopacity]
+ {Confirmation bias?};
+ \fi
+ \ifdefined\depupdate
+ \node (depupdate)
+ [rbox, below=of srun, yshift=+8mm, opacity=\ppopacity]
+ {Environment update?};
+ \fi
+ \ifdefined\coauth
+ \node (coaut)
+ [rbox, below=of depupdate, yshift=+8mm, opacity=\ppopacity]
+ {In sync with coauthors?};
+ \fi
+
+ %% Paper ...
+ \let\ppopacity\undefined
+ \ifdefined\allopacity \newcommand{\ppopacity}{1}
+ \else \ifdefined\focusonpaper
+ \newcommand{\ppopacity}{1}
+ \else
+ \newcommand{\ppopacity}{0.3}
+ \fi
+ \fi
+ \ifdefined\varsinpaper
+ \node (varsinpaper)
+ [rbox, above=of paper, xshift=-1mm, yshift=-8mm, opacity=\ppopacity]
+ {Sync with analysis?};
+ \fi
+ \ifdefined\recordinfo
+ \node (recordinfo)
+ [rbox, above=of varsinpaper, yshift=-8mm, opacity=\ppopacity]
+ {Report this info?};
+ \fi
+ \ifdefined\softcite
+ \node (softcite)
+ [rbox, above=of recordinfo, yshift=-8mm, opacity=\ppopacity]
+ {Cited software?};
+ \fi
+ \ifdefined\prevchange
+ \node (prevchange)
+ [rbox, above=of softcite, yshift=-8mm, opacity=\ppopacity]
+ {History recorded?};
+ \fi
+
+ \ifdefined\gitlogo
+ \node [inner sep=0pt, opacity=0.5] at (5.5,0)
+ {\includegraphics[width=10cm]{img/git.png}};
+ \fi
+\end{tikzpicture}
diff --git a/tex/src/preamble-biblatex.tex b/tex/src/preamble-biblatex.tex
index fc9f259..caa03b4 100644
--- a/tex/src/preamble-biblatex.tex
+++ b/tex/src/preamble-biblatex.tex
@@ -59,8 +59,8 @@
url=false,
dashed=false,
eprint=false,
- maxbibnames=2,
- minbibnames=1,
+ maxbibnames=10,
+ minbibnames=4,
hyperref=true,
maxcitenames=2,
mincitenames=1,
@@ -90,7 +90,8 @@
%% Set the color of the doi link to mymg (magenta) and the ads links
%% to mypurp (or purple):
\definecolor{mypurp}{cmyk}{0.75,1,0,0}
-\newcommand{\doihref}[2]{\href{#1}{\color{magenta}{#2}}}
+\definecolor{myblue}{rgb}{0,0.669,0.885}
+\newcommand{\doihref}[2]{\href{#1}{\color{myblue}{#2}}}
\newcommand{\adshref}[2]{\href{#1}{\color{mypurp}{#2}}}
\newcommand{\blackhref}[2]{\href{#1}{\color{black}{#2}}}
@@ -117,7 +118,7 @@
\usebibmacro{begentry}%
\usebibmacro{author/translator+others}%
\newunit%
- \ifdefined\makethesis\printtext{\usebibmacro{title}}\fi%
+ \printtext{\usebibmacro{title}}%
\newunit%
\printtext[doilink]{\usebibmacro{journal}}%
\addcomma%
diff --git a/tex/src/preamble-header.tex b/tex/src/preamble-header.tex
deleted file mode 100644
index 81a08b0..0000000
--- a/tex/src/preamble-header.tex
+++ /dev/null
@@ -1,89 +0,0 @@
-%% The headers: title, authors, top of pages and section title formatting
-%% of the final LaTeX file are configured here.
-%
-%% Copyright (C) 2018-2020 Mohammad Akhlaghi <mohammad@akhlaghi.org>
-%
-%% This template is free software: you can redistribute it and/or modify it
-%% under the terms of the GNU General Public License as published by the
-%% Free Software Foundation, either version 3 of the License, or (at your
-%% option) any later version.
-%
-%% This template is distributed in the hope that it will be useful, but
-%% WITHOUT ANY WARRANTY; without even the implied warranty of
-%% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
-%% General Public License for more details.
-%
-%% You should have received a copy of the GNU General Public License along
-%% with this template. If not, see <http://www.gnu.org/licenses/>.
-
-
-
-
-
-%% General page header settings.
-\usepackage{fancyhdr}
-\pagestyle{fancy}
-\lhead{\footnotesize{\scshape Draft paper}, {\footnotesize nnn:i (pp), Year Month day}}
-\rhead{\scshape\footnotesize YOUR-NAME et al.}
-\cfoot{\thepage}
-\setlength{\voffset}{0.75cm}
-\setlength{\headsep}{0.2cm}
-\setlength{\footskip}{0.75cm}
-\renewcommand{\headrulewidth}{0pt}
-
-
-
-
-
-%% Specific style for first page.
-\fancypagestyle{firststyle}
-{
- \lhead{\footnotesize{\scshape Draft paper}, nnn:i (pp), YYYY Month day\\
- \scriptsize \textcopyright YYYY, Your name. All rights reserved.}
- \rhead{\footnotesize \footnotesize \today, \currenttime\\}
-}
-
-
-
-
-
-%To set the style of the titles:
-\usepackage{titlesec}
-\titleformat{\section}
- {\centering\normalfont\uppercase}
- {\thesection.}
- {0em}
- { }
-\titleformat{\subsection}
- {\centering\normalsize\slshape}
- {\thesubsection.}
- {0em}
- { }
-\titleformat{\subsubsection}
- {\centering\small\slshape}
- {\thesubsubsection.}
- {0em}
- { }
-
-
-
-
-
-% Basic Document information that goes into the PDF meta-data.
-\hypersetup
-{
- pdfauthor={YOUR NAME},
- pdfsubject={A SHORT DESCRIPTION OF THE WORK},
- pdftitle={THE TITLE OF THIS PROJECT},
- pdfkeywords={SOME, KEYWORDS, FOR, THE, PDF}
-}
-
-
-
-
-
-%% Title and author information
-\usepackage{authblk}
-\renewcommand\Authfont{\small\scshape}
-\renewcommand\Affilfont{\footnotesize\normalfont}
-\setlength{\affilsep}{0.2cm}
diff --git a/tex/src/preamble-necessary.tex b/tex/src/preamble-necessary.tex
deleted file mode 100644
index b6f5b55..0000000
--- a/tex/src/preamble-necessary.tex
+++ /dev/null
@@ -1,90 +0,0 @@
-%% Necessary (independent of style) macros for this project.
-%%
-%% These are a set of packages that have been commonly necessary in most
-%% LaTeX usages. However, if any are not needed in your work, please feel
-%% free to remove them.
-%
-%% Copyright (C) 2018-2020 Mohammad Akhlaghi <mohammad@akhlaghi.org>
-%
-%% This template is free software: you can redistribute it and/or modify it
-%% under the terms of the GNU General Public License as published by the
-%% Free Software Foundation, either version 3 of the License, or (at your
-%% option) any later version.
-%
-%% This template is distributed in the hope that it will be useful, but
-%% WITHOUT ANY WARRANTY; without even the implied warranty of
-%% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
-%% General Public License for more details.
-%
-%% You should have received a copy of the GNU General Public License along
-%% with this template. If not, see <http://www.gnu.org/licenses/>.
-
-
-
-
-
-%% Values from the analysis.
-\input{tex/build/macros/project.tex}
-
-
-
-
-
-% Macros for to help in typing, remove them if you don't need them, but
-% this can help as a demo on how you can simply writing of commonly used
-% words that need special formatting (like software names).
-\newcommand{\snsign}{{\small S}/{\small N}}
-\newcommand{\originsoft}{\textsf{ORIGIN}}
-\newcommand{\sextractor}{\textsf{SE\-xtractor}}
-\newcommand{\noisechisel}{\textsf{Noise\-Chisel}}
-\newcommand{\makecatalog}{\textsf{Make\-Catalog}}
-
-
-
-
-
-%% For highlighting updates. When this is set, text marked as \new
-%% will be colored in dark green and text that is marked wtih \tonote
-%% will be marked in dark red.
-\ifdefined\highlightchanges
-\newcommand{\new}[1]{\textcolor{green!60!black}{#1}}
-\newcommand{\tonote}[1]{\textcolor{red!60!black}{[#1]}}
-\else
-\newcommand{\new}[1]{\textcolor{black}{#1}}
-\newcommand{\tonote}[1]{{}}
-\fi
-
-
-
-
-
-% Better than verbatim for displaying typed text.
-\usepackage{alltt}
-
-
-
-
-
-% For arithmetic opertions within LaTeX
-\usepackage[nomessages]{fp}
-
-
-
-
-
-%To add a code font to the text:
-\usepackage{courier}
-
-
-
-
-
-%To add some enumerating styles
-\usepackage{enumerate}
-
-
-
-
-
-%Including images if necessary
-\usepackage{graphicx}
diff --git a/tex/src/preamble-pgfplots.tex b/tex/src/preamble-pgfplots.tex
index 0ffb294..3f467a6 100644
--- a/tex/src/preamble-pgfplots.tex
+++ b/tex/src/preamble-pgfplots.tex
@@ -67,7 +67,9 @@
%% slow with detailed plots). 2) You can use the PDFs of the individual
%% plots for other purposes (for example to include in slides) cleanly.
\usepackage{tikz}
+\usetikzlibrary{graphs}
\usetikzlibrary{external}
+\usetikzlibrary{positioning}
\tikzexternalize
\tikzsetexternalprefix{tikz/}
@@ -119,3 +121,73 @@
legend style = {font=\footnotesize},
label style = {font=\footnotesize}
}
+
+
+
+
+
+%% Nodes in demo graphs
+\tikzset{node-terminal/.style={
+ rectangle,
+ very thick,
+ draw=blue!50,
+ text centered,
+ top color=white,
+ minimum size=6mm,
+ text width=2.1cm,
+ rounded corners=3mm,
+ bottom color=blue!20,
+ font=\ttfamily}}
+
+\tikzset{node-nonterminal/.style={
+ rectangle,
+ very thick,
+ text centered,
+ top color=white,
+ text width=2.1cm,
+ minimum size=6mm,
+ draw=green!50!black!50,
+ bottom color=green!80!black!50,
+ font=\ttfamily}}
+
+\tikzset{node-makefile/.style={
+ thick,
+ rectangle,
+ anchor=south,
+ minimum width=2.6cm,
+ minimum height=5cm,
+ draw=green!50!black!50,
+ fill=black!10!green!12!white,
+}}
+
+\tikzset{node-point/.style={
+ circle,
+ black!50,
+ inner sep=0pt,
+ minimum size=0pt,
+ fill=white}}
+
+\tikzset{ bbox/.style={
+ rectangle,
+ minimum width=2.5cm,
+ rounded corners=2mm,
+ very thick,draw=blue!50,
+ top color=white,
+ bottom color=blue!20 } }
+
+\tikzset{ rbox/.style={
+ rectangle,
+ dotted,
+ minimum width=2.5cm,
+ rounded corners=2mm,
+ very thick,draw=red!50!black!50,
+ top color=white,
+ bottom color=red!50!black!20 } }
+
+\tikzset{ gbox/.style={
+ rectangle,
+ minimum width=2.5cm,
+ very thick,
+ draw=green!50!black!50,
+ top color=white,
+ bottom color=green!50!black!20 } }
diff --git a/tex/src/preamble-style.tex b/tex/src/preamble-style.tex
index 95fafc8..e843903 100644
--- a/tex/src/preamble-style.tex
+++ b/tex/src/preamble-style.tex
@@ -1,152 +1,125 @@
-%% General paper's style settings.
-%%
-%% This preamble can be completely ignored when including this TeX file in
-%% another style. This is done because this LaTeX build is meant to be an
-%% initial/internal phase or part of a larger effort, so it has a basic
-%% style defined here as a preamble. To ignore it, uncomment or delete the
-%% respective line in `paper.tex'.
-%%
-%% Copyright (C) 2019-2020 Mohammad Akhlaghi <mohammad@akhlaghi.org>
-%
-%% This template is free software: you can redistribute it and/or modify it
-%% under the terms of the GNU General Public License as published by the
-%% Free Software Foundation, either version 3 of the License, or (at your
-%% option) any later version.
-%
-%% This template is distributed in the hope that it will be useful, but
-%% WITHOUT ANY WARRANTY; without even the implied warranty of
-%% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
-%% General Public License for more details.
-%
-%% You should have received a copy of the GNU General Public License along
-%% with this template. If not, see <http://www.gnu.org/licenses/>.
-
-
-
-
-
-%% Font.
+%% Set the page margins (use `showframe' to see the sides).
+%% A4 is 210mm x 297mm
+\usepackage[a4paper]{geometry}
+
+%% Horizontal space of text: total is 210mm
+\setlength{\hoffset}{0mm} % remaining: 190mm
+\setlength{\textwidth}{155mm} % remaining: 160mm
+\setlength{\marginparsep}{0pt}
+\setlength{\marginparwidth}{0pt}
+\setlength{\oddsidemargin}{0pt}
+
+%% Vertical space of text: total is 297mm.
+\setlength{\voffset}{-15.4mm} % remaining: 287mm (== 10mm (1 inch + \hoffset).
+\setlength{\topmargin}{0mm} % remaining: 287mm.
+\setlength{\headheight}{10mm} % remaining: 277mm.
+\setlength{\headsep}{10mm} % remaining: 272mm.
+\setlength{\textheight}{245mm} % remaining: 22mm.
+\setlength{\footskip}{7mm} % remaining: 10mm.
+
+%% To see the layout, add a `\layout' right after `\begin{document}'.
+\usepackage{layout}
+
+%% To allow a prefix to the enumeration.
+\usepackage{enumitem}
+
+%% Horizontal line with spacing
+\newcommand{\horizontalline}{\vspace{3mm}\hrule\vspace{3mm}}
+
+%% Custom title format
+\usepackage{setspace}
+\makeatletter
+\renewcommand{\maketitle}{\bgroup\setlength{\parindent}{0pt}
+ \begin{flushleft}
+ {\mpbold RESEARCH PAPER}
+
+ \vspace{3mm}
+ {\LARGE\mpmedium \@title}
+
+ \vspace{2mm}
+ \@author
+ \end{flushleft}\egroup
+ \horizontalline
+}
+\makeatother
+
+%% For authors and affiliations
+\newcommand{\authoraffil}[2]{#1\textsuperscript{\mplight#2}}
+
+%% Spacing before and after section titles.
+%% Format: \titlespacing*{<command>}{<left>}{<before-sep>}{<after-sep>}
+\usepackage{titlesec}
+\titlespacing*{\section}{0pt}{10pt plus 0pt minus 0pt}{0pt plus 0pt minus 0pt}
+\titlespacing*{\subsection}{0pt}{10pt plus 0pt minus 0pt}{0pt plus 0pt minus 0pt}
+\titlespacing*{\subsubsection}{0pt}{10pt plus 0pt minus 0pt}{0pt plus 0pt minus 0pt}
+\titleformat{\section}{\large\scshape\bf}{\thesection.{ }}{0pt}{}
+\titleformat{\subsection}{\bfseries\itshape}{\thesubsection.{ }}{0pt}{}
+\titleformat{\subsubsection}{\bfseries\itshape}{\thesubsubsection.{ }}{0pt}{}
+
+%% Set the font.
+%% After downloading, put the font in `/usr/share/fonts/TTF'.
+%% https://www.fontspace.com/m-fonts/m-2p
+%% Also for M+: https://mplus-fonts.osdn.jp/about-en.html
+%% https://www.fontpalace.com/font-download/Memento/
+\usepackage{fontspec}
\usepackage[T1]{fontenc}
-\usepackage{newtxtext}
-\usepackage{newtxmath}
-
-
-
-
-
-%% Print size
-\usepackage[a4paper, includeheadfoot, body={18.7cm, 24.5cm}]{geometry}
-
-
-
-
-
-%% Set the distance between the columns if two columns:
-\setlength{\columnsep}{0.75cm}
-
-
-
-
-
-% To allow figures to take up more space on the top of the page:
-\renewcommand{\topfraction}{.99}
-\renewcommand{\bottomfraction}{.7}
-\renewcommand{\textfraction}{.05}
-\renewcommand{\floatpagefraction}{.99}
-\renewcommand{\dbltopfraction}{.99}
-\renewcommand{\dblfloatpagefraction}{.99}
-\setcounter{topnumber}{1}
-\setcounter{bottomnumber}{0}
-\setcounter{totalnumber}{2}
-\setcounter{dbltopnumber}{1}
-
-
-
-
-
-%% Color related settings:
-\usepackage{xcolor}
-\color{black} % Text color
-\definecolor{DarkBlue}{RGB}{0,0,90}
-
-
-
-
-
-
-% figure and figure* ordering correction:
-\usepackage{fixltx2e}
-
-
-
-
-
-%% For editing the caption appearence. The `setspace' package defines
-%% the `stretch' variable. `abovecaptionskip' is the distance between
-%% the figure and the caption.
-\usepackage{setspace, caption}
-\captionsetup{font=footnotesize, labelfont={color=DarkBlue,bf}, skip=1pt}
-\captionsetup[figure]{font={stretch=1, small}}
-\setlength{\abovecaptionskip}{3pt plus 1pt minus 1pt}
-\setlength{\belowcaptionskip}{-1.25em}
-
-
-
-
-
-
-%% To make the footnotes align:
-\usepackage[hang]{footmisc}
-\setlength\footnotemargin{10pt}
-
-
-
-
-
-%For including time in the title:
-\usepackage{datetime}
-
-
-
-
-
-%To make links to webpages and include document information in the
-%properties of the PDF
+\usepackage{Alegreya}
+\newfontfamily\mplight{AlegreyaSans-Light}
+\newfontfamily\mpbold{AlegreyaSans-Bold}
+\newfontfamily\mpmedium{AlegreyaSans-Medium}
+\newfontfamily\mpregular{AlegreyaSans-Regular}
+
+%% For highlighting updates. When this is set, text marked as \new
+%% will be colored in dark green and text that is marked wtih \tonote
+%% will be marked in dark red.
+\ifdefined\highlightchanges
+\newcommand{\new}[1]{\textcolor{green!60!black}{#1}}
+\newcommand{\tonote}[1]{\textcolor{red!60!black}{[#1]}}
+\else
+\newcommand{\new}[1]{\textcolor{black}{#1}}
+\newcommand{\tonote}[1]{{}}
+\fi
+
+%% To have links.
\usepackage[
colorlinks,
- urlcolor=blue,
- citecolor=blue,
- linkcolor=blue,
+ urlcolor=gray,
+ citecolor=gray,
+ linkcolor=gray,
linktocpage]{hyperref}
\renewcommand\UrlFont{\rmfamily}
+%% To include figures.
+\usepackage{graphicx}
+%% To manage captions.
+\usepackage[font={footnotesize}]{caption}
+%% To use colors.
+\usepackage{xcolor}
-
-%% Define the abstract environment
-\renewenvironment{abstract}
- {\vspace{-0.5cm}\small%
- \list{}{%
- \setlength{\leftmargin}{2cm}%
- \setlength{\rightmargin}{\leftmargin}%
- }%
- \item\relax}
- {\endlist}
-
-
-
-
-
-%% To keep the main page's code clean.
-\newcommand{\includeabstract}[1]{%
-\twocolumn[%
- \begin{@twocolumnfalse}%
- \maketitle%
- \begin{abstract}%
- #1%
- \end{abstract}%
- \vspace{1cm}%
- \end{@twocolumnfalse}%
- ]%
+%% Header and footer style.
+\usepackage{lastpage}
+\usepackage{fancyhdr}
+\pagestyle{fancy}
+\lhead{\mplight\footnotesize Art.XX, page {\thepage} of \pageref{LastPage}}
+\chead{}
+\rhead{\mplight\footnotesize Akhlaghi et al; Reproducible paper template}
+\lfoot{}
+\cfoot{}
+\rfoot{}
+\renewcommand\headrulewidth{0.0pt}
+\renewcommand\footrulewidth{0.0pt}
+\fancypagestyle{firstpage} {
+ \lhead{\includegraphics[width=3.5cm]{tex/img/codata.png}}
+ \chead{}
+ \rhead{\mplight\footnotesize
+ Akhlaghi, M, et al. 2019. Reproducible paper template\\
+ \emph{Data Science Journal}, VV, NN, pp.1-N,\\
+ DOI: https://doi.org/10.5334/dsj-XXXX-XXX}
+ \lfoot{}
+ \cfoot{}
+ \rfoot{}
+ \renewcommand\headrulewidth{0.1pt}
+ \renewcommand\footrulewidth{0.0pt}
}
diff --git a/tex/src/references.tex b/tex/src/references.tex
index 02a7d50..0f8171b 100644
--- a/tex/src/references.tex
+++ b/tex/src/references.tex
@@ -1,33 +1,1503 @@
-%% Non-software BibTeX entries. The software-specific BibTeX entries are
-%% stored in a `*.tex' file under the `tex/dependencies' directory.
-%
-%% Copyright (C) 2018-2020 Mohammad Akhlaghi <mohammad@akhlaghi.org>
-%
-%% Copying and distribution of this file, with or without modification,
-%% are permitted in any medium without royalty provided the copyright
-%% notice and this notice are preserved. This file is offered as-is,
-%% without any warranty.
+@ARTICLE{gibney20,
+ author = {Elizabeth Gibney},
+ title = {This AI researcher is trying to ward off a reproducibility crisis},
+ year = {2020},
+ journal = {Nature},
+ volume = {577},
+ pages = {14},
+ doi = {10.1038/d41586-019-03895-5},
+}
+
+
+
+
+
+@ARTICLE{munafo19,
+ author = {Marcus Munaf\'o},
+ title = {Raising research quality will require collective action},
+ year = {2019},
+ journal = {Nature},
+ volume = {576},
+ pages = {183},
+ doi = {10.1038/d41586-019-03750-7},
+}
+
+
+
+
+
+@ARTICLE{jones19,
+ author = {{Jones}, M.~G. and {Verdes-Montenegro}, L. and {Damas-Segovia}, A. and
+ {Borthakur}, S. and {Yun}, M. and {del Olmo}, A. and {Perea}, J. and
+ {Rom{\'a}n}, J. and {Luna}, S. and {Lopez Gutierrez}, D. and
+ {Williams}, B. and {Vogt}, F.~P.~A. and {Garrido}, J. and
+ {Sanchez}, S. and {Cannon}, J. and {Ram{\'\i}rez-Moreta}, P.},
+ title = "{Evolution of compact groups from intermediate to final stages. A case study of the H I content of HCG 16}",
+ journal = {Astronomy \& Astrophysics},
+ eprint = {1910.03420},
+ keywords = {galaxies: groups: individual: HCG 16, galaxies: interactions, galaxies: evolution, galaxies: ISM, radio lines: galaxies},
+ year = "2019",
+ month = "Dec",
+ volume = {632},
+ eid = {A78},
+ pages = {A78},
+ doi = {10.1051/0004-6361/201936349},
+ adsurl = {https://ui.adsabs.harvard.edu/abs/2019A&A...632A..78J},
+ adsnote = {Provided by the SAO/NASA Astrophysics Data System}
+}
+
+
+
+
+
+@ARTICLE{banek19,
+ author = {{Banek}, Christine and {Thornton}, Adam and {Economou}, Frossie and
+ {Fausti}, Angelo and {Krughoff}, K. Simon and {Sick}, Jonathan},
+ title = "{Why is the LSST Science Platform built on Kubernetes?}",
+ journal = {Proceedings of ADASS XXIX},
+ volume = {arXiv},
+ keywords = {Astrophysics - Instrumentation and Methods for Astrophysics},
+ year = "2019",
+ month = "Nov",
+ eid = {arXiv:1911.06404},
+ pages = {1911.06404},
+archivePrefix = {arXiv},
+ eprint = {1911.06404},
+ primaryClass = {astro-ph.IM},
+ adsurl = {https://ui.adsabs.harvard.edu/abs/2019arXiv191106404B},
+ adsnote = {Provided by the SAO/NASA Astrophysics Data System}
+}
+
+
+
+
+
+@ARTICLE{infante19,
+ author = {{Infante-Sainz}, Ra{\'u}l and {Trujillo}, Ignacio and
+ {Rom{\'a}n}, Javier},
+ title = "{The Sloan Digital Sky Survey extended Point Spread Functions}",
+ journal = {MNRAS},
+ volume = {arXiv},
+ keywords = {Astrophysics - Instrumentation and Methods for Astrophysics, Astrophysics - Astrophysics of Galaxies},
+ year = "2019",
+ month = "Nov",
+ eid = {arXiv:1911.01430},
+ pages = {1911.01430},
+archivePrefix = {arXiv},
+ eprint = {1911.01430},
+ primaryClass = {astro-ph.IM},
+ adsurl = {https://ui.adsabs.harvard.edu/abs/2019arXiv191101430I},
+ adsnote = {Provided by the SAO/NASA Astrophysics Data System}
+}
+
+
+
+
+
+@ARTICLE{fineberg19,
+ author = {Harvey V. Fineberg and David B. Allison and Lorena A. Barba and Dianne Chong and David L. Donoho and Juliana Freire and Gerald Gabrielse and Constantine Gatsonis and Edward Hall and Thomas H. Jordan and Dietram A. Scheufele and Victoria Stodden and Simine Vazire, Timothy D. Wilson and Wendy Wood and Jennifer Heimberg and Thomas Arrison and Michael Cohen and Michele Schwalbe and Adrienne Stith Butler and Barbara A. Wanchisen and Tina Winters and Rebecca Morgan and Thelma Cox and Lesley Webb and Garret Tyson and Erin Hammers Forstag},
+ title = {Reproducibility and Replicability in Science},
+ journal = {The National Academies Press},
+ year = 2019,
+ pages = {1},
+ doi = {10.17226/25303},
+}
+
+
+
+
+
+@ARTICLE{akhlaghi19,
+ author = {{Akhlaghi}, Mohammad},
+ title = "{Carving out the low surface brightness universe with NoiseChisel}",
+ journal = {IAU Symposium 355},
+ volume = {arXiv},
+ keywords = {Astrophysics - Instrumentation and Methods for Astrophysics, Astrophysics - Astrophysics of Galaxies, Computer Science - Computer Vision and Pattern Recognition},
+ year = "2019",
+ month = "Sep",
+ eid = {arXiv:1909.11230},
+ pages = {1909.11230},
+archivePrefix = {arXiv},
+ eprint = {1909.11230},
+ primaryClass = {astro-ph.IM},
+ adsurl = {https://ui.adsabs.harvard.edu/abs/2019arXiv190911230A},
+ adsnote = {Provided by the SAO/NASA Astrophysics Data System}
+}
+
+
+
+
+
+@ARTICLE{cribbs19,
+ author = {Cribbs, AP and Luna-Valero, S and George, C and Sudbery, IM and Berlanga-Taylor, AJ and Sansom, SN and Smith, T and Ilott, NE and Johnson, J and Scaber, J and Brown, K and Sims, D and Heger, A},
+ title = {CGAT-core: a python framework for building scalable, reproducible computational biology workflows [version 2; peer review: 1 approved, 1 approved with reservations]},
+ journal = {F1000Research},
+ year = 2019,
+ volume = 8,
+ pages = {377},
+ doi = {10.12688/f1000research.18674.2},
+}
+
+
+
+
+
+@ARTICLE{brinckman19,
+author = "Adam Brinckman and Kyle Chard and Niall Gaffney and Mihael Hategan and Matthew B. Jones and Kacper Kowalik and Sivakumar Kulasekaran and Bertram Ludäscher and Bryce D. Mecum and Jarek Nabrzyski and Victoria Stodden and Ian J. Taylor and Matthew J. Turk and Kandace Turner",
+ title = {Computing environments for reproducibility: Capturing the ``Whole Tale''},
+ journal = {Future Generation Computer Systems},
+ year = 2019,
+ volume = 94,
+ pages = 854,
+ doi = {10.1016/j.future.2017.12.029},
+}
+
+
+
+
+
+@ARTICLE{uhse19,
+ author = {Uhse, Simon and Pflug, Florian G. and {von Haeseler}, Arndt and Djamei, Armin},
+ title = {Insertion Pool Sequencing for Insertional Mutant Analysis in Complex Host‐Microbe Interactions},
+ journal = {Current Protocols in Plant Biology},
+ volume = {4},
+ year = "2019",
+ month = "July",
+ pages = {e20097},
+ doi = {10.1002/cppb.20097},
+}
+
+
+
+
+
+@ARTICLE{alliez19,
+ author = {{Alliez}, Pierre and {Di Cosmo}, Roberto and {Guedj}, Benjamin and
+ {Girault}, Alain and {Hacid}, Mohand-Said and {Legrand}, Arnaud and
+ {Rougier}, Nicolas P.},
+ title = "{Attributing and Referencing (Research) Software: Best Practices and Outlook from Inria}",
+ journal = {Computing in Science \& Engineering},
+ volume = {22},
+ keywords = {Computer Science - Digital Libraries, Computer Science - Software Engineering},
+ year = "2019",
+ month = "May",
+ pages = {39},
+archivePrefix = {arXiv},
+ eprint = {1905.11123},
+ primaryClass = {cs.DL},
+ doi = {10.1109/MCSE.2019.2949413},
+ adsurl = {https://ui.adsabs.harvard.edu/abs/2019arXiv190511123A},
+ adsnote = {Provided by the SAO/NASA Astrophysics Data System}
+}
+
+
+
+
+
+@ARTICLE{kneller19,
+ author = {Kneller,Gerald R. and Hinsen,Konrad},
+ title = {Memory effects in a random walk description of protein structure ensembles},
+ journal = {The Journal of Chemical Physics},
+ volume = {150},
+ year = {2019},
+ pages = {064911},
+ doi = {10.1063/1.5054887},
+}
+
+
+
+
+
+@ARTICLE{plesser18,
+ author = {Hans E. Plesser},
+ title = {Reproducibility vs. Replicability: A Brief History of a Confused Terminology},
+ journal = {Frontiers in Neuroinformatics},
+ volume = {11},
+ year = {2018},
+ pages = {76},
+ doi = {10.3389/fninf.2017.00076},
+}
+
+
+
+
+
+@ARTICLE{zhang18,
+ author = {{Zhang}, Zhi-Yu and {Romano}, D. and {Ivison}, R.~J. and
+ {Papadopoulos}, Padelis P. and {Matteucci}, F.},
+ title = "{Stellar populations dominated by massive stars in dusty starburst galaxies across cosmic time}",
+ journal = {Nature},
+ keywords = {Astrophysics - Astrophysics of Galaxies},
+ year = "2018",
+ month = "Jun",
+ volume = {558},
+ number = {7709},
+ pages = {260},
+ doi = {10.1038/s41586-018-0196-x},
+archivePrefix = {arXiv},
+ eprint = {1806.01280},
+ primaryClass = {astro-ph.GA},
+ adsurl = {https://ui.adsabs.harvard.edu/abs/2018Natur.558..260Z},
+ adsnote = {Provided by the SAO/NASA Astrophysics Data System}
+}
+
+
+
+
+
+@ARTICLE{smart18,
+ author = {{Smart}, A.G.},
+ title = {The war over supercooled water},
+ journal = {Physics Today},
+ volume = {Aug},
+ year = "2018",
+ pages = {DOI:10.1063/PT.6.1.20180822a},
+ doi = {10.1063/PT.6.1.20180822a},
+}
+
+
+
+
+
+@ARTICLE{kaiser18,
+ author = {{Kaiser}, J.},
+ title = {Plan to replicate 50 high-impact cancer papers shrinks to just 18},
+ journal = {Science},
+ volume = {Jul},
+ year = "2018",
+ pages = {31},
+ doi = {10.1126/science.aau9619},
+}
+
+
+
+
+
+@ARTICLE{dicosmo18,
+ author = {{Di Cosmo}, Roberto and {Gruenpeter}, Morane and {Zacchiroli}, Stefano},
+ title = {Identifiers for Digital Objects: The case of software source code preservation},
+ journal = {Proceedings of iPRES 2018},
+ year = "2018",
+ pages = {204.4},
+ doi = {10.17605/osf.io/kde56},
+}
+
+
+
+
+
+@ARTICLE{gruning18,
+ author = {Gr\"uning, Bj\"orn and Chilton, John and K\"oster, Johannes and Dale, Ryan and Soranzo, Nicola and {van den Beek}, Marius and Goecks, Jeremy and Backofen, Rolf and Nekrutenko, Anton and Taylor, James},
+ title = {Practical Computational Reproducibility in the Life Sciences},
+ journal = {Cell Systems},
+ volume = 6,
+ year = "2018",
+ pages = {631. bioRxiv:\href{https://www.biorxiv.org/content/10.1101/200683v2}{200683}},
+ doi = {10.1016/j.cels.2018.03.014},
+}
+
+
+
+
+
+@ARTICLE{allen18,
+ author = {{Allen}, Alice and {Teuben}, Peter J. and {Ryan}, P. Wesley},
+ title = "{Schroedinger's Code: A Preliminary Study on Research Source Code Availability and Link Persistence in Astrophysics}",
+ journal = {The Astrophysical Journal Supplement Series},
+ keywords = {methods: numerical, Astrophysics - Instrumentation and Methods for Astrophysics, Computer Science - Digital Libraries},
+ year = "2018",
+ month = "May",
+ volume = {236},
+ number = {1},
+ eid = {10},
+ pages = {10},
+ doi = {10.3847/1538-4365/aab764},
+archivePrefix = {arXiv},
+ eprint = {1801.02094},
+ primaryClass = {astro-ph.IM},
+ adsurl = {https://ui.adsabs.harvard.edu/abs/2018ApJS..236...10A},
+ adsnote = {Provided by the SAO/NASA Astrophysics Data System}
+}
+
+
+
+
+
+@ARTICLE{burrell18,
+ author = {{Burrell}, A.G. and {Halford}, A. and {Klenzing}, J. and {Stoneback}, R.A. and {Morley}, S.K. and {Annex}, A.M. and {Laundal}, K.M. and {Kellerman}, A.C. and {Stansby}, D. and {Ma}, J.},
+ title = {Snakes on a Spaceship—An Overview of Python in Heliophysics},
+ journal = {Journal of Geophysical Research: Space Physics},
+ volume = {123},
+ year = "2018",
+ pages = {384},
+ doi = {10.1029/2018JA025877},
+}
+
+
+
+
+
+@article{stodden18,
+ author = {{Stodden}, V. and {Seiler}, J. and {Ma}, Z.},
+ title = {An empirical analysis of journal policy effectiveness for computational reproducibility},
+ volume = {115},
+ number = {11},
+ pages = {2584},
+ year = {2018},
+ doi = {10.1073/pnas.1708290115},
+ issn = {0027-8424},
+ URL = {https://www.pnas.org/content/115/11/2584},
+ journal = {Proceedings of the National Academy of Sciences}
+}
+
+
+
+
+
+@article {fanelli18,
+ author = {{Fanelli}, D.},
+ title = {Opinion: Is science really facing a reproducibility crisis, and do we need it to?},
+ volume = {115},
+ number = {11},
+ pages = {2628},
+ year = {2018},
+ doi = {10.1073/pnas.1708272114},
+ publisher = {National Academy of Sciences},
+ issn = {0027-8424},
+ URL = {https://www.pnas.org/content/115/11/2628},
+ journal = {Proceedings of the National Academy of Sciences}
+}
+
+
+
+
+
+
+@ARTICLE{lewis18,
+ author = {{Lewis}, L.M. and {Edwards}, M.C. and {Meyers}, Z.R. and {Conover Talbot}, C. and {Hao}, H. and {Blum}, D. },
+ title = "{Replication Study: Transcriptional amplification in tumor cells with elevated c-Myc}",
+ journal = {eLife},
+ volume = {7},
+ year = "2018",
+ month = "January",
+ pages = {e30274},
+ doi = {10.7554/eLife.30274},
+}
+
+
+
+
+
+@ARTICLE{akhlaghi18b,
+ author = {{Akhlaghi}, Mohammad and {Bacon}, Roland and {Inami}, Hanae},
+ title = "{MUSE HUDF survey I \& II, Sections 7.3 \& 3.4: photometry for objects with no prior broad-band segmentation map}",
+ journal = {Zenodo},
+ pages = {DOI:10.5281/zenodo.1164774},
+ year = "2018",
+ month = "February",
+ doi = {10.5281/zenodo.1164774},
+}
+
+
+
+
+
+@ARTICLE{akhlaghi18a,
+ author = {{Akhlaghi}, Mohammad and {Bacon}, Roland},
+ title = "{MUSE HUDF survey I, Section 4: data and reproduction pipeline for photometry and astrometry}",
+ journal = {Zenodo},
+ pages = {DOI:10.5281/zenodo.1163746},
+ year = "2018",
+ month = "January",
+ doi = {10.5281/zenodo.1163746},
+}
+
+
+
+
+
+@ARTICLE{leek17,
+ author = {Jeffrey T. Leek and Leah R. Jager},
+ title = {Is Most Published Research Really False?},
+ journal = {Annual Review of Statistics and Its Application},
+ volume = {4},
+ year = {2017},
+ pages = {109},
+ doi = {10.1146/annurev-statistics-060116-054104},
+}
+
+
+
+
+
+@ARTICLE{reich17,
+ author = {Michael Reich and Thorin Tabor and Ted Liefeld and Helga Thorvaldsdóttir and Barbara Hill and Pablo Tamayo and Jill P. Mesirov},
+ title = {The GenePattern Notebook Environment},
+ journal = {Cell Systems},
+ year = {2017},
+ volume = {5},
+ pages = {149},
+ doi = {10.1016/j.cels.2017.07.003},
+}
+
+
+
+
+
+@ARTICLE{becker17,
+ author = {Gabriel Becker and Cory Barr and Robert Gentleman and Michael Lawrence},
+ title = {Enhancing Reproducibility and Collaboration via Management of R Package Cohorts},
+ journal = {Journal of Statistical Software, Articles},
+ volume = {82},
+ pages = 1,
+ year = "2017",
+archivePrefix = {arXiv},
+ eprint = {1501.02284},
+ doi = {10.18637/jss.v082.i01},
+ adsurl = {https://ui.adsabs.harvard.edu/abs/2015arXiv150102284B},
+}
+
+
+
+
+
+@ARTICLE{jenness17,
+ author = {{Jenness}, Tim},
+ title = "{Modern Python at the Large Synoptic Survey Telescope}",
+ journal = {ADASS 27},
+ year = "2017",
+ month = "Dec",
+ eid = {arXiv:1712.00461},
+ pages = {arXiv:1712.00461},
+archivePrefix = {arXiv},
+ eprint = {1712.00461},
+ primaryClass = {astro-ph.IM},
+ adsurl = {https://ui.adsabs.harvard.edu/abs/2017arXiv171200461J},
+ adsnote = {Provided by the SAO/NASA Astrophysics Data System}
+}
+
+
+
+
+
+@article{bezanson17,
+ title={Julia: A fresh approach to numerical computing},
+ author={Bezanson, Jeff and Edelman, Alan and Karpinski, Stefan and Shah, Viral B},
+ journal={SIAM {R}eview},
+ volume={59},
+ number={1},
+ pages={65},
+ year={2017},
+ archivePrefix={arXiv},
+ eprint={1411.1607},
+ publisher={SIAM},
+ doi={10.1137/141000671},
+ adsurl = {https://ui.adsabs.harvard.edu/abs/2014arXiv1411.1607B},
+}
+
+
+
+
+
+@ARTICLE{melson17,
+ author = {{Melsen}, L.A. and {Torfs}, P.J.J.F and {Uijlenhoet}, R. and {Teuling}, A.J.},
+ title = {Comment on “Most computational hydrology is not reproducible, so is it really science?” by Christopher Hutton et al.},
+ journal = {Water Resources Research},
+ volume = 53,
+ pages = {2568},
+ year = {2017},
+ doi = {10.1002/2016WR020208},
+}
+
+
+
+
+
+@ARTICLE{munafo17,
+ author = {{Munaf\'o}, M.R. and {Nosek}, B.A. and {Bishop}, D.V.M. and {Button}, K.S. and {Chambers}, C.D. and {Percie du Sert}, N. and {Simonsohn}, U. and {Wagenmakers}, E.J. and {Ware}, J.J. {Ioannidis}, J.P.A.},
+ title = {A manifesto for reproducible science},
+ journal = {Nature Human Behaviour},
+ volume = 1,
+ pages = {21},
+ year = {2017},
+ doi = {10.1038/s41562-016-0021},
+}
+
+
+
+
+
+@ARTICLE{jimenez17,
+ title={The popper convention: Making reproducible systems evaluation practical},
+ author = {{Jimenez}, I. and {Sevilla}, M. and {Watkins}, N. and {Maltzahn}, C. and {Lofstead}, J. and {Mohror}, K. and {Arpaci-Dusseau}, A. and {Arpaci-Dusseau}, R.},
+ journal = {IEEE IPDPSW},
+ pages = {1561},
+ year = {2017},
+ doi = {10.1109/IPDPSW.2017.157},
+}
+
+
+
+
@ARTICLE{bacon17,
- author = {{Bacon}, R. and {Conseil}, S. and {Mary}, D. and {Brinchmann}, J. and
- {Shepherd}, M. and {Akhlaghi}, M. and {Weilbacher}, P.~M. and
- {Piqueras}, L. and {Wisotzki}, L. and {Lagattuta}, D. and {Epinat}, B. and
- {Guerou}, A. and {Inami}, H. and {Cantalupo}, S. and {Courbot}, J.~B. and
- {Contini}, T. and {Richard}, J. and {Maseda}, M. and {Bouwens}, R. and
- {Bouch{\'e}}, N. and {Kollatschny}, W. and {Schaye}, J. and
- {Marino}, R.~A. and {Pello}, R. and {Herenz}, C. and {Guiderdoni}, B. and
- {Carollo}, M.},
- title = "{The MUSE Hubble Ultra Deep Field Survey. I. Survey description, data reduction, and source detection}",
- journal = {A\&A},
-archivePrefix = "arXiv",
- eprint = {1710.03002},
- keywords = {galaxies: distances and redshifts, galaxies: high-redshift, cosmology: observations, methods: data analysis, techniques: imaging spectroscopy, galaxies: formation},
- year = 2017,
- month = nov,
- volume = 608,
- eid = {A1},
- pages = {A1},
- doi = {10.1051/0004-6361/201730833},
- adsurl = {http://adsabs.harvard.edu/abs/2017A\%26A...608A...1B},
- adsnote = {Provided by the SAO/NASA Astrophysics Data System}
+ author = {{Bacon}, Roland and {Conseil}, Simon and {Mary}, David and
+ {Brinchmann}, Jarle and {Shepherd}, Martin and {Akhlaghi}, Mohammad and
+ {Weilbacher}, Peter M. and {Piqueras}, Laure and {Wisotzki}, Lutz and
+ {Lagattuta}, David and {Epinat}, Benoit and {Guerou}, Adrien and
+ {Inami}, Hanae and {Cantalupo}, Sebastiano and
+ {Courbot}, Jean Baptiste and {Contini}, Thierry and {Richard}, Johan and
+ {Maseda}, Michael and {Bouwens}, Rychard and {Bouch{\'e}}, Nicolas and
+ {Kollatschny}, Wolfram and {Schaye}, Joop and {Marino}, Raffaella Anna and
+ {Pello}, Roser and {Herenz}, Christian and {Guiderdoni}, Bruno and
+ {Carollo}, Marcella},
+ title = "{The MUSE Hubble Ultra Deep Field Survey. I. Survey description, data reduction, and source detection}",
+ journal = {Astronomy \& Astrophysics},
+ keywords = {galaxies: distances and redshifts, galaxies: high-redshift, cosmology: observations, methods: data analysis, techniques: imaging spectroscopy, galaxies: formation, Astrophysics - Astrophysics of Galaxies},
+ year = "2017",
+ month = "Nov",
+ volume = {608},
+ eid = {A1},
+ pages = {A1},
+ doi = {10.1051/0004-6361/201730833},
+archivePrefix = {arXiv},
+ eprint = {1710.03002},
+ primaryClass = {astro-ph.GA},
+ adsurl = {https://ui.adsabs.harvard.edu/abs/2017A\&A...608A...1B},
+ adsnote = {Provided by the SAO/NASA Astrophysics Data System}
+}
+
+
+
+
+
+@ARTICLE{smith16,
+ author = {Arfon M. Smith and Daniel S. Katz and Kyle E. Niemeyer},
+ title = {Software citation principles},
+ journal = {PeerJ Computer Science},
+ volume = {2},
+ year = {2016},
+ pages = {e86},
+ doi = {10.7717/peerj-cs.86},
+}
+
+
+
+
+
+@ARTICLE{ziemann16,
+ author = {Mark Ziemann and Yotam Eren and Assam El-Osta},
+ title = {Gene name errors are widespread in the scientific literature},
+ journal = {Genome Biology},
+ volume = {17},
+ year = {2016},
+ pages = {177},
+ doi = {10.1186/s13059-016-1044-7},
+}
+
+
+
+
+
+@ARTICLE{hinsen16,
+ author = {Konrad Hinsen},
+ title = {Scientific notations for the digital era},
+ journal = {The Self Journal of Science},
+ year = {2016},
+ pages = {1: arXiv:\href{https://arxiv.org/abs/1605.02960}{1605.02960}},
+}
+
+
+
+
+
+@ARTICLE{kluyver16,
+ author = {Thomas Kluyver and Benjamin Ragan-Kelley and Fernando Pérez and Brian Granger and Matthias Bussonnier and Jonathan Frederic and Kyle Kelley and Jessica Hamrick and Jason Grout and Sylvain Corlay and Paul Ivanov and Damián Avila and Safia Abdalla and Carol Willing},
+ title = "{Jupyter Notebooks – a publishing format for reproducible computational workflows}",
+ journal = {Positioning and Power in Academic Publishing: Players, Agents and Agendas},
+ year = {2016},
+ pages = {87},
+ doi = {10.3233/978-1-61499-649-1-87},
+}
+
+
+
+
+
+@ARTICLE{baker16,
+ author = {{Baker}, M.},
+ title = "{Is there a reproducibility crisis?}",
+ journal = {Nature},
+ volume = {533},
+ year = "2016",
+ month = "May",
+ pages = {452},
+ doi = {10.1038/533452a},
+}
+
+
+
+
+
+@ARTICLE{wilkinson16,
+ author = { {Wilkinson}, M.D and {Dumontier}, M. and {Aalbersberg}, I.J. and {Appleton}, G. and {Axton}, M. and {Baak}, A. and {Blomberg}, N. and {Boiten}, J. and {da Silva Santos}, L.B and {Bourne}, P.E. and {Bouwman}, J. and {Brookes}, A.J. and {Clark}, T. and {Crosas}, M. and {Dillo}, I. and {Dumon}, O. and {Edmunds}, S. and {Evelo}, C. and {Finkers}, R. and {Gonzalez-Beltran}, A. and {Gray}, A.J.G. and {Groth}, P. and {Goble}, C. and {Grethe}, Jeffrey S. and {Heringa}, J. and {’t Hoen}, P.A.C and {Hooft}, R. and {Kuhn}, T. and {Kok}, R. and {Kok}, J. and {Lusher}, S. and {Martone}, M. and {Mons}, A. and {Packer}, A. and {Persson}, B. and {Rocca-Serra}, P. and {Roos}, M. and {van Schaik}, R. and {Sansone}, S. and {Schultes}, E. and {Sengstag}, T. and {Slater}, T. and {Strawn}, G. and {Swertz}, M. and {Thompson}, M. and {van der Lei}, J. and {van Mulligen}, E. and {Velterop}, J. and {Waagmeester}, A. and {Wittenburg}, P. and {Wolstencroft}, K. and {Zhao}, J. and {Mons}, B.},
+ title = "{The FAIR Guiding Principles for scientific data management and stewardship}",
+ journal = {Scientific Data},
+ year = 2016,
+ month = mar,
+ volume = 3,
+ pages = {160018},
+ doi = {10.1038/sdata.2016.18},
+}
+
+
+
+
+@ARTICLE{hutton16,
+ author = {{Hutton}, C. and {Wagener}, T. and {Freer}, J. and {Han}, D. and {Duffy}, C. and {Arheimer}, B.},
+ title = {Most computational hydrology is not reproducible, so is it really science?},
+ journal = {Water Resources Research},
+ year = {2016},
+ volume = 52,
+ pages = {7548},
+ doi = {10.1002/2016WR019285},
+}
+
+
+
+
+
+@ARTICLE{topalidou16,
+ author = {{Topalidou}, M. and {Leblois}, A. and {Boraud}, T. and {Rougier}, N.P.},
+ title = {A long journey into reproducible computational neuroscience},
+ journal = {Frontiers in Computational Neuroscience},
+ year = {2016},
+ volume = 9,
+ pages = {30},
+ doi = {10.3389/fncom.2015.00030},
+}
+
+
+
+
+
+@ARTICLE{gil16,
+ author = {{Gil}, Yolanda and {David}, C.H. and {Demir}, I. and {Essawy}, B.T. and {Fulweiler}, R.W. and {Goodall}, J.L. and {Karlstrom}, L. and {Lee}, H. and {Mills}, H.J. and {Oh}, J. and {Pierce}, S.A. and {Pope}, A. and {Tzeng}, M.W. and {Villamizar}, S.R. and {Yu}, X},
+ title = {Toward the Geoscience Paper of the Future: Best practices for documenting and sharing research from data to software to provenance},
+ journal = {Earth and Space Science},
+ year = 2016,
+ volume = 3,
+ pages = {388},
+ doi = {10.1002/2015EA000136},
+}
+
+
+
+
+
+@ARTICLE{horvath15,
+ author = {Steve Horvath},
+ title = {Erratum to: DNA methylation age of human tissues and cell types},
+ journal = {Genome Biology},
+ volume = {16},
+ pages = {96},
+ year = {2015},
+ doi = {10.1186/s13059-015-0649-6},
+}
+
+
+
+
+
+@ARTICLE{chang15,
+ author = {Andrew C. Chang and Phillip Li},
+ title = {Is Economics Research Replicable? Sixty Published Papers from Thirteen Journals Say ``Usually Not''},
+ journal = {Finance and Economics Discussion Series 2015-083},
+ year = {2015},
+ pages = {1},
+ doi = {10.17016/FEDS.2015.083},
+}
+
+
+
+
+
+@ARTICLE{schaffer15,
+ author = {Jonathan Schaffer},
+ title = {What Not to Multiply Without Necessity},
+ journal = {Australasian Journal of Philosophy},
+ volume = {93},
+ pages = {644},
+ year = {2015},
+ doi = {10.1080/00048402.2014.992447},
+}
+
+
+
+
+
+@ARTICLE{clarkso15,
+ author = "Chris Clarkson and Mike Smith and Ben Marwick and Richard Fullagar and Lynley A. Wallis and Patrick Faulkner and Tiina Manne and Elspeth Hayes and Richard G. Roberts and Zenobia Jacobs and Xavier Carah and Kelsey M. Lowe and Jacqueline Matthews and S. Anna Florin",
+ title = {The archaeology, chronology and stratigraphy of Madjedbebe (Malakunanja II): A site in northern Australia with early occupation},
+ journal = {Journal of Human Evolution},
+ year = 2015,
+ volume = 83,
+ pages = 46,
+ doi = {10.1016/j.jhevol.2015.03.014},
+}
+
+
+
+
+
+@ARTICLE{meng15,
+ author = {Haiyan Meng and Rupa Kommineni and Quan Pham and Robert Gardner and Tanu Malik and Douglas Thain},
+ title = {An invariant framework for conducting reproducible computational science},
+ journal = {Journal of Computational Science},
+ year = 2015,
+ volume = 9,
+ pages = 137,
+ doi = {10.1016/j.jocs.2015.04.012},
+}
+
+
+
+
+
+@ARTICLE{gamblin15,
+ author = {Gamblin, Todd and LeGendre, Matthew and Collette, Michael R. and Lee, Gregory L. and Moody, Adam and {de Supinski}, Bronis R. and Futral, Scott},
+ title = {The Spack package manager: bringing order to HPC software chaos},
+ journal = {IEEE SC15},
+ year = 2015,
+ volume = 1,
+ pages = {1},
+ doi = {10.1145/2807591.2807623},
+}
+
+
+
+
+@ARTICLE{akhlaghi15,
+ author = {{Akhlaghi}, M. and {Ichikawa}, T.},
+ title = "{Noise-based Detection and Segmentation of Nebulous Objects}",
+ journal = {The Astrophysical Journal Supplement Series},
+ archivePrefix = "arXiv",
+ eprint = {1505.01664},
+ primaryClass = "astro-ph.IM",
+ keywords = {galaxies: irregular, galaxies: photometry, galaxies: structure, methods: data analysis, techniques: image processing, techniques: photometric},
+ year = 2015,
+ month = sep,
+ volume = 220,
+ eid = {1},
+ pages = {1},
+ doi = {10.1088/0067-0049/220/1/1},
+ adsurl = {https://ui.adsabs.harvard.edu/abs/2015ApJS..220....1A},
+ adsnote = {Provided by the SAO/NASA Astrophysics Data System}
+}
+
+
+
+
+
+@ARTICLE{courtes15,
+ author = {{Court{\`e}s}, Ludovic and {Wurmus}, Ricardo},
+ title = "{Reproducible and User-Controlled Software Environments in HPC with Guix}",
+ journal = {Euro-Par},
+ volume = {9523},
+ keywords = {Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Operating Systems, Computer Science - Software Engineering},
+ year = "2015",
+ month = "Jun",
+ eid = {arXiv:1506.02822},
+ pages = {arXiv:1506.02822},
+archivePrefix = {arXiv},
+ eprint = {1506.02822},
+ primaryClass = {cs.DC},
+ doi = {10.1007/978-3-319-27308-2_47},
+ adsurl = {https://ui.adsabs.harvard.edu/abs/2015arXiv150602822C},
+ adsnote = {Provided by the SAO/NASA Astrophysics Data System}
+}
+
+
+
+
+
+@ARTICLE{hinsen15,
+ author = {{Hinsen}, K.},
+ title = {ActivePapers: a platform for publishing and archiving computer-aided research},
+ journal = {F1000Research},
+ year = 2015,
+ volume = 3,
+ pages = {289},
+ doi = {10.12688/f1000research.5773.3},
+}
+
+
+
+
+
+@ARTICLE{belhajjame15,
+ author = {{Belhajjame}, K. and {Zhao}, Z. and {Garijo}, D. and {Gamble}, M. and {Hettne}, K. and {Palma}, R. and {Mina}, E. and {Corcho}, O. and {Gómez-Pérez}, J.M. and {Bechhofer}, S. and {Klyne}, G. and {Goble}, C},
+ title = "{Using a suite of ontologies for preserving workflow-centric research objects}",
+ journal = {Journal of Web Semantics},
+ year = 2015,
+ volume = 32,
+ pages = {16},
+ doi = {10.1016/j.websem.2015.01.003},
+}
+
+
+
+
+
+@ARTICLE{bechhofer13,
+ author = {{Bechhofer}, S. and {Buchan}, I. and {De Roure}, D. and {Missier}, P. and {Ainsworth}, J. and {Bhagat}, J. and Couch, P. and Cruickshank, D. and {Delderfield}, M and Dunlop, I. and {Gamble}, M. and {Michaelides}, D. and {Owen}, S. and {Newman}, D. and {Sufi}, S. and {Goble}, C},
+ title = "{Why linked data is not enough for scientists}",
+ journal = {Future Generation Computer Systems},
+ year = 2013,
+ volume = 29,
+ pages = {599},
+ doi = {10.1016/j.future.2011.08.004},
+}
+
+
+
+
+
+@ARTICLE{peng15,
+ author = {{Peng}, R.D.},
+ title = {The reproducibility crisis in science: A statistical counterattack},
+ journal = {Significance},
+ year = 2015,
+ month = jun,
+ volume = 12,
+ pages = {30},
+ doi = {10.1111/j.1740-9713.2015.00827.x},
+}
+
+
+
+
+
+@ARTICLE{herndon14,
+ author = {Thomas Herndon and Michael Ash and Robert Pollin},
+ title = {Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff},
+ journal = {Cambridge Journal of Economics},
+ year = {2014},
+ month = {dec},
+ volume = {38},
+ pages = {257},
+ doi = {10.1093/cje/bet075},
+}
+
+
+
+
+
+@ARTICLE{easterbrook14,
+ author = {{Easterbook}, S.},
+ title = {Open code for open science?},
+ journal = {Nature Geoscience},
+ year = 2014,
+ month = oct,
+ volume = 7,
+ pages = {779},
+ doi = {10.1038/ngeo2283},
+}
+
+
+
+
+
+@ARTICLE{fomel13,
+ author = {Sergey Fomel and Paul Sava and Ioan Vlad and Yang Liu and Vladimir Bashkardin},
+ title = {Madagascar: open-source software project for multidimensional data analysis and reproducible computational experiments},
+ journal = {Journal of open research software},
+ year = {2013},
+ volume = {1},
+ pages = {e8},
+ doi = {10.5334/jors.ag},
+}
+
+
+
+
+
+@ARTICLE{sandve13,
+ author = {{Sandve}, G.K. and {Nekrutenko}, A. and {Taylor}, J. and {Hovig}, E.},
+ title = {Ten Simple Rules for Reproducible Computational Research},
+ journal = {PLoS Computational Biology},
+ year = 2013,
+ month = oct,
+ volume = 9,
+ pages = {e1003285},
+ doi = {10.1371/journal.pcbi.1003285},
+}
+
+
+
+
+
+@ARTICLE{malik13,
+ author = {Tanu Malik and Quan Pham and Ian Foster},
+ title = {SOLE: Towards Descriptive and Interactive Publications},
+ journal = {Implementing Reproducible Research},
+ year = 2013,
+ volume = {Chapter 2},
+ pages = {1. URL: \url{https://osf.io/ns2m3}},
+}
+
+
+
+
+
+@ARTICLE{gronenschild12,
+ author = {Ed H. B. M. Gronenschild and Petra Habets and Heidi I. L. Jacobs and Ron Mengelers and Nico Rozendaal and Jim van Os and Machteld Marcelis},
+ title = {The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements},
+ journal = {PLoS ONE},
+ volume = {7},
+ year = {2012},
+ pages = {e38234},
+ doi = {10.1371/journal.pone.0038234},
+}
+
+
+
+
+
+@ARTICLE{pham12,
+ author = {Quan Pham and Tanu Malik and Ian Foster and Roberto {Di Lauro} and Raffaele Montella},
+ title = {SOLE: Linking Research Papers with Science Objects},
+ journal = {Provenance and Annotation of Data and Processes (IPAW)},
+ year = {2012},
+ pages = {203},
+ doi = {10.1007/978-3-642-34222-6_16},
+}
+
+
+
+
+
+@ARTICLE{davison12,
+ author = {Andrew Davison},
+ title = {Automated Capture of Experiment Context for Easier Reproducibility in Computational Research},
+ journal = {Computing in Science \& Engineering},
+ volume = {14},
+ year = {2012},
+ pages = {48},
+ doi = {10.1109/MCSE.2012.41},
+}
+
+
+
+
+
+@ARTICLE{zhao12,
+ author = {Jun Zhao and Jose Manuel Gomez-Perez and Khalid Belhajjame and Graham Klyne and Esteban Garcia-Cuesta and Aleix Garrido and Kristina Hettne and Marco Roos and David {De Roure} and Carole Goble},
+ title = {Why workflows break — Understanding and combating decay in Taverna workflows},
+ journal = {IEEE 8th International Conference on E-Science},
+ year = {2012},
+ pages = {1},
+ doi = {10.1109/eScience.2012.6404482},
+}
+
+
+
+
+@ARTICLE{vangorp11,
+ author = {Pieter {Van Gorp} and Steffen Mazanek},
+ title = {SHARE: a web portal for creating and sharing executable research},
+ journal = {Procedia Computer Science},
+ year = 2011,
+ volume = 4,
+ pages = {589},
+ doi = {10.1016/j.procs.2011.04.062},
+}
+
+
+
+
+
+@ARTICLE{hinsen11,
+ author = {{Hinsen}, Konrad},
+ title = {A data and code model for reproducible research and executable papers},
+ journal = {Procedia Computer Science},
+ year = 2011,
+ volume = 4,
+ pages = {579},
+ doi = {10.1016/j.procs.2011.04.061},
+}
+
+
+
+
+
+@ARTICLE{limare11,
+ author = {Nicolas Limare and Jean-Michel Morel},
+ title = {The IPOL Initiative: Publishing and Testing Algorithms on Line for
+Reproducible Research in Image Processing},
+ journal = {Procedia Computer Science},
+ year = 2011,
+ volume = 4,
+ pages = {716},
+ doi = {10.1016/j.procs.2011.04.075},
+}
+
+
+
+
+
+@ARTICLE{gavish11,
+ author = {Matan Gavish and David L. Donoho},
+ title = {A Universal Identifier for Computational Results},
+ journal = {Procedia Computer Science},
+ year = 2011,
+ volume = 4,
+ pages = {637},
+ doi = {10.1016/j.procs.2011.04.067},
+}
+
+
+
+
+@ARTICLE{gabriel11,
+ author = {Ann Gabriel and Rebecca Capone},
+ title = {Executable Paper Grand Challenge Workshop},
+ journal = {Procedia Computer Science},
+ volume = {4},
+ year = {2011},
+ pages = {577},
+ doi = {10.1016/j.procs.2011.04.060},
+}
+
+
+
+
+
+@ARTICLE{nowakowski11,
+ author = {Piotr Nowakowski and Eryk Ciepiela and Daniel Har\k{e}\.{z}lak and Joanna Kocot and Marek Kasztelnik and Tomasz Barty\'nski and Jan Meizner and Grzegorz Dyk and Maciej Malawski},
+ title = {The Collage Authoring Environment},
+ journal = {Procedia Computer Science},
+ volume = {4},
+ year = {2011},
+ pages = {608},
+ doi = {j.procs.2011.04.064},
+}
+
+
+
+
+
+@ARTICLE{peng11,
+ author = {{Peng}, R.D.},
+ title = {Reproducible Research in Computational Science},
+ journal = {Science},
+ year = 2011,
+ month = dec,
+ volume = 334,
+ pages = {1226},
+ doi = {10.1126/science.1213847},
+}
+
+
+
+
+
+@ARTICLE{pence10,
+ author = {{Pence}, W.~D. and {Chiappetti}, L. and {Page}, C.~G. and {Shaw}, R.~A. and
+ {Stobie}, E.},
+ title = "{Definition of the Flexible Image Transport System (FITS), version 3.0}",
+ journal = {Astronomy and Astrophysics},
+ keywords = {instrumentation: miscellaneous, methods: miscellaneous, techniques: miscellaneous, reference systems, standards, astronomical databases: miscellaneous},
+ year = "2010",
+ month = "Dec",
+ volume = {524},
+ eid = {A42},
+ pages = {A42},
+ doi = {10.1051/0004-6361/201015362},
+ adsurl = {https://ui.adsabs.harvard.edu/abs/2010A\&A...524A..42P},
+ adsnote = {Provided by the SAO/NASA Astrophysics Data System}
+}
+
+
+
+
+
+@ARTICLE{goecks10,
+ author = {Jeremy Goecks and Anton Nekrutenko and James Taylor},
+ title = {Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences},
+ journal = {Genome Biology},
+ year = 2010,
+ volume = 11,
+ pages = {R86},
+ doi = {10.1186/gb-2010-11-8-r86},
+}
+
+
+
+
+
+@ARTICLE{merali10,
+ author = {Zeeya Merali},
+ title = {Computational science: ...Error},
+ journal = {Nature},
+ year = 2010,
+ volume = 467,
+ pages = {775},
+ doi = {10.1038/467775a},
+}
+
+
+
+
+
+@ARTICLE{casadevall10,
+ author = {{Casadevall}, A. and {Fang}, F.C},
+ title = {Reproducible Science},
+ journal = {Infection and Immunity},
+ year = 2010,
+ volume = 78,
+ pages = {4972},
+ doi = {10.1128/IAI.00908-10},
+}
+
+
+
+
+
+@ARTICLE{mesirov10,
+ author = {{Mesirov}, J.P.},
+ title = {Accessible Reproducible Research},
+ journal = {Science},
+ year = 2010,
+ volume = 327,
+ pages = {415},
+ doi = {10.1126/science.1179653},
+}
+
+
+
+
+
+@ARTICLE{ioannidis2009,
+ author = {John P. A. Ioannidis and David B. Allison and Catherine A. Ball and Issa Coulibaly and Xiangqin Cui and Aedín C Culhane and Mario Falchi and Cesare Furlanello and Laurence Game and Giuseppe Jurman and Jon Mangion and Tapan Mehta and Michael Nitzberg and Grier P. Page and Enrico Petretto and Vera {van Noort}},
+ title = {Repeatability of published microarray gene expression analyses},
+ journal = {Nature Genetics},
+ year = {2009},
+ volume = {41},
+ pages = {149},
+ doi = {10.1038/ng.295},
+}
+
+
+
+
+
+@ARTICLE{fomel09,
+ author = {Sergey Fomel and Jon F. Claerbout},
+ title = {Reproducible Research},
+ journal = {Computing in Science Engineering},
+ year = {2009},
+ volume = {11},
+ pages = {5},
+ doi = {10.1109/MCSE.2009.14},
+}
+
+
+
+
+
+@ARTICLE{baggerly09,
+ author = {Keith A. Baggerly and Kevin R Coombes},
+ title = {Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology},
+ journal = {The Annals of Applied Statistics},
+ year = {2009},
+ volume = {3},
+ pages = {1309},
+ doi = {10.1214/09-AOAS291},
+}
+
+
+
+
+
+@ARTICLE{scheidegger08,
+ author = {Carlos Scheidegger and David Koop and Emanuele Santos and Huy Vo and Steven Callahan and Juliana Freire and Cláudio Silva},
+ title = {Tackling the Provenance Challenge one layer at a time},
+ journal = {Concurrency Computation: Practice and Experiment},
+ year = {2008},
+ volume = {20},
+ pages = {473},
+ doi = {10.1002/cpe.1237},
+}
+
+
+
+
+
+@ARTICLE{moreau08,
+ author = {Moreau, Luc and Ludäscher, Bertram and Altintas, Ilkay and Barga, Roger S. and Bowers, Shawn and Callahan, Steven and Chin JR., George and Clifford, Ben and Cohen, Shirley and Cohen-Boulakia, Sarah and Davidson, Susan and Deelman, Ewa and Digiampietri, Luciano and Foster, Ian and Freire, Juliana and Frew, James and Futrelle, Joe and Gibson, Tara and Gil, Yolanda and Goble, Carole and Golbeck, Jennifer and Groth, Paul and Holland, David A. and Jiang, Sheng and Kim, Jihie and Koop, David and Krenek, Ales and McPhillips, Timothy and Mehta, Gaurang and Miles, Simon and Metzger, Dominic and Munroe, Steve and Myers, Jim and Plale, Beth and Podhorszki, Norbert and Ratnakar, Varun and Santos, Emanuele and Scheidegger, Carlos and Schuchardt, Karen and Seltzer, Margo and Simmhan, Yogesh L. and Silva, Claudio and Slaughter, Peter and Stephan, Eric and Stevens, Robert and Turi, Daniele and Vo, Huy and Wilde, Mike and Zhao, Jun and Zhao, Yong},
+ title = {The First Provenance Challenge},
+ journal = {Concurrency Computation: Practice and Experiment},
+ year = {2008},
+ volume = {20},
+ pages = {473},
+ doi = {10.1002/cpe.1237},
+}
+
+
+
+
+
+@ARTICLE{witten2007,
+ author = {Ben Witten and Bill Curry and Jeff Shragge},
+ title = {A New Build Environment for SEP},
+ journal = {Stanford Exploration Project},
+ year = {2007},
+ volume = {129},
+ pages = {247: \url{http://sepwww.stanford.edu/data/media/public/docs/sep129/ben1.pdf}},
+}
+
+
+
+
+
+@ARTICLE{miller06,
+ author = {Greg Miller},
+ title = {A Scientist's Nightmare: Software Problem Leads to Five Retractions},
+ journal = {Science},
+ year = {2006},
+ volume = {314},
+ pages = {1856},
+ doi = {10.1126/science.314.5807.1856},
+}
+
+
+
+
+
+@ARTICLE{reich06,
+ author = {Michael Reich and Ted Liefeld and Joshua Gould and Jim Lerner and Pablo Tamayo and Jill P Mesirov},
+ title = {GenePattern 2.0},
+ journal = {Nature Genetics},
+ year = {2006},
+ volume = {38},
+ pages = {500},
+ doi = {10.1038/ng0506-500},
+}
+
+
+
+
+
+@ARTICLE{ludascher05,
+ author = {Ludäs\-cher, Bertram and Altintas, Ilkay and Berkley, Chad and Higgins, Dan and Jaeger, Efrat and Jones, Matthew and Lee, Edward A. and Tao, Jing and Zhao, Yang},
+ title = {Scientific workflow management and the Kepler system},
+ journal = {Concurrency Computation: Practice and Experiment},
+ year = {2006},
+ volume = {18},
+ pages = {1039},
+ doi = {10.1002/cpe.994},
+}
+
+
+
+
+
+@ARTICLE{ioannidis05,
+ author = {John P. A. Ioannidis},
+ title = {Why Most Published Research Findings Are False},
+ journal = {PLoS Medicine },
+ year = {2005},
+ volume = {2},
+ pages = {e124},
+ doi = {10.1371/journal.pmed.0020124},
+}
+
+
+
+
+
+@ARTICLE{bavoil05,
+ author = {Louis Bavoil and Steven P. Callahan and Patricia J. Crossno and Juliana Freire and Carlos E. Scheidegger and Cláudio T. Silva and Huy T. Vo},
+ title = {VisTrails: Enabling Interactive Multiple-View Visualizations},
+ journal = {VIS 05. IEEE Visualization},
+ year = {2005},
+ volume = {},
+ pages = {135},
+ doi = {10.1109/VISUAL.2005.1532788},
+}
+
+
+
+
+
+@ARTICLE{dolstra04,
+ author = {{Dolstra}, Eelco and {de Jonge}, Merijn and {Visser}, Eelco},
+ title = {Nix: A Safe and Policy-Free System for Software Deployment},
+ journal = {Large Installation System Administration Conference},
+ year = {2004},
+ volume = {18},
+ pages = {79. \url{https://www.usenix.org/legacy/events/lisa04/tech/full_papers/dolstra/dolstra.pdf}},
+}
+
+
+
+
+
+@ARTICLE{oinn04,
+ author = {Oinn, Tom and Addis, Matthew and Ferris, Justin and Marvin, Darren and Senger, Martin and Greenwood, Mark and Carver, Tim and Glover, Kevin and Pocock, Matthew R. and Wipat, Anil and Li, Peter},
+ title = {Taverna: a tool for the composition and enactment of bioinformatics workflows},
+ journal = {Bioinformatics},
+ year = {2004},
+ volume = {20},
+ pages = {3045},
+ doi = {10.1093/bioinformatics/bth361},
+}
+
+
+
+
+
+@ARTICLE{schwab2000,
+ author = {Matthias Schwab and Martin Karrenbach and Jon F. Claerbout},
+ title = {Making scientific computations reproducible},
+ journal = {Computing in Science \& Engineering},
+ year = {2000},
+ volume = {2},
+ pages = {61},
+ doi = {10.1109/5992.881708},
+}
+
+
+
+
+
+@ARTICLE{buckheit1995,
+ author = {Jonathan B. Buckheit and David L. Donoho},
+ title = {WaveLab and Reproducible Research},
+ journal = {Wavelets and Statistics},
+ year = {1995},
+ volume = {1},
+ pages = {55},
+ doi = {10.1007/978-1-4612-2544-7\_5},
+}
+
+
+
+
+
+@ARTICLE{claerbout1992,
+ author = {Jon F. Claerbout and Martin Karrenbach},
+ title = {Electronic documents give reproducible research a new meaning},
+ journal = {SEG Technical Program Expanded Abstracts},
+ year = {1992},
+ volume = {1},
+ pages = {601},
+ doi = {10.1190/1.1822162},
+}
+
+
+
+
+
+@ARTICLE{eker03,
+ author = {Johan Eker and Jorn W Janneck and Edward A. Lee and Jie Liu and Xiaojun Liu and Jozsef Ludvig and Sonia Sachs and Yuhong Xiong and Stephen Neuendorffer},
+ title = {Taming heterogeneity - the Ptolemy approach},
+ journal = {Proceedings of the IEEE},
+ year = {2003},
+ volume = {91},
+ pages = {127},
+ doi = {10.1109/JPROC.2002.805829},
+}
+
+
+
+
+
+@ARTICLE{stevens03,
+ author = {Robert Stevens and Kevin Glover and Chris Greenhalgh and Claire Jennings and Simon Pearce and Peter Li and Melena Radenkovic and Anil Wipat},
+ title = {Performing in silico Experiments on the Grid: A Users Perspective},
+ journal = {Proceedings of UK e-Science All Hands Meeting},
+ year = {2003},
+ pages = {43},
+}
+
+
+
+
+
+@ARTICLE{knuth84,
+ author = {Donald Knuth},
+ title = {Literate Programming},
+ journal = {The Computer Journal},
+ year = {1984},
+ volume = {27},
+ pages = {97},
+ doi = {10.1093/comjnl/27.2.97},
+}
+
+
+
+
+
+@ARTICLE{stallman88,
+ author = {Richard M. Stallman and Roland McGrath and Paul D. Smith},
+ title = {GNU Make: a program for directing recompilation},
+ journal = {Free Software Foundation},
+ year = {1988},
+ pages = {ISBN:1-882114-83-3. \url{https://www.gnu.org/s/make/manual/make.pdf}},
+}
+
+
+
+
+
+@ARTICLE{somogyi87,
+ author = {Zoltan Somogyi},
+ title = {Cake: a fifth generation version of make},
+ journal = {University of Melbourne},
+ year = {1987},
+ pages = {1: \url{https://pdfs.semanticscholar.org/3e97/3b5c9af7763d70cdfaabdd1b96b3b75b5483.pdf}},
+}
+
+
+
+
+
+@ARTICLE{feldman79,
+ author = {Stuart I. Feldman},
+ title = {Make -- a program for maintaining computer programs},
+ journal = {Journal of Software: Practice and Experience},
+ volume = {9},
+ pages = {255},
+ year = {1979},
+ doi = {10.1002/spe.4380090402},
+}
+
+
+
+
+
+@ARTICLE{anscombe73,
+ author = {{Anscombe}, F.J.},
+ title = {Graphs in Statistical Analysis},
+ journal = {The American Statistician},
+ year = {1973},
+ volume = {27},
+ pages = {17},
+ doi = {10.1080/00031305.1973.10478966},
+}
+
+
+
+
+
+@ARTICLE{roberts69,
+ author = {{Roberts}, K.V.},
+ title = {The publication of scientific fortran programs},
+ journal = {Computer Physics Communications},
+ year = {1969},
+ volume = {1},
+ pages = {1},
+ doi = {10.1016/0010-4655(69)90011-3},
}