aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
blob: 26288fca4af16229494776be7e6b43ce32ff2c78 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
\documentclass[10.5pt]{article}

%% This is a convenience variable if you are using PGFPlots to build plots
%% within LaTeX. If you want to import PDF files for figures directly, you
%% can use the standard `\includegraphics' command. See the definition of
%% `\includetikz' in `tex/preamble-pgfplots.tex' for where the files are
%% assumed to be if you use `\includetikz' when `\makepdf' is not defined.
\newcommand{\makepdf}{}

%% When defined (value is irrelevant), `\highlightchanges' will cause text
%% in `\tonote' and `\new' to become colored. This is useful in cases that
%% you need to distribute drafts that is undergoing revision and you want
%% to hightlight to your colleagues which parts are new and which parts are
%% only for discussion.
\newcommand{\highlightchanges}{}

%% Import the necessary preambles.
\input{tex/src/preamble-style.tex}
\input{tex/build/macros/project.tex}
\input{tex/src/preamble-pgfplots.tex}
\input{tex/src/preamble-biblatex.tex}





\title{Maneage: Customizable framework for Managing Data Lineage}
\author{\large\mpregular \authoraffil{Mohammad Akhlaghi}{1,2},
        \large\mpregular \authoraffil{Ra\'ul Infante-Sainz}{1,2},
        \large\mpregular \authoraffil{Roberto Baena Gall\'e}{1,2}\\
  {
    \footnotesize\mplight
    \textsuperscript{1} Instituto de Astrof\'isica de Canarias, C/V\'ia L\'actea, 38200 La Laguna, Tenerife, ES.\\
    \textsuperscript{2} Facultad de F\'isica, Universidad de La Laguna, Avda. Astrof\'isico Fco. S\'anchez s/n, 38200, La Laguna, Tenerife, ES.\\
    Corresponding author: Mohammad Akhlaghi
    (\href{mailto:mohammad@akhlaghi.org}{\textcolor{black}{mohammad@akhlaghi.org}})
  }}
\date{}





\begin{document}%\layout
\thispagestyle{firstpage}
\maketitle

%% Abstract
{\noindent\mpregular
  The era of big data has ushered an era of big responsibility.
  In the absense of reproducibility, as a test on controlling the data lineage, the result's integrity will be subject to perpetual debate.
  Maneage (management + lineage) is introduced here as a host to the computational and narrative components of an analysis.
  Analysis steps are added to a new project with lineage in mind, thus facilitating the project's execution and testing as the project evolves, while being friendly to publishing and archival because it is wholly in machine\--action\-able, and human\--read\-able, plain-text.
  Maneage is founded on the principles of completeness (e.g., no dependency beyond a POSIX-compatible operating system, no administrator privileges, or no network connection), modular and straight-forward design, temporal lineage and free software.
  The lineage isn't limited to downloading the inputs and processing them automatically, but also includes building the necessary software with fixed versions and build configurations.
  Additionally, Maneage also builds the final PDF report of the project, establishing direct and automatic links between the data analysis and the narrative, with the precision of a word in a sentence.
  Maneage enables incremental projects, where a new project can branch off an existing one, with moderate changes to enable experimentation on published methods.
  Once Maneage is implement in a sufficiently wide scale, it can aid in automatic and optimized workflow creation through machine learning, or automating data management plans.
  Maneage was a recipient of the research data alliance (RDA) Europe Adoption Grant in 2019.
  This paper is itself written in Maneage with snapshot \projectversion.
  \horizontalline

  \noindent
  {\mpbold Keywords:} Data Lineage, Data Provenance, Reproducibility, Scientific Pipelines, Workflows
}

\horizontalline










\section{Introduction}
\label{sec:introduction}

The increasing volume and complexity of data analysis has been highly productive, giving rise to a new branch of ``Big Data'' in many fields of the sciences and industry.
However, given its inherent complexity, the mere results are barely useful alone.
Questions such as these commonly follow any such result:
What inputs were used?
What operations were done on those inputs? How were the configurations or training data chosen?
How did the quantitative results get visualized into the final demonstration plots, figures or narrative/qualitative interpretation?
May there be a bias in the visualization?
See Figure \ref{fig:questions} for a more detailed visual representation of such questions for various stages of the workflow.

In data science and database management, this type of metadata are commonly known as \emph{data provenance}, and the lower-level implementation is \emph{data lineage} (for more on the definitions, see Section \ref{sec:definitions}).
Data lineage is being increasingly demanded for integrity checking from both the scientific and industrial/legal domains.
Notable examples in each domain are respectively the ``Reproducibility crisis'' in the sciences that was claimed by the Nature journal \citep{baker16}, and the General Data Protection Regulation (GDPR) by the European Parliament and the California Consumer Privacy Act (CCPA), implemented in 2018 and 2020 respectively.
The former argues that reproducibility (as a test on sufficiently conveying the data lineage) is necessary for other scientists to study, check and build-upon each other's work.
The latter requires the data intensive industry to give individual users control over their data, effectively requiring thorough management and knowledge of the data's lineage.
Besides regulation and integrity checks, having a robust data governance (management of data lineage) in a project can be very productive: it enables easy debugging, experimentation on alternative methods, or optimization of the workflow.

\begin{figure}[t]
  \begin{center}
    \includetikz{figure-project-outline}
  \end{center}
  \vspace{-17mm}
  \caption{\label{fig:questions}Graph of a generic project's workflow (connected through arrows), highlighting the various issues/questions on each step.
    The green boxes with sharp edges are inputs and the blue boxes with rounded corners are the intermediate or final outputs.
    The red boxes with dashed edges highlight the main questions on the respective stage.
    The orange box surrounding the software download and build phases marks shows the various commonly recognized solutions to the questions in it, for more see Appendix \ref{appendix:jobmanagement}.
  }
\end{figure}

Due to the complexity of modern data analysis, a small deviation in the final result can be due to many different steps, which may be significant.
Publishing the precise codes of the analysis is the only guarantee.
For example, \citet{smart18} describes how a 7-year old conflict in theoretical condensed matter physics was only identified after the relative codes were shared.
Nature is already a black box which we are trying hard to unlock, or understand.
Not being able to experiment on the methods of other researchers is an artificial and self-imposed black box, wrapped over the original, and taking most of the energy of researchers.

\citet{miller06} found that a mistaken column flipping, leading to retraction of 5 papers in major journals, including Science.
\citet{baggerly09} highlighted the inadequate narrative description of the analysis and showed the prevalence of simple errors in published results, ultimately calling their work ``forensic bioinformatics''.
\citet{herndon14} and \citet[a self-correction]{horvath15} also reported similar situations and \citet{ziemann16} concluded that one-fifth of papers with supplementary Microsoft Excel gene lists contain erroneous gene name conversions.
The reason such reports are mostly from genomics and bioinformatics is because they have traditionally been more open to publishing workflows: for example \href{https://www.myexperiment.org}{myexperiment.org}, which mostly uses Apache Taverna \citep{oinn04}, or \href{https://www.genepattern.org}{genepattern.org} \citep{reich06}, \href{https://galaxyproject.org}{galaxy\-project.org} \citep{goecks10}, among others.
Such integrity checks are a critical component of the scientific method, but are only possible with access to the data \emph{and} its lineage (workflows).
The status in other fields were workflows aren't commonly shared is probably (much) worse.

The completeness of a paper's published metadata (or ``Methods'' section) can be measured by a simple question: given the same input datasets (supposedly on a third-party database like \href{http://zenodo.org}{zenodo.org}), can another researcher reproduce the exact same result automatically, without needing to contact the authors?
Several studies have attempted to answer this with different levels of detail.
For example \citet{allen18} found that roughly half of the papers in astrophysics don't even mention the names of any analysis software they have used, while \citet{menke20} found this fraction has greatly improved in medical/biological journals over the last two decades (currently above $80\%$).

\citet{ioannidis2009} attempted to reproduce 18 published results by two independent groups but, only fully succeeded in 2 of them and partially in 6.
\citet{chang15} attempted to reproduce 67 papers in well-regarded economic journals with data and code: only 22 could be reproduced without contacting authors, and more than half could not be replicated at all.
\citet{stodden18} attempted to replicate the results of 204 scientific papers published in the journal Science \emph{after} that journal adopted a policy of publishing the data and code associated with the papers.
Even though the authors were contacted, the success rate was $26\%$.
Generally, this problem is unambiguously felt in the community: \citet{baker16} surveyed 1574 researchers and found that only $3\%$ did not see a ``reproducibility crisis''.

This is not a new problem in the sciences: in 2011, Elsevier conducted an ``Executable Paper Grand Challenge'' \citep{gabriel11}.
The proposed solutions were published in a special edition.
Before that, in an attemp to simulate research projects, \citet{ioannidis05} proved that ``most claimed research findings are false''.
In the 1990s, \citet{schwab2000, buckheit1995, claerbout1992} describe this same problem very eloquently and also provided some solutions that they used.
While the situation has improved since the early 1990s, these papers still resonate strongly with the frustrations of today's scientists.
Even earlier, through his famous quartet, \citet{anscombe73} qualitatively showed how distancing of researchers from the intricacies of algorithms/methods can lead to misinterpretation of the results.
One of the earliest such efforts we found was \citet{roberts69} who discussed conventions in FORTRAN programming and documentation to help in publishing research codes.

Another fundamental problem in the design of a project's lineage/workflow is ``workflow decay'' \citep{zhao12}.
Besides data availability, the ``decay'' in the software tools that the workflow needs is also critical: re-building the full software dependency tree on a random system is very hard.
Generally, software is not a secular component of projects, where one software can easily be swapped with another.
Projects are built around specific software technologies, and research in software methods and implementations is itself a vibrant research topic in many domains \citep{dicosmo19}.

In this paper we introduce Maneage as a solution to the collective problem of preserving a project's data lineage and its software dependencies.
A project using Maneage will start by branching from the main Git branch of Maneage and starts customizing it: specifying the necessary software tools for that particular project, adding analysis steps and writing a narrative based on the analysis results.
The temporal provenance of the project is fully preserved in Git, and allows merging of the project with the core branch to update the low-level infra-structure (common to all projects) without changing the high-level steps specific to this project.
In Section \ref{sec:d-and-p} the basic concepts are defined and the founding principles of Maneage are discussed.
Section \ref{sec:maneage} describes the internal structure of Maneage and Section \ref{sec:discussion} is a discussion on its benefits, caveats and future prospecs.


\section{Definitions}
\label{sec:definitions}

The concepts and terminologies of reproducibility and project/workflow management and design are commonly used differently by different research communities or different solution provides.
As a consequence, before starting with the technical details it is important to clarify the specific terms used throughout this paper and its appendix.

\begin{enumerate}[label={\bf D\arabic*}]
\item \label{definition:input}\textbf{Input:}
  Any computer file needed by a project that may be usable in other projects.
  The inputs of a project include data or software source code, see \citet{hinsen16} on the fundamental similarity of data and source code.
  Inputs may have initially been created/written (e.g., software source code) or collected (e.g., data) for one specific project.
  However, they can, and most often will, be used in other/later projects also.

\item \label{definition:output}\textbf{Output:}
  Any computer file that is published at the end of the project.
  The output(s) of a project can be a published narrative paper, datasets (e.g., table(s), image(s), a number, or Boolean: confirming a hypothesis as true or false), automatically generated software source code, or any other computer file.

\item \label{definition:project}\textbf{Project:}
  The high-level series of operations that are done on input(s) to produce outputs.
  This definition is therefore very similar to ``workflow'' \citep{oinn04, reich06, goecks10}, but because the published narrative paper/report is also an output, the project defined here also includes the source of the narrative (e.g., \LaTeX{} or MarkDown) \emph{and} how the visualizations in it were created.

  The project is thus only in charge of managing the inputs and outputs of each analysis step (take the outputs of one step, and feed them as inputs to the next), not to do analysis by itself.
  A good project will follow the modularity principle: analysis scripts should be well-defined as an independently managed software source with clearly defined inputs and outputs.
  For example modules in Python, packages in R, or libraries/programs in C/C++ that can be executed by the higher-level project source when necessary.
  Maintaining these lower-level components as independent software projects enables their easy usage in other projects.
  Therefore here, they are defined as inputs (not the project).

\item \label{definition:provenance}\textbf{Data Provenance:}
  A dataset's provenance is the set of metadata (in any ontology, standard or structure) that connect it to the components (other datasets or scripts) that produced it.
  Data provenance thus provides a high-level \emph{and structured} view of a project's lineage.
  A good example of this is Research Objects \citep{belhajjame15}.

% This definition of data lineage is inspired from https://stackoverflow.com/questions/43383197/what-are-the-differences-between-data-lineage-and-data-provenance:

% "data provenance includes only high level view of the system for business users, so they can roughly navigate where their data come from.
% It's provided by variety of modeling tools or just simple custom tables and charts.
% Data lineage is a more specific term and includes two sides - business (data) lineage and technical (data) lineage.
% Business lineage pictures data flows on a business-term level and it's provided by solutions like Collibra, Alation and many others.
% Technical data lineage is created from actual technical metadata and tracks data flows on the lowest level - actual tables, scripts and statements.
% Technical data lineage is being provided by solutions such as MANTA or Informatica Metadata Manager. "
\item \label{definition:lineage}\textbf{Data Lineage:}
Data lineage is commonly used interchangeably with Data provenance \citep[for example][]{cheney09}.
For clarity, we define term ``Data lineage'' as a low-level and fine-grained recording of the data's trajectory in an analysis (not meta-data, but actual commands).
Therfore data lineage is synonymous with ``project'' as defined above.
\item \label{definition:reproduction}\textbf{Reproducibility \& Replicability:}
  These terms have been used in the literature with various meanings, sometimes in a contradictory way.
  It is important to highlight that in this paper we are only considering computational analysis: \emph{after} data has been collected and stored as a file.
  Therefore, many of the definitions reviewed in \citet{plesser18}, that are about data collection do not apply here.
  We adopt the same definition of \citet{leek17,fineberg19}, among others.
  Note that these definitions are contrary some others for example the ACM policy guidelines\footnote{\url{https://www.acm.org/publications/policies/artifact-review-badging}} (dated April 2018) and \citet{hinsen15}.

  \citet{fineberg19} define reproducibility as \emph{obtaining consistent [not necessarily identical] results using the same input data; computational steps, methods, and code; and conditions of analysis}, or same inputs $\rightarrow$ consistent result.
  They define Replicability as \emph{obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data}, or different inputs $\rightarrow$ consistent result.
  Generally, since replicability involves new data collection \citep[e.g., see][]{kaiser18}.
\end{enumerate}










\section{Principles}
\label{sec:principles}

The core principle behing Maneage is simple: science is defined by its method, not its result.
\citet{buckheit1995} summarize this nicely by noting that modern scientific papers (narrative combined with plots, tables and figures) are merely advertisements of a scholarship, the actual scholarship is the scripts and software usage that went into doing the analysis.

Maneage is not the first attempted solution to this fundamental problem.
Various solutions have been proposed since the early 1990s, for example RED \citep{claerbout1992,schwab2000}, Apache Taverna \citep{oinn04}, Madagascar \citep{fomel13}, GenePattern \citep{reich06}, Kepler \citep{ludascher05}, VisTrails \citep{bavoil05}, Galaxy \citep{goecks10}, Image Processing On Line journal \citep[IPOL][]{limare11}, WINGS \citep{gil10}, Active papers \citep{hinsen11}, Collage Authoring Environment \citep{nowakowski11}, SHARE \citep{vangorp11}, Verifiable Computational Result \citep{gavish11}, SOLE \citep{pham12}, Sumatra \citep{davison12}, Sciunit \citep{meng15}, Popper \citep{jimenez17}, WholeTale \citep{brinckman19}, and many more.
To highlight the uniqueness of Maneage in this plethora of tools, a more elaborate list of principles are necessary as described below.

\begin{enumerate}[label={\bf P\arabic*}]
\item \label{principle:complete}\textbf{Complete:}
  A project that is complete, or self-contained, doesn't depend on anything beyond the Portable operating system Interface (POSIX), doesn't affect the host system, doesn't require root/administrator previlages, doesn't need an internet connection (when its inputs are on the filesystem), and is fully recorded and executable in plain-text\footnote{Plain text format doesn't include document container formats like \inlinecode{.odf} or \inlinecode{.doc}, for software like LibreOffice or Microsoft Office.} format (e.g., ASCII or Unicode).

  A complete project can automatically access to the inputs (see definition \ref{definition:input}), build its necessary software (instructions on configuring, building and installing those software in a fixed environment), do the analysis (run the software on the data) and create the final narrative report/paper as well as its visualizations, in its final format (usually in PDF or HTML).
  No manual/human interaction is required within a complete project, as \citet{claerbout1992} put it: ``a clerk can do it''.
  Generally, manual intervention in any of the steps above, or an interactive interface, is an incompleteness.
  Finally, the plain-text format is particularly important because any other storage format will require higher-level software \emph{before} the project.

  \emph{Comparison with existing:} Except for IPOL, none of the tools above are complete.
  They all have many dependencies far beyond POSIX, for example the more recent ones are written in Python or use Jupyter notebooks \citep{kluyver16}.
  Such high-level tools have very short lifespan and evolve very fast, the most recent example was Python 3 that is not compatible with Python 2.
  They also have a very complex dependency trees, making them extremely volnerable to updates, for example see Figure 1 of \citet{alliez19} on the dependency tree of Matplotlib (one of the smaller Jupyter dependencies).
  The logevity of a data lineage, or workflow (not the analysis itself), is determined by its shortest-lived dependency.

  Many existing tools therefore don't attempt to store the project as plain text, but pre-built binary blobs (containers or virtual machines) that can rarely be recreated\footnote{Using the package manager of the container's OS, or Conda which are both highly dependent on the time they are created.} and also have a short lifespan\footnote{For example Docker only works on Linux kernels that are on long-term support, not older.
    Currently this is Linux 3.2.x that was initially released 8 years ago in 2012. The current Docker images may not be usable in a similar time frame in the future.}.
  Once the lifespan of a binary workflow's dependency has passed, it is useless: that binary file cannot be opened to read or executed.
  But as plain-text, even if it is no longer executable due to much evolved technologies, it is still human read-able and parse-able by any machine.

\item \label{principle:modularity}\textbf{Modularity:}
A project should be compartmentalized or partitioned to independent modules or components with well-defined inputs/outputs having no side-effects.
In a modular project, communication between the independent modules is explicit, providing optimizations on multiple levels:
1) Execution: independent modules can run in parallel, or modules that don't need to be run (because their dependencies haven't changed) won't be re-done.
2) Data lineage and data provenance extraction (recording any dataset's origins).
3) Citation: allowing others to credit specific parts of a project.
This principle doesn't just apply to the analysis, it also applies to the whole project, for example see the definitions of ``input'', ``output'' and ``project'' in Section \ref{sec:definitions}.

\emph{Comparison with existing:} Most are agnostic to modularity: leaving such design choices to the experience of project authors.
But designing a modular project needs to be encouraged and facilitated otherwise, scientists (that are not usually trained in data management) will not design their projects to be modular, leading to great inefficiencies in terms of project cost or scientific accuracy.
Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails do encourage this, but the more recent tools above (which usually require programming) don't.

\item \label{principle:complexity}\textbf{Minimal complexity:}
  This principle is essentially Occam's razor: ``Never posit pluralities without necessity'' \citep{schaffer15}, but extrapolated to project management:
  1) avoid complex relations between analysis steps (which is not unrelated to the principle of modularity in \ref{principle:modularity}).
  2) avoid the programming language that is currently in vogue because it is going to fall out of fashion soon and significant resources are required to translate or rewrite it every few years (to stay in vogue).
  The same job can be done with more stable/basic tools, and less effort in the long-run.
  Unlike software engineers who learn a new tool every two years and don't require long lifespan for their projects, scientists need to focus on their own research domain and are unable to stay up to date with the vogue, creating a generational gap when they do.
  This is very bad for the scientists: valuable detailed experience can't pass through generations, or tools have to be re-written at a high cost that could have gone to actual research.

  \emph{Comparison with existing:} Most of the existing tools use the language that was in vogue when they were created, for example a larger fraction of them are written in Python as we come closer to the present time.
  Again IPOL stands out from the rest in this principle also.

\item \label{principle:verify}\textbf{Verifiable inputs and outputs:}
The project should contain automatic verification checks on its inputs (software source code and data) and outputs.
When applied, expert knowledge won't be necessary to confirm the correct reproduction.

\emph{Comparison with existing:} Such verification is usually possible in most systems, but fully maintained by the user.
Automatic verification of inputs is most commonly implemented in some cases, but rarely the outputs.

\item \label{principle:history}\textbf{History and temporal provenance:}
No project is done in a single/first attempt.
Projects evolve as they are being completed.
It is natural that earlier phases of a project are redesigned/optimized only after later phases have been completed.
This is often seen in scientific papers, with statements like ``we [first] tried method [or parameter] XXXX, but YYYY is used here because it showed to have better precision [or less bias, or etc]''.
A project's ``history'' is thus as scientifically relevant as the final, or published, snapshot of the project.

\emph{Comparison with existing:} The systems above that are implemented around version control usually support this principle.
However, because they are rarely complete (as discussed in principle \ref{principle:complete}), this history is also not complete.
IPOL, which uniquely stands out in other principles, fails at this one: only the final snapshot is published.

\item \label{principle:freesoftware}\textbf{Free and open source software}
  Technically, as defined in Section \ref{definition:reproduction}, reproducibility is also possible with a non-free or non-open-source software (a black box).
  This principle is thus necessary to complement the definition of reproducibility and has many advantages which are critical to the sciences and the industry:
  1) The lineage, and its optimizatoin, can be traced down to the internal algorithm in the software's source.
  2) A non-free software may not be executable on a given/future hardware, if its free, the project members can modify it to work.
  3) A non-free software cannot be distributed by the authors, making the whole community reliant only on the proprietary owner's server (even if the proprietary software doesn't ask for payments), also see Section \ref{sec:publishing}.

  \emph{Comparison with existing:} The existing solutions listed above are all free software.
  There are non-free existing solutions also, but we do not consider them here because of this principle.
\end{enumerate}



\section{Implementation of Maneage}
\label{sec:maneage}

\subsection{Job orchestration with Make}
\label{sec:usingmake}

\subsection{General implementation structure}
\label{sec:generalimplementation}

\subsection{Project configuration}
\label{sec:projectconfigure}

\subsubsection{Setting local directories}
\label{sec:localdirs}

\subsubsection{Checking for a C compiler}
\label{sec:ccompiler}

\subsubsection{Verifying and building necessary software from source}
\label{sec:buildsoftware}

\subsubsection{Software citation}
\label{sec:softwarecitation}

\subsection{High-level organization of analysis}
\label{sec:highlevelanalysis}

\subsubsection{Isolated analysis environment}
\label{sec:analysisenvironment}

\subsubsection{Preparation phase}
\label{sec:prepare}

\subsection{Low-level organization of analysis}
\label{sec:lowlevelanalysis}

\subsubsection{Non-recursive Make}
\label{sec:nonrecursivemake}

\subsubsection{Ultimate target: the project's paper or report (\inlinecode{paper.pdf})}
\label{sec:paperpdf}

\subsubsection{Values within text (\inlinecode{project.tex})}
\label{sec:valuesintext}

\subsubsection{Verification of outputs (\inlinecode{verify.mk})}
\label{sec:outputverification}

\subsubsection{Project initialization (\inlinecode{initialize.mk})}
\label{sec:initialize}

\subsubsection{Importing and validating inputs (\inlinecode{download.mk})}
\label{sec:download}

\subsubsection{The analysis}
\label{sec:analysis}

\subsubsection{Configuration files}
\label{sec:configfiles}

\subsection{Projects as Git branches of Maneage}
\label{sec:starting}

\subsection{Multi-user collaboration on single build directory}
\label{sec:collaborating}

\subsection{Publishing the project}
\label{sec:publishing}

\subsubsection{Automatic creation of publication tarball}
\label{sec:makedist}

\subsubsection{What to publish, and where?}
\label{sec:whatpublish}

\subsubsection{Worries about getting scooped!}
\label{sec:scooped}

\subsection{Future of Maneage and its past}
\label{sec:futurework}

\section{Discussion}
\label{sec:discussion}





%% Acknowledgements
\section{Acknowledgments}
The authors wish to thank Pedram Ashofteh Ardakani, Zahra Sharbaf and Surena Fatemi for their useful suggestions and feedback on Maneage and this paper and to David Valls-Gabaud, Ignacio Trujillo, Johan Knapen, Roland Bacon for their support.

Work on the reproducible paper template has been funded by the Japanese Ministry of Education, Culture, Sports, Science, and Technology ({\small MEXT}) scholarship and its Grant-in-Aid for Scientific Research (21244012, 24253003), the European Research Council (ERC) advanced grant 339659-MUSICOS, European Union’s Horizon 2020 research and innovation programme under Marie Sklodowska-Curie grant agreement No 721463 to the SUNDIAL ITN, and from the Spanish Ministry of Economy and Competitiveness (MINECO) under grant number AYA2016-76219-P.
The reproducible paper template was also supported by European Union’s Horizon 2020 (H2020) research and innovation programme via the RDA EU 4.0 project (ref. GA no. 777388).

\input{tex/build/macros/dependencies.tex}

%% Tell BibLaTeX to put the bibliography list here.
\printbibliography

%% Finish LaTeX
\end{document}

%% This file is part of a paper describing the Maneage workflow system
%%   https://gitlab.com/makhlaghi/maneage-paper
%
%% This file is free software: you can redistribute it and/or modify it
%% under the terms of the GNU General Public License as published by the
%% Free Software Foundation, either version 3 of the License, or (at your
%% option) any later version.
%
%% This file is distributed in the hope that it will be useful, but WITHOUT
%% ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
%% FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
%% for more details.
%
%% You should have received a copy of the GNU General Public License along
%% with this file.  If not, see <https://www.gnu.org/licenses/>.