aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex869
1 files changed, 711 insertions, 158 deletions
diff --git a/paper.tex b/paper.tex
index b7d5c8d..d190501 100644
--- a/paper.tex
+++ b/paper.tex
@@ -1,15 +1,28 @@
-%% Copyright (C) 2018-2021 Mohammad Akhlaghi <mohammad@akhlaghi.org>
-%% See the end of the file for license conditions.
-\documentclass[10pt, twocolumn]{article}
-
-%% (OPTIONAL) CONVENIENCE VARIABLE: Only relevant when you use Maneage's
-%% '\includetikz' macro to build plots/figures within LaTeX using TikZ or
-%% PGFPlots. If so, when the Figure files (PDFs) are already built, you can
-%% avoid TikZ or PGFPlots completely by commenting/removing the definition
-%% of '\makepdf' below. This is useful when you don't want to slow-down a
-%% LaTeX-only build of the project (for example this happens when you run
-%% './project make dist'). See the definition of '\includetikz' in
-%% `tex/preamble-pgfplots.tex' for more.
+%% Main LaTeX source of project's paper.
+%
+%% Copyright (C) 2020-2021 Mohammad Akhlaghi <mohammad@akhlaghi.org>
+%% Copyright (C) 2020-2021 Raúl Infante-Sainz <infantesainz@gmail.com>
+%% Copyright (C) 2020-2021 Boudewijn F. Roukema <boud@astro.uni.torun.pl>
+%% Copyright (C) 2020-2021 Mohammadreza Khellat <mkhellat@ideal-information.com>
+%% Copyright (C) 2020-2021 David Valls-Gabaud <david.valls-gabaud@obspm.fr>
+%% Copyright (C) 2020-2021 Roberto Baena-Gallé <roberto.baena@gmail.com>
+%
+%% This file is free software: you can redistribute it and/or modify it
+%% under the terms of the GNU General Public License as published by the
+%% Free Software Foundation, either version 3 of the License, or (at your
+%% option) any later version.
+%
+%% This file is distributed in the hope that it will be useful, but WITHOUT
+%% ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+%% FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
+%% for more details. See <http://www.gnu.org/licenses/>.
+\documentclass[journal]{IEEEtran}
+
+%% This is a convenience variable if you are using PGFPlots to build plots
+%% within LaTeX. If you want to import PDF files for figures directly, you
+%% can use the standard `\includegraphics' command. See the definition of
+%% `\includetikz' in `tex/preamble-pgfplots.tex' for where the files are
+%% assumed to be if you use `\includetikz' when `\makepdf' is not defined.
\newcommand{\makepdf}{}
%% VALUES FROM ANALYSIS (NUMBERS AND STRINGS): this file is automatically
@@ -17,207 +30,747 @@
%% (defined with '\newcommand') for various processing outputs to be used
%% within the paper.
\input{tex/build/macros/project.tex}
-
-%% MANEAGE-ONLY PREAMBLE: this file contains LaTeX constructs that are
-%% provided by Maneage (for example enabling or disabling of highlighting
-%% from the './project' script). They are not style-related.
\input{tex/src/preamble-maneage.tex}
-%% PROJECT-SPECIFIC PREAMBLE: This is where you can include any LaTeX
-%% setting for customizing your project.
+%% Import the other necessary TeX files for this particular project.
\input{tex/src/preamble-project.tex}
+%% Title and author names.
+\title{\projecttitle}
+\author{
+ Mohammad Akhlaghi,
+ Ra\'ul Infante-Sainz,
+ Boudewijn F. Roukema,
+ Mohammadreza Khellat,\\
+ David Valls-Gabaud,
+ Roberto Baena-Gall\'e
+ \thanks{Manuscript received MM DD, YYYY; revised MM DD, YYYY.}
+}
+%% The paper headers
+\markboth{Computing in Science and Engineering, Vol. X, No. X, MM YYYY}%
+{Akhlaghi \MakeLowercase{\textit{et al.}}: \projecttitle}
-%% PROJECT TITLE: The project title should also be printed as metadata in
-%% all output files. To avoid inconsistancy caused by manually typing it,
-%% the title is defined with other core project metadata in
-%% 'reproduce/analysis/config/metadata.conf'. That value is then written in
-%% the '\projectitle' LaTeX macro which is available in 'project.tex' (that
-%% was loaded above).
-%
-%% Please set your project's title in 'metadata.conf' (ideally with other
-%% basic project information) and re-run the project to have your new
-%% title. If you later use a different LaTeX style, please use the same
-%% '\projectitle' in it (after importing 'tex/build/macros/project.tex'
-%% like above), don't type it by hand.
-\title{\large \uppercase{\projecttitle}}
-%% AUTHOR INFORMATION: For a more fine-grained control of the headers
-%% including author name, or paper info, see
-%% `tex/src/preamble-header.tex'. Note that if you plan to use a journal's
-%% LaTeX style file, you will probably set the authors in a different way,
-%% feel free to change them here, this is just basic style and varies from
-%% project to project.
-\author[1]{Your name}
-\author[2]{Coauthor one}
-\author[1,3]{Coauthor two}
-\affil[1]{The first affiliation in the list.; \url{your@email.address}}
-\affil[2]{Another affilation can be put here.}
-\affil[3]{And generally as many affiliations as you like.
-\par \emph{Received YYYY MM DD; accepted YYYY MM DD; published YYYY MM DD}}
-\date{}
+%% Start the paper.
+\begin{document}
+% make the title area
+\maketitle
+
+% As a general rule, do not put math, special symbols or citations
+% in the abstract or keywords.
+\begin{abstract}
+ %% CONTEXT
+ Analysis pipelines commonly use high-level technologies that are popular when created, but are unlikely to be readable, executable, or sustainable in the long term.
+ %% AIM
+ A set of criteria is introduced to address this problem:
+ %% METHOD
+ Completeness (no \new{execution requirement} beyond \new{a minimal Unix-like operating system}, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free software.
+ They have been tested in several research publications in various fields.
+ %% RESULTS
+ As a proof of concept, ``Maneage'' is introduced, enabling cheap archiving, provenance extraction, and peer verification.
+ %% CONCLUSION
+ We show that longevity is a realistic requirement that does not sacrifice immediate or short-term reproducibility.
+ The caveats (with proposed solutions) are then discussed and we conclude with the benefits for the various stakeholders.
+ This paper is itself written with Maneage (project commit \projectversion).
+
+ \vspace{2.5mm}
+ \emph{Appendices} ---
+ Two comprehensive appendices that review the longevity of existing solutions; available
+\ifdefined\separatesupplement
+as supplementary ``Web extras'' on the journal webpage.
+\else
+after main body of paper (Appendices \ref{appendix:existingtools} and \ref{appendix:existingsolutions}).
+\fi
+
+ \vspace{2.5mm}
+ \emph{Reproducibility} ---
+ All products in \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}},
+ Git history of source at \href{https://gitlab.com/makhlaghi/maneage-paper}{\texttt{gitlab.com/makhlaghi/maneage-paper}},
+ which is also archived in \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=https://gitlab.com/makhlaghi/maneage-paper.git}{SoftwareHeritage}.
+\end{abstract}
+
+% Note that keywords are not normally used for peer-review papers.
+\begin{IEEEkeywords}
+Data Lineage, Provenance, Reproducibility, Scientific Pipelines, Workflows
+\end{IEEEkeywords}
+
+
+
+
+
+
+
+% For peer review papers, you can put extra information on the cover
+% page as needed:
+% \ifCLASSOPTIONpeerreview
+% \begin{center} \bfseries EDICS Category: 3-BBND \end{center}
+% \fi
+%
+% For peerreview papers, this IEEEtran command inserts a page break and
+% creates the second title. It will be ignored for other modes.
+\IEEEpeerreviewmaketitle
+
+
+
+\section{Introduction}
+% The very first letter is a 2 line initial drop letter followed
+% by the rest of the first word in caps.
+%\IEEEPARstart{F}{irst} word
+
+Reproducible research has been discussed in the sciences for at least 30 years \cite{claerbout1992, fineberg19}.
+Many reproducible workflow solutions (hereafter, ``solutions'') have been proposed that mostly rely on the common technology of the day,
+starting with Make and Matlab libraries in the 1990s, Java in the 2000s, and mostly shifting to Python during the last decade.
+
+However, these technologies develop fast, e.g., code written in Python 2 \new{(which is no longer officially maintained)} often cannot run with Python 3.
+The cost of staying up to date within this rapidly-evolving landscape is high.
+Scientific projects, in particular, suffer the most: scientists have to focus on their own research domain, but to some degree they need to understand the technology of their tools because it determines their results and interpretations.
+Decades later, scientists are still held accountable for their results and therefore the evolving technology landscape creates generational gaps in the scientific community, preventing previous generations from sharing valuable experience.
+
+
+
+
+
+\section{Longevity of existing tools}
+\label{sec:longevityofexisting}
+\new{Reproducibility is defined as ``obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis'' \cite{fineberg19}.
+Longevity is defined as the length of time that a project remains \emph{functional} after its creation.
+Functionality is defined as \emph{human readability} of the source and its \emph{execution possibility} (when necessary).
+Many usage contexts of a project do not involve execution: for example, checking the configuration parameter of a single step of the analysis to re-\emph{use} in another project, or checking the version of used software, or the source of the input data.
+Extracting these from execution outputs is not always possible.}
+A basic review of the longevity of commonly used tools is provided here \new{(for a more comprehensive review, please see
+ \ifdefined\separatesupplement
+ the supplementary appendices%
+ \else%
+ appendices \ref{appendix:existingtools} and \ref{appendix:existingsolutions}%
+ \fi%
+ ).
+}
+
+To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011, but discontinued in 2019).
+However, containers (e.g., Docker or Singularity) are currently the most widely-used solution.
+We will focus on Docker here because it is currently the most common.
+
+\new{It is hypothetically possible to precisely identify the used Docker ``images'' with their checksums (or ``digest'') to re-create an identical OS image later.
+However, that is rarely done.}
+Usually images are imported with operating system (OS) names; e.g., \cite{mesnard20}
+\ifdefined\separatesupplement
+\new{(more examples in the \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{appendices})}%
+\else%
+\new{(more examples: see the appendices (\ref{appendix:existingtools}))}%
+\fi%
+{ }imports `\inlinecode{FROM ubuntu:16.04}'.
+The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly, and only the most recent five are archived there.
+Hence, if the image is built in different months, it will contain different OS components.
+% CentOS announcement: https://blog.centos.org/2020/12/future-is-centos-stream
+In the year 2024, when long-term support (LTS) for this version of Ubuntu expires, the image will be unavailable at the expected URL \new{(if not aborted earlier, like CentOS 8 which will be terminated 8 years early).}
+
+Generally, \new{pre-built} binary files (like Docker images) are large and expensive to maintain and archive.
+%% This URL: https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates}
+\new{Because of this, DockerHub (where many reproducible workflows are archived) announced that inactive images (older than 6 months) will be deleted in free accounts from mid 2021.}
+Furthermore, Docker requires root permissions, and only supports recent (LTS) versions of the host kernel: older Docker images may not be executable \new{(their longevity is determined by the host kernel, typically a decade).}
+
+Once the host OS is ready, PMs are used to install the software or environment.
+Usually the OS's PM, such as `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with generic PMs.
+The former has \new{the same longevity} as the OS, while some of the latter (such as Conda and Spack) are written in high-level languages like Python, so the PM itself depends on the host's Python installation \new{with a typical longevity of a few years}.
+Nix and GNU Guix produce bit-wise identical programs \new{with considerably better longevity; that of their supported CPU architectures}.
+However, they need root permissions and are primarily targeted at the Linux kernel.
+Generally, in all the package managers, the exact version of each software (and its dependencies) is not precisely identified by default, although an advanced user can indeed fix them.
+Unless precise version identifiers of \emph{every software package} are stored by project authors, a PM will use the most recent version.
+Furthermore, because third-party PMs introduce their own language, framework, and version history (the PM itself may evolve) and are maintained by an external team, they increase a project's complexity.
+
+With the software environment built, job management is the next component of a workflow.
+Visual/GUI workflow tools like Apache Taverna, GenePattern \new{(deprecated)}, Kepler or VisTrails \new{(deprecated)}, which were mostly introduced in the 2000s and used Java or Python 2 encourage modularity and robust job management.
+\new{However, a GUI environment is tailored to specific applications and is hard to generalize, while being hard to reproduce once the required Java Virtual Machine (JVM) is deprecated.
+These tools' data formats are complex (designed for computers to read) and hard to read by humans without the GUI.}
+The more recent tools (mostly non-GUI, written in Python) leave this to the authors of the project.
+Designing a robust project needs to be encouraged and facilitated because scientists (who are not usually trained in project or data management) will rarely apply best practices.
+This includes automatic verification, which is possible in many solutions, but is rarely practiced.
+Besides non-reproducibility, weak project management leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding, or validating will be expensive).
+
+Finally, to blend narrative and analysis, computational notebooks \cite{rule18}, such as Jupyter, are currently gaining popularity.
+However, because of their complex dependency trees, their build is vulnerable to the passage of time; e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies.
+It is important to remember that the longevity of a project is determined by its shortest-lived dependency.
+Furthermore, as with job management, computational notebooks do not actively encourage good practices in programming or project management.
+\new{The ``cells'' in a Jupyter notebook can either be run sequentially (from top to bottom, one after the other) or by manually selecting the cell to run.
+By default cell dependencies are not included (e.g., automatically running some cells only after certain others), parallel execution, or usage of more than one language.
+There are third party add-ons like \inlinecode{sos} or \inlinecode{nbextensions} (both written in Python) for some of these.
+However, since they are not part of the core, their longevity can be assumed to be shorter.
+Therefore, the core Jupyter framework leaves very few options for project management, especially as the project grows beyond a small test or tutorial.}
+In summary, notebooks can rarely deliver their promised potential \cite{rule18} and may even hamper reproducibility \cite{pimentel19}.
+
+
+
+
+
+\section{Proposed criteria for longevity}
+\label{criteria}
+The main premise is that starting a project with a robust data management strategy (or tools that provide it) is much more effective, for researchers and the community, than imposing it at the end \cite{austin17,fineberg19}.
+In this context, researchers play a critical role \cite{austin17} in making their research more Findable, Accessible, Interoperable, and Reusable (the FAIR principles).
+Simply archiving a project workflow in a repository after the project is finished is, on its own, insufficient, and maintaining it by repository staff is often either practically unfeasible or unscalable.
+We argue and propose that workflows satisfying the following criteria can not only improve researcher flexibility during a research project, but can also increase the FAIRness of the deliverables for future researchers:
+
+\textbf{Criterion 1: Completeness.}
+A project that is complete (self-contained) has the following properties.
+(1) \new{No \emph{execution requirements} apart from a minimal Unix-like operating system.
+Fewer explicit execution requirements would mean higher \emph{execution possibility} and consequently higher \emph{longetivity}.}
+(2) Primarily stored as plain text \new{(encoded in ASCII/Unicode)}, not needing specialized software to open, parse, or execute.
+(3) No impact on the host OS libraries, programs and \new{environment variables}.
+(4) Does not require root privileges to run (during development or post-publication).
+(5) Builds its own controlled software \new{with independent environment variables}.
+(6) Can run locally (without an internet connection).
+(7) Contains the full project's analysis, visualization \emph{and} narrative: including instructions to automatically access/download raw inputs, build necessary software, do the analysis, produce final data products \emph{and} final published report with figures \emph{as output}, e.g., PDF or HTML.
+(8) It can run automatically, \new{without} human interaction.
+
+\textbf{Criterion 2: Modularity.}
+A modular project enables and encourages independent modules with well-defined inputs/outputs and minimal side effects.
+\new{In terms of file management, a modular project will \emph{only} contain the hand-written project source of that particular high-level project: no automatically generated files (e.g., software binaries or figures), software source code, or data should be included.
+The latter two (developing low-level software, collecting data, or the publishing and archival of both) are separate projects in themselves because they can be used in other independent projects.
+This optimizes the storage, archival/mirroring and publication costs (which are critical to longevity): a snapshot of a project's hand-written source will usually be on the scale of $\times100$ kilobytes and the version controlled history may become a few megabytes.}
+
+In terms of the analysis workflow, explicit communication between various modules enables optimizations on many levels:
+(1) Modular analysis components can be executed in parallel and avoid redundancies (when a dependency of a module has not changed, it will not be re-run).
+(2) Usage in other projects.
+(3) Debugging and adding improvements (possibly by future researchers).
+(4) Citation of specific parts.
+(5) Provenance extraction.
+
+\textbf{Criterion 3: Minimal complexity.}
+Minimal complexity can be interpreted as:
+(1) Avoiding the language or framework that is currently in vogue (for the workflow, not necessarily the high-level analysis).
+A popular framework typically falls out of fashion and requires significant resources to translate or rewrite every few years \new{(for example Python 2, which is no longer supported)}.
+More stable/basic tools can be used with less long-term maintenance costs.
+(2) Avoiding too many different languages and frameworks; e.g., when the workflow's PM and analysis are orchestrated in the same framework, it becomes easier to maintain in the long term.
+
+\textbf{Criterion 4: Scalability.}
+A scalable project can easily be used in arbitrarily large and/or complex projects.
+On a small scale, the criteria here are trivial to implement, but can rapidly become unsustainable.
+
+\textbf{Criterion 5: Verifiable inputs and outputs.}
+The project should automatically verify its inputs (software source code and data) \emph{and} outputs, not needing any expert knowledge.
+
+\textbf{Criterion 6: Recorded history.}
+No exploratory research is done in a single, first attempt.
+Projects evolve as they are being completed.
+It is natural that earlier phases of a project are redesigned/optimized only after later phases have been completed.
+Research papers often report this with statements such as ``\emph{we [first] tried method [or parameter] X, but Y is used here because it gave lower random error}''.
+The derivation ``history'' of a result is thus not any the less valuable as itself.
+
+\textbf{Criterion 7: Including narrative that is linked to analysis.}
+A project is not just its computational analysis.
+A raw plot, figure or table is hardly meaningful alone, even when accompanied by the code that generated it.
+A narrative description is also a deliverable (defined as ``data article'' in \cite{austin17}): describing the purpose of the computations, and interpretations of the result, and the context in relation to other projects/papers.
+This is related to longevity, because if a workflow contains only the steps to do the analysis or generate the plots, in time it may get separated from its accompanying published paper.
+
+\textbf{Criterion 8: Free and open source software:}
+Non-free or non-open-source software typically cannot be distributed, inspected or modified by others.
+They are reliant on a single supplier (even without payments) \new{and prone to \href{https://www.gnu.org/proprietary/proprietary-obsolescence.html}{proprietary obsolescence}}.
+A project that is \href{https://www.gnu.org/philosophy/free-sw.en.html}{free software} (as formally defined by GNU), allows others to run, learn from, \new{distribute, build upon (modify), and publish their modified versions}.
+When the software used by the project is itself also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware.
+
+\new{It may happen that proprietary software is necessary to read proprietary data formats produced by data collection hardware (for example micro-arrays in genetics).
+In such cases, it is best to immediately convert the data to free formats upon collection, and archive (e.g., on Zenodo) or use the data in free formats.}
+
+
+
+
+
+
+
+
+
+
+\section{Proof of concept: Maneage}
+
+With the longevity problems of existing tools outlined above, a proof-of-concept tool is presented here via an implementation that has been tested in published papers \cite{akhlaghi19, infante20}.
+\new{Since the initial submission of this paper, it has also been used in \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} (on the COVID-19 pandemic) and \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}.}
+It was also awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows \cite{austin17}, from the researchers' perspective.
+
+The tool is called Maneage, for \emph{Man}aging data Lin\emph{eage} (the ending is pronounced as in ``lineage''), hosted at \url{https://maneage.org}.
+It was developed as a parallel research project over five years of publishing reproducible workflows of our research.
+The original implementation was published in \cite{akhlaghi15}, and evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}.
+
+Technically, the hardest criterion to implement was the first (completeness); in particular \new{restricting execution requirements to only a minimal Unix-like operating system}.
+One solution we considered was GNU Guix and Guix Workflow Language (GWL).
+However, because Guix requires root access to install, and only works with the Linux kernel, it failed the completeness criterion.
+Inspired by GWL+Guix, a single job management tool was implemented for both installing software \emph{and} the analysis workflow: Make.
+
+Make is not an analysis language, it is a job manager.
+Make decides when and how to call analysis steps/programs (in any language like Python, R, Julia, Shell, or C).
+Make \new{has been available since 1977, it is still heavily used in almost all components of modern Unix-like OSs} and is standardized in POSIX.
+It is thus mature, actively maintained, highly optimized, efficient in managing provenance, and recommended by the pioneers of reproducible research \cite{claerbout1992,schwab2000}.
+Researchers using free software have also already had some exposure to it \new{(most free research software are built with Make).}
+
+Linking the analysis and narrative (criterion 7) was historically our first design element.
+To avoid the problems with computational notebooks mentioned above, our implementation follows a more abstract linkage, providing a more direct and precise, yet modular, connection.
+Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is through automatically-created \LaTeX{} macros, during the analysis.
+For example, \cite{akhlaghi19} writes `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'.
+The \LaTeX{} source of the quote above is: `\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}'.
+The macro `\inlinecode{\small\textbackslash{}demosfoptimizedsn}' is generated during the analysis and expands to the value `\inlinecode{0.25}' when the PDF output is built.
+Since values like this depend on the analysis, they should \emph{also} be reproducible, along with figures and tables.
+
+These macros act as a quantifiable link between the narrative and analysis, with the granularity of a word in a sentence and a particular analysis command.
+This allows automatic updates to the embedded numbers during the experimentation phase of a project \emph{and} accurate post-publication provenance.
+Through the former, manual updates by authors (which are prone to errors and discourage improvements or experimentation after writing the first draft) are by-passed.
+
+Acting as a link, the macro files build the core skeleton of Maneage.
+For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation.
+These are combined at the end to generate precise software \new{acknowledgement} and citation that is shown in the
+\new{
+ \ifdefined\separatesupplement
+ \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{appendices}.%
+ \else%
+ appendices (\ref{appendix:software}).%
+ \fi%
+}
+(for other examples, see \cite{akhlaghi19, infante20})
+\new{Furthermore, the machine related specifications of the running system (including hardware name and byte-order) are also collected and cited.
+These can help in \emph{root cause analysis} of observed differences/issues in the execution of the workflow on different machines.}
+The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}).
+All software dependencies are built down to precise versions of every tool, including the shell,\new{important low-level application programs} (e.g., GNU Coreutils) and of course, the high-level science software.
+\new{The source code of all the free software used in Maneage is archived in and downloaded from \href{https://doi.org/10.5281/zenodo.3883409}{zenodo.3883409}.
+Zenodo promises long-term archival and also provides a persistent identifier for the files, which are sometimes unavailable at a software package's webpage.}
+
+On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library (glibc) is being added (task \href{http://savannah.nongnu.org/task/?15390}{15390}).
+Currently, {\TeX}Live is also being added (task \href{http://savannah.nongnu.org/task/?15267}{15267}), but that is only for building the final PDF, not affecting the analysis or verification.
+\new{Finally, some software cannot be built on some CPU architectures, hence by default, the architecture is included in the final built paper automatically (see below).}
+
+\new{Building the core Maneage software environment on an 8-core CPU takes about 1.5 hours (GCC consumes more than half of the time).
+However, this is only necessary once in a project: the analysis (which usually takes months to write/mature for a normal project) will only use built environment.
+Hence the few hours of initial software building is negligible compared to a project's life span.
+To facilitate moving to another computer in the short term, Maneage'd projects can be built in a container or VM.
+The \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{\inlinecode{README.md}} file has instructions on building in Docker.
+Through containers or VMs, users on non-Unix-like OSs (like Microsoft Windows) can use Maneage.
+For Windows-native software that can be run in batch-mode, evolving technologies like Windows Subsystem for Linux may be usable.}
+
+The analysis phase of the project however is naturally different from one project to another at a low-level.
+It was thus necessary to design a generic framework to comfortably host any project, while still satisfying the criteria of modularity, scalability, and minimal complexity.
+This design is demonstrated with the example of Figure \ref{fig:datalineage} (left).
+It is an enhanced replication of the ``tool'' curve of Figure 1C in \cite{menke20}.
+Figure \ref{fig:datalineage} (right) is the data lineage that produced it.
+
+\begin{figure*}[t]
+ \begin{center}
+ \includetikz{figure-tools-per-year}{width=0.95\linewidth}
+% \includetikz{figure-data-lineage}{width=0.85\linewidth}
+ \end{center}
+ \vspace{-3mm}
+ \caption{\label{fig:datalineage}
+ Left: an enhanced replica of Figure 1C in \cite{menke20}, shown here for demonstrating Maneage.
+ It shows the fraction of the number of papers mentioning software tools (green line, left vertical axis) in each year (red bars, right vertical axis on a log scale).
+ Right: Schematic representation of the data lineage, or workflow, to generate the plot on the left.
+ Each colored box is a file in the project and \new{arrows show the operation of various software: linking input file(s) to output file(s)}.
+ Green files/boxes are plain-text files that are under version control and in the project source directory.
+ Blue files/boxes are output files in the build directory, shown within the Makefile (\inlinecode{*.mk}) where they are defined as a \emph{target}.
+ For example, \inlinecode{paper.pdf} \new{is created by running \LaTeX{} on} \inlinecode{project.tex} (in the build directory; generated automatically) and \inlinecode{paper.tex} (in the source directory; written manually).
+ \new{Other software is used in other steps.}
+ The solid arrows and full-opacity built boxes correspond to the lineage of this paper.
+ The dotted arrows and built boxes show the scalability of Maneage (ease of adding hypothetical steps to the project as it evolves).
+ The underlying data of the left plot is available at
+ \href{https://zenodo.org/record/\projectzenodoid/files/tools-per-year.txt}{zenodo.\projectzenodoid/tools-per-year.txt}.
+ }
+\end{figure*}
+
+The analysis is orchestrated through a single point of entry (\inlinecode{top-make.mk}, which is a Makefile; see Listing \ref{code:topmake}).
+It is only responsible for \inlinecode{include}-ing the modular \emph{subMakefiles} of the analysis, in the desired order, without doing any analysis itself.
+This is visualized in Figure \ref{fig:datalineage} (right) where no built (blue) file is placed directly over \inlinecode{top-make.mk}.
+A visual inspection of this file is sufficient for a non-expert to understand the high-level steps of the project (irrespective of the low-level implementation details), provided that the subMakefile names are descriptive (thus encouraging good practice).
+A human-friendly design that is also optimized for execution is a critical component for the FAIRness of reproducible research.
+
+\begin{lstlisting}[
+ label=code:topmake,
+ caption={This project's simplified \inlinecode{top-make.mk}, also see Figure \ref{fig:datalineage}}
+ ]
+# Default target/goal of project.
+all: paper.pdf
+
+# Define subMakefiles to load in order.
+makesrc = initialize \ # General
+ download \ # General
+ format \ # Project-specific
+ demo-plot \ # Project-specific
+ verify \ # General
+ paper # General
+
+# Load all the configuration files.
+include reproduce/analysis/config/*.conf
+
+# Load the subMakefiles in the defined order
+include $(foreach s,$(makesrc), \
+ reproduce/analysis/make/$(s).mk)
+\end{lstlisting}
+
+All projects first load \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk} (Listing \ref{code:topmake}).
+Project authors add their modular subMakefiles in between.
+Except for \inlinecode{paper.mk} (which builds the ultimate target: \inlinecode{paper.pdf}), all subMakefiles build a macro file with the same base-name (the \inlinecode{.tex} file in each subMakefile of Figure \ref{fig:datalineage}).
+Other built files (intermediate analysis steps) cascade down in the lineage to one of these macro files, possibly through other files.
+
+Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk} to satisfy the verification criteria (this step was not yet available in \cite{akhlaghi19, infante20}).
+All project deliverables (macro files, plot or table data and other datasets) are verified at this stage, with their checksums, to automatically ensure exact reproducibility.
+Where exact reproducibility is not possible \new{(for example, due to parallelization)}, values can be verified by the project authors.
+\new{For example see \new{\href{https://archive.softwareheritage.org/browse/origin/content/?branch=refs/heads/postreferee_corrections&origin_url=https://codeberg.org/boud/elaphrocentre.git&path=reproduce/analysis/bash/verify-parameter-statistically.sh}{verify-parameter-statistically.sh}} of \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}.}
+
+\begin{figure*}[t]
+ \begin{center} \includetikz{figure-branching}{scale=1}\end{center}
+ \vspace{-3mm}
+ \caption{\label{fig:branching} Maneage is a Git branch.
+ Projects using Maneage are branched off it and apply their customizations.
+ (a) A hypothetical project's history prior to publication.
+ The low-level structure (in Maneage, shared between all projects) can be updated by merging with Maneage.
+ (b) A finished/published project can be revitalized for new technologies by merging with the core branch.
+ Each Git ``commit'' is shown on its branch as a colored ellipse, with its commit hash shown and colored to identify the team that is/was working on the branch.
+ Briefly, Git is a version control system, allowing a structured backup of project files.
+ Each Git ``commit'' effectively contains a copy of all the project's files at the moment it was made.
+ The upward arrows at the branch-tops are therefore in the time direction.
+ }
+\end{figure*}
+
+To further minimize complexity, the low-level implementation can be further separated from the high-level execution through configuration files.
+By convention in Maneage, the subMakefiles (and the programs they call for number crunching) do not contain any fixed numbers, settings, or parameters.
+Parameters are set as Make variables in ``configuration files'' (with a \inlinecode{.conf} suffix) and passed to the respective program by Make.
+For example, in Figure \ref{fig:datalineage} (bottom), \inlinecode{INPUTS.conf} contains URLs and checksums for all imported datasets, thereby enabling exact verification before usage.
+To illustrate this, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot).
+The number \inlinecode{\menkenumpapersdemoyear} is stored in \inlinecode{demo-year.conf} and the result (\inlinecode{\menkenumpapersdemocount}) was calculated after generating \inlinecode{tools-per-year.txt}.
+Both numbers are expanded as \LaTeX{} macros when creating this PDF file.
+An interested reader can change the value in \inlinecode{demo-year.conf} to automatically update the result in the PDF, without knowing the underlying low-level implementation.
+Furthermore, the configuration files are a prerequisite of the targets that use them.
+If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's source or other built products.
+This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details; e.g., by co-authors or future readers), and ensures self-consistency.
+
+\new{In contrast to notebooks like Jupyter, the analysis scripts and configuration parameters are therefore not blended into the running code, are not stored together in a single file and do not require a unique editor.
+To satisfy the modularity criterion, the analysis steps and narrative are run in their own files (in different languages, thus maximally benefiting from the unique features of each) and the files can be viewed or manipulated with any text editor that the authors prefer.
+The analysis can benefit from the powerful and portable job management features of Make and communicates with the narrative text through \LaTeX{} macros, enabling much better formatted output that blends analysis outputs in the narrative sentences and enables direct provenance tracking.}
+
+To satisfy the recorded history criterion, version control (currently implemented in Git) is another component of Maneage (see Figure \ref{fig:branching}).
+Maneage is a Git branch that contains the shared components (infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles, and interface script).
+\new{The core Maneage git repository is hosted at \href{http://git.maneage.org/project.git}{git.maneage.org/project.git} (archived at \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{Software Heritage}).}
+Derived projects start by creating a branch and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its particular analysis, see Listing \ref{code:branching}).
+There is a \new{thoroughly elaborated} customization checklist in \inlinecode{README-hacking.md}).
+
+The current project's Git hash is provided to the authors as a \LaTeX{} macro (shown here at the end of the abstract), as well as the Git hash of the last commit in the Maneage branch (shown here in the acknowledgements).
+These macros are created in \inlinecode{initialize.mk}, with \new{other basic information from the running system like the CPU architecture, byte order or address sizes (shown here in the acknowledgements)}.
+
+\begin{lstlisting}[
+ label=code:branching,
+ caption={Starting a new project with Maneage, and building it},
+ ]
+# Cloning main Maneage branch and branching off it.
+$ git clone https://git.maneage.org/project.git
+$ cd project
+$ git remote rename origin origin-maneage
+$ git checkout -b master
+
+# Build the raw Maneage skeleton in two phases.
+$ ./project configure # Build software environment.
+$ ./project make # Do analysis, build PDF paper.
+
+# Start editing, test-building and committing
+$ emacs paper.tex # Set your name as author.
+$ ./project make # Re-build to see effect.
+$ git add -u && git commit # Commit changes.
+\end{lstlisting}
+
+The branch-based design of Figure \ref{fig:branching} allows projects to re-import Maneage at a later time (technically: \emph{merge}), thus improving its low-level infrastructure: in (a) authors do the merge during an ongoing project;
+in (b) readers do it after publication; e.g., the project remains reproducible but the infrastructure is outdated, or a bug is fixed in Maneage.
+\new{Generally, any git flow (branching strategy) can be used by the high-level project authors or future readers.}
+Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of curation and maintenance of each individual project, before \emph{and} after publication.
+
+Finally, the complete project source is usually $\sim100$ kilo-bytes.
+It can thus easily be published or archived in many servers, for example it can be uploaded to arXiv (with the \LaTeX{} source, see the arXiv source in \cite{akhlaghi19, infante20, akhlaghi15}), published on Zenodo and archived in SoftwareHeritage.
+
+
+
+
+
+
+
+
+
+
+
+\section{Discussion}
+\label{discussion}
+%% It should provide some insight or 'lessons learned', where 'lessons learned' is jargon for 'informal hypotheses motivated by experience', reworded to make the hypotheses sound more scientific (if it's a 'lesson', then it sounds like knowledge, when in fact it's really only a hypothesis).
+%% What is the message we should take from the experience?
+%% Are there clear demonstrated design principles that can be reapplied elsewhere?
+%% Are there roadblocks or bottlenecks that others might avoid?
+%% Are there suggested community or work practices that can make things smoother?
+%% Attempt to generalise the significance.
+%% should not just present a solution or an enquiry into a unitary problem but make an effort to demonstrate wider significance and application and say something more about the ‘science of data’ more generally.
+
+We have shown that it is possible to build workflows satisfying all the proposed criteria, and we comment here on our experience in testing them through this proof-of-concept tool.
+Maneage user-base grew with the support of RDA, underscoring some difficulties for a widespread adoption.
+
+Firstly, while most researchers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used.
+Fortunately, we have noticed that after witnessing the improvements in their research, many, especially early-career researchers, have started mastering these tools.
+Scientists are rarely trained sufficiently in data management or software development, and the plethora of high-level tools that change every few years discourages them.
+Indeed the fast-evolving tools are primarily targeted at software developers, who are paid to learn and use them effectively for short-term projects before moving on to the next technology.
+
+Scientists, on the other hand, need to focus on their own research fields, and need to consider longevity.
+Hence, arguably the most important feature of these criteria (as implemented in Maneage) is that they provide a fully working template or bundle that works immediately out of the box by producing a paper with an example calculation that they just need to start customizing.
+Using mature and time-tested tools, for blending version control, the research paper's narrative, the software management \emph{and} a robust data management strategies.
+We have noticed that providing a complete \emph{and} customizable template with a clear checklist of the initial steps is much more effective in encouraging mastery of these modern scientific tools than having abstract, isolated tutorials on each tool individually.
+
+Secondly, to satisfy the completeness criterion, all the required software of the project must be built on various \new{Unix-like OSs} (Maneage is actively tested on different GNU/Linux distributions, macOS, and is being ported to FreeBSD also).
+This requires maintenance by our core team and consumes time and energy.
+However, because the PM and analysis components share the same job manager (Make) and design principles, we have already noticed some early users adding, or fixing, their required software alone.
+They later share their low-level commits on the core branch, thus propagating it to all derived projects.
+
+\new{Unix-like OSs are a very large and diverse group (mostly conforming with POSIX), so this condition does not guarantee bitwise reproducibility of the software, even when built on the same hardware.}.
+However \new{our focus is on reproducing results (output of software), not the software itself.}
+Well written software internally corrects for differences in OS or hardware that may affect its output (through tools like the GNU portability library).
+On GNU/Linux hosts, Maneage builds precise versions of the compilation tool chain.
+However, glibc is not install-able on some \new{Unix-like} OSs (e.g., macOS) and all programs link with the C library.
+This may hypothetically hinder the exact reproducibility \emph{of results} on non-GNU/Linux systems, but we have not encountered this in our research so far.
+With everything else under precise control in Maneage, the effect of differing hardware, Kernel and C libraries on high-level science can now be systematically studied in follow-up research \new{(including floating-point arithmetic or optimization differences).
+Using continuous integration (CI) is one way to precisely identify breaking points on multiple systems.}
+
+% DVG: It is a pity that the following paragraph cannot be included, as it is really important but perhaps goes beyond the intended goal.
+%Thirdly, publishing a project's reproducible data lineage immediately after publication enables others to continue with follow-up papers, which may provide unwanted competition against the original authors.
+%We propose these solutions:
+%1) Through the Git history, the work added by another team at any phase of the project can be quantified, contributing to a new concept of authorship in scientific projects and helping to quantify Newton's famous ``\emph{standing on the shoulders of giants}'' quote.
+%This is a long-term goal and would require major changes to academic value systems.
+%2) Authors can be given a grace period where the journal or a third party embargoes the source, keeping it private for the embargo period and then publishing it.
+
+Other implementations of the criteria, or future improvements in Maneage, may solve some of the caveats, but this proof of concept already shows their many advantages.
+For example, publication of projects meeting these criteria on a wide scale will allow automatic workflow generation, optimized for desired characteristics of the results (e.g., via machine learning).
+The completeness criterion implies that algorithms and data selection can be included in the optimizations.
+
+Furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the connection between an analysis with the resulting narrative \emph{and} the history of that analysis+narrative.
+Parsers can be written over projects for meta-research and provenance studies, e.g., to generate ``research objects''.
+Likewise, when a bug is found in one science software, affected projects can be detected and the scale of the effect can be measured.
+Combined with SoftwareHeritage, precise high-level science components of the analysis can be accurately cited (e.g., even failed/abandoned tests at any historical point).
+Many components of ``machine-actionable'' data management plans can also be automatically completed as a byproduct, useful for project PIs and grant funders.
+
+From the data repository perspective, these criteria can also be useful, e.g., the challenges mentioned in \cite{austin17}:
+(1) The burden of curation is shared among all project authors and readers (the latter may find a bug and fix it), not just by database curators, thereby improving sustainability.
+(2) Automated and persistent bidirectional linking of data and publication can be established through the published \emph{and complete} data lineage that is under version control.
+(3) Software management: with these criteria, each project comes with its unique and complete software management.
+It does not use a third-party PM that needs to be maintained by the data center (and the many versions of the PM), hence enabling robust software management, preservation, publishing, and citation.
+For example, see \href{https://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}, \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}, \href{https://doi.org/10.5281/zenodo.3951151}{zenodo.3951151} or \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460} where we have exploited the free-software criterion to distribute the source code of all software used in each project as deliverables.
+(4) ``Linkages between documentation, code, data, and journal articles in an integrated environment'', which effectively summarizes the whole purpose of these criteria.
+
+
+
+
+
+% use section* for acknowledgment
+\section*{Acknowledgment}
+
+The authors wish to thank (sorted alphabetically)
+Julia Aguilar-Cabello,
+Dylan A\"issi,
+Marjan Akbari,
+Alice Allen,
+Pedram Ashofteh Ardakani,
+Roland Bacon,
+Michael R. Crusoe,
+Antonio D\'iaz D\'iaz,
+Surena Fatemi,
+Fabrizio Gagliardi,
+Konrad Hinsen,
+Marios Karouzos,
+Johan Knapen,
+Tamara Kovazh,
+Terry Mahoney,
+Ryan O'Connor,
+Mervyn O'Luing,
+Simon Portegies Zwart,
+Idafen Santana-P\'erez,
+Elham Saremi,
+Yahya Sefidbakht,
+Zahra Sharbaf,
+Nadia Tonello,
+Ignacio Trujillo and
+the AMIGA team at the Instituto de Astrof\'isica de Andaluc\'ia for their useful help, suggestions, and feedback on Maneage and this paper.
+\new{The five referees and editors of CiSE (Lorena Barba and George Thiruvathukal) provided many points that greatly helped to clarify this paper.}
+
+This project (commit \inlinecode{\projectversion}) is maintained in Maneage (\emph{Man}aging data lin\emph{eage}).
+\new{The latest merged Maneage branch commit was \inlinecode{\maneageversion} (\maneagedate).
+This project was built on an \inlinecode{\machinearchitecture} machine with {\machinebyteorder} byte-order and address sizes {\machineaddresssizes}}.
+
+Work on Maneage, and this paper, has been partially funded/supported by the following institutions:
+The Japanese MEXT PhD scholarship to M.A and its Grant-in-Aid for Scientific Research (21244012, 24253003).
+The European Research Council (ERC) advanced grant 339659-MUSICOS.
+The European Union (EU) Horizon 2020 (H2020) research and innovation programmes No 777388 under RDA EU 4.0 project, and Marie Sk\l{}odowska-Curie grant agreement No 721463 to the SUNDIAL ITN.
+The State Research Agency (AEI) of the Spanish Ministry of Science, Innovation and Universities (MCIU) and the European
+Regional Development Fund (ERDF) under the grant AYA2016-76219-P.
+The IAC project P/300724, financed by the MCIU, through the Canary Islands Department of Economy, Knowledge and Employment.
+The ``A next-generation worldwide quantum sensor network with optical atomic clocks'' project of the TEAM IV programme of the
+Foundation for Polish Science co-financed by the EU under ERDF.
+The Polish MNiSW grant DIR/WK/2018/12.
+The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314.
-%% Start creating the paper.
-\begin{document}
-%% Project abstract and keywords.
-\includeabstract{
- Welcome to Maneage (\emph{Man}aging data lin\emph{eage}) and reproducible papers/projects, for a review of the basics of this system, please see \citet{maneage}.
- You are now ready to configure Maneage and implement your own research in this framework.
- Maneage contains almost all the elements that you will need in a research project, and adding any missing parts is very easy once you become familiar with it.
- For example it already has steps to downloading of raw data and necessary software (while verifying them with their checksums), building the software, and processing the data with the software in a highly-controlled environment.
- But Maneage is not just for the analysis of your project, you will also write your paper in it (by replacing this text in \texttt{paper.tex}): including this abstract, figures and bibliography.
- If you design your project with Maneage's infra-structure, don't forget to add a notice and clearly let the readers know that your work is reproducible, we should spread the word and show the world how useful reproducible research is for the sciences, also don't forget to cite and acknowledge it so we can continue developing it.
- This PDF was made with Maneage, commit \projectversion{}.
- \vspace{0.25cm}
- \textsl{Keywords}: Add some keywords for your research here.
- \textsl{Reproducible paper}: All quantitave results (numbers and plots)
- in this paper are exactly reproducible with Maneage
- (\url{https://maneage.org}). }
-%% To add the first page's headers.
-\thispagestyle{firststyle}
+%% Bibliography of main body
+\bibliographystyle{IEEEtran_openaccess}
+\bibliography{IEEEabrv,references}
+%% Biography
+\begin{IEEEbiographynophoto}{Mohammad Akhlaghi}
+ is a postdoctoral researcher at the Instituto de Astrof\'isica de Canarias (IAC), Spain.
+ He has a PhD from Tohoku University (Japan) and was previously a CNRS postdoc in Lyon (France).
+ Email: mohammad-AT-akhlaghi.org; Website: \url{https://akhlaghi.org}.
+\end{IEEEbiographynophoto}
+\begin{IEEEbiographynophoto}{Ra\'ul Infante-Sainz}
+ is a doctoral student at IAC, Spain.
+ He has an M.Sc in University of Granada (Spain).
+ Email: infantesainz-AT-gmail.com; Website: \url{https://infantesainz.org}.
+\end{IEEEbiographynophoto}
+\begin{IEEEbiographynophoto}{Boudewijn F. Roukema}
+ is a professor of cosmology at the Institute of Astronomy, Faculty of Physics, Astronomy and Informatics, Nicolaus Copernicus University in Toru\'n, Grudziadzka 5, Poland.
+ He has a PhD from Australian National University. Email: boud-AT-astro.uni.torun.pl.
+\end{IEEEbiographynophoto}
-%% Start of main body.
-\section{Congratulations!}
-Congratulations on running the raw template project! You can now follow the ``Customization checklist'' in the \texttt{README-hacking.md} file, customize this template and start your exciting research project over it.
-You can always merge Maneage back into your project to improve its infra-structure and leaving your own project intact.
-If you haven't already read \citet{maneage}, please do so before continuing, it isn't long (just 7 pages).
+\begin{IEEEbiographynophoto}{Mohammadreza Khellat}
+ is the backend technical services manager at Ideal-Information, Oman.
+ He has an M.Sc in theoretical particle physics from Yazd University (Iran).
+ Email: mkhellat-AT-ideal-information.com.
+\end{IEEEbiographynophoto}
-While you are writing your paper, just don't forget to \emph{not} use numbers or fixed strings (for example database urls like \url{\wfpctwourl}) directly within your \LaTeX{} source.
-Put them in configuration files and after using them in the analysis, pass them into the \LaTeX{} source through macros in the same subMakefile that used them.
-For some already published examples, please see \citet{maneage}\footnote{\url{https://gitlab.com/makhlaghi/maneage-paper}}, \citet{infantesainz20}\footnote{\url{https://gitlab.com/infantesainz/sdss-extended-psfs-paper}} and \citet{akhlaghi19}\footnote{\url{https://gitlab.com/makhlaghi/iau-symposium-355}}.
-Working in this way, will let you focus clearly on your science and not have to worry about fixing this or that number/name in the text.
+\begin{IEEEbiographynophoto}{David Valls-Gabaud}
+ is a CNRS Research Director at LERMA, Observatoire de Paris, France.
+ Educated at the universities of Madrid, Paris, and Cambridge, he obtained his PhD in 1991.
+ Email: david.valls-gabaud-AT-obspm.fr.
+\end{IEEEbiographynophoto}
-Once your project is ready for publication, there is also a ``Publication checklist'' in \texttt{README-hacking.md} that will guide you in the steps to do for making your project as FAIR as possible (Findable, Accessibile, Interoperable, and Reusable).
+\begin{IEEEbiographynophoto}{Roberto Baena-Gall\'e}
+ held a postdoc position at IAC and obtained a degree in Telecommunication and Electronic at the University of Seville, with a PhD at the University of Barcelona.
+ Email: rbaena-AT-iac.es
+\end{IEEEbiographynophoto}
+\vfill
-The default \LaTeX{} structure within Maneage also has two \LaTeX{} macros for easy marking of text within your document as \emph{new} and \emph{notes}.
-To activate them, please use the \texttt{--highlight-new} or \texttt{--highlight-notes} options with \texttt{./project make}.
-For example if you run \texttt{./project make --highlight-new}, then \new{this text (that has been marked as \texttt{new}) will show up as green in the final PDF}.
-If you run \texttt{./project make --highlight-notes} then you will see a note following this sentence that is written in red and has square brackets around it (it is invisible without this option).
-\tonote{This text is written within a \texttt{tonote} and is invisible without \texttt{--highlight-notes}.}
-You can also use these two options together to both highlight the new parts and add notes within the text.
-Another thing you may notice from the \LaTeX{} source of this default paper is there is one line per sentence (and one sentence in a line).
-Of course, as with everything else in Maneage, you are free to use any format that you are most comfortable with.
-The reason behind this choice is that this source is under Git version control and that Git also primarily works on lines.
-In this way, when a change in a setence is made, git will only highlight/color that line/sentence we have found that this helps a lot in viewing the changes.
-Also, this format helps in reminding the author when the sentence is getting too long!
-Here is a tip when looking at the changes of narrative document in Git: use the \texttt{--word-diff} option (for example \texttt{git diff --word-diff}, you can also use it with \texttt{git log}).
-Figure \ref{squared} shows a simple plot as a demonstration of creating plots within \LaTeX{} (using the {\small PGFP}lots package).
-The minimum value in this distribution is $\deletememin$, and $\deletememax$ is the maximum.
-Take a look into the \LaTeX{} source and you'll see these numbers are actually macros that were calculated from the same dataset (they will change if the dataset, or function that produced it, changes).
-The individual {\small PDF} file of Figure \ref{squared} is available under the \texttt{tex/tikz/} directory of your build directory.
-You can use this PDF file in other contexts (for example in slides showing your progress or after publishing the work).
-If you want to directly use the {\small PDF} file in the figure without having to let {\small T}i{\small KZ} decide if it should be remade or not, you can also comment the \texttt{makepdf} macro at the top of this \LaTeX{} source file.
-\begin{figure}[t]
- \includetikz{delete-me-squared}{width=\linewidth}
- \captionof{figure}{\label{squared} A very basic $X^2$ plot for
- demonstration.}
-\end{figure}
-Figure \ref{image-histogram} is another demonstration of showing images (datasets) using PGFPlots.
-It shows a small crop of an image from the Wide-Field Planetary Camera 2 (that was installed on the Hubble Space Telescope from 1993 to 2009).
-As another more realistic demonstration of reporting results with Maneage, here we report that the mean pixel value in that image is $\deletemewfpctwomean$ and the median is $\deletemewfpctwomedian$.
-The skewness in the histogram of Figure \ref{image-histogram}(b) explains this difference between the mean and median.
-The dataset is visualized here as a black and white image using the \textsf{Convert\-Type} program of GNU Astronomy Utilities (Gnuastro).
-The histogram and basic statstics were generated with Gnuastro's \textsf{Statistics} program.
-{\small PGFP}lots\footnote{\url{https://ctan.org/pkg/pgfplots}} is a great tool to build the plots within \LaTeX{} and removes the necessity to add further dependencies, just to create the plots.
-There are high-level libraries like Matplotlib which also generate plots.
-However, the problem is that they require \emph{many} dependencies, for example see Figure 1 of \citet{alliez19}.
-Installing these dependencies from source, is not easy and will harm the reproducibility of your paper in the future.
-Furthermore, since {\small PGFP}lots builds the plots within \LaTeX{}, it respects all the properties of your text (for example line width and fonts and etc).
-Therefore the final plot blends in your paper much more nicely.
-It also has a wonderful manual\footnote{\url{http://mirrors.ctan.org/graphics/pgf/contrib/pgfplots/doc/pgfplots.pdf}}.
-\begin{figure}[t]
- \includetikz{delete-me-image-histogram}{width=\linewidth}
- \captionof{figure}{\label{image-histogram} (a) An example image of the Wide-Field Planetary Camera 2, on board the Hubble Space Telescope from 1993 to 2009.
- This is one of the sample images from the FITS standard webpage, kept as examples for this file format.
- (b) Histogram of pixel values in (a).}
-\end{figure}
-\section{Notice and citations}
-To encourage other scientists to publish similarly reproducible papers, please add a notice close to the start of your paper or in the end of the abstract clearly mentioning that your work is fully reproducible.
-One convention we have adopted until now is to put the Git checkum of the project as the last word of the abstract, for example see \citet{akhlaghi19}, \citet{infantesainz20} and \citet{maneage}
-Finally, don't forget to cite \citet{maneage} and acknowledge the funders mentioned below.
-Otherwise we won't be able to continue working on Maneage.
-Also, just as another reminder, before publication, don't forget to follow the ``Publication checklist'' of \texttt{README-hacking.md}.
-%% End of main body.
+%% Appendix (only build if 'separatesupplement' has not been given): by
+%% default, the appendices are built.
+\ifdefined\separatesupplement
+\else
+\clearpage
+\appendices
+\input{tex/src/appendix-existing-tools.tex}
+\input{tex/src/appendix-existing-solutions.tex}
+\input{tex/src/appendix-used-software.tex}
+%\input{tex/src/appendix-necessity.tex}
+\bibliographystyleappendix{IEEEtran_openaccess}
+\bibliographyappendix{IEEEabrv,references}
+\fi
+\end{document}
-\section{Acknowledgments}
-\new{Please include the following paragraph in the Acknowledgement section of your paper.
- In order to get more funding to continue working on Maneage, we need to to cite it and its funding institutions in your papers.
- Also note that at the start, it includes version and date information for the most recent Maneage commit you merged with (which can be very helpful for others) as well as very basic information about your CPU architecture (which was extracted during configuration).
- This CPU information is very important for reproducibility because some software may not be buildable on other CPU architectures, so it is necessary to publish CPU information with the results and software versions.}
-This project was developed in the reproducible framework of Maneage \citep[\emph{Man}aging data lin\emph{eage},][latest Maneage commit \maneageversion{}, from \maneagedate]{maneage}.
-The project was built on an {\machinearchitecture} machine with {\machinebyteorder} byte-order, see Appendix \ref{appendix:software} for the used software and their versions.
-Maneage has been funded partially by the following grants: Japanese Ministry of Education, Culture, Sports, Science, and Technology (MEXT) PhD scholarship to M. Akhlaghi and its Grant-in-Aid for Scientific Research (21244012, 24253003).
-The European Research Council (ERC) advanced grant 339659-MUSICOS.
-The European Union (EU) Horizon 2020 (H2020) research and innovation programmes No 777388 under RDA EU 4.0 project, and Marie Sk\l{}odowska-Curie grant agreement No 721463 to the SUNDIAL ITN.
-The State Research Agency (AEI) of the Spanish Ministry of Science, Innovation and Universities (MCIU) and the European Regional Development Fund (ERDF) under the grant AYA2016-76219-P.
-The IAC project P/300724, financed by the MCIU, through the Canary Islands Department of Economy, Knowledge and Employment.
-%% Tell BibLaTeX to put the bibliography list here.
-\printbibliography
-%% Start appendix.
-\appendix
-%% Mention all used software in an appendix.
-\section{Software acknowledgement}
-\label{appendix:software}
-\input{tex/build/macros/dependencies.tex}
-%% Finish LaTeX
-\end{document}
-%% This file is part of Maneage (https://maneage.org).
-%
-%% This file is free software: you can redistribute it and/or modify it
-%% under the terms of the GNU General Public License as published by the
-%% Free Software Foundation, either version 3 of the License, or (at your
-%% option) any later version.
-%
-%% This file is distributed in the hope that it will be useful, but WITHOUT
-%% ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
-%% FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
-%% for more details.
-%
-%% You should have received a copy of the GNU General Public License along
-%% with this file. If not, see <http://www.gnu.org/licenses/>.
+
+
+
+%%\newpage
+%%\section{Things remaining to add}
+%%\begin{itemize}
+%%\item \url{https://sites.nationalacademies.org/cs/groups/pgasite/documents/webpage/pga_180684.pdf}, does the following classification of tools:
+%% \begin{itemize}
+%% \item Research environments: \href{http://vcr.stanford.edu}{Verifiable computational research} (discussed above), \href{http://www.sciencedirect.com/science/article/pii/S1877050911001207}{SHARE} (a Virtual Machine), \href{http://www.codeocean.com}{Code Ocean} (discussed above), \href{http://jupyter.org}{Jupyter} (discussed above), \href{https://yihui.name/knitr}{knitR} (based on Sweave, dynamic report generation with R), \href{https://cran.r-project.org}{Sweave} (Function in R, for putting R code within \LaTeX), \href{http://www.cyverse.org}{Cyverse} (proprietary web tool with servers for bioinformatics), \href{https://nanohub.org}{NanoHUB} (collection of Simulation Programs for nanoscale phenomena that run in the cloud), \href{https://www.elsevier.com/about/press-releases/research-and-journals/special-issue-computers-and-graphics-incorporates-executable-paper-grand-challenge-winner-collage-authoring-environment}{Collage Authoring Environment} (discussed above), \href{https://osf.io/ns2m3}{SOLE} (discussed above), \href{https://osf.io}{Open Science framework} (a hosting webpage), \href{https://www.vistrails.org}{VisTrails} (discussed above), \href{https://pypi.python.org/pypi/Sumatra}{Sumatra} (discussed above), \href{http://software.broadinstitute.org/cancer/software/genepattern}{GenePattern} (reviewed above), Image Processing On Line (\href{http://www.ipol.im}{IPOL}) journal (publishes full analysis scripts, but does not deal with dependencies), \href{https://github.com/systemslab/popper}{Popper} (reviewed above), \href{https://galaxyproject.org}{Galaxy} (reviewed above), \href{http://torch.ch}{Torch.ch} (finished project for neural networks on images), \href{http://wholetale.org/}{Whole Tale} (discussed above).
+%% \item Workflow systems: \href{http://www.taverna.org.uk}{Taverna}, \href{http://www.wings-workflows.org}{Wings}, \href{https://pegasus.isi.edu}{Pegasus}, \href{http://www.pgbovine.net/cde.html}{CDE}, \href{http://binder.org}{Binder}, \href{http://wiki.datakurator.org/wiki}{Kurator}, \href{https://kepler-project.org}{Kepler}, \href{https://github.com/everware}{Everware}, \href{http://cds.nyu.edu/projects/reprozip}{Reprozip}.
+%% \item Dissemination platforms: \href{http://researchcompendia.org}{ResearchCompendia}, \href{https://datacenterhub.org/about}{DataCenterHub}, \href{http://runmycode.org}, \href{https://www.chameleoncloud.org}{ChameleonCloud}, \href{https://occam.cs.pitt.edu}{Occam}, \href{http://rcloud.social/index.html}{RCloud}, \href{http://thedatahub.org}{TheDataHub}, \href{http://www.ahay.org/wiki/Package_overview}{Madagascar}.
+%% \end{itemize}
+%%\item Special volume on ``Reproducible research'' in the Computing in Science Engineering \citeappendix{fomel09}.
+%%\item ``I’ve learned that interactive programs are slavery (unless they include the ability to arrive in any previous state by means of a script).'' \citeappendix{fomel09}.
+%%\item \citeappendix{fomel09} discuss the ``versioning problem'': on different systems, programs have different versions.
+%%\item \citeappendix{fomel09}: a C program written 20 years ago was still usable.
+%%\item \citeappendix{fomel09}: ``in an attempt to increase the size of the community, Matthias Schwab and I submitted a paper to Computers in Physics, one of CiSE’s forerunners. It was rejected. The editors said if everyone used Microsoft computers, everything would be easily reproducible. They also predicted the imminent demise of Fortran''.
+%%\item \citeappendix{alliez19}: Software citation, with a nice dependency plot for matplotlib.
+%% \item SC \href{https://sc19.supercomputing.org/submit/reproducibility-initiative}{Reproducibility Initiative} for mandatory Artifact Description (AD).
+%% \item \href{https://www.acm.org/publications/policies/artifact-review-badging}{Artifact review badging} by the Association of computing machinery (ACM).
+%% \item eLife journal \href{https://elifesciences.org/labs/b521cf4d/reproducible-document-stack-towards-a-scalable-solution-for-reproducible-articles}{announcement} on reproducible papers. \citeappendix{lewis18} is their first reproducible paper.
+%% \item The \href{https://www.scientificpaperofthefuture.org}{Scientific paper of the future initiative} encourages geoscientists to include associate metadata with scientific papers \citeappendix{gil16}.
+%% \item Digital objects: \url{http://doi.org/10.23728/b2share.b605d85809ca45679b110719b6c6cb11} and \url{http://doi.org/10.23728/b2share.4e8ac36c0dd343da81fd9e83e72805a0}
+%% \item \citeappendix{mesirov10}, \citeappendix{casadevall10}, \citeappendix{peng11}: Importance of reproducible research.
+%% \item \citeappendix{sandve13} is an editorial recommendation to publish reproducible results.
+%% \item \citeappendix{easterbrook14} Free/open software for open science.
+%% \item \citeappendix{peng15}: Importance of better statistical education.
+%% \item \citeappendix{topalidou16}: Failed attempt to reproduce a result.
+%% \item \citeappendix{hutton16} reproducibility in hydrology, criticized in \citeappendix{melson17}.
+%% \item \citeappendix{fomel09}: Editorial on reproducible research.
+%% \item \citeappendix{munafo17}: Reproducibility in social sciences.
+%% \item \citeappendix{stodden18}: Effectiveness of journal policy on computational reproducibility.
+%% \item \citeappendix{fanelli18} is critical of the narrative that there is a ``reproducibility crisis'', and that its important to empower scientists.
+%% \item \citeappendix{burrell18} open software (in particular Python) in heliophysics.
+%% \item \citeappendix{allen18} show that many papers do not cite software.
+%% \item \citeappendix{zhang18} explicity say that they won't release their code: ``We opt not to make the code used for the chemical evo-lution modeling publicly available because it is an important asset of the re-searchers’ toolkits''
+%% \item \citeappendix{jones19} make genuine effort at reproducing every number in the paper (using Docker, Conda, and CGAT-core, and Binder), but they can ultimately only release scripts. They claim its not possible to reproduce that level of reproducibility, but here we show it is.
+%% \item LSST uses Kubernetes and docker for reproducibility \citeappendix{banek19}.
+%% \item Interesting survey/paper on the importance of coding in science \citeappendix{merali10}.
+%% \item Discuss the Provenance challenge \citeappendix{moreau08}, showing the importance of meta data and provenance tracking.
+%% Especially that it is organized by teh medical scientists.
+%% Its webpage (for latest challenge) has a nice intro: \url{https://www.cccinnovationcenter.com/challenges/provenance-challenge}.
+%% \item In discussion: The XML provenance system is very interesting, scripts can be written to parse the Makefiles within this template to generate such XML outputs for easy standard metadata parsing.
+%% The XML that contains a log of the outputs is also interesting.
+%% \item \citeappendix{becker17} Discuss reproducibility methods in R.
+%% \item Elsevier Executable Paper Grand Challenge\footnote{\url{https://shar.es/a3dgl2}} \citeappendix{gabriel11}.
+%% \item \citeappendix{menke20} show how software identifability has seen the best improvement, so there is hope!
+%% \item Nature's collection on papers about reproducibility: \url{https://www.nature.com/collections/prbfkwmwvz}.
+%% \item Nice links for applying FAIR principles in research software: \url{https://www.rd-alliance.org/group/software-source-code-ig/wiki/fair4software-reading-materials}
+%% \item Jupyter Notebooks and problems with reproducibility: \citeappendix{rule18} and \citeappendix{pimentel19}.
+%% \item Reproducibility certification \url{https://www.cascad.tech}.
+%% \item \url{https://plato.stanford.edu/entries/scientific-reproducibility}.
+%% \item
+%%Modern analysis tools are almost entirely implemented as software packages.
+%%This has lead many scientists to adopt solutions that software developers use for reproducing software (for example to fix bugs, or avoid security issues).
+%%These tools and how they are used are thorougly reviewed in Appendices \ref{appendix:existingtools} and \ref{appendix:existingsolutions}.
+%%However, the problem of reproducibility in the sciences is more complicated and subtle than that of software engineering.
+%%This difference can be broken up into the following categories, which are described more fully below:
+%%1) Reading vs. executing, 2) Archiving how software is used and 3) Citation of the software/methods used for scientific credit.
+%%
+%%The first difference is because in the sciences, reproducibility is not merely a problem of re-running a research project (where a binary blob like a container or virtual machine is sufficient).
+%%For a scientist it is more important to read/study a method of a paper that is 1, 10, or 100 years old.
+%%The hardware to execute the code may have become obsolete, or it may require too much processing power, storage, or time for another random scientist to execute.
+%%Another scientist just needs to be assured that the commands they are reading is exactly what was (and can potentially be) executed.
+%%
+%%On the second point, scientists are devoting a smaller fraction of their papers to the technical aspects of the work because they are done increasingly by pre-written software programs and libraries.
+%%Therefore, scientific papers are no longer a complete repository for preserving and archiving very important aspects of the scientific endeavor and hard gained experience.
+%%Attempts such as Software Heritage\footnote{\url{https://www.softwareheritage.org}} \citeappendix{dicosmo18} do a wonderful job at long term preservation and archival of the software source code.
+%%However, preservation of the software's raw code is only part of the process, it is also critically important to preserve how the software was used: with what configuration or run-time options, for what kinds of problems, in conjunction with which other software tools and etc.
+%%
+%%The third major difference was scientific credit, which is measured in units of citations, not dollars.
+%%As described above, scientific software are playing an increasingly important role in modern science.
+%%Because of the domain-specific knowledge necessary to produce such software, they are mostly written by scientists for scientists.
+%%Therefore a significant amount of effort and research funding has gone into producing scientific software.
+%%Atleast for the software that do have an accompanying paper, it is thus important that those papers be cited when they are used.
+%%\end{itemize}