%% Main LaTeX source of project's paper, license is printed in the end. % %% Copyright (C) 2020 Mohammad Akhlaghi %% Copyright (C) 2020 Raúl Infante-Saiz %% Copyright (C) 2020 Boudewijn F. Roukema %% Copyright (C) 2020 David Valls-Gabaud %% Copyright (C) 2020 Roberto Baena-Gallé \documentclass[journal]{IEEEtran} %% This is a convenience variable if you are using PGFPlots to build plots %% within LaTeX. If you want to import PDF files for figures directly, you %% can use the standard `\includegraphics' command. See the definition of %% `\includetikz' in `tex/preamble-pgfplots.tex' for where the files are %% assumed to be if you use `\includetikz' when `\makepdf' is not defined. \newcommand{\makepdf}{} %% When defined (value is irrelevant), `\highlightchanges' will cause text %% in `\tonote' and `\new' to become colored. This is useful in cases that %% you need to distribute drafts that is undergoing revision and you want %% to highlight to your colleagues which parts are new and which parts are %% only for discussion. %\newcommand{\highlightchanges}{} %% Import necessary packages \input{tex/build/macros/project.tex} \input{tex/src/preamble-project.tex} \input{tex/src/preamble-pgfplots.tex} %% Title and author names. \title{Towards Long-term and Archivable Reproducibility} \author{ Mohammad~Akhlaghi, Ra\'ul Infante-Sainz, Boudewijn F. Roukema, David Valls-Gabaud, Roberto Baena-Gall\'e \thanks{Manuscript received MM DD, YYYY; revised MM DD, YYYY.} } %% The paper headers \markboth{Computing in Science and Engineering, Vol. X, No. X, MM YYYY}% {Akhlaghi \MakeLowercase{\textit{et al.}}: Towards Long-term and Archivable Reproducibility} %% Start the paper. \begin{document} % make the title area \maketitle % As a general rule, do not put math, special symbols or citations % in the abstract or keywords. \begin{abstract} %% CONTEXT Many reproducible workflow solutions have been proposed over recent decades. Most use the high-level technologies that were popular when they were created, providing an immediate solution that is unlikely to be sustainable in the long term. Decades later, scientists lack the resources to rewrite their projects, while still being accountable for their results. This creates generational gaps, which, together with technological obsolescence, impede reproducibility and building upon previous work. %% AIM We aim to introduce a set of criteria to address this problem and to demonstrate their practicality. %% METHOD The criteria have been tested in several research publications and can be summarized as: completeness (no dependency beyond a POSIX-compatible operating system, no administrator privileges, no network connection and storage primarily in plain-text); modular design; linking analysis with narrative; temporal provenance; scalability; and free-and-open-source software. %% RESULTS Through an implementation, called ``Maneage'' (managing+lineage), we find that storing the project in machine-actionable and human-readable plain-text, enables version-control, cheap archiving, automatic parsing to extract data provenance, and peer-reviewable verification. Furthermore, we show that these criteria are not limited to long-term reproducibility but also provide immediate, fast short-term reproducibility. %% CONCLUSION We conclude that requiring longevity from solutions is realistic. We discuss the benefits of these criteria for scientific progress. \end{abstract} % Note that keywords are not normally used for peerreview papers. \begin{IEEEkeywords} Data Lineage, Provenance, Reproducibility, Scientific Pipelines, Workflows \end{IEEEkeywords} % For peer review papers, you can put extra information on the cover % page as needed: % \ifCLASSOPTIONpeerreview % \begin{center} \bfseries EDICS Category: 3-BBND \end{center} % \fi % % For peerreview papers, this IEEEtran command inserts a page break and % creates the second title. It will be ignored for other modes. \IEEEpeerreviewmaketitle \section{Introduction} % The very first letter is a 2 line initial drop letter followed % by the rest of the first word in caps. \IEEEPARstart{T}{his} demo file is intended to serve as a ``starter file'' for IEEE journal papers produced under \LaTeX\ using IEEEtran.cls version 1.8b and later. % You must have at least 2 lines in the paragraph with the drop letter % (should never be an issue) Here is an example citation \cite{akhlaghi19}. \section{Principles} \label{sec:principles} The core principle of Maneage is simple: science is defined primarily by its method, not its result. As \cite{buckheit1995} describe it, modern scientific papers are merely advertisements of scholarship, while the actual scholarship is the coding behind the plots/results. Many solutions have been proposed in the last decades, including (but not limited to) 1992: \href{https://sep.stanford.edu/doku.php?id=sep:research:reproducible}{RED}, 2003: \href{https://taverna.incubator.apache.org}{Apache Taverna}, 2004: \href{https://www.genepattern.org}{GenePattern}, 2010: \href{https://wings-workflows.org}{WINGS}, 2011: \href{https://www.ipol.im}{Image Processing On Line journal} (IPOL), \href{https://www.activepapers.org}{Active papers}, \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE}, 2015: \href{https://sciunit.run}{Sciunit}; 2017: \href{https://falsifiable.us}{Popper}; 2019: \href{https://wholetale.org}{WholeTale}. To help in the comparison, the founding principles of Maneage are listed below. \begin{enumerate}%[label={\bf P\arabic*] \item \label{principle:complete}\textbf{Completeness:} A project that is complete, or self-contained, (P1.1) has no dependency beyond the Port\-able Operating System (OS) Interface, or POSIX, or a minimal Unix-like environment. A consequence of this is that the project itself must be stored in plain-text: not needing any specialized software to open, parse or execute. (P1.2) does not affect the host, (P1.3) does not require root, or administrator, privileges, (P1.4) builds its software for an independent environment, (P1.5) can be run locally (without internet connection), (P1.6) contains the full project's analysis, visualization \emph{and} narrative, from access to raw inputs to producing final published format (e.g., PDF or HTML), (P1.7) requires no manual/human interaction and can run automatically \cite[according to][``\emph{a clerk can do it}'']{claerbout1992}. \emph{Comparison with existing:} with many dependencies beyond POSIX, except for IPOL, none of the tools above are complete. For example, the workflow of most recent solutions need Python or Jupyter notebooks. Because of their complexity (see \ref{principle:complexity}), pre-built binary blobs like containers or virtual machines are the chosen storage format, which are large (Giga-bytes) and expensive to archive. Furthermore, third-party package managers setup the environment, like Conda, or the OS's, like apt or yum. However, exact versions of \emph{every software} are rarely included, and the servers remove old binaries, hence blobs are hard to recreate. Blobs also have a short lifespan, e.g., Docker containers made today, may not be operable with future versions of Docker or Linux (currently Linux 3.2.x is the earliest supported version, released in 2012). In general they mostly aim for short-term reproducibility. A plain-text project is readable by humans and machines (even if it can't be executed) and consumes no less than a megabyte. \item \label{principle:modularity}\textbf{Modularity:} A project should be compartmentalized into independent modules with well-defined inputs/outputs having no side effects. Communication between the independent modules should be explicit, providing several optimizations: (1) independent modules can run in parallel. Modules that do not need to be run (because their dependencies have not changed) will not be re-run. (2) Data provenance extraction (recording any dataset's origins). (3) Citation: others can credit specific parts of a project. (4) Usage in other projects. (5) Most importantly: they are easy to debug and improve. \emph{Comparison with existing:} Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails encourage this, but the more recent tools (mostly written in Python) leave this to project authors. However, designing a modular project needs to be encouraged and facilitated. Otherwise, scientists, who are not usually trained in data management, will rarely design a modular project, leading to great inefficiencies in terms of project cost and/or scientific accuracy (testing/validating will be expensive). \item \label{principle:complexity}\textbf{Minimal complexity:} This is Ockham's razor extrapolated to project management \cite[``\emph{Never posit pluralities without necessity}''][]{schaffer15}: 1) avoid complex relations between analysis steps (related to \ref{principle:modularity}). 2) avoid the programming language that is currently in vogue, because it is going to fall out of fashion soon and require significant resources to translate or rewrite it every few years (to stay fashionable). The same job can be done with more stable/basic tools, requiring less long-term effort. \emph{Comparison with existing:} IPOL stands out here too (requiring only ISO C), however most others are written in Python, and use Conda or Jupyter (see \ref{principle:complete}). Besides being incomplete (\ref{principle:complete}), these tools have short lifespans and evolve fast (e.g., Python 2 code cannot run with Python 3, causing disruption in many projects). Their complex dependency trees also making them hard to maintain, for example, see the dependency tree of Matlplotlib in \cite[][Figure 1]{alliez19}, its one of the simpler Jupyter dependencies. The longevity of a workflow is determined by its shortest-lived dependency. \item \label{principle:verify}\textbf{Verifiable inputs and outputs:} The project should automatically verify its inputs (software source code and data) \emph{and} outputs, not needing expert knowledge to confirm a reproduction. \emph{Comparison with existing:} Such verification is usually possible in most systems, but as a responsibility of the project authors. As with \ref{principle:modularity}, due to lack of training, if not actively encouraged and facilitated, it will not be implemented. \item \label{principle:history}\textbf{History and temporal provenance:} No project is done in a single/first attempt. Projects evolve as they are being completed. It is natural that earlier phases of a project are redesigned/optimized only after later phases have been completed. This is often seen in exploratory research papers, with statements like ``\emph{we [first] tried method [or parameter] X, but Y is used here because it gave lower random error}''. A project's ``history'' is thus as scientifically relevant as the final, or published version. \emph{Comparison with existing:} The solutions above that implement version control usually support this principle. However, because the systems as a whole are rarely complete (see \ref{principle:complete}), their histories are also incomplete. IPOL fails here, because only the final snapshot is published. \item \label{principle:scalable}\textbf{Scalability:} A project should be scalable to arbitrarily large and/or complex projects. \emph{Comparison with existing:} Most of the more recent solutions above are scalable. However, IPOL, which uniquely stands out in satisfying most principles, fails here: IPOL is devoted to low-level image processing algorithms that \emph{can be} done with no dependencies beyond an ISO C compiler. IPOL is thus not scalable to large projects, which commonly involve dozens of high-level dependencies, with complex data formats and analysis. \begin{figure*}[t] \begin{center} \includetikz{figure-branching}\end{center} \vspace{-3mm} \caption{\label{fig:branching} Harvesting the power of version-control in project management with Maneage. Maneage is maintained as a core branch, with projects created by branching off it. (a) shows how projects evolve on their own branch, but can always update their low-level structure by merging with the core branch (b) shows how a finished/published project can be revitalized for new technologies simply by merging with the core branch. Each Git ``commit'' is shown on their branches as colored ellipses, with their hash printed in them. The commits are colored based on the team that is working on that branch. The collaboration and paper icons are respectively made by `mynamepong' and `iconixar' and downloaded from \url{www.flaticon.com}. } \end{figure*} \item \label{principle:freesoftware}\textbf{Free and open source software:} Technically, reproducibility \cite{fineberg19} is possible with non-free or non-open-source software (a black box). This principle is thus necessary to complement that definition (nature is already a black box, we don't need another one): (1) As a free software, others can learn from, modify, and build upon it. (2) The lineage can be traced to free software's implemented algorithms, also enabling optimizations on that level. (3) A free-software package that does not execute on particular hardware can be modified to work on it. (4) A non-free software project typically cannot be distributed by others, making the whole community reliant on the owner's server (even if the owner does not ask for payments). \emph{Comparison with existing:} The existing solutions listed above are all free software. Based on this principle, we do not consider non-free solutions. \end{enumerate} % use section* for acknowledgment \section*{Acknowledgment} The authors wish to thank (sorted alphabetically) Julia Aguilar-Cabello, Alice Allen, Pedram Ashofteh Ardakani, Roland Bacon, Surena Fatemi, Fabrizio Gagliardi, Konrad Hinsen, Mohammad-reza Khellat, Johan Knapen, Tamara Kovazh, Ryan O'Connor, Simon Portegies Zwart, Idafen Santana-P\'erez, Elham Saremi, Yahya Sefidbakht, Zahra Sharbaf, Nadia Tonello, and Ignacio Trujillo for their useful help, suggestions and feedback on Maneage and this paper. Work on Maneage, and this paper, has been partially funded/supported by the following institutions: The Japanese Ministry of Education, Culture, Sports, Science, and Technology (MEXT) PhD scholarship to M. Akhl\-aghi and its Grant-in-Aid for Scientific Research (21244012, 24253003). The European Research Council (ERC) advanced grant 339659-MUSICOS. The European Union (EU) Horizon 2020 (H2020) research and innovation programmes No 777388 under RDA EU 4.0 project, and Marie Sk\l{}odowska-Curie grant agreement No 721463 to the SUNDIAL ITN. The State Research Agency (AEI) of the Spanish Ministry of Science, Innovation and Universities (MCIU) and the European Regional Development Fund (ERDF) under the grant AYA2016-76219-P. The IAC project P/300724, financed by the MCIU, through the Canary Islands Department of Economy, Knowledge and Employment. The Fundaci\'on BBVA under its 2017 programme of assistance to scientific research groups, for the project ``Using machine-learning techniques to drag galaxies from the noise in deep imaging''. The ``A next-generation worldwide quantum sensor network with optical atomic clocks'' project of the TEAM IV programme of the Foundation for Polish Science co-financed by the EU under ERDF. The Polish MNiSW grant DIR/WK/2018/12. The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314. %% Bibliography \bibliographystyle{IEEEtran} \bibliography{IEEEabrv,/home/mohammad/documents/personal/professional/data-science/maneage/paper/source/tex/src/references} %% Biography \begin{IEEEbiographynophoto}{Mohammad Akhlaghi} is currently a big data postdoctoral researcher at the Instituto de Astrof\'isica de Canarias, Tenerife, Spain. His main scientific interest is in early galaxy evolution, but to extract information from the modern complex datasets, he has been involved in image processing and reproducible workflow management where he has founded GNU Astronomy Utilities (Gnuastro) and Maneage. He received his PhD in astronomy from Tohoku University, Sendai Japan, and also held a postdoc position at the Centre de Recherche Astrophysique de Lyon (CRAL). Contact him at mohammad@akhlaghi.org and find his website at https://akhlaghi.org. \end{IEEEbiographynophoto} \begin{IEEEbiographynophoto}{Ra\'ul Infante-Sainz} is currently a doctoral student at the Instituto de Astrof\'isica de Canarias, Tenerife, Spain. Contact him at infantesainz@gmail.com. \end{IEEEbiographynophoto} \begin{IEEEbiographynophoto}{Boudewijn F. Roukema} is currently a professor at the Astronomy and Informatics department of Nicolaus Copernicus University in Toru\'n, Poland. Contact him at boud@astro.uni.torun.pl. \end{IEEEbiographynophoto} \begin{IEEEbiographynophoto}{David Valls-Gabaud} is currently a professor at the Observatoire de Paris, France. Contact him at david.valls-gabaud@obspm.fr. \end{IEEEbiographynophoto} \begin{IEEEbiographynophoto}{Roberto Baena-Gall\'e} is currently a postdoctoral fellow at the Instituto de Astrof\'isica de Canarias, Tenerife, Spain. Contact him at roberto.baena@gmail.com. \end{IEEEbiographynophoto} \end{document} %% This file is free software: you can redistribute it and/or modify it %% under the terms of the GNU General Public License as published by the %% Free Software Foundation, either version 3 of the License, or (at your %% option) any later version. % %% This file is distributed in the hope that it will be useful, but WITHOUT %% ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or %% FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License %% for more details. See .