diff options
-rw-r--r-- | paper.tex | 78 |
1 files changed, 41 insertions, 37 deletions
@@ -60,7 +60,7 @@ % in the abstract or keywords. \begin{abstract} %% CONTEXT - Reproducible workflow solutions commonly use the high-level technologies that were popular when they were created, providing an immediate solution that is unlikely to be sustainable in the long term. + Reproducible workflow solutions commonly use the high-level technologies that were popular when they were created, providing an immediate solution that are unlikely to be sustainable in the long term. %% AIM We aim to introduce a set of criteria to address this problem and to demonstrate their practicality. %% METHOD @@ -110,7 +110,7 @@ Also, once the binary format is obsolete, reading or parsing the project is not The cost of staying up to date with this evolving landscape is high. Scientific projects in particular suffer the most: scientists have to focus on their own research domain, but they also need to understand the used technology to a certain level, because it determines their results and interpretations. Decades later, they are also still held accountable for their results. -Hence the evolving technology creates generational gaps in the scientific community, not allowing the previous generations to share valuable lessons which are too low-level to be published in a traditional scientific paper. +Hence, the evolving technology creates generational gaps in the scientific community, not allowing the previous generations to share valuable lessons which are too low-level to be published in a traditional scientific paper. As a solution to this problem, here we introduce a criteria that can guarantee the longevity of a project based on our experiences with existing solutions. @@ -119,7 +119,7 @@ As a solution to this problem, here we introduce a criteria that can guarantee t \section{Commonly used tools and their longevity} To highlight the proposed criteria, some of the most commonly used tools are reviewed from the long-term usability perspective. -We recall that while longevity is important in some fields (like the sciences), it isn't necessarilyy of interest in others (e.g., short term commercial projects), hence the wide usage of tools the evolve very fast. +We recall that while longevity is important in some fields (like the sciences), it is not necessarily of interest in others (e.g., short term commercial projects), hence the wide usage of tools that evolve very fast. Most existing reproducible workflows use a common set of third-party tools that can be categozied as: (1) Environment isolators like virtual machines, containers and etc. (2) PMs like Conda, Nix, or Spack, @@ -127,25 +127,25 @@ Most existing reproducible workflows use a common set of third-party tools that (4) Notebooks like Jupyter. To isolate the environment, virtual machines (VMs) have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (which was awarded 2nd prize in the Elseiver Executable Paper Grand Challenge of 2011 and discontinued in 2019). -However, containers (in particular Docker and lesser, Singularity) are by far the most used solution today, so we'll focus on Docker here. +However, containers (in particular Docker and lesser, Singularity) are by far the most used solution today, so we will focus on Docker here. %% Note that L. Barba (second author of this paper) is the editor in chief of CiSE. -Ideally, is possible to precisely version/tag the images that are imported into a Docker container. +Ideally, it is possible to precisely version/tag the images that are imported into a Docker container. But that is rarely practiced in most solutions that we have studied. -Usually images are imported with generic operating system names e.g., `\inlinecode{FROM ubuntu:16.04}'\cite{mesnard20}. +Usually, images are imported with generic operating system names e.g., `\inlinecode{FROM ubuntu:16.04}'\cite{mesnard20}. The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated with different software versions almost monthly and only archives the last 5 images. Hence if the Dockerfile is run in different months, it will contain different core operating system components. Furthermore, in the year 2024, when the long-term support for this version of Ubuntu expires, it will be totally removed. This is similar in other OSs: pre-built binary files are large and expensive to maintain and archive. -Furthermore Docker requires root permissions, and only supports recent (in ``long-term-support'') versions of the host kernel, hence older Docker images may not be executable. +Furthermore, Docker requires root permissions, and only supports recent (in ``long-term-support'') versions of the host kernel, hence older Docker images may not be executable. Once the host OS is ready, PMs are used to install the software, or environment. Usually the OS's PM, like `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with more generic PMs like Conda, Nix, GNU Guix or Spack. The OS PM suffers from the same longevity problem as the OS. Some third-party tools like Conda and Spack are written in high-level languages like Python, so the PM itself depends on the host's Python installation. -Nix and GNU Guix don't have any dependencies and produce bit-wise identical programs, however, they need root permissions. -Generally the exact version of each software's dependencies isn't precisely identified in the build instructions (although it is possible). -Therefore unless precise versions of \emph{every software} are stored, they will use the most recent version. +Nix and GNU Guix do not have any dependencies and produce bit-wise identical programs, however, they need root permissions. +Generally, the exact version of each software's dependencies is not precisely identified in the build instructions (although it is possible). +Therefore, unless precise versions of \emph{every software} are stored, they will use the most recent version. Furthermore, because each third party PM introduces its own language and framework, they increase the project's complexity. With the software environment built, job management is the next component of a workflow. @@ -161,8 +161,8 @@ Furthermore, similar to the point above on job management, by not actively encou An exceptional solution we encountered was the Image Processing Online Journal (IPOL, \href{https://www.ipol.im}{ipol.im}). Submitted papers must be accompanied by an ISO C implementation of their algorithm (which is build-able on all operating systems) with example images/data that can also be executed on their webpage. -This is possible due to the focus on low-level algorithms that don't need any dependencies beyond an ISO C compiler. -Many data-intensive projects, commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, hence this solution isn't scalable. +This is possible due to the focus on low-level algorithms that do not need any dependencies beyond an ISO C compiler. +Many data-intensive projects, commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, hence this solution is not scalable. @@ -173,7 +173,7 @@ Many data-intensive projects, commonly involve dozens of high-level dependencies The main premise is that starting a project with robust data management strategy (or tools that provide it) is much more effective, for the researchers and community, than imposing it in the end \cite{austin17,fineberg19}. Researchers play a critical role\cite{austin17} in making their research more Findabe, Accessible, Interoperable, and Reusable (the FAIR principles). Actively curating workflows for evolving technologies by repositories alone is not practically feasible, or scalable. -In this paper we argue that workflows that satisfy the criteria below can reduce the cost of curation for repositories, while maximizing the FAIRness of the deliverables for future researchers. +In this paper we argue that workflows satisfying the criteria below can reduce the cost of curation for repositories, while maximizing the FAIRness of the deliverables for future researchers. \textbf{Criteria 1: Completeness.} A project that is complete, or self-contained, has the following properties: @@ -249,7 +249,7 @@ When the used software are also free, Given the limitations of existing tools with the proposed criteria, it is necessary to show a proof of concept. The proof presented here has already been tested in previously published papers \cite{akhlaghi19, infante20} and was recently awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the joint RDA and World Data System (WDS) working group on Publishing Data Workflows\cite{austin17} from the researcher perspective to ensure longevity. -The proof of concept is called Maneage (Managing+Lineage, ending is pronounced like ``Lineage''). +The proof of concept is called Maneage (MANaging+LinEAGE, ending is pronounced like ``Lineage''). It was developed along with the criteria, as a parallel research project in 5 years for publishing our reproducible research workflows with our research. Its primordial form was implemented in \cite{akhlaghi15} and later evolved in \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}. @@ -257,10 +257,10 @@ Technically, the hardest criteria to implement was the completeness criteria (an One proposed solution was the Guix Workflow Language (GWL) which is written in the same framework (GNU Guile, an implementation of Scheme) as GNU Guix (a PM). But as natural scientists (astronomers), our background was with languages like Shell, Python, C or Fortran. Not having any exposure to Lisp/Scheme and their fundamentally different style, made it very hard for us to adopt GWL. -Furthermore, the desired solution was meant to be easily understandable/usable by fellow scientists, which generally also haven't had exposure to Lisp/Scheme. +Furthermore, the desired solution was meant to be easily understandable/usable by fellow scientists, who generally have not had exposure to Lisp/Scheme. Inspired by GWL+Guix, a single job management tool was used for both installing of software \emph{and} the analysis workflow: Make. -Make is not an analysis language, it is a job manager, deciding when to call analysis programs (written in any languge like Shell, Python, Julia or C). +Make is not an analysis language, it is a job manager, deciding when to call analysis programs (written in any language like Shell, Python, Julia or C). Make is standardized in POSIX and is used in almost all core OS components. It is thus mature, actively maintained and highly optimized. Make was recommended by the pioneers of reproducible research\cite{claerbout1992,schwab2000} and many researchers have already had a minimal exposure to it (when building research software). @@ -268,7 +268,7 @@ Make was recommended by the pioneers of reproducible research\cite{claerbout1992 Linking the analysis and narrative was another major design choice. Literate programming, implemented as Computational Notebooks like Jupyter, is a common solution these days. -However, due to the problems above, we our implementation follows a more abstract design: providing a more direct and precise, but modular (not in the same file) connection. +However, due to the problems above, our implementation follows a more abstract design: providing a more direct and precise, but modular (not in the same file) connection. Assuming that the narrative is typeset in \LaTeX{}, the connection between the analysis and narrative (usually as numbers) is through \LaTeX{} macros, that are automatically defined during the analysis. For example, in the abstract of \cite{akhlaghi19} we say `\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}'. @@ -282,16 +282,16 @@ Manually typing such numbers in the narrative is prone to errors and discourages The ultimate aim of any project is to produce a report accompaning a dataset with some visualizations, or a research article in a journal, let's call it \inlinecode{paper.pdf}. Hence the files with the relevant macros of each (modular) step, build the core structure (skeleton) of Maneage. During the software building (configuration) phase, each package is identified by a \LaTeX{} file, containing its official name, version and possible citation. -In the end, they are combined to enable precise software acknowledgement and citation (see the appendices of \cite{akhlaghi19, infante20}, not included here due to the word-limit). +In the end, they are combined to enable precise software acknowledgement and citation (see the appendices of \cite{akhlaghi19, infante20}, not included here due to the strict word-limit). Simultaneously, they act as Make \emph{targets} and \emph{prerequisite}s to allow accurate dependency tracking and optimized execution (parallel, no redundancies), for any complexity (e.g., Maneage also builds Matplotlib if requested, see Figure 1 of \cite{alliez19}). Dependencies go down to precise versions of the shell, C compiler, and the C library (task 15390) for an exactly reproducible environment. To enable easy and fast relocation of the project without building from source, it is possible to build it in any existing container/VM. The important factor is that, the precise environment isolator is irrelevant, it can always be rebuilt. During configuration, only the very high-level choice of which software to built differs between projects. -The Makefiles containig build recipes of each software don't generally change. +The Makefiles containig build recipes of each software do not generally change. However, the analysis will naturally be different from one project to another. -Therefore a design was necessary to satisfy the modularity, scalability and minimal complexity criteria. +Therefore, a design was necessary to satisfy the modularity, scalability and minimal complexity criteria. To avoid getting too abstract, we will demonstrate it by replicating Figure 1C of \cite{menke20} in Figure \ref{fig:datalineage} (top). Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it (with this whole paper). @@ -322,11 +322,11 @@ A human-friendly design (that is also optimized for execution) is a critical com In all projects \inlinecode{top-make.mk} will first load the subMakefiles \inlinecode{initialize.mk} and \inlinecode{download.mk}, while concluding with \inlinecode{verify.mk} and \inlinecode{paper.mk}. Project authors add their modular subMakefiles in between (after \inlinecode{download.mk} and before \inlinecode{verify.mk}), in Figure \ref{fig:datalineage} (bottom), the project-specific subMakefiles are \inlinecode{format.mk} \& \inlinecode{demo-plot.mk}. -Except for \inlinecode{paper.mk} which builds the ultimate target \inlinecode{paper.pdf}, all subMakefiles build atleast one file: a \LaTeX{} macro file with the same base-name, see the \inlinecode{.tex} files in each subMakefile of Figure \ref{fig:datalineage}. +Except for \inlinecode{paper.mk} which builds the ultimate target \inlinecode{paper.pdf}, all subMakefiles build at least one file: a \LaTeX{} macro file with the same base-name, see the \inlinecode{.tex} files in each subMakefile of Figure \ref{fig:datalineage}. The other built files will ultimately (through other files) lead to one of the macro files. -Irrespective of the number of subMakefiles, there lineaege reaches a bottle-neck in \inlinecode{verify.mk} to satisfy the verification criteria. -All the macro files, plot information and published datasets of the project are verfied with their checksums here to automatically ensure exact reproducibility. +Irrespective of the number of subMakefiles, there lineage reaches a bottleneck in \inlinecode{verify.mk} to satisfy the verification criteria. +All the macro files, plot information and published datasets of the project are verified with their checksums here to automatically ensure exact reproducibility. Where exact reproducibility is not possible, values can be verified by any statistical means (specified by the project authors). Finally, having verified quantitative results, the project builds the ultimate target in \inlinecode{paper.mk}. @@ -343,11 +343,11 @@ Finally, having verified quantitative results, the project builds the ultimate t } \end{figure*} -To further minimize complexity, the low-level implementation can be further separated from from the high-level execution through configuration files. -By convention in Maneage, the subMakefiles (and the Python, Julia, C, Fortran, or etc, programs that they call for doing the number crunching), only organize the analysis, they don't contain any fixed numbers, settings or parameters. +To further minimize complexity, the low-level implementation can be further separated from the high-level execution through configuration files. +By convention in Maneage, the subMakefiles (and the Python, Julia, C, Fortran, or etc, programs that they call for doing the number crunching), only organize the analysis, they do not contain any fixed numbers, settings or parameters. Parameters are set as Make variables in ``configuration files'' and passed to the respective program (\inlinecode{.conf} files in Figure \ref{fig:datalineage}). In the demo lineage, \inlinecode{INPUTS.conf} contains URLs and checksums for all imported datasets, enabling exact verification before usage. -As another demo, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which isn't in their original plot). +As another demo, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot). The number \inlinecode{\menkenumpapersdemoyear} is stored in \inlinecode{demo-year.conf}. The result \inlinecode{\menkenumpapersdemocount} was calculated after generating \inlinecode{columns.txt}. Both are expanded in the PDF as \LaTeX{} macros. @@ -374,9 +374,9 @@ $ ./project configure # Build software environment. $ ./project make # Do analysis, build PDF paper. \end{lstlisting} -As Figure \ref{fig:branching} shows, due to this architecture, it is always possible to import, or merge, Maneage into the project to improve the low-level infrastructure: +As Figure \ref{fig:branching} shows, due to this architecture, it is always possible to import or merge Maneage into the project to improve the low-level infrastructure: in (a) the authors merge into Maneage during an ongoing project, -in (b) readers can do it after the paper's publication, even when authors can't be accessed, and the project's infrastructure is outdated, or doesn't build. +in (b) readers can do it after the paper's publication, even when authors cannot be accessed, and the project's infrastructure is outdated, or does not build. Low-level improvements in Maneage are thus automatically propagated to all projects. This greatly reduces the cost of curation, or maintenance, of each individual project, before and after publication. @@ -400,17 +400,17 @@ Here we will review the lessons learnt and insights gained, while sharing the ex We will also discuss the design principles, an how they may be generalized and usable in other projects. With the support of RDA, the user base and development of the criteria and Maneage grew phenomenally, highlighting some difficulties for wide-spread adoption of these criteria. -Firstly, the low-level tools are not widely used by by many scientists, e.g., Git, \LaTeX, the command-line and Make. +Firstly, the low-level tools are not widely used by many scientists, e.g., Git, \LaTeX, the command-line and Make. This is primarily because of a lack of exposure, we noticed that after witnessing the improvements in their research, many (especially early career researchers) have started mastering these tools. -Fortunately many research institutes are having courses on these generic tools and we will also be adding more tutorials and demonstration videos in its documentation. +Fortunately, many research institutes are having courses on these generic tools and we will also be adding more tutorials and demonstration videos in its documentation. Secondly, to satisfy the completeness criteria, all the necessary software of the project must be built on various POSIX-compatible systems (we actively test Maneage on several GNU/Linux distributions and macOS). This requires maintenance by our core team and consumes time and energy. However, due to the complexity criteria, the PM and analysis share the same job manager. -Our experience has shown that users' experience in the analysis empowers some of them them to add/fix their required software on their own systems, and share that commits on the core branch, thus propagating to all derived projects. +Our experience has shown that users' experience in the analysis empowers some of them to add/fix their required software on their own systems, and share that commits on the core branch, thus propagating to all derived projects. This has already happened in multiple cases. -Thirdly, publishing a project's reproducible data lineage immediately after publication enables others to continue with followup papers in competition with the original authors. +Thirdly, publishing a project's reproducible data lineage immediately after publication enables others to continue with follow-up papers in competition with the original authors. We propose these solutions: 1) Through the Git history, the work added by another team at any phase of the project can be quantified, contributing to a new concept of authorship in scientific projects and helping to quantify Newton's famous ``\emph{standing on the shoulders of giants}'' quote. This is a long-term goal and requires major changes to academic value systems. @@ -418,7 +418,7 @@ This is a long-term goal and requires major changes to academic value systems. Other implementations of the criteria, or future improvements in Maneage, may solve the caveats above. However, the proof of concept already shows many advantages to adopting the criteria. -Above, the benefits for researchers was the main focus, but the these criteria also help in data centers, for example with regard to th challenges mentioned in \cite{austin17}: +Above, the benefits for researchers was the main focus, but these criteria also help in data centers, for example with regard to the challenges mentioned in \cite{austin17}: (1) The burden of curation is shared among all project authors and/or readers (who may find a bug and fix it), not just by data-base curators, improving the sustainability of data centers. (2) Automated and persistent bi-directional linking of data and publication can be established through the published \& \emph{complete} data lineage that is version controlled. (3) Software management. @@ -438,7 +438,7 @@ Furthermore, through elements like the macros, natural language processing can a Parsers can be written over projects for meta-research and data provenance studies, for example to generate ``research objects''. As another example, when a bug is found in one software package, all affected projects can be found and the scale of the effect can be measured. Combined with SoftwareHeritage, precise high-level science parts of Maneage projects can be accurately cited (e.g., failed/abandoned tests at any historical point). -Many components of ``machine-actionable'' data management plans can be automatically filled out by Maneage, which is useful for project PIs and and grant funders. +Many components of ``machine-actionable'' data management plans can be automatically filled out by Maneage, which is useful for project PIs and grant funders. @@ -447,7 +447,7 @@ Many components of ``machine-actionable'' data management plans can be automatic % use section* for acknowledgment -\section*{Acknowledgment} +\section*{Acknowledgement} The authors wish to thank (sorted alphabetically) Julia Aguilar-Cabello, @@ -517,8 +517,12 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314 \end{IEEEbiographynophoto} \begin{IEEEbiographynophoto}{Roberto Baena-Gall\'e} - is currently a postdoctoral fellow at the Instituto de Astrof\'isica de Canarias, Tenerife, Spain. - Contact him at roberto.baena@gmail.com. + is a postdoctoral researcher at the Instituto de Astrof\'isica de Canarias, Tenerife, Spain. + Before enrolling IAC, he worked at University of Barcelona, Reial Acad\`emia de Ci\`encias i Arts de Barcelona, l'Universit\'e Pierre et Marie Curie and ONERA-The French Aerospace Lab. + His research interests are image processing and resolution of inverse problems, with applications to AO corrected FOVs, satellite identification under atmospheric turbulence and retina images. + He is currently involved in projects related with PSF estimation of large astronomic surveys and Machine Learning. + Baena-Gall\'e has both MS in Telecommunication and Electronic Engineering from University of Seville (Spain), and received a PhD in astronomy from University of Barcelona (Spain). + Contact him at rbaena@iac.es. \end{IEEEbiographynophoto} \end{document} |