diff options
-rw-r--r-- | paper.tex | 52 |
1 files changed, 26 insertions, 26 deletions
@@ -133,7 +133,7 @@ starting with Make and Matlab libraries in the 1990s, Java in the 2000s, and mos However, these technologies develop fast, e.g., code written in Python 2 \new{(which is no longer officially maintained)} often cannot run with Python 3. The cost of staying up to date within this rapidly-evolving landscape is high. -Scientific projects, in particular, suffer the most: scientists have to focus on their own research domain, but to some degree they need to understand the technology of their tools because it determines their results and interpretations. +Scientific projects, in particular, suffer the most: scientists have to focus on their own research domain, but to some degree, they need to understand the technology of their tools because it determines their results and interpretations. Decades later, scientists are still held accountable for their results and therefore the evolving technology landscape creates generational gaps in the scientific community, preventing previous generations from sharing valuable experience. @@ -202,7 +202,7 @@ However, because of their complex dependency trees, their build is vulnerable to It is important to remember that the longevity of a project is determined by its shortest-lived dependency. Furthermore, as with job management, computational notebooks do not actively encourage good practices in programming or project management. \new{The ``cells'' in a Jupyter notebook can either be run sequentially (from top to bottom, one after the other) or by manually selecting the cell to run. -By default cell dependencies are not included (e.g., automatically running some cells only after certain others), parallel execution, or usage of more than one language. +By default, cell dependencies are not included (e.g., automatically running some cells only after certain others), parallel execution, or usage of more than one language. There are third party add-ons like \inlinecode{sos} or \inlinecode{nbextensions} (both written in Python) for some of these. However, since they are not part of the core, their longevity can be assumed to be shorter. Therefore, the core Jupyter framework leaves very few options for project management, especially as the project grows beyond a small test or tutorial.} @@ -224,7 +224,7 @@ A project that is complete (self-contained) has the following properties. (1) \new{No \emph{execution requirements} apart from a minimal Unix-like operating system. Fewer explicit execution requirements would mean higher \emph{execution possibility} and consequently higher \emph{longetivity}.} (2) Primarily stored as plain text \new{(encoded in ASCII/Unicode)}, not needing specialized software to open, parse, or execute. -(3) No impact on the host OS libraries, programs and \new{environment variables}. +(3) No impact on the host OS libraries, programs, and \new{environment variables}. (4) Does not require root privileges to run (during development or post-publication). (5) Builds its own controlled software \new{with independent environment variables}. (6) Can run locally (without an internet connection). @@ -235,7 +235,7 @@ Fewer explicit execution requirements would mean higher \emph{execution possibil A modular project enables and encourages independent modules with well-defined inputs/outputs and minimal side effects. \new{In terms of file management, a modular project will \emph{only} contain the hand-written project source of that particular high-level project: no automatically generated files (e.g., software binaries or figures), software source code, or data should be included. The latter two (developing low-level software, collecting data, or the publishing and archival of both) are separate projects in themselves because they can be used in other independent projects. -This optimizes the storage, archival/mirroring and publication costs (which are critical to longevity): a snapshot of a project's hand-written source will usually be on the scale of $\times100$ kilobytes and the version controlled history may become a few megabytes.} +This optimizes the storage, archival/mirroring, and publication costs (which are critical to longevity): a snapshot of a project's hand-written source will usually be on the scale of $\times100$ kilobytes, and the version-controlled history may become a few megabytes.} In terms of the analysis workflow, explicit communication between various modules enables optimizations on many levels: (1) Modular analysis components can be executed in parallel and avoid redundancies (when a dependency of a module has not changed, it will not be re-run). @@ -261,23 +261,23 @@ The project should automatically verify its inputs (software source code and dat \textbf{Criterion 6: Recorded history.} No exploratory research is done in a single, first attempt. Projects evolve as they are being completed. -It is natural that earlier phases of a project are redesigned/optimized only after later phases have been completed. +Naturally, earlier phases of a project are redesigned/optimized only after later phases have been completed. Research papers often report this with statements such as ``\emph{we [first] tried method [or parameter] X, but Y is used here because it gave lower random error}''. The derivation ``history'' of a result is thus not any the less valuable as itself. \textbf{Criterion 7: Including narrative that is linked to analysis.} A project is not just its computational analysis. -A raw plot, figure or table is hardly meaningful alone, even when accompanied by the code that generated it. -A narrative description is also a deliverable (defined as ``data article'' in \cite{austin17}): describing the purpose of the computations, and interpretations of the result, and the context in relation to other projects/papers. +A raw plot, figure, or table is hardly meaningful alone, even when accompanied by the code that generated it. +A narrative description is also a deliverable (defined as ``data article'' in \cite{austin17}): describing the purpose of the computations, interpretations of the result, and the context concerning other projects/papers. This is related to longevity, because if a workflow contains only the steps to do the analysis or generate the plots, in time it may get separated from its accompanying published paper. -\textbf{Criterion 8: Free and open source software:} -Non-free or non-open-source software typically cannot be distributed, inspected or modified by others. +\textbf{Criterion 8: Free and open-source software:} +Non-free or non-open-source software typically cannot be distributed, inspected, or modified by others. They are reliant on a single supplier (even without payments) \new{and prone to \href{https://www.gnu.org/proprietary/proprietary-obsolescence.html}{proprietary obsolescence}}. A project that is \href{https://www.gnu.org/philosophy/free-sw.en.html}{free software} (as formally defined by GNU), allows others to run, learn from, \new{distribute, build upon (modify), and publish their modified versions}. When the software used by the project is itself also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware. -\new{It may happen that proprietary software is necessary to read proprietary data formats produced by data collection hardware (for example micro-arrays in genetics). +\new{Propietary software may be necessary to read private data formats produced by data collection hardware (for example micro-arrays in genetics). In such cases, it is best to immediately convert the data to free formats upon collection, and archive (e.g., on Zenodo) or use the data in free formats.} @@ -323,7 +323,7 @@ This allows automatic updates to the embedded numbers during the experimentation Through the former, manual updates by authors (which are prone to errors and discourage improvements or experimentation after writing the first draft) are by-passed. Acting as a link, the macro files build the core skeleton of Maneage. -For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version and possible citation. +For example, during the software building phase, each software package is identified by a \LaTeX{} file, containing its official name, version, and possible citation. These are combined at the end to generate precise software \new{acknowledgement} and citation that is shown in the \new{ \ifdefined\separatesupplement @@ -333,7 +333,7 @@ These are combined at the end to generate precise software \new{acknowledgement} \fi% } (for other examples, see \cite{akhlaghi19, infante20}) -\new{Furthermore, the machine related specifications of the running system (including hardware name and byte-order) are also collected and cited. +\new{Furthermore, the machine-related specifications of the running system (including hardware name and byte-order) are also collected and cited. These can help in \emph{root cause analysis} of observed differences/issues in the execution of the workflow on different machines.} The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}). All software dependencies are built down to precise versions of every tool, including the shell,\new{important low-level application programs} (e.g., GNU Coreutils) and of course, the high-level science software. @@ -345,7 +345,7 @@ Currently, {\TeX}Live is also being added (task \href{http://savannah.nongnu.org \new{Finally, some software cannot be built on some CPU architectures, hence by default, the architecture is included in the final built paper automatically (see below).} \new{Building the core Maneage software environment on an 8-core CPU takes about 1.5 hours (GCC consumes more than half of the time). -However, this is only necessary once in a project: the analysis (which usually takes months to write/mature for a normal project) will only use built environment. +However, this is only necessary once in a project: the analysis (which usually takes months to write/mature for a normal project) will only use the built environment. Hence the few hours of initial software building is negligible compared to a project's life span. To facilitate moving to another computer in the short term, Maneage'd projects can be built in a container or VM. The \href{https://archive.softwareheritage.org/browse/origin/directory/?origin_url=http://git.maneage.org/project.git}{\inlinecode{README.md}} file has instructions on building in Docker. @@ -368,7 +368,7 @@ Figure \ref{fig:datalineage} (right) is the data lineage that produced it. Left: an enhanced replica of Figure 1C in \cite{menke20}, shown here for demonstrating Maneage. It shows the fraction of the number of papers mentioning software tools (green line, left vertical axis) in each year (red bars, right vertical axis on a log scale). Right: Schematic representation of the data lineage, or workflow, to generate the plot on the left. - Each colored box is a file in the project and \new{arrows show the operation of various software: linking input file(s) to output file(s)}. + Each colored box is a file in the project and \new{arrows show the operation of various software: linking input file(s) to the output file(s)}. Green files/boxes are plain-text files that are under version control and in the project source directory. Blue files/boxes are output files in the build directory, shown within the Makefile (\inlinecode{*.mk}) where they are defined as a \emph{target}. For example, \inlinecode{paper.pdf} \new{is created by running \LaTeX{} on} \inlinecode{project.tex} (in the build directory; generated automatically) and \inlinecode{paper.tex} (in the source directory; written manually). @@ -415,7 +415,7 @@ Except for \inlinecode{paper.mk} (which builds the ultimate target: \inlinecode{ Other built files (intermediate analysis steps) cascade down in the lineage to one of these macro files, possibly through other files. Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk} to satisfy the verification criteria (this step was not yet available in \cite{akhlaghi19, infante20}). -All project deliverables (macro files, plot or table data and other datasets) are verified at this stage, with their checksums, to automatically ensure exact reproducibility. +All project deliverables (macro files, plot or table data, and other datasets) are verified at this stage, with their checksums, to automatically ensure exact reproducibility. Where exact reproducibility is not possible \new{(for example, due to parallelization)}, values can be verified by the project authors. \new{For example see \new{\href{https://archive.softwareheritage.org/browse/origin/content/?branch=refs/heads/postreferee_corrections&origin_url=https://codeberg.org/boud/elaphrocentre.git&path=reproduce/analysis/bash/verify-parameter-statistically.sh}{verify-parameter-statistically.sh}} of \href{https://doi.org/10.5281/zenodo.4062460}{zenodo.4062460}.} @@ -424,7 +424,7 @@ Where exact reproducibility is not possible \new{(for example, due to paralleliz \vspace{-3mm} \caption{\label{fig:branching} Maneage is a Git branch. Projects using Maneage are branched off it and apply their customizations. - (a) A hypothetical project's history prior to publication. + (a) A hypothetical project's history before publication. The low-level structure (in Maneage, shared between all projects) can be updated by merging with Maneage. (b) A finished/published project can be revitalized for new technologies by merging with the core branch. Each Git ``commit'' is shown on its branch as a colored ellipse, with its commit hash shown and colored to identify the team that is/was working on the branch. @@ -446,9 +446,9 @@ Furthermore, the configuration files are a prerequisite of the targets that use If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's source or other built products. This fast and cheap testing encourages experimentation (without necessarily knowing the implementation details; e.g., by co-authors or future readers), and ensures self-consistency. -\new{In contrast to notebooks like Jupyter, the analysis scripts and configuration parameters are therefore not blended into the running code, are not stored together in a single file and do not require a unique editor. +\new{In contrast to notebooks like Jupyter, the analysis scripts and configuration parameters are therefore not blended into the running code, are not stored together in a single file, and do not require a unique editor. To satisfy the modularity criterion, the analysis steps and narrative are run in their own files (in different languages, thus maximally benefiting from the unique features of each) and the files can be viewed or manipulated with any text editor that the authors prefer. -The analysis can benefit from the powerful and portable job management features of Make and communicates with the narrative text through \LaTeX{} macros, enabling much better formatted output that blends analysis outputs in the narrative sentences and enables direct provenance tracking.} +The analysis can benefit from the powerful and portable job management features of Make and communicates with the narrative text through \LaTeX{} macros, enabling much better-formatted output that blends analysis outputs in the narrative sentences and enables direct provenance tracking.} To satisfy the recorded history criterion, version control (currently implemented in Git) is another component of Maneage (see Figure \ref{fig:branching}). Maneage is a Git branch that contains the shared components (infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles, and interface script). @@ -456,8 +456,8 @@ Maneage is a Git branch that contains the shared components (infrastructure) of Derived projects start by creating a branch and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its particular analysis, see Listing \ref{code:branching}). There is a \new{thoroughly elaborated} customization checklist in \inlinecode{README-hacking.md}). -The current project's Git hash is provided to the authors as a \LaTeX{} macro (shown here at the end of the abstract), as well as the Git hash of the last commit in the Maneage branch (shown here in the acknowledgements). -These macros are created in \inlinecode{initialize.mk}, with \new{other basic information from the running system like the CPU architecture, byte order or address sizes (shown here in the acknowledgements)}. +The current project's Git hash is provided to the authors as a \LaTeX{} macro (shown here at the end of the abstract), as well as the Git hash of the last commit in the Maneage branch (shown here in the acknowledgments). +These macros are created in \inlinecode{initialize.mk}, with \new{other basic information from the running system like the CPU architecture, byte order or address sizes (shown here in the acknowledgments)}. \begin{lstlisting}[ label=code:branching, @@ -485,7 +485,7 @@ in (b) readers do it after publication; e.g., the project remains reproducible b Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of curation and maintenance of each individual project, before \emph{and} after publication. Finally, the complete project source is usually $\sim100$ kilo-bytes. -It can thus easily be published or archived in many servers, for example it can be uploaded to arXiv (with the \LaTeX{} source, see the arXiv source in \cite{akhlaghi19, infante20, akhlaghi15}), published on Zenodo and archived in SoftwareHeritage. +It can thus easily be published or archived in many servers, for example, it can be uploaded to arXiv (with the \LaTeX{} source, see the arXiv source in \cite{akhlaghi19, infante20, akhlaghi15}), published on Zenodo and archived in SoftwareHeritage. @@ -508,14 +508,14 @@ It can thus easily be published or archived in many servers, for example it can %% should not just present a solution or an enquiry into a unitary problem but make an effort to demonstrate wider significance and application and say something more about the ‘science of data’ more generally. We have shown that it is possible to build workflows satisfying all the proposed criteria, and we comment here on our experience in testing them through this proof-of-concept tool. -Maneage user-base grew with the support of RDA, underscoring some difficulties for a widespread adoption. +Maneage user-base grew with the support of RDA, underscoring some difficulties for widespread adoption. Firstly, while most researchers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used. Fortunately, we have noticed that after witnessing the improvements in their research, many, especially early-career researchers, have started mastering these tools. Scientists are rarely trained sufficiently in data management or software development, and the plethora of high-level tools that change every few years discourages them. Indeed the fast-evolving tools are primarily targeted at software developers, who are paid to learn and use them effectively for short-term projects before moving on to the next technology. -Scientists, on the other hand, need to focus on their own research fields, and need to consider longevity. +Scientists, on the other hand, need to focus on their own research fields and need to consider longevity. Hence, arguably the most important feature of these criteria (as implemented in Maneage) is that they provide a fully working template or bundle that works immediately out of the box by producing a paper with an example calculation that they just need to start customizing. Using mature and time-tested tools, for blending version control, the research paper's narrative, the software management \emph{and} a robust data management strategies. We have noticed that providing a complete \emph{and} customizable template with a clear checklist of the initial steps is much more effective in encouraging mastery of these modern scientific tools than having abstract, isolated tutorials on each tool individually. @@ -528,7 +528,7 @@ They later share their low-level commits on the core branch, thus propagating it \new{Unix-like OSs are a very large and diverse group (mostly conforming with POSIX), so this condition does not guarantee bitwise reproducibility of the software, even when built on the same hardware.}. However \new{our focus is on reproducing results (output of software), not the software itself.} Well written software internally corrects for differences in OS or hardware that may affect its output (through tools like the GNU portability library). -On GNU/Linux hosts, Maneage builds precise versions of the compilation tool chain. +On GNU/Linux hosts, Maneage builds precise versions of the compilation toolchain. However, glibc is not install-able on some \new{Unix-like} OSs (e.g., macOS) and all programs link with the C library. This may hypothetically hinder the exact reproducibility \emph{of results} on non-GNU/Linux systems, but we have not encountered this in our research so far. With everything else under precise control in Maneage, the effect of differing hardware, Kernel and C libraries on high-level science can now be systematically studied in follow-up research \new{(including floating-point arithmetic or optimization differences). @@ -542,7 +542,7 @@ Using continuous integration (CI) is one way to precisely identify breaking poin %2) Authors can be given a grace period where the journal or a third party embargoes the source, keeping it private for the embargo period and then publishing it. Other implementations of the criteria, or future improvements in Maneage, may solve some of the caveats, but this proof of concept already shows their many advantages. -For example, publication of projects meeting these criteria on a wide scale will allow automatic workflow generation, optimized for desired characteristics of the results (e.g., via machine learning). +For example, the publication of projects meeting these criteria on a wide scale will allow automatic workflow generation, optimized for desired characteristics of the results (e.g., via machine learning). The completeness criterion implies that algorithms and data selection can be included in the optimizations. Furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the connection between an analysis with the resulting narrative \emph{and} the history of that analysis+narrative. @@ -631,7 +631,7 @@ The Pozna\'n Supercomputing and Networking Center (PSNC) computational grant 314 \begin{IEEEbiographynophoto}{Ra\'ul Infante-Sainz} is a doctoral student at IAC, Spain. - He has an M.Sc in University of Granada (Spain). + He has an M.Sc from the University of Granada (Spain). Email: infantesainz-AT-gmail.com; Website: \url{https://infantesainz.org}. \end{IEEEbiographynophoto} |