aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2020-06-04 16:49:00 +0100
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2020-06-04 16:49:00 +0100
commit6e26690be7b5f073105baaca088bbfb14d454f63 (patch)
tree411a13ba7c7e1ca060922dc9a420aacf2cdced96 /paper.tex
parent83063f62859287defc6da525a8e7cb5b728e4fbe (diff)
Final full reading, and minor edits to submit to Zenodo and arXiv
Everything else regarding the submission to arXiv and Zenodo has been complete, so I done a final read, making some minor edits to hopefully make the text easier to read.
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex115
1 files changed, 57 insertions, 58 deletions
diff --git a/paper.tex b/paper.tex
index fff0e41..2f3300a 100644
--- a/paper.tex
+++ b/paper.tex
@@ -66,13 +66,13 @@
%% METHOD
The criteria have been tested in several research publications and can be summarized as: completeness (no dependency beyond a POSIX-compatible operating system, no administrator privileges, no network connection and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; temporal provenance; linking analysis with narrative; and free-and-open-source software.
%% RESULTS
- They are implemented in a tool, called ``Maneage'', which stores the project in machine-actionable and human-readable plain-text, enables version-control, cheap archiving, automatic parsing to extract data provenance, and peer-reviewable verification.
+ As a proof of concept, we have implemented ``Maneage'', a solution which stores the project in machine-actionable and human-readable plain-text, enables version-control, cheap archiving, automatic parsing to extract data provenance, and peer-reviewable verification.
%% CONCLUSION
- We show that requiring longevity of a reproducible workflow solution is realistic, and discuss the benefits of the criteria for scientific progress, but also immediate benefits for short-term reproducibility.
+ We show that requiring longevity of a reproducible workflow solution is realistic, without sacrificing immediate or short-term reproducibility and discuss the benefits of the criteria for scientific progress.
This paper has itself been written in Maneage, with snapshot \projectversion.
\vspace{3mm}
- \emph{Reproducible supplement} --- \href{https://doi.org/10.5281/zenodo.3872248}{\texttt{Zenodo.3872248}}.
+ \emph{Reproducible supplement} --- Necessary software, workflow and output data are published in \href{https://doi.org/10.5281/zenodo.3872248}{\texttt{Zenodo.3872248}}.
\end{abstract}
% Note that keywords are not normally used for peer-review papers.
@@ -107,7 +107,8 @@ Data Lineage, Provenance, Reproducibility, Scientific Pipelines, Workflows
Reproducible research has been discussed in the sciences for at least 30 years \cite{claerbout1992, fineberg19}.
Many reproducible workflow solutions (hereafter, ``solutions'') have been proposed, mostly relying on the common technology of the day: starting with Make and Matlab libraries in the 1990s, to Java in the 2000s and mostly shifting to Python during the last decade.
-However, these technologies develop very fast, and the cost of staying up to date within this rapidly-evolving landscape is high; e.g., Python 2 code often cannot run with Python 3.
+However, these technologies develop fast, e.g., Python 2 code often cannot run with Python 3.
+The cost of staying up to date within this rapidly-evolving landscape is high.
Scientific projects, in particular, suffer the most: scientists have to focus on their own research domain, but to some degree they need to understand the technology of their tools, because it determines their results and interpretations.
Decades later, scientists are still held accountable for their results and therefore the evolving technology landscape creates generational gaps in the scientific community, preventing previous generations from sharing valuable experience.
@@ -127,20 +128,20 @@ A common set of third-party tools that are commonly used can be categorized as:
To isolate the environment, VMs have sometimes been used, e.g., in \href{https://is.ieis.tue.nl/staff/pvgorp/share}{SHARE} (which was awarded second prize in the Elsevier Executable Paper Grand Challenge of 2011 but was discontinued in 2019).
However, containers (in particular, Docker, and to a lesser degree, Singularity) are currently the most widely-used solution, we will thus focus on Docker here.
-Ideally, it is possible to precisely identify the Docker ``images'' that are imported by their checksums, but that is rarely practiced in most solutions that we have surveyed.
+Ideally, it is possible to precisely identify the Docker ``images'' that are imported with their checksums, but that is rarely practiced in most solutions that we have surveyed.
Usually, images are imported with generic operating system (OS) names e.g. \cite{mesnard20} uses `\inlinecode{FROM ubuntu:16.04}'.
The extracted tarball (from \url{https://partner-images.canonical.com/core/xenial}) is updated almost monthly and only the most recent five are archived.
-Hence, if the Dockerfile is run in different months, it will contain different core OS components.
+Hence, if the Dockerfile is run in different months, its output image will contain different OS components.
In the year 2024, when long-term support for this version of Ubuntu will expire, the image will be unavailable at the expected URL.
-This is entirely similar in other OSes: pre-built binary files are large and expensive to maintain and archive.
+Other OSes have similar issues because pre-built binary files are large and expensive to maintain and archive.
Furthermore, Docker requires root permissions, and only supports recent (``long-term-support'') versions of the host kernel, so older Docker images may not be executable.
Once the host OS is ready, PMs are used to install the software, or environment.
Usually the OS's PM, like `\inlinecode{apt}' or `\inlinecode{yum}', is used first and higher-level software are built with generic PMs.
The former suffers from the same longevity problem as the OS, while some of the latter (like Conda and Spack) are written in high-level languages like Python, so the PM itself depends on the host's Python installation.
Nix and GNU Guix produce bit-wise identical programs, but they need root permissions and are primarily targeted at the Linux kernel.
-Generally, the exact version of each software's dependencies is not precisely identified in the build instructions (although this could be implemented).
-Therefore, unless precise version identifiers of \emph{every software package} are stored, a PM will use the most recent version.
+Generally, the exact version of each software's dependencies is not precisely identified in the PM build instructions (although this could be implemented).
+Therefore, unless precise version identifiers of \emph{every software package} are stored by project authors, a PM will use the most recent version.
Furthermore, because each third-party PM introduces its own language and framework, this increases the project's complexity.
With the software environment built, job management is the next component of a workflow.
@@ -149,13 +150,13 @@ Designing a modular project needs to be encouraged and facilitated because scien
This includes automatic verification: while it is possible in many solutions, it is rarely practiced, which leads to many inefficiencies in project cost and/or scientific accuracy (reusing, expanding or validating will be expensive).
Finally, to add narrative, computational notebooks \cite{rule18}, like Jupyter, are currently gaining popularity.
-However, due to their complex dependency trees, they are very vulnerable to the passage of time, e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies.
+However, due to their complex dependency trees, they are vulnerable to the passage of time, e.g., see Figure 1 of \cite{alliez19} for the dependencies of Matplotlib, one of the simpler Jupyter dependencies.
It is important to remember that the longevity of a project is determined by its shortest-lived dependency.
Further, as with job management, computational notebooks do not actively encourage good practices in programming or project management.
Hence they can rarely deliver their promised potential \cite{rule18} and can even hamper reproducibility \cite{pimentel19}.
An exceptional solution we encountered was the Image Processing Online Journal (IPOL, \href{https://www.ipol.im}{ipol.im}).
-Submitted papers must be accompanied by an ISO C implementation of their algorithm (which is buildable on any widely used OS) with example images/data that can also be executed on their webpage.
+Submitted papers must be accompanied by an ISO C implementation of their algorithm (which is build-able on any widely used OS) with example images/data that can also be executed on their webpage.
This is possible due to the focus on low-level algorithms with no dependencies beyond an ISO C compiler.
However, many data-intensive projects commonly involve dozens of high-level dependencies, with large and complex data formats and analysis, and hence this solution is not scalable.
@@ -199,7 +200,7 @@ More stable/basic tools can be used with less long-term maintenance.
\textbf{Criterion 4: Scalability.}
A scalable project can easily be used in arbitrarily large and/or complex projects.
-On a small scale, the criteria here are trivial to implement, but can become unsustainable very rapidly.
+On a small scale, the criteria here are trivial to implement, but can rapidly become unsustainable.
\textbf{Criterion 5: Verifiable inputs and outputs.}
The project should verify its inputs (software source code and data) \emph{and} outputs.
@@ -210,22 +211,19 @@ No exploratory research project is done in a single, first attempt.
Projects evolve as they are being completed.
It is natural that earlier phases of a project are redesigned/optimized only after later phases have been completed.
Research papers often report this with statements such as ``\emph{we [first] tried method [or parameter] X, but Y is used here because it gave lower random error}''.
-The ``history'' is thus as valuable as the final, published version.
+The derivation ``history'' of a result is thus not any less valuable as itself.
\textbf{Criterion 7: Including narrative, linked to analysis.}
A project is not just its computational analysis.
A raw plot, figure or table is hardly meaningful alone, even when accompanied by the code that generated it.
A narrative description must also be part of the deliverables (defined as ``data article'' in \cite{austin17}): describing the purpose of the computations, and interpretations of the result, and the context in relation to other projects/papers.
-This is related to longevity, because if a workflow only contains the steps to do the analysis or generate the plots, it may get separated from its accompanying published paper.
+This is related to longevity, because if a workflow only contains the steps to do the analysis or generate the plots, in time it may get separated from its accompanying published paper.
\textbf{Criterion 8: Free and open source software:}
-Technically, reproducibility (as defined in \cite{fineberg19}) is possible with non-free or non-open-source software (a black box).
-This criterion is necessary to complement that definition (nature is already a black box!).
-If a project is free software (as formally defined), then others can learn from, modify, and build on it.
-When the software used by the project is itself also free:
-(1) The lineage can be traced to the implemented algorithms, possibly enabling optimizations on that level.
-(2) The source can be modified to work on future hardware.
-In contrast, a non-free software package typically cannot be distributed by others, making it reliant on a single supplier (even without payments).
+Reproducibility (defined in \cite{fineberg19}) can be achieved with a black box (non-free or non-open-source software), this criterion is thus necessary because nature is already a black box.
+A project that is free software (as formally defined), allows others to learn from, modify, and build upon it.
+When the software used by the project is itself also free, the lineage can be traced to the core algorithms, possibly enabling optimizations on that level and it can be modified for future hardware.
+In contrast, a non-free tools typically cannot be distributed or modified by others, making it reliant on a single supplier (even without payments).
@@ -274,10 +272,10 @@ These are combined at the end to generate precise software acknowledgment and ci
The macro files also act as Make \emph{targets} and \emph{prerequisites} to allow accurate dependency tracking and optimized execution (in parallel, no redundancies), for any level of complexity (e.g., Maneage builds Matplotlib if requested; see Figure~1 of \cite{alliez19}).
All software dependencies are built down to precise versions of every tool, including the shell, POSIX tools (e.g., GNU Coreutils) or \TeX{}Live, providing the same environment.
-On GNU/Linux, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library is being added (task 15390).
-Relocation of a project without building from source, can be done by building the project in a container or VM.
+On GNU/Linux distributions, even the GNU Compiler Collection (GCC) and GNU Binutils are built from source and the GNU C library is being added (task 15390).
+Temporary relocation of a project, without building from source, can be done by building the project in a container or VM.
-The analysis phase of the project however is naturally very different from one project to another at a low-level.
+The analysis phase of the project however is naturally different from one project to another at a low-level.
It was thus necessary to design a generic framework to comfortably host any project, while still satisfying the criteria of modularity, scalability and minimal complexity.
We demonstrate this design by replicating Figure 1C of \cite{menke20} in Figure \ref{fig:datalineage} (top).
Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it (including this complete paper).
@@ -329,16 +327,16 @@ The analysis is orchestrated through a single point of entry (\inlinecode{top-ma
It is only responsible for \inlinecode{include}-ing the modular \emph{subMakefiles} of the analysis, in the desired order, without doing any analysis itself.
This is visualized in Figure \ref{fig:datalineage} (bottom) where no built (blue) file is placed directly over \inlinecode{top-make.mk} (they are produced by the subMakefiles under them).
A visual inspection of this file is sufficient for a non-expert to understand the high-level steps of the project (irrespective of the low-level implementation details), provided that the subMakefile names are descriptive (thus encouraging good practice).
-A human-friendly design that is also optimized for execution is a critical component for reproducible research workflows.
+A human-friendly design that is also optimized for execution is a critical component for the FAIRness of reproducible research.
All projects first load \inlinecode{initialize.mk} and \inlinecode{download.mk}, and finish with \inlinecode{verify.mk} and \inlinecode{paper.mk} (Listing \ref{code:topmake}).
Project authors add their modular subMakefiles in between.
-Except for \inlinecode{paper.mk} (which builds the ultimate target \inlinecode{paper.pdf}), all subMakefiles build a macro file with the same basename (the \inlinecode{.tex} file in each subMakefile of Figure \ref{fig:datalineage}).
+Except for \inlinecode{paper.mk} (which builds the ultimate target: \inlinecode{paper.pdf}), all subMakefiles build a macro file with the same base-name (the \inlinecode{.tex} file in each subMakefile of Figure \ref{fig:datalineage}).
Other built files (intermediate analysis steps) cascade down in the lineage to one of these macro files, possibly through other files.
-Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk}, to satisfy the verification criteria.
-All project deliverables (macro files, plot or table data and other datasets) are thus verified at this stage, with their checksums, to automatically ensure exact reproducibility.
-Where exact reproducibility is not possible, values can be verified by any statistical means, specified by the project authors (this step was not available in \cite{akhlaghi19, infante20}).
+Just before reaching the ultimate target (\inlinecode{paper.pdf}), the lineage reaches a bottleneck in \inlinecode{verify.mk}, to satisfy the verification criteria (this step was yet not available in \cite{akhlaghi19, infante20}).
+All project deliverables (macro files, plot or table data and other datasets) are verified at this stage, with their checksums, to automatically ensure exact reproducibility.
+Where exact reproducibility is not possible, values can be verified by any statistical means, specified by the project authors.
To further minimize complexity, the low-level implementation can be further separated from the high-level execution through configuration files.
By convention in Maneage, the subMakefiles (and the programs they call for number crunching) do not contain any fixed numbers, settings or parameters.
@@ -346,7 +344,7 @@ Parameters are set as Make variables in ``configuration files'' (with a \inlinec
For example, in Figure \ref{fig:datalineage} (bottom), \inlinecode{INPUTS.conf} contains URLs and checksums for all imported datasets, enabling exact verification before usage.
To illustrate this, we report that \cite{menke20} studied $\menkenumpapersdemocount$ papers in $\menkenumpapersdemoyear$ (which is not in their original plot).
The number \inlinecode{\menkenumpapersdemoyear} is stored in \inlinecode{demo-year.conf} and the result (\inlinecode{\menkenumpapersdemocount}) was calculated after generating \inlinecode{columns.txt}.
-Both are expanded as \LaTeX{} macros when creating this PDF file.
+Both numbers are expanded as \LaTeX{} macros when creating this PDF file.
An interested reader can change the value in \inlinecode{demo-year.conf} to automatically update the result in the PDF, without necessarily knowing the underlying low-level implementation.
Furthermore, the configuration files are a prerequisite of the targets that use them.
If changed, Make will \emph{only} re-execute the dependent recipe and all its descendants, with no modification to the project's source or other built products.
@@ -365,9 +363,9 @@ This fast and cheap testing encourages experimentation (without necessarily know
}
\end{figure*}
-Finally, to satisfy the temporal provenance criterion, version control (currently implemented in Git), plays a crucial role in Maneage, as shown in Figure \ref{fig:branching}.
-In practice, Maneage is a Git branch that contains the shared components (infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles and interface script).
-Every project starts by branching off the Maneage branch and customizing it (e.g., replacing the title, data links, and narrative, and adding subMakefiles for its particular analysis, see Listing \ref{code:branching}).
+Finally, to satisfy the temporal provenance criterion, version control (currently implemented in Git) is another component of Maneage (see Figure \ref{fig:branching}).
+Maneage is a Git branch that contains the shared components (infrastructure) of all projects (e.g., software tarball URLs, build recipes, common subMakefiles and interface script).
+Derived project start by branching off, and customizing it (e.g., adding a title, data links, narrative, and subMakefiles for its particular analysis, see Listing \ref{code:branching}, there is customization checklist in \inlinecode{README-hacking.md}).
\begin{lstlisting}[
label=code:branching,
@@ -389,13 +387,13 @@ $ ./project make # Re-build to see effect.
$ git add -u && git commit # Commit changes
\end{lstlisting}
-Thanks to this architecture (Figure \ref{fig:branching}), it is always possible to import Maneage (technically: \emph{merge}) into a project and improve the low-level infrastructure:
-in (a) the authors merge Maneage during an ongoing project;
-in (b) readers do it after the paper's publication, e.g., when the project remains reproducible but the infrastructure is outdated, or a bug is found in Maneage.
-In this way, low-level improvements in Maneage can easily propagate to all projects, greatly reducing the cost of curation and maintenance of each individual project, before \emph{and} after publication.
+The branch-based design of Figure \ref{fig:branching}) allows projects to re-import Maneage at a later time (technically: \emph{merge}), thus improving its low-level infrastructure:
+in (a) authors do the merge during an ongoing project;
+in (b) readers do it after publication, e.g., the project remains reproducible but the infrastructure is outdated, or a bug is fixed in Maneage.
+Low-level improvements in Maneage can thus propagate to all projects, greatly reducing the cost of curation and maintenance of each individual project, before \emph{and} after publication.
Finally, the complete project source is usually $\sim100$ kilo-bytes.
-It can thus easily be published or archived in many servers, for example it can be uploaded to arXiv (with the LaTeX source, see \cite{akhlaghi19, infante20}), published on Zenodo and archived in SoftwareHeritage.
+It can thus easily be published or archived in many servers, for example it can be uploaded to arXiv (with the \LaTeX{} source, see the arXiv source in \cite{akhlaghi19, infante20, akhlaghi15}), published on Zenodo and archived in SoftwareHeritage.
@@ -411,7 +409,7 @@ It can thus easily be published or archived in many servers, for example it can
%% should not just present a solution or an enquiry into a unitary problem but make an effort to demonstrate wider significance and application and say something more about the ‘science of data’ more generally.
We have shown that it is possible to build workflows satisfying all the proposed criteria, and we comment here on our experience in testing them through this proof-of-concept tool.
-With the support of RDA, our user-base grew phenomenally, underscoring some difficulties for a widespread adoption.
+With the support of RDA, Maneage user-base grew, underscoring some difficulties for a widespread adoption.
Firstly, while most researchers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used.
Fortunately, we have noticed that after witnessing the improvements in their research, many, especially early-career researchers, have started mastering these tools.
@@ -424,15 +422,15 @@ We have noticed that providing a complete \emph{and} customizable template with
Secondly, to satisfy the completeness criterion, all the required software of the project must be built on various POSIX-compatible systems (Maneage is actively tested on different GNU/Linux distributions, macOS, and is being ported to FreeBSD also).
This requires maintenance by our core team and consumes time and energy.
-However, because the PM and analysis components share the same job manager (Make), design principles and conventions, we have already noticed some early users adding, or fixing, their required software alone.
+However, because the PM and analysis components share the same job manager (Make) and design principles, we have already noticed some early users adding, or fixing, their required software alone.
They later share their low-level commits on the core branch, thus propagating it to all derived projects.
-On a related note, POSIX is a fuzzy standard that does not guarantee the bit-wise reproducibility of programs.
-It has been chosen here, however, as the underlying platform because we focus on the results (data), not on the compiled software.
-POSIX is ubiquitous and fixed versions of low-level software (e.g., core GNU tools) are installable on most; each internally corrects for differences affecting its functionality (partly as part of the GNU portability library).
-On GNU/Linux hosts, Maneage builds precise versions of the compilation toolchain, but glibc is not installable on some POSIX OSs (e.g., macOS).
+A related caveat is that, POSIX is a fuzzy standard, not guaranteeing the bit-wise reproducibility of programs.
+It has been chosen here, however, as the underlying platform because our focus on reproducing the results (data) which doesn't always need that bit-wise identical software.
+POSIX is ubiquitous and low-level software (e.g., core GNU tools) are install-able on most; each internally corrects for differences affecting its functionality (partly as part of the GNU portability library).
+On GNU/Linux hosts, Maneage builds precise versions of the compilation tool chain, but glibc is not install-able on some POSIX OSs (e.g., macOS).
The C library is linked with all programs, and this dependence can hypothetically hinder exact reproducibility \emph{of results}, but we have not encountered this so far.
-With everything else under precise control, the effect of differing Kernel and C libraries on high-level science results can now be systematically studied with Maneage.
+With everything else under precise control, the effect of differing Kernel and C libraries on high-level science can now be systematically studied with Maneage in followup research.
% DVG: It is a pity that the following paragraph cannot be included, as it is really important but perhaps goes beyond the intended goal.
%Thirdly, publishing a project's reproducible data lineage immediately after publication enables others to continue with follow-up papers, which may provide unwanted competition against the original authors.
@@ -441,24 +439,25 @@ With everything else under precise control, the effect of differing Kernel and C
%This is a long-term goal and would require major changes to academic value systems.
%2) Authors can be given a grace period where the journal or a third party embargoes the source, keeping it private for the embargo period and then publishing it.
-Other implementations of the criteria, or future improvements in Maneage, may solve some of the caveats above, but this proof of concept has shown many advantages in adopting them.
-For example, the publication of projects meeting these criteria on a wide scale will allow automatic workflow generation, optimized for desired characteristics of the results (e.g., via machine learning).
-The completeness criteria implies that algorithms and data selection can be similarly optimized.
-Furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the connection between an analysis with the resulting narrative \emph{and} the history of that analysis/narrative.
+Other implementations of the criteria, or future improvements in Maneage, may solve some of the caveats above, but this proof of concept already shows their many advantages.
+For example, publication of projects meeting these criteria on a wide scale will allow automatic workflow generation, optimized for desired characteristics of the results (e.g., via machine learning).
+The completeness criteria implies that algorithms and data selection can be included in the optimizations.
+
+Furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the connection between an analysis with the resulting narrative \emph{and} the history of that analysis+narrative.
Parsers can be written over projects for meta-research and provenance studies, e.g., to generate ``research objects''.
-Likewise, when a bug is found in one software package, all affected projects can be found and the scale of the effect can be measured.
-Combined with SoftwareHeritage, precise high-level science parts of Maneage projects can be accurately cited (e.g., failed/abandoned tests at any historical point).
-Many components of ``machine-actionable'' data management plans can be automatically filled out by Maneage, which is useful for project PIs and grant funders.
+Likewise, when a bug is found in one science software, affected projects can be detected and the scale of the effect can be measured.
+Combined with SoftwareHeritage, precise high-level science components of the analysis can be accurately cited (e.g., even failed/abandoned tests at any historical point).
+Many components of ``machine-actionable'' data management plans can also be automatically completed as a byproduct, useful for project PIs and grant funders.
-From the data repository perspective, these criteria can also be very useful with regard to the challenges mentioned in \cite{austin17}:
+From the data repository perspective, these criteria can also be useful, e.g., the challenges mentioned in \cite{austin17}:
(1) The burden of curation is shared among all project authors and readers (the latter may find a bug and fix it), not just by database curators, improving sustainability.
-(2) Automated and persistent bidirectional linking of data and publication can be established through the published \& \emph{complete} data lineage that is under version control.
-(3) Software management.
-With these criteria, we ensure that each project's unique and complete software management is included.
-It is not a third-party PM that needs to be maintained by the data center.
+(2) Automated and persistent bidirectional linking of data and publication can be established through the published \emph{and complete} data lineage that is under version control.
+(3) Software management:
+with these criteria, each project comes with its unique and complete software management.
+It does not use a third-party PM that needs to be maintained by the data center (and the many versions of the PM).
Hence enabling robust software management, preservation, publishing and citation.
For example, see \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}, \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, \href{https://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}, where we have exploited the free-software criterion to distribute the tarballs of all the software used with each project's source as deliverables.
-(4) ``Linkages between documentation, code, data, and journal articles in an integrated environment'', which effectively summarises the whole purpose of these criteria.
+(4) ``Linkages between documentation, code, data, and journal articles in an integrated environment'', which effectively summarizes the whole purpose of these criteria.