aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
authorBoud Roukema <boud@cosmo.torun.pl>2020-12-28 19:22:22 +0100
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2020-12-29 10:46:54 +0000
commit10a46908c69db8783063385457f62e77ff0429db (patch)
tree0bc7ca42ad5422f1d0f10d9bdf056114073ba980 /paper.tex
parenta695e719ec316a8a26e7f360e4eb36f0b43730e3 (diff)
Copyedit on Appendix A
This commit makes many small wording fixes, mainly to Appendix A. It also insert "quotes" around some of the titles fields in 'tex/src/references.tex', since otherwise capitalisation is lost (DNA becomes Dna; 'of Reinhart and Rogoff' becomes 'of reinhart and rogoff'; and so on). I didn't do this for all titles, because some Have All Words Capitalised, which blocks the .bib file from choosing a consistent style.
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex90
1 files changed, 45 insertions, 45 deletions
diff --git a/paper.tex b/paper.tex
index a609ed7..c43fda0 100644
--- a/paper.tex
+++ b/paper.tex
@@ -61,7 +61,7 @@
%% CONTEXT
Analysis pipelines commonly use high-level technologies that are popular when created, but are unlikely to be readable, executable, or sustainable in the long term.
%% AIM
- A set of criteria are introduced to address this problem.
+ A set of criteria is introduced to address this problem.
%% METHOD
Completeness (no \new{execution requirement} beyond \new{a minimal Unix-like operating system}, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free software.
They have been tested in several research publications in various fields.
@@ -134,7 +134,7 @@ Decades later, scientists are still held accountable for their results and there
\new{Reproducibility is defined as ``obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis'' \cite{fineberg19}.
Longevity is defined as the length of time that a project remains \emph{functional} after its creation.
Functionality is defined as \emph{human readability} of the source and its \emph{execution possibility} (when necessary).
-Many usage contexts of a project don't involve execution: for example, checking the configuration parameter of a single step of the analysis to re-\emph{use} in another project, or checking the version of used software, or the source of the input data.
+Many usage contexts of a project do not involve execution: for example, checking the configuration parameter of a single step of the analysis to re-\emph{use} in another project, or checking the version of used software, or the source of the input data.
Extracting these from execution outputs is not always possible.}
Longevity is as important in science as in some fields of industry, but not all; e.g., fast-evolving tools can be appropriate in short-term commercial projects.
@@ -197,7 +197,7 @@ However, because of their complex dependency trees, their build is vulnerable to
It is important to remember that the longevity of a project is determined by its shortest-lived dependency.
Furthermore, as with job management, computational notebooks do not actively encourage good practices in programming or project management.
\new{The ``cells'' in a Jupyter notebook can either be run sequentially (from top to bottom, one after the other) or by manually selecting the cell to run.
-By default cell dependencies aren't included (e.g., automatically running some cells only after certain others), parallel execution, or usage of more than one language.
+By default cell dependencies are not included (e.g., automatically running some cells only after certain others), parallel execution, or usage of more than one language.
There are third party add-ons like \inlinecode{sos} or \inlinecode{nbextensions} (both written in Python) for some of these.
However, since they are not part of the core, their longevity can be assumed to be shorter.
Therefore, the core Jupyter framework leaves very few options for project management, especially as the project grows beyond a small test or tutorial.}
@@ -691,73 +691,73 @@ Questions such as these commonly follow any such result:
What inputs were used?
What operations were done on those inputs? How were the configurations or training data chosen?
How did the quantitative results get visualized into the final demonstration plots, figures or narrative/qualitative interpretation?
-May there be a bias in the visualization?
+Could there be a bias in the visualization?
See Figure \ref{fig:questions} for a more detailed visual representation of such questions for various stages of the workflow.
-In data science and database management, this type of metadata are commonly known as \emph{data provenance} or \emph{data lineage}.
-Data lineage is being increasingly demanded for integrity checking from both the scientific and industrial/legal domains.
-Notable examples in each domain are respectively the ``Reproducibility crisis'' in the sciences that was reported by the Nature journal after a large survey \citeappendix{baker16}, and the General Data Protection Regulation (GDPR) by the European Parliament and the California Consumer Privacy Act (CCPA), implemented in 2018 and 2020 respectively.
+In data science and database management, this type of metadata is commonly known as \emph{data provenance} or \emph{data lineage}.
+Data lineage is being increasingly demanded for integrity checking from both the scientific, industrial and legal domains.
+Notable examples in each domain are respectively the ``Reproducibility crisis'' in the sciences that was reported by \emph{Nature} after a large survey \citeappendix{baker16}, and the General Data Protection Regulation (GDPR) by the European Parliament and the California Consumer Privacy Act (CCPA), implemented in 2018 and 2020, respectively.
The former argues that reproducibility (as a test on sufficiently conveying the data lineage) is necessary for other scientists to study, check and build-upon each other's work.
-The latter requires the data intensive industry to give individual users control over their data, effectively requiring thorough management and knowledge of the data's lineage.
-Besides regulation and integrity checks, having a robust data governance (management of data lineage) in a project can be very productive: it enables easy debugging, experimentation on alternative methods, or optimization of the workflow.
+The latter requires the data intensive industry to give individual users control over their data, effectively requiring thorough management and knowledge of the data lineage.
+Besides regulation and integrity checks, having robust data governance (management of data lineage) in a project can be very productive: it enables easy debugging, experimentation on alternative methods, or optimization of the workflow.
-In the sciences, the results of a project's analysis are published as scientific papers which have also been the primary conveyor of the result's lineage: usually in narrative form, within the ``Methods'' section of the paper.
-From our own experiences, this section is usually most discussed during peer review and conference presentations, showing its importance.
+In the sciences, the results of a project's analysis are published as scientific papers, which have traditionally been the primary conveyor of the lineage of the results: usually in narrative form, especially within the ``Methods'' section of the paper.
+From our own experience, this section is often that which is the most intensively discussed during peer review and conference presentations, showing its importance.
After all, a result is defined as ``scientific'' based on its \emph{method} (the ``scientific method''), or lineage in data-science terminology.
-In the industry however, data governance is usually kept as a trade secret and isn't publicly published or scrutinized.
-Therefore the main practical focus here will be in the scientific front which has traditionally been more open to publishing the methods and anonymous peer scrutiny.
+In industry, however, data governance is usually kept as a trade secret and is not published openly or widely scrutinized.
+Therefore, the main practical focus here will be in the scientific front, which has traditionally been more open to the publication of methods and anonymous peer scrutiny.
\begin{figure*}[t]
\begin{center}
\includetikz{figure-project-outline}{width=\linewidth}
\end{center}
- \caption{\label{fig:questions}Graph of a generic project's workflow (connected through arrows), highlighting the various issues/questions on each step.
+ \caption{\label{fig:questions}Graph of a generic project's workflow (connected through arrows), highlighting the various issues and questions on each step.
The green boxes with sharp edges are inputs and the blue boxes with rounded corners are the intermediate or final outputs.
- The red boxes with dashed edges highlight the main questions on the respective stage.
- The orange box surrounding the software download and build phases marks shows the various commonly recognized solutions to the questions in it, for more see Appendix \ref{appendix:independentenvironment}.
+ The red boxes with dashed edges highlight the main questions at various stages in the workchain.
+ The orange box, surrounding the software download and build phases, lists some commonly recognized solutions to the questions in it; for morediscussion, see Appendix \ref{appendix:independentenvironment}.
}
\end{figure*}
-The traditional format of a scientific paper has been very successful in conveying the method with the result in the last centuries.
+The traditional format of a scientific paper has been very successful in conveying the method and the results during recent centuries.
However, the complexity mentioned above has made it impossible to describe all the analytical steps of most modern projects to a sufficient level of detail.
-Citing this difficulty, many authors suffice to describing the very high-level generalities of their analysis, while even the most basic calculations (like the mean of a distribution) can depend on the software implementation.
+Citing this difficulty, many authors limit themselves to describing the very high-level generalities of their analysis, while even the most basic calculations (such as the mean of a distribution) can depend on the software implementation.
-Due to the complexity of modern scientific analysis, a small deviation in the final result can be due to many different steps, which may be significant.
-Publishing the precise codes of the analysis is the only guarantee.
-For example, \citeappendix{smart18} describes how a 7-year old conflict in theoretical condensed matter physics was only identified after the relative codes were shared.
-Nature is already a black box which we are trying hard to unlock, or understand.
-Not being able to experiment on the methods of other researchers is an artificial and self-imposed black box, wrapped over the original, and taking most of the energy of researchers.
+Due to the complexity of modern scientific analysis, a small deviation in some of the different steps involved can lead to significant differences in the final result.
+Publishing the precise codes of the analysis is the only guarantee of allowing this to be investigated.
+For example, \citeappendix{smart18} describes how a 7-year old conflict in theoretical condensed matter physics was only identified after the different groups' codes were shared.
+Nature is already a black box that we are trying hard to unlock and understand.
+Not being able to experiment on the methods of other researchers is an artificial and self-imposed black box, wrapped over the original, and wasting much of researchers' time and energy.
-An example showing the importance of sharing code is \citeappendix{miller06}, that found a mistaken column flipping, leading to retraction of 5 papers in major journals, including Science.
-\citeappendix{baggerly09} highlighted the inadequate narrative description of the analysis and showed the prevalence of simple errors in published results, ultimately calling their work ``forensic bioinformatics''.
-\citeappendix{herndon14} and \citeappendix{horvath15} also reported similar situations and \citeappendix{ziemann16} concluded that one-fifth of papers with supplementary Microsoft Excel gene lists, contain erroneous gene name conversions.
-Such integrity checks are a critical component of the scientific method, but are only possible with access to the data and codes and \emph{cannot be resolved by the published paper alone}.
+A dramatic example showing the importance of sharing code is \citeappendix{miller06}, in which a mistaken flipping of a column was discovered, leading to the retraction of five papers in major journals, including \emph{Science}.
+Ref.\/ \citeappendix{baggerly09} highlighted the inadequate narrative description of analysis in several papers and showed the prevalence of simple errors in published results, ultimately calling their work ``forensic bioinformatics''.
+References \citeappendix{herndon14} and \citeappendix{horvath15} also reported similar situations and \citeappendix{ziemann16} concluded that one-fifth of papers with supplementary Microsoft Excel gene lists contain erroneous gene name conversions.
+Such integrity checks are a critical component of the scientific method, but are only possible with access to the data and codes and \emph{cannot be resolved from analysing the published paper alone}.
The completeness of a paper's published metadata (or ``Methods'' section) can be measured by a simple question: given the same input datasets (supposedly on a third-party database like \href{http://zenodo.org}{zenodo.org}), can another researcher reproduce the exact same result automatically, without needing to contact the authors?
Several studies have attempted to answer this with different levels of detail.
-For example \citeappendix{allen18} found that roughly half of the papers in astrophysics don't even mention the names of any analysis software they have used, while \cite{menke20} found that the fraction of papers explicitly mentioning their tools/software has greatly improved in medical journals over the last two decades.
+For example \citeappendix{allen18} found that roughly half of the papers in astrophysics do not even mention the names of any analysis software they have used, while \cite{menke20} found that the fraction of papers explicitly mentioning their software tools has greatly improved in medical journals over the last two decades.
-\citeappendix{ioannidis2009} attempted to reproduce 18 published results by two independent groups but, only fully succeeded in 2 of them and partially in 6.
-\citeappendix{chang15} attempted to reproduce 67 papers in well-regarded economic journals with data and code: only 22 could be reproduced without contacting authors, and more than half could not be replicated at all.
-\citeappendix{stodden18} attempted to replicate the results of 204 scientific papers published in the journal Science \emph{after} that journal adopted a policy of publishing the data and code associated with the papers.
+Ref.\/ \citeappendix{ioannidis2009} attempted to reproduce 18 published results by two independent groups, but only fully succeeded in two of them and partially in six.
+Ref.\/ \citeappendix{chang15} attempted to reproduce 67 papers in well-regarded economic journals with data and code: only 22 could be reproduced without contacting authors, and more than half could not be replicated at all.
+Ref.\/ \citeappendix{stodden18} attempted to replicate the results of 204 scientific papers published in the journal Science \emph{after} that journal adopted a policy of publishing the data and code associated with the papers.
Even though the authors were contacted, the success rate was $26\%$.
Generally, this problem is unambiguously felt in the community: \citeappendix{baker16} surveyed 1574 researchers and found that only $3\%$ did not see a ``reproducibility crisis''.
This is not a new problem in the sciences: in 2011, Elsevier conducted an ``Executable Paper Grand Challenge'' \citeappendix{gabriel11}.
The proposed solutions were published in a special edition.
Some of them are reviewed in Appendix \ref{appendix:existingsolutions}, but most have not been continued since then.
-Before that, \citeappendix{ioannidis05} proved that ``most claimed research findings are false''.
-In the 1990s, \cite{schwab2000}, \citeappendix{buckheit1995} and \cite{claerbout1992} describe this same problem very eloquently and also provided some solutions that they used.
+In 2005, Ref.\/ \citeappendix{ioannidis05} argued that ``most claimed research findings are false''.
+Even earlier, in the 1990s, Refs \cite{schwab2000}, \citeappendix{buckheit1995} and \cite{claerbout1992} described this same problem very eloquently and provided some of the solutions that they adopted.
While the situation has improved since the early 1990s, the problems mentioned in these papers will resonate strongly with the frustrations of today's scientists.
-Even earlier, through his famous quartet, \citeappendix{anscombe73} qualitatively showed how distancing of researchers from the intricacies of algorithms/methods can lead to misinterpretation of the results.
-One of the earliest such efforts we found was \citeappendix{roberts69} who discussed conventions in FORTRAN programming and documentation to help in publishing research codes.
+Even earlier yet, through his famous quartet, Anscombe \citeappendix{anscombe73} qualitatively showed how the distancing of researchers from the intricacies of algorithms and methods can lead to misinterpretation of the results.
+One of the earliest such efforts we found was \citeappendix{roberts69}, who discussed conventions in FORTRAN programming and documentation to help in publishing research codes.
From a practical point of view, for those who publish the data lineage, a major problem is the fast evolving and diverse software technologies and methodologies that are used by different teams in different epochs.
-\citeappendix{zhao12} describe it as ``workflow decay'' and recommend preserving these auxiliary resources.
-But in the case of software its not as straightforward as data: if preserved in binary form, software can only be run on certain hardware and if kept as source-code, their build dependencies and build configuration must also be preserved.
-\citeappendix{gronenschild12} specifically study the effect of software version and environment and encourage researchers to not update their software environment.
+Ref.\/ \citeappendix{zhao12} describes it as ``workflow decay'' and recommends preserving these auxiliary resources.
+But in the case of software, this is not as straightforward as for data: if preserved in binary form, software can only be run on a certain operating system on particular hardware, and if kept as source code, its build dependencies and build configuration must also be preserved.
+Ref.\/ \citeappendix{gronenschild12} specifically studies the effect of software version and environment and encourages researchers to not update their software environment.
However, this is not a practical solution because software updates are necessary, at least to fix bugs in the same research software.
-Generally, software is not a secular component of projects, where one software can easily be swapped with another.
+Generally, software is not a secular component of projects, where one software package can easily be swapped with another.
Projects are built around specific software technologies, and research in software methods and implementations is itself a vibrant research topic in many domains \citeappendix{dicosmo19}.
@@ -773,7 +773,7 @@ Projects are built around specific software technologies, and research in softwa
\label{appendix:existingtools}
Data analysis workflows (including those that aim for reproducibility) are commonly high-level frameworks which employ various lower-level components.
-To help in reviewing existing reproducible workflow solutions in light of the proposed criteria in Appendix \ref{appendix:existingsolutions}, we first need survey the most commonly employed lower-level tools.
+To help in reviewing existing reproducible workflow solutions in light of the proposed criteria in Appendix \ref{appendix:existingsolutions}, we first need to survey the most commonly employed lower-level tools.
\subsection{Independent environment}
\label{appendix:independentenvironment}
@@ -808,7 +808,7 @@ Otherwise, containers have their own independent software for everything else.
Therefore, they have much less overhead in hardware/CPU access.
Like VMs, users often choose an operating system for the container's independent operating system (most commonly GNU/Linux distributions which are free software).
-Below we'll review some of the most common container solutions: Docker and Singularity.
+Below we review some of the most common container solutions: Docker and Singularity.
\begin{itemize}
\item {\bf\small Docker containers:} Docker is one of the most popular tools today for keeping an independent analysis environment.
@@ -922,7 +922,7 @@ Moreover, these are designed for the Linux kernel (using its containerization fe
\label{appendix:nixguix}
Nix \citeappendix{dolstra04} and GNU Guix \citeappendix{courtes15} are independent package managers that can be installed and used on GNU/Linux operating systems, and macOS (only for Nix, prior to macOS Catalina).
Both also have a fully functioning operating system based on their packages: NixOS and ``Guix System''.
-GNU Guix is based on the same principles of Nix but implemented differencely, so we'll focus the review here on Nix.
+GNU Guix is based on the same principles of Nix but implemented differencely, so we focus the review here on Nix.
The Nix approach to package management is unique in that it allows exact dependency tracking of all the dependencies, and allows for multiple versions of a software, for more details see \citeappendix{dolstra04}.
In summary, a unique hash is created from all the components that go into the building of the package.
@@ -1061,7 +1061,7 @@ For example it is first necessary to download a dataset and do some preparations
Each one of these is a logically independent step, which needs to be run before/after the others in a specific order.
Hence job management is a critical component of a research project.
-There are many tools for managing the sequence of jobs, below we'll review the most common ones that are also used the existing reproducibility solutions of Appendix \ref{appendix:existingsolutions}.
+There are many tools for managing the sequence of jobs, below we review the most common ones that are also used the existing reproducibility solutions of Appendix \ref{appendix:existingsolutions}.
\subsubsection{Manual operation with narrative}
\label{appendix:manual}
@@ -1097,7 +1097,7 @@ This was found to greatly help in debugging software projects, and in speeding u
The most common implementation of Make, since the early 1990s, is GNU Make.
Make was also the framework used in the first attempts at reproducible scientific papers \cite{claerbout1992,schwab2000}.
Our proof-of-concept (Maneage) also uses Make to organize its workflow.
-Here, we'll complement that section with more technical details on Make.
+Here, we complement that section with more technical details on Make.
Usually, the top-level Make instructions are placed in a file called Makefile, but it is also common to use the \inlinecode{.mk} suffix for custom file names.
Each stage/step in the analysis is defined through a \emph{rule}.
@@ -1179,7 +1179,7 @@ These are primarily specifications/standards rather than software, so ideally tr
In order to later reproduce a project, the analysis steps must be stored in files.
For example Shell, Python or R scripts, Makefiles, Dockerfiles, or even the source files of compiled languages like C or FORTRAN.
Given that a scientific project does not evolve linearly and many edits are needed as it evolves, it is important to be able to actively test the analysis steps while writing the project's source files.
-Here we'll review some common methods that are currently used.
+Here we review some common methods that are currently used.
\subsubsection{Text editors}
The most basic way to edit text files is through simple text editors which just allow viewing and editing such files, for example \inlinecode{gedit} on the GNOME graphic user interface.