aboutsummaryrefslogtreecommitdiff
path: root/paper.tex
diff options
context:
space:
mode:
authorRaul Infante-Sainz <infantesainz@gmail.com>2020-04-09 15:33:48 +0100
committerRaul Infante-Sainz <infantesainz@gmail.com>2020-04-09 15:33:48 +0100
commitd97aaedb48a221e8bb5290acc71b76289df0904f (patch)
tree1ec701e20922ebadfe14b25a072c7f3abcd653be /paper.tex
parent1e07a59513b1774ca8b0561cc30f6fd2515afef9 (diff)
Minor typos and spelling corrections in Introduction
With this commit, minor typos have been corrected in the Introduction section. The majority of them are just small corrections, others are in order to not use contractions ("did not" instead of "didn't" and so on). Other modifications have been added with the aim of remove some small portion of the phrases to make it more focused.
Diffstat (limited to 'paper.tex')
-rw-r--r--paper.tex34
1 files changed, 17 insertions, 17 deletions
diff --git a/paper.tex b/paper.tex
index bc371a1..9e057cd 100644
--- a/paper.tex
+++ b/paper.tex
@@ -78,11 +78,11 @@
The increasing volume and complexity of data analysis has been highly productive, giving rise to a new branch of ``Big Data'' in many fields of the sciences and industry.
However, given its inherent complexity, the mere results are barely useful alone.
-Questions such as these commonly follow any such result: What inputs were used? What operations were done on those inputs? How were the configurations or training data chosen? How did the quantiative results get visualized into the final demonstration plots, figures or narrative/qualitative interpretation (may there be a bias in the visualization)? See Figure \ref{fig:questions} for a more detailed visual representation of such questions for various stages of the workflow.
+Questions such as these commonly follow any such result: What inputs were used? What operations were done on those inputs? How were the configurations or training data chosen? How did the quantiative results get visualized into the final demonstration plots, figures or narrative/qualitative interpretation? See Figure \ref{fig:questions} for a more detailed visual representation of such questions for various stages of the workflow.
In data science and database management, this type of metadata are commonly known as \emph{data provenance} or \emph{data lineage}.
Their definitions are elaborated with other basic concepts in Section \ref{sec:definitions}.
-Data lineage is being increasingly demaded for integrity checking from both the scientific and industrial/legal domains.
+Data lineage is being increasingly demanded for integrity checking from both the scientific and industrial/legal domains.
Notable examples in each domain are respectively the ``Reproducibility crisis'' in the sciences that was claimed by the Nature journal \citep{baker16}, and the General Data Protection Regulation (GDPR) by the European parliment and the California Consumer Privacy Act (CCPA), implemented in 2018 and 2020 respectively.
The former argues that reproducibility (as a test on sufficiently conveying the data lineage) is necessary for other scientists to study, check and build-upon each other's work.
The latter requires the data intensive industry to give individual users control over their data, effectively requiring thorough management and knowledge of the data's lineage.
@@ -92,7 +92,7 @@ In the sciences, the results of a project's analysis are published as scientific
From our own experiences, this section is usually most discussed during peer review and conference presentations, showing its importance.
After all, a result is defined as ``scientific'' based on its \emph{method} (the ``scientific method''), or lineage in data-science terminology.
In the industry however, data governance is usually kept as a trade secret and isn't publicly published or scrutinized.
-Therefore while the proposed approach introduced in this paper (Maneage) is also useful in industrial contexts, the main practical focus will be in the scientific front which has traditionally been more open to publishing the methods and anonymous peer scrutiny.
+Therefore while the proposed approach introduced in this paper (Maneage) is also useful in industrial contexts, the main practical focus would be in the scientific front which has traditionally been more open to publishing the methods and anonymous peer scrutiny.
\begin{figure}[t]
\begin{center}
@@ -107,29 +107,29 @@ Therefore while the proposed approach introduced in this paper (Maneage) is also
\end{figure}
The traditional format of a scientific paper has been very successful in conveying the method with the result in the last centuries.
-However, the complexity mentioned above has made it impossible to describe all the analytical steps of a project to a sufficient level of detail, in the traditional format of a published paper.
+However, the complexity mentioned above has made it impossible to describe all the analytical steps of a project to a sufficient level of detail.
Citing this difficulty, many authors suffice to describing the very high-level generalities of their analysis, while even the most basic calculations (like the mean of a distribution) can depend on the software implementation.
Due to the complexity of modern scientific analysis, a small deviation in the final result can be due to many different steps, which may be significant.
Publishing the precise codes of the analysis is the only guarantee.
For example, \citet{smart18} describes how a 7-year old conflict in theoretical condensed matter physics was only identified after the relative codes were shared.
Nature is already a black box which we are trying hard to unlock, or understand.
-Not being able to experiment on the methods of other researchers is an artificial and self-imposed back box, wrapped over the original, and taking most of the energy of fellow researchers.
+Not being able to experiment on the methods of other researchers is an artificial and self-imposed black box, wrapped over the original, and taking most of the energy of researchers.
-\citet{miller06} found that a mistaken column flipping, leading to retraction of 5 papers in major journals, including Science.
-\citet{baggerly09} highlighted the inadequet narrative description of the analysis and showed the prevalence of simple errors in published results, ultimately calling their work ``forensic bioinformatics''.
+\citet{miller06} found that a mistaken column flipping caused the retraction of 5 papers in major journals, including Science.
+\citet{baggerly09} highlighted the inadequate narrative description of the analysis and showed the prevalence of simple errors in published results, ultimately calling their work ``forensic bioinformatics''.
\citet{herndon14} and \citet[a self-correction]{horvath15} also reported similar situations and \citet{ziemann16} concluded that one-fifth of papers with supplementary Microsoft Excel gene lists contain erroneous gene name conversions.
Such integrity checks tests are a critical component of the scientific method, but are only possible with access to the data and codes.
The completeness of a paper's published metadata (or ``Methods'' section) can be measured by a simple question: given the same input datasets (supposedly on a third-party database like \href{http://zenodo.org}{zenodo.org}), can another researcher reproduce the exact same result automatically, without needing to contact the authors?
-Several studies have attempted to answer this with differnet levels of detail.
-For example \citet{allen18} found that roughly half of the papers in astrophysics don't even mention the names of any analysis software they have used, while \citet{menke20} found that the fraction of papers explicitly mentioning their tools/software has greatly improved over the last two decades.
+Several studies have attempted to answer this question with differnet levels of detail.
+For example, \citet{allen18} found that roughly half of the papers in astrophysics do not even mention the names of any analysis software they used, while \citet{menke20} found that the fraction of papers explicitly mentioning their tools/software has greatly improved over the last two decades.
-\citet{ioannidis2009} attempted to reproduce 18 published results by two independent groups but only fully succeeded in 2 of them and partially in 6.
-\citet{chang15} attempted to reproduce 67 papers in well-regarded economics journals with data and code: only 22 could be reproduced without contacting authors, more than half couldn't be replicated at all.
+\citet{ioannidis2009} attempted to reproduce 18 published results by two independent groups but, only fully succeeded in 2 of them and partially in 6.
+\citet{chang15} attempted to reproduce 67 papers in well-regarded economic journals with data and code: only 22 could be reproduced without contacting authors, and more than half could not be replicated at all.
\citet{stodden18} attempted to replicate the results of 204 scientific papers published in the journal Science \emph{after} that journal adopted a policy of publishing the data and code associated with the papers.
Even though the authors were contacted, the success rate was $26\%$.
-Generally, this problem is unambiguously felt in the community: \citet{baker16} surveyed 1574 researchers and found that only $3\%$ didn't see a ``reproducibility crisis''.
+Generally, this problem is unambiguously felt in the community: \citet{baker16} surveyed 1574 researchers and found that only $3\%$ did not see a ``reproducibility crisis''.
This is not a new problem in the sciences: in 2011, Elsevier conducted an ``Executable Paper Grand Challenge'' \citep{gabriel11}.
The proposed solutions were published in a special edition.
@@ -138,20 +138,20 @@ Before that, \citet{ioannidis05} proved that ``most claimed research findings ar
In the 1990s, \citet{schwab2000, buckheit1995, claerbout1992} describe this same problem very eloquently and also provided some solutions that they used.
While the situation has improved since the early 1990s, these papers still resonate strongly with the frustrations of today's scientists.
Even earlier, through his famous quartet, \citet{anscombe73} qualitatively showed how distancing of researchers from the intricacies of algorithms/methods can lead to misinterpretation of the results.
-One of the earliest such efforts we found was \citet{roberts69} who discussed conventions in Fortran programming and documentation to help in publishing research codes.
+One of the earliest such efforts we found was \citet{roberts69}, who discussed conventions in Fortran programming and documentation to help in publishing research codes.
From a practical point of view, for those who publish the data lineage, a major problem is the fast evolving and diverse software technologies and methodologies that are used by different teams in different epochs.
\citet{zhao12} describe it as ``workflow decay'' and recommend preserving these auxilary resources.
-But in the case of software its not as streightforward as data: if preserved in binary form, software can only be run on certain hardware and if kept as source-code, their build dependencies and build configuration must also be preserved.
+But in the case of software, its not as straightforward as data: if preserved in binary form, software can only be run on certain hardware and if kept as source-code, their build dependencies and build configuration must also be preserved.
\citet{gronenschild12} specifically study the effect of software version and environment and encourage researchers to not update their software environment.
-However, this is not a practical solution because software updates are necessary, atleast to fix bugs in the same research software.
+However, this is not a practical solution because software updates are necessary, at least to fix bugs in the same research software.
Generally, software is not a secular component of projects, where one software can easily be swapped with another.
Projects are built around specific software technologies, and research in software methods and implementations is itself a vibrant research topic in many domains \citep{dicosmo19}.
This paper introduces Maneage as a solution to these important issues.
-Section \ref{sec:definitions} defines the necessay concepts and terminology used in this paper leading to a discussion of the necessary guiding principles in Section \ref{sec:principles}.
+Section \ref{sec:definitions} defines the necessay concepts and terminology used in this paper, leading to a discussion of the necessary guiding principles in Section \ref{sec:principles}.
Section \ref{sec:maneage} introduces the implementation of Maneage, going into lower-level details in some cases.
-Finally in Section \ref{sec:discussion} the future prospects of using systems like this template are discussed.
+Finally, in Section \ref{sec:discussion}, the future prospects of using systems like this template are discussed.
After the main body, Appendix \ref{appendix:existingtools} reviews the most commonly used lower-level technologies used today.
In light of the guiding principles, in Appendix \ref{appendix:existingsolutions} a critical review of many workflow management systems that have been introduced over the last three decades is given.
Finally, in Appendix \ref{appendix:softwareacknowledge} we acknowledge the various software (with a name and version number) that were used for this project.