diff options
-rw-r--r-- | paper.tex | 301 | ||||
-rw-r--r-- | tex/src/references.tex | 36 |
2 files changed, 219 insertions, 118 deletions
@@ -73,127 +73,97 @@ \section{Introduction} \label{sec:introduction} -In the last two decades, several technological advancements have profoundly affected how science is being done: much improved processing power, storage capacity and internet connections, combined with larger public datasets and more robust free and open source software solutions. -Given these powerful tools, scientists are using ever more complex processing steps and datasets, often composed of mixing many different software components in a single project. -For example see the almost complete list of the necessary software in Appendix A of the following two research papers: \citet{akhlaghi19} and \citet{infante20}. - -The increased complexity of scientific analysis has made it impossible to describe all the analytical steps of a project in the traditional format of a published paper, to a sufficient level of detail. -Citing this difficulty, many authors suffice to describing the very high-level generalities of their analysis. -However, even the most basic calculations (like the mean of a distribution) can depend on the software implementation. -Therefore, even if the raw collected data are published with the paper, it is very hard, and expensitve, to study the validity/integrity of a result because of incomplete metadata. - -Due to the complexity of modern scientific analysis, a small deviation in the final result can be due to many different steps, which may be significant in their own intermediate steps. -This makes it critically important to share the research steps that go into the analysis to finest possible detail. -Attempts to reproduce an incomplete reporting is simply too expensive for anyone (even for the original authors, one year after publication). -For example, \citet{smart18} describes how a 7-year old conflict in theoretical condensed matter was only identified after the relative codes were shared. - -The energy/cost to independently repeat the mistakes of other researchers, is a waste of precious scientific funding, and public trust in the sciences. -It is therefore a critical component of the current Big data era. -Nature is already a black box which we are trying hard to unlock, or understand. -Not being able to experiment on the methods of other researchers is an artificial, self-imposed back box wrapped over the original. - -The completeness of a paper's metadata can be measured by a simple question: given the same input datasets, can another researcher reproduce the exact same result automatically (without needing to contact the authors)? -Several studies have actually attempted to answer this with differnet levels of detail. -For example \citet{stodden18} attempted to replicate the results of 204 scientific papers published in the journal Science after that journal adopted a policy of publishing the data and code associated with the papers. -Even though the authors were contacted, the success rate was $26\%$, concluding that policy change along is insufficient. -\citet{allen18} \tonote{Add a short summary of its results}. -\citet{zhao12} study ``workflow decay'' in papers using Taverna workflows (Appendix \ref{appendix:taverna}). \tonote{Review some of their major results} - -This problem is also generally felt in the community, \citet{baker16} found that $52\%$ and $38\%$ of the 1576 researchers surveyed, respectively acknowledged ``a significant crisis'' and ``a slight crisis'' regarding the reproducibility of scientific results. -Only $3\%$ believed that there is no reproducibility crisis. -It must be added that this is not a recent problem, it was also strongly felt in the previous decades. -For example \citet{baggerly09} complaining about inadequet narrative description of the analysis and showing the prevalence of simple errors, calling their work ``forensic bioinformatics''. -Even earlier, \citet{ioannidis05} prove that ``most claimed research findings are false''. - -Given the scale of the problem, a committee of the National Academy of Sciences was asked to assess its impact by the USA National Science Foundation (NSF, asked by the USA congress). -The results were recently published by \citet{fineberg19} and provide a good review of the status. -That committee doesn't recognize a ``crisis'', but the importance is stressed along with definitions (see Section \ref{sec:definitions}) and proposals. -Earlier in 2011, the Elsevier conducted an ``Executable Paper Grand Challenge'' \citep{gabriel11}, and to the best working solutions (at that time) were recognized. -Some of them are reviewed in Appendix \ref{appendix:existingsolutions}, but most have not been continued since then. +The increasing volume and complexity of data analysis has been highly productive, giving rise to a new branch of ``Big Data'' in many fields of the sciences and industry. +However, given its inherent complexity, the mere results are barely useful alone. +Questions such as these commonly follow any such result: What inputs were used? What operations were done on those inputs? How were the configurations or training data chosen? How did the quantiative results get visualized into the final demonstration plots, figures or narrative/qualitative interpretation (may there be a bias in the visualization)? See Figure \ref{fig:questions} for a more detailed visual representation of such questions for various stages of the workflow. + +In data science and database management, this type of metadata are commonly known as \emph{data provenance} or \emph{data lineage}. +Their definitions are elaborated with other basic concepts in Section \ref{sec:definitions}. +Data lineage is being increasingly demaded for integrity checking from both the scientific and industrial/legal domains. +Notable examples in each domain are respectively the ``Reproducibility crisis'' in the sciences that was claimed by the Nature journal \citep{baker16}, and the General Data Protection Regulation (GDPR) by the European parliment and the California Consumer Privacy Act (CCPA), implemented in 2018 and 2020 respectively. +The former argues that reproducibility (as a test on sufficiently conveying the data lineage) is necessary for other scientists to study, check and build-upon each other's work. +The latter requires the data intensive industry to give individual users control over their data, effectively requiring thorough management and knowledge of the data's lineage. +Besides regulation and integrity checks, having a robust data governance (management of data lineage) in a project can be very productive: it enables easy debugging, experimentation on alternative methods, or optimization of the workflow. + +In the sciences, the results of a project's analysis are published as scientific papers which have also been the primary conveyer of the result's lineage: usually in narrative form, within the ``Methods'' section of the paper. +From our own experiences, this section is usually most discussed during peer review and conference presentations, showing its importance. +After all, a result is defined as ``scientific'' based on its \emph{method} (the ``scientific method''), or lineage in data-science terminology. +In the industry however, data governance is usually kept as a trade secret and isn't publicly published or scrutinized. +Therefore while the proposed approach introduced in this paper (Maneage) is also useful in industrial contexts, the main practical focus will be in the scientific front which has traditionally been more open to publishing the methods and anonymous peer scrutiny. \begin{figure}[t] \begin{center} \includetikz{figure-project-outline} \end{center} \vspace{-17mm} - \caption{Graph of a generic project's workflow (connected through arrows), highlighting the various issues/questions on each step. + \caption{\label{fig:questions}Graph of a generic project's workflow (connected through arrows), highlighting the various issues/questions on each step. The green boxes with sharp edges are inputs and the blue boxes with rounded corners are the intermediate or final outputs. The red boxes with dashed edges highlight the main questions on the respective stage. The orange box surrounding the software download and build phases marks shows the various commonly recognized solutions to the questions in it, for more see Appendix \ref{appendix:jobmanagement}. } \end{figure} -Modern analysis tools are almost entirely implemented as software packages. -This has lead many scientists to adopt solutions that software developers use for reproducing software (for example to fix bugs, or avoid security issues). -These tools and how they are used are thorougly reviewed in Appendices \ref{appendix:existingtools} and \ref{appendix:existingsolutions}. -However, the problem of reproducibility in the sciences is more complicated and subtle than that of software engineering. -This difference can be broken up into the following categories, which are described more fully below: -1) Reading vs. executing, 2) Archiving how software is used and 3) Citation of the software/methods used for scientific credit. +The traditional format of a scientific paper has been very successful in conveying the method with the result in the last centuries. +However, the complexity mentioned above has made it impossible to describe all the analytical steps of a project to a sufficient level of detail, in the traditional format of a published paper. +Citing this difficulty, many authors suffice to describing the very high-level generalities of their analysis, while even the most basic calculations (like the mean of a distribution) can depend on the software implementation. + +Due to the complexity of modern scientific analysis, a small deviation in the final result can be due to many different steps, which may be significant. +Publishing the precise codes of the analysis is the only guarantee. +For example, \citet{smart18} describes how a 7-year old conflict in theoretical condensed matter physics was only identified after the relative codes were shared. +Nature is already a black box which we are trying hard to unlock, or understand. +Not being able to experiment on the methods of other researchers is an artificial and self-imposed back box, wrapped over the original, and taking most of the energy of fellow researchers. + +\citet{miller06} found that a mistaken column flipping, leading to retraction of 5 papers in major journals, including Science. +\citet{baggerly09} highlighted the inadequet narrative description of the analysis and showed the prevalence of simple errors in published results, ultimately calling their work ``forensic bioinformatics''. +\citet{herndon14} and \citet[a self-correction]{horvath15} also reported similar situations and \citet{ziemann16} concluded that one-fifth of papers with supplementary Microsoft Excel gene lists contain erroneous gene name conversions. +Such integrity checks tests are a critical component of the scientific method, but are only possible with access to the data and codes. + +The completeness of a paper's published metadata (or ``Methods'' section) can be measured by a simple question: given the same input datasets (supposedly on a third-party database like \href{http://zenodo.org}{zenodo.org}), can another researcher reproduce the exact same result automatically, without needing to contact the authors? +Several studies have attempted to answer this with differnet levels of detail. +For example \citet{allen18} found that roughly half of the papers in astrophysics don't even mention the names of any analysis software they have used, while \citet{menke20} found that the fraction of papers explicitly mentioning their tools/software has greatly improved over the last two decades. + +\citet{ioannidis2009} attempted to reproduce 18 published results by two independent groups but only fully succeeded in 2 of them and partially in 6. +\citet{chang15} attempted to reproduce 67 papers in well-regarded economics journals with data and code: only 22 could be reproduced without contacting authors, more than half couldn't be replicated at all. +\citet{stodden18} attempted to replicate the results of 204 scientific papers published in the journal Science \emph{after} that journal adopted a policy of publishing the data and code associated with the papers. +Even though the authors were contacted, the success rate was $26\%$. +Generally, this problem is unambiguously felt in the community: \citet{baker16} surveyed 1574 researchers and found that only $3\%$ didn't see a ``reproducibility crisis''. + +This is not a new problem in the sciences: in 2011, Elsevier conducted an ``Executable Paper Grand Challenge'' \citep{gabriel11}. +The proposed solutions were published in a special edition. +Some of them are reviewed in Appendix \ref{appendix:existingsolutions}, but most have not been continued since then. +Before that, \citet{ioannidis05} proved that ``most claimed research findings are false''. +In the 1990s, \citet{schwab2000, buckheit1995, claerbout1992} describe this same problem very eloquently and also provided some solutions that they used. +While the situation has improved since the early 1990s, these papers still resonate strongly with the frustrations of today's scientists. +Even earlier, through his famous quartet, \citet{anscombe73} qualitatively showed how distancing of researchers from the intricacies of algorithms/methods can lead to misinterpretation of the results. +One of the earliest such efforts we found was \citet{roberts69} who discussed conventions in Fortran programming and documentation to help in publishing research codes. + +From a practical point of view, for those who publish the data lineage, a major problem is the fast evolving and diverse software technologies and methodologies that are used by different teams in different epochs. +\citet{zhao12} describe it as ``workflow decay'' and recommend preserving these auxilary resources. +But in the case of software its not as streightforward as data: if preserved in binary form, software can only be run on certain hardware and if kept as source-code, their build dependencies and build configuration must also be preserved. +\citet{gronenschild12} specifically study the effect of software version and environment and encourage researchers to not update their software environment. +However, this is not a practical solution because software updates are necessary, atleast to fix bugs in the same research software. +Generally, software is not a secular component of projects, where one software can easily be swapped with another. +Projects are built around specific software technologies, and research in software methods and implementations is itself a vibrant research topic in many domains \citep{dicosmo19}. + +This paper introduces Maneage as a solution to these important issues. +Section \ref{sec:definitions} defines the necessay concepts and terminology used in this paper leading to a discussion of the necessary guiding principles in Section \ref{sec:principles}. +Section \ref{sec:maneage} introduces the implementation of Maneage, going into lower-level details in some cases. +Finally in Section \ref{sec:discussion} the future prospects of using systems like this template are discussed. +After the main body, Appendix \ref{appendix:existingtools} reviews the most commonly used lower-level technologies used today. +In light of the guiding principles, in Appendix \ref{appendix:existingsolutions} a critical review of many workflow management systems that have been introduced over the last three decades is given. +Finally, in Appendix \ref{appendix:softwareacknowledge} we acknowledge the various software (with a name and version number) that were used for this project. + + + + -The first difference is because in the sciences, reproducibility is not merely a problem of re-running a research project (where a binary blob like a container or virtual machine is sufficient). -For a scientist it is more important to read/study a method of a paper that is 1, 10, or 100 years old. -The hardware to execute the code may have become obsolete, or it may require too much processing power, storage, or time for another random scientist to execute. -Another scientist just needs to be assured that the commands they are reading is exactly what was (and can potentially be) executed. -On the second point, scientists are devoting a smaller fraction of their papers to the technical aspects of the work because they are done increasingly by pre-written software programs and libraries. -Therefore, scientific papers are no longer a complete repository for preserving and archiving very important aspects of the scientific endeavor and hard gained experience. -Attempts such as Software Heritage\footnote{\url{https://www.softwareheritage.org}} \citep{dicosmo18} do a wonderful job at long term preservation and archival of the software source code. -However, preservation of the software's raw code is only part of the process, it is also critically important to preserve how the software was used: with what configuration or run-time options, for what kinds of problems, in conjunction with which other software tools and etc. -The third major difference was scientific credit, which is measured in units of citations, not dollars. -As described above, scientific software are playing an increasingly important role in modern science. -Because of the domain-specific knowledge necessary to produce such software, they are mostly written by scientists for scientists. -Therefore a significant amount of effort and research funding has gone into producing scientific software. -Atleast for the software that do have an accompanying paper, it is thus important that those papers be cited when they are used. -Similar community concerns on the importance of metadata in research products have lead to the wide adoption of the FAIR principles (Findable, Accessible, Interoperable, Reusable) for data management and stewardship \citep{wilkinson16}. -These are very good generic guidelines that don't go into any implementation details. -\tonote{Discuss this and other similar attempts in a little more detail.} - -The importance of publishing processing source code, and allowing for critical analysis of the methods, along with scientific paper is not a recent problem. -For example \citet{roberts69} discussed conventions in Fortran programming and documentation to help in publishing research codes. -%\citet{anscombe73} showed how the distancing of researchers from the intricacies of algorithms/methods can lead to misinterpretation of the results. -\citet[Geophysicists]{claerbout1992} is the first paper we have found that discusses the issue directly in the same sense as this paper: a scientific paper must be accompanied by the code that generated it's results, they describe a model they had started from 1990 and used in a PhD dissertation and 6 other documents, with nearly a thousand reproducible plots. -The high-level analysis orchestration was organized through Cake ( which were distributed in CD-ROMs along with analysis code, text and all non-proprietary software (including \LaTeX{}) . -It later inspired \citet{buckheit1995} to publish a reproducible paper (in Matlab). -\tonote{Find some other historical examples.} - -In this paper, a solution to this problem is introduced that attemps to address the problems above and has already been used in scientific papers. -The primordial implementation of this system was in \citet{akhlaghi15} which described a new detection algorithm in astronomical image processing. -The detection algorithm was developed as the paper (initially a small report!) was being written. -An automated sequence of commands to build the figures, and update the paper/report was a practical necessity as the algorithm was evolving. -In particular, it didn't just reproduce figures, it also used \LaTeX{} macros to update numbers printed within the text. -Finally, since the full analysis pipeline was in plain-text and roughly 100kb (much less than a single figure), it was uploaded to arXiv with the paper's \LaTeX{} source, under a \inlinecode{reproduce/} directory, see \href{https://arxiv.org/abs/1505.01664}{arXiv:1505.01664}\footnote{ - To download the \LaTeX{} source of any arXiv paper, click on the ``Other formats'' link, containing necessary instructions and links.}. -The system later evolved in \citet{bacon17}, in particular the two sections of that paper that were done by M.A (first author of this paper): \citet[\href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}]{akhlaghi18a} and \citet[\href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}]{akhlaghi18b}. -With these projects, the core skeleton of the system was written as a more abstract ``template'' that could be customized for separate projects. -The template later matured by including installation of all necessary software from source and used in \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}]{akhlaghi19} and \citet[\href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}]{infante20}. -The short historical review above highlights how this template was created by practicing scientists, and has evolved based on the needs of real scientific projects and working scenarios. -In Section \ref{sec:definitions}, the problem that is addressed by this template is clearly defined and Section \ref{appendix:existingsolutions} reviews some existing solutions and their pros and cons with respect to reproducibility in a scientific framework. -Section \ref{sec:template} introduces the template, and its approach to the problem, with full design details. -Finally in Section \ref{sec:discussion} the future prospects of using systems like this template are discussed. -\begin{itemize} -\item \citep{claerbout1992,schwab2000}: These papers describe the practical need very nicely. -\item \citet{herndon14}: 1) simple typos (from the original spreadsheets, they found typos in number of rows). 2) Importance of releasing data (by the study they reproduced). -\item \citet{ziemann16}: one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions. -\item \citet{ioannidis2009}: Two teams attempts to independently replicate results from 18 articles. - Two were successful, six were partial and ten weren't replicated. - The main reason for failure to reproduce was data unavailability, and discrepancies were mostly due to incomplete data annotation or specification of data processing and analysis. -\item \citet{miller06}: an incorrect column filliping in a custom analysis caused the retraction of 5 papers in major journals (including Science). -\item \citet{gronenschild12}: effect of software version and environment on scientific results: encouraging researchers to not update environment. -\item \citet{chang15}: 67 studies in well-regarded economics journals with data and code. Only 22 could be reproduced without contacting authors, more than half couldn't be replicated at all (they use ``replicate''). -\item \citet{horvath15}: errartum, describing the effect of a software mistake on result. -\item Nature's collection on papers about reproducibility: \url{https://www.nature.com/collections/prbfkwmwvz}. -\item \citet{menke20} on the ``Rigor and Transparency Index'', in particular showing how practices have improved but not enough. - Also, how software identifability has seen the best improvement. -\item \citet{dicosmo19} summarize the special place of software in modern science very nicely: ``Software is a hybrid object in the world research as it is equally a driving force (as a tool), a result (as proof of the existence of a solution) and an object of study (as an artefact)''. -\item Nice links for applying FAIR principles in research software: \url{https://www.rd-alliance.org/group/software-source-code-ig/wiki/fair4software-reading-materials} -\item Nice paper about software citation: \url{https://doi.org/10.1109/MCSE.2019.2963148}. -\end{itemize} @@ -262,6 +232,30 @@ For example modules in Python, packages in R, or libraries/programs in C/C++ tha +\subsection{Definition: data provenance} +\label{definition:provenance} + +Data provenance is a very generic term which points to slightly different technical concepts in different fields like databases, storage systems and scientific workflows. +For example within a database, an SQL query from a relational database connects a subset of the database entries to the output (\emph{why-} provenance), their more detailed dependency (\emph{how-} provenance) and the precise location of the input sources (\emph{where-} provenance), for more see \citet{cheney09}. +In scientific workflows provenance goes beyond a single database and its datasets, but may includes many databases that aren't directly linked, the higher-level project specific analysis that is done on the data, and linking of the analysis to the text of the paper, for example see \citet{bavoil05, moreau08, malik13}. + +Here, we define provenance to be the common factor of the usages above: a dataset's provenance is the set of metadata (in any ontology, standard or structure) that connect it to the components (other datasets or scripts) that produced it. +Data provenance thus provides a high-level view of the data's genealogy. + +\subsection{Definition: data lineage} +\label{definition:lineage} + +% This definition is inspired from https://stackoverflow.com/questions/43383197/what-are-the-differences-between-data-lineage-and-data-provenance: + +% "data provenance includes only high level view of the system for business users, so they can roughly navigate where their data come from. +% It's provided by variety of modeling tools or just simple custom tables and charts. +% Data lineage is a more specific term and includes two sides - business (data) lineage and technical (data) lineage. +% Business lineage pictures data flows on a business-term level and it's provided by solutions like Collibra, Alation and many others. +% Technical data lineage is created from actual technical metadata and tracks data flows on the lowest level - actual tables, scripts and statements. +% Technical data lineage is being provided by solutions such as MANTA or Informatica Metadata Manager. " +Data lineage is commonly used interchangably with Data provenance \citep[for example][\tonote{among many others, just search ``data lineage'' in scholar.google.com}]{cheney09}. +However, for clarity, in this paper we refer to the term Data lineage as a low-level and fine-grained record of the data's genealogy, down to the exact command that produced each intermediate step. + \subsection{Definition: reproducibility \& replicability} \label{definition:reproduction} @@ -315,6 +309,11 @@ Therfore our inputs are team-agnostic, allowing us to safely ignore ``repeatabil + + + + + \section{Principles of the proposed solution} \label{sec:principles} @@ -342,7 +341,7 @@ In terms of interfaces, wrappers can be written over this core skeleton for vari \subsection{Principle: Complete/Self-contained} \label{principle:complete} -A project should be self-contained, needing no particular features from the host operating system, and not affecting it. +A project should be self-contained, needing no particular features from the host operating system (OS), and not affecting the host OS. At build-time (when the project is building its necessary tools), the project shouldn't need anything beyond a minimal POSIX environment on the host which is available in Unix-like operating system like GNU/Linux, BSD-based or macOS. At run-time (when environment/software are built), it should not use or affect any host operating system programs or libraries. @@ -368,6 +367,13 @@ This principle has several important consequences: The first two components are particularly important for high performace computing (HPC) facilities: because of security reasons, HPC users commonly don't have previlaged permissions or internet access. +A complete project as defined here is much less exposed to ``workflow decay'' as defined by \citet{zhao12} (in particular under their missing execution environment tests). +As recommended by \citet{zhao12}, a complete project automatically builds all its necessary third-party tools, it doesn't just assume their existence. +Ultimately, the executability of a project will decay once the host Linux kernel inevitably evolves to such that the project's fixed version of the GNU C Library and GNU C Compiler can't be built. +This will happen on much longer time scales than the high-level software menioned in \citet{zhao12} and can be fixed by changing the project's (GNU) C library and (GNU) C Compiler to versions that are build-able with the host kernel. +These are very low-level components and any possible change in the output should be minimal. +Ultimately after multiple decades, even that may not be possible, but even at that point, thanks to the plain-text principle (Section \ref{principle:text}), it can still be studied, without necessarily executing it. + @@ -408,7 +414,7 @@ Binary formats will complicate various aspects of the project: its usage, archiv This is a critical principle for long term preservation and portability: when the software to read binary format has been depreciated or become obsolete and isn't installable on the running system, the project will not be readable/usable any more. A project that is solely in plain text format can be put under version control as it evolves, with easy tracking of changed parts, using already available and mature tools in software development: software source code is also in plain text. -After publication, independent modules of a plain-text project can be used and cited through services like Software Heritage \citep{dicosmo18}, enabling future projects to easily build ontop of old ones, or cite specific parts of a project. +After publication, independent modules of a plain-text project can be used and cited through services like Software Heritage \citep{dicosmo18,dicosmo20}, enabling future projects to easily build ontop of old ones, or cite specific parts of a project. Archiving a binary version of the project is like archiving a well cooked dish itself, which will be inedible with changes in hardware (temperature, humidity, and the natural world in general). But archiving the dish's recipe (which is also in plain text!): you can re-cook it any time. @@ -535,21 +541,14 @@ This is because software freedom as an important pillar for the sciences as show -\section{Reproducible paper template} -\label{sec:template} +\section{Implementation of Maneage} +\label{sec:maneage} The proposed solution is an implementation of the principles discussed in Section \ref{sec:principles}: it is complete and automatic (Section \ref{principle:complete}), modular (Section \ref{principle:modularity}), fully in plain text (Section \ref{principle:text}), having minimal complexity (see Section \ref{principle:complexity}), with automatically verifiable inputs \& outputs (Section \ref{principle:verify}), preserving temporal provenance, or project evolution (Section \ref{principle:history}) and finally, it is free software (Section \ref{principle:freesoftware}). In practice it is a collection of plain-text files, that are distributed in pre-defined sub-directories by context, and are all under version-control (currently with Git). In its raw form (before customizing for different projects), it is a fully working skeleton of a project without much flesh: containing all the low-level infrastructure, with just a small demonstrative ``delete-me'' analysis. To start a new project, users will \emph{clone}\footnote{In Git, ``clone''ing is the process of copying all the project's file and their history into the host system.} the core skeleton, create their own Git branch, and start customizing the core files (adding their high-level analysis steps, scripts to generate figure and narrative) within their custom branch. -Because of this, we also refer to the proposed system as a ``template''. - -Before going into the details, it is important to note that as with any software, the template core architecture will inevitably evolve after the publication of this paper. -We already have roughly 30 tasks that are left for the future and will affect various high-level phases of the project as described here. -However, the core of the system has been used and become stable enough already and we don't see any major change in the core methodology in the near future. -A list of the notable changes after the publication of this paper will be kept in in the project's \inlinecode{README-hacking.md} file. -Once the improvements become substantial, new paper(s) will be written to complement or replace this one. In this section we will review the current implementation of the reproducibile paper template. Generally, job orchestration is implemented in Make (a POSIX software), this choice is elaborated in Section \ref{sec:usingmake}. @@ -1255,9 +1254,6 @@ This style of managing project parameters therefore produces a much more healthy -\subsubsection{The validation} -\label{sec:thevalidation} - \subsubsection{Building the paper} \label{sec:buildingpaper} @@ -1267,6 +1263,47 @@ This style of managing project parameters therefore produces a much more healthy \end{itemize} +\subsection{Future work and history} +\label{sec:futureworkx} +As with any software, the core architecture of Maneage will inevitably evolve after the publication of this paper. +The current version introduced here has already experienced 5 years of evolution and several reincarnations. +Its primordial implementation was written for \citet{akhlaghi15}. +This paper described a new detection algorithm in astronomical image processing. +The detection algorithm was developed as the paper was being written (initially a small report!). +An automated sequence of commands to build the figures, and update the paper/report was a practical necessity as the algorithm was evolving. +In particular, it didn't just reproduce figures, it also used \LaTeX{} macros to update numbers printed within the text. +Finally, since the full analysis pipeline was in plain-text and roughly 100kb (much less than a single figure), it was uploaded to arXiv with the paper's \LaTeX{} source, under a \inlinecode{reproduce/} directory, see \href{https://arxiv.org/abs/1505.01664}{arXiv:1505.01664}\footnote{ + To download the \LaTeX{} source of any arXiv paper, click on the ``Other formats'' link, containing necessary instructions and links.}. + +The system later evolved in \citet{bacon17}, in particular the two sections of that paper that were done by M. Akhlaghi (first author of this paper): \citet[\href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}]{akhlaghi18a} and \citet[\href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}]{akhlaghi18b}. +With these projects, the skeleton of the system was written as a more abstract ``template'' that could be customized for separate projects. +The template later matured by including installation of all necessary software from source and used in \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}]{akhlaghi19} and \citet[\href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}]{infante20}. +The short historical review above highlights how this template was created by practicing scientists, and has evolved and matured significantly. + +We already have roughly 30 tasks that are left for the future and will affect various high-level phases of the project as described here. +However, the core of the system has been used and become stable enough already and we don't see any major change in the core methodology in the near future. +A list of the notable changes after the publication of this paper will be kept in in the project's \inlinecode{README-hacking.md} file. +Once the improvements become substantial, new paper(s) will be written to complement or replace this one. + + + + + + + + + + + + + + + + + + + + \section{Discussion} \label{sec:discussion} @@ -1294,6 +1331,13 @@ This style of managing project parameters therefore produces a much more healthy In this system, because of automatic verification of inputs and outputs, no technical knowledge is necessary for the verification. \item \citet{miksa19b} Machine-actionable data management plans (maDMPs) embeded in workflows, allowing \item \citet{miksa19a} RDA recommendation on maDMPs. +\item FAIR Principles \citep{wilkinson16}. +\item \citet{cheney09}: ``In both data warehouses and curated databases, tremendous (\emph{and often manual}) effort is usually expended in the construction'' +\item https://arxiv.org/pdf/2001.11506.pdf +\item Apache NiFi for automated data flow. +\item \url{https://arxiv.org/pdf/2003.04915.pdf}: how data lineage can help machine learning. +\item Interesting patent on ``documenting data lineage'': \url{https://patentimages.storage.googleapis.com/c0/51/6e/1f3af366cd73b1/US10481961.pdf} +\item Automated data lineage extractor: \url{http://hdl.handle.net/20.500.11956/110206}. \end{itemize} @@ -1343,8 +1387,6 @@ Some existing solution to for managing the different parts of a reproducible wor - - \subsection{Independent environment} \label{appendix:independentenvironment} @@ -2292,6 +2334,31 @@ Furthermore, the fact that a Tale is stored as a binary Docker container causes \item \citet{becker17} Discuss reproducibility methods in R. \item Elsevier Executable Paper Grand Challenge\footnote{\url{https://shar.es/a3dgl2}} \citep{gabriel11}. \item \citet{menke20} show how software identifability has seen the best improvement, so there is hope! + \item Nature's collection on papers about reproducibility: \url{https://www.nature.com/collections/prbfkwmwvz}. + \item Nice links for applying FAIR principles in research software: \url{https://www.rd-alliance.org/group/software-source-code-ig/wiki/fair4software-reading-materials} + \item +Modern analysis tools are almost entirely implemented as software packages. +This has lead many scientists to adopt solutions that software developers use for reproducing software (for example to fix bugs, or avoid security issues). +These tools and how they are used are thorougly reviewed in Appendices \ref{appendix:existingtools} and \ref{appendix:existingsolutions}. +However, the problem of reproducibility in the sciences is more complicated and subtle than that of software engineering. +This difference can be broken up into the following categories, which are described more fully below: +1) Reading vs. executing, 2) Archiving how software is used and 3) Citation of the software/methods used for scientific credit. + +The first difference is because in the sciences, reproducibility is not merely a problem of re-running a research project (where a binary blob like a container or virtual machine is sufficient). +For a scientist it is more important to read/study a method of a paper that is 1, 10, or 100 years old. +The hardware to execute the code may have become obsolete, or it may require too much processing power, storage, or time for another random scientist to execute. +Another scientist just needs to be assured that the commands they are reading is exactly what was (and can potentially be) executed. + +On the second point, scientists are devoting a smaller fraction of their papers to the technical aspects of the work because they are done increasingly by pre-written software programs and libraries. +Therefore, scientific papers are no longer a complete repository for preserving and archiving very important aspects of the scientific endeavor and hard gained experience. +Attempts such as Software Heritage\footnote{\url{https://www.softwareheritage.org}} \citep{dicosmo18} do a wonderful job at long term preservation and archival of the software source code. +However, preservation of the software's raw code is only part of the process, it is also critically important to preserve how the software was used: with what configuration or run-time options, for what kinds of problems, in conjunction with which other software tools and etc. + +The third major difference was scientific credit, which is measured in units of citations, not dollars. +As described above, scientific software are playing an increasingly important role in modern science. +Because of the domain-specific knowledge necessary to produce such software, they are mostly written by scientists for scientists. +Therefore a significant amount of effort and research funding has gone into producing scientific software. +Atleast for the software that do have an accompanying paper, it is thus important that those papers be cited when they are used. \end{itemize} diff --git a/tex/src/references.tex b/tex/src/references.tex index f4dee1d..6e1de41 100644 --- a/tex/src/references.tex +++ b/tex/src/references.tex @@ -1,3 +1,23 @@ +@ARTICLE{dicosmo20, + author = {{Di Cosmo}, Roberto and {Gruenpeter}, Morane and {Zacchiroli}, Stefano}, + title = "{Referencing Source Code Artifacts: a Separate Concern in Software Citation}", + journal = {Computing in Science \& Engineering}, + year = 2020, + volume = 22, + eid = {arXiv:2001.08647}, + pages = {33}, +archivePrefix = {arXiv}, + eprint = {2001.08647}, + primaryClass = {cs.DL}, + doi = {10.1109/MCSE.2019.2963148}, + adsurl = {https://ui.adsabs.harvard.edu/abs/2020arXiv200108647D}, + adsnote = {Provided by the SAO/NASA Astrophysics Data System} +} + + + + + @ARTICLE{menke20, author = {Joe Menke and Martijn Roelandse and Burak Ozyurt and Maryann Martone and Anita Bandrowski}, title = {Rigor and Transparency Index, a new metric of quality for assessing biological and medical science methods}, @@ -1324,6 +1344,20 @@ Reproducible Research in Image Processing}, +@ARTICLE{cheney09, + author = {James Cheney and Laura Chiticariu and Wang-Chiew Tan}, + title = {Provenance in Databases: Why, How, and Where}, + journal = {Foundations and Trends in Databases}, + year = {2009}, + volume = {1}, + pages = {379}, + doi = {10.1561/1900000006}, +} + + + + + @ARTICLE{ioannidis2009, author = {John P. A. Ioannidis and David B. Allison and Catherine A. Ball and Issa Coulibaly and Xiangqin Cui and AedÃn C Culhane and Mario Falchi and Cesare Furlanello and Laurence Game and Giuseppe Jurman and Jon Mangion and Tapan Mehta and Michael Nitzberg and Grier P. Page and Enrico Petretto and Vera {van Noort}}, title = {Repeatability of published microarray gene expression analyses}, @@ -1387,7 +1421,7 @@ Reproducible Research in Image Processing}, year = {2008}, volume = {20}, pages = {473}, - doi = {10.1002/cpe.1237}, + doi = {10.1002/cpe.1233}, } |