diff options
author | Roberto Baena Gallé <roberto.baena@gmail.com> | 2020-04-13 17:28:45 +0100 |
---|---|---|
committer | Mohammad Akhlaghi <mohammad@akhlaghi.org> | 2020-04-13 17:37:34 +0100 |
commit | 7ebc882a6157e7b39d0feb8a5fef7be1f7a42766 (patch) | |
tree | d4de0632e12625371e0ab2e50a97c1f22d34883b | |
parent | 2aa52db8108c3ba4b984b1d57a5c47d44de93d91 (diff) |
Minor corrections and thoughts
I corrected bugs, typos, double words, and punctuations along the whole
text. I do some comments which are always highlighted with \hl{this is my
comment}, so you can identify them easily in the pdf. If you want to
remove, then you can do it easily with Ctrl+R since I think you never used
\hl. Finally, I added my name as coauthor but, please, feel free to remove
it if you want.
Note from Mohammad: since there were two other suggested commits before
this that were already merged, I rembased Roberto's commits and fixed a few
minor conflicts.
-rw-r--r-- | paper.tex | 200 |
1 files changed, 101 insertions, 99 deletions
@@ -26,7 +26,8 @@ \title{Maneage: Customizable framework for Managing Data Lineage} \author{\large\mpregular \authoraffil{Mohammad Akhlaghi}{1,2}, - \large\mpregular \authoraffil{Ra\'ul Infante-Sainz}{1,2}\\ + \large\mpregular \authoraffil{Ra\'ul Infante-Sainz}{1,2}, + \large\mpregular \authoraffil{Roberto Baena-Gall\'e}{1,2}\\ { \footnotesize\mplight \textsuperscript{1} Instituto de Astrof\'isica de Canarias, C/V\'ia L\'actea, 38200 La Laguna, Tenerife, ES.\\ @@ -51,10 +52,10 @@ Maneage (management + lineage) is introduced here as a host to the computational and narrative components of an analysis. Analysis steps are added to a new project with lineage in mind, thus facilitating the project's execution and testing as the project evolves, while being friendly to publishing and archival because it is wholly in machine\--action\-able, and human\--read\-able, plain-text. Maneage is founded on the principles of completeness (e.g., no dependency beyond a POSIX-compatible operating system, no administrator privileges, or no network connection), modular and straight-forward design, temporal lineage and free software. - The lineage isn't limited to downloading the inputs and processing them automatically, but also includes building the necessary software with fixed versions and build configurations. + The lineage is not limited to downloading the inputs and processing them automatically, but also includes building the necessary software with fixed versions and build configurations. Additionally, Maneage also builds the final PDF report of the project, establishing direct and automatic links between the data analysis and the narrative, with the precision of a word in a sentence. Maneage enables incremental projects, where a new project can branch off an existing one, with moderate changes to enable experimentation on published methods. - Once Maneage is implement in a sufficiently wide scale, it can aid in automatic and optimized workflow creation through machine learning, or automating data management plans. + Once Maneage is implemented in a sufficiently wide scale, it can aid in automatic and optimized workflow creation through machine learning, or automating data management plans. Maneage was a recipient of the research data alliance (RDA) Europe Adoption Grant in 2019. This paper is itself written in Maneage with snapshot \projectversion. \horizontalline @@ -111,18 +112,18 @@ For example, \citet{smart18} describes how a 7-year old conflict in theoretical Nature is already a black box which we are trying hard to unlock, or understand. Not being able to experiment on the methods of other researchers is an artificial and self-imposed black box, wrapped over the original, and taking most of the energy of researchers. -\citet{miller06} found that a mistaken column flipping, leading to retraction of 5 papers in major journals, including Science. +\citet{miller06} found that a mistaken column flipping, leading to retraction of 5 papers in major journals, including Science. \hl{I think this sentence is bad written, not sure}} \citet{baggerly09} highlighted the inadequate narrative description of the analysis and showed the prevalence of simple errors in published results, ultimately calling their work ``forensic bioinformatics''. \citet{herndon14} and \citet[a self-correction]{horvath15} also reported similar situations and \citet{ziemann16} concluded that one-fifth of papers with supplementary Microsoft Excel gene lists contain erroneous gene name conversions. The reason such reports are mostly from genomics and bioinformatics is because they have traditionally been more open to publishing workflows: for example \href{https://www.myexperiment.org}{myexperiment.org}, which mostly uses Apache Taverna \citep{oinn04}, or \href{https://www.genepattern.org}{genepattern.org} \citep{reich06}, \href{https://galaxyproject.org}{galaxy\-project.org} \citep{goecks10}, among others. Such integrity checks are a critical component of the scientific method, but are only possible with access to the data \emph{and} its lineage (workflows). -The status in other fields were workflows aren't commonly shared is probably (much) worse. +The status in other fields were workflows are not commonly shared is probably (much) worse. The completeness of a paper's published metadata (or ``Methods'' section) can be measured by a simple question: given the same input datasets (supposedly on a third-party database like \href{http://zenodo.org}{zenodo.org}), can another researcher reproduce the exact same result automatically, without needing to contact the authors? Several studies have attempted to answer this with different levels of detail. -For example \citet{allen18} found that roughly half of the papers in astrophysics don't even mention the names of any analysis software they have used, while \citet{menke20} found this fraction has greatly improved in medical/biological journals over the last two decades (currently above $80\%$). +For example, \citet{allen18} found that roughly half of the papers in astrophysics do not even mention the names of any analysis software they have used, while \citet{menke20} found this fraction has greatly improved in medical/biological journals over the last two decades (currently above $80\%$). -\citet{ioannidis2009} attempted to reproduce 18 published results by two independent groups but, only fully succeeded in 2 of them and partially in 6. +\citet{ioannidis2009} attempted to reproduce 18 published results by two independent groups but only fully succeeded in 2 of them and partially in 6. \citet{chang15} attempted to reproduce 67 papers in well-regarded economic journals with data and code: only 22 could be reproduced without contacting authors, and more than half could not be replicated at all. \citet{stodden18} attempted to replicate the results of 204 scientific papers published in the journal Science \emph{after} that journal adopted a policy of publishing the data and code associated with the papers. Even though the authors were contacted, the success rate was $26\%$. @@ -131,7 +132,7 @@ Generally, this problem is unambiguously felt in the community: \citet{baker16} This is not a new problem in the sciences: in 2011, Elsevier conducted an ``Executable Paper Grand Challenge'' \citep{gabriel11}. The proposed solutions were published in a special edition. Before that, in an attempt to simulate research projects, \citet{ioannidis05} proved that ``most claimed research findings are false''. -In the 1990s, \citet{schwab2000, buckheit1995, claerbout1992} describe this same problem very eloquently and also provided some solutions that they used. +In the 1990s, \citet{schwab2000, buckheit1995, claerbout1992} describe this same problem very eloquently and also provided some solutions they used. While the situation has improved since the early 1990s, these papers still resonate strongly with the frustrations of today's scientists. Even earlier, through his famous quartet, \citet{anscombe73} qualitatively showed how distancing of researchers from the intricacies of algorithms/methods can lead to misinterpretation of the results. One of the earliest such efforts we found was \citet{roberts69} who discussed conventions in FORTRAN programming and documentation to help in publishing research codes. @@ -141,10 +142,10 @@ Besides data availability, the ``decay'' in the software tools that the workflow Generally, software is not a secular component of projects, where one software can easily be swapped with another. Projects are built around specific software technologies, and research in software methods and implementations is itself a vibrant research topic in many domains \citep{dicosmo19}. -In this paper we introduce Maneage as a solution to the collective problem of preserving a project's data lineage and its software dependencies. +In this paper, we introduce Maneage as a solution to the collective problem of preserving a project's data lineage and its software dependencies. A project using Maneage will start by branching from the main Git branch of Maneage and starts customizing it: specifying the necessary software tools for that particular project, adding analysis steps and writing a narrative based on the analysis results. The temporal provenance of the project is fully preserved in Git, and allows merging of the project with the core branch to update the low-level infra-structure (common to all projects) without changing the high-level steps specific to this project. -In Section \ref{sec:d-and-p} the basic concepts are defined and the founding principles of Maneage are discussed. +In Section \ref{sec:d-and-p} \hl{the section label is missed} the basic concepts are defined and the founding principles of Maneage are discussed. Section \ref{sec:maneage} describes the internal structure of Maneage and Section \ref{sec:discussion} is a discussion on its benefits, caveats and future prospects. @@ -152,31 +153,31 @@ Section \ref{sec:maneage} describes the internal structure of Maneage and Sectio \label{sec:definitions} The concepts and terminologies of reproducibility and project/workflow management and design are commonly used differently by different research communities or different solution provides. -As a consequence, before starting with the technical details it is important to clarify the specific terms used throughout this paper and its appendix. +As a consequence, before starting with the technical details it is important to clarify the specific terms used throughout this paper and its appendix. \hl{do you still have appendices (in pl if more than one)?} \begin{enumerate}[label={\bf D\arabic*}] \item \label{definition:input}\textbf{Input:} - Any computer file needed by a project that may be usable in other projects. + \hl{Here, we define as input a}ny computer file needed by a project that may be usable in other projects. The inputs of a project include data or software source code, see \citet{hinsen16} on the fundamental similarity of data and source code. Inputs may have initially been created/written (e.g., software source code) or collected (e.g., data) for one specific project. - However, they can, and most often will, be used in other/later projects also. + However, they can, and most often will, be used in other/later projects too. \item \label{definition:output}\textbf{Output:} - Any computer file that is published at the end of the project. + \hl{The output is a}ny computer file that is published at the end of the project. The output(s) of a project can be a published narrative paper, datasets (e.g., table(s), image(s), a number, or Boolean: confirming a hypothesis as true or false), automatically generated software source code, or any other computer file. \item \label{definition:project}\textbf{Project:} - The high-level series of operations that are done on input(s) to produce outputs. + \hl{The project is t}he high-level series of operations that are done on input(s) to produce outputs. This definition is therefore very similar to ``workflow'' \citep{oinn04, reich06, goecks10}, but because the published narrative paper/report is also an output, the project defined here also includes the source of the narrative (e.g., \LaTeX{} or MarkDown) \emph{and} how the visualizations in it were created. The project is thus only in charge of managing the inputs and outputs of each analysis step (take the outputs of one step, and feed them as inputs to the next), not to do analysis by itself. A good project will follow the modularity principle: analysis scripts should be well-defined as an independently managed software source with clearly defined inputs and outputs. - For example modules in Python, packages in R, or libraries/programs in C/C++ that can be executed by the higher-level project source when necessary. + For example, modules in Python, packages in R, or libraries/programs in C/C++, which can be executed by the higher-level project source when necessary. Maintaining these lower-level components as independent software projects enables their easy usage in other projects. - Therefore here, they are defined as inputs (not the project). + Therefore, they are defined as inputs (not the project). \item \label{definition:provenance}\textbf{Data Provenance:} - A dataset's provenance is the set of metadata (in any ontology, standard or structure) that connect it to the components (other datasets or scripts) that produced it. + A dataset's provenance is \hl{is defined as} the set of metadata (in any ontology, standard or structure) that connect it to the components (other datasets or scripts) that produced it. Data provenance thus provides a high-level \emph{and structured} view of a project's lineage. A good example of this is Research Objects \citep{belhajjame15}. @@ -191,13 +192,13 @@ As a consequence, before starting with the technical details it is important to \item \label{definition:lineage}\textbf{Data Lineage:} Data lineage is commonly used interchangeably with Data provenance \citep[for example][]{cheney09}. For clarity, we define term ``Data lineage'' as a low-level and fine-grained recording of the data's trajectory in an analysis (not meta-data, but actual commands). -Therefore data lineage is synonymous with ``project'' as defined above. +Therefore, data lineage is synonymous with ``project'' as defined above. \item \label{definition:reproduction}\textbf{Reproducibility \& Replicability:} - These terms have been used in the literature with various meanings, sometimes in a contradictory way. + These two terms have been used in the literature with various meanings, sometimes in a contradictory way. It is important to highlight that in this paper we are only considering computational analysis: \emph{after} data has been collected and stored as a file. - Therefore, many of the definitions reviewed in \citet{plesser18}, that are about data collection do not apply here. + Therefore, many of the definitions reviewed in \citet{plesser18}, which are about data collection, do not apply here. We adopt the same definition of \citet{leek17,fineberg19}, among others. - Note that these definitions are contrary some others for example the ACM policy guidelines\footnote{\url{https://www.acm.org/publications/policies/artifact-review-badging}} (dated April 2018) and \citet{hinsen15}. + Note that these definitions are contrary some others, for example, the ACM policy guidelines\footnote{\url{https://www.acm.org/publications/policies/artifact-review-badging}} (dated April 2018) and \citet{hinsen15}. \citet{fineberg19} define reproducibility as \emph{obtaining consistent [not necessarily identical] results using the same input data; computational steps, methods, and code; and conditions of analysis}, or same inputs $\rightarrow$ consistent result. They define Replicability as \emph{obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data}, or different inputs $\rightarrow$ consistent result. @@ -225,7 +226,7 @@ To highlight the uniqueness of Maneage in this plethora of tools, a more elabora \begin{enumerate}[label={\bf P\arabic*}] \item \label{principle:complete}\textbf{Complete:} - A project that is complete, or self-contained, doesn't depend on anything beyond the Portable operating system Interface (POSIX), doesn't affect the host system, doesn't require root/administrator privileges, doesn't need an internet connection (when its inputs are on the file-system), and is fully recorded and executable in plain-text\footnote{Plain text format doesn't include document container formats like \inlinecode{.odf} or \inlinecode{.doc}, for software like LibreOffice or Microsoft Office.} format (e.g., ASCII or Unicode). + A project that is complete, or self-contained, does not depend on anything beyond the Portable operating system Interface (POSIX), does not affect the host system, does not require root/administrator privileges, does not need an internet connection (when its inputs are on the file-system), and is fully recorded and executable in plain-text\footnote{Plain text format does not include document container formats like \inlinecode{.odf} or \inlinecode{.doc}, for software like LibreOffice or Microsoft Office.} format (e.g., ASCII or Unicode). A complete project can automatically access to the inputs (see definition \ref{definition:input}), build its necessary software (instructions on configuring, building and installing those software in a fixed environment), do the analysis (run the software on the data) and create the final narrative report/paper as well as its visualizations, in its final format (usually in PDF or HTML). No manual/human interaction is required within a complete project, as \citet{claerbout1992} put it: ``a clerk can do it''. @@ -233,42 +234,42 @@ To highlight the uniqueness of Maneage in this plethora of tools, a more elabora Finally, the plain-text format is particularly important because any other storage format will require higher-level software \emph{before} the project. \emph{Comparison with existing:} Except for IPOL, none of the tools above are complete. - They all have many dependencies far beyond POSIX, for example the more recent ones are written in Python or use Jupyter notebooks \citep{kluyver16}. + They all have many dependencies far beyond POSIX, for example, the more recent ones are written in Python or use Jupyter notebooks \citep{kluyver16}. Such high-level tools have very short lifespan and evolve very fast, the most recent example was Python 3 that is not compatible with Python 2. - They also have a very complex dependency trees, making them extremely vulnerable to updates, for example see Figure 1 of \citet{alliez19} on the dependency tree of Matplotlib (one of the smaller Jupyter dependencies). + They also have a very complex dependency trees, making them extremely vulnerable to updates, for example, see Figure 1 of \citet{alliez19} on the dependency tree of Matplotlib (one of the smaller Jupyter dependencies). The longevity of a data lineage, or workflow (not the analysis itself), is determined by its shortest-lived dependency. - Many existing tools therefore don't attempt to store the project as plain text, but pre-built binary blobs (containers or virtual machines) that can rarely be recreated\footnote{Using the package manager of the container's OS, or Conda which are both highly dependent on the time they are created.} and also have a short lifespan\footnote{For example Docker only works on Linux kernels that are on long-term support, not older. - Currently this is Linux 3.2.x that was initially released 8 years ago in 2012. The current Docker images may not be usable in a similar time frame in the future.}. + Many existing tools therefore do not attempt to store the project as plain text, but pre-built binary blobs (containers or virtual machines) that can rarely be recreated\footnote{Using the package manager of the container's OS, or Conda which are both highly dependent on the time they are created.} and also have a short lifespan\footnote{For example Docker only works on Linux kernels that are on long-term support, not older. + Currently, this is Linux 3.2.x that was initially released 8 years ago in 2012. The current Docker images may not be usable in a similar time frame in the future.}. Once the lifespan of a binary workflow's dependency has passed, it is useless: that binary file cannot be opened to read or executed. But as plain-text, even if it is no longer executable due to much evolved technologies, it is still human read-able and parse-able by any machine. \item \label{principle:modularity}\textbf{Modularity:} A project should be compartmentalized or partitioned to independent modules or components with well-defined inputs/outputs having no side-effects. In a modular project, communication between the independent modules is explicit, providing optimizations on multiple levels: -1) Execution: independent modules can run in parallel, or modules that don't need to be run (because their dependencies haven't changed) won't be re-done. +1) Execution: independent modules can run in parallel, or modules that do not need to be run (because their dependencies have not changed) will not be re-done. 2) Data lineage and data provenance extraction (recording any dataset's origins). 3) Citation: allowing others to credit specific parts of a project. -This principle doesn't just apply to the analysis, it also applies to the whole project, for example see the definitions of ``input'', ``output'' and ``project'' in Section \ref{sec:definitions}. +This principle does not just apply to the analysis, it also applies to the whole project, for example, see the definitions of ``input'', ``output'' and ``project'' in Section \ref{sec:definitions}. \emph{Comparison with existing:} Most are agnostic to modularity: leaving such design choices to the experience of project authors. -But designing a modular project needs to be encouraged and facilitated otherwise, scientists (that are not usually trained in data management) will not design their projects to be modular, leading to great inefficiencies in terms of project cost or scientific accuracy. -Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails do encourage this, but the more recent tools above (which usually require programming) don't. +But designing a modular project needs to be encouraged and facilitated, otherwise, scientists (that are not usually trained in data management) will not design their projects to be modular, leading to great inefficiencies in terms of project cost or scientific accuracy. +Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails do encourage this, but the more recent tools above (which usually require programming) do not. \item \label{principle:complexity}\textbf{Minimal complexity:} This principle is essentially Occam's razor: ``Never posit pluralities without necessity'' \citep{schaffer15}, but extrapolated to project management: 1) avoid complex relations between analysis steps (which is not unrelated to the principle of modularity in \ref{principle:modularity}). 2) avoid the programming language that is currently in vogue because it is going to fall out of fashion soon and significant resources are required to translate or rewrite it every few years (to stay in vogue). The same job can be done with more stable/basic tools, and less effort in the long-run. - Unlike software engineers who learn a new tool every two years and don't require long lifespan for their projects, scientists need to focus on their own research domain and are unable to stay up to date with the vogue, creating a generational gap when they do. - This is very bad for the scientists: valuable detailed experience can't pass through generations, or tools have to be re-written at a high cost that could have gone to actual research. + Unlike software engineers who learn a new tool every two years and do not require long lifespan for their projects, scientists need to focus on their own research domain and are unable to stay up to date with the vogue, creating a generational gap when they do. + This is very bad for the scientists: valuable detailed experience cannot pass through generations, or tools have to be re-written at a high cost that could have gone to actual research. - \emph{Comparison with existing:} Most of the existing tools use the language that was in vogue when they were created, for example a larger fraction of them are written in Python as we come closer to the present time. + \emph{Comparison with existing:} Most of the existing tools use the language that was in vogue when they were created, for example, a larger fraction of them are written in Python as we come closer to the present time. Again IPOL stands out from the rest in this principle also. \item \label{principle:verify}\textbf{Verifiable inputs and outputs:} The project should contain automatic verification checks on its inputs (software source code and data) and outputs. -When applied, expert knowledge won't be necessary to confirm the correct reproduction. +When applied, expert knowledge will not be necessary to confirm the correct reproduction. \emph{Comparison with existing:} Such verification is usually possible in most systems, but fully maintained by the user. Automatic verification of inputs is most commonly implemented in some cases, but rarely the outputs. @@ -289,10 +290,10 @@ IPOL, which uniquely stands out in other principles, fails at this one: only the This principle is thus necessary to complement the definition of reproducibility and has many advantages which are critical to the sciences and the industry: 1) The lineage, and its optimization, can be traced down to the internal algorithm in the software's source. 2) A non-free software may not be executable on a given/future hardware, if its free, the project members can modify it to work. - 3) A non-free software cannot be distributed by the authors, making the whole community reliant only on the proprietary owner's server (even if the proprietary software doesn't ask for payments), also see Section \ref{sec:publishing}. + 3) A non-free software cannot be distributed by the authors, making the whole community reliant only on the proprietary owner's server (even if the proprietary software does not ask for payments), also see Section \ref{sec:publishing}. \emph{Comparison with existing:} The existing solutions listed above are all free software. - There are non-free existing solutions also, but we do not consider them here because of this principle. + There are non-free existing solutions too, but we do not consider them here because of this principle. \end{enumerate} @@ -313,7 +314,7 @@ The main Maneage Branch is a fully working skeleton of a project without much fl To start a new project, the authors will \emph{clone}\footnote{In Git, the ``clone'' operation is the process of copying all the project's files and history from a repository onto the local system.} Maneage, create their own Git branch over the latest commit, and start their project by customizing that branch. Customization in their project branch is done by adding the names of the software they need, references to their input data, adding the analysis commands and commands to generate the visualizations, and write the narrative report which includes the visualizations. -Manages contains a file called \inlinecode{README-hacking.md} that has a complete checklist of steps to start a new project and remove demonstration parts. +Maneage contains a file called \inlinecode{README-hacking.md} that has a complete checklist of steps to start a new project and remove demonstration parts. In order to start using Maneage, there are hands-on tutorials for guiding the reader through research examples and show the workflow in a practical way. This will usually be done in multiple commits in the project's duration (maybe multiple years), thus preserving the project's history: the causes of all choices, the authors and times of each change, failed tests, and etc. @@ -340,12 +341,13 @@ Two sub-directories are also present: \inlinecode{tex/} (containing \LaTeX{} fil The \inlinecode{project} script is a high-level wrapper to interface with Maneage and in its current implementation has two main phases as shown below. As seen below, a project's operations are broken-up into two phases: 1) configuration, where the necessary software are built and the environment is setup. 2) analysis, where data are accessed and the software is run on them to create visualizations and the final report. +\hl{Below. first, I guess 2 hours depend on the machine. Second, shouldn't you include here the prepare step too?} \begin{lstlisting}[language=bash] ./project configure # Build software from source (takes around 2 hours for full build). ./project make # Do the analysis (download data, run software on data, build PDF). \end{lstlisting} -Here we will delve deeper into the implementation and some usage details of Maneage. +Here, we will delve deeper into the implementation and some usage details of Maneage. Section \ref{sec:usingmake} elaborates why Make (a POSIX tool) was chosen as the main job orchestrator in Maneage. Sections \ref{sec:projectconfigure} \& \ref{sec:projectanalysis} then discuss the operations done during the configuration and analysis phase. Afterwards, we describe how Maneage projects benefit from version control, during the project's lifetime and after its publication, or end-of-life in Section \ref{sec:projectgit}. @@ -366,17 +368,17 @@ The Make paradigm starts from the end: the final \emph{target}. In Make's syntax, the process is broken into atomic \emph{rules} where each rule has a single \emph{target} file which can depend on any number of \emph{prerequisite} files. To build the target from the prerequisites, each rule also has a \emph{recipe} (an atomic script). The plain-text files containing Make rules and their components are called Makefiles. -Hence Make doesn't replace scripting languages like the shell, Python or R, it is a higher-level structure enabling modular/atomic scripts (in any language) to be put in a workflow. +Hence, Make does not replace scripting languages like the shell, Python or R, it is a higher-level structure enabling modular/atomic scripts (in any language) to be put in a workflow. The formal connection of targets with prerequisites that is defined in Make, enables creation of a precise lineage as a formal/codified executable system that is very mature and has stood the test of time: Make is actively developed and used in the building of most OS components. Besides formalizing data lineage, Make also greatly encourages experimentation in a project because a recipe is executed only when at least one prerequisite is more recent than its target. -Therefore when only $5\%$ of a project's targets are affected by a change, only they will be recreated, the other $95\%$ remain untouched. +Therefore, when only $5\%$ of a project's targets are affected by a change, only they will be recreated, the other $95\%$ remain untouched. Furthermore, Make first examines the full lineage before starting the execution of recipes. It can thus execute independent rules in parallel, further improving the speed and encouraging experimentation. Make is well known by many outside of the software developing communities. -For example \citet{schwab2000} report how geophysics students have easily adopted it for the RED project management tool. -Because of its simplicity, we have also had very good feedback on using Make from the early adopters of Maneage during the last year, in particular graduate students and postdocs. +For example, \citet{schwab2000} report how geophysics students have easily adopted it for the RED project management tool. +Because of its simplicity, we have also had very good feedback on using Make from the early adopters of Maneage during the last year, in particular, graduate students and postdocs. \tonote{Mention the non-recursive Make here when discussing \inlinecode{top-make.mk}.} @@ -388,33 +390,33 @@ A more robust solution is to to build the software environment from scratch. Most existing tools reviewed in Section \ref{sec:principles}, use package managers like Conda, but since conda itself is written in Python, it violates our completeness principle \ref{principle:complete}. Highly robust solutions like Nix \citep{dolstra04} and GNU Guix \citep{courtes15} do exist, but they require root permissions which is also against that principle. Based on the principles of completeness (\ref{principle:complete}) and minimal complexity (\ref{principle:complexity}) Maneage orchestrates the building of its necessary software in the same language that it orchestrates the analysis: Make (see Section \ref{sec:usingmake}). -Therefore, a researcher already using Maneage easily understands and can customize the software environment also, without having to learn the intricacies of third-party tools. +Therefore, a researcher already using Maneage easily understands and can customize the software environment too, without having to learn the intricacies of third-party tools. Project configuration (building the software environment) is managed by the files under \inlinecode{reproduce\-/soft\-ware} of Maneage's source, see Figure \ref{fig:files}. At the start of project configuration, Maneage needs a top-level directory to build itself on the host filesystem (software and analysis). -We call this the ``build directory'' (or \inlinecode{BDIR}) and it must not be under the source directory (see \ref{principle:modularity}). +We call this the ``build directory'' (or \hl{the so-called} \inlinecode{BDIR}) and it must not be under the source directory (see \ref{principle:modularity}). No other location on the running operating system will be affected by the project and it should not affect the result, so its value is not under version control. -Two other local directories can optionally be specified by the project when inputs (\ref{definition:input}) are present locally and don't need to be downloaded: 1) software tarball directory and 2) input data directory. +Two other local directories can optionally be specified by the project when inputs (\ref{definition:input}) are present locally and do not need to be downloaded: 1) software tarball directory and 2) input data directory. Sections \ref{sec:buildsoftware} and \ref{sec:softwarecitation} elaborate more on the building of the necessary software and the important problem of software citation. \subsubsection{Verifying and building necessary software from source} \label{sec:buildsoftware} To compile the necessary software from source Maneage currently needs the host to have a C compiler (available on any POSIX-compliant OS). -This C compiler will be used by Maneage to install all the Maneage meta-software (software to build other software) with fixed versions, this includes GNU Bash, GNU AWK, GNU Coreutils, and many more on all supported operating systems (including macOS). -For example the full list of installed software for this paper is available in Acknowledgments of this paper. +This C compiler will be used by Maneage to install all necessary meta-software (software to build other software) with fixed versions, this includes GNU Bash, GNU AWK, GNU Coreutils, and many more on all supported operating systems (including macOS). +For example, the full list of installed software for this paper is available in Acknowledgments of this paper. On GNU/Linux OSs, a fixed version of the GNU Binutils and GNU C Compiler (GCC) is also included, and soon Maneage will also install its own fixed version of the GNU C Library to be fully independent of the host on such systems (Task 15390\footnote{\url{https://savannah.nongnu.org/task/?15390}}). In effect, except for the Kernel, Maneage builds all other components of the GNU OS on the host from source. -The software source code may already be present on the host filesystem, if not they can be downloaded. +The software source code may already be present on the host filesystem, if not, they can be downloaded. But before being used to build the software, their SHA-512 checksum \citep[part of the SHA-2 algorithms, see][]{romine15} will be checked with the expected checksum in Maneage. -If the checksums don't match, Maneage will stop and warn the user. +If the checksums do not match, Maneage will stop and warn the user. The core operating system components (mostly GNU tools) will be installed in any project, only their versions may differ from one project to another. But Maneage also includes a large collection of scientific software (and their dependencies) that are usually not necessary in all projects, each project has to identify its high-level software in its branch, specified in the \inlinecode{TARGETS.conf} file under \inlinecode{re\-produce\-/soft\-ware\-/config} directory, see Figure \ref{fig:files}. Note that project configuration can be done in a container or virtual machine to avoid having to facilitate moving the project. -However the important factor is that such binary blobs are an optional output of Maneage, they are not the only way a project using Maneage can be archived. +However, the important factor is that such binary blobs are an optional output of Maneage, they are not the only way a project using Maneage can be archived. \subsubsection{Software citation} \label{sec:softwarecitation} @@ -426,11 +428,11 @@ Furthermore, when the software is associate with a published paper, that paper's This is particularly important in the case for research software, where the researcher has invested significant time in building the software, and requires official citation to justify continued work on it. One notable example that nicely highlights this issue is GNU Parallel \citep{tange18}: every time it is run, it prints the citation information before it starts. -This doesn't cause any problem in automatic scripts, but can be annoying when reading/debugging the outputs. +This does not cause any problem in automatic scripts, but can be annoying when reading/debugging the outputs. Users can disable the notice, with the \inlinecode{--citation} option and accept to cite its paper, or support its development directly by paying $10000$ euros! -This is justified by an uncomfortably true statement\footnote{GNU Parallel's FAQ on the need to cite software: \url{http://git.savannah.gnu.org/cgit/parallel.git/plain/doc/citation-notice-faq.txt}}: ``history has shown that researchers forget to [cite software] if they are not reminded explicitly. ... If you feel the benefit from using GNU Parallel is too small to warrant a citation, then prove that by simply using another tool''. +This is justified by an uncomfortably true statement\footnote{GNU Parallel's FAQ on the need to cite software: \url{http://git.savannah.gnu.org/cgit/parallel.git/plain/doc/citation-notice-faq.txt}}: ``history has shown that researchers forget to [cite software] \hl{why in brackets?} if they are not reminded explicitly. ... If you feel the benefit from using GNU Parallel is too small to warrant a citation, then prove that by simply using another tool''. In bug 905674\footnote{Debian bug on the citation notice of GNU Parallel: \url{https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=905674}}, the Debian developers argued that because of this extra condition, GNU Parallel should not be considered as free software, and they are using a patch to remove that part of the code for its build under Debian-based OSs. -Most other research software don't resort to such drastic measures, however, citation is important for them. +Most other research software do not resort to such drastic measures, however, citation is important for them. Given the increasing number of software used in scientific research, the only reliable solution is to automatically cite the used software in the final paper. For a review of the necessity and basic elements in software citation, see \citet{katz14} and \citet{smith16}. @@ -449,7 +451,7 @@ Once the project is configured (Section \ref{sec:projectconfigure}), a unique an All analysis operations run such that the host OS settings cannot penetrate it, enabling an isolated environment without the extra layer of containers or a virtual machine. In Maneage, a project's analysis is broken into two phases: data preparation and analysis. The former is mostly necessary in special situations where the datasets are extremely large and some initial preparation needs to be done on them to avoid slowing down the whole project in each run. -That phase is organized in an identical manner as the analysis phase, so we won't to into it any further here and refer the interested reader to the documentation of Maneage. +That phase is organized in an identical manner as the analysis phase, so we will not go into it any further here and refer the interested reader to the documentation of Maneage. A project consists of many steps, including data access (possibly by downloading), running various steps of the analysis on the obtained data, and creating the necessary plots, figures or tables for a published report, or output datasets for a database. If all of these steps are organized in a single Makefile, it will become very large, or long, and will be hard to maintain, extend/grow, read, reuse, and cite. @@ -471,24 +473,24 @@ Figure \ref{fig:datalineage} schematically shows these subMakefiles and their re Each colored box is a file in the project and the arrows show the dependencies between them. Green files/boxes are plain text files that are under version control and in the source-directory. Blue files/boxes are output files of various steps in the build-directory, located within the Makefile (\inlinecode{*.mk}) that generates them. - For example \inlinecode{paper.pdf} depends on \inlinecode{project.tex} (in the build directory and generated automatically) and \inlinecode{paper.tex} (in the source directory and written by hand). + For example, \inlinecode{paper.pdf} depends on \inlinecode{project.tex} (in the build directory and generated automatically) and \inlinecode{paper.tex} (in the source directory and written by hand). In turn, \inlinecode{project.tex} depends on all the \inlinecode{*.tex} files at the bottom of the Makefiles above it. The solid arrows and built boxes with full opacity are actually described in the context of a demonstration project in this paper. The dashed arrows and lower opacity built boxes, just shows how adding more elements to the lineage is also easily possible, making this a scalable tool. } \end{figure} -To avoid getting too abstract in the subsections below, where necessary, we'll do a basic analysis on the data of \citet[data were published as supplementary material on bioXriv]{menke20} and replicate one of their results. -Note that because we are not using the same software, this isn't a reproduction (\ref{definition:reproduction}). -We can't use the same software because they use Microsoft Excel for the analysis which violates several of our principles: \ref{principle:complete}, \ref{principle:complexity} and \ref{principle:freesoftware}. +To avoid getting too abstract in the subsections below, where necessary, we will do a basic analysis on the data of \citet[data were published as supplementary material on bioXriv]{menke20} and replicate one of their results. +Note that because we are not using the same software, this is not a reproduction (\ref{definition:reproduction}). +We cannot use the same software because they use Microsoft Excel for the analysis which violates several of our principles: \ref{principle:complete}, \ref{principle:complexity} and \ref{principle:freesoftware}. In the subsections below, this paper's analysis on that dataset is described using the data lineage graph of Figure \ref{fig:datalineage}. -We'll follow Make's paradigm (see Section \ref{sec:usingmake}) of starting form the ultimate target in Section \ref{sec:paperpdf}, and tracing backwards in its lineage to the configuration files \ref{sec:configfiles}. +We will follow Make's paradigm (see Section \ref{sec:usingmake}) of starting form the ultimate target in Section \ref{sec:paperpdf}, and tracing backwards in its lineage to the configuration files \ref{sec:configfiles}. \subsubsection{Ultimate target: the project's paper or report (\inlinecode{paper.pdf})} \label{sec:paperpdf} The ultimate purpose of a project is to report the data analysis result, as raw visualizations of the results or blended in with a narrative description. -In Figure \ref{fig:datalineage} it is shown as \inlinecode{paper.pdf}, note that it is the only built file (blue box) with no arrows leaving it. +In Figure \ref{fig:datalineage}, it is shown as \inlinecode{paper.pdf}. Note that it is the only built file (blue box) with no arrows leaving it. The instructions to build \inlinecode{paper.pdf} are in the \inlinecode{paper.mk} subMakefile. Its prerequisites include \inlinecode{paper.tex} and \inlinecode{references.tex} (Bib\TeX{} entries for possible citations) in the project source and \inlinecode{project.tex} which is a built product. \inlinecode{references.tex} is also an important component of Maneage because it formalizes the connections of this project with previous projects on a high-level. @@ -496,13 +498,13 @@ Its prerequisites include \inlinecode{paper.tex} and \inlinecode{references.tex} \subsubsection{Values within text (\inlinecode{project.tex})} \label{sec:valuesintext} -Figures, plots, tables and narrative aren't the only analysis output that goes into the paper. +Figures, plots, tables and narrative are not the only analysis output that goes into the paper. In many cases, quantitative values from the analysis are also blended into the sentences of the report's narration. -For example note this sentence in the abstract of \citet[which is written in Maneage]{akhlaghi19}: ``... detect the outer wings of M51 down to S/N of 0.25 ...''. +For example, note this sentence in the abstract of \citet[which is written in Maneage]{akhlaghi19}: ``... detect the outer wings of M51 down to S/N of 0.25 ...''. The signal-to-noise ratio (S/N) value ``0.25'' depends on the analysis, and is an output of the analysis just like paper's figures and plots. Manually typing such numbers in the narrative is prone to very important bugs: the author may forget to check it after a change in an analysis (e.g., using a newer version of the software, or changing an analysis parameter for another part of the paper). Given the non-linear evolution of a scientific projects, this type of human error is very hard to avoid and can discourage experimentation. -Therefore such values must also be automatically generated. +Therefore, such values must also be automatically generated. To automatically generate and blend them in the text, Maneage uses \LaTeX{} macros. In the quote above, the \LaTeX{} source\footnote{\citet{akhlaghi19} uses this template to be reproducible, so its \LaTeX{} source is available in multiple ways: 1) direct download from arXiv:\href{https://arxiv.org/abs/1909.11230}{1909.11230}, by clicking on ``other formats'', or 2) the Git or \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481} links is also available on arXiv.} looks like this: ``\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}''. @@ -511,7 +513,7 @@ The built \inlinecode{project.tex} file stores all such reported values. However, managing all the necessary \LaTeX{} macros in one file is against the modularity principle and can be frustrating and buggy. To address this problem, Maneage has the convention that all subMakefiles \emph{must} contain a fixed target with the same base-name, but with a \inlinecode{.tex} suffix to store reported values generated in that subMakefile. -In Figure \ref{fig:datalineage}, these macro files can be seen in every subMakefile, except for \inlinecode{paper.mk} (which doesn't need it). +In Figure \ref{fig:datalineage}, these macro files can be seen in every subMakefile, except for \inlinecode{paper.mk} (which does not need it). These \LaTeX{} macro files thus form the core skeleton of a Maneage project: as shown in Figure \ref{fig:datalineage}, the outward arrows of all built files of any subMakefile ultimately leads to one of these \LaTeX{} macro files, possibly in another subMakefile. \subsubsection{Verification of outputs (\inlinecode{verify.mk})} @@ -519,10 +521,10 @@ These \LaTeX{} macro files thus form the core skeleton of a Maneage project: as Before the modular \LaTeX{} macro files of Section \ref{sec:valuesintext} are merged into the single \inlinecode{project.tex} file, they need to pass through the verification filter, which is a core principle of Maneage (\ref{principle:verify}). Note that simply confirming the checksum of the final PDF, or figures and datasets is not generally possible: many tools write the creation date into the produced files. -To avoid such cases the raw data (independent of their metadata like creation date) must be verified, some standards have such features for example For example the \inlinecode{DATASUM} keyword in the FITS format \citep{pence10}. +To avoid such cases the raw data (independent of their metadata like creation date) must be verified, some standards have such features, for example, the \inlinecode{DATASUM} keyword in the FITS format \citep{pence10}. To facilitate output verification, the project has a \inlinecode{verify.mk} subMakefile, see Figure \ref{fig:datalineage}. -It's \inlinecode{verify.tex} the only prerequisite of \inlinecode{project.tex} that was described in Section \ref{sec:valuesintext} and is the boundary between the analytical phase of the paper, and the production of the report. +It is \inlinecode{verify.tex} the only prerequisite of \inlinecode{project.tex} that was described in Section \ref{sec:valuesintext} and is the boundary between the analytical phase of the paper, and the production of the report. It has some tests on pre-defined formats, and other formats can easily be added. Prior to publication, the project authors should add the MD5 checksums of all the\LaTeX{} macro files and output datasets in the recipe of \inlinecode{verify\-.tex} to enable automatic verification by readers afterwards. @@ -531,13 +533,13 @@ Prior to publication, the project authors should add the MD5 checksums of all th The \inlinecode{initial\-ize\-.mk} subMakefile is present in all projects and is the first subMakefile that is loaded into \inlinecode{top-make.mk} (see Figure \ref{fig:datalineage}). Project authors rarely need to modify/edit this file, it is a low-level infrastructure of Maneage, but are encouraged to do so. -\inlinecode{initial\-ize\-.mk} doesn't contain any analysis or major processing steps, it just initializes the system by setting the necessary Make environment as well as other general jobs like defining the Git commit hash of the run as a \LaTeX{} (\inlinecode{\textbackslash{}projectversion}) macro that can be loaded into the narrative. -Papers using Maneage usually put this hash as the last word in their abstract, for example see \citet{akhlaghi19} and \citet{infante20}, for the current version of this paper, it expands to \projectversion. +\inlinecode{initial\-ize\-.mk} does not contain any analysis or major processing steps, it just initializes the system by setting the necessary Make environment as well as other general jobs like defining the Git commit hash of the run as a \LaTeX{} (\inlinecode{\textbackslash{}projectversion}) macro that can be loaded into the narrative. +Papers using Maneage usually put this hash as the last word in their abstract, for example, see \citet{akhlaghi19} and \citet{infante20}, for the current version of this paper, it expands to \projectversion. \subsubsection{The analysis} \label{sec:analysis} -The basic concepts behind organizing the analysis into modular subMakefiles has already been discussed above, we'll thus describe it here with the practical example of replicating Figure 1C of \citet{menke20}, with some enhancements in Figure \ref{fig:toolsperyear}. +The basic concepts behind organizing the analysis into modular subMakefiles has already been discussed above, we will thus describe it here with the practical example of replicating Figure 1C of \citet{menke20}, with some enhancements in Figure \ref{fig:toolsperyear}. As shown in Figure \ref{fig:datalineage}, in the customized branch of this project, we have broken this goal into two subMakefiles: \inlinecode{format.mk} and \inlinecode{demo-plot.mk}. The former is in charge of converting the Microsoft Excel formatted input into the simple comma-separated value (CSV) format, and the latter is in charge of generating the table to build Figure \ref{fig:toolsperyear}. In a real project, subMakefiles will be much more complex. @@ -550,7 +552,7 @@ Note that their location after the standard starting subMakefiles (initializatio \end{center} \vspace{-5mm} \caption{\label{fig:toolsperyear}Fraction of papers mentioning software tools (green line, left vertical axis) to total number of papers studied in that year (light red bars, right vertical axis in log-scale). - Data from \citet{menke20}. + Data from \citet{menke20}. \hl{Maybe say here also in the caption this is a replica of figure 1C from menke20 done with Maneage } The subMakefile archiving the executable lineage of figure's data is shown in Figure \ref{fig:demoplotsrc} and discussed in Section \ref{sec:analysis}. } \end{figure} @@ -564,30 +566,30 @@ Note that their location after the standard starting subMakefiles (initializatio \end{figure} Figure \ref{fig:toolsperyear} also shows the number of papers that were studied each year (that is not shown in the original plot). -Its horizontal axis also shows the full range of the data (starting from \menkefirstyear) while the original Figure 1C in \citet{menke20} starts from 1997. -Probably the reason \citet{menke20} decided to avoid earlier years was the small number of papers in earlier years. -For example in \menkenumpapersdemoyear, they had only studied \menkenumpapersdemocount{} papers. +Its horizontal axis shows the full range of the data (starting from \menkefirstyear) while the original Figure 1C in \citet{menke20} starts from 1997. +Probably, the reason \citet{menke20} decided to avoid earlier years was the small number of papers in earlier years. +For example, in \menkenumpapersdemoyear, they had only studied \menkenumpapersdemocount{} papers. Note that both the numbers of the previous sentence (\menkenumpapersdemoyear{} and \menkenumpapersdemocount), and the dataset's oldest year (mentioned above: \menkefirstyear) are automatically generated \LaTeX{} macros, see \ref{sec:valuesintext}. -We didn't typeset them in this narrative explanation manually. +We did not typeset them in this narrative explanation manually. This step (generating the macros) is shown schematically in Figure \ref{fig:datalineage} with the arrow from \inlinecode{tools-per-year.txt} to \inlinecode{demo-plot.tex}. -To create Figure \ref{fig:toolsperyear}, we used the \LaTeX{} package PGFPlots, therefore the final analysis output we needed was a simple plain-text table with 3 columns. +To create Figure \ref{fig:toolsperyear}, we used the \LaTeX{} package PGFPlots, therefore, the final analysis output we needed was a simple plain-text table with 3 columns. This table is shown in the lineage graph of Figure \ref{fig:datalineage} as \inlinecode{tools-per-year.txt}. If another plotting tool was desired (for example Python's Matplotlib, or Gnuplot), the built graphic file (for example \inlinecode{tools-per-year.pdf}) could be the target instead of the raw table. The \inlinecode{tools-per-year.txt} is a value-added table with only \menkenumyears{} rows (counting per year), the original dataset had \menkenumorigrows{} rows (one row for each year of each journal). -In the Make rule to build it is in \inlinecode{demo-plot.mk} and its the recipe is a simple GNU AWK command, with \inlinecode{menke20-table-3.txt} as its prerequisite, its is schematically shown by the arrow connecting the two \inlinecode{.txt} files. +In the Make rule to build, it is in \inlinecode{demo-plot.mk} and its recipe is a simple GNU AWK command, with \inlinecode{menke20-table-3.txt} as its prerequisite, it is schematically shown by the arrow connecting the two \inlinecode{.txt} files. Note that both the row numbers mentioned at the start of this paragraph are also macros. The latter (\menkenumorigrows{}) is schematically shown with the arrow from \inlinecode{menke20-table-3.txt} to \inlinecode{format.tex}. -Ultimately the first file we need to operate on is \inlinecode{menke20-table-3.txt} (which is defined as a target in \inlinecode{format.mk}). +Ultimately, the first file we need to operate on is \inlinecode{menke20-table-3.txt} (which is defined as a target in \inlinecode{format.mk}). As mentioned before, the main operation here is to convert the Microsoft Excel format of the downloaded dataset (\inlinecode{menke20.xlsx}, discussed below in Section \ref{sec:download}) to this simple plain-text format for the operations mentioned above. We do this job with the XLSX I/O program that has been specified as software to build during project configuration. This step is shown schematically in Figure \ref{fig:datalineage} with the arrow connecting these two lines. Having prepared the full dataset in a simple format, let's report the number of subjects (papers and journals) that were studied in \citet{menke20}. The necessary file for this measurement is \inlinecode{menke20-table-3.txt} that is a target in \inlinecode{format.mk}. -Therefore we do this calculation (with a simple AWK command) and write the results in (\inlinecode{format.tex}). +Therefore, we do this calculation (with a simple AWK command) and write the results in (\inlinecode{format.tex}). In the built PDF paper, the two macros expand to $\menkenumpapers$ (number of papers studied) and $\menkenumjournals$ (number of journals studied) respectively. This step is shown schematically in Figure \ref{fig:datalineage} with the arrow from \inlinecode{menke20-table-3.txt} to \inlinecode{format.tex}. @@ -604,7 +606,7 @@ Irrespective of where the dataset is \emph{used} in the project's lineage, it he Each external dataset has some basic information, including its expected name on the local system (for offline access), the necessary checksum to validate it (either the whole file or just its main ``data'', as discussed in Section \ref{sec:outputverification}), and its URL/PID. In Maneage, such information regarding a project's input dataset(s) is in the \inlinecode{INPUTS.conf} file. -See Figures \ref{fig:files} \& \ref{fig:datalineage} for the position of \inlinecode{INPUTS.conf} in the project's file structure and data lineage respectively. +See Figures \ref{fig:files} \& \ref{fig:datalineage} for the position of \inlinecode{INPUTS.conf} in the project's file structure and data lineage, respectively. For demonstration, we are using the datasets of \citet{menke20} which are stored in one \inlinecode{.xlsx} file on bioXriv\footnote{\label{footnote:dataurl}Full data URL: \url{\menketwentyurl}}. Figure \ref{fig:inputconf} shows the corresponding \inlinecode{INPUTS.conf} where the necessary information are stored as Make variables and are automatically loaded into the full project when Make starts (and is most often used in \inlinecode{download.mk}). @@ -643,18 +645,18 @@ The configuration files greatly simplify project management from multiple perspe \item If an analysis parameter is used in multiple places within the project, simply changing the value in the configuration file will change it everywhere in the project. This is cortical in more complex projects and if not done like this can lead to significant human error. \item Configuration files enable the logical separation between the low-level implementation and high-level running of a project. - For example after writing the project, the authors don't need to remember where the number/parameter was used, they can just modify the configuration file. + For example, after writing the project, the authors do not need to remember where the number/parameter was used, they can just modify the configuration file. Other co-authors, or readers, of the project also benefit: they just need to know that there is a unified place for high-level project settings, parameters, or numbers without necessarily having to know the low-level implementation. -\item A configuration file will be a prerequisite to any rule that uses it's value. +\item A configuration file will be a prerequisite to any rule that uses its value. If the configuration file is updated (the value/parameter is changed), Make will automatically detect the data lineage branch that is affected by it and re-execute only that branch, without any human interference. \end{itemize} \subsection{Projects as Git branches of Maneage} \label{sec:projectgit} -Maneage is fully composed of plain-text files, therefore it can be maintained under under version control systems like Git. -Every commit in the version controlled history contains \emph{a complete} snapshot of the data lineage, for more see the completeness principle (\ref{principle:complete}). -Maneage is maintained by its developers in a central branch, which we'll call \inlinecode{man\-eage} hereafter. +Maneage is fully composed of plain-text files, therefore, it can be maintained under version control systems like Git. +Every commit in the version controlled history contains \emph{a complete} snapshot of the data lineage, for more, see the completeness principle (\ref{principle:complete}). +Maneage is maintained by its developers in a central branch, which we will call \inlinecode{man\-eage} hereafter. The \inlinecode{man\-eage} branch contains all the low-level infrastructure, or skeleton, that is necessary for any project as described in the sections above. As mentioned in Section \ref{sec:maneage}, to start a new project, users simply clone it from its reference repository and build their own Git branch over the most recent commit. This is demonstrated in the first phase of Figure \ref{fig:branching} where a project has started by branching off of commit \inlinecode{0c120cb} (in the \inlinecode{maneage} branch). @@ -680,11 +682,11 @@ This is demonstrated in the first phase of Figure \ref{fig:branching} where a pr \end{figure} After a project starts, Maneage will evolve. -For example new features will be added, low-level bugs will be fixed that are useful for any project. +For example, new features will be added, low-level bugs will be fixed that are useful for any project. Because all the changes in Maneage are committed on the \inlinecode{maneage} branch, and all projects branch-off from it, updating the project's low-level infra-structure is as easy as merging the \inlinecode{maneage} branch into the project's branch. -For example in Figure \ref{fig:branching} (phase 1), see how Maneage's \inlinecode{3c05235} commit has been merged into project's branch trough commit \inlinecode{2ed0c82} . +For example, in Figure \ref{fig:branching} (phase 1), see how Maneage's \inlinecode{3c05235} commit has been merged into project's branch trough commit \inlinecode{2ed0c82}. -This doesn't just apply to the pre-publication phase, when done in Maneage, a project can be revived at any later date by other researchers as shown in phase 2 of Figure \ref{fig:branching}. +This does not just apply to the pre-publication phase, when done in Maneage, a project can be revived at any later date by other researchers as shown in phase 2 of Figure \ref{fig:branching}. In that figure, a new team of researchers have decided to experiment on the results of the published paper and have merged it with the Maneage branch (commit \inlinecode{a92b25a}) to fix some possible portability problem for their operating system that was fixed as a bug in Maneage after the paper's publication. Other scenarios include a third project that can easily merge various high-level components from different projects into its own branch, thus adding a temporal dimension to their data lineage. @@ -727,21 +729,21 @@ Therefore, there are various scenarios for the publication of the project as des \item \textbf{Public Git repository:} This is the simplest publication method. The project will already be on a (private) Git repository prior to publication. In such cases, the private configuration can be removed so it becomes public. - \item \textbf{In journal or PDF-only preprint systems (e.g., bioRxiv):} If the journal or pre-print server allows publication of small supplement files to the paper, the commit that produced the final paper can be submitted as a compressed file, for example with the + \item \textbf{In journal or PDF-only preprint systems (e.g., bioRxiv):} If the journal or pre-print server allows publication of small supplement files to the paper, the commit that produced the final paper can be submitted as a compressed file, for example, with the \hl{Something is missed} \item \textbf{arXiv:} arXiv will run its own internal \LaTeX{} engine on the uploaded files and produce the PDF that is published. When the project is published, arXiv also allows users to anonymously download the \LaTeX{} source tarball that the authors uploaded. Therefore, simply uploading the tarball from the \inlinecode{./project make dist} command is sufficient. We done this in \citet[arXiv:1909.11230]{akhlaghi19} and \citet[arXiv:1911.01430]{infante20}. Since arXiv is mirrored in many institutes over the planet, this is a robust way to preserve the reproducible lineage. \item \textbf{In output datasets:} Many data storage formats support an internal structure with the data file. - One commonly used example today is the Hierarchical Data Format (HDF), and in particular its HDF5 which can host a complex filesystem in POSIX syntax. + One commonly used example today is the Hierarchical Data Format (HDF) and, in particular, its HDF5 which can host a complex filesystem in POSIX syntax. \end{itemize} \item \textbf{Source and data:} The project inputs (including the software tarballs, or possible datasets) may have a large volume. Publishing them with the source is thus not always possible. However, based on the definition of inputs in Section \ref{definition:input}, they are usable in other projects: another project may use the same data or software source code, in a different way. - Therefore even when published with the source, it is encouraged to publish them as separate files. + Therefore, even when published with the source, it is encouraged to publish them as separate files. - For example strategy was followed in \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}\footnote{https://doi.org/10.5281/zenodo.3408481} which supplements \citet{akhlaghi19} which contains the following files. + For example, strategy was followed in \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}\footnote{https://doi.org/10.5281/zenodo.3408481} which supplements \citet{akhlaghi19} which contains the following files. \hl{Rewrite the sentence, I think something is missed and there are two \emph{which} very close each other} \begin{itemize} \item \textbf{Final PDF:} for easy understanding of the project. @@ -753,14 +755,14 @@ Therefore, there are various scenarios for the publication of the project as des \end{itemize} Note that \citet{akhlaghi19} used previously published datasets which are automatically accessed when necessary. - Also, that paper didn't produce any output datasets beyond the figures shown in the report, therefore the Zenodo upload doesn't contain any datasets. + Besides, that paper did not produce any output datasets beyond the figures shown in the report, therefore, the Zenodo upload does not contain any datasets. When a project involves data collection, or added-value data products, they can also be uploaded with the files above. \end{itemize} \subsubsection{Worries about getting scooped!} \label{sec:scooped} Publishing the project source with the paper can have many benefits for the researcher and the larger community. -For example if the source is published with a pre-print, others my help the authors find bugs, or improvements to the source that can affect the validity or precision of the result, or simply optimize it so it does the same work in half the time for example. +For example, if the source is published with a pre-print, others my help the authors find bugs, or improvements to the source that can affect the validity or precision of the result, or simply optimize it, so it does the same work in half the time. However, one particular feedback raised by a minority of researchers is that publishing the project's reproducible data lineage immediately after publication may hamper their ability to continue harvesting from all their hard work. Because others can easily reproduce the work, others may take the next follow-up project they originally intended to do. @@ -770,7 +772,7 @@ The level that this may happen is an interesting subject to be studied once many But it is a valid concern that must be addressed. Given the strong integrity checks in Maneage, we believe it has features to address this problem in the following ways: 1) Through the Git history, it is clear how much extra work the other team has added. -In this way, Maneage can contribute to a new concept of authorship in scientific projects and help to quantify newton's famous ``standing on the shoulders of giants'' quote. +In this way, Maneage can contribute to a new concept of authorship in scientific projects and help to quantify Newton's famous ``standing on the shoulders of giants'' quote. However, this is a long term goal and requires major changes to academic value systems. 2) Authors can be given a grace period where the journal, or some third authority, keeps the source and publishes it a certain time after publication. @@ -783,18 +785,18 @@ Its primordial implementation was written for \citet{akhlaghi15}. This paper described a new detection algorithm in astronomical image processing. The detection algorithm was developed as the paper was being written (initially a small report!). An automated sequence of commands to build the figures, and update the paper/report was a practical necessity as the algorithm was evolving. -In particular, it didn't just reproduce figures, it also used \LaTeX{} macros to update numbers printed within the text. +In particular, it did not just reproduce figures, it also used \LaTeX{} macros to update numbers printed within the text. Finally, since the full analysis pipeline was in plain-text and roughly 100kb (much less than a single figure), it was uploaded to arXiv with the paper's \LaTeX{} source, under a \inlinecode{reproduce/} directory, see \href{https://arxiv.org/abs/1505.01664}{arXiv:1505.01664}\footnote{ To download the \LaTeX{} source of any arXiv paper, click on the ``Other formats'' link, containing necessary instructions and links.}. -The system later evolved in \citet{bacon17}, in particular the two sections of that paper that were done by M. Akhlaghi (first author of this paper): \citet[\href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}]{akhlaghi18a} and \citet[\href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}]{akhlaghi18b}. +The system later evolved in \citet{bacon17}, in particular, the two sections of that paper that were done by M. Akhlaghi (first author of this paper): \citet[\href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}]{akhlaghi18a} and \citet[\href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}]{akhlaghi18b}. With these projects, the skeleton of the system was written as a more abstract ``template'' that could be customized for separate projects. The template later matured by including installation of all necessary software from source and used in \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}]{akhlaghi19} and \citet[\href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}]{infante20}. The short historical review above highlights how this template was created by practicing scientists, and has evolved and matured significantly. We already have roughly 30 tasks that are left for the future and will affect various high-level phases of the project as described here. -However, the core of the system has been used and become stable enough already and we don't see any major change in the core methodology in the near future. -A list of the notable changes after the publication of this paper will be kept in in the project's \inlinecode{README-hacking.md} file. +However, the core of the system has been used and become stable enough already and we do not see any major change in the core methodology in the near future. +A list of the notable changes after the publication of this paper will be kept in the project's \inlinecode{README-hacking.md} file. Once the improvements become substantial, new paper(s) will be written to complement or replace this one. \section{Discussion} |