diff options
-rw-r--r-- | paper.tex | 377 | ||||
-rw-r--r-- | tex/img/codata.pdf | bin | 0 -> 8798 bytes | |||
-rw-r--r-- | tex/img/codata.png | bin | 112554 -> 0 bytes | |||
-rw-r--r-- | tex/src/figure-branching.tex | 4 | ||||
-rw-r--r-- | tex/src/figure-src-inputconf.tex | 2 | ||||
-rw-r--r-- | tex/src/preamble-style.tex | 10 | ||||
-rw-r--r-- | tex/src/references.tex | 14 |
7 files changed, 227 insertions, 180 deletions
@@ -12,7 +12,7 @@ %% you need to distribute drafts that is undergoing revision and you want %% to highlight to your colleagues which parts are new and which parts are %% only for discussion. -\newcommand{\highlightchanges}{} +%\newcommand{\highlightchanges}{} %% Import the necessary preambles. \input{tex/src/preamble-style.tex} @@ -24,14 +24,16 @@ -\title{Maneage: A Customizable Framework for Managing Data Lineage} +\title{Maneage, a Customizable Framework for Managing Data Lineage} \author{\large\mpregular \authoraffil{Mohammad Akhlaghi}{1,2}, \large\mpregular \authoraffil{Ra\'ul Infante-Sainz}{1,2}, + \large\mpregular \authoraffil{David Valls-Gabaud}{3}, \large\mpregular \authoraffil{Roberto Baena-Gall\'e}{1,2}\\ { \footnotesize\mplight \textsuperscript{1} Instituto de Astrof\'isica de Canarias, Calle V\'ia L\'actea s/n, 38205 La Laguna, Tenerife, Spain.\\ \textsuperscript{2} Departamento de Astrof\'isica, Universidad de La Laguna, Avenida Astrof\'isico Francisco S\'anchez s/n, 38200 La Laguna, Tenerife, Spain.\\ + \textsuperscript{3} LERMA, CNRS, Observaoire de Paris, 61 Avenue de l'Observatoire, 75014 Paris, France.\\ Corresponding author: Mohammad Akhlaghi (\href{mailto:mohammad@akhlaghi.org}{\textcolor{black}{mohammad@akhlaghi.org}}) }} @@ -47,12 +49,12 @@ %% Abstract {\noindent\mpregular - The era of big data has ushered an era of big responsibility. In particular, reproducibility includes a full understanding of data lineage, a key - feature without which the results of an analysis can be the subject of perpetual debate. + The era of big data has ushered an era of big responsibility. + In the absence of reproducibility, as a test on understanding data lineage, the result can be the subject of perpetual debate. To address this problem, we introduce Maneage (management + lineage), founded on the principles of completeness (e.g., no dependency beyond a POSIX-compatible operating system, no administrator privileges, and no network connection), modular and straightforward design, temporal lineage and free software. A project using Maneage is fully stored in machine\--action\-able, and human\--read\-able plain-text format, facilitating version-control, publication, archival, and automatic parsing to extract data provenance. The provided lineage is not limited to high-level processing, but also includes building the necessary software from source with fixed versions and build configurations. - Additionally, a project's final visualizations and narrative report are also included, establishing direct links between the data analysis and the narrative or visualizations, down to the precision of a word within a sentence or a point in a plot. + Additionally, a project's final visualizations and narrative report are also included, establishing direct links between the analysis and the narrative or visualizations, to the precision of a word within a sentence or each point in a plot. Maneage also enables incremental projects, where a new project can branch off an existing one, with moderate changes to enable experimentation on published methods. Once Maneage is implemented in a sufficiently wide scale, automatic and optimized workflow creation through machine learning, or automating data management plans, can easily be set up. Maneage was a recipient of a research data alliance (RDA) Europe Adoption Grant in 2019, and has already been tested and used in several scientific papers, including the present one, with snapshot \projectversion. @@ -81,16 +83,16 @@ However, given its inherent complexity, as the mere results are barely useful on What inputs were used? How were the configurations or training data chosen? What operations were done on those inputs, how were the plots made? -Figure~\ref{fig:questions} provides a more detailed visual representation of such questions at various stages of the workflow. +Figure \ref{fig:questions} provides a more detailed visual representation of such questions at various stages of the workflow. \tonote{Johan: add some general references.} Due to the complexity of modern data analysis, a small deviation in the final result can be due to any or many different steps, which may be significant for its interpretation, and hence publishing the \emph{complete} codes of the analysis is the only way to fully understand the results, and to avoid wasting time and resources. For example, \citet{smart18} describes how a 7-year old conflict in theoretical condensed matter physics was only identified after the relative codes were shared. -\citet{miller06} found a mistaken column flipping in a project's workflow, leading to the retraction of 5 papers in major journals, including \textsl{Science}. -\citet{baggerly09} highlighted the inadequate narrative description of the analysis and showed the prevalence of simple errors in published results, ultimately calling their work ``\textsl{forensic bioinformatics}''. -\citet{herndon14} and \citet[a self-correction]{horvath15} also reported similar situations and \citet{ziemann16} concluded that one-fifth of papers with supplementary \textsl{Microsoft Excel} gene lists contain erroneous gene name conversions. +\citet{miller06} found a mistaken column flipping in a project's workflow, leading to the retraction of 5 papers in major journals, including \emph{Science}. +\citet{baggerly09} highlighted the inadequate narrative description of the analysis and showed the prevalence of simple errors in published results, ultimately calling their work ``\emph{forensic bioinformatics}''. +\citet{herndon14} and \citet[a self-correction]{horvath15} also reported similar situations and \citet{ziemann16} concluded that one-fifth of papers with supplementary Microsoft Excel gene lists contain erroneous gene name conversions. \begin{figure}[t] \begin{center} @@ -100,34 +102,43 @@ For example, \citet{smart18} describes how a 7-year old conflict in theoretical \caption{\label{fig:questions}Graph of a generic project's workflow (connected through arrows), highlighting the various issues/questions on each step. The green boxes with sharp edges are inputs and the blue boxes with rounded corners are intermediate or final outputs. The red boxes with dashed edges highlight the main questions at each respective stage. - The box covering software download and build phases shows some common tools software developers use for this phase, but a scientific project is clearly much more involved. + The box covering software download and build phases shows some common tools software developers use for this phase, but a scientific project is clearly much more involved. } \end{figure} The reason such reports are mostly from genomics and bioinformatics is because they have traditionally been more open to publishing workflows: see for example \href{https://www.myexperiment.org}{myexperiment.org}, \href{https://www.genepattern.org}{genepattern.org}, and \href{https://galaxyproject.org}{galaxy\-project.org}. Integrity checks are hence a critical component of the scientific method, but are only possible with access to the data \emph{and} its lineage (workflows). The status in other fields, where workflows are not commonly shared, is highly likely to be even worse. +Nature is already a black box which we are trying hard to unlock. +Not being able to experiment on the methods of other researchers is a self-imposed back box over it. The completeness of a paper's published metadata (usually within the ``Methods'' section) can be measured by the ability to reproduce the result without needing to contact the authors. Several studies have attempted to answer this with different levels of detail. For example, \citet{allen18} found that roughly half of the papers in astrophysics do not even mention the names of any analysis software, while \citet{menke20} found this fraction has greatly improved in medical/biological field and is currently above $80\%$. \citet{ioannidis2009} attempted to reproduce 18 published results by two independent groups, but fully succeeded in only 2 of them and partially in 6. \citet{chang15} attempted to reproduce 67 papers in well-regarded journals in Economics with data and code: only 22 could be reproduced without contacting authors, and more than half could not be replicated at all. \tonote{DVG: even after contacting the authors?} -\citet{stodden18} attempted to replicate the results of 204 scientific papers published in the journal \textsl{Science} \emph{after} that journal adopted a policy of publishing the data and code associated with the papers. +\citet{stodden18} attempted to replicate the results of 204 scientific papers published in the journal \emph{Science} \emph{after} that journal adopted a policy of publishing the data and code associated with the papers. Even though the authors were contacted, the success rate was an abysmal $26\%$. -Overall, this problem is unambiguously assessed as being very serious in the community: \citet{baker16} surveyed 1574 researchers and found that only $3\%$ did not see a ``\textsl{reproducibility crisis}''. +Overall, this problem is unambiguously assessed as being very serious in the community: \citet{baker16} surveyed 1574 researchers and found that only $3\%$ did not see a ``\emph{reproducibility crisis}''. -Yet, this is not a new problem in the sciences: back in 2011, Elsevier conducted an ``\textsl{Executable Paper Grand Challenge}'' \citep{gabriel11} and the proposed solutions were published in a special edition.\tonote{DVG: which were the results?} -Even before that, in an attempt to simulate research projects, \citet{ioannidis05} proved that ``\textsl{most claimed research findings are false}''. +Yet, this is not a new problem in the sciences: back in 2011, Elsevier conducted an ``\emph{Executable Paper Grand Challenge}'' \citep{gabriel11} and the proposed solutions were published in a special edition.\tonote{DVG: which were the results?} +Even before that, in an attempt to simulate research projects, \citet{ioannidis05} proved that ``\emph{most claimed research findings are false}''. In the 1990s, \citet{schwab2000, buckheit1995, claerbout1992} described the same problem very eloquently and also provided some solutions they used.\tonote{DVG: more details here, one is left wondering ...} Even earlier, through his famous quartet, \citet{anscombe73} qualitatively showed how distancing of researchers from the intricacies of algorithms/methods can lead to misinterpretation of the results. One of the earliest such efforts we are aware of is the work of \citet{roberts69}, who discussed conventions in \texttt{FORTRAN} programming and documentation to help in publishing research codes. While the situation has somewhat improved, all these papers still resonate strongly with the frustrations of today's scientists. In this paper, we introduce Maneage as a solution to the collective problem of preserving a project's data lineage and its software dependencies. -A project using Maneage starts by branching from the main Git branch of Maneage and then customizes itself: specifying the necessary software tools for that particular project, adding analysis steps and writing a narrative based on the results. -The temporal provenance of the project is fully preserved in Git, and allows the merging of the project with the core branch to update the low-level infrastructure (common to all projects) without changing the high-level steps specific to the project. +A project using Maneage starts by branching from its main Git branch, allowing the authors to customize it: specifying the necessary software tools for that particular project, adding analysis steps and adding visualizations and a narrative based on the results. In Sections \ref{sec:definitions} \& \ref{sec:principles} the basic concepts are defined and the founding principles of Maneage are discussed. -Section \ref{sec:maneage} describes the internal structure of Maneage and Section \ref{sec:discussion} is a discussion on its benefits, caveats and future prospects. +Section \ref{sec:maneage} describes the internal structure of Maneage and Section \ref{sec:discussion} is a discussion on its benefits, caveats and future prospects and we conclude with a summary in Section \ref{sec:conclusion} + + + + + + + + \section{Definitions} @@ -148,11 +159,11 @@ As a consequence, before starting with the technical details it is important to \item \label{definition:project}\textbf{Project:} A project is the series of operations that are done on input(s) to produce outputs. - This definition is therefore very similar to ``\textsl{workflow}'' \citep{oinn04, reich06, goecks10}, but because the published narrative paper/report is also an output, a project also includes both the source of the narrative (e.g., \LaTeX{} or MarkDown) \emph{and} how + This definition is therefore very similar to ``\emph{workflow}'' \citep{oinn04, reich06, goecks10}, but because the published narrative paper/report is also an output, a project also includes both the source of the narrative (e.g., \LaTeX{} or MarkDown) \emph{and} how its visualizations were created. - In a good project, all analysis scripts (e.g., written in \textsl{Python{, packages in \textsl{R}, libraries/programs in \textsl{C/C++}, etc.) are well-defined as an independently-managed software with clearly-defined inputs, outputs and no side-effects. - This is crucial help for debugging and experimenting during the project, and also for their re-usability in later projects. + In a good project, all analysis scripts (e.g., written in Python, packages in R, libraries/programs in C/C++, etc.) are well-defined as an independently managed software with clearly defined inputs, outputs and no side-effects. + This is crucial help for debugging and experimenting during the project, and also for their re-usability in later projects. As a consequence, such analysis scripts/programs are defined above as ``inputs'' for the project. A project hence does not include any analysis source code (to the extent this is possible), it only manages calls to them. @@ -170,9 +181,9 @@ As a consequence, before starting with the technical details it is important to % Technical data lineage is created from actual technical metadata and tracks data flows on the lowest level - actual tables, scripts and statements. % Technical data lineage is being provided by solutions such as MANTA or Informatica Metadata Manager. " \item \label{definition:lineage}\textbf{Data Lineage:} -Data lineage is commonly used interchangeably with Data provenance \citep[for example][]{cheney09}. -For clarity, we define the term ``\textsl{Data lineage}'' as a low-level and fine-grained recording of the data's trajectory in an analysis (not meta-data, but actual commands). -Therefore, data lineage is synonymous with ``\textsl{project}'' as defined above. + Data lineage is commonly used interchangeably with Data provenance \citep[for example][]{cheney09}. + For clarity, we define the term ``\emph{Data lineage}'' as a low-level and fine-grained recording of the data's trajectory in an analysis (not meta-data, but actual commands). + Therefore, data lineage is synonymous with ``\emph{project}'' as defined above. \item \label{definition:reproduction}\textbf{Reproducibility \& Replicability:} These two terms have been used in the literature with various meanings, sometimes in a contradictory way. It is important to highlight that in this paper we are only considering computational analysis: \emph{after} data has been collected and stored as a file. @@ -192,23 +203,30 @@ The core principle of Maneage is simple: science is defined by its method, not i \citet{buckheit1995} summarize this nicely by noting that modern scientific papers (narrative combined with plots, tables and figures) are merely advertisements of a scholarship, the actual scholarship is the scripts and software usage that went into doing the analysis. Maneage is not the first attempted solution to this fundamental problem. -Various solutions have indeed been proposed since the early 1990s, for example RED \citep{claerbout1992,schwab2000}, Apache Taverna \citep{oinn04}, Madagascar \citep{fomel13}, GenePattern \citep{reich06}, Kepler \citep{ludascher05}, VisTrails \citep{bavoil05}, Galaxy \citep{goecks10}, Image Processing On Line journal \citep[IPOL][]{limare11}, WINGS \citep{gil10}, Active papers \citep{hinsen11}, Collage Authoring Environment \citep{nowakowski11}, SHARE \citep{vangorp11}, Verifiable Computational Result \citep{gavish11}, SOLE \citep{pham12}, Sumatra \citep{davison12}, Sciunit \citep{meng15}, Popper \citep{jimenez17}, WholeTale \citep{brinckman19}, and many more. +Various solutions have been proposed since the early 1990s, for example RED (1992), Apache Taverna (2003), Madagascar (2003), GenePattern (2004), Kepler (2005), VisTrails (2005), Galaxy (2010), WINGS (2010), Image Processing On Line journal (IPOL, 2011), Active papers (2011), Collage Authoring Environment (2011), SHARE (2011), Verifiable Computational Result (2011), SOLE (2012), Sumatra (2012), Sciunit (2015), Binder (2017), Popper (2017), WholeTale (2019), and many more. To highlight the uniqueness of Maneage in this plethora of tools, a more elaborate list of principles are required: \begin{enumerate}[label={\bf P\arabic*}] \item \label{principle:complete}\textbf{Complete:} - A project that is complete, or self-contained, (i) does not depend on anything beyond the Portable operating system Interface (POSIX), - (ii) does not affect the host system, (iii) does not require root/administrator privileges, (iv) does not need an internet connection (when its inputs are on the file-system), and (v) is stored in a format that does not require any software beyond POSIX tools to open, parse or execute. - - A complete project can (i) automatically access the inputs (see definition \ref{definition:input}), (ii) build its necessary software (instructions on configuring, building and installing those software in a fixed environment), (iii) do the analysis (run the software on the data) and + A project that is complete, or self-contained, + (i) does not depend on anything beyond the Portable operating system Interface (POSIX), + (ii) does not affect the host system, + (iii) does not require root/administrator privileges, + (iv) does not need an internet connection (when its inputs are on the file-system), and + (v) is stored in a format that does not require any software beyond POSIX tools to open, parse or execute. + + A complete project can + (i) automatically access the inputs (see definition \ref{definition:input}), + (ii) build its necessary software (instructions on configuring, building and installing those software in a fixed environment), + (iii) do the analysis (run the software on the data) and (iv) create the final narrative report/paper as well as its visualizations, in its final format (usually in PDF or HTML). - No manual/human interaction is required within a complete project, as \citet{claerbout1992} put it: ``\textsl{a clerk can do it}''. - Generally, a manual intervention in any of the steps above, or an interactive interface, constitutes an incompleteness. + No manual/human interaction is required to run a complete project, as \citet{claerbout1992} put it: ``\emph{a clerk can do it}''. + Generally, manual intervention in any of the steps above, or an interactive interface, constitutes an incompleteness. Lastly, the plain-text format is particularly important because any other storage format will require specialized software \emph{before} the project can be opened. - \emph{Comparison with existing:} Except for IPOL, none of the tools above are complete as - they all have many dependencies far beyond POSIX. For example, the more recent ones (the project/workflow, not the analysis) are written in Python or rely on Jupyter notebooks. - Such high-level tools have very short lifespans and evolve very fast (e.g., Python 3 is not compatible with Python 2). \tonote{DVG: but fortran 77 can be used within fortran90 ...} + \emph{Comparison with existing:} Except for IPOL, none of the tools above are complete as they all have many dependencies far beyond POSIX. + For example, in the more recent ones are written in Python (the project/workflow, not the analysis), or rely on Jupyter notebooks. + Such high-level tools have very short lifespans and evolve very fast (e.g., Python 2 code cannot run with Python 3). They also have a very complex dependency trees, making them extremely vulnerable and hard to maintain. For example, see Figure 1 of \citet{alliez19} on the dependency tree of Matplotlib (one of the smaller Jupyter dependencies). It is important to remember that the longevity of a workflow (not the analysis itself) is determined by its shortest-lived dependency. @@ -228,7 +246,7 @@ In a modular project, communication between the independent modules is explicit, However, designing a modular project needs to be encouraged and facilitated, otherwise scientists (who are not usually trained in data management) will not design their projects to be modular, leading to great inefficiencies in terms of project cost and/or scientific accuracy. \item \label{principle:complexity}\textbf{Minimal complexity:} - This principle is essentially Ockham's razor: ``\textsl{Never posit pluralities without necessity}'' \citep{schaffer15}, but extrapolated to project management: + This principle is essentially Ockham's razor: ``\emph{Never posit pluralities without necessity}'' \citep{schaffer15}, but extrapolated to project management: 1) avoid complex relations between analysis steps (which is related to the principle of modularity in \ref{principle:modularity}). 2) avoid the programming language that is currently in vogue because it is going to fall out of fashion soon and significant resources are required to translate or rewrite it every few years (to stay in vogue). The same job can be done with more stable/basic tools, and less effort in the long run. @@ -246,7 +264,7 @@ Automatic verification of inputs is most commonly implemented in some cases, but No project is done in a single/first attempt. Projects evolve as they are being completed. It is natural that earlier phases of a project are redesigned/optimized only after later phases have been completed. -This is often seen in scientific papers, with statements like ``\textsl{we [first] tried method [or parameter] X, but Y is used here because it showed to have better precision [or less bias, or etc]}''. +This is often seen in scientific papers, with statements like ``\emph{we [first] tried method [or parameter] X, but Y is used here because it showed to have better precision [or less bias, or etc]}''. A project's ``history'' is thus as scientifically relevant as the final, or published version. \emph{Comparison with existing:} The systems above that are implemented around version control usually support this principle. @@ -254,7 +272,7 @@ However, because they are rarely complete (as discussed in principle \ref{princi IPOL, which uniquely stands out in other principles, fails here: only the final snapshot is published. \item \label{principle:freesoftware}\textbf{Free and open source software:} - Technically, as defined in Section~\ref{definition:reproduction}, reproducibility is also possible with a non-free or non-open-source software (a black box). + Technically, reproducibility (defined in \ref{definition:reproduction}) is possible with a non-free or non-open-source software (a black box). This principle is thus necessary to complement the definition of reproducibility and has many advantages which are critical to the sciences and the industry: 1) The lineage, and its optimization, can be traced down to the internal algorithm in the software's source. 2) A free software that may not execute on a future hardware can be modified to work. @@ -275,18 +293,17 @@ IPOL, which uniquely stands out in other principles, fails here: only the final \section{Maneage} \label{sec:maneage} - -Maneage is an implementation of the principles of Section~\ref{sec:principles}: (i) it is complete (\ref{principle:complete}), (ii) modular (\ref{principle:modularity}), (iii) has minimal complexity (\ref{principle:complexity}), (iv) verifies its inputs \& outputs (\ref{principle:verify}), (v) preserves temporal provenance (\ref{principle:history}) and (vi) it is free software (\ref{principle:freesoftware}). +Maneage is an implementation of the principles of Section \ref{sec:principles}: it is complete (\ref{principle:complete}), modular (\ref{principle:modularity}), has minimal complexity (\ref{principle:complexity}), verifies its inputs \& outputs (\ref{principle:verify}), preserves temporal provenance (\ref{principle:history}) and finally, it is free software (\ref{principle:freesoftware}). In practice, Maneage is a collection of plain-text files that are distributed in pre-defined sub-directories by context (a modular source), and are all under version-control, currently with Git. -The main Maneage Branch is a fully-working skeleton of a project without much flesh: it contains all the low-level infrastructure, but without any actual high-level analysis operations\footnote{In the core Maneage branch, only a simple demo analysis is included to be complete, and - can easily be removed: all its files and steps have a \inlinecode{delete-me} prefix.}. -Maneage contains a file called \inlinecode{README-hacking.md}\footnote{Note that the \inlinecode{README.md} file is reserved for the project using Maneage, not Maneage itself.} that has a complete checklist of steps to start a new project and remove demonstration parts. There are also hands-on tutorials to help new users. +The main Maneage Branch is a fully-working skeleton of a project without much flesh: it contains all the low-level infrastructure, but without any actual high-level analysis operations\footnote{In the core Maneage branch, only a simple demo analysis is included to be complete, and can easily be removed: all its files and steps have a \inlinecode{delete-me} prefix.}. +Maneage contains a file called \inlinecode{README-hacking.md}\footnote{Note that the \inlinecode{README.md} file is reserved for the project using Maneage, not Maneage itself.} that has a complete checklist of steps to start a new project and remove demonstration parts. +There are also hands-on tutorials to help new users. To start a new project, the authors just \emph{clone}\footnote{In Git, the ``clone'' operation is the process of copying all the project's files and history from a repository onto the local system.} Maneage, create their own Git branch over the latest commit, and start their project by customizing that branch. Customization in their project branch is done by adding the names of the software they need, references to their input data, the analysis commands, visualization commands, and a narrative report which includes the visualizations. This will usually be done in multiple commits in the project's duration (perhaps multiple years), thus preserving the project's history: the causes of all choices, the authors and times of each change, failed tests, etc. -Figure~\ref{fig:files} shows this directory structure containing the modular plain-text files (classified by context in sub-directories) and some representative files in each directory. +Figure \ref{fig:files} shows this directory structure containing the modular plain-text files (classified by context in sub-directories) and some representative files in each directory. The top-level source has only very high-level components: the \inlinecode{project} shell script (POSIX-compliant) that is the main interface to the project, as well as the paper's \LaTeX{} source, documentation and a copyright statement. Two sub-directories are also present: \inlinecode{tex/} (containing \LaTeX{} files) and \inlinecode{reproduce/} (containing all other parts of the project). @@ -315,15 +332,15 @@ In practice, these steps are run with two commands: \end{lstlisting} We now delve deeper into the implementation and some usage details of Maneage. -Section~\ref{sec:usingmake} elaborates why Make (a POSIX tool) was chosen as the main job orchestrator in Maneage. -Sections~\ref{sec:projectconfigure} \& \ref{sec:projectanalysis} then discuss the operations done during the configuration and analysis phase. -Afterwards, we describe how Maneage projects benefit from version control in Section~\ref{sec:projectgit}. -Section~\ref{sec:collaborating} discusses the sharing of a built environment, and in Section \ref{sec:publishing} the publication/archival of Maneage projects is discussed. +Section \ref{sec:usingmake} elaborates why Make (a POSIX tool) was chosen as the main job manager in Maneage. +Sections \ref{sec:projectconfigure} \& \ref{sec:projectanalysis} then discuss the operations done during the configuration and analysis phase. +Afterwards, we describe how Maneage projects benefit from version control in Section \ref{sec:projectgit}. +Section \ref{sec:collaborating} discusses the sharing of a built environment, and in Section \ref{sec:publishing} the publication/archival of Maneage projects is discussed. \subsection{Job orchestration with Make} \label{sec:usingmake} -Scripts (in Shell, Python, or any other high-level language) are usually the first solution that come to mind when non-interactive, or batch, processing is needed (the completeness principle, see~\ref{principle:complete}), +Scripts (in Shell, Python, or any other high-level language) are usually the first solution that come to mind when non-interactive, or batch, processing is needed (the completeness principle, see \ref{principle:complete}), However, the inherent complexity and non-linearity of progress, as a project evolves, makes it hard to manage such scripts. For example, if $90\%$ of a research project is done and only the newly-added final $10\%$ must be executed, a script will always start from the beginning. It is possible to manually ignore (with some conditionals), or manually comment, parts of a script to only do a special part. @@ -334,7 +351,7 @@ The Make paradigm starts from the end: the final \emph{target}. In Make's syntax, the process is broken into atomic \emph{rules} where each rule has a single \emph{target} file which can depend on any number of \emph{prerequisite} files. To build the target from the prerequisites, each rule also has a \emph{recipe} (an atomic script). The plain-text files containing Make rules and their components are called Makefiles. -Note that Make does not replace scripting languages like the shell, \textsl{Python} or \textsl{R}. +Note that Make does not replace scripting languages like the shell, Python or R. It is a higher-level structure enabling modular/atomic scripts (in any language) to be put into a workflow. The formal connection of targets with prerequisites, as defined in Make, enables the creation of an optimized workflow that is very mature and has withstood the test of time: almost all OSs rely on it. @@ -345,29 +362,32 @@ it can thus execute independent rules in parallel, further improving the speed a Make is well known by many outside of the software developing communities. For example, \citet{schwab2000} report how geophysics students have easily adopted it for the RED project management tool. -Because of its simplicity, we have also had very good feedback on using Make from the early adopters of Maneage, %since the RDA grant, -in particular with graduate students and postdocs. +Because of its simplicity, we have also had very good feedback on using Make from the early adopters of Maneage, in particular with graduate students and postdocs. + + + + \subsection{Project configuration} \label{sec:projectconfigure} -Maneage orchestrates the building of its necessary software in the same language that it orchestrates the analysis: Make (see Section~\ref{sec:usingmake}). +Maneage orchestrates the building of its necessary software in the same language that it orchestrates the analysis: Make (see Section \ref{sec:usingmake}). Therefore, a researcher already using Maneage for their high-level analysis easily understands, and can customize, the software environment too, without delving into the intricacies of third-party tools. -Most existing tools reviewed in Section~\ref{sec:principles} use package managers like Conda to maintain the software environment, but since Conda itself is written in \textsl{Python}, it does not fulfill our completeness principle \ref{principle:complete}. -Highly-robust solutions like Nix \citep{dolstra04} and GNU Guix \citep{courtes15} do exist, but they require root permissions which is also against this principle. +Most existing tools reviewed in Section \ref{sec:principles} use package managers like Conda to maintain the software environment, but since Conda itself is written in Python, it does not fulfill our completeness principle \ref{principle:complete}. +Highly-robust solutions like Nix and GNU Guix do exist, but they require root permissions which is also problematic for this principle. -Project configuration (building the software environment) is managed by the files under \inlinecode{reproduce\-/soft\-ware} of Maneage's source, see Figure~\ref{fig:files}. +Project configuration (building the software environment) is managed by the files under \inlinecode{reproduce\-/soft\-ware} of Maneage's source, see Figure \ref{fig:files}. At the start of project configuration, Maneage needs a top-level directory to build itself on the host filesystem (software and analysis). -We call this the ``build directory'' and it must not be under the source directory (see~\ref{principle:modularity}): by default Maneage will not touch any file in its source. +We call this the ``build directory'' and it must not be under the source directory (see \ref{principle:modularity}): by default Maneage will not touch any file in its source. No other location on the running operating system will be affected by the project and the build directory should not affect the result, so its value is not under version control. Two other local directories can optionally be specified by the project when inputs (\ref{definition:input}) are present locally and do not need to be downloaded: 1) software tarball directory and 2) input data directory. -Sections~\ref{sec:buildsoftware} and \ref{sec:softwarecitation} elaborate more on the building of the required software and the important problem of software citation. +Sections \ref{sec:buildsoftware} and \ref{sec:softwarecitation} elaborate more on the building of the required software and the important problem of software citation. \subsubsection{Verifying and building necessary software from source} \label{sec:buildsoftware} -To compile the necessary software from source Maneage currently needs the host to have a \textsl{C} compiler (available on any POSIX-compliant OS). -This \textsl{C} compiler will be used by Maneage to build and install (in the build directory) all necessary software and their dependencies with fixed versions. +To compile the necessary software from source Maneage currently needs the host to have a \emph{C} compiler (available on any POSIX-compliant OS). +This \emph{C} compiler will be used by Maneage to build and install (in the build directory) all necessary software and their dependencies with fixed versions. The dependency tree goes all the way down to core operating system components like GNU Bash, GNU AWK, GNU Coreutils, and many more on all supported operating systems (including macOS, not just GNU/Linux). For example, the full list of installed software for this paper is automatically available in the Acknowledgments section of this paper. On GNU/Linux OSs, a fixed version of the GNU Binutils and GNU C Compiler (GCC) is also included, and soon Maneage will also install its own fixed version of the GNU C Library to be fully independent of the host on such systems (Task 15390\footnote{\url{https://savannah.nongnu.org/task/?15390}}). @@ -376,7 +396,7 @@ In effect, except for the Kernel, Maneage builds all other components of the GNU The software source code may already be present on the host filesystem, if not, it can be downloaded. Before being used to build the software, it will be validated by its SHA-512 checksum (which is already stored in the project). Maneage includes a large collection of scientific software (and their dependencies) that are usually not necessary in all projects. -Therefore, each project has to identify its high-level software in the \inlinecode{TARGETS.conf} file under \inlinecode{re\-produce\-/soft\-ware\-/config} directory, see Figure~\ref{fig:files}. +Therefore, each project has to identify its high-level software in the \inlinecode{TARGETS.conf} file under \inlinecode{re\-produce\-/soft\-ware\-/config} directory, see Figure \ref{fig:files}. All the high-level software dependencies are codified in Maneage as Make \emph{prerequisites}, and hence the specified software will be automatically built after its dependencies. Note that project configuration can be done in a container or virtual machine to facilitate moving the project. @@ -388,19 +408,18 @@ The important factor, however, is that such binary blobs are optional outputs of Maneage contains the full list of built software for each project, their versions and their configuration options, but this information is buried deep into each project's source. Maneage also prints a distilled fraction of this information in the project's final report, blended into the narrative, as seen in the Acknowledgments of this paper. -Furthermore, when the software is associated with a published paper, that paper's Bib\TeX{} entry is also added to the final report and is -duly cited with the software's name and version. +Furthermore, when the software is associated with a published paper, that paper's Bib\TeX{} entry is also added to the final report and is duly cited with the software's name and version. For example\footnote{In this paper we have used very basic tools that are not accompanied by a paper}, see the software acknowledgement sections of \citet{akhlaghi19} and \citet{infante20}. This is particularly important in the case for research software, where citation is critical to justify continued development. One notable example that nicely highlights this issue is GNU Parallel \citep{tange18}: every time it is run, it prints the citation information before it starts. -This does not cause any problem in automatic scripts, but can be annoying when reading/debugging the outputs. Users can disable the notice, with the \inlinecode{--citation} option and accept to cite its paper, or support its development directly by paying $10000$ euros! -This is justified by an uncomfortably true statement\footnote{GNU Parallel's FAQ on the need to cite software: \url{http://git.savannah.gnu.org/cgit/parallel.git/plain/doc/citation-notice-faq.txt}}: ``\textsl{history has shown that researchers forget to [cite software] if they are not reminded explicitly. ... If you feel the benefit from using GNU Parallel is too small to warrant a citation, then prove that by simply using another tool}''. +This is justified by an uncomfortably true statement\footnote{GNU Parallel's FAQ on the need to cite software: \url{http://git.savannah.gnu.org/cgit/parallel.git/plain/doc/citation-notice-faq.txt}}: ``\emph{history has shown that researchers forget to [cite software] if they are not reminded explicitly. ... If you feel the benefit from using GNU Parallel is too small to warrant a citation, then prove that by simply using another tool}''. Most research software does not resort to such drastic measures, however, but proper citation is not only important but also ethical. -Given the increasing number of software used in scientific research, the only reliable solution is to automatically cite the used software in the final paper. The necessity and basic elements in software citation are reviewed, inter alia, by \citet{katz14} and \citet{smith16}. -There are ongoing projects specifically tailored to software citation, including CodeMeta\footnote{\url{https://codemeta.github.io}} and Citation file format\footnote{\url{https://citation-file-format.github.io}} (CFF), a very robust approach is also provided by SoftwareHeritage \citep{dicosmo18}. +Given the increasing number of software used in scientific research, the only reliable solution is to automatically cite the used software. +The necessity and basic elements in software citation are reviewed, inter alia, by \citet{katz14} and \citet{smith16}. +There are ongoing projects specifically tailored to software citation, including CodeMeta and Citation file format (CFF), a very robust approach is also provided by SoftwareHeritage \citep{dicosmo18}. We plan to enable these tools in future versions of Maneage. @@ -410,7 +429,7 @@ We plan to enable these tools in future versions of Maneage. \subsection{Analysis of the Project} \label{sec:projectanalysis} -Once the project is configured (Section~\ref{sec:projectconfigure}), a unique and fully-controlled environment is available to execute the analysis. +Once the project is configured (Section \ref{sec:projectconfigure}), a unique and fully-controlled environment is available to execute the analysis. All analysis operations run such that the host's OS settings cannot penetrate it, enabling an isolated environment without the extra layer of containers or a virtual machine. In Maneage, a project's analysis is broken into two phases: 1) preparation, and 2) analysis. The former is mostly necessary to optimize extremely large datasets and is only useful for advanced users, while following an identical internal structure to the later. @@ -424,38 +443,37 @@ Maneage is thus designed to encourage and facilitate modularity by distributing In the rest of this paper these modular, or lower-level, Makefiles will be called \emph{subMakefiles}. The subMakefiles are loaded into the special Makefile \inlinecode{top-make.mk} with a certain order and executed in one instance of Make\footnote{The subMakefiles are loaded into \inlinecode{top-make.mk} using Make's \inlinecode{include} directive. Hence no recursion is used (where one instance of Make, calls Make within itself) because recursion is against the minimal complexity principle and can make the code very hard to read \ref{principle:complexity}.}. -When run with the \inlinecode{make} argument, the \inlinecode{project} script (Section~\ref{sec:maneage}), calls \inlinecode{top-make.mk}. -All these Makefiles are in \inlinecode{re\-produce\-/anal\-ysis\-/make}, see Figure~\ref{fig:files}. -Figure~\ref{fig:datalineage} schematically shows these subMakefiles and their relation with each other with the targets they build. +When run with the \inlinecode{make} argument, the \inlinecode{project} script (Section \ref{sec:maneage}), calls \inlinecode{top-make.mk}. +All these Makefiles are in \inlinecode{re\-produce\-/anal\-ysis\-/make}, see Figure \ref{fig:files}. +Figure \ref{fig:datalineage} schematically shows these subMakefiles and their relation with each other with the targets they build. \begin{figure}[t] \begin{center} \includetikz{figure-data-lineage} \end{center} \vspace{-7mm} - \caption{\label{fig:datalineage}Schematic representation of data lineage in a hypothetical project/pipeline using Maneage. + \caption{\label{fig:datalineage}Schematic representation of a project's data lineage, or workflow, for the demonstration analysis of this paper. Each colored box is a file in the project and the arrows show the dependencies between them. Green files/boxes are plain text files that are under version control and in the source directory. - Blue files/boxes are output files of various steps in the build directory, located within the Makefile (\inlinecode{*.mk}) that generates them. + Blue files/boxes are output files of various steps in the build-directory, shown within the Makefile (\inlinecode{*.mk}) where they are defined as a \emph{target}. For example, \inlinecode{paper.pdf} depends on \inlinecode{project.tex} (in the build directory and generated automatically) and \inlinecode{paper.tex} (in the source directory and written by hand). - In turn, \inlinecode{project.tex} depends on all the \inlinecode{*.tex} files at the bottom of the Makefiles above it. - The solid arrows and built boxes with full opacity are actually described in the context of a demonstration project in this paper. - The dashed arrows and lower opacity built boxes, just show how adding more elements to the lineage is also easily implemented, making it a scalable tool. + The solid arrows and built boxes with full opacity are described in Section \ref{sec:projectanalysis}. + The dashed arrows and lower opacity built boxes, show the scalability by adding hypothetical steps to the project. } \end{figure} To avoid getting too abstract in the subsections below, where necessary we will do a basic analysis on the data of \citet{menke20} and replicate one of the results. Note that because we are not using the same software\footnote{We cannot use the same software because \citet{menke20} use -\textsl{Microsoft Excel} for the analysis which violates several of our principles: \ref{principle:complete}, \ref{principle:complexity} and \ref{principle:freesoftware}.}, this is not a reproduction (see~\ref{definition:reproduction}). -In the subsections below, this paper's analysis on that dataset is described using the data lineage graph of Figure~\ref{fig:datalineage}. -We will follow Make's paradigm (see Section~\ref{sec:usingmake}) of starting the lineage backwards form the ultimate target in Section ~\ref{sec:paperpdf} (bottom of Figure~\ref{fig:datalineage}) to the configuration files \ref{sec:configfiles} (top of Figure~\ref{fig:datalineage}). +Microsoft Excel for the analysis which violates several of our principles: \ref{principle:complete}, \ref{principle:complexity} and \ref{principle:freesoftware}.}, this is not a reproduction (see \ref{definition:reproduction}). +In the subsections below, this paper's analysis on that dataset is described using the data lineage graph of Figure \ref{fig:datalineage}. +We will follow Make's paradigm (see Section \ref{sec:usingmake}) of starting the lineage backwards form the ultimate target in Section \ref{sec:paperpdf} (bottom of Figure \ref{fig:datalineage}) to the configuration files \ref{sec:configfiles} (top of Figure \ref{fig:datalineage}). To better understand this project, we encourage looking into this paper's own Maneage source, published as a supplement. \subsubsection{Ultimate target: the project's paper or report (\inlinecode{paper.pdf})} \label{sec:paperpdf} The ultimate purpose of a project is to report the data analysis result, as raw visualizations, or numbers blended in with a narrative. -In Figure~\ref{fig:datalineage}, this is \inlinecode{paper.pdf}. Note that it is the only built file (blue box) with no outwards arrows leaving it. +In Figure \ref{fig:datalineage}, this is \inlinecode{paper.pdf}. Note that it is the only built file (blue box) with no outwards arrows leaving it. The instructions to build \inlinecode{paper.pdf} are in the \inlinecode{paper.mk} subMakefile. Its prerequisites include \inlinecode{paper.tex} and \inlinecode{references.tex} (Bib\TeX{} entries for possible citations) in the project source and \inlinecode{project.tex} which is a built product. \inlinecode{references.tex} formalizes the connections of this project with previous projects on a higher level. @@ -465,26 +483,26 @@ Its prerequisites include \inlinecode{paper.tex} and \inlinecode{references.tex} Figures, plots, tables and narrative are not the only analysis products that are included in the paper/report. In many cases, quantitative values from the analysis are also blended into the sentences of the report's narration. -For example, this sentence in the abstract of \citet[which is written in Maneage]{akhlaghi19}: ``\textsl{... detect the outer wings of M51 down to S/N of 0.25 ...}''. +For example, this sentence in the abstract of \citet[which is written in Maneage]{akhlaghi19}: ``\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}''. The value `0.25', for the signal-to-noise ratio (S/N), depends on the analysis, and is an output of the analysis just like paper's figures and plots. Manually typing such numbers in the narrative is prone to very important errors and discourages testing in scientific papers. -Therefore, they must \textsl{also} be automatically generated. +Therefore, they must \emph{also} be automatically generated. To automatically generate and blend them in the text, Maneage uses \LaTeX{} macros. In the quote above, the \LaTeX{} source\footnote{\citet{akhlaghi19} is written in Maneage and its \LaTeX{} source is available in multiple ways: 1) direct download from arXiv:\href{https://arxiv.org/abs/1909.11230}{1909.11230}, by clicking on ``other formats'', or 2) the Git or \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, links are also available on arXiv's top page.} looks like this: ``\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}''. -The ma\-cro ``\inlinecode{\small\textbackslash{}demosfoptimizedsn}'' is automatically created during in the project and expands to the value ``\inlinecode{0.25}'' when the PDF output is built. +T\-he ma\-cro ``\inlinecode{\small\textbackslash{}demosfoptimizedsn}'' is automatically created during in the project and expands to the value ``\inlinecode{0.25}'' when the PDF output is built. The built \inlinecode{project.tex} file stores all such reported values. However, managing all the necessary \LaTeX{} macros in one file is against the modularity principle and can be frustrating and buggy. To address this problem, Maneage adopts the convention that all subMakefiles \emph{must} contain a fixed target with the same base-name, but with a \inlinecode{.tex} suffix to store reported values generated in that subMakefile. If it does not need to report any values in the text, the file can indeed be empty. -In Figure~\ref{fig:datalineage}, these macro files can be seen in every subMakefile, except for \inlinecode{paper.mk} (which does not need it). -These \LaTeX{} macro files thus form the core skeleton of a Maneage project: as shown in Figure~\ref{fig:datalineage}, the outward arrows of all built files of any subMakefile ultimately leads to one of these \LaTeX{} macro files, possibly in another subMakefile. +In Figure \ref{fig:datalineage}, these macro files can be seen in every subMakefile, except for \inlinecode{paper.mk} (which does not need it). +These \LaTeX{} macro files thus form the core skeleton of a Maneage project: as shown in Figure \ref{fig:datalineage}, the outward arrows of all built files of any subMakefile ultimately leads to one of these \LaTeX{} macro files, possibly in another subMakefile. \subsubsection{Verification of outputs (\inlinecode{verify.mk})} \label{sec:outputverification} -Before the modular \LaTeX{} macro files of Section~\ref{sec:valuesintext} are merged into the single \inlinecode{project.tex} file, they need to pass through the verification filter, which is another core principle of Maneage (\ref{principle:verify}). +Before the modular \LaTeX{} macro files of Section \ref{sec:valuesintext} are merged into the single \inlinecode{project.tex} file, they need to pass through the verification filter, which is another core principle of Maneage (\ref{principle:verify}). Note that simply confirming the checksum of the final PDF, or figures and datasets is not generally possible: many tools write the creation date into the produced files. To avoid such cases the raw data (independently of their metadata-like creation date) must be verified. Some standards include such features, for example, the \inlinecode{DATASUM} keyword in the FITS format \citep{pence10}. To facilitate output verification, the project has a \inlinecode{verify.mk} subMakefile (see Figure \ref{fig:datalineage}) and \inlinecode{verify.tex} the only prerequisite of \inlinecode{project.tex} that was described in Section \ref{sec:valuesintext}. @@ -494,11 +512,12 @@ It has some tests on pre-defined formats, and other formats can easily be added. \subsubsection{The analysis} \label{sec:analysis} -The basic concepts behind organizing the analysis into modular subMakefiles having already been discussed above, we describe them here with the practical example of replicating Figure 1C of \citet{menke20}, with some enhancements in Figure~\ref{fig:toolsperyear}. -As shown in Figure~\ref{fig:datalineage}, in this project we have broken this goal into two subMakefiles: \inlinecode{format.mk} and \inlinecode{demo-plot.mk}. -The former is in charge of converting the \textsl{Microsoft Excel}-formatted input into the simple comma-separated value (CSV) format, and the latter is in charge of generating the table to build Figure~\ref{fig:toolsperyear}. -In a real project, subMakefiles could and will be much more complex. -Figure~\ref{fig:topmake} shows how the two subMakefiles are placed as values to the \inlinecode{makesrc} variable of \inlinecode{top-make.mk}, without their suffix (see Section \ref{sec:valuesintext}). +The basic concepts behind organizing the analysis into modular subMakefiles have already been discussed above. +We will thus describe it here with the practical example of replicating Figure 1C of \citet{menke20}, with some enhancements in Figure \ref{fig:toolsperyear}. +As shown in Figure \ref{fig:datalineage}, in this project we have broken this goal into two subMakefiles: \inlinecode{format.mk} and \inlinecode{demo-plot.mk}. +The former is in charge of converting the Excel-formatted input into the simple comma-separated value (CSV) format, and the latter is in charge of generating the table to build Figure \ref{fig:toolsperyear}. +In a real project, subMakefiles could, and will, be much more complex. +Figure \ref{fig:topmake} shows how the two subMakefiles are placed as values to the \inlinecode{makesrc} variable of \inlinecode{top-make.mk}, without their suffix (see Section \ref{sec:valuesintext}). Note that their location after the standard starting subMakefiles (initialization and download) and before the standard ending subMakefiles (verification and final paper) is important, along with their order. \begin{figure}[t] @@ -507,7 +526,7 @@ Note that their location after the standard starting subMakefiles (initializatio \end{center} \vspace{-5mm} \caption{\label{fig:toolsperyear}Ratio of papers mentioning software tools (green line, left vertical axis) to total number of papers studied in that year (light red bars, right vertical axis in log-scale). - This is an enhanced replica of figure 1C \citet{menke20}, shown here for demonstrating Maneage, see Figure~\ref{fig:datalineage} for its lineage and Section \ref{sec:analysis} for how it was organized. + This is an enhanced replica of figure 1C \citet{menke20}, shown here for demonstrating Maneage, see Figure \ref{fig:datalineage} for its lineage and Section \ref{sec:analysis} for how it was organized. } \end{figure} @@ -515,33 +534,33 @@ Note that their location after the standard starting subMakefiles (initializatio \input{tex/src/figure-src-topmake.tex} \vspace{-3mm} \caption{\label{fig:topmake} General view of the high-level \inlinecode{top-make.mk} Makefile which manages the project's analysis that is in various subMakefiles. - See Figures~\ref{fig:files} \& \ref{fig:datalineage} for its location in the project's file structure and its data lineage, as well as the subMakefiles it includes. + See Figures \ref{fig:files} \& \ref{fig:datalineage} for its location in the project's file structure and its data lineage, as well as the subMakefiles it includes. } \end{figure} -To enhance the original plot, Figure~\ref{fig:toolsperyear} also shows the number of papers that were studied each year. +To enhance the original plot, Figure \ref{fig:toolsperyear} also shows the number of papers that were studied each year. Its horizontal axis shows the full range of the data (starting from \menkefirstyear, while the original Figure 1C in \citet{menke20} starts from 1997). The probable reason is that \citet{menke20} decided to avoid earlier years due to their small number of papers. For example, in \menkenumpapersdemoyear, they had only studied \menkenumpapersdemocount{} papers. Note that both the numbers of the previous sentence (\menkenumpapersdemoyear{} and \menkenumpapersdemocount), and the dataset's oldest year (mentioned above: \menkefirstyear) are automatically generated \LaTeX{} macros, see \ref{sec:valuesintext}. -They are \textsl{not} typeset manually in this narrative explanation. -This step (generating the macros) is shown schematically in Figure~\ref{fig:datalineage} with the arrow from \inlinecode{tools-per-year.txt} to \inlinecode{demo-plot.tex}. +They are \emph{not} typeset manually in this narrative explanation. +This step (generating the macros) is shown schematically in Figure \ref{fig:datalineage} with the arrow from \inlinecode{tools-per-year.txt} to \inlinecode{demo-plot.tex}. -To create Figure~\ref{fig:toolsperyear}, we used the \LaTeX{} package PGFPlots. The final analysis output we needed was therefore a simple plain-text table with 3 columns (year, paper per year, tool fraction per year). -This table is shown in the lineage graph of Figure~\ref{fig:datalineage} as \inlinecode{tools-per-year.txt} and The PGFPlots source to generate this figure is located in \inlinecode{tex\-/src\-/figure\--tools\--per\--year\-.tex}. -If another plotting tool was desired (for example \textsl{Python}'s Matplotlib, or Gnuplot), the built graphic file (for example \inlinecode{tools-per-year.pdf}) could be the target instead of the raw table. +To create Figure \ref{fig:toolsperyear}, we used the \LaTeX{} package PGFPlots. The final analysis output we needed was therefore a simple plain-text table with 3 columns (year, paper per year, tool fraction per year). +This table is shown in the lineage graph of Figure \ref{fig:datalineage} as \inlinecode{tools-per-year.txt} and The PGFPlots source to generate this figure is located in \inlinecode{tex\-/src\-/figure\--tools\--per\--year\-.tex}. +If another plotting tool was desired (for example \emph{Python}'s Matplotlib, or Gnuplot), the built graphic file (for example \inlinecode{tools-per-year.pdf}) could be the target instead of the raw table. The \inlinecode{tools-per-year.txt} is a value-added table with only \menkenumyears{} rows (counting per year), the original dataset had \menkenumorigrows{} rows (one row for each year of each journal). -We see in Figure~\ref{fig:datalineage} that it is defined as a Make \emph{target} in \inlinecode{demo-plot.mk} and that its prerequisite is \inlinecode{menke20-table-3.txt} (schematically shown by the arrow connecting them). +We see in Figure \ref{fig:datalineage} that it is defined as a Make \emph{target} in \inlinecode{demo-plot.mk} and that its prerequisite is \inlinecode{menke20-table-3.txt} (schematically shown by the arrow connecting them). Note that both the row numbers mentioned at the start of this paragraph are also macros. -Again from Figure~\ref{fig:datalineage}, we see that \inlinecode{menke20-table-3.txt} is a target in \inlinecode{format.mk} and its prerequisite is the input file \inlinecode{menke20.xlsx}. -The input files (which come from outside the project) are all \emph{targets} in \inlinecode{download.mk} and futher discussed in Section ~\ref{sec:download}. +Again from Figure \ref{fig:datalineage}, we see that \inlinecode{menke20-table-3.txt} is a target in \inlinecode{format.mk} and its prerequisite is the input file \inlinecode{menke20.xlsx}. +The input files (which come from outside the project) are all \emph{targets} in \inlinecode{download.mk} and futher discussed in Section \ref{sec:download}. Having prepared the full dataset in a simple format, let's report the number of subjects (papers and journals) that were studied in \citet{menke20}. The necessary file for this measurement is \inlinecode{menke20-table-3.txt}. and this calculation is done with a simple AWK command, and the results are written in \inlinecode{format.tex}, which is automatically loaded into this paper's source along with the macro files. In the built PDF paper, the two macros expand to $\menkenumpapers$ (number of papers studied) and $\menkenumjournals$ (number of journals studied) respectively. -This step is shown schematically in Figure~\ref{fig:datalineage} with the arrow from \inlinecode{menke20-table-3.txt} to \inlinecode{format.tex}. +This step is shown schematically in Figure \ref{fig:datalineage} with the arrow from \inlinecode{menke20-table-3.txt} to \inlinecode{format.tex}. @@ -550,25 +569,19 @@ This step is shown schematically in Figure~\ref{fig:datalineage} with the arrow The \inlinecode{download.mk} subMakefile is present in all Maneage projects and contains the common steps for importing the input dataset(s) into the project. All necessary input datasets for the project are imported through this subMakefile. -This helps in modularity and minimal complexity (\ref{principle:modularity} \& \ref{principle:complexity}): to see which external datasets were used in a project, this is the only necessary file to manage/read. -Note that a simple call to a downloader (for example \inlinecode{wget}) is usually not enough. -Irrespective of where the dataset is \emph{used} in the project's lineage, it helps to maintain relation with the outside world (to the project) in one subMakefile. +Irrespective of where the dataset is \emph{used} in the project's lineage, it helps to maintain relation with the outside world (to the project) in one subMakefile (see the modularity and minimal complexity principles \ref{principle:modularity} \& \ref{principle:complexity}). Each external dataset has some basic information, including its expected name on the local system (for offline access), the necessary checksum to validate it (either the whole file or just its main ``data'', as discussed in Section \ref{sec:outputverification}), and its URL/PID. In Maneage, such information regarding a project's input dataset(s) is in the \inlinecode{INPUTS.conf} file. -See Figures~\ref{fig:files} \& \ref{fig:datalineage} for the position of \inlinecode{INPUTS.conf} in the project's file structure and data lineage, respectively. +See Figures \ref{fig:files} \& \ref{fig:datalineage} for the position of \inlinecode{INPUTS.conf} in the project's file structure and data lineage, respectively. For demonstration, we are using the datasets of \citet{menke20} which are stored in one \inlinecode{.xlsx} file on bioXriv\footnote{\label{footnote:dataurl}Full data URL: \url{\menketwentyurl}}. -Figure~\ref{fig:inputconf} shows the corresponding \inlinecode{INPUTS.conf} where the necessary information are stored as Make variables and are automatically loaded into the full project when Make starts (and is most often used in \inlinecode{download.mk}). +Figure \ref{fig:inputconf} shows the corresponding \inlinecode{INPUTS.conf} where the necessary information are stored as Make variables and are automatically loaded into the full project when Make starts (and is most often used in \inlinecode{download.mk}). \begin{figure}[t] \input{tex/src/figure-src-inputconf.tex} \vspace{-3mm} - \caption{\label{fig:inputconf} Contents of the \inlinecode{INPUTS.conf} file for the demonstration dataset of \citet{menke20}. - This file contains the basic, or minimal, metadata for retrieving the required dataset(s) of a project: it can become arbitrarily long. - Here \inlinecode{M20DATA} contains the name of this dataset within this project. - \inlinecode{MK20MD5} contains the MD5 checksum of the dataset, in order to check the validity and integrity of the dataset before usage. - \inlinecode{MK20SIZE} contains the size of the dataset in human-readable format. - \inlinecode{MK20URL} is the URL which the dataset is automatically downloaded from (only when its not already present on the host). + \caption{\label{fig:inputconf} The \inlinecode{INPUTS.conf} configuration file keeps references to external (input) datasets of a project, as well as their checksums for validation, see Sections \ref{sec:download} \& \ref{sec:configfiles}. + Shown here are the entries for the demonstration dataset of \citet{menke20}. Note that the original URL (footnote \ref{footnote:dataurl}) was too long to display properly here. } \end{figure} @@ -580,13 +593,13 @@ Figure~\ref{fig:inputconf} shows the corresponding \inlinecode{INPUTS.conf} wher The subMakefiles discussed above should only contain the organization of an analysis, they should not contain any fixed numbers, settings or parameters, as such elements should only be used as variables which are defined in configuration files. Configuration files enable the logical separation between the low-level implementation and high-level running of a project. -In the data lineage plot of Figure~\ref{fig:datalineage}, configuration files are shown as the sharp-edged, green \inlinecode{*.conf} files in the top row (for example, the \inlinecode{INPUTS.conf} file that was shown in Figure~\ref{fig:inputconf} and mentioned in Section~\ref{sec:download}). -All the configuration files of a project are placed under the \inlinecode{reproduce/analysis/config} (see Figure~\ref{fig:files}) subdirectory, and are loaded into \inlinecode{top-make.mk} before any of the subMakefiles, see Figure~\ref{fig:topmake}. +In the data lineage plot of Figure \ref{fig:datalineage}, configuration files are shown as the sharp-edged, green \inlinecode{*.conf} files in the top row (for example, the \inlinecode{INPUTS.conf} file that was shown in Figure \ref{fig:inputconf} and mentioned in Section \ref{sec:download}). +All the configuration files of a project are placed under the \inlinecode{reproduce/analysis/config} (see Figure \ref{fig:files}) subdirectory, and are loaded into \inlinecode{top-make.mk} before any of the subMakefiles, see Figure \ref{fig:topmake}. -The demo analysis of Section~\ref{sec:analysis} is a good demonstration of their usage: during that discussion we reported the number of papers studied by \citet{menke20} in \menkenumpapersdemoyear. +The demo analysis of Section \ref{sec:analysis} is a good demonstration of their usage: during that discussion we reported the number of papers studied by \citet{menke20} in \menkenumpapersdemoyear. However, the year's number is not written by hand in \inlinecode{demo-plot.mk}. It is referenced through the \inlinecode{menke-year-demo} variable, which is defined in \inlinecode{menke-demo-year.conf}, that is a prerequisite of the \inlinecode{demo-plot.tex} rule. -This is also visible in the data lineage of Figure~\ref{fig:datalineage}. +This is also visible in the data lineage of Figure \ref{fig:datalineage}. If we later would decide to report the number in another year, we would simply have to change the value in \inlinecode{menke-demo-year.conf}. A configuration file is a prerequisite of the target that uses it, hence its date will be newer than \inlinecode{demo-plot.tex}. Therefore Make will re-execute the recipe to generate the macro file before this paper is re-built and the corresponding year and value will be updated in this paper, always in synchronization with each other and no matter how many times they are used. @@ -608,8 +621,8 @@ Maneage contains only plain-text files, and therefore it can be maintained under Every commit in the version-controlled history contains \emph{a complete} snapshot of the data lineage (for more, see the completeness principle \ref{principle:complete}). Maneage is maintained by its developers in a central branch, which we will call \inlinecode{man\-eage} hereafter. The \inlinecode{man\-eage} branch contains all the low-level infrastructure, or skeleton, that is necessary for any project as described in the sections above. -As mentioned in Section ~ref{sec:maneage}, to start a new project users simply clone it from its reference repository and build their own Git branch over the most recent commit. -This is demonstrated in the first phase of Figure~\ref{fig:branching} where a project has started by branching-off of commit \inlinecode{0c120cb} in the \inlinecode{maneage} branch. +As mentioned in Section \ref{sec:maneage}, to start a new project, users simply clone it from its reference repository and build their own Git branch over it +This is demonstrated in the first phase of Figure \ref{fig:branching} where a project has started by branching-off of commit \inlinecode{0c120cb}. %% Exact URLs of imported images. %% Collaboration icon: https://www.flaticon.com/free-icon/collaboration_809522 @@ -618,14 +631,11 @@ This is demonstrated in the first phase of Figure~\ref{fig:branching} where a pr \begin{figure}[t] \includetikz{figure-branching} \vspace{-3mm} - \caption{\label{fig:branching} Projects start by branching off the main Maneage branch and developing their high-level analysis over the common low-level infrastructure: add flesh to a skeleton. - The low-level infrastructure can always be updated (keeping the added high-level analysis intact), with a simple merge between branches. - Two phases of a project's evolution shown here: in phase 1, a co-author has made two commits in parallel to the main project branch, which have later been merged. - In phase 2, the project has finished: note the identical first project commit and the Maneage commits it branches from. - The dashed parts of Scenario 2 can be any arbitrary history after those shown in phase 1. - A second team now wants to build upon that published work in a derivate branch, or project. - The second team applies two commits and merges their branch with Maneage to improve the skeleton and continue their research. - The Git commits are shown on their branches as colored ellipses, with their hash printed in them. + \caption{\label{fig:branching} Harvesting the power of version-control in project management with Maneage. + Maneage is maintained as a core branch, with projects created by branching-off of it. + (a) shows how projects evolve on their own branch, but can always update their low-level structure by merging with the core branch + (b) shows how a finished/published project can be revitalized for new technologies simply by merging with the core branch. + Each Git ``commit'' is shown on their branches as colored ellipses, with their hash printed in them. The commits are colored based on the team that is working on that branch. The collaboration and paper icons are respectively made by `mynamepong' and `iconixar' and downloaded from \url{www.flaticon.com}. } @@ -633,20 +643,18 @@ This is demonstrated in the first phase of Figure~\ref{fig:branching} where a pr After a project starts, Maneage will evolve, for example, new features will be added or low-level bugs will be fixed. Because all projects branch-off from the same branch that these infrastructure improvements are made, updating the project's low-level skeleton is as easy as merging the \inlinecode{maneage} branch into the project's branch. -For example, in Figure~\ref{fig:branching} (phase 1), see how Maneage's \inlinecode{3c05235} commit has been merged into project's branch trough commit \inlinecode{2ed0c82}. +For example, in Figure \ref{fig:branching} (phase 1), see how Maneage's \inlinecode{3c05235} commit has been merged into project's branch trough commit \inlinecode{2ed0c82}. +This allows infrastructure improvements and fixes to be easily propagated to all projects. -Another useful scenario is reviving a finished/published project at later date, perhaps by other researchers, as shown in phase 2 of Figure ~\ref{fig:branching}, where - a new team of researchers have decided to experiment on the results of the published paper and have merged it with the Maneage branch (commit \inlinecode{a92b25a}) to make it usable for their system (e.g., assuming the original project was completed years ago, and is no longer directly executable). - -Other possible scenarios include a third project that can easily merge various high-level components from different projects into its own branch, thus adding a temporal dimension to their data lineage. -This structure also enables easy propagation of low-level fixes to all projects using Maneage. +Another useful scenario is reviving a finished/published project at later date, possibly by other researchers as shown in phase 2 of Figure \ref{fig:branching} (e.g., assuming the original project was completed years ago, and is no longer directly executable). +Other scenarios include projects that are created by merge various other projects. Modern version control systems provide many more capabilities that can be leveraged through Maneage in project management, thanks to the shared branch it has with \emph{all} derived projects, and that it is complete (\ref{principle:complete}). \subsection{Multi-user collaboration on single build directory} \label{sec:collaborating} Because the project's source and build directories are separate, it is possible for different users to share a build directory, while working on their own separate project branches during a collaboration. -Similar to the parallel branch that is later merged in phase 1 of Figure~\ref{fig:branching}. +Similar to the parallel branch that is later merged in phase 1 of Figure \ref{fig:branching}. To enable this mode, \inlinecode{./project} script has a special \inlinecode{--group} option which takes the name of a (POSIX) user group in the host operating system. All files built in the build directory are then automatically assigned to this user group, with read and write permissions. Of course, avoiding conflicts in the build directory, while members are working on different branches is up to the team. @@ -666,7 +674,7 @@ Therefore, there are two scenarios for the publication of the project: 1) only p In the former case, the output of \inlinecode{dist} (described above) can be submitted to the journal as a supplement, or uploaded to pre-print servers like arXiv that will actually compile the \LaTeX{} source and build their own PDFs. The Git history can also be archived as a single ``bundle'' file and also submitted as a supplement. When publishing with datasets, the project's outputs, and inputs (if necessary), can be published on servers like Zenodo. -For example, \citet{akhlaghi19} uploaded all the project's necessary software and its final PDF to Zenodo in \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}\footnote{https://doi.org/10.5281/zenodo.3408481}, along with the source files mentioned above. +For example, \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}]{akhlaghi19} uploaded all the project's necessary software and its final PDF along with the project's source tarball and Git ``bundle'' to Zenodo. @@ -680,68 +688,82 @@ For example, \citet{akhlaghi19} uploaded all the project's necessary software a \section{Discussion \& Caveats} \label{sec:discussion} -Maneage is the final product of various research projects carried out in astrophysics over the past 5 years, and -its primordial implementation was written for the analysis of \citet{akhlaghi15}. -Since the full analysis pipeline was in plain-text and consumed much less space than a single figure, it was uploaded to -\texttt{arXiv} with the paper's \LaTeX{} source, see \href{https://arxiv.org/abs/1505.01664}{arXiv:1505.01664}\footnote{ - To download the \LaTeX{} source of any \texttt{arXiv} paper, click on the ``Other formats'' link, containing necessary instructions and links.}. -The system later evolved in \citet{bacon17}, and in particular the two sections of that paper that were done by M. Akhlaghi: \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}. +Maneage was created and evolved during various research projects (in astrophysics) over the last 5 years. +The primordial implementation was written for \citet{akhlaghi15}. +It later evolved in \citet{bacon17}, and in particular the two sections of that paper that were done by M. Akhlaghi: \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}. With these projects, the skeleton of the system was written as a more abstract ``template'' that could be customized for separate projects. That template later matured into Maneage by including the installation of all necessary software from source and it was used in \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}]{akhlaghi19} and \citet[\href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}]{infante20}. -In the last year and within the context of the Research Data Alliance (RDA) grant that was awarded to Maneage, its user base (and thus its development) grew phenomenally and it has evolved to become much more customizable, well-tested and well-documented. -It is, however, far from complete: its core architecture will continue evolving after the publication of this paper, and a list of the notable changes after the publication of this paper will be kept in the \inlinecode{README-hacking.md} file. +As Git repositories, a Maneage project can benefit from the wonderful archival and citation features of Software Heritage \citep{dicosmo18}, enabling easy citation of precise parts of other projects, at various points in their history. +Once Maneage is adopted on a wide scale in a special topic, it is possible to feed them into machine learning algorithms for automatic workflow generation, optimized for certain aspects of the result. +Because Maneage is complete and also includes the project's history, even inputs (software and input data) or failed tests during the projects can enter this optimization process. +Furthermore, writing parsers of Maneage projects to generate Research Objects is trivial, and very useful for meta-research and data provenance studies. + +Maneage was awarded a Research Data Alliance (RDA) adoption grant adhering to the recommendations of the publishing data workflows working group \citep{austin17}. +Its user base (and thus its development) grew phenomenally afterwards. +But it is far from complete: bugs will be fixed and its core architecture will continue evolve after the publication of this paper. +A list of the notable changes after the publication of this paper will be kept in the \inlinecode{README-hacking.md} file. -Based on early adopters, we have seen the following caveats for Maneage. -The first caveat is related to its widespread adoption: by principle, Maneage uses very low-level tools like Git, \LaTeX, Make and command-line tools to run in non-interactive mode, but - a large fraction of the scientific community is accustomed to interactive graphic user interface (GUI) tools. -This is not often a final choice: some of our early users simply were simply not aware such tools existed. -After seeing them in action together (as a \emph{complete} Maneage project) they have started using these tools effectively. -Unfortunately by their low-level nature, the documentation of these tools alone discourages scientists. We are thus working on several tutorials and scientist-friendly documentation of such tools, hopefully by collaborating with efforts such as \href{http://software.ac.uk}{software.ac.uk} and \href{http://urssi.us}{urssi.us}. -\citet{fineberg19} also note the importance that a project starts by following good practice, not to force it in the end. +Based on feedback from early adopters, we have seen the following caveats for Maneage. +The first caveat is regarding its widespread adoption: by principle, Maneage uses very low-level tools that are not commonly used by scientists like Git, \LaTeX, Make and the command-line. +We have discovered that this is because they have mainly not been exposed to them as useful components in their research. +Once the usage of these tools was witnessed in practice, these tools were adopted to follow best practices. +\citet{fineberg19} also note the importance of projects starting by following good practice, not to force it in the end. A second caveat is the fact that Maneage is effectively an almost complete GNU operating system, tailored to each project. -It is just built on top of an existing POSIX-compatible operating system, using its kernel. -Maneage has many generic scripts for simplifying the software packaging, but - maintaining them (updating versions or fixing bugs on some hosts) can take time for a small team. -Because package management (Section~\ref{sec:projectconfigure}) is in the same language as the analysis, some users have learnt to package their necessary software or correct some bugs themselves. -They later send those additions as merge-requests to the core Maneage branch, thus propagating the improvement to all projects using Maneage. -With a larger users base we hope the fraction of such contributors to increase and hence decrease the burden on our core team. - -Another caveat that has been raised is that publishing the project's reproducible data lineage immediately after publication may hamper their ability to continue harvesting from all their hard work. +Maintaining the various packages can consume time for its core developers. +In Maneage, package management (Section \ref{sec:projectconfigure}) is in the same language as the analysis, therefore some users are already adding their necessary software in it and submitted them to the core Maneage branch, thus propagating the improvement to all projects using Maneage. +With a larger user-base look forward to an increasing number of such contributors, hence decreases the burden on our core team. + +Another caveat that has been raised is that publishing the project's reproducible data lineage immediately after publication may hamper their ability to continue with followup papers because others may do it before them. Given the strong integrity checks in Maneage, we believe it has features to address this problem in the following ways: 1) Through the Git history, it is clear how much extra work the other team has added. -In this way, Maneage can contribute to a new concept of authorship in scientific projects and help to quantify Newton's famous ``\textsl{standing on the shoulders of giants}'' quote. +In this way, Maneage can contribute to a new concept of authorship in scientific projects and help to quantify Newton's famous ``\emph{standing on the shoulders of giants}'' quote. This is however a long-term goal and requires major changes to academic value systems. 2) Authors can be given a grace period where the journal, or some third authority, keeps the source and publishes it a certain interval after publication. -Once Maneage is adopted on a wide scale in a special topic, it is possible to feed them into machine learning algorithms for automatic workflow generation, optimized for certain aspects of the result. -Because Maneage is complete and also includes the project's history, even inputs (software and input data) or failed tests during the projects can enter this optimization process. -Furthermore, writing parsers of Maneage projects to generate Research Objects is trivial, and very useful for meta-research and data provenance studies. + + + + + + + + \section{Conclusion \& Summary} \label{sec:conclusion} To effectively leaverage the power of big data, we need to have a complete view of its lineage. -Scientists are however rarely trained sufficiently in data management or software development, and the plethora of high-level tools that change every few years does not help. -Maneage is designed as a complete template, providing scientists with a built low-level skeleton that scientists can customize for any project and adopt modern, robust and efficient data management in practice on their own projects. +Scientists are however rarely trained sufficiently in data management or software development, and the plethora of high-level tools that change every few years does not help. +Such high-level tools are primarily targetted at software developers, who are paid to learn them and use them effectively for short-term projects. +Scientists on the other hand need to focus on their own research fields, and need to think about longevity. + +Maneage is designed as a complete template, providing scientists with a built low-level skeleton, using simple and robust tools that have withstood the test of time while being actively developed. +Scientists can customize its existing data management for their own projects, enabling them to learn and master the lower-level tools in the meantime. +This improves their efficiency and the robustness of their scientific result, while also enabling future scientists to reproduce and build-upon their work. + +We discussed the founding principles of Maneage that are completeness, modularity, minimal complexity, verifiable inputs and outputs, temporal provenance, and free software. +We showed how these principles are implemented in an already built structure, ready for customization and discussed the caveats and advantages of this implementation. +With a larger user-base and wider application in scientific (and hopefully industrial) applications, Maneage will certainly grow and become even more robust, stable and user friendly. + + + + + + + -Maneage is a solution built upon the principles of completeness, modularity, minimal complexity, verifiable inputs and outputs, temporal provenance, and free software. -These principles are implemented in an pre-built structure that users just have to customize for the high-level aspects of their projects. -While there are caveats and advantages of this implementation, -with a larger user-base and a wider application in scientific and industrial applications, Maneage will certainly grow and become even more stable and user friendly. -\tonote{One more paragraph will be added here: don't forget to review the caveats} -\tonote{DVG: perhaps also adding the plain text feature ... but only it there is any space left !} %% Acknowledgements -\section{Acknowledgments} -The authors wish to thank David Valls-Gabaud, Johan Knapen, Ignacio Trujillo, Roland Bacon, Konrad Hinsen, Yahya Sefidbakht, Simon Portegies Zwart, Pedram Ashofteh Ardakani, Elham Saremi, Zahra Sharbaf and Surena Fatemi for their useful suggestions and feedback on Maneage and this paper. +\section*{Acknowledgments} +The authors wish to thank David Valls-Gabaud, Alice Allen, Johan Knapen, Ignacio Trujillo, Roland Bacon, Konrad Hinsen, Yahya Sefidbakht, Simon Portegies Zwart, Pedram Ashofteh Ardakani, Elham Saremi, Zahra Sharbaf and Surena Fatemi for their useful suggestions and feedback on Maneage and this paper. We also thank Julia Aguilar-Cabello for designing the Maneage logo. During its development, Maneage has been partially funded (in historical order) by the following institutions: The Japanese Ministry of Education, Culture, Sports, Science, and Technology ({\small MEXT}) PhD scholarship to M.A and its Grant-in-Aid for Scientific Research (21244012, 24253003). The European Research Council (ERC) advanced grant 339659-MUSICOS. -The European Union’s Horizon 2020 (H2020) research and innovation programmes No 777388 under RDA EU 4.0 project, and Marie Sk\l{}odowska-Curie grant agreement No 721463 to the SUNDIAL ITN network. +The European Union’s Horizon 2020 (H2020) research and innovation programmes No 777388 under RDA EU 4.0 project, and Marie Sk\l{}odowska-Curie grant agreement No 721463 to the SUNDIAL ITN. The State Research Agency (AEI) of the Spanish Ministry of Science, Innovation and Universities (MCIU) and the European Regional Development Fund (FEDER) under the grant with reference AYA2016-76219-P. The IAC project P/300724, financed by the Ministry of Science, Innovation and Universities, through the State Budget. The Canary Islands Department of Economy, Knowledge and Employment, through the Regional Budget of the Autonomous Community. @@ -749,6 +771,17 @@ The Fundaci\'on BBVA under its 2017 programme of assistance to scientific resear \input{tex/build/macros/dependencies.tex} +\section*{Competing Interests} +The authors have no competing interests to declare. + +\section*{Author Contributions} +\begin{enumerate} +\item Mohammad Akhlaghi: principal author of the Maneage source code and this paper, also principal investigator (PI) of the RDA Adoption grant awarded to Maneage. +\item Ra\'ul Infante-Sainz: contributed many patches/commits to the source of Maneage, also helped in early testing and writing this paper. +\item David Valls-Gabaud: involved in the Maneage project and its testing for 4 years and contributed to writing this paper. +\item Roberto Baena-Gall\'e: contributed to early testing of Maneage and in writing this paper. +\end{enumerate} + %% Tell BibLaTeX to put the bibliography list here. \printbibliography diff --git a/tex/img/codata.pdf b/tex/img/codata.pdf Binary files differnew file mode 100644 index 0000000..e00f2ca --- /dev/null +++ b/tex/img/codata.pdf diff --git a/tex/img/codata.png b/tex/img/codata.png Binary files differdeleted file mode 100644 index c78dbc3..0000000 --- a/tex/img/codata.png +++ /dev/null diff --git a/tex/src/figure-branching.tex b/tex/src/figure-branching.tex index e264746..a917987 100644 --- a/tex/src/figure-branching.tex +++ b/tex/src/figure-branching.tex @@ -81,7 +81,7 @@ %% Description of this scenario: \draw [rounded corners, fill=black!10!white] (3.1cm,0) rectangle (7.5cm,1.25cm); - \draw [anchor=west, black] (3.1cm,1.0cm) node {\small \textbf{Phase 1} (pre-publication):}; + \draw [anchor=west, black] (3.1cm,1.0cm) node {\textbf{(a)} pre-publication:}; \draw [anchor=west, black] (3.3cm,0.6cm) node {\footnotesize Collaborating on a project while}; \draw [anchor=west, black] (3.3cm,0.2cm) node {\footnotesize working in parallel, then merging.}; @@ -140,7 +140,7 @@ %% Description of this scenario: \draw [rounded corners, fill=black!10!white] (11.1cm,0) rectangle (15.3cm,1.25cm); - \draw [anchor=west, black] (11.1cm,1.0cm) node {\small \textbf{Phase 2} (post-publication):}; + \draw [anchor=west, black] (11.1cm,1.0cm) node {\textbf{(b)} post-publication:}; \draw [anchor=west, black] (11.3cm,0.6cm) node {\footnotesize Other researchers building upon}; \draw [anchor=west, black] (11.3cm,0.2cm) node {\footnotesize previously published work.}; \end{tikzpicture} diff --git a/tex/src/figure-src-inputconf.tex b/tex/src/figure-src-inputconf.tex index 1245dfb..fc3315d 100644 --- a/tex/src/figure-src-inputconf.tex +++ b/tex/src/figure-src-inputconf.tex @@ -3,6 +3,6 @@ \texttt{\mkvar{MK20DATA} = menke20.xlsx}\\ \texttt{\mkvar{MK20MD5}{ } = 8e4eee64791f351fec58680126d558a0}\\ \texttt{\mkvar{MK20SIZE} = 1.9MB}\\ - \texttt{\mkvar{MK20URL}{ } = https://the.full.url/is/too/large/for/here/media-1.xlsx}\\ + \texttt{\mkvar{MK20URL}{ } = https://the.full.url/is/too/large/to/show/here/media-1.xlsx}\\ \vspace{-3mm} \end{tcolorbox} diff --git a/tex/src/preamble-style.tex b/tex/src/preamble-style.tex index e20c73c..26deac9 100644 --- a/tex/src/preamble-style.tex +++ b/tex/src/preamble-style.tex @@ -115,19 +115,19 @@ \pagestyle{fancy} \lhead{\mplight\footnotesize Art.XX, page {\thepage} of \pageref{LastPage}} \chead{} -\rhead{\mplight\footnotesize Akhlaghi et al; Reproducible paper template} +\rhead{\mplight\footnotesize Akhlaghi, et al: Maneage, a Customizable Framework for Managing Data Lineage} \lfoot{} \cfoot{} \rfoot{} \renewcommand\headrulewidth{0.0pt} \renewcommand\footrulewidth{0.0pt} \fancypagestyle{firstpage} { - \lhead{\includegraphics[width=3.5cm]{tex/img/codata.png}} + \lhead{\includegraphics[width=3.5cm]{tex/img/codata.pdf}} \chead{} \rhead{\mplight\footnotesize - Akhlaghi, M, et al. 2019. Reproducible paper template\\ - \emph{Data Science Journal}, VV, NN, pp.1-N,\\ - DOI: https://doi.org/10.5334/dsj-XXXX-XXX} + Akhlaghi, M, et al. 2020. Maneage, a Customizable Framework\\ + for Managing Data Lineage. \emph{Data Science Journal}, VV,\\ + NN, pp.1-\pageref*{LastPage}. DOI: \href{https://doi.org/10.5334/dsj-XXXX-XXX}{\textcolor{black}{https://doi.org/10.5334/dsj-XXXX-XXX}}} \lfoot{} \cfoot{} \rfoot{} diff --git a/tex/src/references.tex b/tex/src/references.tex index 510fc89..ef33d02 100644 --- a/tex/src/references.tex +++ b/tex/src/references.tex @@ -695,6 +695,20 @@ archivePrefix = {arXiv}, +@ARTICLE{austin17, + author = {{Claire C.} Austin and Theodora Bloom and Sünje Dallmeier-Tiessen and {Varsha K.} Khodiyar and Fiona Murphy and Amy Nurnberger and Lisa Raymond and Martina Stockhause and Jonathan Tedds and Mary Vardigan and Angus Whyte}, + title = {Key components of data publishing: using current best practices to develop a reference model for data publishing}, + journal = {International Journal on Digital Libraries}, + volume = {18}, + year = {2017}, + pages = {77}, + doi = {10.1007/s00799-016-0178-2}, +} + + + + + @ARTICLE{smith16, author = {Arfon M. Smith and Daniel S. Katz and Kyle E. Niemeyer}, title = {Software citation principles}, |