From 2cc4bc2a193bdb5816674266ad03c70c5da96ae6 Mon Sep 17 00:00:00 2001 From: Mohammad Akhlaghi Date: Tue, 14 Apr 2020 18:35:34 +0100 Subject: Further text shrink, added Competing interest and Author contributions To make the text easier to read and further comply with the author guideline, the text was shrank a little more and the two final sections were also added on "Competing interest" and "Author contributions". I also found the CODATA logo on Wikipedia in SVG format (vector graphics), so I replaced the previous pixelated PNG format with the PDF (converted from SVG). --- paper.tex | 159 ++++++++++++++++++++++----------------- tex/img/codata.pdf | Bin 0 -> 8798 bytes tex/img/codata.png | Bin 112554 -> 0 bytes tex/src/figure-branching.tex | 4 +- tex/src/figure-src-inputconf.tex | 2 +- tex/src/preamble-style.tex | 10 +-- tex/src/references.tex | 14 ++++ 7 files changed, 113 insertions(+), 76 deletions(-) create mode 100644 tex/img/codata.pdf delete mode 100644 tex/img/codata.png diff --git a/paper.tex b/paper.tex index f71d9be..14eaaf6 100644 --- a/paper.tex +++ b/paper.tex @@ -24,7 +24,7 @@ -\title{Maneage: Customizable Framework for Managing Data Lineage} +\title{Maneage, a Customizable Framework for Managing Data Lineage} \author{\large\mpregular \authoraffil{Mohammad Akhlaghi}{1,2}, \large\mpregular \authoraffil{Ra\'ul Infante-Sainz}{1,2}, \large\mpregular \authoraffil{Roberto Baena-Gall\'e}{1,2}\\ @@ -100,13 +100,15 @@ For example, \citet{smart18} describes how a 7-year old conflict in theoretical \caption{\label{fig:questions}Graph of a generic project's workflow (connected through arrows), highlighting the various issues/questions on each step. The green boxes with sharp edges are inputs and the blue boxes with rounded corners are the intermediate or final outputs. The red boxes with dashed edges highlight the main questions on the respective stage. - The box coverting the software download and build phases shows some common tools software developers use for that phase, but a scientific project is so much more than that. + The box covering the software download and build phases shows some common tools software developers use for that phase, but a scientific project is so much more than that. } \end{figure} The reason such reports are mostly from genomics and bioinformatics is because they have traditionally been more open to publishing workflows: for example \href{https://www.myexperiment.org}{myexperiment.org}, or \href{https://www.genepattern.org}{genepattern.org}, \href{https://galaxyproject.org}{galaxy\-project.org}, and others. Such integrity checks are a critical component of the scientific method, but are only possible with access to the data \emph{and} its lineage (workflows). The status in other fields, where workflows are not commonly shared, is probably (much) worse. +Nature is already a black box which we are trying hard to unlock. +Not being able to experiment on the methods of other researchers is a self-imposed back box over it. The completeness of a paper's published metadata (or ``Methods'' section) can be measured by the ability to reproduce the result without needing to contact the authors. Several studies have attempted to answer this with different levels of detail, for example, \citet{allen18} found that roughly half of the papers in astrophysics do not even mention the names of any analysis software, while \citet{menke20} found this fraction has greatly improved in medical/biological field and is currently above $80\%$. @@ -131,6 +133,14 @@ In Sections \ref{sec:definitions} \& \ref{sec:principles} the basic concepts are Section \ref{sec:maneage} describes the internal structure of Maneage and Section \ref{sec:discussion} is a discussion on its benefits, caveats and future prospects. + + + + + + + + \section{Definitions} \label{sec:definitions} @@ -198,7 +208,7 @@ The core principle of Maneage is simple: science is defined by its method, not i \citet{buckheit1995} summarize this nicely by noting that modern scientific papers (narrative combined with plots, tables and figures) are merely advertisements of a scholarship, the actual scholarship is the scripts and software usage that went into doing the analysis. Maneage is not the first attempted solution to this fundamental problem. -Various solutions have been proposed since the early 1990s, for example RED \citep{claerbout1992,schwab2000}, Apache Taverna \citep{oinn04}, Madagascar \citep{fomel13}, GenePattern \citep{reich06}, Kepler \citep{ludascher05}, VisTrails \citep{bavoil05}, Galaxy \citep{goecks10}, Image Processing On Line journal \citep[IPOL][]{limare11}, WINGS \citep{gil10}, Active papers \citep{hinsen11}, Collage Authoring Environment \citep{nowakowski11}, SHARE \citep{vangorp11}, Verifiable Computational Result \citep{gavish11}, SOLE \citep{pham12}, Sumatra \citep{davison12}, Sciunit \citep{meng15}, Popper \citep{jimenez17}, WholeTale \citep{brinckman19}, and many more. +Various solutions have been proposed since the early 1990s, for example RED (1992), Apache Taverna (2003), Madagascar (2003), GenePattern (2004), Kepler (2005), VisTrails (2005), Galaxy (2010), WINGS (2010), Image Processing On Line journal (IPOL, 2011), Active papers (2011), Collage Authoring Environment (2011), SHARE (2011), Verifiable Computational Result (2011), SOLE (2012), Sumatra (2012), Sciunit (2015), Binder (2017), Popper (2017), WholeTale (2019), and many more. To highlight the uniqueness of Maneage in this plethora of tools, a more elaborate list of principles are necessary as described below. \begin{enumerate}[label={\bf P\arabic*}] @@ -284,7 +294,7 @@ Maneage is an implementation of the principles of Section \ref{sec:principles}: In practice, it is a collection of plain-text files, that are distributed in pre-defined sub-directories by context (a modular source), and are all under version-control, currently with Git. The main Maneage Branch is a fully working skeleton of a project without much flesh: containing all the low-level infrastructure, but without any actual high-level analysis operations\footnote{In the core Maneage branch, only a simple demo analysis is included to be complete. But it can easily be removed: all its files and steps have a \inlinecode{delete-me} prefix.}. -Maneage contains a file called \inlinecode{README-hacking.md}\footnote{Note that the \inlinecode{README.md} file is reserved for the project using Maneage, not maneage itself.} that has a complete checklist of steps to start a new project and remove demonstration parts, there are also hands-on tutorials to help new adopters. +Maneage contains a file called \inlinecode{README-hacking.md}\footnote{Note that the \inlinecode{README.md} file is reserved for the project using Maneage, not Maneage itself.} that has a complete checklist of steps to start a new project and remove demonstration parts, there are also hands-on tutorials to help new adopters. To start a new project, the authors will \emph{clone}\footnote{In Git, the ``clone'' operation is the process of copying all the project's files and history from a repository onto the local system.} Maneage, create their own Git branch over the latest commit, and start their project by customizing that branch. Customization in their project branch is done by adding the names of the software they need, references to their input data, the analysis commands, visualization commands, and a narrative report which includes the visualizations. @@ -319,7 +329,7 @@ In practice, these two steps are run with the following commands: \end{lstlisting} Here, we will delve deeper into the implementation and some usage details of Maneage. -Section \ref{sec:usingmake} elaborates why Make (a POSIX tool) was chosen as the main job orchestrator in Maneage. +Section \ref{sec:usingmake} elaborates why Make (a POSIX tool) was chosen as the main job manager in Maneage. Sections \ref{sec:projectconfigure} \& \ref{sec:projectanalysis} then discuss the operations done during the configuration and analysis phase. Afterwards, we describe how Maneage projects benefit from version control in Section \ref{sec:projectgit}. Section \ref{sec:collaborating} discusses sharing of a built environment and finally, in Section \ref{sec:publishing} the publication/archival of Maneage projects are discussed. @@ -356,8 +366,8 @@ Because of its simplicity, we have also had very good feedback on using Make fro Maneage orchestrates the building of its necessary software in the same language that it orchestrates the analysis: Make (see Section \ref{sec:usingmake}). Therefore, a researcher already using Maneage for their high-level analysis easily understands, and can customize, the software environment too, without delving into the intricacies of third-party tools. -Most existing tools reviewed in Section \ref{sec:principles}, use package managers like Conda to maintain the software environment, but since conda itself is written in Python, it does not fit in our completeness principle \ref{principle:complete}. -Highly robust solutions like Nix \citep{dolstra04} and GNU Guix \citep{courtes15} do exist, but they require root permissions which is also against that principle. +Most existing tools reviewed in Section \ref{sec:principles}, use package managers like Conda to maintain the software environment, but since Conda itself is written in Python, it does not fit in our completeness principle \ref{principle:complete}. +Highly robust solutions like Nix and GNU Guix do exist, but they require root permissions which is also against that principle. Project configuration (building the software environment) is managed by the files under \inlinecode{reproduce\-/soft\-ware} of Maneage's source, see Figure \ref{fig:files}. At the start of project configuration, Maneage needs a top-level directory to build itself on the host filesystem (software and analysis). @@ -396,14 +406,13 @@ For example\footnote{In this paper we have used very basic tools that are not ac This is particularly important in the case for research software, where citation is critical to justify continued development. One notable example that nicely highlights this issue is GNU Parallel \citep{tange18}: every time it is run, it prints the citation information before it starts. -This does not cause any problem in automatic scripts, but can be annoying when reading/debugging the outputs. -Users can disable the notice, with the \inlinecode{--citation} option and accept to cite its paper, or support its development directly by paying $10000$ euros! +Users can disable the notice with the \inlinecode{--citation} option and accept to cite its paper, or support its development directly by paying $10000$ euros! This is justified by an uncomfortably true statement\footnote{GNU Parallel's FAQ on the need to cite software: \url{http://git.savannah.gnu.org/cgit/parallel.git/plain/doc/citation-notice-faq.txt}}: ``history has shown that researchers forget to [cite software] if they are not reminded explicitly. ... If you feel the benefit from using GNU Parallel is too small to warrant a citation, then prove that by simply using another tool''. Most other research software do not resort to such drastic measures, however, citation is important for them. Given the increasing number of software used in scientific research, the only reliable solution is to automatically cite the used software in the final paper. For a review of the necessity and basic elements in software citation, see \citet{katz14} and \citet{smith16}. -There are ongoing projects specifically tailored to software citation, including CodeMeta\footnote{\url{https://codemeta.github.io}} and Citation file format\footnote{\url{https://citation-file-format.github.io}} (CFF), a very robust approach is also provided by SoftwareHeritage \citep{dicosmo18}. +There are ongoing projects specifically tailored to software citation, including CodeMeta and Citation file format (CFF), a very robust approach is also provided by SoftwareHeritage \citep{dicosmo18}. We plan to enable these wonderful tools in Maneage. @@ -436,14 +445,13 @@ Figure \ref{fig:datalineage} schematically shows these subMakefiles and their re \includetikz{figure-data-lineage} \end{center} \vspace{-7mm} - \caption{\label{fig:datalineage}Schematic representation of data lineage in a hypothetical project/pipeline using Maneage. + \caption{\label{fig:datalineage}Schematic representation of a project's data lineage, or workflow, for the demonstration analysis of this paper. Each colored box is a file in the project and the arrows show the dependencies between them. Green files/boxes are plain text files that are under version control and in the source-directory. - Blue files/boxes are output files of various steps in the build-directory, located within the Makefile (\inlinecode{*.mk}) that generates them. - For example, \inlinecode{paper.pdf} depends on \inlinecode{project.tex} (in the build directory and generated automatically) and \inlinecode{paper.tex} (in the source directory and written by hand). - In turn, \inlinecode{project.tex} depends on all the \inlinecode{*.tex} files at the bottom of the Makefiles above it. - The solid arrows and built boxes with full opacity are actually described in the context of a demonstration project in this paper. - The dashed arrows and lower opacity built boxes, just shows how adding more elements to the lineage is also easily possible, making this a scalable tool. + Blue files/boxes are output files of various steps in the build-directory, shown within the Makefile (\inlinecode{*.mk}) where they are defined as a \emph{target}. + For example, \inlinecode{paper.pdf} depends on \inlinecode{project.tex} (in the build directory and generated automatically) and \inlinecode{paper.tex} (in the source directory, written by hand and under version control). + The solid arrows and built boxes with full opacity are described in Section \ref{sec:projectanalysis}. + The dashed arrows and lower opacity built boxes, show the scalability by adding hypothetical steps to the project. } \end{figure} @@ -565,12 +573,8 @@ Figure \ref{fig:inputconf} shows the corresponding \inlinecode{INPUTS.conf} wher \begin{figure}[t] \input{tex/src/figure-src-inputconf.tex} \vspace{-3mm} - \caption{\label{fig:inputconf} Contents of the \inlinecode{INPUTS.conf} file for the demonstration dataset of \citet{menke20}. - This file contains the basic, or minimal, metadata for retrieving the required dataset(s) of a project: it can become arbitrarily long. - Here, \inlinecode{M20DATA} contains the name of this dataset within this project. - \inlinecode{MK20MD5} contains the MD5 checksum of the dataset, in order to check the validity and integrity of the dataset before usage. - \inlinecode{MK20SIZE} contains the size of the dataset in human readable format. - \inlinecode{MK20URL} is the URL which the dataset is automatically downloaded from (only when its not already present on the host). + \caption{\label{fig:inputconf} The \inlinecode{INPUTS.conf} configuration file keeps references to external (input) datasets of a project, as well as their checksums for validation, see Sections \ref{sec:download} \& \ref{sec:configfiles}. + Shown here are the entries for the demonstration dataset of \citet{menke20}. Note that the original URL (footnote \ref{footnote:dataurl}) was too long to display properly here. } \end{figure} @@ -607,11 +611,11 @@ For the current version of this paper, it expands to \projectversion. \label{sec:projectgit} Maneage is fully composed of plain-text files, therefore, it can be maintained under version control systems (currently using Git). -Every commit in the version controlled history contains \emph{a complete} snapshot of the data lineage, for more, see the completeness principle (\ref{principle:complete}). -Maneage is maintained by its developers in a central branch, which we will call \inlinecode{man\-eage} hereafter. +Every commit in the version controlled history contains \emph{a complete} snapshot of the data lineage (see the completeness principle, \ref{principle:complete}). +Maneage is maintained by its developers in a central branch (hereafter called \inlinecode{man\-eage}). The \inlinecode{man\-eage} branch contains all the low-level infrastructure, or skeleton, that is necessary for any project as described in the sections above. -As mentioned in Section \ref{sec:maneage}, to start a new project, users simply clone it from its reference repository and build their own Git branch over the most recent commit. -This is demonstrated in the first phase of Figure \ref{fig:branching} where a project has started by branching-off of commit \inlinecode{0c120cb} in the \inlinecode{maneage} branch. +As mentioned in Section \ref{sec:maneage}, to start a new project, users simply clone it from its reference repository and build their own Git branch over it +This is demonstrated in the first phase of Figure \ref{fig:branching} where a project has started by branching-off of commit \inlinecode{0c120cb}. %% Exact URLs of imported images. %% Collaboration icon: https://www.flaticon.com/free-icon/collaboration_809522 @@ -620,14 +624,11 @@ This is demonstrated in the first phase of Figure \ref{fig:branching} where a pr \begin{figure}[t] \includetikz{figure-branching} \vspace{-3mm} - \caption{\label{fig:branching} Projects start by branching off the main Maneage branch and developing their high-level analysis over the common low-level infrastructure: add flesh to a skeleton. - The low-level infrastructure can always be updated (keeping the added high-level analysis intact), with a simple merge between branches. - Two phases of a project's evolution shown here: in phase 1, a co-author has made two commits in parallel to the main project branch, which have later been merged. - In phase 2, the project has finished: note the identical first project commit and the Maneage commits it branches from. - The dashed parts of Scenario 2 can be any arbitrary history after those shown in phase 1. - A second team now wants to build upon that published work in a derivate branch, or project. - The second team applies two commits and merges their branch with Maneage to improve the skeleton and continue their research. - The Git commits are shown on their branches as colored ellipses, with their hash printed in them. + \caption{\label{fig:branching} Harvesting the power of version-control in project management with Maneage. + Maneage is maintained as a core branch, with projects created by branching-off of it. + (a) shows how projects evolve on their own branch, but can always update their low-level structure by merging with the core branch + (b) shows how a finished/published project can be revitalized for new technologies simply by merging with the core branch. + Each Git ``commit'' is shown on their branches as colored ellipses, with their hash printed in them. The commits are colored based on the team that is working on that branch. The collaboration and paper icons are respectively made by `mynamepong' and `iconixar' and downloaded from \url{www.flaticon.com}. } @@ -636,12 +637,10 @@ This is demonstrated in the first phase of Figure \ref{fig:branching} where a pr After a project starts, Maneage will evolve, for example, new features will be added or low-level bugs will be fixed. Because all projects branch-off from the same branch that these infrastructure improvements are made, updating the project's low-level skeleton is as easy as merging the \inlinecode{maneage} branch into the project's branch. For example, in Figure \ref{fig:branching} (phase 1), see how Maneage's \inlinecode{3c05235} commit has been merged into project's branch trough commit \inlinecode{2ed0c82}. +This allows infrastructure improvements and fixes to be easily propagated to all projects. -Another useful scenario is reviving a finished/published project at later date, by other researchers as shown in phase 2 of Figure \ref{fig:branching}. -In that figure, a new team of researchers have decided to experiment on the results of the published paper and have merged it with the Maneage branch (commit \inlinecode{a92b25a}) to make it usable for their system (e.g., assuming the original project was completed years ago, and is no longer directly executable). - -Other scenarios include a third project that can easily merge various high-level components from different projects into its own branch, thus adding a temporal dimension to their data lineage. -This structure also enables easy propagation of low-level fixes to all projects using Maneage. +Another useful scenario is reviving a finished/published project at later date, possibly by other researchers as shown in phase 2 of Figure \ref{fig:branching} (e.g., assuming the original project was completed years ago, and is no longer directly executable). +Other scenarios include projects that are created by merge various other projects. Modern version control systems provide many more capabilities that can be leveraged through Maneage in project management, thanks to the shared branch it has with \emph{all} derived projects, and that it is complete (\ref{principle:complete}). \subsection{Multi-user collaboration on single build directory} @@ -668,7 +667,7 @@ Therefore, there are various scenarios for the publication of the project: 1) on In the former case, the output of \inlinecode{dist} (described above) can be submitted to the journal as a supplement, or uploaded to pre-print servers like arXiv that will actually compile the \LaTeX{} source and build their own PDFs. The Git history can also be archived as a single ``bundle'' file and also submitted as a supplement. When publishing with datasets, the project's outputs, and inputs (if necessary), can be published on servers like Zenodo. -For example in \citet{akhlaghi19}, along with the source files mentioned above, we also uploaded all the project's necessary software and its final PDF to Zenodo in \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}\footnote{https://doi.org/10.5281/zenodo.3408481}. +For example in \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}]{akhlaghi19}, along with the source files mentioned above, we also uploaded all the project's necessary software and its final PDF. @@ -682,65 +681,79 @@ For example in \citet{akhlaghi19}, along with the source files mentioned above, \section{Discussion \& Caveats} \label{sec:discussion} -Maneage is the final product of various research projects (in astrophysics) over the last 5 years. +Maneage evolved during various research projects (in astrophysics) over the last 5 years. The primordial implementation was written for the analysis of \citet{akhlaghi15}. -Since the full analysis pipeline was in plain-text and consumed much less space than a single figure, it was uploaded to arXiv with the paper's \LaTeX{} source, see \href{https://arxiv.org/abs/1505.01664}{arXiv:1505.01664}\footnote{ - To download the \LaTeX{} source of any arXiv paper, click on the ``Other formats'' link, containing necessary instructions and links.}. The system later evolved in \citet{bacon17}, in particular, the two sections of that paper that were done by M. Akhlaghi: \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}. With these projects, the skeleton of the system was written as a more abstract ``template'' that could be customized for separate projects. That template later matured into Maneage by including the installation of all necessary software from source and it was used in \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}]{akhlaghi19} and \citet[\href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}]{infante20}. -In the last year and with the Research Data Alliance (RDA) grant that was awarded to Maneage, its user base (and thus its development) grew phenomenally and it has evolved to become much more customizable, well-tested and well-documented. -But it is far from complete: its core architecture will continue evolve after the publication of this paper, therefore a list of the notable changes after the publication of this paper will be kept in the \inlinecode{README-hacking.md} file. +As Git repositories, a Maneage project can benefit from the wonderful archival and citation features of Software Heritage \citep{dicosmo18}, enabling easy citation of precise parts of other projects, at various points in their history. +Once Maneage is adopted on a wide scale in a special topic, it is possible to feed them into machine learning algorithms for automatic workflow generation, optimized for certain aspects of the result. +Because Maneage is complete and also includes the project's history, even inputs (software and input data) or failed tests during the projects can enter this optimization process. +Furthermore, writing parsers of Maneage projects to generate Research Objects is trivial, and very useful for meta-research and data provenance studies. + +Maneage was awarded a Research Data Alliance (RDA) adoption grant adhering to the recommendations of the publishing data workflows working group \citep{austin17}. +Its user base (and thus its development) grew phenomenally afterwards. +But it is far from complete: bugs will be fixed and its core architecture will continue evolve after the publication of this paper. +A list of the notable changes after the publication of this paper will be kept in the \inlinecode{README-hacking.md} file. Based on early adopters, we have seen the following caveats for Maneage. -The first caveat is regarding its widespread adoption: by principle, Maneage uses very low-level tools like Git, \LaTeX, Make and command-line tools to run in non-interactive mode. -However, a large fraction of the scientific community are accustomed to interactive graphic user interface (GUI) tools. -But this is not often a final choice: some of our early users simply did not know such tools existed. -After seeing them in action together (as a \emph{complete} Maneage project) they have started using these tools effectively. -Unfortunately by their low-level nature, the documentation of these tools alone discourages scientists, we are thus working on several tutorials and scientist-friendly documentation of such tools, hopefully by collaborating with efforts like \href{http://software.ac.uk}{software.ac.uk} and \href{http://urssi.us}{urssi.us}. -\citet{fineberg19} also note the importance that a project starts by following good practice, not to force it in the end. +The first caveat is regarding its widespread adoption: by principle, Maneage uses very low-level tools that are not commonly used by scientists like Git, \LaTeX, Make and the command-line. +We have discovered that this is because they have mainly not been exposed to them as useful components in their research. +Once the usage of these tools was witnessed in practice, these tools were adopted to follow best practices. +\citet{fineberg19} also note the importance of projects starting by following good practice, not to force it in the end. A second caveat is the fact that Maneage is effectively an almost complete GNU operating system, tailored to each project. -It is just built ontop of an existing POSIX-compatible operating system, using its kernel. -Maneage has many generic scripts for simplifying the software packaging. -However, maintaining them (updating versions or fixing bugs on some hosts) can take time for a small team. -Because package management (Section \ref{sec:projectconfigure}) is in the same language as the analysis, some users have learnt to package their necessary software or correct some bugs themselves. -They later send those additions as merge-requests to the core Maneage branch, thus propagating the improvement to all projects using Maneage. +Maintaining the various packages can consume time for its core developers. +In Maneage, package management (Section \ref{sec:projectconfigure}) is in the same language as the analysis, therefore some users are already adding their necessary software in it and send them as merge-requests to the core Maneage branch, thus propagating the improvement to all projects using Maneage. With a larger user-base we hope the fraction of such contributors increases and decreases the burden on our core team. -Another caveat that has been raised by some people is that publishing the project's reproducible data lineage immediately after publication may hamper their ability to continue harvesting from all their hard work. +Another caveat for some researchers is that publishing the project's reproducible data lineage immediately after publication may hamper their ability to continue with followup papers because others may do it before them. Given the strong integrity checks in Maneage, we believe it has features to address this problem in the following ways: 1) Through the Git history, it is clear how much extra work the other team has added. In this way, Maneage can contribute to a new concept of authorship in scientific projects and help to quantify Newton's famous ``standing on the shoulders of giants'' quote. However, this is a long term goal and requires major changes to academic value systems. 2) Authors can be given a grace period where the journal, or some third authority, keeps the source and publishes it a certain interval after publication. -Once Maneage is adopted on a wide scale in a special topic, it is possible to feed them into machine learning algorithms for automatic workflow generation, optimized for certain aspects of the result. -Because Maneage is complete and also includes the project's history, even inputs (software and input data) or failed tests during the projects can enter this optimization process. -Furthermore, writing parsers of Maneage projects to generate Research Objects is trivial, and very useful for meta-research and data provenance studies. + + + + + + + + \section{Conclusion \& Summary} \label{sec:conclusion} To effectively leaverage the power of big data, we need to have a complete view of its lineage. -However, scientists are rarely trained sufficiently in data management or software development, the plethora of high-level tools, that change every few years also does not help. -Maneage is desigend as a complete template, providing scientists with a built low-level skeleton that scientists can customize for any project and adopt modern, robust and efficient data management in practice on their own projects. +However, scientists are rarely trained sufficiently in data management or software development, the plethora of high-level tools for software engineering, that change every few years, also does not help. +Maneage is desigend as a complete template, providing scientists with an built low-level skeleton, using simple and robust tools that have withstood the test of time while being actively developed. +Scientists can customize its existing data management for their own projects, enabling to learn and master the lower-level tools in the meantime. +This improves their efficiency and the robustness of their scientific result, while also enabling future scientists to reproduce and build-upon their work. + +We discussed the founding principles of Maneage that are completeness, modularity, minimal complexity, verifiable inputs and outputs, temporal provenance, and free software. +We showed how these principles are implemented in an already built structure, ready for customization and discussed the caveats and advantages of this implementation. +With a larger user-base and wider application in scientific (and hopefully industrial) applications, Maneage will certainly grow and become even more robust, stable and user friendly. + + + + + + + -In this paper we introduced Maneage and how it is built upon the principles of completeness, modularity, minimal complexity, verifiable inputs and outputs, temporal provenance, and free software. -We showed how these principles are implemented in an already built structure that users just have to customize for the high-level aspects of their projects and discussed the caveats and advantages of this implementation. -With a larger user-base and wider application in scientific (and hopefully industrial) applications, Maneage will certainly grow and become even more stable user and friendly. -\tonote{One more paragraph will be added here: don't forget to review the caveats} %% Acknowledgements -\section{Acknowledgments} +\section*{Acknowledgments} The authors wish to thank David Valls-Gabaud, Johan Knapen, Ignacio Trujillo, Roland Bacon, Konrad Hinsen, Yahya Sefidbakht, Simon Portegies Zwart, Pedram Ashofteh Ardakani, Elham Saremi, Zahra Sharbaf and Surena Fatemi for their useful suggestions and feedback on Maneage and this paper. We also thank Julia Aguilar-Cabello for designing the Maneage logo. During its development, Maneage has been partially funded (in historical order) by the following institutions: The Japanese Ministry of Education, Culture, Sports, Science, and Technology ({\small MEXT}) PhD scholarship to M.A and its Grant-in-Aid for Scientific Research (21244012, 24253003). The European Research Council (ERC) advanced grant 339659-MUSICOS. -The European Union’s Horizon 2020 (H2020) research and innovation programmes No 777388 under RDA EU 4.0 project, and Marie Sk\l{}odowska-Curie grant agreement No 721463 to the SUNDIAL ITN network. +The European Union’s Horizon 2020 (H2020) research and innovation programmes No 777388 under RDA EU 4.0 project, and Marie Sk\l{}odowska-Curie grant agreement No 721463 to the SUNDIAL ITN. The State Research Agency (AEI) of the Spanish Ministry of Science, Innovation and Universities (MCIU) and the European Regional Development Fund (FEDER) under the grant with reference AYA2016-76219-P. The IAC project P/300724, financed by the Ministry of Science, Innovation and Universities, through the State Budget. The Canary Islands Department of Economy, Knowledge and Employment, through the Regional Budget of the Autonomous Community. @@ -748,6 +761,16 @@ The Fundaci\'on BBVA under its 2017 programme of assistance to scientific resear \input{tex/build/macros/dependencies.tex} +\section*{Competing Interests} +The authors have no competing interests to declare. + +\section*{Author Contributions} +\begin{enumerate} +\item Mohammad Akhlaghi: principal author of the Maneage source code and this paper, also principal investigator (PI) of the RDA Adoption grant awarded to Maneage. +\item Ra\'ul Infante-Sainz: contributed many patches/commits to the source of Maneage, also helped in early testing and writing this paper. +\item Roberto Baena-Gall\'e: contributed to early testing of Maneage and in writing this paper. +\end{enumerate} + %% Tell BibLaTeX to put the bibliography list here. \printbibliography diff --git a/tex/img/codata.pdf b/tex/img/codata.pdf new file mode 100644 index 0000000..e00f2ca Binary files /dev/null and b/tex/img/codata.pdf differ diff --git a/tex/img/codata.png b/tex/img/codata.png deleted file mode 100644 index c78dbc3..0000000 Binary files a/tex/img/codata.png and /dev/null differ diff --git a/tex/src/figure-branching.tex b/tex/src/figure-branching.tex index e264746..a917987 100644 --- a/tex/src/figure-branching.tex +++ b/tex/src/figure-branching.tex @@ -81,7 +81,7 @@ %% Description of this scenario: \draw [rounded corners, fill=black!10!white] (3.1cm,0) rectangle (7.5cm,1.25cm); - \draw [anchor=west, black] (3.1cm,1.0cm) node {\small \textbf{Phase 1} (pre-publication):}; + \draw [anchor=west, black] (3.1cm,1.0cm) node {\textbf{(a)} pre-publication:}; \draw [anchor=west, black] (3.3cm,0.6cm) node {\footnotesize Collaborating on a project while}; \draw [anchor=west, black] (3.3cm,0.2cm) node {\footnotesize working in parallel, then merging.}; @@ -140,7 +140,7 @@ %% Description of this scenario: \draw [rounded corners, fill=black!10!white] (11.1cm,0) rectangle (15.3cm,1.25cm); - \draw [anchor=west, black] (11.1cm,1.0cm) node {\small \textbf{Phase 2} (post-publication):}; + \draw [anchor=west, black] (11.1cm,1.0cm) node {\textbf{(b)} post-publication:}; \draw [anchor=west, black] (11.3cm,0.6cm) node {\footnotesize Other researchers building upon}; \draw [anchor=west, black] (11.3cm,0.2cm) node {\footnotesize previously published work.}; \end{tikzpicture} diff --git a/tex/src/figure-src-inputconf.tex b/tex/src/figure-src-inputconf.tex index 1245dfb..fc3315d 100644 --- a/tex/src/figure-src-inputconf.tex +++ b/tex/src/figure-src-inputconf.tex @@ -3,6 +3,6 @@ \texttt{\mkvar{MK20DATA} = menke20.xlsx}\\ \texttt{\mkvar{MK20MD5}{ } = 8e4eee64791f351fec58680126d558a0}\\ \texttt{\mkvar{MK20SIZE} = 1.9MB}\\ - \texttt{\mkvar{MK20URL}{ } = https://the.full.url/is/too/large/for/here/media-1.xlsx}\\ + \texttt{\mkvar{MK20URL}{ } = https://the.full.url/is/too/large/to/show/here/media-1.xlsx}\\ \vspace{-3mm} \end{tcolorbox} diff --git a/tex/src/preamble-style.tex b/tex/src/preamble-style.tex index e20c73c..26deac9 100644 --- a/tex/src/preamble-style.tex +++ b/tex/src/preamble-style.tex @@ -115,19 +115,19 @@ \pagestyle{fancy} \lhead{\mplight\footnotesize Art.XX, page {\thepage} of \pageref{LastPage}} \chead{} -\rhead{\mplight\footnotesize Akhlaghi et al; Reproducible paper template} +\rhead{\mplight\footnotesize Akhlaghi, et al: Maneage, a Customizable Framework for Managing Data Lineage} \lfoot{} \cfoot{} \rfoot{} \renewcommand\headrulewidth{0.0pt} \renewcommand\footrulewidth{0.0pt} \fancypagestyle{firstpage} { - \lhead{\includegraphics[width=3.5cm]{tex/img/codata.png}} + \lhead{\includegraphics[width=3.5cm]{tex/img/codata.pdf}} \chead{} \rhead{\mplight\footnotesize - Akhlaghi, M, et al. 2019. Reproducible paper template\\ - \emph{Data Science Journal}, VV, NN, pp.1-N,\\ - DOI: https://doi.org/10.5334/dsj-XXXX-XXX} + Akhlaghi, M, et al. 2020. Maneage, a Customizable Framework\\ + for Managing Data Lineage. \emph{Data Science Journal}, VV,\\ + NN, pp.1-\pageref*{LastPage}. DOI: \href{https://doi.org/10.5334/dsj-XXXX-XXX}{\textcolor{black}{https://doi.org/10.5334/dsj-XXXX-XXX}}} \lfoot{} \cfoot{} \rfoot{} diff --git a/tex/src/references.tex b/tex/src/references.tex index 510fc89..ef33d02 100644 --- a/tex/src/references.tex +++ b/tex/src/references.tex @@ -695,6 +695,20 @@ archivePrefix = {arXiv}, +@ARTICLE{austin17, + author = {{Claire C.} Austin and Theodora Bloom and Sünje Dallmeier-Tiessen and {Varsha K.} Khodiyar and Fiona Murphy and Amy Nurnberger and Lisa Raymond and Martina Stockhause and Jonathan Tedds and Mary Vardigan and Angus Whyte}, + title = {Key components of data publishing: using current best practices to develop a reference model for data publishing}, + journal = {International Journal on Digital Libraries}, + volume = {18}, + year = {2017}, + pages = {77}, + doi = {10.1007/s00799-016-0178-2}, +} + + + + + @ARTICLE{smith16, author = {Arfon M. Smith and Daniel S. Katz and Kyle E. Niemeyer}, title = {Software citation principles}, -- cgit v1.2.1