diff options
-rw-r--r-- | paper.tex | 43 |
1 files changed, 22 insertions, 21 deletions
@@ -283,7 +283,7 @@ IPOL, which uniquely stands out in other principles, fails here: only the final \label{sec:maneage} Maneage is an implementation of the principles of Section \ref{sec:principles}: it is complete (\ref{principle:complete}), modular (\ref{principle:modularity}), has minimal complexity (\ref{principle:complexity}), verifies its inputs \& outputs (\ref{principle:verify}), preserves temporal provenance (\ref{principle:history}) and finally, it is free software (\ref{principle:freesoftware}). -In practice it is a collection of plain-text files, that are distributed in pre-defined sub-directories by context (a modular source), and are all under version-control, currently with Git. +In practice, it is a collection of plain-text files, that are distributed in pre-defined sub-directories by context (a modular source), and are all under version-control, currently with Git. The main Maneage Branch is a fully working skeleton of a project without much flesh: containing all the low-level infrastructure, but without any actual high-level analysis operations\footnote{In the core Maneage branch, only a simple demo analysis is included to be complete. But it can easily be removed: all its files and steps have a \inlinecode{delete-me} prefix.}. Maneage contains a file called \inlinecode{README-hacking.md}\footnote{Note that the \inlinecode{README.md} file is reserved for the project using Maneage, not maneage itself.} that has a complete checklist of steps to start a new project and remove demonstration parts, there are also hands-on tutorials to help new adopters. @@ -312,8 +312,8 @@ Two sub-directories are also present: \inlinecode{tex/} (containing \LaTeX{} fil } \end{figure} -The \inlinecode{project} script is a high-level wrapper to interface with Maneage and in its current implementation has two main phases as shown below. -As seen below, a project's operations are broken-up into two phases: 1) configuration, where the necessary software are built and the environment is setup. 2) analysis, where data are accessed and the software is run on them to create visualizations and the final report. +The \inlinecode{project} script is a high-level wrapper to interface with Maneage and in its current implementation has two main phases as shown below: 1) configuration, where the necessary software are built and the environment is setup. 2) analysis, where data are accessed and the software is run on them to create visualizations and the final report. +In practice, these two steps are run with the following commands: \begin{lstlisting}[language=bash] ./project configure # Build all necessary software from source. @@ -358,7 +358,7 @@ Because of its simplicity, we have also had very good feedback on using Make fro Maneage orchestrates the building of its necessary software in the same language that it orchestrates the analysis: Make (see Section \ref{sec:usingmake}). Therefore, a researcher already using Maneage for their high-level analysis easily understands, and can customize, the software environment too, without delving into the intricacies of third-party tools. -Most existing tools reviewed in Section \ref{sec:principles}, use package managers like Conda to maintain the software environment, but since conda itself is written in Python, it violates our completeness principle \ref{principle:complete}. +Most existing tools reviewed in Section \ref{sec:principles}, use package managers like Conda to maintain the software environment, but since conda itself is written in Python, it does not fit in our completeness principle \ref{principle:complete}. Highly robust solutions like Nix \citep{dolstra04} and GNU Guix \citep{courtes15} do exist, but they require root permissions which is also against that principle. Project configuration (building the software environment) is managed by the files under \inlinecode{reproduce\-/soft\-ware} of Maneage's source, see Figure \ref{fig:files}. @@ -382,7 +382,7 @@ The software source code may already be present on the host filesystem, if not, But before being used to build the software, they will be validated by their SHA-512 checksum (which is already stored in the project). Maneage includes a large collection of scientific software (and their dependencies) that are usually not necessary in all projects. Therefore, each project has to identify its high-level software in the \inlinecode{TARGETS.conf} file under \inlinecode{re\-produce\-/soft\-ware\-/config} directory, see Figure \ref{fig:files}. -All the high-level software dependencies are codified in Maneage as Make \emph{prerequisites}, so the specified software will be automatically built after their dependencies, and their dependencies, and etc. +All the high-level software dependencies are codified in Maneage as Make \emph{prerequisites}, so the specified software will be automatically built after their dependencies. Note that project configuration can be done in a container or virtual machine to facilitate moving the project. However, the important factor is that such binary blobs are an optional output of Maneage, they are not the its primary storage/archival format. @@ -394,7 +394,7 @@ Maneage contains the full list of built software for each project, their version However, this information is buried deep into each project's source. Maneage also prints a distilled fraction of this information in the project's final report, blended into the narrative, as seen in the Acknowledgments of this paper. Furthermore, when the software is associate with a published paper, that paper's Bib\TeX{} entry is also added to the final report and is cited with the software's name and version. -For example\footnote{In this paper we have used very basic tools that aren't accompanied by a paper}, see the software acknowledgement sections of \citet{akhlaghi19} and \citet{infante20}. +For example\footnote{In this paper we have used very basic tools that are not accompanied by a paper}, see the software acknowledgement sections of \citet{akhlaghi19} and \citet{infante20}. This is particularly important in the case for research software, where citation is critical to justify continued development. One notable example that nicely highlights this issue is GNU Parallel \citep{tange18}: every time it is run, it prints the citation information before it starts. @@ -475,13 +475,13 @@ Manually typing such numbers in the narrative is prone to very important errors Therefore, they must also be automatically generated. To automatically generate and blend them in the text, Maneage uses \LaTeX{} macros. -In the quote above, the \LaTeX{} source\footnote{\citet{akhlaghi19} is written in Maneage and its \LaTeX{} source is available in multiple ways: 1) direct download from arXiv:\href{https://arxiv.org/abs/1909.11230}{1909.11230}, by clicking on ``other formats'', or 2) the Git or \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, links is also available on arXiv's top page.} looks like this: ``\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}''. +In the quote above, the \LaTeX{} source\footnote{\citet{akhlaghi19} is written in Maneage and its \LaTeX{} source is available in multiple ways: 1) direct download from arXiv:\href{https://arxiv.org/abs/1909.11230}{1909.11230}, by clicking on ``other formats'', or 2) the Git or \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, links are also available on arXiv's top page.} looks like this: ``\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}''. T\-he ma\-cro ``\inlinecode{\small\textbackslash{}demosfoptimizedsn}'' is automatically created during in the project and expands to the value ``\inlinecode{0.25}'' when the PDF output is built. The built \inlinecode{project.tex} file stores all such reported values. However, managing all the necessary \LaTeX{} macros in one file is against the modularity principle and can be frustrating and buggy. To address this problem, Maneage has the convention that all subMakefiles \emph{must} contain a fixed target with the same base-name, but with a \inlinecode{.tex} suffix to store reported values generated in that subMakefile. -If it doesn't need to report any values in text, the file can be empty. +If it does not need to report any values in text, the file can be empty. In Figure \ref{fig:datalineage}, these macro files can be seen in every subMakefile, except for \inlinecode{paper.mk} (which does not need it). These \LaTeX{} macro files thus form the core skeleton of a Maneage project: as shown in Figure \ref{fig:datalineage}, the outward arrows of all built files of any subMakefile ultimately leads to one of these \LaTeX{} macro files, possibly in another subMakefile. @@ -498,7 +498,7 @@ It has some tests on pre-defined formats, and other formats can easily be added. \subsubsection{The analysis} \label{sec:analysis} -The basic concepts behind organizing the analysis into modular subMakefiles has already been discussed above, we will thus describe it here with the practical example of replicating Figure 1C of \citet{menke20}, with some enhancements in Figure \ref{fig:toolsperyear}. +The basic concepts behind organizing the analysis into modular subMakefiles have already been discussed above, we will thus describe it here with the practical example of replicating Figure 1C of \citet{menke20}, with some enhancements in Figure \ref{fig:toolsperyear}. As shown in Figure \ref{fig:datalineage}, in this project, we have broken this goal into two subMakefiles: \inlinecode{format.mk} and \inlinecode{demo-plot.mk}. The former is in charge of converting the Microsoft Excel formatted input into the simple comma-separated value (CSV) format, and the latter is in charge of generating the table to build Figure \ref{fig:toolsperyear}. In a real project, subMakefiles will be much more complex. @@ -553,7 +553,7 @@ This step is shown schematically in Figure \ref{fig:datalineage} with the arrow \label{sec:download} The \inlinecode{download.mk} subMakefile is present in all Maneage projects and contains the common steps for importing the input dataset(s) into the project. -All necessary input datasets to the project are imported through this subMakefile. +All necessary input datasets for the project are imported through this subMakefile. This helps in modularity and minimal complexity (\ref{principle:modularity} \& \ref{principle:complexity}): to see what external datasets were used in a project, this is the only necessary file to manage/read. Also, a simple call to a downloader (for example \inlinecode{wget}) is not usually enough. Irrespective of where the dataset is \emph{used} in the project's lineage, it helps to maintain relation with the outside world (to the project) in one subMakefile. @@ -603,7 +603,7 @@ Combined with the fact that all source files in Maneage are under version-contro The \inlinecode{initial\-ize\-.mk} subMakefile is present in all projects and is the first subMakefile that is loaded into \inlinecode{top-make.mk} (see Figure \ref{fig:datalineage}). It does not contain any analysis or major processing steps, it just initializes the system by setting the necessary Make environment as well as other general jobs like defining the Git commit hash of the run as a \LaTeX{} (\inlinecode{\textbackslash{}projectversion}) macro that can be loaded into the narrative. Papers using Maneage usually put this hash as the last word in their abstract, for example, see \citet{akhlaghi19} and \citet{infante20}. -For this version of this paper, it expands to \projectversion. +For the current version of this paper, it expands to \projectversion. \subsection{Projects as Git branches of Maneage} \label{sec:projectgit} @@ -640,7 +640,7 @@ Because all projects branch-off from the same branch that these infrastructure i For example, in Figure \ref{fig:branching} (phase 1), see how Maneage's \inlinecode{3c05235} commit has been merged into project's branch trough commit \inlinecode{2ed0c82}. Another useful scenario is reviving a finished/published project at later date, by other researchers as shown in phase 2 of Figure \ref{fig:branching}. -In that figure, a new team of researchers have decided to experiment on the results of the published paper and have merged it with the Maneage branch (commit \inlinecode{a92b25a}) to make it usable for their system (e.g., assming the original project was completed years ago, and is no longer directly executable). +In that figure, a new team of researchers have decided to experiment on the results of the published paper and have merged it with the Maneage branch (commit \inlinecode{a92b25a}) to make it usable for their system (e.g., assuming the original project was completed years ago, and is no longer directly executable). Other scenarios include a third project that can easily merge various high-level components from different projects into its own branch, thus adding a temporal dimension to their data lineage. This structure also enables easy propagation of low-level fixes to all projects using Maneage. @@ -653,13 +653,13 @@ Because the project's source and build directories are separate, it is possible Similar to the parallel branch that is later merged in phase 1 of Figure \ref{fig:branching}. To enable this mode, \inlinecode{./project} script has a special \inlinecode{--group} option which takes the name of a (POSIX) user group in the host operating system. All files built in the build directory are then automatically assigned to this user group, with read and write permissions. -Ofcourse, avoiding conflicts in the build directory, while members are working on different branches is up to the team. +Of course, avoiding conflicts in the build directory, while members are working on different branches is up to the team. \subsection{Publishing the project} \label{sec:publishing} -Once the project is complete it needs to be published. -In a scientific scenario, it is submitted to a journal, and an industrial world, it is submitted to the customers or employers. +Once the project is complete, it needs to be published. +In a scientific scenario, it is submitted to a journal, while in an industrial world, it is submitted to the customers or employers. To facilitate the publication of the project's source, Maneage has a special \inlinecode{dist} target during the build process which is activated with the command \inlinecode{./project make dist}. In this mode, Maneage will not do any analysis, it will simply copy the full project's source (on the given commit) into a temporary directory and compress it into a \inlinecode{.tar.gz} file. If a Zip compression is necessary, the \inlinecode{dist-zip} target can be called instead \inlinecode{dist}. @@ -698,9 +698,9 @@ But it is far from complete: its core architecture will continue evolve after th Based on early adopters, we have seen the following caveats for Maneage. The first caveat is regarding its widespread adoption: by principle, Maneage uses very low-level tools like Git, \LaTeX, Make and command-line tools to run in non-interactive mode. However, a large fraction of the scientific community are accustomed to interactive graphic user interface (GUI) tools. -But this is not often a final choice: some of our early users simply didn't know such tools existed. +But this is not often a final choice: some of our early users simply did not know such tools existed. After seeing them in action together (as a \emph{complete} Maneage project) they have started using these tools effectively. -Unfortunately by their low-level nature, the documentation of these tools alone discourages scientists, we thus working on several tutorials and scientist-friendly documentation of such tools, hopefully by collaborating with efforts like \href{http://software.ac.uk}{software.ac.uk} and \href{http://urssi.us}{urssi.us}. +Unfortunately by their low-level nature, the documentation of these tools alone discourages scientists, we are thus working on several tutorials and scientist-friendly documentation of such tools, hopefully by collaborating with efforts like \href{http://software.ac.uk}{software.ac.uk} and \href{http://urssi.us}{urssi.us}. \citet{fineberg19} also note the importance that a project starts by following good practice, not to force it in the end. A second caveat is the fact that Maneage is effectively an almost complete GNU operating system, tailored to each project. @@ -709,9 +709,9 @@ Maneage has many generic scripts for simplifying the software packaging. However, maintaining them (updating versions or fixing bugs on some hosts) can take time for a small team. Because package management (Section \ref{sec:projectconfigure}) is in the same language as the analysis, some users have learnt to package their necessary software or correct some bugs themselves. They later send those additions as merge-requests to the core Maneage branch, thus propagating the improvement to all projects using Maneage. -With a larger user-base we hope the fraction of such volunteers increases and decreases the burden on our core team. +With a larger user-base we hope the fraction of such contributors increases and decreases the burden on our core team. -Another caveat that has been raised by some is that publishing the project's reproducible data lineage immediately after publication may hamper their ability to continue harvesting from all their hard work. +Another caveat that has been raised by some people is that publishing the project's reproducible data lineage immediately after publication may hamper their ability to continue harvesting from all their hard work. Given the strong integrity checks in Maneage, we believe it has features to address this problem in the following ways: 1) Through the Git history, it is clear how much extra work the other team has added. In this way, Maneage can contribute to a new concept of authorship in scientific projects and help to quantify Newton's famous ``standing on the shoulders of giants'' quote. @@ -726,14 +726,15 @@ Furthermore, writing parsers of Maneage projects to generate Research Objects is \label{sec:conclusion} To effectively leaverage the power of big data, we need to have a complete view of its lineage. -However scientists are rarely trained sufficiently in data management or software development, the plethora of high-level tools, that change every few years also doesn't help. +However, scientists are rarely trained sufficiently in data management or software development, the plethora of high-level tools, that change every few years also does not help. Maneage is desigend as a complete template, providing scientists with a built low-level skeleton that scientists can customize for any project and adopt modern, robust and efficient data management in practice on their own projects. In this paper we introduced Maneage and how it is built upon the principles of completeness, modularity, minimal complexity, verifiable inputs and outputs, temporal provenance, and free software. We showed how these principles are implemented in an already built structure that users just have to customize for the high-level aspects of their projects and discussed the caveats and advantages of this implementation. -With a larger user-base and wider application in scientific (and hopefully industrial) applications, Maneage will certainly grow and become even more stable user friendly. +With a larger user-base and wider application in scientific (and hopefully industrial) applications, Maneage will certainly grow and become even more stable user and friendly. \tonote{One more paragraph will be added here.} +\tonote{Raul: We have say nothing negative. Maybe we can point out that it would be necessary an effort from the user in order to adopt this template, for example learning Make (but at the end is not much more effort than learning any other language/tool).} %% Acknowledgements \section{Acknowledgments} |