aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2020-04-22 04:49:16 +0100
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2020-04-22 04:58:12 +0100
commitf0622d8a21b15c4c9374ffc50a20a14056d42e09 (patch)
treedec63aed41fe17b314a98ffe69f41a91a5c8e131
parent1f8fca2815bd03828221521144b714ad11867549 (diff)
Implemented Konrad's suggestions, minor edits here and there
Today Konrad made the following suggestions after reading through the paper (created from Commit 1ac5c12). Thanks a lot Konrad ;-). I tried to address them all in this commit. Afterwards, while looking over the corrected parts, some minor edits came up to me to remove redundant parts and add extra points where it helps. In particular to be able to print the International Phonetic Alphabet (IPA), I had to include the LaTeX `TIPA' package, but it was interesting to see that it was already available in the project as a dependency of another package we loaded.
-rw-r--r--paper.tex129
-rw-r--r--tex/src/figure-src-inputconf.tex2
-rw-r--r--tex/src/preamble-style.tex3
3 files changed, 68 insertions, 66 deletions
diff --git a/paper.tex b/paper.tex
index a219791..54ac4f5 100644
--- a/paper.tex
+++ b/paper.tex
@@ -115,22 +115,23 @@ Not being able to experiment on the methods of other researchers is a self-impos
}
\end{figure}
-The completeness of a project's published workflow (usually within the ``Methods'' section) can be measured by the ability to reproduce the result without needing to contact the authors.
-Several studies have attempted to answer this with different levels of detail. For example, \citet{allen18} found that roughly half of the papers in astrophysics do not even mention the names of any analysis software, while \citet{menke20} found this fraction has greatly improved in medical/biological field and is currently above $80\%$.
+The completeness of a project's published lineage (usually within the ``Methods'' section) can be measured by the ability to reproduce the result.
+Several studies have attempted to answer this with different levels of detail.
+For example, \citet{allen18} found that roughly half of the papers in astrophysics do not even mention the names of any analysis software, while \citet{menke20} found this fraction has greatly improved in medical/biological field and is currently above $80\%$.
\citet{ioannidis2009} attempted to reproduce 18 published results by two independent groups, but fully succeeded in only 2 of them and partially in 6.
-\citet{chang15} attempted to reproduce 67 papers in well-regarded journals in Economics with data and code: only 22 could be reproduced without contacting authors, and more than half could not be replicated at all. \tonote{DVG: even after contacting the authors?}
-\citet{stodden18} attempted to replicate the results of 204 scientific papers published in the journal \emph{Science} \emph{after} that journal adopted a policy of publishing the data and code associated with the papers.
+\citet{chang15} attempted to reproduce 67 papers in well-regarded Economics journals that published data and code: only 22 could be reproduced without contacting authors, and more than half could not be replicated at all. \tonote{DVG: even after contacting the authors?}
+\citet{stodden18} attempted it in 204 scientific papers published in the journal \emph{Science} \emph{after} adoptiong a policy of publishing the data and code associated with the papers.
Even though the authors were contacted, the success rate was an abysmal $26\%$.
Overall, this problem is unambiguously assessed as being very serious in the community: \citet{baker16} surveyed 1574 researchers and found that only $3\%$ did not see a ``\emph{reproducibility crisis}''.
Yet, this is not a new problem in the sciences: back in 2011, Elsevier conducted an ``\emph{Executable Paper Grand Challenge}'' \citep{gabriel11} and the proposed solutions were published in a special edition.\tonote{DVG: which were the results?}
Even before that, in an attempt to simulate research projects, \citet{ioannidis05} proved that ``\emph{most claimed research findings are false}''.
-In the 1990s, \citet{schwab2000, buckheit1995, claerbout1992} described the same problem very eloquently and also provided some solutions they used.\tonote{DVG: more details here, one is left wondering ...}
+In the 1990s, \citet{buckheit1995, claerbout1992} described the same problem very eloquently and also provided some solutions they used.\tonote{DVG: more details here, one is left wondering ...}
Even earlier, through his famous quartet, \citet{anscombe73} qualitatively showed how distancing of researchers from the intricacies of algorithms/methods can lead to misinterpretation of the results.
One of the earliest such efforts we are aware of is the work of \citet{roberts69}, who discussed conventions in Fortran programming and documentation to help in publishing research codes.
While the situation has somewhat improved, all these papers still resonate strongly with the frustrations of today's scientists.
-In this paper, we introduce Maneage as a solution to the collective problem of preserving a project's data lineage and its software dependencies.
+To address the collective problem of preserving a project's data lineage as well as its software dependencies, we introduce Maneage (Maneage+Lineage), pronounced man-ee-ij or \textipa{[m{\ae}n}i{\textsci}d{\textyogh}], hosted at \url{http://maneage.org}.
A project using Maneage starts by branching from its main Git branch, allowing the authors to customize it: specifying the necessary software tools for that particular project, adding analysis steps and adding visualizations and a narrative based on the results.
In Sections \ref{sec:definitions} \& \ref{sec:principles} the basic concepts are defined and the founding principles of Maneage are discussed.
Section \ref{sec:maneage} describes the internal structure of Maneage and Section \ref{sec:discussion} is a discussion on its benefits, caveats and future prospects and we conclude with a summary in Section \ref{sec:conclusion}
@@ -220,27 +221,22 @@ To help in the comparison, the founding principles of Maneage are listed below.
\begin{enumerate}[label={\bf P\arabic*}]
\item \label{principle:complete}\textbf{Completeness:}
A project that is complete, or self-contained,
- (P1.1) has no dependency beyond the Port\-able Operating System (OS) Interface, or POSIX.
+ (P1.1) has no dependency beyond the Port\-able Operating System (OS) Interface, or POSIX, or a minimal Unix-like environment.
+ A consequence of this is that the project itself must be stored in plain-text: not needing any specialized software to open, parse or execute.
(P1.2) does not affect the host,
(P1.3) does not require root, or administrator, privileges,
(P1.4) builds its software for an independent environment,
(P1.5) can be run locally (without internet connection),
(P1.6) contains the full project's analysis, visualization \emph{and} narrative, from access to raw inputs to producing final published format (e.g., PDF or HTML),
(P1.7) requires no manual/human interaction and can run automatically \citep[according to][``\emph{a clerk can do it}'']{claerbout1992}.
- A consequence of P1.1 is that the project itself must be stored in plain-text: not needing any specialized software to open, parse or execute.
\emph{Comparison with existing:} with many dependencies beyond POSIX, except for IPOL, none of the tools above are complete.
For example, most recent solutions need Python (for the workflow, not the analysis), or rely on Jupyter notebooks.
- High-level tools have short lifespans and evolve fast (e.g., Python 2 code cannot run with Python 3).
- They also have complex dependency trees, making them hard to maintain.
- For example, see the dependency tree of Matlplotlib in \citet[][Figure 1]{alliez19}, its one of the simpler Jupyter dependencies.
- The longevity of a workflow is determined by its shortest-lived dependency.
-
- As a result the primary storage format of most recent solutions is pre-built binary blobs like containers or virtual machines.
- They are large (Giga-bytes) and expensive to archive, furthermore generic package managers (e.g., Conda), or OS's (e.g., \inlinecode{apt} or \inlinecode{yum}) are used to setup its environment.
- Because exact versions of \emph{every software} are rarely included, and the servers remove old binaries, recreating them is very hard.
- Blobs also have a short lifespan (e.g., Docker containers only run on long-term support OSs, in GNU/Linux systems, this corresponds to Linux 3.2.x, released in 2012).
- A plain-text project consumes below one megabyte, is human-readable and parsable by any machine, even if it can't be executed.
+ Because of the complexity in maintaining such high-level tools (see \ref{principle:complexity}), the primary storage format of most recent solutions is pre-built binary blobs like containers or virtual machines.
+ They are large (Giga-bytes) and expensive to archive, furthermore third-party package managers (e.g., Conda), or the OS's (e.g., \inlinecode{apt} or \inlinecode{yum}) are used to setup its environment.
+ However, exact versions of \emph{every software} are rarely included, and the servers remove old binaries, hence recreating them is very hard.
+ Blobs also have a short lifespan, e.g., Docker containers made today can only run on Linux 3.2.x (after 2012), it does not promise longevity.
+ A plain-text project is human-readable and parsable by any machine (even if it can't be executed) and consumes no less than a megabyte.
\item \label{principle:modularity}\textbf{Modularity:}
A project should be compartmentalized into independent modules with well-defined inputs/outputs having no side effects.
@@ -252,7 +248,7 @@ Modules that do not need to be run (because their dependencies have not changed)
(4) Usage in other projects.
(5) Most importantly: they are easy to debug and improve.
-\emph{Comparison with existing:} Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails encourage this, but the more recent tools leave this design choice to project authors.
+\emph{Comparison with existing:} Visual workflow tools like Apache Taverna, GenePattern, Kepler or VisTrails encourage this, but the more recent tools (mostly written in Python) leave this to project authors.
However, designing a modular project needs to be encouraged and facilitated.
Otherwise, scientists, who are not usually trained in data management, will rarely design a modular project, leading to great inefficiencies in terms of project cost and/or scientific accuracy (testing/validating will be expensive).
@@ -262,15 +258,16 @@ Otherwise, scientists, who are not usually trained in data management, will rare
2) avoid the programming language that is currently in vogue, because it is going to fall out of fashion soon and require significant resources to translate or rewrite it every few years (to stay fashionable).
The same job can be done with more stable/basic tools, requiring less long-term effort.
- \emph{Comparison with existing:} IPOL stands out here too (requiring only ISO C), however most existing solutions use tools that were most popular at their creation date.
- For example, as we approach the present, successively larger fractions of tools are written in Python, and use Conda or Jupyter (see \ref{principle:complete}).
+ \emph{Comparison with existing:} IPOL stands out here too (requiring only ISO C), however most others are written in Python, and use Conda or Jupyter (see \ref{principle:complete}).
+ Besides being incomplete (\ref{principle:complete}), these tools have short lifespans and evolve fast (e.g., Python 2 code cannot run with Python 3, causing disruption in many projects).
+ Their complex dependency trees also making them hard to maintain, for example, see the dependency tree of Matlplotlib in \citet[][Figure 1]{alliez19}, its one of the simpler Jupyter dependencies.
+ The longevity of a workflow is determined by its shortest-lived dependency.
\item \label{principle:verify}\textbf{Verifiable inputs and outputs:}
-The project should automatically verify its inputs (software source code and data) \emph{and} outputs.
-Thus not needing any expert knowledge to confirm a reproduction.
+The project should automatically verify its inputs (software source code and data) \emph{and} outputs, not needing expert knowledge to confirm a reproduction.
-\emph{Comparison with existing:} Such verification is usually possible in most systems, but is usually the responsibility of the project authors.
-As with \ref{principle:modularity}, due to lack of training, this must be actively encouraged and facilitated, otherwise most will not be able to implement it.
+\emph{Comparison with existing:} Such verification is usually possible in most systems, but as a responsibility of the project authors.
+As with \ref{principle:modularity}, due to lack of training, if not actively encouraged and facilitated, it will not be implemented.
\item \label{principle:history}\textbf{History and temporal provenance:}
No project is done in a single/first attempt.
@@ -293,9 +290,9 @@ IPOL is thus not scalable to large projects, which commonly involve dozens of hi
\item \label{principle:freesoftware}\textbf{Free and open source software:}
Technically, reproducibility (see \ref{definition:reproduction}) is possible with non-free or non-open-source software (a black box).
- This principle is thus necessary to complement that definition to ensure that the project is reproducible \emph{and} free/open:
- (1) When the project itself is free software, others can learn from and build upon it.
- (2) The lineage can be traced to free software's implemented algorithms, enabling optimizations on that level.
+ This principle is thus necessary to complement that definition (nature is already a black box, we don't need another one):
+ (1) As a free software, others can learn from, modify, and build upon it.
+ (2) The lineage can be traced to free software's implemented algorithms, also enabling optimizations on that level.
(3) A free-software package that does not execute on particular hardware can be modified to work on it.
(4) A non-free software project typically cannot be distributed by others, making the whole community reliant on the owner's server (even if the owner does not ask for payments).
@@ -411,15 +408,17 @@ Sections \ref{sec:buildsoftware} and \ref{sec:softwarecitation} detail the build
To compile the necessary software from source, Maneage currently needs the host to have a C and C++ compiler (available on any POSIX-compliant OS).
Maneage will build and install (in the build directory) all necessary software and their dependencies, all with fixed versions and configurations.
-The dependency tree continues down to core OS components including GNU Bash, GNU AWK, GNU Coreutils on all supported OSs (including macOS, not just GNU/Linux).
-On GNU/Linux OSs, a fixed version of the GNU Binutils and GNU C Compiler (GCC) is also built from source, and a fixed version of the GNU C Library will soon be added to be fully independent of the host on such systems (task 15390).
+The dependency tree continues down to core OS components including GNU Bash, GNU AWK, GNU Coreutils on all supported OSs.
+On GNU/Linux OSs, a fixed version of the GNU Binutils and GNU C Compiler (GCC) is also built, soon a custom GNU C Library will also be included to be fully independent of the host (task 15390).
-Except for the Kernel, Maneage thus builds all other OS components necessary for the project.
-A Maneage project can be configured in a container or virtual machine to facilitate moving the project without rebuilding everything from source.
-However, such binary blobs are not Maneage's primary storage/archival format.
+Except for very low level components like the Kernel or filesystem, Maneage thus builds all other components necessary for the project.
+Because there is no pure/raw POSIX OS, Maneage aims to run on existing POSIX-compatible OSs, failure to build on anyone of them is treated as a bug, which will be fixed.
+It is currently being actively tested on GNU/Linux and macOS.
+A Maneage project can be configured in a container or virtual machine to facilitate moving the project without rebuilding everything from source, or to use it on non-compatible OSs.
+However, such binary blobs are not the primary storage/archival format of Maneage.
Before building the software, their source codes are validated by their SHA-512 checksum (stored in the project).
-Maneage includes a large collection of scientific software (and its dependencies), much of which is superfluous for one project.
+Maneage includes a growing collection of scientific software (and its dependencies), much of which is superfluous for any single project.
Therefore, each project has to identify its high-level software in the \inlinecode{TARGETS.conf} file under \inlinecode{re\-produce\-/soft\-ware\-/config} directory (Fig.~\ref{fig:files}).
\subsubsection{Software citation}
@@ -436,7 +435,7 @@ It provides a \inlinecode{--citation} option to disable the notice.
In \href{http://git.savannah.gnu.org/cgit/parallel.git/tree/doc/citation-notice-faq.txt?h=master}{its FAQ} this is justified by ``\emph{If you feel the benefit from using GNU Parallel is too small to warrant a citation, then prove that by simply using another tool}''.
Most software does not resort to such drastic measures. However, proper citation is not only useful practically, it is also an ethical imperative.
-Given the increasing role of software in research \citep{clement19}, automatic citation (as presented here) is a step forward.
+Given the increasing role of software in research \citep{clement19}, automatic citation, is a robust solution.
For a review of the necessity and basic elements of software citation, see \citet{katz14} and \citet{smith16}.
The CodeMeta and Citation file format (CFF) aim to expand software citation beyond Bib\TeX, while Software Heritage \citep{dicosmo18} also includes archival and citation abilities.
These will be tested and enabled in Maneage.
@@ -448,22 +447,20 @@ These will be tested and enabled in Maneage.
\subsection{Project analysis}
\label{sec:projectanalysis}
-Once the project is configured (Section \ref{sec:projectconfigure}), a unique and fully-controlled environment is available to execute the analysis.
-All analysis operations run with no influence from the host OS, enabling an isolated environment without the extra layer of containers or a virtual machine.
+The analysis operations run with no influence from the host OS, enabling an isolated environment without the extra layer of containers or a virtual machine.
In Maneage, a project's analysis is broken into two phases: 1) preparation, and 2) analysis.
The former is mostly necessary to optimize extremely large datasets and is only useful for advanced users, while following an identical internal structure to the later.
We will therefore not go any further into it and refer the interested reader to the documentation.
A project consists of many steps, including data access (possibly by downloading), running various steps of the analysis on the raw inputs, and creating the necessary plots, figures or tables for a published report, or output datasets for a database.
-If all of these steps were organized in a single Makefile, it would become very large, or long, and would be hard to maintain, extend/grow, read, reuse, and cite.
-Large files are in general a bad practice and do not fulfil the modularity principle (\ref{principle:modularity}).
+If all of these steps are organized in a single Makefile, it will become very long, and would be hard to maintain, extend/grow, read, reuse, and cite.
+Large files are in general a bad practice and against the modularity and minimal complexity principles (\ref{principle:modularity} \& \ref{principle:complexity}).
Maneage is thus designed to encourage and facilitate modularity by distributing the analysis into many Makefiles that contain contextually-similar analysis steps.
Hereafter, these modular or lower-level Makefiles will be called \emph{subMakefiles}.
-When run with the \inlinecode{make} argument, the \inlinecode{project} script (Section \ref{sec:maneage}), calls \inlinecode{top-make.mk} which loads the subMakefiles with a certain order.
-They are loaded using Make's \inlinecode{include} feature (so Make sees everything as one file in one instance of Make).
-By default Maneage does not use recursion (where one instance of Make, calls another instance of Make within itself) to comply with minimal complexity principle (\ref{principle:complexity}) and keep the code's logic clear and simple.
+When run with the \inlinecode{make} argument, the \inlinecode{project} script (Section \ref{sec:maneage}), calls \inlinecode{top-make.mk} which loads the subMakefiles with a certain order into itself (see Section \ref{sec:analysis}).
All the analysis Makefiles are in \inlinecode{re\-produce\-/anal\-ysis\-/make} (see Figure \ref{fig:files}) and Figure \ref{fig:datalineage} shows their inter-relation with the target/built files that they manage.
+To keep the project's logic clear and simple (minimal complexity principle, \ref{principle:complexity}), by default recursion is not used (where one instance of Make, calls Make within itself).
\begin{figure}[t]
\begin{center}
@@ -501,15 +498,15 @@ Its prerequisites include \inlinecode{paper.tex} and \inlinecode{references.tex}
Figures, plots, tables and narrative are not the only analysis products that are included in the paper/report.
In many cases, quantitative values from the analysis are also blended into the sentences of the report's narration.
-For example, this sentence in the abstract of \citet[which is written in Maneage]{akhlaghi19}: ``\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}''.
+For example, this sentence in the abstract of \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, written in Maneage]{akhlaghi19}: ``\emph{... detect the outer wings of M51 down to S/N of 0.25 ...}''.
The value `0.25', for the signal-to-noise ratio (S/N), depends on the analysis, and is an output of the analysis just like paper's figures and plots.
Manually typing such numbers in the narrative is prone to very important errors and discourages testing in scientific papers.
Therefore, they must \emph{also} be automatically generated.
To automatically generate and blend them in the text, Maneage uses \LaTeX{} macros.
-In the quote above, the \LaTeX{} source\footnote{\citet{akhlaghi19} is written in Maneage and its \LaTeX{} source is available in multiple ways: 1) direct download from arXiv:\href{https://arxiv.org/abs/1909.11230}{1909.11230}, by clicking on ``other formats'', or 2) the Git or \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, links are also available on arXiv's top page.} looks like this: ``\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}''.
+For example, \LaTeX{} source of the quote above is: ``\inlinecode{\small detect the outer wings of M51 down to S/N of \$\textbackslash{}demo\-sf\-optimized\-sn\$}''.
T\-he ma\-cro ``\inlinecode{\small\textbackslash{}demosfoptimizedsn}'' is automatically created during in the project and expands to the value ``\inlinecode{0.25}'' when the PDF output is built.
-The built \inlinecode{project.tex} file stores all such reported values.
+All such values are referenced in \inlinecode{project.tex}.
However, managing all the necessary \LaTeX{} macros in one file is against the modularity principle and can be frustrating and buggy.
To address this problem, Maneage adopts the convention that all subMakefiles \emph{must} contain a fixed target with the same base-name, but with a \inlinecode{.tex} suffix to store reported values generated in that subMakefile.
@@ -580,14 +577,14 @@ The input files (which come from outside the project) are all \emph{targets} in
\subsubsection{Importing and validating inputs (\inlinecode{download.mk})}
\label{sec:download}
-The \inlinecode{download.mk} subMakefile is present in all Maneage projects and contains the common steps for importing the input dataset(s) into the project.
+The \inlinecode{download.mk} subMakefile is present in all projects, containing common steps for importing the input dataset(s).
All necessary input datasets for the project are imported through this subMakefile.
Irrespective of where the dataset is \emph{used} in the project's lineage, it helps to maintain relation with the outside world (to the project) in one subMakefile (see the modularity and minimal complexity principles \ref{principle:modularity} \& \ref{principle:complexity}).
Each external dataset has some basic information, including its expected name on the local system (for offline access), the necessary checksum to validate it (either the whole file or just its main ``data'', as discussed in Section \ref{sec:outputverification}), and its URL/PID.
In Maneage, such information regarding a project's input dataset(s) is in the \inlinecode{INPUTS.conf} file.
See Figures \ref{fig:files} \& \ref{fig:datalineage} for the position of \inlinecode{INPUTS.conf} in the project's file structure and data lineage, respectively.
-For demonstration, we are using the datasets of M20 which are stored in one \inlinecode{.xlsx} file on bioXriv\footnote{\label{footnote:dataurl}Full data URL: \url{\menketwentyurl}}.
+For demonstration, we are using the datasets of M20 which are stored in one \inlinecode{.xlsx} file on bioXriv.
Figure \ref{fig:inputconf} shows the corresponding \inlinecode{INPUTS.conf} where the necessary information are stored as Make variables and are automatically loaded into the full project when Make starts (and is most often used in \inlinecode{download.mk}).
\begin{figure}[t]
@@ -595,7 +592,7 @@ Figure \ref{fig:inputconf} shows the corresponding \inlinecode{INPUTS.conf} wher
\vspace{-3mm}
\caption{\label{fig:inputconf} The \inlinecode{INPUTS.conf} configuration file keeps references to external (input) datasets of a project, as well as their checksums for validation, see Sections \ref{sec:download} \& \ref{sec:configfiles}.
Shown here are the entries for the demonstration dataset of \citet{menke20}.
- Note that the original URL (footnote \ref{footnote:dataurl}) was too long to display properly here.
+ The original URL is \url{\menketwentyurl}.
}
\end{figure}
@@ -703,28 +700,30 @@ For example, \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481
Maneage was created, and has evolved during various research projects.
The primordial implementation was written for \citet{akhlaghi15}.
-It later evolved in \citet{bacon17}, and in particular the two sections of that paper that were done by M. Akhlaghi: \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}.
-With these projects, the skeleton of the system was written as a more abstract ``template'' that could be customized for separate projects.
-That template later matured into Maneage by including the installation of all necessary software from source and it was used in \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}]{akhlaghi19} and \citet[\href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}]{infante20}.
-Bugs will still be found and Maneage will continue to evolve after this paper is published.
-A list of the notable changes after the publication of this paper will be kept in the \inlinecode{README-hacking.md} file.
-
-Once Maneage is adopted on a wide scale in a special topic, they can be fed them into machine learning algorithms for automatic workflow generation, optimized for certain aspects of the result.
-Because Maneage is complete, even inputs (software and input data), or failed tests during the projects can enter this optimization process.
-Furthermore, writing parsers of Maneage projects to generate Research Objects is trivial, and very useful for meta-research and data provenance studies.
-Combined with Software Heritage \citep{dicosmo18}, precise parts Maneage projects (high-level science) can be cited, at various points in its history (e.g., failed/abandoned tests).
-Many components of Machine actionable data management plans \citep{miksa19b} can also be automatically filled with Maneage, greatly helping the project PI and and grant organizations.
+It later evolved in \citet{bacon17}, and in particular the two sections of that paper that were done by M. Akhlaghi (\href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}).
+With these, the customizable skeleton was separated from the flesh as a more abstract ``template''.
+Later, steps to build necessary software were included added, as used in \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}]{akhlaghi19} and \citet[\href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}]{infante20}.
+After this paper is published, bugs will still be found and Maneage will continue to evolve and improve, notable changes beyond this paper will be kept in \inlinecode{README-hacking.md}.
+
+Once adopted on a wide scale, Maneage projects can be fed them into machine learning (ML) tools for automatic workflow generation, optimized for certain aspects of the result.
+Because Maneage is complete, even inputs (software algorithms and data selection), or failed tests can enter this optimization.
+Furthermore, since it connects the analysis directly to the narrative and history of a project, this can include natural language processing.
+On the other hand, parsers can be written over Maneage-derived projects for meta-research and data provenance studies, for example to generate Research Objects.
+For example if a bug is found in a software, all affected projects can be found and the scale of the effect can be measured.
+Combined with Software Heritage, precise parts Maneage projects (high-level science) can be cited, at various points in its history (e.g., failed/abandoned tests).
+Many components of Machine actionable data management plans \citep{miksa19b} can also be automatically filled with Maneage, useful for project PIs and and grant organizations.
Maneage was awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the Publishing Data Workflows working group \citep{austin17}.
Its user base, and thus its development, grew phenomenally afterwards and highlighted some caveats.
The first is that Maneage uses very low-level tools that are not widely used by scientists, e.g., Git, \LaTeX, Make and the command-line.
We have discovered that this is primarily because of a lack of exposure.
-Many (in particular early career researchers) have started mastering them as they adopt Maneage once they witness their advantages, but it does take time.
+Many (in particular early career researchers) have started mastering them as they adopt Maneage, but it does take time.
+We are thus working on several tutorials and improving the documentation.
-A second caveat is the fact that Maneage is effectively an almost complete GNU OS, tailored to each project.
-Maintaining the various packages is time consuming for us (Maneage maintainers).
-However, because software installation is also in Make, some users are already adding their necessary software to the core Maneage branch, thus propagating the improvements to all projects using Maneage.
-Another caveat that has been raised is that publishing the project's reproducible data lineage immediately after publication may hamper their ability to continue with followup papers because others may do it before them.
+A second caveat is the maintenance of the various software packages on the many POSIX-compatible systems.
+However, because Maneage builds its software in same framework as the analysis (in Make), users are empowered to add/fix their necessary software without learning anything new.
+This has already happened, with submitted changes to the core Maneage branch, which are propagated to all projects.
+Another caveat that has been raised is that publishing the project's reproducible data lineage immediately after publication enables others to continue with followup papers before they can do it themselves.
We propose these solutions:
1) Through the Git history, the added work by another team, at any phase of the project, can be quantified, contributing to a new concept of authorship in scientific projects and helping to quantify Newton's famous ``\emph{standing on the shoulders of giants}'' quote.
This is however a long-term goal and requires major changes to academic value systems.
@@ -783,7 +782,7 @@ Zahra Sharbaf,
and Ignacio Trujillo
for their useful help, suggestions and feedback on Maneage and this paper.
-During its development, Maneage has been partially funded/supported by the following institutions:
+Work on Maneage, and this paper, has been partially funded/supported by the following institutions:
The Japanese Ministry of Education, Culture, Sports, Science, and Technology ({\small MEXT}) PhD scholarship to M. Akhl\-aghi and its Grant-in-Aid for Scientific Research (21244012, 24253003).
The European Research Council (ERC) advanced grant 339659-MUSICOS.
The European Union (EU) Horizon 2020 (H2020) research and innovation programmes No 777388 under RDA EU 4.0 project, and Marie Sk\l{}odowska-Curie grant agreement No 721463 to the SUNDIAL ITN.
diff --git a/tex/src/figure-src-inputconf.tex b/tex/src/figure-src-inputconf.tex
index fc3315d..1a3b31c 100644
--- a/tex/src/figure-src-inputconf.tex
+++ b/tex/src/figure-src-inputconf.tex
@@ -1,4 +1,4 @@
-\begin{tcolorbox}[title=\inlinecode{\textcolor{white}{INPUT.conf}}\hfill\textcolor{white}{(simplified)}]
+\begin{tcolorbox}[title=\inlinecode{\textcolor{white}{INPUT.conf}}]
\footnotesize
\texttt{\mkvar{MK20DATA} = menke20.xlsx}\\
\texttt{\mkvar{MK20MD5}{ } = 8e4eee64791f351fec58680126d558a0}\\
diff --git a/tex/src/preamble-style.tex b/tex/src/preamble-style.tex
index 82f9714..592d97e 100644
--- a/tex/src/preamble-style.tex
+++ b/tex/src/preamble-style.tex
@@ -145,6 +145,9 @@
%% Custom macros
\newcommand{\inlinecode}[1]{\textcolor{blue!35!black}{\texttt{#1}}}
+%% To use International Phonetic Alphabet (IPA)
+\usepackage{tipa}
+
%% Example Makefile macros
\newcommand{\mkcomment}[1]{\textcolor{red!70!white}{\# #1}}
\newcommand{\mkvar}[1]{\textcolor{orange!40!black}{#1}}