aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--paper.tex74
-rw-r--r--tex/src/preamble-style.tex6
-rw-r--r--tex/src/references.tex13
3 files changed, 53 insertions, 40 deletions
diff --git a/paper.tex b/paper.tex
index dfa1dc5..7937d16 100644
--- a/paper.tex
+++ b/paper.tex
@@ -90,11 +90,15 @@ See Figure \ref{fig:questions} for some similar questions, classified by their p
\tonote{Johan: add some general references.}
Due to the complexity of modern data analysis, a small deviation in the final result can be due to many different steps, which may be significant for its interpretation.
-Publishing the \emph{complete} set of operations is the only way to avoid ambiguity and wasting resources.
+Integrity checks are a critical component of the scientific method, but are only possible with access to the data \emph{and} its lineage (workflows).
For example, \citet{smart18} describes how a 7-year old conflict in theoretical condensed matter physics was only identified after the relative codes were shared.
\citet{miller06} found a mistaken column flipping in a project's workflow, leading to the retraction of 5 papers in major journals, including \emph{Science}.
\citet{baggerly09} highlighted the inadequate narrative description of the analysis and showed the prevalence of simple errors in published results, ultimately calling their work ``\emph{forensic bioinformatics}''.
-\citet{herndon14} and \citet[a self-correction]{horvath15} also reported similar situations and \citet{ziemann16} concluded that one-fifth of papers with supplementary Microsoft Excel gene lists contain erroneous gene name conversions.
+\citet{herndon14} and \citet[a self-correction]{horvath15} also reported similar situations and \citet{ziemann16} concluded that one-fifth of papers contain erroneous gene name conversions.
+These are mostly from genomics and bioinformatics because publishing workflows is commonly practiced already (for example \href{https://www.myexperiment.org}{myexperiment.org}, \href{https://www.genepattern.org}{genepattern.org}, and \href{https://galaxyproject.org}{galaxy\-project.org}).
+The status in other fields, without a culture of publishing workflows, is highly likely to be worse.
+Nature is already a black box which we are trying hard to unlock.
+Not being able to experiment on the methods of other researchers is a self-imposed back box over it.
\begin{figure}[t]
\begin{center}
@@ -108,12 +112,6 @@ For example, \citet{smart18} describes how a 7-year old conflict in theoretical
}
\end{figure}
-The reason such reports are mostly from genomics and bioinformatics is because they have traditionally been more open to publishing workflows: see for example \href{https://www.myexperiment.org}{myexperiment.org}, \href{https://www.genepattern.org}{genepattern.org}, and \href{https://galaxyproject.org}{galaxy\-project.org}.
-Integrity checks are hence a critical component of the scientific method, but are only possible with access to the data \emph{and} its lineage (workflows).
-The status in other fields, where workflows are not commonly shared, is highly likely to be even worse.
-Nature is already a black box which we are trying hard to unlock.
-Not being able to experiment on the methods of other researchers is a self-imposed back box over it.
-
The completeness of a project's published workflow (usually within the ``Methods'' section) can be measured by the ability to reproduce the result without needing to contact the authors.
Several studies have attempted to answer this with different levels of detail. For example, \citet{allen18} found that roughly half of the papers in astrophysics do not even mention the names of any analysis software, while \citet{menke20} found this fraction has greatly improved in medical/biological field and is currently above $80\%$.
\citet{ioannidis2009} attempted to reproduce 18 published results by two independent groups, but fully succeeded in only 2 of them and partially in 6.
@@ -419,18 +417,18 @@ Therefore, each project has to identify its high-level software in the \inlineco
Maneage contains the full list of built software for each project, their versions and their configuration options, but this information is buried deep into each project's source.
Therefore Maneage also prints a distilled fraction of this information in the project's final report, blended into the narrative, as seen in the Acknowledgments of this paper.
Furthermore, when the software is associated with a published paper, that paper's Bib\TeX{} entry is also added to the final report and is duly cited with the software's name and version.
-This paper uses basic tools that do not have a paper, for software citation examples see \citet{akhlaghi19} and \citet{infante20}.
+This paper uses basic software without a paper, for software citation examples see \citet{akhlaghi19} and \citet{infante20}.
This is particularly important in the case for research software, where citation is critical to justify continued development.
-One notable example that nicely highlights this issue is GNU Parallel \citep{tange18}: every time it is run, it prints the citation information before it starts.
+One notable example is GNU Parallel \citep{tange18}: it prints the citation information everytime it starts.
Users can disable the notice, with the \inlinecode{--citation} option and accept to cite its paper, or support its development directly by paying $10000$ euros!
This is justified by an uncomfortably true statement ``\emph{history has shown that researchers forget to [cite software] if they are not reminded explicitly. ... If you feel the benefit from using GNU Parallel is too small to warrant a citation, then prove that by simply using another tool}''.
-Most research software does not resort to such drastic measures, however, but proper citation is not only important but also ethical.
+Most software do not resort to such drastic measures, however, proper citation is not only important but also ethical.
-Given the increasing number of software used in scientific research, the only reliable solution is to automatically cite the used software.
-The necessity and basic elements in software citation are reviewed, inter alia, by \citet{katz14} and \citet{smith16}.
-There are ongoing projects specifically tailored to software citation, including CodeMeta and Citation file format (CFF), a very robust approach is also provided by SoftwareHeritage \citep{dicosmo18}.
-We plan to enable these tools in future versions of Maneage.
+Given the increasing number and role of software in research \citep{clement19}, automatic citation (as presented here) is a step forward.
+The necessity and basic elements in software citation are reviewed, inter alia, by \citet{katz14} and \citet{smith16} and CodeMeta and Citation file format (CFF) are projects specifically tailored to expand software citation beyond a Bib\TeX.
+A very robust approach that also includes archival, is Software Heritage \citep{dicosmo18}.
+They will be tested and enabled in Maneage.
@@ -450,7 +448,7 @@ If all of these steps were organized in a single Makefile, it would become very
Large files are in general a bad practice and do not fulfil the modularity principle (\ref{principle:modularity}).
Maneage is thus designed to encourage and facilitate modularity by distributing the analysis into many Makefiles that contain contextually-similar analysis steps.
-In the rest of this paper these modular, or lower-level, Makefiles will be called \emph{subMakefiles}.
+Hereafter, these modular or lower-level Makefiles will be called \emph{subMakefiles}.
When run with the \inlinecode{make} argument, the \inlinecode{project} script (Section \ref{sec:maneage}), calls \inlinecode{top-make.mk} which loads the subMakefiles with a certain order.
They are loaded using Make's \inlinecode{include} feature (so Make sees everything as one file in one instance of Make).
By default Maneage does not use recursion (where one instance of Make, calls another instance of Make within itself) to comply with minimal complexity principle (\ref{principle:complexity}) and keep the code's logic clear and simple.
@@ -471,9 +469,9 @@ All the analysis Makefiles are in \inlinecode{re\-produce\-/anal\-ysis\-/make} (
}
\end{figure}
-To avoid getting too abstract in the subsections below, where necessary we will do a basic analysis on the data of \citet{menke20} and replicate one of the results.
-Note that because we are not using the same software\footnote{We cannot use the same software because \citet{menke20} use
-Microsoft Excel for the analysis which violates several of our principles: \ref{principle:complete}, \ref{principle:complexity} and \ref{principle:freesoftware}.}, this is not a reproduction (see \ref{definition:reproduction}).
+To avoid getting too abstract in the subsections below, where necessary we will do a basic analysis on the data of \citet{menke20} (hereafter M20) and replicate one of the results.
+Note that because we are not using the same software, this is not a reproduction (see \ref{definition:reproduction}).
+We cannot use the same software because M20 use Microsoft Excel for the analysis which violates several of our principles: \ref{principle:complete}, \ref{principle:complexity} and \ref{principle:freesoftware}.
In the subsections below, this paper's analysis on that dataset is described using the data lineage graph of Figure \ref{fig:datalineage}.
We will follow Make's paradigm (see Section \ref{sec:usingmake}) of starting the lineage backwards form the ultimate target in Section \ref{sec:paperpdf} (bottom of Figure \ref{fig:datalineage}) to the configuration files \ref{sec:configfiles} (top of Figure \ref{fig:datalineage}).
To better understand this project, we encourage looking into this paper's own Maneage source, published as a supplement.
@@ -522,7 +520,7 @@ It has some tests on pre-defined formats, and other formats can easily be added.
\label{sec:analysis}
The basic concepts behind organizing the analysis into modular subMakefiles have already been discussed above.
-We will thus describe it here with the practical example of replicating Figure 1C of \citet{menke20}, with some enhancements in Figure \ref{fig:toolsperyear}.
+We will thus describe it here with the practical example of replicating Figure 1C of M20, with some enhancements in Figure \ref{fig:toolsperyear}.
As shown in Figure \ref{fig:datalineage}, in this project we have broken this goal into two subMakefiles: \inlinecode{format.mk} and \inlinecode{demo-plot.mk}.
The former is in charge of converting the Excel-formatted input into the simple comma-separated value (CSV) format, and the latter is in charge of generating the table to build Figure \ref{fig:toolsperyear}.
In a real project, subMakefiles could, and will, be much more complex.
@@ -578,7 +576,7 @@ Irrespective of where the dataset is \emph{used} in the project's lineage, it he
Each external dataset has some basic information, including its expected name on the local system (for offline access), the necessary checksum to validate it (either the whole file or just its main ``data'', as discussed in Section \ref{sec:outputverification}), and its URL/PID.
In Maneage, such information regarding a project's input dataset(s) is in the \inlinecode{INPUTS.conf} file.
See Figures \ref{fig:files} \& \ref{fig:datalineage} for the position of \inlinecode{INPUTS.conf} in the project's file structure and data lineage, respectively.
-For demonstration, we are using the datasets of \citet{menke20} which are stored in one \inlinecode{.xlsx} file on bioXriv\footnote{\label{footnote:dataurl}Full data URL: \url{\menketwentyurl}}.
+For demonstration, we are using the datasets of M20 which are stored in one \inlinecode{.xlsx} file on bioXriv\footnote{\label{footnote:dataurl}Full data URL: \url{\menketwentyurl}}.
Figure \ref{fig:inputconf} shows the corresponding \inlinecode{INPUTS.conf} where the necessary information are stored as Make variables and are automatically loaded into the full project when Make starts (and is most often used in \inlinecode{download.mk}).
\begin{figure}[t]
@@ -600,7 +598,7 @@ Configuration files enable the logical separation between the low-level implemen
In the data lineage plot of Figure \ref{fig:datalineage}, configuration files are shown as the sharp-edged, green \inlinecode{*.conf} files in the top row (for example, the \inlinecode{INPUTS.conf} file that was shown in Figure \ref{fig:inputconf} and mentioned in Section \ref{sec:download}).
All the configuration files of a project are placed under the \inlinecode{reproduce/analysis/config} (see Figure \ref{fig:files}) subdirectory, and are loaded into \inlinecode{top-make.mk} before any of the subMakefiles, see Figure \ref{fig:topmake}.
-The demo analysis of Section \ref{sec:analysis} is a good demonstration of their usage: during that discussion we reported the number of papers studied by \citet{menke20} in \menkenumpapersdemoyear.
+The demo analysis of Section \ref{sec:analysis} is a good demonstration of their usage: during that discussion we reported the number of papers studied by M20 in \menkenumpapersdemoyear.
However, the year's number is not written by hand in \inlinecode{demo-plot.mk}.
It is referenced through the \inlinecode{menke-year-demo} variable, which is defined in \inlinecode{menke-demo-year.conf}, that is a prerequisite of the \inlinecode{demo-plot.tex} rule.
This is also visible in the data lineage of Figure \ref{fig:datalineage}.
@@ -692,7 +690,7 @@ For example, \citet[\href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481
\section{Discussion \& Caveats}
\label{sec:discussion}
-Maneage was created and evolved during various research projects (in astrophysics) over the last 5 years.
+Maneage was created, and has evolved during various research projects.
The primordial implementation was written for \citet{akhlaghi15}.
It later evolved in \citet{bacon17}, and in particular the two sections of that paper that were done by M. Akhlaghi: \href{http://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} and \href{http://doi.org/10.5281/zenodo.1164774}{zenodo.1164774}.
With these projects, the skeleton of the system was written as a more abstract ``template'' that could be customized for separate projects.
@@ -700,25 +698,24 @@ That template later matured into Maneage by including the installation of all ne
Bugs will still be found and Maneage will continue to evolve after this paper is published.
A list of the notable changes after the publication of this paper will be kept in the \inlinecode{README-hacking.md} file.
-As Git repositories, a Maneage project can benefit from the wonderful archival and citation features of Software Heritage \citep{dicosmo18}, enabling easy citation of precise parts of other projects, at various points in their history.
-Once Maneage is adopted on a wide scale in a special topic, it is possible to feed them into machine learning algorithms for automatic workflow generation, optimized for certain aspects of the result.
-Because Maneage is complete and also includes the project's history, even inputs (software and input data) or failed tests during the projects can enter this optimization process.
+Once Maneage is adopted on a wide scale in a special topic, they can be fed them into machine learning algorithms for automatic workflow generation, optimized for certain aspects of the result.
+Because Maneage is complete, even inputs (software and input data), or failed tests during the projects can enter this optimization process.
Furthermore, writing parsers of Maneage projects to generate Research Objects is trivial, and very useful for meta-research and data provenance studies.
+Combined with Software Heritage \citep{dicosmo18}, precise parts Maneage projects (high-level science) can be cited, at various points in its history (e.g., failed/abandoned tests).
+Many components of Machine actionable data management plans \citep{miksa19b} can also be automatically filled with Maneage, greatly helping the project PI and and grant organizations.
-Maneage was awarded a Research Data Alliance (RDA) adoption grant adhering to the recommendations of the publishing data workflows working group \citep{austin17}.
+Maneage was awarded a Research Data Alliance (RDA) adoption grant for implementing the recommendations of the Publishing Data Workflows working group \citep{austin17}.
Its user base, and thus its development, grew phenomenally afterwards and highlighted some caveats.
-The first is that Maneage uses very low-level tools that are (unfortunately) not widely used by scientists, e.g., Git, \LaTeX, Make and the command-line.
+The first is that Maneage uses very low-level tools that are not widely used by scientists, e.g., Git, \LaTeX, Make and the command-line.
We have discovered that this is primarily because of a lack of exposure.
-Many (in particular early career researchers) have started mastering them as they adopt Maneage once they witness their advantages for their project.
+Many (in particular early career researchers) have started mastering them as they adopt Maneage once they witness their advantages, but it does take time.
A second caveat is the fact that Maneage is effectively an almost complete GNU operating system, tailored to each project.
-Maintaining the various packages is time consuming for Maneage maintainers, not derived projects.
-However, because software installation is also in Make, some users are already adding their necessary software to the core Maneage branch, thus propagating the improvement to all projects using Maneage.
-
+Maintaining the various packages is time consuming for us (Maneage maintainers).
+However, because software installation is also in Make, some users are already adding their necessary software to the core Maneage branch, thus propagating the improvements to all projects using Maneage.
Another caveat that has been raised is that publishing the project's reproducible data lineage immediately after publication may hamper their ability to continue with followup papers because others may do it before them.
-Given the strong integrity checks in Maneage, we believe it has features to address this problem in the following ways:
-1) Through the Git history, it is clear how much extra work the other team has added.
-In this way, Maneage can contribute to a new concept of authorship in scientific projects and help to quantify Newton's famous ``\emph{standing on the shoulders of giants}'' quote.
+We propose these solutions:
+1) Through the Git history, the added work by another team, at any phase of the project, can be quantied, contributing to a new concept of authorship in scientific projects and help to quantify Newton's famous ``\emph{standing on the shoulders of giants}'' quote.
This is however a long-term goal and requires major changes to academic value systems.
2) Authors can be given a grace period where the journal, or some third authority, keeps the source and publishes it a certain interval after publication.
@@ -764,10 +761,9 @@ During its development, Maneage has been partially funded (in historical order)
The Japanese Ministry of Education, Culture, Sports, Science, and Technology ({\small MEXT}) PhD scholarship to M.A and its Grant-in-Aid for Scientific Research (21244012, 24253003).
The European Research Council (ERC) advanced grant 339659-MUSICOS.
The European Union’s Horizon 2020 (H2020) research and innovation programmes No 777388 under RDA EU 4.0 project, and Marie Sk\l{}odowska-Curie grant agreement No 721463 to the SUNDIAL ITN.
-The State Research Agency (AEI) of the Spanish Ministry of Science, Innovation and Universities (MCIU) and the European Regional Development Fund (FEDER) under the grant with reference AYA2016-76219-P.
-The IAC project P/300724, financed by the Ministry of Science, Innovation and Universities, through the State Budget.
-The Canary Islands Department of Economy, Knowledge and Employment, through the Regional Budget of the Autonomous Community.
-The Fundaci\'on BBVA under its 2017 programme of assistance to scientific research groups, for the project "Using machine-learning techniques to drag galaxies from the noise in deep imaging".
+The State Research Agency (AEI) of the Spanish Ministry of Science, Innovation and Universities (MCIU) and the European Regional Development Fund (FEDER) under the grant AYA2016-76219-P.
+The IAC project P/300724, financed by the MCIU, through the Canary Islands Department of Economy, Knowledge and Employment.
+The Fundaci\'on BBVA under its 2017 programme of assistance to scientific research groups, for the project ``Using machine-learning techniques to drag galaxies from the noise in deep imaging''.
\input{tex/build/macros/dependencies.tex}
diff --git a/tex/src/preamble-style.tex b/tex/src/preamble-style.tex
index 26deac9..82f9714 100644
--- a/tex/src/preamble-style.tex
+++ b/tex/src/preamble-style.tex
@@ -91,6 +91,9 @@
\newcommand{\tonote}[1]{{}}
\fi
+%% To print the creation date on the PDF.
+\usepackage{datetime}
+
%% To have links.
\usepackage[
colorlinks,
@@ -127,7 +130,8 @@
\rhead{\mplight\footnotesize
Akhlaghi, M, et al. 2020. Maneage, a Customizable Framework\\
for Managing Data Lineage. \emph{Data Science Journal}, VV,\\
- NN, pp.1-\pageref*{LastPage}. DOI: \href{https://doi.org/10.5334/dsj-XXXX-XXX}{\textcolor{black}{https://doi.org/10.5334/dsj-XXXX-XXX}}}
+ NN, pp.1-\pageref*{LastPage}. DOI: \href{https://doi.org/10.5334/dsj-XXXX-XXX}{\textcolor{black}{https://doi.org/10.5334/dsj-XXXX-XXX}}\\
+ PDF created on: \currenttime{}, \today}
\lfoot{}
\cfoot{}
\rfoot{}
diff --git a/tex/src/references.tex b/tex/src/references.tex
index ef33d02..73e3b20 100644
--- a/tex/src/references.tex
+++ b/tex/src/references.tex
@@ -1,3 +1,16 @@
+@ARTICLE{clement19,
+ author = {Cl\'ement-Fontaine, M\'elanie and Di Cosmo, Roberto and Guerry, Bastien and MOREAU, Patrick and Pellegrini, Fran\c cois},
+ title = {Encouraging a wider usage of software derived from research},
+ year = {2019},
+ journal = {Archives ouvertes HAL},
+ volume = {},
+ pages = {\href{https://hal.archives-ouvertes.fr/hal-02545142}{hal-02545142}},
+}
+
+
+
+
+
@ARTICLE{dicosmo20,
author = {{Di Cosmo}, Roberto and {Gruenpeter}, Morane and {Zacchiroli}, Stefano},
title = "{Referencing Source Code Artifacts: a Separate Concern in Software Citation}",