From e3bdc607a7fca8ebd876e1fa6002e679ad32f2c4 Mon Sep 17 00:00:00 2001 From: Mohammad Akhlaghi Date: Thu, 4 Jun 2020 04:09:21 +0100 Subject: Verification activated, README added, Proper metadata in plot data All the steps following the to-be-added (in 'README-hacking.md') publication checklist prior to the final check from new clone have been added: - 'README.md' file has been set. - "Reproducible supplement" was added just above the keywords, pointing to Zenodo. - A link to the to-be-uploaded data underlying the plot was added in the caption of the tools-per-year plot. - A new meta-data configuration file was added to store basic project metadata to be used throughout the project. This will later be taken into Maneage. For examle the project title is now stored here and written into the paper's LaTeX source and output datasets automatically. - Verification was activated and plot's data and LaTeX macro files are now automatically verified. - A complete metadata was added for the data underlying the plot. - A generic function was added in 'initialize.mk' that will automatically write project info and copyright in all plain-text outputs. --- README.md | 38 ++++++++++++-------------- paper.tex | 10 +++++-- reproduce/analysis/config/metadata-common.conf | 16 +++++++++++ reproduce/analysis/config/verify-outputs.conf | 11 ++++++-- reproduce/analysis/make/demo-plot.mk | 35 ++++++++++++++++++++---- reproduce/analysis/make/initialize.mk | 37 ++++++++++++++++++++----- reproduce/analysis/make/verify.mk | 8 +++--- tex/src/figure-data-lineage.tex | 2 +- tex/src/figure-tools-per-year.tex | 4 +-- 9 files changed, 117 insertions(+), 44 deletions(-) create mode 100644 reproduce/analysis/config/metadata-common.conf diff --git a/README.md b/README.md index 7216f1f..91d5527 100644 --- a/README.md +++ b/README.md @@ -1,12 +1,13 @@ -Reproducible source for XXXXXXXXXXXXXXXXX -========================================= +Reproducible source for paper introducing Maneage (MANaging data linEAGE) +------------------------------------------------------------------------- Copyright (C) 2018-2020 Mohammad Akhlaghi \ See the end of the file for license conditions. -This is the reproducible project source for the paper titled "**XXX XXXXX -XXXXXX**", by XXXXX XXXXXX, YYYYYY YYYYY and ZZZZZZ ZZZZZ that is published -in XXXXX XXXXX. +This is the reproducible project source for the paper titled "**Towards +Long-term and Archivable Reproducibility**", by Mohammad Akhlaghi, Raúl +Infante-Sainz, Boudewijn F. Roukema, David Valls-Gabaud, Roberto +Baena-Gallé. To reproduce the results and final paper, the only dependency is a minimal Unix-based building environment including a C compiler (already available @@ -18,8 +19,8 @@ button to download a compressed tarball of the project). If you have received this source from arXiv, please see the respective section below. ```shell -$ git clone XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX -$ cd XXXXXXXXXXXXXXXXXX +$ git clone https://gitlab.com/makhlaghi/maneage-paper +$ cd maneage-paper $ ./project configure $ ./project make ``` @@ -34,8 +35,7 @@ https://maneage.org. -Building the project --------------------- +### Building the project This project was designed to have as few dependencies as possible without requiring root/administrator permissions. @@ -52,14 +52,11 @@ requiring root/administrator permissions. a directory given at configuration time), they will be used. Otherwise, a downloader (`wget` or `curl`) will be necessary to download any necessary tarball. The necessary tarballs are also - collected in the archived project on Zenodo (link below) [[TO - AUTHORS: UPLOAD THE SOFTWARE TARBALLS WITH YOUR DATA AND PROJECT - SOURCE TO ZENODO OR OTHER SIMILAR SERVICES. THEN ADD THE DOI/LINK - HERE.DON'T FORGET THAT THE SOFTWARE ARE A CRITICAL PART OF YOUR - WORK.]]. Just unpack that tarball, and when `./project configure` - asks for the "software tarball directory", give the address of the - unpacked directory that has all the tarballs. - https://doi.org/10.5281/zenodo.3408481 + collected in the archived project on Zenodo (link below). Just + unpack that tarball, and when `./project configure` asks for the + "software tarball directory", give the address of the unpacked + directory that has all the tarballs. + https://doi.org/10.5281/zenodo.3872248 2. Configure the environment (top-level directories in particular) and build all the necessary software for use in the next step. It is @@ -86,8 +83,8 @@ requiring root/administrator permissions. -Source from arXiv ------------------ +### Source from arXiv + If the paper is also published on arXiv, it is highly likely that the authors also uploaded/published the full project there along with the LaTeX sources. If you have downloaded (or plan to download) this source from @@ -155,8 +152,7 @@ arXiv, some minor extra steps are necessary: -Copyright information ---------------------- +### Copyright information This file and `.file-metadata` (a binary file, used by Metastore to store file dates when doing Git checkouts) are part of the reproducible project diff --git a/paper.tex b/paper.tex index 685849f..fff0e41 100644 --- a/paper.tex +++ b/paper.tex @@ -27,7 +27,7 @@ \input{tex/src/preamble-pgfplots.tex} %% Title and author names. -\title{Towards Long-term and Archivable Reproducibility} +\title{\projecttitle} \author{ Mohammad~Akhlaghi, Ra\'ul Infante-Sainz, @@ -70,9 +70,12 @@ %% CONCLUSION We show that requiring longevity of a reproducible workflow solution is realistic, and discuss the benefits of the criteria for scientific progress, but also immediate benefits for short-term reproducibility. This paper has itself been written in Maneage, with snapshot \projectversion. + + \vspace{3mm} + \emph{Reproducible supplement} --- \href{https://doi.org/10.5281/zenodo.3872248}{\texttt{Zenodo.3872248}}. \end{abstract} -% Note that keywords are not normally used for peerreview papers. +% Note that keywords are not normally used for peer-review papers. \begin{IEEEkeywords} Data Lineage, Provenance, Reproducibility, Scientific Pipelines, Workflows \end{IEEEkeywords} @@ -82,6 +85,8 @@ Data Lineage, Provenance, Reproducibility, Scientific Pipelines, Workflows + + % For peer review papers, you can put extra information on the cover % page as needed: % \ifCLASSOPTIONpeerreview @@ -293,6 +298,7 @@ Figure \ref{fig:datalineage} (bottom) is the data lineage graph that produced it For example, \inlinecode{paper.pdf} depends on \inlinecode{project.tex} (in the build directory; generated automatically) and \inlinecode{paper.tex} (in the source directory; written manually). The solid arrows and full-opacity built boxes correspond to this paper. The dashed arrows and low-opacity built boxes show the scalability by adding hypothetical steps to the project. + The underlying data of the top plot is available at \href{https://zenodo.org/record/3872248/files/tools-per-year.txt}{zenodo.3872248/tools-per-year.txt}. } \end{figure*} diff --git a/reproduce/analysis/config/metadata-common.conf b/reproduce/analysis/config/metadata-common.conf new file mode 100644 index 0000000..7bc9fa5 --- /dev/null +++ b/reproduce/analysis/config/metadata-common.conf @@ -0,0 +1,16 @@ +# Metadata parameters that can be used in + +# Project information +metadata-title = Towards Long-term and Archivable Reproducibility + +# DOIs and identifiers. +metadata-arxiv = +metadata-doi-zenodo = https://doi.org/10.5281/zenodo.3872248 +metadata-doi-journal = +metadata-doi = $(metadata-doi-zenodo) +metadata-git-repository = https://gitlab.com/makhlaghi/maneage-paper + +# Copyright and identifier. +metadata-copyright-owner = Mohammad Akhlaghi +metadata-copyright = Creative Commons Attribution-ShareAlike (CC BY-SA) +metadata-copyright-url = https://creativecommons.org/licenses/by-sa/4.0 diff --git a/reproduce/analysis/config/verify-outputs.conf b/reproduce/analysis/config/verify-outputs.conf index e4ef479..c9287e8 100644 --- a/reproduce/analysis/config/verify-outputs.conf +++ b/reproduce/analysis/config/verify-outputs.conf @@ -1,2 +1,9 @@ -# To enable verification of output datasets set this variable to yes -verify-outputs = +# To enable verification of output datasets set this variable to 'yes'. +# +# Copyright (C) 2019-2020 Mohammad Akhlaghi +# +# Copying and distribution of this file, with or without modification, are +# permitted in any medium without royalty provided the copyright notice and +# this notice are preserved. This file is offered as-is, without any +# warranty. +verify-outputs = yes diff --git a/reproduce/analysis/make/demo-plot.mk b/reproduce/analysis/make/demo-plot.mk index c14b83d..a149040 100644 --- a/reproduce/analysis/make/demo-plot.mk +++ b/reproduce/analysis/make/demo-plot.mk @@ -18,7 +18,7 @@ # Directory to host outputs # ------------------------- -a2dir = $(texdir)/tools-per-year +a2dir = $(texdir)/to-publish $(a2dir):; mkdir $@ @@ -27,7 +27,7 @@ $(a2dir):; mkdir $@ # Table for Figure 1C of Menke+20 # ------------------------------- -a2mk20f1c = $(a2dir)/columns.txt +a2mk20f1c = $(a2dir)/tools-per-year.txt $(a2mk20f1c): $(mk20tab3) | $(a2dir) # Remove the (possibly) produced figure that is created from this @@ -35,12 +35,37 @@ $(a2mk20f1c): $(mk20tab3) | $(a2dir) # multiple files with a fixed prefix. rm -f $(tikzdir)/figure-tools-per-year* + # Write the column metadata in a temporary file name (appending + # '.tmp' to the actual target name). Once all steps are done, it is + # renamed to the final target. We do this because if there is an + # error in the middle, Make will not consider the job to be + # complete and will stop here. + echo "# Data of plot showing fraction of papers that mentioned software tools" > $@.tmp + echo "# per year to demonstrate the features of Maneage (MANaging data linEAGE)." >> $@.tmp + >> $@.tmp + echo "# Raw data taken from Menke+2020 (https://doi.org/10.1101/2020.01.15.908111)." \ + >> $@.tmp + echo "# " >> $@.tmp + echo "# Column 1: YEAR [count, u16] Publication year of papers." \ + >> $@.tmp + echo "# Column 2: WITH_TOOLS [frac, f32] Fraction of papers mentioning software tools." \ + >> $@.tmp + echo "# Column 3: NUM_PAPERS [count, u32] Total number of papers studied in that year." \ + >> $@.tmp + echo "# " >> $@.tmp + $(call print-copyright, $@.tmp) + + # Find the maximum number of papers. awk '!/^#/{all[$$1]+=$$2; id[$$1]+=$$3} \ END{ for(year in all) \ - print year, 100*id[year]/all[year], all[year] \ + printf("%-7d%-10.3f%d\n", year, 100*id[year]/all[year], \ + all[year]) \ }' $< \ - > $@ + >> $@.tmp + + # Write it into the final target + mv $@.tmp $@ @@ -50,7 +75,7 @@ $(a2mk20f1c): $(mk20tab3) | $(a2dir) $(mtexdir)/demo-plot.tex: $(a2mk20f1c) $(pconfdir)/demo-year.conf # Find the first year (first column of first row) of data. - v=$$(awk 'NR==1{print $$1}' $(a2mk20f1c)) + v=$$(awk '!/^#/ && c==0{c++; print $$1}' $(a2mk20f1c)) echo "\newcommand{\menkefirstyear}{$$v}" > $@ # Find the number of rows in the plotted table. diff --git a/reproduce/analysis/make/initialize.mk b/reproduce/analysis/make/initialize.mk index fe9c103..b0701f4 100644 --- a/reproduce/analysis/make/initialize.mk +++ b/reproduce/analysis/make/initialize.mk @@ -213,8 +213,9 @@ $(lockdir): | $(BDIR); mkdir $@ # we want to ensure that the file is always built in every run: it contains # the project version which may change between two separate runs, even when # no file actually differs. -packagebasename := $(shell if [ -d .git ]; then \ - echo paper-$$(git describe --dirty --always --long); else echo NOGIT; fi) +project-commit-hash := $(shell if [ -d .git ]; then \ + echo $$(git describe --dirty --always --long); else echo NOGIT; fi) +packagebasename := paper-$(project-commit-hash) packagecontents = $(texdir)/$(packagebasename) .PHONY: all clean dist dist-zip distclean clean-mmap $(packagecontents) \ $(mtexdir)/initialize.tex @@ -373,6 +374,31 @@ dist-zip: $(packagecontents) +# Print Copyright statement +# ------------------------- +# +# This statement can be used in published datasets that are in plain-text +# format. It assumes you have already put the data-specific statements in +# its first argument, it will supplement them with general project links. +print-copyright = \ + echo "\# Project title: $(metadata-title)" >> $(1); \ + echo "\# Git commit (that produced this dataset): $(packagebasename)" >> $(1); \ + echo "\# Project's Git repository: $(metadata-git-repository)" >> $(1); \ + if [ x$(metadata-arxiv) != x ]; then echo "\# arXiv:$(metadata-arxiv)" >> $(1); fi; \ + if [ x$(metadata-doi-journal) != x ]; then \ + echo "\# DOI (Journal): $(metadata-doi-journal)" >> $(1); fi; \ + if [ x$(metadata-doi-zenodo) != x ]; then \ + echo "\# DOI (Zenodo): $(metadata-doi-zenodo)" >> $(1); fi; \ + echo "\#" >> $(1); \ + echo "\# Copyright (C) $$(date +%Y) $(metadata-copyright-owner)" >> $(1); \ + echo "\# Dataset is available under $(metadata-copyright)." >> $(1); \ + echo "\# License URL: $(metadata-copyright-url)" >> $(1); + + + + + + # Project initialization results # ------------------------------ # @@ -381,8 +407,5 @@ dist-zip: $(packagecontents) # calculated everytime the project is run. So even though this file # actually exists, it is also aded as a `.PHONY' target above. $(mtexdir)/initialize.tex: | $(mtexdir) - - # Version of the project. - @if [ -d .git ]; then v=$$(git describe --dirty --always --long); - else v=NO-GIT; fi - echo "\newcommand{\projectversion}{$$v}" > $@ + echo "\newcommand{\projecttitle}{$(metadata-title)}" > $@ + echo "\newcommand{\projectversion}{$(project-commit-hash)}" >> $@ diff --git a/reproduce/analysis/make/verify.mk b/reproduce/analysis/make/verify.mk index 088b3b3..1573920 100644 --- a/reproduce/analysis/make/verify.mk +++ b/reproduce/analysis/make/verify.mk @@ -107,14 +107,14 @@ $(mtexdir)/verify.tex: $(foreach s, $(verify-dep), $(mtexdir)/$(s).tex) # Verify the figure datasets. $(call verify-txt-no-comments-leading-space, \ - $(delete-num), ad345e873e6af577f0e4e7c8942cdf08) - $(call verify-txt-no-comments-leading-space, \ - $(delete-histogram), 12a81c4c8c5f552e5ed5686453587fe8) + $(a2mk20f1c), 76fc5b13495c4d8e8e6f8d440304cf69) # Verify TeX macros (the values that go into the PDF text). for m in $(verify-check); do file=$(mtexdir)/$$m.tex - if [ $$m == download ]; then s=XXXXX + if [ $$m == download ]; then s=64da83ee3bfaa236849927cdc001f5d3 + elif [ $$m == format ]; then s=e04d95a539b5540c940bf48994d8d45f + elif [ $$m == demo-plot ]; then s=2504472bd2b3f60b5a26c5f2a3a67251 else echo; echo "'$$m' not recognized."; exit 1 fi $(call verify-txt-no-comments-leading-space, $$file, $$s) diff --git a/tex/src/figure-data-lineage.tex b/tex/src/figure-data-lineage.tex index fcc52d9..21c84f3 100644 --- a/tex/src/figure-data-lineage.tex +++ b/tex/src/figure-data-lineage.tex @@ -177,7 +177,7 @@ \ifdefined\outtwob \node (menkedemoyear) [node-nonterminal, at={(2.67cm,4.6cm)}] {demo-year.conf}; \node (a2tex-west) [node-point, at={(1.27cm,-0.8cm)}] {}; - \node (out2b) [node-terminal, at={(2.67cm,0.3cm)}] {columns.txt}; + \node (out2b) [node-terminal, at={(2.67cm,0.3cm)}] {tools-per-\\year.txt}; \draw [->] (out2b) -- (a2tex); \draw [->,rounded corners] (menkedemoyear.west) -| (a2tex-west) |- (a2tex); \fi diff --git a/tex/src/figure-tools-per-year.tex b/tex/src/figure-tools-per-year.tex index 240ac27..e235424 100644 --- a/tex/src/figure-tools-per-year.tex +++ b/tex/src/figure-tools-per-year.tex @@ -14,7 +14,7 @@ %% Linear plot, showing the number of papers mentioning tools. \addplot+ [mark=none, very thick, green!60!black] - table {tex/build/tools-per-year/columns.txt}; + table {tex/build/to-publish/tools-per-year.txt}; \end{axis} %% Add the right-side Y axis. @@ -29,6 +29,6 @@ max space between ticks=20, ] \addplot+ [ybar, mark=none, fill=red!50!white, red, opacity=0.25] - table [x index=0, y index=2] {tex/build/tools-per-year/columns.txt}; + table [x index=0, y index=2] {tex/build/to-publish/tools-per-year.txt}; \end{axis} \end{tikzpicture} -- cgit v1.2.1