From e623102768c426e86b0ed73904168006dfea2af9 Mon Sep 17 00:00:00 2001 From: Mohammad Akhlaghi Date: Sun, 25 Nov 2018 15:22:48 +0000 Subject: Pipeline now downloads and uses an input dataset In most analysis situations (except for simulations), an input dataset is necessary, but that part of the pipeline was just left out and a general `SURVEY' variable was set and never used. So with this commit, we actually use a sample FITS file from the FITS standard webpage, show it (as well as its histogram) and do some basic calculations on it. This preparation of the input datasets is done in a generic way to enable easy addition of more datasets if necessary. --- README-pipeline.md | 55 ++++------- configure | 108 +++++++++++++++++++-- paper.tex | 86 ++++++++++------ reproduce/config/gnuastro/astconvertt.conf | 31 ++++++ reproduce/config/gnuastro/aststatistics.conf | 34 +++++++ reproduce/config/pipeline/INPUTS.mk | 9 ++ reproduce/config/pipeline/LOCAL.mk.in | 4 + reproduce/config/pipeline/delete-me-wfpc2-quant.mk | 2 + reproduce/config/pipeline/dependency-versions.mk | 1 + reproduce/config/pipeline/web.mk | 6 -- reproduce/src/make/delete-me.mk | 71 +++++++++++++- reproduce/src/make/dependencies.mk | 7 +- reproduce/src/make/download.mk | 57 ++++++++--- reproduce/src/make/initialize.mk | 9 ++ tex/delete-me-wfpc2.tex | 34 +++++++ 15 files changed, 414 insertions(+), 100 deletions(-) create mode 100644 reproduce/config/gnuastro/astconvertt.conf create mode 100644 reproduce/config/gnuastro/aststatistics.conf create mode 100644 reproduce/config/pipeline/INPUTS.mk create mode 100644 reproduce/config/pipeline/delete-me-wfpc2-quant.mk delete mode 100644 reproduce/config/pipeline/web.mk create mode 100644 tex/delete-me-wfpc2.tex diff --git a/README-pipeline.md b/README-pipeline.md index ff15094..6effa30 100644 --- a/README-pipeline.md +++ b/README-pipeline.md @@ -516,6 +516,7 @@ advanced in later stages of your work. them. - Delete marked part(s) in `configure`. + - Delete the `reproduce/config/gnuastro` directory. - Delete `astnoisechisel` from the value of `top-level-programs` in `reproduce/src/make/dependencies.mk`. You can keep the rule to build `astnoisechisel`, since its not in the `top-level-programs` list, it (and all the dependencies that are only needed by Gnuastro) will be ignored. - Delete marked parts in `reproduce/src/make/initialize.mk`. - Delete `and Gnuastro \gnuastroversion` from `tex/preamble-style.tex`. @@ -526,51 +527,31 @@ advanced in later stages of your work. commented thoroughly and reading over the comments should guide you on what to add/remove and where. - - **Input dataset (can be done later)**: The user manages the top-level - directory of the input data through the variables set in - `reproduce/config/pipeline/LOCAL.mk.in` (the user actually edits a - `LOCAL.mk` file that is created by `configure` from the `.mk.in` file, - but the `.mk` file is not under version control). Datasets are usually - large and the users might already have their copy don't need to - download them). So you can define a variable (all in capital letters) - in `reproduce/config/pipeline/LOCAL.mk.in`. For example if you are - working on data from the XDF survey, use `XDF`. You can use this - variable to identify the location of the raw inputs on the running - system. Here, we'll assume its name is `SURVEY`. Afterwards, change - any occurrence of `SURVEY` in the whole pipeline with the new - name. You can find the occurrences with a simple command like the ones - shown below. We follow the Make convention here that all - `ONLY-CAPITAL` variables are those directly set by the user and all - `small-caps` variables are set by the pipeline designer. All variables - that also depend on this survey have a `survey` in their name. Hence, - also correct all these occurrences to your new name in small-caps. Of - course, ignore/delete those occurrences that are irrelevant, like - those in this file. Note that in the raw version of this template no - target depends on these files, so they are ignored. Afterwards, set - the webpage and correct the filenames in - `reproduce/src/make/download.mk` if necessary. - - ```shell - $ grep -r SURVEY ./ - $ grep -r survey ./ - ``` - - - **Other input datasets (can be done later)**: Add any other input - datasets that may be necessary for your research to the pipeline based - on the example above. + - **Input dataset (can be done later)**: The input datasets are managed + through the `reproduce/config/pipeline/INPUTS.mk` file. It is best to + gather all the information regarding all the input datasets into this + one central file. To ensure that the proper dataset is being + downloaded and used by the pipeline, its best to also get an MD5 + checksum (https://en.wikipedia.org/wiki/MD5) of the file and include + that in thsi file so you can check it in the pipeline. The preparation + of the input datasets is done in + `reproduce/src/make/download.mk`. Have a look there to see how these + values are to be used. This information about the input datasets is + also used in the initial `configure` script (to inform the users), so + also modify that file. - **Delete dummy parts (can be done later)**: The template pipeline - contains some parts that are only for the initial/test run, not for - any real analysis. The respective files to remove and parts to fix are - discussed here. + contains some parts that are only for the initial/test run, mainly as + a demonstration of important steps. They not for any real + analysis. You can remove these parts in the file below - `paper.tex`: Delete the text of the abstract and the paper's main body, *except* the "Acknowledgments" section. This reproduction pipeline was designed by funding from many grants, so its necessary to acknowledge them in your final research. - - `Makefile`: Delete the two lines containing `delete-me` in the - `foreach` loops. Just make sure the other lines that end in `\` are + - `Makefile`: Delete the lines containing `delete-me` in the `foreach` + loops. Just make sure the other lines that end in `\` are immediately after each other. - Delete all `delete-me*` files in the following directories: diff --git a/configure b/configure index c33d646..2922365 100755 --- a/configure +++ b/configure @@ -42,6 +42,7 @@ topdir=$(pwd) installedlink=.local lbdir=reproduce/build cdir=reproduce/config +optionaldir="/optional/path" pdir=$cdir/pipeline pconf=$pdir/LOCAL.mk @@ -100,7 +101,7 @@ function create_file_with_notice() { # Since the build directory will go into a symbolic link, we want it to be # an absolute address. With this function we can make sure of that. function absolute_dir() { - echo "$(cd "$(dirname "$inbdir")" && pwd )/$(basename "$inbdir")" + echo "$(cd "$(dirname "$1")" && pwd )/$(basename "$1")" } @@ -179,7 +180,8 @@ fi # the web address. if [ $rewritepconfig = yes ]; then if type wget > /dev/null 2>/dev/null; then - downloader="wget --no-use-server-timestamps -O"; + wgetname=$(which wget) + downloader="$wgetname --no-use-server-timestamps -O"; else cat <> $pconf else # Read the values from existing configuration file. - inbdir=$(awk '$1=="BDIR" {print $NF}' $pconf) - ddir=$(awk '$1=="DEPENDENCIES-DIR" {print $NF}' $pconf) - downloader=$(awk '$1=="DOWNLOADER" {print $NF}' $pconf) + inbdir=$(awk '$1=="BDIR" {print $3}' $pconf) + downloader=$(awk '$1=="DOWNLOADER" {print $3}' $pconf) + + # Make sure all necessary variables have a value + err=0 + verr=0 + novalue="" + if [ x"$inbdir" = x ]; then novalue="BDIR, "; fi + if [ x"$downloader" = x ]; then novalue="$novalue"DOWNLOADER; fi + if [ x"$novalue" != x ]; then verr=1; err=1; fi # Make sure `bdir' is an absolute path and it exists. + berr=0 + ierr=0 bdir=$(absolute_dir $inbdir) - if ! [ -d $bdir ]; then mkdir $bdir; fi + + if ! [ -d $bdir ]; then if ! mkdir $bdir; then berr=1; err=1; fi; fi + if [ $err = 1 ]; then + cat < $@ + + + + + # TeX macros # ---------- # @@ -50,7 +103,7 @@ $(dm): $(pconfdir)/delete-me-num.mk | $(dmdir) # # NOTE: In LaTeX you cannot use any non-alphabetic character in a variable # name. -$(mtexdir)/delete-me.tex: $(dm) +$(mtexdir)/delete-me.tex: $(dm) $(wfpc2) $(wfpc2hist) $(wfpc2stats) # Write the number of random values used. echo "\newcommand{\deletemenum}{$(delete-me-num)}" > $@ @@ -67,6 +120,16 @@ $(mtexdir)/delete-me.tex: $(dm) {if($$2>max) max=$$2; if($$2> $@; + echo "\newcommand{\deletememin}{$$v}" >> $@ v=$$(echo "$$mm" | awk '{printf "%.3f", $$2}'); echo "\newcommand{\deletememax}{$$v}" >> $@ + + # Write the statistics of the WFPC2 image as a macro. + q=$(delete-me-wfpc2-quantile) + echo "\newcommand{\deletemewfpcquantile}{$$q}" >> $@ + mean=$$(awk '{printf("%.2f", $$1)}' $(wfpc2stats)) + echo "\newcommand{\deletemewfpctwomean}{$$mean}" >> $@ + median=$$(awk '{printf("%.2f", $$2)}' $(wfpc2stats)) + echo "\newcommand{\deletemewfpctwomedian}{$$median}" >> $@ + quantile=$$(awk '{printf("%.2f", $$3)}' $(wfpc2stats)) + echo "\newcommand{\deletemewfpctwoquantile}{$$quantile}" >> $@ diff --git a/reproduce/src/make/dependencies.mk b/reproduce/src/make/dependencies.mk index 8ed359b..a784883 100644 --- a/reproduce/src/make/dependencies.mk +++ b/reproduce/src/make/dependencies.mk @@ -43,7 +43,7 @@ ildir = $(BDIR)/dependencies/installed/lib ilidir = $(BDIR)/dependencies/installed/lib/built # Define the top-level programs to build (installed in `.local/bin'). -top-level-programs = gawk gs grep sed git astnoisechisel texlive-ready +top-level-programs = gawk gs grep sed git flock astnoisechisel texlive-ready all: $(foreach p, $(top-level-programs), $(ibdir)/$(p)) # Other basic environment settings: We are only including the host @@ -75,6 +75,7 @@ LD_LIBRARY_PATH := $(ildir) tarballs = $(foreach t, cfitsio-$(cfitsio-version).tar.gz \ cmake-$(cmake-version).tar.gz \ curl-$(curl-version).tar.gz \ + flock-$(flock-version).tar.xz \ gawk-$(gawk-version).tar.lz \ ghostscript-$(ghostscript-version).tar.gz \ git-$(git-version).tar.xz \ @@ -111,6 +112,7 @@ $(tarballs): $(tdir)/%: w=https://heasarc.gsfc.nasa.gov/FTP/software/fitsio/c/cfitsio$$v.tar.gz elif [ $$n = cmake ]; then w=https://cmake.org/files/v3.12 elif [ $$n = curl ]; then w=https://curl.haxx.se/download + elif [ $$n = flock ]; then w=https://github.com/discoteq/flock/releases/download/v$(flock-version) elif [ $$n = gawk ]; then w=http://ftp.gnu.org/gnu/gawk elif [ $$n = ghostscript ]; then w=https://github.com/ArtifexSoftware/ghostpdl-downloads/releases/download/gs926 elif [ $$n = git ]; then w=https://mirrors.edge.kernel.org/pub/software/scm/git @@ -244,6 +246,9 @@ $(ibdir)/libtool: $(tdir)/libtool-$(libtool-version).tar.xz $(ibdir)/gs: $(tdir)/ghostscript-$(ghostscript-version).tar.gz $(call gbuild, $<, ghostscript-$(ghostscript-version)) +$(ibdir)/flock: $(tdir)/flock-$(flock-version).tar.xz + $(call gbuild, $<, flock-$(flock-version), static) + $(ibdir)/git: $(tdir)/git-$(git-version).tar.xz \ $(ilidir)/zlib $(call gbuild, $<, git-$(git-version), static) diff --git a/reproduce/src/make/download.mk b/reproduce/src/make/download.mk index 9617a45..180d2cf 100644 --- a/reproduce/src/make/download.mk +++ b/reproduce/src/make/download.mk @@ -25,20 +25,51 @@ -# Download SURVEY data +# Download input data # -------------------- # -# Data from a survey (for example an imaging survey) usually have a special -# file-name format which should be set here in the `foreach' loop. Note -# that the `foreach' function needs the backslash (`\') at the end of the -# line when it is broken into multiple lines. -all-survey = $(foreach f, $(filters-survey), \ - $(SURVEY)/a-special-format-$(f).fits \ - $(SURVEY)/a-possibly-additional-$(f)-format.fits ) -$(SURVEY):; mkdir $@ -$(all-survey): $(SURVEY)/%: | $(SURVEY) $(lockdir) - flock $(lockdir)/download -c "$(DOWNLOADER) $@ $(web-survey)/$*" +# The input dataset properties are defined in `$(pconfdir)/INPUTS.mk'. For +# this template pipeline we only have one dataset to enable easy +# processing, so all the extra checks in this rule may seem +# redundant. +# +# However, in a real project, you will need more than one dataset. In that +# case, just add them to the target list and add an `elif' statement to +# define it in the recipe. +# +# Download lock file: Most systems have a single connection to the +# internet, therefore downloading is inherently done in series. As a +# result, when more than one dataset is necessary for download, if they are +# done in parallel, the speed will be slower than downloading them in +# series. We thus use the `flock' program to tie/lock the downloading +# process with a file and make sure that only one downloading event is in +# progress at every moment. +$(indir):; mkdir $@ +inputdatasets = $(foreach i, $(WFPC2IMAGE), $(indir)/$(i)) +$(inputdatasets): $(indir)/%: | $(indir) $(lockdir) + + # Set the necessary parameters for this input file. + if [ $* = $(WFPC2IMAGE) ]; then url=$(WFPC2URL); mdf=$(WFPC2MD5); + else + echo; echo; echo "Not recognized input dataset: '$*'." + echo; echo; exit 1 + fi + + # Download (or make the link to) the input dataset. + if [ -f $(INDIR)/$* ]; then + ln -s $(INDIR)/$* $@ + else + flock $(lockdir)/download $(DOWNLOADER) $@ $$url/$* + fi + # Check the md5 sum to see if this is the proper dataset. + sum=$$(md5sum $@ | awk '{print $$1}') + if [ $$sum != $$mdf ]; then + wrongname=$(dir $@)/wrong-$(notdir $@) + mv $@ $$wrongname + echo; echo; echo "Wrong MD5 checksum for '$*' in $$wrongname" + echo; echo; exit 1 + fi @@ -49,5 +80,5 @@ $(all-survey): $(SURVEY)/%: | $(SURVEY) $(lockdir) # # It is very important to mention the address where the data were # downloaded in the final report. -$(mtexdir)/download.tex: $(pconfdir)/web.mk | $(mtexdir) - @echo "\\newcommand{\\websurvey}{$(web-survey)}" > $@ +$(mtexdir)/download.tex: $(pconfdir)/INPUTS.mk | $(mtexdir) + echo "\\newcommand{\\wfpctwourl}{$(WFPC2URL)}" > $@ diff --git a/reproduce/src/make/initialize.mk b/reproduce/src/make/initialize.mk index 694aca0..41a5e05 100644 --- a/reproduce/src/make/initialize.mk +++ b/reproduce/src/make/initialize.mk @@ -34,6 +34,7 @@ # parallel. Also, some programs may not be thread-safe, therefore it will # be necessary to put a lock on them. This pipeline uses the `flock' # program to achieve this. +indir = $(BDIR)/inputs texdir = $(BDIR)/tex srcdir = reproduce/src lockdir = $(BDIR)/locks @@ -224,6 +225,14 @@ $(mtexdir)/initialize.tex: | $(mtexdir) fi; \ echo "\newcommand{\\bziptwoversion}{$(bzip2-version)}" >> $@ + # Unfortunately we couldn't find a way to retrieve the version of + # the discoteq `flock' that we are using here. So we'll just repot + # the version we downloaded and installed. + echo "\newcommand{\\flockversion}{$(flock-version)}" >> $@ + + + + # Versions of libraries. $(call lvcheck, fitsio.h, $(cfitsio-version), CFITSIO, cfitsioversion) diff --git a/tex/delete-me-wfpc2.tex b/tex/delete-me-wfpc2.tex new file mode 100644 index 0000000..95b3105 --- /dev/null +++ b/tex/delete-me-wfpc2.tex @@ -0,0 +1,34 @@ +\begin{tikzpicture} + + %% The displayed WFPC2 image. + \node[anchor=south west] (img) at (0,0) + {\includegraphics[width=0.5\linewidth] + {\bdir/tex/delete-me-wfpc2/wfpc2.pdf}}; + + %% Its label + \node[anchor=south west] at (0.45\linewidth,0.45\linewidth) + {\textcolor{white}{a}}; + + %% This histogram. + \begin{axis}[at={(0.52\linewidth,0.1\linewidth)}, + no markers, + axis on top, + xmode=normal, + ymode=normal, + yticklabels={}, + scale only axis, + xlabel=Pixel value, + width=0.5\linewidth, + height=0.412\linewidth, + enlarge y limits=false, + enlarge x limits=false, + ] + \addplot [const plot mark mid, fill=red] + table [x index=0, y index=1] + {\bdir/tex/delete-me-wfpc2/wfpc2-hist.txt} + \closedcycle; + \end{axis} + + %% The histogram's label + \node[anchor=south west] at (0.95\linewidth,0.45\linewidth) {b}; +\end{tikzpicture} -- cgit v1.2.1