IMPORTANT: more generic, robust and secure INPUTS.conf and download.mk

SUMMARY: it is necessary to update your 'INPUTS.conf' and 'download.mk'. Until now, adding an input file involved several steps that needed manual (and inconvenient!) intervention: for every file, you needed to define four variables in 'INPUTS.conf', and in 'reproduce/analysis/make/download.mk' you had to use a (complex for large number of files) shell 'if/elif/else' condition to link the names of the input files to those variables. Besides inconvenience, this could cause bugs (typos!). Furthermore, a basic MD5 checksum was used for verifying the files. With this commit, a new structure has been defined for 'INPUTS.conf' that (thanks to some pretty useful GNU Make features), removes the need for users to manually edit 'reproduce/analysis/make/download.mk', and reduces the number of variables necessary for each file to three (from four). Furthermore, we now use the SHA256 checksum for input data validation. Regarding the trick used in 'INPUTS.conf' (form the newly added description in 'download.mk'): In GNU Make, '.VARIABLES' "... expands to a list of the names of all global variables defined so far" (from the "Other Special Variables" section of the GNU Make manual). Assuming that the pattern 'INPUT-%-sha256' is only used for input files, we find all the variables that contain the input file names (the '%' is the filename). Finally, using the pattern-substitution function ('patsubst'), we remove the fixed string at the start and end of the variable name. Steps you need to take: - INPUTS.conf: translate your old format to the new format (after carefully reading the description in the comments at the start of the file). After applying the new standards, you don't need to use the variables of 'INPUTS.conf' directly in your Makefiles! For example if one of your input datasets is called 'abc.fits', the checksum variable will be 'INPUT-abc.fits-sha256' and in your high-level Makefiles, you can simply set '$(indir)/abc.fits' as a prerequisite (like you probably did already). - reproduce/analysis/make/download.mk: for the definition and rule of 'inputdatasets', simply use the Maneage branch, and remove anything you had added in your project. In the process, I also noticed that 'README-hacking.md' still referred to 'master' as the main project branch, while we have used 'main' in the paper (and is the common convention with Git).
author: Mohammad Akhlaghi <mohammad@akhlaghi.org> 2022-04-15 04:57:54 +0200
committer: Mohammad Akhlaghi <mohammad@akhlaghi.org> 2022-04-15 05:22:19 +0200
commit: 91799fe4b6d62230e99a1520a23a0d30c3eb963e (patch)
tree: 7b04b6e7ec5901b2d7d3f4096ec8dac2b78da001
parent: c5d7f2adbea2038d240868e0192fb306256e3b92 (diff)
3 files changed, 126 insertions, 112 deletions
diff --git a/README-hacking.md b/README-hacking.md
index 1888912..24a3cea 100644
--- a/README-hacking.md
+++ b/README-hacking.md
@@ -568,7 +568,7 @@ First custom commit
      the default `origin` remote server to specify that this is Maneage's
      remote server. This will allow you to use the conventional `origin`
      name for your own project as shown in the next steps. Second, you will
-     create and go into the conventional `master` branch to start
+     create and go into the conventional `main` branch to start
      committing in your project later.
 
      ```shell
@@ -576,7 +576,7 @@ First custom commit
      $ mv project my-project                            # Change the name to your project's name.
      $ cd my-project                                    # Go into the cloned directory.
      $ git remote rename origin origin-maneage          # Rename current/only remote to "origin-maneage".
-     $ git checkout -b master                           # Create and enter your own "master" branch.
+     $ git checkout -b main                             # Create and enter your own "main" branch.
      $ pwd                                              # Just to confirm where you are.
      ```
 
@@ -631,7 +631,7 @@ First custom commit
      a new project which is bad in this scenario, and will not allow you to
      push to it). It will give you a URL (usually starting with `git@` and
      ending in `.git`), put this URL in place of `XXXXXXXXXX` in the first
-     command below. With the second command, "push" your `master` branch to
+     command below. With the second command, "push" your `main` branch to
      your `origin` remote, and (with the `--set-upstream` option) set them
      to track/follow each other. However, the `maneage` branch is currently
      tracking/following your `origin-maneage` remote (automatically set
@@ -642,7 +642,7 @@ First custom commit
 
      ```shell
      git remote add origin XXXXXXXXXX        # Newly created repo is now called 'origin'.
-     git push --set-upstream origin master   # Push 'master' branch to 'origin' (with tracking).
+     git push --set-upstream origin main     # Push 'main' branch to 'origin' (with tracking).
      git push origin maneage                 # Push 'maneage' branch to 'origin' (no tracking).
      ```
 
@@ -650,7 +650,7 @@ First custom commit
      your name (with your possible coauthors) and tentative abstract in
      `paper.tex`. You should see the relevant place in the preamble (prior
      to `\begin{document}`. Just note that some core project metadata like
-     the project tile are actually set in
+     the project title are actually set in
      `reproduce/analysis/config/metadata.conf`. So set your project title
      in there. After you are done, run the `./project make` command again
      to see your changes in the final PDF and make sure that your changes
@@ -696,13 +696,14 @@ First custom commit
        $ rm reproduce/analysis/config/delete-me*
        ```
 
-     - Disable verification of outputs by removing the `yes` from
-       `reproduce/analysis/config/verify-outputs.conf`. Later, when you are
-       ready to submit your paper, or publish the dataset, activate
-       verification and make the proper corrections in this file (described
-       under the "Other basic customizations" section below). This is a
-       critical step and only takes a few minutes when your project is
-       finished. So DON'T FORGET to activate it in the end.
+     - `reproduce/analysis/config/verify-outputs.conf`: Disable
+       verification of outputs by changing the `yes` (the value of
+       `verify-outputs`) to `no`. Later, when you are ready to submit your
+       paper, or publish the dataset, activate verification and make the
+       proper corrections in this file (described under the "Other basic
+       customizations" section below). This is a critical step and only
+       takes a few minutes when your project is finished. So DON'T FORGET
+       to activate it in the end.
 
      - Re-make the project (after a cleaning) to see if you haven't
        introduced any errors.
@@ -714,7 +715,7 @@ First custom commit
 
  7. **Ignore changes in some Maneage files**: One of the main advantages of
      Maneage is that you can later update your infra-structure by merging
-     your `master` branch with the `maneage` branch. This is good for many
+     your `main` branch with the `maneage` branch. This is good for many
      low-level features that you will likely never modify yourself. But it
      is not desired for some files like `paper.tex` (you don't want changes
      in Maneage's default `paper.tex` to cause conflicts with all the text
@@ -758,7 +759,7 @@ First custom commit
      add a copyright notice in your name under the existing one(s), like
      the line with capital letters below. To start with, add this line with
      your name and email address to `paper.tex`,
-     `tex/src/preamble-header.tex`, `reproduce/analysis/make/top-make.mk`,
+     `tex/src/preamble-project.tex`, `reproduce/analysis/make/top-make.mk`,
      and generally, all the files you modified in the previous step.
 
      ```
@@ -781,7 +782,7 @@ First custom commit
      ```
 
  10. **Your first commit**: You have already made some small and basic
-     changes in the steps above and you are in your project's `master`
+     changes in the steps above and you are in your project's `main`
      branch. So, you can officially make your first commit in your
      project's history and push it. But before that, you need to make sure
      that there are no problems in the project. This is a good habit to
@@ -838,24 +839,12 @@ Other basic customizations
      Gnuastro, go through the analysis steps in `reproduce/analysis` and
      remove all its use cases (clearly marked).
 
- - **Input dataset**: The input datasets are managed through the
-     `reproduce/analysis/config/INPUTS.conf` file. It is best to gather all
-     the information regarding all the input datasets into this one central
-     file. To ensure that the proper dataset is being downloaded and used
-     by the project, it is also recommended get an [MD5
-     checksum](https://en.wikipedia.org/wiki/MD5) of the file and include
-     that in `INPUTS.conf` so the project can check it automatically. The
-     preparation/downloading of the input datasets is done in
-     `reproduce/analysis/make/download.mk`. Have a look there to see how
-     these values are to be used. This information about the input datasets
-     is also used in the initial `configure` script (to inform the users),
-     so also modify that file. You can find all occurrences of the demo
-     dataset with the command below and replace it with your input's
-     dataset.
-
-     ```shell
-     $ grep -ir wfpc2 ./*
-     ```
+ - **Input datasets**: The input datasets are managed through the
+     `reproduce/analysis/config/INPUTS.conf` file. It is best to gather the
+     following information regarding all the input datasets into this one
+     central file: 1) the SHA256 checksum of the file, 2) the URL where the
+     file can be downloaded online. Please read the comments at the start
+     of `reproduce/analysis/config/INPUTS.conf` carefully.
 
  - **`README.md`**: Correct all the `XXXXX` place holders (name of your
      project, your own name, address of your project's online/remote
@@ -1535,10 +1524,10 @@ for the benefit of others.
         # with your project.
         $ git log --oneline --graph --decorate --all # General view of branches.
 
-        # Go to your 'master' branch and import all the updates into
-        # 'master', don't worry about the printed outputs (in particular
+        # Go to your 'main' branch and import all the updates into
+        # 'main', don't worry about the printed outputs (in particular
         # the 'CONFLICT's), we'll clean them up in the next step.
-        $ git checkout master
+        $ git checkout main
         $ git merge maneage
 
         # Ignore conflicting Maneage files that you had previously deleted
@@ -1556,7 +1545,7 @@ for the benefit of others.
         git status
 
         # TIP: If you want the changes in one file to be only from a
-        # special branch ('maneage' or 'master', completely ignoring
+        # special branch ('maneage' or 'main', completely ignoring
         # changes in the other), use this command:
         # $ git checkout <BRANCH-NAME> -- <FILENAME>
 
@@ -1579,7 +1568,7 @@ for the benefit of others.
         ./project make
 
         # When everything is OK, before continuing with your project's
-        # work, don't forget to push both your 'master' branch and your
+        # work, don't forget to push both your 'main' branch and your
         # updated 'maneage' branch to your remote server.
         git push
         git push origin maneage
diff --git a/reproduce/analysis/config/INPUTS.conf b/reproduce/analysis/config/INPUTS.conf
index 936f5f9..5a58758 100644
--- a/reproduce/analysis/config/INPUTS.conf
+++ b/reproduce/analysis/config/INPUTS.conf
@@ -1,40 +1,68 @@
-# Input files necessary for this project, the variables defined in this
-# file are primarily used in 'reproduce/analysis/make/download.mk'. See
-# there for precise usage of the variables. But comments are also provided
-# here.
-#
-# Necessary variables for each input dataset are listed below. Its good
-# that all the variables of each file have the same base-name (in the
-# example below 'DEMO') with descriptive suffixes, also put a short comment
-# above each group of variables for each dataset, shortly explaining what
-# it is.
-#
-#  1) Local file name ('DEMO-DATA' below): this is the name of the dataset
-#     on the local system (in 'INDIR', given at configuration time). It is
-#     recommended that it be the same name as the online version of the
-#     file like the case here (note how this variable is used in 'DEMO-URL'
-#     for the dataset's full URL). However, this is not always possible, so
-#     the local and server filenames may be different. Ultimately, the file
-#     name is irrelevant, we check the integrity with the checksum.
-#
-#  2) The MD5 checksum of the file ('DEMO-MD5' below): this is very
-#     important for an automatic verification of the file. You can
-#     calculate it by running 'md5sum' on your desired file. You can also
-#     use any other checksum tool that you prefer, just be sure to correct
-#     the respective command in 'reproduce/analysis/make/download.mk'.
-#
-#  3) The human-readable size of the file ('DEMO-SIZE' below): this is an
-#     optional variable, mainly to help a reader of your project get a
-#     sense of the volume they need to download if they don't already have
-#     the dataset. So it is highly recommended to add it (future readers of
-#     your project's source will appreciate it!). You can get it from the
-#     output of 'ls -lh' command on the file. Optionally you can use it in
-#     messages during the configuration phase (when Maneage asks for the
-#     input data directory), along with other info about the file(s).
-#
-#  4) The full dataset URL ('DEMO-URL' below): this is the full URL
-#     (including the file-name) that can be used to download the dataset
-#     when necessary. Also, see the description above on local filename.
+# This project's input file information (metadata).
+#
+# For each input (external) data file that is used within the project,
+# three variables are suggested here (two of them are mandatory). These
+# variables will be used by 'reproduce/analysis/make/download.mk' to import
+# the dataset into the project (within the build directory):
+#
+#   - If the file already exists locally in '$(INDIR)' (the optional input
+#     directory that may have been specified at configuration time with
+#     '--input-dir'), a symbolic link will be added in '$(indir)' (in the
+#     build directory). A symbolic link is used to avoid extra storage when
+#     files are large.
+#
+#   - If the file doesn't exist in '$(INDIR)', or no input directory was
+#     specified at configuration time, then the file is downloaded from a
+#     specific URL.
+#
+# In both cases, before placing the file (or its link) in the build
+# directory, 'reproduce/analysis/make/download.mk' will check the SHA256
+# checksum of the dataset and if it differs from the pre-defined value (set
+# for that file, here), it will abort (since this is not the intended
+# dataset).
+#
+# Therefore, the two variables specifying the URL and SHA256 checksum of
+# the file are MANDATORY. The third variable (INPUT-%-size) showing the
+# human-readable size of the file (from 'ls -lh') is optional (but
+# recommended: because it gives future scientists to get a feeling of the
+# volume of data they need to input: will become important if the
+# size/number of files is large).
+#
+# The naming convension is critical for the input files to be properly
+# imported into the project. In the patterns below, the '%' is the full
+# file name (including its prefix): for example in the demo input of this
+# file in the 'maneage' branch, we have 'INPUT-wfpc2.fits-sha256':
+# therefore, the input file (within the project's '$(indir)') is called
+# 'wfpc2.fits'. This allows you to simply set '$(indir)/wfpc2.fits' as the
+# pre-requisite of any recipe that needs the input file: you will rarely
+# (if at all!) need to use these variables directly.
+#
+#   INPUT-%-sha256: The sha256 checksum of the file. You can generate the
+#                   SHA256 checksum of a file with the 'sha256sum FILENAME'
+#                   command (where 'FILENAME' is the name of your
+#                   file). this is very important for an automatic
+#                   verification of the file: that it hasn't changed
+#                   between different runs of the project (locally or in
+#                   the URL). There are more robust checksum algorithms
+#                   like the 'SHA' standards.
+#
+#   INPUT-%-url: The URL to download the file if it is not available
+#                locally. It can happen that during the first phases of
+#                your project the data aren't yet public. In this case, you
+#                set a phony URL like this (just as a clear place-holder):
+#                'https://this.file/is/not/yet/public'.
+#
+#   INPUT-%-size: The human-readable size of the file (output of 'ls
+#                 -lh'). This is not used by default but can help other
+#                 scientists who would like to run your project get a
+#                 good feeling of the necessary network and storage
+#                 capacity that is necessary to start the project.
+#
+# The input dataset's name (that goes into the '%') can be different from
+# the URL's file name (last component of the URL, after the last '/'). Just
+# note that it is assumed that the local copy (outside of your project) is
+# also called '%' (if your local copy of the input dataset and the only
+# repository names are the same, be sure to set '%' accordingly).
 #
 # Copyright (C) 2018-2022 Mohammad Akhlaghi <mohammad@akhlaghi.org>
 #
@@ -48,7 +76,6 @@
 
 
 # Demo dataset used in the histogram plot (remove when customizing).
-DEMO-DATA = WFPC2ASSNu5780205bx.fits
-DEMO-MD5  = a4791e42cd1045892f9c41f11b50bad8
-DEMO-SIZE = 62K
-DEMO-URL  = https://fits.gsfc.nasa.gov/samples/$(DEMO-DATA)
+INPUT-wfpc2.fits-size = 62K
+INPUT-wfpc2.fits-url  = https://fits.gsfc.nasa.gov/samples/WFPC2ASSNu5780205bx.fits
+INPUT-wfpc2.fits-sha256 = 9851bc2bf9a42008ea606ec532d04900b60865daaff2f233e5c8565dac56ad5f
diff --git a/reproduce/analysis/make/download.mk b/reproduce/analysis/make/download.mk
index e652c17..6e67962 100644
--- a/reproduce/analysis/make/download.mk
+++ b/reproduce/analysis/make/download.mk
@@ -27,22 +27,20 @@
 # Download input data
 # --------------------
 #
-# The input dataset properties are defined in
-# '$(pconfdir)/INPUTS.conf'. For this template we only have one dataset to
-# enable easy processing, so all the extra checks in this rule may seem
-# redundant.
+# 'reproduce/analysis/config/INPUTS.conf' contains the input dataset
+# properties. In most cases, you will not need to edit this rule (or
+# file!). Simply follow the instructions of 'INPUTS.conf' and set the
+# variables names according to the described standards.
 #
-# In a real project, you will need more than one dataset. In that case,
-# just add them to the target list and add an 'elif' statement to define it
-# in the recipe.
-#
-# Files in a server usually have very long names, which are mainly designed
-# for helping in data-base management and being generic. Since Make uses
-# file names to identify which rule to execute, and the scope of this
-# research project is much less than the generic survey/dataset, it is
-# easier to have a simple/short name for the input dataset and work with
-# that. In the first condition of the recipe below, we connect the short
-# name with the raw database name of the dataset.
+# TECHNICAL NOTE on the '$(foreach, n ...)' loop of 'inputdatasets': we are
+# using several (relatively complex!) features particular to Make: In GNU
+# Make, '.VARIABLES' "... expands to a list of the names of all global
+# variables defined so far" (from the "Other Special Variables" section of
+# the GNU Make manual). Assuming that the pattern 'INPUT-%-sha256' is only
+# used for input files, we find all the variables that contain the input
+# file name (the '%' is the filename). Finally, using the
+# pattern-substitution function ('patsubst'), we remove the fixed string at
+# the start and end of the variable name.
 #
 # Download lock file: Most systems have a single connection to the
 # internet, therefore downloading is inherently done in series. As a
@@ -53,16 +51,16 @@
 # progress at every moment.
 $(indir):; mkdir $@
 downloadwrapper = $(bashdir)/download-multi-try
-inputdatasets = $(foreach i, wfpc2, $(indir)/$(i).fits)
-$(inputdatasets): $(indir)/%.fits: | $(indir) $(lockdir)
+inputdatasets = $(foreach i, \
+                  $(patsubst INPUT-%-sha256,%, \
+                    $(filter INPUT-%-sha256,$(.VARIABLES))), \
+                  $(indir)/$(i))
+$(inputdatasets): $(indir)/%: | $(indir) $(lockdir)
 
-#	Set the necessary parameters for this input file.
-	if   [ $* = wfpc2 ]; then
-	  localname=$(DEMO-DATA); url=$(DEMO-URL); mdf=$(DEMO-MD5);
-	else
-	echo; echo; echo "Not recognized input dataset: '$*.fits'."
-	echo; echo; exit 1
-	fi
+#	Set the necessary parameters for this input file as shell variables
+#	(to help in readability).
+	url=$(INPUT-$*-url)
+	sha=$(INPUT-$*-sha256)
 
 #	Download (or make the link to) the input dataset. If the file
 #	exists in 'INDIR', it may be a symbolic link to some other place in
@@ -72,25 +70,25 @@ $(inputdatasets): $(indir)/%.fits: | $(indir) $(lockdir)
 #	GNU Coreutils). If its not a link, the 'readlink' part has no
 #	effect.
 	unchecked=$@.unchecked
-	if [ -f $(INDIR)/$$localname ]; then
-	  ln -fs $$(readlink -f $(INDIR)/$$localname) $$unchecked
+	if [ -f $(INDIR)/$* ]; then
+	  ln -fs $$(readlink -f $(INDIR)/$*) $$unchecked
 	else
 	  touch $(lockdir)/download
 	  $(downloadwrapper) "wget --no-use-server-timestamps -O" \
 	                     $(lockdir)/download $$url $$unchecked
 	fi
 
-#	Check the md5 sum to see if this is the proper dataset.
-	sum=$$(md5sum $$unchecked | awk '{print $$1}')
-	if [ $$sum = $$mdf ]; then
+#	Check the checksum to see if this is the proper dataset.
+	sum=$$(sha256sum $$unchecked | awk '{print $$1}')
+	if [ $$sum = $$sha ]; then
 	  mv $$unchecked $@
 	  echo "Integrity confirmed, using $@ in this project."
 	else
 	  echo; echo;
-	  echo "Wrong MD5 checksum for input file '$$localname':"
+	  echo "Wrong SHA256 checksum for input file '$*':"
 	  echo "  File location: $$unchecked"; \
-	  echo "  Expected MD5 checksum:   $$mdf"; \
-	  echo "  Calculated MD5 checksum: $$sum"; \
+	  echo "  Expected SHA256 checksum:   $$sha"; \
+	  echo "  Calculated SHA256 checksum: $$sum"; \
 	  echo; exit 1
 	fi
 
@@ -104,4 +102,4 @@ $(inputdatasets): $(indir)/%.fits: | $(indir) $(lockdir)
 # It is very important to mention the address where the data were
 # downloaded in the final report.
 $(mtexdir)/download.tex: $(pconfdir)/INPUTS.conf | $(mtexdir)
-	echo "\\newcommand{\\wfpctwourl}{$(DEMO-URL)}" > $@
+	echo "\\newcommand{\\wfpctwourl}{$(INPUT-wfpc2.fits-url)}" > $@
author	Mohammad Akhlaghi <mohammad@akhlaghi.org>	2022-04-15 04:57:54 +0200
committer	Mohammad Akhlaghi <mohammad@akhlaghi.org>	2022-04-15 05:22:19 +0200
commit	91799fe4b6d62230e99a1520a23a0d30c3eb963e (patch)
tree	7b04b6e7ec5901b2d7d3f4096ec8dac2b78da001
parent	c5d7f2adbea2038d240868e0192fb306256e3b92 (diff)