From 43186705c89e99fd4dbf5549ad031d27d77dfe6f Mon Sep 17 00:00:00 2001 From: Mohammad Akhlaghi Date: Thu, 25 Aug 2022 17:02:29 +0200 Subject: Added server authentication and FITS DATASUM for verficiation SUMMARY: Nothing special is necessary for your existing projects. This commit just addds two new features (read the commit description for more): 1. To provide a user and password to servers that need authentication before they allow downloading of proprietary data, 2. To use the FITS Standard's DATASUM for file verification (for cases where the file is not static on the server, and is generated upon receiving your download request). Until now, Maneage didn't have any infrastructure for databases that require authentication (through a user or password, when calling 'wget'). Furthermore, when the downloaded file is automatically generated by the server upon request, the server usually adds metadata (like file date, or query number and etc) in the header. Therefore the simple SHA256 checksum of the file would differ on every download! This made it very hard to verify if the data (not headers) are unchanged. With this commit, both these problems have been addressed: - Server authentication: the 'reproduce/software/config/LOCAL.conf' now contains three new variables for this purpose. With them, you can give your username and password, along with the authentication method of the server. The comments on top of these three variables give a full description of their usage. - Verifying only the data in a file (ignoring the headers): The 'reproduce/analysis/config/INPUTS.conf' now accepts two new optional variables for each input file using the FITS standard's DATASUM convention: 'INPUT-%-fitsdatasum' and 'INPUT-%-fitshdu'. If the SHA256 isn't specified for a file, Maneage will use these to verify the file. With the latter, you specify the HDU of the data you want to verify and with the former you give the DATASUM value for that HDU. As the name suggests, this is only valid for FITS files. If we find other formats that support a similar behavior, we can add this feature for those formats also. This is also thoroughly discussed in the comments of 'reproduce/analysis/config/INPUTS.conf'. This commit was done with the help of Pedram Ashofte Ardakani, Sepideh Eskandarlou and Mohammadreza Khellat. --- reproduce/analysis/config/INPUTS.conf | 115 +++++++++++++++++++++++----------- 1 file changed, 77 insertions(+), 38 deletions(-) (limited to 'reproduce/analysis/config') diff --git a/reproduce/analysis/config/INPUTS.conf b/reproduce/analysis/config/INPUTS.conf index 75e24de..5970ae3 100644 --- a/reproduce/analysis/config/INPUTS.conf +++ b/reproduce/analysis/config/INPUTS.conf @@ -1,9 +1,10 @@ # This project's input file information (metadata). # # For each input (external) data file that is used within the project, -# three variables are suggested here (two of them are mandatory). These -# variables will be used by 'reproduce/analysis/make/download.mk' to import -# the dataset into the project (within the build directory): +# three variables are suggested here (only the verification variable is +# strictly mandatory). These variables will be used by the download rule of +# 'reproduce/analysis/make/initialize.mk' to import the dataset into the +# project (within the build directory): # # - If the file already exists locally in '$(INDIR)' (the optional input # directory that may have been specified at configuration time with @@ -12,27 +13,53 @@ # files are large. # # - If the file doesn't exist in '$(INDIR)', or no input directory was -# specified at configuration time, then the file is downloaded from a -# specific URL. +# specified at configuration time, then the file is downloaded from the +# specified URL for that dataset. # # In both cases, before placing the file (or its link) in the build -# directory, 'reproduce/analysis/make/download.mk' will check the SHA256 -# checksum of the dataset and if it differs from the pre-defined value (set -# for that file, here), it will abort (since this is not the intended -# dataset). -# -# Therefore, the two variables specifying the URL and SHA256 checksum of -# the file are MANDATORY. The third variable (INPUT-%-size) showing the -# human-readable size of the file (from 'ls -lh') is optional (but -# recommended: because it gives future scientists to get a feeling of the -# volume of data they need to input: will become important if the -# size/number of files is large). +# directory, the download rule of 'reproduce/analysis/make/initialize.mk' +# will check the verification of the dataset and if it differs from the +# pre-defined value (set for that file, here), it will abort (since this is +# not the intended dataset). # +# Verification (two modes) +# ------------------------ +# - SHA256 checksum. This will check the full contents of the file, and +# is generic to any data format. However, if the server inserts custom +# headers like the query date or query code and etc, this form of +# validation is not useful: because every download will have different +# headers. In such cases, you should use the other verification methods +# below. In other words, this method is only good for files that are +# "static" on the server (and left there unchanged). If the file is +# generated at request time, the server usually inserts custom run-time +# dependent headers; making it impossible to verify with an SHA +# checksum of the whole file. +# - The FITS Standard's 'DATASUM' (which will only check the data, not +# the headers). According to the FITS standard, this sum ignores all +# headers, and is only calculated on a HDU's data. By default, this +# will require Gnuastro (which can easily calculate and return the +# value on the command-line), and it assumes HDU number 1 (counting +# from 0). You can modify the defaults by modifying the rule in +# 'reproduce/analysis/make/initialize.mk'. +# +# Automatic writing of verification +# --------------------------------- +# In case you would like Maneage to find the checksum upon downloading, put +# the string '--auto-replace--' instead of a checksum. This can be helpful +# for large datasets; where downloading only for adding the checksum is not +# easy/possible and can be buggy. In this scenario, upon downloading the +# file its checksum will be calculated and will be replaced with the +# '--auto-replace--' in this file. But since this file is under version +# control, be sure to commit all the updated checksums after your downloads +# are finished! +# +# Variable description +# -------------------- # The naming convension is critical for the input files to be properly -# imported into the project. In the patterns below, the '%' is the full -# file name (including its suffix): for example in the demo input of this -# file in the 'maneage' branch, we have 'INPUT-wfpc2.fits-sha256': -# therefore, the input file (within the project's '$(indir)') is called +# imported into Maneage. In the patterns below, the '%' is the full file +# name (including its suffix): for example in the demo input of this file +# in the 'maneage' branch, we have 'INPUT-wfpc2.fits-sha256': therefore, +# the input file (within the project's '$(indir)') is called # 'wfpc2.fits'. This allows you to simply set '$(indir)/wfpc2.fits' as the # pre-requisite of any recipe that needs the input file: you will rarely # (if at all!) need to use these variables directly. @@ -40,23 +67,17 @@ # INPUT-%-sha256: The sha256 checksum of the file. You can generate the # SHA256 checksum of a file with the 'sha256sum FILENAME' # command (where 'FILENAME' is the name of your -# file). this is very important for an automatic -# verification of the file: that it hasn't changed -# between different runs of the project (locally or in -# the URL). There are more robust checksum algorithms -# like the 'SHA' standards. -# -# AUTOMATIC CHEKSUM CALCULATION: In case you would like -# Maneage to find the checksum upon downloading, put the -# string '--auto-replace--' instead of a checksum. This -# can be helpful for large datasets; where downloading -# only for adding the checksum is not easy/possible and -# can be buggy. In this scenario, upon downloading the -# file its checksum will be calculated and will be -# replaced with the '--auto-replace--' in this file. But -# since this file is under version control, be sure to -# commit all the updated checksums after your downloads -# are finished! +# file). Don't use this if you give the 'fitsdatasum' +# keyvalue. +# +# INPUT-%-fitsdatasum: The FITS standard DATASUM value for HDU number 1 +# of the FITS file (counting from 0). Don't use this +# if you give the 'sha256' keyword. +# +# INPUT-%-fitshdu: The HDU identifier (counter from 0, or name) to use +# for the verification. This is only relevant in the +# 'fitsdatasum' verification method and optional (if not +# given, HDU number 1 is used; counting from 0). # # INPUT-%-url: The URL to download the file if it is not available # locally. It can happen that during the first phases of @@ -70,6 +91,13 @@ # good feeling of the necessary network and storage # capacity that is necessary to start the project. # +# Therefore, the the verification variable is MANDATORY in any case. The +# variable with a URL is only necessary if you do not have the file +# locally. However, The size variable is optional (but recommended: because +# it gives future scientists a feeling of the volume of data they need to +# input to run your project: will become important if the size/number of +# files is large). +# # The input dataset's name (that goes into the '%') can be different from # the URL's file name (last component of the URL, after the last '/'). Just # note that it is assumed that the local copy (outside of your project) is @@ -87,7 +115,18 @@ -# Demo dataset used in the histogram plot (remove when customizing). +# Demo dataset used in the histogram plot +# --------------------------------------- +# +# Remove this part while you are entering your project's datasets. +# +# Since the demonstration dataset is a FITS file, we have also added the +# two '$(INPUT-%-fits*)' variables as a demonstration. But they are +# commented because the SHA256 method is also possible for this file (its +# not generated on the server at query time; it is a static file on the +# server). INPUT-wfpc2.fits-size = 62K INPUT-wfpc2.fits-url = https://fits.gsfc.nasa.gov/samples/WFPC2ASSNu5780205bx.fits INPUT-wfpc2.fits-sha256 = 9851bc2bf9a42008ea606ec532d04900b60865daaff2f233e5c8565dac56ad5f +#INPUT-wfpc2.fits-fitshdu = 0 +#INPUT-wfpc2.fits-fitsdatasum = 2218330266 -- cgit v1.2.1