From 8463df97c6f26ec4d22cd5828bb0574fd5e450d2 Mon Sep 17 00:00:00 2001 From: Mohammad Akhlaghi Date: Mon, 4 Oct 2021 02:51:45 +0200 Subject: IMPORTANT: Updates to almost all software This commit primarily affects the configuration step of Maneage'd projects, and in particular, updated versions of the many of the software (see P.S.). So it shouldn't affect your high-level analysis other than the version bumps of the software you use (and the software's possibly improve/changed behavior). The following software (and thus their dependencies) couldn't be updated as described below: - Cryptography: isn't building because it depends on a new setuptools-rust package that has problems (https://savannah.nongnu.org/bugs/index.php?61731), so it has been commented in 'versions.conf'. - SecretStorage: because it depends on Cryptography. - Keyring: because it depends on SecretStorage. - Astroquery: because it depends on Keyring. This is a "squashed" commit after rebasing a development branch of 60 commits corresponding to a roughly two-month time interval. The following people contributed to this branch. - Boudewijn Roukema added all the R software infrastructure and the R packages, as well as greatly helping in fixing many bugs during the update. - Raul Infante-Sainz helped in testing and debugging the build. - Pedram Ashofteh Ardakani found and fixed a bug. - Zahra Sharbaf helped in testing and found several bugs. Below a description of the most noteworthy points is given. - Software tarballs: all updated software now have a unified format tarball (ustar; if not possible, pax) and unified compression (Lzip) in Maneage's software repository in Zenodo (https://doi.org/10.5281/zenodo.3883409). For more on this See https://savannah.nongnu.org/task/?15699 . This won't affect any extra software you would like to add; you can use any format recognized by GNU Tar, and all common compression algorithms. This new requirement is only for software that get merged to the core Maneage branch. - Metastore (and thus libbsd and libmd) moved to highlevel: Metastore (and the packages it depends on) is a high-level product that is only relevant during the project development (like Emacs!): when the user wants the file meta data (like dates) to be unchanged after checking out branches. So it should be considered a high-level software, not basic. Metastore also usually causes many more headaches and error messages, so personally, I have stopped using it! Instead I simply merge my branches in a separate clone, then pull the merge commit: in this way, the files of my project aren't re-written during the checkout phase and therefore their dates are untouched (which can conflict with Make's dates on configuration files). - The un-official cloned version of Flex (2.6.4-91 until this commit) was causing problems in the building of Netpbm, so with this commit, it has been moved back to version 2.6.4. - Netpbm's official page had version 10.73.38 as the latest stable tarball that was just released in late 2021. But I couldn't find our previously-used version 10.86.99 anywhere (to see when it was released and why we used it! Its at last more than one year old!). So the official stable version is being used now. - Improved instructions in 'README.md' for building software environment in a Docker container (while having project source and output data products on the local system; including the usage of the host's '/dev/shm' to speed up temporary operations). - Until now, the convention in Maneage was to put eight SPACE characters before the comment lines within recipes. This was done because by default GNU Emacs (also many other editors) show a TAB as eight characters. However, in other text editors, online browsers, or even the Git diff, a TAB can correspond to a different number of characters. In such cases, the Maneage recipes wouldn't look too interesting (the comments and the recipe commands would show a different indentation!). With this commit, all the comment lines in the Makefiles within the core Maneage branch have a hash ('#') as their first character and a TAB as the second. This allows the comment lines in recipes to have the same indentation as code; making the code much more easier to read in a general scenario including a 'git diff' (editor agnostic!). P.S. List of updated software with their old and new versions - Software with no version update are not mentioned. - The old version of newly added software are shown with '--'. Name (Basic) Old version New version ------------ ----------- ----------- Bzip2 1.0.6 1.0.8 CURL 7.71.1 7.79.1 Dash 0.5.10.2 0.5.11.5 File 5.39 5.41 Flock 0.2.3 0.4.0 GNU Bash 5.0.18 5.1.8 GNU Binutils 2.35 2.37 GNU Coreutils 8.32 9.0 GNU GCC 10.2.0 11.2.0 GNU M4 1.4.18 1.4.19 GNU Readline 8.0 8.1.1 GNU Tar 1.32 1.34 GNU Texinfo 6.7 6.8 GNU diffutils 3.7 3.8 GNU findutils 4.7.0 4.8.0 GNU gmp 6.2.0 6.2.1 GNU grep 3.4 3.7 GNU gzip 1.10 1.11 GNU libunistring 0.9.10 1.0 GNU mpc 1.1.0 1.2.1 GNU mpfr 4.0.2 4.1.0 GNU nano 5.2 6.0 GNU ncurses 6.2 6.3 GNU wget 1.20.3 1.21.2 Git 2.28.0 2.34.0 Less 563 590 Libxml2 2.9.9 2.9.12 Lzip 1.22-rc2 1.22 OpenSLL 1.1.1a 3.0.0 Patchelf 0.10 0.13 Perl 5.32.0 5.34.0 Podlators -- 4.14 Name (Highlevel) Old version New version ---------------- ----------- ----------- Apachelog4cxx 0.10.0-603 0.12.1 Astrometry.net 0.80 0.85 Boost 1.73.0 1.77.0 CFITSIO 3.48 4.0.0 Cmake 3.18.1 3.21.4 Eigen 3.3.7 3.4.0 Expat 2.2.9 2.4.1 FFTW 3.3.8 3.3.10 Flex 2.6.4-91 2.6.4 Fontconfig 2.13.1 2.13.94 Freetype 2.10.2 2.11.0 GNU Astronomy Utilities 0.12 0.16.1-e0f1 GNU Autoconf 2.69.200-babc 2.71 GNU Automake 1.16.2 1.16.5 GNU Bison 3.7 3.8.2 GNU Emacs 27.1 27.2 GNU GDB 9.2 11.1 GNU GSL 2.6 2.7 GNU Help2man 1.47.11 1.48.5 Ghostscript 9.52 9.55.0 ICU -- 70.1 ImageMagick 7.0.8-67 7.1.0-13 Libbsd 0.10.0 0.11.3 Libffi 3.2.1 3.4.2 Libgit2 1.0.1 1.3.0 Libidn 1.36 1.38 Libjpeg 9b 9d Libmd -- 1.0.4 Libtiff 4.0.10 4.3.0 Libx11 1.6.9 1.7.2 Libxt 1.2.0 1.2.1 Netpbm 10.86.99 10.73.38 OpenBLAS 0.3.10 0.3.18 OpenMPI 4.0.4 4.1.1 Pixman 0.38.0 0.40.0 Python 3.8.5 3.10.0 R 4.0.2 4.1.2 SWIG 3.0.12 4.0.2 Util-linux 2.35 2.37.2 Util-macros 1.19.2 1.19.3 Valgrind 3.15.0 3.18.1 WCSLIB 7.3 7.7 Xcb-proto 1.14 1.14.1 Xorgproto 2020.1 2021.5 Name (Python) Old version New version ------------- ----------- ----------- Astropy 4.0 5.0 Beautifulsoup4 4.7.1 4.10.0 Beniget -- 0.4.1 Cffi 1.12.2 1.15.0 Cryptography 2.6.1 36.0.1 Cycler 0.10.0 0.11.0+} Cython 0.29.21 0.29.24 Esutil 0.6.4 0.6.9 Extension-helpers -- 0.1 Galsim 2.2.1 2.3.3 Gast -- 0.5.3 Jinja2 -- 3.0.3 MPI4py 3.0.3 3.1.3 Markupsafe -- 2.0.1 Numpy 1.19.1 1.21.3 Packaging -- 21.3 Pillow -- 8.4.0 Ply -- 3.11 Pyerfa -- 2.0.0.1 Pyparsing 2.3.1 3.0.4 Pythran -- 0.11.0 Scipy 1.5.2 1.7.3 Setuptools 41.6.0 58.3.0 Six 1.12.0 1.16.0 Uncertainties 3.1.2 3.1.6 Wheel -- 0.37.0 Name (R) Old version New version -------- ----------- ----------- Cli -- 2.5.0 Colorspace -- 2.0-1 Cowplot -- 1.1.1 Crayon -- 1.4.1 Digest -- 0.6.27 Ellipsis -- 0.3.2 Fansi -- 0.5.0 Farver -- 2.1.0 Ggplot2 -- 3.3.4 Glue -- 1.4.2 GridExtra -- 2.3 Gtable -- 0.3.0 Isoband -- 0.2.4 Labeling -- 0.4.2 Lifecycle -- 1.0.0 Magrittr -- 2.0.1 MASS -- 7.3-54 Mgcv -- 1.8-36 Munsell -- 0.5.0 Pillar -- 1.6.1 R-Pkgconfig -- 2.0.3 R6 -- 2.5.0 RColorBrewer -- 1.1-2 Rlang -- 0.4.11 Scales -- 1.1.1 Tibble -- 3.1.2 Utf8 -- 1.2.1 Vctrs -- 0.3.8 ViridisLite -- 0.4.0 Withr -- 2.4.2 --- reproduce/analysis/config/INPUTS.conf | 2 +- reproduce/analysis/config/delete-me-squared-num.conf | 2 +- reproduce/analysis/config/metadata.conf | 2 +- reproduce/analysis/config/pdf-build.conf | 2 +- reproduce/analysis/config/verify-outputs.conf | 2 +- 5 files changed, 5 insertions(+), 5 deletions(-) (limited to 'reproduce/analysis/config') diff --git a/reproduce/analysis/config/INPUTS.conf b/reproduce/analysis/config/INPUTS.conf index b969945..936f5f9 100644 --- a/reproduce/analysis/config/INPUTS.conf +++ b/reproduce/analysis/config/INPUTS.conf @@ -36,7 +36,7 @@ # (including the file-name) that can be used to download the dataset # when necessary. Also, see the description above on local filename. # -# Copyright (C) 2018-2021 Mohammad Akhlaghi +# Copyright (C) 2018-2022 Mohammad Akhlaghi # # Copying and distribution of this file, with or without modification, are # permitted in any medium without royalty provided the copyright notice and diff --git a/reproduce/analysis/config/delete-me-squared-num.conf b/reproduce/analysis/config/delete-me-squared-num.conf index 13068e4..4df2101 100644 --- a/reproduce/analysis/config/delete-me-squared-num.conf +++ b/reproduce/analysis/config/delete-me-squared-num.conf @@ -1,6 +1,6 @@ # Number of samples in the demonstration analysis (to be deleted). # -# Copyright (C) 2019-2021 Mohammad Akhlaghi +# Copyright (C) 2019-2022 Mohammad Akhlaghi # # Copying and distribution of this file, with or without modification, are # permitted in any medium without royalty provided the copyright notice and diff --git a/reproduce/analysis/config/metadata.conf b/reproduce/analysis/config/metadata.conf index aaf2ca0..0241136 100644 --- a/reproduce/analysis/config/metadata.conf +++ b/reproduce/analysis/config/metadata.conf @@ -15,7 +15,7 @@ # and the copyright license name and standard link to the fully copyright # license. # -# Copyright (C) 2020-2021 Mohammad Akhlaghi +# Copyright (C) 2020-2022 Mohammad Akhlaghi # # Copying and distribution of this file, with or without modification, are # permitted in any medium without royalty provided the copyright notice and diff --git a/reproduce/analysis/config/pdf-build.conf b/reproduce/analysis/config/pdf-build.conf index 015bf2e..a57b529 100644 --- a/reproduce/analysis/config/pdf-build.conf +++ b/reproduce/analysis/config/pdf-build.conf @@ -12,7 +12,7 @@ # LaTeX. Otherwise, a notice will just printed that, no PDF will be # created. # -# Copyright (C) 2018-2021 Mohammad Akhlaghi +# Copyright (C) 2018-2022 Mohammad Akhlaghi # # Copying and distribution of this file, with or without modification, are # permitted in any medium without royalty provided the copyright notice and diff --git a/reproduce/analysis/config/verify-outputs.conf b/reproduce/analysis/config/verify-outputs.conf index d96f293..37fc43c 100644 --- a/reproduce/analysis/config/verify-outputs.conf +++ b/reproduce/analysis/config/verify-outputs.conf @@ -1,6 +1,6 @@ # To enable verification of output datasets set this variable to 'yes'. # -# Copyright (C) 2019-2021 Mohammad Akhlaghi +# Copyright (C) 2019-2022 Mohammad Akhlaghi # # Copying and distribution of this file, with or without modification, are # permitted in any medium without royalty provided the copyright notice and -- cgit v1.2.1 From 91799fe4b6d62230e99a1520a23a0d30c3eb963e Mon Sep 17 00:00:00 2001 From: Mohammad Akhlaghi Date: Fri, 15 Apr 2022 04:57:54 +0200 Subject: IMPORTANT: more generic, robust and secure INPUTS.conf and download.mk SUMMARY: it is necessary to update your 'INPUTS.conf' and 'download.mk'. Until now, adding an input file involved several steps that needed manual (and inconvenient!) intervention: for every file, you needed to define four variables in 'INPUTS.conf', and in 'reproduce/analysis/make/download.mk' you had to use a (complex for large number of files) shell 'if/elif/else' condition to link the names of the input files to those variables. Besides inconvenience, this could cause bugs (typos!). Furthermore, a basic MD5 checksum was used for verifying the files. With this commit, a new structure has been defined for 'INPUTS.conf' that (thanks to some pretty useful GNU Make features), removes the need for users to manually edit 'reproduce/analysis/make/download.mk', and reduces the number of variables necessary for each file to three (from four). Furthermore, we now use the SHA256 checksum for input data validation. Regarding the trick used in 'INPUTS.conf' (form the newly added description in 'download.mk'): In GNU Make, '.VARIABLES' "... expands to a list of the names of all global variables defined so far" (from the "Other Special Variables" section of the GNU Make manual). Assuming that the pattern 'INPUT-%-sha256' is only used for input files, we find all the variables that contain the input file names (the '%' is the filename). Finally, using the pattern-substitution function ('patsubst'), we remove the fixed string at the start and end of the variable name. Steps you need to take: - INPUTS.conf: translate your old format to the new format (after carefully reading the description in the comments at the start of the file). After applying the new standards, you don't need to use the variables of 'INPUTS.conf' directly in your Makefiles! For example if one of your input datasets is called 'abc.fits', the checksum variable will be 'INPUT-abc.fits-sha256' and in your high-level Makefiles, you can simply set '$(indir)/abc.fits' as a prerequisite (like you probably did already). - reproduce/analysis/make/download.mk: for the definition and rule of 'inputdatasets', simply use the Maneage branch, and remove anything you had added in your project. In the process, I also noticed that 'README-hacking.md' still referred to 'master' as the main project branch, while we have used 'main' in the paper (and is the common convention with Git). --- reproduce/analysis/config/INPUTS.conf | 109 +++++++++++++++++++++------------- 1 file changed, 68 insertions(+), 41 deletions(-) (limited to 'reproduce/analysis/config') diff --git a/reproduce/analysis/config/INPUTS.conf b/reproduce/analysis/config/INPUTS.conf index 936f5f9..5a58758 100644 --- a/reproduce/analysis/config/INPUTS.conf +++ b/reproduce/analysis/config/INPUTS.conf @@ -1,40 +1,68 @@ -# Input files necessary for this project, the variables defined in this -# file are primarily used in 'reproduce/analysis/make/download.mk'. See -# there for precise usage of the variables. But comments are also provided -# here. -# -# Necessary variables for each input dataset are listed below. Its good -# that all the variables of each file have the same base-name (in the -# example below 'DEMO') with descriptive suffixes, also put a short comment -# above each group of variables for each dataset, shortly explaining what -# it is. -# -# 1) Local file name ('DEMO-DATA' below): this is the name of the dataset -# on the local system (in 'INDIR', given at configuration time). It is -# recommended that it be the same name as the online version of the -# file like the case here (note how this variable is used in 'DEMO-URL' -# for the dataset's full URL). However, this is not always possible, so -# the local and server filenames may be different. Ultimately, the file -# name is irrelevant, we check the integrity with the checksum. -# -# 2) The MD5 checksum of the file ('DEMO-MD5' below): this is very -# important for an automatic verification of the file. You can -# calculate it by running 'md5sum' on your desired file. You can also -# use any other checksum tool that you prefer, just be sure to correct -# the respective command in 'reproduce/analysis/make/download.mk'. -# -# 3) The human-readable size of the file ('DEMO-SIZE' below): this is an -# optional variable, mainly to help a reader of your project get a -# sense of the volume they need to download if they don't already have -# the dataset. So it is highly recommended to add it (future readers of -# your project's source will appreciate it!). You can get it from the -# output of 'ls -lh' command on the file. Optionally you can use it in -# messages during the configuration phase (when Maneage asks for the -# input data directory), along with other info about the file(s). -# -# 4) The full dataset URL ('DEMO-URL' below): this is the full URL -# (including the file-name) that can be used to download the dataset -# when necessary. Also, see the description above on local filename. +# This project's input file information (metadata). +# +# For each input (external) data file that is used within the project, +# three variables are suggested here (two of them are mandatory). These +# variables will be used by 'reproduce/analysis/make/download.mk' to import +# the dataset into the project (within the build directory): +# +# - If the file already exists locally in '$(INDIR)' (the optional input +# directory that may have been specified at configuration time with +# '--input-dir'), a symbolic link will be added in '$(indir)' (in the +# build directory). A symbolic link is used to avoid extra storage when +# files are large. +# +# - If the file doesn't exist in '$(INDIR)', or no input directory was +# specified at configuration time, then the file is downloaded from a +# specific URL. +# +# In both cases, before placing the file (or its link) in the build +# directory, 'reproduce/analysis/make/download.mk' will check the SHA256 +# checksum of the dataset and if it differs from the pre-defined value (set +# for that file, here), it will abort (since this is not the intended +# dataset). +# +# Therefore, the two variables specifying the URL and SHA256 checksum of +# the file are MANDATORY. The third variable (INPUT-%-size) showing the +# human-readable size of the file (from 'ls -lh') is optional (but +# recommended: because it gives future scientists to get a feeling of the +# volume of data they need to input: will become important if the +# size/number of files is large). +# +# The naming convension is critical for the input files to be properly +# imported into the project. In the patterns below, the '%' is the full +# file name (including its prefix): for example in the demo input of this +# file in the 'maneage' branch, we have 'INPUT-wfpc2.fits-sha256': +# therefore, the input file (within the project's '$(indir)') is called +# 'wfpc2.fits'. This allows you to simply set '$(indir)/wfpc2.fits' as the +# pre-requisite of any recipe that needs the input file: you will rarely +# (if at all!) need to use these variables directly. +# +# INPUT-%-sha256: The sha256 checksum of the file. You can generate the +# SHA256 checksum of a file with the 'sha256sum FILENAME' +# command (where 'FILENAME' is the name of your +# file). this is very important for an automatic +# verification of the file: that it hasn't changed +# between different runs of the project (locally or in +# the URL). There are more robust checksum algorithms +# like the 'SHA' standards. +# +# INPUT-%-url: The URL to download the file if it is not available +# locally. It can happen that during the first phases of +# your project the data aren't yet public. In this case, you +# set a phony URL like this (just as a clear place-holder): +# 'https://this.file/is/not/yet/public'. +# +# INPUT-%-size: The human-readable size of the file (output of 'ls +# -lh'). This is not used by default but can help other +# scientists who would like to run your project get a +# good feeling of the necessary network and storage +# capacity that is necessary to start the project. +# +# The input dataset's name (that goes into the '%') can be different from +# the URL's file name (last component of the URL, after the last '/'). Just +# note that it is assumed that the local copy (outside of your project) is +# also called '%' (if your local copy of the input dataset and the only +# repository names are the same, be sure to set '%' accordingly). # # Copyright (C) 2018-2022 Mohammad Akhlaghi # @@ -48,7 +76,6 @@ # Demo dataset used in the histogram plot (remove when customizing). -DEMO-DATA = WFPC2ASSNu5780205bx.fits -DEMO-MD5 = a4791e42cd1045892f9c41f11b50bad8 -DEMO-SIZE = 62K -DEMO-URL = https://fits.gsfc.nasa.gov/samples/$(DEMO-DATA) +INPUT-wfpc2.fits-size = 62K +INPUT-wfpc2.fits-url = https://fits.gsfc.nasa.gov/samples/WFPC2ASSNu5780205bx.fits +INPUT-wfpc2.fits-sha256 = 9851bc2bf9a42008ea606ec532d04900b60865daaff2f233e5c8565dac56ad5f -- cgit v1.2.1