paper-concept.git - Paper (Towards Long-term and Archivable Reproducibility)

Age	Commit message (Collapse)	Author	Lines
2020-01-27	Initial scripts compatible with Dash (minimalistic POSIX)	Mohammad Akhlaghi	-34/+46
	Until now, the initial project scripts were primarily tested with GNU Bash. But Bash is not generally available on all systems (it has many features beyond POSIX). Because of this, effectively we were imposing the requirement on the user that they must have Bash installed. We recently started this with setting the shebang of `project' and `reproduce/software/bash/configure.sh' to `/bin/sh'. After doing so, Raul and Gaspar reported an error on their systems. To fix the problem, I installed Dash (a minimalist POSIX-compliant shell) on my computer and temporarily set the shebangs to `/bin/dash', ran the project configuration step and fixed all issues that came up. With this commit, it can go all the way to building GCC on my system's Dash. After this stage (when `high-level.mk' is called), there is no problem, because we have our own version of GNU Bash and that installed version is used. Probably some more issues still remain and will hopefully be found in the future. While doing this, I also noticed the following two minor issues: - The `./project configure' option `--input-dir' was not recognized because it was mistakenly checking `--inputdir'. It has been corrected. - The test C programs now use the `<<EOF' method instead of `echo'. - In `basic.mk', the extra space between `syspath' and `:=' was removed (it was an ancient relic!).
2020-01-23	Hashbangs of project and configure.sh set to /bin/sh	Mohammad Akhlaghi	-1/+1
	Until now, the hashbang of these two shell scripts was set to `/bin/bash', hence assuming that GNU Bash exists on the host system! But this is an extra requirement on the host operating system and these two scripts should be written such that they operate on a POSIX shell (the generic `/bin/sh' which can point to any shell program). With this commit this has been implemented! We may confront some errors as the system is run on other systems, but we should fix such errors and work hard to make these two scripts as POSIX-compatible as possible (runnable on any shell, so as not to force users to install Bash before running the project). This completes Task #15525.
2020-01-23	IMPORTANT: Project preparation is now also done with project make	Mohammad Akhlaghi	-2/+2
	Until now, the main commands to run the project were these: `./project configure' (to build the software), `./project prepare' (to possibly arrange input datasets and build special configuration Makefiles) and finally `./project make' to run the project. The main logic behind the "prepare" phase `top-prepare.mk' is to build configuration files that can be fed into the "make" step and optimize its operation. For example when the total number of necessary inputs for the majority of the analysis is not as large as the total number of inputs. With "prepare" (when necessary), you go through the raw inputs, select the ones that are necessary for the rest of the project. The output of `top-prepare.mk' is a configuration file (a Make variable) that keeps the IDs (numbers, names, etc). That configuration file would then be used in the `top-make.mk' to identify the lower level targets and allow optimal project organization and management. But the last two are both part of the analysis, and while they indeed need different calls to Make to be executed, many projects don't actually need a preparation phase: ultimately, its an implementation choice by the project developers and doesn't concern the project users (or the developers when they are running it). To avoid confusing the users, or simply annoying them when a projet doesn't need it, with this commit, the top-level `top-prepare.mk' and `top-make.mk' Makefiles are called with the single `./project make' command and `./project prepare' has been dropped. I noticed this while writing the paper on this system.
2020-01-22	Better explanation for missing static C library	Mohammad Akhlaghi	-12/+22
	Until now, the explanation for a missing static C library didn't actually guide the users to look above and see the error message! So with this commit, I edited it a little to be more clear (and mention to look above). Also, I noticed that on Amazon AWS systems, the static C library is installed as a separate package, so to help the users, I added the necessary command and some better explanation.
2020-01-20	IMPORTANT!!! Configuration Makefiles now have a .conf suffix	Mohammad Akhlaghi	-10/+10
	Until now, the configuration Makefiles (in `reproduce/software/config/installation' and `reproduce/analysis/config') had a `.mk' suffix, similar to the workhorse Makefiles. Although they are indeed Makefiles, but given their nature (to only keep configuration parameters), it is confusing (especially to early users) for them to also have a `.mk' (similar to the analysis or software building Makefiles). To address this issue, with this commit, all the configuration Makefiles (in those directories) are now given a `.conf' suffix. This is also assumed for all the files that are loaded. The configuration (software building) and running of the template have been checked with this change from scratch, but please report any error that may not have been noticed. THIS IS AN IMPORTANT CHANGE AND WILL CAUSE CRASHES OR UNEXPECTED BEHAVIORS FOR PROJECTS THAT HAVE BRANCHED FROM THIS TEMPLATE. PLEASE CORRECT THE SUFFIX OF ALL YOUR PROJECT'S CONFIGURATION MAKEFILES (IN THE DIRECTORIES ABOVE), OTHERWISE THEY AREN'T AUTOMATICALLY LOADED ANYMORE.
2020-01-19	Corrected typo in last commit (forgetting \ at end of line)	Mohammad Akhlaghi	-1/+1
	In the previous commmit, I had forgot to add a `\' after the newly added `sys_library_path' variable to the `high-level.mk' call.
2020-01-19	LIBRARY_PATH is set accordingly based on the host	Mohammad Akhlaghi	-16/+40
	Until now, GCC wouldn't build properly on Debian-based operating systems because `ld' needed to link with several necessary C library features like `crti.o' and `crtn.o' (this is an `ld' issue, not GCC). The solution is to add the directory containing them to `LIBRARY_PATH'. In the previous commit, I actually searched for these files, but while testing on another system, I noticed that it can be problematic (other architectures may exist). With this commit, we are actually finding the build architecture of the running GCC (which is the same as the `ld') and using that to fix a fixed directory to `LIBRARY_PATH'.
2020-01-19	Better search for static C library at start of configuration	Mohammad Akhlaghi	-108/+75
	Until now, to see if a working static C library and `sys/cdefs.h' exist, we were checking absolute locations like `/usr/include/sys/cdefs.h' or `/usr/lib/libc.a' and `/usr/lib64/libc.a'. But this is not robust because on different systems, they can be in different locations. With this commit, we actually use `find' to find the location of `libc.a' and use that to add elements to CPPFLAGS and LDFLAGS. This should fix the problem on systems that have them on non-standard locations.
2020-01-01	Copyright statements updated to include 2020	Mohammad Akhlaghi	-4/+4
	Now that its 2020, its necessary to include this year in the copyright statements.
2019-11-08	Minor corrections applied in software builds	Mohammad Akhlaghi	-4/+4
	While working on a different branch to build the GNU C Library, I noticed a few places in the template that need corrections which are now applied: 1. A new-line character after the "C compiler works" notice at the start of the configure script. 2. Removing possible `::' in the `LD_LIBRARY_PATH' definition of `basic.mk'. Note that its not necessary in the other steps because we don't use any outside-defined `LD_LIBRARY_PATH'. 3. Building GMP for C++ and also with `--enable-fat'. 4. Removing the unpacked Perl tarball directory after its installation.
2019-10-01	Minor corrections in configure and prepare phase	Mohammad Akhlaghi	-1/+0
	Since ImageMagick can take long to build, we are now building it in parallel. Also, the part where we replace an `_' with `\_' in the software version at the end of the configure script was removed. It is more clear/readable that the actual rule that includes such a name deals with the underline (as is the case for `sip_tpv' which already dealt with it). Finally, I noticed that the checks at the start of `top-prepare' were missing new-lines. I had forgot that the Make single-shell variable isn't activated in this stage yet.
2019-10-01	Preparation phase added before final building	Mohammad Akhlaghi	-3/+3
	In many real-world scenarios, `./project make' can really benefit from having some basic information about the data before being run. For example when quering a server. If we know how many datasets were downloaded and their general properties, it can greatly optmize the process when we are designing the solution to be run in `./project make'. Therefore with this commit, a new phase has been added to the template's design: `./project prepare'. In the raw template this is empty, because the simple analysis done in the template doesn't warrant it. But everything is ready for projects using the template to add preparation phases prior to the analysis.
2019-10-01	New versions for ImageMagick, Python, Numpy Scipy and Matplotlib	Mohammad Akhlaghi	-0/+0
	It was some time since these three software were not updated! With this commit the template now uses the most recent stable release of these packages. Also, the hosting server for ImageMagick was moved to my own webpage because unfortunately ImageMagick removes its tarballs from its own version.
2019-09-16	Configure script won't allow build directory to be under source	Mohammad Akhlaghi	-4/+28
	Users that are not familiar with the file structure of the project may specify the current directory (to-level source directory) as their build-directory. This will cause a crash right after answering the questions, where `rm' will complain about `tex/build' not being deleted because it exists as a directory. To avoid such confusing situtations, the configure script now checks if the build directory is actually a sub-directory of the source. If it is, it will complain with a short message and abort. Also, a `CAUTION' statment has been put in the initial description, right ontop of the question. This bug was reported Carlos Allende Prieto and David Valls-Gabaud.
2019-08-08	Static PatchELF only built when static C library exists	Mohammad Akhlaghi	-7/+59
	Until now, when building PatchELF, we would always require that it be done statically. However, some systems don't have a static C library available for linking. This cause a crash in the static building of PatchELF. But a static PatchELF is necessary for correcting RPATH in GCC's outputs. With this commit, in the configure script we check if a static C library is linkable for the compiler. If it isn't then `host_cc' will be set to 1 and GCC won't be built. We also pass the result of this test to `basic.mk' (through `good_static_lib'), so if a static C library isn't available, it builds a dynamically linked PatchELF. This bug was reported by Elham Saremi.
2019-08-07	Removed temporary files to check Fortarn compiler	Mohammad Akhlaghi	-2/+3
	Until now, the Fortran compiler check wouldn't delete the files it creates in the temporary software building directory. With this commit, the cleaning steps have been added.
2019-08-05	Configuration: Checking that C and Fortran compilers work	Mohammad Akhlaghi	-116/+199
	Until now, we were just checking for the existance of a C and Fortran compiler. But it can happen that even if they exist, they don't operate properly, for example see some errors that have been reported until now in P.S. (both on different macOS systems). But finding this source after the programs have started is frustrating for the user. With this commit, before we start building anything, we'll check these two compilers with a simple program and see if they can indeed compile, and if their compiled program can run. If it doesn't work an elaborate error message is printed to help the users navigate to a solution. Also, the building of `flock' within `configure.sh' has been moved just before calling `basic.mk'. This was done so any warning/error message is printed before actually building anything. This fixes bug #56715. P.S. The error messages: C compiler ---------- conftest.c:9:19: fatal error: stdio.h: No such file or directory ^ compilation terminated. ---------- Fortran compiler ---------------- dyld: Library not loaded: @rpath/libisl.10.dylib Referenced from: /path/to/anaconda2/gcc/libexec/gcc/x86_64-apple-darwin15.5.0/4.9.3/f951 Reason: image not found gfortran: internal compiler error: Abort trap: 6 ----------------
2019-08-01	Bash startup script for every recipe	Mohammad Akhlaghi	-0/+45
	Until now the only way to define the environment of the Make recipes was through the exported Make variables (mostly in `initialize.mk' for the analysis steps for example). However, there is only so much you can do with environment variables! In some situations you want slightly more complicated environment control, like setting an alias or running of scripts (things that are commonly done in the `~/.bashrc' file of users to configure their interactive, non-login shells). With this commit, a `reproduce/software/bash/bashrc.sh' has been defined for this job (which is currently empty!). Every major Make step of the project adds this file as the `BASH_ENV' environment variable, so the shell that is created to execute a recipe first executes this file, then the recipe. Each top-level Makefile also defines a `PROJECT_STATUS' environment variable that enables users to limit their envirnoment setup based on the condition it is being setup (in particular in the early phase of `basic.mk', where the user can't make any assumption about the programs and has to write a portable shell script).
2019-07-29	Checking software tarball checksums before building software	Mohammad Akhlaghi	-3/+16
	Until now, there was no check on the integrity of the contents of the downloaded/copied software tarballs, we only relied on the tarball name. This could be bad for reproducibility and security, for example on one server the name of a tarball may be the same but with different content. With this commit, the SHA512 checksums of all the software are stored in the newly created `checksums.mk' (similar to how the versions are stored in the `versions.mk'). The resulting variable is then defined for each software and after downloading/copying the file we check to see if the new tarball has the same checksum as the stored value. If it doesn't the script will crash with an error, informing the user of the problem. The only limitation now is a bootstrapping problem: if the host system doesn't already an `sha512sum' executable, we will not do any checksum verification until we install our `sha512sum' (as part of GNU Coreutils). All the tarballs downloaded after GNU Coreutils are built will have their checksums validated. By default almost all GNU/Linux systems will have a usable `sha512sum' (its part of GNU Coreutils after all for a long time: from the Coreutils Changelog file atleast since 2013). This completes task #15347.
2019-07-29	Tip added at initial configuration notice on how to see progress	Mohammad Akhlaghi	-0/+10
	The configuration step (building all the ncessary software) can take some time. It is natual for the user to want to see how the build is going (which software is being built at every moment). So far, we have only put a "Inspecting status" section in `README-hacking.md' that describes a solution, but some early users may not have read it yet. With this commit a short tip was added in the initial installation notice to inform the user of this very useful command.
2019-07-28	Single wrapper instead of old ./configure, Makefile and ./for-group	Mohammad Akhlaghi	-0/+1224
	Until now, to work on a project, it was necessary to `./configure' it and build the software. Then we had to run `.local/bin/make' to run the project and do the analysis every time. If the project was a shared project between many users on a large server, it was necessary to call the `./for-group' script. This way of managing the project had a major problem: since the user directly called the lower-level `./configure' or `.local/bin/make' it was not possible to provide high-level control (for example limiting the environment variables). This was especially noticed recently with a bug that was related to environment variables (bug #56682). With this commit, this problem is solved using a single script called `project' in the top directory. To configure and build the project, users can now run these commands: $ ./project configure $ ./project make To work on the project with other users in a group these commands can be used: $ ./project configure --group=GROUPNAME $ ./project make --group=GROUPNAME The old options to both configure and make the project are still valid. Run `./project --help' to see a list. For example: $ ./project configure -e --host-cc $ ./project make -j8 The old `configure' script has been moved to `reproduce/software/bash/configure.sh' and is called by the new `./project' script. The `./project' script now just manages the options, then passes control to the `configure.sh' script. For the "make" step, it also reads the options, then calls Make. So in the lower-level nothing has changed. Only the `./project' script is now the single/direct user interface of the project. On a parallel note: as part of bug #56682, we also found out that on some macOS systems, the `DYLD_LIBRARY_PATH' environment variable has to be set to blank. This is no problem because RPATH is automatically set in macOS and the executables and libraries contain the absolute address of the libraries they should link with. But having `DYLD_LIBRARY_PATH' can conflict with some low-level system libraries and cause very hard to debug linking errors (like that reported in the bug report). This fixes bug #56682.
2019-04-15	New architecture to separate software-building and analysis steps	Mohammad Akhlaghi	-0/+149
	Until now, the software building and analysis steps of the pipeline were intertwined. However, these steps (of how to build a software, and how to use it) are logically completely independent. Therefore with this commit, the pipeline now has a new architecture (particularly in the `reproduce' directory) to emphasize this distinction: The `reproduce' directory now has the two `software' and `analysis' subdirectories and the respective parts of the previous architecture have been broken up between these two based on their function. There is also no more `src' directory. The `config' directory for software and analysis is now mixed with the language-specific directories. Also, some of the software versions were also updated after some checks with their webpages. This new architecture will allow much more focused work on each part of the pipeline (to install the software and to run them for an analysis).