Next: Future improvements, Previous: Customization checklist, Up: About
The following is a list of design points, tips, or recommendations that have been learned after some experience with this type of project management. Please don't hesitate to share any experience you gain after using it with us. In this way, we can add it here (with full giving credit) for the benefit of others.
Modularity: Modularity is the key to easy and clean growth of a project. So it is always best to break up a job into as many sub-components as reasonable. Here are some tips to stay modular.
Short recipes: if you see the recipe of a rule becoming more than a handful of lines which involve significant processing, it is probably a good sign that you should break up the rule into its main components. Try to only have one major processing step per rule.
Context-based (many) Makefiles: For maximum modularity, this design
allows easy inclusion of many Makefiles: in
reproduce/analysis/make/*.mk
for analysis steps, and
reproduce/software/make/*.mk
for building software. So keep the
rules for closely related parts of the processing in separate
Makefiles.
Descriptive names: Be very clear and descriptive with the naming of the files and the variables because a few months after the processing, it will be very hard to remember what each one was for. Also this helps others (your collaborators or other people reading the project source after it is published) to more easily understand your work and find their way around.
Naming convention: As the project grows, following a single standard
or convention in naming the files is very useful. Try best to use
multiple word filenames for anything that is non-trivial (separating
the words with a -
). For example if you have a Makefile for
creating a catalog and another two for processing it under models A
and B, you can name them like this: catalog-create.mk
,
catalog-model-a.mk
and catalog-model-b.mk
. In this way, when
listing the contents of reproduce/analysis/make
to see all the
Makefiles, those related to the catalog will all be close to each
other and thus easily found. This also helps in auto-completions by
the shell or text editors like Emacs.
Source directories: If you need to add files in other languages for
example in shell, Python, AWK or C, keep the files in the same
language in a separate directory under reproduce/analysis
, with the
appropriate name.
Configuration files: If your research uses special programs as part
of the processing, put all their configuration files in a devoted
directory (with the program's name) within
reproduce/software/config
. Similar to the
reproduce/software/config/gnuastro
directory (which is put in
Maneage as a demo in case you use GNU Astronomy Utilities). It is
much cleaner and readable (thus less buggy) to avoid mixing the
configuration files, even if there is no technical necessity.
Contents: It is good practice to follow the following recommendations on the contents of your files, whether they are source code for a program, Makefiles, scripts or configuration files (copyrights aren't necessary for the latter).
Copyright: Always start a file containing programming constructs
with a copyright statement like the ones that Maneage starts with
(for example in the top level Makefile
).
Comments: Comments are vital for readability (by yourself in two months, or others). Describe everything you can about why you are doing something, how you are doing it, and what you expect the result to be. Write the comments as if it was what you would say to describe the variable, recipe or rule to a friend sitting beside you. When writing the project it is very tempting to just steam ahead with commands and codes, but be patient and write comments before the rules or recipes. This will also allow you to think more about what you should be doing. Also, in several months when you come back to the code, you will appreciate the effort of writing them. Just don't forget to also read and update the comment first if you later want to make changes to the code (variable, recipe or rule). As a general rule of thumb: first the comments, then the code.
File title: In general, it is good practice to start all files with a single line description of what that particular file does. If further information about the totality of the file is necessary, add it after a blank line. This will help a fast inspection where you don't care about the details, but just want to remember/see what that file is (generally) for. This information must of course be commented (its for a human), but this is kept separate from the general recommendation on comments, because this is a comment for the whole file, not each step within it.
Make programming: Here are some experiences that we have come to learn over the years in using Make and are useful/handy in research contexts.
Environment of each recipe: If you need to define a special
environment (or aliases, or scripts to run) for all the recipes in
your Makefiles, you can use a Bash startup file
reproduce/software/shell/bashrc.sh
. This file is loaded before every
Make recipe is run, just like the .bashrc
in your home directory is
loaded every time you start a new interactive, non-login terminal. See
the comments in that file for more.
Automatic variables: These are wonderful and very useful Make
constructs that greatly shrink the text, while helping in
read-ability, robustness (less bugs in typos for example) and
generalization. For example even when a rule only has one target or
one prerequisite, always use $@
instead of the target's name, $<
instead of the first prerequisite, $^
instead of the full list of
prerequisites and etc. You can see the full list of automatic
variables
here. If
you use GNU Make, you can also see this page on your command-line:
info make "automatic variables"
Debug: Since Make doesn't follow the common top-down paradigm, it
can be a little hard to get accustomed to why you get an error or
un-expected behavior. In such cases, run Make with the -d
option. With this option, Make prints a full list of exactly which
prerequisites are being checked for which targets. Looking
(patiently) through this output and searching for the faulty
file/step will clearly show you any mistake you might have made in
defining the targets or prerequisites.
Large files: If you are dealing with very large files (thus having
multiple copies of them for intermediate steps is not possible), one
solution is the following strategy (Also see the next item on "Fast
access to temporary files"). Set a small plain text file as the
actual target and delete the large file when it is no longer needed
by the project (in the last rule that needs it). Below is a simple
demonstration of doing this. In it, we use Gnuastro's Arithmetic
program to add all pixels of the input image with 2 and create
large1.fits
. We then subtract 2 from large1.fits
to create
large2.fits
and delete large1.fits
in the same rule (when its no
longer needed). We can later do the same with large2.fits
when it
is no longer needed and so on.
large1.fits.txt: input.fits
astarithmetic $< 2 + --output=$(subst .txt,,$@)
echo "done" > $@
large2.fits.txt: large1.fits.txt
astarithmetic $(subst .txt,,$<) 2 - --output=$(subst .txt,,$@)
rm $(subst .txt,,$<)
echo "done" > $@
A more advanced Make programmer will use Make's call function
to define a wrapper in reproduce/analysis/make/initialize.mk
. This
wrapper will replace $(subst .txt,,XXXXX)
. Therefore, it will be
possible to greatly simplify this repetitive statement and make the
code even more readable throughout the whole project.Fast access to temporary files: Most Unix-like operating systems
will give you a special shared-memory device (directory): on systems
using the GNU C Library (all GNU/Linux system), it is /dev/shm
. The
contents of this directory are actually in your RAM, not in your
persistence storage like the HDD or SSD. Reading and writing from/to
the RAM is much faster than persistent storage, so if you have enough
RAM available, it can be very beneficial for large temporary files to
be put there. You can use the mktemp
program to give the temporary
files a randomly-set name, and use text files as targets to keep that
name (as described in the item above under "Large files") for later
deletion. For example, see the minimal working example Makefile below
(which you can actually put in a Makefile
and run if you have an
input.fits
in the same directory, and Gnuastro is installed).
.ONESHELL:
.SHELLFLAGS = -ec
all: mean-std.txt
shm-maneage := /dev/shm/$(shell whoami)-maneage-XXXXXXXXXX
large1.txt: input.fits
out=$$(mktemp $(shm-maneage))
astarithmetic $< 2 + --output=$$out.fits
echo "$$out" > $@
large2.txt: large1.txt
input=$$(cat $<)
out=$$(mktemp $(shm-maneage))
astarithmetic $$input.fits 2 - --output=$$out.fits
rm $$input.fits $$input
echo "$$out" > $@
mean-std.txt: large2.txt
input=$$(cat $<)
aststatistics $$input.fits --mean --std > $@
rm $$input.fits $$input
The important point here is that the temporary name template
(shm-maneage
) has no suffix. So you can add the suffix
corresponding to your desired format afterwards (for example
$$out.fits
, or $$out.txt
). But more importantly, when mktemp
sets the random name, it also checks if no file exists with that name
and creates a file with that exact name at that moment. So at the end
of each recipe above, you'll have two files in your /dev/shm
, one
empty file with no suffix one with a suffix. The role of the file
without a suffix is just to ensure that the randomly set name will
not be used by other calls to mktemp
(when running in parallel) and
it should be deleted with the file containing a suffix. This is the
reason behind the rm $$input.fits $$input
command above: to make
sure that first the file with a suffix is deleted, then the core
random file (note that when working in parallel on powerful systems,
in the time between deleting two files of a single rm
command, many
things can happen!). When using Maneage, you can put the definition
of shm-maneage
in reproduce/analysis/make/initialize.mk
to be
usable in all the different Makefiles of your analysis, and you won't
need the three lines above it. Finally, BE RESPONSIBLE: after you
are finished, be sure to clean up any possibly remaining files (due
to crashes in the processing while you are working), otherwise your
RAM may fill up very fast. You can do it easily with a command like
this on your command-line: rm -f /dev/shm/$(whoami)-*
.Software tarballs and raw inputs: It is critically important to document the raw inputs to your project (software tarballs and raw input data):
Keep the source tarball of dependencies: After configuration
finishes, the .build/software/tarballs
directory will contain all
the software tarballs that were necessary for your project. You can
mirror the contents of this directory to keep a backup of all the
software tarballs used in your project (possibly as another version
controlled repository) that is also published with your project. Note
that software web-pages are not written in stone and can suddenly go
offline or not be accessible in some conditions. This backup is thus
very important. If you intend to release your project in a place like
Zenodo, you can upload/keep all the necessary tarballs (and data)
there with your
project. zenodo.1163746 is
one example of how the data, Gnuastro (main software used) and all
major Gnuastro's dependencies have been uploaded with the project's
source. Just note that this is only possible for free and open-source
software.
Keep your input data: The input data is also critical to the project's reproducibility, so like the above for software, make sure you have a backup of them, or their persistent identifiers (PIDs).
Version control: Version control is a critical component of Maneage. Here are some tips to help in effectively using it.
Regular commits: It is important (and extremely useful) to have the history of your project under version control. So try to make commits regularly (after any meaningful change/step/result).
Keep Maneage up-to-date: In time, Maneage is going to become more and more mature and robust (thanks to your feedback and the feedback of other users). Bugs will be fixed and new/improved features will be added. So every once and a while, you can run the commands below to pull new work that is done in Maneage. If the changes are useful for your work, you can merge them with your project to benefit from them. Just pay very close attention to resolving possible conflicts which might happen in the merge (updated settings that you have customized in Maneage).
git checkout maneage
git pull # Get recent work in Maneage
git log XXXXXX..XXXXXX --reverse # Inspect new work (replace XXXXXXs with hashs mentioned in output of previous command).
git log --oneline --graph --decorate --all # General view of branches.
git checkout master # Go to your top working branch.
git merge maneage # Import all the work into master.
Adding Maneage to a fork of your project: As you and your colleagues
continue your project, it will be necessary to have separate
forks/clones of it. But when you clone your own project on a
different system, or a colleague clones it to collaborate with you,
the clone won't have the origin-maneage
remote that you started the
project with. As shown in the previous item above, you need this
remote to be able to pull recent updates from Maneage. The steps
below will setup the origin-maneage
remote, and a local maneage
branch to track it, on the new clone.
git remote add origin-maneage https://git.maneage.org/project.git
git fetch origin-maneage
git checkout -b maneage --track origin-maneage/maneage
Commit message: The commit message is a very important and useful
aspect of version control. To make the commit message useful for
others (or yourself, one year later), it is good to follow a
consistent style. Maneage already has a consistent formatting
(described below), which you can also follow in your project if you
like. You can see many examples by running git log
in the maneage
branch. If you intend to push commits to Maneage, for the consistency
of Maneage, it is necessary to follow these guidelines. 1) No line
should be more than 75 characters (to enable easy reading of the
message when you run git log
on the standard 80-character
terminal). 2) The first line is the title of the commit and should
summarize it (so git log --oneline
can be useful). The title should
also not end with a point (.
, because its a short single sentence,
so a point is not necessary and only wastes space). 3) After the
title, leave an empty line and start the body of your message
(possibly containing many paragraphs). 4) Describe the context of
your commit (the problem it is trying to solve) as much as possible,
then go onto how you solved it. One suggestion is to start the main
body of your commit with "Until now ...", and continue describing the
problem in the first paragraph(s). Afterwards, start the next
paragraph with "With this commit ...".
Project outputs: During your research, it is possible to checkout a
specific commit and reproduce its results. However, the processing
can be time consuming. Therefore, it is useful to also keep track of
the final outputs of your project (at minimum, the paper's PDF) in
important points of history. However, keeping a snapshot of these
(most probably large volume) outputs in the main history of the
project can unreasonably bloat it. It is thus recommended to make a
separate Git repo to keep those files and keep your project's source
as small as possible. For example if your project is called
my-exciting-project
, the name of the outputs repository can be
my-exciting-project-output
. This enables easy sharing of the output
files with your co-authors (with necessary permissions) and not
having to bloat your email archive with extra attachments also (you
can just share the link to the online repo in your
communications). After the research is published, you can also
release the outputs repository, or you can just delete it if it is
too large or un-necessary (it was just for convenience, and fully
reproducible after all). For example Maneage's output is available
for demonstration in a
separate repository.
Full Git history in one file: When you are publishing your project
(for example to Zenodo for long term preservation), it is more
convenient to have the whole project's Git history into one file to
save with your datasets. After all, you can't be sure that your
current Git server (for example GitLab, Github, or Bitbucket) will be
active forever. While they are good for the immediate future, you
can't rely on them for archival purposes. Fortunately keeping your
whole history in one file is easy with Git using the following
commands. To learn more about it, run git help bundle
.
my-project-git.bundle
to a descriptive name of your
project):git bundle create my-project-git.bundle --all
my-project-git.bundle
anywhere. Later, if
you need to un-bundle it, you can use the following command.git clone my-project-git.bundle
Next: Future improvements, Previous: Customization checklist, Up: About