Maneage -- Managing data lineage

Next: Future improvements, Previous: Customization checklist, Up: About

Tips for designing your project

The following is a list of design points, tips, or recommendations that have been learned after some experience with this type of project management. Please don't hesitate to share any experience you gain after using it with us. In this way, we can add it here (with full giving credit) for the benefit of others.

Modularity: Modularity is the key to easy and clean growth of a project. So it is always best to break up a job into as many sub-components as reasonable. Here are some tips to stay modular.
- Short recipes: if you see the recipe of a rule becoming more than a handful of lines which involve significant processing, it is probably a good sign that you should break up the rule into its main components. Try to only have one major processing step per rule.
- Context-based (many) Makefiles: For maximum modularity, this design allows easy inclusion of many Makefiles: in reproduce/analysis/make/*.mk for analysis steps, and reproduce/software/make/*.mk for building software. So keep the rules for closely related parts of the processing in separate Makefiles.
- Descriptive names: Be very clear and descriptive with the naming of the files and the variables because a few months after the processing, it will be very hard to remember what each one was for. Also this helps others (your collaborators or other people reading the project source after it is published) to more easily understand your work and find their way around.
- Naming convention: As the project grows, following a single standard or convention in naming the files is very useful. Try best to use multiple word filenames for anything that is non-trivial (separating the words with a -). For example if you have a Makefile for creating a catalog and another two for processing it under models A and B, you can name them like this: catalog-create.mk, catalog-model-a.mk and catalog-model-b.mk. In this way, when listing the contents of reproduce/analysis/make to see all the Makefiles, those related to the catalog will all be close to each other and thus easily found. This also helps in auto-completions by the shell or text editors like Emacs.
- Source directories: If you need to add files in other languages for example in shell, Python, AWK or C, keep the files in the same language in a separate directory under reproduce/analysis, with the appropriate name.
- Configuration files: If your research uses special programs as part of the processing, put all their configuration files in a devoted directory (with the program's name) within reproduce/software/config. Similar to the reproduce/software/config/gnuastro directory (which is put in Maneage as a demo in case you use GNU Astronomy Utilities). It is much cleaner and readable (thus less buggy) to avoid mixing the configuration files, even if there is no technical necessity.
Contents: It is good practice to follow the following recommendations on the contents of your files, whether they are source code for a program, Makefiles, scripts or configuration files (copyrights aren't necessary for the latter).
- Copyright: Always start a file containing programming constructs with a copyright statement like the ones that Maneage starts with (for example in the top level Makefile).
- Comments: Comments are vital for readability (by yourself in two months, or others). Describe everything you can about why you are doing something, how you are doing it, and what you expect the result to be. Write the comments as if it was what you would say to describe the variable, recipe or rule to a friend sitting beside you. When writing the project it is very tempting to just steam ahead with commands and codes, but be patient and write comments before the rules or recipes. This will also allow you to think more about what you should be doing. Also, in several months when you come back to the code, you will appreciate the effort of writing them. Just don't forget to also read and update the comment first if you later want to make changes to the code (variable, recipe or rule). As a general rule of thumb: first the comments, then the code.
- File title: In general, it is good practice to start all files with a single line description of what that particular file does. If further information about the totality of the file is necessary, add it after a blank line. This will help a fast inspection where you don't care about the details, but just want to remember/see what that file is (generally) for. This information must of course be commented (its for a human), but this is kept separate from the general recommendation on comments, because this is a comment for the whole file, not each step within it.
Make programming: Here are some experiences that we have come to learn over the years in using Make and are useful/handy in research contexts.
- Environment of each recipe: If you need to define a special environment (or aliases, or scripts to run) for all the recipes in your Makefiles, you can use a Bash startup file reproduce/software/shell/bashrc.sh. This file is loaded before every Make recipe is run, just like the .bashrc in your home directory is loaded every time you start a new interactive, non-login terminal. See the comments in that file for more.
- Automatic variables: These are wonderful and very useful Make constructs that greatly shrink the text, while helping in read-ability, robustness (less bugs in typos for example) and generalization. For example even when a rule only has one target or one prerequisite, always use $@ instead of the target's name, $< instead of the first prerequisite, $^ instead of the full list of prerequisites and etc. You can see the full list of automatic variables here. If you use GNU Make, you can also see this page on your command-line:
```
info make "automatic variables"
```
- Debug: Since Make doesn't follow the common top-down paradigm, it can be a little hard to get accustomed to why you get an error or un-expected behavior. In such cases, run Make with the -d option. With this option, Make prints a full list of exactly which prerequisites are being checked for which targets. Looking (patiently) through this output and searching for the faulty file/step will clearly show you any mistake you might have made in defining the targets or prerequisites.
- Large files: If you are dealing with very large files (thus having multiple copies of them for intermediate steps is not possible), one solution is the following strategy (Also see the next item on "Fast access to temporary files"). Set a small plain text file as the actual target and delete the large file when it is no longer needed by the project (in the last rule that needs it). Below is a simple demonstration of doing this. In it, we use Gnuastro's Arithmetic program to add all pixels of the input image with 2 and create large1.fits. We then subtract 2 from large1.fits to create large2.fits and delete large1.fits in the same rule (when its no longer needed). We can later do the same with large2.fits when it is no longer needed and so on.
```
large1.fits.txt: input.fits
        astarithmetic $< 2 + --output=$(subst .txt,,$@)
        echo "done" > $@
large2.fits.txt: large1.fits.txt
        astarithmetic $(subst .txt,,$<) 2 - --output=$(subst .txt,,$@)
        rm $(subst .txt,,$<)
        echo "done" > $@
```
  A more advanced Make programmer will use Make's call function to define a wrapper in reproduce/analysis/make/initialize.mk. This wrapper will replace $(subst .txt,,XXXXX). Therefore, it will be possible to greatly simplify this repetitive statement and make the code even more readable throughout the whole project.
- Fast access to temporary files: Most Unix-like operating systems will give you a special shared-memory device (directory): on systems using the GNU C Library (all GNU/Linux system), it is /dev/shm. The contents of this directory are actually in your RAM, not in your persistence storage like the HDD or SSD. Reading and writing from/to the RAM is much faster than persistent storage, so if you have enough RAM available, it can be very beneficial for large temporary files to be put there. You can use the mktemp program to give the temporary files a randomly-set name, and use text files as targets to keep that name (as described in the item above under "Large files") for later deletion. For example, see the minimal working example Makefile below (which you can actually put in a Makefile and run if you have an input.fits in the same directory, and Gnuastro is installed).
```
.ONESHELL:
.SHELLFLAGS = -ec
all: mean-std.txt
shm-maneage := /dev/shm/$(shell whoami)-maneage-XXXXXXXXXX
large1.txt: input.fits
        out=$$(mktemp $(shm-maneage))
        astarithmetic $< 2 + --output=$$out.fits
        echo "$$out" > $@
large2.txt: large1.txt
        input=$$(cat $<)
        out=$$(mktemp $(shm-maneage))
        astarithmetic $$input.fits 2 - --output=$$out.fits
        rm $$input.fits $$input
        echo "$$out" > $@
mean-std.txt: large2.txt
        input=$$(cat $<)
        aststatistics $$input.fits --mean --std > $@
        rm $$input.fits $$input
```
  The important point here is that the temporary name template (shm-maneage) has no suffix. So you can add the suffix corresponding to your desired format afterwards (for example $$out.fits, or $$out.txt). But more importantly, when mktemp sets the random name, it also checks if no file exists with that name and creates a file with that exact name at that moment. So at the end of each recipe above, you'll have two files in your /dev/shm, one empty file with no suffix one with a suffix. The role of the file without a suffix is just to ensure that the randomly set name will not be used by other calls to mktemp (when running in parallel) and it should be deleted with the file containing a suffix. This is the reason behind the rm $$input.fits $$input command above: to make sure that first the file with a suffix is deleted, then the core random file (note that when working in parallel on powerful systems, in the time between deleting two files of a single rm command, many things can happen!). When using Maneage, you can put the definition of shm-maneage in reproduce/analysis/make/initialize.mk to be usable in all the different Makefiles of your analysis, and you won't need the three lines above it. Finally, BE RESPONSIBLE: after you are finished, be sure to clean up any possibly remaining files (due to crashes in the processing while you are working), otherwise your RAM may fill up very fast. You can do it easily with a command like this on your command-line: rm -f /dev/shm/$(whoami)-*.
Software tarballs and raw inputs: It is critically important to document the raw inputs to your project (software tarballs and raw input data):
- Keep the source tarball of dependencies: After configuration finishes, the .build/software/tarballs directory will contain all the software tarballs that were necessary for your project. You can mirror the contents of this directory to keep a backup of all the software tarballs used in your project (possibly as another version controlled repository) that is also published with your project. Note that software web-pages are not written in stone and can suddenly go offline or not be accessible in some conditions. This backup is thus very important. If you intend to release your project in a place like Zenodo, you can upload/keep all the necessary tarballs (and data) there with your project. zenodo.1163746 is one example of how the data, Gnuastro (main software used) and all major Gnuastro's dependencies have been uploaded with the project's source. Just note that this is only possible for free and open-source software.
- Keep your input data: The input data is also critical to the project's reproducibility, so like the above for software, make sure you have a backup of them, or their persistent identifiers (PIDs).
Version control: Version control is a critical component of Maneage. Here are some tips to help in effectively using it.
- Regular commits: It is important (and extremely useful) to have the history of your project under version control. So try to make commits regularly (after any meaningful change/step/result).
- Keep Maneage up-to-date: In time, Maneage is going to become more and more mature and robust (thanks to your feedback and the feedback of other users). Bugs will be fixed and new/improved features will be added. So every once and a while, you can run the commands below to pull new work that is done in Maneage. If the changes are useful for your work, you can merge them with your project to benefit from them. Just pay very close attention to resolving possible conflicts which might happen in the merge (updated settings that you have customized in Maneage).
```
git checkout maneage
git pull                            # Get recent work in Maneage
git log XXXXXX..XXXXXX --reverse    # Inspect new work (replace XXXXXXs with hashs mentioned in output of previous command).
git log --oneline --graph --decorate --all # General view of branches.
git checkout master                 # Go to your top working branch.
git merge maneage                   # Import all the work into master.
```
- Adding Maneage to a fork of your project: As you and your colleagues continue your project, it will be necessary to have separate forks/clones of it. But when you clone your own project on a different system, or a colleague clones it to collaborate with you, the clone won't have the origin-maneage remote that you started the project with. As shown in the previous item above, you need this remote to be able to pull recent updates from Maneage. The steps below will setup the origin-maneage remote, and a local maneage branch to track it, on the new clone.
```
git remote add origin-maneage https://git.maneage.org/project.git
git fetch origin-maneage
git checkout -b maneage --track origin-maneage/maneage
```
- Commit message: The commit message is a very important and useful aspect of version control. To make the commit message useful for others (or yourself, one year later), it is good to follow a consistent style. Maneage already has a consistent formatting (described below), which you can also follow in your project if you like. You can see many examples by running git log in the maneage branch. If you intend to push commits to Maneage, for the consistency of Maneage, it is necessary to follow these guidelines. 1) No line should be more than 75 characters (to enable easy reading of the message when you run git log on the standard 80-character terminal). 2) The first line is the title of the commit and should summarize it (so git log --oneline can be useful). The title should also not end with a point (., because its a short single sentence, so a point is not necessary and only wastes space). 3) After the title, leave an empty line and start the body of your message (possibly containing many paragraphs). 4) Describe the context of your commit (the problem it is trying to solve) as much as possible, then go onto how you solved it. One suggestion is to start the main body of your commit with "Until now ...", and continue describing the problem in the first paragraph(s). Afterwards, start the next paragraph with "With this commit ...".
- Project outputs: During your research, it is possible to checkout a specific commit and reproduce its results. However, the processing can be time consuming. Therefore, it is useful to also keep track of the final outputs of your project (at minimum, the paper's PDF) in important points of history. However, keeping a snapshot of these (most probably large volume) outputs in the main history of the project can unreasonably bloat it. It is thus recommended to make a separate Git repo to keep those files and keep your project's source as small as possible. For example if your project is called my-exciting-project, the name of the outputs repository can be my-exciting-project-output. This enables easy sharing of the output files with your co-authors (with necessary permissions) and not having to bloat your email archive with extra attachments also (you can just share the link to the online repo in your communications). After the research is published, you can also release the outputs repository, or you can just delete it if it is too large or un-necessary (it was just for convenience, and fully reproducible after all). For example Maneage's output is available for demonstration in a separate repository.
- Full Git history in one file: When you are publishing your project (for example to Zenodo for long term preservation), it is more convenient to have the whole project's Git history into one file to save with your datasets. After all, you can't be sure that your current Git server (for example Codeberg, other community git servers, or GitLab, Github, or Bitbucket) will be active forever. While they are good for the immediate future, you can't rely on them for archival purposes. Fortunately keeping your whole history in one file is easy with Git using the following commands. To learn more about it, run git help bundle.
  - "bundle" your project's history into one file (just don't forget to change my-project-git.bundle to a descriptive name of your project):
```
git bundle create my-project-git.bundle --all
```
  - You can easily upload my-project-git.bundle anywhere. Later, if you need to un-bundle it, you can use the following command.
```
git clone my-project-git.bundle
```

Next: Future improvements, Previous: Customization checklist, Up: About

Maneage

Managing Data Lineage

Tips for designing your project