diff options
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 400 |
1 files changed, 400 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..1a52571 --- /dev/null +++ b/README.md @@ -0,0 +1,400 @@ +Introduction +============ + +This description is for *creators* of the reproduction pipeline. See +`README` on instructions for running it. + +This project contains a **fully working template** for a high-level +research reproduction pipeline as defined in the link below. If this page +is inaccessible at the time of reading, please see the end of this file +which contains a portion of the introduction in this webpage. + + http://akhlaghi.org/reproducible-science.html + +This template is created with the aim of supporting reproducible research +by making it easy to start a project in this framework. It is very easy to +customize this template pipeline for any particular research/job and expand +it as it starts and evolves. It can be run with no modification (as +described in `README`) as a demonstration and customized by editing the +existing rules and adding new rules as well as adding new Makefiles as the +research/project grows. + +This file will continue with a discussion of why Make is the perfect +language/framework for a research reproduction pipeline and how to master +Make easily. Afterwards, a checklist of actions that are necessary to +customize this pipeline for your research is provided. The main body of +this text will finish with some tips and guidelines on how to manage or +extend it as your research grows. Please share your thoughts and +suggestions on this pipeline so we can implement them and make it even more +easier to use and more robust. + + +Why Make? +--------- + +When batch processing is necessary (no manual intervention, as in a +reproduction pipeline), shell scripts are usually the first solution that +comes to mind. However, the problem with scripts for a scientific +reproduction pipeline is the complexity and non-linearity. A script will +start from the top/start every time it is run. So if you have gone through +90% of a research project and want to run the remaining 10% that you have +newly added, you have to run the whole script from the start again and wait +until you see the effects of the last few steps (for the possible errors, +or better solutions and etc). It is possible to manually ignore/comment +parts of a script to only do a special part, but such checks/comments will +only add to the complexity of the script and they is prone to very serious +bugs in the end (when trying to reproduce from scratch). Such bugs are very +hard to notice when adding the checks or commenting-code in a script. + +The Make paradigm, on the other hand, starts from the end: the final +target. It builds a dependency tree internally, and finds where it should +actually start each time it is run. Therefore, in the scenario above, a +researcher that has just added the final 10% of steps of her research to +her Makefile, will only have run those extra steps. As commonly happens in +a research context, due to its paradigm Make, it is also trivial to change +the processing of any intermediate (already written) rule/step in the +middle of an already written pipeline: the next time Make is run, only +rules affected by the changes/additions will be re-run. + +This greatly speeds up the processing (enabling creative changes), while +keeping all the dependencies clearly documented (as part of the Make +language), and most importantly, enabling full reproducibility from scratch +with no changes in the pipeline code that was working during the +research. Since the dependencies are also clearly demarcated, Make can +identify independent steps and run them in parallel (further speeding up +the process). Make was designed for this purpose and it is how huge +projects like all Unix-like operating systems (including GNU/Linux or Mac +OS operating systems) and their core components are built. Therefore, Make +is a highly mature paradigm/system with robust and highly efficient +implementations in various operating systems perfectly suited for a complex +non-linear research project. + +Make is a small language with the aim of defining "rules" containing +"targets", "prerequisites" and "recipes". It comes with some cool features +like functions or automatic-variables to greatly facilitate the management +of any of those constructs. For a more detailed (yet still general) +introduction see Wikipedia: + + https://en.wikipedia.org/wiki/Make_(software) + +Many implementations of Make exist and all should be usable with this +pipeline. This pipeline has been created and tested mainly with GNU Make +which is the most common implementation. But if you see parts specific to +GNU Make, please inform us to correct it. + + +How can I learn Make? +--------------------- + +The best place to learn Make from scratch is the GNU Make manual. It is an +excellent and non-technical (in its first chapters) book to help get +started. It is freely available and always up to date with the current +release. It also clearly explains which features are specific to GNU Make +and which are general in all implementations. So the first few chapters +regarding the generalities are useful for all implementations. + +The first link below has links to the GNU Make manual in various formats +and in the second, you can get it in PDF (which may be easier to read in +the first time). + + https://www.gnu.org/software/make/manual/ + + https://www.gnu.org/software/make/manual/make.pdf + +If you use GNU Make, you also have the whole GNU Make manual on the +command-line with the following command (you can come out of the "info" +environment by pressing `q`, if you don't know "Info", we strongly +recommend running "$ info info" to learn it easily, it greatly simplifies +your access to many manuals that are installed on your system). + +```shell + $ info make +``` + +If you use the Emacs text editor, you will find the Info version of the +Make manual there also. + + + + +Checklist to customize the pipeline +=================================== + +Take the following steps to fully customize this pipeline for your research +project. After finishing the list, be sure to run `./configure` and `make` +to see if everything works correctly before expanding it. If you notice +anything missing or any in-correct part (probably a change that has not +been explained here), please let us know to correct it. + + - **Get this repository** (if you don't already have it): Arguably the + easiest way to start is to clone this repository as shown below: + + ```shell + $ git clone https://gitlab.com/makhlaghi/reproduction-pipeline-template.git + ``` + + - **Copyright**, **name** and **date**: Go over the following files and + correct the copyright, names and dates in their first few lines: + `README`, `Makefile` and `reproduce/src/make/*.mk`. When making new + files, always remember to add a similar copyright statement at the top + of the tile. + + - **Gnuastro**: GNU Astronomy Utilities (Gnuastro) is currently a + dependency of the pipeline and without it, the pipeline will complain + and abort. The main reason for this is to demonstrate how critically + important it is to version your software. If you don't want to install + Gnuastro please follow the instructions in the list below. If you do + have Gnuastro (or have installed it to check this pipeline), then + after an initial check, try un-commenting the `onlyversion` line and + running the pipeline to see the respective error. Such features in a + software makes it easy to design a robust pipeline like this. If you + have tried it and don't need Gnuastro in your pipeline, also follow + this list: + + - Delete the description about Gnuastro in `README`. + - Delete everything about Gnuastro in `reproduce/src/make/initialize.mk` + - Delete `and Gnuastro \gnuastrover` from `tex/preamble-style` + + - **Initiate a new Git repo**: You don't want to mix the history of this + template reproduction pipeline with your own reproduction + pipeline. You have already made some small changes in the previous + step, so let's re-initiate history before continuing. But before doing + that, keep the output of `git describe` in a place and write it in + your first commit message to document what point in this pipeline's + history you started from. Since the pipeline is highly integrated with + your particular research, it may not be easy to merge the changes + later. Having the commit in this history that you started from, will + allow you to check and manually apply any changes that don't interfere + with your implemented pipeline. After this step, you can commit your + changes into your newly initiated history as you like. + + ```shell + $ git describe # The point in this history you started from. + $ git clean -fxd # Remove any possibly created extra files. + $ rm -rf .git # Completely remove this history. + $ git init # Initiate a new history. + $ git add --all # Stage everything that is here. + $ git commit # Make your first commit (mention the first output) + ``` + + - **`README`**: Go through this top-level instruction file and make it fit + to your pipeline: update the text and etc. Don't forget that your + colleagues or anyone else, will first be drawn to read this file, so + make it as easy as possible for them to understand your + work. Therefore, also check and update `README` one last time when you + are ready to publish your work (and its reproduction pipeline). + + - **First input dataset**: The user manages the top-level directory of the + input data through the variables set in + `reproduce/config/pipeline/DIRECTORIES.mk.in` (the user actually edits + a `DIRECTORIES.mk` file that is copied from the `.mk.in` file, but the + `.mk` file is not under version control). So open this file and + replace `SURVEY` with the name of your input survey or dataset (all in + capital letters), for example if you are working on data from the XDF + survey, replace `SURVEY` with `XDF`. But don't change the value, just + the name. Afterwards, change any occurrence of `SURVEY` in the whole + pipeline with the new name. You can find the occurrences with a simple + command like the ones shown below. We follow the Make convention here + that all `ONLY-CAPITAL` variables are those directly set by the user + and all `small-caps` variables are set by the pipeline designer. All + variables that also depend on this survey have a `survey` in their + name. Hence, also correct all these occurrences to your new name in + small-caps. Of course, ignore those occurrences that are irrelevant, + like those in this file. Note that in the raw version of this template + no target depends on these files, so they are ignored. Afterwards, set + the webpage and correct the filenames in + `reproduce/src/make/download.mk' if necessary. + + ```shell + $ grep -r SURVEY ./ + $ grep -r survey ./ + ``` + + - **Other input datasets**: Add any other input datasets that may be + necessary for your research to the pipeline based on the example + above. + + +Tips on using the pipeline +========================== + +The following is a list of points, tips, or recommendations that have been +learned after some experience with this pipeline. Please don't hesitate to +share any experience you gain after using this pipeline with us. In this +way, we can add it here for others to also benefit. + + - **Modularity**: Modularity is the key to easy and clean growth of a + project. So it is always best to break up a job into as many + sub-components as reasonable. Here are some tips to stay modular. + + - *Short recipes*: if you see the recipe of a rule becoming more than a + few lines, it probably a good sign that you should break up the job. + + - *Context-based (many) Makefiles*: This pipeline is designed to allow + the easy inclusion of many Makefiles (in `reproduce/src/make/*.mk`) + for maximal modularity. So keep the rules for closely related parts + of the processing in separate Makefiles. + + - *Clear file names*: Be very clear and descriptive with the naming of + the files and the variables because a few months after the processing + it will be very hard to remember what each one does. Also this helps + others (your collaborators or other people reading the pipeline after + it is published) to more easily understand your work and find their + way around. + + - *Standard naming*: As the project grows, following a good standard in + naming the files is very useful. Try best to use multiple word + filenames for anything that is non-trivial (separating the words with + a `-`). For example if you have a Makefile for creating a catalog and + another two for processing it under models A and B, you can name them + like this: `catalog-create.mk`, `catalog-modela.mk` and + `catalog-modelb.mk`. In this way, when listing the contents of + `reproduce/src/make` to see all the Makefiles, those related to the + catalog will all be close to each other and thus easily found. This + also helps in auto-completions by the shell or text editors like + Emacs. + + - *Source directories*: If you need to add scripts (shell, Python, AWK + or any other language), keep them in a separate directory under + `reproduce/src`, with the appropriate name. + + - *Configuration files*: If your research uses special programs as part + of the processing, put all their configuration files in a devoted + directory (with the program's name) within + `reproduce/config`. Similar to the `reproduce/config/gnuastro` + directory (which is put in the template as a demo in case you use GNU + Astronomy Utilities). It is much cleaner and readable (thus less + buggy) to avoid mixing the configuration files, even if there is no + technical necessity. + + + - **Contents**: It is good practice to follow the following + recommendations on the contents of your files, whether they are source + code for a program, Makefiles, scripts or configuration files + (copyrights aren't necessary for the latter). + + - *Copyright*: Always start a file containing programming constructs + with a copyright statement like the ones that this template starts + with (for example in the top level `Makefile`). + + - *Comments*: Comments are vital for readability (by yourself in two + months or others). Describe everything you can about why you are + doing something, how you are doing it, and what you expect the result + to be. Write the comments as if it was what you would say to describe + the variable, recipe or rule to a friend sitting beside you. When + writing the pipeline it is very tempting to just steam ahead, but be + patient and write comments before the rules or recipes. This will + also allow you to think more about what you should be doing. Also, in + several months when you come back to the code, you will appreciate + the effort of writing them. Just don't forget to also read and update + the comment first if you later want to make changes to the variable, + recipe or rule. As a general rule of thumb: first the comments, then + the code. + + + - **Make programming**: Here are some experiences that we have come to + learn over the years in using Make and are useful/handy in research + contexts. + + - *Automatic variables*: These are wonderful and very useful Make + constructs that greatly shrink the text, while help in read-ability, + robustness (less bugs in typos for example) and generalization. For + example even when a rule only has one target or one prerequisite, + always use `$@` instead of the target's name, `$<` instead of the + first prerequisite, `$^` instead of the full list of prerequisites + and etc. You can see the full list of automatic variables + [here](https://www.gnu.org/software/make/manual/html_node/Automatic-Variables.html). If + you use GNU Make, you can also see this page on your command-line: + + ```shell + $ info make "automatic variables + ``` + + - *Large files*: If you are dealing with very large files (thus having + multiple copies of them for intermediate steps is not possible), one + solution is the following strategy. Set a small plain text file as + the actual target and delete the large file when it is no longer + needed by the pipeline (in the last rule that needs it). Below is a + simple demonstration of doing this, where we use Gnuastro's + Arithmetic program to add all pixels of the input image with 2 and + create `large1.fits`. We then subtract 2 from `large1.fits` to create + `large2.fits` and delete `large1.fits` in the same rule (when its no + longer needed). We can later do the same with `large2.fits` when it + is no longer needed and so on. + ``` + large1.fits.txt: input.fits + astarithmetic $< 2 + --output=$(subst .txt,,$@) + echo "done" > $@ + large2.fits.txt: large1.fits.txt + astarithmetic $(subst .txt,,$<) 2 - --output=$(subst .txt,,$@) + rm $(subst .txt,,$<) + echo "done" > $@ + ``` + A more advanced Make programmer will use [Make's call + function](https://www.gnu.org/software/make/manual/html_node/Call-Function.html) + to define a wrapper in `reproduce/src/make/initialize.mk`. This + wrapper will replace `$(subst .txt,,XXXXX)`. Therefore, it will be + possible to greatly simplify this repetitive statement and make the + code even more readable throughout the whole pipeline. + + + - **Dependencies**: It is critically important to exactly document, keep + and check the versions of the programs you are using in the pipeline. + + - *Check versions*: In `reproduce/src/make/initialize.mk`, check the + versions of the programs you are using. + + - *Keep the source tarball of dependencies*: keep a tarball of the + necessary version of all your dependencies (and also a copy of the + higher-level libraries they depend on). Software evolves very fast + and only in a few years, a feature might be changed or removed from + the mainstream version or the software server might go down. To be + safe, keep a copy of the tarballs (they are hardly ever over a few + megabytes, very insignificant compared to the data). If you intend to + release the pipeline in a place like Zenodo, then you can create your + submission early (before public release) and upload/keep all the + necessary tarballs (and data) there. + + - *Keep your input data*: The input data is also critical to the + pipeline, so like the above for software, make sure you have a backup + of them + + + + + + + +Appendix: Introduction to this concept from link above +====================================================== + +In case [the link above](http://akhlaghi.org/reproducible-science.html) is +not accessible at the time of reading, here is a copy of the introduction +of that link, describing the necessity for a reproduction pipeline like +this (copied on February 7th, 2018): + +The most important element of a "scientific" statement/result is the fact +that others should be able to falsify it. The Tsunami of data that has +engulfed astronomers in the last two decades, combined with faster +processors and faster internet connections has made it much more easier to +obtain a result. However, these factors have also increased the complexity +of a scientific analysis, such that it is no longer possible to describe +all the steps of an analysis in the published paper. Citing this +difficulty, many authors suffice to describing the generalities of their +analysis in their papers. + +However, It is impossible to falsify (or even study) a result if you can't +exactly reproduce it. The complexity of modern science makes it vitally +important to exactly reproduce the final result. Because even a small +deviation can be due to many different parts of an analysis. Nature is +already a black box which we are trying so hard to comprehend. Not letting +other scientists see the exact steps taken to reach a result, or not +allowing them to modify it (do experiments on it) is a self-imposed black +box, which only exacerbates our ignorance. + +Other scientists should be able to reproduce, check and experiment on the +results of anything that is to carry the "scientific" label. Any result +that is not reproducible (due to incomplete information by the author) is +not scientific: the readers have to have faith in the subjective experience +of the authors in the very important choice of configuration values and +order of operations: this is contrary to the scientific spirit.
\ No newline at end of file |