aboutsummaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'README.md')
-rw-r--r--README.md400
1 files changed, 400 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..1a52571
--- /dev/null
+++ b/README.md
@@ -0,0 +1,400 @@
+Introduction
+============
+
+This description is for *creators* of the reproduction pipeline. See
+`README` on instructions for running it.
+
+This project contains a **fully working template** for a high-level
+research reproduction pipeline as defined in the link below. If this page
+is inaccessible at the time of reading, please see the end of this file
+which contains a portion of the introduction in this webpage.
+
+ http://akhlaghi.org/reproducible-science.html
+
+This template is created with the aim of supporting reproducible research
+by making it easy to start a project in this framework. It is very easy to
+customize this template pipeline for any particular research/job and expand
+it as it starts and evolves. It can be run with no modification (as
+described in `README`) as a demonstration and customized by editing the
+existing rules and adding new rules as well as adding new Makefiles as the
+research/project grows.
+
+This file will continue with a discussion of why Make is the perfect
+language/framework for a research reproduction pipeline and how to master
+Make easily. Afterwards, a checklist of actions that are necessary to
+customize this pipeline for your research is provided. The main body of
+this text will finish with some tips and guidelines on how to manage or
+extend it as your research grows. Please share your thoughts and
+suggestions on this pipeline so we can implement them and make it even more
+easier to use and more robust.
+
+
+Why Make?
+---------
+
+When batch processing is necessary (no manual intervention, as in a
+reproduction pipeline), shell scripts are usually the first solution that
+comes to mind. However, the problem with scripts for a scientific
+reproduction pipeline is the complexity and non-linearity. A script will
+start from the top/start every time it is run. So if you have gone through
+90% of a research project and want to run the remaining 10% that you have
+newly added, you have to run the whole script from the start again and wait
+until you see the effects of the last few steps (for the possible errors,
+or better solutions and etc). It is possible to manually ignore/comment
+parts of a script to only do a special part, but such checks/comments will
+only add to the complexity of the script and they is prone to very serious
+bugs in the end (when trying to reproduce from scratch). Such bugs are very
+hard to notice when adding the checks or commenting-code in a script.
+
+The Make paradigm, on the other hand, starts from the end: the final
+target. It builds a dependency tree internally, and finds where it should
+actually start each time it is run. Therefore, in the scenario above, a
+researcher that has just added the final 10% of steps of her research to
+her Makefile, will only have run those extra steps. As commonly happens in
+a research context, due to its paradigm Make, it is also trivial to change
+the processing of any intermediate (already written) rule/step in the
+middle of an already written pipeline: the next time Make is run, only
+rules affected by the changes/additions will be re-run.
+
+This greatly speeds up the processing (enabling creative changes), while
+keeping all the dependencies clearly documented (as part of the Make
+language), and most importantly, enabling full reproducibility from scratch
+with no changes in the pipeline code that was working during the
+research. Since the dependencies are also clearly demarcated, Make can
+identify independent steps and run them in parallel (further speeding up
+the process). Make was designed for this purpose and it is how huge
+projects like all Unix-like operating systems (including GNU/Linux or Mac
+OS operating systems) and their core components are built. Therefore, Make
+is a highly mature paradigm/system with robust and highly efficient
+implementations in various operating systems perfectly suited for a complex
+non-linear research project.
+
+Make is a small language with the aim of defining "rules" containing
+"targets", "prerequisites" and "recipes". It comes with some cool features
+like functions or automatic-variables to greatly facilitate the management
+of any of those constructs. For a more detailed (yet still general)
+introduction see Wikipedia:
+
+ https://en.wikipedia.org/wiki/Make_(software)
+
+Many implementations of Make exist and all should be usable with this
+pipeline. This pipeline has been created and tested mainly with GNU Make
+which is the most common implementation. But if you see parts specific to
+GNU Make, please inform us to correct it.
+
+
+How can I learn Make?
+---------------------
+
+The best place to learn Make from scratch is the GNU Make manual. It is an
+excellent and non-technical (in its first chapters) book to help get
+started. It is freely available and always up to date with the current
+release. It also clearly explains which features are specific to GNU Make
+and which are general in all implementations. So the first few chapters
+regarding the generalities are useful for all implementations.
+
+The first link below has links to the GNU Make manual in various formats
+and in the second, you can get it in PDF (which may be easier to read in
+the first time).
+
+ https://www.gnu.org/software/make/manual/
+
+ https://www.gnu.org/software/make/manual/make.pdf
+
+If you use GNU Make, you also have the whole GNU Make manual on the
+command-line with the following command (you can come out of the "info"
+environment by pressing `q`, if you don't know "Info", we strongly
+recommend running "$ info info" to learn it easily, it greatly simplifies
+your access to many manuals that are installed on your system).
+
+```shell
+ $ info make
+```
+
+If you use the Emacs text editor, you will find the Info version of the
+Make manual there also.
+
+
+
+
+Checklist to customize the pipeline
+===================================
+
+Take the following steps to fully customize this pipeline for your research
+project. After finishing the list, be sure to run `./configure` and `make`
+to see if everything works correctly before expanding it. If you notice
+anything missing or any in-correct part (probably a change that has not
+been explained here), please let us know to correct it.
+
+ - **Get this repository** (if you don't already have it): Arguably the
+ easiest way to start is to clone this repository as shown below:
+
+ ```shell
+ $ git clone https://gitlab.com/makhlaghi/reproduction-pipeline-template.git
+ ```
+
+ - **Copyright**, **name** and **date**: Go over the following files and
+ correct the copyright, names and dates in their first few lines:
+ `README`, `Makefile` and `reproduce/src/make/*.mk`. When making new
+ files, always remember to add a similar copyright statement at the top
+ of the tile.
+
+ - **Gnuastro**: GNU Astronomy Utilities (Gnuastro) is currently a
+ dependency of the pipeline and without it, the pipeline will complain
+ and abort. The main reason for this is to demonstrate how critically
+ important it is to version your software. If you don't want to install
+ Gnuastro please follow the instructions in the list below. If you do
+ have Gnuastro (or have installed it to check this pipeline), then
+ after an initial check, try un-commenting the `onlyversion` line and
+ running the pipeline to see the respective error. Such features in a
+ software makes it easy to design a robust pipeline like this. If you
+ have tried it and don't need Gnuastro in your pipeline, also follow
+ this list:
+
+ - Delete the description about Gnuastro in `README`.
+ - Delete everything about Gnuastro in `reproduce/src/make/initialize.mk`
+ - Delete `and Gnuastro \gnuastrover` from `tex/preamble-style`
+
+ - **Initiate a new Git repo**: You don't want to mix the history of this
+ template reproduction pipeline with your own reproduction
+ pipeline. You have already made some small changes in the previous
+ step, so let's re-initiate history before continuing. But before doing
+ that, keep the output of `git describe` in a place and write it in
+ your first commit message to document what point in this pipeline's
+ history you started from. Since the pipeline is highly integrated with
+ your particular research, it may not be easy to merge the changes
+ later. Having the commit in this history that you started from, will
+ allow you to check and manually apply any changes that don't interfere
+ with your implemented pipeline. After this step, you can commit your
+ changes into your newly initiated history as you like.
+
+ ```shell
+ $ git describe # The point in this history you started from.
+ $ git clean -fxd # Remove any possibly created extra files.
+ $ rm -rf .git # Completely remove this history.
+ $ git init # Initiate a new history.
+ $ git add --all # Stage everything that is here.
+ $ git commit # Make your first commit (mention the first output)
+ ```
+
+ - **`README`**: Go through this top-level instruction file and make it fit
+ to your pipeline: update the text and etc. Don't forget that your
+ colleagues or anyone else, will first be drawn to read this file, so
+ make it as easy as possible for them to understand your
+ work. Therefore, also check and update `README` one last time when you
+ are ready to publish your work (and its reproduction pipeline).
+
+ - **First input dataset**: The user manages the top-level directory of the
+ input data through the variables set in
+ `reproduce/config/pipeline/DIRECTORIES.mk.in` (the user actually edits
+ a `DIRECTORIES.mk` file that is copied from the `.mk.in` file, but the
+ `.mk` file is not under version control). So open this file and
+ replace `SURVEY` with the name of your input survey or dataset (all in
+ capital letters), for example if you are working on data from the XDF
+ survey, replace `SURVEY` with `XDF`. But don't change the value, just
+ the name. Afterwards, change any occurrence of `SURVEY` in the whole
+ pipeline with the new name. You can find the occurrences with a simple
+ command like the ones shown below. We follow the Make convention here
+ that all `ONLY-CAPITAL` variables are those directly set by the user
+ and all `small-caps` variables are set by the pipeline designer. All
+ variables that also depend on this survey have a `survey` in their
+ name. Hence, also correct all these occurrences to your new name in
+ small-caps. Of course, ignore those occurrences that are irrelevant,
+ like those in this file. Note that in the raw version of this template
+ no target depends on these files, so they are ignored. Afterwards, set
+ the webpage and correct the filenames in
+ `reproduce/src/make/download.mk' if necessary.
+
+ ```shell
+ $ grep -r SURVEY ./
+ $ grep -r survey ./
+ ```
+
+ - **Other input datasets**: Add any other input datasets that may be
+ necessary for your research to the pipeline based on the example
+ above.
+
+
+Tips on using the pipeline
+==========================
+
+The following is a list of points, tips, or recommendations that have been
+learned after some experience with this pipeline. Please don't hesitate to
+share any experience you gain after using this pipeline with us. In this
+way, we can add it here for others to also benefit.
+
+ - **Modularity**: Modularity is the key to easy and clean growth of a
+ project. So it is always best to break up a job into as many
+ sub-components as reasonable. Here are some tips to stay modular.
+
+ - *Short recipes*: if you see the recipe of a rule becoming more than a
+ few lines, it probably a good sign that you should break up the job.
+
+ - *Context-based (many) Makefiles*: This pipeline is designed to allow
+ the easy inclusion of many Makefiles (in `reproduce/src/make/*.mk`)
+ for maximal modularity. So keep the rules for closely related parts
+ of the processing in separate Makefiles.
+
+ - *Clear file names*: Be very clear and descriptive with the naming of
+ the files and the variables because a few months after the processing
+ it will be very hard to remember what each one does. Also this helps
+ others (your collaborators or other people reading the pipeline after
+ it is published) to more easily understand your work and find their
+ way around.
+
+ - *Standard naming*: As the project grows, following a good standard in
+ naming the files is very useful. Try best to use multiple word
+ filenames for anything that is non-trivial (separating the words with
+ a `-`). For example if you have a Makefile for creating a catalog and
+ another two for processing it under models A and B, you can name them
+ like this: `catalog-create.mk`, `catalog-modela.mk` and
+ `catalog-modelb.mk`. In this way, when listing the contents of
+ `reproduce/src/make` to see all the Makefiles, those related to the
+ catalog will all be close to each other and thus easily found. This
+ also helps in auto-completions by the shell or text editors like
+ Emacs.
+
+ - *Source directories*: If you need to add scripts (shell, Python, AWK
+ or any other language), keep them in a separate directory under
+ `reproduce/src`, with the appropriate name.
+
+ - *Configuration files*: If your research uses special programs as part
+ of the processing, put all their configuration files in a devoted
+ directory (with the program's name) within
+ `reproduce/config`. Similar to the `reproduce/config/gnuastro`
+ directory (which is put in the template as a demo in case you use GNU
+ Astronomy Utilities). It is much cleaner and readable (thus less
+ buggy) to avoid mixing the configuration files, even if there is no
+ technical necessity.
+
+
+ - **Contents**: It is good practice to follow the following
+ recommendations on the contents of your files, whether they are source
+ code for a program, Makefiles, scripts or configuration files
+ (copyrights aren't necessary for the latter).
+
+ - *Copyright*: Always start a file containing programming constructs
+ with a copyright statement like the ones that this template starts
+ with (for example in the top level `Makefile`).
+
+ - *Comments*: Comments are vital for readability (by yourself in two
+ months or others). Describe everything you can about why you are
+ doing something, how you are doing it, and what you expect the result
+ to be. Write the comments as if it was what you would say to describe
+ the variable, recipe or rule to a friend sitting beside you. When
+ writing the pipeline it is very tempting to just steam ahead, but be
+ patient and write comments before the rules or recipes. This will
+ also allow you to think more about what you should be doing. Also, in
+ several months when you come back to the code, you will appreciate
+ the effort of writing them. Just don't forget to also read and update
+ the comment first if you later want to make changes to the variable,
+ recipe or rule. As a general rule of thumb: first the comments, then
+ the code.
+
+
+ - **Make programming**: Here are some experiences that we have come to
+ learn over the years in using Make and are useful/handy in research
+ contexts.
+
+ - *Automatic variables*: These are wonderful and very useful Make
+ constructs that greatly shrink the text, while help in read-ability,
+ robustness (less bugs in typos for example) and generalization. For
+ example even when a rule only has one target or one prerequisite,
+ always use `$@` instead of the target's name, `$<` instead of the
+ first prerequisite, `$^` instead of the full list of prerequisites
+ and etc. You can see the full list of automatic variables
+ [here](https://www.gnu.org/software/make/manual/html_node/Automatic-Variables.html). If
+ you use GNU Make, you can also see this page on your command-line:
+
+ ```shell
+ $ info make "automatic variables
+ ```
+
+ - *Large files*: If you are dealing with very large files (thus having
+ multiple copies of them for intermediate steps is not possible), one
+ solution is the following strategy. Set a small plain text file as
+ the actual target and delete the large file when it is no longer
+ needed by the pipeline (in the last rule that needs it). Below is a
+ simple demonstration of doing this, where we use Gnuastro's
+ Arithmetic program to add all pixels of the input image with 2 and
+ create `large1.fits`. We then subtract 2 from `large1.fits` to create
+ `large2.fits` and delete `large1.fits` in the same rule (when its no
+ longer needed). We can later do the same with `large2.fits` when it
+ is no longer needed and so on.
+ ```
+ large1.fits.txt: input.fits
+ astarithmetic $< 2 + --output=$(subst .txt,,$@)
+ echo "done" > $@
+ large2.fits.txt: large1.fits.txt
+ astarithmetic $(subst .txt,,$<) 2 - --output=$(subst .txt,,$@)
+ rm $(subst .txt,,$<)
+ echo "done" > $@
+ ```
+ A more advanced Make programmer will use [Make's call
+ function](https://www.gnu.org/software/make/manual/html_node/Call-Function.html)
+ to define a wrapper in `reproduce/src/make/initialize.mk`. This
+ wrapper will replace `$(subst .txt,,XXXXX)`. Therefore, it will be
+ possible to greatly simplify this repetitive statement and make the
+ code even more readable throughout the whole pipeline.
+
+
+ - **Dependencies**: It is critically important to exactly document, keep
+ and check the versions of the programs you are using in the pipeline.
+
+ - *Check versions*: In `reproduce/src/make/initialize.mk`, check the
+ versions of the programs you are using.
+
+ - *Keep the source tarball of dependencies*: keep a tarball of the
+ necessary version of all your dependencies (and also a copy of the
+ higher-level libraries they depend on). Software evolves very fast
+ and only in a few years, a feature might be changed or removed from
+ the mainstream version or the software server might go down. To be
+ safe, keep a copy of the tarballs (they are hardly ever over a few
+ megabytes, very insignificant compared to the data). If you intend to
+ release the pipeline in a place like Zenodo, then you can create your
+ submission early (before public release) and upload/keep all the
+ necessary tarballs (and data) there.
+
+ - *Keep your input data*: The input data is also critical to the
+ pipeline, so like the above for software, make sure you have a backup
+ of them
+
+
+
+
+
+
+
+Appendix: Introduction to this concept from link above
+======================================================
+
+In case [the link above](http://akhlaghi.org/reproducible-science.html) is
+not accessible at the time of reading, here is a copy of the introduction
+of that link, describing the necessity for a reproduction pipeline like
+this (copied on February 7th, 2018):
+
+The most important element of a "scientific" statement/result is the fact
+that others should be able to falsify it. The Tsunami of data that has
+engulfed astronomers in the last two decades, combined with faster
+processors and faster internet connections has made it much more easier to
+obtain a result. However, these factors have also increased the complexity
+of a scientific analysis, such that it is no longer possible to describe
+all the steps of an analysis in the published paper. Citing this
+difficulty, many authors suffice to describing the generalities of their
+analysis in their papers.
+
+However, It is impossible to falsify (or even study) a result if you can't
+exactly reproduce it. The complexity of modern science makes it vitally
+important to exactly reproduce the final result. Because even a small
+deviation can be due to many different parts of an analysis. Nature is
+already a black box which we are trying so hard to comprehend. Not letting
+other scientists see the exact steps taken to reach a result, or not
+allowing them to modify it (do experiments on it) is a self-imposed black
+box, which only exacerbates our ignorance.
+
+Other scientists should be able to reproduce, check and experiment on the
+results of anything that is to carry the "scientific" label. Any result
+that is not reproducible (due to incomplete information by the author) is
+not scientific: the readers have to have faith in the subjective experience
+of the authors in the very important choice of configuration values and
+order of operations: this is contrary to the scientific spirit. \ No newline at end of file