aboutsummaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2018-02-07 20:37:15 +0100
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2018-02-07 20:37:15 +0100
commita16f22881841e57f2652f2a17b7f60b5106b2e60 (patch)
tree6e5a86c38e68cd9f9be546d17c69adad17483825 /README.md
First commit to the reproduction pipeline template
Let's start working on this pipeline independently with this first commit. It is based on my previous experiences, but I had never made a skeleton of a pipeline before, it was always within a working analysis. But now that the pipeline has a separate repository for its self, we will be able to work on it and use it as a base for future work and modify it to make it even better. Hopefully in time (and with the help of others), it will grow and become much more robust and useful.
Diffstat (limited to 'README.md')
-rw-r--r--README.md400
1 files changed, 400 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..1a52571
--- /dev/null
+++ b/README.md
@@ -0,0 +1,400 @@
+Introduction
+============
+
+This description is for *creators* of the reproduction pipeline. See
+`README` on instructions for running it.
+
+This project contains a **fully working template** for a high-level
+research reproduction pipeline as defined in the link below. If this page
+is inaccessible at the time of reading, please see the end of this file
+which contains a portion of the introduction in this webpage.
+
+ http://akhlaghi.org/reproducible-science.html
+
+This template is created with the aim of supporting reproducible research
+by making it easy to start a project in this framework. It is very easy to
+customize this template pipeline for any particular research/job and expand
+it as it starts and evolves. It can be run with no modification (as
+described in `README`) as a demonstration and customized by editing the
+existing rules and adding new rules as well as adding new Makefiles as the
+research/project grows.
+
+This file will continue with a discussion of why Make is the perfect
+language/framework for a research reproduction pipeline and how to master
+Make easily. Afterwards, a checklist of actions that are necessary to
+customize this pipeline for your research is provided. The main body of
+this text will finish with some tips and guidelines on how to manage or
+extend it as your research grows. Please share your thoughts and
+suggestions on this pipeline so we can implement them and make it even more
+easier to use and more robust.
+
+
+Why Make?
+---------
+
+When batch processing is necessary (no manual intervention, as in a
+reproduction pipeline), shell scripts are usually the first solution that
+comes to mind. However, the problem with scripts for a scientific
+reproduction pipeline is the complexity and non-linearity. A script will
+start from the top/start every time it is run. So if you have gone through
+90% of a research project and want to run the remaining 10% that you have
+newly added, you have to run the whole script from the start again and wait
+until you see the effects of the last few steps (for the possible errors,
+or better solutions and etc). It is possible to manually ignore/comment
+parts of a script to only do a special part, but such checks/comments will
+only add to the complexity of the script and they is prone to very serious
+bugs in the end (when trying to reproduce from scratch). Such bugs are very
+hard to notice when adding the checks or commenting-code in a script.
+
+The Make paradigm, on the other hand, starts from the end: the final
+target. It builds a dependency tree internally, and finds where it should
+actually start each time it is run. Therefore, in the scenario above, a
+researcher that has just added the final 10% of steps of her research to
+her Makefile, will only have run those extra steps. As commonly happens in
+a research context, due to its paradigm Make, it is also trivial to change
+the processing of any intermediate (already written) rule/step in the
+middle of an already written pipeline: the next time Make is run, only
+rules affected by the changes/additions will be re-run.
+
+This greatly speeds up the processing (enabling creative changes), while
+keeping all the dependencies clearly documented (as part of the Make
+language), and most importantly, enabling full reproducibility from scratch
+with no changes in the pipeline code that was working during the
+research. Since the dependencies are also clearly demarcated, Make can
+identify independent steps and run them in parallel (further speeding up
+the process). Make was designed for this purpose and it is how huge
+projects like all Unix-like operating systems (including GNU/Linux or Mac
+OS operating systems) and their core components are built. Therefore, Make
+is a highly mature paradigm/system with robust and highly efficient
+implementations in various operating systems perfectly suited for a complex
+non-linear research project.
+
+Make is a small language with the aim of defining "rules" containing
+"targets", "prerequisites" and "recipes". It comes with some cool features
+like functions or automatic-variables to greatly facilitate the management
+of any of those constructs. For a more detailed (yet still general)
+introduction see Wikipedia:
+
+ https://en.wikipedia.org/wiki/Make_(software)
+
+Many implementations of Make exist and all should be usable with this
+pipeline. This pipeline has been created and tested mainly with GNU Make
+which is the most common implementation. But if you see parts specific to
+GNU Make, please inform us to correct it.
+
+
+How can I learn Make?
+---------------------
+
+The best place to learn Make from scratch is the GNU Make manual. It is an
+excellent and non-technical (in its first chapters) book to help get
+started. It is freely available and always up to date with the current
+release. It also clearly explains which features are specific to GNU Make
+and which are general in all implementations. So the first few chapters
+regarding the generalities are useful for all implementations.
+
+The first link below has links to the GNU Make manual in various formats
+and in the second, you can get it in PDF (which may be easier to read in
+the first time).
+
+ https://www.gnu.org/software/make/manual/
+
+ https://www.gnu.org/software/make/manual/make.pdf
+
+If you use GNU Make, you also have the whole GNU Make manual on the
+command-line with the following command (you can come out of the "info"
+environment by pressing `q`, if you don't know "Info", we strongly
+recommend running "$ info info" to learn it easily, it greatly simplifies
+your access to many manuals that are installed on your system).
+
+```shell
+ $ info make
+```
+
+If you use the Emacs text editor, you will find the Info version of the
+Make manual there also.
+
+
+
+
+Checklist to customize the pipeline
+===================================
+
+Take the following steps to fully customize this pipeline for your research
+project. After finishing the list, be sure to run `./configure` and `make`
+to see if everything works correctly before expanding it. If you notice
+anything missing or any in-correct part (probably a change that has not
+been explained here), please let us know to correct it.
+
+ - **Get this repository** (if you don't already have it): Arguably the
+ easiest way to start is to clone this repository as shown below:
+
+ ```shell
+ $ git clone https://gitlab.com/makhlaghi/reproduction-pipeline-template.git
+ ```
+
+ - **Copyright**, **name** and **date**: Go over the following files and
+ correct the copyright, names and dates in their first few lines:
+ `README`, `Makefile` and `reproduce/src/make/*.mk`. When making new
+ files, always remember to add a similar copyright statement at the top
+ of the tile.
+
+ - **Gnuastro**: GNU Astronomy Utilities (Gnuastro) is currently a
+ dependency of the pipeline and without it, the pipeline will complain
+ and abort. The main reason for this is to demonstrate how critically
+ important it is to version your software. If you don't want to install
+ Gnuastro please follow the instructions in the list below. If you do
+ have Gnuastro (or have installed it to check this pipeline), then
+ after an initial check, try un-commenting the `onlyversion` line and
+ running the pipeline to see the respective error. Such features in a
+ software makes it easy to design a robust pipeline like this. If you
+ have tried it and don't need Gnuastro in your pipeline, also follow
+ this list:
+
+ - Delete the description about Gnuastro in `README`.
+ - Delete everything about Gnuastro in `reproduce/src/make/initialize.mk`
+ - Delete `and Gnuastro \gnuastrover` from `tex/preamble-style`
+
+ - **Initiate a new Git repo**: You don't want to mix the history of this
+ template reproduction pipeline with your own reproduction
+ pipeline. You have already made some small changes in the previous
+ step, so let's re-initiate history before continuing. But before doing
+ that, keep the output of `git describe` in a place and write it in
+ your first commit message to document what point in this pipeline's
+ history you started from. Since the pipeline is highly integrated with
+ your particular research, it may not be easy to merge the changes
+ later. Having the commit in this history that you started from, will
+ allow you to check and manually apply any changes that don't interfere
+ with your implemented pipeline. After this step, you can commit your
+ changes into your newly initiated history as you like.
+
+ ```shell
+ $ git describe # The point in this history you started from.
+ $ git clean -fxd # Remove any possibly created extra files.
+ $ rm -rf .git # Completely remove this history.
+ $ git init # Initiate a new history.
+ $ git add --all # Stage everything that is here.
+ $ git commit # Make your first commit (mention the first output)
+ ```
+
+ - **`README`**: Go through this top-level instruction file and make it fit
+ to your pipeline: update the text and etc. Don't forget that your
+ colleagues or anyone else, will first be drawn to read this file, so
+ make it as easy as possible for them to understand your
+ work. Therefore, also check and update `README` one last time when you
+ are ready to publish your work (and its reproduction pipeline).
+
+ - **First input dataset**: The user manages the top-level directory of the
+ input data through the variables set in
+ `reproduce/config/pipeline/DIRECTORIES.mk.in` (the user actually edits
+ a `DIRECTORIES.mk` file that is copied from the `.mk.in` file, but the
+ `.mk` file is not under version control). So open this file and
+ replace `SURVEY` with the name of your input survey or dataset (all in
+ capital letters), for example if you are working on data from the XDF
+ survey, replace `SURVEY` with `XDF`. But don't change the value, just
+ the name. Afterwards, change any occurrence of `SURVEY` in the whole
+ pipeline with the new name. You can find the occurrences with a simple
+ command like the ones shown below. We follow the Make convention here
+ that all `ONLY-CAPITAL` variables are those directly set by the user
+ and all `small-caps` variables are set by the pipeline designer. All
+ variables that also depend on this survey have a `survey` in their
+ name. Hence, also correct all these occurrences to your new name in
+ small-caps. Of course, ignore those occurrences that are irrelevant,
+ like those in this file. Note that in the raw version of this template
+ no target depends on these files, so they are ignored. Afterwards, set
+ the webpage and correct the filenames in
+ `reproduce/src/make/download.mk' if necessary.
+
+ ```shell
+ $ grep -r SURVEY ./
+ $ grep -r survey ./
+ ```
+
+ - **Other input datasets**: Add any other input datasets that may be
+ necessary for your research to the pipeline based on the example
+ above.
+
+
+Tips on using the pipeline
+==========================
+
+The following is a list of points, tips, or recommendations that have been
+learned after some experience with this pipeline. Please don't hesitate to
+share any experience you gain after using this pipeline with us. In this
+way, we can add it here for others to also benefit.
+
+ - **Modularity**: Modularity is the key to easy and clean growth of a
+ project. So it is always best to break up a job into as many
+ sub-components as reasonable. Here are some tips to stay modular.
+
+ - *Short recipes*: if you see the recipe of a rule becoming more than a
+ few lines, it probably a good sign that you should break up the job.
+
+ - *Context-based (many) Makefiles*: This pipeline is designed to allow
+ the easy inclusion of many Makefiles (in `reproduce/src/make/*.mk`)
+ for maximal modularity. So keep the rules for closely related parts
+ of the processing in separate Makefiles.
+
+ - *Clear file names*: Be very clear and descriptive with the naming of
+ the files and the variables because a few months after the processing
+ it will be very hard to remember what each one does. Also this helps
+ others (your collaborators or other people reading the pipeline after
+ it is published) to more easily understand your work and find their
+ way around.
+
+ - *Standard naming*: As the project grows, following a good standard in
+ naming the files is very useful. Try best to use multiple word
+ filenames for anything that is non-trivial (separating the words with
+ a `-`). For example if you have a Makefile for creating a catalog and
+ another two for processing it under models A and B, you can name them
+ like this: `catalog-create.mk`, `catalog-modela.mk` and
+ `catalog-modelb.mk`. In this way, when listing the contents of
+ `reproduce/src/make` to see all the Makefiles, those related to the
+ catalog will all be close to each other and thus easily found. This
+ also helps in auto-completions by the shell or text editors like
+ Emacs.
+
+ - *Source directories*: If you need to add scripts (shell, Python, AWK
+ or any other language), keep them in a separate directory under
+ `reproduce/src`, with the appropriate name.
+
+ - *Configuration files*: If your research uses special programs as part
+ of the processing, put all their configuration files in a devoted
+ directory (with the program's name) within
+ `reproduce/config`. Similar to the `reproduce/config/gnuastro`
+ directory (which is put in the template as a demo in case you use GNU
+ Astronomy Utilities). It is much cleaner and readable (thus less
+ buggy) to avoid mixing the configuration files, even if there is no
+ technical necessity.
+
+
+ - **Contents**: It is good practice to follow the following
+ recommendations on the contents of your files, whether they are source
+ code for a program, Makefiles, scripts or configuration files
+ (copyrights aren't necessary for the latter).
+
+ - *Copyright*: Always start a file containing programming constructs
+ with a copyright statement like the ones that this template starts
+ with (for example in the top level `Makefile`).
+
+ - *Comments*: Comments are vital for readability (by yourself in two
+ months or others). Describe everything you can about why you are
+ doing something, how you are doing it, and what you expect the result
+ to be. Write the comments as if it was what you would say to describe
+ the variable, recipe or rule to a friend sitting beside you. When
+ writing the pipeline it is very tempting to just steam ahead, but be
+ patient and write comments before the rules or recipes. This will
+ also allow you to think more about what you should be doing. Also, in
+ several months when you come back to the code, you will appreciate
+ the effort of writing them. Just don't forget to also read and update
+ the comment first if you later want to make changes to the variable,
+ recipe or rule. As a general rule of thumb: first the comments, then
+ the code.
+
+
+ - **Make programming**: Here are some experiences that we have come to
+ learn over the years in using Make and are useful/handy in research
+ contexts.
+
+ - *Automatic variables*: These are wonderful and very useful Make
+ constructs that greatly shrink the text, while help in read-ability,
+ robustness (less bugs in typos for example) and generalization. For
+ example even when a rule only has one target or one prerequisite,
+ always use `$@` instead of the target's name, `$<` instead of the
+ first prerequisite, `$^` instead of the full list of prerequisites
+ and etc. You can see the full list of automatic variables
+ [here](https://www.gnu.org/software/make/manual/html_node/Automatic-Variables.html). If
+ you use GNU Make, you can also see this page on your command-line:
+
+ ```shell
+ $ info make "automatic variables
+ ```
+
+ - *Large files*: If you are dealing with very large files (thus having
+ multiple copies of them for intermediate steps is not possible), one
+ solution is the following strategy. Set a small plain text file as
+ the actual target and delete the large file when it is no longer
+ needed by the pipeline (in the last rule that needs it). Below is a
+ simple demonstration of doing this, where we use Gnuastro's
+ Arithmetic program to add all pixels of the input image with 2 and
+ create `large1.fits`. We then subtract 2 from `large1.fits` to create
+ `large2.fits` and delete `large1.fits` in the same rule (when its no
+ longer needed). We can later do the same with `large2.fits` when it
+ is no longer needed and so on.
+ ```
+ large1.fits.txt: input.fits
+ astarithmetic $< 2 + --output=$(subst .txt,,$@)
+ echo "done" > $@
+ large2.fits.txt: large1.fits.txt
+ astarithmetic $(subst .txt,,$<) 2 - --output=$(subst .txt,,$@)
+ rm $(subst .txt,,$<)
+ echo "done" > $@
+ ```
+ A more advanced Make programmer will use [Make's call
+ function](https://www.gnu.org/software/make/manual/html_node/Call-Function.html)
+ to define a wrapper in `reproduce/src/make/initialize.mk`. This
+ wrapper will replace `$(subst .txt,,XXXXX)`. Therefore, it will be
+ possible to greatly simplify this repetitive statement and make the
+ code even more readable throughout the whole pipeline.
+
+
+ - **Dependencies**: It is critically important to exactly document, keep
+ and check the versions of the programs you are using in the pipeline.
+
+ - *Check versions*: In `reproduce/src/make/initialize.mk`, check the
+ versions of the programs you are using.
+
+ - *Keep the source tarball of dependencies*: keep a tarball of the
+ necessary version of all your dependencies (and also a copy of the
+ higher-level libraries they depend on). Software evolves very fast
+ and only in a few years, a feature might be changed or removed from
+ the mainstream version or the software server might go down. To be
+ safe, keep a copy of the tarballs (they are hardly ever over a few
+ megabytes, very insignificant compared to the data). If you intend to
+ release the pipeline in a place like Zenodo, then you can create your
+ submission early (before public release) and upload/keep all the
+ necessary tarballs (and data) there.
+
+ - *Keep your input data*: The input data is also critical to the
+ pipeline, so like the above for software, make sure you have a backup
+ of them
+
+
+
+
+
+
+
+Appendix: Introduction to this concept from link above
+======================================================
+
+In case [the link above](http://akhlaghi.org/reproducible-science.html) is
+not accessible at the time of reading, here is a copy of the introduction
+of that link, describing the necessity for a reproduction pipeline like
+this (copied on February 7th, 2018):
+
+The most important element of a "scientific" statement/result is the fact
+that others should be able to falsify it. The Tsunami of data that has
+engulfed astronomers in the last two decades, combined with faster
+processors and faster internet connections has made it much more easier to
+obtain a result. However, these factors have also increased the complexity
+of a scientific analysis, such that it is no longer possible to describe
+all the steps of an analysis in the published paper. Citing this
+difficulty, many authors suffice to describing the generalities of their
+analysis in their papers.
+
+However, It is impossible to falsify (or even study) a result if you can't
+exactly reproduce it. The complexity of modern science makes it vitally
+important to exactly reproduce the final result. Because even a small
+deviation can be due to many different parts of an analysis. Nature is
+already a black box which we are trying so hard to comprehend. Not letting
+other scientists see the exact steps taken to reach a result, or not
+allowing them to modify it (do experiments on it) is a self-imposed black
+box, which only exacerbates our ignorance.
+
+Other scientists should be able to reproduce, check and experiment on the
+results of anything that is to carry the "scientific" label. Any result
+that is not reproducible (due to incomplete information by the author) is
+not scientific: the readers have to have faith in the subjective experience
+of the authors in the very important choice of configuration values and
+order of operations: this is contrary to the scientific spirit. \ No newline at end of file