aboutsummaryrefslogtreecommitdiff
path: root/README-hacking.md
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2019-01-23 16:57:32 +0000
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2019-01-23 16:57:32 +0000
commit8362ebf4d4efe5452323de31b11384f5b69e8fbf (patch)
tree1923c69f2892719fec22e7f7769c92379ecc01ea /README-hacking.md
parent464f088506987ab775f6b28687fe8155c006f8ad (diff)
README-pipeline.md is now called README-hacking.md
To be more generic and recognizable, the `README-pipeline.md' script was renamed to `README-hacking.md'. In essence, it is just that: to hack the existing pipeline for your own project. We follow a similar naming convention in many GNU software.
Diffstat (limited to 'README-hacking.md')
-rw-r--r--README-hacking.md999
1 files changed, 999 insertions, 0 deletions
diff --git a/README-hacking.md b/README-hacking.md
new file mode 100644
index 0000000..7fb42a2
--- /dev/null
+++ b/README-hacking.md
@@ -0,0 +1,999 @@
+Reproducible paper template
+===========================
+
+This project contains a **fully working template** for a high-level
+research reproduction pipeline, or reproducible paper, as defined in the
+link below. If the link below is not accessible at the time of reading,
+please see the appendix at the end of this file for a portion of its
+introduction. Some [slides](http://akhlaghi.org/pdf/reproducible-paper.pdf)
+are also available to help demonstrate the concept implemented here.
+
+ http://akhlaghi.org/reproducible-science.html
+
+This template is created with the aim of supporting reproducible research
+by making it easy to start a project in this framework. As shown below, it
+is very easy to customize this template reproducible paper pipeline for any
+particular research/job and expand it as it starts and evolves. It can be
+run with no modification (as described in `README.md`) as a demonstration
+and customized for use in any project as fully described below.
+
+The pipeline will download and build all the necessary libraries and
+programs for working in a closed environment (highly independent of the
+host operating system) with fixed versions of the necessary
+dependencies. The tarballs for building the local environment are also
+collected in a [separate
+repository](https://gitlab.com/makhlaghi/reproducible-paper-dependencies). The
+[final reproducible paper
+output](https://gitlab.com/makhlaghi/reproducible-paper-output/raw/master/paper.pdf)
+of this pipeline is also present in [a separate
+repository](https://gitlab.com/makhlaghi/reproducible-paper-output). Notice
+the last paragraph of the Acknowledgments where all the dependencies are
+mentioned with their versions.
+
+Below, we start with a discussion of why Make was chosen as the high-level
+language/framework for this research reproduction pipeline and how to learn
+and master Make easily (and freely). The general architecture and design of
+the pipeline is then discussed to help you navigate the files and their
+contents. This is followed by a checklist for the easy/fast customization
+of this pipeline to your exciting research. We continue with some tips and
+guidelines on how to manage or extend the pipeline as your research grows
+based on our experiences with it so far. The main body concludes with a
+description of possible future improvements that are planned for the
+pipeline (but not yet implemented). As discussed above, we end with a short
+introduction on the necessity of reproducible science in the appendix.
+
+Please don't forget to share your thoughts, suggestions and criticisms on
+this pipeline. Maintaining and designing this pipeline is itself a separate
+project, so please join us if you are interested. Once it is mature enough,
+we will describe it in a paper (written by all contributors) for a formal
+introduction to the community.
+
+
+
+
+
+Why Make?
+---------
+
+When batch processing is necessary (no manual intervention, as in a
+reproduction pipeline), shell scripts are usually the first solution that
+come to mind. However, the inherent complexity and non-linearity of
+progress in a scientific project (where experimentation is key) make it
+hard to manage the script(s) as the project evolves. For example, a script
+will start from the top/start every time it is run. So if you have already
+completed 90% of a research project and want to run the remaining 10% that
+you have newly added, you have to run the whole script from the start
+again. Only then will you see the effects of the last new steps (to find
+possible errors, or better solutions and etc).
+
+It is possible to manually ignore/comment parts of a script to only do a
+special part. However, such checks/comments will only add to the complexity
+of the script and will discourage you to play-with/change an already
+completed part of the project when an idea suddenly comes up. It is also
+prone to very serious bugs in the end (when trying to reproduce from
+scratch). Such bugs are very hard to notice during the work and frustrating
+to find in the end.
+
+The Make paradigm, on the other hand, starts from the end: the final
+*target*. It builds a dependency tree internally, and finds where it should
+start each time the pipeline is run. Therefore, in the scenario above, a
+researcher that has just added the final 10% of steps of her research to
+her Makefile, will only have to run those extra steps. With Make, it is
+also trivial to change the processing of any intermediate (already written)
+*rule* (or step) in the middle of an already written analysis: the next
+time Make is run, only rules that are affected by the changes/additions
+will be re-run, not the whole analysis/pipeline.
+
+This greatly speeds up the processing (enabling creative changes), while
+keeping all the dependencies clearly documented (as part of the Make
+language), and most importantly, enabling full reproducibility from scratch
+with no changes in the pipeline code that was working during the
+research. This will allow robust results and let the scientists get to what
+they do best: experiment and be critical to the methods/analysis without
+having to waste energy and time on technical problems that come up as a
+result of that experimentation in scripts.
+
+Since the dependencies are clearly demarcated in Make, it can identify
+independent steps and run them in parallel. This further speeds up the
+processing. Make was designed for this purpose. It is how huge projects
+like all Unix-like operating systems (including GNU/Linux or Mac OS
+operating systems) and their core components are built. Therefore, Make is
+a highly mature paradigm/system with robust and highly efficient
+implementations in various operating systems perfectly suited for a complex
+non-linear research project.
+
+Make is a small language with the aim of defining *rules* containing
+*targets*, *prerequisites* and *recipes*. It comes with some nice features
+like functions or automatic-variables to greatly facilitate the management
+of text (filenames for example) or any of those constructs. For a more
+detailed (yet still general) introduction see the article on Wikipedia:
+
+ https://en.wikipedia.org/wiki/Make_(software)
+
+Make is a +40 year old software that is still evolving, therefore many
+implementations of Make exist. The only difference in them is some extra
+features over the [standard
+definition](https://pubs.opengroup.org/onlinepubs/009695399/utilities/make.html)
+(which is shared in all of them). This pipeline has been created for GNU
+Make which is the most common, most actively developed, and most advanced
+implementation. Just note that this pipeline downloads, builds, internally
+installs, and uses its own dependencies (including GNU Make), so you don't
+have to have it installed before you try it out.
+
+
+
+
+
+How can I learn Make?
+---------------------
+
+The GNU Make book/manual (links below) is arguably the best place to learn
+Make. It is an excellent and non-technical book to help get started (it is
+only non-technical in its first few chapters to get you started easily). It
+is freely available and always up to date with the current GNU Make
+release. It also clearly explains which features are specific to GNU Make
+and which are general in all implementations. So the first few chapters
+regarding the generalities are useful for all implementations.
+
+The first link below points to the GNU Make manual in various formats and
+in the second, you can download it in PDF (which may be easier for a first
+time reading).
+
+ https://www.gnu.org/software/make/manual/
+
+ https://www.gnu.org/software/make/manual/make.pdf
+
+If you use GNU Make, you also have the whole GNU Make manual on the
+command-line with the following command (you can come out of the "Info"
+environment by pressing `q`).
+
+```shell
+ $ info make
+```
+
+If you aren't familiar with the Info documentation format, we strongly
+recommend running `$ info info` and reading along. In less than an hour,
+you will become highly proficient in it (it is very simple and has a great
+manual for itself). Info greatly simplifies your access (without taking
+your hands off the keyboard!) to many manuals that are installed on your
+system, allowing you to be much more efficient as you work. If you use the
+GNU Emacs text editor (or any of its variants), you also have access to all
+Info manuals while you are writing your projects (again, without taking
+your hands off the keyboard!).
+
+
+
+
+
+Published works using this pipeline
+-----------------------------------
+
+The links below will guide you to some of the works that have already been
+published using the method of this pipeline. Note that this pipeline is
+evolving, so some small details may be different in them, but they can be
+used as a good working model to build your own.
+
+ - Section 7.3 of Bacon et
+ al. ([2017](http://adsabs.harvard.edu/abs/2017A%26A...608A...1B), A&A
+ 608, A1): The version controlled reproduction pipeline is available [on
+ Gitlab](https://gitlab.com/makhlaghi/muse-udf-origin-only-hst-magnitudes)
+ and a snapshot of the pipeline along with all the necessary input
+ datasets and outputs is available in
+ [zenodo.1164774](https://doi.org/10.5281/zenodo.1164774).
+
+ - Section 4 of Bacon et
+ al. ([2017](http://adsabs.harvard.edu/abs/2017A%26A...608A...1B), A&A,
+ 608, A1): The version controlled reproduction pipeline is available [on
+ Gitlab](https://gitlab.com/makhlaghi/muse-udf-photometry-astrometry) and
+ a snapshot of the pipeline along with all the necessary input datasets
+ is available in
+ [zenodo.1163746](https://doi.org/10.5281/zenodo.1163746).
+
+ - Akhlaghi & Ichikawa
+ ([2015](http://adsabs.harvard.edu/abs/2015ApJS..220....1A), ApJS, 220,
+ 1): The version controlled reproduction pipeline is available [on
+ Gitlab](https://gitlab.com/makhlaghi/NoiseChisel-paper). This is the
+ very first (and much less mature) implementation of this pipeline: the
+ history of this template pipeline started more than two years after that
+ paper was published. It is a very rudimentary/initial implementation,
+ thus it is only included here for historical reasons. However, the
+ pipeline is complete and accurate and uploaded to arXiv along with the
+ paper. See the more recent implementations if you want to get ideas for
+ your version of this pipeline.
+
+
+
+
+
+Citation
+--------
+
+A paper will be published to fully describe this reproduction
+pipeline. Until then, if this pipeline is useful in your work, please cite
+the paper that implemented the first version of this pipeline: Akhlaghi &
+Ichikawa ([2015](http://adsabs.harvard.edu/abs/2015ApJS..220....1A), ApJS,
+220, 1).
+
+The experience gained with this template after several more implementations
+will be used to make this pipeline robust enough for a complete and useful
+paper to introduce to the community afterwards.
+
+Also, when your paper is published, don't forget to add a notice in your
+own paper (in coordination with the publishing editor) that the paper is
+fully reproducible and possibly add a sentence or paragraph in the end of
+the paper shortly describing the concept. This will help spread the word
+and encourage other scientists to also publish their reproduction
+pipelines.
+
+
+
+
+
+
+
+
+
+
+Reproduction pipeline architecture
+==================================
+
+In order to adopt this pipeline to your research, it is important to first
+understand its architecture so you can navigate your way in the directories
+and understand how to implement your research project within its
+framework. But before reading this theoretical discussion, please run the
+pipeline (described in `README.md`: first run `./configure`, then
+`.local/bin/make -j8`) without any change, just to see how it works.
+
+In order to obtain a reproducible result it is important to have an
+identical environment (for example same versions of the programs that it
+will use). Therefore, the pipeline builds its own dependencies during the
+`./configure` step. Building of the dependencies is managed by
+`reproduce/src/make/dependencies-basic.mk` and
+`reproduce/src/make/dependencies.mk`. These Makefiles are called by the
+`./configure` script and not used afterwards. The first is intended for
+downloading and building the most basic tools like GNU Bash, GNU Make, and
+GNU Tar. Therefore it must only contain very basic and portable Make and
+shell features. The second is called after the first, thus enabling usage
+of the modern and advanced features of GNU Bash and GNU Make, similar to
+the rest of the pipeline. Later, if you add a new program/library for your
+research, you will need to include a rule on how to download and build it
+(in `reproduce/src/make/dependencies.mk`).
+
+After it finishes, `./configure` will create a `Makefile` in the top
+directory (a symbolic link to `reproduce/src/make/top.mk`) and a `.local`
+directory (a link for easy access to the custom built software
+packages). The `.local/bin/make` command will then use our custom version
+of GNU Make to do the analysis. The first file that is read by Make is the
+top-level `Makefile`. Therefore, we'll start our navigation/discussion with
+this file. This file is relatively short and heavily commented so hopefully
+the descriptions in each comment will be enough to understand the general
+details. As you read this section, please also look at the contents of the
+mentioned files and directories to fully understand what is going on.
+
+Before starting to look into the top `Makefile`, it is important to recall
+that Make defines dependencies by files. Therefore, the input/prerequisite
+and output of every step/rule must be a file. Also recall that Make will
+use the modification date of the prerequisite and target files to see if
+the target must be re-built or not. Therefore during the processing, _many_
+intermediate files will be created (see the tips section below on a good
+strategy to deal with large/huge files).
+
+To keep the source and (intermediate) built files separate, you _must_
+define a top-level build directory variable (or `$(BDIR)`) to host all the
+intermediate files (it was defined in `./configure`). This directory
+doesn't need to be version controlled or even synchronized, or backed-up in
+other servers: its contents are all products of the pipeline, and can be
+easily re-created any time. As you define targets for your new rules, it is
+thus important to place them all under sub-directories of `$(BDIR)`.
+
+In this architecture, we have two types of Makefiles that are loaded into
+the top `Makefile`: _configuration-Makefiles_ (only independent
+variables/configurations) and _workhorse-Makefiles_ (Makefiles that
+actually contain rules).
+
+The configuration-Makefiles are those that satisfy this wildcard:
+`reproduce/config/pipeline/*.mk`. These Makefiles don't actually have any
+rules, they just have values for various free parameters throughout the
+analysis/processing. Open a few of them to see for your self. These
+Makefiles must only contain raw Make variables (pipeline
+configurations). By raw we mean that the Make variables in these files must
+not depend on variables in any other configuration-Makefile. This is
+because we don't want to assume any order in reading them. It is also very
+important to *not* define any rule, or other Make construct in any of these
+configuration-Makefiles.
+
+These conditions will enable you to set these configure-Makefiles as a
+prerequisite to any target that depends on their variable
+values. Therefore, if you change any of their values, all targets that
+depend on those values will be re-built. This is very convenient as your
+project scales up and gets more complex.
+
+The workhorse-Makefiles are those satisfying this wildcard
+`reproduce/src/make/*.mk`. They contain the details of the processing steps
+(Makefiles containing rules). Therefore, in this phase *order is
+important*, because the prerequisites of most rules will be the targets of
+other rules that will be defined prior to them (not a fixed name like
+`paper.pdf`). The lower-level rules must be imported into Make before the
+higher-level ones.
+
+All processing steps are assumed to ultimately (usually after many rules)
+end up in some number, image, figure, or table that are to be included in
+the paper. The writing of these results into the final report/paper is
+managed through separate LaTeX files that only contain macros (a name given
+to a number/string to be used in the LaTeX source, which will be replaced
+when compiling it to the final PDF). So the last target in a
+workhorse-Makefile is a `.tex` file (with the same base-name as the
+Makefile, but in `$(BDIR)/tex/macros`). As a result, if the targets in a
+workhorse-Makefile aren't directly a prerequisite of other
+workhorse-Makefile targets, they can be a pre-requisite of that
+intermediate LaTeX macro file and thus be called when necessary. Otherwise,
+they will be ignored by Make.
+
+This pipeline also has a mode to share the build directory between several
+users of a Unix group (when working on large computer clusters). In this
+scenario, each user can have their own cloned pipeline source, but share
+the large built files between each other. To do this, it is necessary for
+all built files to give full permission to group members while not allowing
+any other users access to the contents. Therefore the `./configure` and
+Make steps must be called with special conditions which are managed in the
+`for-group` file.
+
+Let's see how this design is implemented. When the `./configure` finishes,
+it makes a `Makefile` in the top directory. This Makefile is just a
+symbolic link to `reproduce/src/make/top.mk`. Please open and inspect it as
+we go along here. The first step (un-commented line) is to import the local
+configuration (answers to the questions `./configure` asked you). They are
+defined in the configuration-Makefile `reproduce/config/pipeline/LOCAL.mk`
+which was also built by `./configure` (based on the `LOCAL.mk.in`
+template).
+
+The next non-commented set of lines define the ultimate target of the whole
+pipeline (`paper.pdf`). But a sanity check is necessary for situations when
+the user is not careful (for example has configured the pipeline for group
+access but forgets to run the pipeline with `./for-group`, or the
+opposite). Therefore we use a Make conditional to define the `all` target
+based on the group permissions being consistent between the initial
+configuration and the current run.
+
+If there is a problem `all` will not depend on anything and will just print
+a warning to inform you of the problem. When the group conditions are fine,
+`all` will depend on `paper.pdf` (which is defined in
+`reproduce/src/make/paper.mk` and will be imported into this top Makefile
+later).
+
+Having defined the top target, our next step is to include all the other
+necessary Makefiles. But order matters in the importing of
+workhorse-Makefiles and each must also have a TeX macro file with the same
+base name (without a suffix). Therefore, the next step in the top-level
+Makefile is to define a `makesrc` variable to keep the base names (without
+a `.mk` suffix) of the workhorse-Makefiles that must be imported, in the
+proper order.
+
+Finally, we'll just import all the configuration-Makefiles with a wildcard
+(while ignoring `LOCAL.mk` that was imported before). Also, all
+workhorse-Makefiles are imported in the proper order using a Make `foreach`
+loop. This finishes the general view of the pipeline's implementation.
+
+In short, to keep things modular, readable and managable, follow these
+recommendations: 1) Set clear-to-understand names for the
+configuration-Makefiles, and workhorse-Makefiles, 2) Only import other
+Makefiles from top Makefile. These will let you know/remember generally
+which step you are taking before or after another. Projects will scale up
+very fast. Thus if you don't start and continue with a clean and robust
+convention like this, in the end it will become very dirty and hard to
+manage/understand (even for yourself). As a general rule of thumb, break
+your rules into as many logically-similar but independent steps as
+possible.
+
+The `reproduce/src/make/paper.mk` Makefile must be the final Makefile that
+is included. It ends with the rule to build `paper.pdf` (final target of
+the whole reproduction pipeline). If look in it, you will notice that it
+starts with a rule to create `tex/pipeline.tex`. `tex/pipeline.tex` is the
+connection between the processing/analysis steps of the pipeline, and the
+steps to build the final PDF. As you see, `tex/pipeline.tex` is only a
+merging/concatenation of LaTeX macros defined as the output of each
+high-level processing step (the separate work-horse Makefiles that you
+included).
+
+One of the LaTeX macros created by `reproduce/src/make/initialize.mk` is
+`\bdir`. It is the location of the build directory. In some cases you want
+tables and images to also be included in the final PDF. To keep these
+necessary LaTeX inputs, you can define other directories under
+`$(BDIR)/tex` in the relevant workhorse-Makefile. You can then easily guide
+LaTeX to look into the proper directory to import an image for example
+through the `\bdir` macro.
+
+During the research, it often happens that you want to test a step that is
+not a prerequisite of any higher-level operation. In such cases, you can
+(temporarily) define that processing as a rule in the most relevant
+workhorse-Makefile and set its target as a prerequisite of its TeX
+macro. If your test gives a promising result and you want to include it in
+your research, set it as prerequisites to other rules and remove it from
+the list of prerequisites for TeX macro file. In fact, this is how a
+project is designed to grow in this framework.
+
+When working within a group, more than one person may want to work with the
+pipeline outputs (in the build directory). For example each person is
+developing part of the higher-level steps of the pipeline in their own Git
+branch of the pipeline, but using the same build directory. Therefore, the
+lower-level parts of the built outputs, can be shared between them. In such
+scenarios, this pipeline comes with a `for-group` script (in the top
+directory) which is just a simple wrapper to run the configure and building
+steps. You can specify a group name within this file. Therefore, when you
+use it (fully described in the comments at the start of the file), it will
+ensure that all group members have write access to the created files.
+
+
+
+
+Summary
+-------
+
+Based on the explanation above, some major design points you should have in
+mind are listed below.
+
+ - Define new `reproduce/src/make/XXXXXX.mk` workhorse-Makefile(s) with
+ good and human-friendly name(s) replacing `XXXXXX`.
+
+ - Add `XXXXXX`, as a new line, to the values in `makesrc` of the top-level
+ `Makefile`.
+
+ - Do not use any constant numbers (or important names like filter names)
+ in the workhorse-Makefiles or paper's LaTeX source. Define such
+ constants as logically-grouped, separate configuration-Makefiles in
+ `reproduce/config/pipeline`. Then set the respective
+ configuration-Makefiles file as a pre-requisite to any rule that uses
+ the variable defined in it.
+
+ - Through any number of intermediate prerequisites, all processing steps
+ should end in (be a prerequisite of) `tex/pipeline.tex` (defined in
+ `reproduce/src/make/paper.mk`). `tex/pipeline.tex` is the bridge between
+ the processing steps and PDF-building steps.
+
+
+
+
+
+
+
+
+
+
+Checklist to customize the pipeline
+===================================
+
+Take the following steps to fully customize this pipeline for your research
+project. After finishing the list, be sure to run `./configure` and `make`
+to see if everything works correctly before expanding it. If you notice
+anything missing or any in-correct part (probably a change that has not
+been explained here), please let us know to correct it.
+
+As described above, the concept of a reproduction pipeline heavily relies
+on [version
+control](https://en.wikipedia.org/wiki/Version_control). Currently this
+pipeline uses Git as its main version control system. If you are not already
+familiar with Git, please read the first three chapters of the [ProGit
+book](https://git-scm.com/book/en/v2) which provides a wonderful practical
+understanding of the basics. You can read later chapters as you get more
+advanced in later stages of your work.
+
+ - **Get this repository and its history** (if you don't already have it):
+ Arguably the easiest way to start is to clone this repository as shown
+ below. The main branch of this pipeline is called `pipeline`. This
+ allows you to use the common branch name `master` for your own
+ research, while keeping up to date with improvements in the pipeline.
+
+ ```shell
+ $ git clone https://gitlab.com/makhlaghi/reproducible-paper.git
+ $ mv reproducible-paper my-project-name # Your own directory name.
+ $ cd my-project-name # Go into the cloned directory.
+ $ git tag | xargs git tag -d # Delete all pipeline tags.
+ $ git config remote.origin.tagopt --no-tags # No tags in future fetch/pull from this pipeline.
+ $ git remote rename origin pipeline-origin # Rename the pipeline's remote.
+ $ git checkout -b master # Create, enter master branch.
+ ```
+
+ - **Test the pipeline**: Before making any changes, it is important to
+ test the pipeline and see if everything works properly with the
+ commands below. If there is any problem in the `./configure` or `make`
+ steps, please contact us to fix the problem before continuing. Since
+ the building of dependencies in `./configure` can take long, you can
+ take the next few steps (editing the files) while its working (they
+ don't affect the configuration). After `make` is finished, open
+ `paper.pdf` and if it looks fine, you are ready to start customizing
+ the pipeline for your project. But before that, clean all the extra
+ pipeline outputs with `make clean` as shown below.
+
+ ```shell
+ $ ./configure # Set top directories and build dependencies.
+ $ .local/bin/make # Run the pipeline.
+
+ # Open 'paper.pdf' and see if everything is ok.
+ $ .local/bin/make clean # Delete high-level outputs.
+ ```
+
+ - **Setup the remote**: You can use any [hosting
+ facility](https://en.wikipedia.org/wiki/Comparison_of_source_code_hosting_facilities)
+ that supports Git to keep an online copy of your project's version
+ controlled history. We recommend [GitLab](https://gitlab.com) because
+ it allows any number of private repositories for free and because you
+ can host GitLab on your own server. Create an account in your favorite
+ hosting facility (if you don't already have one), and define a new
+ project there. It will give you a link (ending in `.git`) that you can
+ put in place of `XXXXXXXXXX` in the command below.
+
+ ```shell
+ git remote add origin XXXXXXXXXX
+ ```
+
+ - **Copyright**, **name** and **date**: Go over the existing scripting
+ files and add your name and email to the copyright notice. You can
+ find the files by searching for the placeholder email
+ `your@email.address` (which you should change) with the command below
+ (you can ignore this file, and any in the `tex/` directory). Don't
+ forget to add your name after the copyright year also. When making new
+ files, always remember to add a similar copyright statement at the top
+ of the file and also ask your colleagues to do so when they edit a
+ file. This is very important.
+
+ ```shell
+ $ grep -r your@email.address ./*
+ ```
+
+ - **Title**, **short description** and **author** in source files: In this
+ raw skeleton, the title or short description of your project should be
+ added in the following two files: `reproduce/src/make/top.mk` (the
+ first line), and `tex/preamble-header.tex`. In both cases, the texts
+ you should replace are all in capital letters to make them easier to
+ identify. Of course, if you use a different LaTeX method of managing
+ the title and authors, please feel free to use your own methods after
+ finishing this checklist and doing your first commit.
+
+ - **Gnuastro**: GNU Astronomy Utilities (Gnuastro) is currently a
+ dependency of the pipeline which will be built and used. The main
+ reason for this is to demonstrate how critically important it is to
+ version your scientific tools. If you don't need Gnuastro for your
+ research, you can simply remove the parts enclosed in marked parts in
+ the relevant files of the list below. The marks are comments, which
+ you can find by searching for "Gnuastro". If you will be using
+ Gnuastro, then remove the commented marks and keep the code within
+ them.
+
+ - Delete marked part(s) in `configure`.
+ - Delete the `reproduce/config/gnuastro` directory.
+ - Delete `astnoisechisel` from the value of `top-level-programs` in `reproduce/src/make/dependencies.mk`. You can keep the rule to build `astnoisechisel`, since its not in the `top-level-programs` list, it (and all the dependencies that are only needed by Gnuastro) will be ignored.
+ - Delete marked parts in `reproduce/src/make/initialize.mk`.
+
+ - **Other dependencies**: If there are any more of the dependencies that
+ you don't use (or others that you need), then remove (or add) them in
+ the respective parts of `reproduce/src/make/dependencies.mk`. It is
+ commented thoroughly and reading over the comments should guide you on
+ what to add/remove and where.
+
+ - **Input dataset (can be done later)**: The input datasets are managed
+ through the `reproduce/config/pipeline/INPUTS.mk` file. It is best to
+ gather all the information regarding all the input datasets into this
+ one central file. To ensure that the proper dataset is being
+ downloaded and used by the pipeline, it is also recommended get an
+ [MD5 checksum](https://en.wikipedia.org/wiki/MD5) of the file and
+ include that in `INPUTS.mk` so you can check it in the pipeline. The
+ preparation of the input datasets is done in
+ `reproduce/src/make/download.mk`. Have a look there to see how these
+ values are to be used. This information about the input datasets is
+ also used in the initial `configure` script (to inform the users), so
+ also modify that file. You can find all occurrences of the template
+ dataset with the command below and replace it with your input's
+ dataset.
+
+ ```shell
+ $ grep -ir wfpc2 ./*
+ ```
+
+ - **Delete dummy parts (can be done later)**: The template pipeline
+ contains some parts that are only for the initial/test run, mainly as
+ a demonstration of important steps. They not for any real
+ analysis. You can remove these parts in the file below
+
+ - `paper.tex`: Delete the text of the abstract and the paper's main
+ body, *except* the "Acknowledgments" section. This reproduction
+ pipeline was designed by funding from many grants, so its necessary
+ to acknowledge them in your final research.
+
+ - `Makefile`: Delete the lines containing `delete-me` in the `foreach`
+ loop. Just make sure the other lines that end in `\` are immediately
+ after each other (except the last line).
+
+ - Delete all `delete-me*` files in the following directories:
+
+ ```shell
+ $ rm tex/delete-me*
+ $ rm reproduce/src/make/delete-me*
+ $ rm reproduce/config/pipeline/delete-me*
+ ```
+
+ - **`README.md`**: Correct all the `XXXXX` place holders (name of your
+ project, your own name, address of pipeline's online/remote
+ repository, link to download dependencies and etc). Generally, read
+ over the text and update it where necessary to fit your project. Don't
+ forget that this is the first file that is displayed on your online
+ repository and also your colleagues will first be drawn to read this
+ file. Therefore, make it as easy as possible for them to start
+ with. Also check and update this file one last time when you are ready
+ to publish your work (and its reproduction pipeline).
+
+ - **`for-group`**: If you will be working on this pipeline with
+ colleagues, and the build steps involve many files, or are slow, you
+ need to share the build directory. This script is designed for such
+ scenarios. So open this file and give the name of the Unix name of
+ your group to the `thisgroup` variable. You can see the list of groups
+ you are a member of with the `groups` command. You can ask your system
+ administrator to define a group with specific members if necessary.
+
+ - **Your first commit**: You have already made some small and basic
+ changes in the steps above and you are in the `master` branch. So, you
+ can officially make your first commit in your project's history. But
+ before that you need to make sure that there are no problems in the
+ pipeline (this is a good habit to always re-build the system before a
+ commit to be sure it works as expected).
+
+ ```shell
+ $ .local/bin/make clean # Delete outputs ('make distclean' for everything)
+ $ .local/bin/make # Build the pipeline to ensure everything is fine.
+ $ git add -u # Stage all the changes.
+ $ git status # Make sure everything is fine.
+ $ git commit # Your first commit, add a nice description.
+ $ git tag -a v0 # Tag this as the zero-th version of your pipeline.
+ ```
+
+ - **Push to the remote**: Push your first commit and its tag to the remote
+ repository with these commands:
+
+ ```shell
+ git push -u origin --all
+ git push -u origin --tags
+ ```
+
+ - **Start your exciting research**: You are now ready to add flesh and
+ blood to this raw skeleton by further modifying and adding your
+ exciting research steps. You can use the "published works" section in
+ the introduction (above) as some fully working models to learn
+ from. Also, don't hesitate to contact us if you have any
+ questions. Any time you are ready to push your commits to the remote
+ repository, you can simply use `git push`.
+
+ - **Feedback**: As you use the pipeline you will notice many things that
+ if implemented from the start would have been very useful for your
+ work. This can be in the actual scripting and architecture of the
+ pipeline or in useful implementation and usage tips, like those
+ below. In any case, please share your thoughts and suggestions with
+ us, so we can add them here for everyone's benefit.
+
+ - **Keep pipeline up-to-date**: In time, this pipeline is going to become
+ more and more mature and robust (thanks to your feedback and the
+ feedback of other users). Bugs will be fixed and new/improved features
+ will be added. So every once and a while, you can run the commands
+ below to pull new work that is done in this pipeline. If the changes
+ are useful for your work, you can merge them with your own customized
+ pipeline to benefit from them. Just pay **very close attention** to
+ resolving possible **conflicts** which might happen in the merge
+ (updated general pipeline settings that you have customized).
+
+ ```shell
+ $ git checkout pipeline
+ $ git pull pipeline-origin pipeline # Get recent work in this pipeline.
+ $ git log XXXXXX..XXXXXX --reverse # Inspect new work (replace XXXXXXs with hashs mentioned in output of previous command).
+ $ git log --oneline --graph --decorate --all # General view of branches.
+ $ git checkout master # Go to your top working branch.
+ $ git merge pipeline # Import all the work into master.
+ ```
+
+ - **Pre-publication: add notice on reproducibility**: Add a notice
+ somewhere prominent in the first page within your paper, informing the
+ reader that your research is fully reproducible. For example in the
+ end of the abstract, or under the keywords with a title like
+ "reproducible paper". This will encourage them to publish their own
+ works in this manner also and also will help spread the word.
+
+
+
+
+
+
+
+
+Usage tips: designing your pipeline/workflow
+============================================
+
+The following is a list of design points, tips, or recommendations that
+have been learned after some experience with this pipeline. Please don't
+hesitate to share any experience you gain after using this pipeline with
+us. In this way, we can add it here for the benefit of others.
+
+ - **Modularity**: Modularity is the key to easy and clean growth of a
+ project. So it is always best to break up a job into as many
+ sub-components as reasonable. Here are some tips to stay modular.
+
+ - *Short recipes*: if you see the recipe of a rule becoming more than a
+ handful of lines which involve significant processing, it is probably
+ a good sign that you should break up the rule into its main
+ components. Try to only have one major processing step per rule.
+
+ - *Context-based (many) Makefiles*: This pipeline is designed to allow
+ the easy inclusion of many Makefiles (in `reproduce/src/make/*.mk`)
+ for maximal modularity. So keep the rules for closely related parts
+ of the processing in separate Makefiles.
+
+ - *Descriptive names*: Be very clear and descriptive with the naming of
+ the files and the variables because a few months after the
+ processing, it will be very hard to remember what each one was
+ for. Also this helps others (your collaborators or other people
+ reading the pipeline after it is published) to more easily understand
+ your work and find their way around.
+
+ - *Naming convention*: As the project grows, following a single standard
+ or convention in naming the files is very useful. Try best to use
+ multiple word filenames for anything that is non-trivial (separating
+ the words with a `-`). For example if you have a Makefile for
+ creating a catalog and another two for processing it under models A
+ and B, you can name them like this: `catalog-create.mk`,
+ `catalog-model-a.mk` and `catalog-model-b.mk`. In this way, when
+ listing the contents of `reproduce/src/make` to see all the
+ Makefiles, those related to the catalog will all be close to each
+ other and thus easily found. This also helps in auto-completions by
+ the shell or text editors like Emacs.
+
+ - *Source directories*: If you need to add files in other languages for
+ example in shell, Python, AWK or C, keep them in a separate directory
+ under `reproduce/src`, with the appropriate name.
+
+ - *Configuration files*: If your research uses special programs as part
+ of the processing, put all their configuration files in a devoted
+ directory (with the program's name) within
+ `reproduce/config`. Similar to the `reproduce/config/gnuastro`
+ directory (which is put in the template as a demo in case you use GNU
+ Astronomy Utilities). It is much cleaner and readable (thus less
+ buggy) to avoid mixing the configuration files, even if there is no
+ technical necessity.
+
+
+ - **Contents**: It is good practice to follow the following
+ recommendations on the contents of your files, whether they are source
+ code for a program, Makefiles, scripts or configuration files
+ (copyrights aren't necessary for the latter).
+
+ - *Copyright*: Always start a file containing programming constructs
+ with a copyright statement like the ones that this template starts
+ with (for example in the top level `Makefile`).
+
+ - *Comments*: Comments are vital for readability (by yourself in two
+ months, or others). Describe everything you can about why you are
+ doing something, how you are doing it, and what you expect the result
+ to be. Write the comments as if it was what you would say to describe
+ the variable, recipe or rule to a friend sitting beside you. When
+ writing the pipeline it is very tempting to just steam ahead with
+ commands and codes, but be patient and write comments before the
+ rules or recipes. This will also allow you to think more about what
+ you should be doing. Also, in several months when you come back to
+ the code, you will appreciate the effort of writing them. Just don't
+ forget to also read and update the comment first if you later want to
+ make changes to the code (variable, recipe or rule). As a general
+ rule of thumb: first the comments, then the code.
+
+ - *File title*: In general, it is good practice to start all files with
+ a single line description of what that particular file does. If
+ further information about the totality of the file is necessary, add
+ it after a blank line. This will help a fast inspection where you
+ don't care about the details, but just want to remember/see what that
+ file is (generally) for. This information must of course be commented
+ (its for a human), but this is kept separate from the general
+ recommendation on comments, because this is a comment for the whole
+ file, not each step within it.
+
+
+ - **Make programming**: Here are some experiences that we have come to
+ learn over the years in using Make and are useful/handy in research
+ contexts.
+
+ - *Automatic variables*: These are wonderful and very useful Make
+ constructs that greatly shrink the text, while helping in
+ read-ability, robustness (less bugs in typos for example) and
+ generalization. For example even when a rule only has one target or
+ one prerequisite, always use `$@` instead of the target's name, `$<`
+ instead of the first prerequisite, `$^` instead of the full list of
+ prerequisites and etc. You can see the full list of automatic
+ variables
+ [here](https://www.gnu.org/software/make/manual/html_node/Automatic-Variables.html). If
+ you use GNU Make, you can also see this page on your command-line:
+
+ ```shell
+ $ info make "automatic variables"
+ ```
+
+ - *Debug*: Since Make doesn't follow the common top-down paradigm, it
+ can be a little hard to get accustomed to why you get an error or
+ un-expected behavior. In such cases, run Make with the `-d`
+ option. With this option, Make prints a full list of exactly which
+ prerequisites are being checked for which targets. Looking
+ (patiently) through this output and searching for the faulty
+ file/step will clearly show you any mistake you might have made in
+ defining the targets or prerequisites.
+
+ - *Large files*: If you are dealing with very large files (thus having
+ multiple copies of them for intermediate steps is not possible), one
+ solution is the following strategy. Set a small plain text file as
+ the actual target and delete the large file when it is no longer
+ needed by the pipeline (in the last rule that needs it). Below is a
+ simple demonstration of doing this, where we use Gnuastro's
+ Arithmetic program to add all pixels of the input image with 2 and
+ create `large1.fits`. We then subtract 2 from `large1.fits` to create
+ `large2.fits` and delete `large1.fits` in the same rule (when its no
+ longer needed). We can later do the same with `large2.fits` when it
+ is no longer needed and so on.
+ ```
+ large1.fits.txt: input.fits
+ astarithmetic $< 2 + --output=$(subst .txt,,$@)
+ echo "done" > $@
+ large2.fits.txt: large1.fits.txt
+ astarithmetic $(subst .txt,,$<) 2 - --output=$(subst .txt,,$@)
+ rm $(subst .txt,,$<)
+ echo "done" > $@
+ ```
+ A more advanced Make programmer will use Make's [call
+ function](https://www.gnu.org/software/make/manual/html_node/Call-Function.html)
+ to define a wrapper in `reproduce/src/make/initialize.mk`. This
+ wrapper will replace `$(subst .txt,,XXXXX)`. Therefore, it will be
+ possible to greatly simplify this repetitive statement and make the
+ code even more readable throughout the whole pipeline.
+
+
+ - **Dependencies**: It is critically important to exactly document, keep
+ and check the versions of the programs you are using in the pipeline.
+
+ - *Check versions*: In `reproduce/src/make/initialize.mk`, check the
+ versions of the programs you are using.
+
+ - *Keep the source tarball of dependencies*: keep a tarball of the
+ necessary version of all your dependencies (and also a copy of the
+ higher-level libraries they depend on). Software evolves very fast
+ and only in a few years, a feature might be changed or removed from
+ the mainstream version or the software server might go down. To be
+ safe, keep a copy of the tarballs. Software tarballs are rarely over
+ a few megabytes, very insignificant compared to the data. If you
+ intend to release the pipeline in a place like Zenodo, then you can
+ create your submission early (before public release) and upload/keep
+ all the necessary tarballs (and data)
+ there. [zenodo.1163746](https://doi.org/10.5281/zenodo.1163746) is
+ one example of how the data, Gnuastro (main software used) and all
+ major Gnuastro's dependencies have been uploaded with the pipeline.
+
+ - *Keep your input data*: The input data is also critical to the
+ pipeline, so like the above for software, make sure you have a backup
+ of them.
+
+ - **Version control**: It is important (and extremely useful) to have the
+ history of your pipeline under version control. So try to make commits
+ regularly (after any meaningful change/step/result), while not
+ forgetting the following notes.
+
+ - *Tags*: To help manage the history, tag all major commits. This helps
+ make a more human-friendly output of `git describe`: for example
+ `v1-4-gaafdb04` states that we are on commit `aafdb04` which is 4
+ commits after tag `v1`. The output of `git describe` is included in
+ your final PDF as part of this pipeline. Also, if you use
+ reproducibility-friendly software like Gnuastro, this value will also
+ be included in all output files, see the description of `COMMIT` in
+ [Output
+ headers](https://www.gnu.org/software/gnuastro/manual/html_node/Output-headers.html).
+ In the checklist above, we tagged the first commit of your pipeline
+ with `v0`. Here is one suggestion on when to tag: when you have fully
+ adopted the pipeline and have got the first (initial) results, you
+ can make a `v1` tag. Subsequently when you first start reporting the
+ results to your colleagues, you can tag the commit as `v2`. Afterwards
+ when you submit to a paper, it can be tagged `v3` and so on.
+
+ - *Pipeline outputs*: During your research, it is possible to checkout a
+ specific commit and reproduce its results. However, the processing
+ can be time consuming. Therefore, it is useful to also keep track of
+ the final outputs of your pipeline (at minimum, the paper's PDF) in
+ important points of history. However, keeping a snapshot of these
+ (most probably large volume) outputs in the main history of the
+ pipeline can unreasonably bloat it. It is thus recommended to make a
+ separate Git repo to keep those files and keep this pipeline's volume
+ as small as possible. For example if your main pipeline is called
+ `my-exciting-project`, the name of the outputs pipeline can be
+ `my-exciting-project-output`. This enables easy sharing of the output
+ files with your co-authors (with necessary permissions) and not
+ having to bloat your email archive with extra attachments (you can
+ just share the link to the online repo in your communications). After
+ the research is published, you can also release the outputs pipeline,
+ or you can just delete it if it is too large or un-necessary (it was
+ just for convenience, and fully reproducible after all). This
+ pipeline's output is available for demonstration in the separate
+ [reproducible-paper-output](https://gitlab.com/makhlaghi/reproducible-paper-output)
+ repository.
+
+
+
+
+
+
+
+
+
+
+Future improvements
+===================
+
+This is an evolving project and as time goes on, it will evolve and become
+more robust. Here are the list of features that we plan to add in the
+future.
+
+ - *Containers*: It is important to have better/full control of the
+ environment of the reproduction pipeline. Our current reproducible
+ paper pipeline builds the higher-level programs (for example GNU Bash,
+ GNU Make, GNU AWK and etc) it needs and sets `PATH` to prefer its own
+ builds. It currently doesn't build and use its own version of
+ lower-level tools (like the C library and compiler). We plan to add the
+ build steps of these low level tools so the system's `PATH` can be
+ completely ignored within the pipeline and we are in full control of
+ the whole build process. Another solution is based on [an interesting
+ tutorial](https://mozillafoundation.github.io/2017-fellows-sf/re-papers/index.html)
+ by the Mozilla science lab to build reproducible papers. It suggests
+ using the [Nix package manager](https://nixos.org/nix/about.html) to
+ build the necessary software for the pipeline and run the pipeline in
+ its completely closed environment. This is an interesting solution
+ because using Nix or [Guix](https://www.gnu.org/software/guix/) (which
+ is based on Nix, but uses the [Scheme
+ language](https://en.wikipedia.org/wiki/Scheme_(programming_language)),
+ not a custom language for the management) will allow a fully working
+ closed environment on the host system which contains the instructions
+ on how to build the environment. The availability of the instructions
+ to build the programs and environment with Nix or Guix, makes them a
+ better solution than binary containers like
+ [docker](https://www.docker.com/) which are essentially just a binary
+ (not human readable) black box and only usable on the given CPU
+ architecture. However, one limitation of using these is their own
+ installation (which usually requires root access).
+
+
+
+
+
+
+
+
+
+
+Appendix: Necessity of exact reproduction in scientific research
+================================================================
+
+In case [the link above](http://akhlaghi.org/reproducible-science.html) is
+not accessible at the time of reading, here is a copy of the introduction
+of that link, describing the necessity for a reproduction pipeline like
+this (copied on February 7th, 2018):
+
+The most important element of a "scientific" statement/result is the fact
+that others should be able to falsify it. The Tsunami of data that has
+engulfed astronomers in the last two decades, combined with faster
+processors and faster internet connections has made it much more easier to
+obtain a result. However, these factors have also increased the complexity
+of a scientific analysis, such that it is no longer possible to describe
+all the steps of an analysis in the published paper. Citing this
+difficulty, many authors suffice to describing the generalities of their
+analysis in their papers.
+
+However, It is impossible to falsify (or even study) a result if you can't
+exactly reproduce it. The complexity of modern science makes it vitally
+important to exactly reproduce the final result. Because even a small
+deviation can be due to many different parts of an analysis. Nature is
+already a black box which we are trying so hard to comprehend. Not letting
+other scientists see the exact steps taken to reach a result, or not
+allowing them to modify it (do experiments on it) is a self-imposed black
+box, which only exacerbates our ignorance.
+
+Other scientists should be able to reproduce, check and experiment on the
+results of anything that is to carry the "scientific" label. Any result
+that is not reproducible (due to incomplete information by the author) is
+not scientific: the readers have to have faith in the subjective experience
+of the authors in the very important choice of configuration values and
+order of operations: this is contrary to the scientific spirit.