diff options
-rw-r--r-- | about.html | 2508 |
1 files changed, 1278 insertions, 1230 deletions
@@ -1,1240 +1,1288 @@ -<h1>Maneage: managing data lineage</h1> - -<p>Copyright (C) 2018-2020 Mohammad Akhlaghi <a href="mailto:mohammad@akhlaghi.org">mohammad@akhlaghi.org</a>\ -Copyright (C) 2020 Raul Infante-Sainz <a href="mailto:infantesainz@gmail.com">infantesainz@gmail.com</a>\ -See the end of the file for license conditions.</p> - -<p>Maneage is a <strong>fully working template</strong> for doing reproducible research (or -writing a reproducible paper) as defined in the link below. If the link -below is not accessible at the time of reading, please see the appendix at -the end of this file for a portion of its introduction. Some -<a href="http://akhlaghi.org/pdf/reproducible-paper.pdf">slides</a> are also available -to help demonstrate the concept implemented here.</p> - -<p>http://akhlaghi.org/reproducible-science.html</p> - -<p>Maneage is created with the aim of supporting reproducible research by -making it easy to start a project in this framework. As shown below, it is -very easy to customize Maneage for any particular (research) project and -expand it as it starts and evolves. It can be run with no modification (as -described in <code>README.md</code>) as a demonstration and customized for use in any -project as fully described below.</p> - -<p>A project designed using Maneage will download and build all the necessary -libraries and programs for working in a closed environment (highly -independent of the host operating system) with fixed versions of the -necessary dependencies. The tarballs for building the local environment are -also collected in a <a href="http://git.maneage.org/tarballs-software.git/tree/">separate -repository</a>. The final -output of the project is <a href="http://git.maneage.org/output-raw.git/plain/paper.pdf">a -paper</a>. Notice the -last paragraph of the Acknowledgments where all the necessary software are -mentioned with their versions.</p> - -<p>Below, we start with a discussion of why Make was chosen as the high-level -language/framework for project management and how to learn and master Make -easily (and freely). The general architecture and design of the project is -then discussed to help you navigate the files and their contents. This is -followed by a checklist for the easy/fast customization of Maneage to your -exciting research. We continue with some tips and guidelines on how to -manage or extend your project as it grows based on our experiences with it -so far. The main body concludes with a description of possible future -improvements that are planned for Maneage (but not yet implemented). As -discussed above, we end with a short introduction on the necessity of -reproducible science in the appendix.</p> - -<p>Please don't forget to share your thoughts, suggestions and -criticisms. Maintaining and designing Maneage is itself a separate project, -so please join us if you are interested. Once it is mature enough, we will -describe it in a paper (written by all contributors) for a formal -introduction to the community.</p> - -<h2>Why Make?</h2> - -<p>When batch processing is necessary (no manual intervention, as in a -reproducible project), shell scripts are usually the first solution that -come to mind. However, the inherent complexity and non-linearity of -progress in a scientific project (where experimentation is key) make it -hard to manage the script(s) as the project evolves. For example, a script -will start from the top/start every time it is run. So if you have already -completed 90% of a research project and want to run the remaining 10% that -you have newly added, you have to run the whole script from the start -again. Only then will you see the effects of the last new steps (to find -possible errors, or better solutions and etc).</p> - -<p>It is possible to manually ignore/comment parts of a script to only do a -special part. However, such checks/comments will only add to the complexity -of the script and will discourage you to play-with/change an already -completed part of the project when an idea suddenly comes up. It is also -prone to very serious bugs in the end (when trying to reproduce from -scratch). Such bugs are very hard to notice during the work and frustrating -to find in the end.</p> - -<p>The Make paradigm, on the other hand, starts from the end: the final -<em>target</em>. It builds a dependency tree internally, and finds where it should -start each time the project is run. Therefore, in the scenario above, a -researcher that has just added the final 10% of steps of her research to -her Makefile, will only have to run those extra steps. With Make, it is -also trivial to change the processing of any intermediate (already written) -<em>rule</em> (or step) in the middle of an already written analysis: the next -time Make is run, only rules that are affected by the changes/additions -will be re-run, not the whole analysis/project.</p> - -<p>This greatly speeds up the processing (enabling creative changes), while -keeping all the dependencies clearly documented (as part of the Make -language), and most importantly, enabling full reproducibility from scratch -with no changes in the project code that was working during the -research. This will allow robust results and let the scientists get to what -they do best: experiment and be critical to the methods/analysis without -having to waste energy and time on technical problems that come up as a -result of that experimentation in scripts.</p> - -<p>Since the dependencies are clearly demarcated in Make, it can identify -independent steps and run them in parallel. This further speeds up the -processing. Make was designed for this purpose. It is how huge projects -like all Unix-like operating systems (including GNU/Linux or Mac OS -operating systems) and their core components are built. Therefore, Make is -a highly mature paradigm/system with robust and highly efficient -implementations in various operating systems perfectly suited for a complex -non-linear research project.</p> - -<p>Make is a small language with the aim of defining <em>rules</em> containing -<em>targets</em>, <em>prerequisites</em> and <em>recipes</em>. It comes with some nice features -like functions or automatic-variables to greatly facilitate the management -of text (filenames for example) or any of those constructs. For a more -detailed (yet still general) introduction see the article on Wikipedia:</p> - -<p>https://en.wikipedia.org/wiki/Make_(software)</p> - -<p>Make is a +40 year old software that is still evolving, therefore many -implementations of Make exist. The only difference in them is some extra -features over the <a href="https://pubs.opengroup.org/onlinepubs/009695399/utilities/make.html">standard -definition</a> -(which is shared in all of them). Maneage is primarily written in GNU Make -(which it installs itself, you don't have to have it on your system). GNU -Make is the most common, most actively developed, and most advanced -implementation. Just note that Maneage downloads, builds, internally -installs, and uses its own dependencies (including GNU Make), so you don't -have to have it installed before you try it out.</p> - -<h2>How can I learn Make?</h2> - -<p>The GNU Make book/manual (links below) is arguably the best place to learn -Make. It is an excellent and non-technical book to help get started (it is -only non-technical in its first few chapters to get you started easily). It -is freely available and always up to date with the current GNU Make -release. It also clearly explains which features are specific to GNU Make -and which are general in all implementations. So the first few chapters -regarding the generalities are useful for all implementations.</p> - -<p>The first link below points to the GNU Make manual in various formats and -in the second, you can download it in PDF (which may be easier for a first -time reading).</p> - -<p>https://www.gnu.org/software/make/manual/</p> - -<p>https://www.gnu.org/software/make/manual/make.pdf</p> - -<p>If you use GNU Make, you also have the whole GNU Make manual on the -command-line with the following command (you can come out of the "Info" -environment by pressing <code>q</code>).</p> - -<p><code>shell - $ info make -</code></p> - -<p>If you aren't familiar with the Info documentation format, we strongly -recommend running <code>$ info info</code> and reading along. In less than an hour, -you will become highly proficient in it (it is very simple and has a great -manual for itself). Info greatly simplifies your access (without taking -your hands off the keyboard!) to many manuals that are installed on your -system, allowing you to be much more efficient as you work. If you use the -GNU Emacs text editor (or any of its variants), you also have access to all -Info manuals while you are writing your projects (again, without taking -your hands off the keyboard!).</p> - -<h2>Published works using Maneage</h2> - -<p>The list below shows some of the works that have already been published -with (earlier versions of) Maneage. Previously it was simply called -"Reproducible paper template". Note that Maneage is evolving, so some -details may be different in them. The more recent ones can be used as a -good working example.</p> - -<ul> -<li><p>Infante-Sainz et -al. (<a href="https://ui.adsabs.harvard.edu/abs/2020MNRAS.491.5317I">2020</a>, -MNRAS, 491, 5317): The version controlled project source is available -<a href="https://gitlab.com/infantesainz/sdss-extended-psfs-paper">on GitLab</a> -and is also archived on Zenodo with all the necessary software tarballs: -<a href="https://zenodo.org/record/3524937">zenodo.3524937</a>.</p></li> -<li><p>Akhlaghi (<a href="https://arxiv.org/abs/1909.11230">2019</a>, IAU Symposium -355). The version controlled project source is available <a href="https://gitlab.com/makhlaghi/iau-symposium-355">on -GitLab</a> and is also -archived on Zenodo with all the necessary software tarballs: -<a href="https://doi.org/10.5281/zenodo.3408481">zenodo.3408481</a>.</p></li> -<li><p>Section 7.3 of Bacon et -al. (<a href="http://adsabs.harvard.edu/abs/2017A%26A...608A...1B">2017</a>, A&A -608, A1): The version controlled project source is available <a href="https://gitlab.com/makhlaghi/muse-udf-origin-only-hst-magnitudes">on -GitLab</a> -and a snapshot of the project along with all the necessary input -datasets and outputs is available in -<a href="https://doi.org/10.5281/zenodo.1164774">zenodo.1164774</a>.</p></li> -<li><p>Section 4 of Bacon et -al. (<a href="http://adsabs.harvard.edu/abs/2017A%26A...608A...1B">2017</a>, A&A, -608, A1): The version controlled project is available <a href="https://gitlab.com/makhlaghi/muse-udf-photometry-astrometry">on -GitLab</a> and -a snapshot of the project along with all the necessary input datasets is -available in <a href="https://doi.org/10.5281/zenodo.1163746">zenodo.1163746</a>.</p></li> -<li><p>Akhlaghi & Ichikawa -(<a href="http://adsabs.harvard.edu/abs/2015ApJS..220....1A">2015</a>, ApJS, 220, -1): The version controlled project is available <a href="https://gitlab.com/makhlaghi/NoiseChisel-paper">on -GitLab</a>. This is the -very first (and much less mature!) incarnation of Maneage: the history -of Maneage started more than two years after this paper was -published. It is a very rudimentary/initial implementation, thus it is -only included here for historical reasons. However, the project source -is complete, accurate and uploaded to arXiv along with the paper.</p></li> -</ul> - -<h2>Citation</h2> - -<p>A paper to fully describe Maneage has been submitted. Until then, if you -used it in your work, please cite the paper that implemented its first -version: Akhlaghi & Ichikawa -(<a href="http://adsabs.harvard.edu/abs/2015ApJS..220....1A">2015</a>, ApJS, 220, 1).</p> - -<p>Also, when your paper is published, don't forget to add a notice in your -own paper (in coordination with the publishing editor) that the paper is -fully reproducible and possibly add a sentence or paragraph in the end of -the paper shortly describing the concept. This will help spread the word -and encourage other scientists to also manage and publish their projects in -a reproducible manner.</p> - -<h1>Project architecture</h1> - -<p>In order to customize Maneage to your research, it is important to first -understand its architecture so you can navigate your way in the directories -and understand how to implement your research project within its framework: -where to add new files and which existing files to modify for what -purpose. But if this the first time you are using Maneage, before reading -this theoretical discussion, please run Maneage once from scratch without -any changes (described in <code>README.md</code>). You will see how it works (note that -the configure step builds all necessary software, so it can take long, but -you can continue reading while its working).</p> - -<p>The project has two top-level directories: <code>reproduce</code> and -<code>tex</code>. <code>reproduce</code> hosts all the software building and analysis -steps. <code>tex</code> contains all the final paper's components to be compiled into -a PDF using LaTeX.</p> - -<p>The <code>reproduce</code> directory has two sub-directories: <code>software</code> and -<code>analysis</code>. As the name says, the former contains all the instructions to -download, build and install (independent of the host operating system) the -necessary software (these are called by the <code>./project configure</code> -command). The latter contains instructions on how to use those software to -do your project's analysis.</p> - -<p>After it finishes, <code>./project configure</code> will create the following symbolic -links in the project's top source directory: <code>.build</code> which points to the -top build directory and <code>.local</code> for easy access to the custom built -software installation directory. With these you can easily access the build -directory and project-specific software from your top source directory. For -example if you run <code>.local/bin/ls</code> you will be using the <code>ls</code> of Maneage, -which is probably different from your system's <code>ls</code> (run them both with -<code>--version</code> to check).</p> - -<p>Once the project is configured for your system, <code>./project make</code> will do -the basic preparations and run the project's analysis with the custom -version of software. The <code>project</code> script is just a wrapper, and with the -<code>make</code> argument, it will first call <code>top-prepare.mk</code> and <code>top-make.mk</code> -(both are in the <code>reproduce/analysis/make</code> directory).</p> - -<p>In terms of organization, <code>top-prepare.mk</code> and <code>top-make.mk</code> have an -identical design, only minor differences. So, let's continue Maneage's -architecture with <code>top-make.mk</code>. Once you understand that, you'll clearly -understand <code>top-prepare.mk</code> also. These very high-level files are -relatively short and heavily commented so hopefully the descriptions in -each comment will be enough to understand the general details. As you read -this section, please also look at the contents of the mentioned files and -directories to fully understand what is going on.</p> - -<p>Before starting to look into the top <code>top-make.mk</code>, it is important to -recall that Make defines dependencies by files. Therefore, the -input/prerequisite and output of every step/rule must be a file. Also -recall that Make will use the modification date of the prerequisite(s) and -target files to see if the target must be re-built or not. Therefore during -the processing, <em>many</em> intermediate files will be created (see the tips -section below on a good strategy to deal with large/huge files).</p> - -<p>To keep the source and (intermediate) built files separate, the user <em>must</em> -define a top-level build directory variable (or <code>$(BDIR)</code>) to host all the -intermediate files (you defined it during <code>./project configure</code>). This -directory doesn't need to be version controlled or even synchronized, or -backed-up in other servers: its contents are all products, and can be -easily re-created any time. As you define targets for your new rules, it is -thus important to place them all under sub-directories of <code>$(BDIR)</code>. As -mentioned above, you always have fast access to this "build"-directory with -the <code>.build</code> symbolic link. Also, beware to <em>never</em> make any manual change -in the files of the build-directory, just delete them (so they are -re-built).</p> - -<p>In this architecture, we have two types of Makefiles that are loaded into -the top <code>Makefile</code>: <em>configuration-Makefiles</em> (only independent -variables/configurations) and <em>workhorse-Makefiles</em> (Makefiles that -actually contain analysis/processing rules).</p> - -<p>The configuration-Makefiles are those that satisfy these two wildcards: -<code>reproduce/software/config/*.conf</code> (for building the necessary software -when you run <code>./project configure</code>) and <code>reproduce/analysis/config/*.conf</code> -(for the high-level analysis, when you run <code>./project make</code>). These -Makefiles don't actually have any rules, they just have values for various -free parameters throughout the configuration or analysis. Open a few of -them to see for yourself. These Makefiles must only contain raw Make -variables (project configurations). By "raw" we mean that the Make -variables in these files must not depend on variables in any other -configuration-Makefile. This is because we don't want to assume any order -in reading them. It is also very important to <em>not</em> define any rule, or -other Make construct, in these configuration-Makefiles.</p> - -<p>Following this rule-of-thumb enables you to set these configure-Makefiles -as a prerequisite to any target that depends on their variable -values. Therefore, if you change any of their values, all targets that -depend on those values will be re-built. This is very convenient as your -project scales up and gets more complex.</p> - -<p>The workhorse-Makefiles are those satisfying this wildcard -<code>reproduce/software/make/*.mk</code> and <code>reproduce/analysis/make/*.mk</code>. They -contain the details of the processing steps (Makefiles containing -rules). Therefore, in this phase <em>order is important</em>, because the -prerequisites of most rules will be the targets of other rules that will be -defined prior to them (not a fixed name like <code>paper.pdf</code>). The lower-level -rules must be imported into Make before the higher-level ones.</p> - -<p>All processing steps are assumed to ultimately (usually after many rules) -end up in some number, image, figure, or table that will be included in the -paper. The writing of these results into the final report/paper is managed -through separate LaTeX files that only contain macros (a name given to a -number/string to be used in the LaTeX source, which will be replaced when -compiling it to the final PDF). So the last target in a workhorse-Makefile -is a <code>.tex</code> file (with the same base-name as the Makefile, but in -<code>$(BDIR)/tex/macros</code>). As a result, if the targets in a workhorse-Makefile -aren't directly a prerequisite of other workhorse-Makefile targets, they -can be a prerequisite of that intermediate LaTeX macro file and thus be -called when necessary. Otherwise, they will be ignored by Make.</p> - -<p>Maneage also has a mode to share the build directory between several -users of a Unix group (when working on large computer clusters). In this -scenario, each user can have their own cloned project source, but share the -large built files between each other. To do this, it is necessary for all -built files to give full permission to group members while not allowing any -other users access to the contents. Therefore the <code>./project configure</code> and -<code>./project make</code> steps must be called with special conditions which are -managed in the <code>--group</code> option.</p> - -<p>Let's see how this design is implemented. Please open and inspect -<code>top-make.mk</code> it as we go along here. The first step (un-commented line) is -to import the local configuration (your answers to the questions of -<code>./project configure</code>). They are defined in the configuration-Makefile -<code>reproduce/software/config/LOCAL.conf</code> which was also built by <code>./project -configure</code> (based on the <code>LOCAL.conf.in</code> template of the same directory).</p> - -<p>The next non-commented set of the top <code>Makefile</code> defines the ultimate -target of the whole project (<code>paper.pdf</code>). But to avoid mistakes, a sanity -check is necessary to see if Make is being run with the same group settings -as the configure script (for example when the project is configured for -group access using the <code>./for-group</code> script, but Make isn't). Therefore we -use a Make conditional to define the <code>all</code> target based on the group -permissions.</p> - -<p>Having defined the top/ultimate target, our next step is to include all the -other necessary Makefiles. However, order matters in the importing of -workhorse-Makefiles and each must also have a TeX macro file with the same -base name (without a suffix). Therefore, the next step in the top-level -Makefile is to define the <code>makesrc</code> variable to keep the base names -(without a <code>.mk</code> suffix) of the workhorse-Makefiles that must be imported, -in the proper order.</p> - -<p>Finally, we import all the necessary remaining Makefiles: 1) All the -analysis configuration-Makefiles with a wildcard. 2) The software -configuration-Makefile that contains their version (just in case its -necessary). 3) All workhorse-Makefiles in the proper order using a Make -<code>foreach</code> loop.</p> - -<p>In short, to keep things modular, readable and manageable, follow these -recommendations: 1) Set clear-to-understand names for the -configuration-Makefiles, and workhorse-Makefiles, 2) Only import other -Makefiles from top Makefile. These will let you know/remember generally -which step you are taking before or after another. Projects will scale up -very fast. Thus if you don't start and continue with a clean and robust -convention like this, in the end it will become very dirty and hard to -manage/understand (even for yourself). As a general rule of thumb, break -your rules into as many logically-similar but independent steps as -possible.</p> - -<p>The <code>reproduce/analysis/make/paper.mk</code> Makefile must be the final Makefile -that is included. This workhorse Makefile ends with the rule to build -<code>paper.pdf</code> (final target of the whole project). If you look in it, you -will notice that this Makefile starts with a rule to create -<code>$(mtexdir)/project.tex</code> (<code>mtexdir</code> is just a shorthand name for -<code>$(BDIR)/tex/macros</code> mentioned before). As you see, the only dependency of -<code>$(mtexdir)/project.tex</code> is <code>$(mtexdir)/verify.tex</code> (which is the last -analysis step: it verifies all the generated results). Therefore, -<code>$(mtexdir)/project.tex</code> is <em>the connection</em> between the -processing/analysis steps of the project, and the steps to build the final -PDF.</p> - -<p>During the research, it often happens that you want to test a step that is -not a prerequisite of any higher-level operation. In such cases, you can -(temporarily) define that processing as a rule in the most relevant -workhorse-Makefile and set its target as a prerequisite of its TeX -macro. If your test gives a promising result and you want to include it in -your research, set it as prerequisites to other rules and remove it from -the list of prerequisites for TeX macro file. In fact, this is how a -project is designed to grow in this framework.</p> - -<h2>File modification dates (meta data)</h2> - -<p>While Git does an excellent job at keeping a history of the contents of -files, it makes no effort in keeping the file meta data, and in particular -the dates of files. Therefore when you checkout to a different branch, -files that are re-written by Git will have a newer date than the other -project files. However, file dates are important in the current design of -Maneage: Make checks the dates of the prerequisite files and target files -to see if the target should be re-built.</p> - -<p>To fix this problem, for Maneage we use a forked version of -<a href="https://github.com/mohammad-akhlaghi/metastore">Metastore</a>. Metastore use -a binary database file (which is called <code>.file-metadata</code>) to keep the -modification dates of all the files under version control. This file is -also under version control, but is hidden (because it shouldn't be modified -by hand). During the project's configuration, Maneage installs to Git hooks -to run Metastore 1) before making a commit to update its database with the -file dates in a branch, and 2) after doing a checkout, to reset the -file-dates after the checkout is complete and re-set the file dates back to -what they were.</p> - -<p>In practice, Metastore should work almost fully invisibly within your -project. The only place you might notice its presence is that you'll see -<code>.file-metadata</code> in the list of modified/staged files (commonly after -merging your branches). Since its a binary file, Git also won't show you -the changed contents. In a merge, you can simply accept any changes with -<code>git add -u</code>. But if Git is telling you that it has changed without a merge -(for example if you started a commit, but canceled it in the middle), you -can just do <code>git checkout .file-metadata</code> and set it back to its original -state.</p> - -<h2>Summary</h2> - -<p>Based on the explanation above, some major design points you should have in -mind are listed below.</p> - -<ul> -<li><p>Define new <code>reproduce/analysis/make/XXXXXX.mk</code> workhorse-Makefile(s) -with good and human-friendly name(s) replacing <code>XXXXXX</code>.</p></li> -<li><p>Add <code>XXXXXX</code>, as a new line, to the values in <code>makesrc</code> of the top-level -<code>Makefile</code>.</p></li> -<li><p>Do not use any constant numbers (or important names like filter names) -in the workhorse-Makefiles or paper's LaTeX source. Define such -constants as logically-grouped, separate configuration-Makefiles in -<code>reproduce/analysis/config/XXXXX.conf</code>. Then set this -configuration-Makefiles file as a prerequisite to any rule that uses -the variable defined in it.</p></li> -<li><p>Through any number of intermediate prerequisites, all processing steps -should end in (be a prerequisite of) <code>$(mtexdir)/verify.tex</code> (defined in -<code>reproduce/analysis/make/verify.mk</code>). <code>$(mtexdir)/verify.tex</code> is the sole -dependency of <code>$(mtexdir)/project.tex</code>, which is the bridge between the -processing steps and PDF-building steps of the project.</p></li> -</ul> - -<h1>Customization checklist</h1> - -<p>Take the following steps to fully customize Maneage for your research -project. After finishing the list, be sure to run <code>./project configure</code> and -<code>project make</code> to see if everything works correctly. If you notice anything -missing or any in-correct part (probably a change that has not been -explained here), please let us know to correct it.</p> - -<p>As described above, the concept of reproducibility (during a project) -heavily relies on <a href="https://en.wikipedia.org/wiki/Version_control">version -control</a>. Currently Maneage -uses Git as its main version control system. If you are not already -familiar with Git, please read the first three chapters of the <a href="https://git-scm.com/book/en/v2">ProGit -book</a> which provides a wonderful practical -understanding of the basics. You can read later chapters as you get more -advanced in later stages of your work.</p> - -<h2>First custom commit</h2> - -<ol> -<li><p><strong>Get this repository and its history</strong> (if you don't already have it): - Arguably the easiest way to start is to clone Maneage and prepare for - your customizations as shown below. After the cloning first you rename - the default <code>origin</code> remote server to specify that this is Maneage's - remote server. This will allow you to use the conventional <code>origin</code> - name for your own project as shown in the next steps. Second, you will - create and go into the conventional <code>master</code> branch to start - committing in your project later.</p> - -<p><code>shell - $ git clone https://git.maneage.org/project.git # Clone/copy the project and its history. - $ mv project my-project # Change the name to your project's name. - $ cd my-project # Go into the cloned directory. - $ git remote rename origin origin-maneage # Rename current/only remote to "origin-maneage". - $ git checkout -b master # Create and enter your own "master" branch. - $ pwd # Just to confirm where you are. -</code></p></li> -<li><p><strong>Prepare to build project</strong>: The <code>./project configure</code> command of the - next step will build the different software packages within the - "build" directory (that you will specify). Nothing else on your system - will be touched. However, since it takes long, it is useful to see - what it is being built at every instant (its almost impossible to tell - from the torrent of commands that are produced!). So open another - terminal on your desktop and navigate to the same project directory - that you cloned (output of last command above). Then run the following - command. Once every second, this command will just print the date - (possibly followed by a non-existent directory notice). But as soon as - the next step starts building software, you'll see the names of - software get printed as they are being built. Once any software is - installed in the project build directory it will be removed. Again, - don't worry, nothing will be installed outside the build directory.</p> - -<p><code>shell - # On another terminal (go to top project source directory, last command above) - $ ./project --check-config -</code></p></li> -<li><p><strong>Test Maneage</strong>: Before making any changes, it is important to test it - and see if everything works properly with the commands below. If there - is any problem in the <code>./project configure</code> or <code>./project make</code> steps, - please contact us to fix the problem before continuing. Since the - building of dependencies in configuration can take long, you can take - the next few steps (editing the files) while its working (they don't - affect the configuration). After <code>./project make</code> is finished, open - <code>paper.pdf</code>. If it looks fine, you are ready to start customizing the - Maneage for your project. But before that, clean all the extra Maneage - outputs with <code>make clean</code> as shown below.</p> - -<p>```shell - $ ./project configure # Build the project's software environment (can take an hour or so). - $ ./project make # Do the processing and build paper (just a simple demo).</p> - -<p># Open 'paper.pdf' and see if everything is ok. - ```</p></li> -<li><p><strong>Setup the remote</strong>: You can use any <a href="https://en.wikipedia.org/wiki/Comparison_of_source_code_hosting_facilities">hosting - facility</a> - that supports Git to keep an online copy of your project's version - controlled history. We recommend <a href="https://gitlab.com">GitLab</a> because - it is <a href="https://www.gnu.org/software/repo-criteria-evaluation.html">more ethical (although not - perfect)</a>, - and later you can also host GitLab on your own server. Anyway, create - an account in your favorite hosting facility (if you don't already - have one), and define a new project there. Please make sure <em>the newly - created project is empty</em> (some services ask to include a <code>README</code> in - a new project which is bad in this scenario, and will not allow you to - push to it). It will give you a URL (usually starting with <code>git@</code> and - ending in <code>.git</code>), put this URL in place of <code>XXXXXXXXXX</code> in the first - command below. With the second command, "push" your <code>master</code> branch to - your <code>origin</code> remote, and (with the <code>--set-upstream</code> option) set them - to track/follow each other. However, the <code>maneage</code> branch is currently - tracking/following your <code>origin-maneage</code> remote (automatically set - when you cloned Maneage). So when pushing the <code>maneage</code> branch to your - <code>origin</code> remote, you <em>shouldn't</em> use <code>--set-upstream</code>. With the last - command, you can actually check this (which local and remote branches - are tracking each other).</p> - -<p><code>shell - git remote add origin XXXXXXXXXX # Newly created repo is now called 'origin'. - git push --set-upstream origin master # Push 'master' branch to 'origin' (with tracking). - git push origin maneage # Push 'maneage' branch to 'origin' (no tracking). -</code></p></li> -<li><p><strong>Title</strong>, <strong>short description</strong> and <strong>author</strong>: The title and basic - information of your project's output PDF paper should be added in - <code>paper.tex</code>. You should see the relevant place in the preamble (prior - to <code>\begin{document}</code>. After you are done, run the <code>./project make</code> - command again to see your changes in the final PDF, and make sure that - your changes don't cause a crash in LaTeX. Of course, if you use a - different LaTeX package/style for managing the title and authors (in - particular a specific journal's style), please feel free to use it - your own methods after finishing this checklist and doing your first - commit.</p></li> -<li><p><strong>Delete dummy parts</strong>: Maneage contains some parts that are only for - the initial/test run, mainly as a demonstration of important steps, - which you can use as a reference to use in your own project. But they - not for any real analysis, so you should remove these parts as - described below:</p> - -<ul> -<li><p><code>paper.tex</code>: 1) Delete the text of the abstract (from -<code>\includeabstract{</code> to <code>\vspace{0.25cm}</code>) and write your own (a -single sentence can be enough now, you can complete it later). 2) -Add some keywords under it in the keywords part. 3) Delete -everything between <code>%% Start of main body.</code> and <code>%% End of main -body.</code>. 4) Remove the notice in the "Acknowledgments" section (in -<code>\new{}</code>) and Acknowledge your funding sources (this can also be -done later). Just don't delete the existing acknowledgment -statement: Maneage is possible thanks to funding from several -grants. Since Maneage is being used in your work, it is necessary to -acknowledge them in your work also.</p></li> -<li><p><code>reproduce/analysis/make/top-make.mk</code>: Delete the <code>delete-me</code> line -in the <code>makesrc</code> definition. Just make sure there is no empty line -between the <code>download \</code> and <code>verify \</code> lines (they should be -directly under each other).</p></li> -<li><p><code>reproduce/analysis/make/verify.mk</code>: In the final recipe, under the -commented line <code>Verify TeX macros</code>, remove the full line that -contains <code>delete-me</code>, and set the value of <code>s</code> in the line for -<code>download</code> to <code>XXXXX</code> (any temporary string, you'll fix it in the -end of your project, when its complete).</p></li> -<li><p>Delete all <code>delete-me*</code> files in the following directories:</p> - -<p><code>shell -$ rm tex/src/delete-me* -$ rm reproduce/analysis/make/delete-me* -$ rm reproduce/analysis/config/delete-me* -</code></p></li> -<li><p>Disable verification of outputs by removing the <code>yes</code> from -<code>reproduce/analysis/config/verify-outputs.conf</code>. Later, when you are -ready to submit your paper, or publish the dataset, activate -verification and make the proper corrections in this file (described -under the "Other basic customizations" section below). This is a -critical step and only takes a few minutes when your project is -finished. So DON'T FORGET to activate it in the end.</p></li> -<li><p>Re-make the project (after a cleaning) to see if you haven't -introduced any errors.</p> - -<p><code>shell -$ ./project make clean -$ ./project make -</code></p></li> -</ul></li> -<li><p><strong>Don't merge some files in future updates</strong>: As described below, you - can later update your infra-structure (for example to fix bugs) by - merging your <code>master</code> branch with <code>maneage</code>. For files that you have - created in your own branch, there will be no problem. However if you - modify an existing Maneage file for your project, next time its - updated on <code>maneage</code> you'll have an annoying conflict. The commands - below show how to fix this future problem. With them, you can - configure Git to ignore the changes in <code>maneage</code> for some of the files - you have already edited and deleted above (and will edit below). Note - that only the first <code>echo</code> command has a <code>></code> (to write over the file), - the rest are <code>>></code> (to append to it). If you want to avoid any other - set of files to be imported from Maneage into your project's branch, - you can follow a similar strategy. We recommend only doing it when you - encounter the same conflict in more than one merge and there is no - other change in that file. Also, don't add core Maneage Makefiles, - otherwise Maneage can break on the next run.</p> - -<p><code>shell - $ echo "paper.tex merge=ours" > .gitattributes - $ echo "tex/src/delete-me.mk merge=ours" >> .gitattributes - $ echo "tex/src/delete-me-demo.mk merge=ours" >> .gitattributes - $ echo "reproduce/analysis/make/delete-me.mk merge=ours" >> .gitattributes - $ echo "reproduce/software/config/TARGETS.conf merge=ours" >> .gitattributes - $ echo "reproduce/analysis/config/delete-me-num.conf merge=ours" >> .gitattributes - $ git add .gitattributes -</code></p></li> -<li><p><strong>Copyright and License notice</strong>: It is necessary that <em>all</em> the - "copyright-able" files in your project (those larger than 10 lines) - have a copyright and license notice. Please take a moment to look at - several existing files to see a few examples. The copyright notice is - usually close to the start of the file, it is the line starting with - <code>Copyright (C)</code> and containing a year and the author's name (like the - examples below). The License notice is a short description of the - copyright license, usually one or two paragraphs with a URL to the - full license. Don't forget to add these <em>two</em> notices to <em>any new - file</em> you add in your project (you can just copy-and-paste). When you - modify an existing Maneage file (which already has the notices), just - add a copyright notice in your name under the existing one(s), like - the line with capital letters below. To start with, add this line with - your name and email address to <code>paper.tex</code>, - <code>tex/src/preamble-header.tex</code>, <code>reproduce/analysis/make/top-make.mk</code>, - and generally, all the files you modified in the previous step.</p> - -<p><code> - Copyright (C) 2018-2020 Existing Name <existing@email.address> - Copyright (C) 2020 YOUR NAME <YOUR@EMAIL.ADDRESS> -</code></p></li> -<li><p><strong>Configure Git for fist time</strong>: If this is the first time you are - running Git on this system, then you have to configure it with some - basic information in order to have essential information in the commit - messages (ignore this step if you have already done it). Git will - include your name and e-mail address information in each commit. You - can also specify your favorite text editor for making the commit - (<code>emacs</code>, <code>vim</code>, <code>nano</code>, and etc.).</p> - -<p><code>shell - $ git config --global user.name "YourName YourSurname" - $ git config --global user.email your-email@example.com - $ git config --global core.editor nano -</code></p></li> -<li><p><strong>Your first commit</strong>: You have already made some small and basic - changes in the steps above and you are in your project's <code>master</code> - branch. So, you can officially make your first commit in your - project's history and push it. But before that, you need to make sure - that there are no problems in the project. This is a good habit to - always re-build the system before a commit to be sure it works as - expected.</p> - -<p><code>shell - $ git status # See which files you have changed. - $ git diff # Check the lines you have added/changed. - $ ./project make # Make sure everything builds successfully. - $ git add -u # Put all tracked changes in staging area. - $ git status # Make sure everything is fine. - $ git diff --cached # Confirm all the changes that will be committed. - $ git commit # Your first commit: put a good description! - $ git push # Push your commit to your remote. -</code></p></li> -<li><p><strong>Start your exciting research</strong>: You are now ready to add flesh and - blood to this raw skeleton by further modifying and adding your - exciting research steps. You can use the "published works" section in - the introduction (above) as some fully working models to learn - from. Also, don't hesitate to contact us if you have any - questions.</p></li> -</ol> - -<h2>Other basic customizations</h2> - -<ul> -<li><p><strong>High-level software</strong>: Maneage installs all the software that your - project needs. You can specify which software your project needs in - <code>reproduce/software/config/TARGETS.conf</code>. The necessary software are - classified into two classes: 1) programs or libraries (usually written - in C/C++) which are run directly by the operating system. 2) Python - modules/libraries that are run within Python. By default - <code>TARGETS.conf</code> only has GNU Astronomy Utilities (Gnuastro) as one - scientific program and Astropy as one scientific Python module. Both - have many dependencies which will be installed into your project - during the configuration step. To see a list of software that are - currently ready to be built in Maneage, see - <code>reproduce/software/config/versions.conf</code> (which has their versions - also), the comments in <code>TARGETS.conf</code> describe how to use the software - name from <code>versions.conf</code>. Currently the raw pipeline just uses - Gnuastro to make the demonstration plots. Therefore if you don't need - Gnuastro, go through the analysis steps in <code>reproduce/analysis</code> and - remove all its use cases (clearly marked).</p></li> -<li><p><strong>Input dataset</strong>: The input datasets are managed through the - <code>reproduce/analysis/config/INPUTS.conf</code> file. It is best to gather all - the information regarding all the input datasets into this one central - file. To ensure that the proper dataset is being downloaded and used - by the project, it is also recommended get an <a href="https://en.wikipedia.org/wiki/MD5">MD5 - checksum</a> of the file and include - that in <code>INPUTS.conf</code> so the project can check it automatically. The - preparation/downloading of the input datasets is done in - <code>reproduce/analysis/make/download.mk</code>. Have a look there to see how - these values are to be used. This information about the input datasets - is also used in the initial <code>configure</code> script (to inform the users), - so also modify that file. You can find all occurrences of the demo - dataset with the command below and replace it with your input's - dataset.</p> - -<p><code>shell - $ grep -ir wfpc2 ./* -</code></p></li> -<li><p><strong><code>README.md</code></strong>: Correct all the <code>XXXXX</code> place holders (name of your - project, your own name, address of your project's online/remote - repository, link to download dependencies and etc). Generally, read - over the text and update it where necessary to fit your project. Don't - forget that this is the first file that is displayed on your online - repository and also your colleagues will first be drawn to read this - file. Therefore, make it as easy as possible for them to start - with. Also check and update this file one last time when you are ready - to publish your project's paper/source.</p></li> -<li><p><strong>Verify outputs</strong>: During the initial customization checklist, you - disabled verification. This is natural because during the project you - need to make changes all the time and its a waste of time to enable - verification every time. But at significant moments of the project - (for example before submission to a journal, or publication) it is - necessary. When you activate verification, before building the paper, - all the specified datasets will be compared with their respective - checksum and if any file's checksum is different from the one recorded - in the project, it will stop and print the problematic file and its - expected and calculated checksums. First set the value of - <code>verify-outputs</code> variable in - <code>reproduce/analysis/config/verify-outputs.conf</code> to <code>yes</code>. Then go to - <code>reproduce/analysis/make/verify.mk</code>. The verification of all the files - is only done in one recipe. First the files that go into the - plots/figures are checked, then the LaTeX macros. Validation of the - former (inputs to plots/figures) should be done manually. If its the - first time you are doing this, you can see two examples of the dummy - steps (with <code>delete-me</code>, you can use them if you like). These two - examples should be removed before you can run the project. For the - latter, you just have to update the checksums. The important thing to - consider is that a simple checksum can be problematic because some - file generators print their run-time date in the file (for example as - commented lines in a text table). When checking text files, this - Makefile already has this function: - <code>verify-txt-no-comments-leading-space</code>. As the name suggests, it will - remove comment lines and empty lines before calculating the MD5 - checksum. For FITS formats (common in astronomy, fortunately there is - a <code>DATASUM</code> definition which will return the checksum independent of - the headers. You can use the provided function(s), or define one for - your special formats.</p></li> -<li><p><strong>Feedback</strong>: As you use Maneage you will notice many things that if - implemented from the start would have been very useful for your - work. This can be in the actual scripting and architecture of Maneage, - or useful implementation and usage tips, like those below. In any - case, please share your thoughts and suggestions with us, so we can - add them here for everyone's benefit.</p></li> -<li><p><strong>Re-preparation</strong>: Automatic preparation is only run in the first run - of the project on a system, to re-do the preparation you have to use - the option below. Here is the reason for this: when its necessary, the - preparation process can be slow and will unnecessarily slow down the - whole project while the project is under development (focus is on the - analysis that is done after preparation). Because of this, preparation - will be done automatically for the first time that the project is run - (when <code>.build/software/preparation-done.mk</code> doesn't exist). After the - preparation process completes once, future runs of <code>./project make</code> - will not do the preparation process anymore (will not call - <code>top-prepare.mk</code>). They will only call <code>top-make.mk</code> for the - analysis. To manually invoke the preparation process after the first - attempt, the <code>./project make</code> script should be run with the - <code>--prepare-redo</code> option, or you can delete the special file above.</p> - -<p><code>shell - $ ./project make --prepare-redo -</code></p></li> -<li><p><strong>Pre-publication</strong>: add notice on reproducibility**: Add a notice - somewhere prominent in the first page within your paper, informing the - reader that your research is fully reproducible. For example in the - end of the abstract, or under the keywords with a title like - "reproducible paper". This will encourage them to publish their own - works in this manner also and also will help spread the word.</p></li> -</ul> - -<h1>Tips for designing your project</h1> - -<p>The following is a list of design points, tips, or recommendations that -have been learned after some experience with this type of project -management. Please don't hesitate to share any experience you gain after -using it with us. In this way, we can add it here (with full giving credit) -for the benefit of others.</p> - -<ul> -<li><p><strong>Modularity</strong>: Modularity is the key to easy and clean growth of a - project. So it is always best to break up a job into as many - sub-components as reasonable. Here are some tips to stay modular.</p> - -<ul> -<li><p><em>Short recipes</em>: if you see the recipe of a rule becoming more than a -handful of lines which involve significant processing, it is probably -a good sign that you should break up the rule into its main -components. Try to only have one major processing step per rule.</p></li> -<li><p><em>Context-based (many) Makefiles</em>: For maximum modularity, this design -allows easy inclusion of many Makefiles: in -<code>reproduce/analysis/make/*.mk</code> for analysis steps, and -<code>reproduce/software/make/*.mk</code> for building software. So keep the -rules for closely related parts of the processing in separate -Makefiles.</p></li> -<li><p><em>Descriptive names</em>: Be very clear and descriptive with the naming of -the files and the variables because a few months after the -processing, it will be very hard to remember what each one was -for. Also this helps others (your collaborators or other people -reading the project source after it is published) to more easily -understand your work and find their way around.</p></li> -<li><p><em>Naming convention</em>: As the project grows, following a single standard -or convention in naming the files is very useful. Try best to use -multiple word filenames for anything that is non-trivial (separating -the words with a <code>-</code>). For example if you have a Makefile for -creating a catalog and another two for processing it under models A -and B, you can name them like this: <code>catalog-create.mk</code>, -<code>catalog-model-a.mk</code> and <code>catalog-model-b.mk</code>. In this way, when -listing the contents of <code>reproduce/analysis/make</code> to see all the -Makefiles, those related to the catalog will all be close to each -other and thus easily found. This also helps in auto-completions by -the shell or text editors like Emacs.</p></li> -<li><p><em>Source directories</em>: If you need to add files in other languages for -example in shell, Python, AWK or C, keep the files in the same -language in a separate directory under <code>reproduce/analysis</code>, with the -appropriate name.</p></li> -<li><p><em>Configuration files</em>: If your research uses special programs as part -of the processing, put all their configuration files in a devoted -directory (with the program's name) within -<code>reproduce/software/config</code>. Similar to the -<code>reproduce/software/config/gnuastro</code> directory (which is put in -Maneage as a demo in case you use GNU Astronomy Utilities). It is -much cleaner and readable (thus less buggy) to avoid mixing the -configuration files, even if there is no technical necessity.</p></li> -</ul></li> -<li><p><strong>Contents</strong>: It is good practice to follow the following - recommendations on the contents of your files, whether they are source - code for a program, Makefiles, scripts or configuration files - (copyrights aren't necessary for the latter).</p> - -<ul> -<li><p><em>Copyright</em>: Always start a file containing programming constructs -with a copyright statement like the ones that Maneage starts with -(for example in the top level <code>Makefile</code>).</p></li> -<li><p><em>Comments</em>: Comments are vital for readability (by yourself in two -months, or others). Describe everything you can about why you are -doing something, how you are doing it, and what you expect the result -to be. Write the comments as if it was what you would say to describe -the variable, recipe or rule to a friend sitting beside you. When -writing the project it is very tempting to just steam ahead with -commands and codes, but be patient and write comments before the -rules or recipes. This will also allow you to think more about what -you should be doing. Also, in several months when you come back to -the code, you will appreciate the effort of writing them. Just don't -forget to also read and update the comment first if you later want to -make changes to the code (variable, recipe or rule). As a general -rule of thumb: first the comments, then the code.</p></li> -<li><p><em>File title</em>: In general, it is good practice to start all files with -a single line description of what that particular file does. If -further information about the totality of the file is necessary, add -it after a blank line. This will help a fast inspection where you -don't care about the details, but just want to remember/see what that -file is (generally) for. This information must of course be commented -(its for a human), but this is kept separate from the general -recommendation on comments, because this is a comment for the whole -file, not each step within it.</p></li> -</ul></li> -<li><p><strong>Make programming</strong>: Here are some experiences that we have come to - learn over the years in using Make and are useful/handy in research - contexts.</p> - -<ul> -<li><p><em>Environment of each recipe</em>: If you need to define a special -environment (or aliases, or scripts to run) for all the recipes in -your Makefiles, you can use a Bash startup file -<code>reproduce/software/shell/bashrc.sh</code>. This file is loaded before every -Make recipe is run, just like the <code>.bashrc</code> in your home directory is -loaded every time you start a new interactive, non-login terminal. See -the comments in that file for more.</p></li> -<li><p><em>Automatic variables</em>: These are wonderful and very useful Make -constructs that greatly shrink the text, while helping in -read-ability, robustness (less bugs in typos for example) and -generalization. For example even when a rule only has one target or -one prerequisite, always use <code>$@</code> instead of the target's name, <code>$<</code> -instead of the first prerequisite, <code>$^</code> instead of the full list of -prerequisites and etc. You can see the full list of automatic -variables -<a href="https://www.gnu.org/software/make/manual/html_node/Automatic-Variables.html">here</a>. If -you use GNU Make, you can also see this page on your command-line:</p> - -<p><code>shell -$ info make "automatic variables" -</code></p></li> -<li><p><em>Debug</em>: Since Make doesn't follow the common top-down paradigm, it -can be a little hard to get accustomed to why you get an error or -un-expected behavior. In such cases, run Make with the <code>-d</code> -option. With this option, Make prints a full list of exactly which -prerequisites are being checked for which targets. Looking -(patiently) through this output and searching for the faulty -file/step will clearly show you any mistake you might have made in -defining the targets or prerequisites.</p></li> -<li><p><em>Large files</em>: If you are dealing with very large files (thus having -multiple copies of them for intermediate steps is not possible), one -solution is the following strategy (Also see the next item on "Fast -access to temporary files"). Set a small plain text file as the -actual target and delete the large file when it is no longer needed -by the project (in the last rule that needs it). Below is a simple -demonstration of doing this. In it, we use Gnuastro's Arithmetic -program to add all pixels of the input image with 2 and create -<code>large1.fits</code>. We then subtract 2 from <code>large1.fits</code> to create -<code>large2.fits</code> and delete <code>large1.fits</code> in the same rule (when its no -longer needed). We can later do the same with <code>large2.fits</code> when it -is no longer needed and so on. -<code> +<!DOCTYPE html> +<!-- + Webpage of Maneage: a framework for managing data lineage + + Copyright (C) 2020, Mohammad Akhlaghi <mohammad@akhlaghi.org> + + This file is part of Maneage. Maneage is free software: you can + redistribute it and/or modify it under the terms of the GNU General + Public License as published by the Free Software Foundation, either + version 3 of the License, or (at your option) any later version. + + Maneage is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. See + <http://www.gnu.org/licenses/>. --> + + <html lang="en-US"> + + <!-- HTML Header --> + <head> + <!-- Title of the page. --> + <title>Maneage -- Managing data lineage</title> + + <!-- Enable UTF-8 encoding to easily use non-ASCII charactes --> + <meta charset="UTF-8"> + <meta http-equiv="Content-type" content="text/html; charset=UTF-8"> + + <!-- Put logo beside the address bar --> + <link rel="shortcut icon" href="./img/favicon.svg" /> + + <!-- The viewport meta tag is placed mainly for mobile browsers + that are pre-configured in different ways (for example setting the + different widths for the page than the actual width of the device, + or zooming to different values. Without this the CSS media + solutions might not work properly on all mobile browsers.--> + <meta name="viewport" + content="width=device-width, initial-scale=1"> + + <!-- Basic styles --> + <link rel="stylesheet" href="css/base.css" /> + </head> + + + + + <!-- Start the main body. --> + <body> + <div id="container"> + + <h1>Maneage: managing data lineage</h1> + + <p>Copyright (C) 2018-2020 Mohammad Akhlaghi <a href="mailto:mohammad@akhlaghi.org">mohammad@akhlaghi.org</a><br /> + Copyright (C) 2020 Raul Infante-Sainz <a href="mailto:infantesainz@gmail.com">infantesainz@gmail.com</a><br /> + See the end of the file for license conditions.</p> + + <p>Maneage is a <strong>fully working template</strong> for doing reproducible research (or + writing a reproducible paper) as defined in the link below. If the link + below is not accessible at the time of reading, please see the appendix at + the end of this file for a portion of its introduction. Some + <a href="http://akhlaghi.org/pdf/reproducible-paper.pdf">slides</a> are also available + to help demonstrate the concept implemented here.</p> + + <p>http://akhlaghi.org/reproducible-science.html</p> + + <p>Maneage is created with the aim of supporting reproducible research by + making it easy to start a project in this framework. As shown below, it is + very easy to customize Maneage for any particular (research) project and + expand it as it starts and evolves. It can be run with no modification (as + described in <code>README.md</code>) as a demonstration and customized for use in any + project as fully described below.</p> + + <p>A project designed using Maneage will download and build all the necessary + libraries and programs for working in a closed environment (highly + independent of the host operating system) with fixed versions of the + necessary dependencies. The tarballs for building the local environment are + also collected in a <a href="http://git.maneage.org/tarballs-software.git/tree/">separate + repository</a>. The final + output of the project is <a href="http://git.maneage.org/output-raw.git/plain/paper.pdf">a + paper</a>. Notice the + last paragraph of the Acknowledgments where all the necessary software are + mentioned with their versions.</p> + + <p>Below, we start with a discussion of why Make was chosen as the high-level + language/framework for project management and how to learn and master Make + easily (and freely). The general architecture and design of the project is + then discussed to help you navigate the files and their contents. This is + followed by a checklist for the easy/fast customization of Maneage to your + exciting research. We continue with some tips and guidelines on how to + manage or extend your project as it grows based on our experiences with it + so far. The main body concludes with a description of possible future + improvements that are planned for Maneage (but not yet implemented). As + discussed above, we end with a short introduction on the necessity of + reproducible science in the appendix.</p> + + <p>Please don't forget to share your thoughts, suggestions and + criticisms. Maintaining and designing Maneage is itself a separate project, + so please join us if you are interested. Once it is mature enough, we will + describe it in a paper (written by all contributors) for a formal + introduction to the community.</p> + + <h2>Why Make?</h2> + + <p>When batch processing is necessary (no manual intervention, as in a + reproducible project), shell scripts are usually the first solution that + come to mind. However, the inherent complexity and non-linearity of + progress in a scientific project (where experimentation is key) make it + hard to manage the script(s) as the project evolves. For example, a script + will start from the top/start every time it is run. So if you have already + completed 90% of a research project and want to run the remaining 10% that + you have newly added, you have to run the whole script from the start + again. Only then will you see the effects of the last new steps (to find + possible errors, or better solutions and etc).</p> + + <p>It is possible to manually ignore/comment parts of a script to only do a + special part. However, such checks/comments will only add to the complexity + of the script and will discourage you to play-with/change an already + completed part of the project when an idea suddenly comes up. It is also + prone to very serious bugs in the end (when trying to reproduce from + scratch). Such bugs are very hard to notice during the work and frustrating + to find in the end.</p> + + <p>The Make paradigm, on the other hand, starts from the end: the final + <em>target</em>. It builds a dependency tree internally, and finds where it should + start each time the project is run. Therefore, in the scenario above, a + researcher that has just added the final 10% of steps of her research to + her Makefile, will only have to run those extra steps. With Make, it is + also trivial to change the processing of any intermediate (already written) + <em>rule</em> (or step) in the middle of an already written analysis: the next + time Make is run, only rules that are affected by the changes/additions + will be re-run, not the whole analysis/project.</p> + + <p>This greatly speeds up the processing (enabling creative changes), while + keeping all the dependencies clearly documented (as part of the Make + language), and most importantly, enabling full reproducibility from scratch + with no changes in the project code that was working during the + research. This will allow robust results and let the scientists get to what + they do best: experiment and be critical to the methods/analysis without + having to waste energy and time on technical problems that come up as a + result of that experimentation in scripts.</p> + + <p>Since the dependencies are clearly demarcated in Make, it can identify + independent steps and run them in parallel. This further speeds up the + processing. Make was designed for this purpose. It is how huge projects + like all Unix-like operating systems (including GNU/Linux or Mac OS + operating systems) and their core components are built. Therefore, Make is + a highly mature paradigm/system with robust and highly efficient + implementations in various operating systems perfectly suited for a complex + non-linear research project.</p> + + <p>Make is a small language with the aim of defining <em>rules</em> containing + <em>targets</em>, <em>prerequisites</em> and <em>recipes</em>. It comes with some nice features + like functions or automatic-variables to greatly facilitate the management + of text (filenames for example) or any of those constructs. For a more + detailed (yet still general) introduction see the article on Wikipedia:</p> + + <p>https://en.wikipedia.org/wiki/Make_(software)</p> + + <p>Make is a +40 year old software that is still evolving, therefore many + implementations of Make exist. The only difference in them is some extra + features over the <a href="https://pubs.opengroup.org/onlinepubs/009695399/utilities/make.html">standard + definition</a> + (which is shared in all of them). Maneage is primarily written in GNU Make + (which it installs itself, you don't have to have it on your system). GNU + Make is the most common, most actively developed, and most advanced + implementation. Just note that Maneage downloads, builds, internally + installs, and uses its own dependencies (including GNU Make), so you don't + have to have it installed before you try it out.</p> + + <h2>How can I learn Make?</h2> + + <p>The GNU Make book/manual (links below) is arguably the best place to learn + Make. It is an excellent and non-technical book to help get started (it is + only non-technical in its first few chapters to get you started easily). It + is freely available and always up to date with the current GNU Make + release. It also clearly explains which features are specific to GNU Make + and which are general in all implementations. So the first few chapters + regarding the generalities are useful for all implementations.</p> + + <p>The first link below points to the GNU Make manual in various formats and + in the second, you can download it in PDF (which may be easier for a first + time reading).</p> + + <p>https://www.gnu.org/software/make/manual/</p> + + <p>https://www.gnu.org/software/make/manual/make.pdf</p> + + <p>If you use GNU Make, you also have the whole GNU Make manual on the + command-line with the following command (you can come out of the "Info" + environment by pressing <code>q</code>).</p> + + <pre><code> +info make + </code></pre> + + <p>If you aren't familiar with the Info documentation format, we strongly + recommend running <code>$ info info</code> and reading along. In less than an hour, + you will become highly proficient in it (it is very simple and has a great + manual for itself). Info greatly simplifies your access (without taking + your hands off the keyboard!) to many manuals that are installed on your + system, allowing you to be much more efficient as you work. If you use the + GNU Emacs text editor (or any of its variants), you also have access to all + Info manuals while you are writing your projects (again, without taking + your hands off the keyboard!).</p> + + <h2>Published works using Maneage</h2> + + <p>The list below shows some of the works that have already been published + with (earlier versions of) Maneage. Previously it was simply called + "Reproducible paper template". Note that Maneage is evolving, so some + details may be different in them. The more recent ones can be used as a + good working example.</p> + + <ul> + <li><p>Infante-Sainz et + al. (<a href="https://ui.adsabs.harvard.edu/abs/2020MNRAS.491.5317I">2020</a>, + MNRAS, 491, 5317): The version controlled project source is available + <a href="https://gitlab.com/infantesainz/sdss-extended-psfs-paper">on GitLab</a> + and is also archived on Zenodo with all the necessary software tarballs: + <a href="https://zenodo.org/record/3524937">zenodo.3524937</a>.</p></li> + <li><p>Akhlaghi (<a href="https://arxiv.org/abs/1909.11230">2019</a>, IAU Symposium + 355). The version controlled project source is available <a href="https://gitlab.com/makhlaghi/iau-symposium-355">on + GitLab</a> and is also + archived on Zenodo with all the necessary software tarballs: + <a href="https://doi.org/10.5281/zenodo.3408481">zenodo.3408481</a>.</p></li> + <li><p>Section 7.3 of Bacon et + al. (<a href="http://adsabs.harvard.edu/abs/2017A%26A...608A...1B">2017</a>, A&A + 608, A1): The version controlled project source is available <a href="https://gitlab.com/makhlaghi/muse-udf-origin-only-hst-magnitudes">on + GitLab</a> + and a snapshot of the project along with all the necessary input + datasets and outputs is available in + <a href="https://doi.org/10.5281/zenodo.1164774">zenodo.1164774</a>.</p></li> + <li><p>Section 4 of Bacon et + al. (<a href="http://adsabs.harvard.edu/abs/2017A%26A...608A...1B">2017</a>, A&A, + 608, A1): The version controlled project is available <a href="https://gitlab.com/makhlaghi/muse-udf-photometry-astrometry">on + GitLab</a> and + a snapshot of the project along with all the necessary input datasets is + available in <a href="https://doi.org/10.5281/zenodo.1163746">zenodo.1163746</a>.</p></li> + <li><p>Akhlaghi & Ichikawa + (<a href="http://adsabs.harvard.edu/abs/2015ApJS..220....1A">2015</a>, ApJS, 220, + 1): The version controlled project is available <a href="https://gitlab.com/makhlaghi/NoiseChisel-paper">on + GitLab</a>. This is the + very first (and much less mature!) incarnation of Maneage: the history + of Maneage started more than two years after this paper was + published. It is a very rudimentary/initial implementation, thus it is + only included here for historical reasons. However, the project source + is complete, accurate and uploaded to arXiv along with the paper.</p></li> + </ul> + + <h2>Citation</h2> + + <p>A paper to fully describe Maneage has been submitted. Until then, if you + used it in your work, please cite the paper that implemented its first + version: Akhlaghi & Ichikawa + (<a href="http://adsabs.harvard.edu/abs/2015ApJS..220....1A">2015</a>, ApJS, 220, 1).</p> + + <p>Also, when your paper is published, don't forget to add a notice in your + own paper (in coordination with the publishing editor) that the paper is + fully reproducible and possibly add a sentence or paragraph in the end of + the paper shortly describing the concept. This will help spread the word + and encourage other scientists to also manage and publish their projects in + a reproducible manner.</p> + + <h1>Project architecture</h1> + + <p>In order to customize Maneage to your research, it is important to first + understand its architecture so you can navigate your way in the directories + and understand how to implement your research project within its framework: + where to add new files and which existing files to modify for what + purpose. But if this the first time you are using Maneage, before reading + this theoretical discussion, please run Maneage once from scratch without + any changes (described in <code>README.md</code>). You will see how it works (note that + the configure step builds all necessary software, so it can take long, but + you can continue reading while its working).</p> + + <p>The project has two top-level directories: <code>reproduce</code> and + <code>tex</code>. <code>reproduce</code> hosts all the software building and analysis + steps. <code>tex</code> contains all the final paper's components to be compiled into + a PDF using LaTeX.</p> + + <p>The <code>reproduce</code> directory has two sub-directories: <code>software</code> and + <code>analysis</code>. As the name says, the former contains all the instructions to + download, build and install (independent of the host operating system) the + necessary software (these are called by the <code>./project configure</code> + command). The latter contains instructions on how to use those software to + do your project's analysis.</p> + + <p>After it finishes, <code>./project configure</code> will create the following symbolic + links in the project's top source directory: <code>.build</code> which points to the + top build directory and <code>.local</code> for easy access to the custom built + software installation directory. With these you can easily access the build + directory and project-specific software from your top source directory. For + example if you run <code>.local/bin/ls</code> you will be using the <code>ls</code> of Maneage, + which is probably different from your system's <code>ls</code> (run them both with + <code>--version</code> to check).</p> + + <p>Once the project is configured for your system, <code>./project make</code> will do + the basic preparations and run the project's analysis with the custom + version of software. The <code>project</code> script is just a wrapper, and with the + <code>make</code> argument, it will first call <code>top-prepare.mk</code> and <code>top-make.mk</code> + (both are in the <code>reproduce/analysis/make</code> directory).</p> + + <p>In terms of organization, <code>top-prepare.mk</code> and <code>top-make.mk</code> have an + identical design, only minor differences. So, let's continue Maneage's + architecture with <code>top-make.mk</code>. Once you understand that, you'll clearly + understand <code>top-prepare.mk</code> also. These very high-level files are + relatively short and heavily commented so hopefully the descriptions in + each comment will be enough to understand the general details. As you read + this section, please also look at the contents of the mentioned files and + directories to fully understand what is going on.</p> + + <p>Before starting to look into the top <code>top-make.mk</code>, it is important to + recall that Make defines dependencies by files. Therefore, the + input/prerequisite and output of every step/rule must be a file. Also + recall that Make will use the modification date of the prerequisite(s) and + target files to see if the target must be re-built or not. Therefore during + the processing, <em>many</em> intermediate files will be created (see the tips + section below on a good strategy to deal with large/huge files).</p> + + <p>To keep the source and (intermediate) built files separate, the user <em>must</em> + define a top-level build directory variable (or <code>$(BDIR)</code>) to host all the + intermediate files (you defined it during <code>./project configure</code>). This + directory doesn't need to be version controlled or even synchronized, or + backed-up in other servers: its contents are all products, and can be + easily re-created any time. As you define targets for your new rules, it is + thus important to place them all under sub-directories of <code>$(BDIR)</code>. As + mentioned above, you always have fast access to this "build"-directory with + the <code>.build</code> symbolic link. Also, beware to <em>never</em> make any manual change + in the files of the build-directory, just delete them (so they are + re-built).</p> + + <p>In this architecture, we have two types of Makefiles that are loaded into + the top <code>Makefile</code>: <em>configuration-Makefiles</em> (only independent + variables/configurations) and <em>workhorse-Makefiles</em> (Makefiles that + actually contain analysis/processing rules).</p> + + <p>The configuration-Makefiles are those that satisfy these two wildcards: + <code>reproduce/software/config/*.conf</code> (for building the necessary software + when you run <code>./project configure</code>) and <code>reproduce/analysis/config/*.conf</code> + (for the high-level analysis, when you run <code>./project make</code>). These + Makefiles don't actually have any rules, they just have values for various + free parameters throughout the configuration or analysis. Open a few of + them to see for yourself. These Makefiles must only contain raw Make + variables (project configurations). By "raw" we mean that the Make + variables in these files must not depend on variables in any other + configuration-Makefile. This is because we don't want to assume any order + in reading them. It is also very important to <em>not</em> define any rule, or + other Make construct, in these configuration-Makefiles.</p> + + <p>Following this rule-of-thumb enables you to set these configure-Makefiles + as a prerequisite to any target that depends on their variable + values. Therefore, if you change any of their values, all targets that + depend on those values will be re-built. This is very convenient as your + project scales up and gets more complex.</p> + + <p>The workhorse-Makefiles are those satisfying this wildcard + <code>reproduce/software/make/*.mk</code> and <code>reproduce/analysis/make/*.mk</code>. They + contain the details of the processing steps (Makefiles containing + rules). Therefore, in this phase <em>order is important</em>, because the + prerequisites of most rules will be the targets of other rules that will be + defined prior to them (not a fixed name like <code>paper.pdf</code>). The lower-level + rules must be imported into Make before the higher-level ones.</p> + + <p>All processing steps are assumed to ultimately (usually after many rules) + end up in some number, image, figure, or table that will be included in the + paper. The writing of these results into the final report/paper is managed + through separate LaTeX files that only contain macros (a name given to a + number/string to be used in the LaTeX source, which will be replaced when + compiling it to the final PDF). So the last target in a workhorse-Makefile + is a <code>.tex</code> file (with the same base-name as the Makefile, but in + <code>$(BDIR)/tex/macros</code>). As a result, if the targets in a workhorse-Makefile + aren't directly a prerequisite of other workhorse-Makefile targets, they + can be a prerequisite of that intermediate LaTeX macro file and thus be + called when necessary. Otherwise, they will be ignored by Make.</p> + + <p>Maneage also has a mode to share the build directory between several + users of a Unix group (when working on large computer clusters). In this + scenario, each user can have their own cloned project source, but share the + large built files between each other. To do this, it is necessary for all + built files to give full permission to group members while not allowing any + other users access to the contents. Therefore the <code>./project configure</code> and + <code>./project make</code> steps must be called with special conditions which are + managed in the <code>--group</code> option.</p> + + <p>Let's see how this design is implemented. Please open and inspect + <code>top-make.mk</code> it as we go along here. The first step (un-commented line) is + to import the local configuration (your answers to the questions of + <code>./project configure</code>). They are defined in the configuration-Makefile + <code>reproduce/software/config/LOCAL.conf</code> which was also built by <code>./project + configure</code> (based on the <code>LOCAL.conf.in</code> template of the same directory).</p> + + <p>The next non-commented set of the top <code>Makefile</code> defines the ultimate + target of the whole project (<code>paper.pdf</code>). But to avoid mistakes, a sanity + check is necessary to see if Make is being run with the same group settings + as the configure script (for example when the project is configured for + group access using the <code>./for-group</code> script, but Make isn't). Therefore we + use a Make conditional to define the <code>all</code> target based on the group + permissions.</p> + + <p>Having defined the top/ultimate target, our next step is to include all the + other necessary Makefiles. However, order matters in the importing of + workhorse-Makefiles and each must also have a TeX macro file with the same + base name (without a suffix). Therefore, the next step in the top-level + Makefile is to define the <code>makesrc</code> variable to keep the base names + (without a <code>.mk</code> suffix) of the workhorse-Makefiles that must be imported, + in the proper order.</p> + + <p>Finally, we import all the necessary remaining Makefiles: 1) All the + analysis configuration-Makefiles with a wildcard. 2) The software + configuration-Makefile that contains their version (just in case its + necessary). 3) All workhorse-Makefiles in the proper order using a Make + <code>foreach</code> loop.</p> + + <p>In short, to keep things modular, readable and manageable, follow these + recommendations: 1) Set clear-to-understand names for the + configuration-Makefiles, and workhorse-Makefiles, 2) Only import other + Makefiles from top Makefile. These will let you know/remember generally + which step you are taking before or after another. Projects will scale up + very fast. Thus if you don't start and continue with a clean and robust + convention like this, in the end it will become very dirty and hard to + manage/understand (even for yourself). As a general rule of thumb, break + your rules into as many logically-similar but independent steps as + possible.</p> + + <p>The <code>reproduce/analysis/make/paper.mk</code> Makefile must be the final Makefile + that is included. This workhorse Makefile ends with the rule to build + <code>paper.pdf</code> (final target of the whole project). If you look in it, you + will notice that this Makefile starts with a rule to create + <code>$(mtexdir)/project.tex</code> (<code>mtexdir</code> is just a shorthand name for + <code>$(BDIR)/tex/macros</code> mentioned before). As you see, the only dependency of + <code>$(mtexdir)/project.tex</code> is <code>$(mtexdir)/verify.tex</code> (which is the last + analysis step: it verifies all the generated results). Therefore, + <code>$(mtexdir)/project.tex</code> is <em>the connection</em> between the + processing/analysis steps of the project, and the steps to build the final + PDF.</p> + + <p>During the research, it often happens that you want to test a step that is + not a prerequisite of any higher-level operation. In such cases, you can + (temporarily) define that processing as a rule in the most relevant + workhorse-Makefile and set its target as a prerequisite of its TeX + macro. If your test gives a promising result and you want to include it in + your research, set it as prerequisites to other rules and remove it from + the list of prerequisites for TeX macro file. In fact, this is how a + project is designed to grow in this framework.</p> + + <h2>File modification dates (meta data)</h2> + + <p>While Git does an excellent job at keeping a history of the contents of + files, it makes no effort in keeping the file meta data, and in particular + the dates of files. Therefore when you checkout to a different branch, + files that are re-written by Git will have a newer date than the other + project files. However, file dates are important in the current design of + Maneage: Make checks the dates of the prerequisite files and target files + to see if the target should be re-built.</p> + + <p>To fix this problem, for Maneage we use a forked version of + <a href="https://github.com/mohammad-akhlaghi/metastore">Metastore</a>. Metastore use + a binary database file (which is called <code>.file-metadata</code>) to keep the + modification dates of all the files under version control. This file is + also under version control, but is hidden (because it shouldn't be modified + by hand). During the project's configuration, Maneage installs to Git hooks + to run Metastore 1) before making a commit to update its database with the + file dates in a branch, and 2) after doing a checkout, to reset the + file-dates after the checkout is complete and re-set the file dates back to + what they were.</p> + + <p>In practice, Metastore should work almost fully invisibly within your + project. The only place you might notice its presence is that you'll see + <code>.file-metadata</code> in the list of modified/staged files (commonly after + merging your branches). Since its a binary file, Git also won't show you + the changed contents. In a merge, you can simply accept any changes with + <code>git add -u</code>. But if Git is telling you that it has changed without a merge + (for example if you started a commit, but canceled it in the middle), you + can just do <code>git checkout .file-metadata</code> and set it back to its original + state.</p> + + <h2>Summary</h2> + + <p>Based on the explanation above, some major design points you should have in + mind are listed below.</p> + + <ul> + <li><p>Define new <code>reproduce/analysis/make/XXXXXX.mk</code> workhorse-Makefile(s) + with good and human-friendly name(s) replacing <code>XXXXXX</code>.</p></li> + <li><p>Add <code>XXXXXX</code>, as a new line, to the values in <code>makesrc</code> of the top-level + <code>Makefile</code>.</p></li> + <li><p>Do not use any constant numbers (or important names like filter names) + in the workhorse-Makefiles or paper's LaTeX source. Define such + constants as logically-grouped, separate configuration-Makefiles in + <code>reproduce/analysis/config/XXXXX.conf</code>. Then set this + configuration-Makefiles file as a prerequisite to any rule that uses + the variable defined in it.</p></li> + <li><p>Through any number of intermediate prerequisites, all processing steps + should end in (be a prerequisite of) <code>$(mtexdir)/verify.tex</code> (defined in + <code>reproduce/analysis/make/verify.mk</code>). <code>$(mtexdir)/verify.tex</code> is the sole + dependency of <code>$(mtexdir)/project.tex</code>, which is the bridge between the + processing steps and PDF-building steps of the project.</p></li> + </ul> + + <h1>Customization checklist</h1> + + <p>Take the following steps to fully customize Maneage for your research + project. After finishing the list, be sure to run <code>./project configure</code> and + <code>project make</code> to see if everything works correctly. If you notice anything + missing or any in-correct part (probably a change that has not been + explained here), please let us know to correct it.</p> + + <p>As described above, the concept of reproducibility (during a project) + heavily relies on <a href="https://en.wikipedia.org/wiki/Version_control">version + control</a>. Currently Maneage + uses Git as its main version control system. If you are not already + familiar with Git, please read the first three chapters of the <a href="https://git-scm.com/book/en/v2">ProGit + book</a> which provides a wonderful practical + understanding of the basics. You can read later chapters as you get more + advanced in later stages of your work.</p> + + <h2>First custom commit</h2> + + <ol> + <li><p><strong>Get this repository and its history</strong> (if you don't already have it): + Arguably the easiest way to start is to clone Maneage and prepare for + your customizations as shown below. After the cloning first you rename + the default <code>origin</code> remote server to specify that this is Maneage's + remote server. This will allow you to use the conventional <code>origin</code> + name for your own project as shown in the next steps. Second, you will + create and go into the conventional <code>master</code> branch to start + committing in your project later.</p> + + <pre><code> +git clone https://git.maneage.org/project.git <span class="comment"># Clone/copy the project and its history.</span> +mv project my-project <span class="comment"># Change the name to your project's name.</span> +cd my-project <span class="comment"># Go into the cloned directory.</span> +git remote rename origin origin-maneage <span class="comment"># Rename current/only remote to "origin-maneage".</span> +git checkout -b master <span class="comment"># Create and enter your own "master" branch.</span> +pwd <span class="comment"># Just to confirm where you are.</span> + </code></pre></li> + <li><p><strong>Prepare to build project</strong>: The <code>./project configure</code> command of the + next step will build the different software packages within the + "build" directory (that you will specify). Nothing else on your system + will be touched. However, since it takes long, it is useful to see + what it is being built at every instant (its almost impossible to tell + from the torrent of commands that are produced!). So open another + terminal on your desktop and navigate to the same project directory + that you cloned (output of last command above). Then run the following + command. Once every second, this command will just print the date + (possibly followed by a non-existent directory notice). But as soon as + the next step starts building software, you'll see the names of + software get printed as they are being built. Once any software is + installed in the project build directory it will be removed. Again, + don't worry, nothing will be installed outside the build directory.</p> + + <pre><code> +<span class="comment"># On another terminal (go to top project source directory, last command above)</span> +./project --check-config + </code></pre></li> + <li><p><strong>Test Maneage</strong>: Before making any changes, it is important to test it + and see if everything works properly with the commands below. If there + is any problem in the <code>./project configure</code> or <code>./project make</code> steps, + please contact us to fix the problem before continuing. Since the + building of dependencies in configuration can take long, you can take + the next few steps (editing the files) while its working (they don't + affect the configuration). After <code>./project make</code> is finished, open + <code>paper.pdf</code>. If it looks fine, you are ready to start customizing the + Maneage for your project. But before that, clean all the extra Maneage + outputs with <code>make clean</code> as shown below.</p> + + <pre><code> +./project configure <span class="comment"># Build the project's software environment (can take an hour or so).</span> +./project make <span class="comment"># Do the processing and build paper (just a simple demo).</span> + <span class="comment"># Open 'paper.pdf' and see if everything is ok. + </code></pre></li> + <li><p><strong>Setup the remote</strong>: You can use any <a href="https://en.wikipedia.org/wiki/Comparison_of_source_code_hosting_facilities">hosting + facility</a> + that supports Git to keep an online copy of your project's version + controlled history. We recommend <a href="https://gitlab.com">GitLab</a> because + it is <a href="https://www.gnu.org/software/repo-criteria-evaluation.html">more ethical (although not + perfect)</a>, + and later you can also host GitLab on your own server. Anyway, create + an account in your favorite hosting facility (if you don't already + have one), and define a new project there. Please make sure <em>the newly + created project is empty</em> (some services ask to include a <code>README</code> in + a new project which is bad in this scenario, and will not allow you to + push to it). It will give you a URL (usually starting with <code>git@</code> and + ending in <code>.git</code>), put this URL in place of <code>XXXXXXXXXX</code> in the first + command below. With the second command, "push" your <code>master</code> branch to + your <code>origin</code> remote, and (with the <code>--set-upstream</code> option) set them + to track/follow each other. However, the <code>maneage</code> branch is currently + tracking/following your <code>origin-maneage</code> remote (automatically set + when you cloned Maneage). So when pushing the <code>maneage</code> branch to your + <code>origin</code> remote, you <em>shouldn't</em> use <code>--set-upstream</code>. With the last + command, you can actually check this (which local and remote branches + are tracking each other).</p> + + <pre><code> +git remote add origin XXXXXXXXXX <span class="comment"># Newly created repo is now called 'origin'.</span> +git push --set-upstream origin master <span class="comment"># Push 'master' branch to 'origin' (with tracking).</span> +git push origin maneage <span class="comment"># Push 'maneage' branch to 'origin' (no tracking).</span> + </code></pre></li> + <li><p><strong>Title</strong>, <strong>short description</strong> and <strong>author</strong>: The title and basic + information of your project's output PDF paper should be added in + <code>paper.tex</code>. You should see the relevant place in the preamble (prior + to <code>\begin{document}</code>. After you are done, run the <code>./project make</code> + command again to see your changes in the final PDF, and make sure that + your changes don't cause a crash in LaTeX. Of course, if you use a + different LaTeX package/style for managing the title and authors (in + particular a specific journal's style), please feel free to use it + your own methods after finishing this checklist and doing your first + commit.</p></li> + <li><p><strong>Delete dummy parts</strong>: Maneage contains some parts that are only for + the initial/test run, mainly as a demonstration of important steps, + which you can use as a reference to use in your own project. But they + not for any real analysis, so you should remove these parts as + described below:</p> + + <ul> + <li><p><code>paper.tex</code>: 1) Delete the text of the abstract (from + <code>\includeabstract{</code> to <code>\vspace{0.25cm}</code>) and write your own (a + single sentence can be enough now, you can complete it later). 2) + Add some keywords under it in the keywords part. 3) Delete + everything between <code>%% Start of main body.</code> and <code>%% End of main + body.</code>. 4) Remove the notice in the "Acknowledgments" section (in + <code>\new{}</code>) and Acknowledge your funding sources (this can also be + done later). Just don't delete the existing acknowledgment + statement: Maneage is possible thanks to funding from several + grants. Since Maneage is being used in your work, it is necessary to + acknowledge them in your work also.</p></li> + <li><p><code>reproduce/analysis/make/top-make.mk</code>: Delete the <code>delete-me</code> line + in the <code>makesrc</code> definition. Just make sure there is no empty line + between the <code>download \</code> and <code>verify \</code> lines (they should be + directly under each other).</p></li> + <li><p><code>reproduce/analysis/make/verify.mk</code>: In the final recipe, under the + commented line <code>Verify TeX macros</code>, remove the full line that + contains <code>delete-me</code>, and set the value of <code>s</code> in the line for + <code>download</code> to <code>XXXXX</code> (any temporary string, you'll fix it in the + end of your project, when its complete).</p></li> + <li><p>Delete all <code>delete-me*</code> files in the following directories:</p> + <pre><code> +rm tex/src/delete-me* +rm reproduce/analysis/make/delete-me* +rm reproduce/analysis/config/delete-me* + </code></pre></li> + <li><p>Disable verification of outputs by removing the <code>yes</code> from + <code>reproduce/analysis/config/verify-outputs.conf</code>. Later, when you are + ready to submit your paper, or publish the dataset, activate + verification and make the proper corrections in this file (described + under the "Other basic customizations" section below). This is a + critical step and only takes a few minutes when your project is + finished. So DON'T FORGET to activate it in the end.</p></li> + <li><p>Re-make the project (after a cleaning) to see if you haven't + introduced any errors.</p> + + <pre><code> +./project make clean +./project make + </code></pre></li> + </ul></li> + <li><p><strong>Don't merge some files in future updates</strong>: As described below, you + can later update your infra-structure (for example to fix bugs) by + merging your <code>master</code> branch with <code>maneage</code>. For files that you have + created in your own branch, there will be no problem. However if you + modify an existing Maneage file for your project, next time its + updated on <code>maneage</code> you'll have an annoying conflict. The commands + below show how to fix this future problem. With them, you can + configure Git to ignore the changes in <code>maneage</code> for some of the files + you have already edited and deleted above (and will edit below). Note + that only the first <code>echo</code> command has a <code>></code> (to write over the file), + the rest are <code>>></code> (to append to it). If you want to avoid any other + set of files to be imported from Maneage into your project's branch, + you can follow a similar strategy. We recommend only doing it when you + encounter the same conflict in more than one merge and there is no + other change in that file. Also, don't add core Maneage Makefiles, + otherwise Maneage can break on the next run.</p> + + <pre><code> +echo "paper.tex merge=ours" > .gitattributes +echo "tex/src/delete-me.mk merge=ours" >> .gitattributes +echo "tex/src/delete-me-demo.mk merge=ours" >> .gitattributes +echo "reproduce/analysis/make/delete-me.mk merge=ours" >> .gitattributes +echo "reproduce/software/config/TARGETS.conf merge=ours" >> .gitattributes +echo "reproduce/analysis/config/delete-me-num.conf merge=ours" >> .gitattributes +git add .gitattributes + </code></pre></li> + <li><p><strong>Copyright and License notice</strong>: It is necessary that <em>all</em> the + "copyright-able" files in your project (those larger than 10 lines) + have a copyright and license notice. Please take a moment to look at + several existing files to see a few examples. The copyright notice is + usually close to the start of the file, it is the line starting with + <code>Copyright (C)</code> and containing a year and the author's name (like the + examples below). The License notice is a short description of the + copyright license, usually one or two paragraphs with a URL to the + full license. Don't forget to add these <em>two</em> notices to <em>any new + file</em> you add in your project (you can just copy-and-paste). When you + modify an existing Maneage file (which already has the notices), just + add a copyright notice in your name under the existing one(s), like + the line with capital letters below. To start with, add this line with + your name and email address to <code>paper.tex</code>, + <code>tex/src/preamble-header.tex</code>, <code>reproduce/analysis/make/top-make.mk</code>, + and generally, all the files you modified in the previous step.</p> + + <pre><code> +Copyright (C) 2018-2020 Existing Name <existing@email.address> +Copyright (C) 2020 YOUR NAME <YOUR@EMAIL.ADDRESS> + </code></pre></li> + <li><p><strong>Configure Git for fist time</strong>: If this is the first time you are + running Git on this system, then you have to configure it with some + basic information in order to have essential information in the commit + messages (ignore this step if you have already done it). Git will + include your name and e-mail address information in each commit. You + can also specify your favorite text editor for making the commit + (<code>emacs</code>, <code>vim</code>, <code>nano</code>, and etc.).</p> + + <pre><code> +git config --global user.name "YourName YourSurname" +git config --global user.email your-email@example.com +git config --global core.editor nano + </code></pre></li> + <li><p><strong>Your first commit</strong>: You have already made some small and basic + changes in the steps above and you are in your project's <code>master</code> + branch. So, you can officially make your first commit in your + project's history and push it. But before that, you need to make sure + that there are no problems in the project. This is a good habit to + always re-build the system before a commit to be sure it works as + expected.</p> + + <pre><code> +git status <span class="comment"># See which files you have changed.</span> +git diff <span class="comment"># Check the lines you have added/changed.</span> +./project make <span class="comment"># Make sure everything builds successfully.</span> +git add -u <span class="comment"># Put all tracked changes in staging area.</span> +git status <span class="comment"># Make sure everything is fine.</span> +git diff --cached <span class="comment"># Confirm all the changes that will be committed.</span> +git commit <span class="comment"># Your first commit: put a good description!</span> +git push <span class="comment"># Push your commit to your remote.</span> + </code></pre></li> + <li><p><strong>Start your exciting research</strong>: You are now ready to add flesh and + blood to this raw skeleton by further modifying and adding your + exciting research steps. You can use the "published works" section in + the introduction (above) as some fully working models to learn + from. Also, don't hesitate to contact us if you have any + questions.</p></li> + </ol> + + <h2>Other basic customizations</h2> + + <ul> + <li><p><strong>High-level software</strong>: Maneage installs all the software that your + project needs. You can specify which software your project needs in + <code>reproduce/software/config/TARGETS.conf</code>. The necessary software are + classified into two classes: 1) programs or libraries (usually written + in C/C++) which are run directly by the operating system. 2) Python + modules/libraries that are run within Python. By default + <code>TARGETS.conf</code> only has GNU Astronomy Utilities (Gnuastro) as one + scientific program and Astropy as one scientific Python module. Both + have many dependencies which will be installed into your project + during the configuration step. To see a list of software that are + currently ready to be built in Maneage, see + <code>reproduce/software/config/versions.conf</code> (which has their versions + also), the comments in <code>TARGETS.conf</code> describe how to use the software + name from <code>versions.conf</code>. Currently the raw pipeline just uses + Gnuastro to make the demonstration plots. Therefore if you don't need + Gnuastro, go through the analysis steps in <code>reproduce/analysis</code> and + remove all its use cases (clearly marked).</p></li> + <li><p><strong>Input dataset</strong>: The input datasets are managed through the + <code>reproduce/analysis/config/INPUTS.conf</code> file. It is best to gather all + the information regarding all the input datasets into this one central + file. To ensure that the proper dataset is being downloaded and used + by the project, it is also recommended get an <a href="https://en.wikipedia.org/wiki/MD5">MD5 + checksum</a> of the file and include + that in <code>INPUTS.conf</code> so the project can check it automatically. The + preparation/downloading of the input datasets is done in + <code>reproduce/analysis/make/download.mk</code>. Have a look there to see how + these values are to be used. This information about the input datasets + is also used in the initial <code>configure</code> script (to inform the users), + so also modify that file. You can find all occurrences of the demo + dataset with the command below and replace it with your input's + dataset.</p> + + <pre><code> +grep -ir wfpc2 ./* + </code></pre></li> + <li><p><strong><code>README.md</code></strong>: Correct all the <code>XXXXX</code> place holders (name of your + project, your own name, address of your project's online/remote + repository, link to download dependencies and etc). Generally, read + over the text and update it where necessary to fit your project. Don't + forget that this is the first file that is displayed on your online + repository and also your colleagues will first be drawn to read this + file. Therefore, make it as easy as possible for them to start + with. Also check and update this file one last time when you are ready + to publish your project's paper/source.</p></li> + <li><p><strong>Verify outputs</strong>: During the initial customization checklist, you + disabled verification. This is natural because during the project you + need to make changes all the time and its a waste of time to enable + verification every time. But at significant moments of the project + (for example before submission to a journal, or publication) it is + necessary. When you activate verification, before building the paper, + all the specified datasets will be compared with their respective + checksum and if any file's checksum is different from the one recorded + in the project, it will stop and print the problematic file and its + expected and calculated checksums. First set the value of + <code>verify-outputs</code> variable in + <code>reproduce/analysis/config/verify-outputs.conf</code> to <code>yes</code>. Then go to + <code>reproduce/analysis/make/verify.mk</code>. The verification of all the files + is only done in one recipe. First the files that go into the + plots/figures are checked, then the LaTeX macros. Validation of the + former (inputs to plots/figures) should be done manually. If its the + first time you are doing this, you can see two examples of the dummy + steps (with <code>delete-me</code>, you can use them if you like). These two + examples should be removed before you can run the project. For the + latter, you just have to update the checksums. The important thing to + consider is that a simple checksum can be problematic because some + file generators print their run-time date in the file (for example as + commented lines in a text table). When checking text files, this + Makefile already has this function: + <code>verify-txt-no-comments-leading-space</code>. As the name suggests, it will + remove comment lines and empty lines before calculating the MD5 + checksum. For FITS formats (common in astronomy, fortunately there is + a <code>DATASUM</code> definition which will return the checksum independent of + the headers. You can use the provided function(s), or define one for + your special formats.</p></li> + <li><p><strong>Feedback</strong>: As you use Maneage you will notice many things that if + implemented from the start would have been very useful for your + work. This can be in the actual scripting and architecture of Maneage, + or useful implementation and usage tips, like those below. In any + case, please share your thoughts and suggestions with us, so we can + add them here for everyone's benefit.</p></li> + <li><p><strong>Re-preparation</strong>: Automatic preparation is only run in the first run + of the project on a system, to re-do the preparation you have to use + the option below. Here is the reason for this: when its necessary, the + preparation process can be slow and will unnecessarily slow down the + whole project while the project is under development (focus is on the + analysis that is done after preparation). Because of this, preparation + will be done automatically for the first time that the project is run + (when <code>.build/software/preparation-done.mk</code> doesn't exist). After the + preparation process completes once, future runs of <code>./project make</code> + will not do the preparation process anymore (will not call + <code>top-prepare.mk</code>). They will only call <code>top-make.mk</code> for the + analysis. To manually invoke the preparation process after the first + attempt, the <code>./project make</code> script should be run with the + <code>--prepare-redo</code> option, or you can delete the special file above.</p> + + <pre><code> +./project make --prepare-redo + </code></pre></li> + <li><p><strong>Pre-publication</strong>: add notice on reproducibility**: Add a notice + somewhere prominent in the first page within your paper, informing the + reader that your research is fully reproducible. For example in the + end of the abstract, or under the keywords with a title like + "reproducible paper". This will encourage them to publish their own + works in this manner also and also will help spread the word.</p></li> + </ul> + + <h1>Tips for designing your project</h1> + + <p>The following is a list of design points, tips, or recommendations that + have been learned after some experience with this type of project + management. Please don't hesitate to share any experience you gain after + using it with us. In this way, we can add it here (with full giving credit) + for the benefit of others.</p> + + <ul> + <li><p><strong>Modularity</strong>: Modularity is the key to easy and clean growth of a + project. So it is always best to break up a job into as many + sub-components as reasonable. Here are some tips to stay modular.</p> + + <ul> + <li><p><em>Short recipes</em>: if you see the recipe of a rule becoming more than a + handful of lines which involve significant processing, it is probably + a good sign that you should break up the rule into its main + components. Try to only have one major processing step per rule.</p></li> + <li><p><em>Context-based (many) Makefiles</em>: For maximum modularity, this design + allows easy inclusion of many Makefiles: in + <code>reproduce/analysis/make/*.mk</code> for analysis steps, and + <code>reproduce/software/make/*.mk</code> for building software. So keep the + rules for closely related parts of the processing in separate + Makefiles.</p></li> + <li><p><em>Descriptive names</em>: Be very clear and descriptive with the naming of + the files and the variables because a few months after the + processing, it will be very hard to remember what each one was + for. Also this helps others (your collaborators or other people + reading the project source after it is published) to more easily + understand your work and find their way around.</p></li> + <li><p><em>Naming convention</em>: As the project grows, following a single standard + or convention in naming the files is very useful. Try best to use + multiple word filenames for anything that is non-trivial (separating + the words with a <code>-</code>). For example if you have a Makefile for + creating a catalog and another two for processing it under models A + and B, you can name them like this: <code>catalog-create.mk</code>, + <code>catalog-model-a.mk</code> and <code>catalog-model-b.mk</code>. In this way, when + listing the contents of <code>reproduce/analysis/make</code> to see all the + Makefiles, those related to the catalog will all be close to each + other and thus easily found. This also helps in auto-completions by + the shell or text editors like Emacs.</p></li> + <li><p><em>Source directories</em>: If you need to add files in other languages for + example in shell, Python, AWK or C, keep the files in the same + language in a separate directory under <code>reproduce/analysis</code>, with the + appropriate name.</p></li> + <li><p><em>Configuration files</em>: If your research uses special programs as part + of the processing, put all their configuration files in a devoted + directory (with the program's name) within + <code>reproduce/software/config</code>. Similar to the + <code>reproduce/software/config/gnuastro</code> directory (which is put in + Maneage as a demo in case you use GNU Astronomy Utilities). It is + much cleaner and readable (thus less buggy) to avoid mixing the + configuration files, even if there is no technical necessity.</p></li> + </ul></li> + <li><p><strong>Contents</strong>: It is good practice to follow the following + recommendations on the contents of your files, whether they are source + code for a program, Makefiles, scripts or configuration files + (copyrights aren't necessary for the latter).</p> + + <ul> + <li><p><em>Copyright</em>: Always start a file containing programming constructs + with a copyright statement like the ones that Maneage starts with + (for example in the top level <code>Makefile</code>).</p></li> + <li><p><em>Comments</em>: Comments are vital for readability (by yourself in two + months, or others). Describe everything you can about why you are + doing something, how you are doing it, and what you expect the result + to be. Write the comments as if it was what you would say to describe + the variable, recipe or rule to a friend sitting beside you. When + writing the project it is very tempting to just steam ahead with + commands and codes, but be patient and write comments before the + rules or recipes. This will also allow you to think more about what + you should be doing. Also, in several months when you come back to + the code, you will appreciate the effort of writing them. Just don't + forget to also read and update the comment first if you later want to + make changes to the code (variable, recipe or rule). As a general + rule of thumb: first the comments, then the code.</p></li> + <li><p><em>File title</em>: In general, it is good practice to start all files with + a single line description of what that particular file does. If + further information about the totality of the file is necessary, add + it after a blank line. This will help a fast inspection where you + don't care about the details, but just want to remember/see what that + file is (generally) for. This information must of course be commented + (its for a human), but this is kept separate from the general + recommendation on comments, because this is a comment for the whole + file, not each step within it.</p></li> + </ul></li> + <li><p><strong>Make programming</strong>: Here are some experiences that we have come to + learn over the years in using Make and are useful/handy in research + contexts.</p> + + <ul> + <li><p><em>Environment of each recipe</em>: If you need to define a special + environment (or aliases, or scripts to run) for all the recipes in + your Makefiles, you can use a Bash startup file + <code>reproduce/software/shell/bashrc.sh</code>. This file is loaded before every + Make recipe is run, just like the <code>.bashrc</code> in your home directory is + loaded every time you start a new interactive, non-login terminal. See + the comments in that file for more.</p></li> + <li><p><em>Automatic variables</em>: These are wonderful and very useful Make + constructs that greatly shrink the text, while helping in + read-ability, robustness (less bugs in typos for example) and + generalization. For example even when a rule only has one target or + one prerequisite, always use <code>$@</code> instead of the target's name, <code>$<</code> + instead of the first prerequisite, <code>$^</code> instead of the full list of + prerequisites and etc. You can see the full list of automatic + variables + <a href="https://www.gnu.org/software/make/manual/html_node/Automatic-Variables.html">here</a>. If + you use GNU Make, you can also see this page on your command-line:</p> + + <pre><code> +info make "automatic variables" + </code></pre></li> + <li><p><em>Debug</em>: Since Make doesn't follow the common top-down paradigm, it + can be a little hard to get accustomed to why you get an error or + un-expected behavior. In such cases, run Make with the <code>-d</code> + option. With this option, Make prints a full list of exactly which + prerequisites are being checked for which targets. Looking + (patiently) through this output and searching for the faulty + file/step will clearly show you any mistake you might have made in + defining the targets or prerequisites.</p></li> + <li><p><em>Large files</em>: If you are dealing with very large files (thus having + multiple copies of them for intermediate steps is not possible), one + solution is the following strategy (Also see the next item on "Fast + access to temporary files"). Set a small plain text file as the + actual target and delete the large file when it is no longer needed + by the project (in the last rule that needs it). Below is a simple + demonstration of doing this. In it, we use Gnuastro's Arithmetic + program to add all pixels of the input image with 2 and create + <code>large1.fits</code>. We then subtract 2 from <code>large1.fits</code> to create + <code>large2.fits</code> and delete <code>large1.fits</code> in the same rule (when its no + longer needed). We can later do the same with <code>large2.fits</code> when it + is no longer needed and so on. + <pre><code> large1.fits.txt: input.fits - astarithmetic $< 2 + --output=$(subst .txt,,$@) - echo "done" > $@ +astarithmetic $< 2 + --output=$(subst .txt,,$@) +echo "done" > $@ large2.fits.txt: large1.fits.txt - astarithmetic $(subst .txt,,$<) 2 - --output=$(subst .txt,,$@) - rm $(subst .txt,,$<) - echo "done" > $@ -</code> -A more advanced Make programmer will use Make's <a href="https://www.gnu.org/software/make/manual/html_node/Call-Function.html">call -function</a> -to define a wrapper in <code>reproduce/analysis/make/initialize.mk</code>. This -wrapper will replace <code>$(subst .txt,,XXXXX)</code>. Therefore, it will be -possible to greatly simplify this repetitive statement and make the -code even more readable throughout the whole project.</p></li> -<li><p><em>Fast access to temporary files</em>: Most Unix-like operating systems -will give you a special shared-memory device (directory): on systems -using the GNU C Library (all GNU/Linux system), it is <code>/dev/shm</code>. The -contents of this directory are actually in your RAM, not in your -persistence storage like the HDD or SSD. Reading and writing from/to -the RAM is much faster than persistent storage, so if you have enough -RAM available, it can be very beneficial for large temporary files to -be put there. You can use the <code>mktemp</code> program to give the temporary -files a randomly-set name, and use text files as targets to keep that -name (as described in the item above under "Large files") for later -deletion. For example, see the minimal working example Makefile below -(which you can actually put in a <code>Makefile</code> and run if you have an -<code>input.fits</code> in the same directory, and Gnuastro is installed). -<code> +astarithmetic $(subst .txt,,$<) 2 - --output=$(subst .txt,,$@) +rm $(subst .txt,,$<) +echo "done" > $@ + </code></pre> + A more advanced Make programmer will use Make's <a href="https://www.gnu.org/software/make/manual/html_node/Call-Function.html">call function</a> + to define a wrapper in <code>reproduce/analysis/make/initialize.mk</code>. This + wrapper will replace <code>$(subst .txt,,XXXXX)</code>. Therefore, it will be + possible to greatly simplify this repetitive statement and make the + code even more readable throughout the whole project.</p></li> + <li><p><em>Fast access to temporary files</em>: Most Unix-like operating systems + will give you a special shared-memory device (directory): on systems + using the GNU C Library (all GNU/Linux system), it is <code>/dev/shm</code>. The + contents of this directory are actually in your RAM, not in your + persistence storage like the HDD or SSD. Reading and writing from/to + the RAM is much faster than persistent storage, so if you have enough + RAM available, it can be very beneficial for large temporary files to + be put there. You can use the <code>mktemp</code> program to give the temporary + files a randomly-set name, and use text files as targets to keep that + name (as described in the item above under "Large files") for later + deletion. For example, see the minimal working example Makefile below + (which you can actually put in a <code>Makefile</code> and run if you have an + <code>input.fits</code> in the same directory, and Gnuastro is installed). + <pre><code> .ONESHELL: .SHELLFLAGS = -ec all: mean-std.txt shm-maneage := /dev/shm/$(shell whoami)-maneage-XXXXXXXXXX large1.txt: input.fits - out=$$(mktemp $(shm-maneage)) - astarithmetic $< 2 + --output=$$out.fits - echo "$$out" > $@ +out=$$(mktemp $(shm-maneage)) +astarithmetic $< 2 + --output=$$out.fits +echo "$$out" > $@ large2.txt: large1.txt - input=$$(cat $<) - out=$$(mktemp $(shm-maneage)) - astarithmetic $$input.fits 2 - --output=$$out.fits - rm $$input.fits $$input - echo "$$out" > $@ +input=$$(cat $<) +out=$$(mktemp $(shm-maneage)) +astarithmetic $$input.fits 2 - --output=$$out.fits +rm $$input.fits $$input +echo "$$out" > $@ mean-std.txt: large2.txt - input=$$(cat $<) - aststatistics $$input.fits --mean --std > $@ - rm $$input.fits $$input -</code> -The important point here is that the temporary name template -(<code>shm-maneage</code>) has no suffix. So you can add the suffix -corresponding to your desired format afterwards (for example -<code>$$out.fits</code>, or <code>$$out.txt</code>). But more importantly, when <code>mktemp</code> -sets the random name, it also checks if no file exists with that name -and creates a file with that exact name at that moment. So at the end -of each recipe above, you'll have two files in your <code>/dev/shm</code>, one -empty file with no suffix one with a suffix. The role of the file -without a suffix is just to ensure that the randomly set name will -not be used by other calls to <code>mktemp</code> (when running in parallel) and -it should be deleted with the file containing a suffix. This is the -reason behind the <code>rm $$input.fits $$input</code> command above: to make -sure that first the file with a suffix is deleted, then the core -random file (note that when working in parallel on powerful systems, -in the time between deleting two files of a single <code>rm</code> command, many -things can happen!). When using Maneage, you can put the definition -of <code>shm-maneage</code> in <code>reproduce/analysis/make/initialize.mk</code> to be -usable in all the different Makefiles of your analysis, and you won't -need the three lines above it. <strong>Finally, BE RESPONSIBLE:</strong> after you -are finished, be sure to clean up any possibly remaining files (due -to crashes in the processing while you are working), otherwise your -RAM may fill up very fast. You can do it easily with a command like -this on your command-line: <code>rm -f /dev/shm/$(whoami)-*</code>.</p></li> -</ul></li> -<li><p><strong>Software tarballs and raw inputs</strong>: It is critically important to - document the raw inputs to your project (software tarballs and raw - input data):</p> - -<ul> -<li><p><em>Keep the source tarball of dependencies</em>: After configuration -finishes, the <code>.build/software/tarballs</code> directory will contain all -the software tarballs that were necessary for your project. You can -mirror the contents of this directory to keep a backup of all the -software tarballs used in your project (possibly as another version -controlled repository) that is also published with your project. Note -that software web-pages are not written in stone and can suddenly go -offline or not be accessible in some conditions. This backup is thus -very important. If you intend to release your project in a place like -Zenodo, you can upload/keep all the necessary tarballs (and data) -there with your -project. <a href="https://doi.org/10.5281/zenodo.1163746">zenodo.1163746</a> is -one example of how the data, Gnuastro (main software used) and all -major Gnuastro's dependencies have been uploaded with the project's -source. Just note that this is only possible for free and open-source -software.</p></li> -<li><p><em>Keep your input data</em>: The input data is also critical to the -project's reproducibility, so like the above for software, make sure -you have a backup of them, or their persistent identifiers (PIDs).</p></li> -</ul></li> -<li><p><strong>Version control</strong>: Version control is a critical component of - Maneage. Here are some tips to help in effectively using it.</p> - -<ul> -<li><p><em>Regular commits</em>: It is important (and extremely useful) to have the -history of your project under version control. So try to make commits -regularly (after any meaningful change/step/result).</p></li> -<li><p><em>Keep Maneage up-to-date</em>: In time, Maneage is going to become more -and more mature and robust (thanks to your feedback and the feedback -of other users). Bugs will be fixed and new/improved features will be -added. So every once and a while, you can run the commands below to -pull new work that is done in Maneage. If the changes are useful for -your work, you can merge them with your project to benefit from -them. Just pay <strong>very close attention</strong> to resolving possible -<strong>conflicts</strong> which might happen in the merge (updated settings that -you have customized in Maneage).</p> - -<p><code>shell -$ git checkout maneage -$ git pull # Get recent work in Maneage -$ git log XXXXXX..XXXXXX --reverse # Inspect new work (replace XXXXXXs with hashs mentioned in output of previous command). -$ git log --oneline --graph --decorate --all # General view of branches. -$ git checkout master # Go to your top working branch. -$ git merge maneage # Import all the work into master. -</code></p></li> -<li><p><em>Adding Maneage to a fork of your project</em>: As you and your colleagues -continue your project, it will be necessary to have separate -forks/clones of it. But when you clone your own project on a -different system, or a colleague clones it to collaborate with you, -the clone won't have the <code>origin-maneage</code> remote that you started the -project with. As shown in the previous item above, you need this -remote to be able to pull recent updates from Maneage. The steps -below will setup the <code>origin-maneage</code> remote, and a local <code>maneage</code> -branch to track it, on the new clone.</p> - -<p><code>shell -$ git remote add origin-maneage https://git.maneage.org/project.git -$ git fetch origin-maneage -$ git checkout -b maneage --track origin-maneage/maneage -</code></p></li> -<li><p><em>Commit message</em>: The commit message is a very important and useful -aspect of version control. To make the commit message useful for -others (or yourself, one year later), it is good to follow a -consistent style. Maneage already has a consistent formatting -(described below), which you can also follow in your project if you -like. You can see many examples by running <code>git log</code> in the <code>maneage</code> -branch. If you intend to push commits to Maneage, for the consistency -of Maneage, it is necessary to follow these guidelines. 1) No line -should be more than 75 characters (to enable easy reading of the -message when you run <code>git log</code> on the standard 80-character -terminal). 2) The first line is the title of the commit and should -summarize it (so <code>git log --oneline</code> can be useful). The title should -also not end with a point (<code>.</code>, because its a short single sentence, -so a point is not necessary and only wastes space). 3) After the -title, leave an empty line and start the body of your message -(possibly containing many paragraphs). 4) Describe the context of -your commit (the problem it is trying to solve) as much as possible, -then go onto how you solved it. One suggestion is to start the main -body of your commit with "Until now ...", and continue describing the -problem in the first paragraph(s). Afterwards, start the next -paragraph with "With this commit ...".</p></li> -<li><p><em>Project outputs</em>: During your research, it is possible to checkout a -specific commit and reproduce its results. However, the processing -can be time consuming. Therefore, it is useful to also keep track of -the final outputs of your project (at minimum, the paper's PDF) in -important points of history. However, keeping a snapshot of these -(most probably large volume) outputs in the main history of the -project can unreasonably bloat it. It is thus recommended to make a -separate Git repo to keep those files and keep your project's source -as small as possible. For example if your project is called -<code>my-exciting-project</code>, the name of the outputs repository can be -<code>my-exciting-project-output</code>. This enables easy sharing of the output -files with your co-authors (with necessary permissions) and not -having to bloat your email archive with extra attachments also (you -can just share the link to the online repo in your -communications). After the research is published, you can also -release the outputs repository, or you can just delete it if it is -too large or un-necessary (it was just for convenience, and fully -reproducible after all). For example Maneage's output is available -for demonstration in <a href="http://git.maneage.org/output-raw.git/">a -separate</a> repository.</p></li> -<li><p><em>Full Git history in one file</em>: When you are publishing your project -(for example to Zenodo for long term preservation), it is more -convenient to have the whole project's Git history into one file to -save with your datasets. After all, you can't be sure that your -current Git server (for example GitLab, Github, or Bitbucket) will be -active forever. While they are good for the immediate future, you -can't rely on them for archival purposes. Fortunately keeping your -whole history in one file is easy with Git using the following -commands. To learn more about it, run <code>git help bundle</code>.</p> - -<ul> -<li>"bundle" your project's history into one file (just don't forget to -change <code>my-project-git.bundle</code> to a descriptive name of your -project):</li> -</ul> - -<p><code>shell -$ git bundle create my-project-git.bundle --all -</code></p> - -<ul> -<li>You can easily upload <code>my-project-git.bundle</code> anywhere. Later, if -you need to un-bundle it, you can use the following command.</li> -</ul> - -<p><p><p><code>shell -$ git clone my-project-git.bundle -</code></p></li> -</ul></p></li> -</ul></p> - -<h1>Future improvements</h1> - -<p>This is an evolving project and as time goes on, it will evolve and become -more robust. Some of the most prominent issues we plan to implement in the -future are listed below, please join us if you are interested.</p> - -<h2>Package management</h2> - -<p>It is important to have control of the environment of the project. Maneage -currently builds the higher-level programs (for example GNU Bash, GNU Make, -GNU AWK and domain-specific software) it needs, then sets <code>PATH</code> so the -analysis is done only with the project's built software. But currently the -configuration of each program is in the Makefile rules that build it. This -is not good because a change in the build configuration does not -automatically cause a re-build. Also, each separate project on a system -needs to have its own built tools (that can waste a lot of space).</p> - -<p>A good solution is based on the <a href="https://nixos.org/nix/about.html">Nix package -manager</a>: a separate file is present for -each software, containing all the necessary info to build it (including its -URL, its tarball MD5 hash, dependencies, configuration parameters, build -steps and etc). Using this file, a script can automatically generate the -Make rules to download, build and install program and its dependencies -(along with the dependencies of those dependencies and etc).</p> - -<p>All the software are installed in a "store". Each installed file (library -or executable) is prefixed by a hash of this configuration (and the OS -architecture) and the standard program name. For example (from the Nix -webpage):</p> - -<p><code> +input=$$(cat $<) +aststatistics $$input.fits --mean --std > $@ +rm $$input.fits $$input + </code></pre> + The important point here is that the temporary name template + (<code>shm-maneage</code>) has no suffix. So you can add the suffix + corresponding to your desired format afterwards (for example + <code>$$out.fits</code>, or <code>$$out.txt</code>). But more importantly, when <code>mktemp</code> + sets the random name, it also checks if no file exists with that name + and creates a file with that exact name at that moment. So at the end + of each recipe above, you'll have two files in your <code>/dev/shm</code>, one + empty file with no suffix one with a suffix. The role of the file + without a suffix is just to ensure that the randomly set name will + not be used by other calls to <code>mktemp</code> (when running in parallel) and + it should be deleted with the file containing a suffix. This is the + reason behind the <code>rm $$input.fits $$input</code> command above: to make + sure that first the file with a suffix is deleted, then the core + random file (note that when working in parallel on powerful systems, + in the time between deleting two files of a single <code>rm</code> command, many + things can happen!). When using Maneage, you can put the definition + of <code>shm-maneage</code> in <code>reproduce/analysis/make/initialize.mk</code> to be + usable in all the different Makefiles of your analysis, and you won't + need the three lines above it. <strong>Finally, BE RESPONSIBLE:</strong> after you + are finished, be sure to clean up any possibly remaining files (due + to crashes in the processing while you are working), otherwise your + RAM may fill up very fast. You can do it easily with a command like + this on your command-line: <code>rm -f /dev/shm/$(whoami)-*</code>.</p></li> + </ul></li> + <li><p><strong>Software tarballs and raw inputs</strong>: It is critically important to + document the raw inputs to your project (software tarballs and raw + input data):</p> + + <ul> + <li><p><em>Keep the source tarball of dependencies</em>: After configuration + finishes, the <code>.build/software/tarballs</code> directory will contain all + the software tarballs that were necessary for your project. You can + mirror the contents of this directory to keep a backup of all the + software tarballs used in your project (possibly as another version + controlled repository) that is also published with your project. Note + that software web-pages are not written in stone and can suddenly go + offline or not be accessible in some conditions. This backup is thus + very important. If you intend to release your project in a place like + Zenodo, you can upload/keep all the necessary tarballs (and data) + there with your + project. <a href="https://doi.org/10.5281/zenodo.1163746">zenodo.1163746</a> is + one example of how the data, Gnuastro (main software used) and all + major Gnuastro's dependencies have been uploaded with the project's + source. Just note that this is only possible for free and open-source + software.</p></li> + <li><p><em>Keep your input data</em>: The input data is also critical to the + project's reproducibility, so like the above for software, make sure + you have a backup of them, or their persistent identifiers (PIDs).</p></li> + </ul></li> + <li><p><strong>Version control</strong>: Version control is a critical component of + Maneage. Here are some tips to help in effectively using it.</p> + + <ul> + <li><p><em>Regular commits</em>: It is important (and extremely useful) to have the + history of your project under version control. So try to make commits + regularly (after any meaningful change/step/result).</p></li> + <li><p><em>Keep Maneage up-to-date</em>: In time, Maneage is going to become more + and more mature and robust (thanks to your feedback and the feedback + of other users). Bugs will be fixed and new/improved features will be + added. So every once and a while, you can run the commands below to + pull new work that is done in Maneage. If the changes are useful for + your work, you can merge them with your project to benefit from + them. Just pay <strong>very close attention</strong> to resolving possible + <strong>conflicts</strong> which might happen in the merge (updated settings that + you have customized in Maneage).</p> + + <pre><code> +git checkout maneage +git pull <span class="comment"># Get recent work in Maneage</span> +git log XXXXXX..XXXXXX --reverse <span class="comment"># Inspect new work (replace XXXXXXs with hashs mentioned in output of previous command).</span> +git log --oneline --graph --decorate --all <span class="comment"># General view of branches.</span> +git checkout master <span class="comment"># Go to your top working branch.</span> +git merge maneage <span class="comment"># Import all the work into master.</span> + </code></pre></li> + <li><p><em>Adding Maneage to a fork of your project</em>: As you and your colleagues + continue your project, it will be necessary to have separate + forks/clones of it. But when you clone your own project on a + different system, or a colleague clones it to collaborate with you, + the clone won't have the <code>origin-maneage</code> remote that you started the + project with. As shown in the previous item above, you need this + remote to be able to pull recent updates from Maneage. The steps + below will setup the <code>origin-maneage</code> remote, and a local <code>maneage</code> + branch to track it, on the new clone.</p> + + <pre><code> +git remote add origin-maneage https://git.maneage.org/project.git +git fetch origin-maneage +git checkout -b maneage --track origin-maneage/maneage + </code></pre></li> + <li><p><em>Commit message</em>: The commit message is a very important and useful + aspect of version control. To make the commit message useful for + others (or yourself, one year later), it is good to follow a + consistent style. Maneage already has a consistent formatting + (described below), which you can also follow in your project if you + like. You can see many examples by running <code>git log</code> in the <code>maneage</code> + branch. If you intend to push commits to Maneage, for the consistency + of Maneage, it is necessary to follow these guidelines. 1) No line + should be more than 75 characters (to enable easy reading of the + message when you run <code>git log</code> on the standard 80-character + terminal). 2) The first line is the title of the commit and should + summarize it (so <code>git log --oneline</code> can be useful). The title should + also not end with a point (<code>.</code>, because its a short single sentence, + so a point is not necessary and only wastes space). 3) After the + title, leave an empty line and start the body of your message + (possibly containing many paragraphs). 4) Describe the context of + your commit (the problem it is trying to solve) as much as possible, + then go onto how you solved it. One suggestion is to start the main + body of your commit with "Until now ...", and continue describing the + problem in the first paragraph(s). Afterwards, start the next + paragraph with "With this commit ...".</p></li> + <li><p><em>Project outputs</em>: During your research, it is possible to checkout a + specific commit and reproduce its results. However, the processing + can be time consuming. Therefore, it is useful to also keep track of + the final outputs of your project (at minimum, the paper's PDF) in + important points of history. However, keeping a snapshot of these + (most probably large volume) outputs in the main history of the + project can unreasonably bloat it. It is thus recommended to make a + separate Git repo to keep those files and keep your project's source + as small as possible. For example if your project is called + <code>my-exciting-project</code>, the name of the outputs repository can be + <code>my-exciting-project-output</code>. This enables easy sharing of the output + files with your co-authors (with necessary permissions) and not + having to bloat your email archive with extra attachments also (you + can just share the link to the online repo in your + communications). After the research is published, you can also + release the outputs repository, or you can just delete it if it is + too large or un-necessary (it was just for convenience, and fully + reproducible after all). For example Maneage's output is available + for demonstration in <a href="http://git.maneage.org/output-raw.git/">a + separate</a> repository.</p></li> + <li><p><em>Full Git history in one file</em>: When you are publishing your project + (for example to Zenodo for long term preservation), it is more + convenient to have the whole project's Git history into one file to + save with your datasets. After all, you can't be sure that your + current Git server (for example GitLab, Github, or Bitbucket) will be + active forever. While they are good for the immediate future, you + can't rely on them for archival purposes. Fortunately keeping your + whole history in one file is easy with Git using the following + commands. To learn more about it, run <code>git help bundle</code>.</p> + + <ul> + <li>"bundle" your project's history into one file (just don't forget to + change <code>my-project-git.bundle</code> to a descriptive name of your + project):</li> + </ul> + + <pre><code> +git bundle create my-project-git.bundle --all + </code></pre> + + <ul> + <li>You can easily upload <code>my-project-git.bundle</code> anywhere. Later, if + you need to un-bundle it, you can use the following command.</li> + </ul> + + <p><p><pre><code> +git clone my-project-git.bundle + </code></pre></li> + </ul></p></li> + </ul></p> + + <h1>Future improvements</h1> + + <p>This is an evolving project and as time goes on, it will evolve and become + more robust. Some of the most prominent issues we plan to implement in the + future are listed below, please join us if you are interested.</p> + + <h2>Package management</h2> + + <p>It is important to have control of the environment of the project. Maneage + currently builds the higher-level programs (for example GNU Bash, GNU Make, + GNU AWK and domain-specific software) it needs, then sets <code>PATH</code> so the + analysis is done only with the project's built software. But currently the + configuration of each program is in the Makefile rules that build it. This + is not good because a change in the build configuration does not + automatically cause a re-build. Also, each separate project on a system + needs to have its own built tools (that can waste a lot of space).</p> + + <p>A good solution is based on the <a href="https://nixos.org/nix/about.html">Nix package manager</a>: a separate file is present for + each software, containing all the necessary info to build it (including its + URL, its tarball MD5 hash, dependencies, configuration parameters, build + steps and etc). Using this file, a script can automatically generate the + Make rules to download, build and install program and its dependencies + (along with the dependencies of those dependencies and etc).</p> + + <p>All the software are installed in a "store". Each installed file (library + or executable) is prefixed by a hash of this configuration (and the OS + architecture) and the standard program name. For example (from the Nix + webpage):</p> + + <pre><code> /nix/store/b6gvzjyb2pg0kjfwrjmg1vfhh54ad73z-firefox-33.1/ -</code></p> - -<p>The important thing is that the "store" is <em>not</em> in the project's search -path. After the complete installation of the software, symbolic links are -made to populate each project's program and library search paths without a -hash. This hash will be unique to that particular software and its -particular configuration. So simply by searching for this hash in the -installed directory, we can find the installed files of that software to -generate the links.</p> - -<p>This scenario has several advantages: 1) a change in a software's build -configuration triggers a rebuild. 2) a single "store" can be used in many -projects, thus saving space and configuration time for new projects (that -commonly have large overlaps in lower-level programs).</p> - -<h1>Appendix: Necessity of exact reproduction in scientific research</h1> - -<p>In case <a href="http://akhlaghi.org/reproducible-science.html">the link above</a> is -not accessible at the time of reading, here is a copy of the introduction -of that link, describing the necessity for a reproducible project like this -(copied on February 7th, 2018):</p> - -<p>The most important element of a "scientific" statement/result is the fact -that others should be able to falsify it. The Tsunami of data that has -engulfed astronomers in the last two decades, combined with faster -processors and faster internet connections has made it much more easier to -obtain a result. However, these factors have also increased the complexity -of a scientific analysis, such that it is no longer possible to describe -all the steps of an analysis in the published paper. Citing this -difficulty, many authors suffice to describing the generalities of their -analysis in their papers.</p> - -<p>However, It is impossible to falsify (or even study) a result if you can't -exactly reproduce it. The complexity of modern science makes it vitally -important to exactly reproduce the final result. Because even a small -deviation can be due to many different parts of an analysis. Nature is -already a black box which we are trying so hard to comprehend. Not letting -other scientists see the exact steps taken to reach a result, or not -allowing them to modify it (do experiments on it) is a self-imposed black -box, which only exacerbates our ignorance.</p> - -<p>Other scientists should be able to reproduce, check and experiment on the -results of anything that is to carry the "scientific" label. Any result -that is not reproducible (due to incomplete information by the author) is -not scientific: the readers have to have faith in the subjective experience -of the authors in the very important choice of configuration values and -order of operations: this is contrary to the scientific spirit.</p> - -<h2>Copyright information</h2> - -<p>This file is part of Maneage's core: https://git.maneage.org/project.git</p> - -<p>Maneage is free software: you can redistribute it and/or modify it under -the terms of the GNU General Public License as published by the Free -Software Foundation, either version 3 of the License, or (at your option) -any later version.</p> - -<p>Maneage is distributed in the hope that it will be useful, but WITHOUT ANY -WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS -FOR A PARTICULAR PURPOSE. See the GNU General Public License for more -details.</p> - -<p>You should have received a copy of the GNU General Public License along -with Maneage. If not, see <a href="https://www.gnu.org/licenses/">https://www.gnu.org/licenses/</a>.</p> + </code></pre> + + <p>The important thing is that the "store" is <em>not</em> in the project's search + path. After the complete installation of the software, symbolic links are + made to populate each project's program and library search paths without a + hash. This hash will be unique to that particular software and its + particular configuration. So simply by searching for this hash in the + installed directory, we can find the installed files of that software to + generate the links.</p> + + <p>This scenario has several advantages: 1) a change in a software's build + configuration triggers a rebuild. 2) a single "store" can be used in many + projects, thus saving space and configuration time for new projects (that + commonly have large overlaps in lower-level programs).</p> + + <h1>Appendix: Necessity of exact reproduction in scientific research</h1> + + <p>In case <a href="http://akhlaghi.org/reproducible-science.html">the link above</a> is + not accessible at the time of reading, here is a copy of the introduction + of that link, describing the necessity for a reproducible project like this + (copied on February 7th, 2018):</p> + + <p>The most important element of a "scientific" statement/result is the fact + that others should be able to falsify it. The Tsunami of data that has + engulfed astronomers in the last two decades, combined with faster + processors and faster internet connections has made it much more easier to + obtain a result. However, these factors have also increased the complexity + of a scientific analysis, such that it is no longer possible to describe + all the steps of an analysis in the published paper. Citing this + difficulty, many authors suffice to describing the generalities of their + analysis in their papers.</p> + + <p>However, It is impossible to falsify (or even study) a result if you can't + exactly reproduce it. The complexity of modern science makes it vitally + important to exactly reproduce the final result. Because even a small + deviation can be due to many different parts of an analysis. Nature is + already a black box which we are trying so hard to comprehend. Not letting + other scientists see the exact steps taken to reach a result, or not + allowing them to modify it (do experiments on it) is a self-imposed black + box, which only exacerbates our ignorance.</p> + + <p>Other scientists should be able to reproduce, check and experiment on the + results of anything that is to carry the "scientific" label. Any result + that is not reproducible (due to incomplete information by the author) is + not scientific: the readers have to have faith in the subjective experience + of the authors in the very important choice of configuration values and + order of operations: this is contrary to the scientific spirit.</p> + + <h2>Copyright information</h2> + + <p>This file is part of Maneage's core: https://git.maneage.org/project.git</p> + + <p>Maneage is free software: you can redistribute it and/or modify it under + the terms of the GNU General Public License as published by the Free + Software Foundation, either version 3 of the License, or (at your option) + any later version.</p> + + <p>Maneage is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for more + details.</p> + + <p>You should have received a copy of the GNU General Public License along + with Maneage. If not, see <a href="https://www.gnu.org/licenses/">https://www.gnu.org/licenses/</a>.</p> + </div> + </body> |