diff options
author | Mohammad Akhlaghi <mohammad@akhlaghi.org> | 2020-11-26 03:45:54 +0000 |
---|---|---|
committer | Mohammad Akhlaghi <mohammad@akhlaghi.org> | 2020-11-26 03:45:54 +0000 |
commit | 6b87843fc38c1646615ab0342a703f7ab3caf1cb (patch) | |
tree | ea11daebe93d93f7e549fe9e3404248d850026f7 | |
parent | 56779683e1abd996f50d1e66055f4f5540a7d61c (diff) |
The long about.hml is now broken up into smaller pages
The "About" page ('about.html') was effectively a full copy of
Maneage's 'README-hacking.md', so it was very long. To help in
readability it has now been broken down into smaller pages (one for
each section).
Also the indentation of Make recipes was corrected, both in the about
pages, and also in the tutorial.
-rw-r--r-- | about-architecture.html | 353 | ||||
-rw-r--r-- | about-citation.html | 208 | ||||
-rw-r--r-- | about-customize.html | 444 | ||||
-rw-r--r-- | about-future.html | 160 | ||||
-rw-r--r-- | about-introduction.html | 177 | ||||
-rw-r--r-- | about-make.html | 221 | ||||
-rw-r--r-- | about-tips.html | 442 | ||||
-rw-r--r-- | about.html | 1206 | ||||
-rw-r--r-- | tutorial.html | 96 |
9 files changed, 2067 insertions, 1240 deletions
diff --git a/about-architecture.html b/about-architecture.html new file mode 100644 index 0000000..915b45f --- /dev/null +++ b/about-architecture.html @@ -0,0 +1,353 @@ +<!DOCTYPE html> +<!-- Copyright notes are just below the head and before body --> + + <html lang="en-US"> + + <!-- HTML Header --> + <head> + <!-- Title of the page. --> + <title>Maneage -- Managing data lineage</title> + + <!-- Enable UTF-8 encoding to easily use non-ASCII charactes --> + <meta charset="UTF-8"> + <meta http-equiv="Content-type" content="text/html; charset=UTF-8"> + + <!-- Put logo beside the address bar --> + <link rel="shortcut icon" href="./img/favicon.svg" /> + + <!-- The viewport meta tag is placed mainly for mobile browsers + that are pre-configured in different ways (for example setting the + different widths for the page than the actual width of the device, + or zooming to different values. Without this the CSS media + solutions might not work properly on all mobile browsers.--> + <meta name="viewport" + content="width=device-width, initial-scale=1"> + + <!-- Basic styles --> + <link rel="stylesheet" href="css/base.css" /> + </head> + + <!-- + Webpage of Maneage: a framework for managing data lineage + + Copyright (C) 2020, Pedram Ashofteh Ardakani <pedramardakani@pm.me> + Copyright (C) 2020, Mohammad Akhlaghi <mohammad@akhlaghi.org> + + This file is part of Maneage. Maneage is free software: you can + redistribute it and/or modify it under the terms of the GNU General + Public License as published by the Free Software Foundation, either + version 3 of the License, or (at your option) any later version. + + Maneage is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. See + <http://www.gnu.org/licenses/>. --> + + <!-- Start the main body. --> + <body> + <div id="container"> + <header role="banner"> + <!-- global navigation --> + <nav role="navigation" id="nav-hamburger-wrapper"> + <input type="checkbox" id="nav-hamburger-input"/> + <label for="nav-hamburger-input">|||</label> + <div id="nav-hamburger-items" class="button"> + <a href="index.html">Home</a> + <a href="about.html">About</a> + <a href="http://git.maneage.org/project.git/">Git</a> + <a href="tutorial.html">Tutorial</a> + </div> + </nav> + </header> + <div class="banner"> + <div> + <a href="index.html"><img src="img/maneage-logo.svg" /></a> + </div> + <div> + <h1>Maneage</h1><h2>Managing Data Lineage</h2> + <p>Copyright © 2018-2020 Mohammad Akhlaghi <a href="mailto:mohammad@akhlaghi.org">mohammad@akhlaghi.org</a><br /> + Copyright © 2020 Raul Infante-Sainz <a href="mailto:infantesainz@gmail.com">infantesainz@gmail.com</a><br /> + <a href="#page-footer">License Conditions</a></p> + </div> + </div> + + + + + + <hr /> + <p align="right">Next: <a href="about-customize.html">Customization checklist</a>, Previous: <a href="about-make.html">Why Make?</a>, Up: <a href="about.html">About</a> </p> + + <h1>Project architecture</h1> + + <p>In order to customize Maneage to your research, it is important to first + understand its architecture so you can navigate your way in the directories + and understand how to implement your research project within its framework: + where to add new files and which existing files to modify for what + purpose. But if this the first time you are using Maneage, before reading + this theoretical discussion, please run Maneage once from scratch without + any changes (described in <code>README.md</code>). You will see how it works (note that + the configure step builds all necessary software, so it can take long, but + you can continue reading while its working).</p> + + <p>The project has two top-level directories: <code>reproduce</code> and + <code>tex</code>. <code>reproduce</code> hosts all the software building and analysis + steps. <code>tex</code> contains all the final paper's components to be compiled into + a PDF using LaTeX.</p> + + <p>The <code>reproduce</code> directory has two sub-directories: <code>software</code> and + <code>analysis</code>. As the name says, the former contains all the instructions to + download, build and install (independent of the host operating system) the + necessary software (these are called by the <code>./project configure</code> + command). The latter contains instructions on how to use those software to + do your project's analysis.</p> + + <p>After it finishes, <code>./project configure</code> will create the following symbolic + links in the project's top source directory: <code>.build</code> which points to the + top build directory and <code>.local</code> for easy access to the custom built + software installation directory. With these you can easily access the build + directory and project-specific software from your top source directory. For + example if you run <code>.local/bin/ls</code> you will be using the <code>ls</code> of Maneage, + which is probably different from your system's <code>ls</code> (run them both with + <code>--version</code> to check).</p> + + <p>Once the project is configured for your system, <code>./project make</code> will do + the basic preparations and run the project's analysis with the custom + version of software. The <code>project</code> script is just a wrapper, and with the + <code>make</code> argument, it will first call <code>top-prepare.mk</code> and <code>top-make.mk</code> + (both are in the <code>reproduce/analysis/make</code> directory).</p> + + <p>In terms of organization, <code>top-prepare.mk</code> and <code>top-make.mk</code> have an + identical design, only minor differences. So, let's continue Maneage's + architecture with <code>top-make.mk</code>. Once you understand that, you'll clearly + understand <code>top-prepare.mk</code> also. These very high-level files are + relatively short and heavily commented so hopefully the descriptions in + each comment will be enough to understand the general details. As you read + this section, please also look at the contents of the mentioned files and + directories to fully understand what is going on.</p> + + <p>Before starting to look into the top <code>top-make.mk</code>, it is important to + recall that Make defines dependencies by files. Therefore, the + input/prerequisite and output of every step/rule must be a file. Also + recall that Make will use the modification date of the prerequisite(s) and + target files to see if the target must be re-built or not. Therefore during + the processing, <em>many</em> intermediate files will be created (see the tips + section below on a good strategy to deal with large/huge files).</p> + + <p>To keep the source and (intermediate) built files separate, the user <em>must</em> + define a top-level build directory variable (or <code>$(BDIR)</code>) to host all the + intermediate files (you defined it during <code>./project configure</code>). This + directory doesn't need to be version controlled or even synchronized, or + backed-up in other servers: its contents are all products, and can be + easily re-created any time. As you define targets for your new rules, it is + thus important to place them all under sub-directories of <code>$(BDIR)</code>. As + mentioned above, you always have fast access to this "build"-directory with + the <code>.build</code> symbolic link. Also, beware to <em>never</em> make any manual change + in the files of the build-directory, just delete them (so they are + re-built).</p> + + <p>In this architecture, we have two types of Makefiles that are loaded into + the top <code>Makefile</code>: <em>configuration-Makefiles</em> (only independent + variables/configurations) and <em>workhorse-Makefiles</em> (Makefiles that + actually contain analysis/processing rules).</p> + + <p>The configuration-Makefiles are those that satisfy these two wildcards: + <code>reproduce/software/config/*.conf</code> (for building the necessary software + when you run <code>./project configure</code>) and <code>reproduce/analysis/config/*.conf</code> + (for the high-level analysis, when you run <code>./project make</code>). These + Makefiles don't actually have any rules, they just have values for various + free parameters throughout the configuration or analysis. Open a few of + them to see for yourself. These Makefiles must only contain raw Make + variables (project configurations). By "raw" we mean that the Make + variables in these files must not depend on variables in any other + configuration-Makefile. This is because we don't want to assume any order + in reading them. It is also very important to <em>not</em> define any rule, or + other Make construct, in these configuration-Makefiles.</p> + + <p>Following this rule-of-thumb enables you to set these configure-Makefiles + as a prerequisite to any target that depends on their variable + values. Therefore, if you change any of their values, all targets that + depend on those values will be re-built. This is very convenient as your + project scales up and gets more complex.</p> + + <p>The workhorse-Makefiles are those satisfying this wildcard + <code>reproduce/software/make/*.mk</code> and <code>reproduce/analysis/make/*.mk</code>. They + contain the details of the processing steps (Makefiles containing + rules). Therefore, in this phase <em>order is important</em>, because the + prerequisites of most rules will be the targets of other rules that will be + defined prior to them (not a fixed name like <code>paper.pdf</code>). The lower-level + rules must be imported into Make before the higher-level ones.</p> + + <p>All processing steps are assumed to ultimately (usually after many rules) + end up in some number, image, figure, or table that will be included in the + paper. The writing of these results into the final report/paper is managed + through separate LaTeX files that only contain macros (a name given to a + number/string to be used in the LaTeX source, which will be replaced when + compiling it to the final PDF). So the last target in a workhorse-Makefile + is a <code>.tex</code> file (with the same base-name as the Makefile, but in + <code>$(BDIR)/tex/macros</code>). As a result, if the targets in a workhorse-Makefile + aren't directly a prerequisite of other workhorse-Makefile targets, they + can be a prerequisite of that intermediate LaTeX macro file and thus be + called when necessary. Otherwise, they will be ignored by Make.</p> + + <p>Maneage also has a mode to share the build directory between several + users of a Unix group (when working on large computer clusters). In this + scenario, each user can have their own cloned project source, but share the + large built files between each other. To do this, it is necessary for all + built files to give full permission to group members while not allowing any + other users access to the contents. Therefore the <code>./project configure</code> and + <code>./project make</code> steps must be called with special conditions which are + managed in the <code>--group</code> option.</p> + + <p>Let's see how this design is implemented. Please open and inspect + <code>top-make.mk</code> it as we go along here. The first step (un-commented line) is + to import the local configuration (your answers to the questions of + <code>./project configure</code>). They are defined in the configuration-Makefile + <code>reproduce/software/config/LOCAL.conf</code> which was also built by <code>./project + configure</code> (based on the <code>LOCAL.conf.in</code> template of the same directory).</p> + + <p>The next non-commented set of the top <code>Makefile</code> defines the ultimate + target of the whole project (<code>paper.pdf</code>). But to avoid mistakes, a sanity + check is necessary to see if Make is being run with the same group settings + as the configure script (for example when the project is configured for + group access using the <code>./for-group</code> script, but Make isn't). Therefore we + use a Make conditional to define the <code>all</code> target based on the group + permissions.</p> + + <p>Having defined the top/ultimate target, our next step is to include all the + other necessary Makefiles. However, order matters in the importing of + workhorse-Makefiles and each must also have a TeX macro file with the same + base name (without a suffix). Therefore, the next step in the top-level + Makefile is to define the <code>makesrc</code> variable to keep the base names + (without a <code>.mk</code> suffix) of the workhorse-Makefiles that must be imported, + in the proper order.</p> + + <p>Finally, we import all the necessary remaining Makefiles: 1) All the + analysis configuration-Makefiles with a wildcard. 2) The software + configuration-Makefile that contains their version (just in case its + necessary). 3) All workhorse-Makefiles in the proper order using a Make + <code>foreach</code> loop.</p> + + <p>In short, to keep things modular, readable and manageable, follow these + recommendations: 1) Set clear-to-understand names for the + configuration-Makefiles, and workhorse-Makefiles, 2) Only import other + Makefiles from top Makefile. These will let you know/remember generally + which step you are taking before or after another. Projects will scale up + very fast. Thus if you don't start and continue with a clean and robust + convention like this, in the end it will become very dirty and hard to + manage/understand (even for yourself). As a general rule of thumb, break + your rules into as many logically-similar but independent steps as + possible.</p> + + <p>The <code>reproduce/analysis/make/paper.mk</code> Makefile must be the final Makefile + that is included. This workhorse Makefile ends with the rule to build + <code>paper.pdf</code> (final target of the whole project). If you look in it, you + will notice that this Makefile starts with a rule to create + <code>$(mtexdir)/project.tex</code> (<code>mtexdir</code> is just a shorthand name for + <code>$(BDIR)/tex/macros</code> mentioned before). As you see, the only dependency of + <code>$(mtexdir)/project.tex</code> is <code>$(mtexdir)/verify.tex</code> (which is the last + analysis step: it verifies all the generated results). Therefore, + <code>$(mtexdir)/project.tex</code> is <em>the connection</em> between the + processing/analysis steps of the project, and the steps to build the final + PDF.</p> + + <p>During the research, it often happens that you want to test a step that is + not a prerequisite of any higher-level operation. In such cases, you can + (temporarily) define that processing as a rule in the most relevant + workhorse-Makefile and set its target as a prerequisite of its TeX + macro. If your test gives a promising result and you want to include it in + your research, set it as prerequisites to other rules and remove it from + the list of prerequisites for TeX macro file. In fact, this is how a + project is designed to grow in this framework.</p> + + <h2>File modification dates (meta data)</h2> + + <p>While Git does an excellent job at keeping a history of the contents of + files, it makes no effort in keeping the file meta data, and in particular + the dates of files. Therefore when you checkout to a different branch, + files that are re-written by Git will have a newer date than the other + project files. However, file dates are important in the current design of + Maneage: Make checks the dates of the prerequisite files and target files + to see if the target should be re-built.</p> + + <p>To fix this problem, for Maneage we use a forked version of + <a href="https://github.com/mohammad-akhlaghi/metastore">Metastore</a>. Metastore use + a binary database file (which is called <code>.file-metadata</code>) to keep the + modification dates of all the files under version control. This file is + also under version control, but is hidden (because it shouldn't be modified + by hand). During the project's configuration, Maneage installs to Git hooks + to run Metastore 1) before making a commit to update its database with the + file dates in a branch, and 2) after doing a checkout, to reset the + file-dates after the checkout is complete and re-set the file dates back to + what they were.</p> + + <p>In practice, Metastore should work almost fully invisibly within your + project. The only place you might notice its presence is that you'll see + <code>.file-metadata</code> in the list of modified/staged files (commonly after + merging your branches). Since its a binary file, Git also won't show you + the changed contents. In a merge, you can simply accept any changes with + <code>git add -u</code>. But if Git is telling you that it has changed without a merge + (for example if you started a commit, but canceled it in the middle), you + can just do <code>git checkout .file-metadata</code> and set it back to its original + state.</p> + + <h2>Summary</h2> + + <p>Based on the explanation above, some major design points you should have in + mind are listed below.</p> + + <ul> + <li><p>Define new <code>reproduce/analysis/make/XXXXXX.mk</code> workhorse-Makefile(s) + with good and human-friendly name(s) replacing <code>XXXXXX</code>.</p></li> + <li><p>Add <code>XXXXXX</code>, as a new line, to the values in <code>makesrc</code> of the top-level + <code>Makefile</code>.</p></li> + <li><p>Do not use any constant numbers (or important names like filter names) + in the workhorse-Makefiles or paper's LaTeX source. Define such + constants as logically-grouped, separate configuration-Makefiles in + <code>reproduce/analysis/config/XXXXX.conf</code>. Then set this + configuration-Makefiles file as a prerequisite to any rule that uses + the variable defined in it.</p></li> + <li><p>Through any number of intermediate prerequisites, all processing steps + should end in (be a prerequisite of) <code>$(mtexdir)/verify.tex</code> (defined in + <code>reproduce/analysis/make/verify.mk</code>). <code>$(mtexdir)/verify.tex</code> is the sole + dependency of <code>$(mtexdir)/project.tex</code>, which is the bridge between the + processing steps and PDF-building steps of the project.</p></li> + </ul> + + <p align="right">Next: <a href="about-customize.html">Customization checklist</a>, Previous: <a href="about-make.html">Why Make?</a>, Up: <a href="about.html">About</a> </p> + + + + + + <footer role="contentinfo" id="page-footer"> + <h2>Copyright information</h2> + + <p>This file is part of Maneage's core: <a href="https://git.maneage.org/project.git">https://git.maneage.org/project.git</a></p> + + <p>Maneage is free software: you can redistribute it and/or modify it under + the terms of the GNU General Public License as published by the Free + Software Foundation, either version 3 of the License, or (at your option) + any later version.</p> + + <p>Maneage is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for more + details.</p> + + <p>You should have received a copy of the GNU General Public License along + with Maneage. If not, see <a href="https://www.gnu.org/licenses/">https://www.gnu.org/licenses/</a>.</p> + <ul> + <li><p>Maneage is currently based in the Instituto de Astrofísica de Canarias (IAC).</p></li> + <li><p>Address: IAC, Calle Vía Láctea, s/n, E38205 - La Laguna (Tenerife), Spain.</p></li> + <!-- The people page will be added later + <li><p>People [page will be added later]</p></li> + --> + <li><p>Contact: with <a href="https://savannah.nongnu.org/support/?func=additem&group=reproduce">this form.</a></p></li> + <li><p>Copyright © 2020 Maneage volunteers</p></li> + <li><p>All logos are copyrighted by the respective institutions</p></li> + </ul> + </footer> + </div> + </body> diff --git a/about-citation.html b/about-citation.html new file mode 100644 index 0000000..6129e14 --- /dev/null +++ b/about-citation.html @@ -0,0 +1,208 @@ +<!DOCTYPE html> +<!-- Copyright notes are just below the head and before body --> + + <html lang="en-US"> + + <!-- HTML Header --> + <head> + <!-- Title of the page. --> + <title>Maneage -- Managing data lineage</title> + + <!-- Enable UTF-8 encoding to easily use non-ASCII charactes --> + <meta charset="UTF-8"> + <meta http-equiv="Content-type" content="text/html; charset=UTF-8"> + + <!-- Put logo beside the address bar --> + <link rel="shortcut icon" href="./img/favicon.svg" /> + + <!-- The viewport meta tag is placed mainly for mobile browsers + that are pre-configured in different ways (for example setting the + different widths for the page than the actual width of the device, + or zooming to different values. Without this the CSS media + solutions might not work properly on all mobile browsers.--> + <meta name="viewport" + content="width=device-width, initial-scale=1"> + + <!-- Basic styles --> + <link rel="stylesheet" href="css/base.css" /> + </head> + + <!-- + Webpage of Maneage: a framework for managing data lineage + + Copyright (C) 2020, Pedram Ashofteh Ardakani <pedramardakani@pm.me> + Copyright (C) 2020, Mohammad Akhlaghi <mohammad@akhlaghi.org> + + This file is part of Maneage. Maneage is free software: you can + redistribute it and/or modify it under the terms of the GNU General + Public License as published by the Free Software Foundation, either + version 3 of the License, or (at your option) any later version. + + Maneage is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. See + <http://www.gnu.org/licenses/>. --> + + <!-- Start the main body. --> + <body> + <div id="container"> + <header role="banner"> + <!-- global navigation --> + <nav role="navigation" id="nav-hamburger-wrapper"> + <input type="checkbox" id="nav-hamburger-input"/> + <label for="nav-hamburger-input">|||</label> + <div id="nav-hamburger-items" class="button"> + <a href="index.html">Home</a> + <a href="about.html">About</a> + <a href="http://git.maneage.org/project.git/">Git</a> + <a href="tutorial.html">Tutorial</a> + </div> + </nav> + </header> + <div class="banner"> + <div> + <a href="index.html"><img src="img/maneage-logo.svg" /></a> + </div> + <div> + <h1>Maneage</h1><h2>Managing Data Lineage</h2> + <p>Copyright © 2018-2020 Mohammad Akhlaghi <a href="mailto:mohammad@akhlaghi.org">mohammad@akhlaghi.org</a><br /> + Copyright © 2020 Raul Infante-Sainz <a href="mailto:infantesainz@gmail.com">infantesainz@gmail.com</a><br /> + <a href="#page-footer">License Conditions</a></p> + </div> + </div> + + + <hr /> + <p align="right">Next: <a href="about-make.html">Why Make?</a>, Previous: <a href="about-introduction.html">Introduction</a> Up: <a href="about.html">About</a> </p> + + <h2>Citation</h2> + + <p>If you use Maneage in your project please cite + Akhlaghi et + al. (<a href="https://arxiv.org/abs/2006.03018">arXiv:2006.03018</a>).</p> + + <p>Also, when your paper is published, don't forget to add a notice in your + own paper (in coordination with the publishing editor) that the paper is + fully reproducible and possibly add a sentence or paragraph in the end of + the paper shortly describing the concept. This will help spread the word + and encourage other scientists to also manage and publish their projects in + a reproducible manner.</p> + + + + + + <h2>Published works using Maneage</h2> + + <p>The list below shows some of the works that have already been published + with (earlier versions of) Maneage. Previously it was simply called + "Reproducible paper template". Note that Maneage is evolving, so some + details may be different in them. The more recent ones can be used as a + good working example.</p> + + <ul> + + <li><p>Peper & Roukema + (<a href="https://arxiv.org/abs/2010.03742">arXiv:2010.03742</a>): + The live version of the controlled source + is <a href="https://codeberg.org/boud/elaphrocentre">at + Codeberg</a>; the main input dataset, a software + snapshot, the software tarballs, the project + outputs and editing history are available at + <a href="https://zenodo.org/record/4062461">zenodo.4062461</a>.</p></li> + + <li><p>Roukema + (<a href="https://arxiv.org/abs/2007.11779">arXiv:2007.11779</a>: + The live version of the controlled source + is <a href="https://codeberg.org/boud/subpoisson">at + Codeburg</a>; the main input dataset, a software + snapshot, the software tarballs, the project + outputs and editing history are available + at <a href="https://zenodo.org/record/3951152">zenodo.3951152</a>.</p></li> + + <li><p>Akhlaghi et + al. (<a href="https://arxiv.org/abs/2006.03018)">arXiv:2006.03018</a>): + The project's version controlled source + is <a href="https://gitlab.com/makhlaghi/maneage-paper">on + Gitlab</a>, necessary software, outputs and + backup of history is available in + <a href="https://doi.org/10.5281/zenodo.3872248">zenodo.3872248</a>. + + </p></li> + + <li><p>Infante-Sainz et + al. (<a href="https://ui.adsabs.harvard.edu/abs/2020MNRAS.491.5317I">2020</a>, + MNRAS, 491, 5317): The version controlled project source is available + <a href="https://gitlab.com/infantesainz/sdss-extended-psfs-paper">on GitLab</a> + and is also archived on Zenodo with all the necessary software tarballs: + <a href="https://zenodo.org/record/3524937">zenodo.3524937</a>.</p></li> + + <li><p>Akhlaghi (<a href="https://arxiv.org/abs/1909.11230">2019</a>, IAU Symposium + 355). The version controlled project source is available <a href="https://gitlab.com/makhlaghi/iau-symposium-355">on + GitLab</a> and is also + archived on Zenodo with all the necessary software tarballs: + <a href="https://doi.org/10.5281/zenodo.3408481">zenodo.3408481</a>.</p></li> + + <li><p>Section 7.3 of Bacon et + al. (<a href="http://adsabs.harvard.edu/abs/2017A%26A...608A...1B">2017</a>, A&A + 608, A1): The version controlled project source is available <a href="https://gitlab.com/makhlaghi/muse-udf-origin-only-hst-magnitudes">on + GitLab</a> + and a snapshot of the project along with all the necessary input + datasets and outputs is available in + <a href="https://doi.org/10.5281/zenodo.1164774">zenodo.1164774</a>.</p></li> + + <li><p>Section 4 of Bacon et + al. (<a href="http://adsabs.harvard.edu/abs/2017A%26A...608A...1B">2017</a>, A&A, + 608, A1): The version controlled project is available <a href="https://gitlab.com/makhlaghi/muse-udf-photometry-astrometry">on + GitLab</a> and + a snapshot of the project along with all the necessary input datasets is + available in <a href="https://doi.org/10.5281/zenodo.1163746">zenodo.1163746</a>.</p></li> + + <li><p>Akhlaghi & Ichikawa + (<a href="http://adsabs.harvard.edu/abs/2015ApJS..220....1A">2015</a>, ApJS, 220, + 1): The version controlled project is available <a href="https://gitlab.com/makhlaghi/NoiseChisel-paper">on + GitLab</a>. This is the + very first (and much less mature!) incarnation of Maneage: the history + of Maneage started more than two years after this paper was + published. It is a very rudimentary/initial implementation, thus it is + only included here for historical reasons. However, the project source + is complete, accurate and uploaded to arXiv along with the paper.</p></li> + </ul> + + <p align="right">Next: <a href="about-make.html">Why Make?</a>, Previous: <a href="about-introduction.html">Introduction</a> Up: <a href="about.html">About</a> </p> + + + + + + <footer role="contentinfo" id="page-footer"> + <h2>Copyright information</h2> + + <p>This file is part of Maneage's core: <a href="https://git.maneage.org/project.git">https://git.maneage.org/project.git</a></p> + + <p>Maneage is free software: you can redistribute it and/or modify it under + the terms of the GNU General Public License as published by the Free + Software Foundation, either version 3 of the License, or (at your option) + any later version.</p> + + <p>Maneage is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for more + details.</p> + + <p>You should have received a copy of the GNU General Public License along + with Maneage. If not, see <a href="https://www.gnu.org/licenses/">https://www.gnu.org/licenses/</a>.</p> + <ul> + <li><p>Maneage is currently based in the Instituto de Astrofísica de Canarias (IAC).</p></li> + <li><p>Address: IAC, Calle Vía Láctea, s/n, E38205 - La Laguna (Tenerife), Spain.</p></li> + <!-- The people page will be added later + <li><p>People [page will be added later]</p></li> + --> + <li><p>Contact: with <a href="https://savannah.nongnu.org/support/?func=additem&group=reproduce">this form.</a></p></li> + <li><p>Copyright © 2020 Maneage volunteers</p></li> + <li><p>All logos are copyrighted by the respective institutions</p></li> + </ul> + </footer> + </div> + </body> diff --git a/about-customize.html b/about-customize.html new file mode 100644 index 0000000..6f66dc2 --- /dev/null +++ b/about-customize.html @@ -0,0 +1,444 @@ +<!DOCTYPE html> +<!-- Copyright notes are just below the head and before body --> + + <html lang="en-US"> + + <!-- HTML Header --> + <head> + <!-- Title of the page. --> + <title>Maneage -- Managing data lineage</title> + + <!-- Enable UTF-8 encoding to easily use non-ASCII charactes --> + <meta charset="UTF-8"> + <meta http-equiv="Content-type" content="text/html; charset=UTF-8"> + + <!-- Put logo beside the address bar --> + <link rel="shortcut icon" href="./img/favicon.svg" /> + + <!-- The viewport meta tag is placed mainly for mobile browsers + that are pre-configured in different ways (for example setting the + different widths for the page than the actual width of the device, + or zooming to different values. Without this the CSS media + solutions might not work properly on all mobile browsers.--> + <meta name="viewport" + content="width=device-width, initial-scale=1"> + + <!-- Basic styles --> + <link rel="stylesheet" href="css/base.css" /> + </head> + + <!-- + Webpage of Maneage: a framework for managing data lineage + + Copyright (C) 2020, Pedram Ashofteh Ardakani <pedramardakani@pm.me> + Copyright (C) 2020, Mohammad Akhlaghi <mohammad@akhlaghi.org> + + This file is part of Maneage. Maneage is free software: you can + redistribute it and/or modify it under the terms of the GNU General + Public License as published by the Free Software Foundation, either + version 3 of the License, or (at your option) any later version. + + Maneage is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. See + <http://www.gnu.org/licenses/>. --> + + <!-- Start the main body. --> + <body> + <div id="container"> + <header role="banner"> + <!-- global navigation --> + <nav role="navigation" id="nav-hamburger-wrapper"> + <input type="checkbox" id="nav-hamburger-input"/> + <label for="nav-hamburger-input">|||</label> + <div id="nav-hamburger-items" class="button"> + <a href="index.html">Home</a> + <a href="about.html">About</a> + <a href="http://git.maneage.org/project.git/">Git</a> + <a href="tutorial.html">Tutorial</a> + </div> + </nav> + </header> + <div class="banner"> + <div> + <a href="index.html"><img src="img/maneage-logo.svg" /></a> + </div> + <div> + <h1>Maneage</h1><h2>Managing Data Lineage</h2> + <p>Copyright © 2018-2020 Mohammad Akhlaghi <a href="mailto:mohammad@akhlaghi.org">mohammad@akhlaghi.org</a><br /> + Copyright © 2020 Raul Infante-Sainz <a href="mailto:infantesainz@gmail.com">infantesainz@gmail.com</a><br /> + <a href="#page-footer">License Conditions</a></p> + </div> + </div> + + + + + <hr /> + <p align="right">Next: <a href="about-tips.html">Tips for designing your project</a>, Previous: <a href="about-architecture.html">Maneage architecture</a>, Up: <a href="about.html">About</a> </p> + + <h1>Customization checklist</h1> + + <p>Take the following steps to fully customize Maneage for your research + project. After finishing the list, be sure to run <code>./project configure</code> and + <code>project make</code> to see if everything works correctly. If you notice anything + missing or any in-correct part (probably a change that has not been + explained here), please let us know to correct it.</p> + + <p>As described above, the concept of reproducibility (during a project) + heavily relies on <a href="https://en.wikipedia.org/wiki/Version_control">version + control</a>. Currently Maneage + uses Git as its main version control system. If you are not already + familiar with Git, please read the first three chapters of the <a href="https://git-scm.com/book/en/v2">ProGit + book</a> which provides a wonderful practical + understanding of the basics. You can read later chapters as you get more + advanced in later stages of your work.</p> + + <h2>First custom commit</h2> + + <ol> + <li><p><strong>Get this repository and its history</strong> (if you don't already have it): + Arguably the easiest way to start is to clone Maneage and prepare for + your customizations as shown below. After the cloning first you rename + the default <code>origin</code> remote server to specify that this is Maneage's + remote server. This will allow you to use the conventional <code>origin</code> + name for your own project as shown in the next steps. Second, you will + create and go into the conventional <code>master</code> branch to start + committing in your project later.</p> + + <pre><code>git clone https://git.maneage.org/project.git <span class="comment"># Clone/copy the project and its history.</span> +mv project my-project <span class="comment"># Change the name to your project's name.</span> +cd my-project <span class="comment"># Go into the cloned directory.</span> +git remote rename origin origin-maneage <span class="comment"># Rename current/only remote to "origin-maneage".</span> +git checkout -b master <span class="comment"># Create and enter your own "master" branch.</span> +pwd <span class="comment"># Just to confirm where you are.</span></code></pre></li> + <li><p><strong>Prepare to build project</strong>: The <code>./project configure</code> command of the + next step will build the different software packages within the + "build" directory (that you will specify). Nothing else on your system + will be touched. However, since it takes long, it is useful to see + what it is being built at every instant (its almost impossible to tell + from the torrent of commands that are produced!). So open another + terminal on your desktop and navigate to the same project directory + that you cloned (output of last command above). Then run the following + command. Once every second, this command will just print the date + (possibly followed by a non-existent directory notice). But as soon as + the next step starts building software, you'll see the names of + software get printed as they are being built. Once any software is + installed in the project build directory it will be removed. Again, + don't worry, nothing will be installed outside the build directory.</p> + + <pre><code><span class="comment"># On another terminal (go to top project source directory, last command above)</span> +./project --check-config</code></pre></li> + <li><p><strong>Test Maneage</strong>: Before making any changes, it is important to test it + and see if everything works properly with the commands below. If there + is any problem in the <code>./project configure</code> or <code>./project make</code> steps, + please contact us to fix the problem before continuing. Since the + building of dependencies in configuration can take long, you can take + the next few steps (editing the files) while its working (they don't + affect the configuration). After <code>./project make</code> is finished, open + <code>paper.pdf</code>. If it looks fine, you are ready to start customizing the + Maneage for your project. But before that, clean all the extra Maneage + outputs with <code>make clean</code> as shown below.</p> + + <pre><code>./project configure <span class="comment"># Build the project's software environment (can take an hour or so).</span> +./project make <span class="comment"># Do the processing and build paper (just a simple demo).</span> +<span class="comment"># Open 'paper.pdf' and see if everything is ok.</code></pre></li> + <li><p><strong>Setup the remote</strong>: You can use any <a href="https://en.wikipedia.org/wiki/Comparison_of_source_code_hosting_facilities">hosting + facility</a> + that supports Git to keep an online copy of your project's version + controlled history. We recommend <a href="https://gitlab.com">GitLab</a> because + it is <a href="https://www.gnu.org/software/repo-criteria-evaluation.html">more ethical (although not + perfect)</a>, + and later you can also host GitLab on your own server. Anyway, create + an account in your favorite hosting facility (if you don't already + have one), and define a new project there. Please make sure <em>the newly + created project is empty</em> (some services ask to include a <code>README</code> in + a new project which is bad in this scenario, and will not allow you to + push to it). It will give you a URL (usually starting with <code>git@</code> and + ending in <code>.git</code>), put this URL in place of <code>XXXXXXXXXX</code> in the first + command below. With the second command, "push" your <code>master</code> branch to + your <code>origin</code> remote, and (with the <code>--set-upstream</code> option) set them + to track/follow each other. However, the <code>maneage</code> branch is currently + tracking/following your <code>origin-maneage</code> remote (automatically set + when you cloned Maneage). So when pushing the <code>maneage</code> branch to your + <code>origin</code> remote, you <em>shouldn't</em> use <code>--set-upstream</code>. With the last + command, you can actually check this (which local and remote branches + are tracking each other).</p> + + <pre><code>git remote add origin XXXXXXXXXX <span class="comment"># Newly created repo is now called 'origin'.</span> +git push --set-upstream origin master <span class="comment"># Push 'master' branch to 'origin' (with tracking).</span> +git push origin maneage <span class="comment"># Push 'maneage' branch to 'origin' (no tracking).</span></code></pre></li> + <li><p><strong>Title</strong>, <strong>short description</strong> and <strong>author</strong>: The title and basic + information of your project's output PDF paper should be added in + <code>paper.tex</code>. You should see the relevant place in the preamble (prior + to <code>\begin{document}</code>. After you are done, run the <code>./project make</code> + command again to see your changes in the final PDF, and make sure that + your changes don't cause a crash in LaTeX. Of course, if you use a + different LaTeX package/style for managing the title and authors (in + particular a specific journal's style), please feel free to use it + your own methods after finishing this checklist and doing your first + commit.</p></li> + <li><p><strong>Delete dummy parts</strong>: Maneage contains some parts that are only for + the initial/test run, mainly as a demonstration of important steps, + which you can use as a reference to use in your own project. But they + not for any real analysis, so you should remove these parts as + described below:</p> + + <ul> + <li><p><code>paper.tex</code>: 1) Delete the text of the abstract (from + <code>\includeabstract{</code> to <code>\vspace{0.25cm}</code>) and write your own (a + single sentence can be enough now, you can complete it later). 2) + Add some keywords under it in the keywords part. 3) Delete + everything between <code>%% Start of main body.</code> and <code>%% End of main + body.</code>. 4) Remove the notice in the "Acknowledgments" section (in + <code>\new{}</code>) and Acknowledge your funding sources (this can also be + done later). Just don't delete the existing acknowledgment + statement: Maneage is possible thanks to funding from several + grants. Since Maneage is being used in your work, it is necessary to + acknowledge them in your work also.</p></li> + <li><p><code>reproduce/analysis/make/top-make.mk</code>: Delete the <code>delete-me</code> line + in the <code>makesrc</code> definition. Just make sure there is no empty line + between the <code>download \</code> and <code>verify \</code> lines (they should be + directly under each other).</p></li> + <li><p><code>reproduce/analysis/make/verify.mk</code>: In the final recipe, under the + commented line <code>Verify TeX macros</code>, remove the full line that + contains <code>delete-me</code>, and set the value of <code>s</code> in the line for + <code>download</code> to <code>XXXXX</code> (any temporary string, you'll fix it in the + end of your project, when its complete).</p></li> + <li><p>Delete all <code>delete-me*</code> files in the following directories:</p> + <pre><code>rm tex/src/delete-me* +rm reproduce/analysis/make/delete-me* +rm reproduce/analysis/config/delete-me*</code></pre></li> + <li><p>Disable verification of outputs by removing the <code>yes</code> from + <code>reproduce/analysis/config/verify-outputs.conf</code>. Later, when you are + ready to submit your paper, or publish the dataset, activate + verification and make the proper corrections in this file (described + under the "Other basic customizations" section below). This is a + critical step and only takes a few minutes when your project is + finished. So DON'T FORGET to activate it in the end.</p></li> + <li><p>Re-make the project (after a cleaning) to see if you haven't + introduced any errors.</p> + + <pre><code>./project make clean +./project make</code></pre></li> + </ul></li> + <li><p><strong>Don't merge some files in future updates</strong>: As described below, you + can later update your infra-structure (for example to fix bugs) by + merging your <code>master</code> branch with <code>maneage</code>. For files that you have + created in your own branch, there will be no problem. However if you + modify an existing Maneage file for your project, next time its + updated on <code>maneage</code> you'll have an annoying conflict. The commands + below show how to fix this future problem. With them, you can + configure Git to ignore the changes in <code>maneage</code> for some of the files + you have already edited and deleted above (and will edit below). Note + that only the first <code>echo</code> command has a <code>></code> (to write over the file), + the rest are <code>>></code> (to append to it). If you want to avoid any other + set of files to be imported from Maneage into your project's branch, + you can follow a similar strategy. We recommend only doing it when you + encounter the same conflict in more than one merge and there is no + other change in that file. Also, don't add core Maneage Makefiles, + otherwise Maneage can break on the next run.</p> + + <pre><code>echo "paper.tex merge=ours" > .gitattributes +echo "tex/src/delete-me.mk merge=ours" >> .gitattributes +echo "tex/src/delete-me-demo.mk merge=ours" >> .gitattributes +echo "reproduce/analysis/make/delete-me.mk merge=ours" >> .gitattributes +echo "reproduce/software/config/TARGETS.conf merge=ours" >> .gitattributes +echo "reproduce/analysis/config/delete-me-num.conf merge=ours" >> .gitattributes +git add .gitattributes</code></pre></li> + <li><p><strong>Copyright and License notice</strong>: It is necessary that <em>all</em> the + "copyright-able" files in your project (those larger than 10 lines) + have a copyright and license notice. Please take a moment to look at + several existing files to see a few examples. The copyright notice is + usually close to the start of the file, it is the line starting with + <code>Copyright (C)</code> and containing a year and the author's name (like the + examples below). The License notice is a short description of the + copyright license, usually one or two paragraphs with a URL to the + full license. Don't forget to add these <em>two</em> notices to <em>any new + file</em> you add in your project (you can just copy-and-paste). When you + modify an existing Maneage file (which already has the notices), just + add a copyright notice in your name under the existing one(s), like + the line with capital letters below. To start with, add this line with + your name and email address to <code>paper.tex</code>, + <code>tex/src/preamble-header.tex</code>, <code>reproduce/analysis/make/top-make.mk</code>, + and generally, all the files you modified in the previous step.</p> + + <pre><code>Copyright (C) 2018-2020 Existing Name <existing@email.address> +Copyright (C) 2020 YOUR NAME <YOUR@EMAIL.ADDRESS></code></pre></li> + <li><p><strong>Configure Git for fist time</strong>: If this is the first time you are + running Git on this system, then you have to configure it with some + basic information in order to have essential information in the commit + messages (ignore this step if you have already done it). Git will + include your name and e-mail address information in each commit. You + can also specify your favorite text editor for making the commit + (<code>emacs</code>, <code>vim</code>, <code>nano</code>, and etc.).</p> + + <pre><code>git config --global user.name "YourName YourSurname" +git config --global user.email your-email@example.com +git config --global core.editor nano</code></pre></li> + <li><p><strong>Your first commit</strong>: You have already made some small and basic + changes in the steps above and you are in your project's <code>master</code> + branch. So, you can officially make your first commit in your + project's history and push it. But before that, you need to make sure + that there are no problems in the project. This is a good habit to + always re-build the system before a commit to be sure it works as + expected.</p> + + <pre><code>git status <span class="comment"># See which files you have changed.</span> +git diff <span class="comment"># Check the lines you have added/changed.</span> +./project make <span class="comment"># Make sure everything builds successfully.</span> +git add -u <span class="comment"># Put all tracked changes in staging area.</span> +git status <span class="comment"># Make sure everything is fine.</span> +git diff --cached <span class="comment"># Confirm all the changes that will be committed.</span> +git commit <span class="comment"># Your first commit: put a good description!</span> +git push <span class="comment"># Push your commit to your remote.</span></code></pre></li> + <li><p><strong>Start your exciting research</strong>: You are now ready to add flesh and + blood to this raw skeleton by further modifying and adding your + exciting research steps. You can use the "published works" section in + the introduction (above) as some fully working models to learn + from. Also, don't hesitate to contact us if you have any + questions.</p></li> + </ol> + + <h2>Other basic customizations</h2> + + <ul> + <li><p><strong>High-level software</strong>: Maneage installs all the software that your + project needs. You can specify which software your project needs in + <code>reproduce/software/config/TARGETS.conf</code>. The necessary software are + classified into two classes: 1) programs or libraries (usually written + in C/C++) which are run directly by the operating system. 2) Python + modules/libraries that are run within Python. By default + <code>TARGETS.conf</code> only has GNU Astronomy Utilities (Gnuastro) as one + scientific program and Astropy as one scientific Python module. Both + have many dependencies which will be installed into your project + during the configuration step. To see a list of software that are + currently ready to be built in Maneage, see + <code>reproduce/software/config/versions.conf</code> (which has their versions + also), the comments in <code>TARGETS.conf</code> describe how to use the software + name from <code>versions.conf</code>. Currently the raw pipeline just uses + Gnuastro to make the demonstration plots. Therefore if you don't need + Gnuastro, go through the analysis steps in <code>reproduce/analysis</code> and + remove all its use cases (clearly marked).</p></li> + <li><p><strong>Input dataset</strong>: The input datasets are managed through the + <code>reproduce/analysis/config/INPUTS.conf</code> file. It is best to gather all + the information regarding all the input datasets into this one central + file. To ensure that the proper dataset is being downloaded and used + by the project, it is also recommended get an <a href="https://en.wikipedia.org/wiki/MD5">MD5 + checksum</a> of the file and include + that in <code>INPUTS.conf</code> so the project can check it automatically. The + preparation/downloading of the input datasets is done in + <code>reproduce/analysis/make/download.mk</code>. Have a look there to see how + these values are to be used. This information about the input datasets + is also used in the initial <code>configure</code> script (to inform the users), + so also modify that file. You can find all occurrences of the demo + dataset with the command below and replace it with your input's + dataset.</p> + + <pre><code>grep -ir wfpc2 ./*</code></pre></li> + <li><p><strong><code>README.md</code></strong>: Correct all the <code>XXXXX</code> place holders (name of your + project, your own name, address of your project's online/remote + repository, link to download dependencies and etc). Generally, read + over the text and update it where necessary to fit your project. Don't + forget that this is the first file that is displayed on your online + repository and also your colleagues will first be drawn to read this + file. Therefore, make it as easy as possible for them to start + with. Also check and update this file one last time when you are ready + to publish your project's paper/source.</p></li> + <li><p><strong>Verify outputs</strong>: During the initial customization checklist, you + disabled verification. This is natural because during the project you + need to make changes all the time and its a waste of time to enable + verification every time. But at significant moments of the project + (for example before submission to a journal, or publication) it is + necessary. When you activate verification, before building the paper, + all the specified datasets will be compared with their respective + checksum and if any file's checksum is different from the one recorded + in the project, it will stop and print the problematic file and its + expected and calculated checksums. First set the value of + <code>verify-outputs</code> variable in + <code>reproduce/analysis/config/verify-outputs.conf</code> to <code>yes</code>. Then go to + <code>reproduce/analysis/make/verify.mk</code>. The verification of all the files + is only done in one recipe. First the files that go into the + plots/figures are checked, then the LaTeX macros. Validation of the + former (inputs to plots/figures) should be done manually. If its the + first time you are doing this, you can see two examples of the dummy + steps (with <code>delete-me</code>, you can use them if you like). These two + examples should be removed before you can run the project. For the + latter, you just have to update the checksums. The important thing to + consider is that a simple checksum can be problematic because some + file generators print their run-time date in the file (for example as + commented lines in a text table). When checking text files, this + Makefile already has this function: + <code>verify-txt-no-comments-leading-space</code>. As the name suggests, it will + remove comment lines and empty lines before calculating the MD5 + checksum. For FITS formats (common in astronomy, fortunately there is + a <code>DATASUM</code> definition which will return the checksum independent of + the headers. You can use the provided function(s), or define one for + your special formats.</p></li> + <li><p><strong>Feedback</strong>: As you use Maneage you will notice many things that if + implemented from the start would have been very useful for your + work. This can be in the actual scripting and architecture of Maneage, + or useful implementation and usage tips, like those below. In any + case, please share your thoughts and suggestions with us, so we can + add them here for everyone's benefit.</p></li> + <li><p><strong>Re-preparation</strong>: Automatic preparation is only run in the first run + of the project on a system, to re-do the preparation you have to use + the option below. Here is the reason for this: when its necessary, the + preparation process can be slow and will unnecessarily slow down the + whole project while the project is under development (focus is on the + analysis that is done after preparation). Because of this, preparation + will be done automatically for the first time that the project is run + (when <code>.build/software/preparation-done.mk</code> doesn't exist). After the + preparation process completes once, future runs of <code>./project make</code> + will not do the preparation process anymore (will not call + <code>top-prepare.mk</code>). They will only call <code>top-make.mk</code> for the + analysis. To manually invoke the preparation process after the first + attempt, the <code>./project make</code> script should be run with the + <code>--prepare-redo</code> option, or you can delete the special file above.</p> + + <pre><code>./project make --prepare-redo</code></pre></li> + <li><p><strong>Pre-publication</strong>: add notice on reproducibility**: Add a notice + somewhere prominent in the first page within your paper, informing the + reader that your research is fully reproducible. For example in the + end of the abstract, or under the keywords with a title like + "reproducible paper". This will encourage them to publish their own + works in this manner also and also will help spread the word.</p></li> + </ul> + + <p align="right">Next: <a href="about-tips.html">Tips for designing your project</a>, Previous: <a href="about-architecture.html">Maneage architecture</a>, Up: <a href="about.html">About</a> </p> + + + + + + <footer role="contentinfo" id="page-footer"> + <h2>Copyright information</h2> + + <p>This file is part of Maneage's core: <a href="https://git.maneage.org/project.git">https://git.maneage.org/project.git</a></p> + + <p>Maneage is free software: you can redistribute it and/or modify it under + the terms of the GNU General Public License as published by the Free + Software Foundation, either version 3 of the License, or (at your option) + any later version.</p> + + <p>Maneage is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for more + details.</p> + + <p>You should have received a copy of the GNU General Public License along + with Maneage. If not, see <a href="https://www.gnu.org/licenses/">https://www.gnu.org/licenses/</a>.</p> + <ul> + <li><p>Maneage is currently based in the Instituto de Astrofísica de Canarias (IAC).</p></li> + <li><p>Address: IAC, Calle Vía Láctea, s/n, E38205 - La Laguna (Tenerife), Spain.</p></li> + <!-- The people page will be added later + <li><p>People [page will be added later]</p></li> + --> + <li><p>Contact: with <a href="https://savannah.nongnu.org/support/?func=additem&group=reproduce">this form.</a></p></li> + <li><p>Copyright © 2020 Maneage volunteers</p></li> + <li><p>All logos are copyrighted by the respective institutions</p></li> + </ul> + </footer> + </div> + </body> diff --git a/about-future.html b/about-future.html new file mode 100644 index 0000000..a8ef98c --- /dev/null +++ b/about-future.html @@ -0,0 +1,160 @@ +<!DOCTYPE html> +<!-- Copyright notes are just below the head and before body --> + + <html lang="en-US"> + + <!-- HTML Header --> + <head> + <!-- Title of the page. --> + <title>Maneage -- Managing data lineage</title> + + <!-- Enable UTF-8 encoding to easily use non-ASCII charactes --> + <meta charset="UTF-8"> + <meta http-equiv="Content-type" content="text/html; charset=UTF-8"> + + <!-- Put logo beside the address bar --> + <link rel="shortcut icon" href="./img/favicon.svg" /> + + <!-- The viewport meta tag is placed mainly for mobile browsers + that are pre-configured in different ways (for example setting the + different widths for the page than the actual width of the device, + or zooming to different values. Without this the CSS media + solutions might not work properly on all mobile browsers.--> + <meta name="viewport" + content="width=device-width, initial-scale=1"> + + <!-- Basic styles --> + <link rel="stylesheet" href="css/base.css" /> + </head> + + <!-- + Webpage of Maneage: a framework for managing data lineage + + Copyright (C) 2020, Pedram Ashofteh Ardakani <pedramardakani@pm.me> + Copyright (C) 2020, Mohammad Akhlaghi <mohammad@akhlaghi.org> + + This file is part of Maneage. Maneage is free software: you can + redistribute it and/or modify it under the terms of the GNU General + Public License as published by the Free Software Foundation, either + version 3 of the License, or (at your option) any later version. + + Maneage is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. See + <http://www.gnu.org/licenses/>. --> + + <!-- Start the main body. --> + <body> + <div id="container"> + <header role="banner"> + <!-- global navigation --> + <nav role="navigation" id="nav-hamburger-wrapper"> + <input type="checkbox" id="nav-hamburger-input"/> + <label for="nav-hamburger-input">|||</label> + <div id="nav-hamburger-items" class="button"> + <a href="index.html">Home</a> + <a href="about.html">About</a> + <a href="http://git.maneage.org/project.git/">Git</a> + <a href="tutorial.html">Tutorial</a> + </div> + </nav> + </header> + <div class="banner"> + <div> + <a href="index.html"><img src="img/maneage-logo.svg" /></a> + </div> + <div> + <h1>Maneage</h1><h2>Managing Data Lineage</h2> + <p>Copyright © 2018-2020 Mohammad Akhlaghi <a href="mailto:mohammad@akhlaghi.org">mohammad@akhlaghi.org</a><br /> + Copyright © 2020 Raul Infante-Sainz <a href="mailto:infantesainz@gmail.com">infantesainz@gmail.com</a><br /> + <a href="#page-footer">License Conditions</a></p> + </div> + </div> + + + + + <hr /> + <p align="right">Previous: <a href="about-tips.html">Tips for designing your project</a>, Up: <a href="about.html">About</a> </p> + + <h1>Future improvements</h1> + + <p>This is an evolving project and as time goes on, it will evolve and become + more robust. Some of the most prominent issues we plan to implement in the + future are listed below, please join us if you are interested.</p> + + <h2>Package management</h2> + + <p>It is important to have control of the environment of the project. Maneage + currently builds the higher-level programs (for example GNU Bash, GNU Make, + GNU AWK and domain-specific software) it needs, then sets <code>PATH</code> so the + analysis is done only with the project's built software. But currently the + configuration of each program is in the Makefile rules that build it. This + is not good because a change in the build configuration does not + automatically cause a re-build. Also, each separate project on a system + needs to have its own built tools (that can waste a lot of space).</p> + + <p>A good solution is based on the <a href="https://nixos.org/nix/about.html">Nix package manager</a>: a separate file is present for + each software, containing all the necessary info to build it (including its + URL, its tarball MD5 hash, dependencies, configuration parameters, build + steps and etc). Using this file, a script can automatically generate the + Make rules to download, build and install program and its dependencies + (along with the dependencies of those dependencies and etc).</p> + + <p>All the software are installed in a "store". Each installed file (library + or executable) is prefixed by a hash of this configuration (and the OS + architecture) and the standard program name. For example (from the Nix + webpage):</p> + + <pre><code>/nix/store/b6gvzjyb2pg0kjfwrjmg1vfhh54ad73z-firefox-33.1/</code></pre> + + <p>The important thing is that the "store" is <em>not</em> in the project's search + path. After the complete installation of the software, symbolic links are + made to populate each project's program and library search paths without a + hash. This hash will be unique to that particular software and its + particular configuration. So simply by searching for this hash in the + installed directory, we can find the installed files of that software to + generate the links.</p> + + <p>This scenario has several advantages: 1) a change in a software's build + configuration triggers a rebuild. 2) a single "store" can be used in many + projects, thus saving space and configuration time for new projects (that + commonly have large overlaps in lower-level programs).</p> + + + <p align="right">Previous: <a href="about-tips.html">Tips for designing your project</a>, Up: <a href="about.html">About</a> </p> + + + + + <footer role="contentinfo" id="page-footer"> + <h2>Copyright information</h2> + + <p>This file is part of Maneage's core: <a href="https://git.maneage.org/project.git">https://git.maneage.org/project.git</a></p> + + <p>Maneage is free software: you can redistribute it and/or modify it under + the terms of the GNU General Public License as published by the Free + Software Foundation, either version 3 of the License, or (at your option) + any later version.</p> + + <p>Maneage is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for more + details.</p> + + <p>You should have received a copy of the GNU General Public License along + with Maneage. If not, see <a href="https://www.gnu.org/licenses/">https://www.gnu.org/licenses/</a>.</p> + <ul> + <li><p>Maneage is currently based in the Instituto de Astrofísica de Canarias (IAC).</p></li> + <li><p>Address: IAC, Calle Vía Láctea, s/n, E38205 - La Laguna (Tenerife), Spain.</p></li> + <!-- The people page will be added later + <li><p>People [page will be added later]</p></li> + --> + <li><p>Contact: with <a href="https://savannah.nongnu.org/support/?func=additem&group=reproduce">this form.</a></p></li> + <li><p>Copyright © 2020 Maneage volunteers</p></li> + <li><p>All logos are copyrighted by the respective institutions</p></li> + </ul> + </footer> + </div> + </body> diff --git a/about-introduction.html b/about-introduction.html new file mode 100644 index 0000000..2b7d431 --- /dev/null +++ b/about-introduction.html @@ -0,0 +1,177 @@ +<!DOCTYPE html> +<!-- Copyright notes are just below the head and before body --> + + <html lang="en-US"> + + <!-- HTML Header --> + <head> + <!-- Title of the page. --> + <title>Maneage -- Managing data lineage</title> + + <!-- Enable UTF-8 encoding to easily use non-ASCII charactes --> + <meta charset="UTF-8"> + <meta http-equiv="Content-type" content="text/html; charset=UTF-8"> + + <!-- Put logo beside the address bar --> + <link rel="shortcut icon" href="./img/favicon.svg" /> + + <!-- The viewport meta tag is placed mainly for mobile browsers + that are pre-configured in different ways (for example setting the + different widths for the page than the actual width of the device, + or zooming to different values. Without this the CSS media + solutions might not work properly on all mobile browsers.--> + <meta name="viewport" + content="width=device-width, initial-scale=1"> + + <!-- Basic styles --> + <link rel="stylesheet" href="css/base.css" /> + </head> + + <!-- + Webpage of Maneage: a framework for managing data lineage + + Copyright (C) 2020, Pedram Ashofteh Ardakani <pedramardakani@pm.me> + Copyright (C) 2020, Mohammad Akhlaghi <mohammad@akhlaghi.org> + + This file is part of Maneage. Maneage is free software: you can + redistribute it and/or modify it under the terms of the GNU General + Public License as published by the Free Software Foundation, either + version 3 of the License, or (at your option) any later version. + + Maneage is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. See + <http://www.gnu.org/licenses/>. --> + + <!-- Start the main body. --> + <body> + <div id="container"> + <header role="banner"> + <!-- global navigation --> + <nav role="navigation" id="nav-hamburger-wrapper"> + <input type="checkbox" id="nav-hamburger-input"/> + <label for="nav-hamburger-input">|||</label> + <div id="nav-hamburger-items" class="button"> + <a href="index.html">Home</a> + <a href="about.html">About</a> + <a href="http://git.maneage.org/project.git/">Git</a> + <a href="tutorial.html">Tutorial</a> + </div> + </nav> + </header> + <div class="banner"> + <div> + <a href="index.html"><img src="img/maneage-logo.svg" /></a> + </div> + <div> + <h1>Maneage</h1><h2>Managing Data Lineage</h2> + <p>Copyright © 2018-2020 Mohammad Akhlaghi <a href="mailto:mohammad@akhlaghi.org">mohammad@akhlaghi.org</a><br /> + Copyright © 2020 Raul Infante-Sainz <a href="mailto:infantesainz@gmail.com">infantesainz@gmail.com</a><br /> + <a href="#page-footer">License Conditions</a></p> + </div> + </div> + + + + <hr /> + <p align="right">Next: <a href="about-citation.html">Citation and published projects using Maneage</a>, Up: <a href="about.html">About</a> </p> + + <h2>Introduction to Maneage</h2> + <p>The most important element of a "scientific" statement/result is the fact + that others should be able to falsify it. The Tsunami of data that has + engulfed astronomers in the last two decades, combined with faster + processors and faster internet connections has made it much more easier to + obtain a result. However, these factors have also increased the complexity + of a scientific analysis, such that it is no longer possible to describe + all the steps of an analysis in the published paper. Citing this + difficulty, many authors suffice to describing the generalities of their + analysis in their papers.</p> + + <p>However, It is impossible to falsify (or even study) a result if you can't + exactly reproduce it. The complexity of modern science makes it vitally + important to exactly reproduce the final result. Because even a small + deviation can be due to many different parts of an analysis. Nature is + already a black box which we are trying so hard to comprehend. Not letting + other scientists see the exact steps taken to reach a result, or not + allowing them to modify it (do experiments on it) is a self-imposed black + box, which only exacerbates our ignorance.</p> + + <p>Other scientists should be able to reproduce, check and experiment on the + results of anything that is to carry the "scientific" label. Any result + that is not reproducible (due to incomplete information by the author) is + not scientific: the readers have to have faith in the subjective experience + of the authors in the very important choice of configuration values and + order of operations: this is contrary to the scientific spirit.</p> + + <p>Maneage is created with the aim of supporting reproducible research by + making it easy to start a project in this framework. As shown below, it is + very easy to customize Maneage for any particular (research) project and + expand it as it starts and evolves. It can be run with no modification (as + described in <code>README.md</code>) as a demonstration and customized for use in any + project as fully described below.</p> + + <p>A project designed using Maneage will download and build all the necessary + libraries and programs for working in a closed environment (highly + independent of the host operating system) with fixed versions of the + necessary dependencies. The tarballs for building the local environment are + also collected in a <a href="http://git.maneage.org/tarballs-software.git/tree/">separate + repository</a>. The final + output of the project is <a href="http://git.maneage.org/output-raw.git/plain/paper.pdf">a + paper</a>. Notice the + last paragraph of the Acknowledgments where all the necessary software are + mentioned with their versions.</p> + + <p>Below, we start with a discussion of why Make was chosen as the high-level + language/framework for project management and how to learn and master Make + easily (and freely). The general architecture and design of the project is + then discussed to help you navigate the files and their contents. This is + followed by a checklist for the easy/fast customization of Maneage to your + exciting research. We continue with some tips and guidelines on how to + manage or extend your project as it grows based on our experiences with it + so far. The main body concludes with a description of possible future + improvements that are planned for Maneage (but not yet implemented). As + discussed above, we end with a short introduction on the necessity of + reproducible science in the appendix.</p> + + <p>Please don't forget to share your thoughts, suggestions and + criticisms. Maintaining and designing Maneage is itself a separate project, + so please join us if you are interested. Once it is mature enough, we will + describe it in a paper (written by all contributors) for a formal + introduction to the community.</p> + + <p align="right">Next: <a href="about-citation.html">Citation and published projects using Maneage</a>, Up: <a href="about.html">About</a> </p> + + + + + <footer role="contentinfo" id="page-footer"> + <h2>Copyright information</h2> + + <p>This file is part of Maneage's core: <a href="https://git.maneage.org/project.git">https://git.maneage.org/project.git</a></p> + + <p>Maneage is free software: you can redistribute it and/or modify it under + the terms of the GNU General Public License as published by the Free + Software Foundation, either version 3 of the License, or (at your option) + any later version.</p> + + <p>Maneage is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for more + details.</p> + + <p>You should have received a copy of the GNU General Public License along + with Maneage. If not, see <a href="https://www.gnu.org/licenses/">https://www.gnu.org/licenses/</a>.</p> + <ul> + <li><p>Maneage is currently based in the Instituto de Astrofísica de Canarias (IAC).</p></li> + <li><p>Address: IAC, Calle Vía Láctea, s/n, E38205 - La Laguna (Tenerife), Spain.</p></li> + <!-- The people page will be added later + <li><p>People [page will be added later]</p></li> + --> + <li><p>Contact: with <a href="https://savannah.nongnu.org/support/?func=additem&group=reproduce">this form.</a></p></li> + <li><p>Copyright © 2020 Maneage volunteers</p></li> + <li><p>All logos are copyrighted by the respective institutions</p></li> + </ul> + </footer> + </div> + </body> diff --git a/about-make.html b/about-make.html new file mode 100644 index 0000000..4474075 --- /dev/null +++ b/about-make.html @@ -0,0 +1,221 @@ +<!DOCTYPE html> +<!-- Copyright notes are just below the head and before body --> + + <html lang="en-US"> + + <!-- HTML Header --> + <head> + <!-- Title of the page. --> + <title>Maneage -- Managing data lineage</title> + + <!-- Enable UTF-8 encoding to easily use non-ASCII charactes --> + <meta charset="UTF-8"> + <meta http-equiv="Content-type" content="text/html; charset=UTF-8"> + + <!-- Put logo beside the address bar --> + <link rel="shortcut icon" href="./img/favicon.svg" /> + + <!-- The viewport meta tag is placed mainly for mobile browsers + that are pre-configured in different ways (for example setting the + different widths for the page than the actual width of the device, + or zooming to different values. Without this the CSS media + solutions might not work properly on all mobile browsers.--> + <meta name="viewport" + content="width=device-width, initial-scale=1"> + + <!-- Basic styles --> + <link rel="stylesheet" href="css/base.css" /> + </head> + + <!-- + Webpage of Maneage: a framework for managing data lineage + + Copyright (C) 2020, Pedram Ashofteh Ardakani <pedramardakani@pm.me> + Copyright (C) 2020, Mohammad Akhlaghi <mohammad@akhlaghi.org> + + This file is part of Maneage. Maneage is free software: you can + redistribute it and/or modify it under the terms of the GNU General + Public License as published by the Free Software Foundation, either + version 3 of the License, or (at your option) any later version. + + Maneage is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. See + <http://www.gnu.org/licenses/>. --> + + <!-- Start the main body. --> + <body> + <div id="container"> + <header role="banner"> + <!-- global navigation --> + <nav role="navigation" id="nav-hamburger-wrapper"> + <input type="checkbox" id="nav-hamburger-input"/> + <label for="nav-hamburger-input">|||</label> + <div id="nav-hamburger-items" class="button"> + <a href="index.html">Home</a> + <a href="about.html">About</a> + <a href="http://git.maneage.org/project.git/">Git</a> + <a href="tutorial.html">Tutorial</a> + </div> + </nav> + </header> + <div class="banner"> + <div> + <a href="index.html"><img src="img/maneage-logo.svg" /></a> + </div> + <div> + <h1>Maneage</h1><h2>Managing Data Lineage</h2> + <p>Copyright © 2018-2020 Mohammad Akhlaghi <a href="mailto:mohammad@akhlaghi.org">mohammad@akhlaghi.org</a><br /> + Copyright © 2020 Raul Infante-Sainz <a href="mailto:infantesainz@gmail.com">infantesainz@gmail.com</a><br /> + <a href="#page-footer">License Conditions</a></p> + </div> + </div> + + + + + <hr /> + <p align="right">Next: <a href="about-architecture.html">Maneage architecture</a>, Previous: <a href="about-citation.html">Citation and published papers</a>, Up: <a href="about.html">About</a> </p> + + <h2>Why Make?</h2> + + <p>When batch processing is necessary (no manual intervention, as in a + reproducible project), shell scripts are usually the first solution that + come to mind. However, the inherent complexity and non-linearity of + progress in a scientific project (where experimentation is key) make it + hard to manage the script(s) as the project evolves. For example, a script + will start from the top/start every time it is run. So if you have already + completed 90% of a research project and want to run the remaining 10% that + you have newly added, you have to run the whole script from the start + again. Only then will you see the effects of the last new steps (to find + possible errors, or better solutions and etc).</p> + + <p>It is possible to manually ignore/comment parts of a script to only do a + special part. However, such checks/comments will only add to the complexity + of the script and will discourage you to play-with/change an already + completed part of the project when an idea suddenly comes up. It is also + prone to very serious bugs in the end (when trying to reproduce from + scratch). Such bugs are very hard to notice during the work and frustrating + to find in the end.</p> + + <p>The Make paradigm, on the other hand, starts from the end: the final + <em>target</em>. It builds a dependency tree internally, and finds where it should + start each time the project is run. Therefore, in the scenario above, a + researcher that has just added the final 10% of steps of her research to + her Makefile, will only have to run those extra steps. With Make, it is + also trivial to change the processing of any intermediate (already written) + <em>rule</em> (or step) in the middle of an already written analysis: the next + time Make is run, only rules that are affected by the changes/additions + will be re-run, not the whole analysis/project.</p> + + <p>This greatly speeds up the processing (enabling creative changes), while + keeping all the dependencies clearly documented (as part of the Make + language), and most importantly, enabling full reproducibility from scratch + with no changes in the project code that was working during the + research. This will allow robust results and let the scientists get to what + they do best: experiment and be critical to the methods/analysis without + having to waste energy and time on technical problems that come up as a + result of that experimentation in scripts.</p> + + <p>Since the dependencies are clearly demarcated in Make, it can identify + independent steps and run them in parallel. This further speeds up the + processing. Make was designed for this purpose. It is how huge projects + like all Unix-like operating systems (including GNU/Linux or Mac OS + operating systems) and their core components are built. Therefore, Make is + a highly mature paradigm/system with robust and highly efficient + implementations in various operating systems perfectly suited for a complex + non-linear research project.</p> + + <p>Make is a small language with the aim of defining <em>rules</em> containing + <em>targets</em>, <em>prerequisites</em> and <em>recipes</em>. It comes with some nice features + like functions or automatic-variables to greatly facilitate the management + of text (filenames for example) or any of those constructs. For a more + detailed (yet still general) introduction see the article on Wikipedia:</p> + + <ul> + <li><a href="https://en.wikipedia.org/wiki/Make_(software)">https://en.wikipedia.org/wiki/Make_(software)</a></li> + </ul> + + <p>Make is a +40 year old software that is still evolving, therefore many + implementations of Make exist. The only difference in them is some extra + features over the <a href="https://pubs.opengroup.org/onlinepubs/009695399/utilities/make.html">standard + definition</a> + (which is shared in all of them). Maneage is primarily written in GNU Make + (which it installs itself, you don't have to have it on your system). GNU + Make is the most common, most actively developed, and most advanced + implementation. Just note that Maneage downloads, builds, internally + installs, and uses its own dependencies (including GNU Make), so you don't + have to have it installed before you try it out.</p> + + <h3>How can I learn Make?</h3> + + <p>The GNU Make book/manual (links below) is arguably the best place to learn + Make. It is an excellent and non-technical book to help get started (it is + only non-technical in its first few chapters to get you started easily). It + is freely available and always up to date with the current GNU Make + release. It also clearly explains which features are specific to GNU Make + and which are general in all implementations. So the first few chapters + regarding the generalities are useful for all implementations.</p> + + <p>The first link below points to the GNU Make manual in various formats and + in the second, you can download it in PDF (which may be easier for a first + time reading).</p> + + <ul> + <li><a href="https://www.gnu.org/software/make/manual/">https://www.gnu.org/software/make/manual/</a></li> + <li><a href="https://www.gnu.org/software/make/manual/make.pdf">https://www.gnu.org/software/make/manual/make.pdf</a></li> + </ul> + + <p>If you use GNU Make, you also have the whole GNU Make manual on the + command-line with the following command (you can come out of the "Info" + environment by pressing <code>q</code>).</p> + + <pre><code>info make</code></pre> + + <p>If you aren't familiar with the Info documentation format, we strongly + recommend running <code>$ info info</code> and reading along. In less than an hour, + you will become highly proficient in it (it is very simple and has a great + manual for itself). Info greatly simplifies your access (without taking + your hands off the keyboard!) to many manuals that are installed on your + system, allowing you to be much more efficient as you work. If you use the + GNU Emacs text editor (or any of its variants), you also have access to all + Info manuals while you are writing your projects (again, without taking + your hands off the keyboard!).</p> + + <p align="right">Next: <a href="about-architecture.html">Maneage architecture</a>, Previous: <a href="about-citation.html">Citation and published papers</a>, Up: <a href="about.html">About</a> </p> + + + + + + <footer role="contentinfo" id="page-footer"> + <h2>Copyright information</h2> + + <p>This file is part of Maneage's core: <a href="https://git.maneage.org/project.git">https://git.maneage.org/project.git</a></p> + + <p>Maneage is free software: you can redistribute it and/or modify it under + the terms of the GNU General Public License as published by the Free + Software Foundation, either version 3 of the License, or (at your option) + any later version.</p> + + <p>Maneage is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for more + details.</p> + + <p>You should have received a copy of the GNU General Public License along + with Maneage. If not, see <a href="https://www.gnu.org/licenses/">https://www.gnu.org/licenses/</a>.</p> + <ul> + <li><p>Maneage is currently based in the Instituto de Astrofísica de Canarias (IAC).</p></li> + <li><p>Address: IAC, Calle Vía Láctea, s/n, E38205 - La Laguna (Tenerife), Spain.</p></li> + <!-- The people page will be added later + <li><p>People [page will be added later]</p></li> + --> + <li><p>Contact: with <a href="https://savannah.nongnu.org/support/?func=additem&group=reproduce">this form.</a></p></li> + <li><p>Copyright © 2020 Maneage volunteers</p></li> + <li><p>All logos are copyrighted by the respective institutions</p></li> + </ul> + </footer> + </div> + </body> diff --git a/about-tips.html b/about-tips.html new file mode 100644 index 0000000..49ea896 --- /dev/null +++ b/about-tips.html @@ -0,0 +1,442 @@ +<!DOCTYPE html> +<!-- Copyright notes are just below the head and before body --> + + <html lang="en-US"> + + <!-- HTML Header --> + <head> + <!-- Title of the page. --> + <title>Maneage -- Managing data lineage</title> + + <!-- Enable UTF-8 encoding to easily use non-ASCII charactes --> + <meta charset="UTF-8"> + <meta http-equiv="Content-type" content="text/html; charset=UTF-8"> + + <!-- Put logo beside the address bar --> + <link rel="shortcut icon" href="./img/favicon.svg" /> + + <!-- The viewport meta tag is placed mainly for mobile browsers + that are pre-configured in different ways (for example setting the + different widths for the page than the actual width of the device, + or zooming to different values. Without this the CSS media + solutions might not work properly on all mobile browsers.--> + <meta name="viewport" + content="width=device-width, initial-scale=1"> + + <!-- Basic styles --> + <link rel="stylesheet" href="css/base.css" /> + </head> + + <!-- + Webpage of Maneage: a framework for managing data lineage + + Copyright (C) 2020, Pedram Ashofteh Ardakani <pedramardakani@pm.me> + Copyright (C) 2020, Mohammad Akhlaghi <mohammad@akhlaghi.org> + + This file is part of Maneage. Maneage is free software: you can + redistribute it and/or modify it under the terms of the GNU General + Public License as published by the Free Software Foundation, either + version 3 of the License, or (at your option) any later version. + + Maneage is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. See + <http://www.gnu.org/licenses/>. --> + + <!-- Start the main body. --> + <body> + <div id="container"> + <header role="banner"> + <!-- global navigation --> + <nav role="navigation" id="nav-hamburger-wrapper"> + <input type="checkbox" id="nav-hamburger-input"/> + <label for="nav-hamburger-input">|||</label> + <div id="nav-hamburger-items" class="button"> + <a href="index.html">Home</a> + <a href="about.html">About</a> + <a href="http://git.maneage.org/project.git/">Git</a> + <a href="tutorial.html">Tutorial</a> + </div> + </nav> + </header> + <div class="banner"> + <div> + <a href="index.html"><img src="img/maneage-logo.svg" /></a> + </div> + <div> + <h1>Maneage</h1><h2>Managing Data Lineage</h2> + <p>Copyright © 2018-2020 Mohammad Akhlaghi <a href="mailto:mohammad@akhlaghi.org">mohammad@akhlaghi.org</a><br /> + Copyright © 2020 Raul Infante-Sainz <a href="mailto:infantesainz@gmail.com">infantesainz@gmail.com</a><br /> + <a href="#page-footer">License Conditions</a></p> + </div> + </div> + + + + + <hr /> + <p align="right">Next: <a href="about-future.html">Future improvements</a>, Previous: <a href="about-customize.html">Customization checklist</a>, Up: <a href="about.html">About</a> </p> + + <h1>Tips for designing your project</h1> + + <p>The following is a list of design points, tips, or recommendations that + have been learned after some experience with this type of project + management. Please don't hesitate to share any experience you gain after + using it with us. In this way, we can add it here (with full giving credit) + for the benefit of others.</p> + + <ul> + <li><p><strong>Modularity</strong>: Modularity is the key to easy and clean growth of a + project. So it is always best to break up a job into as many + sub-components as reasonable. Here are some tips to stay modular.</p> + + <ul> + <li><p><em>Short recipes</em>: if you see the recipe of a rule becoming more than a + handful of lines which involve significant processing, it is probably + a good sign that you should break up the rule into its main + components. Try to only have one major processing step per rule.</p></li> + <li><p><em>Context-based (many) Makefiles</em>: For maximum modularity, this design + allows easy inclusion of many Makefiles: in + <code>reproduce/analysis/make/*.mk</code> for analysis steps, and + <code>reproduce/software/make/*.mk</code> for building software. So keep the + rules for closely related parts of the processing in separate + Makefiles.</p></li> + <li><p><em>Descriptive names</em>: Be very clear and descriptive with the naming of + the files and the variables because a few months after the + processing, it will be very hard to remember what each one was + for. Also this helps others (your collaborators or other people + reading the project source after it is published) to more easily + understand your work and find their way around.</p></li> + <li><p><em>Naming convention</em>: As the project grows, following a single standard + or convention in naming the files is very useful. Try best to use + multiple word filenames for anything that is non-trivial (separating + the words with a <code>-</code>). For example if you have a Makefile for + creating a catalog and another two for processing it under models A + and B, you can name them like this: <code>catalog-create.mk</code>, + <code>catalog-model-a.mk</code> and <code>catalog-model-b.mk</code>. In this way, when + listing the contents of <code>reproduce/analysis/make</code> to see all the + Makefiles, those related to the catalog will all be close to each + other and thus easily found. This also helps in auto-completions by + the shell or text editors like Emacs.</p></li> + <li><p><em>Source directories</em>: If you need to add files in other languages for + example in shell, Python, AWK or C, keep the files in the same + language in a separate directory under <code>reproduce/analysis</code>, with the + appropriate name.</p></li> + <li><p><em>Configuration files</em>: If your research uses special programs as part + of the processing, put all their configuration files in a devoted + directory (with the program's name) within + <code>reproduce/software/config</code>. Similar to the + <code>reproduce/software/config/gnuastro</code> directory (which is put in + Maneage as a demo in case you use GNU Astronomy Utilities). It is + much cleaner and readable (thus less buggy) to avoid mixing the + configuration files, even if there is no technical necessity.</p></li> + </ul></li> + <li><p><strong>Contents</strong>: It is good practice to follow the following + recommendations on the contents of your files, whether they are source + code for a program, Makefiles, scripts or configuration files + (copyrights aren't necessary for the latter).</p> + + <ul> + <li><p><em>Copyright</em>: Always start a file containing programming constructs + with a copyright statement like the ones that Maneage starts with + (for example in the top level <code>Makefile</code>).</p></li> + <li><p><em>Comments</em>: Comments are vital for readability (by yourself in two + months, or others). Describe everything you can about why you are + doing something, how you are doing it, and what you expect the result + to be. Write the comments as if it was what you would say to describe + the variable, recipe or rule to a friend sitting beside you. When + writing the project it is very tempting to just steam ahead with + commands and codes, but be patient and write comments before the + rules or recipes. This will also allow you to think more about what + you should be doing. Also, in several months when you come back to + the code, you will appreciate the effort of writing them. Just don't + forget to also read and update the comment first if you later want to + make changes to the code (variable, recipe or rule). As a general + rule of thumb: first the comments, then the code.</p></li> + <li><p><em>File title</em>: In general, it is good practice to start all files with + a single line description of what that particular file does. If + further information about the totality of the file is necessary, add + it after a blank line. This will help a fast inspection where you + don't care about the details, but just want to remember/see what that + file is (generally) for. This information must of course be commented + (its for a human), but this is kept separate from the general + recommendation on comments, because this is a comment for the whole + file, not each step within it.</p></li> + </ul></li> + <li><p><strong>Make programming</strong>: Here are some experiences that we have come to + learn over the years in using Make and are useful/handy in research + contexts.</p> + + <ul> + <li><p><em>Environment of each recipe</em>: If you need to define a special + environment (or aliases, or scripts to run) for all the recipes in + your Makefiles, you can use a Bash startup file + <code>reproduce/software/shell/bashrc.sh</code>. This file is loaded before every + Make recipe is run, just like the <code>.bashrc</code> in your home directory is + loaded every time you start a new interactive, non-login terminal. See + the comments in that file for more.</p></li> + <li><p><em>Automatic variables</em>: These are wonderful and very useful Make + constructs that greatly shrink the text, while helping in + read-ability, robustness (less bugs in typos for example) and + generalization. For example even when a rule only has one target or + one prerequisite, always use <code>$@</code> instead of the target's name, <code>$<</code> + instead of the first prerequisite, <code>$^</code> instead of the full list of + prerequisites and etc. You can see the full list of automatic + variables + <a href="https://www.gnu.org/software/make/manual/html_node/Automatic-Variables.html">here</a>. If + you use GNU Make, you can also see this page on your command-line:</p> + + <pre><code>info make "automatic variables"</code></pre></li> + <li><p><em>Debug</em>: Since Make doesn't follow the common top-down paradigm, it + can be a little hard to get accustomed to why you get an error or + un-expected behavior. In such cases, run Make with the <code>-d</code> + option. With this option, Make prints a full list of exactly which + prerequisites are being checked for which targets. Looking + (patiently) through this output and searching for the faulty + file/step will clearly show you any mistake you might have made in + defining the targets or prerequisites.</p></li> + <li><p><em>Large files</em>: If you are dealing with very large files (thus having + multiple copies of them for intermediate steps is not possible), one + solution is the following strategy (Also see the next item on "Fast + access to temporary files"). Set a small plain text file as the + actual target and delete the large file when it is no longer needed + by the project (in the last rule that needs it). Below is a simple + demonstration of doing this. In it, we use Gnuastro's Arithmetic + program to add all pixels of the input image with 2 and create + <code>large1.fits</code>. We then subtract 2 from <code>large1.fits</code> to create + <code>large2.fits</code> and delete <code>large1.fits</code> in the same rule (when its no + longer needed). We can later do the same with <code>large2.fits</code> when it + is no longer needed and so on. +<pre><code>large1.fits.txt: input.fits + astarithmetic $< 2 + --output=$(subst .txt,,$@) + echo "done" > $@ +large2.fits.txt: large1.fits.txt + astarithmetic $(subst .txt,,$<) 2 - --output=$(subst .txt,,$@) + rm $(subst .txt,,$<) + echo "done" > $@</code></pre> + A more advanced Make programmer will use Make's <a href="https://www.gnu.org/software/make/manual/html_node/Call-Function.html">call function</a> + to define a wrapper in <code>reproduce/analysis/make/initialize.mk</code>. This + wrapper will replace <code>$(subst .txt,,XXXXX)</code>. Therefore, it will be + possible to greatly simplify this repetitive statement and make the + code even more readable throughout the whole project.</p></li> + <li><p><em>Fast access to temporary files</em>: Most Unix-like operating systems + will give you a special shared-memory device (directory): on systems + using the GNU C Library (all GNU/Linux system), it is <code>/dev/shm</code>. The + contents of this directory are actually in your RAM, not in your + persistence storage like the HDD or SSD. Reading and writing from/to + the RAM is much faster than persistent storage, so if you have enough + RAM available, it can be very beneficial for large temporary files to + be put there. You can use the <code>mktemp</code> program to give the temporary + files a randomly-set name, and use text files as targets to keep that + name (as described in the item above under "Large files") for later + deletion. For example, see the minimal working example Makefile below + (which you can actually put in a <code>Makefile</code> and run if you have an + <code>input.fits</code> in the same directory, and Gnuastro is installed). +<pre><code>.ONESHELL: +.SHELLFLAGS = -ec +all: mean-std.txt +shm-maneage := /dev/shm/$(shell whoami)-maneage-XXXXXXXXXX +large1.txt: input.fits + out=$$(mktemp $(shm-maneage)) + astarithmetic $< 2 + --output=$$out.fits + echo "$$out" > $@ +large2.txt: large1.txt + input=$$(cat $<) + out=$$(mktemp $(shm-maneage)) + astarithmetic $$input.fits 2 - --output=$$out.fits + rm $$input.fits $$input + echo "$$out" > $@ +mean-std.txt: large2.txt + input=$$(cat $<) + aststatistics $$input.fits --mean --std > $@ + rm $$input.fits $$input</code></pre> + The important point here is that the temporary name template + (<code>shm-maneage</code>) has no suffix. So you can add the suffix + corresponding to your desired format afterwards (for example + <code>$$out.fits</code>, or <code>$$out.txt</code>). But more importantly, when <code>mktemp</code> + sets the random name, it also checks if no file exists with that name + and creates a file with that exact name at that moment. So at the end + of each recipe above, you'll have two files in your <code>/dev/shm</code>, one + empty file with no suffix one with a suffix. The role of the file + without a suffix is just to ensure that the randomly set name will + not be used by other calls to <code>mktemp</code> (when running in parallel) and + it should be deleted with the file containing a suffix. This is the + reason behind the <code>rm $$input.fits $$input</code> command above: to make + sure that first the file with a suffix is deleted, then the core + random file (note that when working in parallel on powerful systems, + in the time between deleting two files of a single <code>rm</code> command, many + things can happen!). When using Maneage, you can put the definition + of <code>shm-maneage</code> in <code>reproduce/analysis/make/initialize.mk</code> to be + usable in all the different Makefiles of your analysis, and you won't + need the three lines above it. <strong>Finally, BE RESPONSIBLE:</strong> after you + are finished, be sure to clean up any possibly remaining files (due + to crashes in the processing while you are working), otherwise your + RAM may fill up very fast. You can do it easily with a command like + this on your command-line: <code>rm -f /dev/shm/$(whoami)-*</code>.</p></li> + </ul></li> + <li><p><strong>Software tarballs and raw inputs</strong>: It is critically important to + document the raw inputs to your project (software tarballs and raw + input data):</p> + + <ul> + <li><p><em>Keep the source tarball of dependencies</em>: After configuration + finishes, the <code>.build/software/tarballs</code> directory will contain all + the software tarballs that were necessary for your project. You can + mirror the contents of this directory to keep a backup of all the + software tarballs used in your project (possibly as another version + controlled repository) that is also published with your project. Note + that software web-pages are not written in stone and can suddenly go + offline or not be accessible in some conditions. This backup is thus + very important. If you intend to release your project in a place like + Zenodo, you can upload/keep all the necessary tarballs (and data) + there with your + project. <a href="https://doi.org/10.5281/zenodo.1163746">zenodo.1163746</a> is + one example of how the data, Gnuastro (main software used) and all + major Gnuastro's dependencies have been uploaded with the project's + source. Just note that this is only possible for free and open-source + software.</p></li> + <li><p><em>Keep your input data</em>: The input data is also critical to the + project's reproducibility, so like the above for software, make sure + you have a backup of them, or their persistent identifiers (PIDs).</p></li> + </ul></li> + <li><p><strong>Version control</strong>: Version control is a critical component of + Maneage. Here are some tips to help in effectively using it.</p> + + <ul> + <li><p><em>Regular commits</em>: It is important (and extremely useful) to have the + history of your project under version control. So try to make commits + regularly (after any meaningful change/step/result).</p></li> + <li><p><em>Keep Maneage up-to-date</em>: In time, Maneage is going to become more + and more mature and robust (thanks to your feedback and the feedback + of other users). Bugs will be fixed and new/improved features will be + added. So every once and a while, you can run the commands below to + pull new work that is done in Maneage. If the changes are useful for + your work, you can merge them with your project to benefit from + them. Just pay <strong>very close attention</strong> to resolving possible + <strong>conflicts</strong> which might happen in the merge (updated settings that + you have customized in Maneage).</p> + + <pre><code>git checkout maneage +git pull <span class="comment"># Get recent work in Maneage</span> +git log XXXXXX..XXXXXX --reverse <span class="comment"># Inspect new work (replace XXXXXXs with hashs mentioned in output of previous command).</span> +git log --oneline --graph --decorate --all <span class="comment"># General view of branches.</span> +git checkout master <span class="comment"># Go to your top working branch.</span> +git merge maneage <span class="comment"># Import all the work into master.</span></code></pre></li> + <li><p><em>Adding Maneage to a fork of your project</em>: As you and your colleagues + continue your project, it will be necessary to have separate + forks/clones of it. But when you clone your own project on a + different system, or a colleague clones it to collaborate with you, + the clone won't have the <code>origin-maneage</code> remote that you started the + project with. As shown in the previous item above, you need this + remote to be able to pull recent updates from Maneage. The steps + below will setup the <code>origin-maneage</code> remote, and a local <code>maneage</code> + branch to track it, on the new clone.</p> + + <pre><code>git remote add origin-maneage https://git.maneage.org/project.git +git fetch origin-maneage +git checkout -b maneage --track origin-maneage/maneage</code></pre></li> + <li><p><em>Commit message</em>: The commit message is a very important and useful + aspect of version control. To make the commit message useful for + others (or yourself, one year later), it is good to follow a + consistent style. Maneage already has a consistent formatting + (described below), which you can also follow in your project if you + like. You can see many examples by running <code>git log</code> in the <code>maneage</code> + branch. If you intend to push commits to Maneage, for the consistency + of Maneage, it is necessary to follow these guidelines. 1) No line + should be more than 75 characters (to enable easy reading of the + message when you run <code>git log</code> on the standard 80-character + terminal). 2) The first line is the title of the commit and should + summarize it (so <code>git log --oneline</code> can be useful). The title should + also not end with a point (<code>.</code>, because its a short single sentence, + so a point is not necessary and only wastes space). 3) After the + title, leave an empty line and start the body of your message + (possibly containing many paragraphs). 4) Describe the context of + your commit (the problem it is trying to solve) as much as possible, + then go onto how you solved it. One suggestion is to start the main + body of your commit with "Until now ...", and continue describing the + problem in the first paragraph(s). Afterwards, start the next + paragraph with "With this commit ...".</p></li> + <li><p><em>Project outputs</em>: During your research, it is possible to checkout a + specific commit and reproduce its results. However, the processing + can be time consuming. Therefore, it is useful to also keep track of + the final outputs of your project (at minimum, the paper's PDF) in + important points of history. However, keeping a snapshot of these + (most probably large volume) outputs in the main history of the + project can unreasonably bloat it. It is thus recommended to make a + separate Git repo to keep those files and keep your project's source + as small as possible. For example if your project is called + <code>my-exciting-project</code>, the name of the outputs repository can be + <code>my-exciting-project-output</code>. This enables easy sharing of the output + files with your co-authors (with necessary permissions) and not + having to bloat your email archive with extra attachments also (you + can just share the link to the online repo in your + communications). After the research is published, you can also + release the outputs repository, or you can just delete it if it is + too large or un-necessary (it was just for convenience, and fully + reproducible after all). For example Maneage's output is available + for demonstration in <a href="http://git.maneage.org/output-raw.git/">a + separate</a> repository.</p></li> + <li><p><em>Full Git history in one file</em>: When you are publishing your project + (for example to Zenodo for long term preservation), it is more + convenient to have the whole project's Git history into one file to + save with your datasets. After all, you can't be sure that your + current Git server (for example GitLab, Github, or Bitbucket) will be + active forever. While they are good for the immediate future, you + can't rely on them for archival purposes. Fortunately keeping your + whole history in one file is easy with Git using the following + commands. To learn more about it, run <code>git help bundle</code>.</p> + + <ul> + <li>"bundle" your project's history into one file (just don't forget to + change <code>my-project-git.bundle</code> to a descriptive name of your + project):</li> + </ul> + + <pre><code>git bundle create my-project-git.bundle --all</code></pre> + + <ul> + <li>You can easily upload <code>my-project-git.bundle</code> anywhere. Later, if + you need to un-bundle it, you can use the following command.</li> + </ul> + + <p><p><pre><code>git clone my-project-git.bundle</code></pre></li> + </ul></p></li> + </ul></p> + + <p align="right">Next: <a href="about-future.html">Future improvements</a>, Previous: <a href="about-customize.html">Customization checklist</a>, Up: <a href="about.html">About</a> </p> + + + + + + <footer role="contentinfo" id="page-footer"> + <h2>Copyright information</h2> + + <p>This file is part of Maneage's core: <a href="https://git.maneage.org/project.git">https://git.maneage.org/project.git</a></p> + + <p>Maneage is free software: you can redistribute it and/or modify it under + the terms of the GNU General Public License as published by the Free + Software Foundation, either version 3 of the License, or (at your option) + any later version.</p> + + <p>Maneage is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for more + details.</p> + + <p>You should have received a copy of the GNU General Public License along + with Maneage. If not, see <a href="https://www.gnu.org/licenses/">https://www.gnu.org/licenses/</a>.</p> + <ul> + <li><p>Maneage is currently based in the Instituto de Astrofísica de Canarias (IAC).</p></li> + <li><p>Address: IAC, Calle Vía Láctea, s/n, E38205 - La Laguna (Tenerife), Spain.</p></li> + <!-- The people page will be added later + <li><p>People [page will be added later]</p></li> + --> + <li><p>Contact: with <a href="https://savannah.nongnu.org/support/?func=additem&group=reproduce">this form.</a></p></li> + <li><p>Copyright © 2020 Maneage volunteers</p></li> + <li><p>All logos are copyrighted by the respective institutions</p></li> + </ul> + </footer> + </div> + </body> @@ -72,1200 +72,24 @@ </div> </div> - <p>Maneage is a <strong>fully working template</strong> for doing reproducible research (or - writing a reproducible paper) as defined in the link below. If the link - below is not accessible at the time of reading, please see the appendix at - the end of this file for a portion of its introduction. Some - <a href="http://akhlaghi.org/pdf/reproducible-paper.pdf">slides</a> are also available - to help demonstrate the concept implemented here.</p> + <p>Maneage is a <strong>fully working + template</strong> (ready to customize) for doing + reproducible research (or writing a reproducible + paper). In each of the sections below, it is discussed + in more detail. </p> + + <ol> + <li><a href="about-introduction.html">Introduction</a></li> + <li><a href="about-citation.html">Citation and published projects using Maneage</a></li> + <li><a href="about-make.html">Why Make?</a></li> + <li><a href="about-architecture.html">Maneage architecture</a></li> + <li><a href="about-customize.html">Customization checklist</a></li> + <li><a href="about-tips.html">Tips for designing your project</a></li> + <li><a href="about-future.html">Future improvements</a></li> + </ol> - <ul> - <li><a href="http://akhlaghi.org/reproducible-science.html">http://akhlaghi.org/reproducible-science.html</a></li> - </ul> - - <p>Maneage is created with the aim of supporting reproducible research by - making it easy to start a project in this framework. As shown below, it is - very easy to customize Maneage for any particular (research) project and - expand it as it starts and evolves. It can be run with no modification (as - described in <code>README.md</code>) as a demonstration and customized for use in any - project as fully described below.</p> - - <p>A project designed using Maneage will download and build all the necessary - libraries and programs for working in a closed environment (highly - independent of the host operating system) with fixed versions of the - necessary dependencies. The tarballs for building the local environment are - also collected in a <a href="http://git.maneage.org/tarballs-software.git/tree/">separate - repository</a>. The final - output of the project is <a href="http://git.maneage.org/output-raw.git/plain/paper.pdf">a - paper</a>. Notice the - last paragraph of the Acknowledgments where all the necessary software are - mentioned with their versions.</p> - - <p>Below, we start with a discussion of why Make was chosen as the high-level - language/framework for project management and how to learn and master Make - easily (and freely). The general architecture and design of the project is - then discussed to help you navigate the files and their contents. This is - followed by a checklist for the easy/fast customization of Maneage to your - exciting research. We continue with some tips and guidelines on how to - manage or extend your project as it grows based on our experiences with it - so far. The main body concludes with a description of possible future - improvements that are planned for Maneage (but not yet implemented). As - discussed above, we end with a short introduction on the necessity of - reproducible science in the appendix.</p> - - <p>Please don't forget to share your thoughts, suggestions and - criticisms. Maintaining and designing Maneage is itself a separate project, - so please join us if you are interested. Once it is mature enough, we will - describe it in a paper (written by all contributors) for a formal - introduction to the community.</p> - - <h2>Why Make?</h2> - - <p>When batch processing is necessary (no manual intervention, as in a - reproducible project), shell scripts are usually the first solution that - come to mind. However, the inherent complexity and non-linearity of - progress in a scientific project (where experimentation is key) make it - hard to manage the script(s) as the project evolves. For example, a script - will start from the top/start every time it is run. So if you have already - completed 90% of a research project and want to run the remaining 10% that - you have newly added, you have to run the whole script from the start - again. Only then will you see the effects of the last new steps (to find - possible errors, or better solutions and etc).</p> - - <p>It is possible to manually ignore/comment parts of a script to only do a - special part. However, such checks/comments will only add to the complexity - of the script and will discourage you to play-with/change an already - completed part of the project when an idea suddenly comes up. It is also - prone to very serious bugs in the end (when trying to reproduce from - scratch). Such bugs are very hard to notice during the work and frustrating - to find in the end.</p> - - <p>The Make paradigm, on the other hand, starts from the end: the final - <em>target</em>. It builds a dependency tree internally, and finds where it should - start each time the project is run. Therefore, in the scenario above, a - researcher that has just added the final 10% of steps of her research to - her Makefile, will only have to run those extra steps. With Make, it is - also trivial to change the processing of any intermediate (already written) - <em>rule</em> (or step) in the middle of an already written analysis: the next - time Make is run, only rules that are affected by the changes/additions - will be re-run, not the whole analysis/project.</p> - - <p>This greatly speeds up the processing (enabling creative changes), while - keeping all the dependencies clearly documented (as part of the Make - language), and most importantly, enabling full reproducibility from scratch - with no changes in the project code that was working during the - research. This will allow robust results and let the scientists get to what - they do best: experiment and be critical to the methods/analysis without - having to waste energy and time on technical problems that come up as a - result of that experimentation in scripts.</p> - - <p>Since the dependencies are clearly demarcated in Make, it can identify - independent steps and run them in parallel. This further speeds up the - processing. Make was designed for this purpose. It is how huge projects - like all Unix-like operating systems (including GNU/Linux or Mac OS - operating systems) and their core components are built. Therefore, Make is - a highly mature paradigm/system with robust and highly efficient - implementations in various operating systems perfectly suited for a complex - non-linear research project.</p> - - <p>Make is a small language with the aim of defining <em>rules</em> containing - <em>targets</em>, <em>prerequisites</em> and <em>recipes</em>. It comes with some nice features - like functions or automatic-variables to greatly facilitate the management - of text (filenames for example) or any of those constructs. For a more - detailed (yet still general) introduction see the article on Wikipedia:</p> - - <ul> - <li><a href="https://en.wikipedia.org/wiki/Make_(software)">https://en.wikipedia.org/wiki/Make_(software)</a></li> - </ul> - - <p>Make is a +40 year old software that is still evolving, therefore many - implementations of Make exist. The only difference in them is some extra - features over the <a href="https://pubs.opengroup.org/onlinepubs/009695399/utilities/make.html">standard - definition</a> - (which is shared in all of them). Maneage is primarily written in GNU Make - (which it installs itself, you don't have to have it on your system). GNU - Make is the most common, most actively developed, and most advanced - implementation. Just note that Maneage downloads, builds, internally - installs, and uses its own dependencies (including GNU Make), so you don't - have to have it installed before you try it out.</p> - - <h2>How can I learn Make?</h2> - - <p>The GNU Make book/manual (links below) is arguably the best place to learn - Make. It is an excellent and non-technical book to help get started (it is - only non-technical in its first few chapters to get you started easily). It - is freely available and always up to date with the current GNU Make - release. It also clearly explains which features are specific to GNU Make - and which are general in all implementations. So the first few chapters - regarding the generalities are useful for all implementations.</p> - - <p>The first link below points to the GNU Make manual in various formats and - in the second, you can download it in PDF (which may be easier for a first - time reading).</p> - - <ul> - <li><a href="https://www.gnu.org/software/make/manual/">https://www.gnu.org/software/make/manual/</a></li> - <li><a href="https://www.gnu.org/software/make/manual/make.pdf">https://www.gnu.org/software/make/manual/make.pdf</a></li> - </ul> - - <p>If you use GNU Make, you also have the whole GNU Make manual on the - command-line with the following command (you can come out of the "Info" - environment by pressing <code>q</code>).</p> - - <pre><code>info make</code></pre> - - <p>If you aren't familiar with the Info documentation format, we strongly - recommend running <code>$ info info</code> and reading along. In less than an hour, - you will become highly proficient in it (it is very simple and has a great - manual for itself). Info greatly simplifies your access (without taking - your hands off the keyboard!) to many manuals that are installed on your - system, allowing you to be much more efficient as you work. If you use the - GNU Emacs text editor (or any of its variants), you also have access to all - Info manuals while you are writing your projects (again, without taking - your hands off the keyboard!).</p> - - <h2>Published works using Maneage</h2> - - <p>The list below shows some of the works that have already been published - with (earlier versions of) Maneage. Previously it was simply called - "Reproducible paper template". Note that Maneage is evolving, so some - details may be different in them. The more recent ones can be used as a - good working example.</p> - - <ul> - <li><p>Infante-Sainz et - al. (<a href="https://ui.adsabs.harvard.edu/abs/2020MNRAS.491.5317I">2020</a>, - MNRAS, 491, 5317): The version controlled project source is available - <a href="https://gitlab.com/infantesainz/sdss-extended-psfs-paper">on GitLab</a> - and is also archived on Zenodo with all the necessary software tarballs: - <a href="https://zenodo.org/record/3524937">zenodo.3524937</a>.</p></li> - <li><p>Akhlaghi (<a href="https://arxiv.org/abs/1909.11230">2019</a>, IAU Symposium - 355). The version controlled project source is available <a href="https://gitlab.com/makhlaghi/iau-symposium-355">on - GitLab</a> and is also - archived on Zenodo with all the necessary software tarballs: - <a href="https://doi.org/10.5281/zenodo.3408481">zenodo.3408481</a>.</p></li> - <li><p>Section 7.3 of Bacon et - al. (<a href="http://adsabs.harvard.edu/abs/2017A%26A...608A...1B">2017</a>, A&A - 608, A1): The version controlled project source is available <a href="https://gitlab.com/makhlaghi/muse-udf-origin-only-hst-magnitudes">on - GitLab</a> - and a snapshot of the project along with all the necessary input - datasets and outputs is available in - <a href="https://doi.org/10.5281/zenodo.1164774">zenodo.1164774</a>.</p></li> - <li><p>Section 4 of Bacon et - al. (<a href="http://adsabs.harvard.edu/abs/2017A%26A...608A...1B">2017</a>, A&A, - 608, A1): The version controlled project is available <a href="https://gitlab.com/makhlaghi/muse-udf-photometry-astrometry">on - GitLab</a> and - a snapshot of the project along with all the necessary input datasets is - available in <a href="https://doi.org/10.5281/zenodo.1163746">zenodo.1163746</a>.</p></li> - <li><p>Akhlaghi & Ichikawa - (<a href="http://adsabs.harvard.edu/abs/2015ApJS..220....1A">2015</a>, ApJS, 220, - 1): The version controlled project is available <a href="https://gitlab.com/makhlaghi/NoiseChisel-paper">on - GitLab</a>. This is the - very first (and much less mature!) incarnation of Maneage: the history - of Maneage started more than two years after this paper was - published. It is a very rudimentary/initial implementation, thus it is - only included here for historical reasons. However, the project source - is complete, accurate and uploaded to arXiv along with the paper.</p></li> - </ul> - - <h2>Citation</h2> - - <p>A paper to fully describe Maneage has been submitted. Until then, if you - used it in your work, please cite the paper that implemented its first - version: Akhlaghi & Ichikawa - (<a href="http://adsabs.harvard.edu/abs/2015ApJS..220....1A">2015</a>, ApJS, 220, 1).</p> - - <p>Also, when your paper is published, don't forget to add a notice in your - own paper (in coordination with the publishing editor) that the paper is - fully reproducible and possibly add a sentence or paragraph in the end of - the paper shortly describing the concept. This will help spread the word - and encourage other scientists to also manage and publish their projects in - a reproducible manner.</p> - - <h1>Project architecture</h1> - - <p>In order to customize Maneage to your research, it is important to first - understand its architecture so you can navigate your way in the directories - and understand how to implement your research project within its framework: - where to add new files and which existing files to modify for what - purpose. But if this the first time you are using Maneage, before reading - this theoretical discussion, please run Maneage once from scratch without - any changes (described in <code>README.md</code>). You will see how it works (note that - the configure step builds all necessary software, so it can take long, but - you can continue reading while its working).</p> - - <p>The project has two top-level directories: <code>reproduce</code> and - <code>tex</code>. <code>reproduce</code> hosts all the software building and analysis - steps. <code>tex</code> contains all the final paper's components to be compiled into - a PDF using LaTeX.</p> - - <p>The <code>reproduce</code> directory has two sub-directories: <code>software</code> and - <code>analysis</code>. As the name says, the former contains all the instructions to - download, build and install (independent of the host operating system) the - necessary software (these are called by the <code>./project configure</code> - command). The latter contains instructions on how to use those software to - do your project's analysis.</p> - - <p>After it finishes, <code>./project configure</code> will create the following symbolic - links in the project's top source directory: <code>.build</code> which points to the - top build directory and <code>.local</code> for easy access to the custom built - software installation directory. With these you can easily access the build - directory and project-specific software from your top source directory. For - example if you run <code>.local/bin/ls</code> you will be using the <code>ls</code> of Maneage, - which is probably different from your system's <code>ls</code> (run them both with - <code>--version</code> to check).</p> - - <p>Once the project is configured for your system, <code>./project make</code> will do - the basic preparations and run the project's analysis with the custom - version of software. The <code>project</code> script is just a wrapper, and with the - <code>make</code> argument, it will first call <code>top-prepare.mk</code> and <code>top-make.mk</code> - (both are in the <code>reproduce/analysis/make</code> directory).</p> - - <p>In terms of organization, <code>top-prepare.mk</code> and <code>top-make.mk</code> have an - identical design, only minor differences. So, let's continue Maneage's - architecture with <code>top-make.mk</code>. Once you understand that, you'll clearly - understand <code>top-prepare.mk</code> also. These very high-level files are - relatively short and heavily commented so hopefully the descriptions in - each comment will be enough to understand the general details. As you read - this section, please also look at the contents of the mentioned files and - directories to fully understand what is going on.</p> - - <p>Before starting to look into the top <code>top-make.mk</code>, it is important to - recall that Make defines dependencies by files. Therefore, the - input/prerequisite and output of every step/rule must be a file. Also - recall that Make will use the modification date of the prerequisite(s) and - target files to see if the target must be re-built or not. Therefore during - the processing, <em>many</em> intermediate files will be created (see the tips - section below on a good strategy to deal with large/huge files).</p> - - <p>To keep the source and (intermediate) built files separate, the user <em>must</em> - define a top-level build directory variable (or <code>$(BDIR)</code>) to host all the - intermediate files (you defined it during <code>./project configure</code>). This - directory doesn't need to be version controlled or even synchronized, or - backed-up in other servers: its contents are all products, and can be - easily re-created any time. As you define targets for your new rules, it is - thus important to place them all under sub-directories of <code>$(BDIR)</code>. As - mentioned above, you always have fast access to this "build"-directory with - the <code>.build</code> symbolic link. Also, beware to <em>never</em> make any manual change - in the files of the build-directory, just delete them (so they are - re-built).</p> - - <p>In this architecture, we have two types of Makefiles that are loaded into - the top <code>Makefile</code>: <em>configuration-Makefiles</em> (only independent - variables/configurations) and <em>workhorse-Makefiles</em> (Makefiles that - actually contain analysis/processing rules).</p> - - <p>The configuration-Makefiles are those that satisfy these two wildcards: - <code>reproduce/software/config/*.conf</code> (for building the necessary software - when you run <code>./project configure</code>) and <code>reproduce/analysis/config/*.conf</code> - (for the high-level analysis, when you run <code>./project make</code>). These - Makefiles don't actually have any rules, they just have values for various - free parameters throughout the configuration or analysis. Open a few of - them to see for yourself. These Makefiles must only contain raw Make - variables (project configurations). By "raw" we mean that the Make - variables in these files must not depend on variables in any other - configuration-Makefile. This is because we don't want to assume any order - in reading them. It is also very important to <em>not</em> define any rule, or - other Make construct, in these configuration-Makefiles.</p> - - <p>Following this rule-of-thumb enables you to set these configure-Makefiles - as a prerequisite to any target that depends on their variable - values. Therefore, if you change any of their values, all targets that - depend on those values will be re-built. This is very convenient as your - project scales up and gets more complex.</p> - - <p>The workhorse-Makefiles are those satisfying this wildcard - <code>reproduce/software/make/*.mk</code> and <code>reproduce/analysis/make/*.mk</code>. They - contain the details of the processing steps (Makefiles containing - rules). Therefore, in this phase <em>order is important</em>, because the - prerequisites of most rules will be the targets of other rules that will be - defined prior to them (not a fixed name like <code>paper.pdf</code>). The lower-level - rules must be imported into Make before the higher-level ones.</p> - - <p>All processing steps are assumed to ultimately (usually after many rules) - end up in some number, image, figure, or table that will be included in the - paper. The writing of these results into the final report/paper is managed - through separate LaTeX files that only contain macros (a name given to a - number/string to be used in the LaTeX source, which will be replaced when - compiling it to the final PDF). So the last target in a workhorse-Makefile - is a <code>.tex</code> file (with the same base-name as the Makefile, but in - <code>$(BDIR)/tex/macros</code>). As a result, if the targets in a workhorse-Makefile - aren't directly a prerequisite of other workhorse-Makefile targets, they - can be a prerequisite of that intermediate LaTeX macro file and thus be - called when necessary. Otherwise, they will be ignored by Make.</p> - - <p>Maneage also has a mode to share the build directory between several - users of a Unix group (when working on large computer clusters). In this - scenario, each user can have their own cloned project source, but share the - large built files between each other. To do this, it is necessary for all - built files to give full permission to group members while not allowing any - other users access to the contents. Therefore the <code>./project configure</code> and - <code>./project make</code> steps must be called with special conditions which are - managed in the <code>--group</code> option.</p> - - <p>Let's see how this design is implemented. Please open and inspect - <code>top-make.mk</code> it as we go along here. The first step (un-commented line) is - to import the local configuration (your answers to the questions of - <code>./project configure</code>). They are defined in the configuration-Makefile - <code>reproduce/software/config/LOCAL.conf</code> which was also built by <code>./project - configure</code> (based on the <code>LOCAL.conf.in</code> template of the same directory).</p> - - <p>The next non-commented set of the top <code>Makefile</code> defines the ultimate - target of the whole project (<code>paper.pdf</code>). But to avoid mistakes, a sanity - check is necessary to see if Make is being run with the same group settings - as the configure script (for example when the project is configured for - group access using the <code>./for-group</code> script, but Make isn't). Therefore we - use a Make conditional to define the <code>all</code> target based on the group - permissions.</p> - - <p>Having defined the top/ultimate target, our next step is to include all the - other necessary Makefiles. However, order matters in the importing of - workhorse-Makefiles and each must also have a TeX macro file with the same - base name (without a suffix). Therefore, the next step in the top-level - Makefile is to define the <code>makesrc</code> variable to keep the base names - (without a <code>.mk</code> suffix) of the workhorse-Makefiles that must be imported, - in the proper order.</p> - - <p>Finally, we import all the necessary remaining Makefiles: 1) All the - analysis configuration-Makefiles with a wildcard. 2) The software - configuration-Makefile that contains their version (just in case its - necessary). 3) All workhorse-Makefiles in the proper order using a Make - <code>foreach</code> loop.</p> - - <p>In short, to keep things modular, readable and manageable, follow these - recommendations: 1) Set clear-to-understand names for the - configuration-Makefiles, and workhorse-Makefiles, 2) Only import other - Makefiles from top Makefile. These will let you know/remember generally - which step you are taking before or after another. Projects will scale up - very fast. Thus if you don't start and continue with a clean and robust - convention like this, in the end it will become very dirty and hard to - manage/understand (even for yourself). As a general rule of thumb, break - your rules into as many logically-similar but independent steps as - possible.</p> - - <p>The <code>reproduce/analysis/make/paper.mk</code> Makefile must be the final Makefile - that is included. This workhorse Makefile ends with the rule to build - <code>paper.pdf</code> (final target of the whole project). If you look in it, you - will notice that this Makefile starts with a rule to create - <code>$(mtexdir)/project.tex</code> (<code>mtexdir</code> is just a shorthand name for - <code>$(BDIR)/tex/macros</code> mentioned before). As you see, the only dependency of - <code>$(mtexdir)/project.tex</code> is <code>$(mtexdir)/verify.tex</code> (which is the last - analysis step: it verifies all the generated results). Therefore, - <code>$(mtexdir)/project.tex</code> is <em>the connection</em> between the - processing/analysis steps of the project, and the steps to build the final - PDF.</p> - - <p>During the research, it often happens that you want to test a step that is - not a prerequisite of any higher-level operation. In such cases, you can - (temporarily) define that processing as a rule in the most relevant - workhorse-Makefile and set its target as a prerequisite of its TeX - macro. If your test gives a promising result and you want to include it in - your research, set it as prerequisites to other rules and remove it from - the list of prerequisites for TeX macro file. In fact, this is how a - project is designed to grow in this framework.</p> - - <h2>File modification dates (meta data)</h2> - - <p>While Git does an excellent job at keeping a history of the contents of - files, it makes no effort in keeping the file meta data, and in particular - the dates of files. Therefore when you checkout to a different branch, - files that are re-written by Git will have a newer date than the other - project files. However, file dates are important in the current design of - Maneage: Make checks the dates of the prerequisite files and target files - to see if the target should be re-built.</p> - - <p>To fix this problem, for Maneage we use a forked version of - <a href="https://github.com/mohammad-akhlaghi/metastore">Metastore</a>. Metastore use - a binary database file (which is called <code>.file-metadata</code>) to keep the - modification dates of all the files under version control. This file is - also under version control, but is hidden (because it shouldn't be modified - by hand). During the project's configuration, Maneage installs to Git hooks - to run Metastore 1) before making a commit to update its database with the - file dates in a branch, and 2) after doing a checkout, to reset the - file-dates after the checkout is complete and re-set the file dates back to - what they were.</p> - - <p>In practice, Metastore should work almost fully invisibly within your - project. The only place you might notice its presence is that you'll see - <code>.file-metadata</code> in the list of modified/staged files (commonly after - merging your branches). Since its a binary file, Git also won't show you - the changed contents. In a merge, you can simply accept any changes with - <code>git add -u</code>. But if Git is telling you that it has changed without a merge - (for example if you started a commit, but canceled it in the middle), you - can just do <code>git checkout .file-metadata</code> and set it back to its original - state.</p> - - <h2>Summary</h2> - - <p>Based on the explanation above, some major design points you should have in - mind are listed below.</p> - - <ul> - <li><p>Define new <code>reproduce/analysis/make/XXXXXX.mk</code> workhorse-Makefile(s) - with good and human-friendly name(s) replacing <code>XXXXXX</code>.</p></li> - <li><p>Add <code>XXXXXX</code>, as a new line, to the values in <code>makesrc</code> of the top-level - <code>Makefile</code>.</p></li> - <li><p>Do not use any constant numbers (or important names like filter names) - in the workhorse-Makefiles or paper's LaTeX source. Define such - constants as logically-grouped, separate configuration-Makefiles in - <code>reproduce/analysis/config/XXXXX.conf</code>. Then set this - configuration-Makefiles file as a prerequisite to any rule that uses - the variable defined in it.</p></li> - <li><p>Through any number of intermediate prerequisites, all processing steps - should end in (be a prerequisite of) <code>$(mtexdir)/verify.tex</code> (defined in - <code>reproduce/analysis/make/verify.mk</code>). <code>$(mtexdir)/verify.tex</code> is the sole - dependency of <code>$(mtexdir)/project.tex</code>, which is the bridge between the - processing steps and PDF-building steps of the project.</p></li> - </ul> - - <h1>Customization checklist</h1> - - <p>Take the following steps to fully customize Maneage for your research - project. After finishing the list, be sure to run <code>./project configure</code> and - <code>project make</code> to see if everything works correctly. If you notice anything - missing or any in-correct part (probably a change that has not been - explained here), please let us know to correct it.</p> - - <p>As described above, the concept of reproducibility (during a project) - heavily relies on <a href="https://en.wikipedia.org/wiki/Version_control">version - control</a>. Currently Maneage - uses Git as its main version control system. If you are not already - familiar with Git, please read the first three chapters of the <a href="https://git-scm.com/book/en/v2">ProGit - book</a> which provides a wonderful practical - understanding of the basics. You can read later chapters as you get more - advanced in later stages of your work.</p> - - <h2>First custom commit</h2> - - <ol> - <li><p><strong>Get this repository and its history</strong> (if you don't already have it): - Arguably the easiest way to start is to clone Maneage and prepare for - your customizations as shown below. After the cloning first you rename - the default <code>origin</code> remote server to specify that this is Maneage's - remote server. This will allow you to use the conventional <code>origin</code> - name for your own project as shown in the next steps. Second, you will - create and go into the conventional <code>master</code> branch to start - committing in your project later.</p> - - <pre><code>git clone https://git.maneage.org/project.git <span class="comment"># Clone/copy the project and its history.</span> -mv project my-project <span class="comment"># Change the name to your project's name.</span> -cd my-project <span class="comment"># Go into the cloned directory.</span> -git remote rename origin origin-maneage <span class="comment"># Rename current/only remote to "origin-maneage".</span> -git checkout -b master <span class="comment"># Create and enter your own "master" branch.</span> -pwd <span class="comment"># Just to confirm where you are.</span></code></pre></li> - <li><p><strong>Prepare to build project</strong>: The <code>./project configure</code> command of the - next step will build the different software packages within the - "build" directory (that you will specify). Nothing else on your system - will be touched. However, since it takes long, it is useful to see - what it is being built at every instant (its almost impossible to tell - from the torrent of commands that are produced!). So open another - terminal on your desktop and navigate to the same project directory - that you cloned (output of last command above). Then run the following - command. Once every second, this command will just print the date - (possibly followed by a non-existent directory notice). But as soon as - the next step starts building software, you'll see the names of - software get printed as they are being built. Once any software is - installed in the project build directory it will be removed. Again, - don't worry, nothing will be installed outside the build directory.</p> - - <pre><code><span class="comment"># On another terminal (go to top project source directory, last command above)</span> -./project --check-config</code></pre></li> - <li><p><strong>Test Maneage</strong>: Before making any changes, it is important to test it - and see if everything works properly with the commands below. If there - is any problem in the <code>./project configure</code> or <code>./project make</code> steps, - please contact us to fix the problem before continuing. Since the - building of dependencies in configuration can take long, you can take - the next few steps (editing the files) while its working (they don't - affect the configuration). After <code>./project make</code> is finished, open - <code>paper.pdf</code>. If it looks fine, you are ready to start customizing the - Maneage for your project. But before that, clean all the extra Maneage - outputs with <code>make clean</code> as shown below.</p> - - <pre><code>./project configure <span class="comment"># Build the project's software environment (can take an hour or so).</span> -./project make <span class="comment"># Do the processing and build paper (just a simple demo).</span> -<span class="comment"># Open 'paper.pdf' and see if everything is ok.</code></pre></li> - <li><p><strong>Setup the remote</strong>: You can use any <a href="https://en.wikipedia.org/wiki/Comparison_of_source_code_hosting_facilities">hosting - facility</a> - that supports Git to keep an online copy of your project's version - controlled history. We recommend <a href="https://gitlab.com">GitLab</a> because - it is <a href="https://www.gnu.org/software/repo-criteria-evaluation.html">more ethical (although not - perfect)</a>, - and later you can also host GitLab on your own server. Anyway, create - an account in your favorite hosting facility (if you don't already - have one), and define a new project there. Please make sure <em>the newly - created project is empty</em> (some services ask to include a <code>README</code> in - a new project which is bad in this scenario, and will not allow you to - push to it). It will give you a URL (usually starting with <code>git@</code> and - ending in <code>.git</code>), put this URL in place of <code>XXXXXXXXXX</code> in the first - command below. With the second command, "push" your <code>master</code> branch to - your <code>origin</code> remote, and (with the <code>--set-upstream</code> option) set them - to track/follow each other. However, the <code>maneage</code> branch is currently - tracking/following your <code>origin-maneage</code> remote (automatically set - when you cloned Maneage). So when pushing the <code>maneage</code> branch to your - <code>origin</code> remote, you <em>shouldn't</em> use <code>--set-upstream</code>. With the last - command, you can actually check this (which local and remote branches - are tracking each other).</p> - - <pre><code>git remote add origin XXXXXXXXXX <span class="comment"># Newly created repo is now called 'origin'.</span> -git push --set-upstream origin master <span class="comment"># Push 'master' branch to 'origin' (with tracking).</span> -git push origin maneage <span class="comment"># Push 'maneage' branch to 'origin' (no tracking).</span></code></pre></li> - <li><p><strong>Title</strong>, <strong>short description</strong> and <strong>author</strong>: The title and basic - information of your project's output PDF paper should be added in - <code>paper.tex</code>. You should see the relevant place in the preamble (prior - to <code>\begin{document}</code>. After you are done, run the <code>./project make</code> - command again to see your changes in the final PDF, and make sure that - your changes don't cause a crash in LaTeX. Of course, if you use a - different LaTeX package/style for managing the title and authors (in - particular a specific journal's style), please feel free to use it - your own methods after finishing this checklist and doing your first - commit.</p></li> - <li><p><strong>Delete dummy parts</strong>: Maneage contains some parts that are only for - the initial/test run, mainly as a demonstration of important steps, - which you can use as a reference to use in your own project. But they - not for any real analysis, so you should remove these parts as - described below:</p> - - <ul> - <li><p><code>paper.tex</code>: 1) Delete the text of the abstract (from - <code>\includeabstract{</code> to <code>\vspace{0.25cm}</code>) and write your own (a - single sentence can be enough now, you can complete it later). 2) - Add some keywords under it in the keywords part. 3) Delete - everything between <code>%% Start of main body.</code> and <code>%% End of main - body.</code>. 4) Remove the notice in the "Acknowledgments" section (in - <code>\new{}</code>) and Acknowledge your funding sources (this can also be - done later). Just don't delete the existing acknowledgment - statement: Maneage is possible thanks to funding from several - grants. Since Maneage is being used in your work, it is necessary to - acknowledge them in your work also.</p></li> - <li><p><code>reproduce/analysis/make/top-make.mk</code>: Delete the <code>delete-me</code> line - in the <code>makesrc</code> definition. Just make sure there is no empty line - between the <code>download \</code> and <code>verify \</code> lines (they should be - directly under each other).</p></li> - <li><p><code>reproduce/analysis/make/verify.mk</code>: In the final recipe, under the - commented line <code>Verify TeX macros</code>, remove the full line that - contains <code>delete-me</code>, and set the value of <code>s</code> in the line for - <code>download</code> to <code>XXXXX</code> (any temporary string, you'll fix it in the - end of your project, when its complete).</p></li> - <li><p>Delete all <code>delete-me*</code> files in the following directories:</p> - <pre><code>rm tex/src/delete-me* -rm reproduce/analysis/make/delete-me* -rm reproduce/analysis/config/delete-me*</code></pre></li> - <li><p>Disable verification of outputs by removing the <code>yes</code> from - <code>reproduce/analysis/config/verify-outputs.conf</code>. Later, when you are - ready to submit your paper, or publish the dataset, activate - verification and make the proper corrections in this file (described - under the "Other basic customizations" section below). This is a - critical step and only takes a few minutes when your project is - finished. So DON'T FORGET to activate it in the end.</p></li> - <li><p>Re-make the project (after a cleaning) to see if you haven't - introduced any errors.</p> - - <pre><code>./project make clean -./project make</code></pre></li> - </ul></li> - <li><p><strong>Don't merge some files in future updates</strong>: As described below, you - can later update your infra-structure (for example to fix bugs) by - merging your <code>master</code> branch with <code>maneage</code>. For files that you have - created in your own branch, there will be no problem. However if you - modify an existing Maneage file for your project, next time its - updated on <code>maneage</code> you'll have an annoying conflict. The commands - below show how to fix this future problem. With them, you can - configure Git to ignore the changes in <code>maneage</code> for some of the files - you have already edited and deleted above (and will edit below). Note - that only the first <code>echo</code> command has a <code>></code> (to write over the file), - the rest are <code>>></code> (to append to it). If you want to avoid any other - set of files to be imported from Maneage into your project's branch, - you can follow a similar strategy. We recommend only doing it when you - encounter the same conflict in more than one merge and there is no - other change in that file. Also, don't add core Maneage Makefiles, - otherwise Maneage can break on the next run.</p> - - <pre><code>echo "paper.tex merge=ours" > .gitattributes -echo "tex/src/delete-me.mk merge=ours" >> .gitattributes -echo "tex/src/delete-me-demo.mk merge=ours" >> .gitattributes -echo "reproduce/analysis/make/delete-me.mk merge=ours" >> .gitattributes -echo "reproduce/software/config/TARGETS.conf merge=ours" >> .gitattributes -echo "reproduce/analysis/config/delete-me-num.conf merge=ours" >> .gitattributes -git add .gitattributes</code></pre></li> - <li><p><strong>Copyright and License notice</strong>: It is necessary that <em>all</em> the - "copyright-able" files in your project (those larger than 10 lines) - have a copyright and license notice. Please take a moment to look at - several existing files to see a few examples. The copyright notice is - usually close to the start of the file, it is the line starting with - <code>Copyright (C)</code> and containing a year and the author's name (like the - examples below). The License notice is a short description of the - copyright license, usually one or two paragraphs with a URL to the - full license. Don't forget to add these <em>two</em> notices to <em>any new - file</em> you add in your project (you can just copy-and-paste). When you - modify an existing Maneage file (which already has the notices), just - add a copyright notice in your name under the existing one(s), like - the line with capital letters below. To start with, add this line with - your name and email address to <code>paper.tex</code>, - <code>tex/src/preamble-header.tex</code>, <code>reproduce/analysis/make/top-make.mk</code>, - and generally, all the files you modified in the previous step.</p> - - <pre><code>Copyright (C) 2018-2020 Existing Name <existing@email.address> -Copyright (C) 2020 YOUR NAME <YOUR@EMAIL.ADDRESS></code></pre></li> - <li><p><strong>Configure Git for fist time</strong>: If this is the first time you are - running Git on this system, then you have to configure it with some - basic information in order to have essential information in the commit - messages (ignore this step if you have already done it). Git will - include your name and e-mail address information in each commit. You - can also specify your favorite text editor for making the commit - (<code>emacs</code>, <code>vim</code>, <code>nano</code>, and etc.).</p> - - <pre><code>git config --global user.name "YourName YourSurname" -git config --global user.email your-email@example.com -git config --global core.editor nano</code></pre></li> - <li><p><strong>Your first commit</strong>: You have already made some small and basic - changes in the steps above and you are in your project's <code>master</code> - branch. So, you can officially make your first commit in your - project's history and push it. But before that, you need to make sure - that there are no problems in the project. This is a good habit to - always re-build the system before a commit to be sure it works as - expected.</p> - - <pre><code>git status <span class="comment"># See which files you have changed.</span> -git diff <span class="comment"># Check the lines you have added/changed.</span> -./project make <span class="comment"># Make sure everything builds successfully.</span> -git add -u <span class="comment"># Put all tracked changes in staging area.</span> -git status <span class="comment"># Make sure everything is fine.</span> -git diff --cached <span class="comment"># Confirm all the changes that will be committed.</span> -git commit <span class="comment"># Your first commit: put a good description!</span> -git push <span class="comment"># Push your commit to your remote.</span></code></pre></li> - <li><p><strong>Start your exciting research</strong>: You are now ready to add flesh and - blood to this raw skeleton by further modifying and adding your - exciting research steps. You can use the "published works" section in - the introduction (above) as some fully working models to learn - from. Also, don't hesitate to contact us if you have any - questions.</p></li> - </ol> - - <h2>Other basic customizations</h2> - - <ul> - <li><p><strong>High-level software</strong>: Maneage installs all the software that your - project needs. You can specify which software your project needs in - <code>reproduce/software/config/TARGETS.conf</code>. The necessary software are - classified into two classes: 1) programs or libraries (usually written - in C/C++) which are run directly by the operating system. 2) Python - modules/libraries that are run within Python. By default - <code>TARGETS.conf</code> only has GNU Astronomy Utilities (Gnuastro) as one - scientific program and Astropy as one scientific Python module. Both - have many dependencies which will be installed into your project - during the configuration step. To see a list of software that are - currently ready to be built in Maneage, see - <code>reproduce/software/config/versions.conf</code> (which has their versions - also), the comments in <code>TARGETS.conf</code> describe how to use the software - name from <code>versions.conf</code>. Currently the raw pipeline just uses - Gnuastro to make the demonstration plots. Therefore if you don't need - Gnuastro, go through the analysis steps in <code>reproduce/analysis</code> and - remove all its use cases (clearly marked).</p></li> - <li><p><strong>Input dataset</strong>: The input datasets are managed through the - <code>reproduce/analysis/config/INPUTS.conf</code> file. It is best to gather all - the information regarding all the input datasets into this one central - file. To ensure that the proper dataset is being downloaded and used - by the project, it is also recommended get an <a href="https://en.wikipedia.org/wiki/MD5">MD5 - checksum</a> of the file and include - that in <code>INPUTS.conf</code> so the project can check it automatically. The - preparation/downloading of the input datasets is done in - <code>reproduce/analysis/make/download.mk</code>. Have a look there to see how - these values are to be used. This information about the input datasets - is also used in the initial <code>configure</code> script (to inform the users), - so also modify that file. You can find all occurrences of the demo - dataset with the command below and replace it with your input's - dataset.</p> - - <pre><code>grep -ir wfpc2 ./*</code></pre></li> - <li><p><strong><code>README.md</code></strong>: Correct all the <code>XXXXX</code> place holders (name of your - project, your own name, address of your project's online/remote - repository, link to download dependencies and etc). Generally, read - over the text and update it where necessary to fit your project. Don't - forget that this is the first file that is displayed on your online - repository and also your colleagues will first be drawn to read this - file. Therefore, make it as easy as possible for them to start - with. Also check and update this file one last time when you are ready - to publish your project's paper/source.</p></li> - <li><p><strong>Verify outputs</strong>: During the initial customization checklist, you - disabled verification. This is natural because during the project you - need to make changes all the time and its a waste of time to enable - verification every time. But at significant moments of the project - (for example before submission to a journal, or publication) it is - necessary. When you activate verification, before building the paper, - all the specified datasets will be compared with their respective - checksum and if any file's checksum is different from the one recorded - in the project, it will stop and print the problematic file and its - expected and calculated checksums. First set the value of - <code>verify-outputs</code> variable in - <code>reproduce/analysis/config/verify-outputs.conf</code> to <code>yes</code>. Then go to - <code>reproduce/analysis/make/verify.mk</code>. The verification of all the files - is only done in one recipe. First the files that go into the - plots/figures are checked, then the LaTeX macros. Validation of the - former (inputs to plots/figures) should be done manually. If its the - first time you are doing this, you can see two examples of the dummy - steps (with <code>delete-me</code>, you can use them if you like). These two - examples should be removed before you can run the project. For the - latter, you just have to update the checksums. The important thing to - consider is that a simple checksum can be problematic because some - file generators print their run-time date in the file (for example as - commented lines in a text table). When checking text files, this - Makefile already has this function: - <code>verify-txt-no-comments-leading-space</code>. As the name suggests, it will - remove comment lines and empty lines before calculating the MD5 - checksum. For FITS formats (common in astronomy, fortunately there is - a <code>DATASUM</code> definition which will return the checksum independent of - the headers. You can use the provided function(s), or define one for - your special formats.</p></li> - <li><p><strong>Feedback</strong>: As you use Maneage you will notice many things that if - implemented from the start would have been very useful for your - work. This can be in the actual scripting and architecture of Maneage, - or useful implementation and usage tips, like those below. In any - case, please share your thoughts and suggestions with us, so we can - add them here for everyone's benefit.</p></li> - <li><p><strong>Re-preparation</strong>: Automatic preparation is only run in the first run - of the project on a system, to re-do the preparation you have to use - the option below. Here is the reason for this: when its necessary, the - preparation process can be slow and will unnecessarily slow down the - whole project while the project is under development (focus is on the - analysis that is done after preparation). Because of this, preparation - will be done automatically for the first time that the project is run - (when <code>.build/software/preparation-done.mk</code> doesn't exist). After the - preparation process completes once, future runs of <code>./project make</code> - will not do the preparation process anymore (will not call - <code>top-prepare.mk</code>). They will only call <code>top-make.mk</code> for the - analysis. To manually invoke the preparation process after the first - attempt, the <code>./project make</code> script should be run with the - <code>--prepare-redo</code> option, or you can delete the special file above.</p> - - <pre><code>./project make --prepare-redo</code></pre></li> - <li><p><strong>Pre-publication</strong>: add notice on reproducibility**: Add a notice - somewhere prominent in the first page within your paper, informing the - reader that your research is fully reproducible. For example in the - end of the abstract, or under the keywords with a title like - "reproducible paper". This will encourage them to publish their own - works in this manner also and also will help spread the word.</p></li> - </ul> - - <h1>Tips for designing your project</h1> - - <p>The following is a list of design points, tips, or recommendations that - have been learned after some experience with this type of project - management. Please don't hesitate to share any experience you gain after - using it with us. In this way, we can add it here (with full giving credit) - for the benefit of others.</p> - - <ul> - <li><p><strong>Modularity</strong>: Modularity is the key to easy and clean growth of a - project. So it is always best to break up a job into as many - sub-components as reasonable. Here are some tips to stay modular.</p> - - <ul> - <li><p><em>Short recipes</em>: if you see the recipe of a rule becoming more than a - handful of lines which involve significant processing, it is probably - a good sign that you should break up the rule into its main - components. Try to only have one major processing step per rule.</p></li> - <li><p><em>Context-based (many) Makefiles</em>: For maximum modularity, this design - allows easy inclusion of many Makefiles: in - <code>reproduce/analysis/make/*.mk</code> for analysis steps, and - <code>reproduce/software/make/*.mk</code> for building software. So keep the - rules for closely related parts of the processing in separate - Makefiles.</p></li> - <li><p><em>Descriptive names</em>: Be very clear and descriptive with the naming of - the files and the variables because a few months after the - processing, it will be very hard to remember what each one was - for. Also this helps others (your collaborators or other people - reading the project source after it is published) to more easily - understand your work and find their way around.</p></li> - <li><p><em>Naming convention</em>: As the project grows, following a single standard - or convention in naming the files is very useful. Try best to use - multiple word filenames for anything that is non-trivial (separating - the words with a <code>-</code>). For example if you have a Makefile for - creating a catalog and another two for processing it under models A - and B, you can name them like this: <code>catalog-create.mk</code>, - <code>catalog-model-a.mk</code> and <code>catalog-model-b.mk</code>. In this way, when - listing the contents of <code>reproduce/analysis/make</code> to see all the - Makefiles, those related to the catalog will all be close to each - other and thus easily found. This also helps in auto-completions by - the shell or text editors like Emacs.</p></li> - <li><p><em>Source directories</em>: If you need to add files in other languages for - example in shell, Python, AWK or C, keep the files in the same - language in a separate directory under <code>reproduce/analysis</code>, with the - appropriate name.</p></li> - <li><p><em>Configuration files</em>: If your research uses special programs as part - of the processing, put all their configuration files in a devoted - directory (with the program's name) within - <code>reproduce/software/config</code>. Similar to the - <code>reproduce/software/config/gnuastro</code> directory (which is put in - Maneage as a demo in case you use GNU Astronomy Utilities). It is - much cleaner and readable (thus less buggy) to avoid mixing the - configuration files, even if there is no technical necessity.</p></li> - </ul></li> - <li><p><strong>Contents</strong>: It is good practice to follow the following - recommendations on the contents of your files, whether they are source - code for a program, Makefiles, scripts or configuration files - (copyrights aren't necessary for the latter).</p> - - <ul> - <li><p><em>Copyright</em>: Always start a file containing programming constructs - with a copyright statement like the ones that Maneage starts with - (for example in the top level <code>Makefile</code>).</p></li> - <li><p><em>Comments</em>: Comments are vital for readability (by yourself in two - months, or others). Describe everything you can about why you are - doing something, how you are doing it, and what you expect the result - to be. Write the comments as if it was what you would say to describe - the variable, recipe or rule to a friend sitting beside you. When - writing the project it is very tempting to just steam ahead with - commands and codes, but be patient and write comments before the - rules or recipes. This will also allow you to think more about what - you should be doing. Also, in several months when you come back to - the code, you will appreciate the effort of writing them. Just don't - forget to also read and update the comment first if you later want to - make changes to the code (variable, recipe or rule). As a general - rule of thumb: first the comments, then the code.</p></li> - <li><p><em>File title</em>: In general, it is good practice to start all files with - a single line description of what that particular file does. If - further information about the totality of the file is necessary, add - it after a blank line. This will help a fast inspection where you - don't care about the details, but just want to remember/see what that - file is (generally) for. This information must of course be commented - (its for a human), but this is kept separate from the general - recommendation on comments, because this is a comment for the whole - file, not each step within it.</p></li> - </ul></li> - <li><p><strong>Make programming</strong>: Here are some experiences that we have come to - learn over the years in using Make and are useful/handy in research - contexts.</p> - - <ul> - <li><p><em>Environment of each recipe</em>: If you need to define a special - environment (or aliases, or scripts to run) for all the recipes in - your Makefiles, you can use a Bash startup file - <code>reproduce/software/shell/bashrc.sh</code>. This file is loaded before every - Make recipe is run, just like the <code>.bashrc</code> in your home directory is - loaded every time you start a new interactive, non-login terminal. See - the comments in that file for more.</p></li> - <li><p><em>Automatic variables</em>: These are wonderful and very useful Make - constructs that greatly shrink the text, while helping in - read-ability, robustness (less bugs in typos for example) and - generalization. For example even when a rule only has one target or - one prerequisite, always use <code>$@</code> instead of the target's name, <code>$<</code> - instead of the first prerequisite, <code>$^</code> instead of the full list of - prerequisites and etc. You can see the full list of automatic - variables - <a href="https://www.gnu.org/software/make/manual/html_node/Automatic-Variables.html">here</a>. If - you use GNU Make, you can also see this page on your command-line:</p> - - <pre><code>info make "automatic variables"</code></pre></li> - <li><p><em>Debug</em>: Since Make doesn't follow the common top-down paradigm, it - can be a little hard to get accustomed to why you get an error or - un-expected behavior. In such cases, run Make with the <code>-d</code> - option. With this option, Make prints a full list of exactly which - prerequisites are being checked for which targets. Looking - (patiently) through this output and searching for the faulty - file/step will clearly show you any mistake you might have made in - defining the targets or prerequisites.</p></li> - <li><p><em>Large files</em>: If you are dealing with very large files (thus having - multiple copies of them for intermediate steps is not possible), one - solution is the following strategy (Also see the next item on "Fast - access to temporary files"). Set a small plain text file as the - actual target and delete the large file when it is no longer needed - by the project (in the last rule that needs it). Below is a simple - demonstration of doing this. In it, we use Gnuastro's Arithmetic - program to add all pixels of the input image with 2 and create - <code>large1.fits</code>. We then subtract 2 from <code>large1.fits</code> to create - <code>large2.fits</code> and delete <code>large1.fits</code> in the same rule (when its no - longer needed). We can later do the same with <code>large2.fits</code> when it - is no longer needed and so on. - <pre><code>large1.fits.txt: input.fits -astarithmetic $< 2 + --output=$(subst .txt,,$@) -echo "done" > $@ -large2.fits.txt: large1.fits.txt -astarithmetic $(subst .txt,,$<) 2 - --output=$(subst .txt,,$@) -rm $(subst .txt,,$<) -echo "done" > $@</code></pre> - A more advanced Make programmer will use Make's <a href="https://www.gnu.org/software/make/manual/html_node/Call-Function.html">call function</a> - to define a wrapper in <code>reproduce/analysis/make/initialize.mk</code>. This - wrapper will replace <code>$(subst .txt,,XXXXX)</code>. Therefore, it will be - possible to greatly simplify this repetitive statement and make the - code even more readable throughout the whole project.</p></li> - <li><p><em>Fast access to temporary files</em>: Most Unix-like operating systems - will give you a special shared-memory device (directory): on systems - using the GNU C Library (all GNU/Linux system), it is <code>/dev/shm</code>. The - contents of this directory are actually in your RAM, not in your - persistence storage like the HDD or SSD. Reading and writing from/to - the RAM is much faster than persistent storage, so if you have enough - RAM available, it can be very beneficial for large temporary files to - be put there. You can use the <code>mktemp</code> program to give the temporary - files a randomly-set name, and use text files as targets to keep that - name (as described in the item above under "Large files") for later - deletion. For example, see the minimal working example Makefile below - (which you can actually put in a <code>Makefile</code> and run if you have an - <code>input.fits</code> in the same directory, and Gnuastro is installed). - <pre><code>.ONESHELL: -.SHELLFLAGS = -ec -all: mean-std.txt -shm-maneage := /dev/shm/$(shell whoami)-maneage-XXXXXXXXXX -large1.txt: input.fits -out=$$(mktemp $(shm-maneage)) -astarithmetic $< 2 + --output=$$out.fits -echo "$$out" > $@ -large2.txt: large1.txt -input=$$(cat $<) -out=$$(mktemp $(shm-maneage)) -astarithmetic $$input.fits 2 - --output=$$out.fits -rm $$input.fits $$input -echo "$$out" > $@ -mean-std.txt: large2.txt -input=$$(cat $<) -aststatistics $$input.fits --mean --std > $@ -rm $$input.fits $$input</code></pre> - The important point here is that the temporary name template - (<code>shm-maneage</code>) has no suffix. So you can add the suffix - corresponding to your desired format afterwards (for example - <code>$$out.fits</code>, or <code>$$out.txt</code>). But more importantly, when <code>mktemp</code> - sets the random name, it also checks if no file exists with that name - and creates a file with that exact name at that moment. So at the end - of each recipe above, you'll have two files in your <code>/dev/shm</code>, one - empty file with no suffix one with a suffix. The role of the file - without a suffix is just to ensure that the randomly set name will - not be used by other calls to <code>mktemp</code> (when running in parallel) and - it should be deleted with the file containing a suffix. This is the - reason behind the <code>rm $$input.fits $$input</code> command above: to make - sure that first the file with a suffix is deleted, then the core - random file (note that when working in parallel on powerful systems, - in the time between deleting two files of a single <code>rm</code> command, many - things can happen!). When using Maneage, you can put the definition - of <code>shm-maneage</code> in <code>reproduce/analysis/make/initialize.mk</code> to be - usable in all the different Makefiles of your analysis, and you won't - need the three lines above it. <strong>Finally, BE RESPONSIBLE:</strong> after you - are finished, be sure to clean up any possibly remaining files (due - to crashes in the processing while you are working), otherwise your - RAM may fill up very fast. You can do it easily with a command like - this on your command-line: <code>rm -f /dev/shm/$(whoami)-*</code>.</p></li> - </ul></li> - <li><p><strong>Software tarballs and raw inputs</strong>: It is critically important to - document the raw inputs to your project (software tarballs and raw - input data):</p> - - <ul> - <li><p><em>Keep the source tarball of dependencies</em>: After configuration - finishes, the <code>.build/software/tarballs</code> directory will contain all - the software tarballs that were necessary for your project. You can - mirror the contents of this directory to keep a backup of all the - software tarballs used in your project (possibly as another version - controlled repository) that is also published with your project. Note - that software web-pages are not written in stone and can suddenly go - offline or not be accessible in some conditions. This backup is thus - very important. If you intend to release your project in a place like - Zenodo, you can upload/keep all the necessary tarballs (and data) - there with your - project. <a href="https://doi.org/10.5281/zenodo.1163746">zenodo.1163746</a> is - one example of how the data, Gnuastro (main software used) and all - major Gnuastro's dependencies have been uploaded with the project's - source. Just note that this is only possible for free and open-source - software.</p></li> - <li><p><em>Keep your input data</em>: The input data is also critical to the - project's reproducibility, so like the above for software, make sure - you have a backup of them, or their persistent identifiers (PIDs).</p></li> - </ul></li> - <li><p><strong>Version control</strong>: Version control is a critical component of - Maneage. Here are some tips to help in effectively using it.</p> - - <ul> - <li><p><em>Regular commits</em>: It is important (and extremely useful) to have the - history of your project under version control. So try to make commits - regularly (after any meaningful change/step/result).</p></li> - <li><p><em>Keep Maneage up-to-date</em>: In time, Maneage is going to become more - and more mature and robust (thanks to your feedback and the feedback - of other users). Bugs will be fixed and new/improved features will be - added. So every once and a while, you can run the commands below to - pull new work that is done in Maneage. If the changes are useful for - your work, you can merge them with your project to benefit from - them. Just pay <strong>very close attention</strong> to resolving possible - <strong>conflicts</strong> which might happen in the merge (updated settings that - you have customized in Maneage).</p> - - <pre><code>git checkout maneage -git pull <span class="comment"># Get recent work in Maneage</span> -git log XXXXXX..XXXXXX --reverse <span class="comment"># Inspect new work (replace XXXXXXs with hashs mentioned in output of previous command).</span> -git log --oneline --graph --decorate --all <span class="comment"># General view of branches.</span> -git checkout master <span class="comment"># Go to your top working branch.</span> -git merge maneage <span class="comment"># Import all the work into master.</span></code></pre></li> - <li><p><em>Adding Maneage to a fork of your project</em>: As you and your colleagues - continue your project, it will be necessary to have separate - forks/clones of it. But when you clone your own project on a - different system, or a colleague clones it to collaborate with you, - the clone won't have the <code>origin-maneage</code> remote that you started the - project with. As shown in the previous item above, you need this - remote to be able to pull recent updates from Maneage. The steps - below will setup the <code>origin-maneage</code> remote, and a local <code>maneage</code> - branch to track it, on the new clone.</p> - - <pre><code>git remote add origin-maneage https://git.maneage.org/project.git -git fetch origin-maneage -git checkout -b maneage --track origin-maneage/maneage</code></pre></li> - <li><p><em>Commit message</em>: The commit message is a very important and useful - aspect of version control. To make the commit message useful for - others (or yourself, one year later), it is good to follow a - consistent style. Maneage already has a consistent formatting - (described below), which you can also follow in your project if you - like. You can see many examples by running <code>git log</code> in the <code>maneage</code> - branch. If you intend to push commits to Maneage, for the consistency - of Maneage, it is necessary to follow these guidelines. 1) No line - should be more than 75 characters (to enable easy reading of the - message when you run <code>git log</code> on the standard 80-character - terminal). 2) The first line is the title of the commit and should - summarize it (so <code>git log --oneline</code> can be useful). The title should - also not end with a point (<code>.</code>, because its a short single sentence, - so a point is not necessary and only wastes space). 3) After the - title, leave an empty line and start the body of your message - (possibly containing many paragraphs). 4) Describe the context of - your commit (the problem it is trying to solve) as much as possible, - then go onto how you solved it. One suggestion is to start the main - body of your commit with "Until now ...", and continue describing the - problem in the first paragraph(s). Afterwards, start the next - paragraph with "With this commit ...".</p></li> - <li><p><em>Project outputs</em>: During your research, it is possible to checkout a - specific commit and reproduce its results. However, the processing - can be time consuming. Therefore, it is useful to also keep track of - the final outputs of your project (at minimum, the paper's PDF) in - important points of history. However, keeping a snapshot of these - (most probably large volume) outputs in the main history of the - project can unreasonably bloat it. It is thus recommended to make a - separate Git repo to keep those files and keep your project's source - as small as possible. For example if your project is called - <code>my-exciting-project</code>, the name of the outputs repository can be - <code>my-exciting-project-output</code>. This enables easy sharing of the output - files with your co-authors (with necessary permissions) and not - having to bloat your email archive with extra attachments also (you - can just share the link to the online repo in your - communications). After the research is published, you can also - release the outputs repository, or you can just delete it if it is - too large or un-necessary (it was just for convenience, and fully - reproducible after all). For example Maneage's output is available - for demonstration in <a href="http://git.maneage.org/output-raw.git/">a - separate</a> repository.</p></li> - <li><p><em>Full Git history in one file</em>: When you are publishing your project - (for example to Zenodo for long term preservation), it is more - convenient to have the whole project's Git history into one file to - save with your datasets. After all, you can't be sure that your - current Git server (for example GitLab, Github, or Bitbucket) will be - active forever. While they are good for the immediate future, you - can't rely on them for archival purposes. Fortunately keeping your - whole history in one file is easy with Git using the following - commands. To learn more about it, run <code>git help bundle</code>.</p> - - <ul> - <li>"bundle" your project's history into one file (just don't forget to - change <code>my-project-git.bundle</code> to a descriptive name of your - project):</li> - </ul> - - <pre><code>git bundle create my-project-git.bundle --all</code></pre> - - <ul> - <li>You can easily upload <code>my-project-git.bundle</code> anywhere. Later, if - you need to un-bundle it, you can use the following command.</li> - </ul> - - <p><p><pre><code>git clone my-project-git.bundle</code></pre></li> - </ul></p></li> - </ul></p> - - <h1>Future improvements</h1> - - <p>This is an evolving project and as time goes on, it will evolve and become - more robust. Some of the most prominent issues we plan to implement in the - future are listed below, please join us if you are interested.</p> - - <h2>Package management</h2> - - <p>It is important to have control of the environment of the project. Maneage - currently builds the higher-level programs (for example GNU Bash, GNU Make, - GNU AWK and domain-specific software) it needs, then sets <code>PATH</code> so the - analysis is done only with the project's built software. But currently the - configuration of each program is in the Makefile rules that build it. This - is not good because a change in the build configuration does not - automatically cause a re-build. Also, each separate project on a system - needs to have its own built tools (that can waste a lot of space).</p> - - <p>A good solution is based on the <a href="https://nixos.org/nix/about.html">Nix package manager</a>: a separate file is present for - each software, containing all the necessary info to build it (including its - URL, its tarball MD5 hash, dependencies, configuration parameters, build - steps and etc). Using this file, a script can automatically generate the - Make rules to download, build and install program and its dependencies - (along with the dependencies of those dependencies and etc).</p> - - <p>All the software are installed in a "store". Each installed file (library - or executable) is prefixed by a hash of this configuration (and the OS - architecture) and the standard program name. For example (from the Nix - webpage):</p> - - <pre><code>/nix/store/b6gvzjyb2pg0kjfwrjmg1vfhh54ad73z-firefox-33.1/</code></pre> - - <p>The important thing is that the "store" is <em>not</em> in the project's search - path. After the complete installation of the software, symbolic links are - made to populate each project's program and library search paths without a - hash. This hash will be unique to that particular software and its - particular configuration. So simply by searching for this hash in the - installed directory, we can find the installed files of that software to - generate the links.</p> - - <p>This scenario has several advantages: 1) a change in a software's build - configuration triggers a rebuild. 2) a single "store" can be used in many - projects, thus saving space and configuration time for new projects (that - commonly have large overlaps in lower-level programs).</p> - - <h1>Appendix: Necessity of exact reproduction in scientific research</h1> - - <p>In case <a href="http://akhlaghi.org/reproducible-science.html">the link above</a> is - not accessible at the time of reading, here is a copy of the introduction - of that link, describing the necessity for a reproducible project like this - (copied on February 7th, 2018):</p> - - <p>The most important element of a "scientific" statement/result is the fact - that others should be able to falsify it. The Tsunami of data that has - engulfed astronomers in the last two decades, combined with faster - processors and faster internet connections has made it much more easier to - obtain a result. However, these factors have also increased the complexity - of a scientific analysis, such that it is no longer possible to describe - all the steps of an analysis in the published paper. Citing this - difficulty, many authors suffice to describing the generalities of their - analysis in their papers.</p> - - <p>However, It is impossible to falsify (or even study) a result if you can't - exactly reproduce it. The complexity of modern science makes it vitally - important to exactly reproduce the final result. Because even a small - deviation can be due to many different parts of an analysis. Nature is - already a black box which we are trying so hard to comprehend. Not letting - other scientists see the exact steps taken to reach a result, or not - allowing them to modify it (do experiments on it) is a self-imposed black - box, which only exacerbates our ignorance.</p> - - <p>Other scientists should be able to reproduce, check and experiment on the - results of anything that is to carry the "scientific" label. Any result - that is not reproducible (due to incomplete information by the author) is - not scientific: the readers have to have faith in the subjective experience - of the authors in the very important choice of configuration values and - order of operations: this is contrary to the scientific spirit.</p> <footer role="contentinfo" id="page-footer"> - <h2>Copyright information</h2> - - <p>This file is part of Maneage's core: <a href="https://git.maneage.org/project.git">https://git.maneage.org/project.git</a></p> - - <p>Maneage is free software: you can redistribute it and/or modify it under - the terms of the GNU General Public License as published by the Free - Software Foundation, either version 3 of the License, or (at your option) - any later version.</p> - - <p>Maneage is distributed in the hope that it will be useful, but WITHOUT ANY - WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS - FOR A PARTICULAR PURPOSE. See the GNU General Public License for more - details.</p> - - <p>You should have received a copy of the GNU General Public License along - with Maneage. If not, see <a href="https://www.gnu.org/licenses/">https://www.gnu.org/licenses/</a>.</p> <ul> <li><p>Maneage is currently based in the Instituto de Astrofísica de Canarias (IAC).</p></li> <li><p>Address: IAC, Calle Vía Láctea, s/n, E38205 - La Laguna (Tenerife), Spain.</p></li> diff --git a/tutorial.html b/tutorial.html index 0708408..1878ddd 100644 --- a/tutorial.html +++ b/tutorial.html @@ -208,13 +208,13 @@ git push --tags <span class="comment"># Push all ta (<code>reproduce/analysis/python</code>).</p> <pre><code><span class="comment"># Make a linear fit of an input data set +# # This Python script makes a linear fitting of a data consisting in time and # population. It generates a figure in which the original data and the # fitted curve is plotted. Finally, it saves the fitting parameters. -# Original author: -# Copyright (C) 2020, Raul Infante-Sainz <a href="mailto:infantesainz@gmail.com">infantesainz@gmail.com</a> -# Contributing author(s): -# Copyright (C) YEAR, YourName YourSurname. +# +<span class="comment"># Copyright (C) 2020 Raul Infante-Sainz <a href="mailto:infantesainz@gmail.com">infantesainz@gmail.com</a></span> +<span class="comment"># Copyright (C) YYYY Your Name <a href="mailto:your-email@example.xxx">your-email@example.xxx</a></span> # # This Python script is free software: you can redistribute it and/or modify it # under the terms of the GNU General Public License as published by the @@ -227,54 +227,45 @@ git push --tags <span class="comment"># Push all ta # Public License for more details. See <a href="http://www.gnu.org/licenses/">http://www.gnu.org/licenses/</a>. # Necessary packages</span> +<span class="comment"># Import necessary packages.</span> import sys import numpy as np import matplotlib.pyplot as plt from scipy.optimize import curve_fit <span class="comment"># Fitting function (linear fit)</span> - def func(x, a, b): return a * x + b <span class="comment"># Define input and output arguments</span> - ifile = sys.argv[1] <span class="comment"># Input file</span> ofile = sys.argv[2] <span class="comment"># Output file</span> ofig = sys.argv[3] <span class="comment"># Output figure</span> <span class="comment"># Read the data from the input file.</span> - data = np.loadtxt(ifile) <span class="comment"># Time and population:</span> - <span class="comment"># time ---------- x</span> - <span class="comment"># population ---- y</span> - x = data[:, 0] y = data[:, 1] <span class="comment"># Make the linear fit</span> - params, pcov = curve_fit(func, x, y) <span class="comment"># Make and save the figure</span> - plt.clf() plt.figure() - plt.plot(x, y, 'bo', label="Original data") plt.plot(x, func(x, *params), 'r-', label="Fitted curve") - plt.title('Population along time') plt.xlabel('Time (year)') plt.ylabel('Population (million people)') plt.legend() plt.grid() - plt.savefig(ofig, format='PDF', bbox_inches='tight') + <span class="comment"># Save the fitting parameters</span> np.savetxt(ofile, params, fmt='%.3f') </code></pre> @@ -354,19 +345,21 @@ git push <span class="comment"># Push th <span class="comment"># WITHOUT ANY WARRANTY; without even the implied warranty of</span> <span class="comment"># MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General</span> <span class="comment"># Public License for more details. See <a href="http://www.gnu.org/licenses/">http://www.gnu.org/licenses/</a>.</span> +<br /> <span class="comment"># Download data for the tutorial</span> <span class="comment"># ------------------------------</span> <span class="comment">#</span> pop-data = $(indir)/ESP.dat $(pop-data): | $(indir) -wget http://akhlaghi.org/data/template-tutorial/ESP.dat -O $@ + wget http://akhlaghi.org/data/template-tutorial/ESP.dat -O $@ +<br /> <span class="comment"># Final TeX macro</span> <span class="comment"># ---------------</span> <span class="comment">#</span> <span class="comment"># It is very important to mention the address where the data were</span> <span class="comment"># downloaded in the final report.</span> $(mtexdir)/getdata-analysis.tex: $(pop-data) | $(mtexdir) -echo "\newcommand{\popurl}{http://akhlaghi.org/data/template-tutorial}" > $@</code></pre> + echo "\newcommand{\popurl}{http://akhlaghi.org/data/template-tutorial}" > $@</code></pre> <p>Have a look at this Makefile and see the different parts. The first line is a descriptive title. Below, include your name, contact email, and finally, the copyright. Please, take your time in order to add all relevant @@ -378,7 +371,7 @@ echo "\newcommand{\popurl}{http://akhlaghi.org/data/template-tutorial}" > $@</co the general structure of a Make rule:</p> <pre><code>TARGETS: PREREQUISITES -RECIPE</code></pre> + RECIPE</code></pre> <p>In a rule, it is said how to construct the <code>TARGETS</code> from the <code>PREREQUISITES</code>, following the <code>RECIPE</code>. <strong>Note that the white space at the @@ -388,7 +381,7 @@ RECIPE</code></pre> <p>Now you can see this structure in our particular case:</p> <pre><code>(pop-data): | $(indir) -wget http://akhlaghi.org/data/template-tutorial/ESP.dat -O $@</code></pre> + wget http://akhlaghi.org/data/template-tutorial/ESP.dat -O $@</code></pre> <p>Here we have:</p> @@ -417,7 +410,7 @@ wget http://akhlaghi.org/data/template-tutorial/ESP.dat -O $@</code></pre> included into the final paper are saved . Here, you are saving the <code>URL</code>.</p> <pre><code>(mtexdir)/getdata-analysis.tex: $(pop-data) | $(mtexdir) -echo "\\newcommand{\\popurl}{http://akhlaghi.org/data/template-tutorial}" > $@</code></pre> + echo "\\newcommand{\\popurl}{http://akhlaghi.org/data/template-tutorial}" > $@</code></pre> <p>In this final rule we have:</p> @@ -445,12 +438,12 @@ echo "\\newcommand{\\popurl}{http://akhlaghi.org/data/template-tutorial}" > $ Makefiles have to be executed. You have to end up having:</p> <pre><code>makesrc = initialize \ -download \ -getdata-analyse \ -delete-me \ -paper</code></pre> + download \ + getdata-analyse \ + delete-me \ + paper</code></pre> - <p>As allways, read carefully all comments and information in order to know + <p>As always, read carefully all comments and information in order to know what is going ong. Also, add your own comments and information in order to be clear and explain each step with enough level of detail. If everything is fine, now the project is ready to download the data in the make step. Try @@ -512,13 +505,13 @@ paper</code></pre> necessary because this directory is needed for saving the file <code>ESP.txt</code>.</p> <pre><code>(odir): -mkdir $@</code></pre> + mkdir $@</code></pre> <p>With all the previous definitions, now it is possible to set the rule for making the analysis:</p> <pre><code>(param-file): $(indir)/ESP.dat | $(odir) -python reproduce/analysis/python/linear-fit.py $< $@ $(odir)/ESP.pdf</code></pre> + python reproduce/analysis/python/linear-fit.py $< $@ $(odir)/ESP.pdf</code></pre> <p>In this rule you have:</p> @@ -562,15 +555,12 @@ echo "\newcommand{\bfitparam}{$$b}" >> $@</code></pre> <p>So, at the end you will have the final rule like this:</p> - <code>(mtexdir)/getdata-analysis.tex: $(param-file) | $(mtexdir)</code> - - <pre><code>echo "\\newcommand{\\popurl}{http://akhlaghi.org/data/template-tutorial}" > $@ - -a=$$(cat $< | awk 'NR==1{print $1}') -b=$$(cat $< | awk 'NR==2{print $1}') - -echo "\newcommand{\afitparam}{$$a}" >> $@ -echo "\newcommand{\bfitparam}{$$b}" >> $@</code></pre> + <pre><code>(mtexdir)/getdata-analysis.tex: $(param-file) | $(mtexdir) + echo "\\newcommand{\\popurl}{http://akhlaghi.org/data/template-tutorial}" > $@ + a=$$(cat $< | awk 'NR==1{print $1}') + b=$$(cat $< | awk 'NR==2{print $1}') + echo "\newcommand{\afitparam}{$$a}" >> $@ + echo "\newcommand{\bfitparam}{$$b}" >> $@</code></pre> <p><strong>Important notes: you have to use two <code>$</code> in order to use the bash <code>$</code> character inside of a Make rule. Also, note that you have to put <code>>></code> in @@ -582,56 +572,64 @@ echo "\newcommand{\bfitparam}{$$b}" >> $@</code></pre> parameters. If you add the necessary comments and information, the final Makefile would look similar to:</p> <pre><code><span class="comment"># Download data and linear fitting for the tutorial</span> +<span class="comment">#</span> <span class="comment"># In this Makefile, data for the tutorial is downloaded. Then, a Python</span> <span class="comment"># script is used to make a linear fitting. Finally, fitted parameters as</span> <span class="comment"># well as the URL is saved into a TeX macro.</span> +<span class="comment">#</span> <span class="comment"># Copyright (C) 2020 Raul Infante-Sainz <a href="mailto:infantesainz@gmail.com">infantesainz@gmail.com</a></span> <span class="comment"># Copyright (C) YYYY Your Name <a href="mailto:your-email@example.xxx">your-email@example.xxx</a></span> +<span class="comment">#</span> <span class="comment"># This Makefile is free software: you can redistribute it and/or modify it</span> <span class="comment"># under the terms of the GNU General Public License as published by the</span> <span class="comment"># Free Software Foundation, either version 3 of the License, or (at your</span> <span class="comment"># option) any later version.</span> +<span class="comment">#</span> <span class="comment"># This Makefile is distributed in the hope that it will be useful, but</span> <span class="comment"># WITHOUT ANY WARRANTY; without even the implied warranty of</span> <span class="comment"># MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General</span> <span class="comment"># Public License for more details. See <a href="http://www.gnu.org/licenses/">http://www.gnu.org/licenses/</a>.</span> +<br /> <span class="comment"># Download data for the tutorial</span> <span class="comment"># ------------------------------</span> <span class="comment"># The input file is defined and downloaded using the following rule</span> pop-data = $(indir)/ESP.dat $(pop-data): | $(indir) -<span class="comment"># Use wget to download the data</span> -wget http://akhlaghi.org/data/template-tutorial/ESP.dat -O $@ +<span class="comment"> # Use wget to download the data</span> + wget http://akhlaghi.org/data/template-tutorial/ESP.dat -O $@ +<br /> <span class="comment"># Output directory</span> <span class="comment"># ----------------</span> <span class="comment"># Small rule for constructing the output directory, previously defined</span> odir = $(BDIR)/fit-parameters $(odir): -<span class="comment"># Build the output directory</span> -mkdir $@ + mkdir $@ +<br /> <span class="comment"># Linear fitting of the data</span> <span class="comment"># --------------------------</span> <span class="comment"># The output file is defined into the output directory. The fitted</span> <span class="comment"># parameters will be saved into this directory by the Python script.</span> param-file = $(odir)/ESP.txt $(param-file): $(indir)/ESP.dat | $(odir) -<span class="comment"># Invoke Python to run the script with the input data</span> -python reproduce/analysis/python/linear-fit.py $< $@ $(odir)/ESP.pdf +<span class="comment"> # Invoke Python to run the script with the input data</span> + python reproduce/analysis/python/linear-fit.py $< $@ $(odir)/ESP.pdf +<br /> <span class="comment"># TeX macros final target</span> <span class="comment"># -----------------------</span> <span class="comment"># This is how we write the necessary parameters in the final PDF. In this</span> <span class="comment"># rule, new TeX parameters are defined from the URL, and the fitted</span> <span class="comment"># parameters.</span> $(mtexdir)/getdata-analysis.tex: $(param-file) | $(mtexdir) -<span class="comment"># Write the URL into the target</span> -echo "\newcommand{\popurl}{http://akhlaghi.org/data/template-tutorial}" > $@ +<span class="comment"> # Write the URL into the target</span> + echo "\newcommand{\popurl}{http://akhlaghi.org/data/template-tutorial}" > $@ -<span class="comment"># Read the fitted parameters and save them into the target</span> -a=$$(cat $< | awk 'NR==1{print $1}') -b=$$(cat $< | awk 'NR==2{print $1}') +<span class="comment"> # Read the fitted parameters into shell variables.</span> + a=$$(cat $< | awk 'NR==1{print $1}') + b=$$(cat $< | awk 'NR==2{print $1}') -echo "\newcommand{\afitparam}{$$a}" >> $@ -echo "\newcommand{\bfitparam}{$$b}" >> $@</code></pre> +<span class="comment"> # Write the parameters into the target as LaTeX macros.</span> + echo "\newcommand{\afitparam}{$$a}" >> $@ + echo "\newcommand{\bfitparam}{$$b}" >> $@</code></pre> <p>Have look at this Makefile and note that it is what it has been described above. Take your time for making useful comments and modifying whatever you |