From 804463c799d96802c466f10fd5aa9037430f5ae1 Mon Sep 17 00:00:00 2001 From: Pedram Ashofteh Ardakani Date: Wed, 29 Apr 2020 14:31:10 +0430 Subject: about.html: initial Markdown to HTML conversion Used the https://daringfireball.net/projects/markdown/ project. --- about.html | 1240 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1240 insertions(+) create mode 100644 about.html diff --git a/about.html b/about.html new file mode 100644 index 0000000..fbfc83c --- /dev/null +++ b/about.html @@ -0,0 +1,1240 @@ +

Maneage: managing data lineage

+ +

Copyright (C) 2018-2020 Mohammad Akhlaghi mohammad@akhlaghi.org\ +Copyright (C) 2020 Raul Infante-Sainz infantesainz@gmail.com\ +See the end of the file for license conditions.

+ +

Maneage is a fully working template for doing reproducible research (or +writing a reproducible paper) as defined in the link below. If the link +below is not accessible at the time of reading, please see the appendix at +the end of this file for a portion of its introduction. Some +slides are also available +to help demonstrate the concept implemented here.

+ +

http://akhlaghi.org/reproducible-science.html

+ +

Maneage is created with the aim of supporting reproducible research by +making it easy to start a project in this framework. As shown below, it is +very easy to customize Maneage for any particular (research) project and +expand it as it starts and evolves. It can be run with no modification (as +described in README.md) as a demonstration and customized for use in any +project as fully described below.

+ +

A project designed using Maneage will download and build all the necessary +libraries and programs for working in a closed environment (highly +independent of the host operating system) with fixed versions of the +necessary dependencies. The tarballs for building the local environment are +also collected in a separate +repository. The final +output of the project is a +paper. Notice the +last paragraph of the Acknowledgments where all the necessary software are +mentioned with their versions.

+ +

Below, we start with a discussion of why Make was chosen as the high-level +language/framework for project management and how to learn and master Make +easily (and freely). The general architecture and design of the project is +then discussed to help you navigate the files and their contents. This is +followed by a checklist for the easy/fast customization of Maneage to your +exciting research. We continue with some tips and guidelines on how to +manage or extend your project as it grows based on our experiences with it +so far. The main body concludes with a description of possible future +improvements that are planned for Maneage (but not yet implemented). As +discussed above, we end with a short introduction on the necessity of +reproducible science in the appendix.

+ +

Please don't forget to share your thoughts, suggestions and +criticisms. Maintaining and designing Maneage is itself a separate project, +so please join us if you are interested. Once it is mature enough, we will +describe it in a paper (written by all contributors) for a formal +introduction to the community.

+ +

Why Make?

+ +

When batch processing is necessary (no manual intervention, as in a +reproducible project), shell scripts are usually the first solution that +come to mind. However, the inherent complexity and non-linearity of +progress in a scientific project (where experimentation is key) make it +hard to manage the script(s) as the project evolves. For example, a script +will start from the top/start every time it is run. So if you have already +completed 90% of a research project and want to run the remaining 10% that +you have newly added, you have to run the whole script from the start +again. Only then will you see the effects of the last new steps (to find +possible errors, or better solutions and etc).

+ +

It is possible to manually ignore/comment parts of a script to only do a +special part. However, such checks/comments will only add to the complexity +of the script and will discourage you to play-with/change an already +completed part of the project when an idea suddenly comes up. It is also +prone to very serious bugs in the end (when trying to reproduce from +scratch). Such bugs are very hard to notice during the work and frustrating +to find in the end.

+ +

The Make paradigm, on the other hand, starts from the end: the final +target. It builds a dependency tree internally, and finds where it should +start each time the project is run. Therefore, in the scenario above, a +researcher that has just added the final 10% of steps of her research to +her Makefile, will only have to run those extra steps. With Make, it is +also trivial to change the processing of any intermediate (already written) +rule (or step) in the middle of an already written analysis: the next +time Make is run, only rules that are affected by the changes/additions +will be re-run, not the whole analysis/project.

+ +

This greatly speeds up the processing (enabling creative changes), while +keeping all the dependencies clearly documented (as part of the Make +language), and most importantly, enabling full reproducibility from scratch +with no changes in the project code that was working during the +research. This will allow robust results and let the scientists get to what +they do best: experiment and be critical to the methods/analysis without +having to waste energy and time on technical problems that come up as a +result of that experimentation in scripts.

+ +

Since the dependencies are clearly demarcated in Make, it can identify +independent steps and run them in parallel. This further speeds up the +processing. Make was designed for this purpose. It is how huge projects +like all Unix-like operating systems (including GNU/Linux or Mac OS +operating systems) and their core components are built. Therefore, Make is +a highly mature paradigm/system with robust and highly efficient +implementations in various operating systems perfectly suited for a complex +non-linear research project.

+ +

Make is a small language with the aim of defining rules containing +targets, prerequisites and recipes. It comes with some nice features +like functions or automatic-variables to greatly facilitate the management +of text (filenames for example) or any of those constructs. For a more +detailed (yet still general) introduction see the article on Wikipedia:

+ +

https://en.wikipedia.org/wiki/Make_(software)

+ +

Make is a +40 year old software that is still evolving, therefore many +implementations of Make exist. The only difference in them is some extra +features over the standard +definition +(which is shared in all of them). Maneage is primarily written in GNU Make +(which it installs itself, you don't have to have it on your system). GNU +Make is the most common, most actively developed, and most advanced +implementation. Just note that Maneage downloads, builds, internally +installs, and uses its own dependencies (including GNU Make), so you don't +have to have it installed before you try it out.

+ +

How can I learn Make?

+ +

The GNU Make book/manual (links below) is arguably the best place to learn +Make. It is an excellent and non-technical book to help get started (it is +only non-technical in its first few chapters to get you started easily). It +is freely available and always up to date with the current GNU Make +release. It also clearly explains which features are specific to GNU Make +and which are general in all implementations. So the first few chapters +regarding the generalities are useful for all implementations.

+ +

The first link below points to the GNU Make manual in various formats and +in the second, you can download it in PDF (which may be easier for a first +time reading).

+ +

https://www.gnu.org/software/make/manual/

+ +

https://www.gnu.org/software/make/manual/make.pdf

+ +

If you use GNU Make, you also have the whole GNU Make manual on the +command-line with the following command (you can come out of the "Info" +environment by pressing q).

+ +

shell + $ info make +

+ +

If you aren't familiar with the Info documentation format, we strongly +recommend running $ info info and reading along. In less than an hour, +you will become highly proficient in it (it is very simple and has a great +manual for itself). Info greatly simplifies your access (without taking +your hands off the keyboard!) to many manuals that are installed on your +system, allowing you to be much more efficient as you work. If you use the +GNU Emacs text editor (or any of its variants), you also have access to all +Info manuals while you are writing your projects (again, without taking +your hands off the keyboard!).

+ +

Published works using Maneage

+ +

The list below shows some of the works that have already been published +with (earlier versions of) Maneage. Previously it was simply called +"Reproducible paper template". Note that Maneage is evolving, so some +details may be different in them. The more recent ones can be used as a +good working example.

+ + + +

Citation

+ +

A paper to fully describe Maneage has been submitted. Until then, if you +used it in your work, please cite the paper that implemented its first +version: Akhlaghi & Ichikawa +(2015, ApJS, 220, 1).

+ +

Also, when your paper is published, don't forget to add a notice in your +own paper (in coordination with the publishing editor) that the paper is +fully reproducible and possibly add a sentence or paragraph in the end of +the paper shortly describing the concept. This will help spread the word +and encourage other scientists to also manage and publish their projects in +a reproducible manner.

+ +

Project architecture

+ +

In order to customize Maneage to your research, it is important to first +understand its architecture so you can navigate your way in the directories +and understand how to implement your research project within its framework: +where to add new files and which existing files to modify for what +purpose. But if this the first time you are using Maneage, before reading +this theoretical discussion, please run Maneage once from scratch without +any changes (described in README.md). You will see how it works (note that +the configure step builds all necessary software, so it can take long, but +you can continue reading while its working).

+ +

The project has two top-level directories: reproduce and +tex. reproduce hosts all the software building and analysis +steps. tex contains all the final paper's components to be compiled into +a PDF using LaTeX.

+ +

The reproduce directory has two sub-directories: software and +analysis. As the name says, the former contains all the instructions to +download, build and install (independent of the host operating system) the +necessary software (these are called by the ./project configure +command). The latter contains instructions on how to use those software to +do your project's analysis.

+ +

After it finishes, ./project configure will create the following symbolic +links in the project's top source directory: .build which points to the +top build directory and .local for easy access to the custom built +software installation directory. With these you can easily access the build +directory and project-specific software from your top source directory. For +example if you run .local/bin/ls you will be using the ls of Maneage, +which is probably different from your system's ls (run them both with +--version to check).

+ +

Once the project is configured for your system, ./project make will do +the basic preparations and run the project's analysis with the custom +version of software. The project script is just a wrapper, and with the +make argument, it will first call top-prepare.mk and top-make.mk +(both are in the reproduce/analysis/make directory).

+ +

In terms of organization, top-prepare.mk and top-make.mk have an +identical design, only minor differences. So, let's continue Maneage's +architecture with top-make.mk. Once you understand that, you'll clearly +understand top-prepare.mk also. These very high-level files are +relatively short and heavily commented so hopefully the descriptions in +each comment will be enough to understand the general details. As you read +this section, please also look at the contents of the mentioned files and +directories to fully understand what is going on.

+ +

Before starting to look into the top top-make.mk, it is important to +recall that Make defines dependencies by files. Therefore, the +input/prerequisite and output of every step/rule must be a file. Also +recall that Make will use the modification date of the prerequisite(s) and +target files to see if the target must be re-built or not. Therefore during +the processing, many intermediate files will be created (see the tips +section below on a good strategy to deal with large/huge files).

+ +

To keep the source and (intermediate) built files separate, the user must +define a top-level build directory variable (or $(BDIR)) to host all the +intermediate files (you defined it during ./project configure). This +directory doesn't need to be version controlled or even synchronized, or +backed-up in other servers: its contents are all products, and can be +easily re-created any time. As you define targets for your new rules, it is +thus important to place them all under sub-directories of $(BDIR). As +mentioned above, you always have fast access to this "build"-directory with +the .build symbolic link. Also, beware to never make any manual change +in the files of the build-directory, just delete them (so they are +re-built).

+ +

In this architecture, we have two types of Makefiles that are loaded into +the top Makefile: configuration-Makefiles (only independent +variables/configurations) and workhorse-Makefiles (Makefiles that +actually contain analysis/processing rules).

+ +

The configuration-Makefiles are those that satisfy these two wildcards: +reproduce/software/config/*.conf (for building the necessary software +when you run ./project configure) and reproduce/analysis/config/*.conf +(for the high-level analysis, when you run ./project make). These +Makefiles don't actually have any rules, they just have values for various +free parameters throughout the configuration or analysis. Open a few of +them to see for yourself. These Makefiles must only contain raw Make +variables (project configurations). By "raw" we mean that the Make +variables in these files must not depend on variables in any other +configuration-Makefile. This is because we don't want to assume any order +in reading them. It is also very important to not define any rule, or +other Make construct, in these configuration-Makefiles.

+ +

Following this rule-of-thumb enables you to set these configure-Makefiles +as a prerequisite to any target that depends on their variable +values. Therefore, if you change any of their values, all targets that +depend on those values will be re-built. This is very convenient as your +project scales up and gets more complex.

+ +

The workhorse-Makefiles are those satisfying this wildcard +reproduce/software/make/*.mk and reproduce/analysis/make/*.mk. They +contain the details of the processing steps (Makefiles containing +rules). Therefore, in this phase order is important, because the +prerequisites of most rules will be the targets of other rules that will be +defined prior to them (not a fixed name like paper.pdf). The lower-level +rules must be imported into Make before the higher-level ones.

+ +

All processing steps are assumed to ultimately (usually after many rules) +end up in some number, image, figure, or table that will be included in the +paper. The writing of these results into the final report/paper is managed +through separate LaTeX files that only contain macros (a name given to a +number/string to be used in the LaTeX source, which will be replaced when +compiling it to the final PDF). So the last target in a workhorse-Makefile +is a .tex file (with the same base-name as the Makefile, but in +$(BDIR)/tex/macros). As a result, if the targets in a workhorse-Makefile +aren't directly a prerequisite of other workhorse-Makefile targets, they +can be a prerequisite of that intermediate LaTeX macro file and thus be +called when necessary. Otherwise, they will be ignored by Make.

+ +

Maneage also has a mode to share the build directory between several +users of a Unix group (when working on large computer clusters). In this +scenario, each user can have their own cloned project source, but share the +large built files between each other. To do this, it is necessary for all +built files to give full permission to group members while not allowing any +other users access to the contents. Therefore the ./project configure and +./project make steps must be called with special conditions which are +managed in the --group option.

+ +

Let's see how this design is implemented. Please open and inspect +top-make.mk it as we go along here. The first step (un-commented line) is +to import the local configuration (your answers to the questions of +./project configure). They are defined in the configuration-Makefile +reproduce/software/config/LOCAL.conf which was also built by ./project +configure (based on the LOCAL.conf.in template of the same directory).

+ +

The next non-commented set of the top Makefile defines the ultimate +target of the whole project (paper.pdf). But to avoid mistakes, a sanity +check is necessary to see if Make is being run with the same group settings +as the configure script (for example when the project is configured for +group access using the ./for-group script, but Make isn't). Therefore we +use a Make conditional to define the all target based on the group +permissions.

+ +

Having defined the top/ultimate target, our next step is to include all the +other necessary Makefiles. However, order matters in the importing of +workhorse-Makefiles and each must also have a TeX macro file with the same +base name (without a suffix). Therefore, the next step in the top-level +Makefile is to define the makesrc variable to keep the base names +(without a .mk suffix) of the workhorse-Makefiles that must be imported, +in the proper order.

+ +

Finally, we import all the necessary remaining Makefiles: 1) All the +analysis configuration-Makefiles with a wildcard. 2) The software +configuration-Makefile that contains their version (just in case its +necessary). 3) All workhorse-Makefiles in the proper order using a Make +foreach loop.

+ +

In short, to keep things modular, readable and manageable, follow these +recommendations: 1) Set clear-to-understand names for the +configuration-Makefiles, and workhorse-Makefiles, 2) Only import other +Makefiles from top Makefile. These will let you know/remember generally +which step you are taking before or after another. Projects will scale up +very fast. Thus if you don't start and continue with a clean and robust +convention like this, in the end it will become very dirty and hard to +manage/understand (even for yourself). As a general rule of thumb, break +your rules into as many logically-similar but independent steps as +possible.

+ +

The reproduce/analysis/make/paper.mk Makefile must be the final Makefile +that is included. This workhorse Makefile ends with the rule to build +paper.pdf (final target of the whole project). If you look in it, you +will notice that this Makefile starts with a rule to create +$(mtexdir)/project.tex (mtexdir is just a shorthand name for +$(BDIR)/tex/macros mentioned before). As you see, the only dependency of +$(mtexdir)/project.tex is $(mtexdir)/verify.tex (which is the last +analysis step: it verifies all the generated results). Therefore, +$(mtexdir)/project.tex is the connection between the +processing/analysis steps of the project, and the steps to build the final +PDF.

+ +

During the research, it often happens that you want to test a step that is +not a prerequisite of any higher-level operation. In such cases, you can +(temporarily) define that processing as a rule in the most relevant +workhorse-Makefile and set its target as a prerequisite of its TeX +macro. If your test gives a promising result and you want to include it in +your research, set it as prerequisites to other rules and remove it from +the list of prerequisites for TeX macro file. In fact, this is how a +project is designed to grow in this framework.

+ +

File modification dates (meta data)

+ +

While Git does an excellent job at keeping a history of the contents of +files, it makes no effort in keeping the file meta data, and in particular +the dates of files. Therefore when you checkout to a different branch, +files that are re-written by Git will have a newer date than the other +project files. However, file dates are important in the current design of +Maneage: Make checks the dates of the prerequisite files and target files +to see if the target should be re-built.

+ +

To fix this problem, for Maneage we use a forked version of +Metastore. Metastore use +a binary database file (which is called .file-metadata) to keep the +modification dates of all the files under version control. This file is +also under version control, but is hidden (because it shouldn't be modified +by hand). During the project's configuration, Maneage installs to Git hooks +to run Metastore 1) before making a commit to update its database with the +file dates in a branch, and 2) after doing a checkout, to reset the +file-dates after the checkout is complete and re-set the file dates back to +what they were.

+ +

In practice, Metastore should work almost fully invisibly within your +project. The only place you might notice its presence is that you'll see +.file-metadata in the list of modified/staged files (commonly after +merging your branches). Since its a binary file, Git also won't show you +the changed contents. In a merge, you can simply accept any changes with +git add -u. But if Git is telling you that it has changed without a merge +(for example if you started a commit, but canceled it in the middle), you +can just do git checkout .file-metadata and set it back to its original +state.

+ +

Summary

+ +

Based on the explanation above, some major design points you should have in +mind are listed below.

+ + + +

Customization checklist

+ +

Take the following steps to fully customize Maneage for your research +project. After finishing the list, be sure to run ./project configure and +project make to see if everything works correctly. If you notice anything +missing or any in-correct part (probably a change that has not been +explained here), please let us know to correct it.

+ +

As described above, the concept of reproducibility (during a project) +heavily relies on version +control. Currently Maneage +uses Git as its main version control system. If you are not already +familiar with Git, please read the first three chapters of the ProGit +book which provides a wonderful practical +understanding of the basics. You can read later chapters as you get more +advanced in later stages of your work.

+ +

First custom commit

+ +
    +
  1. Get this repository and its history (if you don't already have it): + Arguably the easiest way to start is to clone Maneage and prepare for + your customizations as shown below. After the cloning first you rename + the default origin remote server to specify that this is Maneage's + remote server. This will allow you to use the conventional origin + name for your own project as shown in the next steps. Second, you will + create and go into the conventional master branch to start + committing in your project later.

    + +

    shell + $ git clone https://git.maneage.org/project.git # Clone/copy the project and its history. + $ mv project my-project # Change the name to your project's name. + $ cd my-project # Go into the cloned directory. + $ git remote rename origin origin-maneage # Rename current/only remote to "origin-maneage". + $ git checkout -b master # Create and enter your own "master" branch. + $ pwd # Just to confirm where you are. +

  2. +
  3. Prepare to build project: The ./project configure command of the + next step will build the different software packages within the + "build" directory (that you will specify). Nothing else on your system + will be touched. However, since it takes long, it is useful to see + what it is being built at every instant (its almost impossible to tell + from the torrent of commands that are produced!). So open another + terminal on your desktop and navigate to the same project directory + that you cloned (output of last command above). Then run the following + command. Once every second, this command will just print the date + (possibly followed by a non-existent directory notice). But as soon as + the next step starts building software, you'll see the names of + software get printed as they are being built. Once any software is + installed in the project build directory it will be removed. Again, + don't worry, nothing will be installed outside the build directory.

    + +

    shell + # On another terminal (go to top project source directory, last command above) + $ ./project --check-config +

  4. +
  5. Test Maneage: Before making any changes, it is important to test it + and see if everything works properly with the commands below. If there + is any problem in the ./project configure or ./project make steps, + please contact us to fix the problem before continuing. Since the + building of dependencies in configuration can take long, you can take + the next few steps (editing the files) while its working (they don't + affect the configuration). After ./project make is finished, open + paper.pdf. If it looks fine, you are ready to start customizing the + Maneage for your project. But before that, clean all the extra Maneage + outputs with make clean as shown below.

    + +

    ```shell + $ ./project configure # Build the project's software environment (can take an hour or so). + $ ./project make # Do the processing and build paper (just a simple demo).

    + +

    # Open 'paper.pdf' and see if everything is ok. + ```

  6. +
  7. Setup the remote: You can use any hosting + facility + that supports Git to keep an online copy of your project's version + controlled history. We recommend GitLab because + it is more ethical (although not + perfect), + and later you can also host GitLab on your own server. Anyway, create + an account in your favorite hosting facility (if you don't already + have one), and define a new project there. Please make sure the newly + created project is empty (some services ask to include a README in + a new project which is bad in this scenario, and will not allow you to + push to it). It will give you a URL (usually starting with git@ and + ending in .git), put this URL in place of XXXXXXXXXX in the first + command below. With the second command, "push" your master branch to + your origin remote, and (with the --set-upstream option) set them + to track/follow each other. However, the maneage branch is currently + tracking/following your origin-maneage remote (automatically set + when you cloned Maneage). So when pushing the maneage branch to your + origin remote, you shouldn't use --set-upstream. With the last + command, you can actually check this (which local and remote branches + are tracking each other).

    + +

    shell + git remote add origin XXXXXXXXXX # Newly created repo is now called 'origin'. + git push --set-upstream origin master # Push 'master' branch to 'origin' (with tracking). + git push origin maneage # Push 'maneage' branch to 'origin' (no tracking). +

  8. +
  9. Title, short description and author: The title and basic + information of your project's output PDF paper should be added in + paper.tex. You should see the relevant place in the preamble (prior + to \begin{document}. After you are done, run the ./project make + command again to see your changes in the final PDF, and make sure that + your changes don't cause a crash in LaTeX. Of course, if you use a + different LaTeX package/style for managing the title and authors (in + particular a specific journal's style), please feel free to use it + your own methods after finishing this checklist and doing your first + commit.

  10. +
  11. Delete dummy parts: Maneage contains some parts that are only for + the initial/test run, mainly as a demonstration of important steps, + which you can use as a reference to use in your own project. But they + not for any real analysis, so you should remove these parts as + described below:

    + +
      +
    • paper.tex: 1) Delete the text of the abstract (from +\includeabstract{ to \vspace{0.25cm}) and write your own (a +single sentence can be enough now, you can complete it later). 2) +Add some keywords under it in the keywords part. 3) Delete +everything between %% Start of main body. and %% End of main +body.. 4) Remove the notice in the "Acknowledgments" section (in +\new{}) and Acknowledge your funding sources (this can also be +done later). Just don't delete the existing acknowledgment +statement: Maneage is possible thanks to funding from several +grants. Since Maneage is being used in your work, it is necessary to +acknowledge them in your work also.

    • +
    • reproduce/analysis/make/top-make.mk: Delete the delete-me line +in the makesrc definition. Just make sure there is no empty line +between the download \ and verify \ lines (they should be +directly under each other).

    • +
    • reproduce/analysis/make/verify.mk: In the final recipe, under the +commented line Verify TeX macros, remove the full line that +contains delete-me, and set the value of s in the line for +download to XXXXX (any temporary string, you'll fix it in the +end of your project, when its complete).

    • +
    • Delete all delete-me* files in the following directories:

      + +

      shell +$ rm tex/src/delete-me* +$ rm reproduce/analysis/make/delete-me* +$ rm reproduce/analysis/config/delete-me* +

    • +
    • Disable verification of outputs by removing the yes from +reproduce/analysis/config/verify-outputs.conf. Later, when you are +ready to submit your paper, or publish the dataset, activate +verification and make the proper corrections in this file (described +under the "Other basic customizations" section below). This is a +critical step and only takes a few minutes when your project is +finished. So DON'T FORGET to activate it in the end.

    • +
    • Re-make the project (after a cleaning) to see if you haven't +introduced any errors.

      + +

      shell +$ ./project make clean +$ ./project make +

    • +
  12. +
  13. Don't merge some files in future updates: As described below, you + can later update your infra-structure (for example to fix bugs) by + merging your master branch with maneage. For files that you have + created in your own branch, there will be no problem. However if you + modify an existing Maneage file for your project, next time its + updated on maneage you'll have an annoying conflict. The commands + below show how to fix this future problem. With them, you can + configure Git to ignore the changes in maneage for some of the files + you have already edited and deleted above (and will edit below). Note + that only the first echo command has a > (to write over the file), + the rest are >> (to append to it). If you want to avoid any other + set of files to be imported from Maneage into your project's branch, + you can follow a similar strategy. We recommend only doing it when you + encounter the same conflict in more than one merge and there is no + other change in that file. Also, don't add core Maneage Makefiles, + otherwise Maneage can break on the next run.

    + +

    shell + $ echo "paper.tex merge=ours" > .gitattributes + $ echo "tex/src/delete-me.mk merge=ours" >> .gitattributes + $ echo "tex/src/delete-me-demo.mk merge=ours" >> .gitattributes + $ echo "reproduce/analysis/make/delete-me.mk merge=ours" >> .gitattributes + $ echo "reproduce/software/config/TARGETS.conf merge=ours" >> .gitattributes + $ echo "reproduce/analysis/config/delete-me-num.conf merge=ours" >> .gitattributes + $ git add .gitattributes +

  14. +
  15. Copyright and License notice: It is necessary that all the + "copyright-able" files in your project (those larger than 10 lines) + have a copyright and license notice. Please take a moment to look at + several existing files to see a few examples. The copyright notice is + usually close to the start of the file, it is the line starting with + Copyright (C) and containing a year and the author's name (like the + examples below). The License notice is a short description of the + copyright license, usually one or two paragraphs with a URL to the + full license. Don't forget to add these two notices to any new + file you add in your project (you can just copy-and-paste). When you + modify an existing Maneage file (which already has the notices), just + add a copyright notice in your name under the existing one(s), like + the line with capital letters below. To start with, add this line with + your name and email address to paper.tex, + tex/src/preamble-header.tex, reproduce/analysis/make/top-make.mk, + and generally, all the files you modified in the previous step.

    + +

    + Copyright (C) 2018-2020 Existing Name <existing@email.address> + Copyright (C) 2020 YOUR NAME <YOUR@EMAIL.ADDRESS> +

  16. +
  17. Configure Git for fist time: If this is the first time you are + running Git on this system, then you have to configure it with some + basic information in order to have essential information in the commit + messages (ignore this step if you have already done it). Git will + include your name and e-mail address information in each commit. You + can also specify your favorite text editor for making the commit + (emacs, vim, nano, and etc.).

    + +

    shell + $ git config --global user.name "YourName YourSurname" + $ git config --global user.email your-email@example.com + $ git config --global core.editor nano +

  18. +
  19. Your first commit: You have already made some small and basic + changes in the steps above and you are in your project's master + branch. So, you can officially make your first commit in your + project's history and push it. But before that, you need to make sure + that there are no problems in the project. This is a good habit to + always re-build the system before a commit to be sure it works as + expected.

    + +

    shell + $ git status # See which files you have changed. + $ git diff # Check the lines you have added/changed. + $ ./project make # Make sure everything builds successfully. + $ git add -u # Put all tracked changes in staging area. + $ git status # Make sure everything is fine. + $ git diff --cached # Confirm all the changes that will be committed. + $ git commit # Your first commit: put a good description! + $ git push # Push your commit to your remote. +

  20. +
  21. Start your exciting research: You are now ready to add flesh and + blood to this raw skeleton by further modifying and adding your + exciting research steps. You can use the "published works" section in + the introduction (above) as some fully working models to learn + from. Also, don't hesitate to contact us if you have any + questions.

  22. +
+ +

Other basic customizations

+ + + +

Tips for designing your project

+ +

The following is a list of design points, tips, or recommendations that +have been learned after some experience with this type of project +management. Please don't hesitate to share any experience you gain after +using it with us. In this way, we can add it here (with full giving credit) +for the benefit of others.

+ +

+ +

Future improvements

+ +

This is an evolving project and as time goes on, it will evolve and become +more robust. Some of the most prominent issues we plan to implement in the +future are listed below, please join us if you are interested.

+ +

Package management

+ +

It is important to have control of the environment of the project. Maneage +currently builds the higher-level programs (for example GNU Bash, GNU Make, +GNU AWK and domain-specific software) it needs, then sets PATH so the +analysis is done only with the project's built software. But currently the +configuration of each program is in the Makefile rules that build it. This +is not good because a change in the build configuration does not +automatically cause a re-build. Also, each separate project on a system +needs to have its own built tools (that can waste a lot of space).

+ +

A good solution is based on the Nix package +manager: a separate file is present for +each software, containing all the necessary info to build it (including its +URL, its tarball MD5 hash, dependencies, configuration parameters, build +steps and etc). Using this file, a script can automatically generate the +Make rules to download, build and install program and its dependencies +(along with the dependencies of those dependencies and etc).

+ +

All the software are installed in a "store". Each installed file (library +or executable) is prefixed by a hash of this configuration (and the OS +architecture) and the standard program name. For example (from the Nix +webpage):

+ +

+/nix/store/b6gvzjyb2pg0kjfwrjmg1vfhh54ad73z-firefox-33.1/ +

+ +

The important thing is that the "store" is not in the project's search +path. After the complete installation of the software, symbolic links are +made to populate each project's program and library search paths without a +hash. This hash will be unique to that particular software and its +particular configuration. So simply by searching for this hash in the +installed directory, we can find the installed files of that software to +generate the links.

+ +

This scenario has several advantages: 1) a change in a software's build +configuration triggers a rebuild. 2) a single "store" can be used in many +projects, thus saving space and configuration time for new projects (that +commonly have large overlaps in lower-level programs).

+ +

Appendix: Necessity of exact reproduction in scientific research

+ +

In case the link above is +not accessible at the time of reading, here is a copy of the introduction +of that link, describing the necessity for a reproducible project like this +(copied on February 7th, 2018):

+ +

The most important element of a "scientific" statement/result is the fact +that others should be able to falsify it. The Tsunami of data that has +engulfed astronomers in the last two decades, combined with faster +processors and faster internet connections has made it much more easier to +obtain a result. However, these factors have also increased the complexity +of a scientific analysis, such that it is no longer possible to describe +all the steps of an analysis in the published paper. Citing this +difficulty, many authors suffice to describing the generalities of their +analysis in their papers.

+ +

However, It is impossible to falsify (or even study) a result if you can't +exactly reproduce it. The complexity of modern science makes it vitally +important to exactly reproduce the final result. Because even a small +deviation can be due to many different parts of an analysis. Nature is +already a black box which we are trying so hard to comprehend. Not letting +other scientists see the exact steps taken to reach a result, or not +allowing them to modify it (do experiments on it) is a self-imposed black +box, which only exacerbates our ignorance.

+ +

Other scientists should be able to reproduce, check and experiment on the +results of anything that is to carry the "scientific" label. Any result +that is not reproducible (due to incomplete information by the author) is +not scientific: the readers have to have faith in the subjective experience +of the authors in the very important choice of configuration values and +order of operations: this is contrary to the scientific spirit.

+ +

Copyright information

+ +

This file is part of Maneage's core: https://git.maneage.org/project.git

+ +

Maneage is free software: you can redistribute it and/or modify it under +the terms of the GNU General Public License as published by the Free +Software Foundation, either version 3 of the License, or (at your option) +any later version.

+ +

Maneage is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS +FOR A PARTICULAR PURPOSE. See the GNU General Public License for more +details.

+ +

You should have received a copy of the GNU General Public License along +with Maneage. If not, see https://www.gnu.org/licenses/.

-- cgit v1.2.1