Maneage: managing data lineage

Copyright (C) 2018-2020 Mohammad Akhlaghi mohammad@akhlaghi.org
Copyright (C) 2020 Raul Infante-Sainz infantesainz@gmail.com
See the end of the file for license conditions.

Maneage is a fully working template for doing reproducible research (or writing a reproducible paper) as defined in the link below. If the link below is not accessible at the time of reading, please see the appendix at the end of this file for a portion of its introduction. Some slides are also available to help demonstrate the concept implemented here.

http://akhlaghi.org/reproducible-science.html

Maneage is created with the aim of supporting reproducible research by making it easy to start a project in this framework. As shown below, it is very easy to customize Maneage for any particular (research) project and expand it as it starts and evolves. It can be run with no modification (as described in README.md) as a demonstration and customized for use in any project as fully described below.

A project designed using Maneage will download and build all the necessary libraries and programs for working in a closed environment (highly independent of the host operating system) with fixed versions of the necessary dependencies. The tarballs for building the local environment are also collected in a separate repository. The final output of the project is a paper. Notice the last paragraph of the Acknowledgments where all the necessary software are mentioned with their versions.

Below, we start with a discussion of why Make was chosen as the high-level language/framework for project management and how to learn and master Make easily (and freely). The general architecture and design of the project is then discussed to help you navigate the files and their contents. This is followed by a checklist for the easy/fast customization of Maneage to your exciting research. We continue with some tips and guidelines on how to manage or extend your project as it grows based on our experiences with it so far. The main body concludes with a description of possible future improvements that are planned for Maneage (but not yet implemented). As discussed above, we end with a short introduction on the necessity of reproducible science in the appendix.

Please don't forget to share your thoughts, suggestions and criticisms. Maintaining and designing Maneage is itself a separate project, so please join us if you are interested. Once it is mature enough, we will describe it in a paper (written by all contributors) for a formal introduction to the community.

Why Make?

When batch processing is necessary (no manual intervention, as in a reproducible project), shell scripts are usually the first solution that come to mind. However, the inherent complexity and non-linearity of progress in a scientific project (where experimentation is key) make it hard to manage the script(s) as the project evolves. For example, a script will start from the top/start every time it is run. So if you have already completed 90% of a research project and want to run the remaining 10% that you have newly added, you have to run the whole script from the start again. Only then will you see the effects of the last new steps (to find possible errors, or better solutions and etc).

It is possible to manually ignore/comment parts of a script to only do a special part. However, such checks/comments will only add to the complexity of the script and will discourage you to play-with/change an already completed part of the project when an idea suddenly comes up. It is also prone to very serious bugs in the end (when trying to reproduce from scratch). Such bugs are very hard to notice during the work and frustrating to find in the end.

The Make paradigm, on the other hand, starts from the end: the final target. It builds a dependency tree internally, and finds where it should start each time the project is run. Therefore, in the scenario above, a researcher that has just added the final 10% of steps of her research to her Makefile, will only have to run those extra steps. With Make, it is also trivial to change the processing of any intermediate (already written) rule (or step) in the middle of an already written analysis: the next time Make is run, only rules that are affected by the changes/additions will be re-run, not the whole analysis/project.

This greatly speeds up the processing (enabling creative changes), while keeping all the dependencies clearly documented (as part of the Make language), and most importantly, enabling full reproducibility from scratch with no changes in the project code that was working during the research. This will allow robust results and let the scientists get to what they do best: experiment and be critical to the methods/analysis without having to waste energy and time on technical problems that come up as a result of that experimentation in scripts.

Since the dependencies are clearly demarcated in Make, it can identify independent steps and run them in parallel. This further speeds up the processing. Make was designed for this purpose. It is how huge projects like all Unix-like operating systems (including GNU/Linux or Mac OS operating systems) and their core components are built. Therefore, Make is a highly mature paradigm/system with robust and highly efficient implementations in various operating systems perfectly suited for a complex non-linear research project.

Make is a small language with the aim of defining rules containing targets, prerequisites and recipes. It comes with some nice features like functions or automatic-variables to greatly facilitate the management of text (filenames for example) or any of those constructs. For a more detailed (yet still general) introduction see the article on Wikipedia:

https://en.wikipedia.org/wiki/Make_(software)

Make is a +40 year old software that is still evolving, therefore many implementations of Make exist. The only difference in them is some extra features over the standard definition (which is shared in all of them). Maneage is primarily written in GNU Make (which it installs itself, you don't have to have it on your system). GNU Make is the most common, most actively developed, and most advanced implementation. Just note that Maneage downloads, builds, internally installs, and uses its own dependencies (including GNU Make), so you don't have to have it installed before you try it out.

How can I learn Make?

The GNU Make book/manual (links below) is arguably the best place to learn Make. It is an excellent and non-technical book to help get started (it is only non-technical in its first few chapters to get you started easily). It is freely available and always up to date with the current GNU Make release. It also clearly explains which features are specific to GNU Make and which are general in all implementations. So the first few chapters regarding the generalities are useful for all implementations.

The first link below points to the GNU Make manual in various formats and in the second, you can download it in PDF (which may be easier for a first time reading).

https://www.gnu.org/software/make/manual/

https://www.gnu.org/software/make/manual/make.pdf

If you use GNU Make, you also have the whole GNU Make manual on the command-line with the following command (you can come out of the "Info" environment by pressing q).


info make
                

If you aren't familiar with the Info documentation format, we strongly recommend running $ info info and reading along. In less than an hour, you will become highly proficient in it (it is very simple and has a great manual for itself). Info greatly simplifies your access (without taking your hands off the keyboard!) to many manuals that are installed on your system, allowing you to be much more efficient as you work. If you use the GNU Emacs text editor (or any of its variants), you also have access to all Info manuals while you are writing your projects (again, without taking your hands off the keyboard!).

Published works using Maneage

The list below shows some of the works that have already been published with (earlier versions of) Maneage. Previously it was simply called "Reproducible paper template". Note that Maneage is evolving, so some details may be different in them. The more recent ones can be used as a good working example.

Citation

A paper to fully describe Maneage has been submitted. Until then, if you used it in your work, please cite the paper that implemented its first version: Akhlaghi & Ichikawa (2015, ApJS, 220, 1).

Also, when your paper is published, don't forget to add a notice in your own paper (in coordination with the publishing editor) that the paper is fully reproducible and possibly add a sentence or paragraph in the end of the paper shortly describing the concept. This will help spread the word and encourage other scientists to also manage and publish their projects in a reproducible manner.

Project architecture

In order to customize Maneage to your research, it is important to first understand its architecture so you can navigate your way in the directories and understand how to implement your research project within its framework: where to add new files and which existing files to modify for what purpose. But if this the first time you are using Maneage, before reading this theoretical discussion, please run Maneage once from scratch without any changes (described in README.md). You will see how it works (note that the configure step builds all necessary software, so it can take long, but you can continue reading while its working).

The project has two top-level directories: reproduce and tex. reproduce hosts all the software building and analysis steps. tex contains all the final paper's components to be compiled into a PDF using LaTeX.

The reproduce directory has two sub-directories: software and analysis. As the name says, the former contains all the instructions to download, build and install (independent of the host operating system) the necessary software (these are called by the ./project configure command). The latter contains instructions on how to use those software to do your project's analysis.

After it finishes, ./project configure will create the following symbolic links in the project's top source directory: .build which points to the top build directory and .local for easy access to the custom built software installation directory. With these you can easily access the build directory and project-specific software from your top source directory. For example if you run .local/bin/ls you will be using the ls of Maneage, which is probably different from your system's ls (run them both with --version to check).

Once the project is configured for your system, ./project make will do the basic preparations and run the project's analysis with the custom version of software. The project script is just a wrapper, and with the make argument, it will first call top-prepare.mk and top-make.mk (both are in the reproduce/analysis/make directory).

In terms of organization, top-prepare.mk and top-make.mk have an identical design, only minor differences. So, let's continue Maneage's architecture with top-make.mk. Once you understand that, you'll clearly understand top-prepare.mk also. These very high-level files are relatively short and heavily commented so hopefully the descriptions in each comment will be enough to understand the general details. As you read this section, please also look at the contents of the mentioned files and directories to fully understand what is going on.

Before starting to look into the top top-make.mk, it is important to recall that Make defines dependencies by files. Therefore, the input/prerequisite and output of every step/rule must be a file. Also recall that Make will use the modification date of the prerequisite(s) and target files to see if the target must be re-built or not. Therefore during the processing, many intermediate files will be created (see the tips section below on a good strategy to deal with large/huge files).

To keep the source and (intermediate) built files separate, the user must define a top-level build directory variable (or $(BDIR)) to host all the intermediate files (you defined it during ./project configure). This directory doesn't need to be version controlled or even synchronized, or backed-up in other servers: its contents are all products, and can be easily re-created any time. As you define targets for your new rules, it is thus important to place them all under sub-directories of $(BDIR). As mentioned above, you always have fast access to this "build"-directory with the .build symbolic link. Also, beware to never make any manual change in the files of the build-directory, just delete them (so they are re-built).

In this architecture, we have two types of Makefiles that are loaded into the top Makefile: configuration-Makefiles (only independent variables/configurations) and workhorse-Makefiles (Makefiles that actually contain analysis/processing rules).

The configuration-Makefiles are those that satisfy these two wildcards: reproduce/software/config/*.conf (for building the necessary software when you run ./project configure) and reproduce/analysis/config/*.conf (for the high-level analysis, when you run ./project make). These Makefiles don't actually have any rules, they just have values for various free parameters throughout the configuration or analysis. Open a few of them to see for yourself. These Makefiles must only contain raw Make variables (project configurations). By "raw" we mean that the Make variables in these files must not depend on variables in any other configuration-Makefile. This is because we don't want to assume any order in reading them. It is also very important to not define any rule, or other Make construct, in these configuration-Makefiles.

Following this rule-of-thumb enables you to set these configure-Makefiles as a prerequisite to any target that depends on their variable values. Therefore, if you change any of their values, all targets that depend on those values will be re-built. This is very convenient as your project scales up and gets more complex.

The workhorse-Makefiles are those satisfying this wildcard reproduce/software/make/*.mk and reproduce/analysis/make/*.mk. They contain the details of the processing steps (Makefiles containing rules). Therefore, in this phase order is important, because the prerequisites of most rules will be the targets of other rules that will be defined prior to them (not a fixed name like paper.pdf). The lower-level rules must be imported into Make before the higher-level ones.

All processing steps are assumed to ultimately (usually after many rules) end up in some number, image, figure, or table that will be included in the paper. The writing of these results into the final report/paper is managed through separate LaTeX files that only contain macros (a name given to a number/string to be used in the LaTeX source, which will be replaced when compiling it to the final PDF). So the last target in a workhorse-Makefile is a .tex file (with the same base-name as the Makefile, but in $(BDIR)/tex/macros). As a result, if the targets in a workhorse-Makefile aren't directly a prerequisite of other workhorse-Makefile targets, they can be a prerequisite of that intermediate LaTeX macro file and thus be called when necessary. Otherwise, they will be ignored by Make.

Maneage also has a mode to share the build directory between several users of a Unix group (when working on large computer clusters). In this scenario, each user can have their own cloned project source, but share the large built files between each other. To do this, it is necessary for all built files to give full permission to group members while not allowing any other users access to the contents. Therefore the ./project configure and ./project make steps must be called with special conditions which are managed in the --group option.

Let's see how this design is implemented. Please open and inspect top-make.mk it as we go along here. The first step (un-commented line) is to import the local configuration (your answers to the questions of ./project configure). They are defined in the configuration-Makefile reproduce/software/config/LOCAL.conf which was also built by ./project configure (based on the LOCAL.conf.in template of the same directory).

The next non-commented set of the top Makefile defines the ultimate target of the whole project (paper.pdf). But to avoid mistakes, a sanity check is necessary to see if Make is being run with the same group settings as the configure script (for example when the project is configured for group access using the ./for-group script, but Make isn't). Therefore we use a Make conditional to define the all target based on the group permissions.

Having defined the top/ultimate target, our next step is to include all the other necessary Makefiles. However, order matters in the importing of workhorse-Makefiles and each must also have a TeX macro file with the same base name (without a suffix). Therefore, the next step in the top-level Makefile is to define the makesrc variable to keep the base names (without a .mk suffix) of the workhorse-Makefiles that must be imported, in the proper order.

Finally, we import all the necessary remaining Makefiles: 1) All the analysis configuration-Makefiles with a wildcard. 2) The software configuration-Makefile that contains their version (just in case its necessary). 3) All workhorse-Makefiles in the proper order using a Make foreach loop.

In short, to keep things modular, readable and manageable, follow these recommendations: 1) Set clear-to-understand names for the configuration-Makefiles, and workhorse-Makefiles, 2) Only import other Makefiles from top Makefile. These will let you know/remember generally which step you are taking before or after another. Projects will scale up very fast. Thus if you don't start and continue with a clean and robust convention like this, in the end it will become very dirty and hard to manage/understand (even for yourself). As a general rule of thumb, break your rules into as many logically-similar but independent steps as possible.

The reproduce/analysis/make/paper.mk Makefile must be the final Makefile that is included. This workhorse Makefile ends with the rule to build paper.pdf (final target of the whole project). If you look in it, you will notice that this Makefile starts with a rule to create $(mtexdir)/project.tex (mtexdir is just a shorthand name for $(BDIR)/tex/macros mentioned before). As you see, the only dependency of $(mtexdir)/project.tex is $(mtexdir)/verify.tex (which is the last analysis step: it verifies all the generated results). Therefore, $(mtexdir)/project.tex is the connection between the processing/analysis steps of the project, and the steps to build the final PDF.

During the research, it often happens that you want to test a step that is not a prerequisite of any higher-level operation. In such cases, you can (temporarily) define that processing as a rule in the most relevant workhorse-Makefile and set its target as a prerequisite of its TeX macro. If your test gives a promising result and you want to include it in your research, set it as prerequisites to other rules and remove it from the list of prerequisites for TeX macro file. In fact, this is how a project is designed to grow in this framework.

File modification dates (meta data)

While Git does an excellent job at keeping a history of the contents of files, it makes no effort in keeping the file meta data, and in particular the dates of files. Therefore when you checkout to a different branch, files that are re-written by Git will have a newer date than the other project files. However, file dates are important in the current design of Maneage: Make checks the dates of the prerequisite files and target files to see if the target should be re-built.

To fix this problem, for Maneage we use a forked version of Metastore. Metastore use a binary database file (which is called .file-metadata) to keep the modification dates of all the files under version control. This file is also under version control, but is hidden (because it shouldn't be modified by hand). During the project's configuration, Maneage installs to Git hooks to run Metastore 1) before making a commit to update its database with the file dates in a branch, and 2) after doing a checkout, to reset the file-dates after the checkout is complete and re-set the file dates back to what they were.

In practice, Metastore should work almost fully invisibly within your project. The only place you might notice its presence is that you'll see .file-metadata in the list of modified/staged files (commonly after merging your branches). Since its a binary file, Git also won't show you the changed contents. In a merge, you can simply accept any changes with git add -u. But if Git is telling you that it has changed without a merge (for example if you started a commit, but canceled it in the middle), you can just do git checkout .file-metadata and set it back to its original state.

Summary

Based on the explanation above, some major design points you should have in mind are listed below.

Customization checklist

Take the following steps to fully customize Maneage for your research project. After finishing the list, be sure to run ./project configure and project make to see if everything works correctly. If you notice anything missing or any in-correct part (probably a change that has not been explained here), please let us know to correct it.

As described above, the concept of reproducibility (during a project) heavily relies on version control. Currently Maneage uses Git as its main version control system. If you are not already familiar with Git, please read the first three chapters of the ProGit book which provides a wonderful practical understanding of the basics. You can read later chapters as you get more advanced in later stages of your work.

First custom commit

  1. Get this repository and its history (if you don't already have it): Arguably the easiest way to start is to clone Maneage and prepare for your customizations as shown below. After the cloning first you rename the default origin remote server to specify that this is Maneage's remote server. This will allow you to use the conventional origin name for your own project as shown in the next steps. Second, you will create and go into the conventional master branch to start committing in your project later.

    
    git clone https://git.maneage.org/project.git    # Clone/copy the project and its history.
    mv project my-project                            # Change the name to your project's name.
    cd my-project                                    # Go into the cloned directory.
    git remote rename origin origin-maneage          # Rename current/only remote to "origin-maneage".
    git checkout -b master                           # Create and enter your own "master" branch.
    pwd                                              # Just to confirm where you are.
                        
  2. Prepare to build project: The ./project configure command of the next step will build the different software packages within the "build" directory (that you will specify). Nothing else on your system will be touched. However, since it takes long, it is useful to see what it is being built at every instant (its almost impossible to tell from the torrent of commands that are produced!). So open another terminal on your desktop and navigate to the same project directory that you cloned (output of last command above). Then run the following command. Once every second, this command will just print the date (possibly followed by a non-existent directory notice). But as soon as the next step starts building software, you'll see the names of software get printed as they are being built. Once any software is installed in the project build directory it will be removed. Again, don't worry, nothing will be installed outside the build directory.

    
    # On another terminal (go to top project source directory, last command above)
    ./project --check-config
                        
  3. Test Maneage: Before making any changes, it is important to test it and see if everything works properly with the commands below. If there is any problem in the ./project configure or ./project make steps, please contact us to fix the problem before continuing. Since the building of dependencies in configuration can take long, you can take the next few steps (editing the files) while its working (they don't affect the configuration). After ./project make is finished, open paper.pdf. If it looks fine, you are ready to start customizing the Maneage for your project. But before that, clean all the extra Maneage outputs with make clean as shown below.

    
    ./project configure     # Build the project's software environment (can take an hour or so).
    ./project make          # Do the processing and build paper (just a simple demo).
    # Open 'paper.pdf' and see if everything is ok.
                        
  4. Setup the remote: You can use any hosting facility that supports Git to keep an online copy of your project's version controlled history. We recommend GitLab because it is more ethical (although not perfect), and later you can also host GitLab on your own server. Anyway, create an account in your favorite hosting facility (if you don't already have one), and define a new project there. Please make sure the newly created project is empty (some services ask to include a README in a new project which is bad in this scenario, and will not allow you to push to it). It will give you a URL (usually starting with git@ and ending in .git), put this URL in place of XXXXXXXXXX in the first command below. With the second command, "push" your master branch to your origin remote, and (with the --set-upstream option) set them to track/follow each other. However, the maneage branch is currently tracking/following your origin-maneage remote (automatically set when you cloned Maneage). So when pushing the maneage branch to your origin remote, you shouldn't use --set-upstream. With the last command, you can actually check this (which local and remote branches are tracking each other).

    
    git remote add origin XXXXXXXXXX        # Newly created repo is now called 'origin'.
    git push --set-upstream origin master   # Push 'master' branch to 'origin' (with tracking).
    git push origin maneage                 # Push 'maneage' branch to 'origin' (no tracking).
                        
  5. Title, short description and author: The title and basic information of your project's output PDF paper should be added in paper.tex. You should see the relevant place in the preamble (prior to \begin{document}. After you are done, run the ./project make command again to see your changes in the final PDF, and make sure that your changes don't cause a crash in LaTeX. Of course, if you use a different LaTeX package/style for managing the title and authors (in particular a specific journal's style), please feel free to use it your own methods after finishing this checklist and doing your first commit.

  6. Delete dummy parts: Maneage contains some parts that are only for the initial/test run, mainly as a demonstration of important steps, which you can use as a reference to use in your own project. But they not for any real analysis, so you should remove these parts as described below:

    • paper.tex: 1) Delete the text of the abstract (from \includeabstract{ to \vspace{0.25cm}) and write your own (a single sentence can be enough now, you can complete it later). 2) Add some keywords under it in the keywords part. 3) Delete everything between %% Start of main body. and %% End of main body.. 4) Remove the notice in the "Acknowledgments" section (in \new{}) and Acknowledge your funding sources (this can also be done later). Just don't delete the existing acknowledgment statement: Maneage is possible thanks to funding from several grants. Since Maneage is being used in your work, it is necessary to acknowledge them in your work also.

    • reproduce/analysis/make/top-make.mk: Delete the delete-me line in the makesrc definition. Just make sure there is no empty line between the download \ and verify \ lines (they should be directly under each other).

    • reproduce/analysis/make/verify.mk: In the final recipe, under the commented line Verify TeX macros, remove the full line that contains delete-me, and set the value of s in the line for download to XXXXX (any temporary string, you'll fix it in the end of your project, when its complete).

    • Delete all delete-me* files in the following directories:

      
      rm tex/src/delete-me*
      rm reproduce/analysis/make/delete-me*
      rm reproduce/analysis/config/delete-me*
                                  
    • Disable verification of outputs by removing the yes from reproduce/analysis/config/verify-outputs.conf. Later, when you are ready to submit your paper, or publish the dataset, activate verification and make the proper corrections in this file (described under the "Other basic customizations" section below). This is a critical step and only takes a few minutes when your project is finished. So DON'T FORGET to activate it in the end.

    • Re-make the project (after a cleaning) to see if you haven't introduced any errors.

      
      ./project make clean
      ./project make
                                  
  7. Don't merge some files in future updates: As described below, you can later update your infra-structure (for example to fix bugs) by merging your master branch with maneage. For files that you have created in your own branch, there will be no problem. However if you modify an existing Maneage file for your project, next time its updated on maneage you'll have an annoying conflict. The commands below show how to fix this future problem. With them, you can configure Git to ignore the changes in maneage for some of the files you have already edited and deleted above (and will edit below). Note that only the first echo command has a > (to write over the file), the rest are >> (to append to it). If you want to avoid any other set of files to be imported from Maneage into your project's branch, you can follow a similar strategy. We recommend only doing it when you encounter the same conflict in more than one merge and there is no other change in that file. Also, don't add core Maneage Makefiles, otherwise Maneage can break on the next run.

    
    echo "paper.tex merge=ours" > .gitattributes
    echo "tex/src/delete-me.mk merge=ours" >> .gitattributes
    echo "tex/src/delete-me-demo.mk merge=ours" >> .gitattributes
    echo "reproduce/analysis/make/delete-me.mk merge=ours" >> .gitattributes
    echo "reproduce/software/config/TARGETS.conf merge=ours" >> .gitattributes
    echo "reproduce/analysis/config/delete-me-num.conf merge=ours" >> .gitattributes
    git add .gitattributes
                            
  8. Copyright and License notice: It is necessary that all the "copyright-able" files in your project (those larger than 10 lines) have a copyright and license notice. Please take a moment to look at several existing files to see a few examples. The copyright notice is usually close to the start of the file, it is the line starting with Copyright (C) and containing a year and the author's name (like the examples below). The License notice is a short description of the copyright license, usually one or two paragraphs with a URL to the full license. Don't forget to add these two notices to any new file you add in your project (you can just copy-and-paste). When you modify an existing Maneage file (which already has the notices), just add a copyright notice in your name under the existing one(s), like the line with capital letters below. To start with, add this line with your name and email address to paper.tex, tex/src/preamble-header.tex, reproduce/analysis/make/top-make.mk, and generally, all the files you modified in the previous step.

    
    Copyright (C) 2018-2020 Existing Name <existing@email.address>
    Copyright (C) 2020 YOUR NAME <YOUR@EMAIL.ADDRESS>
                            
  9. Configure Git for fist time: If this is the first time you are running Git on this system, then you have to configure it with some basic information in order to have essential information in the commit messages (ignore this step if you have already done it). Git will include your name and e-mail address information in each commit. You can also specify your favorite text editor for making the commit (emacs, vim, nano, and etc.).

    
    git config --global user.name "YourName YourSurname"
    git config --global user.email your-email@example.com
    git config --global core.editor nano
                            
  10. Your first commit: You have already made some small and basic changes in the steps above and you are in your project's master branch. So, you can officially make your first commit in your project's history and push it. But before that, you need to make sure that there are no problems in the project. This is a good habit to always re-build the system before a commit to be sure it works as expected.

    
    git status                 # See which files you have changed.
    git diff                   # Check the lines you have added/changed.
    ./project make             # Make sure everything builds successfully.
    git add -u                 # Put all tracked changes in staging area.
    git status                 # Make sure everything is fine.
    git diff --cached          # Confirm all the changes that will be committed.
    git commit                 # Your first commit: put a good description!
    git push                   # Push your commit to your remote.
                            
  11. Start your exciting research: You are now ready to add flesh and blood to this raw skeleton by further modifying and adding your exciting research steps. You can use the "published works" section in the introduction (above) as some fully working models to learn from. Also, don't hesitate to contact us if you have any questions.

Other basic customizations

Tips for designing your project

The following is a list of design points, tips, or recommendations that have been learned after some experience with this type of project management. Please don't hesitate to share any experience you gain after using it with us. In this way, we can add it here (with full giving credit) for the benefit of others.

Future improvements

This is an evolving project and as time goes on, it will evolve and become more robust. Some of the most prominent issues we plan to implement in the future are listed below, please join us if you are interested.

Package management

It is important to have control of the environment of the project. Maneage currently builds the higher-level programs (for example GNU Bash, GNU Make, GNU AWK and domain-specific software) it needs, then sets PATH so the analysis is done only with the project's built software. But currently the configuration of each program is in the Makefile rules that build it. This is not good because a change in the build configuration does not automatically cause a re-build. Also, each separate project on a system needs to have its own built tools (that can waste a lot of space).

A good solution is based on the Nix package manager: a separate file is present for each software, containing all the necessary info to build it (including its URL, its tarball MD5 hash, dependencies, configuration parameters, build steps and etc). Using this file, a script can automatically generate the Make rules to download, build and install program and its dependencies (along with the dependencies of those dependencies and etc).

All the software are installed in a "store". Each installed file (library or executable) is prefixed by a hash of this configuration (and the OS architecture) and the standard program name. For example (from the Nix webpage):


/nix/store/b6gvzjyb2pg0kjfwrjmg1vfhh54ad73z-firefox-33.1/
                

The important thing is that the "store" is not in the project's search path. After the complete installation of the software, symbolic links are made to populate each project's program and library search paths without a hash. This hash will be unique to that particular software and its particular configuration. So simply by searching for this hash in the installed directory, we can find the installed files of that software to generate the links.

This scenario has several advantages: 1) a change in a software's build configuration triggers a rebuild. 2) a single "store" can be used in many projects, thus saving space and configuration time for new projects (that commonly have large overlaps in lower-level programs).

Appendix: Necessity of exact reproduction in scientific research

In case the link above is not accessible at the time of reading, here is a copy of the introduction of that link, describing the necessity for a reproducible project like this (copied on February 7th, 2018):

The most important element of a "scientific" statement/result is the fact that others should be able to falsify it. The Tsunami of data that has engulfed astronomers in the last two decades, combined with faster processors and faster internet connections has made it much more easier to obtain a result. However, these factors have also increased the complexity of a scientific analysis, such that it is no longer possible to describe all the steps of an analysis in the published paper. Citing this difficulty, many authors suffice to describing the generalities of their analysis in their papers.

However, It is impossible to falsify (or even study) a result if you can't exactly reproduce it. The complexity of modern science makes it vitally important to exactly reproduce the final result. Because even a small deviation can be due to many different parts of an analysis. Nature is already a black box which we are trying so hard to comprehend. Not letting other scientists see the exact steps taken to reach a result, or not allowing them to modify it (do experiments on it) is a self-imposed black box, which only exacerbates our ignorance.

Other scientists should be able to reproduce, check and experiment on the results of anything that is to carry the "scientific" label. Any result that is not reproducible (due to incomplete information by the author) is not scientific: the readers have to have faith in the subjective experience of the authors in the very important choice of configuration values and order of operations: this is contrary to the scientific spirit.

Copyright information

This file is part of Maneage's core: https://git.maneage.org/project.git

Maneage is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Maneage is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Maneage. If not, see https://www.gnu.org/licenses/.