Maneage -- Managing data lineage

This document is a tutorial in which it is described how Maneage (management + lineage) works in practice. It is highly recommended to read the README-hacking.md in order to have a clear idea of what is this project about. Actually, in this tutorial it is assumed you have the project already set up and working properly. In order to do it, please, read and follow all the steps described in the sections Customization checklist up to the section Title, short description and author (including the last one).

With the current tutorial, the reader will be able to have a fully reproducible paper describing a small research example carried out step by step. The research example is very simple: it will consist in analyse a dataset with two columns (time and population). The analysis will be just to make a linear fitting of the data, and then, write the results in a small paragraph into the final paper.

In the following, the tutorial assume you have three different directories. You had to set up them in the configure step:

input-directory: Necessary input data for the project is in this directory.
project-directory: This directory contains the project itself (source codes), it is under Git control.
build-directory: Output directory of the project, it is where all the necessary software and the results of the project are saved.

IMPORTANT NOTE: the tutorial assume you are always in project-directory when considering command lines.

In short: this hands on tutorial will guide you through a simple research example in order to show the workflow in Maneage. The tutorial describes by step how to download a small file containg data, analyse the data (by making a linear fitting), and finally write a small paragraph with the fitting parameters into the final paper. All of this will be done in the same Makefile.

Installing available software: Matplotlib

If all steps above have been done successfully, you are ready to start including your own analysis scripts. But, before that, let's install Matplotlib Python package, which will be used later in the analysis of the data when obtaining the linear fit figure. This Python package will be used as an example on how to install programs that are already available in Maneage. Just open the Makefile reproduce/software/config/installation/TARGETS.mk and add to the top-level-python line, the word matplotlib.

# Python libraries/modules.
top-level-python    = astropy matplotlib

After that, run the configure step again with the option -e to continue using the same configuration options given before (input and build directories). Also, run the prepare and make steps:

./project configure -e
./project prepare
./project make

Open 'paper.pdf' and see if everything is fine. Note that now, Matplotlib is appearing in the software appendix at the end of the document.

Once you have verified that Matplotlib has been properly installed and it appears into the final paper.pdf, you are ready to make the first commit of the project. With the next commands, you will see which files have been modified, what are the modifications, prepare them to be commited, and make the commit. In the commit process, Git will open the text editor for writting the commit message. Take into account that all changes commited will be preserved in the history of your project. So, it is a good practice to take some time to describe properly what have been done/changed/added. Finally, as this is the very first commit of the project, tag this as the zero-th version.

git status         # See which files have been changed.
git diff           # See the lines you have modified.
git add -u         # Put all tracked changes in staging area.
git status         # Make sure everything is fine.
git commit         # Your first commit, add a nice description.
git tag -a v0.0    # Tag this as the zero-th version of your project.

Now, have a look at the Git history of the project. Note that the local master branch is one commit above than the remote origin/master branch. After that, push your first commit and its tag to your remote repository with the next commands. Since you had setup your master branch to follow origin/master, you can just use git push.

git log --oneline --decorate --all --graph   # Have a look at the Git history.
git push                                     # Push the commit to the remote/origin.
git push --tags                              # Push all tags to the remote/origin.

Now it is time to start including your own scripts to download and make the analysis of the data. It is important to bear in mind that the goal of this tutorial is to give a general view of the workflow in Maneage. In this sense, only a few basic concepts about Make and how it is used into this project will be given. Maneage is much more powerfull and much more things than the ones showed in this tutorial can be done. So, read carefully all the documentation and comments already available into each file, be creative and experiment making your own research.

In the following, the tutorial will be focused in download the data, analyse the data, and finally write the results into the final paper. As a consequence, there are a lot of things already done that are not necessary. For example, all the text of the final paper already written into the paper.tex file, some Makefiles to download images from the Hubble Space Telescope and analyse them, etc. In your own research, all of this work would be removed. However, in this tutorial they are not removed because we will only show how to do a simple analysis and include a small paragraph with the result of the linear fitting.

In short: in this section you have learnt how to install available software in Maneage. In this particular case, you installed Matplotlib

Including Python script to make the analysis

You are going to use a small Python script to make the analysis of the data. This Python script will be invoked from a Makefile that will be set up later. For now, we are going to just create the Python script and put it in an appropiate location. All analysis scripts are kept into a subfolder with the name of the same file type in reproduce/analysis. For example, the Makefiles are saved into the make directory, and bash scripts are saved into the bash directory. Since there is any python directory, create it with the following command.

mkdir reproduce/analysis/python

After that, you need the Python script itself. The code is very simple: it will take an input file containing two columns (year and population), the name of the output file in which the parameters of the linear fit will be saved, and the name of the figure showing the original data and the fitted curve. Paste the next Python script into a new file named linear-fit.py into the directory generated in the above step (reproduce/analysis/python).

# Make a linear fit of an input data set
#
# This Python script makes a linear fitting of a data consisting in time and
# population. It generates a figure in which the original data and the
# fitted curve is plotted.  Finally, it saves the fitting parameters.
#
# Copyright (C) 2020 Raul Infante-Sainz infantesainz@gmail.com
# Copyright (C) YYYY Your Name your-email@example.xxx
#
# This Python script is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation, either version 3 of the License, or (at your
# option) any later version.
#
# This Python script is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General
# Public License for more details. See http://www.gnu.org/licenses/.
# Necessary packages

# Import necessary packages.
import sys
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

# Fitting function (linear fit)
def func(x, a, b):
return a * x + b

# Define input and output arguments
ifile = sys.argv[1]    # Input file
ofile = sys.argv[2]    # Output file
ofig  = sys.argv[3]    # Output figure

# Read the data from the input file.
data = np.loadtxt(ifile)

# Time and population:
# time ---------- x
# population ---- y
x = data[:, 0]
y = data[:, 1]

# Make the linear fit
params, pcov = curve_fit(func, x, y)

# Make and save the figure
plt.clf()
plt.figure()
plt.plot(x, y, 'bo', label="Original data")
plt.plot(x, func(x, *params), 'r-', label="Fitted curve")
plt.title('Population along time')
plt.xlabel('Time (year)')
plt.ylabel('Population (million people)')
plt.legend()
plt.grid()
plt.savefig(ofig, format='PDF', bbox_inches='tight')

# Save the fitting parameters
np.savetxt(ofile, params, fmt='%.3f')

Have a look at this Python script. At the very beginning, it has a block of commented lines with a descriptive title, a small paragraph describing the the script, and the copyright with the contact information. For each file, it is very important to have such kind of meta-data. Below these lines, there is the source code itself.

As it can be seen, this Python script (linear-fit.py) is designed to be invoked from the command line in the following way.

python /path/to/linear-fit.py /path/to/input.dat /path/to/output.dat /path/to/figure.pdf

/path/to/input.dat is the input data file, /path/to/output.dat is the output data file (with the fitted parameters), and /path/to/figure.pdf is the plotted figure.

You will do this invokation inside of a Make rule (that will be set up later). Now that you have included this Python script, make a commit in order to save this work. With the first command you will see the files with modifications. With the second command, you can check what are the changes. Correct, add and modify whatever you want in order to include more information, comments or clarify any step. After that, add the files and commit the work. Finally, push the commit to the remote/origin.

git status                                       # See which files you have changed.
git diff                                         # See the lines you have added/changed.
git add reproduce/analysis/python/linear-fit.py  # Put all tracked changes in staging area.
git commit                                       # Commit, add a nice descriptions.
git push                                         # Push the commit to the remote/origin.

Check that everything is fine having a look at the Git history of the project. Note that the master branch has been increased in one commit, while the template branch is behind.

git log --oneline --decorate --all --graph  # See the `Git` history.

In short: in this section you have included a Python script that will be used for making the linear fitting.

Downloading data

As it was said before, there are multiple things that are already included into the project. One of them is to use a dedicated Makefile to manage all necessary download of the input data (reproduce/analysis/make/download.mk). By appropiate modifications of this file, you would be able to download the necessary data. However, in order to keep this tutorial as simple as possible, we will describe how to download the data you need more explicity.

The data needed by this tutorial consist in a simple plain text file containing two rows: time (year) and population (in million of people). This data correspond to Spain, and it can be downloaded from this URL: http://akhlaghi.org/data/template-tutorial/ESP.dat. But don't do that using your browser, you have to do it into Maneage!

Let's create a Makefile for downloading the data. Later, you will also include (in the same Makefile) the necessary work in order to make the analysis. Save this Makefile in the dedicated directory (reproduce/analysis/make) with the name getdata-analysis.mk. In that Makefile, paste the following code.

# Download data for the tutorial
#
# In this Makefile, data for the tutorial is downloaded.
#
# Copyright (C) 2020 Raul Infante-Sainz infantesainz@gmail.com
# Copyright (C) YYYY Your Name your-email@example.xxx
#
# This Makefile is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation, either version 3 of the License, or (at your
# option) any later version.
#
# This Makefile is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General
# Public License for more details. See http://www.gnu.org/licenses/.


# Download data for the tutorial
# ------------------------------
#
pop-data = $(indir)/ESP.dat
$(pop-data): | $(indir)
        wget http://akhlaghi.org/data/template-tutorial/ESP.dat -O $@


# Final TeX macro
# ---------------
#
# It is very important to mention the address where the data were
# downloaded in the final report.
$(mtexdir)/getdata-analysis.tex: $(pop-data) | $(mtexdir)
        echo "\newcommand{\popurl}{http://akhlaghi.org/data/template-tutorial}" > $@

Have a look at this Makefile and see the different parts. The first line is a descriptive title. Below, include your name, contact email, and finally, the copyright. Please, take your time in order to add all relevant information in each Makefile you modify. As you can see, these lines start with # because they are comments.

After that information, there are five white lines in order to separate the different parts. Then, you have the Make rule to download the data. Remember the general structure of a Make rule:

TARGETS: PREREQUISITES
        RECIPE

In a rule, it is said how to construct the TARGETS from the PREREQUISITES, following the RECIPE. Note that the white space at the beginning of the RECIPE are not spaces but a single TAB. Take into account this if you copy/paste the code.

Now you can see this structure in our particular case:

(pop-data): | $(indir)
        wget http://akhlaghi.org/data/template-tutorial/ESP.dat -O $@

Here we have:

$(pop-data) is the TARGET. It is previously defined just one line above: pop-data = $(indir)/ESP.dat. As it can be seen, the target is just one file named ESP.dat into the indir directory.
$(indir) is the PREREQUISITE. In this case, nothing is needed for obtaining the TARGET, just the output directory in which it is going to be saved. This is the reason of having the pipe | at the beginning of the prerequisite (it indicates an order-only-prerequisite).
wget http://akhlaghi.org/data/template-tutorial/ESP.dat -O $@ is the RECIPE. It states how to construct the TARGET from the PREREQUISITE. In this case, it is just the use of wget to download the file specified in the URL (http://akhlaghi.org/data/template-tutorial/ESP.dat) and save it as the target: -O $@. Inisde of a Make rule, $@ is the target. So, in this case: $@ is $(pop-data).

With this, you have included the rule that will download the data. Now, to finish, you have to specify what is the final purpose of the Makefile: download that data! This is done by setting $(pop-data) as a prerequisite of the final rule. Remember that each Makefile will build a final target with the same name as the Makefile, but with the extension .tex. As a consequence, they will be TeX macros in which relevant information to be included into the final paper are saved . Here, you are saving the URL.

(mtexdir)/getdata-analysis.tex: $(pop-data) | $(mtexdir)
        echo "\\newcommand{\\popurl}{http://akhlaghi.org/data/template-tutorial}" > $@

In this final rule we have:

$(mtexdir)/getdata-analysis.tex is the TARGET. It is the TeX macro. Note that it has the same name as the Makefile itself, but it will be saved into the $(mtexdir) directory. What do I need for constructing this target? The prerequisites.
$(pop-data) | $(mtexdir) are the PREREQUISITES. In this case you have two prerequisites. First, $(pop-data), which indicates that the final TeX macro has to be generated after this file has been obtained. The second prerequisite is order-only-prerequisite, and it is the directory in which the target is saved: $(mtexdir).
echo "\\newcommand{\\popurl}{http://akhlaghi.org/data/template-tutorial}" > $@ is the RECIPE. Basically, it writes the text \\newcommand{\\popurl}{http://akhlaghi.org/data/template-tutorial} into the TARGET ($@). As you can see, this is the definition of a new command in TeX. The definition of this new command \popurl will be used for writting the final paper.

Only one step is remaining to finally make the download of the data. You have to add the name (without the extension .mk) of this Makefile into the reproduce/analysis/make/top-make.mk Makefile. There it is defined which Makefiles have to be executed. You have to end up having:

makesrc = initialize \
          download \
          getdata-analyse \
          delete-me \
          paper

As always, read carefully all comments and information in order to know what is going ong. Also, add your own comments and information in order to be clear and explain each step with enough level of detail. If everything is fine, now the project is ready to download the data in the make step. Try it!

./project make

Hopefully, it will download and save the file into the folder called inputs under the build-directory. Check that it is there, and also have a look at the TeX macro in order to see that the new command has been included, it is into the top-build directory: build-directory/tex/macros/getdata-analysis.tex.

Now that all of this changes have been included and it works fine, it is time to check little by little everything and make a commit order to save this work. Remember to put a good commit title and a nice commit message describing what you have done and why. Then, push the commit to the remote/origin.

Congratulations! You have included you first Makefile and the data is now ready to be analysed!

In short, to download the data you did the following:

Create a Makefile: reproduce/analysis/make/getdata-analysis.mk
Write meta-data at the beginning: title, your name, email, copyright, etc.
Define the file you want to download, and the rule to do it.
Write the rule to generate the TeX macro, putting as prerequisite, the file you are downloading.
Add the name of the Makefile (without the .tex) into reproduce/analysis/make/top-make.mk
$ ./project make in order to execute the project and download the data.
Check that everything worked fine by loking at the downloaded file and the TeX macro.
Commit and push all the work included.

Adding the analysis rule

Until this point, you have included the Python script that will do the linear fitting, and the rule for downloading the data. Now, it is necessary to construct the Make rule in which this Python script is invoked to do the analysis. This rule will be put in the same Makefile you have already generated for downloading the data. But, before this, define the directory in which the target is going to be saved.

odir = $(BDIR)/fit-parameters

This is a folder under the build-directory called fit-parameters. After that, define the target: a plain text file in which the linear fit parameters are saved (by the Python script). Put it into the previously defined directory. As the data is from Spain, name it ESP.txt.

param-file = $(odir)/ESP.txt

Now, include a rule to construct the output directory odir. This is necessary because this directory is needed for saving the file ESP.txt.

(odir):
        mkdir $@

With all the previous definitions, now it is possible to set the rule for making the analysis:

(param-file): $(indir)/ESP.dat | $(odir)
        python reproduce/analysis/python/linear-fit.py $< $@ $(odir)/ESP.pdf

In this rule you have:

$(param-file) is the TARGET. It is the file previously defined in which the fitting parameters will be saved.
$(indir)/ESP.dat | $(odir) are the PREREQUISITES. In this case you have two prerequisites. First, $(indir)/ESP.dat, which is the input file previously downloaded by the rule above. In this file there is the input data that the Python script will use for making the linear fit. $(odir) is the second prerequisite. It is order-only-prerequisite (indicated by the pipe |), and it is the directory where the target is saved.
python reproduce/analysis/python/linear-fit.py $< $@ $(odir)/ESP.pdf is the RECIPE. Basically, it call python to run the script reproduce/analysis/python/linear-fit.py with the necessary arguments: the input file $<, the target $@, and the name of the figure $(odir)/ESP.pdf (a PDF figure saved into the same directory than the target.

Finally, in order to indicate you want to obtain the target you have just included ($(param-file)), it is necessary to add it as a prerequisite of the final TARGET $(mtexdir)/linear-fit.tex. So, in the last rule (which creates the TeX macro), remove $(pop-data) and put $(param-file) instead. By doing this, you are telling to the Makefile that you want to obtain the file in which it is saved the fitted parameters. Inside of the rule, define a couple of bash variables (a and b) that are the fitted parameters extracted from the prerequisite. For a:

a=$$(cat $< | awk 'NR==1{print $1}')

Similarly, for obtaining the parameter b (which is in the second row):

b=$$(cat $< | awk 'NR==2{print $1}')

Then you have to specify the new TeX commands for these two parameters, just write them as it was done before for the URL:

echo "\newcommand{\afitparam}{$$a}" >> $@
echo "\newcommand{\bfitparam}{$$b}" >> $@

So, at the end you will have the final rule like this:

(mtexdir)/getdata-analysis.tex: $(param-file) | $(mtexdir)
        echo "\\newcommand{\\popurl}{http://akhlaghi.org/data/template-tutorial}" > $@
        a=$$(cat $< | awk 'NR==1{print $1}')
        b=$$(cat $< | awk 'NR==2{print $1}')
        echo "\newcommand{\afitparam}{$$a}" >> $@
        echo "\newcommand{\bfitparam}{$$b}" >> $@

Important notes: you have to use two $ in order to use the bash $ character inside of a Make rule. Also, note that you have to put >> in order to not create a new target each time you write someting into the target. With the double > it will only add the line at the end of the file without generating a new file.

With all the above modifications, you are ready to obtain the fitting parameters. If you add the necessary comments and information, the final Makefile would look similar to:

# Download data and linear fitting for the tutorial
#
# In this Makefile, data for the tutorial is downloaded. Then, a Python
# script is used to make a linear fitting. Finally, fitted parameters as
# well as the URL is saved into a TeX macro.
#
# Copyright (C) 2020 Raul Infante-Sainz infantesainz@gmail.com
# Copyright (C) YYYY Your Name your-email@example.xxx
#
# This Makefile is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation, either version 3 of the License, or (at your
# option) any later version.
#
# This Makefile is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General
# Public License for more details. See http://www.gnu.org/licenses/.


# Download data for the tutorial
# ------------------------------
# The input file is defined and downloaded using the following rule
pop-data = $(indir)/ESP.dat
$(pop-data): | $(indir)
        # Use wget to download the data
        wget http://akhlaghi.org/data/template-tutorial/ESP.dat -O $@


# Output directory
# ----------------
# Small rule for constructing the output directory, previously defined
odir = $(BDIR)/fit-parameters
$(odir):
        mkdir $@


# Linear fitting of the data
# --------------------------
# The output file is defined into the output directory. The fitted
# parameters will be saved into this directory by the Python script.
param-file = $(odir)/ESP.txt
$(param-file): $(indir)/ESP.dat | $(odir)
        # Invoke Python to run the script with the input data
        python reproduce/analysis/python/linear-fit.py $< $@ $(odir)/ESP.pdf


# TeX macros final target
# -----------------------
# This is how we write the necessary parameters in the final PDF. In this
# rule, new TeX parameters are defined from the URL, and the fitted
# parameters.
$(mtexdir)/getdata-analysis.tex: $(param-file) | $(mtexdir)
        # Write the URL into the target
        echo "\newcommand{\popurl}{http://akhlaghi.org/data/template-tutorial}" > $@

        # Read the fitted parameters into shell variables.
        a=$$(cat $< | awk 'NR==1{print $1}')
        b=$$(cat $< | awk 'NR==2{print $1}')

        # Write the parameters into the target as LaTeX macros.
        echo "\newcommand{\afitparam}{$$a}" >> $@
        echo "\newcommand{\bfitparam}{$$b}" >> $@

Have look at this Makefile and note that it is what it has been described above. Take your time for making useful comments and modifying whatever you think it is necessary. If everything is fine, now the project is ready to download the data and make the linear fitting. Try it!

./project make

Hopefully, now you will have the fitted parameters into the build-directory/fit-parameters/ESP.txt file, and the figure in the same directory. Do not pay to much attention at the quality of the fitting. It is just an example. Also, check that the TeX macro has been created successfully by having a look at build-directory/tex/macros/getdata-analyse.tex. Finally, now that you have ensured that everything is fine, make a commit in order to keep the work safe. In the next step, you will see how to include this data into the final paper.

In short: with the work included in this section, the project is able to download and make the linear fitting of the data. The result is the fitted parameters that are also saved in a TeX macro, and the figure showing the data with the fitted curve.

Editing the final paper

With all the previous work, the project is able to download the file containing the data (two columns, year and population of Spain), and analyse them by making a linear fitting (y=ax+b). The result is a TeX macro in which there are the information about the URL of the data and the linear fitting parameters (a and b). Now, it is time to add a small paragraph into the paper, just to ilustrate how to write the relevant parameters from the analysis.

Before all, make a copy of the current paper.pdf document you have into the project-directory. This paper is an example that Maneage constructs by default. Now, you will modify it by adding a small paragraph including the fitting parameters and the URL. So, open project-directory/paper.tex and add the following paragraph just at the beginning of the abstract section.

By following the steps described in the tutorial, I have been able to obtain this reproducible paper!
The project is very simple and it consists in download a file (from \popurl), and make an easy linear fit using a Python script.
The linear fitting is $y=a*x+b$, with the following parameters: $a=\afitparam$ and $b=\bfitparam$

As you can see, the TeX definitions done before in the Makefiles, are now included into the paper: \popurl, \afitparam, and \bfitparam. If you do again the make step $ ./project make, you will re-compile the paper including this paragraph. Check that it is true and compare with the previous version, of the paper. Contratulations! You have complete this tutorial and now you are able to use Maneage for making your exciting research in a reproducible way!

Maneage

Tutorial

Installing available software: Matplotlib

Including Python script to make the analysis

Downloading data

Adding the analysis rule

Editing the final paper