1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
|
<!DOCTYPE html>
<!-- Copyright notes are just below the head and before body -->
<html lang="en-US">
<!-- HTML Header -->
<head>
<!-- Title of the page. -->
<title>Maneage -- Managing data lineage</title>
<!-- Enable UTF-8 encoding to easily use non-ASCII charactes -->
<meta charset="UTF-8">
<meta http-equiv="Content-type" content="text/html; charset=UTF-8">
<!-- Put logo beside the address bar -->
<link rel="shortcut icon" href="./img/favicon.svg" />
<!-- The viewport meta tag is placed mainly for mobile browsers
that are pre-configured in different ways (for example setting the
different widths for the page than the actual width of the device,
or zooming to different values. Without this the CSS media
solutions might not work properly on all mobile browsers.-->
<meta name="viewport"
content="width=device-width, initial-scale=1">
<!-- Basic styles -->
<link rel="stylesheet" href="css/base.css" />
</head>
<!--
Webpage of Maneage: a framework for managing data lineage
Copyright (C) 2020-2022 Pedram Ashofteh Ardakani <pedramardakani@pm.me>
Copyright (C) 2020-2022 Mohammad Akhlaghi <mohammad@akhlaghi.org>
This file is part of Maneage. Maneage is free software: you can
redistribute it and/or modify it under the terms of the GNU General
Public License as published by the Free Software Foundation, either
version 3 of the License, or (at your option) any later version.
Maneage is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details. See
<http://www.gnu.org/licenses/>. -->
<!-- Start the main body. -->
<body>
<div id="container">
<header role="banner">
<!-- global navigation -->
<nav role="navigation" id="nav-hamburger-wrapper">
<input type="checkbox" id="nav-hamburger-input"/>
<label for="nav-hamburger-input">|||</label>
<div id="nav-hamburger-items" class="button">
<a href="index.html">Home</a>
<a href="about.html">About</a>
<a href="http://git.maneage.org/project.git/">Git</a>
<a href="tutorial.html">Tutorial</a>
</div>
</nav>
</header>
<div class="banner">
<div>
<a href="index.html"><img src="img/maneage-logo.svg" /></a>
</div>
<div>
<h1>Maneage</h1><h2>Managing Data Lineage</h2>
<p>Copyright © 2018-2022 Mohammad Akhlaghi <a href="mailto:mohammad@akhlaghi.org">mohammad@akhlaghi.org</a><br />
Copyright © 2020-2022 Raul Infante-Sainz <a href="mailto:infantesainz@gmail.com">infantesainz@gmail.com</a><br />
<a href="#page-footer">License Conditions</a></p>
</div>
</div>
<hr />
<p align="right">Next: <a href="about-tips.html">Tips for designing your project</a>, Previous: <a href="about-architecture.html">Maneage architecture</a>, Up: <a href="about.html">About</a> </p>
<h2>Customization checklist</h2>
<p>Take the following steps to fully customize Maneage for your research
project. After finishing the list, be sure to run <code>./project configure</code> and
<code>project make</code> to see if everything works correctly. If you notice anything
missing or any in-correct part (probably a change that has not been
explained here), please let us know to correct it.</p>
<p>As described above, the concept of reproducibility (during a project)
heavily relies on <a href="https://en.wikipedia.org/wiki/Version_control">version
control</a>. Currently Maneage
uses Git as its main version control system. If you are not already
familiar with Git, please read the first three chapters of the <a href="https://git-scm.com/book/en/v2">ProGit
book</a> which provides a wonderful practical
understanding of the basics. You can read later chapters as you get more
advanced in later stages of your work.</p>
<h2>First custom commit</h2>
<ol>
<li><p><strong>Get this repository and its history</strong> (if you don't already have it):
Arguably the easiest way to start is to clone Maneage and prepare for
your customizations as shown below. After the cloning first you rename
the default <code>origin</code> remote server to specify that this is Maneage's
remote server. This will allow you to use the conventional <code>origin</code>
name for your own project as shown in the next steps. Second, you will
create and go into the conventional <code>master</code> branch to start
committing in your project later.</p>
<pre><code>git clone https://git.maneage.org/project.git <span class="comment"># Clone/copy the project and its history.</span>
mv project my-project <span class="comment"># Change the name to your project's name.</span>
cd my-project <span class="comment"># Go into the cloned directory.</span>
git remote rename origin origin-maneage <span class="comment"># Rename current/only remote to "origin-maneage".</span>
git checkout -b master <span class="comment"># Create and enter your own "master" branch.</span>
pwd <span class="comment"># Just to confirm where you are.</span></code></pre></li>
<li><p><strong>Prepare to build project</strong>: The <code>./project configure</code> command of the
next step will build the different software packages within the
"build" directory (that you will specify). Nothing else on your system
will be touched. However, since it takes long, it is useful to see
what it is being built at every instant (its almost impossible to tell
from the torrent of commands that are produced!). So open another
terminal on your desktop and navigate to the same project directory
that you cloned (output of last command above). Then run the following
command. Once every second, this command will just print the date
(possibly followed by a non-existent directory notice). But as soon as
the next step starts building software, you'll see the names of
software get printed as they are being built. Once any software is
installed in the project build directory it will be removed. Again,
don't worry, nothing will be installed outside the build directory.</p>
<pre><code><span class="comment"># On another terminal (go to top project source directory, last command above)</span>
./project --check-config</code></pre></li>
<li><p><strong>Test Maneage</strong>: Before making any changes, it is important to test it
and see if everything works properly with the commands below. If there
is any problem in the <code>./project configure</code> or <code>./project make</code> steps,
please contact us to fix the problem before continuing. Since the
building of dependencies in configuration can take long, you can take
the next few steps (editing the files) while its working (they don't
affect the configuration). After <code>./project make</code> is finished, open
<code>paper.pdf</code>. If it looks fine, you are ready to start customizing the
Maneage for your project. But before that, clean all the extra Maneage
outputs with <code>make clean</code> as shown below.</p>
<pre><code>./project configure <span class="comment"># Build the project's software environment (can take an hour or so).</span>
./project make <span class="comment"># Do the processing and build paper (just a simple demo).</span>
<span class="comment"># Open 'paper.pdf' and see if everything is ok.</code></pre></li>
<li><p><strong>Setup the remote</strong>: You can use any <a href="https://en.wikipedia.org/wiki/Comparison_of_source_code_hosting_facilities">hosting
facility</a>
that supports Git to keep an online copy of your project's version
controlled history. We recommend <a href="https://gitlab.com">GitLab</a> because
it is <a href="https://www.gnu.org/software/repo-criteria-evaluation.html">more ethical (although not
perfect)</a>,
and later you can also host GitLab on your own server. Anyway, create
an account in your favorite hosting facility (if you don't already
have one), and define a new project there. Please make sure <em>the newly
created project is empty</em> (some services ask to include a <code>README</code> in
a new project which is bad in this scenario, and will not allow you to
push to it). It will give you a URL (usually starting with <code>git@</code> and
ending in <code>.git</code>), put this URL in place of <code>XXXXXXXXXX</code> in the first
command below. With the second command, "push" your <code>master</code> branch to
your <code>origin</code> remote, and (with the <code>--set-upstream</code> option) set them
to track/follow each other. However, the <code>maneage</code> branch is currently
tracking/following your <code>origin-maneage</code> remote (automatically set
when you cloned Maneage). So when pushing the <code>maneage</code> branch to your
<code>origin</code> remote, you <em>shouldn't</em> use <code>--set-upstream</code>. With the last
command, you can actually check this (which local and remote branches
are tracking each other).</p>
<pre><code>git remote add origin XXXXXXXXXX <span class="comment"># Newly created repo is now called 'origin'.</span>
git push --set-upstream origin master <span class="comment"># Push 'master' branch to 'origin' (with tracking).</span>
git push origin maneage <span class="comment"># Push 'maneage' branch to 'origin' (no tracking).</span></code></pre></li>
<li><p><strong>Title</strong>, <strong>short description</strong> and <strong>author</strong>: The title and basic
information of your project's output PDF paper should be added in
<code>paper.tex</code>. You should see the relevant place in the preamble (prior
to <code>\begin{document}</code>. After you are done, run the <code>./project make</code>
command again to see your changes in the final PDF, and make sure that
your changes don't cause a crash in LaTeX. Of course, if you use a
different LaTeX package/style for managing the title and authors (in
particular a specific journal's style), please feel free to use it
your own methods after finishing this checklist and doing your first
commit.</p></li>
<li><p><strong>Delete dummy parts</strong>: Maneage contains some parts that are only for
the initial/test run, mainly as a demonstration of important steps,
which you can use as a reference to use in your own project. But they
not for any real analysis, so you should remove these parts as
described below:</p>
<ul>
<li><p><code>paper.tex</code>: 1) Delete the text of the abstract (from
<code>\includeabstract{</code> to <code>\vspace{0.25cm}</code>) and write your own (a
single sentence can be enough now, you can complete it later). 2)
Add some keywords under it in the keywords part. 3) Delete
everything between <code>%% Start of main body.</code> and <code>%% End of main
body.</code>. 4) Remove the notice in the "Acknowledgments" section (in
<code>\new{}</code>) and Acknowledge your funding sources (this can also be
done later). Just don't delete the existing acknowledgment
statement: Maneage is possible thanks to funding from several
grants. Since Maneage is being used in your work, it is necessary to
acknowledge them in your work also.</p></li>
<li><p><code>reproduce/analysis/make/top-make.mk</code>: Delete the <code>delete-me</code> line
in the <code>makesrc</code> definition. Just make sure there is no empty line
between the <code>download \</code> and <code>verify \</code> lines (they should be
directly under each other).</p></li>
<li><p><code>reproduce/analysis/make/verify.mk</code>: In the final recipe, under the
commented line <code>Verify TeX macros</code>, remove the full line that
contains <code>delete-me</code>, and set the value of <code>s</code> in the line for
<code>download</code> to <code>XXXXX</code> (any temporary string, you'll fix it in the
end of your project, when its complete).</p></li>
<li><p>Delete all <code>delete-me*</code> files in the following directories:</p>
<pre><code>rm tex/src/delete-me*
rm reproduce/analysis/make/delete-me*
rm reproduce/analysis/config/delete-me*</code></pre></li>
<li><p>Disable verification of outputs by removing the <code>yes</code> from
<code>reproduce/analysis/config/verify-outputs.conf</code>. Later, when you are
ready to submit your paper, or publish the dataset, activate
verification and make the proper corrections in this file (described
under the "Other basic customizations" section below). This is a
critical step and only takes a few minutes when your project is
finished. So DON'T FORGET to activate it in the end.</p></li>
<li><p>Re-make the project (after a cleaning) to see if you haven't
introduced any errors.</p>
<pre><code>./project make clean
./project make</code></pre></li>
</ul></li>
<li><p><strong>Don't merge some files in future updates</strong>: As described below, you
can later update your infra-structure (for example to fix bugs) by
merging your <code>master</code> branch with <code>maneage</code>. For files that you have
created in your own branch, there will be no problem. However if you
modify an existing Maneage file for your project, next time its
updated on <code>maneage</code> you'll have an annoying conflict. The commands
below show how to fix this future problem. With them, you can
configure Git to ignore the changes in <code>maneage</code> for some of the files
you have already edited and deleted above (and will edit below). Note
that only the first <code>echo</code> command has a <code>></code> (to write over the file),
the rest are <code>>></code> (to append to it). If you want to avoid any other
set of files to be imported from Maneage into your project's branch,
you can follow a similar strategy. We recommend only doing it when you
encounter the same conflict in more than one merge and there is no
other change in that file. Also, don't add core Maneage Makefiles,
otherwise Maneage can break on the next run.</p>
<pre><code>echo "paper.tex merge=ours" > .gitattributes
echo "tex/src/delete-me.mk merge=ours" >> .gitattributes
echo "tex/src/delete-me-demo.mk merge=ours" >> .gitattributes
echo "reproduce/analysis/make/delete-me.mk merge=ours" >> .gitattributes
echo "reproduce/software/config/TARGETS.conf merge=ours" >> .gitattributes
echo "reproduce/analysis/config/delete-me-num.conf merge=ours" >> .gitattributes
git add .gitattributes</code></pre></li>
<li><p><strong>Copyright and License notice</strong>: It is necessary that <em>all</em> the
"copyright-able" files in your project (those larger than 10 lines)
have a copyright and license notice. Please take a moment to look at
several existing files to see a few examples. The copyright notice is
usually close to the start of the file, it is the line starting with
<code>Copyright (C)</code> and containing a year and the author's name (like the
examples below). The License notice is a short description of the
copyright license, usually one or two paragraphs with a URL to the
full license. Don't forget to add these <em>two</em> notices to <em>any new
file</em> you add in your project (you can just copy-and-paste). When you
modify an existing Maneage file (which already has the notices), just
add a copyright notice in your name under the existing one(s), like
the line with capital letters below. To start with, add this line with
your name and email address to <code>paper.tex</code>,
<code>tex/src/preamble-header.tex</code>, <code>reproduce/analysis/make/top-make.mk</code>,
and generally, all the files you modified in the previous step.</p>
<pre><code>Copyright (C) 2018-2022 Existing Name <existing@email.address>
Copyright (C) 2022 YOUR NAME <YOUR@EMAIL.ADDRESS></code></pre></li>
<li><p><strong>Configure Git for fist time</strong>: If this is the first time you are
running Git on this system, then you have to configure it with some
basic information in order to have essential information in the commit
messages (ignore this step if you have already done it). Git will
include your name and e-mail address information in each commit. You
can also specify your favorite text editor for making the commit
(<code>emacs</code>, <code>vim</code>, <code>nano</code>, and etc.).</p>
<pre><code>git config --global user.name "YourName YourSurname"
git config --global user.email your-email@example.com
git config --global core.editor nano</code></pre></li>
<li><p><strong>Your first commit</strong>: You have already made some small and basic
changes in the steps above and you are in your project's <code>master</code>
branch. So, you can officially make your first commit in your
project's history and push it. But before that, you need to make sure
that there are no problems in the project. This is a good habit to
always re-build the system before a commit to be sure it works as
expected.</p>
<pre><code>git status <span class="comment"># See which files you have changed.</span>
git diff <span class="comment"># Check the lines you have added/changed.</span>
./project make <span class="comment"># Make sure everything builds successfully.</span>
git add -u <span class="comment"># Put all tracked changes in staging area.</span>
git status <span class="comment"># Make sure everything is fine.</span>
git diff --cached <span class="comment"># Confirm all the changes that will be committed.</span>
git commit <span class="comment"># Your first commit: put a good description!</span>
git push <span class="comment"># Push your commit to your remote.</span></code></pre></li>
<li><p><strong>Start your exciting research</strong>: You are now ready to add flesh and
blood to this raw skeleton by further modifying and adding your
exciting research steps. You can use the "published works" section in
the introduction (above) as some fully working models to learn
from. Also, don't hesitate to contact us if you have any
questions.</p></li>
</ol>
<h2>Other basic customizations</h2>
<ul>
<li><p><strong>High-level software</strong>: Maneage installs all the software that your
project needs. You can specify which software your project needs in
<code>reproduce/software/config/TARGETS.conf</code>. The necessary software are
classified into two classes: 1) programs or libraries (usually written
in C/C++) which are run directly by the operating system. 2) Python
modules/libraries that are run within Python. By default
<code>TARGETS.conf</code> only has GNU Astronomy Utilities (Gnuastro) as one
scientific program and Astropy as one scientific Python module. Both
have many dependencies which will be installed into your project
during the configuration step. To see a list of software that are
currently ready to be built in Maneage, see
<code>reproduce/software/config/versions.conf</code> (which has their versions
also), the comments in <code>TARGETS.conf</code> describe how to use the software
name from <code>versions.conf</code>. Currently the raw pipeline just uses
Gnuastro to make the demonstration plots. Therefore if you don't need
Gnuastro, go through the analysis steps in <code>reproduce/analysis</code> and
remove all its use cases (clearly marked).</p></li>
<li><p><strong>Input dataset</strong>: The input datasets are managed through the
<code>reproduce/analysis/config/INPUTS.conf</code> file. It is best to gather all
the information regarding all the input datasets into this one central
file. To ensure that the proper dataset is being downloaded and used
by the project, it is also recommended get an <a href="https://en.wikipedia.org/wiki/MD5">MD5
checksum</a> of the file and include
that in <code>INPUTS.conf</code> so the project can check it automatically. The
preparation/downloading of the input datasets is done in
<code>reproduce/analysis/make/download.mk</code>. Have a look there to see how
these values are to be used. This information about the input datasets
is also used in the initial <code>configure</code> script (to inform the users),
so also modify that file. You can find all occurrences of the demo
dataset with the command below and replace it with your input's
dataset.</p>
<pre><code>grep -ir wfpc2 ./*</code></pre></li>
<li><p><strong><code>README.md</code></strong>: Correct all the <code>XXXXX</code> place holders (name of your
project, your own name, address of your project's online/remote
repository, link to download dependencies and etc). Generally, read
over the text and update it where necessary to fit your project. Don't
forget that this is the first file that is displayed on your online
repository and also your colleagues will first be drawn to read this
file. Therefore, make it as easy as possible for them to start
with. Also check and update this file one last time when you are ready
to publish your project's paper/source.</p></li>
<li><p><strong>Verify outputs</strong>: During the initial customization checklist, you
disabled verification. This is natural because during the project you
need to make changes all the time and its a waste of time to enable
verification every time. But at significant moments of the project
(for example before submission to a journal, or publication) it is
necessary. When you activate verification, before building the paper,
all the specified datasets will be compared with their respective
checksum and if any file's checksum is different from the one recorded
in the project, it will stop and print the problematic file and its
expected and calculated checksums. First set the value of
<code>verify-outputs</code> variable in
<code>reproduce/analysis/config/verify-outputs.conf</code> to <code>yes</code>. Then go to
<code>reproduce/analysis/make/verify.mk</code>. The verification of all the files
is only done in one recipe. First the files that go into the
plots/figures are checked, then the LaTeX macros. Validation of the
former (inputs to plots/figures) should be done manually. If its the
first time you are doing this, you can see two examples of the dummy
steps (with <code>delete-me</code>, you can use them if you like). These two
examples should be removed before you can run the project. For the
latter, you just have to update the checksums. The important thing to
consider is that a simple checksum can be problematic because some
file generators print their run-time date in the file (for example as
commented lines in a text table). When checking text files, this
Makefile already has this function:
<code>verify-txt-no-comments-leading-space</code>. As the name suggests, it will
remove comment lines and empty lines before calculating the MD5
checksum. For FITS formats (common in astronomy, fortunately there is
a <code>DATASUM</code> definition which will return the checksum independent of
the headers. You can use the provided function(s), or define one for
your special formats.</p></li>
<li><p><strong>Feedback</strong>: As you use Maneage you will notice many things that if
implemented from the start would have been very useful for your
work. This can be in the actual scripting and architecture of Maneage,
or useful implementation and usage tips, like those below. In any
case, please share your thoughts and suggestions with us, so we can
add them here for everyone's benefit.</p></li>
<li><p><strong>Re-preparation</strong>: Automatic preparation is only run in the first run
of the project on a system, to re-do the preparation you have to use
the option below. Here is the reason for this: when its necessary, the
preparation process can be slow and will unnecessarily slow down the
whole project while the project is under development (focus is on the
analysis that is done after preparation). Because of this, preparation
will be done automatically for the first time that the project is run
(when <code>.build/software/preparation-done.mk</code> doesn't exist). After the
preparation process completes once, future runs of <code>./project make</code>
will not do the preparation process anymore (will not call
<code>top-prepare.mk</code>). They will only call <code>top-make.mk</code> for the
analysis. To manually invoke the preparation process after the first
attempt, the <code>./project make</code> script should be run with the
<code>--prepare-redo</code> option, or you can delete the special file above.</p>
<pre><code>./project make --prepare-redo</code></pre></li>
<li><p><strong>Pre-publication</strong>: add notice on reproducibility**: Add a notice
somewhere prominent in the first page within your paper, informing the
reader that your research is fully reproducible. For example in the
end of the abstract, or under the keywords with a title like
"reproducible paper". This will encourage them to publish their own
works in this manner also and also will help spread the word.</p></li>
</ul>
<p align="right">Next: <a href="about-tips.html">Tips for designing your project</a>, Previous: <a href="about-architecture.html">Maneage architecture</a>, Up: <a href="about.html">About</a> </p>
<footer role="contentinfo" id="page-footer">
<ul>
<li><p>Maneage is currently based in the Centro de Estudios de Física del Cosmos de Aragón (CEFCA).</p></li>
<li><p>Address: CEFCA, Plaza San Juan 1, Planta 2, Teruel, Spain, 44001.</p></li>
<li><p>Contact: with <a href="https://savannah.nongnu.org/support/?func=additem&group=reproduce">this form</a>, or <a href="https://app.element.io/#/room/#maneage-general:matrix.org">#maneage-general:matrix.org</a>, or project PI (<a href="http://akhlaghi.org">Mohammad Akhlaghi</a>).</p></li>
<li><p>Copyright © 2020-2022 Maneage volunteers</p></li>
<li>This page is distributed under GNU General Public License (<a href="https://www.gnu.org/licenses/gpl-3.0.en.html">GPL</a>).</li>
</ul>
</footer>
</body>
</html>
|