aboutsummaryrefslogtreecommitdiff
path: root/reproduce/analysis/config/INPUTS.conf
diff options
context:
space:
mode:
Diffstat (limited to 'reproduce/analysis/config/INPUTS.conf')
-rw-r--r--reproduce/analysis/config/INPUTS.conf113
1 files changed, 70 insertions, 43 deletions
diff --git a/reproduce/analysis/config/INPUTS.conf b/reproduce/analysis/config/INPUTS.conf
index fd8ac53..f3d1cd4 100644
--- a/reproduce/analysis/config/INPUTS.conf
+++ b/reproduce/analysis/config/INPUTS.conf
@@ -1,42 +1,70 @@
-# Input files necessary for this project, the variables defined in this
-# file are primarily used in 'reproduce/analysis/make/download.mk'. See
-# there for precise usage of the variables. But comments are also provided
-# here.
-#
-# Necessary variables for each input dataset are listed below. Its good
-# that all the variables of each file have the same base-name (in the
-# example below 'DEMO') with descriptive suffixes, also put a short comment
-# above each group of variables for each dataset, shortly explaining what
-# it is.
-#
-# 1) Local file name ('DEMO-DATA' below): this is the name of the dataset
-# on the local system (in 'INDIR', given at configuration time). It is
-# recommended that it be the same name as the online version of the
-# file like the case here (note how this variable is used in 'DEMO-URL'
-# for the dataset's full URL). However, this is not always possible, so
-# the local and server filenames may be different. Ultimately, the file
-# name is irrelevant, we check the integrity with the checksum.
-#
-# 2) The MD5 checksum of the file ('DEMO-MD5' below): this is very
-# important for an automatic verification of the file. You can
-# calculate it by running 'md5sum' on your desired file. You can also
-# use any other checksum tool that you prefer, just be sure to correct
-# the respective command in 'reproduce/analysis/make/download.mk'.
-#
-# 3) The human-readable size of the file ('DEMO-SIZE' below): this is an
-# optional variable, mainly to help a reader of your project get a
-# sense of the volume they need to download if they don't already have
-# the dataset. So it is highly recommended to add it (future readers of
-# your project's source will appreciate it!). You can get it from the
-# output of 'ls -lh' command on the file. Optionally you can use it in
-# messages during the configuration phase (when Maneage asks for the
-# input data directory), along with other info about the file(s).
-#
-# 4) The full dataset URL ('DEMO-URL' below): this is the full URL
-# (including the file-name) that can be used to download the dataset
-# when necessary. Also, see the description above on local filename.
-#
-# Copyright (C) 2018-2021 Mohammad Akhlaghi <mohammad@akhlaghi.org>
+# This project's input file information (metadata).
+#
+# For each input (external) data file that is used within the project,
+# three variables are suggested here (two of them are mandatory). These
+# variables will be used by 'reproduce/analysis/make/download.mk' to import
+# the dataset into the project (within the build directory):
+#
+# - If the file already exists locally in '$(INDIR)' (the optional input
+# directory that may have been specified at configuration time with
+# '--input-dir'), a symbolic link will be added in '$(indir)' (in the
+# build directory). A symbolic link is used to avoid extra storage when
+# files are large.
+#
+# - If the file doesn't exist in '$(INDIR)', or no input directory was
+# specified at configuration time, then the file is downloaded from a
+# specific URL.
+#
+# In both cases, before placing the file (or its link) in the build
+# directory, 'reproduce/analysis/make/download.mk' will check the SHA256
+# checksum of the dataset and if it differs from the pre-defined value (set
+# for that file, here), it will abort (since this is not the intended
+# dataset).
+#
+# Therefore, the two variables specifying the URL and SHA256 checksum of
+# the file are MANDATORY. The third variable (INPUT-%-size) showing the
+# human-readable size of the file (from 'ls -lh') is optional (but
+# recommended: because it gives future scientists to get a feeling of the
+# volume of data they need to input: will become important if the
+# size/number of files is large).
+#
+# The naming convension is critical for the input files to be properly
+# imported into the project. In the patterns below, the '%' is the full
+# file name (including its prefix): for example in the demo input of this
+# file in the 'maneage' branch, we have 'INPUT-wfpc2.fits-sha256':
+# therefore, the input file (within the project's '$(indir)') is called
+# 'wfpc2.fits'. This allows you to simply set '$(indir)/wfpc2.fits' as the
+# pre-requisite of any recipe that needs the input file: you will rarely
+# (if at all!) need to use these variables directly.
+#
+# INPUT-%-sha256: The sha256 checksum of the file. You can generate the
+# SHA256 checksum of a file with the 'sha256sum FILENAME'
+# command (where 'FILENAME' is the name of your
+# file). this is very important for an automatic
+# verification of the file: that it hasn't changed
+# between different runs of the project (locally or in
+# the URL). There are more robust checksum algorithms
+# like the 'SHA' standards.
+#
+# INPUT-%-url: The URL to download the file if it is not available
+# locally. It can happen that during the first phases of
+# your project the data aren't yet public. In this case, you
+# set a phony URL like this (just as a clear place-holder):
+# 'https://this.file/is/not/yet/public'.
+#
+# INPUT-%-size: The human-readable size of the file (output of 'ls
+# -lh'). This is not used by default but can help other
+# scientists who would like to run your project get a
+# good feeling of the necessary network and storage
+# capacity that is necessary to start the project.
+#
+# The input dataset's name (that goes into the '%') can be different from
+# the URL's file name (last component of the URL, after the last '/'). Just
+# note that it is assumed that the local copy (outside of your project) is
+# also called '%' (if your local copy of the input dataset and the only
+# repository names are the same, be sure to set '%' accordingly).
+#
+# Copyright (C) 2018-2022 Mohammad Akhlaghi <mohammad@akhlaghi.org>
#
# Copying and distribution of this file, with or without modification, are
# permitted in any medium without royalty provided the copyright notice and
@@ -48,7 +76,6 @@
# Dataset used in this analysis and its checksum for integrity checking.
-MK20DATA = menke20.xlsx
-MK20MD5 = 8e4eee64791f351fec58680126d558a0
-MK20SIZE = 1.9MB
-MK20URL = https://www.biorxiv.org/content/biorxiv/early/2020/01/18/2020.01.15.908111/DC1/embed/media-1.xlsx
+INPUT-menke20.xlsx-size = 1.9M
+INPUT-menke20.xlsx-url = https://www.biorxiv.org/content/biorxiv/early/2020/01/18/2020.01.15.908111/DC1/embed/media-1.xlsx
+INPUT-menke20.xlsx-sha256 = 7839cdc2946134773ffc401cbcc78fb58fc489d2caad65375c85d605b2f8b13e