From afc7c57b3b8240aa84e7682272bf528615530ba2 Mon Sep 17 00:00:00 2001 From: Boud Roukema Date: Sun, 27 Dec 2020 10:49:48 +0100 Subject: Fix typos; snapshot size This commit fixes 'automaticly', 'mega byte', 'terra byte'. It also changes 'will be far less than a mega byte' to 'should be less than a megabyte'. The reason for 'should' is that in some cases, providing a small data set in the package is useful, as in [1]. Of course, [1] would be only 0.9 Mb in size, including the data sets, instead of 1.3 Mb, if the author, whoever that may happen to be, had excluded the useless (produced) file 'paper-tmp.eps'. :P Case [2] is 0.4 Mb. These two tar archives are for ArXiv, so they also contain produced .eps files. So maybe in principle 'far less than' is right. However, on neither [3] nor [4], trying to follow the recommendations :), are any of the "useful" versions of single file archives smaller than the ArXiv version. The git bundles are bigger because of the git history, and the 'software' archives are 0.5 to 0.6 Gb because they include almost everything. However, stating something that is possible in principle but not done in practice would be misleading. So I would not include 'far less'. [1] https://zenodo.org/record/3951152/files/subpoisson-252cf1c-arXiv.tar.gz [2] https://zenodo.org/record/4062461/files/elaphrocentre-724a7c8-arXiv.tar.gz [3] https://zenodo.org/record/3951152 [4] https://zenodo.org/record/4062461 --- paper.tex | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/paper.tex b/paper.tex index 20b463a..8c74578 100644 --- a/paper.tex +++ b/paper.tex @@ -230,10 +230,10 @@ Fewer explicit execution requirements would mean higher \emph{execution possibil \textbf{Criterion 2: Modularity.} A modular project enables and encourages independent modules with well-defined inputs/outputs and minimal side effects. -\new{In terms of file management, a modular project will \emph{only} contain the hand-written project source of that particular high-level project: no automaticly generated files (e.g., built binaries), software source code (maintained separately), or data (archived separately) should be included. +\new{In terms of file management, a modular project will \emph{only} contain the hand-written project source of that particular high-level project: no automatically generated files (e.g., built binaries), software source code (maintained separately), or data (archived separately) should be included. The latter two (developing low-level software or collecting data) are separate projects in themselves and can be used in other high-level projects.} Explicit communication between various modules enables optimizations on many levels: -(1) Storage and archival cost (no duplicate software or data files): a snapshot of a project will be far less than a mega byte. +(1) Storage and archival cost (no duplicate software or data files): a snapshot of a project should be less than a megabyte. (2) Modular analysis components can be executed in parallel and avoid redundancies (when a dependency of a module has not changed, it will not be re-run). (3) Usage in other projects. (4) Easy debugging and improvements. @@ -857,7 +857,7 @@ The main reply they got in the discussion is to build the Conda environment in a However, as described in Appendix \ref{appendix:independentenvironment} containers just hide the reproducibility problem, they do not fix it: containers are not static and need to evolve (i.e., re-built) with the project. Given these limitations, \citeappendix{uhse19} are forced to host their conda-packaged software as tarballs on a separate repository. -Conda installs with a shell script that contains a binary-blob (+500 mega bytes, embedded in the shell script). +Conda installs with a shell script that contains a binary-blob (+500 megabytes, embedded in the shell script). This is the first major issue with Conda: from the shell script, it is not clear what is in this binary blob and what it does. After installing Conda in any location, users can easily activate that environment by loading a special shell script into their shell. However, the resulting environment is not fully independent of the host operating system as described below: @@ -1585,7 +1585,7 @@ Furthermore, ReproZip just copies the binary/compiled files used in a project, i As mentioned in this paper, and also \citeappendix{oliveira18} the question of ``how'' the environment was built is critical for understanding the results and simply having the binaries cannot necessarily be useful. For the data, it is similarly not possible to extract which data server they came from. -Hence two projects that each use a 1 terra byte dataset will need a full copy of that same 1 terra byte file in their bundle, making long term preservation extremely expensive. +Hence two projects that each use a 1-terabyte dataset will need a full copy of that same 1-terabyte file in their bundle, making long term preservation extremely expensive. -- cgit v1.2.1