From f88b104f6ed0f075e88d24163510cbb606e44b18 Mon Sep 17 00:00:00 2001 From: Mohammad Akhlaghi Date: Mon, 7 Jun 2021 23:40:20 +0100 Subject: Clarifications added to ReproZip in the appendix MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit After Boud posted a notice about Maneage in an online forum [1], Rémi Rampin and Vicky Rampin (from the ReproZip project) replied with some notes about our review of ReproZip in Appendix B. We are very grateful to both Rémi and Vicky for looking into it and for their comments, their contribution has been gratefully acknowledged with this commit. The relevant comments are listed below and have been addressed in this commit (see the 'diff' of this commit). - [Rémi Rampin] ReproZip can capture the build step if you want it to, it's just another command. So if you want to trace "make" and "pip install" etc before tracing your actual experiment, you will have all that build information. - [Rémi Rampin] Bundle size is easily fixed by not putting terabyte-sized data in the bundle, which is done by editing a simple configuration file. - [Vicky Rampin] Not all the files in the bundle are compiled/binary files [in relation to the old sentence "ReproZip just copies the binary/compiled files used in a project"]. [1] https://framapiaf.org/@boud/106296894758145705 --- tex/src/appendix-existing-solutions.tex | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) (limited to 'tex') diff --git a/tex/src/appendix-existing-solutions.tex b/tex/src/appendix-existing-solutions.tex index 75285eb..9710df8 100644 --- a/tex/src/appendix-existing-solutions.tex +++ b/tex/src/appendix-existing-solutions.tex @@ -422,18 +422,25 @@ We could not find a URL to the source software of Umbrella (no source code repos \subsection{ReproZip (2016)} -ReproZip\footnote{\inlinecode{\url{https://www.reprozip.org}}}\citeappendix{chirigati16} is a Python package that is designed to automatically track all the necessary data files, libraries, and environment variables into a single bundle. +ReproZip\footnote{\inlinecode{\url{https://www.reprozip.org}}}\citeappendix{chirigati16} is a Python package that is designed to automatically track all the necessary data files, libraries, and environment variables of a process into a single bundle. The tracking is done at the kernel system-call level, so any file that is accessed during the running of the project is identified. The tracked files can be packaged into a \inlinecode{.rpz} bundle that can then be unpacked into another system. -ReproZip is therefore very good to take a ``snapshot'' of the running environment into a single file. +ReproZip is therefore very good to take a ``snapshot'' of the running environment, at one moment into a single file. However, the bundle can become very large when many/large datasets are involved, or if the software environment is complex (many dependencies). -Since it copies the binary software libraries, it can only be run on systems with a similar CPU architecture to the original. -Furthermore, ReproZip copies the binary/compiled files used in a project, without a way of knowing how the software was built. -As mentioned in this paper, and also Oliveira et al. \citeappendix{oliveira18} the question of ``how'' the environment was built is critical for understanding the results, and simply having the binaries cannot necessarily be useful. +Furthermore, since the binary software libraries are directly copied, it can only be re-run on a systems with a similar CPU architecture. +Furthermore, ReproZip copies all files used in a project, without a way of knowing how the software was built (its provenance). + +As mentioned in this paper, and also Oliveira et al. \citeappendix{oliveira18} the question of ``how'' the environment was built is critical for understanding the results; simply having the binaries cannot necessarily be useful in many contexts. +It is possible to include the build instructions of the used software within the project to be ReproZip'd, but that will again simply bloat the bundle due to the many temporary files that are created during the build of a software, add complexity and slow down the project's running time. For the data, it is similarly not possible to extract which data server they came from. Hence two projects that each use a 1-terabyte dataset will need a full copy of that same 1-terabyte file in their bundle, making long-term preservation extremely expensive. +Such files can be excluded from the bundle through modifications in the configuration file. +But this will only add complexity since a higher-level script over ReproZip needs to be written to make sure that the data and bundle are used together or check the integrity of the data. + +Finally, because it is only a snapshot of one moment in a project's history, preserving the connection between the ReproZip'd bundles of various points in a project's history is not easily possible (for example when software or data are updated, or new analysis methods are used). +In other words, a ReproZip user will have to define their own subjective archival method to preserve the various black-boxs of their project as it evolves, and tracking what has changed between them is not trivial. -- cgit v1.2.1