aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorBoud Roukema <boud@cosmo.torun.pl>2021-06-08 19:18:19 +0200
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2021-06-08 18:33:57 +0100
commit6f7f00fb3fb4c14c85890bff6bd89485ccc1ff92 (patch)
tree5d0cc9429310f82011241d20c6656e5130ea1469
parentc1bc4ebfa064f61fb53bc44d221fc059b7e4abfb (diff)
Several minor edits, removed exact value of arXiv's size-limit
This commit makes several copyediting changes to the appendices and to the supplement.tex introduction to the appendices. The ArXiv unofficially increased upload limit of 50 Mb comes from a tweet: https://nitter.fdn.fr/arxiv/status/1286381643893268483 (archive: https://archive.today/PdxhT) but not listed on official ArXiv pages. So it seems safer not to quote a value. The very old value was 0.5 Mb - out of respect to people with low bandwidth, especially scientists in poor countries. Tweets are generally not acceptable as "reliable sources" in en.Wikipedia.
-rw-r--r--tex/src/appendix-existing-solutions.tex16
-rw-r--r--tex/src/appendix-existing-tools.tex41
-rw-r--r--tex/src/supplement.tex2
3 files changed, 30 insertions, 29 deletions
diff --git a/tex/src/appendix-existing-solutions.tex b/tex/src/appendix-existing-solutions.tex
index 9710df8..58a3518 100644
--- a/tex/src/appendix-existing-solutions.tex
+++ b/tex/src/appendix-existing-solutions.tex
@@ -426,21 +426,21 @@ ReproZip\footnote{\inlinecode{\url{https://www.reprozip.org}}}\citeappendix{chir
The tracking is done at the kernel system-call level, so any file that is accessed during the running of the project is identified.
The tracked files can be packaged into a \inlinecode{.rpz} bundle that can then be unpacked into another system.
-ReproZip is therefore very good to take a ``snapshot'' of the running environment, at one moment into a single file.
+ReproZip is therefore very good for storing a ``snapshot'' of the running environment, at a single moment, into a single file.
However, the bundle can become very large when many/large datasets are involved, or if the software environment is complex (many dependencies).
-Furthermore, since the binary software libraries are directly copied, it can only be re-run on a systems with a similar CPU architecture.
-Furthermore, ReproZip copies all files used in a project, without a way of knowing how the software was built (its provenance).
+Furthermore, since the binary software libraries are directly copied, it can only be re-run on a systems with a compatible CPU architecture.
+Another problem is that ReproZip copies all files used in a project, without (by default) a way of knowing how the software was built (its provenance).
-As mentioned in this paper, and also Oliveira et al. \citeappendix{oliveira18} the question of ``how'' the environment was built is critical for understanding the results; simply having the binaries cannot necessarily be useful in many contexts.
-It is possible to include the build instructions of the used software within the project to be ReproZip'd, but that will again simply bloat the bundle due to the many temporary files that are created during the build of a software, add complexity and slow down the project's running time.
+As mentioned in this paper, and also Oliveira et al. \citeappendix{oliveira18}, the question of ``how'' the environment was built is critical to understanding the results; having only the binaries is not useful in many contexts.
+It is possible to include the build instructions of the software used within the project to be ReproZip'd, but this risks bloating the bundle with the many temporary files that are created during the build of the software, adding complexity and slowing down the project's running time.
For the data, it is similarly not possible to extract which data server they came from.
Hence two projects that each use a 1-terabyte dataset will need a full copy of that same 1-terabyte file in their bundle, making long-term preservation extremely expensive.
Such files can be excluded from the bundle through modifications in the configuration file.
-But this will only add complexity since a higher-level script over ReproZip needs to be written to make sure that the data and bundle are used together or check the integrity of the data.
+However, this will add complexity if a higher-level script is written above ReproZip to make sure that the data and bundle are used together or to check the integrity of the data.
-Finally, because it is only a snapshot of one moment in a project's history, preserving the connection between the ReproZip'd bundles of various points in a project's history is not easily possible (for example when software or data are updated, or new analysis methods are used).
-In other words, a ReproZip user will have to define their own subjective archival method to preserve the various black-boxs of their project as it evolves, and tracking what has changed between them is not trivial.
+Finally, because it is only a snapshot of one moment in a project's history, preserving the connection between the ReproZip'd bundles of various points in a project's history is likely to be difficult (for example, when software or data are updated, or when analysis methods are modified).
+In other words, a ReproZip user will have to personally define an archival method to preserve the various black boxes of the project as it evolves, and tracking what has changed between the versions is not trivial.
diff --git a/tex/src/appendix-existing-tools.tex b/tex/src/appendix-existing-tools.tex
index d636f36..f234ac0 100644
--- a/tex/src/appendix-existing-tools.tex
+++ b/tex/src/appendix-existing-tools.tex
@@ -334,45 +334,46 @@ Storing the changes in binary files is also possible in Git, however it is most
\subsection{Archiving}
\label{appendix:archiving}
-Long-term, bytewise, checksummed archiving of software research projects is necessary for a project to be reproducible for a wide community (in both time and space).
+Long-term, bytewise, checksummed archiving of software research projects is necessary for a project to be reproducible by a broad community (in both time and space).
Generally, archival includes either binary or plain-text source code files.
-In some cases, specific tools have their own archival systems like Docker Hub\footnote{\inlinecode{\url{https://hub.docker.com}}} for Docker containers (that were discussed above in Appendix \ref{appendix:containers}, hence they are not reviewed here).
+In some cases, specific tools have their own archival systems, such as Docker Hub\footnote{\inlinecode{\url{https://hub.docker.com}}} for Docker containers (that were discussed above in Appendix \ref{appendix:containers}, so they are not reviewed here).
We will focus on generic archival tools in this section.
The Wayback Machine (part of the Internet Archive)\footnote{\inlinecode{\url{https://archive.org}}} and similar services such as Archive Today\footnote{\inlinecode{\url{https://archive.today}}} provide on-demand long-term archiving of web pages, which is a critically important service for preserving the history of the World Wide Web.
-However, because these services are heavily tailored to the web format, they have many limitations for source code or data.
+However, because these services are heavily tailored to the web format, they have many limitations for scientific source code or data.
For example, the only way to archive the source code of a computational project is through its tarball\footnote{For example \inlinecode{\url{https://archive.org/details/gnuastro}}}.
-Through public research repositories such as Zenodo\footnote{\inlinecode{\url{https://zenodo.org}}} or Figshare\footnote{\inlinecode{\url{https://figshare.com}}} academic files (of any format: data or hand-written narrative or code) can be archived for the long-term.
+Through public research repositories such as Zenodo\footnote{\inlinecode{\url{https://zenodo.org}}} or Figshare\footnote{\inlinecode{\url{https://figshare.com}}} academic files (in any format and of any type of content: data, hand-written narrative or code) can be archived for the long term.
Since they are tailored to academic files, these services mint a DOI for each package of files, and provide convenient maintenance of metadata by the uploading user, while verifying the files with MD5 checksums.
-Since they allow large files, they are mostly useful for data (for example Zenodo currently allows a total size, for all files, of 50 GB in each upload).
+Since these services allow large files, they are mostly useful for data (for example Zenodo currently allows a total size, for all files, of 50 GB in each upload).
Universities now regularly provide their own repositories,\footnote{For example \inlinecode{\url{https://repozytorium.umk.pl}}} many of which are registered with the \emph{Open Archives Initiative} that aims at repository interoperability\footnote{\inlinecode{\url{https://www.openarchives.org/Register/BrowseSites}}}.
-However, a computational research project's source code (including instructions on how to do the research analysis, how to build the plots, blended with narrative, and how to prepare the software environment) are different from data (which is usually just a series of values, coming out of experiments and can get very large).
-Even though both source code and data are ultimately just a series of bits in a file, their creation and usage is significantly different within a project.
-Source code is often written by humans, for machines to execute \emph{and also} humans to read/modify; they are often composed of many files and thousands of lines of (modular) code, often the fine details of the history of the changes in those lines are also preserved though version control as mentioned in Appendix \ref{appendix:versioncontrol}.
+However, a computational research project's source code (including instructions on how to do the research analysis, how to build the plots, blended with narrative, and how to prepare the software environment) are different from the data to be analysed (which are usually just a sequence of values resulting from experiments and whose volume can be very large).
+Even though both source code and data are ultimately just sequences of bits in a file, their creation and usage are fundamentally different within a project, from both the philosophy-of-science point of view and from a practical point of view.
+Source code is often written by humans, for machines to execute \emph{and also} for humans to read/modify; it is often composed of many files and thousands of lines of (modular) code.
+Often, the fine details of the history of the changes in those lines are preserved through version control, as mentioned in Appendix \ref{appendix:versioncontrol}.
-For example, the source-code files for a research paper can also be archived on a preprint server such as arXiv\footnote{\inlinecode{\url{https://arXiv.org}}}, which pioneered the archiving of research pre-prints.
+For example, the source-code files for a research paper can also be archived on a preprint server such as arXiv\footnote{\inlinecode{\url{https://arXiv.org}}}, which pioneered the archiving of research preprints.
ArXiv uses the {\LaTeX} source of a paper (and its plots) to build the paper internally and provide users with Postscript or PDF outputs: having access to the {\LaTeX} source, allows it to extract metadata or contextual information among other benefits\footnote{\inlinecode{\url{https://arxiv.org/help/faq/whytex}}}.
-However, along with the {\LaTeX} source, authors can also submit any type of plain-text file, including Shell or Python scripts for example (as long as the total volume of the upload doesn't exceed a certain limit, currently 50MB).
+However, along with the {\LaTeX} source, authors can also submit any type of plain-text file, including Shell or Python scripts for example (as long as the total volume of the upload doesn't exceed a certain limit).
This feature of arXiv is heavily used by Maneage'd papers.
For example this paper is available at \href{https://arxiv.org/abs/2006.03018}{arXiv:2006.03018}; by clicking on ``Other formats'', and then ``Download source'', the full source file that we uploaded is available to any interested reader.
-This file includes a full snapshot of this Maneage'd project at the point the paper was submitted there, including all software and narrative source code.
-Infact the \inlinecode{./project make dist} command in Maneage will automatically create the arXiv-ready tarball for authors to upload easily.
-ArXiv provides long-term stable URIs, giving unique identifiers for each publication\footnote{\inlinecode{\url{https://arxiv.org/help/arxiv_identifier}}} and is mirrored on many severs across the globe.
+The file includes a full snapshot of this Maneage'd project, at the point the paper was submitted there, including all data and software download and build instructions, analysis commands and narrative source code.
+In fact the \inlinecode{./project make dist} command in Maneage will automatically create the arXiv-ready tarball for authors to upload easily.
+ArXiv provides long-term stable URIs, giving unique identifiers for each publication\footnote{\inlinecode{\url{https://arxiv.org/help/arxiv_identifier}}} and is mirrored on many servers across the globe.
The granularity offered by the archival systems above is a file (which is usually a compressed package of many files in the case of source code).
-It is therefore not possible to get any more precise in preserving or linking to the contents of a file, or to preserve the history of changes in the file (which are very important in hand-written source code).
-Commonly used Git repositories (like Codeberg, Gitlab or Github) do provide one way to access the fine details of a source file.
-However, the Git history of projects on these repositories can easily be changed by the owners, or the whole site may go bankrupt or be sold to another (thus changing the URL).
-Such repositories are therefore not reliable sources in view of longevity.
+It is thus not possible to be more precise when preserving or linking to the contents of a file, or to preserve the history of changes in the file (both of which are very important in hand-written source code).
+Commonly used Git repositories (like Codeberg, Gitlab or Github) do provide one way to access the fine details of the source files in a project.
+However, the Git history of projects on these repositories can easily be changed by the owners, or the whole site may become inactive (citizens'-association-run sites like Codeberg) or go bankrupt or be sold to another (commercial sites like Gitlab or Github), thus changing the URL or conditions of access.
+Such repositories are thus not reliable sources in view of longevity.
-For preserving, and providing direct access to the fine details of a source-code file (with the granularity of a line within the file) Software Heritage is especially useful\citeappendix{abramatic18,dicosmo18}.
+For preserving, and providing direct access to the fine details of a source-code file (with the granularity of a line within the file), Software Heritage is especially useful\citeappendix{abramatic18,dicosmo18}.
Through Software Heritage, users can anonymously nominate the version-controlled repository of any publicly accessible project and request that it be archived.
The Software Heritage scripts (themselves free-licensed) download the repository and preserve it within Software Heritage's archival system.
This allows the repository as a whole, or individual files, and certain lines within the files, to be accessed using a standard Software Heritage ID (SWHID), for more see \citeappendix{dicosmo18}.
-In the main body of this paper, we use this feature multiple times.
-Software Heritage is also mirrored in international servers and is supported by major international institutions like UNESCO.
+In the main body of \emph{this} paper, we use this feature several times.
+Software Heritage is mirrored on international servers and is supported by major international institutions like UNESCO.
An open question in archiving the full sequence of steps that go into a quantitative scientific research project is whether/how to preserve ``scholarly ephemera'' in scientific software development.
This refers to discussions about the software such as reports on bugs or proposals of adding features, which are usually referred to as ``Issues'', and ``pull requests'', which propose that a change be ``pulled'' into the main branch of a software development repository by the core developers.
diff --git a/tex/src/supplement.tex b/tex/src/supplement.tex
index 2624729..a047574 100644
--- a/tex/src/supplement.tex
+++ b/tex/src/supplement.tex
@@ -75,7 +75,7 @@
%% A short appendix describing this file.
\begin{abstract}
- This supplement contains appendices to the main body of Akhlaghi et al., published in CiSE (\href{https://doi.org/10.1109/MCSE.2021.3072860}{DOI:10.1109/MCSE.2021.3072860}, with pre-print published on \href{https://arxiv.org/abs/\projectarxivid}{\texttt{arXiv:\projectarxivid}} or \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}}).
+ This supplement contains appendices to the main body of Akhlaghi et al., published in CiSE (\href{https://doi.org/10.1109/MCSE.2021.3072860}{DOI:10.1109/MCSE.2021.3072860}, available as a preprint at \href{https://arxiv.org/abs/\projectarxivid}{\texttt{arXiv:\projectarxivid}} or \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}}).
In the paper's main body we introduced criteria for longevity of reproducible workflow solutions and introduced a proof of concept that implements them, called Maneage (\emph{Man}aging data lin\emph{eage}).
This supplement provides an in-depth literature review of previous methods and compares them and their lower-level tools in detail with our criteria and with the proof of concept presented in this work.
Appendix \ref{appendix:existingtools} reviews the low-level tools that are used by many reproducible workflow solutions (including our proof of concept).