aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMohammad Akhlaghi <mohammad@akhlaghi.org>2021-06-08 02:21:35 +0100
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2021-06-08 02:21:35 +0100
commit54d994d4aedae5f0222ce2c967bb884bc19d1d64 (patch)
treec6b928d31e43980e5f23472995dbbad7846bc522
parentf88b104f6ed0f075e88d24163510cbb606e44b18 (diff)
Improved appendix on archival
Until now the appendix only touched upon the archival aspects of scholarly research producs (data, code, narrative). To help in clarity, the context of this section has been improved, giving more explanations and examples.
-rw-r--r--tex/src/appendix-existing-tools.tex57
-rw-r--r--tex/src/references.tex15
-rw-r--r--tex/src/supplement.tex2
3 files changed, 57 insertions, 17 deletions
diff --git a/tex/src/appendix-existing-tools.tex b/tex/src/appendix-existing-tools.tex
index eff30b4..0e69be1 100644
--- a/tex/src/appendix-existing-tools.tex
+++ b/tex/src/appendix-existing-tools.tex
@@ -334,22 +334,47 @@ Storing the changes in binary files is also possible in Git, however it is most
\subsection{Archiving}
\label{appendix:archiving}
-Long-term, bytewise, checksummed archiving of software research projects is necessary for a project to be reproducible many decades later.
-The Wayback Machine\footnote{\inlinecode{\url{https://archive.org}}} and similar services such as Archive Today\footnote{\inlinecode{\url{https://archive.today}}} provide on-demand long-term archiving of web pages, which is a critically important service for preserving the history of the World Wide Web.
-However, research project software archiving requires the preservation of files and metadata about the files, not of web pages.
-This is commonly done in public research repositories such as Zenodo\footnote{\inlinecode{\url{https://zenodo.org}}}, which publishes md5sums of uploaded files, freezes them as a DOI-identified version of record, and provides convenient maintenance of metadata by the uploading user.
-Universities now regularly provide their own repositories,\footnote{E.g. \inlinecode{\url{https://repozytorium.umk.pl}}} many of which are registered with the \emph{Open Archives Initiative} that aims at repository interoperability.\footnote{\inlinecode{\url{https://www.openarchives.org/Register/BrowseSites}}}
-
-For preserving the full editing records of a software project, \emph{Software Heritage}\citeappendix{dicosmo18} is especially useful.
-Software Heritage allows a user to anonymously nominate the URL of a git (or cvs) commit history of any project and request that it be archived.
-The Software Heritage scripts (themselves free-licensed) download the repository and allow the repository as a whole or individual files to be accessed using a URI.
-
-The {\LaTeX} and figure source files for the final research paper itself are also best archived on a preprint server such as ArXiv\footnote{\inlinecode{\url{https://arXiv.org}}}, which pioneered the archiving of research papers.
-ArXiv recommends that the figures of a research paper are provided in postscript, a plain-text format, to maximise long-term longevity, and (normally) provides the source package and both postscript and pdf formats of the paper by email and on the web.
-ArXiv provides long-term stable URIs, allowing versions, for each accepted research preprint.\footnote{\inlinecode{\url{https://arxiv.org/help/arxiv_identifier}}}
-
-An open question in archiving the full sequence of steps that go into a quantitative scientific research project is how to or whether to preserve ``scholarly ephemera'' in scientific software development.
-This refers to discussion about the software such as reports on bugs or proposals of adding features, which are usually referred to as ``Issues'', and ``pull requests'', which propose that a change be ``pulled'' into the main branch of a software development repository by the core developers.
+Long-term, bytewise, checksummed archiving of software research projects is necessary for a project to be reproducible for a wide community (in both time and space).
+Generally, archival includes either binary or plain-text source code files.
+In some cases, specific tools have their own archival systems like Docker Hub\footnote{\inlinecode{\url{https://hub.docker.com}}} for Docker containers (that were discussed above in Appendix \ref{appendix:containers}, hence they are not reviewed here).
+We will focus on generic archival tools in this section.
+
+The Wayback Machine (part of the Internet Archive)\footnote{\inlinecode{\url{https://archive.org}}} and similar services such as Archive Today\footnote{\inlinecode{\url{https://archive.today}}} provide on-demand long-term archiving of web pages, which is a critically important service for preserving the history of the World Wide Web.
+However, because these services are heavily tailored to the web format, they have many limitations for source code or data.
+For example, the only way to archive the source code of a computational project is through its tarball\footnote{For example \inlinecode{\url{https://archive.org/details/gnuastro}}}.
+
+Through public research repositories such as Zenodo\footnote{\inlinecode{\url{https://zenodo.org}}} or Figshare\footnote{\inlinecode{\url{https://figshare.com}}} academic files (of any format: data or hand-written narrative or code) can be archived for the long-term.
+Since they are tailored to academic files, these services mint a DOI for each package of files, and provide convenient maintenance of metadata by the uploading user, while verifying the files with MD5 checksums.
+Since they allow large files, they are mostly useful for data (for example Zenodo currently allows a total size, for all files, of 50 GB in each upload).
+Universities now regularly provide their own repositories,\footnote{For example \inlinecode{\url{https://repozytorium.umk.pl}}} many of which are registered with the \emph{Open Archives Initiative} that aims at repository interoperability\footnote{\inlinecode{\url{https://www.openarchives.org/Register/BrowseSites}}}.
+
+However, a computational research project's source code (including instructions on how to do the research analysis, how to build the plots, blended with narrative, and how to prepare the software environment) are different from data (which is usually just a series of values, coming out of experiments and can get very large).
+Even though both source code and data are ultimately just a series of bits in a file, their creation and usage is significantly different within a project.
+Source code is often written by humans, for machines to execute \emph{and also} humans to read/modify; they are often composed of many files and thousands of lines of (modular) code, often the fine details of the history of the changes in those lines are also preserved though version control as mentioned in Appendix \ref{appendix:versioncontrol}.
+
+For example, the source-code files for a research paper can also be archived on a preprint server such as arXiv\footnote{\inlinecode{\url{https://arXiv.org}}}, which pioneered the archiving of research pre-prints.
+ArXiv only requires the {\LaTeX} source of a paper (and its plots) to build the paper internally and provide users with Postscript or PDF outputs: having access to the {\LaTeX} source, allows it to extract metadata like bibliography for example.
+However, along with the {\LaTeX} source, authors can also submit any type of plain-text file, including Shell or Python scripts for example (as long as the total volume of the upload doesn't exceed a certain limit, currently 50MB).
+This feature of arXiv is heavily used by Maneage'd papers.
+For example this paper is available at \href{https://arxiv.org/abs/2006.03018}{arXiv:2006.03018}; by clicking on ``Other formats'', and then ``Download source'', the full source file that we uploaded is available to any interested reader (it includes a full snapshot of Maneage at the point the paper was built, including all software and narrative source code).
+Infact the \inlinecode{./project make dist} command in Maneage will automatically create the arXiv-ready tarball for authors to upload easily.
+ArXiv provides long-term stable URIs, giving unique identifiers for each publication\footnote{\inlinecode{\url{https://arxiv.org/help/arxiv_identifier}}} and is mirrored on many severs across the globe.
+
+The granularity offered by the archival systems above is a file (which is usually a compressed package of many files in the case of source code).
+It is therefore not possible to get any more precise in preserving or linking to the contents of a file, or to preserve the history of changes in the file (which are very important in hand-written source code).
+Commonly used Git repositories (like Codeberg, Gitlab or Github) do provide one way to access the fine details of a source file.
+However, the Git history of projects on these repositories can easily be changed by the owners, or the whole site may go bankrupt or be sold to another (thus changing the URL).
+Such repositories are therefore not reliable sources in view of longevity.
+
+For preserving, and providing direct access to the fine details of a source-code file (with the granularity of a line within the file) Software Heritage is especially useful\citeappendix{abramatic18,dicosmo18}.
+Through Software Heritage, users can anonymously nominate the version-controlled repository of any publicly accessible project and request that it be archived.
+The Software Heritage scripts (themselves free-licensed) download the repository and preserve it within Software Heritage's archival system.
+This allows the repository as a whole, or individual files, and certain lines within the files, to be accessed using a standard Software Heritage ID (SWHID), for more see \citeappendix{dicosmo18}.
+In the main body of this paper, we use this feature multiple times.
+Software Heritage is also mirrored in international servers and is supported by major international institutions like UNESCO.
+
+An open question in archiving the full sequence of steps that go into a quantitative scientific research project is whether/how to preserve ``scholarly ephemera'' in scientific software development.
+This refers to discussions about the software such as reports on bugs or proposals of adding features, which are usually referred to as ``Issues'', and ``pull requests'', which propose that a change be ``pulled'' into the main branch of a software development repository by the core developers.
These ephemera are not part of the git commit history of a software project, but add wider context and understanding beyond the commit history itself, and provide a record that could be used to allocate intellectual credit.
For these reasons, the \emph{Investigating \& Archiving the Scholarly Git Experience} (IASGE) project proposes that the ephemera should be archived as well as the git repositories themselves.\footnote{\inlinecode{\href{https://investigating-archiving-git.gitlab.io/updates/define-scholarly-ephemera}{https://investigating-archiving-git.gitlab.io/updates/}}\\\inlinecode{\href{https://investigating-archiving-git.gitlab.io/updates/define-scholarly-ephemera}{define-scholarly-ephemera}}}
diff --git a/tex/src/references.tex b/tex/src/references.tex
index 6b768ee..bf99d1d 100644
--- a/tex/src/references.tex
+++ b/tex/src/references.tex
@@ -642,6 +642,21 @@ archivePrefix = {arXiv},
+@ARTICLE{abramatic18,
+ author = {Jean-Francois Abramatic and Roberto {Di Cosmo} and Stefano Zacchiroli},
+ title = {Identifiers for Digital Objects: The case of software source code preservation},
+ journal = {Communications of the ACM},
+ year = 2018,
+ volume = 61,
+ issue = 10,
+ pages = {29},
+ doi = {10.1145/3183558},
+}
+
+
+
+
+
@ARTICLE{dicosmo18,
author = {{Di Cosmo}, Roberto and {Gruenpeter}, Morane and {Zacchiroli}, Stefano},
title = {Identifiers for Digital Objects: The case of software source code preservation},
diff --git a/tex/src/supplement.tex b/tex/src/supplement.tex
index 6fb75b2..2624729 100644
--- a/tex/src/supplement.tex
+++ b/tex/src/supplement.tex
@@ -75,7 +75,7 @@
%% A short appendix describing this file.
\begin{abstract}
- This supplement contains appendices to the main body of a paper submitted to CiSE (pre-print published on \href{https://arxiv.org/abs/\projectarxivid}{\texttt{arXiv:\projectarxivid}} or \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}}).
+ This supplement contains appendices to the main body of Akhlaghi et al., published in CiSE (\href{https://doi.org/10.1109/MCSE.2021.3072860}{DOI:10.1109/MCSE.2021.3072860}, with pre-print published on \href{https://arxiv.org/abs/\projectarxivid}{\texttt{arXiv:\projectarxivid}} or \href{https://doi.org/10.5281/zenodo.\projectzenodoid}{\texttt{zenodo.\projectzenodoid}}).
In the paper's main body we introduced criteria for longevity of reproducible workflow solutions and introduced a proof of concept that implements them, called Maneage (\emph{Man}aging data lin\emph{eage}).
This supplement provides an in-depth literature review of previous methods and compares them and their lower-level tools in detail with our criteria and with the proof of concept presented in this work.
Appendix \ref{appendix:existingtools} reviews the low-level tools that are used by many reproducible workflow solutions (including our proof of concept).