aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorBoud Roukema <boud@cosmo.torun.pl>2021-06-13 01:56:45 +0200
committerMohammad Akhlaghi <mohammad@akhlaghi.org>2021-06-13 15:09:19 +0100
commita77bba6249f03062eb9849997c77245ad403d027 (patch)
tree5579d2b4b0b7256eea4b0626d9bd76c48be56938
parent313db0b04bd3499f83d9e79fd7e92578cd367c2b (diff)
Add GHTorrent, some https, notabug
This commit adds a few sentences in relation to the first known attempt to store and make available git repository hosting ephemera (GHTorrent, introduced to us by Roberto Di Cosmo). Since one of the two sponsors of GHTorrent is Microsoft, both the ethics and practical aspects of this in the context of reproducibility and scientific ethics as expressed by the international scientific community are rather unclear, so a link to one of the well-known lists of practical and ethical issues with Github is included. A minor fix is made in 'tex/src/appendix-existing-solutions.tex', since the word 'data' is plural (singular is 'datum').
-rw-r--r--tex/src/appendix-existing-solutions.tex2
-rw-r--r--tex/src/appendix-existing-tools.tex14
2 files changed, 10 insertions, 6 deletions
diff --git a/tex/src/appendix-existing-solutions.tex b/tex/src/appendix-existing-solutions.tex
index 578d6e9..1d515e4 100644
--- a/tex/src/appendix-existing-solutions.tex
+++ b/tex/src/appendix-existing-solutions.tex
@@ -437,7 +437,7 @@ It is possible to include the build instructions of the software used within the
For the data, it is similarly not possible to extract which data server they came from.
Hence two projects that each use a 1-terabyte dataset will need a full copy of that same 1-terabyte file in their bundle, making long-term preservation extremely expensive.
Such files can be excluded from the bundle through modifications in the configuration file.
-However, this will add complexity: a higher-level script will be necessary with the ReproZip bundle, to make sure that the data and bundle are used together, or to check the integrity of the data (in case it may have changed).
+However, this will add complexity: a higher-level script will be necessary with the ReproZip bundle, to make sure that the data and bundle are used together, or to check the integrity of the data (in case they have changed).
Finally, because it is only a snapshot of one moment in a project's history, preserving the connection between the ReproZip'd bundles of various points in a project's history is likely to be difficult (for example, when software or data are updated, or when analysis methods are modified).
In other words, a ReproZip user will have to personally define an archival method to preserve the various black boxes of the project as it evolves, and tracking what has changed between the versions is not trivial.
diff --git a/tex/src/appendix-existing-tools.tex b/tex/src/appendix-existing-tools.tex
index 7d76db6..c72f393 100644
--- a/tex/src/appendix-existing-tools.tex
+++ b/tex/src/appendix-existing-tools.tex
@@ -321,7 +321,7 @@ Git commits are commonly summarized by the checksum's first few characters, for
With Git, making parallel ``branches'' (in the project's history) is very easy and its distributed nature greatly helps in the parallel development of a project by a team.
The team can host the Git history on a web page and collaborate through that.
-There are several Git hosting services for example \href{http://codeberg.org}{codeberg.org}, \href{http://gitlab.com}{gitlab.com}, \href{http://bitbucket.org}{bitbucket.org} or \href{http://github.com}{github.com} (among many others).
+There are several Git hosting services, for example, \href{https://codeberg.org}{codeberg.org}, \href{https://notabug.org}{notabug.org}, \href{https://gitlab.com}{gitlab.com}, \href{https://bitbucket.com}{bitbucket.com} or \href{https://github.com}{github.com} (among many others).
Storing the changes in binary files is also possible in Git, however it is most useful for human-readable plain-text sources.
@@ -365,8 +365,8 @@ ArXiv provides long-term stable URIs, giving unique identifiers for each publica
The granularity offered by the archival systems above is a file (which is usually a compressed package of many files in the case of source code).
It is thus not possible to be more precise when preserving or linking to the contents of a file, or to preserve the history of changes in the file (both of which are very important in hand-written source code).
-Commonly used Git repositories (like Codeberg, Gitlab or Github) do provide one way to access the fine details of the source files in a project.
-However, the Git history of projects on these repositories can easily be changed by the owners, or the whole site may become inactive (for association-run sites, like Codeberg) or go bankrupt or be sold to another (commercial sites, like Gitlab or Github), thus changing the URL or conditions of access.
+Commonly used Git repositories (like Codeberg, Notabug, Gitlab or Github) do provide one way to access the fine details of the source files in a project.
+However, the Git history of projects on these repositories can easily be changed by the owners, or the whole site may become inactive (for association-run sites, like Codeberg or Notabug) or go bankrupt or be sold to another (commercial sites, like Gitlab or Github), thus changing the URL or conditions of access.
Such repositories are thus not reliable sources in view of longevity.
For preserving, and providing direct access to the fine details of a source-code file (with the granularity of a line within the file), Software Heritage is especially useful\citeappendix{abramatic18,dicosmo18}.
@@ -376,10 +376,14 @@ This allows the repository as a whole, or individual files, and certain lines wi
In the main body of \emph{this} paper, we use this feature several times.
Software Heritage is mirrored on international servers and is supported by major international institutions like UNESCO.
-An open question in archiving the full sequence of steps that go into a quantitative scientific research project is whether/how to preserve ``scholarly ephemera''.
-This refers to discussions about the project such as bug reports or proposals of adding new features: which are usually referred to as ``Issues'' or ``pull requests'' (also called ``merge requests'').
+An open question in archiving the full sequence of steps that go into a quantitative scientific research project is whether or how to preserve ``scholarly ephemera''.
+This refers to discussions about the project such as bug reports or proposals of adding new features: which are usually referred to as ``issues'' or ``pull requests'' (also called ``merge requests'').
These ephemera are not part of the Git commit history of a software project, but add wider context and understanding beyond the commit history itself, and provide a record that could be used to allocate intellectual credit.
For these reasons, the \emph{Investigating \& Archiving the Scholarly Git Experience} (IASGE) project proposes that the ephemera should be archived along with the Git repositories themselves\footnote{\inlinecode{\href{https://investigating-archiving-git.gitlab.io/updates/define-scholarly-ephemera}{https://investigating-archiving-git.gitlab.io/updates/}}\\\inlinecode{\href{https://investigating-archiving-git.gitlab.io/updates/define-scholarly-ephemera}{define-scholarly-ephemera}}}.
+While Github is controversial for practical and ethical reasons\footnote{\inlinecode{\href{https://git.sdf.org/humanacollaborator/humanacollabora/src/branch/master/github.md}{https://git.sdf.org/humanacollaborator/humanacollabora/src/}}\\\inlinecode{\href{https://git.sdf.org/humanacollaborator/humanacollabora/src/branch/master/github.md}{branch/master/github.md}} (archive: \inlinecode{\url{https://archive.today/5twQ6}})}, it is currently in wide use, and appears to be the first git repository hoster for which the ephemera are being preserved, by the GHTorrent project\footnote{\inlinecode{\url{https://ghtorrent.org}}}.
+The GHTorrent project tracks the public Github ``event timeline'', downloads all ``contents and their dependencies, exhaustively'', and provides database files of all the material.
+A particular complication that will need to be dealt with by projects such as GHTorrent is the copyright of the git hoster on the particular format and creative choices in style in which the ephemera are provided for downloading.
+