diff options
Diffstat (limited to 'tex')
-rw-r--r-- | tex/src/appendix-existing-solutions.tex | 2 | ||||
-rw-r--r-- | tex/src/appendix-existing-tools.tex | 21 |
2 files changed, 12 insertions, 11 deletions
diff --git a/tex/src/appendix-existing-solutions.tex b/tex/src/appendix-existing-solutions.tex index 58a3518..578d6e9 100644 --- a/tex/src/appendix-existing-solutions.tex +++ b/tex/src/appendix-existing-solutions.tex @@ -437,7 +437,7 @@ It is possible to include the build instructions of the software used within the For the data, it is similarly not possible to extract which data server they came from. Hence two projects that each use a 1-terabyte dataset will need a full copy of that same 1-terabyte file in their bundle, making long-term preservation extremely expensive. Such files can be excluded from the bundle through modifications in the configuration file. -However, this will add complexity if a higher-level script is written above ReproZip to make sure that the data and bundle are used together or to check the integrity of the data. +However, this will add complexity: a higher-level script will be necessary with the ReproZip bundle, to make sure that the data and bundle are used together, or to check the integrity of the data (in case it may have changed). Finally, because it is only a snapshot of one moment in a project's history, preserving the connection between the ReproZip'd bundles of various points in a project's history is likely to be difficult (for example, when software or data are updated, or when analysis methods are modified). In other words, a ReproZip user will have to personally define an archival method to preserve the various black boxes of the project as it evolves, and tracking what has changed between the versions is not trivial. diff --git a/tex/src/appendix-existing-tools.tex b/tex/src/appendix-existing-tools.tex index f234ac0..7d76db6 100644 --- a/tex/src/appendix-existing-tools.tex +++ b/tex/src/appendix-existing-tools.tex @@ -348,37 +348,38 @@ Since they are tailored to academic files, these services mint a DOI for each pa Since these services allow large files, they are mostly useful for data (for example Zenodo currently allows a total size, for all files, of 50 GB in each upload). Universities now regularly provide their own repositories,\footnote{For example \inlinecode{\url{https://repozytorium.umk.pl}}} many of which are registered with the \emph{Open Archives Initiative} that aims at repository interoperability\footnote{\inlinecode{\url{https://www.openarchives.org/Register/BrowseSites}}}. -However, a computational research project's source code (including instructions on how to do the research analysis, how to build the plots, blended with narrative, and how to prepare the software environment) are different from the data to be analysed (which are usually just a sequence of values resulting from experiments and whose volume can be very large). +However, a computational research project's source code (including instructions on how to do the research analysis, how to build the plots, blended with narrative, how to access the data, and how to prepare the software environment) are different from the data to be analysed (which are usually just a sequence of values resulting from experiments and whose volume can be very large). Even though both source code and data are ultimately just sequences of bits in a file, their creation and usage are fundamentally different within a project, from both the philosophy-of-science point of view and from a practical point of view. Source code is often written by humans, for machines to execute \emph{and also} for humans to read/modify; it is often composed of many files and thousands of lines of (modular) code. Often, the fine details of the history of the changes in those lines are preserved through version control, as mentioned in Appendix \ref{appendix:versioncontrol}. -For example, the source-code files for a research paper can also be archived on a preprint server such as arXiv\footnote{\inlinecode{\url{https://arXiv.org}}}, which pioneered the archiving of research preprints. -ArXiv uses the {\LaTeX} source of a paper (and its plots) to build the paper internally and provide users with Postscript or PDF outputs: having access to the {\LaTeX} source, allows it to extract metadata or contextual information among other benefits\footnote{\inlinecode{\url{https://arxiv.org/help/faq/whytex}}}. +Due to this fundamental difference, some services focus only on archiving the source code of a project. +A prominent example is arXiv\footnote{\inlinecode{\url{https://arXiv.org}}}, which pioneered the archiving of research preprints. +ArXiv uses the {\LaTeX} source of a paper (and its plots) to build the paper internally and provide users with in-house Postscript or PDF outputs: having access to the {\LaTeX} source, allows it to extract metadata or contextual information among other benefits\footnote{\inlinecode{\url{https://arxiv.org/help/faq/whytex}}}. However, along with the {\LaTeX} source, authors can also submit any type of plain-text file, including Shell or Python scripts for example (as long as the total volume of the upload doesn't exceed a certain limit). This feature of arXiv is heavily used by Maneage'd papers. For example this paper is available at \href{https://arxiv.org/abs/2006.03018}{arXiv:2006.03018}; by clicking on ``Other formats'', and then ``Download source'', the full source file that we uploaded is available to any interested reader. The file includes a full snapshot of this Maneage'd project, at the point the paper was submitted there, including all data and software download and build instructions, analysis commands and narrative source code. -In fact the \inlinecode{./project make dist} command in Maneage will automatically create the arXiv-ready tarball for authors to upload easily. +In fact the \inlinecode{./project make dist} command in Maneage will automatically create the arXiv-ready tarball to help authors upload their project to arXiv. ArXiv provides long-term stable URIs, giving unique identifiers for each publication\footnote{\inlinecode{\url{https://arxiv.org/help/arxiv_identifier}}} and is mirrored on many servers across the globe. The granularity offered by the archival systems above is a file (which is usually a compressed package of many files in the case of source code). It is thus not possible to be more precise when preserving or linking to the contents of a file, or to preserve the history of changes in the file (both of which are very important in hand-written source code). Commonly used Git repositories (like Codeberg, Gitlab or Github) do provide one way to access the fine details of the source files in a project. -However, the Git history of projects on these repositories can easily be changed by the owners, or the whole site may become inactive (citizens'-association-run sites like Codeberg) or go bankrupt or be sold to another (commercial sites like Gitlab or Github), thus changing the URL or conditions of access. +However, the Git history of projects on these repositories can easily be changed by the owners, or the whole site may become inactive (for association-run sites, like Codeberg) or go bankrupt or be sold to another (commercial sites, like Gitlab or Github), thus changing the URL or conditions of access. Such repositories are thus not reliable sources in view of longevity. For preserving, and providing direct access to the fine details of a source-code file (with the granularity of a line within the file), Software Heritage is especially useful\citeappendix{abramatic18,dicosmo18}. Through Software Heritage, users can anonymously nominate the version-controlled repository of any publicly accessible project and request that it be archived. -The Software Heritage scripts (themselves free-licensed) download the repository and preserve it within Software Heritage's archival system. +The Software Heritage scripts (themselves free-licensed) download the repository (including its full history) and preserve it. This allows the repository as a whole, or individual files, and certain lines within the files, to be accessed using a standard Software Heritage ID (SWHID), for more see \citeappendix{dicosmo18}. In the main body of \emph{this} paper, we use this feature several times. Software Heritage is mirrored on international servers and is supported by major international institutions like UNESCO. -An open question in archiving the full sequence of steps that go into a quantitative scientific research project is whether/how to preserve ``scholarly ephemera'' in scientific software development. -This refers to discussions about the software such as reports on bugs or proposals of adding features, which are usually referred to as ``Issues'', and ``pull requests'', which propose that a change be ``pulled'' into the main branch of a software development repository by the core developers. -These ephemera are not part of the git commit history of a software project, but add wider context and understanding beyond the commit history itself, and provide a record that could be used to allocate intellectual credit. -For these reasons, the \emph{Investigating \& Archiving the Scholarly Git Experience} (IASGE) project proposes that the ephemera should be archived as well as the git repositories themselves.\footnote{\inlinecode{\href{https://investigating-archiving-git.gitlab.io/updates/define-scholarly-ephemera}{https://investigating-archiving-git.gitlab.io/updates/}}\\\inlinecode{\href{https://investigating-archiving-git.gitlab.io/updates/define-scholarly-ephemera}{define-scholarly-ephemera}}} +An open question in archiving the full sequence of steps that go into a quantitative scientific research project is whether/how to preserve ``scholarly ephemera''. +This refers to discussions about the project such as bug reports or proposals of adding new features: which are usually referred to as ``Issues'' or ``pull requests'' (also called ``merge requests''). +These ephemera are not part of the Git commit history of a software project, but add wider context and understanding beyond the commit history itself, and provide a record that could be used to allocate intellectual credit. +For these reasons, the \emph{Investigating \& Archiving the Scholarly Git Experience} (IASGE) project proposes that the ephemera should be archived along with the Git repositories themselves\footnote{\inlinecode{\href{https://investigating-archiving-git.gitlab.io/updates/define-scholarly-ephemera}{https://investigating-archiving-git.gitlab.io/updates/}}\\\inlinecode{\href{https://investigating-archiving-git.gitlab.io/updates/define-scholarly-ephemera}{define-scholarly-ephemera}}}. |