diff options
-rw-r--r-- | paper.tex | 51 |
1 files changed, 26 insertions, 25 deletions
@@ -408,7 +408,7 @@ This greatly reduces the cost of curation and maintenance of each individual pro \section{Discussion} -%% It should provide some insight or lessons learned. +%% It should provide some insight or 'lessons learned', where 'lessons learned' is jargon for 'informal hypotheses motivated by experience', reworded to make the hypotheses sound more scientific (if it's a 'lesson', then it sounds like knowledge, when in fact it's really only a hypothesis). %% What is the message we should take from the experience? %% Are there clear demonstrated design principles that can be reapplied elsewhere? %% Are there roadblocks or bottlenecks that others might avoid? @@ -416,49 +416,50 @@ This greatly reduces the cost of curation and maintenance of each individual pro %% Attempt to generalise the significance. %% should not just present a solution or an enquiry into a unitary problem but make an effort to demonstrate wider significance and application and say something more about the ‘science of data’ more generally. -Having shown that it is possible to build workflows satisfing the proposed criteria, here we will review the lessons leaned and insights gained, while sharing the experience of implementing the RDA/WDS recommendations. -We will also discuss the design principles, an how they may be generalized and usable in other projects. -In particular, with the support of RDA, the user base and development of the criteria and Maneage grew phenomenally, highlighting some difficulties for the wide-spread adoption of these criteria. +We have shown that it is possible to build workflows satisfying the proposed criteria. +Here, we comment on our experience in building this system and implementing the RDA/WDS recommendations. +We will discuss the design principles, and how they may be generalized and usable in other projects. +In particular, with the support of RDA, the user base, the development of the criteria and of Maneage grew phenomenally, highlighting some difficulties for the wide-spread adoption of these criteria. -Firstly, while most researachers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used. -But we have noticed that after witnessing the improvements in their research, many (especially early career researchers) have started mastering these tools. -Scientists are rarely trained sufficiently in data management or software development, and the plethora of high-level tools that change every few years, discourages them. -Fast evolving tools are primarily targeted at software developers, who are paid to learn them and use them effectively for short-term projects and move to the next technology. +Firstly, while most researchers are generally familiar with them, the necessary low-level tools (e.g., Git, \LaTeX, the command-line and Make) are not widely used. +Fortunately, we have noticed that after witnessing the improvements in their research, many, especially early-career researchers, have started mastering these tools. +Scientists are rarely trained sufficiently in data management or software development, and the plethora of high-level tools that change every few years discourages them. +Fast-evolving tools are primarily targeted at software developers, who are paid to learn them and use them effectively for short-term projects before moving on to the next technology. Scientists, on the other hand, need to focus on their own research fields, and need to consider longevity. -Hence, arguably the most important feature of these criteria is that they provide a fully working template, using mature and time-tested tools, for blending version control, paper's narrative, software management \emph{and} a modular lineage for analysis. -We have seen that a complete \emph{and} customizable template with a clear checklist of first steps is much more effective in encouraging mastery of these essential tools for modern science. -As opposed to having abstract/isolated tutorials on each tool individually. +Hence, arguably the most important feature of these criteria is that they provide a fully working template, using mature and time-tested tools, for blending version control, the research paper's narrative, software management \emph{and} a modular lineage for analysis. +We have seen that providing a complete \emph{and} customizable template with a clear checklist of the initial steps is much more effective in encouraging mastery of these essential tools for modern science than having abstract, isolated tutorials on each tool individually. -Secondly, to satisfy the completeness criteria, all the necessary software of the project must be built on various POSIX-compatible systems (we actively test Maneage on several GNU/Linux distributions and macOS). +Secondly, to satisfy the completeness criteria, all the necessary software of the project must be built on various POSIX-compatible systems (we actively test Maneage on several different GNU/Linux distributions and on macOS). This requires maintenance by our core team and consumes time and energy. -However, the PM and analysis share the same job manager and our experience so far has shown that users' experience in the analysis, empowers some of them to add/fix their required software on their own systems. +However, the PM and analysis share the same job manager. +Our experience so far has shown that users' experience in the analysis empowers some of them to add or fix their required software on their own systems. Later, they share them as commits on the core branch, thus propagating it to all derived projects. This has already occurred multiple times. -Thirdly, publishing a project's reproducible data lineage immediately after publication enables others to continue with follow-up papers in competition with the original authors. +Thirdly, publishing a project's reproducible data lineage immediately after publication enables others to continue with follow-up papers, which may provide unwanted competition against the original authors. We propose these solutions: 1) Through the Git history, the work added by another team at any phase of the project can be quantified, contributing to a new concept of authorship in scientific projects and helping to quantify Newton's famous ``\emph{standing on the shoulders of giants}'' quote. -However, this is a long-term goal and requires major changes to academic value systems. +This is a long-term goal and would require major changes to academic value systems. 2) Authors can be given a grace period where the journal or a third party embargoes the source, keeping it private for the embargo period and then publishing it. Other implementations of the criteria, or future improvements in Maneage, may solve some of the caveats above. -However, the proof of concept already shows many advantages to adopting the criteria. -For example, publication of projects with these criteria on a wide scale allows automatic workflow generation, optimized for desired characteristics of the results (for example via machine learning). +However, the proof of concept already shows many advantages in adopting the criteria. +For example, publication of projects with these criteria on a wide scale will allow automatic workflow generation, optimized for desired characteristics of the results (for example, via machine learning). Because of the completeness criteria, algorithms and data selection can be similarly optimized. -Furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the connection between an analysis with the resulting narrative \emph{and} history of that analysis/narrative. -Parsers can be written over projects for meta-research and data provenance studies, for example to generate ``research objects''. +Furthermore, through elements like the macros, natural language processing can also be included, automatically analyzing the connection between an analysis with the resulting narrative \emph{and} the history of that analysis/narrative. +Parsers can be written over projects for meta-research and data provenance studies, for example, to generate ``research objects''. As another example, when a bug is found in one software package, all affected projects can be found and the scale of the effect can be measured. Combined with SoftwareHeritage, precise high-level science parts of Maneage projects can be accurately cited (e.g., failed/abandoned tests at any historical point). Many components of ``machine-actionable'' data management plans can be automatically filled out by Maneage, which is useful for project PIs and grant funders. From the data repository perspective these criteria can also be very useful, for example with regard to the challenges mentioned in \cite{austin17}: -(1) The burden of curation is shared among all project authors and/or readers (who may find a bug and fix it), not just by data-base curators, improving their sustainability. -(2) Automated and persistent bi-directional linking of data and publication can be established through the published \& \emph{complete} data lineage that is under version control. +(1) The burden of curation is shared among all project authors and readers (the latter may find a bug and fix it), not just by database curators, improving sustainability. +(2) Automated and persistent bidirectional linking of data and publication can be established through the published \& \emph{complete} data lineage that is under version control. (3) Software management. -With these criteria, each project's unique and complete software management is included: its not a third-party PM, that needs to be maintained by the data center employees. -This enables easy management, preservation, publishing and citation of used software. -For example see \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}, \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, \href{https://doi.org/10.5281/zenodo.1163746}{zenodo.1163746} where we have exploited the free software criteria to distribute all the used software tarballs with the project's source and deliverables. -(4) ``Linkages between documentation, code, data, and journal articles in an integrated environment'', which is the whole purpose of these criteria. +With these criteria, each project's unique and complete software management is included: it is not a third-party PM that needs to be maintained by the data center employees. +These criteria enable easy management, preservation, publishing and citation of the software used. +For example, see \href{https://doi.org/10.5281/zenodo.3524937}{zenodo.3524937}, \href{https://doi.org/10.5281/zenodo.3408481}{zenodo.3408481}, \href{https://doi.org/10.5281/zenodo.1163746}{zenodo.1163746}, where we have exploited the free-software criterion to distribute the tarballs of all the software used with the project's source and deliverables. +(4) ``Linkages between documentation, code, data, and journal articles in an integrated environment'', which effectively summarises the whole purpose of these criteria. |