From 624ccc1326a5b9e86561bedcb97a9f04851e7067 Mon Sep 17 00:00:00 2001 From: Mohammad Akhlaghi Date: Mon, 4 Jan 2021 00:42:02 +0000 Subject: Edits on points raised by Raul After his previous two commits, we discussed some of the points and I am making these edits following those. In particular the last statement about Madagascar "could have been more useful..." was changed to simply mention that mixing workflow with analysis is against the modularity principle. We should not judge its usefulness to the community (which is beyond our scope and would need an official survey). A few other minor edits were done here and there to clarify some of the points. --- tex/src/appendix-existing-solutions.tex | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/tex/src/appendix-existing-solutions.tex b/tex/src/appendix-existing-solutions.tex index ccef295..e802644 100644 --- a/tex/src/appendix-existing-solutions.tex +++ b/tex/src/appendix-existing-solutions.tex @@ -134,9 +134,9 @@ Besides the location or contents of the data, RSF also contains name/value pairs Since RSF contains program options also, the inputs and outputs of Madagascar's analysis programs are read from, and written to, standard input and standard output. In terms of completeness, as long as the user only uses Madagascar's own analysis programs, it is fairly complete at a high level (not lower-level OS libraries). -However, this comes at the expense of a large amount of bloatware (programs that one project may never need, but is forced to build). +However, this comes at the expense of a large amount of bloatware (programs that one project may never need, but is forced to build), thus adding complexity. Also, the linking between the analysis programs (of a certain user at a certain time) and future versions of that program (that is updated in time) is not immediately obvious. -Madagascar could have been more useful to a larger community if the workflow components were maintained as a separate project compared to the analysis components. +Furthermore, the blending of the workflow component with the low-level analysis components fails the modularity criteria. @@ -209,10 +209,11 @@ Besides the fact that it is no longer maintained, VisTrails does not control the \subsection{Galaxy (2010)} \label{appendix:galaxy} Galaxy\footnote{\inlinecode{\url{https://galaxyproject.org}}} is a web-based Genomics workbench \citeappendix{goecks10}. -The main user interfaces are ``Galaxy Pages'', which do not require any programming: users simply use abstract ``tools'' which are wrappers over command-line programs. +The main user interface is the ``Galaxy Pages'', which does not require any programming: users graphically manipulate abstract ``tools'' which are wrappers over command-line programs. Therefore the actual running version of the program can be hard to control across different Galaxy servers. Besides the automatically generated metadata of a project (which include version control, or its history), users can also tag/annotate each analysis step, describing its intent/purpose. -Besides some small differences, Galaxy seems very similar to GenePattern (Appendix \ref{appendix:genepattern}), so most of the same points there apply here too (including the very large cost of maintaining such a system). +Besides some small differences, Galaxy seems very similar to GenePattern (Appendix \ref{appendix:genepattern}), so most of the same points there apply here too. +For example the very large cost of maintaining such a system and being based on a graphic environment. @@ -273,13 +274,13 @@ Furthermore, as a high-level data format HDF5 evolves very fast, for example HDF While data and code are indeed fundamentally similar concepts technically\citeappendix{hinsen16}, they are used by humans differently due to their volume: the code of a large project involving Terabytes of data can be less than a megabyte. Hence, storing code and data together becomes a burden when large datasets are used, this was also acknowledged in \citeappendix{hinsen15}. Also, if the data are proprietary (for example medical patient data), the data must not be released, but the methods that were applied to them can be published. -Furthermore, since all reading and writing is done in the HDF5 file, it can easily bloat the file to very large sizes due to temporary/reproducible files. -These files are later removed, thus making the code more complicated and hard to read. +Furthermore, since all reading and writing is done in the HDF5 file, it can easily bloat the file to very large sizes due to temporary files. +These files can later be removed as part of the analysis, but this makes the code more complicated and hard to read/maintain. For example the Active Papers HDF5 file of \citeappendix[in \href{https://doi.org/10.5281/zenodo.2549987}{zenodo.2549987}]{kneller19} is 1.8 giga-bytes. In many scenarios, peers just want to inspect the processing by reading the code and checking a very special part of it (one or two lines, just to see the option values to one step for example). -They do not necessarily need to run it or obtain the output the datasets. -Hence the extra volume for data and obscure HDF5 format that needs special tools for reading plain text code is a major hindrance. +They do not necessarily need to run it or obtain the output the datasets (which may be published elsewhere). +Hence the extra volume for data and obscure HDF5 format that needs special tools for reading its plain-text internals is an issue. @@ -302,7 +303,8 @@ SHARE\footnote{\inlinecode{\url{https://is.ieis.tue.nl/staff/pvgorp/share}}} \ci SHARE was recognized as the second position in the Elsevier Executable Paper Grand Challenge \citeappendix{gabriel11}. Simply put, SHARE was just a VM library that users could download or connect to, and run. The limitations of VMs for reproducibility were discussed in Appendix \ref{appendix:virtualmachines}, and the SHARE system does not specify any requirements on making the VM itself reproducible. -While the top project web page still works, upon selecting any operation, a notice is printed that ``SHARE is offline'' since 2019 and the reason is not mentioned. +As of January 2021, the top SHARE web page still works. +However, upon selecting any operation, a notice is printed that ``SHARE is offline'' since 2019 and the reason is not mentioned. @@ -339,7 +341,7 @@ For reproducibility, \citeappendix{pham12} suggest building a SOLE-based project However, as described in Appendices \ref{appendix:independentenvironment} and \ref{appendix:packagemanagement}, unless virtual machines are built with robust package managers, this is not a sustainable solution (the virtual machine itself is not reproducible). Also, hosting a large virtual machine server with fixed IP on a hosting service like Amazon (as suggested there) for every project in perpetuity will be very expensive. The manual/artificial definition of tags to connect parts of the paper with the analysis scripts is also a caveat due to human error and incompleteness (the authors may not consider tags as important things, but they may be useful later). -In Maneage, instead of using artificial/commented tags, the analysis inputs and outputs are automatically linked into the paper's text. +In Maneage, instead of using artificial/commented tags, the analysis inputs and outputs are automatically linked into the paper's text through \LaTeX{} macros that are the backbone of the whole system (aren't artifical/extra features). -- cgit v1.2.1