From ddd6dfd504bf2ce1bc6a4f6c3445daf2ad3a2a06 Mon Sep 17 00:00:00 2001 From: Mohammad Akhlaghi Date: Fri, 9 Mar 2018 12:39:34 +0100 Subject: Added tip for pipeline outputs While doing my own project (which has grown to a processing time of about half an hour), I felt that it would be very convenient to a record of the outputs at major points also. But we don't want to bloat the pipeline by commiting PDF files or large datasets that get fully changed and are just by-products. So it occurred to me to have a separate pipeline only for outputs and after trying it out, it indeed seemds to be a good solution. --- README.md | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 9df72a7..ee60b4a 100644 --- a/README.md +++ b/README.md @@ -549,8 +549,8 @@ been explained here), please let us know to correct it. -Tips on expanding this template (designing your pipeline) -========================================================= +Usage tips: designing your pipeline/workflow +============================================ The following is a list of design points, tips, or recommendations that have been learned after some experience with this pipeline. Please don't @@ -716,7 +716,7 @@ us. In this way, we can add it here for the benefit of others. - *Keep your input data*: The input data is also critical to the pipeline, so like the above for software, make sure you have a backup - of them + of them. - **Version control**: It is important (and extremely useful) to have the history of your pipeline under version control. So try to make commits @@ -739,6 +739,25 @@ us. In this way, we can add it here for the benefit of others. results to your colleagues, you can tag the commit as `v2`. Afterwards when you submit to a paper, it can be tagged `v3` and so on. + - *Pipeline outputs*: During your research, it is possible to checkout a + specific commit and reproduce its results. However, the processing + can be time consuming. Therefore, it is useful to also keep track of + the final outputs of your pipeline (at minimum, the paper's PDF) in + important points of history. However, keeping a snapshot of these + (most probably large volume) outputs in the main history of the + pipeline can unreasonably bloat it. It is thus recommended to make a + separate Git repo to keep those files and keep this pipeline's volume + as small as possible. For example if your main pipeline is called + `my-exciting-project`, the name of the outputs pipeline can be + `my-exciting-project-outputs`. This enables easy sharing of the + output files with your co-authors (with necessary permissions) and + not having to bloat your email archive with extra attachments (you + can just share the link to the online repo in your + communications). After the research is published, you can also + release the outputs pipeline, or you can just delete it if it is too + large or un-necessary (it was just for convenience, and fully + reproducible after all). + -- cgit v1.2.1