--- title: "_systemPipeRdata_: NGS workflow templates and sample data" author: "Author: Daniela Cassol (danielac@ucr.edu) and Thomas Girke (thomas.girke@ucr.edu)" date: "Last update: `r format(Sys.time(), '%d %B, %Y')`" output: BiocStyle::html_document: toc_float: true code_folding: show BiocStyle::pdf_document: default package: systemPipeRdata vignette: | %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{systemPipeRdata: Workflow templates and sample data} %\VignetteEngine{knitr::rmarkdown} fontsize: 14pt bibliography: bibtex.bib --- ```{css, echo=FALSE} pre code { white-space: pre !important; overflow-x: scroll !important; word-break: keep-all !important; word-wrap: initial !important; } ``` ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=60, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")), tidy.opts=list(width.cutoff=60), tidy=TRUE) ``` ```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE} suppressPackageStartupMessages({ library(systemPipeR) library(systemPipeRdata) library(BiocGenerics) }) ``` **Note:** the most recent version of this vignette can be found here and a short overview slide show [here](https://htmlpreview.github.io/?https://github.com/tgirke/systemPipeR/blob/master/inst/extdata/slides/systemPipeRslides.html). **Note:** if you use _`systemPipeR`_ and _`systemPipeRdata`_ in published research, please cite: Backman, T.W.H and Girke, T. (2016). *systemPipeR*: NGS Workflow and Report Generation Environment. *BMC Bioinformatics*, 17: 388. [10.1186/s12859-016-1241-0](https://doi.org/10.1186/s12859-016-1241-0). # Introduction [_`systemPipeRdata`_](https://github.com/tgirke/systemPipeRdata) is a helper package to generate with a single command NGS workflow templates that are intended to be used by its parent package [_`systemPipeR`_](http://www.bioconductor.org/packages/devel/bioc/html/systemPipeR.html) [@H_Backman2016-bt]. The latter is an environment for building *end-to-end* analysis pipelines with automated report generation for next generation sequence (NGS) applications such as RNA-Seq, Ribo-Seq, ChIP-Seq, VAR-Seq and many others. The directory structure of the workflow templates and the sample data used by _`systemPipeRdata`_ are described [here](http://bioconductor.org/packages/release/bioc/vignettes/systemPipeR/inst/doc/systemPipeR.html#load-sample-data-and-workflow-templates). # Getting Started ## Installation The R software for using _`systemPipeRdata`_ can be downloaded from [CRAN](http://cran.at.r-project.org). The _`systemPipeRdata`_ package can be installed from within R as follows: ```{r install, eval=FALSE} if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("systemPipeRdata") # Installs from Bioconductor once # available there BiocManager::install("tgirke/systemPipeR", build_vignettes=TRUE, dependencies=TRUE) # Installs from github ``` ## Loading package and documentation ```{r load_systemPipeRdata, eval=TRUE} library("systemPipeRdata") # Loads the package ``` ```{r documentation_systemPipeRdata, eval=FALSE} library(help="systemPipeRdata") # Lists package info vignette("systemPipeRdata") # Opens vignette ``` ## Generate workflow template Load one of the available NGS workflows into your current working directory. The following does this for the _`varseq`_ template. The name of the resulting workflow directory can be specified under the _`mydirname`_ argument. The default _`NULL`_ uses the name of the chosen workflow. An error is issued if a directory of the same name and path exists already. Besides, it is possible to choose different version of the workflow template. Please check the available options [here](https://github.com/tgirke/systemPipeRdata/tree/master/inst/extdata/workflows), or provide the download URL to your template. The URL can be specified under _`url`_ argument and the file name in the _`urlname`_ argument. The default _`NULL`_ copies the current version available in the [systemPipeRdata](https://github.com/tgirke/systemPipeRdata/tree/master/inst/extdata/workflows). ```{r generate_workenvir, eval=FALSE} genWorkenvir(workflow="varseq", mydirname=NULL, url=NULL, urlname=NULL) setwd("varseq") ``` On Linux and OS X systems the same can be achieved from the command-line of a terminal with the following commands. ```{bash generate_workenvir_from_shell, eval=FALSE} $ Rscript -e "systemPipeRdata::genWorkenvir(workflow='varseq', mydirname=NULL, url=NULL, urlname=NULL)" ``` ### Directory Structure The workflow templates generated by _`genWorkenvir`_ contain the following preconfigured directory structure: * _**workflow/**_ (*e.g.* *rnaseq/*) + This is the root directory of the R session running the workflow. + Run script ( *\*.Rmd*) and sample annotation (*targets.txt*) files are located here. + Note, this directory can have any name (*e.g.* _**rnaseq**_, _**varseq**_). Changing its name does not require any modifications in the run script(s). + **Important subdirectories**: + _**param/**_ + Stores non-CWL parameter files such as: *\*.param*, *\*.tmpl* and *\*.run.sh*. These files are only required for backwards compatibility to run old workflows using the previous custom command-line interface. + _**param/cwl/**_: This subdirectory stores all the CWL parameter files. To organize workflows, each can have its own subdirectory, where all `CWL param` and `input.yml` files need to be in the same subdirectory. + _**data/**_ + FASTQ files + FASTA file of reference (*e.g.* reference genome) + Annotation files + etc. + _**results/**_ + Analysis results are usually written to this directory, including: alignment, variant and peak files (BAM, VCF, BED); tabular result files; and image/plot files + Note, the user has the option to organize results files for a given sample and analysis step in a separate subdirectory. **Note**: Directory names are indicated in ***green***. Users can change this structure as needed, but need to adjust the code in their workflows accordingly.