--- title: "BLAST Workflow Template" author: "Author: FirstName LastName" date: "Last update: `r format(Sys.time(), '%d %B, %Y')`" output: BiocStyle::html_document: toc_float: true code_folding: show BiocStyle::pdf_document: default package: systemPipeR vignette: | %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{WF: BLAST Template} %\VignetteEngine{knitr::rmarkdown} fontsize: 14pt bibliography: bibtex.bib editor_options: chunk_output_type: console --- ```{css, echo=FALSE} pre code { white-space: pre !important; overflow-x: scroll !important; word-break: keep-all !important; word-wrap: initial !important; } ``` ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=60, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")), tidy.opts=list(width.cutoff=60), tidy=TRUE) ``` ```{r setup, echo=FALSE, message=FALSE, warning=FALSE, eval=FALSE} suppressPackageStartupMessages({ library(systemPipeR) }) ``` # About this template This is the BLAST workflow template of the [systemPipeRdata](https://bioconductor.org/packages/devel/data/experiment/html/systemPipeRdata.html) package, a companion package to [systemPipeR](https://www.bioconductor.org/packages/devel/bioc/html/systemPipeR.html) [@H_Backman2016-bt]. Like other workflow templates, it can be loaded with a single command. Users have the flexibility to utilize the template as is or modify it as needed. More in-depth information can be found in the main vignette of [systemPipeRdata](https://bioconductor.org/packages/devel/data/experiment/vignettes/systemPipeRdata/inst/doc/systemPipeRdata.html). The BLAST workflow template serves as a starting point for conducting sequence similarity search routines. It employs [NCBI’s BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi) software as an illustrative example, enabling users to search a sequence database for entries that share sequence similarity to one or multiple query sequences. The search results can be presented in a concise tabular summary format, or the corresponding pairwise alignments can be included. To utilize this workflow, users must download and install the BLAST software from NCBI’s website [here](https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) and ensure it is added to their system’s PATH environment variable. The `Rmd` file (`SPblast.Rmd`) associated with this vignette serves a dual purpose. It acts both as a template for executing the workflow and as a template for generating a reproducible scientific analysis report. Thus, users want to customize the text (and/or code) of this or other `systemPipeR` workflow vignettes to describe their experimental design and analysis results. This typically involves deleting the instructions how to work with this workflow, and customizing the text describing experimental designs, other metadata and analysis results. The following data analysis steps are included in this workflow template: 1. Validation of the BLAST installation 2. Creation of an indexed sequence database that can be searched with BLAST 3. BLAST search of indexed database with query sequence The topology graph of the BLAST workflow is shown in Figure 1. ```{r spblast-toplogy, eval=TRUE, warning= FALSE, echo=FALSE, out.width="100%", fig.align = "center", fig.cap= "Topology graph of BLAST workflow.", warning=FALSE} knitr::include_graphics("results/plotwf_spblast.jpg") ``` # Workflow environment The environment of the chosen workflow is generated with the `genWorenvir` function. After this, the user’s R session needs to be directed into the resulting directory (here `SPblast`). ```{r gen_spblast_wf, eval=FALSE} systemPipeRdata::genWorkenvir(workflow = "SPblast", mydirname = "SPblast") setwd("SPblast") ``` The `SPRproject` function initializes a new workflow project instance. This function call creates an empty `SAL` workflow container and at the same time a linked project log directory (default name `.SPRproject`) that acts as a flat-file database of a workflow. For additional details, please visit this [section](https://www.bioconductor.org/packages/devel/bioc/vignettes/systemPipeR/inst/doc/systemPipeR.html#5_Detailed_tutorial) in `systemPipeR's` main vignette. ```{r create_workflow, message=FALSE, eval=FALSE} library(systemPipeR) sal <- SPRproject() sal ``` The `importWF` function allows to import all the workflow steps outlined in the source Rmd file of this vignette into a `SAL` (`SYSargsList`) workflow container. Once imported, the entire workflow can be executed from start to finish using the `runWF` function. More details regarding this process are provided in the following section [here](#importwf). ```{r load_workflow_default, eval=FALSE} sal <- importWF(sal, "SPblast.Rmd") sal <- runWF(sal) ``` ## Step 1: Load packages Next, the `systemPipeR` package needs to be loaded in a workflow. ```{r load_packages, eval=FALSE, spr=TRUE} appendStep(sal) <- LineWise( code = { library(systemPipeR) }, step_name = "load_packages" ) ``` ## Step 2: Test BLAST install The following step is optional. It tests the availability of the BLAST software on the user’s system. ```{r test_blast, eval=FALSE, spr=TRUE} appendStep(sal) <- LineWise( code = { # If you have a modular system, then enable the following line # moduleload("ncbi-blast") blast_check <- tryCMD("blastn", silent = TRUE) if(blast_check == "error") stop("Check your BLAST installation path.") }, step_name = "test_blast", dependency = "load_packages" ) ``` ## Step 3: BLASTable database This step creates an indexed sequence database that can be searched with BLAST. The sample sequences used for creating the databases are stored in a file named `tair10.fasta` under the `data` directory of the workflow environment. The exact command-line (CL) call used for creating the indexed database can be returned with `cmdlist(sal, step=3)`. ```{r build_genome_db, eval=FALSE, spr=TRUE} appendStep(sal) <- SYSargsList( step_name = "build_genome_db", dir = FALSE, targets=NULL, wf_file = "blast/makeblastdb.cwl", input_file="blast/makeblastdb.yml", dir_path="param/cwl", dependency = "test_blast" ) ``` ## Step 4: BLAST search Next, the BLASTable database is searched with a set of query sequences to find sequences in the database that share sequence similarity with the queries. As above, the exact CL call used in this step can be returned with `cmdlist(sal, step=4)`. ```{r blast_genome, eval=FALSE, spr=TRUE} appendStep(sal) <- SYSargsList( step_name = "blast_genome", dir = FALSE, targets="targets_blast.txt", wf_file = "blast/blastn.cwl", input_file="blast/blastn.yml", inputvars = c( FileName = "_query_file_", SampleName = "_SampleName_" ), dir_path="param/cwl", dependency = "build_genome_db" ) ``` ## Step 5: View top hits This step displays the top hits identified by the BLAST search in the previous step. The `e_value` and `bit_score` columns allow to rank the BLAST results by sequence similarity. ```{r display_hits, eval=FALSE, spr=TRUE} appendStep(sal) <- LineWise( code = { # get the output file path from a Sysargs step using `getColumn` tbl_tair10 <- read.delim(getColumn(sal, step = "blast_genome")[1], header = FALSE, stringsAsFactors = FALSE) names(tbl_tair10) <- c( "query", "subject", "identity", "alignment_length", "mismatches", "gap_openings", "q_start", "q_end", "s_start", "s_end", "e_value", "bit_score" ) print(head(tbl_tair10, n = 20)) }, step_name = "display_hits", dependency = "blast_genome" ) ``` ## Version Information ```{r wf_session, eval=FALSE, spr=TRUE} appendStep(sal) <- LineWise( code = { sessionInfo() }, step_name = "wf_session", dependency = "display_hits") ``` # Automated routine {#importwf} Once the above workflow steps have been loaded into `sal` from the source `Rmd` file of this vignette, the workflow can be executed from start to finish (or partially) with the `runWF` command. Subsequently, scientific and technical workflow reports can be generated with the `renderReport` and `renderLogs` functions, respectively. __Note:__ To demonstrate 'systemPipeR's' automation routines without regenerating a new workflow environment from scratch, the first line below uses the `overwrite=TRUE` option of the `SPRproject` function. This option is generally discouraged as it erases the existing workflow project and `sal` container. For information on resuming and restarting workflow runs, users want to consult the relevant section of the main vignette (see [here](https://www.bioconductor.org/packages/devel/bioc/vignettes/systemPipeR/inst/doc/systemPipeR.html#10_Restarting_and_resetting_workflows).) ```{r , import_run_routine, eval=FALSE} sal <- SPRproject(overwrite = TRUE) # Avoid 'overwrite=TRUE' in real runs. sal <- importWF(sal, file_path = "SPblast.Rmd") # Imports above steps from new.Rmd. sal <- runWF(sal) # Runs ggworkflow. plotWF(sal) # Plot toplogy graph of workflow sal <- renderReport(sal) # Renders scientific report. sal <- renderLogs(sal) # Renders technical report from log files. ``` ## CL tools used The `listCmdTools` (and `listCmdModules`) return the CL tools that are used by a workflow. To include a CL tool list in a workflow report, one can use the following code. Additional details on this topic can be found in the main vignette [here](https://www.bioconductor.org/packages/devel/bioc/vignettes/systemPipeR/inst/doc/systemPipeR.html#111_Accessor_methods). ```{r list_tools} if(file.exists(file.path(".SPRproject", "SYSargsList.yml"))) { local({ sal <- systemPipeR::SPRproject(resume = TRUE) systemPipeR::listCmdTools(sal) systemPipeR::listCmdModules(sal) }) } else { cat(crayon::blue$bold("Tools and modules required by this workflow are:\n")) cat(c("BLAST 2.16.0+"), sep = "\n") } ``` ## Session Info This is the session information for rendering this R Markdown report. To access the session information for the workflow run, generate the technical HTML report with `renderLogs`. ```{r report_session_info, eval=TRUE} sessionInfo() ```