--- title: "FLAMES" author: "Oliver Voogd, Changqing Wang" package: FLAMES output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{FLAMES} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: ref.bib --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # FLAMES The FLAMES package provides a framework for performing single-cell and bulk read full-length analysis of mutations and splicing. FLAMES performs cell barcode and UMI assignment from nanopore reads as well as semi-supervised isoform detection and quantification. FLAMES is designed to be an easy and quick to use, powerful workflow for isoform detection and quantification, splicing analysis and mutation detection, and seeks to overcome the limitations of other tools, such as an inability to process single cell data, and a focus on cell barcode and UMI assignment [@flames]. This R package was created to simplify installation and execution of the FLAMES python module, which supports the article *Comprehensive characterization of single cell full-length isoforms in human and mouse with long-read sequencing* by Tian et al. [-@flames]. ```{r workflow, echo=FALSE, fig.align='center', fig.cap='FLAMES pipeline workflow summary'} knitr::include_graphics(system.file("images/FLAMESpipeline-01.png", package = "FLAMES")) ``` Input to FLAMES are fastq files generated from the long-read platform. Using a cell barcode allow-list (e.g. barcode list obtained from short-read sequencing), the `find_barcode()` function locates the barcode sequencing in the long-reads and trims the cell barcodes and the flanking UMI sequence. The pipeline then calls `minimap_align()`, a `minimap2` wrapper, to align the demultiplex long-reads to the reference genome. Next, `find_isoform()` is called to identify novel isoforms using the alignment, creating an updated transcript assembly. The demultiplexed reads are then aligned to the transcript assembly by the function `minimap_realign()`. Finally, FLAMES calls `quantify()` to quantify the transcript counts using the (re-)alignment file. Figure \@ref(fig:workflow) summerise the steps of the pipeline described above. The pipeline functions (`sc_long_pipeline`, `bulk_long_pipeline`, `sc_long_multisample_pipline`) execute the steps sequentially and return `SingleCellExperiment` object, `SummarizedExperiment` object and a list of `SingleCellExperiment` objects respectivly. For read alignment and realignment, FLAMES uses [minimap2](https://github.com/lh3/minimap2), a versatile alignment program for aligning sequences against a reference database [@minimap2]. This software needs to be downloaded prior to using the FLAMES pipeline, and can be found at [https://github.com/lh3/minimap2](https://github.com/lh3/minimap2). # FLAMES Pipeline Execution This vignette will detail the process of running the FLAMES pipeline. It details the execution of both the single cell pipeline (`sc_long_pipeline()`) and the bulk data pipeline (`bulk_long_pipeline()`). ## FLAMES Single Cell Pipeline ### Environment setup To get started, the pipeline needs access to a gene annotation file in GFF3 or GTF format, a directory containing one or more FASTQ files (which will be merged as pre-processing), a genome FASTA file, as well as the file path to minimap2, and the file path to the directory to hold output files. The single cell pipeline can demultiplex the input data before running, if required. This can be disabled by setting the `match_barcode` argument to `FALSE` when calling the pipeline. This example dataset has already been demultiplexed. For this example, the required files are downloaded from GitHub using [BiocFileCache](http://bioconductor.org/packages/release/bioc/html/BiocFileCache.html) [@biocfilecache]. ```{r eval=TRUE, echo=TRUE} temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" annot <- bfc[[names(BiocFileCache::bfcadd( bfc, "Annotation", file.path(file_url, "gencodeshortened.gtf") ))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd( bfc, "Genomefa", file.path(file_url, "GRCh38shortened.fa") ))]] fastq <- bfc[[names(BiocFileCache::bfcadd( bfc, "Fastq", file.path(file_url, "sc_align2genome.sample.fastq.gz")))]] # setup other environment variables outdir <- tempfile() dir.create(outdir) config_file <- FLAMES::create_config(outdir, type = "SIRV") ``` The optional argument `config_file` can be given to both `bulk_long_pipeline()` and `sc_long_pipeline()` in order to customise the execution of the pipelines. It is expected to be a JSON file, and an example can be found by calling `FLAMES::create_config`, which returns the path to a copy of the default JSON configuration file. The values from the default configs can be altered by either editing the JSON file manually, or passing additional arguments to `FLAMES::create_config`, for example, `FLAMES::create_config(outdir, type = "sc_5end", min_sup_cnt = 10)` with create a copy of the default config for 5' end nanopore single cell reads, with the `min_sup_cnt` value (minimal number of supporting reads for a novel isoform to pass filtering) changed to 10. This vignette uses the default configuration file created for SIRV reads. ### FLAMES execution Once the initial variables have been setup, the pipeline can be run using: ```{r, eval=TRUE, echo=TRUE} library(FLAMES) # do not run if minimap2 cannot be found if (is.character(locate_minimap2_dir())) { sce <- sc_long_pipeline( annotation = annot, fastq = fastq, genome_fa = genome_fa, outdir = outdir, config_file = config_file, match_barcode = FALSE ) } ``` As stated above, the example dataset has already been demultiplexed, so `match_barcode=FALSE` should be set to skip the preprocessing of the barcodes. If, however, the input fastq files need to be demultiplexed, then the `reference_csv` argument will need to be specified - `reference_csv` is the file path to a cell barcode allow-list, as a text file with one barcode per line. The `filtered_feature_bc_matrix/barcodes.tsv.gz` from `cellranger` outputs can be used to create such allow-list, with `zcat barcodes.tsv.gz | cut -f1 -d'-' > allow-list.csv`. The pipeline can alse be run by calling the consituent steps sequentially: ```{r, eval=TRUE, echo=TRUE} library(FLAMES) minimap2_dir <- locate_minimap2_dir() if (is.character(minimap2_dir)) { # do not run if minimap2 cannot be found config <- jsonlite::fromJSON(config_file) # find_barcode(...) genome_bam <- rownames(minimap2_align( config = config, fa_file = genome_fa, fq_in = fastq, annot = annot, outdir = outdir, minimap2_dir = minimap2_dir )) find_isoform( annotation = annot, genome_fa = genome_fa, genome_bam = genome_bam, outdir = outdir, config = config ) minimap2_realign( config = config, fq_in = fastq, outdir = outdir, minimap2_dir = minimap2_dir ) quantify(annotation = annot, outdir = outdir, config = config) sce <- create_sce_from_dir(outdir = outdir, annotation = annot) } ``` ## Visulisation The `combine_sce()` and `sc_annotate_plots()` can be used for visulisation with experiment designs similar to `FLT-seq`. The `combine_sce()` function combines the `SingleCellExperiment` objects of 1) the short-read gene counts from the larger subsample, 2) the short-read gene counts from the smaller subsample and 3) the transcript counts from the larger subsample into a `MultiAssayExperiment` object. The `sc_annotate_plots()` function can then use it to produce annotated plots. ```{r, eval=TRUE, message=FALSE, warning=FALSE} library(FLAMES) library(MultiAssayExperiment) # load scm_lib80, scm_lib80 and scm_lib20 source(system.file("examples/scmixology1.R", package = "FLAMES")) scm_lib80 <- scuttle::addPerCellQC(scm_lib80) scm_lib20 <- scuttle::addPerCellQC(scm_lib20) qc_80 <- scuttle::quickPerCellQC(colData(scm_lib80)) qc_20 <- scuttle::quickPerCellQC(colData(scm_lib20)) qc_80$discard[scm_lib80$sum < 10000] <- TRUE qc_20$discard[scm_lib20$sum < 20000] <- TRUE combined_sce <- combine_sce( short_read_large = scm_lib80[,!qc_80$discard], short_read_small = scm_lib20[,!qc_20$discard], long_read_sce = scm_lib20_transcripts, remove_duplicates = FALSE) plots <- sc_annotate_plots(gene = "ENSG00000108107", multiAssay = combined_sce) scater::plotReducedDim( experiments(plots[["multiAssay"]])$gene_counts, dimred = "PCA", colour_by = colData(plots[["multiAssay"]] [,,"gene_counts"])[,"Lib_small",drop=FALSE]) plots[["plot_umap"]] plots[["combined_isoform_plot"]] ``` ### FLAMES termination The directory `outdir` now contains several output files returned from this pipeline. The output files generated by this pipeline are: * `transcript_count.csv.gz` - a transcript count matrix (also contained in the output [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) or [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html)) * `isoform_annotated.filtered.gff3` - found isoform information in gff3 format * `transcript_assembly.fa` - transcript sequence from the isoforms * `align2genome.bam` sorted BAM file with reads aligned to genome (intermediate FLAMES step) * `realign2transcript.bam` - sorted realigned BAM file using the transcript_assembly.fa as reference (intermediate FLAMES step) * `tss_tes.bedgraph`- TSS TES enrichment for all reads (for QC) The pipeline also returns a [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) or [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) object, depending on the pipeline run, containing the data from `transcript_count.csv.gz`and `isoform_annotated.filtered.gff3` [@singlecellexperiment] [@summarizedexperiment]. This [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) (or [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html)) object contains the same data as present in the `outdir` directory, and is given to simplify the process of reading the transcript count matrix and annotation data back into R, for further analysis. ## FLAMES Bulk Pipeline A basic example of the execution of the FLAMES bulk pipeline is given below. The process for this is essentially identical to the above example for single cell data. ### Environment setup To get started, the pipeline needs access to a gene annotation file in GFF3 or GTF format, a directory containing one or more FASTQ files (which will be merged as pre-processing), a genome FASTA file, as well as the file path to minimap2, and the file path to the directory to hold output files. For this example, these files are downloaded from GitHub using [BiocFileCache](http://bioconductor.org/packages/release/bioc/html/BiocFileCache.html) [@biocfilecache]. ```{r eval=TRUE, echo=TRUE} temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" annot <- bfc[[names(BiocFileCache::bfcadd( bfc, "Annotation", file.path(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf") ))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd( bfc, "Genomefa", file.path(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta") ))]] # download the two fastq files, move them to a folder to be merged together fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", file.path(file_url, "fastq", "sample1.fastq.gz")))]] fastq2 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq2", file.path(file_url, "fastq", "sample2.fastq.gz")))]] # the downloaded fastq files need to be in a directory to be merged together fastq_dir <- file.path(temp_path, "fastq_dir") dir.create(fastq_dir) file.copy(c(fastq1, fastq2), fastq_dir) unlink(c(fastq1, fastq2)) # the original files can be deleted # setup other environment variables outdir <- tempfile() dir.create(outdir) config_file <- FLAMES::create_config(outdir) ``` ### FLAMES execution Once the environment has been setup, the pipeline can be executed by running: ```{r eval=TRUE, echo=TRUE} library(FLAMES) if (is.character(locate_minimap2_dir())) { summarizedExperiment <- bulk_long_pipeline( annot = annot, fastq = fastq_dir, outdir = outdir, genome_fa = genome_fa, minimap2_dir = minimap2_dir, config_file = config_file ) } ``` ### FLAMES termination After the bulk pipeline has completed, the output directory contains the same files as the single cell pipeline produces. `bulk_long_pipeline` also returns a [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) object, containing the same data as the [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) as above [@singlecellexperiment] [@summarizedexperiment]. ## Running with Slurm Since the barcode demultiplexing step, isoform identification step and quantification step is currently single threaded, job scheduler (such as Slurm) users may want to allocate different resources for each step. This can be achieved by calling the individual functions (`find_isoform`, `minimap2_realign` etc.) sequentially, or by changing the configuration file. The `Bash` script below demonstrates how this can be done with a single `R` script calling the pipeline function. ```{bash eval=FALSE} #!/bin/bash -x # set all steps to false sed -i '/"do_/s/true/false/' config_sclr_nanopore_5end.json # set do_genome_alignment to true sed -i '/do_genome_alignment/s/false/true/' \ config_sclr_nanopore_5end.json srun -c 20 --mem 64GB \ Rscript flames_pipeline.R && \ sed -i '/"do_/s/true/false/' \ config_sclr_nanopore_5end.json && \ sed -i '/do_isoform_identification/s/false/true/' \ config_sclr_nanopore_5end.json && \ srun -c 1 --mem 64GB --ntasks=1 \ Rscript flames_pipeline.R && \ sed -i '/"do_/s/true/false/' \ config_sclr_nanopore_5end.json && \ sed -i '/do_read_realignment/s/false/true/' \ config_sclr_nanopore_5end.json && \ srun -c 20 --mem 64GB --ntasks=1 \ Rscript flames_pipeline.R && \ sed -i '/"do_/s/true/false/' \ config_sclr_nanopore_5end.json && \ sed -i '/do_transcript_quantification/s/false/true/' \ config_sclr_nanopore_5end.json && \ srun -c 1 --mem 64GB --ntasks=1 \ Rscript flames_pipeline.R && \ echo "Pipeline finished" ``` The `Bash` script can then be executed inside a tmux or screen session with: ```{bash eval=FALSE} ./flames.sh | tee -a bash.log; tail bash.log \ | mail -s "flames pipeline ended" user@example.com ``` ## FLAMES on Windows Due to FLAMES requiring minimap2 and pysam, FLAMES is currently unavaliable on Windows. # Session Info ``` {r echo=FALSE} utils::sessionInfo() ``` # References