--- title: "Damsel-workflow" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Damsel-workflow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.height = 7, fig.width = 12 ) ``` ```{r setup, include=FALSE} library(Damsel) ``` # 1. Introduction This document gives an introduction to the R package Damsel, for use in DamID analysis; from BAM file input to gene ontology analysis. Designed for use with DamID data, the Damsel methodology could be modified for use on any similar technology that seeks to identify enriched regions relative to a control sample. Utilising the power of edgeR for differential analysis and goseq for gene ontology bias correction, Damsel provides a unique end to end analysis for DamID. The DamID example data used in this vignette is available in the package and has been taken from Vissers et al., (2018), 'The Scalloped and Nerfin-1 Transcription Factors Cooperate to Maintain Neuronal Cell Fate'. The fastq files were downloaded from SRA, aligned using `Rsubread::index` and `Rsubread::align`, before sorting and making bai files with Samtools. # Installation ```{r eval=FALSE} if (!require("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("Damsel") library(Damsel) ``` # Processing the BAM files As a DamID analysis tool, Damsel requires a GATC region file for analysis. These regions serve as a guide to extract counts from the BAM files. ## Introducing the GATC region file It can be generated with `getGatcRegions()` using a `BSGenome` object or a FASTA file. It is a `GRangesList` with the consecutive GATC regions across the genome - representing the region (or the length) between GATC sites, as well as the positions of the sites themselves. If you have another species of DamID data or would prefer to make your own region file, you can use the following function, providing a BSgenome object or a FASTA file. ```{r} library(BSgenome.Dmelanogaster.UCSC.dm6) regions_and_sites <- getGatcRegions(BSgenome.Dmelanogaster.UCSC.dm6) regions <- regions_and_sites$regions knitr::kable(head(regions)) knitr::kable(head(regions_and_sites$sites)) ``` If you already have your own GATC region file, ensure that it has the same format with 6 columns: * Position: chromosome-start * seqnames: chromosome name * start: start of region * end: end of region * width: length of region (ensure that is is correct according to `[plyranges::as_granges()])` ## Extracting the counts within the GATC regions Note: Damsel requires BAM files that have been mapped to the reference genome. Provided the path to a folder of BAM files (and their .bai files) and the appropriate GATC region file, the function `countBamInGATC()` will extract the counts for each region for each available BAM and add them as columns to a data frame. The columns will be named by the BAM file name - please rename them before running the function if they do not make sense. ```{r eval=FALSE} path_to_bams <- system.file("extdata", package = "Damsel") counts.df <- countBamInGATC(path_to_bams, regions = regions ) ``` ```{r include=FALSE} data_env <- new.env(parent = emptyenv()) data("dros_counts", envir = data_env, package = "Damsel") counts.df <- data_env[["dros_counts"]] ``` * If necessary, at this stage please rearrange the BAM file columns so they are ordered in the following way: Dam_1, Fusion_1, Dam_2, Fusion_2 etc ```{r} knitr::kable(head(counts.df)) ``` This example data is also directly available as a counts file via `data`. ```{r eval=FALSE} data("dros_counts") ``` * Do not remove the .bam extension on the column names as this is used as a check in later functions to ensure only the BAM files are selected from the data frame. * The DamID data captures the ~75bp region extending from each GATC site, so although regions are of differing widths, there is a null to minimal length bias present on the data, and does not require length correction. ## Correlation analysis of samples At this stage, the similarities and differences between the samples can be analysed via correlation. `plotCorrHeatmap` plots the correlation of all available BAM files Dam and Fusion, to visualise the similarity between files. The default for all Damsel correlation analysis is the non-parametric "spearman's" correlation. The correlation between Dam_1 and Fusion_1 can be expected to reach ~ 0.7, whereas the correlation between Dam_1 & Dam_3 or Fusion_1 & Fusion_2 would be expected to be closer to ~0.9 ```{r heatmap} plotCorrHeatmap(df = counts.df, method = "spearman") ``` Two specific samples can also be compared using `ggscatter` which plots a scatterplot of the two samples, overlaid with the correlation results. [ggpubr::ggscatter()] ## Visualisation of coverage The overall coverage of different samples can be compared ```{r} plotCountsDistribution(counts.df, constant = 1) ``` A specific region can be selected to view the counts across samples. ```{r coverage3, fig.wide=TRUE} plotCounts(counts.df, seqnames = "chr2L", start_region = 1, end_region = 40000, layout = "spread" ) ``` As shown in the following plots, the default layout is `"stacked"`, where the replicates are overlaid. The counts can also be displayed in a log2 ratio with `log2_scale=TRUE` # Differential methylation analysis The goal with DamID analysis is to identify regions that are enriched in the fusion sample relative to the control. In Damsel, this step is referred to as differential methylation analysis, and makes use of [`edgeR`]. For ease of use, Damsel has four main edgeR based functions which compile different steps and functions from within edgeR. ## Setting up edgeR analysis `makeDGE` sets up the edgeR analysis for differential methylation testing. Taking the data frame of samples and regions as input, it conducts the following steps: * it extracts the sample data * groups the samples (Dam or Fusion) * filters the samples (remove regions with very low counts, the filtering parameters may be adjusted) * normalises the data * establishes the design matrix (this includes the sample group and pairing replicates together - Dam_1 & Fusion_1) * estimates the dispersion ```{r} dge <- makeDGE(counts.df) head(dge) ``` The output from this step is a DGEList containing all of the information from the steps. ## Examining the data - multidimensional scaling plot It's important to visualise the differences between the samples. You would expect the Dam samples to cluster together, and for the Fusion samples to cluster together. You would expect the majority of the variation to be within the 1st dimension (the x axis), and less variation in the 2nd dimension (y axis) ```{r mds, fig.height=6, fig.width=6} group <- dge$samples$group %>% as.character() limma::plotMDS(dge, col = as.numeric(factor(group))) ``` ## Identifying differentially methylated regions After exploring the data visually, it's time to identify the enriched regions. `testDmRegions` compiles the edgeR functions for differential testing with one key modification - it outputs the results with the adjusted p values as well as the raw p values. `testDmRegions` conducts the following key steps: * i. fits a QLF model - quasi likelihood * ii. tests the model * iii. conducts p value adjustment and summarises model results by setting regions as either (1,0) (log fold change and p value thresholds can be adjusted) These results are incorporated with the region data, providing a result for every region. The regions excluded from edgeR analysis are given logFC = 0, and adjust.p = 1 Setting plot=TRUE will plot an [`edgeR::plotSmear()`] alongside the results ```{r} dm_results <- testDmRegions(dge, p.value = 0.05, lfc = 1, regions = regions, plot = TRUE) dm_results %>% dplyr::group_by(meth_status) %>% dplyr::summarise(n = dplyr::n()) knitr::kable(head(dm_results), digits = 32) ``` The edgeR results can be plotted alongside the counts as well. ```{r, fig.wide=TRUE} plotCounts(counts.df, seqnames = "chr2L", start_region = 1, end_region = 40000, log2_scale = FALSE ) + geom_dm(dm_results.df = dm_results) ``` Only regions that are fully contained within the provided boundaries will be plotted. * Add GATC sites ```{r} gatc_sites <- regions_and_sites$sites knitr::kable(head(gatc_sites)) ``` ```{r, fig.wide=TRUE} plotCounts(counts.df, seqnames = "chr2L", start_region = 1, end_region = 40000, log2_scale = FALSE ) + geom_dm(dm_results) + geom_gatc(gatc_sites) ``` # Identifying peaks (bridges) As you could see from the plot of the differential methylation results, there are 10s of 1000s of enriched regions. To reduce the scale of this data to something that can be more biologically meaningul, enriched regions can be compiled into peaks. ## Aggregating the regions Damsel identifies peaks by aggregating regions of enrichment. As DamID sequencing generally sequences the 75 bp from the GATC site, regions smaller than 150 bp are mostly non-significant in statistical testing. Because of this, gaps between peaks of less than or equal to 150 bp are combined into one longer peak. The FDR and logFC for each peak is calculated via the theory of [csaw::getBestTest()] where the 'best' (smallest) p-value in the regions that make up the peak is selected as representative of the peak. The logFC is therefore the corresponding logFC from the selected region. ```{r} peaks <- identifyPeaks(dm_results) nrow(peaks) knitr::kable(head(peaks), digits = 32) ``` ## Plotting A peak plot layer can be added to our graph ```{r, fig.wide=TRUE} plotCounts(counts.df, seqnames = "chr2L", start_region = 1, end_region = 40000 ) + geom_dm(dm_results) + geom_peak(peaks, peak.label = TRUE) + geom_gatc(gatc_sites) ``` The default version will not plot the peak.label The distribution of counts per sample that have contributed to peaks can be compared. ```{r} plotCountsInPeaks(counts.df, dm_results, peaks, position = "stack") ``` # Identifying genes associated with peaks The peak information itself - while interesting, has no biological meaning. As the peaks represent a region that the Fusion protein interacted with on the DNA, likely as a transcription factor, we wish to identify the gene that is being affected. To do so, we need to associate the peaks with a potential "target" gene. Note: any gene identified here is only a potential target that must be validated in laboratory procedures. There is no method available that is able to accurately predict the location and target genes of enhancers, so a key and potentially incorrect assumption in this part of the analysis is that all peaks represent binding to a local enhancer or promoter - that it is close or overlapping to the target gene. It must also be noted that the Drosophila melanogaster genome and transcription factor interactions are different to that of mammals and using the same assumptions means results must be taken cautiously. While mammalian genes are generally spread out with little overlap, there is a large amount of overlap between Drosophila genes, requiring some intuitive interpretation of which gene the peak is potentially targeting. In the Damsel methodology, peaks are considered to associate with genes if they are within 5kb upstream or downstream. If multiple genes are within these criteria, they are all listed, with the closest gene given the primary position. ## Extract genes The function `collateGenes()` uses two different mechanisms to create a list of genes. It allows for the use of a TxDb object/annotation package, or can access biomaRt. ### A TxDb object The simplest approach is to use a TxDb annotation package. A TxDb package is available for most species and has information about the genes, exons, cds, promoters, etc - which can all be accessed using the GenomicFeatures package. This presentation has a lot of slides on how to use TxDb objects to access different data types: https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/GenomicFeatures_In_Bioconductor.html#10 However, TxDb libraries contain only the Ensembl gene Ids and not the gene symbol or name. Instead we need to access an org.Db package to transfer them over. org.Db packages contain information about model organisms genome annotation, and can be used to extract various information about the gene name etc. More information can be found here https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/GenomicFeatures_In_Bioconductor.html#46 ```{r eval=FALSE} BiocManager::install("TxDb.Dmelanogaster.UCSC.dm6.ensGene") BiocManager::install("org.Dm.eg.db") ``` ```{r} library("TxDb.Dmelanogaster.UCSC.dm6.ensGene") txdb <- TxDb.Dmelanogaster.UCSC.dm6.ensGene::TxDb.Dmelanogaster.UCSC.dm6.ensGene library("org.Dm.eg.db") genes <- collateGenes(genes = txdb, regions = regions, org.Db = org.Dm.eg.db) knitr::kable(head(genes)) ``` Using the TxDb package, Damsel assumes that the TSS (transcription start site) is the same as the start site of the gene, taking the strand into account. ### Accessing the biomaRt resource Alternatively, the name of a species listed in biomaRt can be provided, and the version of the genome. The advantage of biomaRt is that a greater amount of information is able to be uncovered, including the canonical transcript. A guide to understanding more about how biomaRt functions is here: https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html ```{r eval=FALSE} BiocManager::install("biomaRt") ``` ```{r} library(biomaRt) collateGenes(genes = "dmelanogaster_gene_ensembl", regions = regions, version = 109) ``` * accesses biomaRt using the seqnames of the appropriate GATC region file as a guide * accesses biomaRt a second time to obtain only the Ensembl canonical sequence information for each gene * identifies the number of GATC regions that overlap with each gene ## Annotating genes to peaks As stated above, Damsel associates genes with peaks if they are within 5 kb upstream or downstream. This maximum distance is an adjustable parameter within the `annotatePeaksGenes()` function. If set to `NULL` it will output all possible combinations as defined by `plyranges::pair_nearest`. The nature of this function means that the closest gene will be found for every peak, even if that distance is in the millions. If the user sets `max_distance=NULL`, we recommend undergoing some filtering to remove those associations. To respect that some species have genes with more overlap than others, `annotatePeaksGenes` outputs a list of data frames. The first, closest, outputs information for every peak and it's closest gene. The second data frame, top_five, outputs a string of the top five genes (if available) and their distances from each peak. The final data frame, all, provides the raw results and all possible gene and peak associations, as well as all available statistical results. ```{r} annotation <- annotatePeaksGenes(peaks, genes, regions = regions, max_distance = 5000) knitr::kable(head(annotation$closest), digits = 32) knitr::kable(head(annotation$top_five), digits = 32) knitr::kable(str(annotation$all), digits = 32) ``` ## Interpreting results and plotting Now that we have the genes from `collateGenes()`, this can be added as a layer to the previous plots. This plot requires the gene positions as a guide for a `Txdb` or `EnsDb` object, building off the autoplot generic built by `ggbio`. ```{r, fig.wide=TRUE} plotCounts(counts.df, seqnames = "chr2L", start_region = 1, end_region = 40000 ) + geom_dm(dm_results) + geom_peak(peaks) + geom_gatc(gatc_sites) + geom_genes_tx(genes, txdb) ``` * If the scale of the gene plot is disproportional to the height of the overall plot - if it is too large or too squished, it can be adjusted using the `plot.height` argument. ```{r, fig.wide=TRUE} plotWrap( id = peaks[1, ]$peak_id, counts.df = counts.df, dm_results.df = dm_results, peaks.df = peaks, gatc_sites.df = gatc_sites, genes.df = genes, txdb = txdb ) ``` ```{r, fig.wide=TRUE} plotWrap( id = genes[1, ]$ensembl_gene_id, counts.df = counts.df, dm_results.df = dm_results, peaks.df = peaks, gatc_sites.df = gatc_sites, genes.df = genes, txdb = txdb ) ``` # Gene ontology One of the last steps in a typical DamID analysis is gene ontology analysis. However, a key mistake made in this analysis using any common data type - including RNA-seq, is the lack of bias correction. We correct for this by utilising the [goseq] package. Without bias correction, the ontology results would be biased towards longer genes - the longer the gene, the more likely it would be to have a peak associated with it, and therefore be called as significant. We can see this in the plot below. ## GO analysis with goseq `testGeneOntology` identifies the over-represented GO terms from the peak data, correcting for the number of GATC regions matching to each gene. 3 outputs Plot of goodness of fit of model signif_results: data.frame of significant GO category results, ordered by p-value. prob_weights: data.frame of probability weights for each gene ```{r} ontology <- testGeneOntology(annotation$all, genes, regions = regions, extend_by = 2000) ``` The goodness of fit plot above shows us that there is a length based bias to the data. The x axis shows the number of GATC regions in each gene. The y axis shows the proportion of genes that have that amount of GATC regions and have been identified as significant. And it shows that as the number of GATC regions in the gene increase, as does the proportion of genes that are significant. ```{r} knitr::kable(head(ontology$signif_results), digits = 32) knitr::kable(head(ontology$prob_weights), digits = 32) ``` As expected, significant gene ontology terms surround developmental processes, which is expected as the fusion gene in the example data (Scalloped) is well known to be involved in development. `plotGeneOntology` can be used to plot the top 10 results. ```{r, fig.height=9, fig.width=10} plotGeneOntology(ontology$signif_results) ``` As shown above, the plot has the category on the y-axis, the FDR on the x-axis, the size of the dot being the number of genes in the GO category, and the colour of the dot being the ontology (Biological Process, Cellular Component, and Molecular Function). # Appendix ```{r} sessionInfo() ```