--- title: "5. Advanced Lab - Explore the Data " author: "Sonali Arora" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{5. Advanced Lab - Explore the Data} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")), error=FALSE) ``` Author: Sonali Arora (sarora@fredhutch.org)
Date: 20-22 July, 2015
The material in this course requires R version 3.2.1 and Bioconductor version 3.2 ## Advanced lab for Bioconductor - Explore the Data In this lab, we will explore the dataset from `r Biocpkg("airway")`, and then subsequently use the same to run a quick RNA-seq analysis. Steps include - - Load the data package and explore the dataset using basic accessors - Performing a `rlog` transformation - Assessing which samples are similar to each other by - building a heatmap - using Principal Component Analysis (PCA) - using Multidimensional Scaling (MDS) - Summarizing the plots ## Data for the Lab The data used in this Lab is an RNA-Seq experiment of airway smooth muscle cells treated with dexamethasone, a synthetic glucocorticoid steroid with anti-inflammatory effects. Glucocorticoids are used, for example, in asthma patients to prevent or reduce inflammation of the airways. In the experiment, four primary human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours. For each of the four cell lines, we have a treated and an untreated sample. The reference for the experiment is: Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM, Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr, Tantisira KG, Weiss ST, Lu Q. "RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells." PLoS One. 2014 Jun 13;9(6):e99625. PMID: [24926665](http://www.ncbi.nlm.nih.gov/pubmed/24926665). GEO: [GSE52778](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778). For our analysis, we wil use data from the data package `r Biocpkg("airway")`. ```{r load-data} library("airway") data(airway) ``` ## Solutions ### Answer 1 : Load the Data _Bioconductor_ software packages often define and use a custom class for their data object, which makes sure that all the needed data slots are consistently provided and fulfill the requirements. The data stored inside `r Biocpkg("airway")` is a *SummarizedExperiment* object. ```{r play} library("airway") data(airway) se <- airway se ``` For a *SummarizedExperiment* object, The `assay(s)` contains the matrix (or matrices) of summarized values, the `rowData` contains information about the genomic ranges, and the `colData` contains information about the samples or experiments. ```{r revise-se} head(assay(se)) colSums(assay(se)) colData(se) rowRanges(se) ``` In `r Biocpkg("DESeq2")`, the custom class is called *DESeqDataSet*. It is built on top of the *SummarizedExperiment* class. ```{r make-dds} library(DESeq2) dds <- DESeqDataSet(se, design = ~ cell + dex) ``` ### Answer 2 : rlog transformation Many common statistical methods for exploratory analysis of multidimensional data, especially methods for clustering and ordination (e.g., principal-component analysis and the like), work best for (at least approximately) homoskedastic data; this means that the variance of an observed quantity (here, the expression strength of a gene) does not depend on the mean. In RNA-Seq data, the variance grows with the mean.If one performs PCA (principal components analysis) directly on a matrix of normalized read counts, the result typically depends only on the few most strongly expressed genes because they show the largest absolute differences between samples. As a solution, DESeq2 offers the regularized-logarithm transformation, or rlog for short. See the help for ?rlog for more information and options. The function rlog returns a SummarizedExperiment object which contains the rlog-transformed values in its assay slot: ```{r do-rlog} rld <- rlog(dds) head(assay(rld)) ``` ### Answer 3 : Visualize the Data To assess overall similarity between samples: Which samples are similar to each other, which are different? Does this fit to the expectation from the experiment's design? We use the R function dist to calculate the Euclidean distance between samples. To avoid that the distance measure is dominated by a few highly variable genes, and have a roughly equal contribution from all genes, we use it on the rlog-transformed data ```{r dist} sampleDists <- dist( t( assay(rld) ) ) sampleDists ``` Note the use of the function t to transpose the data matrix. We need this because dist calculates distances between data rows and our samples constitute the columns. #### Heatmaps We visualize the sample-to-sample distances in a heatmap, using the function heatmap.2 from the gplots package. Note that we have changed the row names of the distance matrix to contain treatment type and patient number instead of sample ID, so that we have all this information in view when looking at the heatmap. ```{r message=FALSE} library("gplots") library("RColorBrewer") sampleDistMatrix <- as.matrix( sampleDists ) rownames(sampleDistMatrix) <- paste( rld$dex, rld$cell, sep="-" ) colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255) hc <- hclust(sampleDists) heatmap.2( sampleDistMatrix, Rowv=as.dendrogram(hc), symm=TRUE, trace="none", col=colors, margins=c(2,10), labCol=FALSE ) ``` #### PCA Another way to visualize sample-to-sample distances is a principal-components analysis (PCA). In this ordination method, the data points (i.e., here, the samples) are projected onto the 2D plane such that they spread out in the two directions which explain most of the differences in the data. The x-axis is the direction (or principal component) which separates the data points the most. The amount of the total variance which is contained in the direction is printed in the axis label. Here, we have used the function plotPCA which comes with DESeq2. The two terms specified by intgroup are the interesting groups for labelling the samples; they tell the function to use them to choose colors. ```{r pca} plotPCA(rld, intgroup = c("dex", "cell")) ``` #### MDS Another plot, very similar to the PCA plot, can be made using the multidimensional scaling (MDS) function in base R. This is useful when we don't have the original data, but only a matrix of distances. Here we have the MDS plot for the distances calculated from the rlog transformed counts: ```{r mds} library(ggplot2) mds <- data.frame(cmdscale(sampleDistMatrix)) mds <- cbind(mds, colData(rld)) qplot(X1,X2,color=dex,shape=cell,data=as.data.frame(mds)) ``` ### Answer 4 : Summarize the plots We see that the differences between cells are considerable, though not stronger than the differences due to treatment with dexamethasone. This shows why it will be important to account for this in differential testing by using a paired design ('paired', because each dex treated sample is paired with one untreated sample from the same cell line). We are already set up for this by using the design formula ~ cell + dex when setting up the data object in the beginning. ## References For a much detailed analysis see - [Case Study- How to build a SummarizedExperiment - airway dataset](http://bioconductor.org/packages/devel/data/experiment/vignettes/airway/inst/doc/airway.html) - [Exploring the Data using Machine Learning Techniques](http://bioconductor.org/help/course-materials/2015/SeattleApr2015/D_MachineLearning.html) ## What to not miss at BioC2015 ! If you liked this lab and want to learn more in this area, do not miss the following labs at BioC2015 - _Differential expression, manipulation, and visualization of RNA-seq reads_ by Mike Love. (Session 1, Tuesday, 1:00pm -2:45pm) - _Automated NGS workflows with systemPipeR running on clusters or single machines, with a focus on VAR-seq_ by Thomas Girke. (Session 4, Tuesday, 3:15pm - 5:00pm) ## `sessionInfo()` ```{r sessionInfo} sessionInfo() ```