---
title: "3. Basic Lab for R & Bioconductor "
author: "Sonali Arora"
output:
BiocStyle::html_document:
toc: true
toc_depth: 2
vignette: >
% \VignetteIndexEntry{3. Basic Lab for R & Bioconductor}
% \VignetteEngine{knitr::rmarkdown}
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
options(width=100, max.print=1000)
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")),
error=FALSE)
```
Author: Sonali Arora (sarora@fredhutch.org)
Date: 20-22 July, 2015
The material in this course requires R version 3.2.1 and Bioconductor
version 3.2
## Basic lab for Bioconductor
__Exercise 1__
- How do you download any package from Bioconductor ?
- How do you access the vignettes for a given package ?
- How do you find help for a function (say `sortSeqlevels`)
__Execise 2 : Hone your R skills__
The Broad Institute has the [EpigenomeRoadMap Project](http://www.roadmapepigenomics.org/) Metadata for the project has been made available for you.
- Read the file into R.
- What does R read it in as ?
- How do you see the first few or last few rows of the dataset ?
- How many rows and columns are there inside this file ?
- What are the column headers ?
- What are the data types of each column ?
- Can you summarize the whole data (Hint: ?summary) Summarize the number of males and females in this dataset.
- The column called `GROUP` contains the source of the sample. Can you subset the data.frame to get all samples belonging to `BRAIN` and `DIGESTIVE`
__Exercise 3 : Getting Comfortable with `GenomicRanges`__
Using the given GRanges , do the following
- Extract ranges only from chromosome 3
- Extract the first five ranges from the GRanges.
- Extract the score and gc column of the GRanges
- Keep only the standard chromosomes (i.e.) from chromosome 1 to 22, X,Y,M.
- Change the chromosome naming style i.e. this GRanges contains UCSC style of chromosome names, change them to NCBI style of chromosome names.
- How do you find out the ranges contained in the gaps of this GRanges object?
- How do you find out the degree of overlap for all the ranges in a GRanges object ? ( Hint: ?coverage)
```{r}
library(GenomicRanges)
gr <-
GRanges(seqnames = paste0("chr", c(1:22, tail(letters, 11))),
ranges = IRanges(start=1:33, width = 1000 ),
strand = c(rep("+", 10), rep("-", 23)),
score = 1:33,
GC = seq(1, 0, length=33))
```
__Exercise 4: Create and Manipulate a SummarizeExperiment Object__
In this small exercise, We have data for 20 genes from 9 highly talented individuals and we will create our first `SummarizedExperiment` object.
The data for 20 genes from 9 individuals results in a `matrix`
```{r se-data}
data <- matrix(1:180, ncol=9, byrow=TRUE)
```
The data from the 20 genes can be represented as a `GRanges`
```{r se-gr}
gr_20gene <-
GRanges(seqnames = paste0("gene", 1:20),
ranges = IRanges(start=1:20, width = 1000 ),
strand = c(rep("+", 10), rep("-", 10)),
score = 1:20,
GC = seq(1, 0, length=20))
```
The data about the 20 individuals is stored in a `data.frame`
```{r-mat}
sample_df <- data.frame( names=c("Martin", "Herve", "Dan",
"Marc", "Valerie", "Jim", "Nate","Paul", "Sonali"),
sex=c(rep("Male", 4), "Female", rep("Male", 3), "Female"))
```
- Create a `SummarizedExperiment` from the three objects listed above.
- Check the dimensions of the created `SummarizedExperiment`
- How would you access the data matrix containing numbers?
- How would you access the information about the genes ?
- How would you access the information about the samples?
- Subset the `SummarizedExperiment` to create a new one which contains information only about the female core team.
- Subset the `SummarizedExperiment` to create a new one containing only the first two genes.
## Solutions
__Answer 1__
If the package we are trying to download is called *GenomeInfoDb* then
```{r load-pkg, eval=FALSE}
source("http://bioconductor.org/biocLite.R")
biocLite("GenomeInfoDb")
vignette(package="GenomeInfoDb")
?sortSeqlevels
```
__Answer 2__
```{r basic-R}
## Reading the data
fname <- system.file("extdata", "epi_metadata.txt", package="BioC2015Introduction")
df <- read.delim(fname, stringsAsFactors=FALSE)
## Exploring the data
class(df)
head(df)
tail(df)
dim(df)
colnames(df)
sapply(df, class)
## Summarize the data
summary(df)
table(df$SEX)
## Subset the data
df[df$GROUP %in% c("Brain", "Digestive"),]
```
__Answer 3__
```{r gr-pkg}
library(GenomicRanges)
gr <-
GRanges(seqnames = paste0("chr", c(1:22, tail(letters, 11))),
ranges = IRanges(start=1:33, width = 1000 ),
strand = c(rep("+", 10), rep("-", 23)),
score = 1:33,
GC = seq(1, 0, length=33))
## extract ranges only from chromosome 3
gr[seqnames(gr) %in% "chr3",]
## extract the first five ranges from the GRanges.
gr[1:5, ]
## extract the score and sequence column from a GRanges
mcols(gr)
## keep only the standard chromosomes (i.e.) from chromosome 1 to 22, x, y,m
keepStandardChromosomes(gr)
## change the chromosome naming style to NCBI
seqlevelsStyle(gr) <- "NCBI"
gr
## gaps in the ranges
gaps(gr)
## find degree of overlap for ranges.
coverage(gr)
```
__Answer4__
![summarizedExperiment exercise](our_figures/summarizedExperiment_exercise.png)
```{r se-ans}
library(SummarizedExperiment)
## data for the SummarizedExperiment object
sample_df <- data.frame( names=c("Martin", "Herve", "Dan",
"Marc", "Valerie", "Jim", "Nate","Paul", "Sonali"),
sex=c(rep("Male", 4), "Female", rep("Male", 3), "Female"))
gr_20genes <-
GRanges(seqnames = paste0("gene", 1:20),
ranges = IRanges(start=1:20, width = 1000 ),
strand = c(rep("+", 10), rep("-", 10)),
score = 1:20,
GC = seq(1, 0, length=20))
data <- matrix(1:180, ncol=9, byrow=TRUE)
## create a SummarizedExperiment object
core_se <- SummarizedExperiment(assays=data,
rowRanges=gr_20genes,
colData=DataFrame(sample_df))
core_se
## exploring the SummarizedExperiment object
dim(core_se)
head(assay(core_se)) # data matrix
rowRanges(core_se) # information about the genes
colData(core_se) # sample information
## subset the SummarizedExperiment object
## subsetting the sample information
core_se[, core_se$sex == "Female"]
## subsetting the gene information
core_se[,1:2]
```
## `sessionInfo()`
```{r sessionInfo}
sessionInfo()
```