---
title: "4. Intermediate Lab for R & Bioconductor "
author: "Sonali Arora"
output:
BiocStyle::html_document:
toc: true
toc_depth: 2
vignette: >
% \VignetteIndexEntry{4. Intermediate Lab for R & Bioconductor}
% \VignetteEngine{knitr::rmarkdown}
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
options(width=100, max.print=1000)
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")),
error=FALSE)
```
Author: Sonali Arora (sarora@fredhutch.org)
Date: 20-22 July, 2015
The material in this course requires R version 3.2.1 and Bioconductor
version 3.2
## Intermediate lab for Bioconductor
__Exercise 1__
Find the package in Bioconductor which contains annotation databases generated from UCSC for Rat Norvegicus (assembly rn5), load it and save it in a variable called 'txdb'. Using this object , do the following -
a) find all the genes contained in this assembly and save it in a varible called 'ratGenes'.
b) How many sequences are contained in ratGenes ? (Hint : ?Seqinfo)
c) The 'ratGenes' contains scaffolds also - how do you subsetthe object to contain sequences only from the standard chromosomes ?
b) I am interested in gene 'Acsl5' (entrez gene id=94340). Does this exist in 'ratGenes'? what are its chromosome coordinates ?
__Exercise 2__
Find the package in Bioconductor thats stores the Full genome sequences for Rattus norvegicus (Rat) as provided by UCSC (rn5, Mar. 2012)
a) Load the package and save it in a object called 'ratSeq'
b) what object is this sequence information stored in ?
c) get the dna sequence information for acsl5 and store it in 'acsl5_sequence'
d) calculate the GC content from this sequnence.
__Exercise 3__
In the 'ratGenes' object above, you get only the entrez gene ids, can you get the gene names for each gene ?
__Exercise 4__
Get the annotation databases from NCBI for Homo sapiens (assembly build GRCh38.80), create a txdb object (similar to what we saw in question 3 above) and get the genes. ( Hint - involves starting from scratch with a GTF file)
__Exercise 5__
The liftOver facilities developed in conjunction with the UCSC browser track infrastructure are available for transforming data in GRanges formats. We want to transform data from rn4 to the latest asembly rn6.
a) The transformation to rn6 coordinates is defined by a chain file provided by UCSC. Get the chain file which contains transformations from rn5 to rn6.
b) Perform the liftover after getting the chain file.
## Solutions
__Answer 1__
```{r txdb}
suppressPackageStartupMessages({
library("TxDb.Rnorvegicus.UCSC.rn5.refGene")
})
txdb <- TxDb.Rnorvegicus.UCSC.rn5.refGene
## find all genes
ratGenes <- genes(txdb)
## list all sequences
seqinfo(ratGenes)
## subset to contain only standard chromosomes
ratGenes <- keepStandardChromosomes(ratGenes)
## find gene 'Acsl5'
acsl5 <- ratGenes[which(mcols(ratGenes)$gene_id==94340),]
acsl5
```
__Answer 2__
```{r bsgenome}
suppressPackageStartupMessages({
library(BSgenome.Rnorvegicus.UCSC.rn5)
})
ratSeq <- BSgenome.Rnorvegicus.UCSC.rn5
class(ratSeq)
## get the sequence
acsl5_sequence <- getSeq(ratSeq, acsl5)
## calculate the GC content
letterFrequency(acsl5_sequence, "GC", as.prob=TRUE)
```
__Answer 3__
```{r select-rat}
library("Rattus.norvegicus")
## get a mapping between all entrex id and gene names
ratGeneNames <- select(Rattus.norvegicus, ratGenes$gene_id,
columns=c("SYMBOL", 'GENEID'), keytype="GENEID")
## match the entrz id with result before subsetting
idx <- match(ratGeneNames$GENEID, ratGenes$gene_id)
## add mactched result to GRanges
mcols(ratGenes) <- ratGeneNames[idx,]
ratGenes
```
__Answer 4__
Steps include
a) Getting the GTF file from NCBI for a particular build of Homo
sapiens that you're interested in. ( AnnotationHub is a package inside
Bioconductor which automatically gets the file for you)
b) create a txdb object from this GTF file (which is read in as a GRanges)
c) extract genes from the txdb object as before.
These steps are beneficial if you cannot find pre-packaged genome annotations
for your favourite organism as a package inside Bioconductor.
```{r gtf-to-gr, eval=FALSE}
library(AnnotationHub)
ah = AnnotationHub()
## find the file
gtf_humans <- query(ah , c("gtf", "Homo sapiens", "grch38","80"))
gtf_humans
## download the file
gtfFile <- ah[["AH47066"]]
## create a txdb
library(GenomicFeatures)
txdb <- makeTxDbFromGRanges(gtfFile) #may take some time..
txdb
## get the genes from the taxdb object
humanGenes <- genes(txdb)
```
__Answer 5__
One way of getting a chain file would be to find the file
in ucsc, download it and read it in using `rtracklayer::import.chain()`.
An easier solution would be to find the files via `AnnotationHub`
```{r chainfile}
## a) get the chain file
## load the package and query the files to find the file we want
library(AnnotationHub)
ah = AnnotationHub()
query(ah , c("rattus", "rn5", "rn6"))
## learn more about the file you want
ah["AH14761"]
## download the file
ratChain <- ah[["AH14761"]]
ratChain
## b) perform the liftover
library(rtracklayer)
lft <- liftOver(acsl5, ratChain)
lft
```
## References
- [Basic Introduction to GenomicRanges](http://bioconductor.org/packages/devel/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.pdf)
- [Manipulating GRanges](http://bioconductor.org/packages/devel/bioc/vignettes/GenomeInfoDb/inst/doc/GenomeInfoDb.pdf)
- [GenomicRangesHOWTOs (Advanced)](http://bioconductor.org/packages/devel/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf)
- [Introduction to Bioconductor for Sequence Data](http://www.bioconductor.org/help/workflows/sequencing/)
- [Working with BSgenome Packages](http://bioconductor.org/packages/devel/bioc/vignettes/BSgenome/inst/doc/GenomeSearching.pdf)
- [Utilizing TxDb objects with GenomicFeatures](http://bioconductor.org/packages/devel/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.pdf)
- [AnnotationHub Introduction](http://bioconductor.org/packages/devel/bioc/vignettes/AnnotationHub/inst/doc/AnnotationHub.html), [Case study](http://bioconductor.org/packages/devel/bioc/vignettes/AnnotationHub/inst/doc/AnnotationHub-HOWTO.html)
- [liftOver workflow](http://bioconductor.org/help/workflows/liftOver/)
## What to not miss at BioC2015 !
If you liked this lab and want to learn more in this area, do not miss the following labs at BioC2015
- [_Bioconductor annotation resources_](https://github.com/mrjc42/BiocAnnotRes2015) by Marc Carlson,
Sonali Arora. (Wednesday, Session 3, 1:00pm-2:45pm)
- _Practical introduction to Bioconductor foundational data
structures for high throughput sequencing analysis_
by Herve Pages, Michael Lawrence. (Wednesday, Session 3, 3:15pm-5:00pm)
## `sessionInfo()`
```{r sessionInfo}
sessionInfo()
```