The material in this course requires R version 3.2.1 and Bioconductor version 3.2

0.1 Intermediate lab for Bioconductor

Exercise 1
Find the package in Bioconductor which contains annotation databases generated from UCSC for Rat Norvegicus (assembly rn5), load it and save it in a variable called ‘txdb’. Using this object , do the following -
a) find all the genes contained in this assembly and save it in a varible called ‘ratGenes’.
b) How many sequences are contained in ratGenes ? (Hint : ?Seqinfo)
c) The ‘ratGenes’ contains scaffolds also - how do you subsetthe object to contain sequences only from the standard chromosomes ?
b) I am interested in gene ‘Acsl5’ (entrez gene id=94340). Does this exist in ‘ratGenes’? what are its chromosome coordinates ?

Exercise 2
Find the package in Bioconductor thats stores the Full genome sequences for Rattus norvegicus (Rat) as provided by UCSC (rn5, Mar. 2012)
a) Load the package and save it in a object called ‘ratSeq’
b) what object is this sequence information stored in ?
c) get the dna sequence information for acsl5 and store it in ‘acsl5_sequence’
d) calculate the GC content from this sequnence.

Exercise 3
In the ‘ratGenes’ object above, you get only the entrez gene ids, can you get the gene names for each gene ?

Exercise 4
Get the annotation databases from NCBI for Homo sapiens (assembly build GRCh38.80), create a txdb object (similar to what we saw in question 3 above) and get the genes. ( Hint - involves starting from scratch with a GTF file)

Exercise 5
The liftOver facilities developed in conjunction with the UCSC browser track infrastructure are available for transforming data in GRanges formats. We want to transform data from rn4 to the latest asembly rn6.
a) The transformation to rn6 coordinates is defined by a chain file provided by UCSC. Get the chain file which contains transformations from rn5 to rn6.
b) Perform the liftover after getting the chain file.

0.2 Solutions

Answer 1

suppressPackageStartupMessages({
    library("TxDb.Rnorvegicus.UCSC.rn5.refGene")
})
txdb <- TxDb.Rnorvegicus.UCSC.rn5.refGene

## find all genes 
ratGenes <- genes(txdb)

## list all sequences 
seqinfo(ratGenes)

## Seqinfo object with 2739 sequences (1 circular) from rn5 genome:
##   seqnames       seqlengths isCircular genome
##   chr1            290094216      FALSE    rn5
##   chr2            285068071      FALSE    rn5
##   chr3            183740530      FALSE    rn5
##   chr4            248343840      FALSE    rn5
##   chr5            177180328      FALSE    rn5
##   ...                   ...        ...    ...
##   chrUn_JH620694       6347      FALSE    rn5
##   chrUn_JH620695       1669      FALSE    rn5
##   chrUn_JH620696       7236      FALSE    rn5
##   chrUn_JH620697       3488      FALSE    rn5
##   chrUn_JH620698       3129      FALSE    rn5

## subset to contain only standard chromosomes
ratGenes <- keepStandardChromosomes(ratGenes)

## find gene 'Acsl5'
acsl5 <- ratGenes[which(mcols(ratGenes)$gene_id==94340),]
acsl5

## GRanges object with 1 range and 1 metadata column:
##         seqnames                 ranges strand |     gene_id
##            <Rle>              <IRanges>  <Rle> | <character>
##   94340     chr1 [283637899, 283685361]      + |       94340
##   -------
##   seqinfo: 22 sequences (1 circular) from rn5 genome

Answer 2

suppressPackageStartupMessages({
    library(BSgenome.Rnorvegicus.UCSC.rn5)
})
ratSeq <- BSgenome.Rnorvegicus.UCSC.rn5
class(ratSeq)

## [1] "BSgenome"
## attr(,"package")
## [1] "BSgenome"

## get the sequence 
acsl5_sequence <- getSeq(ratSeq, acsl5)

## calculate the GC content 
letterFrequency(acsl5_sequence, "GC", as.prob=TRUE)

##            G|C
## [1,] 0.4156501

Answer 3

library("Rattus.norvegicus")

## Loading required package: OrganismDbi
## Loading required package: GO.db
## 
## Loading required package: org.Rn.eg.db
## 
## Now getting the GODb Object directly
## Now getting the OrgDb Object directly
## Now getting the TxDb Object directly

## get a mapping between all entrex id and gene names
ratGeneNames <- select(Rattus.norvegicus, ratGenes$gene_id, 
                       columns=c("SYMBOL", 'GENEID'), keytype="GENEID")

## 'select()' returned 1:1 mapping between keys and columns

## match the entrz id with result before subsetting  
idx <- match(ratGeneNames$GENEID, ratGenes$gene_id)

## add mactched result to GRanges
mcols(ratGenes) <- ratGeneNames[idx,]
ratGenes

## GRanges object with 17165 ranges and 2 metadata columns:
##             seqnames                 ranges strand   |      GENEID      SYMBOL
##                <Rle>              <IRanges>  <Rle>   | <character> <character>
##   100034253     chrX [ 20785115,  20818062]      -   |   100034253       Gnl3l
##   100036582     chr8 [ 20639977,  20641201]      +   |   100036582     Olr1867
##   100036765    chr12 [ 39085314,  39111846]      +   |   100036765      Ccdc92
##   100049583     chr8 [117147872, 117149172]      -   |   100049583       Trex1
##   100124593     chr8 [132020812, 132021866]      +   |   100124593       Cxcr6
##         ...      ...                    ...    ... ...         ...         ...
##       94338    chr19 [ 49107658,  49191221]      -   |       94338       Smpd3
##       94339     chr5 [176554525, 176557154]      -   |       94339       Mmp23
##       94340     chr1 [283637899, 283685361]      +   |       94340       Acsl5
##       94341     chr9 [ 94208941,  94217050]      -   |       94341      Kcnj13
##       94342    chr20 [  7198625,   7211343]      +   |       94342        Bag6
##   -------
##   seqinfo: 22 sequences (1 circular) from rn5 genome

Answer 4
Steps include
a) Getting the GTF file from NCBI for a particular build of Homo
sapiens that you’re interested in. ( AnnotationHub is a package inside
Bioconductor which automatically gets the file for you)
b) create a txdb object from this GTF file (which is read in as a GRanges)
c) extract genes from the txdb object as before.

These steps are beneficial if you cannot find pre-packaged genome annotations
for your favourite organism as a package inside Bioconductor.

library(AnnotationHub)
ah = AnnotationHub()

## find the file 
gtf_humans <- query(ah , c("gtf", "Homo sapiens", "grch38","80"))
gtf_humans

## download the file
gtfFile <- ah[["AH47066"]]

## create a txdb 
library(GenomicFeatures)
txdb <- makeTxDbFromGRanges(gtfFile)  #may take some time..
txdb

## get the genes from the taxdb object
humanGenes <- genes(txdb)

Answer 5
One way of getting a chain file would be to find the file
in ucsc, download it and read it in using rtracklayer::import.chain().
An easier solution would be to find the files via AnnotationHub

## a) get the chain file 

## load the package and query the files to find the file we want
library(AnnotationHub)
ah = AnnotationHub()

## snapshotDate(): 2015-05-26

query(ah , c("rattus", "rn5", "rn6"))

## AnnotationHub with 2 records
## # snapshotDate(): 2015-05-26 
## # $dataprovider: UCSC
## # $species: Rattus norvegicus
## # $rdataclass: ChainFile
## # additional mcols(): taxonomyid, genome, description, tags, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH14745"]]' 
## 
##             title                 
##   AH14745 | rn6ToRn5.over.chain.gz
##   AH14761 | rn5ToRn6.over.chain.gz

## learn more about the file you want
ah["AH14761"]

## AnnotationHub with 1 record
## # snapshotDate(): 2015-05-26 
## # names(): AH14761
## # $dataprovider: UCSC
## # $species: Rattus norvegicus
## # $rdataclass: ChainFile
## # $title: rn5ToRn6.over.chain.gz
## # $description: UCSC liftOver chain file from rn5 to rn6
## # $taxonomyid: 10116
## # $genome: rn5
## # $sourcetype: Chain
## # $sourceurl: http://hgdownload.cse.ucsc.edu/goldenpath/rn5/liftOver/rn5ToRn6.over.chain.gz
## # $sourcelastmodifieddate: NA
## # $sourcesize: NA
## # $tags: liftOver, chain, UCSC, genome, homology 
## # retrieve record with 'object[["AH14761"]]'

## download the file
ratChain <- ah[["AH14761"]]
ratChain

## Chain of length 22
## names(22): chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 ... chr16 chr17 chr18 chr19 chr20 chrX chrM

## b) perform the liftover 

library(rtracklayer)
lft <- liftOver(acsl5, ratChain)
lft

## GRangesList object of length 1:
## $94340 
## GRanges object with 5 ranges and 1 metadata column:
##       seqnames                 ranges strand |     gene_id
##          <Rle>              <IRanges>  <Rle> | <character>
##   [1]     chr1 [276240703, 276246818]      + |       94340
##   [2]     chr1 [276249487, 276251786]      + |       94340
##   [3]     chr1 [276253038, 276277131]      + |       94340
##   [4]     chr1 [276278664, 276288427]      + |       94340
##   [5]     chr1 [276288451, 276290006]      + |       94340
## 
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths

0.3 References

0.4 What to not miss at BioC2015 !

If you liked this lab and want to learn more in this area, do not miss the following labs at BioC2015

Bioconductor annotation resources by Marc Carlson, Sonali Arora. (Wednesday, Session 3, 1:00pm-2:45pm)
Practical introduction to Bioconductor foundational data structures for high throughput sequencing analysis by Herve Pages, Michael Lawrence. (Wednesday, Session 3, 3:15pm-5:00pm)

0.5 `sessionInfo()`

sessionInfo()

## R version 3.2.1 (2015-06-18)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] Rattus.norvegicus_1.3.1                 org.Rn.eg.db_3.1.2                     
##  [3] GO.db_3.1.2                             OrganismDbi_1.11.42                    
##  [5] BSgenome.Rnorvegicus.UCSC.rn5_1.4.0     BSgenome_1.37.3                        
##  [7] rtracklayer_1.29.12                     TxDb.Rnorvegicus.UCSC.rn5.refGene_3.1.3
##  [9] org.Hs.eg.db_3.1.2                      RSQLite_1.0.0                          
## [11] DBI_0.3.1                               TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.3
## [13] GenomicFeatures_1.21.13                 AnnotationDbi_1.31.17                  
## [15] AnnotationHub_2.1.30                    RNAseqData.HNRNPC.bam.chr14_0.7.0      
## [17] GenomicAlignments_1.5.11                Rsamtools_1.21.14                      
## [19] Biostrings_2.37.2                       XVector_0.9.1                          
## [21] SummarizedExperiment_0.3.2              Biobase_2.29.1                         
## [23] GenomicRanges_1.21.16                   GenomeInfoDb_1.5.8                     
## [25] IRanges_2.3.14                          S4Vectors_0.7.10                       
## [27] BiocGenerics_0.15.3                     ggplot2_1.0.1                          
## [29] BiocStyle_1.7.4                        
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.11.6                  digest_0.6.8                 mime_0.3                    
##  [4] R6_2.1.0                     plyr_1.8.3                   futile.options_1.0.0        
##  [7] evaluate_0.7                 httr_1.0.0                   BiocInstaller_1.19.8        
## [10] zlibbioc_1.15.0              curl_0.9.1                   rmarkdown_0.7               
## [13] proto_0.3-10                 labeling_0.3                 BiocParallel_1.3.34         
## [16] stringr_1.0.0                RCurl_1.95-4.7               biomaRt_2.25.1              
## [19] munsell_0.4.2                shiny_0.12.1                 httpuv_1.3.2                
## [22] htmltools_0.2.6              interactiveDisplayBase_1.7.0 codetools_0.2-14            
## [25] XML_3.98-1.3                 MASS_7.3-43                  bitops_1.0-6                
## [28] RBGL_1.45.1                  grid_3.2.1                   xtable_1.7-4                
## [31] gtable_0.1.2                 magrittr_1.5                 formatR_1.2                 
## [34] scales_0.2.5                 graph_1.47.2                 stringi_0.5-5               
## [37] reshape2_1.4.1               futile.logger_1.4.1          lambda.r_1.1.7              
## [40] tools_3.2.1                  yaml_2.1.13                  colorspace_1.2-6            
## [43] knitr_1.10.5

Intermediate Lab for R & Bioconductor

Sonali Arora

Contents

0.1 Intermediate lab for Bioconductor

0.2 Solutions

0.3 References

0.4 What to not miss at BioC2015 !

0.5 `sessionInfo()`

Intermediate Lab for R & Bioconductor

Sonali Arora

Contents

0.1 Intermediate lab for Bioconductor

0.2 Solutions

0.3 References

0.4 What to not miss at BioC2015 !

0.5 sessionInfo()

0.5 `sessionInfo()`