Author: Sonali Arora (sarora@fredhutch.org)
Date: 20-22 July, 2015
The material in this course requires R version 3.2.1 and Bioconductor version 3.2
Exercise 1
Find the package in Bioconductor which contains annotation databases generated from UCSC for Rat Norvegicus (assembly rn5), load it and save it in a variable called ‘txdb’. Using this object , do the following -
a) find all the genes contained in this assembly and save it in a varible called ‘ratGenes’.
b) How many sequences are contained in ratGenes ? (Hint : ?Seqinfo)
c) The ‘ratGenes’ contains scaffolds also - how do you subsetthe object to contain sequences only from the standard chromosomes ?
b) I am interested in gene ‘Acsl5’ (entrez gene id=94340). Does this exist in ‘ratGenes’? what are its chromosome coordinates ?
Exercise 2
Find the package in Bioconductor thats stores the Full genome sequences for Rattus norvegicus (Rat) as provided by UCSC (rn5, Mar. 2012)
a) Load the package and save it in a object called ‘ratSeq’
b) what object is this sequence information stored in ?
c) get the dna sequence information for acsl5 and store it in ‘acsl5_sequence’
d) calculate the GC content from this sequnence.
Exercise 3
In the ‘ratGenes’ object above, you get only the entrez gene ids, can you get the gene names for each gene ?
Exercise 4
Get the annotation databases from NCBI for Homo sapiens (assembly build GRCh38.80), create a txdb object (similar to what we saw in question 3 above) and get the genes. ( Hint - involves starting from scratch with a GTF file)
Exercise 5
The liftOver facilities developed in conjunction with the UCSC browser track infrastructure are available for transforming data in GRanges formats. We want to transform data from rn4 to the latest asembly rn6.
a) The transformation to rn6 coordinates is defined by a chain file provided by UCSC. Get the chain file which contains transformations from rn5 to rn6.
b) Perform the liftover after getting the chain file.
Answer 1
suppressPackageStartupMessages({
library("TxDb.Rnorvegicus.UCSC.rn5.refGene")
})
txdb <- TxDb.Rnorvegicus.UCSC.rn5.refGene
## find all genes
ratGenes <- genes(txdb)
## list all sequences
seqinfo(ratGenes)
## Seqinfo object with 2739 sequences (1 circular) from rn5 genome:
## seqnames seqlengths isCircular genome
## chr1 290094216 FALSE rn5
## chr2 285068071 FALSE rn5
## chr3 183740530 FALSE rn5
## chr4 248343840 FALSE rn5
## chr5 177180328 FALSE rn5
## ... ... ... ...
## chrUn_JH620694 6347 FALSE rn5
## chrUn_JH620695 1669 FALSE rn5
## chrUn_JH620696 7236 FALSE rn5
## chrUn_JH620697 3488 FALSE rn5
## chrUn_JH620698 3129 FALSE rn5
## subset to contain only standard chromosomes
ratGenes <- keepStandardChromosomes(ratGenes)
## find gene 'Acsl5'
acsl5 <- ratGenes[which(mcols(ratGenes)$gene_id==94340),]
acsl5
## GRanges object with 1 range and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 94340 chr1 [283637899, 283685361] + | 94340
## -------
## seqinfo: 22 sequences (1 circular) from rn5 genome
Answer 2
suppressPackageStartupMessages({
library(BSgenome.Rnorvegicus.UCSC.rn5)
})
ratSeq <- BSgenome.Rnorvegicus.UCSC.rn5
class(ratSeq)
## [1] "BSgenome"
## attr(,"package")
## [1] "BSgenome"
## get the sequence
acsl5_sequence <- getSeq(ratSeq, acsl5)
## calculate the GC content
letterFrequency(acsl5_sequence, "GC", as.prob=TRUE)
## G|C
## [1,] 0.4156501
Answer 3
library("Rattus.norvegicus")
## Loading required package: OrganismDbi
## Loading required package: GO.db
##
## Loading required package: org.Rn.eg.db
##
## Now getting the GODb Object directly
## Now getting the OrgDb Object directly
## Now getting the TxDb Object directly
## get a mapping between all entrex id and gene names
ratGeneNames <- select(Rattus.norvegicus, ratGenes$gene_id,
columns=c("SYMBOL", 'GENEID'), keytype="GENEID")
## 'select()' returned 1:1 mapping between keys and columns
## match the entrz id with result before subsetting
idx <- match(ratGeneNames$GENEID, ratGenes$gene_id)
## add mactched result to GRanges
mcols(ratGenes) <- ratGeneNames[idx,]
ratGenes
## GRanges object with 17165 ranges and 2 metadata columns:
## seqnames ranges strand | GENEID SYMBOL
## <Rle> <IRanges> <Rle> | <character> <character>
## 100034253 chrX [ 20785115, 20818062] - | 100034253 Gnl3l
## 100036582 chr8 [ 20639977, 20641201] + | 100036582 Olr1867
## 100036765 chr12 [ 39085314, 39111846] + | 100036765 Ccdc92
## 100049583 chr8 [117147872, 117149172] - | 100049583 Trex1
## 100124593 chr8 [132020812, 132021866] + | 100124593 Cxcr6
## ... ... ... ... ... ... ...
## 94338 chr19 [ 49107658, 49191221] - | 94338 Smpd3
## 94339 chr5 [176554525, 176557154] - | 94339 Mmp23
## 94340 chr1 [283637899, 283685361] + | 94340 Acsl5
## 94341 chr9 [ 94208941, 94217050] - | 94341 Kcnj13
## 94342 chr20 [ 7198625, 7211343] + | 94342 Bag6
## -------
## seqinfo: 22 sequences (1 circular) from rn5 genome
Answer 4
Steps include
a) Getting the GTF file from NCBI for a particular build of Homo
sapiens that you’re interested in. ( AnnotationHub is a package inside
Bioconductor which automatically gets the file for you)
b) create a txdb object from this GTF file (which is read in as a GRanges)
c) extract genes from the txdb object as before.
These steps are beneficial if you cannot find pre-packaged genome annotations
for your favourite organism as a package inside Bioconductor.
library(AnnotationHub)
ah = AnnotationHub()
## find the file
gtf_humans <- query(ah , c("gtf", "Homo sapiens", "grch38","80"))
gtf_humans
## download the file
gtfFile <- ah[["AH47066"]]
## create a txdb
library(GenomicFeatures)
txdb <- makeTxDbFromGRanges(gtfFile) #may take some time..
txdb
## get the genes from the taxdb object
humanGenes <- genes(txdb)
Answer 5
One way of getting a chain file would be to find the file
in ucsc, download it and read it in using rtracklayer::import.chain()
.
An easier solution would be to find the files via AnnotationHub
## a) get the chain file
## load the package and query the files to find the file we want
library(AnnotationHub)
ah = AnnotationHub()
## snapshotDate(): 2015-05-26
query(ah , c("rattus", "rn5", "rn6"))
## AnnotationHub with 2 records
## # snapshotDate(): 2015-05-26
## # $dataprovider: UCSC
## # $species: Rattus norvegicus
## # $rdataclass: ChainFile
## # additional mcols(): taxonomyid, genome, description, tags, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH14745"]]'
##
## title
## AH14745 | rn6ToRn5.over.chain.gz
## AH14761 | rn5ToRn6.over.chain.gz
## learn more about the file you want
ah["AH14761"]
## AnnotationHub with 1 record
## # snapshotDate(): 2015-05-26
## # names(): AH14761
## # $dataprovider: UCSC
## # $species: Rattus norvegicus
## # $rdataclass: ChainFile
## # $title: rn5ToRn6.over.chain.gz
## # $description: UCSC liftOver chain file from rn5 to rn6
## # $taxonomyid: 10116
## # $genome: rn5
## # $sourcetype: Chain
## # $sourceurl: http://hgdownload.cse.ucsc.edu/goldenpath/rn5/liftOver/rn5ToRn6.over.chain.gz
## # $sourcelastmodifieddate: NA
## # $sourcesize: NA
## # $tags: liftOver, chain, UCSC, genome, homology
## # retrieve record with 'object[["AH14761"]]'
## download the file
ratChain <- ah[["AH14761"]]
ratChain
## Chain of length 22
## names(22): chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 ... chr16 chr17 chr18 chr19 chr20 chrX chrM
## b) perform the liftover
library(rtracklayer)
lft <- liftOver(acsl5, ratChain)
lft
## GRangesList object of length 1:
## $94340
## GRanges object with 5 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 [276240703, 276246818] + | 94340
## [2] chr1 [276249487, 276251786] + | 94340
## [3] chr1 [276253038, 276277131] + | 94340
## [4] chr1 [276278664, 276288427] + | 94340
## [5] chr1 [276288451, 276290006] + | 94340
##
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
If you liked this lab and want to learn more in this area, do not miss the following labs at BioC2015
sessionInfo()
sessionInfo()
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
## [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Rattus.norvegicus_1.3.1 org.Rn.eg.db_3.1.2
## [3] GO.db_3.1.2 OrganismDbi_1.11.42
## [5] BSgenome.Rnorvegicus.UCSC.rn5_1.4.0 BSgenome_1.37.3
## [7] rtracklayer_1.29.12 TxDb.Rnorvegicus.UCSC.rn5.refGene_3.1.3
## [9] org.Hs.eg.db_3.1.2 RSQLite_1.0.0
## [11] DBI_0.3.1 TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.3
## [13] GenomicFeatures_1.21.13 AnnotationDbi_1.31.17
## [15] AnnotationHub_2.1.30 RNAseqData.HNRNPC.bam.chr14_0.7.0
## [17] GenomicAlignments_1.5.11 Rsamtools_1.21.14
## [19] Biostrings_2.37.2 XVector_0.9.1
## [21] SummarizedExperiment_0.3.2 Biobase_2.29.1
## [23] GenomicRanges_1.21.16 GenomeInfoDb_1.5.8
## [25] IRanges_2.3.14 S4Vectors_0.7.10
## [27] BiocGenerics_0.15.3 ggplot2_1.0.1
## [29] BiocStyle_1.7.4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.11.6 digest_0.6.8 mime_0.3
## [4] R6_2.1.0 plyr_1.8.3 futile.options_1.0.0
## [7] evaluate_0.7 httr_1.0.0 BiocInstaller_1.19.8
## [10] zlibbioc_1.15.0 curl_0.9.1 rmarkdown_0.7
## [13] proto_0.3-10 labeling_0.3 BiocParallel_1.3.34
## [16] stringr_1.0.0 RCurl_1.95-4.7 biomaRt_2.25.1
## [19] munsell_0.4.2 shiny_0.12.1 httpuv_1.3.2
## [22] htmltools_0.2.6 interactiveDisplayBase_1.7.0 codetools_0.2-14
## [25] XML_3.98-1.3 MASS_7.3-43 bitops_1.0-6
## [28] RBGL_1.45.1 grid_3.2.1 xtable_1.7-4
## [31] gtable_0.1.2 magrittr_1.5 formatR_1.2
## [34] scales_0.2.5 graph_1.47.2 stringi_0.5-5
## [37] reshape2_1.4.1 futile.logger_1.4.1 lambda.r_1.1.7
## [40] tools_3.2.1 yaml_2.1.13 colorspace_1.2-6
## [43] knitr_1.10.5