Author: Sonali Arora (sarora@fredhutch.org)
Date: 20-22 July, 2015
The material in this course requires R version 3.2.1 and Bioconductor version 3.2
Analysis and comprehension of high-throughput genomic data
Packages, vignettes, work flows
Helpful Links
A typical workflow consists of the following steps.
- Experimental design
- Wet-lab preparation
- High-throughput sequencing
+ Output: FASTQ files of reads and their quality scores
- Alignment + Many different aligners, some specialized for different purposes
+ Output: BAM files of aligned reads
- Summary
+ e.g., count of reads overlapping regions of interest (e.g., genes)
- Statistical analysis
- Comprehension
One of the biggest strengths of Bioconductor is the classes defined to make simple tasks extremely easy and streamlined.
Many biologically interesting questions represent operations on ranges
GenomicRanges::summarizeOverlaps()
GenomicRanges::nearest()
, [ChIPseeker][]GRanges Algebra
shift()
, narrow()
, flank()
, promoters()
, resize()
, restrict()
, trim()
?"intra-range-methods"
range()
, reduce()
, gaps()
, disjoin()
coverage()
(!)?"inter-range-methods"
findOverlaps()
, countOverlaps()
, …, %over%
, %within%
, %outside%
; union()
, intersect()
, setdiff()
, punion()
, pintersect()
, psetdiff()
The SummarizedExperiment class is a matrix-like container where rows represent ranges of interest (as a ‘GRanges or GRangesList-class’) and columns represent samples (with sample data summarized as a ‘DataFrame-class’)
Example - Reading in BAM files
The GenomicAlignments package is used to input reads aligned to a reference genome. In this next example, we will read in a BAM file and specifically read in reads supporting an apparent
exon splice junction spanning position 19653773 of chromosome 14.
The package RNAseqData.HNRNPC.bam.chr14_BAMFILES contains 8 BAM files. We will use only the first BAM file. We will load the software packages and the data package, construct a GRanges with our region of interest, and use summarizeJunctions()
to find reads in our region of interest.
## 1. load software packages
library(GenomicRanges)
library(GenomicAlignments)
## 2. load sample data
library('RNAseqData.HNRNPC.bam.chr14')
bf <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[[1]], asMates=TRUE)
## 3. define our 'region of interest'
roi <- GRanges("chr14", IRanges(19653773, width=1))
## 4. alignments, junctions, overlapping our roi
paln <- readGAlignmentsList(bf)
j <- summarizeJunctions(paln, with.revmap=TRUE)
j_overlap <- j[j %over% roi]
## 5. supporting reads
paln[j_overlap$revmap[[1]]]
## GAlignmentsList object of length 8:
## [[1]]
## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width njunc
## [1] chr14 - 66M120N6M 72 19653707 19653898 192 1
## [2] chr14 + 7M1270N65M 72 19652348 19653689 1342 1
##
## [[2]]
## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width njunc
## [1] chr14 - 66M120N6M 72 19653707 19653898 192 1
## [2] chr14 + 72M 72 19653686 19653757 72 0
##
## [[3]]
## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width njunc
## [1] chr14 + 72M 72 19653675 19653746 72 0
## [2] chr14 - 65M120N7M 72 19653708 19653899 192 1
##
## ...
## <5 more elements>
## -------
## seqinfo: 93 sequences from an unspecified genome
AnnotationHub is a web client with which one can browse and
download biological files from various databases such as UCSC, NCBI.
Using this package allows the user to directly get the file, without needing
to figure out where the file is located on UCSC, downloading it and managing
multiple files on their local machine.
library(AnnotationHub)
ah = AnnotationHub()
## data is available from the following sources
unique(ah$dataprovider)
## [1] "Ensembl" "EncodeDCC"
## [3] "UCSC" "Inparanoid8"
## [5] "NCBI" "NHLBI"
## [7] "ChEA" "Pazar"
## [9] "NIH Pathway Interaction Database" "RefNet"
## [11] "Haemcode" "GEO"
## [13] "BroadInstitute" "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/"
## [15] "dbSNP"
## following types of files can be retrieved from the hub
unique(ah$sourcetype)
## [1] "FASTA" "BED" "UCSC track" "GTF" "Inparanoid" "NCBI/blast2GO"
## [7] "TwoBit" "Chain" "GRASP" "Zip" "CSV" "BioPax"
## [13] "BioPaxLevel2" "RData" "BigWig" "tar.gz" "tab" "NCBI/ensembl"
## [19] "VCF"
## We will download all _Homo sapiens_ cDNA sequences from the FASTA file
## 'Homo_sapiens.GRCh38.cdna.all.fa' from Ensembl using
## `r Biocpkg("AnnotationHub")`.
ah2 <- query(ah, c("fasta", "homo sapiens", "Ensembl"))
fa <- ah2[["AH18522"]]
fa
## class: FaFile
## path: /home/ubuntu/.AnnotationHub/22617
## index: /home/ubuntu/.AnnotationHub/25666
## isOpen: FALSE
## yieldSize: NA
dbfile(txdb)
GenomicFeatures::makeTxDbFrom*()
exons()
, transcripts()
, genes()
, cds()
(coding sequence)promoters()
& friendsexonsBy()
& friends – exons by gene, transcript, …keytypes()
, columns()
, keys()
, select()
, mapIds()
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # TaxID: 9606
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-05-12 10:59:39 -0700 (Tue, 12 May 2015)
## # GenomicFeatures version at creation time: 1.21.3
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1
methods(class=class(txdb))
## [1] $ $<- ExpressionSet annotatedDataFrameFrom
## [5] as.list asBED asGFF assayData
## [9] assayData<- cds cdsBy cdsByOverlaps
## [13] coerce columns combine contents
## [17] dbInfo dbconn dbfile dbmeta
## [21] dbschema disjointExons distance exons
## [25] exonsBy exonsByOverlaps extractUpstreamSeqs featureNames
## [29] featureNames<- fiveUTRsByTranscript genes initialize
## [33] intronsByTranscript isActiveSeq isActiveSeq<- isNA
## [37] keys keytypes mapIds mapToTranscripts
## [41] mappedkeys metadata microRNAs nhit
## [45] organism promoters revmap sample
## [49] sampleNames sampleNames<- saveDb select
## [53] seqinfo seqinfo<- seqlevels0 show
## [57] species storageMode storageMode<- tRNAs
## [61] taxonomyId threeUTRsByTranscript transcripts transcriptsBy
## [65] transcriptsByOverlaps updateObject
## see '?methods' for accessing help and source code
genes(txdb)
## GRanges object with 23056 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 1 chr19 [ 58858172, 58874214] - | 1
## 10 chr8 [ 18248755, 18258723] + | 10
## 100 chr20 [ 43248163, 43280376] - | 100
## 1000 chr18 [ 25530930, 25757445] - | 1000
## 10000 chr1 [243651535, 244006886] - | 10000
## ... ... ... ... ... ...
## 9991 chr9 [114979995, 115095944] - | 9991
## 9992 chr21 [ 35736323, 35743440] + | 9992
## 9993 chr22 [ 19023795, 19109967] - | 9993
## 9994 chr6 [ 90539619, 90584155] + | 9994
## 9997 chr22 [ 50961997, 50964905] - | 9997
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
TxDb
keytypes()
, columns()
, keys()
, select()
, mapIds()
library(org.Hs.eg.db)
select(org.Hs.eg.db, c("BRCA1", "PTEN"), c("ENTREZID", "GENENAME"), "SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
## SYMBOL ENTREZID GENENAME
## 1 BRCA1 672 breast cancer 1, early onset
## 2 PTEN 5728 phosphatase and tensin homolog
keytypes(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID"
## [7] "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME" "GO" "GOALL"
## [13] "IPI" "MAP" "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH"
## [19] "PFAM" "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [25] "UNIGENE" "UNIPROT"
columns(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID"
## [7] "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME" "GO" "GOALL"
## [13] "IPI" "MAP" "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH"
## [19] "PFAM" "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [25] "UNIGENE" "UNIPROT"
Bioconductor packages are organized by biocViews. One can answer a number of Biological Questions using various packages. Some of the entries under Sequencing and other terms, and representative packages, include:
ChIPSeq, e.g.,DiffBind, csaw, ChIPseeker, ChIPQC.
SNPs and other variants, e.g., VariantAnnotation, VariantFiltering, h5vc.
CopyNumberVariation e.g., DNAcopy, crlmm, fastseg.
Microbiome and metagenome sequencing, e.g., metagenomeSeq, phyloseq, DirichletMultinomial.
sessionInfo()
sessionInfo()
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
## [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] org.Hs.eg.db_3.1.2 RSQLite_1.0.0
## [3] DBI_0.3.1 TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.3
## [5] GenomicFeatures_1.21.13 AnnotationDbi_1.31.17
## [7] AnnotationHub_2.1.30 RNAseqData.HNRNPC.bam.chr14_0.7.0
## [9] GenomicAlignments_1.5.11 Rsamtools_1.21.14
## [11] Biostrings_2.37.2 XVector_0.9.1
## [13] SummarizedExperiment_0.3.2 Biobase_2.29.1
## [15] GenomicRanges_1.21.16 GenomeInfoDb_1.5.8
## [17] IRanges_2.3.14 S4Vectors_0.7.10
## [19] BiocGenerics_0.15.3 ggplot2_1.0.1
## [21] BiocStyle_1.7.4
##
## loaded via a namespace (and not attached):
## [1] reshape2_1.4.1 colorspace_1.2-6 htmltools_0.2.6
## [4] rtracklayer_1.29.12 yaml_2.1.13 interactiveDisplayBase_1.7.0
## [7] XML_3.98-1.3 BiocParallel_1.3.34 lambda.r_1.1.7
## [10] plyr_1.8.3 stringr_1.0.0 zlibbioc_1.15.0
## [13] munsell_0.4.2 gtable_0.1.2 futile.logger_1.4.1
## [16] codetools_0.2-14 evaluate_0.7 labeling_0.3
## [19] knitr_1.10.5 biomaRt_2.25.1 httpuv_1.3.2
## [22] BiocInstaller_1.19.8 curl_0.9.1 proto_0.3-10
## [25] Rcpp_0.11.6 xtable_1.7-4 scales_0.2.5
## [28] formatR_1.2 mime_0.3 digest_0.6.8
## [31] stringi_0.5-5 shiny_0.12.1 grid_3.2.1
## [34] tools_3.2.1 bitops_1.0-6 magrittr_1.5
## [37] RCurl_1.95-4.7 futile.options_1.0.0 MASS_7.3-43
## [40] rmarkdown_0.7 httr_1.0.0 R6_2.1.0