Author: Sonali Arora (sarora@fredhutch.org)
Date: 20-22 July, 2015
The material in this course requires R version 3.2.1 and Bioconductor version 3.2
Exercise 1
- How do you download any package from Bioconductor ?
- How do you access the vignettes for a given package ?
- How do you find help for a function (say sortSeqlevels
)
Execise 2 : Hone your R skills
The Broad Institute has the EpigenomeRoadMap Project Metadata for the project has been made available for you.
- Read the file into R.
- What does R read it in as ?
- How do you see the first few or last few rows of the dataset ?
- How many rows and columns are there inside this file ?
- What are the column headers ?
- What are the data types of each column ?
- Can you summarize the whole data (Hint: ?summary) Summarize the number of males and females in this dataset. - The column called GROUP
contains the source of the sample. Can you subset the data.frame to get all samples belonging to BRAIN
and DIGESTIVE
Exercise 3 : Getting Comfortable with GenomicRanges
Using the given GRanges , do the following
- Extract ranges only from chromosome 3
- Extract the first five ranges from the GRanges.
- Extract the score and gc column of the GRanges
- Keep only the standard chromosomes (i.e.) from chromosome 1 to 22, X,Y,M.
- Change the chromosome naming style i.e. this GRanges contains UCSC style of chromosome names, change them to NCBI style of chromosome names.
- How do you find out the ranges contained in the gaps of this GRanges object?
- How do you find out the degree of overlap for all the ranges in a GRanges object ? ( Hint: ?coverage)
library(GenomicRanges)
gr <-
GRanges(seqnames = paste0("chr", c(1:22, tail(letters, 11))),
ranges = IRanges(start=1:33, width = 1000 ),
strand = c(rep("+", 10), rep("-", 23)),
score = 1:33,
GC = seq(1, 0, length=33))
Exercise 4: Create and Manipulate a SummarizeExperiment Object
In this small exercise, We have data for 20 genes from 9 highly talented individuals and we will create our first SummarizedExperiment
object.
The data for 20 genes from 9 individuals results in a matrix
data <- matrix(1:180, ncol=9, byrow=TRUE)
The data from the 20 genes can be represented as a GRanges
gr_20gene <-
GRanges(seqnames = paste0("gene", 1:20),
ranges = IRanges(start=1:20, width = 1000 ),
strand = c(rep("+", 10), rep("-", 10)),
score = 1:20,
GC = seq(1, 0, length=20))
The data about the 20 individuals is stored in a data.frame
sample_df <- data.frame( names=c("Martin", "Herve", "Dan",
"Marc", "Valerie", "Jim", "Nate","Paul", "Sonali"),
sex=c(rep("Male", 4), "Female", rep("Male", 3), "Female"))
SummarizedExperiment
from the three objects listed above.SummarizedExperiment
SummarizedExperiment
to create a new one which contains information only about the female core team.SummarizedExperiment
to create a new one containing only the first two genes.Answer 1
If the package we are trying to download is called GenomeInfoDb then
source("http://bioconductor.org/biocLite.R")
biocLite("GenomeInfoDb")
vignette(package="GenomeInfoDb")
?sortSeqlevels
Answer 2
## Reading the data
fname <- system.file("extdata", "epi_metadata.txt", package="BioC2015Introduction")
df <- read.delim(fname, stringsAsFactors=FALSE)
## Exploring the data
class(df)
## [1] "data.frame"
head(df)
## EID GROUP COLOR MNEMONIC STD_NAME
## 1 E001 ESC #924965 ESC.I3 ES-I3 Cells
## 2 E002 ESC #924965 ESC.WA7 ES-WA7 Cells
## 3 E003 ESC #924965 ESC.H1 H1 Cells
## 4 E004 ES-deriv #4178AE ESDR.H1.BMP4.MESO H1 BMP4 Derived Mesendoderm Cultured Cells
## 5 E005 ES-deriv #4178AE ESDR.H1.BMP4.TROP H1 BMP4 Derived Trophoblast Cultured Cells
## 6 E006 ES-deriv #4178AE ESDR.H1.MSC H1 Derived Mesenchymal Stem Cells
## EDACC_NAME ANATOMY TYPE AGE SEX SOLID_LIQUID
## 1 ES-I3_Cell_Line ESC PrimaryCulture CL Female <NA>
## 2 ES-WA7_Cell_Line ESC PrimaryCulture CL Female <NA>
## 3 H1_Cell_Line ESC PrimaryCulture CL Male <NA>
## 4 H1_BMP4_Derived_Mesendoderm_Cultured_Cells ESC_DERIVED ESCDerived CL Male <NA>
## 5 H1_BMP4_Derived_Trophoblast_Cultured_Cells ESC_DERIVED ESCDerived CL Male <NA>
## 6 H1_Derived_Mesenchymal_Stem_Cells ESC_DERIVED ESCDerived CL Male <NA>
## ETHNICITY SINGLEDONOR_COMPOSITE
## 1 <NA> SD
## 2 <NA> SD
## 3 <NA> SD
## 4 <NA> SD
## 5 <NA> SD
## 6 <NA> SD
tail(df)
## EID GROUP COLOR MNEMONIC STD_NAME
## 122 E124 ENCODE2012 #000000 BLD.CD14.MONO Monocytes-CD14+ RO01746 Primary Cells
## 123 E125 ENCODE2012 #000000 BRN.NHA NH-A Astrocytes Primary Cells
## 124 E126 ENCODE2012 #000000 SKIN.NHDFAD NHDF-Ad Adult Dermal Fibroblast Primary Cells
## 125 E127 ENCODE2012 #000000 SKIN.NHEK NHEK-Epidermal Keratinocyte Primary Cells
## 126 E128 ENCODE2012 #000000 LNG.NHLF NHLF Lung Fibroblast Primary Cells
## 127 E129 ENCODE2012 #000000 BONE.OSTEO Osteoblast Primary Cells
## EDACC_NAME ANATOMY TYPE AGE SEX SOLID_LIQUID ETHNICITY
## 122 Monocytes-CD14+_RO01746 BLOOD PrimaryCell Female
## 123 NH-A_Astrocytes BRAIN PrimaryCulture Unknown
## 124 NHDF-Ad_Adult_Dermal_Fibroblasts SKIN PrimaryCulture Female
## 125 NHEK-Epidermal_Keratinocytes SKIN PrimaryCulture Unknown
## 126 NHLF_Lung_Fibroblasts LUNG PrimaryCulture Unknown
## 127 Osteoblasts BONE PrimaryCulture Unknown
## SINGLEDONOR_COMPOSITE
## 122 SD
## 123 SD
## 124 SD
## 125 SD
## 126 SD
## 127 SD
dim(df)
## [1] 127 13
colnames(df)
## [1] "EID" "GROUP" "COLOR" "MNEMONIC"
## [5] "STD_NAME" "EDACC_NAME" "ANATOMY" "TYPE"
## [9] "AGE" "SEX" "SOLID_LIQUID" "ETHNICITY"
## [13] "SINGLEDONOR_COMPOSITE"
sapply(df, class)
## EID GROUP COLOR MNEMONIC
## "character" "character" "character" "character"
## STD_NAME EDACC_NAME ANATOMY TYPE
## "character" "character" "character" "character"
## AGE SEX SOLID_LIQUID ETHNICITY
## "character" "character" "character" "character"
## SINGLEDONOR_COMPOSITE
## "character"
## Summarize the data
summary(df)
## EID GROUP COLOR MNEMONIC STD_NAME
## Length:127 Length:127 Length:127 Length:127 Length:127
## Class :character Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character Mode :character
## EDACC_NAME ANATOMY TYPE AGE SEX
## Length:127 Length:127 Length:127 Length:127 Length:127
## Class :character Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character Mode :character
## SOLID_LIQUID ETHNICITY SINGLEDONOR_COMPOSITE
## Length:127 Length:127 Length:127
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
table(df$SEX)
##
## Female Female/Unknown Male Male/Unknown Mixed Unknown
## 38 1 51 1 12 24
## Subset the data
df[df$GROUP %in% c("Brain", "Digestive"),]
## EID GROUP COLOR MNEMONIC STD_NAME
## 65 E067 Brain #C5912B BRN.ANG.GYR Brain Angular Gyrus
## 66 E068 Brain #C5912B BRN.ANT.CAUD Brain Anterior Caudate
## 67 E069 Brain #C5912B BRN.CING.GYR Brain Cingulate Gyrus
## 68 E070 Brain #C5912B BRN.GRM.MTRX Brain Germinal Matrix
## 69 E071 Brain #C5912B BRN.HIPP.MID Brain Hippocampus Middle
## 70 E072 Brain #C5912B BRN.INF.TMP Brain Inferior Temporal Lobe
## 71 E073 Brain #C5912B BRN.DL.PRFRNTL.CRTX Brain_Dorsolateral_Prefrontal_Cortex
## 72 E074 Brain #C5912B BRN.SUB.NIG Brain Substantia Nigra
## 73 E075 Digestive #C58DAA GI.CLN.MUC Colonic Mucosa
## 75 E077 Digestive #C58DAA GI.DUO.MUC Duodenum Mucosa
## 77 E079 Digestive #C58DAA GI.ESO Esophagus
## 79 E081 Brain #C5912B BRN.FET.M Fetal Brain Male
## 80 E082 Brain #C5912B BRN.FET.F Fetal Brain Female
## 82 E084 Digestive #C58DAA GI.L.INT.FET Fetal Intestine Large
## 83 E085 Digestive #C58DAA GI.S.INT.FET Fetal Intestine Small
## 90 E092 Digestive #C58DAA GI.STMC.FET Fetal Stomach
## 92 E094 Digestive #C58DAA GI.STMC.GAST Gastric
## 99 E101 Digestive #C58DAA GI.RECT.MUC.29 Rectal Mucosa Donor 29
## 100 E102 Digestive #C58DAA GI.RECT.MUC.31 Rectal Mucosa Donor 31
## 104 E106 Digestive #C58DAA GI.CLN.SIG Sigmoid Colon
## 107 E109 Digestive #C58DAA GI.S.INT Small Intestine
## 108 E110 Digestive #C58DAA GI.STMC.MUC Stomach Mucosa
## EDACC_NAME ANATOMY TYPE AGE SEX SOLID_LIQUID
## 65 Brain_Angular_Gyrus BRAIN PrimaryTissue 75Y, 81Y Mixed SOLID
## 66 Brain_Anterior_Caudate BRAIN PrimaryTissue 75Y, 81Y Mixed SOLID
## 67 Brain_Cingulate_Gyrus BRAIN PrimaryTissue 75Y, 81Y Mixed SOLID
## 68 Brain_Germinal_Matrix BRAIN PrimaryTissue 20GW Male SOLID
## 69 Brain_Hippocampus_Middle BRAIN PrimaryTissue 81Y, 73Y Male SOLID
## 70 Brain_Inferior_Temporal_Lobe BRAIN PrimaryTissue 75Y, 81Y Mixed SOLID
## 71 Brain_Mid_Frontal_Lobe BRAIN PrimaryTissue 75Y, 81Y Mixed SOLID
## 72 Brain_Substantia_Nigra BRAIN PrimaryTissue 75Y, 81Y Mixed SOLID
## 73 Colonic_Mucosa GI_COLON PrimaryTissue 73Y Female SOLID
## 75 Duodenum_Mucosa GI_DUODENUM PrimaryTissue 76Y Male SOLID
## 77 Esophagus GI_ESOPHAGUS PrimaryTissue 34Y Male SOLID
## 79 Fetal_Brain_Male BRAIN PrimaryTissue 17GW, 17GW Male/Unknown SOLID
## 80 Fetal_Brain_Female BRAIN PrimaryTissue 17GW. 17GW Female SOLID
## 82 Fetal_Intestine_Large GI_INTESTINE PrimaryTissue 15GW Male SOLID
## 83 Fetal_Intestine_Small GI_INTESTINE PrimaryTissue 15GW Male SOLID
## 90 Fetal_Stomach GI_STOMACH PrimaryTissue Female SOLID
## 92 Gastric GI_STOMACH PrimaryTissue 34Y Male SOLID
## 99 Rectal_Mucosa.Donor_29 GI_RECTUM PrimaryTissue 50Y Female SOLID
## 100 Rectal_Mucosa.Donor_31 GI_RECTUM PrimaryTissue 61Y Female SOLID
## 104 Sigmoid_Colon GI_COLON PrimaryTissue 3Y, 34Y Male SOLID
## 107 Small_Intestine GI_INTESTINE PrimaryTissue 3Y, 34Y Male SOLID
## 108 Stomach_Mucosa GI_STOMACH PrimaryTissue 59Y Male SOLID
## ETHNICITY SINGLEDONOR_COMPOSITE
## 65 Unknown C
## 66 Unknown C
## 67 Unknown C
## 68 Unknown SD
## 69 Unknown C
## 70 Unknown C
## 71 Unknown C
## 72 Unknown C
## 73 Caucasian SD
## 75 Caucasian SD
## 77 Caucasian SD
## 79 Unknown C
## 80 Unknown C
## 82 Unknown SD
## 83 Unknown SD
## 90 C
## 92 Caucasian SD
## 99 Caucasian SD
## 100 Caucasian SD
## 104 Caucasian/African American, Caucasian C
## 107 Caucasian/African American, Caucasian C
## 108 Caucasian SD
Answer 3
library(GenomicRanges)
gr <-
GRanges(seqnames = paste0("chr", c(1:22, tail(letters, 11))),
ranges = IRanges(start=1:33, width = 1000 ),
strand = c(rep("+", 10), rep("-", 23)),
score = 1:33,
GC = seq(1, 0, length=33))
## extract ranges only from chromosome 3
gr[seqnames(gr) %in% "chr3",]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | score GC
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## [1] chr3 [3, 1002] + | 3 0.9375
## -------
## seqinfo: 33 sequences from an unspecified genome; no seqlengths
## extract the first five ranges from the GRanges.
gr[1:5, ]
## GRanges object with 5 ranges and 2 metadata columns:
## seqnames ranges strand | score GC
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## [1] chr1 [1, 1000] + | 1 1
## [2] chr2 [2, 1001] + | 2 0.96875
## [3] chr3 [3, 1002] + | 3 0.9375
## [4] chr4 [4, 1003] + | 4 0.90625
## [5] chr5 [5, 1004] + | 5 0.875
## -------
## seqinfo: 33 sequences from an unspecified genome; no seqlengths
## extract the score and sequence column from a GRanges
mcols(gr)
## DataFrame with 33 rows and 2 columns
## score GC
## <integer> <numeric>
## 1 1 1.00000
## 2 2 0.96875
## 3 3 0.93750
## 4 4 0.90625
## 5 5 0.87500
## ... ... ...
## 29 29 0.12500
## 30 30 0.09375
## 31 31 0.06250
## 32 32 0.03125
## 33 33 0.00000
## keep only the standard chromosomes (i.e.) from chromosome 1 to 22, x, y,m
keepStandardChromosomes(gr)
## GRanges object with 22 ranges and 2 metadata columns:
## seqnames ranges strand | score GC
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## [1] chr1 [1, 1000] + | 1 1
## [2] chr2 [2, 1001] + | 2 0.96875
## [3] chr3 [3, 1002] + | 3 0.9375
## [4] chr4 [4, 1003] + | 4 0.90625
## [5] chr5 [5, 1004] + | 5 0.875
## ... ... ... ... ... ... ...
## [18] chr18 [18, 1017] - | 18 0.46875
## [19] chr19 [19, 1018] - | 19 0.4375
## [20] chr20 [20, 1019] - | 20 0.40625
## [21] chr21 [21, 1020] - | 21 0.375
## [22] chr22 [22, 1021] - | 22 0.34375
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
## change the chromosome naming style to NCBI
seqlevelsStyle(gr) <- "NCBI"
gr
## GRanges object with 33 ranges and 2 metadata columns:
## seqnames ranges strand | score GC
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## [1] 1 [1, 1000] + | 1 1
## [2] 2 [2, 1001] + | 2 0.96875
## [3] 3 [3, 1002] + | 3 0.9375
## [4] 4 [4, 1003] + | 4 0.90625
## [5] 5 [5, 1004] + | 5 0.875
## ... ... ... ... ... ... ...
## [29] chrv [29, 1028] - | 29 0.125
## [30] chrw [30, 1029] - | 30 0.09375
## [31] chrx [31, 1030] - | 31 0.0625
## [32] chry [32, 1031] - | 32 0.03125
## [33] chrz [33, 1032] - | 33 0
## -------
## seqinfo: 33 sequences from an unspecified genome; no seqlengths
## gaps in the ranges
gaps(gr)
## GRanges object with 32 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] 2 [1, 1] +
## [2] 3 [1, 2] +
## [3] 4 [1, 3] +
## [4] 5 [1, 4] +
## [5] 6 [1, 5] +
## ... ... ... ...
## [28] chrv [1, 28] -
## [29] chrw [1, 29] -
## [30] chrx [1, 30] -
## [31] chry [1, 31] -
## [32] chrz [1, 32] -
## -------
## seqinfo: 33 sequences from an unspecified genome; no seqlengths
## find degree of overlap for ranges.
coverage(gr)
## RleList of length 33
## $`1`
## integer-Rle of length 1000 with 1 run
## Lengths: 1000
## Values : 1
##
## $`2`
## integer-Rle of length 1001 with 2 runs
## Lengths: 1 1000
## Values : 0 1
##
## $`3`
## integer-Rle of length 1002 with 2 runs
## Lengths: 2 1000
## Values : 0 1
##
## $`4`
## integer-Rle of length 1003 with 2 runs
## Lengths: 3 1000
## Values : 0 1
##
## $`5`
## integer-Rle of length 1004 with 2 runs
## Lengths: 4 1000
## Values : 0 1
##
## ...
## <28 more elements>
Answer4
library(SummarizedExperiment)
## data for the SummarizedExperiment object
sample_df <- data.frame( names=c("Martin", "Herve", "Dan",
"Marc", "Valerie", "Jim", "Nate","Paul", "Sonali"),
sex=c(rep("Male", 4), "Female", rep("Male", 3), "Female"))
gr_20genes <-
GRanges(seqnames = paste0("gene", 1:20),
ranges = IRanges(start=1:20, width = 1000 ),
strand = c(rep("+", 10), rep("-", 10)),
score = 1:20,
GC = seq(1, 0, length=20))
data <- matrix(1:180, ncol=9, byrow=TRUE)
## create a SummarizedExperiment object
core_se <- SummarizedExperiment(assays=data,
rowRanges=gr_20genes,
colData=DataFrame(sample_df))
core_se
## class: RangedSummarizedExperiment
## dim: 20 9
## metadata(0):
## assays(1): ''
## rownames: NULL
## rowRanges metadata column names(2): score GC
## colnames: NULL
## colData names(2): names sex
## exploring the SummarizedExperiment object
dim(core_se)
## [1] 20 9
head(assay(core_se)) # data matrix
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 1 2 3 4 5 6 7 8 9
## [2,] 10 11 12 13 14 15 16 17 18
## [3,] 19 20 21 22 23 24 25 26 27
## [4,] 28 29 30 31 32 33 34 35 36
## [5,] 37 38 39 40 41 42 43 44 45
## [6,] 46 47 48 49 50 51 52 53 54
rowRanges(core_se) # information about the genes
## GRanges object with 20 ranges and 2 metadata columns:
## seqnames ranges strand | score GC
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## [1] gene1 [1, 1000] + | 1 1
## [2] gene2 [2, 1001] + | 2 0.947368421052632
## [3] gene3 [3, 1002] + | 3 0.894736842105263
## [4] gene4 [4, 1003] + | 4 0.842105263157895
## [5] gene5 [5, 1004] + | 5 0.789473684210526
## ... ... ... ... ... ... ...
## [16] gene16 [16, 1015] - | 16 0.210526315789474
## [17] gene17 [17, 1016] - | 17 0.157894736842105
## [18] gene18 [18, 1017] - | 18 0.105263157894737
## [19] gene19 [19, 1018] - | 19 0.0526315789473685
## [20] gene20 [20, 1019] - | 20 0
## -------
## seqinfo: 20 sequences from an unspecified genome; no seqlengths
colData(core_se) # sample information
## DataFrame with 9 rows and 2 columns
## names sex
## <factor> <factor>
## 1 Martin Male
## 2 Herve Male
## 3 Dan Male
## 4 Marc Male
## 5 Valerie Female
## 6 Jim Male
## 7 Nate Male
## 8 Paul Male
## 9 Sonali Female
## subset the SummarizedExperiment object
## subsetting the sample information
core_se[, core_se$sex == "Female"]
## class: RangedSummarizedExperiment
## dim: 20 2
## metadata(0):
## assays(1): ''
## rownames: NULL
## rowRanges metadata column names(2): score GC
## colnames: NULL
## colData names(2): names sex
## subsetting the gene information
core_se[,1:2]
## class: RangedSummarizedExperiment
## dim: 20 2
## metadata(0):
## assays(1): ''
## rownames: NULL
## rowRanges metadata column names(2): score GC
## colnames: NULL
## colData names(2): names sex
sessionInfo()
sessionInfo()
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
## [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] org.Hs.eg.db_3.1.2 RSQLite_1.0.0
## [3] DBI_0.3.1 TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.3
## [5] GenomicFeatures_1.21.13 AnnotationDbi_1.31.17
## [7] AnnotationHub_2.1.30 RNAseqData.HNRNPC.bam.chr14_0.7.0
## [9] GenomicAlignments_1.5.11 Rsamtools_1.21.14
## [11] Biostrings_2.37.2 XVector_0.9.1
## [13] SummarizedExperiment_0.3.2 Biobase_2.29.1
## [15] GenomicRanges_1.21.16 GenomeInfoDb_1.5.8
## [17] IRanges_2.3.14 S4Vectors_0.7.10
## [19] BiocGenerics_0.15.3 ggplot2_1.0.1
## [21] BiocStyle_1.7.4
##
## loaded via a namespace (and not attached):
## [1] reshape2_1.4.1 colorspace_1.2-6 htmltools_0.2.6
## [4] rtracklayer_1.29.12 yaml_2.1.13 interactiveDisplayBase_1.7.0
## [7] XML_3.98-1.3 BiocParallel_1.3.34 lambda.r_1.1.7
## [10] plyr_1.8.3 stringr_1.0.0 zlibbioc_1.15.0
## [13] munsell_0.4.2 gtable_0.1.2 futile.logger_1.4.1
## [16] codetools_0.2-14 evaluate_0.7 labeling_0.3
## [19] knitr_1.10.5 biomaRt_2.25.1 httpuv_1.3.2
## [22] BiocInstaller_1.19.8 curl_0.9.1 proto_0.3-10
## [25] Rcpp_0.11.6 xtable_1.7-4 scales_0.2.5
## [28] formatR_1.2 mime_0.3 digest_0.6.8
## [31] stringi_0.5-5 shiny_0.12.1 grid_3.2.1
## [34] tools_3.2.1 bitops_1.0-6 magrittr_1.5
## [37] RCurl_1.95-4.7 futile.options_1.0.0 MASS_7.3-43
## [40] rmarkdown_0.7 httr_1.0.0 R6_2.1.0