BiocIntro 0.0.8
Abstract: This two-hour workshop is meant to empower Cancer Moonshot research labs to tackle their bioinformatic analysis challenges. For the first hour and a half, we’ll work through a hands-on workflow for single cell assay analysis. We’ll introduce data import, management, and interactive visualization using Bioconductor tools like iSEE. After seeing how to work with one assay, we’ll briefly explore approaches to integrating different assays. In the final ½ hour we’ll go beyond Bioconductor tools for single-cell analysis. We’ll assemble a panel to discuss possible strategies for data analysis challenges submitted (before the workshop) to the organizers.
Goal: Empower Cancer Moonshot Research Labs to tackle their bioinformatic analysis challenges
Objectives, this workshop:
What we will learn
R
Vectors, variables, and functions
x = rnorm(100)
mean(x)
## [1] -0.1471021
var(x)
## [1] 0.7680572
hist(x)
Manageing data: classes and methods
y = x + rnorm(100)
df = data.frame(x, y)
plot(y ~ x, df)
Visualization
fit = lm(y ~ x, df)
anova(fit)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 69.876 69.876 59.719 9.617e-12 ***
## Residuals 98 114.668 1.170
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(y ~ x, df)
abline(fit)
Extending base R: packages
library(ggplot2)
ggplot(df, aes(x, y)) +
geom_point() +
geom_smooth(method="lm")
CRAN (Comprehensive R Archive Network)
Help!
?lm
browseVignettes("ggplot2")
/ vignette(package="ggplot2")
What we learned
data.frame
help manage data?
, browseVignettes()
and other meansWhat we will learn
SummarizedExperiment
for data managementBioconductor
Resources
Data management
Domain-specific work flows, e.g., bulk RNA-seq diffrential expression
library(airway)
data(airway) # load example data
airway
## class: RangedSummarizedExperiment
## dim: 64102 8
## metadata(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
colData()
: description of samples, especially four cell lines exposed to
two dexamethasone treatmentscolData(airway)
## DataFrame with 8 rows and 9 columns
## SampleName cell dex albut Run avgLength
## <factor> <factor> <factor> <factor> <factor> <integer>
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101
## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98
## Experiment Sample BioSample
## <factor> <factor> <factor>
## SRR1039508 SRX384345 SRS508568 SAMN02422669
## SRR1039509 SRX384346 SRS508567 SAMN02422675
## SRR1039512 SRX384349 SRS508571 SAMN02422678
## SRR1039513 SRX384350 SRS508572 SAMN02422670
## SRR1039516 SRX384353 SRS508575 SAMN02422682
## SRR1039517 SRX384354 SRS508576 SAMN02422673
## SRR1039520 SRX384357 SRS508579 SAMN02422683
## SRR1039521 SRX384358 SRS508580 SAMN02422677
airway$cell
## [1] N61311 N61311 N052611 N052611 N080611 N080611 N061011 N061011
## Levels: N052611 N061011 N080611 N61311
assay()
data: RNA-seq reads overlapping genes (proxy for gene expression)head(assay(airway))
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 679 448 873 408 1138
## ENSG00000000005 0 0 0 0 0
## ENSG00000000419 467 515 621 365 587
## ENSG00000000457 260 211 263 164 245
## ENSG00000000460 60 55 40 35 78
## ENSG00000000938 0 0 2 0 1
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 1047 770 572
## ENSG00000000005 0 0 0
## ENSG00000000419 799 417 508
## ENSG00000000457 331 233 229
## ENSG00000000460 63 76 60
## ENSG00000000938 0 0 0
library(DESeq2) # 'Software' package -- bulk RNA-seq differential expression
dds = DESeqDataSet(airway, ~ cell + dex)
fit = DESeq(dds)
toptable = results(fit)
toptable
## log2 fold change (MLE): dex untrt vs trt
## Wald test p-value: dex untrt vs trt
## DataFrame with 64102 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## ENSG00000000003 708.602169691234 0.381253982523051 0.100654411865818
## ENSG00000000005 0 NA NA
## ENSG00000000419 520.297900552084 -0.206812601085049 0.112218646507257
## ENSG00000000457 237.163036796015 -0.0379204312050853 0.143444654934093
## ENSG00000000460 57.9326331250967 0.0881681758701333 0.287141822937771
## ... ... ... ...
## LRG_94 0 NA NA
## LRG_96 0 NA NA
## LRG_97 0 NA NA
## LRG_98 0 NA NA
## LRG_99 0 NA NA
## stat pvalue padj
## <numeric> <numeric> <numeric>
## ENSG00000000003 3.78775232457072 0.000152016273044868 0.00128362968171095
## ENSG00000000005 NA NA NA
## ENSG00000000419 -1.84294328546971 0.0653372915097682 0.196546085656283
## ENSG00000000457 -0.26435583272523 0.791505741608343 0.911459479594897
## ENSG00000000460 0.307054454722331 0.758801923982932 0.895032784975418
## ... ... ... ...
## LRG_94 NA NA NA
## LRG_96 NA NA NA
## LRG_97 NA NA NA
## LRG_98 NA NA NA
## LRG_99 NA NA NA
library(AnnotationHub)
query(AnnotationHub(), c("EnsDb", "Homo sapiens", "98"))
## AnnotationHub with 1 record
## # snapshotDate(): 2019-10-29
## # names(): AH75011
## # $dataprovider: Ensembl
## # $species: Homo sapiens
## # $rdataclass: EnsDb
## # $rdatadateadded: 2019-05-02
## # $title: Ensembl 98 EnsDb for Homo sapiens
## # $description: Gene and protein annotations for Homo sapiens based on En...
## # $taxonomyid: 9606
## # $genome: GRCh38
## # $sourcetype: ensembl
## # $sourceurl: http://www.ensembl.org
## # $sourcesize: NA
## # $tags: c("98", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl",
## # "Gene", "Protein", "Transcript")
## # retrieve record with 'object[["AH75011"]]'
ensdb <- AnnotationHub()[["AH75011"]]
egids <- rownames(toptable)
toptable$SYMBOL <- mapIds(ensdb, egids, "SYMBOL", "GENEID")
rowData(airway)$SYMBOL <- toptable$SYMBOL
idx <- head( order( toptable$padj), 20 )
toptable[idx, c("SYMBOL", "padj")]
##
##
## DataFrame with 20 rows and 2 columns
## SYMBOL padj
## <character> <numeric>
## ENSG00000152583 SPARCL1 4.00389774546424e-132
## ENSG00000165995 CACNB2 7.06644398972879e-131
## ENSG00000120129 DUSP1 2.20357722858621e-126
## ENSG00000101347 SAMHD1 4.31942535034484e-126
## ENSG00000189221 MAOA 3.96239157817871e-120
## ... ... ...
## ENSG00000163884 KLF15 2.50751670671422e-77
## ENSG00000178695 KCTD12 7.07018634448484e-76
## ENSG00000198624 CCDC69 3.58309812975263e-70
## ENSG00000107562 CXCL12 4.54131694767991e-70
## ENSG00000148848 ADAM12 6.14170889471998e-70
heatmap( log1p( assay(airway)[idx,] ))
m <- assay(airway)[idx,]
rownames(m) <- rowData(airway)$SYMBOL[idx]
heatmap( log1p( m ))
What we learned
SummarizedExperiment
allows easy management of complicated
data. Use functions like colData()
, assay()
, and [
for
manipulation.query()
to discover common resources.What we will do
An excellent resource – osca –
For today’s workshop
Goal: reproduce parts of Figure 1.
Our part in the big picture (from Amezquita et al., Orchestrating Single-Cell Analysis with Bioconductor bioRxiv 590562)
Packages we’ll use
library(scRNAseq)
library(SingleCellExperiment)
library(scater)
library(AnnotationHub)
library(ensembldb)
library(destiny)
library(dplyr)
library(ggplot2)
Example data set
tcell = RichardTCellData()
tcell
## class: SingleCellExperiment
## dim: 46603 572
## metadata(0):
## assays(1): counts
## rownames(46603): ENSMUSG00000102693 ENSMUSG00000064842 ...
## ENSMUSG00000096730 ENSMUSG00000095742
## rowData names(0):
## colnames(572): SLX-12611.N701_S502. SLX-12611.N702_S502. ...
## SLX-12612.i712_i522. SLX-12612.i714_i522.
## colData names(13): age individual ... stimulus time
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(1): ERCC
Filtering
## Expression in at least 5% of cells...
drop_filter <- rowSums(assay(tcell) > 0) >= (ncol(tcell) * 0.05)
tcell <- tcell[drop_filter,]
## ...and mean count at least 1
mean_filter <- rowMeans(assay(tcell)) >= 1
tcell <- tcell[mean_filter,]
Normalization
tcell <- logNormCounts(tcell)
Cells of interest – time course
as_tibble(colData(tcell)) %>%
dplyr::filter(post.analysis.well.quality == "pass") %>%
dplyr::count(individual, stimulus, time)
## # A tibble: 11 x 4
## individual stimulus time n
## <int> <chr> <int> <int>
## 1 1 OT-I high affinity peptide N4 (SIINFEKL) 6 45
## 2 1 OT-I non-binding peptide NP68 (ASNENMDAM) 6 47
## 3 1 OT-I reduced affinity peptide G4 (SIIGFEKL) 6 48
## 4 1 OT-I reduced affinity peptide T4 (SIITFEKL) 6 44
## 5 2 OT-I high affinity peptide N4 (SIINFEKL) 1 51
## 6 2 OT-I high affinity peptide N4 (SIINFEKL) 3 64
## 7 2 OT-I high affinity peptide N4 (SIINFEKL) 6 46
## 8 2 OT-I non-binding peptide NP68 (ASNENMDAM) 6 46
## 9 2 OT-I reduced affinity peptide G4 (SIIGFEKL) 6 46
## 10 2 OT-I reduced affinity peptide T4 (SIITFEKL) 6 47
## 11 2 unstimulated NA 44
tcell
to contain only the samples of interest, and
expressed genes with non-zero read counts across samples.stimulus <- c("OT-I high affinity peptide N4 (SIINFEKL)", "unstimulated")
cells_of_interest <-
( tcell$stimulus %in% stimulus ) &
( tcell$`post-analysis well quality` %in% "pass" )
expressed_genes <- rowSums(logcounts(tcell)[, cells_of_interest]) != 0
fig1 <- tcell[expressed_genes, cells_of_interest]
mean(logcounts(fig1) == 0) # proportion of 0's
## [1] 0.6136515
## update colData -- treat `time` as factor (discrete) rather than continuous
colData(fig1)$time <- factor(colData(fig1)$time)
## Save useful row-wise summaries
rowData(fig1)$ave_count <- rowMeans(assay(fig1))
rowData(fig1)$n_cells <- rowSums(assay(fig1) > 0)
Dimensionality reduction
fig1
fig1 <- runPCA(fig1, ncomponents = 4)
fig1 <- runTSNE(fig1)
fig1
## class: SingleCellExperiment
## dim: 9676 250
## metadata(0):
## assays(2): counts logcounts
## rownames(9676): ENSMUSG00000098104 ENSMUSG00000103922 ...
## ENSMUSG00000064370 ENSMUSG00000064372
## rowData names(2): ave_count n_cells
## colnames(250): SLX-12611.N701_S507. SLX-12611.N702_S507. ...
## SLX-12612.i711_i521. SLX-12612.i712_i521.
## colData names(13): age individual ... stimulus time
## reducedDimNames(2): PCA TSNE
## spikeNames(0):
## altExpNames(1): ERCC
plotPCA(fig1, colour_by = "time", point_size = 4)
plotTSNE(fig1, colour_by = "time", point_size = 4)
Our favorite genes
EnsDb
(Ensembl data base)
for Mus musculus, Ensembl release 98query(AnnotationHub(), c("EnsDb", "Mus musculus", "98"))
## AnnotationHub with 1 record
## # snapshotDate(): 2019-10-29
## # names(): AH75036
## # $dataprovider: Ensembl
## # $species: Mus musculus
## # $rdataclass: EnsDb
## # $rdatadateadded: 2019-05-02
## # $title: Ensembl 98 EnsDb for Mus musculus
## # $description: Gene and protein annotations for Mus musculus based on En...
## # $taxonomyid: 10090
## # $genome: GRCm38
## # $sourcetype: ensembl
## # $sourceurl: http://www.ensembl.org
## # $sourcesize: NA
## # $tags: c("98", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl",
## # "Gene", "Protein", "Transcript")
## # retrieve record with 'object[["AH75036"]]'
ensdb <- AnnotationHub()[["AH75036"]]
fig1
rowData(fig1)$SYMBOL <- mapIds(ensdb, rownames(fig1), "SYMBOL", "GENEID")
rows <- rowData(fig1)$SYMBOL %in% "Nr4a1"
tbl <- tibble(
time = fig1$time,
Nr4a1 = logcounts(fig1)[rows,]
)
ggplot(tbl, aes(time, Nr4a1, color = time)) +
geom_boxplot() +
geom_jitter(width = .1)
What we learned
SummarizedExperiment
help
substantially with these data managment tasks.library(MultiAssayExperiment)
library(curatedTCGAData)
curatedTCGAData()
## Available Cancer codes:
## ACC BLCA BRCA CESC CHOL COAD DLBC ESCA GBM HNSC KICH
## KIRC KIRP LAML LGG LIHC LUAD LUSC MESO OV PAAD PCPG
## PRAD READ SARC SKCM STAD TGCT THCA THYM UCEC UCS UVM
## Available Data Types:
## CNACGH CNACGH_CGH_hg_244a
## CNACGH_CGH_hg_415k_g4124a CNASeq CNASNP
## CNVSNP GISTIC_AllByGene GISTIC_Peaks
## GISTIC_ThresholdedByGene Methylation
## Methylation_methyl27 Methylation_methyl450
## miRNAArray miRNASeqGene mRNAArray
## mRNAArray_huex mRNAArray_TX_g4502a
## mRNAArray_TX_g4502a_1
## mRNAArray_TX_ht_hg_u133a Mutation
## RNASeq2GeneNorm RNASeqGene RPPAArray
assays = c("Mutation", "RNASeqGene", "RPPAArray")
brca = curatedTCGAData("BRCA", assays, dry.run = FALSE)
brca
## A MultiAssayExperiment object of 3 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 3:
## [1] BRCA_Mutation-20160128: RaggedExperiment with 90490 rows and 993 columns
## [2] BRCA_RNASeqGene-20160128: SummarizedExperiment with 20502 rows and 878 columns
## [3] BRCA_RPPAArray-20160128: SummarizedExperiment with 226 rows and 937 columns
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample availability DFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
brca[["BRCA_RNASeqGene-20160128"]]
## class: SummarizedExperiment
## dim: 20502 878
## metadata(0):
## assays(1): ''
## rownames(20502): A1BG A1CF ... psiTPTE22 tAKR
## rowData names(0):
## colnames(878): TCGA-A1-A0SB-01A-11R-A144-07
## TCGA-A1-A0SD-01A-11R-A115-07 ... TCGA-GM-A2DH-01A-11R-A180-07
## TCGA-GM-A2DK-01A-21R-A180-07
## colData names(0):
length(colnames(colData(brca)))
## [1] 2684
head(colnames(colData(brca)))
## [1] "patientID" "years_to_birth" "vital_status"
## [4] "days_to_death" "days_to_last_followup" "tumor_tissue_site"
Research reported in this presentation was supported by the NCIA and NHGRI of the National Institutes of Health under award numbers U24CA232979, U41HG004059, and U24CA180996. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
sessionInfo()
## R version 3.6.1 Patched (2019-10-25 r77333)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Users/ma38727/bin/R-3-6-branch/lib/libRblas.dylib
## LAPACK: /Users/ma38727/bin/R-3-6-branch/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] RaggedExperiment_1.10.0 curatedTCGAData_1.8.0
## [3] MultiAssayExperiment_1.12.0 dplyr_0.8.3
## [5] destiny_3.0.0 scater_1.14.3
## [7] scRNAseq_2.0.1 SingleCellExperiment_1.8.0
## [9] ensembldb_2.10.0 AnnotationFilter_1.10.0
## [11] GenomicFeatures_1.38.0 AnnotationDbi_1.48.0
## [13] AnnotationHub_2.18.0 BiocFileCache_1.10.2
## [15] dbplyr_1.4.2 DESeq2_1.26.0
## [17] airway_1.6.0 SummarizedExperiment_1.16.0
## [19] DelayedArray_0.12.0 BiocParallel_1.20.0
## [21] matrixStats_0.55.0 Biobase_2.46.0
## [23] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0
## [25] IRanges_2.20.0 S4Vectors_0.24.0
## [27] BiocGenerics_0.32.0 ggplot2_3.2.1
## [29] BiocStyle_2.14.0
##
## loaded via a namespace (and not attached):
## [1] utf8_1.1.4 tidyselect_0.2.5
## [3] RSQLite_2.1.2 htmlwidgets_1.5.1
## [5] grid_3.6.1 ranger_0.11.2
## [7] Rtsne_0.15 munsell_0.5.0
## [9] codetools_0.2-16 withr_2.1.2
## [11] colorspace_1.4-1 knitr_1.25
## [13] rstudioapi_0.10 robustbase_0.93-5
## [15] vcd_1.4-4 VIM_4.8.0
## [17] TTR_0.23-5 labeling_0.3
## [19] GenomeInfoDbData_1.2.2 bit64_0.9-7
## [21] vctrs_0.2.0 xfun_0.10
## [23] ggthemes_4.2.0 R6_2.4.0
## [25] ggbeeswarm_0.6.0 rsvd_1.0.2
## [27] RcppEigen_0.3.3.5.0 locfit_1.5-9.1
## [29] bitops_1.0-6 assertthat_0.2.1
## [31] promises_1.1.0 scales_1.0.0
## [33] nnet_7.3-12 beeswarm_0.2.3
## [35] gtable_0.3.0 rlang_0.4.1
## [37] zeallot_0.1.0 genefilter_1.68.0
## [39] scatterplot3d_0.3-41 splines_3.6.1
## [41] rtracklayer_1.46.0 lazyeval_0.2.2
## [43] acepack_1.4.1 hexbin_1.27.3
## [45] checkmate_1.9.4 BiocManager_1.30.9
## [47] yaml_2.2.0 abind_1.4-5
## [49] backports_1.1.5 httpuv_1.5.2
## [51] Hmisc_4.3-0 tools_3.6.1
## [53] bookdown_0.14 RColorBrewer_1.1-2
## [55] proxy_0.4-23 Rcpp_1.0.3
## [57] base64enc_0.1-3 progress_1.2.2
## [59] zlibbioc_1.32.0 purrr_0.3.3
## [61] RCurl_1.95-4.12 prettyunits_1.0.2
## [63] rpart_4.1-15 openssl_1.4.1
## [65] viridis_0.5.1 cowplot_1.0.0
## [67] zoo_1.8-6 haven_2.2.0
## [69] cluster_2.1.0 magrittr_1.5
## [71] data.table_1.12.6 RSpectra_0.15-0
## [73] openxlsx_4.1.3 lmtest_0.9-37
## [75] pcaMethods_1.78.0 ProtGenerics_1.18.0
## [77] hms_0.5.2 mime_0.7
## [79] evaluate_0.14 xtable_1.8-4
## [81] smoother_1.1 XML_3.98-1.20
## [83] rio_0.5.16 readxl_1.3.1
## [85] gridExtra_2.3 compiler_3.6.1
## [87] biomaRt_2.42.0 tibble_2.1.3
## [89] crayon_1.3.4 htmltools_0.4.0
## [91] later_1.0.0 Formula_1.2-3
## [93] tidyr_1.0.0 geneplotter_1.64.0
## [95] DBI_1.0.0 ExperimentHub_1.12.0
## [97] MASS_7.3-51.4 rappdirs_0.3.1
## [99] boot_1.3-23 Matrix_1.2-17
## [101] car_3.0-4 cli_1.1.0
## [103] forcats_0.4.0 pkgconfig_2.0.3
## [105] GenomicAlignments_1.22.0 foreign_0.8-72
## [107] laeken_0.5.0 sp_1.3-2
## [109] annotate_1.64.0 vipor_0.4.5
## [111] XVector_0.26.0 stringr_1.4.0
## [113] digest_0.6.22 Biostrings_2.54.0
## [115] rmarkdown_1.16 cellranger_1.1.0
## [117] htmlTable_1.13.2 DelayedMatrixStats_1.8.0
## [119] curl_4.2 shiny_1.4.0
## [121] Rsamtools_2.2.1 ggplot.multistats_1.0.0
## [123] lifecycle_0.1.0 carData_3.0-2
## [125] BiocNeighbors_1.4.1 viridisLite_0.3.0
## [127] askpass_1.1 fansi_0.4.0
## [129] pillar_1.4.2 lattice_0.20-38
## [131] fastmap_1.0.1 httr_1.4.1
## [133] DEoptimR_1.0-8 survival_3.1-7
## [135] interactiveDisplayBase_1.24.0 glue_1.3.1
## [137] xts_0.11-2 zip_2.0.4
## [139] BiocVersion_3.10.1 bit_1.1-14
## [141] class_7.3-15 stringi_1.4.3
## [143] blob_1.2.0 RcppHNSW_0.2.0
## [145] BiocSingular_1.2.0 latticeExtra_0.6-28
## [147] memoise_1.1.0 irlba_2.3.3
## [149] e1071_1.7-2