In this section, we will learn to search and download DNA methylation (epigenetic) and gene expression (transcription) data from the newly created NCI Genomic Data Commons (GDC) portal and prepare them into a Summarized Experiment object.
The figure below highlights the workflow part which will be covered in this section.
library(TCGAbiolinks)
library(SummarizedExperiment)
library(DT)
library(dplyr)
query.exp <- GDCquery(project = "TCGA-LUSC",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - FPKM-UQ",
barcode = c("TCGA-34-5231-01","TCGA-77-7138-01"))
GDCdownload(query.exp)
exp <- GDCprepare(query = query.exp,
save = TRUE,
save.filename = "Exp_LUSC.rda",
summarizedExperiment = TRUE)
exp
## class: RangedSummarizedExperiment
## dim: 57035 2
## metadata(0):
## assays(1): HTSeq - FPKM-UQ
## rownames(57035): ENSG00000000003 ENSG00000000005 ...
## ENSG00000281912 ENSG00000281920
## rowData names(3): ensembl_gene_id external_gene_name
## original_ensembl_gene_id
## colnames(2): TCGA-34-5231-01A-21R-1820-07
## TCGA-77-7138-01A-41R-2045-07
## colData names(69): patient barcode ...
## subtype_Homozygous.Deletions subtype_Expression.Subtype
colData(exp) %>% as.data.frame %>% datatable(options = list(scrollX = TRUE), rownames = TRUE)
assay(exp)[1:5,] %>% datatable (options = list(scrollX = TRUE), rownames = TRUE)
rowRanges(exp)
## GRanges object with 57035 ranges and 3 metadata columns:
## seqnames ranges strand | ensembl_gene_id
## <Rle> <IRanges> <Rle> | <character>
## ENSG00000000003 chrX [100627109, 100639991] - | ENSG00000000003
## ENSG00000000005 chrX [100584802, 100599885] + | ENSG00000000005
## ENSG00000000419 chr20 [ 50934867, 50958555] - | ENSG00000000419
## ENSG00000000457 chr1 [169849631, 169894267] - | ENSG00000000457
## ENSG00000000460 chr1 [169662007, 169854080] + | ENSG00000000460
## ... ... ... ... . ...
## ENSG00000281904 chr2 [90365737, 90367699] + | ENSG00000281904
## ENSG00000281909 chr15 [22480439, 22484840] - | ENSG00000281909
## ENSG00000281910 chr16 [58559796, 58559931] - | ENSG00000281910
## ENSG00000281912 chr1 [45303910, 45305619] + | ENSG00000281912
## ENSG00000281920 chr2 [65623272, 65628424] + | ENSG00000281920
## external_gene_name original_ensembl_gene_id
## <character> <character>
## ENSG00000000003 TSPAN6 ENSG00000000003.13
## ENSG00000000005 TNMD ENSG00000000005.5
## ENSG00000000419 DPM1 ENSG00000000419.11
## ENSG00000000457 SCYL3 ENSG00000000457.12
## ENSG00000000460 C1orf112 ENSG00000000460.15
## ... ... ...
## ENSG00000281904 AC233263.6 ENSG00000281904.1
## ENSG00000281909 AC100757.4 ENSG00000281909.1
## ENSG00000281910 SNORA50A ENSG00000281910.1
## ENSG00000281912 LINC01144 ENSG00000281912.1
## ENSG00000281920 AC007389.5 ENSG00000281920.1
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
This subsection describes how to download DNA methylation using the Bioconductor package TCGAbiolinks (Colaprico et al. 2016) from NCI Genomic Data Commons (GDC) portal. In this example, we will download DNA methylation data (Infinium HumanMethylation450 platform) for two TCGA-LUSC (TCGA Lung Squamous Cell Carcinoma) samples. GDCquery function will search in the GDC database for the information required to download the data, this information is used by the GDCdownload
function which will request the files to GDC, those files will be compacted into a 76 MB tar.gz file. After the download is completed GDCdownload
will uncompress the tar.gz file and move its files to a folder; the default is GDCData/(Project)/(source)/(data.category)/(data.type)), in our example, it will be GDCdata/TCGA-LUSC/harmonized/DNA_Methylation/Methylation_Beta_Value/
Finally, GDCprepare
transforms the downloaded data into a summarizedExperiment object (Huber et al. 2015) or a data frame. If SummarizedExperiment is set to TRUE, TCGAbiolinks will add to the object molecular sub-type information, which was defined by The Cancer Genome Atlas (TCGA) Research Network reports (the full list of papers can be seen in TCGAquery_subtype section in TCGAbiolinks vignette), and clinical information.
query.met <- GDCquery(project = "TCGA-LUSC",
data.category = "DNA Methylation",
platform = "Illumina Human Methylation 450",
barcode = c("TCGA-34-5231-01A-21D-1818-05","TCGA-77-7138-01A-41D-2043-05"))
GDCdownload(query.met)
met <- GDCprepare(query = query.met,
save = TRUE,
save.filename = "DNAmethylation_LUSC.rda",
summarizedExperiment = TRUE)
The object created is a Sum
met
## class: RangedSummarizedExperiment
## dim: 485577 2
## metadata(0):
## assays(1): ''
## rownames(485577): cg00000029 cg00000108 ... rs966367 rs9839873
## rowData names(7): Composite.Element.REF Gene_Symbol ...
## CGI_Coordinate Feature_Type
## colnames(2): TCGA-34-5231-01A-21D-1818-05
## TCGA-77-7138-01A-41D-2043-05
## colData names(69): patient barcode ...
## subtype_Homozygous.Deletions subtype_Expression.Subtype
colData(met) %>% as.data.frame %>% datatable(options = list(scrollX = TRUE), rownames = TRUE)
assay(met)[1:5,] %>% datatable (options = list(scrollX = TRUE), rownames = TRUE)
rowRanges(met)
## GRanges object with 485577 ranges and 7 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## cg00000029 chr16 [ 53434200, 53434201] * |
## cg00000108 chr3 [ 37417715, 37417716] * |
## cg00000109 chr3 [172198247, 172198248] * |
## cg00000165 chr1 [ 90729117, 90729118] * |
## cg00000236 chr8 [ 42405776, 42405777] * |
## ... ... ... ... .
## rs9363764 chr6 [67522149, 67522149] * |
## rs939290 chr3 [14617359, 14617359] * |
## rs951295 chr15 [45707625, 45707625] * |
## rs966367 chr2 [12008094, 12008094] * |
## rs9839873 chr3 [86613005, 86613005] * |
## Composite.Element.REF
## <character>
## cg00000029 cg00000029
## cg00000108 cg00000108
## cg00000109 cg00000109
## cg00000165 cg00000165
## cg00000236 cg00000236
## ... ...
## rs9363764 rs9363764
## rs939290 rs939290
## rs951295 rs951295
## rs966367 rs966367
## rs9839873 rs9839873
## Gene_Symbol
## <character>
## cg00000029 RBL2;RBL2;RBL2
## cg00000108 C3orf35;C3orf35;C3orf35;C3orf35;C3orf35;C3orf35;C3orf35;C3orf35
## cg00000109 FNDC3B;FNDC3B;FNDC3B;FNDC3B;FNDC3B;FNDC3B
## cg00000165 .
## cg00000236 VDAC3
## ... ...
## rs9363764 .
## rs939290 .
## rs951295 RP11-718O11.1;RP11-718O11.1
## rs966367 AC096559.1;AC096559.1;AC096559.1;AC096559.1
## rs9839873 .
## Gene_Type
## <character>
## cg00000029 protein_coding;protein_coding;protein_coding
## cg00000108 lincRNA;lincRNA;lincRNA;lincRNA;lincRNA;lincRNA;lincRNA;lincRNA
## cg00000109 protein_coding;protein_coding;protein_coding;protein_coding;protein_coding;protein_coding
## cg00000165 .
## cg00000236 protein_coding
## ... ...
## rs9363764 .
## rs939290 .
## rs951295 lincRNA;lincRNA
## rs966367 lincRNA;lincRNA;lincRNA;lincRNA
## rs9839873 .
## Transcript_ID
## <character>
## cg00000029 ENST00000262133.9;ENST00000544405.5;ENST00000567964.5
## cg00000108 ENST00000328376.8;ENST00000332506.6;ENST00000425564.2;ENST00000425932.4;ENST00000426078.4;ENST00000452017.3;ENST00000466204.4;ENST00000481400.4
## cg00000109 ENST00000336824.7;ENST00000415807.5;ENST00000416957.4;ENST00000443501.1;ENST00000469491.4;ENST00000478016.1
## cg00000165 .
## cg00000236 ENST00000022615.7
## ... ...
## rs9363764 .
## rs939290 .
## rs951295 ENST00000559600.1;ENST00000560705.1
## rs966367 ENST00000412294.4;ENST00000438292.4;ENST00000450916.1;ENST00000451644.4
## rs9839873 .
## Position_to_TSS
## <character>
## cg00000029 -221;-1420;222
## cg00000108 18552;18552;6505;31445;18143;447;18552;18552
## cg00000109 157692;158618;151333;71272;158587;71273
## cg00000165 .
## cg00000236 13872
## ... ...
## rs9363764 .
## rs939290 .
## rs951295 2429;2546
## rs966367 965;142208;919;977
## rs9839873 .
## CGI_Coordinate Feature_Type
## <character> <character>
## cg00000029 CGI:chr16:53434489-53435297 N_Shore
## cg00000108 CGI:chr3:37451927-37453047 .
## cg00000109 CGI:chr3:172039703-172040934 .
## cg00000165 CGI:chr1:90724932-90727247 S_Shore
## cg00000236 CGI:chr8:42410918-42411241 .
## ... ... ...
## rs9363764 CGI:chr6:68634840-68635154 .
## rs939290 CGI:chr3:14602211-14603323 .
## rs951295 CGI:chr15:45704255-45705206 S_Shelf
## rs966367 CGI:chr2:11784857-11785127 .
## rs9839873 CGI:chr3:86990460-86991366 .
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/local/lib/R/lib/libRblas.so
## LAPACK: /usr/local/lib/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] SummarizedExperiment_1.7.5
## [2] DelayedArray_0.3.19
## [3] matrixStats_0.52.2
## [4] Biobase_2.37.2
## [5] GenomicRanges_1.29.12
## [6] GenomeInfoDb_1.13.4
## [7] IRanges_2.11.12
## [8] S4Vectors_0.15.5
## [9] BiocGenerics_0.23.0
## [10] TCGAbiolinks_2.5.6
## [11] Bioc2017.TCGAbiolinks.ELMER_0.0.0.9000
## [12] bindrcpp_0.2
## [13] MultiAssayExperiment_1.3.20
## [14] dplyr_0.7.2
## [15] DT_0.2
## [16] ELMER_2.0.1
## [17] ELMER.data_2.0.1
##
## loaded via a namespace (and not attached):
## [1] shinydashboard_0.6.1 R.utils_2.5.0
## [3] RSQLite_2.0 AnnotationDbi_1.39.2
## [5] htmlwidgets_0.9 grid_3.4.1
## [7] trimcluster_0.1-2 BiocParallel_1.11.4
## [9] devtools_1.13.2 DESeq_1.29.0
## [11] munsell_0.4.3 codetools_0.2-15
## [13] withr_2.0.0 colorspace_1.3-2
## [15] BiocInstaller_1.27.2 knitr_1.16
## [17] robustbase_0.92-7 labeling_0.3
## [19] GenomeInfoDbData_0.99.1 KMsurv_0.1-5
## [21] mnormt_1.5-5 hwriter_1.3.2
## [23] bit64_0.9-7 rprojroot_1.2
## [25] downloader_0.4 biovizBase_1.25.1
## [27] ggthemes_3.4.0 EDASeq_2.11.0
## [29] diptest_0.75-7 R6_2.2.2
## [31] doParallel_1.0.10 locfit_1.5-9.1
## [33] AnnotationFilter_1.1.3 flexmix_2.3-14
## [35] reshape_0.8.6 bitops_1.0-6
## [37] assertthat_0.2.0 scales_0.4.1
## [39] nnet_7.3-12 gtable_0.2.0
## [41] ensembldb_2.1.10 rlang_0.1.1
## [43] genefilter_1.59.0 cmprsk_2.2-7
## [45] GlobalOptions_0.0.12 splines_3.4.1
## [47] rtracklayer_1.37.3 lazyeval_0.2.0
## [49] acepack_1.4.1 dichromat_2.0-0
## [51] selectr_0.3-1 broom_0.4.2
## [53] checkmate_1.8.3 yaml_2.1.14
## [55] reshape2_1.4.2 GenomicFeatures_1.29.8
## [57] backports_1.1.0 httpuv_1.3.5
## [59] Hmisc_4.0-3 tools_3.4.1
## [61] psych_1.7.5 ggplot2_2.2.1
## [63] RColorBrewer_1.1-2 Rcpp_0.12.12
## [65] plyr_1.8.4 base64enc_0.1-3
## [67] progress_1.1.2 zlibbioc_1.23.0
## [69] purrr_0.2.2.2 RCurl_1.95-4.8
## [71] prettyunits_1.0.2 ggpubr_0.1.4
## [73] rpart_4.1-11 GetoptLong_0.1.6
## [75] viridis_0.4.0 zoo_1.8-0
## [77] ggrepel_0.6.5 cluster_2.0.6
## [79] magrittr_1.5 data.table_1.10.4
## [81] circlize_0.4.1 survminer_0.4.0
## [83] mvtnorm_1.0-6 whisker_0.3-2
## [85] ProtGenerics_1.9.0 aroma.light_3.7.0
## [87] hms_0.3 mime_0.5
## [89] evaluate_0.10.1 xtable_1.8-2
## [91] XML_3.98-1.9 mclust_5.3
## [93] gridExtra_2.2.1 shape_1.4.2
## [95] compiler_3.4.1 biomaRt_2.33.3
## [97] tibble_1.3.3 R.oo_1.21.0
## [99] htmltools_0.3.6 Formula_1.2-2
## [101] tidyr_0.6.3 geneplotter_1.55.0
## [103] DBI_0.7 matlab_1.0.2
## [105] ComplexHeatmap_1.15.0 MASS_7.3-47
## [107] fpc_2.1-10 BiocStyle_2.5.8
## [109] ShortRead_1.35.1 Matrix_1.2-10
## [111] readr_1.1.1 R.methodsS3_1.7.1
## [113] Gviz_1.21.1 bindr_0.1
## [115] km.ci_0.5-2 pkgconfig_2.0.1
## [117] GenomicAlignments_1.13.4 foreign_0.8-69
## [119] plotly_4.7.1 xml2_1.1.1
## [121] roxygen2_6.0.1 foreach_1.4.3
## [123] annotate_1.55.0 XVector_0.17.0
## [125] rvest_0.3.2 stringr_1.2.0
## [127] VariantAnnotation_1.23.6 digest_0.6.12
## [129] ConsensusClusterPlus_1.41.0 Biostrings_2.45.3
## [131] rmarkdown_1.6 survMisc_0.5.4
## [133] htmlTable_1.9 dendextend_1.5.2
## [135] edgeR_3.19.3 curl_2.8.1
## [137] kernlab_0.9-25 shiny_1.0.3
## [139] Rsamtools_1.29.0 commonmark_1.2
## [141] modeltools_0.2-21 rjson_0.2.15
## [143] nlme_3.1-131 jsonlite_1.5
## [145] viridisLite_0.2.0 limma_3.33.7
## [147] BSgenome_1.45.1 lattice_0.20-35
## [149] httr_1.2.1 DEoptimR_1.0-8
## [151] survival_2.41-3 interactiveDisplayBase_1.15.0
## [153] glue_1.1.1 prabclus_2.2-6
## [155] iterators_1.0.8 bit_1.1-12
## [157] class_7.3-14 stringi_1.1.5
## [159] blob_1.1.0 AnnotationHub_2.9.5
## [161] latticeExtra_0.6-28 memoise_1.1.0
Colaprico, Antonio, Tiago C. Silva, Catharina Olsen, Luciano Garofano, Claudia Cava, Davide Garolini, Thais S. Sabedot, et al. 2016. “TCGAbiolinks: An R/Bioconductor Package for Integrative Analysis of Tcga Data.” Nucleic Acids Research 44 (8): e71. doi:10.1093/nar/gkv1507.
Huber, Wolfgang, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nature Methods 12 (2). Nature Publishing Group: 115–21.