--- title: "curatedTCGAData" date: "`r BiocStyle::doc_date()`" vignette: | %\VignetteIndexEntry{curatedTCGAData} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: toc_float: true package: curatedTCGAData bibliography: ../inst/REFERENCES.bib --- # Installation ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("curatedTCGAData") ``` Load packages: ```{r,include=TRUE,results="hide",message=FALSE,warning=FALSE} library(curatedTCGAData) library(MultiAssayExperiment) library(TCGAutils) ``` # Citing `curatedTCGAData` Your citations are important to the project and help us secure funding. Please refer to the [References](#references) section to see the @Ramos2020-ya and @Ramos2017-og citations. For the BibTeX entries, run the `citation` function (after installation): ```{r,eval=FALSE} citation("curatedTCGAData") citation("MultiAssayExperiment") ``` `curatedTCGAData` uses `MultiAssayExperiment` to coordinate and represent the data. Please cite the `MultiAssayExperiment` [Cancer Research][1] publication. You can see the PDF of our public data publication on [JCO Clinical Cancer Informatics][2]. [1]: https://cancerres.aacrjournals.org/content/77/21/e39 [2]: https://ascopubs.org/doi/pdf/10.1200/CCI.19.00119 # Data versions `curatedTCGAData` now has a version `2.0.1` set of data with a number of improvements and bug fixes. To access the previous data release, please use version `1.1.38`. This can be added to the `curatedTCGAData` function as: ```{r} head( curatedTCGAData( diseaseCode = "COAD", assays = "*", version = "1.1.38" ) ) ``` ## Version 2.0.1 Here is a list of changes to the data provided in version `2.0.1` * provides `RNASeq2Gene` assays with RSEM gene expression values * Genomic information is present in `RaggedExperiment` objects as `GRCh37` rather than `37` * Assays coming from the same platform are now merged and provided as one (e.g., in `OV` and `GBM`) * `mRNAArray` data now returns `matrix` data instead of `DataFrame` # Data source These data were processed by Broad Firehose pipelines and accessed using `r BiocStyle::Biocpkg("RTCGAToolbox")`. For details on the preprocessing methods used, see the [Broad GDAC documentation](https://broadinstitute.atlassian.net/wiki/spaces/GDAC/pages/844334346/Documentation). # Downloading datasets To get a neat table of cancer data available in `curatedTCGAData`, see the `diseaseCodes` dataset from `TCGAutils`. Availability is indicated by the `Available` column in the dataset. ```{r} data('diseaseCodes', package = "TCGAutils") head(diseaseCodes) ``` Alternatively, you can get the full table of data available using wildcards `'*'` in the `diseaseCode` argument of the main function: ```{r,eval=FALSE} curatedTCGAData( diseaseCode = "*", assays = "*", version = "2.0.1" ) ``` To see what assays are available for a particular TCGA disease code, leave the `assays` argument as a wildcard (`'*'`): ```{r} head( curatedTCGAData( diseaseCode = "COAD", assays = "*", version = "2.0.1" ) ) ``` # Caveats for working with TCGA data Not all TCGA samples are cancer, there are a mix of samples in each of the 33 cancer types. Use `sampleTables` on the `MultiAssayExperiment` object along with `data(sampleTypes, package = "TCGAutils")` to see what samples are present in the data. There may be tumors that were used to create multiple contributions leading to technical replicates. These should be resolved using the appropriate helper functions such as `mergeReplicates`. Primary tumors should be selected using `TCGAutils::TCGAsampleSelect` and used as input to the subsetting mechanisms. See the "Samples in Assays" section of this vignette. ## ACC dataset example ```{r, message=FALSE} (accmae <- curatedTCGAData( "ACC", c("CN*", "Mutation"), version = "2.0.1", dry.run = FALSE )) ``` **Note**. For more on how to use a `MultiAssayExperiment` please see the `MultiAssayExperiment` vignette. ### Subtype information Some cancer datasets contain associated subtype information within the clinical datasets provided. This subtype information is included in the metadata of `colData` of the `MultiAssayExperiment` object. To obtain these variable names, use the `getSubtypeMap` function from TCGA utils: ```{r} head(getSubtypeMap(accmae)) ``` ### Typical clinical variables Another helper function provided by TCGAutils allows users to obtain a set of consistent clinical variable names across several cancer types. Use the `getClinicalNames` function to obtain a character vector of common clinical variables such as vital status, years to birth, days to death, etc. ```{r} head(getClinicalNames("ACC")) colData(accmae)[, getClinicalNames("ACC")][1:5, 1:5] ``` ### Identifying samples in Assays The `sampleTables` function gives an overview of sample types / codes present in the data: ```{r} sampleTables(accmae) ``` You can use the reference dataset (`sampleTypes`) from the `TCGAutils` package to interpret the TCGA sample codes above. The dataset provides clinically meaningful descriptions: ```{r} data(sampleTypes, package = "TCGAutils") head(sampleTypes) ``` ### Separating samples Often, an analysis is performed comparing two groups of samples to each other. To facilitate the separation of samples, the `TCGAsplitAssays` function from `TCGAutils` identifies all sample types in the assays and moves each into its own assay. By default, all discoverable sample types are separated into a separate experiment. In this case we requested only solid tumors and blood derived normal samples as seen in the `sampleTypes` reference dataset: ```{r} sampleTypes[sampleTypes[["Code"]] %in% c("01", "10"), ] TCGAsplitAssays(accmae, c("01", "10")) ``` To obtain a logical vector that could be used for subsetting a `MultiAsssayExperiment`, refer to `TCGAsampleSelect`. To select only primary tumors, use the function on the colnames of the `MultiAssayExperiment`: ```{r} tums <- TCGAsampleSelect(colnames(accmae), "01") ``` ### TCGAprimaryTumors convenience If interested in only the primary tumor samples, `TCGAutils` provides a convenient operation to extract primary tumors from the MultiAssayExperiment representation. The `TCGAprimaryTumors` function will return only samples with primary tumor (either solid tissue or blood) samples using the above operations in the background: ```{r} (primaryTumors <- TCGAprimaryTumors(accmae)) ``` To view the results, run `sampleTables` again on the output: ```{r} sampleTables(primaryTumors) ``` ### Keeping colData in an extracted Assay When extracting a single assay from the `MultiAssayExperiment` the user can conveniently choose to keep the `colData` from the `MultiAssayExperiment` in the extracted assay **given** that the class of the extracted assay supports `colData` storage and operations. `SummarizedExperiment` and its derived data representations support this operation. In this example, we extract the mutation data as represented by a `RaggedExperiment` which was designed to have `colData` functionality. The default replacement method is to 'append' the `MultiAssayExperiment` `colData` to the `RaggedExperiment` assay. The `mode` argument can also completely replace the `colData` when set to 'replace'. ```{r} (accmut <- getWithColData(accmae, "ACC_Mutation-20160128", mode = "append")) head(colData(accmut)[, 1:4]) ``` ### Example use of RaggedExperiment The RaggedExperiment representation provides a matrix view of a `GRangesList` internal representation. Typical use of a RaggedExperiment involves a number of functions to reshape 'ragged' measurements into a matrix-like format. These include `sparseAssay`, `compactAssay`, `disjoinAssay`, and `qreduceAssay`. See the `RaggedExperiment` vignette for details. In this example, we convert entrez gene identifiers to numeric in order to show how we can create a sparse matrix representation of any numeric metadata column in the `RaggedExperiment`. ```{r} ragex <- accmae[["ACC_Mutation-20160128"]] ## convert score to numeric mcols(ragex)$Entrez_Gene_Id <- as.numeric(mcols(ragex)[["Entrez_Gene_Id"]]) sparseAssay(ragex, i = "Entrez_Gene_Id", sparse=TRUE)[1:6, 1:3] ``` Users who would like to use the internal `GRangesList` representation can invoke the coercion method: ```{r} as(ragex, "GRangesList") ``` ## Exporting Data MultiAssayExperiment provides users with an integrative representation of multi-omic TCGA data at the convenience of the user. For those users who wish to use alternative environments, we have provided an export function to extract all the data from a MultiAssayExperiment instance and write them to a series of files: ```{r} td <- tempdir() tempd <- file.path(td, "ACCMAE") if (!dir.exists(tempd)) dir.create(tempd) exportClass(accmae, dir = tempd, fmt = "csv", ext = ".csv") ``` This works for all data classes stored (e.g., `RaggedExperiment`, `HDF5Matrix`, `SummarizedExperiment`) in the `MultiAssayExperiment` via the `assays` method which converts classes to `matrix` format (using individual `assay` methods). # Session Information {.unnumbered}
Click here to expand ```{r} sessionInfo() ```
# References {.unnumbered}