---
title: "Structure and content of RAVmodel"
author: "Sehyun Oh"
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Introduction on RAVmodel}
%\VignetteEncoding{UTF-8}
output:
BiocStyle::html_document:
number_sections: yes
toc: yes
toc_depth: 4
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE, comment = "#>"
)
```
# Setup
## Install and load package
```{r eval = FALSE}
if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("GenomicSuperSignature")
```
```{r results="hide", message=FALSE, warning=FALSE}
library(GenomicSuperSignature)
```
## Download RAVmodel
You can download GenomicSuperSignature from Google Cloud bucket using
`GenomicSuperSignature::getModel` function. Currently available models are
built from top 20 PCs of 536 studies (containing 44,890 samples) containing
13,934 common genes from each of 536 study's top 90% varying genes based on
their study-level standard deviation. There are two versions of this RAVmodel
annotated with different gene sets for GSEA: MSigDB C2 (`C2`) and three
priors from PLIER package (`PLIERpriors`). In this vignette, we are showing
the `C2` annotated model.
Note that the first interactive run of this code, you will be asked to allow
R to create a cache directory. The model file will be stored there and
subsequent calls to `getModel` will read from the cache.
```{r load_model}
RAVmodel <- getModel("C2", load=TRUE)
```
# Content of RAVmodel
`RAVindex` is a matrix containing genes in rows and RAVs in columns. `colData`
slot provides the information on each RAVs, such as GSEA annotation and
studies involved in each cluster. `metadata` slot stores model construction
information. `trainingData` slot contains the information on individual
studies in training dataset, such as MeSH terms assigned to each study.
```{r}
RAVmodel
```
## RAVindex
*R*eplicable *A*xis of *V*ariation (RAV) index is the main component of
GenomicSuperSignature. It serves as an index connecting new datasets and the
existing database. You can access it through `GenomicSuperSignature::RAVindex`
(equivalent of `SummarizedExperiment::assay`). Rows are genes and columns are
RAVs.
Here, RAVmodel consists of 13,934 genes and 4,764 RAVs.
```{r}
class(RAVindex(RAVmodel))
dim(RAVindex(RAVmodel))
RAVindex(RAVmodel)[1:4, 1:4]
```
## Metadata for RAVmodel
Metadata slot of RAVmodel contains information related to the model building.
```{r}
names(metadata(RAVmodel))
```
* `cluster` : cluster membership of each PCs from the training dataset
* `size` : an integer vector with the length of clusters, containing the number
of PCs in each cluster
* `k` : the number of all clusters in the given RAVmodel
* `n` : the number of top PCs kept from each study in the training dataset
* `geneSets` : the name of gene sets used for GSEA annotation
* `MeSH_freq` : the frequency of MeSH terms associated with the training
dataset. MeSH terms like 'Humans' and 'RNA-seq' are top ranked (which is very
expected) because the training dataset of this model is Human RNA sequencing
data.
* `updateNote` : a brief note on the given model's specification
* `version` : the version of the given model
```{r}
head(metadata(RAVmodel)$cluster)
head(metadata(RAVmodel)$size)
metadata(RAVmodel)$k
metadata(RAVmodel)$n
geneSets(RAVmodel)
head(metadata(RAVmodel)$MeSH_freq)
updateNote(RAVmodel)
metadata(RAVmodel)$version
```
## Studies in each RAV
You can find which studies are in each cluster using `studies` method. Output is
a list with the length of clusters, where each element is a character vector
containing the name of studies in each cluster.
```{r}
length(studies(RAVmodel))
studies(RAVmodel)[1:3]
```
You can check which PC from different studies are in RAVs using `PCinRAV`.
```{r}
PCinRAV(RAVmodel, 2)
```
## Silhouette width for each RAV
Silhouette width ranges from -1 to 1 for each cluster. Typically, it is
interpreted as follows:
- Values close to 1 suggest that the observation is well matched to the
assigned cluster
- Values close to 0 suggest that the observation is borderline matched
between two clusters
- Values close to -1 suggest that the observations may be assigned to the
wrong cluster
For RAVmodel, the average silhouette width of each cluster is a quality control
measure and suggested as a secondary reference to choose proper RAVs,
following validation score.
```{r}
x <- silhouetteWidth(RAVmodel)
head(x) # average silhouette width of the first 6 RAVs
```
## GSEA on each RAV
Pre-processed GSEA results on each RAV are stored in RAVmodel and can be
accessed through `gsea` function.
```{r}
class(gsea(RAVmodel))
class(gsea(RAVmodel)[[1]])
length(gsea(RAVmodel))
gsea(RAVmodel)[1]
```
## MeSH terms for each study
You can find MeSH terms associated with each study using `mesh` method.
Output is a list with the length of studies used for training. Each element of
this output list is a data frame containing the assigned MeSH terms and the
detail of them. The last column `bagOfWords` is the frequency of the MeSH term
in the whole training dataset.
```{r}
length(mesh(RAVmodel))
mesh(RAVmodel)[1]
```
## PCA summary for each study
PCA summary of each study can be accessed through `PCAsummary` method. Output
is a list with the length of studies, where each element is a matrix containing
PCA summary results: standard deviation (SD), variance explained by each PC
(Variance), and the cumulative variance explained (Cumulative).
```{r}
length(PCAsummary(RAVmodel))
PCAsummary(RAVmodel)[1]
```
# Session Info
```{r}
sessionInfo()
```