---
title: "Tokenizing Text of Gene Set Enrichment Analysis"
author: "Dongmin Jung"
date: December 24, 2020
output:
rmarkdown::html_document:
highlight: pygments
toc: true
toc_depth: 2
number_sections: true
fig_width: 5
fig_height: 4
vignette: >
%\VignetteIndexEntry{ttgsea}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
# Introduction
Functional enrichment analysis methods such as gene set enrichment analysis (GSEA) have been widely used for analyzing gene expression data. GSEA is a powerful method to infer results of gene expression data at a level of gene sets by calculating enrichment scores for predefined sets of genes. GSEA depends on the availability and accuracy of gene sets. There are overlaps between terms of gene sets or categories because multiple terms may exist for a single biological process, and it can thus lead to redundancy within enriched terms. In other words, the sets of related terms are overlapping. Using deep learning, this pakage is aimed to predict enrichment scores for unique tokens or words from text in names of gene sets to resolve this overlapping set issue. Furthermore, we can coin a new term by combining tokens and find its enrichment score by predicting such a combined tokens.
Text can be seen as sequential data, either as a sequence of characters or as a sequence of words. Recurrent Neural Network (RNN) operating on sequential data is a type of the neural network. RNN has been applied to a variety of tasks including natural language processing such as machine translation. However, RNN suffers from the problem of long-term dependencies which means that RNN struggles to remember information for long periods of time. An Long Short-Term Memory (LSTM) network is a special kind of RNN that is designed to solve the long-term dependency problem. The bidirectional LSTM network consists of two distinct LSTM networks, termed the forward LSTM and the backward LSTM, which process the sequences in opposite directions. Gated Recurrent Unit (GRU) is a simplified version of LSTM with less number of parameters, and thus, the total number of parameters can be greatly reduced for a large neural network. LSTM and GRU are known to be successful remedies to the long-term dependency problem. The above models take terms of gene sets as input and enrichment scores as output to predict enrichment scores of new terms.
# Example
## Terms of gene sets
### GSEA
Consider a simple example. Once GSEA is performed, the result calculated from GSEA is fed into the algorithm to train the deep learning models.
```{r, fig.align='center', message=FALSE, warning=FALSE, eval=TRUE}
library(ttgsea)
library(fgsea)
data(examplePathways)
data(exampleRanks)
names(examplePathways) <- gsub("_", " ", substr(names(examplePathways), 9, 1000))
set.seed(1)
fgseaRes <- fgseaSimple(examplePathways, exampleRanks, nperm = 10000)
data.table::data.table(fgseaRes[order(fgseaRes$NES, decreasing = TRUE),])
### convert from gene set defined by BiocSet::BiocSet to list
#library(BiocSet)
#genesets <- BiocSet(examplePathways)
#gsc_list <- as(genesets, "list")
# convert from gene set defined by GSEABase::GeneSetCollection to list
#library(GSEABase)
#genesets <- BiocSet(examplePathways)
#gsc <- as(genesets, "GeneSetCollection")
#gsc_list <- list()
#for (i in 1:length(gsc)) {
# gsc_list[[setName(gsc[[i]])]] <- geneIds(gsc[[i]])
#}
#set.seed(1)
#fgseaRes <- fgseaSimple(gsc_list, exampleRanks, nperm = 10000)
```
### deep learning and embedding
Since deep learning architectures are incapable of processing characters or words in their raw form, the text needs to be converted to numbers as inputs. Word embeddings are the texts converted into numbers. For tokenization, unigram and bigram sequences are used as default. An integer is assigned to each token, and then each term is converted to a sequence of integers. The sequences that are longer than the given maximum length are truncated, whereas shorter sequences are padded with zeros. Keras is a higher-level library built on top of TensorFlow. It is available in R through the keras package. The input to the Keras embedding are integers. These integers are of the tokens. This representation is passed to the embedding layer. The embedding layer acts as the first hidden layer of the neural network.
```{r, fig.align='center', message=FALSE, warning=FALSE, eval=TRUE}
if (keras::is_keras_available() & reticulate::py_available()) {
# model parameters
num_tokens <- 1000
length_seq <- 30
batch_size <- 32
embedding_dim <- 50
num_units <- 32
epochs <- 10
# algorithm
ttgseaRes <- fit_model(fgseaRes, "pathway", "NES",
model = bi_lstm(num_tokens, embedding_dim,
length_seq, num_units),
num_tokens = num_tokens,
length_seq = length_seq,
epochs = epochs,
batch_size = batch_size,
use_generator = FALSE,
callbacks = keras::callback_early_stopping(
monitor = "loss",
patience = 10,
restore_best_weights = TRUE))
# prediction for every token
ttgseaRes$token_pred
ttgseaRes$token_gsea[["TGF beta"]][,1:5]
}
```
### Monte Carlo p-value
Deep learning models predict only enrichment scores. The p-values of the scores are not provided by the model. So, the Monte Carlo p-value method is used within the algorithm. Computing the p-value for a statistical test can be easily accomplished via Monte Carlo. The ordinary Monte Carlo is a simulation technique for approximating the expectation of a function for a general random variable, when the exact expectation cannot be found analytically. The Monte Carlo p-value method simply simulates a lot of datasets under the null, computes a statistic for each generated dataset, and then computes the percentile rank of observed value among these sets of simulated values. The number of tokens used for each simulation is the same to the length of the sequence of the corresponding term. If a new text does not have any tokens, its p-value is not available.
```{r, fig.align='center', message=FALSE, warning=FALSE, eval=TRUE}
if (exists("ttgseaRes")) {
# prediction with MC p-value
set.seed(1)
new_text <- c("Cell Cycle DNA Replication",
"Cell Cycle",
"DNA Replication",
"Cycle Cell",
"Replication DNA",
"TGF-beta receptor")
print(predict_model(ttgseaRes, new_text))
print(predict_model(ttgseaRes, "data science"))
}
```
### visualization
You are allowed to create a visualization of your model architecture.
```{r, fig.align='center', message=FALSE, warning=FALSE, eval=TRUE}
if (exists("ttgseaRes")) {
summary(ttgseaRes$model)
plot_model(ttgseaRes$model)
}
```
## Leading edge genes
Take another exmaple. A set of names of ranked genes can be seen as sequential data. In the result of GSEA, names of leading edge genes for each gene set are given. The leading edge subset contains genes which contribute most to the enrichment score. Thus the scores of one or more genes of the leading edge subset can be predicted.
```{r, fig.align='center', message=FALSE, warning=FALSE, eval=FALSE}
if (keras::is_keras_available() & reticulate::py_available()) {
# leading edge
LE <- unlist(lapply(fgseaRes$leadingEdge, function(x) gsub(",", "", toString(x))))
fgseaRes <- cbind(fgseaRes, LE)
# model parameters
num_tokens <- 1000
length_seq <- 30
batch_size <- 32
embedding_dim <- 50
num_units <- 32
epochs <- 10
# algorithm
ttgseaRes <- fit_model(fgseaRes, "LE", "NES",
model = bi_lstm(num_tokens, embedding_dim,
length_seq, num_units),
num_tokens = num_tokens,
length_seq = length_seq,
epochs = epochs,
batch_size = batch_size,
verbose = 0,
callbacks = callback_early_stopping(
monitor = "loss",
patience = 5,
restore_best_weights = TRUE))
# prediction for every token
ttgseaRes$token_pred
# prediction with MC p-value
set.seed(1)
new_text <- c("107995 56150", "16952")
predict_model(ttgseaRes, new_text)
}
```
# Case Study
The "airway" dataset has four cell lines with two conditions, control and treatment with dexamethasone. By using the package "DESeq2", differntially expressed genes between controls and treated samples are identified from the gene expression data. Then the log2FC is used as a score for GSEA. For GSEA, GOBP for human is obtained from the package "org.Hs.eg.db", by using the package "BiocSet". GSEA is performed by the package "fgsea". Since "fgsea" can accept a list, the type of gene set is converted to a list. Finally, the result of GSEA is fitted to a deep learning model, and then enrichment scores of new terms can be predicted.
```{r, fig.align='center', message=FALSE, warning=FALSE, eval=FALSE}
if (keras::is_keras_available() & reticulate::py_available()) {
## data preparation
library(airway)
data(airway)
## differentially expressed genes
library(DESeq2)
des <- DESeqDataSet(airway, design = ~ dex)
des <- DESeq(des)
res <- results(des)
head(res)
# log2FC used for GSEA
statistic <- res$"log2FoldChange"
names(statistic) <- rownames(res)
statistic <- na.omit(statistic)
head(statistic)
## gene set
library(org.Hs.eg.db)
library(BiocSet)
go <- go_sets(org.Hs.eg.db, "ENSEMBL", ontology = "BP")
go <- as(go, "list")
# convert GO id to term name
library(GO.db)
names(go) <- Term(GOTERM)[names(go)]
## GSEA
library(fgsea)
set.seed(1)
fgseaRes <- fgsea(go, statistic)
head(fgseaRes)
## tokenizing text of GSEA
# model parameters
num_tokens <- 5000
length_seq <- 30
batch_size <- 64
embedding_dim <- 128
num_units <- 32
epochs <- 20
# algorithm
ttgseaRes <- fit_model(fgseaRes, "pathway", "NES",
model = bi_lstm(num_tokens, embedding_dim,
length_seq, num_units),
num_tokens = num_tokens,
length_seq = length_seq,
epochs = epochs,
batch_size = batch_size,
callbacks = keras::callback_early_stopping(
monitor = "loss",
patience = 5,
restore_best_weights = TRUE))
# prediction
ttgseaRes$token_pred
set.seed(1)
predict_model(ttgseaRes, c("translation response",
"cytokine activity",
"rhodopsin mediate",
"granzyme",
"histone deacetylation",
"helper T cell",
"Wnt"))
}
```
# Session information
```{r, eval=TRUE}
sessionInfo()
```
# References
Alterovitz, G., & Ramoni, M. (Eds.). (2011). Knowledge-Based Bioinformatics: from Analysis to Interpretation. John Wiley & Sons.
Consoli, S., Recupero, D. R., & Petkovic, M. (2019). Data Science for Healthcare: Methodologies and Applications. Springer.
DasGupta, A. (2011). Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics. Springer.
Ghatak, A. (2019). Deep Learning with R. Springer.
Hassanien, A. E., & Elhoseny, M. (2019). Cybersecurity and Secure Information Systems: Challenges and Solutions and Smart Environments. Springer.
Leong, H. S., & Kipling, D. (2009). Text-based over-representation analysis of microarray gene lists with annotation bias. Nucleic acids research, 37(11), e79.
Micheas, A. C. (2018). Theory of Stochastic Objects: Probability, Stochastic Processes and Inference. CRC Press.
Shaalan, K., Hassanien, A. E., & Tolba, F. (Eds.). (2017). Intelligent Natural Language Processing: Trends and Applications (Vol. 740). Springer.