---
title: "Tranferable Omics Pediction"
date: "`r BiocStyle::doc_date()`"
params:
  test: FALSE
author:
- name: Harry Robertson
  affiliation:  
  - Centre for Precision Data Science, University of Sydney, Australia
  - School of Mathematics and Statistics, University of Sydney, Australia
  - &WIMR Westmead Institute for Medical Research, University of Sydney, Australia
- name: Nicholas Robertson
  affiliation:
  - Centre for Precision Data Science, University of Sydney, Australia
  - School of Mathematics and Statistics, University of Sydney, Australia
- name: Ellis Patrick
  affiliation:
  - Centre for Precision Data Science, University of Sydney, Australia
  - School of Mathematics and Statistics, University of Sydney, Australia
  - &WIMR Westmead Institute for Medical Research, University of Sydney, Australia
vignette: >
  %\VignetteIndexEntry{"Introduction to TOP"}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output: 
  BiocStyle::html_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(BiocStyle)
```

```{r, warning = FALSE, message = FALSE}
suppressPackageStartupMessages({
  library(tidyverse)
  library(survival)
  library(dplyr)
  library(survminer)
  library(Biobase)
  library(ggsci)
  library(ggbeeswarm)
  library(TOP)
  library(curatedOvarianData)
})

theme_set(theme_bw())
```

# Installation
```{r eval = FALSE}
# Install the package from Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("TOP")
```

# Overview 

The TOP R package provides a transfer learning approach for building predictive models across multiple omics datasets. With the increasing availability of omics data, there is a growing need for methods that can effectively integrate and analyze data from multiple sources. However, merging datasets can be challenging due to batch effects and other sources of variation.

TOP uses transfer learning strategies to build predictive models that can be applied across multiple datasets without the need for extensive batch correction methods. Specifically, TOP employs a lasso regression model to identify important features and construct a transferable predictive model. By leveraging information from multiple datasets, TOP can improve predictive accuracy and identify common biomarkers across different omics datasets.

The package provides several functions for building and evaluating transfer learning models, including options for visualizing results. Additionally, the package includes sample datasets and detailed documentation to help users get started with the package and understand the underlying methods.

Overall, TOP offers a flexible and powerful approach for integrating and analyzing omics data from multiple sources, making it a valuable tool for researchers in a variety of fields.

# Loading example data
The example data used in this vignette is the curatedOvarianData. Described in the paper from [Benjamin Frederick Ganzfried et al (2013)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3625954/) \n
```{r}
data("GSE12418_eset", package = "curatedOvarianData")
data("GSE30009_eset", package = "curatedOvarianData")
```

# Building a survival model.
The transferable omics prediction framework is also able to build survival models. In this section, we will demonstrate how to use TOP to build a survival model using example ovarian cancer datasets. First we utilise nine datasets that were preparred by Ganzfield et al. 

```{r japan_a}
data("GSE32062.GPL6480_eset")
japan_a <- GSE32062.GPL6480_eset

data("GSE9891_eset")
tothill <- GSE9891_eset

data("GSE32063_eset")
japan_b <- GSE32063_eset

data("TCGA.RNASeqV2_eset")
selection <- TCGA.RNASeqV2_eset$tumorstage %in% c(3, 4) & TCGA.RNASeqV2_eset$site_of_tumor_first_recurrence == "metastasis"
selection[is.na(selection)] <- FALSE
tcgarnaseq <- TCGA.RNASeqV2_eset[, selection]

data("E.MTAB.386_eset")
bentink <- E.MTAB.386_eset

data("GSE13876_eset")
crijns <- GSE13876_eset
crijns <- crijns[, crijns$grade %in% c(3, 4)]

data("GSE18520_eset")
mok <- GSE18520_eset

data("GSE17260_eset")
yoshihara2010 <- GSE17260_eset

data("GSE26712_eset")
bonome <- GSE26712_eset

list_ovarian_eset <- lst(
  japan_a, tothill, japan_b,
  tcgarnaseq, bonome, mok, yoshihara2010,
  bentink, crijns
)

list_ovarian_eset %>%
  sapply(dim)
```

# Common genes between datasets
In order to apply the TOP framework, it is important that the input matrices have identical feature names, such as gene names, across all datasets. In this example, we will identify the common genes present in all the datasets, as these will be the features used for transfer learning.

```{r}
raw_gene_list <- purrr::map(list_ovarian_eset, rownames)
common_genes <- Reduce(f = intersect, x = raw_gene_list)
length(common_genes)
```

# Survival samples
Next, we will prepare the survival data from each of the ovarian cancer datasets. The survival data includes the survival time and the event status (i.e., whether the event of interest, such as death, has occurred)
```{r}
ov_pdata <- purrr::map(list_ovarian_eset, pData)
list_pdata <- list_ovarian_eset %>%
  purrr::map(pData) %>%
  purrr::map(tibble::rownames_to_column, var = "sample_id")

ov_surv_raw <- purrr::map(
  .x = list_pdata,
  .f = ~ data.frame(
    sample_id = .x$sample_id,
    time = .x$days_to_death %>% as.integer(),
    dead = ifelse(.x$vital_status == "deceased", 1, 0)
  ) %>%
    na.omit() %>%
    dplyr::filter(
      time > 0,
      !is.nan(time),
      !is.nan(dead)
    )
)
ov_surv_raw %>% sapply(nrow)
ov_surv_y <- ov_surv_raw %>%
  purrr::map(~ .x %>%
    dplyr::select(-sample_id)) %>%
  purrr::map(~ Surv(time = .x$time, event = .x$dead))
```

# Preparing data for modelling.
In this section, we will prepare the gene expression data and survival data for visualization. We will subset the gene expression data to include only the common genes and samples with survival information. Then, we will create a combined survival data table, with a data_source column to identify the origin of each sample. Finally we plot the distribution of survival times across datasets.
```{r}
ov_surv_exprs <- purrr::map2(
  .x = list_ovarian_eset,
  .y = ov_surv_raw,
  .f = ~ exprs(.x[common_genes, .y$sample_id])
)

ov_surv_tbl <- ov_surv_raw %>%
  bind_rows(.id = "data_source")
ov_surv_tbl %>%
  ggplot(aes(
    x = time,
    y = ..density..,
    fill = data_source
  )) +
  geom_density(alpha = 0.25) +
  scale_fill_d3()
```
# Building a TOP survival model.
In this section, we will build the survival model using the TOP framework. To do this, we will first organize the survival data and gene expression data into appropriate data structures. Then, we will apply the `TOP_survival` function to build the model, specifying the number of features to be selected by the lasso regression.
```{r}
surv_list <- ov_surv_tbl %>%
  split(ov_surv_tbl$data_source)
surv_list <- lapply(surv_list, function(x) x[, 3:4])

surv_counts <- ov_surv_exprs %>% lapply(t)
surv_list <- surv_list[names(surv_counts)]
surv_model <- TOP_survival(
  x_list = surv_counts, y_list = surv_list, nFeatures = 10
)
```

# Visualising performance. 
Once we have built the survival model using the TOP framework, it is important to evaluate its performance. In this section, we will visualize the performance of the model by calculating the concordance index (C-index) for each dataset. The C-index measures the agreement between the predicted and observed survival times, with values ranging from 0 to 1,
```{r}
conf_results <- unlist(lapply(seq_along(surv_counts), function(x) {
  Surv_TOP_CI(
    surv_model,
    newx = surv_counts[[x]], newy = surv_list[[x]]
  )$concordance
}))

conf_results %>%
  tibble::enframe() %>%
  mutate(Metric = "C-index") %>%
  ggplot(aes(y = value, x = Metric)) +
  geom_boxplot(width = 0.5) +
  ylab("C-index") +
  geom_jitter(alpha = 0.7, width = 0.1) +
  theme(axis.text.x = element_blank()) +
  xlab("Survival Model")
```

# sessionInfo

```{r}
sessionInfo()
```