--- title: "Integrative Pathway Analysis with pathwayPCA" author: "Gabriel Odom, Lily Wang, Xi Chen" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 2 word_document: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Integrative Pathway Analysis with pathwayPCA} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, cache = FALSE, comment = "#>") ``` # 1. Introduction `pathwayPCA` is an integrative analysis tool that implements the principal component analysis (PCA) based pathway analysis approaches described in Chen et al. (2008), Chen et al. (2010), and Chen (2011). `pathwayPCA` allows users to: 1. Test pathway association with binary, continuous, or survival phenotypes. 2. Extract relevant genes in the pathways using the SuperPCA and AESPCA approaches. 3. Compute *principal components* (PCs) based on the *selected genes*. These estimated latent variables represent pathway activities for individual subjects, which can then be used to perform integrative pathway analysis, such as multi-omics analysis. 4. Extract relevant genes that drive pathway significance as well as data corresponding to these relevant genes for additional in-depth analysis. 5. Perform analyses with enhanced computational efficiency with parallel computing and enhanced data safety with S4-class data objects. 6. Analyze studies with complex experimental designs, with multiple covariates, and with interaction effects, e.g., testing whether pathway association with clinical phenotype is different between male and female subjects. ## Installing the Package `pathwayPCA` is a package for R, so you need [R](https://cloud.r-project.org/) first. We also strongly recommend the [RStudio](https://www.rstudio.com/products/rstudio/download/) integrated development environment as a user-friendly graphical wrapper for R. ### Stable Build The stable build of our package will be available on Bioconductor in May of 2019. To access Bioconductor packages, first install BiocManager, then use BiocManager to install this package: ``` install.packages("BiocManager") BiocManager::install("pathwayPCA") ``` ### Development Build Because we are currently in the development phase for version 2 of this package, you can install the package from GitHub. In order to install a package from GitHub, you will need the `devtools::` package (https://github.com/r-lib/devtools) and either [Rtools](https://cran.r-project.org/bin/windows/Rtools/) (for Windows) or [Xcode](https://developer.apple.com/xcode/) (for Mac). Then you can install the development version of the [`pathwayPCA` package](https://github.com/gabrielodom/pathwayPCA) from [GitHub](https://github.com/): ``` devtools::install_github("gabrielodom/pathwayPCA") ``` ## Loading Packages Throughout this vignette, we will make use of the `tidyverse` suite of utility packages (https://www.tidyverse.org/). The `tidyverse` and `pathwayPCA` can be loaded into R using: ```{r packLoad, message=FALSE} library(tidyverse) library(pathwayPCA) ```
*******************************************************************************
# 2. Case study: identifying significant pathways in protein expressions associated with survival outcome in ovarian cancer data ## 2.1. Ovarian cancer dataset For this example, we will use the mass spectrometry based global proteomics data for ovarian cancer recently generated by the Clinical Proteomic Tumor Analysis Consortium (CPTAC). The normalized protein abundance expression dataset can be obtained from the LinkedOmics database at http://linkedomics.org/data_download/TCGA-OV/. We used the dataset “Proteome (PNNL, Gene level)” which was generated by the Pacific Northwest National Laboratory (PNNL). One subject was removed due to missing survival outcome. Missing values in proteins expression data were imputed using the Bioconductor package `impute` under default settings. The final dataset consisted of 5162 protein expression values for 83 samples. ## 2.2. Creating an `Omics` data object for pathway analysis First, we need to create an `Omics`-class data object that stores (1) the expression dataset (2) phenotype information for the samples (3) a collection of pathways ### 2.2.1 Expression and Phenotype Data We can obtain datasets 1 and 2 for the ovarian cancer dataset by loading the `ovarian_PNNL_survival.RDS` data file included in the `pathwayPCAdata` supplement repository: https://github.com/lizhongliu1996/pathwayPCAdata. ```{r loadAssay} gitHubPath_char <- "https://raw.githubusercontent.com/lizhongliu1996/pathwayPCAdata/master/" ovSurv_df <- readRDS( url(paste0(gitHubPath_char, "ovarian_PNNL_survival.RDS")) ) ``` The `ovSurv_df` dataset is a data frame with protein expression levels and survival outcome matched by sample IDs. The variables (columns) include overall survival time and censoring status, as well as expression data for 5162 proteins for each of the 83 samples. ```{r printAssay, eval=TRUE} ovSurv_df [1:5, 1:5] ``` ### 2.2.2 Pathway Collections For the collection of pathways in (3), we need to specify a `.gmt` file, a text file with each row corresponding to one pathway. Each row contains an ID (column `TERMS`), an optional description (column `description`), and the genes in the pathway (all subsequent columns). Pathway collections in `.gmt` form can be downloaded from the MSigDB database at http://software.broadinstitute.org/gsea/msigdb/collections.jsp. For [WikiPathways](https://www.wikipathways.org/index.php/WikiPathways), one can download monthly data releases in `.gmt` format using the `dowloadPathwayArchive()` function in the `rWikiPathways` package [from Bioconductor](https://bioconductor.org/packages/release/bioc/html/rWikiPathways.html). For example, the [following commands](https://bioconductor.org/packages/release/bioc/vignettes/rWikiPathways/inst/doc/Overview.html#4_give_me_more) downloads the latest release of the human pathways from WikiPathways to your current directory: ```{r rWikiPathways, eval=FALSE} library(rWikiPathways) # library(XML) # necessary if you encounter an error with readHTMLTable downloadPathwayArchive( organism = "Homo sapiens", format = "gmt" ) ``` ``` trying URL 'http://data.wikipathways.org/current/gmt/wikipathways-20190110-gmt-Homo_sapiens.gmt' Content type '' length 174868 bytes (170 KB) downloaded 170 KB ``` ```{r printWPgmtpath, echo=FALSE} print("wikipathways-20190110-gmt-Homo_sapiens.gmt") ``` `pathwayPCA` includes the June 2018 Wikipathways collection for *homo sapiens*, which can be loaded using the `read_gmt` function: ```{r ourWikiPWC} dataDir_path <- system.file( "extdata", package = "pathwayPCA", mustWork = TRUE ) wikipathways_PC <- read_gmt( paste0(dataDir_path, "/wikipathways_human_symbol.gmt"), description = TRUE ) ``` ### 2.2.3 Create an `OmicsSurv` Data Container Now that we have these three data components, we create an `OmicsSurv` data container. Note that when `assayData_df` and `response` are supplied from two different files, the user must match and merge these data sets by sample IDs. ```{r createOmics, comment=""} ov_OmicsSurv <- CreateOmics( # protein expression data assayData_df = ovSurv_df[, -(2:3)], # pathway collection pathwayCollection_ls = wikipathways_PC, # survival phenotypes response = ovSurv_df[, 1:3], # phenotype is survival data respType = "survival", # retain pathways with > 5 proteins minPathSize = 5 ) ``` To see a summary of the `Omics` data object we just created, simply type the name of the object: ```{r inspectOmicsObj} ov_OmicsSurv ``` ## 2.3. Testing pathway association with phenotypes ### 2.3.1 Method Description Once we have a valid `Omics` object, we can perform pathway analysis using the AES-PCA (Adaptive, Elastic-net, Sparse PCA) or Supervised PCA (SuperPCA) methodology described in Chen et al. (2008), Chen et al. (2010), and Chen (2011). Briefly, in the AES-PCA method, we first extract PCs representing activities within each pathway using a dimension reduction approach based on adaptive, elastic-net, sparse principal component analysis (https://doi.org/10.2202/1544-6115.1697). The estimated latent variables are then tested against phenotypes using a permutation test that permutes sample labels. Note that the AESPCA approach does not use the response information to estimate pathway PCs, so it is an unsupervised approach. This is in contrast to the SuperPCA approach, where a selected subset of genes most associated with disease outcome are used to estimate the latent variable for a pathway (https://doi.org/10.1002/gepi.20532). Because of this gene selection step, the test statistics from the SuperPCA model can no longer be approximated well using the Student's $t$-distribution. To account for the gene selection step, `pathwayPCA` estimates $p$-values from a two-component mixture of Gumbel extreme value distributions instead. ### 2.3.2 Implementation Because the syntax for performing SuperPCA is nearly identical to the AES-PCA syntax, we will illustrate only the AES-PCA workflow below. Note that when the value supplied to the `numReps` argument is greater than 0, the `AESPCA_pvals()` function employs a parametric test when estimating pathway significance via the following model: "phenotype ~ intercept + PC1". Pathway $p$-values are estimated based on a likelihood ratio test that compares this model to a null model (with intercept only). ```{r aespcaCall, comment=""} ovarian_aespcOut <- AESPCA_pVals( # The Omics data container object = ov_OmicsSurv, # One principal component per pathway numPCs = 1, # Use parallel computing with 2 cores parallel = TRUE, numCores = 2, # # Use serial computing # parallel = FALSE, # Estimate the p-values parametrically numReps = 0, # Control FDR via Benjamini-Hochberg adjustment = "BH" ) ``` This `ovarian_aespcOut` object contains 3 components: a table of pathway $p$-values, AESPCA-estimated PCs of each sample from each pathway, and the loadings of each protein onto the AESPCs. ```{r inspectAESPCAout} names(ovarian_aespcOut) ``` ### 2.3.3 The pathway analysis results For this ovarian cancer dataset, the top 10 most significant pathways identified by AES-PCA are: ```{r headPVals} getPathpVals(ovarian_aespcOut, numPaths = 10) ``` Before constructing a graph of the $p$-values, we extract the top 20 pathways (the default value for `numPaths` is 20): ```{r subset_top_genes} ovOutGather_df <- getPathpVals(ovarian_aespcOut, score = TRUE) ``` Now we plot the pathway significance level for the top 20 pathways. In this figure, `score` indicates the negative natural logarithm of the unadjusted $p$-values for each pathway. ```{r pathway_bar_plots, fig.height = 5, fig.width = 8} ggplot(ovOutGather_df) + # set overall appearance of the plot theme_bw() + # Define the dependent and independent variables aes(x = reorder(terms, score), y = score) + # From the defined variables, create a vertical bar chart geom_col(position = "dodge", fill = "#66FFFF", width = 0.7) + # Add pathway labels geom_text( aes( x = reorder(terms, score), label = reorder(description, score), y = 0.1 ), color = "black", size = 2, hjust = 0 ) + # Set main and axis titles ggtitle("AES-PCA Significant Pathways: Ovarian Cancer") + xlab("Pathways") + ylab("Negative LN (p-Value)") + # Add a line showing the alpha = 0.01 level geom_hline(yintercept = -log(0.01), size = 1, color = "blue") + # Flip the x and y axes coord_flip() ``` ### 2.3.4 Extract relevant genes from significant pathways Because pathways are defined *a priori*, typically only a subset of genes within each pathway are relevant to the phenotype and contribute to pathway significance. In AESPCA, these relevant genes are the genes with nonzero loadings in the first PC of AESPCs. For example, for the "IL-1 signaling pathway" (Wikipathways [WP195](https://www.wikipathways.org/index.php/Pathway:WP195)), we can extract the **PC**s and their protein **L**oadings using the `getPathPCLs()` function: ```{r getWP195pcl} wp195PCLs_ls <- getPathPCLs(ovarian_aespcOut, "WP195") wp195PCLs_ls ``` The proteins with non-zero loadings can be extracted using the following lines: ```{r WP195loads} wp195Loadings_df <- wp195PCLs_ls$Loadings %>% filter(PC1 != 0) wp195Loadings_df ``` We can also prepare these loadings for graphics: ```{r loadDirections} wp195Loadings_df <- wp195Loadings_df %>% # Sort Loading from Strongest to Weakest arrange(desc(abs(PC1))) %>% # Order the Genes by Loading Strength mutate(featureID = factor(featureID, levels = featureID)) %>% # Add Directional Indicator (for Colour) mutate(Direction = factor(ifelse(PC1 > 0, "Up", "Down"))) ``` Now we will construct a **col**umn chart with `ggplot2`'s `geom_col()` function. ```{r graphLoads, fig.height = 3.6, fig.width = 6.5} ggplot(data = wp195Loadings_df) + # Set overall appearance theme_bw() + # Define the dependent and independent variables aes(x = featureID, y = PC1, fill = Direction) + # From the defined variables, create a vertical bar chart geom_col(width = 0.5, fill = "#005030", color = "#f47321") + # Set main and axis titles labs( title = "Gene Loadings on IL-1 Signaling Pathway", x = "Protein", y = "Loadings of PC1" ) + # Remove the legend guides(fill = FALSE) ``` Alternatively, we can also plot the correlation of each gene with first PC for each gene. These correlations can be computed by using the `TidyCorrelation()` function in [Section 3.3](https://gabrielodom.github.io/pathwayPCA/articles/Supplement5-Analyse_Results.html#gene-correlations-with-pcs) of additional vignette “Chapter 5 - Visualizing the Results”. ### 2.3.5 Subject-specific PCA estimates In the study of complex diseases, there is often considerable heterogeneity among different subjects with regard to underlying causes of disease and benefit of particular treatment. Therefore, in addition to identifying disease-relevant pathways for the entire patient group, successful (personalized) treatment regimens will also depend upon knowing if a particular pathway is dysregulated for an individual patient. To this end, we can also assess subject-specific pathway activity. As we saw earlier, the `getPathPCLs()` function also returns subject-specific estimates for the individual pathway PCs. ```{r wp195PCs, fig.height = 3.6, fig.width = 6.5} ggplot(data = wp195PCLs_ls$PCs) + # Set overall appearance theme_classic() + # Define the independent variable aes(x = V1) + # Add the histogram layer geom_histogram(bins = 10, color = "#005030", fill = "#f47321") + # Set main and axis titles labs( title = "Distribution of Sample-specific Estimate of Pathway Activities", subtitle = paste0(wp195PCLs_ls$pathway, ": ", wp195PCLs_ls$description), x = "PC1 Value for Each Sample", y = "Count" ) ``` This graph shows there can be considerable heterogeneity in pathway activities between the patients. ### 2.3.6 Extract analysis dataset for significant pathways Users are often also interested in examining the actual data used for analysis of the top pathways, especially for the relevant genes with the pathway. To extract this dataset, we can use the `SubsetPathwayData()` function. These commands extract data for the most significant pathway (IL-1 signaling): ```{r subsetWP195data} wp195Data_df <- SubsetPathwayData(ov_OmicsSurv, "WP195") wp195Data_df ``` In order to plot a heatmap of the proteins in this pathway, we need to "gather" the data frame into three columns. ```{r gatherWP195} wp195gather_df <- wp195Data_df %>% arrange(EventTime) %>% select(-EventTime, -EventObs) %>% gather(protein, value, -sampleID) ``` Now for the heatmap: ```{r heatmapWP195, fig.width=8, fig.asp=0.62} ggplot(wp195gather_df, aes(x = protein, y = sampleID)) + geom_tile(aes(fill = value)) + scale_fill_gradient2(low = "red", mid = "black", high = "green") + labs(x = "Proteins", y = "Subjects", fill = "Protein level") + theme(axis.text.x = element_text(angle = 90)) + coord_flip() ``` We can also perform analysis for individual genes belonging to the pathway: ```{r dataSubsetMod} library(survival) NFKB1_df <- wp195Data_df %>% select(EventTime, EventObs, NFKB1) wp195_mod <- coxph( Surv(EventTime, EventObs) ~ NFKB1, data = NFKB1_df ) summary(wp195_mod) ``` Additionally, we can estimate Kaplan-Meier survival curves for patients with high or low expression values for individual genes: ```{r NFKB1Surv} # Add the direction NFKB1_df <- NFKB1_df %>% # Group subjects by gene expression mutate(NFKB1_Expr = ifelse(NFKB1 > median(NFKB1), "High", "Low")) %>% # Re-code time to years mutate(EventTime = EventTime / 365.25) %>% # Ignore any events past 10 years filter(EventTime <= 10) # Fit the survival model NFKB1_fit <- survfit( Surv(EventTime, EventObs) ~ NFKB1_Expr, data = NFKB1_df ) ``` Finally, we can plot these K-M curves over NFKB1 protein expression. ```{r NFKB1SurvPlot, message=FALSE, fig.height = 4, fig.width = 6} library(survminer) ggsurvplot( NFKB1_fit, conf.int = FALSE, pval = TRUE, xlab = "Time in Years", palette = c("#f47321", "#005030"), xlim = c(0, 10) ) ```
*******************************************************************************
# 3. Case study: an integrative multi-omics pathway analysis of ovarian cancer data While copy number alterations are common genomic aberrations in ovarian carcer, recent studies have shown these changes do not necessarily lead to concordant changes in protein expression. In Section 2.3 above, we illustrated testing pathway activities in protein expression against survival outcome. In this section, we will additionally test pathway activities in copy number against survival outcome. Moreover, we will perform integrative analysis to identify those survival -associated protein pathways, genes, and samples driven by copy number alterations. ## 3.1 Creating copy number `Omics` data object for pathway analysis We can identify copy number (CNV) pathways significantly associated with survival in the same way as we did for protein expressions. This gene level CNV data was downloaded from UCSC Xena Functional Genomics Browser (http://xena.ucsc.edu/). ```{r CNVsetup} copyNumberClean_df <- readRDS( url(paste0(gitHubPath_char, "OV_surv_x_CNV2.RDS")) ) ``` And now we create an `Omics` data container. ```{r CNVomics, eval=FALSE} ovCNV_Surv <- CreateOmics( assayData_df = copyNumberClean_df[, -(2:3)], pathwayCollection_ls = wikipathways_PC, response = copyNumberClean_df[, 1:3], respType = "survival", minPathSize = 5 ) ``` ``` ====== Creating object of class OmicsSurv ======= The input pathway database included 5831 unique features. The input assay dataset included 24776 features. Only pathways with at least 5 or more features included in the assay dataset are tested (specified by minPathSize parameter). There are 424 pathways which meet this criterion. Because pathwayPCA is a self-contained test (PMID: 17303618), only features in both assay data and pathway database are considered for analysis. There are 5637 such features shared by the input assay and pathway database. ``` Finally, we can apply the AESPCA method to this copy-number data container. Due to the large sample size, this will take a few moments. ```{r CNV_aespca, eval=FALSE} ovCNV_aespcOut <- AESPCA_pVals( object = ovCNV_Surv, numPCs = 1, parallel = TRUE, numCores = 20, numReps = 0, adjustment = "BH" ) ``` ``` Part 1: Calculate Pathway AES-PCs Initializing Computing Cluster: DONE Extracting Pathway PCs in Parallel: DONE Part 2: Calculate Pathway p-Values Initializing Computing Cluster: DONE Extracting Pathway p-Values in Parallel: DONE Part 3: Adjusting p-Values and Sorting Pathway p-Value Data Frame DONE ``` ```{r CNV_aespca_read, echo=FALSE} ovCNV_aespcOut <- readRDS( url(paste0(gitHubPath_char, "ovarian_copyNum_aespcOut.RDS")) ) ``` ## 3.2 Identifying significant pathways and relevant genes in both CNV and protein level Next, we identify the intersection of significant pathways based on both CNV and protein data. First, we will create a data frame of the pathway $p$-values from both CNV and proteomics. ```{r multiOmicsPvals} # Copy Number CNVpvals_df <- getPathpVals(ovCNV_aespcOut, alpha = 0.05) %>% mutate(rawp_CNV = rawp) %>% select(description, rawp_CNV) # Proteomics PROTpvals_df <- getPathpVals(ovarian_aespcOut, alpha = 0.05) %>% mutate(rawp_PROT = rawp) %>% select(description, rawp_PROT) # Intersection SigBoth_df <- inner_join(PROTpvals_df, CNVpvals_df, by = "description") # WnT Signaling Pathway is listed as WP363 and WP428 ``` The results showed there are 23 pathways significantly associated with survival in both CNV and protein data, which is significantly more than expected by chance (*p*-value = 0.02088; Fisher's Exact Test; shown in `multi_pathway_overlap_fishers.R`). Here are the top-10 most significant pathways (sorted by protein data significance): ```{r intersectPaths} SigBoth_df ``` Similar to the protein pathway analysis shown in Section 2.3.4, we can also identify relevant genes with nonzero loadings that drives pathway significance in CNV. The "IL-1 signaling pathway" (WP195) is significant in both CNV and protein data. ```{r wp195} # Copy Number Loadings CNVwp195_ls <- getPathPCLs(ovCNV_aespcOut, "WP195") CNV195load_df <- CNVwp195_ls$Loadings %>% filter(abs(PC1) > 0) %>% rename(PC1_CNV = PC1) # Protein Loadings PROTwp195_ls <- getPathPCLs(ovarian_aespcOut, "WP195") PROT195load_df <- PROTwp195_ls$Loadings %>% filter(abs(PC1) > 0) %>% rename(PC1_PROT = PC1) # Intersection inner_join(CNV195load_df, PROT195load_df) ``` The result showed the NFKB1 gene was selected by AESPCA when testing IL-1 signaling pathway (WP195) with survival outcome in both CNV and protein pathway analysis. Because the proteomics data were recorded on a subset of the subjects shown in the copy number data, we can further examine the relationship between CNV and protein expressions for this gene. ```{r mergeGene} NFKB1data_df <- inner_join( copyNumberClean_df %>% select(Sample, NFKB1) %>% rename(CNV_NFKB1 = NFKB1), ovSurv_df %>% select(Sample, NFKB1) %>% rename(PROT_NFKB1 = NFKB1) ) ``` This Pearson test shows that the correlation between CNV and protein expression is highly significant for this gene. ```{r NFKB1cor, comment=""} NFKB1_cor <- cor.test(NFKB1data_df$CNV_NFKB1, NFKB1data_df$PROT_NFKB1) NFKB1_cor ``` We can then visualize the multi-omics relationship for the NFKB1 gene using a scatterplot. ```{r NFKB1plot, fig.height = 4, fig.width = 4} ggplot(data = NFKB1data_df) + # Set overall appearance theme_bw() + # Define the dependent and independent variables aes(x = CNV_NFKB1, y = PROT_NFKB1) + # Add the scatterplot geom_point(size = 2) + # Add a trendline geom_smooth(method = lm, se = FALSE, size = 1) + # Set main and axis titles labs( title = "NFKB1 Expressions in Multi-Omics Data", x = "Copy Number Variation (Centred and Scaled)", y = "Protein Expression (Centred and Scaled)" ) + # Include the correlation on the graph annotate( geom = "text", x = 0.5, y = -0.6, label = paste("rho = ", round(NFKB1_cor$estimate, 3)) ) + annotate( geom = "text", x = 0.5, y = -0.7, label = "p-val < 10^-5" ) ``` ## 3.3 An integrative view on patient-specific pathway activities In Section 2.3.5, we have seen that there can be considerable heterogeneity in pathway activities between the patients. One possible reason could be that copy number changes might not result in changes in protein expression for some of the patients. `pathwayPCA` can be used to estimate pathway activities for each patient, both for protein expressions and copy number expressions separately. These estimates can then be viewed jointly using a Circos plot. ![](circlePlot.PNG){width=90%} The accompanying Circos plot shown normalized copy number (outer circle) and protein expression (inner circle) pathway activities for the IL-1 signaling pathway in the ovarian cancer dataset samples. Each bar corresponds to a patient sample. Red bars indicate higher expression values and more pathway activity for the sample, while blue color bars indicate lower expression values and lower pathway activity for the sample. Note that only some patients have concordant changes in copy number and protein expression.
*******************************************************************************
# 4. Case study: analysis of studies with complex designs `pathwayPCA` is capable of analyzing studies with multiple experimental factors. In this section, we illustrate using `pathwayPCA` to test differential association of pathway expression with survival outcome in male and female subjects. ## 4.1 Data setup and AESPCA analysis For this example, we used TCGA KIRP RNAseq dataset downloaded from Xena datahub. First, we load the KIRP data, create an `Omics` data container, and extract first AESPC from each pathway. Because the full data is large (19.4Mb), we have moved it to a supplemental data package. ```{r kidneyData} kidney_df <- readRDS( url(paste0(gitHubPath_char, "KIRP_Surv_RNAseq_inner.RDS")) ) ``` ```{r kidneyOmics} # Create Omics Container kidney_Omics <- CreateOmics( assayData_df = kidney_df[, -(2:4)], pathwayCollection_ls = wikipathways_PC, response = kidney_df[, 1:3], respType = "surv", minPathSize = 5 ) ``` ``` ====== Creating object of class OmicsSurv ======= The input pathway database included 5831 unique features. The input assay dataset included 20211 features. Only pathways with at least 5 or more features included in the assay dataset are tested (specified by minPathSize parameter). There are 423 pathways which meet this criterion. Because pathwayPCA is a self-contained test (PMID: 17303618), only features in both assay data and pathway database are considered for analysis. There are 5566 such features shared by the input assay and pathway database. ``` ```{r kidneyAESPCA} # AESPCA kidney_aespcOut <- AESPCA_pVals( object = kidney_Omics, numPCs = 1, parallel = TRUE, numCores = 2, numReps = 0, adjustment = "BH" ) ``` ## 4.2 Test for sex interaction with first PC Next we can test whether there is differential pathway association with survival outcome for males and females by the following model, \[ h(t) = h_0(t) \exp\left[ \beta_0 + \beta_1\text{PC}_1 + \beta_2\text{male} + \beta_3(\text{PC}_1 \times \text{male}) \right]. \] In this model, $h(t)$ is expected hazard at time $t$, $h_0 (t)$ is baseline hazard when all predictors are zero, variable $male$ is an indicator variable for male samples, and $PC_1$ is a pathway’s estimated first principal component based on AESPCA. In order to test the sex interaction effect for all pathways, we will write a function which tests the interaction effect for one pathway. ```{r sexInteractFun} TestIntxn <- function(pathway, pcaOut, resp_df){ # For the given pathway, extract the PCs and loadings from the pcaOut list PCL_ls <- getPathPCLs(pcaOut, pathway) # Select and rename the PC PC_df <- PCL_ls$PCs %>% select(PC1 = V1) # Bind this PC to the phenotype data data_df <- bind_cols(resp_df, PC_df) # Construct a survival model with sex interaction sex_mod <- coxph( Surv(time, status) ~ PC1 + male + PC1 * male, data = data_df ) # Extract the model fit statistics for the interaction modStats_mat <- t( coef(summary(sex_mod))["PC1:maleTRUE", ] ) # Return a data frame of the pathway and model statistics list( statistics = data.frame( terms = pathway, description = PCL_ls$description, modStats_mat, stringsAsFactors = FALSE ), model = sex_mod ) } ``` As an example, we can test it on patwhay WP195, ```{r testNewFun} TestIntxn("WP195", kidney_aespcOut, kidney_df[, 2:4])$model ``` We can also apply this function to a list of pathways and select the significant pathways: ```{r applyTestSexFun} paths_char <- kidney_aespcOut$pVals_df$terms interactions_ls <- lapply( paths_char, FUN = TestIntxn, pcaOut = kidney_aespcOut, resp_df = kidney_df[, 2:4] ) names(interactions_ls) <- paths_char interactions_df <- # Take list of interactions interactions_ls %>% # select the first element (the data frame of model stats) lapply(`[[`, 1) %>% # stack these data frames bind_rows() %>% as.tibble() %>% # sort the rows by significance arrange(`Pr...z..`) interactions_df %>% filter(`Pr...z..` < 0.05) ``` The results showed the most significant pathway is [WP1559](https://www.wikipathways.org/index.php/Pathway:WP1559) (“TFs Regulate miRNAs related to cardiac hypertrophy”). We can inspect the model results for this pathway directly. ```{r modelRes} summary(interactions_ls[["WP1559"]]$model) ``` The results showed that although sex is not significantly associated with survival outcome, the association of pathway gene expression (PC1) with survival is highly dependent on sex of the samples. ## 4.3 Survival curves by sex interaction For visualization, we can divide subjects according to sex and high or low PC-expression, and add a group indicator for the four groups. ```{r kidneySurvPlotData} # Bind the Pheno Data to WP1559's First PC kidneyWP1559_df <- bind_cols( kidney_df[, 2:4], getPathPCLs(kidney_aespcOut, "WP1559")$PCs %>% select(PC1 = V1) ) # Add Grouping Feature kidneySurvWP1559grouped_df <- kidneyWP1559_df %>% # add strength indicator mutate(direction = ifelse(PC1 > median(PC1), "High", "Low")) %>% # group by interaction of sex and strength on PC mutate(group = paste0(direction, ifelse(male, "_M", "_F"))) %>% # recode time in years mutate(time = time / 365.25) %>% # remove summarized columns select(-male, -PC1, -direction) ``` Now we can plot survival curves for the four groups. ```{r plotKidneyGroupedSurv, message=FALSE, fig.height = 5, fig.width = 7} # Fit the survival model fit <- survfit( Surv(time, status) ~ group, data = kidneySurvWP1559grouped_df ) ggsurvplot( fit, conf.int = FALSE, xlab = "Time in Years", xlim = c(0, 10), ylim = c(0.4, 1) ) ``` These Kaplan-Meier curves showed that while high or low pathway activities were not associated with survival in male subjects (green and purple curves, respectively), female subjects with high pathway activities (red) had significantly worse survival outcomes than those with low pathway activities (blue).
*******************************************************************************
# 5. Further reading For addtional information on `pathwayPCA`, please see each of our supplementary vignette chapters for detailed tutorials on each of the three topics discussed above. These vignettes are: - [Chapter 2: Import Data](https://gabrielodom.github.io/pathwayPCA/articles/Supplement2-Importing_Data.html) - [Chapter 3: Create `Omics` Data Objects](https://gabrielodom.github.io/pathwayPCA/articles/Supplement3-Create_Omics_Objects.html) - [Chapter 4: Test Pathway Significance](https://gabrielodom.github.io/pathwayPCA/articles/Supplement4-Methods_Walkthrough.html) - [Chapter 5: Visualizing the Results](https://gabrielodom.github.io/pathwayPCA/articles/Supplement5-Analyse_Results.html) # 6. References Chen, X., Wang, L., Smith, J.D. and Zhang, B. (2008) Supervised principal component analysis for gene set enrichment of microarray data with continuous or survival outcomes. Bioinformatics, 24, 2474-2481. Chen, X., Wang, L., Hu, B., Guo, M., Barnard, J. and Zhu, X. (2010) Pathway-based analysis for genome-wide association studies using supervised principal components. Genetic epidemiology, 34, 716-724. Chen, X. (2011) Adaptive elastic-net sparse principal component analysis for pathway association testing. Statistical applications in genetics and molecular biology, 10. The R session information for this vignette is: ```{r sessionDetails} sessionInfo() ```