---
title: "The discordant R Package: A Novel Approach to Differential Correlation"
shorttitle: "discordant"
author:
  - name: Charlotte Siska
    email: charlotte.siska@ucdenver.edu
    affiliation: University of Colorado Anschutz Medical Campus
  - name: Max McGrath
    email: max.mcgrath@ucdenver.edu
    affiliation: University of Colorado Anschutz Medical Campus
  - name: Katerina Kechris
    email: katerina.kechris@cuanschutz.edu
    affiliation: University of Colorado Anschutz Medical Campus
output:
  BiocStyle::html_document:
    highlight: "tango"
    code_folding: show
    toc: true
    toc_float: 
      collapsed: false
package: discordant
citation_package: natbib
bibliography: Discordant_bib_v3.bib
vignette: |
  %\VignetteIndexEntry{The discordant R Package: A Novel Approach to Differential Correlation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{r, include=FALSE, echo=FALSE}
# date: "`r doc_date()`"
# "`r pkg_ver('BiocStyle')`"
# 
```
# Introduction
Discordant is an R package that identifies pairs of features that correlate 
differently between phenotypic groups, with application to -omics data sets. 
Discordant uses a mixture model that “bins” molecular feature pairs based on 
their type of coexpression or coabbundance. More information on the algorithm 
can be found in 
(@siska1, @siska2). The final output are posterior probabilities of differential
correlation. This package can be used to determine differential correlation 
within one –omics data set or between two –omics data sets (provided that both 
–omics data sets were taken from the same samples). Also, the type of data can 
be any type of –omics with normal or non-normal distributions. Some examples are
metabolomics, transcriptomic, proteomics, etc. 
The functions in the Discordant package provide a simple pipeline for 
intermediate R users to determine differentially correlated pairs. The final 
output is a table of molecular feature pairs and their respective posterior 
probabilities. Functions have been written to allow flexibility for users in how
they interpret results, which will be discussed further. Currently, the package 
only supports the comparison between two phenotypic groups (e.g., disease versus 
control, mutant versus wildtype).
```{r, echo=FALSE}
library(discordant)
```
# Discordant Algorithm
Discordant is originally derived from the Concordant algorithm written by 
(@lai1, @lai2). It was used to determine concordance between microarrays. We 
have applied it to determine differential correlation of features between groups
(@siska1, @siska2).
Using a three component mixture model and the Expectation Maximization (EM)
algorithm, the model predicts 
if the correlation coefficients in phenotypic groups 1 and 2 for a molecular 
feature pair are dissimilar (@siska1). The correlation coefficients are 
generated for all possible molecular feature pairs witin an -omics dataset or 
between two -omics data sets. The correlation coefficients are transformed into 
z scores using Fisher's transformation. The three components are -, + and 0 
which correspond respectively to a negative, positive or no correlation. 
Molecular features that have correlation coefficients in *different* components 
are considered *differentially* correlated, as opposed to when correlation 
coefficients are in the *same* component then they are *equivalently* 
correlated.
\[
\begin{array}{c|c c c}
  \text{} & \text{0} & \text{-}  & \text{+}\\ 
\hline
   0 & 1 & 2  & 3 \\
   - & 4 & 5  & 6 \\
   + & 7 & 8  & 9 \\ 
\end{array}
\]
The class matrix (above) contains the classes that represent all possible 
paired-correlation scenarios. These scenarios are based off the components in 
the mixture models. Molecular features that have correlation coefficients in 
different components are considered differentially correlated, as opposed to 
when correlation coefficients are in the same component they are equivalently 
correlated. This can be visualized in the class matrix, where the rows represent
the components for group 1 and the columns represent the components for group 2.
The classes on the diagonal represent equivalent correlation (1, 5 and 9), and 
classes in the off-diagonal represent differential correlation (2, 3, 4, 6, 8).
After running the EM algorithm, we have 9 posterior probabilities for each 
molecular feature pair that correspond to the 9 classes in the class matrix. 
Since we want to summarize the probability that the molecular feature pair is 
differentially correlated, we sum the posterior probabilities representing the 
off-diagonal classes in the class matrix.
# Example Data
The following data sets are provided by `r Biocpkg("discordant")` and will be 
used in the examples which follow. All data sets are originally from the Cancer 
Genome Atlas (TCGA) and can be found at 
[http://cancergenome.nih.gov/](http://cancergenome.nih.gov/). 
- TCGA\_GBM\_miRNA\_microarray
 
  - miRNA expression values from 10 control and 20 tumor samples for 
  a Glioblastoma multiforme (GBM) Agilent miRNA micorarray. The feature size was 
  originally 470, but after features with outliers were filtered out feature 
  size reduces to 331.  In this sample data set provided in the package, we 
  randomly selected 10 features.
  
 
  - TCGA\_GBM\_transcript\_microarray
 
  - Transcript (or mRNA) expression values from 10 control and 20 
  tumor samples in a GBM Agilent 244k micorarray. The feature size was 
  originally 90797, but after features with outliers were filtered out, feature
  size reduces to 72656. In this sample data set provided in the package, we 
  randomly selected 20 features.
  
 
  - TCGA\_Breast\_miRNASeq
 
  - miRNA counts from 15 control and 45 tumor samples in a Breast Cancer 
  Illumina HiSeq miRNASeq. The feature size was originally 212, but after 
  features with outliers were filtered out feature size reduces to 200. In this 
  sample data set provided in the package, we randomly selected 100 features.
  
 
  - TCGA\_Breast\_RNASeq
 
  - Transcript (or mRNA) counts from 15 control and 45 tumor samples in a 
  Breast Cancer Illumina HiSeq RNASeq. The feature size was originally 19414, 
  but after features with outliers were filtered out feature size reduces to 
  16656. In this sample data set provided in the package, we randomly selected 
  100 features.
 
  - TCGA\_Breast\_miRNASeq\_voom
 
  - voom-transformed TCGA\_Breast\_miRNASeq
 
  - TCGA\_Breast\_RNASeq\_voom
 
  - voom-transformed TCGA\_Breast\_RNASeq
 
The data sets are provided as described above with no other modifications.
Throughout this vignette we will use the
data sets `TCGA_GBM_miRNA_microarray` and `TCGA_GBM_transcript_microarray` to
demonstrate `r Biocpkg("discordant")`'s essential functionality. They are 
loaded as follows:
```{r}
# Load data
data(TCGA_GBM_miRNA_microarray)
data(TCGA_GBM_transcript_microarray)
```
```{r, echo=FALSE}
# Modify column names
# rownames(TCGA_GBM_miRNA_microarray) <- 
#   substr(rownames(TCGA_GBM_miRNA_microarray), 9,
#          nchar(rownames(TCGA_GBM_miRNA_microarray)))
```
# Before Starting
## Types of Analysis
"Within" –omics refers to when Discordant analysis is performed within one 
–omics dataset where all molecular features within a -omics dataset are paired 
to each other (e.g. transcript-transcript pairs in a transcriptomics 
experiment).
"Between" -omics refers to analysis of two -omics data sets. Molecular feature 
pairs analyzed are between the two -omics, (e.g. transcript-protein, protein-
metabolite) are paired.
`r Biocpkg("discordant")` provides tools for both within and between 
-omics analysis that will be described in the sections that follow.
## Outliers
In our work, we found that features with outliers would skew correlation and 
cause false positives. Our approach was to filter out features that had large 
outliers. With normal data, such as in gene expression data from microarrays, 
Grubbs' test can be used. 
The null hypothesis is that there are no outliers in the data, and so features 
with p-value $\ge$ 0.05 are kept. A simple R function is found in the 
`r CRANpkg("outliers")` R package as `grubbs.test()`.
Determining outliers in non-normal data is more complicated. We used the median 
absolute deviation (MAD). Normally, features are filtered if they are outside 2 
or 3 MADs from the median (@leys). This is not completely applicable to 
sequencing data, because sequencing data has large variance and a non-
symmetrical distribution. Therefore we used the 'split MAD' approach (@magwene).
A left MAD is determined based on data left to the median and a 
right MAD is determined based on data to the right of the median. If there are 
any feature outside a factor of the left or right MAD from the median, they are
filtered out.
`r Biocpkg("discordant")` provides `splitMADOutlier()` to identify features
with outliers using MAD. The 
number of MAD outside of the median can be changed with option 
`threshold`. Another option is `filter0` which if `TRUE`
will filter out any feature with at least one 0. Arguments returned are 
`mat.filtered`, which is the filtered matrix and `index` which 
is the index of features that are retained in `mat.filtered`.
```{r}
data(TCGA_Breast_miRNASeq)
mat.filtered <- splitMADOutlier(TCGA_Breast_miRNASeq,
                                filter0 = TRUE, 
                                threshold = 4)
```
# Correlation Vectors
To run the Discordant algorithm, correlation vectors respective to each group 
are necessary for input, which are easy to create using the function 
`createVectors()`. Each correlation coefficient represents the correlation 
between two molecular features. The type of molecular feature pairs depends if 
a within -omics or between -omics analysis is performed. Correlation between 
molecular features in the same -omics dataset is within -omics, and correlation 
between molecular features in two different -omics datasets is between -omics. 
Whether or not within -omics or between -omics analysis is performed depends on 
whether one or two matrices are parameters for this function. The arguments for
`createVectors()` are:
- x
 
  - $m$ by $n$ matrix where $m$ are features and $n$ are samples. If only this
  matrix is provided, a within -omics analysis is performed
 
- y
 
  - $m$ by $n$ matrix where $m$ are features and $n$ are samples. This is an
  optional argument which will induce between -omics analysis. Samples must be
  matched with those in x
 - group
 
  - vector containing 1s and 2s that correspond to the location of samples in
  the columns of x (and y if provided). For
  example, the control is group 1 and the experimental group 2, and the
  locations of samples corresponding to the two groups matches the locations of
  1s and 2s in the group vector
 
`createVectors()` is then run as follows:
```{r}
groups <- c(rep(1,10), rep(2,20))
# Within -omics
wthn_vectors <- createVectors(x = TCGA_GBM_transcript_microarray, 
                              groups = groups)
# Between -omics
btwn_vectors <- createVectors(x = TCGA_GBM_miRNA_microarray, 
                              y = TCGA_GBM_transcript_microarray, 
                              groups = groups)
```
`createVectors()` returns a list with two elements, `v1` and `v2`, which are the
correlation vectors of molecular feature pairs corresponding to
samples labeled `1` and `2` using the `groups` argument, respectively. Each
vector is a numeric named vector with names indicating each feature in the pair
separated by an underscore. Below are the first few correlations for each group,
first from the within -omics analysis and second from the between -omics 
analysis.
```{r}
# Within -omics
head(wthn_vectors$v1)
head(wthn_vectors$v2)
```
```{r}
# Between -omics
head(btwn_vectors$v1)
head(btwn_vectors$v2)
```
## Correlation Metrics
The function `createVectors()` provides several options for correlation metrics
using the argument `cor.method`. The methods provided include `"spearman"` (the
default metric), `"pearson"`, `"bwmc"`, and `"sparcc"`. For information and 
comparison of Spearman, Pearson and biweight midcorrelation (bwmc) see
[Song et al](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-328) 
(@song). We have also investigated correlation metrics in Discordant in relation
to sequencing data and found Spearman's correlation had the best performance 
(@siska2).
The algorithm for SparCC was introduced by Friedman et al. (@friedman), and we
use code provided by Huaying Fang (@fang).
# Calling Discordant
The Discordant Algorithm is implemented in the the function `discordantRun()` 
which requires two correlation vectors and the original data. If the user wishes
to generate their own correlation vector before inputting the data set, 
they can do so. However, the function will return an error message if the 
dimensions of the data sets inserted do not match the correlation vector.
`discordantRun()` is called as follows:
```{r}
# Within -omics
wthn_result <- discordantRun(v1 = wthn_vectors$v1, 
                             v2 = wthn_vectors$v2, 
                             x = TCGA_GBM_transcript_microarray)
# Between -omics
btwn_result <- discordantRun(v1 = btwn_vectors$v1, 
                             v2 = btwn_vectors$v2, 
                             x = TCGA_GBM_miRNA_microarray, 
                             y = TCGA_GBM_transcript_microarray)
```
## Output 
The posterior probability output of the Discordant algorithm are the 
differential correlation posterior probabilities (the sum of the off-diagonal of
the class matrix described above). If the user wishes to observe more 
detailed information, alternative outputs are available. `discordantRun()` has 
six outputs:
__discordPPMatrix__
Matrix of differential correlation posterior probabilities where rows and
columns reflect features. If only x was inputted, then the number of rows and 
columns are number of features in x. The rows and column names are the feature
names, and the upper diagonal of the matrix are NAs to avoid repeating 
results. If x and y are inputted, the number of rows is the feature size of x 
and the number of columns the feature size of y. The row names are features 
from x and the column names are features from y.
```{r}
# Within -omics
wthn_result$discordPPMatrix[1:5, 1:4]
# Between -omics
btwn_result$discordPPMatrix[1:5, 1:4]
```
__discordPPVector__
Vector of differential correlation posterior probabilities. The length is
the number of feature pairs. The names of the vector are the feature pairs.
```{r}
# Within -omics
head(wthn_result$discordPPVector)
# Between -omics
head(btwn_result$discordPPVector)
```
__classMatrix__
Matrix of classes with the highest posterior probability for each pair. 
Row and column names are the same as in `discordPPMatrix` and determined by 
whether only x is inputted or both x and y.
```{r}
# Within -omics
wthn_result$classMatrix[1:5,1:4]
# Between -omics
btwn_result$classMatrix[1:5,1:4]
```
__classVector__
Vector of class with the highest posterior probability for each pair. The 
length is the number of feature pairs. Names of vector correspond to the 
feature pairs, similar to `discordPPVector`.
```{r}
# Within -omics
head(wthn_result$classVector)
# Between -omics
head(btwn_result$classVector)
```
__probMatrix__
Matrix of all posterior probabilities, where the number of rows is
the number of feature pairs and the columns represent the class within the 
class matrix. The number of columns can be 9 or 25, depending on how many 
mixture components are chosen (discussed later). The values across each row 
add up to 1. Posterior probabilities in `discordPPMatrix` and 
`discordPPVector` are the summation of columns that correspond to 
differential correlation classes (described above). Each column corresponds
to the respectively numbered element from the class matrix above for three 
components or the class matrix described below for five components.
```{r}
# Within -omics
round(head(wthn_result$probMatrix), 2)
# Between -omics
round(head(btwn_result$probMatrix), 2)
```
__loglik__
The log likelihood from the model fit.
```{r}
# Within -omics
wthn_result$loglik
# Between -omics
btwn_result$loglik
```
## Subsampling
Subsampling is an option to run the EM algorithm with a random sample of 
independent feature pairs. This is repeated for a number of samplings, and then 
the average of these parameters are used to maximize posterior probabilities for
all feature pairs. This option was introduced to speed up Discordant method and 
to also address the independence assumption. 
The argument `subsampling` must be set to `TRUE` for subsampling to be used. By
default, the number of independent feature pairs to be subsampled is half the
total number of features divided by two for within -omics analysis and the 
number of features in the data set with fewer features for between -omics 
analysis. This number may be altered by users using the `subSize` argument, but
the value set by users cannot exceed the default value, as the default value
is the maximum number of independent sample pairs possible for a given analysis.
The number of random samplings to be run is set by the argument 
`iter` which has a default value of 100.
As discussed in the next section, the discordant method requires a sufficient
number of features to estimate components, and using subsampling reduces the
quantity of features used for analysis, so subsampling should be reserved for 
larger data sets. For some data sets, certain random samples will be sufficient,
while others may not be. For those data sets, the subsampling algorithm will
allow up to 10% of iterations to fail and be repeated. If more than 10% of
iterations fail, `discordantRun()` will throw an error with potential solutions,
as shown below.
```{r, error = TRUE}
# Between -omics
btwn_result <- discordantRun(v1 = btwn_vectors$v1, 
                             v2 = btwn_vectors$v2, 
                             x = TCGA_GBM_miRNA_microarray, 
                             y = TCGA_GBM_transcript_microarray,
                             components = 3,
                             subsampling = TRUE)
```
Given the limited number of features in the `TCGA_GBM` data sets, they are not
a suitable candidate for subsampling, so we will instead use the 
`TCGA_Breast_miRNASeq` and `TCGA_Breast_RNASeq` data sets to demonstrate
a multi -omics analysis with subsampling. Note that a seed is set prior to 
calling `discordantRun()` with `subsampling = TRUE`, as there is a randomness
involved in drawing subsamples, and results may differ using different seeds.
```{r}
# Load Data
data(TCGA_Breast_miRNASeq_voom)
data(TCGA_Breast_RNASeq_voom)
# Prepare groups
groups <- c(rep(1, 15), rep(2, 42))
# Create correlation vectors
sub_vectors <- createVectors(x = TCGA_Breast_miRNASeq_voom, 
                             y = TCGA_Breast_RNASeq_voom,
                             groups = groups)
# Run analysis with subsampling
set.seed(126)
sub_result <- discordantRun(sub_vectors$v1, sub_vectors$v2,
                            x = TCGA_Breast_miRNASeq_voom,
                            y = TCGA_Breast_RNASeq_voom,
                            components = 3, subsampling = TRUE)
# Results
round(head(sub_result$probMatrix), 2)
```
## Five Components
We also provide the option to increase component size from three to five in the 
mixture model. The number of classes in the class matrix increases, as seen in 
the table below. Incorporating the extra components means that it is possible to
identify elevated differential correlation, which is when there are associations
in both groups in the same direction but one is more extreme. Using this option
introduces more parameters, which does have an effect on run-time. We also found
that using the five mixture component mixture model reduces performance compared
to the three component mixture model(@siska2). However, the option is 
available if users wish to explore more types of differential correlation.
\[
\begin{array}{c|c c c c c}
  \text{} & \text{0} & \text{-}  & \text{--} & \text{+} & \text{++} \\ 
\hline
  0 & 1 & 2 & 3 & 4 & 5 \\
  - & 6 & 7 & 8 & 9 & 10  \\  
  -- & 11 & 12 & 13 & 14 & 15 \\
  + & 16 & 17 & 18 & 19 & 20 \\
  ++ & 21 & 22 & 23 & 24 & 25
\end{array}
\]
By default, `discordantRun()` uses a three component mixture model, but this may
be changed to a five component mixture model by setting the argument
`components = 5`. A greater amount of data (specifically a greater number of 
features) is necessary to accurately estimate 5 components compared to 3.
If an insufficient amount of data is used, `discordantRun()` will throw an error
suggesting the user increase the number of features or reduce the chosen
number of components. The data used for the within -omics analysis above does
not have enough features to estimate 5 components, so an error is thrown below.
```{r, error = TRUE}
# Within -omics
wthn_result <- discordantRun(v1 = wthn_vectors$v1, 
                             v2 = wthn_vectors$v2, 
                             x = TCGA_GBM_transcript_microarray,
                             components = 5)
```
However, the between -omics analysis above is a suitable candidate for further
analysis using five components, so an example of such an analysis is provided
below.
```{r}
# Between -omics
btwn_result <- discordantRun(v1 = btwn_vectors$v1, 
                             v2 = btwn_vectors$v2, 
                             x = TCGA_GBM_miRNA_microarray, 
                             y = TCGA_GBM_transcript_microarray,
                             components = 5)
# Between -omics
round(head(btwn_result$probMatrix), 2)
```
# Session Info
```{r}
sessionInfo()
```
# References