---
title: "mixOmics"
author:
- name: Kim-Anh Lê Cao
affiliation: Melbourne Integrative Genomics, School of Mathematics and Statistics, The University of Melbourne, Australia
email: mixomics@math.univ-toulouse.fr
- name: Sébastien Déjean
affiliation: Institut de Mathématiques de Toulouse, UMR 5219, CNRS and Université de Toulouse, France
package: mixOmics
output:
BiocStyle::html_document
vignette: >
%\VignetteIndexEntry{mixOmics}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
bibliography: ["mybib.bib"]
biblio-style: apalike
link-citations: true
header-includes:
- \usepackage{color}
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```
```{r setup, include = FALSE}
library(knitr)
# global options
knitr::opts_chunk$set(dpi = 100, echo=TRUE, warning=FALSE, message=FALSE, eval = TRUE,
fig.show=TRUE, fig.width= 7,fig.height= 6,fig.align='center', out.width = '80%'#,
#fig.path= 'Figures/'
)
# knitr::opts_chunk$set(
# collapse = TRUE,
# comment = "#>"
# )
```
# Bookdown version
This vignette is also available as a bookdown format on our gitHub page https://mixomicsteam.github.io/Bookdown/.
# Introduction {#intro}
`mixOmics` is an R toolkit dedicated to the exploration and integration of biological data sets with a specific focus on variable selection. The package currently includes nineteen multivariate methodologies, mostly developed by the `mixOmics` team (see some of our references in \@ref(intro:pubs)). Originally, all methods were designed for omics data, however, their application is not limited to biological data only. Other applications where integration is required can be considered, but mostly for the case where the predictor variables are continuous (see also \@ref(intro:datatypes)).
In `mixOmics`, a strong focus is given to graphical representation to better translate and understand the relationships between the different data types and visualize the correlation structure at both sample and variable levels.
## Input data {#intro:datatypes}
Note the data pre-processing requirements before analysing data with `mixOmics`:
- **Types of data**. Different types of biological data can be explored and integrated with `mixOmics`. Our methods can handle molecular features measured on a continuous scale (e.g. microarray, mass spectrometry-based proteomics and metabolomics) or sequenced-based count data (RNA-seq, 16S, shotgun metagenomics) that become `continuous' data after pre-processing and normalisation.
- **Normalisation**. The package does not handle normalisation as it is platform specific and we cover a too wide variety of data! Prior to the analysis, we assume the data sets have been normalised using appropriate normalisation methods and pre-processed when applicable.
- **Prefiltering**. While `mixOmics` methods can handle large data sets (several tens of thousands of predictors), we recommend pre-filtering the data to less than 10K predictor variables per data set, for example by using Median Absolute Deviation [@Ten16] for RNA-seq data, by removing consistently low counts in microbiome data sets [@Lec16] or by removing near zero variance predictors. Such step aims to lessen the computational time during the parameter tuning process.
- **Data format**.
Our methods use matrix decomposition techniques. Therefore, the numeric data matrix or data frames have $n$ observations or samples in rows and $p$ predictors or variables (e.g. genes, proteins, OTUs) in columns.
- **Covariates**. In the current version of `mixOmics`, covariates that may confound the analysis are not included in the methods. We recommend correcting for those covariates beforehand using appropriate univariate or multivariate methods for batch effect removal. Contact us for more details as we are currently working on this aspect.
## Methods
### Some background knowledge {#intro:background}
We list here the main methodological or theoretical concepts you need to know to be able to efficiently apply `mixOmics`:
- **Individuals, observations or samples**: the experimental units on which information are collected, e.g. patients, cell lines, cells, faecal samples ...
- **Variables, predictors**: read-out measured on each sample, e.g. gene (expression), protein or OTU (abundance), weight ...
- **Variance**: measures the spread of one variable. In our methods we estimate the variance of components rather that variable read-outs. A high variance indicates that the data points are very spread out from the mean, and from one another (scattered).
- **Covariance**: measures the strength of the relationship between two variables, i.e whether they co-vary. A high covariance value indicates a strong relationship, e.g weight and height in individuals frequently vary roughly in the same way; roughly, the heaviest are the tallest. A covariance value has no lower or upper bound.
- **Correlation**: a standardized version of the covariance that is bounded by -1 and 1.
- **Linear combination**: variables are combined by multiplying each of of them by a coefficient and adding the results. A linear combination of height and weight could be 2 $*$ weight - 1.5 $*$ height with the coefficients 2 and -1.5 assigned with weight and height respectively.
- **Component**: an artificial variable built from a linear combination of the observed variables in a given data set. Variable coefficients are optimally defined based on some statistical criterion. For example in Principal Component Analysis, the coefficients in the (principal) component is defined so as to maximise the variance of the component.
- **Loadings**: variable coefficients used to define a component.
- **Sample plot**: representation of the samples projected in a small space spanned (defined) by the components. Samples coordinates are determined by their components values, or scores.
- **Correlation circle plot**: representation of the variables in a space spanned by the components. Each variable coordinate is defined as the correlation between the original variable value and each component. A correlation circle plot enables to visualise the correlation between variables - negative or positive correlation, defined by the cosine angle between the centre of the circle and each variable point) and the contribution of each variable to each component - defined by absolute value of the coordinate on each component. For this interpretation, data need to be centred and scaled (by default in most of our methods except PCA). For more details on this insightful graphic, see Figure 1 in [@Gon12].
- **Unsupervised analysis**: the method does not take into account any known sample groups and the analysis is exploratory. Examples of unsupervised methods covered in this vignette are Principal Component Analysis (PCA, Chapter \@ref(pca)), Projection to Latent Structures (PLS, Chapter \@ref(pls)), and also Canonical Correlation Analysis (CCA, not covered here).
- **Supervised analysis**: the method includes a vector indicating the class membership of each sample. The aim is to discriminate sample groups and perform sample class prediction. Examples of supervised methods covered in this vignette are PLS Discriminant Analysis (PLS-DA, Chapter \@ref(plsda)), DIABLO (Chapter \@ref(diablo)) and also MINT (not covered here [@Roh16]).
### Overview {#intro:overview}
Here is an overview of the most widely used methods in `mixOmics` that will be further detailed in this vignette, with the exception of rCCA and MINT. We depict them along with the type of data set they can handle.
```{r overview, echo=FALSE, message=FALSE}
library(mixOmics)
coul <- color.mixo(1:3)
plot(0, type="n", xlim=c(0,100), ylim=c(0,100), axes=FALSE,
xlab="",ylab="", main="mixOmics overview")
box()
# PCA
rect(xleft = 20, ybottom = 75, xright = 40, ytop = 95, col=coul[1])
text(5, 85, "PCA")
# PLS-DA
rect(xleft = 20, ybottom = 50, xright = 40, ytop = 70, col=coul[1])
rect(xleft = 43, ybottom = 50, xright = 45, ytop = 70, col=coul[2])
text(5, 60, "PLS-DA")
# PLS
rect(xleft = 20, ybottom = 25, xright = 40, ytop = 45, col=coul[1])
rect(xleft = 43, ybottom = 25, xright = 60, ytop = 45, col=coul[1])
text(5, 35, "PLS")
# DIABLO
rect(xleft = 20, ybottom = 0, xright = 40, ytop = 20, col=coul[1])
rect(xleft = 43, ybottom = 0, xright = 60, ytop = 20, col=coul[1])
points(x=61, y=10, pch=16, col=coul[3], cex=0.5)
points(x=62.5, y=10, pch=16, col=coul[3], cex=0.5)
points(x=64, y=10, pch=16, col=coul[3], cex=0.5)
rect(xleft = 65, ybottom = 0, xright = 80, ytop = 20, col=coul[1])
rect(xleft = 85, ybottom = 0, xright = 88, ytop = 20, col=coul[2])
text(5, 10, "DIABLO")
# legend
rect(xleft = 75, ybottom = 95, xright = 77, ytop = 97, col=coul[1])
text(90, 96, "Quantitative", cex=0.75)
rect(xleft = 75, ybottom = 90, xright = 77, ytop = 92, col=coul[2])
text(90, 91, "Qualitative", cex=0.75)
```
```{r methods, echo=FALSE, fig.cap="List of methods in mixOmics, sparse indicates methods that perform variable selection", out.width='100%', fig.align='center'}
knitr::include_graphics("XtraFigs/Methods.png")
```
```{r cheatsheet, echo=FALSE, fig.cap="Main functions and parameters of each method", out.width= '100%', fig.align='center'}
knitr::include_graphics("XtraFigs/cheatsheet.png")
```
### Key publications {#intro:pubs}
#### Methods papers
The methods implemented in `mixOmics` are described in detail in the following publications. A more extensive list can be found at this [link](http://mixomics.org/a-propos/publications/).
- **DIABLO** (New): Singh A, Gautier B, Shannon C, Vacher M, Rohart F, Tebbutt S, K-A. Le Cao (2019). [DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/bty1054). *Bioinformatics* bty1054.
- **Overview and recent integrative methods**: Rohart F., Gautier, B, Singh, A, Le Cao, K. A. mixOmics: an [R package for omics feature selection and multiple data integration](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005752). *PLoS Comput Biol* 13(11): e1005752.
- **Graphical outputs for integrative methods**: [@Gon12] Gonzalez I., Le Cao K.-A., Davis, M.D. and Dejean S. (2012) [Insightful graphical outputs to explore relationships between two omics data sets](https://biodatamining.biomedcentral.com/articles/10.1186/1756-0381-5-19). *BioData Mining* 5:19.
- **sparse PLS**: Le Cao K.-A., Martin P.G.P, Robert-Granie C. and Besse, P. (2009) [Sparse Canonical Methods for Biological Data Integration: application to a cross-platform study](http://www.biomedcentral.com/1471-2105/10/34/). *BMC Bioinformatics*, 10:34.
- **sparse PLS-DA**:Le Cao K.-A., Boitard S. and Besse P. (2011) [Sparse PLS Discriminant Analysis: biologically relevant feature selection and graphical displays for multiclass problems]( https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-253). *BMC Bioinformatics*, 22:253.
- **sPLS-DA for microbiome data**: Le Cao K-A$^*$, Costello ME $^*$, Lakis VA , Bartolo F, Chua XY, Brazeilles R and Rondeau P. (2016) [MixMC: Multivariate insights into Microbial Communities](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0160169).PLoS ONE 11(8): e0160169
- **Multilevel approach for repeated measurements**: Liquet B, Le Cao K-A, Hocini H, Thiebaut R (2012). [A novel approach for biomarker selection and the integration of repeated measures experiments from two assays](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-325). *BMC Bioinformatics*, 13:325
#### Biological application papers
For those interested in how these methods have been applied to omics biological problems:
- Lee AH, Shanon CS, [...], some members of mixOmics team and Kollman T (2019). [Dynamic molecular changes during the first week of human life follow a robust developmental trajectory](https://www.nature.com/articles/s41467-019-08794-x) *Nature Communications* **10**: 1092. *We used DIABLO and other multi omics integrative methods to integratve 3 - 5 omics datasets*.
- Gavin PG, [...], and Hamilton-Williams EE (2018). [Intestinal metaproteomics reveals host-microbiota interactions in subjects at risk for type 1 diabetes](https://www.nature.com/articles/s41467-019-08794-x) *Diabetes care* **41**: 10. *We used DIABLO to integrate microbiome, proteomics and meta-proteomics*.
## Outline of this Vignette
- **Chapter \@ref(start)** details some practical aspects to get started
- **Chapter \@ref(pca)**: Principal Components Analysis (PCA)
- **Chapter \@ref(plsda)**: Projection to Latent Structure - Discriminant Analysis (PLS-DA)
- **Chapter \@ref(pls)**: Projection to Latent Structures (PLS)
- **Chapter \@ref(diablo)**: Integrative analysis for multiple data sets (DIABLO)
Each of the methods chapter has the following outline:
1. Type of biological question to be answered
2. Brief description of an illustrative data set
3. Principle of the method
4. Quick start of the method with the main functions and arguments
5. To go further: customized plots, additional graphical outputs and tuning parameters
6. FAQ
## Other methods not covered in this vignette
Other methods not covered in this document are described on our website and the following references:
- [regularised Canonical Correlation Analysis](http://www.mixOmics.org), see the **Methods** and **Case study** tabs, and [@Gon08] that describes CCA for large data sets.
- [Microbiome (16S, shotgun metagenomics) data analysis](http://www.mixOmics.org/mixmc), see also [@Lec16] and [kernel integration for microbiome data](http://mixomics.org/mixkernel). The latter is in collaboration with Drs J Mariette and Nathalie Villa-Vialaneix (INRA Toulouse, France), an example is provided for the Tara ocean metagenomics and environmental data, see also [@Mar17].
- [MINT](http://mixomics.org/mixmint/) or P-integration to integrate independently generated transcriptomics data sets. An example in stem cells studies, see also [@Roh16].
# Let's get started {#start}
## Installation
First, download the latest \texttt{mixOmics} version from Bioconductor.:
```{r, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("mixOmics")
```
Alternatively, you can install the *latest* development version of the package from Github.
```{r, eval = FALSE}
install.packages("devtools")
# then load
library(devtools)
install_github("mixOmicsTeam/mixOmics")
```
The `mixOmics` package will directly import the following packages:
`r Rpackage("igraph")`, `r Rpackage("rgl")`, `r Rpackage("ellipse")`, `r Rpackage("corpcor")`, `r Rpackage("RColorBrewer")`, `r Rpackage("parallel")`, `r Rpackage("dplyr")`, `r Rpackage("tidyr")`, `r Rpackage("reshape2")`, `r Rpackage("methods")`, `r Rpackage("matrixStats")`, `r Rpackage("rARPACK")`, `r Rpackage("gridExtra")`.
For *apple mac users*, if you are unable to install the imported library `rgl`, you will need to install the [XQuartz software](https://www.xquartz.org) first.
## Load the package {#start:upload}
```{r, message=FALSE}
library(mixOmics)
```
Check that there is no error when loading the package, especially for the `rgl` library (see above).
## Upload data
The examples we give in this vignette use data that are already part of the package, *which means that the objects are stored as list. However, you do not need to store your data as list, just data frames will do*. To upload your own data, check first that your working directory is set, then read your data from a `.txt` or `.csv` format, either by using **File > Import Dataset** in RStudio (*although be mindful this option may pose problem with row and col name headers*) or via one of these command lines:
```{r, eval = FALSE}
# from csv file
data <- read.csv("your_data.csv", row.names = 1, header = TRUE)
# from txt file
data <- read.table("your_data.txt", header = TRUE)
# then in the argument in the mixOmics functions, just fill with:
# X = data
```
For more details about the arguments used to modify those functions, type `?read.csv` or `?read.table` in the R console. In this vignette we assume you are sufficiently proficient in R to input your data to carry on with the mixOmics analyses!
## Quick start in `mixOmics` {#start:PCA}
Each analysis should follow this workflow:
1. Run the method
2. Graphical representation of the samples
3. Graphical representation of the variables
Then use your critical thinking and additional functions and visual tools to make sense of your data! (some of which are listed in \@ref(intro:overview)) and will be described in the next Chapters.
For instance, for Principal Components Analysis, we first load the data:
```{r}
data(nutrimouse)
X <- nutrimouse$gene
```
Then use the following steps:
```{r}
MyResult.pca <- pca(X) # 1 Run the method
plotIndiv(MyResult.pca) # 2 Plot the samples
plotVar(MyResult.pca) # 3 Plot the variables
```
This is only a first quick-start, there will be many avenues you can take to deepen your exploratory and integrative analyses. The package proposes several methods to perform variable, or feature selection to identify the relevant information from rather large omics data sets. The sparse methods are listed in the Table in \@ref(intro:overview).
Following our example here, sparse PCA can be applied to select the top 5 variables contributing to each of the two components in PCA. The user specifies the number of variables to selected on each component, for example here 5 variables are selected on each of the first two components (`keepX=c(5,5)`):
```{r}
MyResult.spca <- spca(X, keepX=c(5,5)) # 1 Run the method
plotIndiv(MyResult.spca) # 2 Plot the samples
plotVar(MyResult.spca) # 3 Plot the variables
```
You can see know that we have considerably reduced the number of genes in the `plotVar` correlation circle plot.
Do not stop here! We are not done yet. You can enhance your analyses with the following:
- Have a look at our manual and each of the functions and their examples, e.g. `?pca`, `?plotIndiv`, `?sPCA`, ...
- Run the examples from the help file using the `example` function: `example(pca)`, `example(plotIndiv)`, ...
- Have a look at out [website](http://www.mixomics.org) that features many tutorials and case studies,
- Keep reading this vignette, this is *just the beginning!*
# Principal Component Analysis (PCA) {#pca}
```{r overview-PCA, echo=FALSE, message=FALSE}
library(mixOmics)
coul <- color.mixo(1:3)
plot(0, type="n", xlim=c(0,100), ylim=c(83,97), axes=FALSE,
xlab="",ylab="", main="PCA overview")
box()
# PCA
rect(xleft = 20, ybottom = 85, xright = 60, ytop = 95, col=coul[1])
text(5, 90, "PCA")
# legend
rect(xleft = 85, ybottom = 90, xright = 87, ytop = 92, col=coul[1])
text(90, 94, "Quantitative", cex=0.75)
#rect(xleft = 75, ybottom = 90, xright = 77, ytop = 92, col=coul[2])
#text(90, 91, "Qualitative", cex=0.75)
```
## Biological question
*I would like to identify the major sources of variation in my data and itdentify whether such sources of variation correspond to biological conditions, or experimental bias. I would like to visualise trends or patterns between samples,whether they 'naturally' cluster according to known biological conditions.*
## The `liver.toxicity` study
The `liver.toxicity` is a list in the package that contains:
- `gene`: a data frame with 64 rows and 3116 columns, corresponding to the expression levels of 3,116 genes measured on 64 rats.
- `clinic`: a data frame with 64 rows and 10 columns, corresponding to the measurements of 10 clinical variables on the same 64 rats.
- `treatment`: data frame with 64 rows and 4 columns, indicating the treatment information of the 64 rats, such as doses of acetaminophen and times of necropsy.
- `gene.ID`: a data frame with 3116 rows and 2 columns, indicating geneBank IDs of the annotated genes.
More details are available at `?liver.toxicity`.
To illustrate PCA, we focus on the expression levels of the genes in the data frame `liver.toxicity$gene`. Some of the terms mentioned below are listed in \@ref(intro:background).
## Principle of PCA
The aim of PCA [@Jol05] is to reduce the dimensionality of the data whilst retaining as much information as possible. 'Information' is referred here as *variance*. The idea is to create uncorrelated artificial variables called *principal components* (PCs) that combine in a linear manner the original (possibly correlated) variables (e.g. genes, metabolites, etc.).
Dimension reduction is achieved by projecting the data into the space spanned by the principal components (PC). In practice, it means that each sample is assigned a score on each new PC dimension - this score is calculated as a linear combination of the original variables to which a weight is applied. The weights of each of the original variables are stored in the so-called *loading vectors* associated to each PC. The dimension of the data is reduced by projecting the data into the smaller subspace spanned by the PCs, while capturing the largest sources of variation between samples.
The principal components are obtained so that their variance is maximised. To that end, we calculate the eigenvectors/eigenvalues of the variance-covariance matrix, often via singular value decomposition when the number of variables is very large. The data are usually centred (`center = TRUE`), and sometimes scaled (`scale = TRUE`) in the method. The latter is especially advised in the case where the variance is not homogeneous across variables.
The first PC is defined as the linear combination of the original variables that explains the greatest amount of variation. The second PC is then defined as the linear combination of the original variables that accounts for the greatest amount of the remaining variation subject of being orthogonal (uncorrelated) to the first component. Subsequent components are defined likewise for the other PCA dimensions. The user must therefore report how much information is explained by the first PCs as these are used to graphically represent the PCA outputs.
## Load the data
We first load the data from the package. See \@ref(start:upload) to upload your own data.
```{r echo=TRUE, message=FALSE}
library(mixOmics)
data(liver.toxicity)
X <- liver.toxicity$gene
```
## Quick start
```{r, fig.show = 'hide'}
MyResult.pca <- pca(X) # 1 Run the method
plotIndiv(MyResult.pca) # 2 Plot the samples
plotVar(MyResult.pca) # 3 Plot the variables
```
If you were to run `pca` with this minimal code, you would be using the following default values:
- `ncomp =2`: the first two principal components are calculated and are used for graphical outputs;
- `center = TRUE`: data are centred (mean = 0)
- `scale = FALSE`: data are not scaled. If `scale = TRUE` standardizes each variable (variance = 1).
Other arguments can also be chosen, see `?pca`.
This example was shown in Chapter \@ref(start:PCA). The two plots are not extremely meaningful as specific sample patterns should be further investigated and the variable correlation circle plot contains too many variables to be easily interpreted. Let's improve those graphics as shown below to improve interpretation.
## To go further
### Customize plots
Plots can be customized using numerous options in `plotIndiv` and `plotVar`. For instance, even if PCA does not take into account any information regarding the known group membership of each sample, we can include such information on the sample plot to visualize any `natural' cluster that may corresponds to biological conditions.
Here is an example where we include the sample groups information with the argument `group`:
```{r eval=TRUE, fig.show=FALSE}
plotIndiv(MyResult.pca, group = liver.toxicity$treatment$Dose.Group,
legend = TRUE)
```
Additionally, two factors can be displayed using both colours (argument `group`) and symbols (argument `pch`). For example here we display both Dose and Time of exposure and improve the title and legend:
```{r}
plotIndiv(MyResult.pca, ind.names = FALSE,
group = liver.toxicity$treatment$Dose.Group,
pch = as.factor(liver.toxicity$treatment$Time.Group),
legend = TRUE, title = 'Liver toxicity: genes, PCA comp 1 - 2',
legend.title = 'Dose', legend.title.pch = 'Exposure')
```
By including information related to the dose of acetaminophen and time of exposure enables us to see a cluster of low dose samples (blue and orange, top left at 50 and 100mg respectively), whereas samples with high doses (1500 and 2000mg in grey and green respectively) are more scattered, but highlight an exposure effect.
To display the results on other components, we can change the `comp` argument provided we have requested enough components to be calculated. Here is our second PCA with 3 components:
```{r eval=TRUE, fig.show=TRUE}
MyResult.pca2 <- pca(X, ncomp = 3)
plotIndiv(MyResult.pca2, comp = c(1,3), legend = TRUE,
group = liver.toxicity$treatment$Time.Group,
title = 'Multidrug transporter, PCA comp 1 - 3')
```
Here, the 3rd component on the y-axis clearly highlights a time of exposure effect.
### Amount of variance explained and choice of number of components
The amount of variance explained can be extracted with the following: a screeplot or the actual numerical proportions of explained variance, and cumulative proportion.
```{r}
plot(MyResult.pca2)
MyResult.pca2
```
There are no clear guidelines on how many components should be included in PCA: it is data dependent and level of noise dependent. We often look at the `elbow' on the screeplot above as an indicator that the addition of PCs does not drastically contribute to explain the remainder variance.
### Other useful plots
We can also have a look at the variable coefficients in each component with the loading vectors. The loading weights are represented in decreasing order from bottom to top in `plotLoadings`. Their absolute value indicates the importance of each variable to define each PC, as represented by the length of each bar. See `?plotLoadings` to change the arguments.
```{r eval=TRUE, fig.show = 'hide'}
# a minimal example
plotLoadings(MyResult.pca)
```
```{r eval=TRUE, fig.show = 'hide'}
# a customized example to only show the top 100 genes
# and their gene name
plotLoadings(MyResult.pca, ndisplay = 100,
name.var = liver.toxicity$gene.ID[, "geneBank"],
size.name = rel(0.3))
```
Such representation will be more informative once we select a few variables in the next section \@ref(sPCA).
Plots can also be displayed in 3 dimensions using the option `style="3d"`, and interactively (we use the `rgl` package for this).
```{r eval= FALSE, fig.show = 'hide'}
plotIndiv(MyResult.pca2,
group = liver.toxicity$treatment$Dose.Group, style="3d",
legend = TRUE, title = 'Liver toxicity: genes, PCA comp 1 - 2 - 3')
```
## Variable selection with sparse PCA {#sPCA}
### Biological question
*I would like to apply PCA but also be able to identify the key variables that contribute to the explanation of most variance in the data set.*
Variable selection can be performed using the sparse version of PCA implemented in `spca` [@She08]. The user needs to provide the number of variables to select on each PC. Here for example we ask to select the top 15 genes contributing to the definition of PC1, the top 10 genes contributing to PC2 and the top 5 genes for PC3 (`keepX=c(15,10,5)`).
```{r}
MyResult.spca <- spca(X, ncomp = 3, keepX = c(15,10,5)) # 1 Run the method
plotIndiv(MyResult.spca, group = liver.toxicity$treatment$Dose.Group, # 2 Plot the samples
pch = as.factor(liver.toxicity$treatment$Time.Group),
legend = TRUE, title = 'Liver toxicity: genes, sPCA comp 1 - 2',
legend.title = 'Dose', legend.title.pch = 'Exposure')
plotVar(MyResult.spca, cex = 1) # 3 Plot the variables
# cex is used to reduce the size of the labels on the plot
```
Selected variables can be identified on each component with the `selectVar` function. Here the coefficient values are extracted, but there are other outputs, see `?selectVar`:
```{r}
selectVar(MyResult.spca, comp = 1)$value
```
Those values correspond to the loading weights that are used to define each component. A large absolute value indicates the importance of the variable in this PC. Selected variables are ranked from the most important (top) to the least important.
We can complement this output with `plotLoadings`. We can see here that all coefficients are negative.
```{r}
plotLoadings(MyResult.spca)
```
If we look at component two, we can see a mix of positive and negative weights (also see in the `plotVar`), those correspond to variables that oppose the low and high doses (see from the `plotIndiv):
```{r, echo=TRUE,results='hide', fig.show=FALSE}
selectVar(MyResult.spca, comp=2)$value
plotLoadings(MyResult.spca, comp = 2)
```
## Tuning parameters
For this set of methods, two parameters need to be chosen:
- The number of components to retain,
- The number of variables to select on each component for sparse PCA.
The function `tune.pca` calculates the percentage of variance explained for each component, up to the minimum between the number of rows, or column in the data set. The `optimal' number of components can be identified if an elbow appears on the screeplot. In the example below the cut-off is not very clear, we could choose 2 components.
```{r eval=TRUE, echo=FALSE, results='hide'}
tune.pca(X)
```
Regarding the number of variables to select in sparse PCA, there is not clear criterion at this stage. As PCA is an exploration method, we prefer to set arbitrary thresholds that will pinpoint the key variables to focus on during the interpretation stage.
## Additional resources
Additional examples are provided in `example(pca)` and in our case studies on our [website](http://www.mixomics.org) in the **Methods** and **Case studies** sections.
Additional reading in [@She08].
## FAQ
* Should I scale my data before running PCA? (`scale = TRUE` in `pca`)
+ Without scaling: a variable with high variance will solely drive the first principal component
+ With scaling: one noisy variable with low variability will be assigned the same variance as other meaningful variables
* Can I perform PCA with missing values?
+ NIPALS (Non-linear Iterative PArtial Least Squares, implemented in mixOmics) can impute missing values but must be built on many components. The proportion of NAs should not exceed 20% of total data.
* When should I apply a multilevel approach in PCA? (`multilevel` argument in `PCA`)
+ When the unique individuals are measured more than once (repeated measures)
+ When the individual variation is less than treatment or time variation. This means that samples from each unique individual will tend to cluster rather than the treatments.
+ When a multilevel vs no multilevel seems to visually make a difference on a PCA plot
+ More details in this [case study](http://mixomics.org/case-studies/multilevel-vac18/)
* When should I apply a CLR transformation in PCA? (`logratio = 'CLR'` argument in `PCA`)
+ When data are compositional, i.e. expressed as relative proportions. This is usually the case with microbiome studies as a result of pre-processing and normalisation, see more details [here](http://mixomics.org/mixmc/) and in our case studies in the same tab.
# PLS - Discriminant Analysis (PLS-DA) {#plsda}
```{r overview-PLSDA, echo=FALSE, message=FALSE}
library(mixOmics)
coul <- color.mixo(1:3)
plot(0, type="n", xlim=c(0,100), ylim=c(83,97), axes=FALSE,
xlab="",ylab="", main="PLSDA overview")
box()
# PLS-DA
rect(xleft = 20, ybottom = 85, xright = 60, ytop = 95, col=coul[1])
rect(xleft = 63, ybottom = 85, xright = 65, ytop = 95, col=coul[2])
text(5, 90, "PLS-DA")
# legend
rect(xleft = 85, ybottom = 92, xright = 87, ytop = 94, col=coul[1])
text(90, 95, "Quantitative", cex=0.75)
rect(xleft = 85, ybottom = 87, xright = 87, ytop = 89, col=coul[2])
text(90, 90, "Qualitative", cex=0.75)
```
## Biological question
*I am analysing a single data set (e.g. transcriptomics data) and I would like to classify my samples into known groups and predict the class of new samples. In addition, I am interested in identifying the key variables that drive such discrimination.*
## The `srbct` study
The data are directly available in a processed and normalised format from the package. The Small Round Blue Cell Tumours (SRBCT) dataset from [@Kha01] includes the expression levels of 2,308 genes measured on 63 samples. The samples are classified into four classes as follows: 8 Burkitt Lymphoma (BL), 23 Ewing Sarcoma (EWS), 12 neuroblastoma (NB), and 20 rhabdomyosarcoma (RMS).
The srbct dataset contains the following:
**$gene:** a data frame with 63 rows and 2308 columns. The expression levels of 2,308 genes in 63 subjects.
**$class:** a class vector containing the class tumour of each individual (4 classes in total).
**$gene.name:** a data frame with 2,308 rows and 2 columns containing further information on the genes.
More details can be found in `?srbct`.
To illustrate PLS-DA, we will analyse the gene expression levels of `srbct$gene` to discriminate the 4 groups of tumours.
## Principle of sparse PLS-DA
Although Partial Least Squares was not originally designed for classification and discrimination problems, it has often been used for that purpose [@Ngu02a;@Tan04]. The response matrix `Y` is qualitative and is internally recoded as a dummy block matrix that records the membership of each observation, i.e. each of the response categories are coded via an indicator variable (see [@mixomics] Suppl. Information S1 for an illustration). The PLS regression (now PLS-DA) is then run as if Y was a continuous matrix. This PLS classification trick works well in practice, as demonstrated in many references [@Bar03;@Ngu02a;@Bou07;@Chung10].
Sparse PLS-DA [@Lec11] performs variable selection and classification in a one step procedure. sPLS-DA is a special case of sparse PLS described later in \@ref(pls), where $\ell_1$ penalization is applied on the loading vectors associated to the X data set.
## Inputs and outputs
We use the following data input matrices: `X` is a $n \times p$ data matrix, `Y` is a factor vector of length $n$ that indicates the class of each sample, and $Y^*$ is the associated dummy matrix ($n \times K$) with $n$ the number of samples (individuals), $p$ the number of variables and $K$ the number of classes. PLS-DA main outputs are:
- A **set of components**, also called latent variables. There are as many components as the chosen
*dimension* of the PLS-DA model.
- A **set of loading vectors**, which are coefficients assigned to each variable to define each component. Those coefficients indicate the importance of each variable in PLS-DA. Importantly, each loading vector is associated to a particular component. Loading vectors are obtained so that the covariance between a linear combination of the variables from X (the X-component) and the factor of interest Y (the $Y^*$-component) is maximised.
- A **list of selected variables** from `X` and associated to each component if sPLS-DA is applied.
## Set up the data
We first load the data from the package. See \@ref(start:upload) to upload your own data.
We will mainly focus on sparse PLS-DA that is more suited for large biological data sets where the aim is to identify molecular signatures, as well as classifying samples. We first set up the data as `X` expression matrix and `Y` as a factor indicating the class membership of each sample. We also check that the dimensions are correct and match:
```{r echo=TRUE, message=FALSE}
library(mixOmics)
data(srbct)
X <- srbct$gene
Y <- srbct$class
summary(Y)
dim(X); length(Y)
```
## Quick start
For a quick start we arbitrarily set the number of variables to select to 50 on each of the 3 components of PLS-DA (see section \@ref(tuning:sPLSDA) for tuning these values).
```{r, echo=TRUE,results='hide', fig.show = 'hide'}
MyResult.splsda <- splsda(X, Y, keepX = c(50,50)) # 1 Run the method
plotIndiv(MyResult.splsda) # 2 Plot the samples
plotVar(MyResult.splsda) # 3 Plot the variables
selectVar(MyResult.splsda, comp=1)$name # Selected variables on component 1
```
As PLS-DA is a supervised method, the sample plot automatically displays the group membership of each sample. We can observe a clear discrimination between the BL samples and the others on the first component (x-axis), and EWS vs the others on the second component (y-axis). Remember that this discrimination spanned by the first two PLS-DA components is obtained based on a subset of 100 variables (50 selected on each component).
From the `plotIndiv` the axis labels indicate the amount of variation explained per component. Note that the interpretation of this amount is *not* the same as in PCA. In PLS-DA, the aim is to maximise the covariance between `X` and `Y`, not only the variance of `X` as it is the case in PCA!
If you were to run `splsda` with this minimal code, you would be using the following default values:
- `ncomp = 2`: the first two PLS components are calculated and are used for graphical outputs;
- `scale = TRUE`: data are scaled (variance = 1, strongly advised here);
It is important to note that all Discriminat Analyses (e.g. `plsda`, `mint.splsda` etc.) use `mode='regression'` for deflation.
PLS-DA without variable selection can be performed as:
```{r fig.show='hide'}
MyResult.plsda <- plsda(X,Y) # 1 Run the method
plotIndiv(MyResult.plsda) # 2 Plot the samples
plotVar(MyResult.plsda) # 3 Plot the variables
```
## To go further {#plsda-tgf}
### Customize sample plots {#splsda:plotIndiv}
The sample plots can be improved in various ways. First, if the names of the samples are not meaningful at this stage, they can be replaced by symbols (`ind.names=TRUE`). Confidence ellipses can be plotted for each sample (`ellipse = TRUE`, confidence level set to 95\% by default, see the argument `ellipse.level`), Additionally, a star plot displays arrows from each group centroid towards each individual sample (`star = TRUE`). A 3D plot is also available, see `plotIndiv` for more details.
```{r}
plotIndiv(MyResult.splsda, ind.names = FALSE, legend=TRUE,
ellipse = TRUE, star = TRUE, title = 'sPLS-DA on SRBCT',
X.label = 'PLS-DA 1', Y.label = 'PLS-DA 2')
```
### Customize variable plots
The name of the variables can be set to FALSE (`var.names=FALSE`):
```{r}
plotVar(MyResult.splsda, var.names=FALSE)
```
In addition, if we had used the non sparse version of PLS-DA, a cut-off can be set to display only the variables that mostly contribute to the definition of each component. Those variables should be located towards the circle of radius 1, far from the centre.
```{r}
plotVar(MyResult.plsda, cutoff=0.7)
```
In this particular case, no variable selection was performed. Only the display was altered to show a subset of variables.
### Other useful plots
#### Background prediction
A 'prediction' background can be added to the sample plot by calculating a background surface first, before overlaying the sample plot. See `?background.predict` for more details. More details about prediction, prediction distances can be found in [@mixomics] in the Suppl. Information.
```{r}
background <- background.predict(MyResult.splsda, comp.predicted=2,
dist = "max.dist")
plotIndiv(MyResult.splsda, comp = 1:2, group = srbct$class,
ind.names = FALSE, title = "Maximum distance",
legend = TRUE, background = background)
```
#### ROC
As PLS-DA acts as a classifier, we can plot a ROC Curve to complement the sPLS-DA classification performance results detailed in \@ref(tuning:sPLSDA). The AUC is calculated from training cross-validation sets and averaged. Note however that ROC and AUC criteria may not be particularly insightful, or may not be in full agreement with the PLSDA performance, as the prediction threshold in PLS-DA is based on specified distance as described in [@mixomics].
```{r, echo=TRUE,results='hide', fig.keep='all'}
auc.plsda <- auroc(MyResult.splsda)
```
### Variable selection outputs
First, note that the number of variables to select on each component does not need to be identical on each component, for example:
```{r}
MyResult.splsda2 <- splsda(X,Y, ncomp=3, keepX=c(15,10,5))
```
Selected variables are listed in the `selectVar` function:
```{r eval=TRUE}
selectVar(MyResult.splsda2, comp=1)$value
```
and can be visualised in `plotLoadings` with the arguments `contrib = 'max'` that is going to assign to each variable bar the sample group colour for which the mean (`method = 'mean'`) is maximum. See `example(plotLoadings)` for other options (e.g. min, median)
```{r}
plotLoadings(MyResult.splsda2, contrib = 'max', method = 'mean')
```
Interestingly from this plot, we can see that all selected variables on component 1 are highly expressed in the BL (orange) class. Setting `contrib = 'min'` would highlight that those variables are lowly expressed in the NB grey class, which makes sense when we look at the sample plot.
Since 4 classes are being discriminated here, samples plots in 3d may help interpretation:
```{r eval=FALSE, fig.show='hide'}
plotIndiv(MyResult.splsda2, style="3d")
```
### Tuning parameters and numerical outputs {#tuning:sPLSDA}
For this set of methods, three parameters need to be chosen:
1 - The number of components to retain `ncomp`. The rule of thumb is usually $K - 1$ where $K$ is the number of classes, but it is worth testing a few extra components.
2 - The number of variables `keepX` to select on each component for sparse PLS-DA,
3 - The prediction distance to evaluate the classification and prediction performance of PLS-DA.
For **item 1**, the `perf` evaluates the performance of PLS-DA for a large number of components, using repeated k-fold cross-validation. For example here we use 3-fold CV repeated 10 times (note that we advise to use at least 50 repeats, and choose the number of folds that are appropriate for the sample size of the data set):
```{r eval= TRUE}
MyResult.plsda2 <- plsda(X,Y, ncomp=10)
set.seed(30) # for reproducibility in this vignette, otherwise increase nrepeat
MyPerf.plsda <- perf(MyResult.plsda2, validation = "Mfold", folds = 3,
progressBar = FALSE, nrepeat = 10) # we suggest nrepeat = 50
# type attributes(MyPerf.plsda) to see the different outputs
# slight bug in our output function currently see the quick fix below
#plot(MyPerf.plsda, col = color.mixo(5:7), sd = TRUE, legend.position = "horizontal")
# quick fix
matplot(MyPerf.plsda$error.rate$BER, type = 'l', lty = 1,
col = color.mixo(1:3),
main = 'Balanced Error rate')
legend('topright',
c('max.dist', 'centroids.dist', 'mahalanobis.dist'),
lty = 1,
col = color.mixo(5:7))
```
The plot outputs the classification error rate, or *Balanced* classification error rate when the number of samples per group is unbalanced, the standard deviation according to three prediction distances. Here we can see that for the BER and the maximum distance, the best performance (i.e. low error rate) seems to be achieved for `ncomp = 3`.
In addition (**item 3** for PLS-DA), the numerical outputs listed here can be reported as performance measures:
```{r}
MyPerf.plsda
```
Regarding **item 2**, we now use `tune.splsda` to assess the optimal number of variables to select on each component. We first set up a grid of `keepX` values that will be assessed on each component, one component at a time.
Similar to above we run 3-fold CV repeated 10 times with a maximum distance prediction defined as above.
```{r eval=TRUE}
list.keepX <- c(5:10, seq(20, 100, 10))
list.keepX # to output the grid of values tested
set.seed(30) # for reproducbility in this vignette, otherwise increase nrepeat
tune.splsda.srbct <- tune.splsda(X, Y, ncomp = 3, # we suggest to push ncomp a bit more, e.g. 4
validation = 'Mfold',
folds = 3, dist = 'max.dist', progressBar = FALSE,
measure = "BER", test.keepX = list.keepX,
nrepeat = 10) # we suggest nrepeat = 50
```
We can then extract the classification error rate averaged across all folds and repeats for each tested `keepX` value, the optimal number of components (see `?tune.splsda` for more details), the optimal number of variables to select per component which is summarised in a plot where the diamond indicated the optimal `keepX` value:
```{r}
error <- tune.splsda.srbct$error.rate
ncomp <- tune.splsda.srbct$choice.ncomp$ncomp # optimal number of components based on t-tests on the error rate
ncomp
select.keepX <- tune.splsda.srbct$choice.keepX[1:ncomp] # optimal number of variables to select
select.keepX
plot(tune.splsda.srbct, col = color.jet(ncomp))
```
Based on those tuning results, we can run our final and tuned sPLS-DA model:
```{r, fig.show = 'hide'}
MyResult.splsda.final <- splsda(X, Y, ncomp = ncomp, keepX = select.keepX)
plotIndiv(MyResult.splsda.final, ind.names = FALSE, legend=TRUE,
ellipse = TRUE, title="SPLS-DA, Final result")
```
Additionally we can run `perf` for the final performance of the sPLS-DA model. Also note that `perf` will output `features` that lists the frequency of selection of the variables across the different folds and different repeats. This is a useful output to assess the confidence of your final variable selection, see a more [detailed example here](http://mixomics.org/case-studies/splsda-srbct/).
## Additional resources
Additional examples are provided in `example(splsda)` and in our case studies on our [website](http://www.mixomics.org) in the **Methods** and **Case studies** sections, and in particular [here](http://mixomics.org/case-studies/splsda-srbct/). Also have a look at [@Lec11]
## FAQ
* Can I discriminate more than two groups of samples (multiclass classification)?
+ Yes, this is one of the advantage of PLS-DA, see this example above
* Can I have a hierarchy between two factors (e.g. diet nested into genotype)?
+ Unfortunately no, sparse PLS-DA only allows to discriminate all groups at once (i.e. 4 x 2 groups when there are 4 diets and 2 genotypes)
* Can I have missing values in my data?
+ Yes in the X data set, but you won't be able to do any prediction (i.e. `tune, perf, predict`)
+ No in the Y factor
# Projection to Latent Structure (PLS) {#pls}
```{r overview-PLS, echo=FALSE, message=FALSE}
library(mixOmics)
coul <- color.mixo(1:3)
plot(0, type="n", xlim=c(0,100), ylim=c(83,97), axes=FALSE,
xlab="",ylab="", main="PLS overview")
box()
# PLS
rect(xleft = 20, ybottom = 85, xright = 50, ytop = 95, col=coul[1])
rect(xleft = 52, ybottom = 85, xright = 73, ytop = 95, col=coul[1])
text(5, 90, "PLS")
# legend
rect(xleft = 85, ybottom = 92, xright = 87, ytop = 94, col=coul[1])
text(90, 95, "Quantitative", cex=0.75)
```
## Biological question
*I would like to integrate two data sets measured on the same samples by extracting correlated information, or by highlighing commonalities between data sets.*
## The `nutrimouse` study
The `nutrimouse` study contains the expression levels of genes potentially involved in nutritional problems and the concentrations of hepatic fatty acids for forty mice. The data sets come from a nutrigenomic study in the mouse from our collaborator [@Mar07], in which the effects of five regimens with contrasted fatty acid compositions on liver lipids and hepatic gene expression in mice were considered. Two sets of variables were measured on 40 mice:
- `gene`: the expression levels of 120 genes measured in liver cells, selected among (among about 30,000) as potentially relevant in the context of the nutrition study. These expressions come from a nylon microarray with radioactive labelling.
- `lipid`: concentration (in percentage) of 21 hepatic fatty acids measured by gas chromatography.
- `diet`: a 5-level factor. Oils used for experimental diets preparation were corn and colza oils (50/50) for a reference diet (REF), hydrogenated coconut oil for a saturated fatty acid diet (COC), sunflower oil for an Omega6 fatty acid-rich diet (SUN), linseed oil for an Omega3-rich diet (LIN) and corn/colza/enriched fish oils for the FISH diet (43/43/14).
- `genotype` 2-levels factor indicating either wild-type (WT) and PPAR$\alpha$ -/- (PPAR).
More details can be found in `?nutrimouse`.
To illustrate sparse PLS, we will integrate the gene expression levels (`gene`) with the concentrations of hepatic fatty acids (`lipid`).
## Principle of PLS
Partial Least Squares (PLS) regression [@Wol66; @Wol01] is a multivariate methodology which relates (\textit{integrates}) two data matrices `X` (e.g. transcriptomics) and `Y` (e.g. lipids). PLS goes beyond traditional multiple regression by modelling the structure of both matrices. Unlike traditional multiple regression models, it is not limited to uncorrelated variables. One of the many advantages of PLS is that it can handle many noisy, collinear (correlated) and missing variables and can also simultaneously model several response variables in `Y`.
PLS is a multivariate projection-based method that can address different types of integration problems. Its flexibility is the reason why it is the backbone of most methods in `mixOmics`. PLS is computationally very efficient when the number of variables $p + q >> n$ the number of samples. It performs successive local regressions that avoid computational issues due to the inversion of large singular covariance matrices. Unlike PCA which maximizes the variance of components from a single data set, PLS maximizes the covariance between components from two data sets. The mathematical concepts of covariance and correlation are similar, but the covariance is an unbounded measure and covariance has a unit measure (see \@ref(intro:background)). In PLS, linear combination of variables are called *latent variables* or *latent components*. The weight vectors used to calculate the linear combinations are called the *loading vectors*. Latent variables and loading vectors are thus associated, and come in pairs from each of the two data sets being integrated.
## Principle of sparse PLS
Even though PLS is highly efficient in a high dimensional context, the interpretability of PLS needed to be improved. sPLS has been recently developed by our team to perform simultaneous variable selection in both data sets `X` and `Y` data sets, by including LASSO $\ell_1$ penalizations in PLS on each pair of loading vectors [@Lec08].
## Inputs and outputs
We consider the data input matrices: `X` is a $n \times p$ data matrix and `Y` a $n \times q$ data matrix, where $n$ the number of samples (individuals), $p$ and $q$ are the number of variables in each data set. PLS main outputs are:
- A **set of components**, also called latent variables associated to each data set. There are as many components as the chosen dimension of the PLS.
- A **set of loading vectors**, which are coefficients assigned to each variable to define each component. Those coefficients indicate the importance of each variable in PLS. Importantly, each loading vector is associated to a particular component. Loading vectors are obtained so that the covariance between a linear combination of the variables from `X` (the X-component) and from `Y` (the $Y$-component) is maximised.
- A **list of selected variables** from both `X` and `Y` and associated to each component if sPLS is applied.
## Set up the data
We first set up the data as `X` expression matrix and `Y` as the lipid abundance matrix. We also check that the dimensions are correct and match:
```{r echo=TRUE, message=FALSE}
library(mixOmics)
data(nutrimouse)
X <- nutrimouse$gene
Y <- nutrimouse$lipid
dim(X); dim(Y)
```
## Quick start
We will mainly focus on sparse PLS for large biological data sets where variable selection can help the interpretation of the results. See `?pls` for a model with no variable selection. Here we arbitrarily set the number of variables to select to 50 on each of the 2 components of PLS (see section \@ref(tuning:PLS) for tuning these values).
```{r fig.show='hide'}
MyResult.spls <- spls(X,Y, keepX = c(25, 25), keepY = c(5,5))
plotIndiv(MyResult.spls)
plotVar(MyResult.spls)
```
If you were to run `spls` with this minimal code, you would be using the following default values:
- `ncomp = 2`: the first two PLS components are calculated and are used for graphical outputs;
- `scale = TRUE`: data are scaled (variance = 1, strongly advised here);
- `mode = "regression"`: by default a PLS regression mode should be used (see \@ref(PLS:details) for more details) .
Because PLS generates a pair of components, each associated to each data set, the function `plotIndiv` produces 2 plots that represent the same samples projected in either the space spanned by the X-components, or the Y-components. A single plot can also be displayed, see section \@ref(pls:plotIndiv).
## To go further {#pls-tgf}
### Customize sample plots {#pls:plotIndiv}
Some of the sample plot additional arguments were described in \@ref(splsda:plotIndiv). In addition, you can choose the representation space to be either the components from the `X`-data set, the `Y`- data set, or an average between both components `rep.space = 'XY-variate'`. See more examples in `examples(plotIndiv)` and on our [website](http://mixomics.org/graphics/sample-plots/). Here are two examples with colours indicating genotype or diet:
```{r}
plotIndiv(MyResult.spls, group = nutrimouse$genotype,
rep.space = "XY-variate", legend = TRUE,
legend.title = 'Genotype',
ind.names = nutrimouse$diet,
title = 'Nutrimouse: sPLS')
```
```{r}
plotIndiv(MyResult.spls, group=nutrimouse$diet,
pch = nutrimouse$genotype,
rep.space = "XY-variate", legend = TRUE,
legend.title = 'Diet', legend.title.pch = 'Genotype',
ind.names = FALSE,
title = 'Nutrimouse: sPLS')
```
### Customize variable plots {#pls:plotVar}
See (`example(plotVar)`) for more examples. Here we change the size of the labels. By default the colours are assigned to each type of variable. The coordinates of the variables can also be saved as follows:
```{r}
plotVar(MyResult.spls, cex=c(3,2), legend = TRUE)
coordinates <- plotVar(MyResult.spls, plot = FALSE)
```
### Other useful plots for data integration
We extended other types of plots, based on clustered image maps and relevance networks to ease the interpretation of the relationships between two types of variables. A similarity matrix is calculated from the outputs of PLS and represented with those graphics, see [@Gon12] for more details, and our [website](http://mixomics.org/graphics/variable-plots/)
#### Clustered Image Maps
A clustered image map can be produced using the `cim` function. You may experience figures margin issues in RStudio. Best is to either use `X11()` or save the plot as an external file. For example to show the correlation structure between the X and Y variables selected on component 1:
```{r eval= FALSE, fig.show = 'hide'}
X11()
cim(MyResult.spls, comp = 1)
cim(MyResult.spls, comp = 1, save = 'jpeg', name.save = 'PLScim')
```
#### Relevance networks {#pls:network}
Using the same similarity matrix input in CIM, we can also represent relevance bipartite networks. Those networks only represent edges between on type of variable from `X` and the other type of variable, from `Y`. Whilst we use sPLS to narrow down to a few key correlated variables, our `keepX` and `keepY` values might still be very high for this kind of output. A cut-off can be set based on the correlation coefficient between the different types of variables.
Other arguments such as `interactive = TRUE` enables a scrollbar to change the cut-off value interactively, see other options in `?network`. Additionally, the graph object can be saved to be input into Cytoscape for an improved visualisation.
```{r eval=FALSE}
X11()
network(MyResult.spls, comp = 1)
network(MyResult.spls, comp = 1, cutoff = 0.6, save = 'jpeg', name.save = 'PLSnetwork')
# save as graph object for cytoscape
myNetwork <- network(MyResult.spls, comp = 1)$gR
```
#### Arrow plots
Instead of projecting the samples into the combined XY representation space, as shown in \@ref(pls:plotIndiv), we can overlap the X- and Y- representation plots. One arrow joins the same sample from the X- space to the Y- space. Short arrows indicate a good agreement found by the PLS between both data sets.
```{r}
plotArrow(MyResult.spls,group=nutrimouse$diet, legend = TRUE,
X.label = 'PLS comp 1', Y.label = 'PLS comp 2', legend.title = 'Diet')
```
### Variable selection outputs
The selected variables can be extracted using the `selectVar` function for further analysis.
```{r}
MySelectedVariables <- selectVar(MyResult.spls, comp = 1)
MySelectedVariables$X$name # Selected genes on component 1
MySelectedVariables$Y$name # Selected lipids on component 1
```
The loading plots help visualise the coefficients assigned to each selected variable on each component:
```{r}
plotLoadings(MyResult.spls, comp = 1, size.name = rel(0.5))
```
### Tuning parameters and numerical outputs {#tuning:PLS}
For PLS and sPLS, two types of parameters need to be chosen:
1 - The number of components to retain `ncomp`,
2 - The number of variables to select on each component and on each data set `keepX` and `keepY` for sparse PLS.
For **item 1** we use the `perf` function and repeated k-fold cross-validation to calculate the Q$^2$ criterion used in the SIMCA-P software [@Ume96]. The rule of thumbs is that a PLS component should be included in the model if its value is $\leq 0.0975$. Here we use 3-fold CV repeated 10 times (note that we advise to use at least 50 repeats, and choose the number of folds that are appropriate for the sample size of the data set).
We run a PLS model with a sufficient number of components first, then run `perf` on the object.
```{r eval=TRUE}
MyResult.pls <- pls(X,Y, ncomp = 4)
set.seed(30) # for reproducibility in this vignette, otherwise increase nrepeat
perf.pls <- perf(MyResult.pls, validation = "Mfold", folds = 5,
progressBar = FALSE, nrepeat = 10)
plot(perf.pls, criterion = 'Q2.total')
```
We can see that the values decrease as we add new components which generally indicates overfitting so a model with high number of components would not be suitable for prediction.
**Item 2** can be quite difficult to tune. Here is a minimal example where we only tune `keepX` based on the correlation between the cross-validated components and those from the full model with all the data. Other measures proposed are Residual Sum of Squares (RSS) (see `?tune.spls`):
```{r, eval=TRUE, message='hide'}
list.keepX <- c(2:10, 15, 20)
# tuning based on correlations
set.seed(30) # for reproducibility in this vignette, otherwise increase nrepeat
tune.spls.cor <- tune.spls(X, Y, ncomp = 3,
test.keepX = list.keepX,
validation = "Mfold", folds = 5,
nrepeat = 10, progressBar = FALSE,
measure = 'cor')
plot(tune.spls.cor, measure = 'cor')
```
Based on the correlation of the cross-validated components, the optimal number of variables to select in the `X` data set, including all variables in the `Y` data set would be:
```{r}
tune.spls.cor$choice.keepX
```
Tuning `keepX` and `keepY` conjointly is also possible:
```{r}
# tuning both X and Y
set.seed(30) # for reproducibility in this vignette, otherwise increase nrepeat
tune.spls.cor.XY <- tune.spls(X, Y, ncomp = 3,
test.keepX = c(8, 20, 50),
test.keepY = c(4, 8, 16),
validation = "Mfold", folds = 5,
nrepeat = 10, progressBar = FALSE,
measure = 'cor')
## visualise correlations
plot(tune.spls.cor.XY, measure = 'cor')
## visualise RSS
plot(tune.spls.cor.XY, measure = 'RSS')
```
### PLS modes {#PLS:details}
You may have noticed the `mode` argument in PLS. We can calculate the residual matrices at each PLS iteration differently. Note: this is for **advanced** users.
** Regression mode **
The PLS regression mode models *uni-directional* (or 'causal') relationship between two data sets. The `Y` matrix is deflated with respect to the information extracted/modelled from the local regression on X. Here the goal is to predict `Y` from `X` (`Y` and `X` play an *assymmetric role*). Consequently the latent variables computed to predict Y from X are different from those computed to predict X from Y. More details about the model can be found in[ Appendix [@Lec08].
PLS regression mode, also called PLS2, is commonly applied for the analysis of biological data [@Bou05; @Byl07] due to the biological assumptions or the biological dogma. In general, the number of variables in `Y` to predict are fewer in number than the predictors in `X`.
** Canonical mode **
Similar to a Canonical Correlation Analysis (CCA) framework, this mode is used to model a *bi-directional* (or *symmetrical*) relationship between the two data sets. The `Y` matrix is deflated with respect to the information extracted or modelled from the local regression on `Y`. Here `X` and `Y` play a symmetric role and the goal is similar to CCA. More details about the model can be found in [@Lec09a].
PLS canonical mode is not well known (yet), but is applicable when there is no *a priori* relationship between the two data sets, or in place of CCA but when variable selection is required in large data sets. In [@Lec09a], we compared the measures of the same biological samples on different types of microarrays, cDNA and Affymetrix arrays, to highlight complementary information at the transcripts levels. Note however that for this mode we do not provide any tuning function.
** Other modes **
The 'invariant' mode performs a redundancy analysis, where the `Y` matrix is not deflated. The 'classic' mode is similar to a regression mode. It gives identical results for the variates and loadings associated to the `X` data set, but differences for the loadings vectors associated to the Y data set (different normalisations are used). Classic mode is the PLS2 model as defined by [@Ten98], Chap 9.
** Difference between PLS modes **
For the first PLS dimension, all PLS modes will output the same results in terms of latent variables and loading vectors. After the first dimension, these vectors will differ, as the matrices are deflated differently.
## Additional resources
Additional examples are provided in `example(spls)` and in our case studies on our [website](http://www.mixomics.org) in the **Methods** and **Case studies** sections, see also [@Lec08; @Lec09a].
## FAQ
* Can PLS handle missing values?
+ Yes it can, but only for the learning / training analysis. Prediction with `perf` or `tune` is not possible with missing values.
* Can PLS deal with more than 2 data sets?
+ sPLS can only deal with 2 data sets, but see `DIABLO` (Chapter \@ref(diablo)) for multi-block analyses
* What are the differences between sPLS and Canonical Correlation Analysis (CCA, see `?rcca` in mixOmics)?
+ CCA maximises the correlation between components; PLS maximises the covariance
+ Both methods give similar results if the components are scaled, but the underlying algorithms are different:
- CCA calculates all component at once, there is no deflation
- PLS has different deflation mode
+ sparse PLS selects variables, CCA cannot perform variable selection
* Can I perform PLS with more variables than observations?
+ Yes, and sparse PLS is particularly useful to identify sets of variables that play a role in explaining the relationship between two data sets.
* Can I perform PLS with 2 data sets that are highly unbalanced (thousands of variables in one data set and less than 10 in the other ?
+ Yes! Even if you performed sPLS to select variables in one data set (or both), you can still control the number of variables selected with `keepX`.
# Multi-block Discriminant Analysis with DIABLO {#diablo}
```{r overview-DIABLO, echo=FALSE, message=FALSE}
library(mixOmics)
coul <- color.mixo(1:3)
plot(0, type="n", xlim=c(0,100), ylim=c(83,97), axes=FALSE,
xlab="",ylab="", main="DIABLO overview")
box()
rect(xleft = 12, ybottom = 85, xright = 30, ytop = 95, col=coul[1])
rect(xleft = 32, ybottom = 85, xright = 45, ytop = 95, col=coul[1])
points(x=47, y=90, pch=16, col='black', cex=0.5)
points(x=48.5, y=90, pch=16, col='black', cex=0.5)
points(x=50, y=90, pch=16, col='black', cex=0.5)
rect(xleft = 52, ybottom = 85, xright = 61, ytop = 95, col=coul[1])
rect(xleft = 63, ybottom = 85, xright = 65, ytop = 95, col=coul[2])
text(5, 90, "DIABLO")
# legend
rect(xleft = 85, ybottom = 92, xright = 87, ytop = 94, col=coul[1])
text(90, 95, "Quantitative", cex=0.75)
rect(xleft = 85, ybottom = 87, xright = 87, ytop = 89, col=coul[2])
text(90, 90, "Qualitative", cex=0.75)
```
DIABLO is our new framework in `mixOmics` that extends PLS for multiple data sets integration and PLS-discriminant analysis. The acronyms stands for **D**ata **I**ntegration **A**nalysis for **B**iomarker discovery using a **L**atent c**O**mponents [@Sin16].
## Biological question
*I would like to identify a highly correlated multi-omics signature discriminating known groups of samples.*
## The `breast.TCGA` study
Human breast cancer is a heterogeneous disease in terms of molecular alterations, cellular composition, and clinical outcome. Breast tumours can be classified into several subtypes, according to levels of mRNA expression [@Sor01. Here we consider a subset of data generated by The Cancer Genome Atlas Network [@TCGA12]. For the package, data were normalised and drastically prefiltered for illustrative purposes but DIABLO can handle larger data sets, see [@mixomics] Table 2.
The data were divided into a training set with a subset of 150 samples from the mRNA, miRNA and proteomics data, and a test set including 70 samples, but only with mRNA and miRNA data (proteomics missing). The aim of this integrative analysis is to identify a highly correlated multi-omics signature discriminating the breast cancer subtypes Basal, Her2 and LumA.
The `breast.TCGA` is a list containing training and test sets of omics data `data.train` and `data.test` which both include:
- `miRNA`: a data frame with 150 (70) rows and 184 columns in the training (test) data set for the miRNA expression levels.
- `mRNA`: a data frame with 150 (70) rows and 520 columns in the training (test) data set for the mRNA expression levels.
- `protein`: a data frame with 150 rows and 142 columns in the training data set only for the protein abundance.
- `subtype`: a factor indicating the breast cancer subtypes in the training (length of 150) and test (length of 70) sets.
More details can be found in `?breast.TCGA`.
To illustrate DIABLO, we will integrate the expression levels of miRNA, mRNA and the abundance of proteins while discriminating the subtypes of breast cancer, then predict the subtypes of the test samples in the test set.
## Principle of DIABLO
The core DIABLO method extends Generalised Canonical Correlation Analysis [@Ten11], which contrary to what its name suggests, generalises PLS for multiple matching datasets, and the sparse sGCCA method [@Ten14]. Starting from the R package RGCCA, we extended these methods for different types of analyses, including unsupervised N-integration (`block.pls`, `block.spls`) and supervised analyses (`block.plsda`, `block.splsda`).
The aim of **N-integration** with our sparse methods is to identify correlated (or co-expressed) variables measured on heterogeneous data sets which also explain the categorical outcome of interest (supervised analysis). The multiple data integration task is not trivial, as the analysis can be strongly affected by the variation between manufacturers or omics technological platforms despite being measured on the same biological samples. Before you embark on data integration, we strongly suggest individual or paired analyses with sPLS-DA and PLS to first understand the major sources of variation in each data set and to guide the integration process.
More methodological details can be found in [@Sin16].
## Inputs and outputs
We consider as input a list `X` of data frames with $n$ rows (the number of samples) and different number of variations in each data frame. `Y` is a factor vector of length $n$ that indicates the class of each sample. Internally and similar to PLS-DA in Chapter \@ref(plsda) it will be coded as a dummy matrix.
DIABLO main outputs are:
- A **set of components**, also called latent variables associated to each data set. There are as many components as the chosen dimension of DIABLO.
- A **set of loading vectors**, which are coefficients assigned to each variable to define each component. Those coefficients indicate the importance of each variable in DIABLO. Importantly, each loading vector is associated to a particular component. Loading vectors are obtained so that the covariance between a linear combination of the variables from `X` (the X-component) and from `Y` (the $Y$-component) is maximised.
- A **list of selected variables** from each data set and associated to each component if sparse DIABLO is applied.
## Set up the data
We first set up the input data as a list of data frames `X` expression matrix and `Y` as a factor indicating the class membership of each sample. Each data frame in `X` should be **named consistently** to match with the `keepX` parameter.
We check that the dimensions are correct and match. We then set up arbitrarily the number of variables `keepX` that we wish to select in each data set and each component.
```{r message=FALSE}
library(mixOmics)
data(breast.TCGA)
# extract training data and name each data frame
X <- list(mRNA = breast.TCGA$data.train$mrna,
miRNA = breast.TCGA$data.train$mirna,
protein = breast.TCGA$data.train$protein)
Y <- breast.TCGA$data.train$subtype
summary(Y)
list.keepX <- list(mRNA = c(16, 17), miRNA = c(18,5), protein = c(5, 5))
```
## Quick start
```{r, fig.show = 'hide'}
MyResult.diablo <- block.splsda(X, Y, keepX=list.keepX)
plotIndiv(MyResult.diablo)
plotVar(MyResult.diablo)
```
Similar to PLS (Chapter \@ref(pls)), DIABLO generates a pair of components, each associated to each data set. This is why we can visualise here 3 sample plots. As DIABLO is a supervised method, samples are represented with different colours depending on their known class.
The variable plot suggest some correlation structure between proteins, mRNA and miRNA. We will further customize these plots in Sections \@ref(diablo:plotIndiv) and \@ref(diablo:plotVar).
If you were to run `block.splsda` with this minimal code, you would be using the following default values:
- `ncomp = 2`: the first two PLS components are calculated and are used for graphical outputs;
- `scale = TRUE`: data are scaled (variance = 1, strongly advised here for data integration);
We focused here on the sparse version as would like to identify a minimal multi-omics signature, however, the non sparse version could also be run with `block.plsda`:
```{r eval= TRUE}
MyResult.diablo2 <- block.plsda(X, Y)
```
## To go further
### Customize sample plots {#diablo:plotIndiv}
Here is an example of an improved plot, see also Section \@ref(splsda:plotIndiv) for additional sources of inspiration.
```{r}
plotIndiv(MyResult.diablo,
ind.names = FALSE,
legend=TRUE, cex=c(1,2,3),
title = 'BRCA with DIABLO')
```
### Customize variable plots {#diablo:plotVar}
Labels can be omitted in some data sets to improve readability. For example her we only show the name of the proteins:
```{r}
plotVar(MyResult.diablo, var.names = c(FALSE, FALSE, TRUE),
legend=TRUE, pch=c(16,16,1))
```
### Other useful plots for data integration
Several plots were added for the DIABLO framework.
#### `plotDiablo`
A global overview of the correlation structure at the component level can be represented with the `plotDiablo` function. It plots the components across the different data sets for a given dimension. Colours indicate the class of each sample.
```{r}
plotDiablo(MyResult.diablo, ncomp = 1)
```
Here, we can see that a strong correlation is extracted by DIABLO between the mRNA and protein data sets. Other dimensions can be plotted with the argument `comp`.
#### `circosPlot`
The circos plot represents the correlations between variables of different types, represented on the side quadrants. Several display options are possible, to show within and between connexions between blocks, expression levels of each variable according to each class (argument `line = TRUE`). The circos plot is built based on a similarity matrix, which was extended to the case of multiple data sets from [@Gon12]. A `cutoff` argument can be included to visualise correlation coefficients above this threshold in the multi-omics signature.
```{r}
circosPlot(MyResult.diablo, cutoff=0.7)
```
#### `cimDiablo`
The `cimDiablo` function is a clustered image map specifically implemented to represent the multi-omics molecular signature expression for each sample. It is very similar to a classic hierarchical clustering:
```{r eval=FALSE, fig.height=8, fig.width=8}
# minimal example with margins improved:
# cimDiablo(MyResult.diablo, margin=c(8,20))
# extended example:
cimDiablo(MyResult.diablo, color.blocks = c('darkorchid', 'brown1', 'lightgreen'), comp = 1, margin=c(8,20), legend.position = "right")
```
#### `plotLoadings`
The `plotLoadings` function visualises the loading weights of each selected variables on each component (default is `comp = 1`) and each data set. The color indicates the class in which the variable has the maximum level of expression (`contrib = "max"`) or minimum (`contrib ="min"`), on average (`method="mean"`) or using the median (`method ="median"`). We only show the last plot here:
```{r}
#plotLoadings(MyResult.diablo, contrib = "max")
plotLoadings(MyResult.diablo, comp = 2, contrib = "max")
```
#### Relevance networks
Another visualisation of the correlation between the different types of variables is the relevance network, which is also built on the similarity matrix [@Gon12]. Each color represents a type of variable. A threshold can also be set using the argument `cutoff`.
See also Section \@ref(pls:network) to save the graph and the different options, or `?network`.
```{r, eval = FALSE, fig.show = 'hide'}
network(MyResult.diablo, blocks = c(1,2,3),
color.node = c('darkorchid', 'brown1', 'lightgreen'),
cutoff = 0.6, save = 'jpeg', name.save = 'DIABLOnetwork')
```
## Numerical outputs
### Classification performance
Similar to what is described in Section \@ref(tuning:sPLSDA) we use repeated cross-validation with `perf` to assess the prediction of the model. For this complex classification problems, often a centroid distance is suitable, see details in [@mixomics] Suppl. Material S1.
```{r, eval = TRUE}
set.seed(123) # for reproducibility in this vignette
MyPerf.diablo <- perf(MyResult.diablo, validation = 'Mfold', folds = 5,
nrepeat = 10,
dist = 'centroids.dist')
#MyPerf.diablo # lists the different outputs
# Performance with Majority vote
#MyPerf.diablo$MajorityVote.error.rate
```
### AUC
An AUC plot per block is plotted using the function `auroc` see [@mixomics] for the interpretation of such output as the ROC and AUC criteria are not particularly insightful in relation to the performance evaluation of our methods, but can complement the statistical analysis.
Here we evaluate the AUC for the model that includes 2 components in the miRNA data set.
```{r, echo=TRUE,results='hide',fig.keep='all'}
Myauc.diablo <- auroc(MyResult.diablo, roc.block = "miRNA", roc.comp = 2)
```
#### Prediction on an external test set
The `predict` function predicts the class of samples from a test set. In our specific case, one data set is missing in the test set but the method can still be applied. Make sure the name of the blocks correspond exactly.
```{r}
# prepare test set data: here one block (proteins) is missing
X.test <- list(mRNA = breast.TCGA$data.test$mrna,
miRNA = breast.TCGA$data.test$mirna)
Mypredict.diablo <- predict(MyResult.diablo, newdata = X.test)
# the warning message will inform us that one block is missing
#Mypredict.diablo # list the different outputs
```
The confusion table compares the real subtypes with the predicted subtypes for a 2 component model, for the distance of interest:
```{r}
confusion.mat <- get.confusion_matrix(
truth = breast.TCGA$data.test$subtype,
predicted = Mypredict.diablo$MajorityVote$centroids.dist[,2])
kable(confusion.mat)
get.BER(confusion.mat)
```
### Tuning parameters
For DIABLO, the parameters to tune are:
1 - The design matrix `design` indicates which data sets, or blocks should be connected to maximise the covariance between components, and to which extend. A compromise needs to be achieved between maximising the correlation between data sets (design value between 0.5 and 1) and maximising the discrimination with the outcome `Y` (design value between 0 and 0.5), see [@Sin16] for more details.
2 - The number of components to retain `ncomp`. The rule of thumb is usually $K-1$ where $K$ is the number of classes, but it is worth testing a few extra components.
3 - The number of variables to select on each component and on each data set in the list `keepX`.
For **item 1**, by default all data sets are linked as follows:
```{r}
MyResult.diablo$design
```
The `design` can be changed as follows. By default each data set will be linked to the `Y` outcome.
```{r}
MyDesign <- matrix(c(0, 0.1, 0.3,
0.1, 0, 0.9,
0.3, 0.9, 0),
byrow=TRUE,
ncol = length(X), nrow = length(X),
dimnames = list(names(X), names(X)))
MyDesign
MyResult.diablo.design <- block.splsda(X, Y, keepX=list.keepX, design=MyDesign)
```
**Items 2 and 3** can be tuned using repeated cross-validation, as we described in Chapter \@ref(plsda). A detailed tutorial is provided on our [website](http://mixomics.org/mixdiablo/) in the different DIABLO tabs.
## Additional resources
Additional examples are provided in `example(block.splsda)` and in our DIABLO tab in http://www.mixomics.org. Also have a look at [@Sin16]
## FAQ
* When performing a multi-block analysis, how do I choose my design?
+ We recommend first relying on some prior biological knowledge you may have on the relationship you expect to see between data sets. Conduct a few trials on a non sparse version `block.plsda`, look at the classification performance with `perf` and `plotDiablo` before you can decide on your final design.
* I have a small number of samples (n < 10), should I still tune `keepX`?
+ It is probably not worth it. Try with a few `keepX` values and look at the graphical outputs so see if they make sense. With a small $n$ you can adopt an exploratory approach that does not require a performance assessment.
* During `tune` or `perf` the code broke down (`system computationally singular`).
+ Check that the $M$ value for your M-fold is not too high compared to $n$ (you want $n/M > 6 - 8$ as rule of thumb). Try leave-one-out instead with `validation = 'loo'` and make sure `ncomp` is not too large as you are running on empty matrices!
* My tuning step indicated the selection of only 1 miRNA...
+ Choose a grid of keepX values that starts at a higher value (e.g. 5). The algorithm found an optimum with only one variable, either because it is highly discriminatory or because the data are noisy, but it does not stop you from trying for more.
* My Y is continuous, what can I do?
+ You can perform a multi-omics regression with `block.spls`. We have not found a way yet to tune the results so you will need to adopt an exploratory approach or back yourself up with down stream analyses once you have identified a list of highly correlated features.
# Session Information
```{r}
sessionInfo()
```
# References