---
title: "**phenomis**: Postprocessing and univariate statistical analysis of omics data"
author: "Etienne A. Thevenot"
date: "`r doc_date()`"
package: "`r pkg_ver('phenomis')`"
vignette: >
  %\VignetteIndexEntry{phenomis-vignette}
  %\VignetteEncoding{UTF-8}
  %\VignetteKeywords{BatchEffect, Clustering, MassSpectrometry, Metabolomics, 
  Normalization, Proteomics, QualityControl, StatisticalMethod, Transcriptomics}  
  %\VignetteEngine{knitr::rmarkdown}
bibliography: "phenomis-vignette.bib"
output:
  BiocStyle::html_document:
    toc: yes
    toc_depth: 4
editor_options: 
  markdown: 
    wrap: 72
---

```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width = 5,
                      fig.height = 5)
```

# Introduction

## Context

Analysis of a molecular phenotyping (e.g. metabolomics) data sets (i.e.
samples times variables table of peak or bucket intensities generated by
preprocessing tools such as XCMS) is aimed at **mining the data** (e.g.
detecting trends and outliers) and selecting features of predictive
value (**biomarker discovery**). It comprises multiple steps including:

-   **Post-processing** (quality control, normalization and/or
    transformation of intensities)

-   **Statistical analysis** (univariate hypothesis testing,
    multivariate modeling, feature selection)

-   **Annotation** (physico-chemical properties, biological pathways)

The **`phenomis`** package focuses on the two first steps, and can be
combined with other packages for multivariate modeling, feature
selection and annotation, such as the
[**`ropls`**](https://doi.org/10.18129/B9.bioc.ropls) and
[**`biosigner`**](https://doi.org/10.18129/B9.bioc.biosigner) and
[**`biodb`**](https://www.bioconductor.org/packages/biodb/) Bioconductor
packages. As an example, the tutorial below reproduces the whole
workflow described in the `sacurine` study [@thevenot_analysis_2015].
The `phenomis` package has also been used to post-processed the
preclinical, proteomics and metabolomics data from the ProMetIS study
[@Imbert_PrometisDeepPhenotyping_2021]. Proteomics and metabolomics data
from this study are provided as a second dataset, named `prometis`.
Examples of statistical analyses of this dataset are provided in the
"Working with multi-omics datasets" section.

## Methods

![Methods from the **phenomis**, **ropls**, and **biosigner** packages
for the analysis of omics datasets; specific parameter values used for
the **sacurine** metabolomics dataset described in the 'Hands-on' part
below are provided as examples.](figures/phenomis_methods.png)

## Formats

### Managing data and metadata: the `SummarizedExperiment` and `MultiAssayExperiment` formats

The standard `SummarizedExperiment` and `MultiAssayExperiment`
Bioconductor formats for single and multi-omics datasets are used by the
`phenomis` methods. The methods return updated `SummarizedExperiment`
and `MultiAssayExperiment` objects with either modified assay matrices
(e.g. when using the `transforming` method) or enriched `rowData` and
`colData` with new columns (e.g. fold changes and *p*-values when using
the `hypotesting` method).

Note that the `phenomis` package also supports the `ExpressionSet`
(respectively, `MultiDataSet`) formats, which are previous
(respectively, alternative) formats for the management of single
(respectively, multiple) datasets.

### Importing from/exporting to tabular files

Input (i.e. preprocessed) data consists of a 'variable times sample'
matrix of intensities (**dataMatrix** numeric matrix), in addition to
sample and variable metadata (**sampleMetadata** and
**variableMetadata** data frames). Importantly, the row names from the 
dataMatrix must be identical to the row names from the vaiableMetadata 
(feature names), and the column names from the dataMatrix must be identical to 
the row names from sampleMetadata (sample names).

Note that 3 sample metadata columns should be specified when working
with the `correcting` method, namely:

-   **sampleType** (character): the following types are handled by the
    algorithms:

    -   **sample**: biological sample of interest

    -   **blank**: e.g. solvent only in Liquid Chromatography coupled to
        Mass Spectrometry

    -   **pool**: quality control sample generated by pooling an equal
        volume of each of the sample of interest from the whole study
        (i.e. from all batches)

    -   **poolN**: (where *N* is an integer; e.g. pool2, pool4, ...):
        diluted quality control sample: *N* indicates the dilution
        factor

-   **injectionOrder** (integer): order of the sample injection (e.g. in
    the LC-MS instrument)

-   **batch** (character): name of each batch

## Text and graphical outputs

Text and graphics can be handled with the *phenomis* methods by setting
the two arguments:

-   **`report.c`**: if set to 'interactive' [default], messages are
    displayed interactively; if set to a character corresponding the a
    filename (with the '.txt' extension), messages are diverted to this
    file by using the *sink* function internally (the same file can be
    appended by successive methods); if set to 'none', messages are
    suppressed

-   **`figure.c`**: if set to 'interactive' [default], graphics are
    displayed interactively; if set to a character corresponding the a
    filename (with the '.pdf' extension), a pdf file with the plot is
    generated instead; if set to 'none', graphics are suppressed

# Hands-on

## The **sacurine** cohort study

As an example, we will use the **`phenomis`** package to study the
**sacurine** human cohort. The study is aimed at characterizing the
physiological variations of the human urine metabolome with age, body
mass index (BMI), and gender [@thevenot_analysis_2015]. Urine samples
from 184 volunteers were analyzed by reversed-phase (C18) ultrahigh
performance liquid chromatography (UPLC) coupled to high-resolution mass
spectrometry (LTQ-Orbitrap). Raw data are publicly available on the
MetaboLights repository
([MTBLS404](https://www.ebi.ac.uk/metabolights/MTBLS404)).

This vignette describes the statistical analysis of the data set from
the negative ionization mode (113 identified metabolites at MSI levels 1
or 2):

-   **`correcting`**: Correction of the signal drift by local regression
    (loess) modeling of the intensity trend in pool samples
    [@dunn_procedures_2011]; Adjustment of offset differences between the
    two analytical batches by using the average of the pool intensities
    in each batch [@van_der_kloet_analytical_2009]

-   Variable quality control by discarding features with a coefficient
    of variation above 30% in *pool* samples

-   Normalization by the sample osmolality

-   **`transforming`**: log10 transformation

-   **`inspecting`**: Computing metrics to filter out outlier samples
    according to the Weighted Hotellings'T2 distance
    [@tenenhaus_approche_1999], the Z-score of one of the intensity
    distribution deciles [@alonso_astream_2011], and the Z-score of the
    number of missing values [@alonso_astream_2011]. A 0.001 threshold
    for all p-values results in the HU_096 sample being discarded

-   **`hypotesting`**: Univariate hypothesis testing of significant
    variations with age, BMI, or between genders (Student's T test with
    Benjamini Hochberg correction)

-   PCA exploration of the data set;
    [**`ropls`**](https://doi.org/10.18129/B9.bioc.ropls) Bioconductor
    package [@thevenot_analysis_2015]

-   **`clustering`**: Hierarchical clustering

-   OPLS(-DA) modeling of age, BMI and gender;
    [**`ropls`**](https://doi.org/10.18129/B9.bioc.ropls) Bioconductor
    package [@thevenot_analysis_2015]

-   Selection of the features which significantly contributes to the
    discrimination between gender with PLS-DA, Random Forest, or Support
    Vector Machines classifiers;
    [**`biosigner`**](https://doi.org/10.18129/B9.bioc.biosigner)
    Bioconductor package [@rinaudo_biosigner_2016]

A Galaxy version of this analysis is available [W4M00001
'Sacurine-statistics'](https://doi.org/10.15454/1.4811121736910142E12)
on the
[workflow4metabolomics.usegalaxy.fr](https://workflow4metabolomics.usegalaxy.fr/)
online infrastructure [@guitton_create_2017].

## **`reading`**: reading the data

The **`reading`** function reads the data sets and builds the
`SummarizedExperiment` object. Additional information on how to build
and handle `SummarizedExperiment` objects (as well as
`MultiAssayExperiment`, `ExpressionSet`, and `MultiDataSet`) are
provided in the **Appendix**.

```{r reading, message = FALSE}
library(phenomis)
sacurine.se <- reading(system.file("extdata/sacurine", package = "phenomis"))
```

## **`inspecting`**: looking at the data

The `inspecting` method provides a numerical and graphical overview of
the data. Furthermore, it computes quality metrics which may
subsequently be used to filter out some samples or variables.

-   Graphical overview. The data matrix is visualized with a color
    gradient (top right) and the score plot of the Principal Component
    Analysis is shown for the two first components (bottom right). The
    black ellipse corresponds to the area of 95% of the samples, as
    computed with the Hotelling test. For each sample the mean of
    variable intensities is shown as a function of the injection order
    to detect any signal drift and/or batch correction (bottom left).
    Note that for this coloring to be displayed, the **sampleType**,
    **injectionOrder** and **batch** columns should be provided in the
    `colData` of (each of) the dataset(s). Finally, some metrics are
    computed regarding the proportion of NAs, 0 values, intensity
    quantiles, and proportion of features with a coefficient of
    variation in pool intensities \< 30%.

-   Quality metrics (as additional column in the metadata):

    -   **samples** (added in the `colData`):

        -   hotel_pval: *p*-value from the Hotelling's T2 test in the
            first plane of the PC components

        -   miss_pval: p-value associated to the z-score of the
            proportion of missing values [@alonso_astream_2011]

        -   deci_pval: p-value associated to the z-score of intensity
            deciles [@alonso_astream_2011]

            For each test, low *p*-values highlight samples with extreme
            behavior.

    -   **features** (included in the `rowData`)

        -   When the type of samples is available (ie the **sampleType**
            column is included in the input `colData` from the
            individual datasets), variable metrics are computed: sample,
            pool, and blank mean, sd and coefficient of variation (if
            the corresponding types are present in the 'sampleType'
            column), as well as 'blank' mean / 'sample' mean, and 'pool'
            CV / 'sample' CV ratio
        -   If pool dilutions have been used and are indicated in the
            'sampleType' column as poolN where N is an integer
            indicating the dilution factor (eg pool2 for a two-fold
            dilution of the pool; note that the non-diluted pool remains
            indicated as 'pool') the Pearson correlation (and
            corresponding p-value) between the intensity and the
            dilution factor is computed for each variable

```{r inspecting}
sacurine.se <- inspecting(sacurine.se, report.c = "none")
```

The `filtering` method may be used to discard samples and/or features
with a high proportion of missing values and/or a variance of 0. When
filtering the features, sample class may be specified (as the name a
column from the sample metadata): in this case, features will be kept
when:

1.  the proportion of missing values is below the threshold in at least
    one class

2.  the variance is above 0 in both classes

## Post-processing

### **`correcting`**: Correcting signal drift and batch effect

For each feature, the signal drift and the batch effect are corrected by
using a normalization strategy which relies on the measurements of a
pooled (or QC) sample injected periodically: for each variable, a
regression model is fitted to the values of the pool and subsequently
used to adjust the intensities of the samples of interest
[@van_der_kloet_analytical_2009; @dunn_procedures_2011]. The sample
metadata of each datasets (e.g. `colData` Data Frames) must contain 3
columns:

1.  **sampleType** (character): either 'sample', 'blank', or 'pool'

2.  **injectionOrder** (integer): order of injection in the instrument

3.  **batch** (character): batch name

```{r correcting, fig.dim = c(6, 5)}
sacurine.se <- correcting(sacurine.se,
                          reference.vc = "pool",
                          report.c = "none")
```

### Variable filtering

-   using **`inspecting`** to compute the 'pool_CV' metric

```{r inspecting_variable_filtering}
sacurine.se <- inspecting(sacurine.se,
                          figure.c = "none", report.c = "none")
```

-   discarding features with 'pool_CV' \> 0.3

```{r discarding_pool_CV_sup_0.3}
sacurine.se <- sacurine.se[rowData(sacurine.se)[, "pool_CV"] <= 0.3, ]
```

-   discarding the 'pool' observations

```{r discarding_pools}
sacurine.se <- sacurine.se[, colData(sacurine.se)[, "sampleType"] != "pool"]
print(sacurine.se)
```

### Normalization

The normalization of each sample by its 'osmolality' value does not
require any extra method and relies on the classical methods from the
`SummarizedExperiment` package.

```{r normalizing_osmolality}
assay(sacurine.se) <- sweep(assay(sacurine.se),
                            2,
                            colData(sacurine.se)[, "osmolality"],
                            "/")
```

Note that a `normalizing` method is available in the `phenomis` package
to perform Probabilistic Quotient Normalization
[@dieterleProbabilisticQuotientNormalization2006].

### **`transforming`**: transforming the data intensities

```{r transforming}
sacurine.se <- transforming(sacurine.se, method.vc = "log10")

```

### Sample filtering

```{r sample_filtering}
sacurine.se <- inspecting(sacurine.se, figure.c = "none", report.c = "none")
sacurine.se <- sacurine.se[, colData(sacurine.se)[, "hotel_pval"] >= 0.001 &
                             colData(sacurine.se)[, "miss_pval"] >= 0.001 &
                             colData(sacurine.se)[, "deci_pval"] >= 0.001]
```

Final visual check of the data before performing the statistics:

```{r final_check}
inspecting(sacurine.se, report.c = "none")
```

## **`hypotesting`**: univariate hypothesis testing

The `hypotesting` method is a wrapper of the main R functions for
hypothesis testing and corrections for multiple testing. The list of
available tests include two sample tests (t-test and Wilcoxon rank test,
but also the limma test), analysis of variance (for one and two factors)
and Kruskal-Wallis rank test, and correlation tests (by using either the
pearson or the spearman correlation).

```{r hypotesting, warning=FALSE}
sacurine.se <- hypotesting(sacurine.se,
                           test.c = "ttest",
                           factor_names.vc = "gender",
                           adjust.c = "BH",
                           adjust_thresh.n = 0.05,
                           report.c = "none")
```

The `phenomis` package can be conveniently complemented with other
packages for comprehensive statistical analysis and annotation. In the
rest of this tutorial, we illustrate how the
[**`ropls`**](https://doi.org/10.18129/B9.bioc.ropls),
[**`biosigner`**](https://doi.org/10.18129/B9.bioc.biosigner) and
[**`biodb`**](https://www.bioconductor.org/packages/biodb/) Bioconductor
packages can be included to perform multivariate modeling
[@thevenot_analysis_2015], feature selection [@rinaudo_biosigner_2016],
and feature annotation [@roger2022]. All these packages support the
`SummarizedExperiment` and `MultiAssayExperiment` (but also
`ExpressionSet` and `MultiDataSet`), to provide interoperability.

## Unsupervised analysis

We first focus on unsupervised methods for multivariate analysis, by
performing Principal Component Analysis and hierarchical clustering.

### Principal component analysis: PCA

PCA can be performed by using the `opls` function from the `ropls`
package. It returns an updated object with scores and loadings included
as new columns in the metadata. The PCA model is stored in the metadata
from the `SummarizedExperiment` and can be accessed with the `getOpls`
method:

```{r PCA}
sacurine.se <- ropls::opls(sacurine.se, info.txt = "none")
sacurine.pca <- ropls::getOpls(sacurine.se)[["PCA"]]
ropls::plot(sacurine.pca,
            parAsColFcVn = colData(sacurine.se)[, "gender"],
            typeVc = "x-score")
ropls::plot(sacurine.pca,
            parAsColFcVn = colData(sacurine.se)[, "age"],
            typeVc = "x-score")
```

### **`clustering`**: hierarchical clustering

The `clustering` method from the `phenomis` package performs the
hierarchical clustering of samples and variables. The default
dissimilarity is \$1-cor\_{pearson}\$.

```{r heatmap}
sacurine.se <- clustering(sacurine.se,
                          clusters.vi = c(5, 3))
```

## Supervised modeling

We now address the multivariate modeling of the response of interest
(either age, bmi, or gender), by using Partial Least Squares modeling
and feature selection.

### Partial Least Squares modeling: (O)PLS(-DA)

Partial Least-Squares (PLS), which is a latent variable regression
method based on covariance between the predictors and the response, has
been shown to efficiently handle datasets with multi-collinear
predictors, as in the case of spectrometry measurements [@Wold2001].

The `opls` function from the `ropls` package can be used to perform
Partial Least Squares modeling, by specifying the response to be modeled
as the second argument:

```{r plsda}
sacurine.se <- ropls::opls(sacurine.se, "gender")
```

The (O)PLS(-DA) can be obtained from the metadata, e.g. to perform specific plots:

```{r plsda_scoreplot, eval = FALSE}
sacurine_gender.plsda <- ropls::getOpls(sacurine.se)[["gender_PLSDA"]]
ropls::plot(sacurine_gender.plsda,
            parAsColFcVn = colData(sacurine.se)[, "gender"],
            typeVc = "x-score")
```


### Feature selection

The `biosigner` approach is a wrapper method to select the features that
significantly contribute to the performance of a binary classifier,
namely PLS-DA, random forest, or Support Vector Machine
[@rinaudo_biosigner_2016]. We refer the interested reader to the
vignette from the `biosigner` package.

```{r biosigner, fig.width = 5, fig.height = 5}
sacurine.biosign <- biosigner::biosign(sacurine.se, "gender", bootI = 5)
```

Please note that the `bootI` parameter is set to 5 to speed up the
vignette building, but that we advice to keep the default 50 number of
bootstraps to obtain more stable signatures.

## **`annotating`**: MS annotation

This method is based on the
[**biodb**](https://www.bioconductor.org/packages/biodb/) package (and
its companion
[**biodbChebi**](https://www.bioconductor.org/packages/biodbChebi/)),
which provides access to standard remote chemical and biological
databases (ChEBI, KEGG, HMDB, ...), as well as to in-house local
database files (CSV, SQLite), with easy retrieval of entries, access to
web services, search of compounds by mass and/or name [@roger2022].

Viewing the parameters from the **`annotating`** method and their
default values:

```{r annotating_parameters}
annotating_parameters()
```

Querying chemical annotation from the Chemical Entities of Biological
Interest [ChEBI](https://www.ebi.ac.uk/chebi/) database:

1)  by using *mz* values:

```{r chebi_annotation, eval = FALSE}
sacurine.se <- annotating(sacurine.se,
                          database.c = "chebi",
                          param.ls = list(query.type = "mz",
                                          query.col = "mass_to_charge",
                                          ms.mode = "neg",
                                          mz.tol = 10,
                                          mz.tol.unit = "ppm",
                                          max.results = 3,
                                          prefix = "chebiMZ."),
                          report.c = "none")
knitr::kable(head(rowData(sacurine.se)[, grep("chebiMZ", 
                                              colnames(rowData(sacurine.se)))]))
```

2)  by using ChEBI identifiers:

```{r chebi_identifier, eval = FALSE}
sacurine.se <- annotating(sacurine.se, database.c = "chebi",
                          param.ls = list(query.type = "chebi.id",
                                          query.col = "database_identifier",
                                          prefix = "chebiID."))
knitr::kable(head(rowData(sacurine.se)[, grep("chebiID", 
                                              colnames(rowData(sacurine.se)))]))

```

Adding chemical information from a local database:

1)  loading a local (example) MS database:

```{r localdbDF}
msdbDF <- read.table(system.file("extdata/local_ms_db.tsv", package = "phenomis"),
                     header = TRUE,
                     sep = "\t",
                     stringsAsFactors = FALSE)
```

2)  Querying the local database:

```{r localms_annotation, message=FALSE, warning=FALSE}
sacurine.se <- annotating(sacurine.se,
                          database.c = "local.ms",
                          param.ls = list(query.type = "mz",
                                          query.col = "mass_to_charge",
                                          ms.mode = "neg",
                                          mz.tol = 5,
                                          mz.tol.unit = "ppm",
                                          local.ms.db = msdbDF,
                                          prefix = "localMS."),
                          report.c = "none")
knitr::kable(rowData(sacurine.se)[!is.na(rowData(sacurine.se)[, "localMS.accession"]), grep("localMS.", colnames(rowData(sacurine.se)), fixed = TRUE)])
```

## **`writing`**: Exporting the results

Finally, the `writing` method can be used to export the
`SummarizedExperiment` object in the 3 tabular file format. In case of a
`MultiAssayExperiment`, one subfolder containing the 3 files will be
created for each dataset.

```{r writing, eval = FALSE}
writing(sacurine.se, dir.c = getwd(), prefix.c = "sacurine")
```

## Graphical User Interface

The main features from the `phenomis` package can also be accessed via a
graphical user interface in the **Quality Metrics**, **Batch
Correction**, **Univariate**, and **Heatmap** modules from the
[workflow4metabolomics.usegalaxy.fr](https://workflow4metabolomics.usegalaxy.fr/)
online resource for computational metabolomics, based on the Galaxy
environment.

## Working with multi-omics datasets

The `MultiAssayExperiment` format is useful to handle **multi-omics**
datasets [@Ramos_SoftwareIntegrationMultiomics_2017]. The `phenomis`
package (as well as the `ropls` and `biosigner` packages) can be used to
process the data sets in parallel, such as all methods described before
can be directly applied to the `MultiAssayExperiment` object.

In several methods, specific parameters for each dataset can be passed
as a vector: for instance, with two datasets, if one wishes to log2
transform the intensities of the first dataset but not the second one,
the `method.vc = c("log2", "none")` parameter can be used with the
`transforming` method (in contrast, if only one value is provided, it
will be used for all datasets).

Here we apply `phenomis` statistical methods to the ProMetIS study ,
which addresses the molecular phenotyping of knock-out mouse models by
combined proteomics and metabolomics analysis
[@Imbert_PrometisDeepPhenotyping_2021]. The `phenomis` dataset has been
selected from the original data (publicly available on GitHub:
[IFB-ElixirFr/ProMetIS](https://github.com/IFB-ElixirFr/ProMetIS)) as
follows:

-   **liver** data

-   **WT** and KO mice for the **LAT** gene (Linker for Activation of T
    cells) only

-   the **proteomics** dataset and the **metabolomics** dataset obtained
    with the ZIC-pHILIC chromatographic column

-   **post-processed** data only

-   **100 features** only

The multi-omics dataset is saved as text files and can be loaded with
the `reading` function as a `MultiAssayExperiment` (or as a
`MultiDataSet` with the `output.c = "set"` argument)

```{r pms_read}
prometis.mae <- reading(system.file("extdata/prometis",
                                    package = "phenomis"))
```

Let us have a look at the (common) sample metadata:

```{r}
head(colData(prometis.mae))
```

We first explore the datasets:

```{r pms_inspect}
prometis.mae <- inspecting(prometis.mae, report.c = "none")
```

We then perform univariate hypothesis testing with the `limma` approach
[@Ritchie_LimmaPowersDifferential_2015]:

```{r pms_limma}
prometis.mae <- hypotesting(prometis.mae, "limma", "gene",
                            report.c = "none")
```

As with single omics studies, the `ropls` and `biosigner` can be
directly applied to `MultiAssayExperiment` for multivariate analysis:

```{r pms_plsda}
prometis.mae <- ropls::opls(prometis.mae, "gene")
```

and feature selection:

```{r pms_biosign, fig.width = 5, fig.height = 5}
prometis.mae <- biosigner::biosign(prometis.mae, "gene", bootI = 5)
```

The final `MultiAssayExperiment` can be exported as text files with:

```{r pms_writing, eval = FALSE}
writing(prometis.mae, dir.c = getwd(), prefix.c = "prometis")
```

# Appendix

## Additional examples of application to single and multiple omics data sets

### CLL data set

The `CLL` package contains the chronic lymphocytic leukemia (CLL) gene
expression data from 24 samples, that were either classified as
progressive or stable in regards to disease progression.

We first load the data (which are stored as an `ExpressionSet` object),
and restrict to the first 1,000 features (to speed up computations):

```{r cll_load}
data(sCLLex, package = "CLL")
cll.eset <- sCLLex[seq_len(1000), ]
```

We explore the data:

```{r cll_inspect, eval = FALSE}
cll.eset <- inspecting(cll.eset, report.c = "none")
```

Investigate whether clusters can be observed (we first add the letter
'p' or 's' in the sample names to indicate the phenotype: p =
progressive, s = stable):

```{r cll_heatmap, eval = FALSE}
Biobase::sampleNames(cll.eset) <- paste0(Biobase::sampleNames(cll.eset),
                                         "_",
                                         substr(Biobase::pData(cll.eset)[, "Disease"], 1, 1))

cll.eset <- clustering(cll.eset)
```

and perform hypothesis testing with the limma test (the dots must be
removed from the 'progress.' phenotype):

```{r cll_limma, eval = FALSE}
Biobase::pData(cll.eset)[, "Disease"] <- gsub(".", "",
                                              Biobase::pData(cll.eset)[, "Disease"],
                                              fixed = TRUE)
cll.eset <- hypotesting(cll.eset, "limma", "Disease")
```

All the subsequent multivariate analysis and feature selection steps
described with the `sacurine` data set can be performed with the
`cll.eset` accordingly.

### NCI60_4arrays data set

We provide an example based on the `NCI60_4arrays` cancer data set from
the `omicade4` package [@mengMultivariateApproachIntegration2014], which
has been made available in the `MultiAssayExperiment` format in the
`ropls` package. The data consist of gene transcript expression analysis
from NCI-60 cell lines by using 4 microarray platforms
[@Reinhold_CellMinerWebBasedSuite_2012].

Getting the `NCI60` data set as a `MultiAssayExperiment`:

```{r mae}
data(NCI60, package = "ropls")
nci.mae <- NCI60[["mae"]]
```

Let's focus on the **LE**ukemia and **ME**lanoma cancer types, and the
**agilent** and **hgu95** arrays:

```{r mae_focus, message=FALSE, warning=FALSE}
library(MultiAssayExperiment)
table(nci.mae$cancer)
nci.mae <- nci.mae[,
                   nci.mae$cancer %in% c("LE", "ME"),
                   c("agilent", "hgu95")]
```

We can now e.g. perform clustering on each data set independently:

```{r mae_clustering, message=TRUE, warning=TRUE, eval = FALSE}
nci.mae <- clustering(nci.mae)
```

and/or perform hypothesis testing on each data set:

```{r mae_hypotesting, eval = FALSE}
nci.mae <- hypotesting(nci.mae, "limma", "cancer", report.c = "none")
```

as well as all the analyses described above for the `prometis` data set.

## Cheat sheets for Bioconductor (multi)omics containers

### `SummarizedExperiment`

The `SummarizedExperiment` format has been designed by the Bioconductor
team to handle (single) omics datasets [@Morgan2022].

An example of `SummarizedExperiment` creation and a summary of useful
methods are provided below.

Please refer to [package
vignettes](https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html)
for a full description of `SummarizedExperiment` objects .

```{r se_build}
# Preparing the data (matrix) and sample and variable metadata (data frames):
data(sacurine, package = "ropls")
data.mn <- sacurine[["dataMatrix"]] # matrix: samples x variables
samp.df <- sacurine[["sampleMetadata"]] # data frame: samples x sample metadata
feat.df <- sacurine[["variableMetadata"]] # data frame: features x feature metadata

# Creating the SummarizedExperiment (package SummarizedExperiment)
library(SummarizedExperiment)
sac.se <- SummarizedExperiment(assays = list(sacurine = t(data.mn)),
                               colData = samp.df,
                               rowData = feat.df)
# note that colData and rowData main format is DataFrame, but data frames are accepted when building the object
stopifnot(validObject(sac.se))

# Viewing the SummarizedExperiment
# ropls::view(sac.se)
```

| [Methods]{.smallcaps}                           | [Description]{.smallcaps}                           | [Returned class]{.smallcaps}      |
|--------------------------|---------------------------|-------------------|
| [**Constructors**]{.smallcaps}                  |                                                     |                                   |
| **`SummarizedExperiment`**                      | Create a SummarizedExperiment object                | `SummarizedExperiment`            |
| **`makeSummarizedExperimentFromExpressionSet`** |                                                     | `SummarizedExperiment`            |
| [**Accessors**]{.smallcaps}                     |                                                     |                                   |
| **`assayNames`**                                | Get or set the names of assay() elements            | `character`                       |
| **`assay`**                                     | Get or set the ith (default first) assay element    | `matrix`                          |
| **`assays`**                                    | Get the list of experimental data numeric matrices  | `SimpleList`                      |
| **`rowData`**                                   | Get or set the row data (features)                  | `DataFrame`                       |
| **`colData`**                                   | Get or set the column data (samples)                | `DataFrame`                       |
| **`metadata`**                                  | Get or set the experiment data                      | `list`                            |
| **`dim`**                                       | Get the dimensions (features of interest x samples) | `integer`                         |
| **`dimnames`**                                  | Get or set the dimension names                      | `list` of length 2 character/NULL |
| **`rownames`**                                  | Get the feature names                               | `character`                       |
| **`colnames`**                                  | Get the sample names                                | `character`                       |
| [**Conversion**]{.smallcaps}                    |                                                     |                                   |
| **`as(, "SummarizedExperiment")`**              | Convert an ExpressionSet to a SummarizedExperiment  | `SummarizedExperiment`            |

### `MultiAssayExperiment`

The `MultiAssayExperiment` format has been designed by the Bioconductor
team to handle multiomics datasets
[@Ramos_SoftwareIntegrationMultiomics_2017].

An example of `MultiAssayExperiment` creation and a summary of useful
methods are provided below. Please refer to [package
vignettes](https://bioconductor.org/packages/MultiAssayExperiment/) or
to the original publication for a full description of
`MultiAssayExperiment` objects
[@Ramos_SoftwareIntegrationMultiomics_2017].

Loading the `NCI60_4arrays` from the `omicade4` package:

```{r mae_build_load}
data("NCI60_4arrays", package = "omicade4")
```

Creating the `MultiAssayExperiment`:

```{r mae_build, message = FALSE, warning=FALSE}
library(MultiAssayExperiment)
# Building the individual SummarizedExperiment instances
experiment.ls <- list()
sampleMap.ls <- list()
for (set.c in names(NCI60_4arrays)) {
  # Getting the data and metadata
  assay.mn <- as.matrix(NCI60_4arrays[[set.c]])
  coldata.df <- data.frame(row.names = colnames(assay.mn),
                           .id = colnames(assay.mn),
                           stringsAsFactors = FALSE) # the 'cancer' information will be stored in the main colData of the mae, not the individual SummarizedExperiments
  rowdata.df <- data.frame(row.names = rownames(assay.mn),
                           name = rownames(assay.mn),
                           stringsAsFactors = FALSE)
  # Building the SummarizedExperiment
  assay.ls <- list(se = assay.mn)
  names(assay.ls) <- set.c
  se <- SummarizedExperiment(assays = assay.ls,
                             colData = coldata.df,
                             rowData = rowdata.df)
  experiment.ls[[set.c]] <- se
  sampleMap.ls[[set.c]] <- data.frame(primary = colnames(se),
                                      colname = colnames(se)) # both datasets use identical sample names
}
sampleMap <- listToMap(sampleMap.ls)

# The sample metadata are stored in the colData data frame (both datasets have the same samples)
stopifnot(identical(colnames(NCI60_4arrays[[1]]),
                    colnames(NCI60_4arrays[[2]])))
sample_names.vc <- colnames(NCI60_4arrays[[1]])
colData.df <- data.frame(row.names = sample_names.vc,
                         cancer = substr(sample_names.vc, 1, 2))

nci.mae <- MultiAssayExperiment(experiments = experiment.ls,
                                colData = colData.df,
                                sampleMap = sampleMap)

stopifnot(validObject(nci.mae))
```

| [Methods]{.smallcaps}          | [Description]{.smallcaps}                                            | [Returned class]{.smallcaps} |
|-------------------|----------------------------------|-------------------|
| [**Constructors**]{.smallcaps} |                                                                      |                              |
| **`MultiAssayExperiment`**     | Create a MultiAssayExperiment object                                 | `MultiAssayExperiment`       |
| **`ExperimentList`**           | Create an ExperimentList from a List or list                         | `ExperimentList`             |
| [**Accessors**]{.smallcaps}    |                                                                      |                              |
| **`colData`**                  | Get or set data that describe the patients/biological units          | `DataFrame`                  |
| **`experiments`**              | Get or set the list of experimental data objects as original classes | `experimentList`             |
| **`assays`**                   | Get the list of experimental data numeric matrices                   | `SimpleList`                 |
| **`assay`**                    | Get the first experimental data numeric matrix                       | `matrix`, matrix-like        |
| **`sampleMap`**                | Get or set the map relating observations to subjects                 | `DataFrame`                  |
| **`metadata`**                 | Get or set additional data descriptions                              | `list`                       |
| **`rownames`**                 | Get row names for all experiments                                    | `CharacterList`              |
| **`colnames`**                 | Get column names for all experiments                                 | `CharacterList`              |
| [**Subsetting**]{.smallcaps}   |                                                                      |                              |
| **`mae[i, j, k]`**             | Get rows, columns, and/or experiments                                | `MultiAssayExperiment`       |
| **`mae[i, ,]`**                | i: GRanges, character, integer, logical, List, list                  | `MultiAssayExperiment`       |
| **`mae[,j,]`**                 | j: character, integer, logical, List, list                           | `MultiAssayExperiment`       |
| **`mae[,,k]`**                 | k: character, integer, logical                                       | `MultiAssayExperiment`       |
| **`mae[[i]]`**                 | Get or set object of arbitrary class from experiments                | (Varies)                     |
| **`mae[[i]]`**                 | Character, integer, logical                                          |                              |
| **`mae$column`**               | Get or set colData column                                            | `vector` (varies)            |
| [**Management**]{.smallcaps}   |                                                                      |                              |
| **`complete.cases`**           | Identify subjects with complete data in all experiments              | `vector` (logical)           |
| **`duplicated`**               | Identify subjects with replicate observations per experiment         | `list` of `LogicalLists`     |
| **`mergeReplicates`**          | Merge replicate observations within each experiment                  | `MultiAssayExperiment`       |
| **`intersectRows`**            | Return features that are present for all experiments                 | `MultiAssayExperiment`       |
| **`intersectColumns`**         | Return subjects with data available for all experiments              | `MultiAssayExperiment`       |
| **`prepMultiAssay`**           | Troubleshoot common problems when constructing main class            | `list`                       |
| [**Reshaping**]{.smallcaps}    |                                                                      |                              |
| **`longFormat`**               | Return a long and tidy DataFrame with optional colData columns       | `DataFrame`                  |
| **`wideFormat`**               | Create a wide DataFrame, one row per subject                         | `DataFrame`                  |
| [**Combining**]{.smallcaps}    |                                                                      |                              |
| **`c`**                        | Concatenate an experiment                                            | `MultiAssayExperiment`       |
| [**Viewing**]{.smallcaps}      |                                                                      |                              |
| **`upsetSamples`**             | Generalized Venn Diagram analog for sample membership                | `upset`                      |
| [**Exporting**]{.smallcaps}    |                                                                      |                              |
| **`exportClass`**              | Exporting to flat files                                              | `csv` or `tsv` files         |

### `ExpressionSet`

The `ExpressionSet` format (`Biobase` package) was designed by the
Bioconductor team as the first format to handle (single) omics data. It
has now been supplanted by the `SummarizedExperiment` format.

An example of `ExpressionSet` creation and a summary of useful methods
are provided below. Please refer to ['An introduction to Biobase and
ExpressionSets'](https://bioconductor.org/packages/release/bioc/vignettes/Biobase/inst/doc/ExpressionSetIntroduction.pdf)
from the documentation from the
[`Biobase`](https://doi.org/doi:10.18129/B9.bioc.Biobase) package for a
full description of `ExpressionSet` objects.

```{r eset_build, message = FALSE, warning = FALSE}
# Preparing the data (matrix) and sample and variable metadata (data frames):
data(sacurine, package = "ropls")
data.mn <- sacurine[["dataMatrix"]] # matrix: samples x variables
samp.df <- sacurine[["sampleMetadata"]] # data frame: samples x sample metadata
feat.df <- sacurine[["variableMetadata"]] # data frame: features x feature metadata
# Creating the ExpressionSet (package Biobase)
sac.set <- Biobase::ExpressionSet(assayData = t(data.mn))
Biobase::pData(sac.set) <- samp.df
Biobase::fData(sac.set) <- feat.df
stopifnot(validObject(sac.set))
# Viewing the ExpressionSet
# ropls::view(sac.set)
```

| [Methods]{.smallcaps} | [Description]{.smallcaps}                         | [Returned class]{.smallcaps} |
|------------------|----------------------------------|--------------------|
| **`exprs`**           | 'variable times samples' numeric matrix           | `matrix`                     |
| **`pData`**           | sample metadata                                   | `data.frame`                 |
| **`fData`**           | variable metadata                                 | `data.frame`                 |
| **`sampleNames`**     | sample names                                      | `character`                  |
| **`featureNames`**    | variable names                                    | `character`                  |
| **`dims`**            | 2x1 matrix of 'Features' and 'Samples' dimensions | `matrix`                     |
| **`varLabels`**       | Column names of the sample metadata data frame    | `character`                  |
| **`fvarLabels`**      | Column names of the variable metadata data frame  | `character`                  |

### `MultiDataSet`

Loading the `NCI60_4arrays` from the `omicade4` package:

```{r mset_build_load}
data("NCI60_4arrays", package = "omicade4")
```

Creating the `MultiDataSet`:

```{r mset_build, message = FALSE, warning=FALSE}
library(MultiDataSet)
# Creating the MultiDataSet instance
nci.mds <- MultiDataSet::createMultiDataSet()
# Adding the two datasets as ExpressionSet instances
for (set.c in names(NCI60_4arrays)) {
  # Getting the data
  expr.mn <- as.matrix(NCI60_4arrays[[set.c]])
  pdata.df <- data.frame(row.names = colnames(expr.mn),
                        cancer = substr(colnames(expr.mn), 1, 2),
                        stringsAsFactors = FALSE)
  fdata.df <- data.frame(row.names = rownames(expr.mn),
                        name = rownames(expr.mn),
                        stringsAsFactors = FALSE)
  # Building the ExpressionSet
  eset <- Biobase::ExpressionSet(assayData = expr.mn,
                                 phenoData = new("AnnotatedDataFrame",
                                                 data = pdata.df),
                                 featureData = new("AnnotatedDataFrame",
                                                   data = fdata.df),
                                 experimentData = new("MIAME",
                                                      title = set.c))
  # Adding to the MultiDataSet
  nci.mds <- MultiDataSet::add_eset(nci.mds, eset, dataset.type = set.c,
                                     GRanges = NA, warnings = FALSE)
}
stopifnot(validObject(nci.mds))
```

| [Methods]{.smallcaps}          | [Description]{.smallcaps}                        | [Returned class]{.smallcaps} |
|---------------------|-------------------------------|--------------------|
| [**Constructors**]{.smallcaps} |                                                  |                              |
| **`createMultiDataSet`**       | Create a MultiDataSet object                     | `MultiDataSet`               |
| **`add_eset`**                 | Create a MultiAssayExperiment object             | `MultiDataSet`               |
| [**Subsetting**]{.smallcaps}   |                                                  |                              |
| **`mset[i, ]`**                | i: character,logical (samples to select)         | `MultiDataSet`               |
| **`mset[, k]`**                | k: character (names of datasets to select)       | `MultiDataSet`               |
| **`mset[[k]]`**                | k: character (name of the datast to select)      | `ExpressionSet`              |
| [**Accessors**]{.smallcaps}    |                                                  |                              |
| **`as.list`**                  | Get the list of data matrices                    | `list`                       |
| **`pData`**                    | Get the list of sample metadata                  | `list`                       |
| **`fData`**                    | Get the list of variable metadata                | `list`                       |
| **`sampleNames`**              | Get the list of sample names                     | `list`                       |
| [**Management**]{.smallcaps}   |                                                  |                              |
| **`commonSamples`**            | Select samples that are present in all datasets  | `MultiDataSet`               |
| [**Conversion**]{.smallcaps}   |                                                  |                              |
| **`mds2mae`**                  | Convert a MultiDataSet to a MultiAssayExperiment | `MultiAssayExperiment`       |

# Session info

Here is the output of `sessionInfo()` on the system on which this
document was compiled:

```{r sessionInfo, echo=FALSE}
sessionInfo()
```

# References