---
title: An introduction to contrast decomposition and querying using orthos
author:
- name: Panagiotis Papasaikas
affiliation:
- &id3 Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics
email: panagiotis.papasaikas@fmi.ch
- name: Charlotte Soneson
affiliation:
- &id3 Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics
email: charlotte.soneson@fmi.ch
- name: Michael Stadler
affiliation:
- &id3 Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics
email: michael.stadler@fmi.ch
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('orthos')`"
output:
BiocStyle::html_document:
toc_float: true
editor_options:
chunk_output_type: console
bibliography: orthos.bib
geometry: "left=0cm,right=3cm,top=2cm,bottom=2cm"
vignette: >
%\VignetteIndexEntry{1. Introduction to orthos}
%\VignetteEncoding{UTF-8}
%\VignettePackage{orthos}
%\VignetteKeywords{cVAE, VariationalAutoEncoders, DGE, DifferentialGeneExpression, RNASeq}
%\VignetteEngine{knitr::rmarkdown}
---
```{r setup, include = FALSE, echo=FALSE, results="hide", message=FALSE}
require(knitr)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
error = FALSE,
warning = FALSE,
message = FALSE,
crop = NULL
)
stopifnot(requireNamespace("htmltools"))
htmltools::tagList(rmarkdown::html_dependency_font_awesome())
```
![](orthos.png){width=50%}
```{css, echo = FALSE}
body {
margin: 0 auto;
max-width: 1600px;
padding: 2rem;
}
```
# Introduction
RNAseq-based **differential expression analysis upon cellular perturbations**,
such as gene knockouts, RNA knockdowns or compound treatment experiments, is
the most commonly used tool for probing molecular mechanisms of action due to
its simplicity and low cost.
However, interpretation of such gene expression contrasts is confounded by
the complex and nuanced impacts of experimental treatments on cellular processes.
For example, knockout or over-expression of a transcription factor will not
only alter the transcription of its direct target genes, but also cause many
secondary expression changes. In addition, treatments or treatment delivery
agents typically elicit a variety of unintended, systemic responses
(such as immune, toxic, metabolic) that cannot be well-controlled for by the
design of the study.
The final experimentally measured gene expression changes are a hard to assess
convolution of **specific** and **non-specific** secondary and lateral treatment
effects.
`orthos` is a generative modelling-based approach that disentangles the
experiment-specific from the non-specific effects of perturbations on gene
expression. It is trained on a large corpus of gene expression contrasts
(per organism >60K annotated, >0.5M augmented), compiled from the [ARCHS4](https://maayanlab.cloud/archs4/)
database of uniformly processed RNAseq experiments (@lachmann2018massive).
It accurately captures and isolates **non-specific** effects (effects
that are observed across multiple treatments) while accounting for context
(tissue or cell-line experimental background).
The residual **specific** component obtained from this decomposition offers a
more unequivocal experimental signature and is more closely related to the
direct molecular effects of the perturbation when compared to the raw signal.
In addition to providing a clearer understanding of the effects of experimental
treatments on gene expression, `orthos` also enables researchers to **query the
contrast database** with arbitrary contrasts and identify experiments with similar
specific effects, ultimately helping to **map treatments to mechanisms of action**.
# Installation and overview
`orthos` can be installed from from Bioconductor using `BiocManager::install()`:
```r
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("orthos")
# or also...
BiocManager::install("orthos", dependencies = TRUE)
```
After installation, the package can be loaded with:
```{r library}
library(orthos)
```
A typical analysis involves two steps:
1. **Decomposing** one or several contrasts into their corresponding specific and
non-specific components using the `decomposeVar()` function and
2. **Performing queries** with the original and decomposed specific and non-specific
contrasts against the contrast database using the `queryWithContrasts()`
function.
# Demonstration data
To demonstrate the functionality of `orthos` we use a dataset from the the GEO
series [GSE215150](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE215150).
This series was not part of the `orthos` training or the `orthosData` contrast
database; it was only publicly released on January 1st 2023 after the freeze of
the training data to the ARCHS4 v2.1.2 database.
The performed experiment involves over-expression (OE) of the MKL/megakaryoblastic
leukemia 1 gene (also termed MRTFA/myocardin related transcription factor A) and
a constitutively active mutant MKL1 (caMKL1, described in @hu2019mkl1).
Both OE experiments were performed in mouse LM2 and human 4T1 tumor-derived
breast cancer cell lines.
In addition to the MKL1/caMKL1 OE samples, the series also contains no-treatment
controls for each of the two cell lines.
For simplicity the three biological replicates of each available condition
have been collapsed in the data provided in the package.
In the provided form each of the two datasets (Mouse, Human) contains raw counts
for over 55,000 genes identified by gene symbols in three conditions:
Control (Ctrl), MKL1 OE (MKL1) and constitutively-active MKL1 OE (caMKL1).
Load the human dataset:
```{r}
MKL1_human <- readRDS(system.file("extdata", "GSE215150_MKL1_Human.rds",
package = "orthos"))
head(MKL1_human)
dim(MKL1_human)
```
Load the mouse dataset:
```{r}
MKL1_mouse <- readRDS(system.file("extdata", "GSE215150_MKL1_Mouse.rds",
package = "orthos"))
head(MKL1_mouse)
dim(MKL1_mouse)
```
# Decomposition of differential gene expression variance into specific and non-specific components using `decomposeVar()`
## Prelude: A short overview of the `orthos` models
The workhorse behind `orthos` are organism-specific conditional variational
autoencoder (cVAE) models that break down the variance of a given differential
expression experiment into a non-specific and an experiment-specific
component.
The non-specific component corresponds to gene variance that has
been observed across multiple instances during training, while the
experiment-specific variance is fairly unique to the experiment.
The inputs to the models are gene counts in the form of log2-transformed
counts per million (LCPMs) that are used to encode the **context** of the
performed experiment as well as the actual gene expression **contrasts** in the form
of log2 fold-changes (LFCs), i.e log2-transformed CPM ratios.
As we will see, calculation of those inputs is by default performed internally
given only **raw gene counts** and a specification of the contrasted conditions.
Given these inputs, the model maps the contrast to a concise latent
representation (z~D~) which retains its recurring -and therefore compressible-
traits. The compressed latent representation is then used to reconstruct a
decoded version of the contrast\*.
The **decoded** contrast corresponds directly to the **non-specific** component
of the observed effects; it subsumes regularities i.e gene variance that the model
can account for because it has been repeatedly encountered, in some form, during training.
The **residual** obtained after removing the decoded contrast from the original one
is the **specific** component; this encompasses the gene variance that the model
cannot account for (experiment-specific biological effects + random noise).
From now on the terms decoded and non-specific will be used in conjunction or interchangeably.
Ditto for the terms residual and specific.
![orthos architecture](orthos_arch.png){width=100%}
\* Notice that both the latent encoding and the decoded output are conditioned on context
(i.e they are context-specific). This means that decomposing a contrast on a different context
will produce a different output. An interesting ancillary application of this conditioning is that
one can "morph" decoded contrasts to **in-silico evaluate non-specific effects in new contexts**.
In essence, we can infer what the non-specific effects would look like had the experiment been performed in
e.g different cell-lines, tissues, batches, patients or using different library-preparation protocols.
The inferred decodings will reflect the new context in multiple ways. For example, genes that
were non-detected/non-expressed in the original contrast (and therefore had neutral LFCs )
will produce new (possibly non-neutral) decoded LFCs if present in the new context and vice versa.
This **"out-of-context"** type of inference is limited to the decoded contrast, as by definition, the residual is not part of
the model's generative capacity.
Mechanically, out-of-context inference of non-specific effects is a simple as evaluating the same contrast using multiple contexts when
calling `decomposeVar()` (see section below).
## Contrast decomposition with `decomposeVar()`
The function that performs this contrast decomposition into the non-specific and
specific components is `decomposeVar()`.
There are two available modes in which the required inputs can be fed into the
function:
- In the first mode the user passes the matrix `M` of raw counts (genes in rows,
conditions in columns) and two vectors `treatm` and `cntr` of equal length
specifying the column indices in `M` corresponding to the "treatments" and
their respective "controls". The same column indices can be repeated multiple
times in these vectors, for example in the case where multiple treatments are
paired to the same control.
- In the second mode the user passes the matrix `M` of raw counts and a second
matrix `MD` that contains pre-calculated log2 fold-changes for the contrasts to
be analyzed. In this mode, `M` specifies the contexts for the corresponding
columns of `MD`, and thus the two matrices need to have the same
dimensionality and identical row- and column-names.
This would be the mode of choice if e.g one wishes to produce the LFCs independently
or if one wants to evaluate the decoding of the same contrast(s) in multiple contexts
(e.g for "out-of-context" inference of non-specific effects described above). In the
latter case, copies of the same contrast in columns of `MD` will be paired with
columns of `M` specifying the different contexts.
In both modes the rownames of `M` (and `MD` if specified) need to correspond to
valid gene identifiers (currently `orthos` supports Entrez gene identifiers,
ENSEMBL gene identifiers, gene symbols or ARCHS4 gene identifiers).
By default the type of gene identifier is detected automatically.
The first time that `decomposeVar` is executed for a particular organism, the
models required for inference will be automatically downloaded from
`ExperimentHub` and cached in the user ExperimentHub directory
(see `ExperimentHub::getExperimentHubOption("CACHE")`) using the `orthosData`
companion package.
For the MKL1 data, which are stored as raw counts, it is more natural to
call `decomposeVar()` using the first mode.
Decomposing the human contrasts:
```{r, warning = FALSE, message = FALSE}
#Decompose MKL1-vs-Cntrl and caMKL1-vs-Cntrl contrasts for human:
dec_MKL1_human <- decomposeVar(M = MKL1_human, treatm = c(2, 3), cntr = c(1, 1),
organism = "Human", verbose = FALSE)
dec_MKL1_human
```
Decomposing the mouse contrasts:
```{r, warning = FALSE, message = FALSE}
#Decompose MKL1-vs-Cntrl and caMKL1-vs-Cntrl contrasts for mouse:
dec_MKL1_mouse <- decomposeVar(M = MKL1_mouse, treatm = c(2, 3), cntr = c(1, 1),
organism = "Mouse", verbose = FALSE)
dec_MKL1_mouse
```
The output of `decomposeVar()` is a `SummarizedExperiment` object with
dimensions `N` x `M`, where `N` is the number of `orthos`
genes\* for that organism and `M` is the number of contrasts specified during input.
The `SummarizedExperiment` output also has 4 assay slots corresponding to
the input contrasts, decoded (non-specific), and residual (specific) components,
as well as the gene context.
Contrasts are represented as log2 fold-changes (LFCs) and context is represented
as log2-transformed counts per million (log2 CPM).
We can use the returned object to produce an MA plot for the original contrast
or to check how the input and decomposed contrasts are related to each other.
For example for the mouse caMKL1 contrast:
```{r}
suppressPackageStartupMessages({
library(ggplot2)
library(SummarizedExperiment)
})
assays(dec_MKL1_mouse)
#MA plot of for the input contrasts:
DF <- data.frame(L2CPM= assay(dec_MKL1_mouse,"CONTEXT")[,2],
L2FC_INPUT=assay(dec_MKL1_mouse,"INPUT_CONTRASTS")[,2],
L2FC_DECODED=assay(dec_MKL1_mouse,"DECODED_CONTRASTS")[,2],
L2FC_RESIDUAL=assay(dec_MKL1_mouse,"RESIDUAL_CONTRASTS")[,2]
)
#MA plot of for the input contrast
P1 <- ggplot(data=DF, aes(x=L2CPM, y=L2FC_INPUT)) +
geom_point(alpha=0.4, size=1.8) +
geom_hline(aes(yintercept = 0), colour = "darkgray", linewidth = 0.5) +
xlab("Expression (Log2 CPMs)") +
ylab("Log2 Fold Change")
#Delta-delta plots for the input and decomposed contrast fractions
P2 <- ggplot(data=DF, aes(x=L2FC_INPUT, y=L2FC_DECODED)) +
geom_point(alpha=0.4, size=1.8) +
geom_hline(aes(yintercept = 0), colour = "darkgray", linewidth = 0.5) +
xlab("Log2 Fold Change INPUT") +
ylab("Log2 Fold Change DECODED")
P3 <- ggplot(data=DF, aes(x=L2FC_INPUT, y=L2FC_RESIDUAL)) +
geom_point(alpha=0.4, size=1.8) +
geom_hline(aes(yintercept = 0), colour = "darkgray", linewidth = 0.5) +
xlab("Log2 Fold Change INPUT") +
ylab("Log2 Fold Change RESIDUAL")
P4 <- ggplot(data=DF, aes(x=L2FC_DECODED, y=L2FC_RESIDUAL)) +
geom_point(alpha=0.4, size=1.8) +
geom_hline(aes(yintercept = 0), colour = "darkgray", linewidth = 0.5) +
xlab("Log2 Fold Change DECODED") +
ylab("Log2 Fold Change RESIDUAL")
cowplot::plot_grid(P1,P2,P3,P4)
```
As expected, both the decoded and residual components are correlated to the input contrast.
However, the residual and decoded components are largely uncorrelated.
The `colData` of the object summarizes the proportion of variance accounted for
in each decomposed component:
```{r}
colData(dec_MKL1_human)
colData(dec_MKL1_mouse)
```
\* Notice that, of the total gene features present in the input (over 55,000), only
~20,000 genes are part of the `orthos` model () and the `decomposeVar()` output.
These ~20,000 `orthos` genes are "sanctioned" according to several criteria
(located on canonical chromosomes, no pseudogenes, no ribosomal protein genes,
detected in at least a small fraction of the ARCHS4 database).
The model is highly robust to small fractions of `orthos` genes not being part
of the user input, even if those genes are expressed in the context under
consideration. That being noted, it is safer to feed-in inputs that are as
comprehensive as possible, i.e **not filtered in any way**, in terms of gene
features.
# Querying the database of gene contrasts using `queryWithContrasts()`
Typically, the next step of the analysis involves querying the contrasts
database (`orthosData`) to identify public experiments similar to the one(s)
under investigation, either in terms of the original or decomposed decoded (non-specific)
and residual (specific) contrasts. As we will see in the following examples the results'
of these queries can guide the interpretation of of the different contrast fractions.
`orthosData` contains over 100,000 differential gene expression experiments
compiled from the ARCHS4 database of publicly available expression data (@lachmann2018massive).
Each entry in `orthosData` corresponds to a pair of RNAseq samples contrasting
a treatment vs a control condition.
A combination of metadata, semantic and quantitative analyses was used to
determine the proper assignment of samples to such pairs in `orthosData`.
The function that performs the queries against `orthosData` is
`queryWithContrasts()`. The input to this function is the `SummarizedExperiment`
object obtained in the previous step from `decomposeVar()`, either the complete
object or one that has been column-subsetted, allowing to query the
contrast database with only a subset of the decomposed contrasts.
As was the case for the `orthos` models, a database will be automatically downloaded from
`ExperimentHub` and cached in the user ExperimentHub directory
(see `ExperimentHub::getExperimentHubOption("CACHE")`) using the `orthosData`
companion package, the first time `queryWithContrasts()` is called for that
database or the first time the user attempts to access the database directly
with `loadContrastDatabase()` (see [Accessing the contrast database]) .
The `queryWithContrasts()` function returns a list with three elements per query contrast:
- "pearson.rhos" is itself a list with each element containing the Pearson
correlation values against all the `orthosData` entries for a specific
component (input, decoded/non-specific, residual/specific).
- "zscores" is also a list with each element containing the z-score transformed
version of the "pearson.rhos" values.
- "TopHits" is also a list with detailed `orthosData` metadata for each of the
top `detailTopn` hits per component (default 10).
In the following examples, please note that the queries are run using
`mode = "DEMO"` in order to keep computations short. For actual analyses,
the default `mode = "ANALYSIS"` should be used.
Examples queries using the decomposed human MKL1 data:
```{r, fig.width = 12}
# parallelization parameters:
params <- BiocParallel::MulticoreParam(workers = 2)
# for demonstration purposes (for actual analyses, use 'mode = "ANALYSIS"'):
query.res.human <- queryWithContrasts(dec_MKL1_human, organism = "Human",
BPPARAM = params, verbose = FALSE,
mode = "DEMO")
names(query.res.human)
names(query.res.human$zscores)
# query contrasts in rows, `orthosData` entries in columns:
dim(query.res.human$zscores$RESIDUAL_CONTRASTS)
summary(t(query.res.human$zscores$RESIDUAL_CONTRASTS))
#Information on the top hits of the query using the residual human MKL1/caMKL1 contrasts:
query.res.human$TopHits$RESIDUAL_CONTRASTS
```
Example queries using the decomposed mouse MKL1 data:
```{r, fig.width = 12}
# query the database using only the "caMKL1" mouse contrast, suppress plotting:
# for demonstration purposes (for actual analyses, use 'mode = "ANALYSIS"'):
query.res.mouse <- queryWithContrasts(dec_MKL1_mouse[, "caMKL1"], organism = "Mouse",
BPPARAM = params, verbose = FALSE,
plotType = "none", mode = "DEMO")
# plot results for individual contrasts using violin plots:
ViolinPlots_mouse <- plotQueryResultsViolin(query.res.mouse, doPlot = FALSE)
ViolinPlots_mouse[["caMKL1"]]
# plot results for individual contrasts using composite Manhattan/Density plots:
ManhDensPlots_mouse <- plotQueryResultsManh(query.res.mouse, doPlot = FALSE)
ManhDensPlots_mouse[["caMKL1"]]
#Information on the top hits of the query using the residual mouse caMKL1 contrasts:
query.res.mouse$TopHits$RESIDUAL_CONTRASTS
```
The top hits obtained for the residual (specific) fractions of either MKL1 or caMKL1 contrasts both in human and mouse are more clearly separated
from the background compared to those obtained from the input or decoded (non-specific) fractions.
More importantly closer inspection of those top hits for the residual contrasts in both experiments (e.g hits from series GSE77120, GSE112277 or GSE140898 in human or GSE164860 in mouse) reveal that they correspond to treatments involving either MKL/MRTFA overexpression or overexpression of the MKL related transcription factor MYOCD.
Of note, these treatments were performed in various cell contexts, different from the ones of the MKL study under consideration (LM2 and 4T1 cell lines for mouse and human respectively).
In general, as in this example, the residual specific fraction of a DGE profile will be a better query proxy for **molecularly and mechanistically related** treatments as it is largely stripped of nuisance variance present in the original contrast.
On the other hand the decoded non-specific fraction and its corresponding query hits can also be of interest in some applications as they provide information on the extent and type of **downstream/secondary or lateral** treatment effects.
# Accessing the contrast database
The `orthos` package provides functionality for direct access to contrast databases of `orthosData` with the
`loadContrastDatabase()` function.
This can be used to retrieve contrast values for all or subsets of genes or metadata for specific datasets, e.g for hits identified with `queryWithContrasts()`.
The organism-specific databases are compiled as HDF5SummarizedExperiment objects.
As was the case for the `orthos` models, a database will be automatically downloaded from
`ExperimentHub` and cached in the user ExperimentHub directory
(see `ExperimentHub::getExperimentHubOption("CACHE")`) using the `orthosData`
companion package, the first time `loadContrastDatabase()` is called for that
database either directly or via `queryWithContrasts()`.
The HDF5SummarizedExperiment object contains pre-calculated INPUT, RESIDUAL and DECODED log2 fold change contrasts as well as
the corresponding expression CONTEXT in log2 CPM representation for all the datasets in `orthosData`.
Extensive gene and contrast annotation is available in the object's `rowData` and `colData` respectively.
```{r}
organism <- "Mouse"
orthosDB <- loadContrastDatabase(organism = "Mouse", mode = "DEMO")
orthosDB
#Available contrast annotations:
colnames(colData(orthosDB))
#Available gene annotations:
colnames(rowData(orthosDB))
#Retrieve partial annotation for a specific contrast
#returned as a top-hit in the mouse caMKL1 query above:
colData(orthosDB)["GSM5021181", c("title", "series_id", "CNTname")]
# Compare context and individual contrast fractions between
# the mouse caMKL1 contrast under consideration and the "GSM5021181"
# query hit:
par(mfrow = c(2, 2))
queryID <- "GSM5021181"
for (contrast in names(assays(dec_MKL1_mouse))[c(4, 1, 2, 3)]) {
unit <- "L2FC"
if (contrast == "CONTEXT") {unit <- "L2CPM"}
plot(assays(dec_MKL1_mouse)[[contrast]][, "caMKL1"],
assays(orthosDB)[[contrast]][, queryID],
pch = 16, cex = 0.5, col = "darkslategrey", main = contrast,
xlab = paste0(unit, " caMKL1"), ylab = paste0(unit, " ", queryID))
abline(0, 1, col = "darkred", lwd = 0.8, lty = 2)
}
```
# Advanced use cases: Directly accessing the orthos models
As (1) typical `orthos` use cases do not require direct access to the models and
(2) use of the models requires loading of a conda environment via `basilisk` this functionality
is by default not exposed to the user and is carried out transparently
by the non-exported functions `.predictEncoder()` and `.predictEncoderD()`.
However, as we envision cases where directly accessing the models might be of interest we provide here a brief overview and examples
for direct calls to these functions.
The `orthos` models are implemented in `Keras`. For each organism there are two types of models:
- A **context encoder** that produces a latent embedding for a specific input context (represented as L2CPMs) and
- A **contrast conditional variational autoencoder** that first produces latent embeddings of input contrasts
(represented as LFCs) conditioned on context embeddings and then generates decoded versions of those
contrasts, again conditioned on the context embeddings (see also figure in [Prelude: A short overview of the `orthos` models]).
As noted previously the first time these models are requested either by `decomposeVar` or directly
by `.predictEncoder()` and `.predictEncoderD()` they are downloaded and
cached in the user ExperimentHub directory (see `ExperimentHub::getExperimentHubOption("CACHE")`)
using the `orthosData` companion package.
When calling the `.predictEncoder()` and `.predictEncoderD()` methods directly be attentive to the following:
- Before the call the predefined conda environment `orthos:::orthosenv` needs to be activated using `basilisk::basiliskStart()`
- The inputs passed to the models need to have appropriate size and value representation (see examples below).
- After the calls it's good practice to deactivate the environment with `basilisk::basiliskStop()`.
We now demonstrate calls to the context encoder for generating a latent embedding of a specific context and to the
contrast conditional variational autoencoder for producing a contrast latent embedding and decoding.
```{r}
# mouse MKL1 context and contrast with the appropriate shape and representation.
#
# Shape of models input is M x N,
# where M is the number of conditions,
# N the number of features -i.e orthos Genes
#
# Representation is L2CPMs for contexts and L2FCs for contrasts.
#
CONTEXT <- t(assay(dec_MKL1_mouse,"CONTEXT")[,1])
CONTRAST <- t(assay(dec_MKL1_mouse,"INPUT_CONTRASTS")[,1])
# Activate the `basilisk` environment:
library(basilisk)
cl <- basiliskStart(orthos:::orthosenv,
testload = "tensorflow")
# Produce a latent embedding for the context with .predictEncoder:
LATC <- basilisk::basiliskRun(proc = cl,
fun = orthos:::.predictEncoder,
organism = "Mouse",
gene_input = CONTEXT)
# Produce a latent embedding and decoding for the contrast with .predictEncoderD:
res <- basilisk::basiliskRun(proc = cl,
fun = orthos:::.predictEncoderD,
organism = "Mouse",
delta_input = CONTRAST, context = LATC)
# Deactivate the `basilisk` environment:
basilisk::basiliskStop(cl)
# Access the contrast latent embedding and decoding from the .predictEncoderD returned result:
LATD <- res$LATD
DEC <- res$DEC
```
Calls similar to the ones above are carried out under the hood when
`decomposeVar()` is called.
# Session information {-}
```{r}
sessionInfo()
```
# References