---
title: "Cop1 role in pro-inflammatory response"
author:
  - name: Lan Huong Nguyen
    affiliation: ICME,  Stanford University, CA 94305
keywords: time-course, time series, PCA, clustering, differential expression.
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    df_print: kable
vignette: >
  %\VignetteIndexEntry{Gene expression time course data analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r style, echo=FALSE, message=FALSE, warning=FALSE, results="asis"}
rm(list = ls())
library("rmarkdown")
options(width = 200)
```


Abstract {#abstract .unnumbered}
================================

The analysis of time-series data is non-trivial
as in involves dependencies between data points.
Our package helps researchers analyze RNA-seq
datasets which include gene expression measurements
taken over time. The methods are specifically
designed for at datasets with a small number of
replicates and time points -- a typical case for 
RNA-seq time course studies. Short time courses 
are more difficult to analyze, as many statistical 
methods designed for time-series data might require 
a minimum number of time points, e.g. functional 
data analysis (FDA) and goodness of fit methods might 
be ineffective. Our approach is non-parametric 
and gives the user a flexibility to incorporate 
different normalization techniques, and distance
metrics. `TimeSeriesExperiment` is a comprehensive 
time course analysis toolbox with methods for data 
visualization, clustering and differential expression
analysis. Additionally, the package can perform
enrichment analysis for DE genes.


Introduction {#introduction .unnumbered}
========================================

We will demonstrate the effectiveness of `TimeSeriesExperiment` package using
the in-house dataset exploring the role of Cop1 in pro-inflammatory response.
The experiments were designed to study the induction and repression kinetics
of genes in bone marrow derived macrophages (BMDMs).

The dataset includes cells from 6 mice. The cells were divided into two equal
groups. For one group the Cop1 gene was in vitro knock out with tamoxifen.
All samples where then subject to LPS treatment to induce an inflammatory
response. Bulk RNA-seq was performed at 6 time-points: one at time 0 before
LPS treatment, then at time 2.5, 4, 6, 9 and 13 hours after LPS was added.


Obtain and process data {#get-data .unnumbered}
===============================================


First we load the necessary packages.

```{r message=FALSE, warning=FALSE}
.packages <- c(
  "edgeR", "viridis", "Biobase", "BiocFileCache", "SummarizedExperiment", 
  "ggplot2", "dplyr", "tidyr", "tibble", "readr", 
  "TimeSeriesExperiment")
sapply(.packages, require, character.only = TRUE)
theme_set(theme_bw())
theme_update(
  text = element_text(size = 15),
  strip.background = element_rect(fill = "grey70"),
  strip.text = element_text(size = 10))
```

The dataset we will study is available from GEO repositories under
accession number: `GSE114762`. We can import the processed read counts
saved in a supplementary file 'GSE114762_raw_counts.csv.gz'.
Similarly both the phenotype and feature data are also available
for download as 'GSE114762_sample_info.csv.gz', and
'GSE114762_gene_data.csv.gz' files respectively.

```{r, message=FALSE, warning=FALSE}
# Remote location of the data
urls <- paste0(
    "https://www.ncbi.nlm.nih.gov/geo/download/",
    "?acc=GSE114762&format=file&file=",
    c(
        "GSE114762_raw_counts.csv.gz",
        "GSE114762_gene_data.csv.gz",
        "GSE114762_sample_info.csv.gz"
    )
)
```


We will save the files to cache using `BiocFileCache`
utilities to avoid unnecessary multiple downloading.

```{r, message=FALSE, warning=FALSE}
library(BiocFileCache)
bfc <- BiocFileCache(ask = FALSE)

cnts <- read_csv(bfcrpath(rnames = urls[1])) %>%
  remove_rownames() %>%
  as.matrix()
rownames(cnts) <- cnts[,1]
cnts <- cnts[,-1]

gene.data <- read_csv(bfcrpath(rnames = urls[2])) %>%
  as.data.frame() %>%
  remove_rownames() 
rownames(gene.data) <- gene.data[,1]
gene.data <- gene.data[,-1]

pheno.data <- read_csv(bfcrpath(rnames = urls[3])) %>%
  as.data.frame() %>%
  remove_rownames() 
rownames(pheno.data) <- pheno.data[,1]
pheno.data <- pheno.data[,-1]
```


Building `TimeSeriesExperiment` object {#build-object .unnumbered}
========================================================

We can now combine all the data, `cnts`, `gene.data` and `pheno.data`,
into a single `TimeSeriesExperiment` (S4) object. The data will store 
everything together in a way that is easier to perform further time course 
data analysis. The most important fields in the object are "assays",
"colData" which will contain information on group, replicate
and time associated with each sample. In `TimeSeriesExperiment` 
the time variable MUST be numeric.

```{r}
cop1.te <- TimeSeriesExperiment(
  assays = list(cnts),
  rowData = gene.data,
  colData = pheno.data,
  timepoint = "time",
  replicate = "replicate",
  group = "group"
)
cop1.te
```


Alternatively, we can build `TimeSeriesExperiment` object from a 
`ExpressionSet` or `SummarizedExperiment` objects.

To show how this is done, we first combine the three data tables into
an `ExpressionSet`  objects.

```{r}
cop1.es <- ExpressionSet(
  as.matrix(cnts),
  phenoData = AnnotatedDataFrame(pheno.data),
  featureData = AnnotatedDataFrame(gene.data))
```

Then, we can easily convert an `ExpressionSet` to a `TimeSeriesExperiment` 
object using `makeTimeSeriesExperimentFromExpressionSet()` function, and 
indicating columns names with relevant information.

```{r}
(cop1.te <- makeTimeSeriesExperimentFromExpressionSet(
  cop1.es, timepoint = "time", group = "group",
  replicate = "individual"))
```

Repeating the same for `SummarizedExperiment`:

```{r}
cop1.se <- SummarizedExperiment(
  assays = list(counts = cnts),
  rowData = gene.data, colData = pheno.data)

(cop1.te <- makeTimeSeriesExperimentFromSummarizedExperiment(
  cop1.se, time = "time", group = "group",
  replicate = "individual"))

# Remove raw data
rm(cnts, gene.data, pheno.data, cop1.se, cop1.es)
```


Data normalization and filtering {#norm-and-filter .unnumbered}
---------------------------------------------------------------

The raw read counts cannot be immediately used for analysis, as
sequencing data involves the issue of varying sample depths.
We can convert the raw counts to counts per million (CPM)
using `TimeSeriesExperiment` function, `normalizeData()`, which performs
data normalization by column. Currently, we only support scaling
sample counts by constant factors (size factors). If the argument
`column.scale.factor` is not specified, by default `TimeSeriesExperiment` 
divides by column sums and multiplies by 1 million to obtain CPMs. The 
normalized data is stored separately from the raw data in a slot "data".


```{r}
# Compute CPMs 
cop1.te <- normalizeData(cop1.te)
```


Since the dataset contains more than 36k genes, we will filter out the very
rare ones which we assume to be too noisy and not containing enough signal
for further analysis.

Here, we find and remove all genes which have the mean expression (CPM) below
`min_mean_cpm = 50` within either of the two groups of interest,
*wild type* or *knock-out*. We set a very large threshold of 100
for this vignette only because the graphics below would make the size of this
vignette too large. When running on the your own study you should pick
the threshold more carefully.

```{r}
# Fine genes with minimum mean count of 5 at least in one of the two groups
min_mean_cpm <- 5
group_cpm_means <- data.frame(row.names = rownames(cop1.te))
norm.cnts <- assays(cop1.te)$norm
for(g in unique(groups(cop1.te))) {
  g_cnts <- norm.cnts[ , which(groups(cop1.te) == g)]
  group_cpm_means[, g] <- apply(g_cnts, 1, mean)
}
group_cpm_max <- apply(as.vector(group_cpm_means), 1, max)
genes_expressed <- rownames(cop1.te)[group_cpm_max > min_mean_cpm]
```

```{r}
# Filter out the noisy genes
cop1.te <- filterFeatures(cop1.te, genes_expressed)
cop1.te
```


Collapse replicates {#collapse-replicates .unnumbered}
------------------------------------------------------

In parts of our later analysis, we will make comparisons between genes,
and therefore it is useful to aggregate gene expression across replicates
to obtain their mean behavior. To do this we can use `collapseReplicates()`
function for `TimeSeriesExperiment`. The function saves collapsed sample
and aggregated expression data in "sample.data.collapsed", and "data.collapsed"
respectively.

```{r}
# Collapsed sample data stored in cop1.te@sample.data.collapsed,
# and mean expression values in cop1.te@data.collapsed
cop1.te <- collapseReplicates(cop1.te, FUN = mean)
```


Time course format {#timecourse-format .unnumbered}
===================================================

The main focus on `TimeSeriesExperiment` is to analyze and visualize time-series
data efficiently. For this reason, we convert the expression data in
a form of a rectangular matrix into a "time-course format" where
each row stores a single time series corresponding to a specified
combination of group membership and and replicate id (here mouse id).
This data wrangling step can be performed with `makeTimeSeries()`
function, the "time-course" will be stored in a slot `timeSeries`.
This slot contains a list containing data.frames `tc` and `tc_collapsed`
(if `assayCollapsed` is defined).

Before converting data to "time-course" format, gene transformation
should be performed. Transformation is thus allows
for a fair gene-to-gene comparison. For example, when clustering genes,
one uses pairwise distances to estimate dissimilarities between genes.
Since, we are generally more interested in grouping genes based on similarities
in their trajectories rather than their absolute expression levels, the
read counts must be transformed before computing the distances.

Currently, gene transformation methods available in `TimeSeriesExperiment` are
"scale_feat_sum" (scaling by gene sum) or "var_stab" (variance stabilization).
The user can specify a variance stabilization method if "var_stab"
is used. VST methods supported are: "log1p" (log plus one), "asinh"
(inverse hyperbolic sine) or "deseq"
(`DESeq2::varianceStabilizingTransformation`).

Usually simply scaling by the gene sum, that is normalizing so that the
total abundance the same (and equal to 1) for all genes gives good clustering
of gene trajectories.

```{r}
# Before conversion, scales gene expression to an even sum, for a fair
# gene-to-gene comparison.
cop1.te <- makeTimeSeries(
    cop1.te, feature.trans.method = "scale_feat_sum")
```

```{r}
# untransfomed
head(timeSeries(cop1.te, "ts"))
```

```{r}
# transfomed
head(timeSeries(cop1.te, "ts_trans"))
```


Lag differences {#lags .unnumbered}
-----------------------------------

Time-series data have a dependency structure and are not standard multivariate
vectors. Many methods have been developed for representing time-series data.
A common technique is for example to fit functions e.g. polynomials
or splines to the data. A similar approach is taken in functional data analysis
(FDA) literature, where time series  are represented as linear combinations
of basis functions (e.g. principal functions). These methods seek to smooth
the data and to reduce of complexity of the inherent (infinite dimensional)
functional data. The fitted coefficients are often used as the time-series
representation then used for clustering or visualization.

Unfortunately, most of the biological time course studies are short, sometimes
containing as few as three or four time-points. Therefore, fitting functions
to sparse time points would be inefficient. Instead, here we propose a simpler
way to incorporate the dependency structure of the time-series. Our
method involves construction of additional data features, which are lag
differences between consecutive time-points. Lag of order 1 for time-point
$i$, $Lag_1(i)$, denotes the difference between the expression level at time
$i$ and time $i-1$, lag 2 is the difference between time $i$ and time $i-2$,
and so on. Intuitively, the lag 1 at time $i$ approximates the slope or the
first derivative of the time series curve at time-point $i$.

We can add these extra lag features to the "time-course" data using
`addLags()` function, which appends lag features to "tc" and "tc_collapsed"
data frames in the slot `timeSeries`. Additionally,the user can define
the weight for each lag feature by specifying the `lambda` argument.
The length of `lambda` indicates how many orders of lags you would like
to include, e.g. `lambda = c(0.5, 0.25)` means lag order 1 will be
added with multiplicative weight of $0.5$ and lag order 2 will be added with
weight $0.025$.

```{r}
# Add lags to time-course data
cop1.te <- addLags(cop1.te, lambda = c(0.5, 0.25))
head(timeSeries(cop1.te, "ts_trans"))
```


At this point, we completed all data pre-processing steps available in
`TimeSeriesExperiment`. In later sections we specify how visualization, 
clustering and differential expression test can be performed with the package.


Data Visualization {#datavis .unnumbered}
=========================================

In this section we show plotting utilities available in `TimeSeriesExperiment`.
Visualizations are data exploration tools and serve as the first step in our
data analysis. In the following subsections we will describe more in details
how heatmaps and PCA plots can be generated.

Heatmaps {#heatmap .unnumbered}
------------------------------

Here we will generate a heatmap of top 100 most variable genes.
The plot of the expression matrix for these most variable features will
give us some insight whether there is  a clear difference between the
two experimental groups and whether a strong temporal trend can be detected.

```{r heatmap, fig.height=9, fig.width=7}
plotHeatmap(cop1.te, num.feat = 100)
```

In the above heatmap the columns are ordered by experimental group, replicate
(mouse id) and time at which the sample was sequenced; the sample membership
is indicated in the color bars on top of the columns. The main heatmap
rectangle shows Z-scores of expression values represented by colors in
the red-and-blue palette corresponding to high-and-low respectively. Even this
first look at the data, shows us patterns present in the data -- within each
condition, i.e. each mouse and in each experimental group there are expression
levels seem to be dependent on time.


PCA {#pca .unnumbered}
======================

Another way to explore the dataset is through dimensionality reduction.
Here we will project the data into a space of principal components.
With PCA, you can visualize both samples and features in the same coordinates
space with a biplot. Here we will keep these two maps separately,
as the visualization can become overcrowded with points which obscure
the inherent structure. Even though, we are plotting the feature and sample
projections separately, they can be compared side by side to see
which groups of features are more correlated with which group of samples.


Visualizing Samples {#sample-vis .unnumbered}
---------------------------------------------

First, we project samples on a 2D map to check whether their relative location
reflects time at which the sequencing was performed. If the samples are ordered
in agreement with time in the PCA plot, there might be patterns in gene
expression levels changing over the course of the study.

A PCA plot can also be used to examine whether samples corresponding to the
same condition, here wild type or knockouts, tend to group together, i.e.
whether they are more similar to each other than to the ones in a different
condition. We use `prcomp()` function from `stats` (default) package to compute
PCA projection.

RNA-seq data is highly heteroskedastic, which means the features (genes)
included have vastly different variances. It is know that bulk expression count
data can be well modeled with a negative binomial distribution. In this
distribution, variance is a quadratic function of the mean, $\sigma^2 = \mu +
\alpha\mu^2$. That is the higher the mean expression level, the higher the
variance is. PCA projection maximizes the amount variance preserved in
consecutive principal components, which implies that computing PCA on raw
expression counts or even the RPKMs or CPMs would put too much weight on most
highly abundant genes.

This step is recommended because RNA-seq data is highly Heteroskedastic, which
means the features (genes) included have vastly different variances. It is know
that bulk expression count data can be well modeled with a negative binomial
distribution. In this distribution, variance is a quadratic function of the
mean, $\sigma^2 = \mu + \alpha\mu^2$. That is the higher the mean expression
level, the higher the variance is.

Thus, we recommend user to variance stabilize the data before computing
a PCA projection  (as described in *Time course format* section).
The variance stabilization method can be specified in `var.stabilize.method`
argument.

Additionally, the user might limit calculations to only specific group of
samples, e.g. in this case you might be interested in visualizing samples
only in the wild type group. A user can also indicate whether
PCA should be applied to sample resolved data or one with replicates
aggregated (stored in "data.collapsed" slot of a `TimeSeriesExperiment` object).

```{r}
cop1.te <- runPCA(cop1.te, var.stabilize.method = "asinh")
```

```{r pca-samples-group, fig.wide = TRUE, fig.height=3, fig.width=7}
plotSamplePCA(cop1.te, col.var = "group", size = 5)
```

```{r pca-samples-time, fig.wide = TRUE, fig.height=3, fig.width=7}
plotSamplePCA(cop1.te, col.var = "timepoint", size = 5)
```

In the plots above we see that the samples group mostly by time at which
they have been sequenced. For some time-points, we see also a clear separation
between sample from different experimental groups.


Visualizing Features {#feat-vis .unnumbered}
-------------------------------------------------

PCA also provides a projection of features to the same principal component
space. The coordinates of features (here genes) are commonly referred to
as PCA loadings. Since gene expression datasets usually includes thousands
of genes, it is not possible to include labels for all of them in
the same 2D PCA plot. Apart from that, in a time-course study one is usually
interested in trajectories of genes over time, and it is good to see groups
of genes with similar expression pattern clustered together on a visualization.

In order to make PCA plots for features more informative, we overlay
average (over replicates) gene expression trend over time for each experimental
group. Plotting trajectories for every gene would make plot overcrowded and
unreadable, therefore we divide the PCA plot into $m \times n$ grid and plot
a trajectory for a gene which PCA coordinates are closes to the grid center
point.

In the feature PCA plot below we see that genes exhibit different responses to
LPS treatment. The figure shows genes organized according to their trajectory.
Inhibited genes tend to gather on the left side of the plot, the primary
response (early spike) genes are at the bottom, and the late response
(late increase) clustered at the top. The plot doesn't show a
global difference between the wild type and knock-out group. Most genes
have trajectories that are exactly overlapping in both conditions.
There are, however, a few genes for which we do observe some difference
between groups.


```{r pca-genes, fig.width = 10, fig.height = 8}
plotTimeSeriesPCA(
    cop1.te,
    linecol = c("WT" = "#e31a1c", "Loxp" = "#1f78b4"),
    main = paste("Visualizing gene profiles with PCA"),
    m  = 15, n = 15, col = adjustcolor("grey77", alpha=0.7), 
    cex.main = 3, cex.axis = 2, cex.lab = 2)
```


Gene Clustering {#clustering .unnumbered}
=========================================

To cluster the gene trajectories we will use the data stored in
`timeSeries` slot of `cop1.te`. These are time-series
of transformed expression values together with appended time lags
which resolving differences between the temporal trends.

We use hierarchical clustering define gene groupings.
Either static or dynamic branch cutting (from `dynamicTreeCut` package)
algorithms can be used to assign clusters. Since hierarchical clustering
is computationally intensive (with $O(n^3)$ complexity for standard
implementations), we apply it only to a subset of genes.
Specifically, we pick `n.top.feat` with average (over replicates) most
variable expression over time in each of selected `groups` (here
we use both wild type and knock-out) to perform clustering. Remaining genes
are, then, assigned to a cluster with the closest centroid. An additional
advantage of using only a subset of most variable genes for clustering is that,
the core genes which exhibit negligible changes over time (which might be the
majority of genes) will not much effect on clustering results.

Here we pick 1000 genes for clustering, which is roughly 1/3 of the
number of genes after filtering.

```{r}
params_for_clustering <- list(
  dynamic = TRUE, 
  dynamic_cut_params = list(deepSplit = TRUE, minModuleSize = 150))

cop1.te <- clusterTimeSeries(
  cop1.te, n.top.feat = 3000, groups = "all",
  clust.params = params_for_clustering)
```

We can see the size of each of clusters computed

```{r}
# See the count of genes in each cluster
cluster_map <- clusterMap(cop1.te)
table(cluster_map$cluster)
```

We can plot the hclust dendrogram obtained from hierarchical clustering
performed.

```{r hclust-dendrogram, fig.width = 7, fig.height = 4}
# Plot the hierarchical clustering of genes2cluster
hclust_obj <- clusterAssignment(cop1.te, "hclust")
plot(x = hclust_obj, labels = FALSE, xlab = "genes", sub = "")
```

Here we plot average (over replicates) gene trajectories grouped into 10
clusters found using the above described approach. The expression profiles
for wild type and knock-out are plotted separately, side by side.


```{r gene-clusters, fig.width = 7, fig.height = 5.5, fig.wide = TRUE}
plotTimeSeriesClusters(cop1.te, transparency = 0.2) +
    scale_color_manual(values = c("WT" = "#1f78b4", "Loxp" = "#e31a1c")) +
    theme(strip.text = element_text(size = 10)) +
    ylim(NA, 0.55)
```


Timecourse plots above show genes clustered clearly into groups related
to their pattern of response to LPS treatment. We see cluster C1, C2, and C4
generally inhibited. Genes in C6 spike right after LPS was applied. C3
shows moderate secondary response, and C5 exhibit late increase in expression.


Differential Expression Ranking {#diff-expr .unnumbered}
========================================================

In the previous section we grouped the genes based on the trajectories
recorded for both experimental groups. In this section we will describe
how to find specific genes that exhibit different expression patterns
between two experimental groups over the time-course of the study.


Differential Point-wise Expression {#point-de .unnumbered}
-------------------------------------------------

In some cases, a user might be interested in differential expression (DE)
at specific timepoints between different experimental groups, here wild type
and knock-out. We can easily test differential expression at any timepoint
over the course of the study using standard DE approaches.

`TimeSeriesExperiment` provides a wrapper `timepointDE()` for differential 
expression testing functions (`voom()` + `limma()`) from `limma` package, which 
allows users to easily apply testing to `TimeSeriesExperiment` objects.


```{r}
cop1.te <- timepointDE(cop1.te, timepoints = "all", alpha = 0.05)
```

Information on DE genes e.g. at timepoint 2.5 can be access as follows:

```{r}
tmp_de <- differentialExpression(cop1.te, "timepoint_de")
# First 6 DE genes at timepoint = 2.5:
head(tmp_de$`2.5`)
```

To find genes with the highest log-fold change in each timepoint we can call
the following commands:

```{r}
top10_genes <- sapply(tmp_de, function(tab) {
    if(nrow(tab) == 0)
      return(rep(NA, 10))
    top10 <- tab %>% 
      arrange(desc(abs(logFC))) %>%
      .[["symbol"]] %>% .[1:10]
    if(length(top10) < 10)
      top10 <- c(top10, rep(NA, 10 - length(top10)))
    return(top10)
  }) %>% 
  as.data.frame()
top10_genes
```

We can also find a list of genes which were found differentially expressed
at any of the timepoints

```{r}
de_genes <- lapply(tmp_de, function(x) x$symbol)
de_any_tmp <- unique(unlist(de_genes))
cat("Out of all", nrow(cop1.te), ", there were",
    length(de_any_tmp), "were found differentially expressed at any timepoint.")
```

A useful diagram would show intersections of differentially
expressed genes at different time-points. You should expert more intersection
between consecutive time-points. You can use functions from `UpSetR`
package to show the overlap between the DE genes across timepoints:


```{r tmp-de-intersect, fig.width=7, fig.height=4}
library(UpSetR)
upset(fromList(de_genes),
      sets = rev(c("2.5", "4", "6", "9", "13")), keep.order = TRUE,
      number.angles = 30, #point.size = 3.5, line.size = 1,
      mainbar.y.label = "DE Gene Intersections",
      sets.x.label = "DE Genes Per Timepoint")
```


Differential Trajectories {#adonis .unnumbered}
-------------------------------------------------

Instead of looking at each time-point separately, it is often useful to
identifying genes which exhibit different expression trajectories,
i.e. ones with differential kinetics over time. We take a most natural approach
which can be applied to short time-course datasets, which is an analysis of
variance. In particular we use a method based on non-parametric permutation
multivariate analysis of variance (MANOVA).

To test each gene for differential trajectory under two conditions, we
use the time-course format, where each row is a transformed
time-series with lags corresponding to a single replicate within a particular
group. `adonis()` function from `vegan` package is applied to find genes with
differential trajectories. `adonis` approach is based on partitioning the sums
of squares of the multivariate (here time-series with lags) data to
within and between-class. The significance is determined with an F-test on
permutations of the data.

Using this procedure, here we will identify genes which have different
trajectories within the wild type and knock-out group. This difference
is determined when time series replicates (expression profiles) are more
different between the groups than within the same group.

With a small number of available replicates (3 wild type and 3 knockout),
a permutation based method does not yield high power. Combined with
multiple hypothesis testing correction (for testing thousands of genes),
we expect the method's p-values to be mostly below significance level
of $\alpha = 0.5$. However, this approach is still useful, as we can user
raw (unadjusted) p-values to filter out the genes that are with
high probability not significant, and use the $R^2$ value (the percentage
variance explained by groups) for ranking the genes in terms of the
difference in expression profiles between two groups.

Function `trajectoryDE()` can be used to find differential genes using
the above described method. Results of testing procedure can be accessed with:
`differentialExpression(cop1.te, "trajectory_de")`.

```{r, eval = FALSE}
cop1.te <- trajectoryDE(cop1.te, dist_method = "euclidean")
de_trajectory_res <- differentialExpression(cop1.te, "trajectory_de")
saveRDS(de_trajectory_res, "de_trajectory_cop1.rds")
```

```{r echo = FALSE}
de_trajectory_res <- readRDS("de_trajectory_cop1.rds")
```


```{r}
head(de_trajectory_res, 20)
```


You can filter out the genes based on the $R^2$ value (the percentage
variance explained by groups). There are `r sum(de_trajectory_res$R2 > 0.7)`
genes with $R^2 \ge 0.7$ which constitutes a fraction of
`r sum(de_trajectory_res$R2 > 0.7)/nrow(cop1.te)`
of all the genes in the dataset.

Here we print out 20 of the genes with highest $R^2$ value and
the pvalue equal 0.1. P-value of 0.1 is the minimum possible when
performing permutation test on distances between 6 observations split
evenly into 2 groups (0.1 = 2/(6 choose 3)).


```{r}
# Select top most different genes across two groups
n_genes <- 20
genes_with_de_trajectory <- de_trajectory_res %>%
    filter(pval <= max(0.05, min(pval)), R2 > 0.7) %>%
    arrange(-R2)
(genes_to_plot <- genes_with_de_trajectory$feature[1:n_genes])
```


```{r genes-de-trajectory, fig.width = 7, fig.height = 5.5, fig.wide = TRUE}
plotTimeSeries(cop1.te, features = genes_to_plot) +
    scale_color_manual(values = c("WT" =  "#1f78b4", "Loxp" = "#e31a1c"))
```


Note, that in this analysis of variance approach we can specify a distance
metric. Additionally, if you are interested in finding out genes DE in
specific time intervals e.g. beginning or end of an experiment, you can
choose to keep only the timepoints of interest are remove the rest from
the time-courses stored in `cop1.te@timecourse.data$tc`.
`trajectory_de()` would then compute the distance for the specified
time period and return DE genes at these specific time intervals.

```{r}
plotTimeSeries(cop1.te, features = genes_to_plot) +
    scale_color_manual(values = c("WT" =  "#1f78b4", "Loxp" = "#e31a1c"))
```


Functional pathways {#fun-path .unnumbered}
===========================================

In this section we will describe procedures to find functional pathways/gene
sets corresponding to genes exhibiting differential expression profiles.
We use publicly available reference databases to find relevant pathways.

You can select any list of genes for this step. However, the method
is intended for sets of genes found differentially expressed using methods
provided by `TimeSeriesExperiment`, discussed in the previous section.

Below, we use genes with DE trajectory, selected in the previous section.
There are `r length(genes_with_de_trajectory$feature)` in the set:

```{r}
length(genes_with_de_trajectory$feature)
```

Multiple functional pathways might be affected by knocking-out Cop1 gene.
Therefore, we expect that genes found differential expressed can be parts
of distinct pathways. Selected DE genes can be grouped by their cluster
membership found earlier.

Below we plot expression profiles of selected DE genes, separated according
to their cluster assignment.

```{r de-gene-clusters, fig.width = 8, fig.height = 6, fig.wide = TRUE}
plotTimeSeriesClusters(
  cop1.te, transparency = 0.5,
  features = genes_with_de_trajectory$feature) +
  scale_color_manual(values = c("WT" = "#1f78b4", "Loxp" = "#e31a1c"))
```


We can test sets  (clusters) of DE genes individually for pathway enrichment
using `pathwayEnrichment()` which is built on top of `goanna()` and `kegga()`
functions from `limma` package. The `features` argument provided
to `pathwayEnrichment()` must be Entrez Gene IDs. The user must also
specify the relevant species. Optionally, the user can modify the filtering
parameters applied to enrichment results: `fltr_DE`, `fltr_N`, and
`fltr_P.DE`. For details see the documentation `?pathwayEnrichment`.


```{r, eval = FALSE}
enrich_res <- pathwayEnrichment(
  object = cop1.te,
  features = genes_with_de_trajectory$feature,
  species = "Mm", ontology ="BP")
saveRDS(enrich_res, "enrich_res_cop1.rds")
```

```{r echo = FALSE}
enrich_res <- readRDS("enrich_res_cop1.rds")
```

We can plot enrichment results for genes with differential trajectory
in each cluster separately using `plotEnrichment()`.
Here we plot the top `n_max = 15` terms for each cluster.
Next to enrichment terms, in the brackets, we have included information on
the exact number of DE genes in a set (DE) and the size of the reference
enrichment term (N). The points are colored by this fraction. The size of
the points correspond the enrichment term size (N).

Below we print we print for example a plot for cluster C4 and C5

```{r go-enrich-clst-c4,  fig.width = 8, fig.height = 4, fig.wide = TRUE}
clst <- "C4"
plotEnrichment(
    enrich = enrich_res[[clst]], n_max = 15) +
  ggtitle(paste0("Cluster , ", clst, " enrichment"))
```

```{r go-enrich-clst-c5,  fig.width = 8, fig.height = 4, fig.wide = TRUE}
clst <- "C5"
plotEnrichment(
    enrich = enrich_res[[clst]], n_max = 15) +
  ggtitle(paste0("Cluster , ", clst, " enrichment"))
```


Session Information {#session-info .unnumbered}
===============================================

```{r}
sessionInfo()
```