--- title: "Clustering time series data with tscR" subtitle: "A R package to cluster time series data, base on slope and Fréchet distance" author: - Pérez-Sanz, Fernando. Murcian Institute of biomedical research - Riquelme-Pérez, Miriam. CNRS - CEA, Univ. Paris-Saclay. MIRCen output: #pdf_document: #toc: true prettydoc::html_pretty: theme: hpstr highlight: github toc: true css: tscR.css vignette: > %\VignetteIndexEntry{tscR} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(knitr.table.format = "html") ``` ```{r setup, include=FALSE, comment=FALSE, echo=FALSE} library(tscR) library(dplyr) library(grid) library(ggplot2) library(latex2exp) library(dtw) ``` # Overview Clustering is an unsupervised learning technique widely used in several disciplines, such as machine learning, bioinformatics, image analysis or pattern recognition. Gene expression data clustering -- as well as in other omic disciplines -- is useful to find groups of genes with similar behavior and may provide very useful information for the understanding of certain biological processes. This clustering of gene expression data can be performed onto genes, samples or over the time variable. In the last case, called 'time series genes expression', it is possible to identify genes with similar conduct and obtain sets of responses to certain situations. This has allowed to highlight environmental stress response, circadian rhythms pathway, treatment response, etc. Despite the fact that there are several generic R packages for the analysis of clustering time series -- _kmlShape_ , _dtwclust_, _clue_, _TSclust_ --, there is no a single algorithm that satisfactorily solves all problems and each clustering problem requires a specific approach. There are other specific packages to clustering time series in gene expression which seek minimize the high sensitivity to noise of other generic algorithms -- _TSMixclust_, _ctsGE_, _MFuzz_ -- . _MFuzz_ minimizes the high noise sensitivity of other algorithms through a fuzzy clustering, on the other hand _TSMixclust_ is a soft-clustering that uses mixed-effects models with non-parametric smoothing spline fitting. _ctsGE_ seeks to minimize noisy data in two steps, first define groups based on their expression profiles (expression index) and then, genes with same index are clustered by applying kmeans. Despite the fact that these methods attempt to solve the important task of minimizing the impact of the noise on clustering, genes with similar expression -- in magnitude -- typically end up in the same groups, even though they have different temporal evolutions. Occasionally, however, the researcher's main interest lies in finding genes with similar patterns (similar trajectories) although these occur at different levels of expression. Therefore, genes with a similar temporal evolution are therefore sought regardless of their magnitude of expression. Let us suppose that we subject an individual to a treatment and measure the expression of three (or ten thousand) of his genes at the time the treatment begins, at one week, two weeks and three weeks after (Fig. 1). The lines in figure 1 (`a`,`b`,`c`) represent 3 tracks (e.g. the expression of 3 genes at 4 different times). The trajectories $T_a$ and $T_c$ are identical in terms of slopes, they only differ in the magnitude of their expression. On the other hand the track $T_b$ would be closer to the track $T_a$ than to $T_c$. ```{r warning=F, fig.cap= "Figure 1. Three different trajectories at four times to exemplify the possible relationships or classifications of the data.", echo=FALSE, fig.align='left'} a <- c(10, 15, 17, 25) b <- c(5, 8, 6, 9 ) c <- c(-19, -14, -12, -4) x <- c(1, 2, 3, 4) df <- as.data.frame(cbind(x,a,b,c)) ggplot(df, aes(x=x))+ geom_line(aes(y=a), color = "red")+ geom_point(y=a, col="red")+ geom_line(aes(y=b))+ geom_point(y=b)+ geom_line(aes(y=c), color = "steelblue")+ geom_point(y=c, color="steelblue")+ geom_label(aes(x=1.5, y=14, label = TeX("$S_{a1}$", output="character")), parse=TRUE ) + geom_label(aes(x=2.5, y=17, label = TeX("$S_{a2}$", output="character")), parse=TRUE ) + geom_label(aes(x=3.5, y=23, label = TeX("$S_{a3}$", output="character")), parse=TRUE ) + geom_label(aes(x=1.5, y=7.5, label = TeX("$S_{b1}$", output="character")), parse=TRUE ) + geom_label(aes(x=2.5, y=8, label = TeX("$S_{b2}$", output="character")), parse=TRUE ) + geom_label(aes(x=3.5, y=9, label = TeX("$S_{b3}$", output="character")), parse=TRUE ) + geom_label(aes(x=1.5, y=-15, label = TeX("$S_{c1}$", output="character")), parse=TRUE ) + geom_label(aes(x=2.5, y=-12, label = TeX("$S_{c2}$", output="character")), parse=TRUE ) + geom_label(aes(x=3.5, y=-7, label = TeX("$S_{c3}$", output="character")), parse=TRUE ) + geom_text(aes(x=1, y=12, label = "Traject. a"))+ geom_text(aes(x=1, y=4, label = "Traject. b"))+ geom_text(aes(x=1, y=-17, label = "Traject. c"))+ theme(legend.position = "none")+ xlab("") + ylab("") ``` Intuitively, if we had to determine the distance between these trajectories, we would think that $T_a$ is closer to $T_b$ than to $T_c$. In addition, $T_c$ is closer to $T_b$ than to $T_a$. You could quantify those distances with different metrics: * Euclidean distance ```{r echo=FALSE} dfx <- as.matrix(rbind(a,b,c)) dsE <- round( dist(dfx, method = "euclidean", diag = FALSE, upper = FALSE), 3) ``` $$ \begin{matrix} & a &b \\ b & 21.24 & \\ c & 54.00 & 39.41 \end{matrix} $$ Other distance metrics based more specifically on time series such as Fréchet or DTW provide similar results: ```{r echo=FALSE} time <- c(1,2,3,4) dF <- tscR::frechetDistC(dfx, time) ``` * Fréchet Distance $$ \begin{matrix} & a &b \\ b & 16 & \\ c & 29 & 24 \end{matrix} $$ * DTW Distance ```{r echo=FALSE} dt1 <- dtwDist(matrix(c(a,b), byrow = TRUE, nrow = 2)) dt2 <- dtwDist(matrix(c(a,c), byrow = TRUE, nrow = 2)) dt3 <- dtwDist(matrix(c(b,c), byrow = TRUE, nrow = 2)) ``` $$ \begin{matrix} & a & b \\ b & 42 & \\ c & 158 & 104 \end{matrix} $$ However, if the main interest is to determine which trajectories have similar behaviour over time regardless of the magnitude of their expression, it is necessary to use purely geometric criteria and look for similarities based on their slopes. In this case, the distance matrix would look as follows: $$ \begin{matrix} & a & b \\ b & 6.71 & \\ c & 0.00 & 6.71 \end{matrix} $$ Where it can be seen that the distance between the $a$ and $c$ trajectory is zero because these are two lines with identical slopes and the distance from both to $b$ is the same. Therefore, we might find a researcher interested in identifying and grouping sets of genes with similar behaviours according to different points of view: * Genes clustered with similar levels of expression, i.e. genes whose trajectories are close in terms of physical distance (fig.2 B). * Genes grouped with similar evolution regardless of their physical distance. In this second scenario, we are dealing with genes that respond in a similar way, but with different intensity (fig.2 C). * It might also be interesting to group genes according to both factors (distance + tendency) (fig.2 D). ```{r echo=FALSE, fig.align='left', fig.cap="Figure 2. Different possibilities of classifying the tracks from the whole data set (a) according to its Fréchet distances (b), according to its slopes (c) or a combination of both (d)."} df <- data.frame(T1 = c(140,100,75,35), T2=c(120,120,50,48), T3 = c(100,140,35,70)) df1 <- matrix(NA, nrow=10, ncol=3) df2 <- matrix(NA, nrow=10, ncol=3) df3 <- matrix(NA, nrow=10, ncol=3) df4 <- matrix(NA, nrow=10, ncol=3) for(i in seq(1,10)){ df1[i,] <- jitter(as.numeric(df[1,]), factor = 1.5) df2[i,] <- jitter(as.numeric(df[2,]), factor = 1.5) df3[i,] <- jitter(as.numeric(df[3,]), factor = 1.5) df4[i,] <- jitter(as.numeric(df[4,]), factor = 1.5) } df <- as.data.frame(rbind(df1,df2,df3,df4)) names(df) <- c("T1","T2","T3") df <- as.data.frame.table(t(df)) df$Var3 <- rep(c("A","B","C","D"), each=30) p1 <- df %>% ggplot( aes_(~Var1, ~Freq, group=~Var2) ) + geom_line() + theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) + xlab("(A)") + ylab("") + ggtitle(label = "Raw trajectories") p2 <- df %>% mutate(Var4 = recode(Var3, "B" = "A")) %>% mutate(Var4 = recode(Var4, "D" = "C")) %>% ggplot( aes_(~Var1, ~Freq, group=~Var2, colour=~Var4) ) + geom_line() + theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) + xlab("(B)") + ylab("") + ggtitle(label = "Fréchet based cluster") p3 <- df %>% mutate(Var4 = recode(Var3, "C" = "A")) %>% mutate(Var4 = recode(Var4, "D" = "B")) %>% ggplot( aes_(~Var1, ~Freq, group=~Var2, colour=~Var4) ) + geom_line() + theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) + xlab("(C)") + ylab("") + ggtitle(label = "Slope based cluster") p4 <- df %>% ggplot( aes_(~Var1, ~Freq, group=~Var2, colour=~Var3) ) + geom_line() + theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) + xlab("(D)") + ylab("") + ggtitle(label = "Combined clusters") gridExtra::grid.arrange(p1,p2,p3,p4, nrow=2) ``` Through this package we propose a methodology that allows to group these paths based on physical distances by using distance metrics adapted to time series (Fréchet's distance) and a new approach based on similarity of slopes between trajectories, where the trajectories are grouped according to this proximity regardless of the distance between them. Furthermore, a combination of both classifications is available, so that the final result would be trajectories grouped according to their values and their evolutions. Since in many studies (especially in different comic disciplines), the number of trajectories can be very large (thousands or tens of thousands), a methodology has been developed based on the existing one in the _kmlShape_ package. By means of a pre-grouping, a series of "representatives" of the trajectories is obtained, called senators; these senators are then classified by means of some of the proposed algorithms. At last, the set of trajectories is assigned to the cluster to which its senator has been assigned. In this way the computational cost and the risk of overflowing the memory is reduced. # Getting started Install from Bioconductor ```{r eval=FALSE, echo=TRUE} if (!requireNamespace("BiocManager")) install.packages("BiocManager") BiocManager::install("tscR") ``` Install the latest development version in `install_github`. ```{r eval = FALSE, echo=TRUE} devtools::install_github("fpsanz/tscR") ``` Read the vignette (this document): ```{r eval = TRUE} library(tscR) ``` ```{r eval =FALSE} browseVignettes("tscR") ``` # Simple clustering ## Based on slope distance The input data has to be a dataframe or matrix where the rows will be each of the trajectories and the columns the times (Table 1). Initially, we are going to load a set of example trajectories included in the library. ```{r eval = TRUE, echo = TRUE, results='asis'} data(tscR) df <- tscR ``` ```{r eval=TRUE, echo=FALSE, results='asis'} knitr::kable(head(round(df,3)), caption = "Table 1. First six rows from the example data matrix at three different times that have been studied for one sample.", align = c('c', 'c', 'c'), format="html") %>% gsub("