%\VignetteEngine{knitr::knitr} \documentclass[a4paper, 9pt]{article} <>= BiocStyle::latex() @ %% \VignetteIndexEntry{An R Package for todo} \usepackage[utf8]{inputenc} \usepackage{graphicx} \usepackage{placeins} \usepackage{url} \usepackage{tcolorbox} \begin{document} \title{Using the \Biocpkg{SIMLR} package} \author{ Bo Wang\footnote{Department of Computer Science, Stanford University, Stanford, CA , USA.} \and Daniele Ramazzotti\footnote{Department of Pathology, Stanford University, Stanford, CA , USA.} \and Luca De Sano\footnote{Dipartimento di Informatica Sistemistica e Comunicazione, Università degli Studi Milano Bicocca Milano, Italy.} \and Junjie Zhu\footnote{Department of Electrical Engineering, Stanford University, Stanford, CA , USA.} \and Emma Pierson\footnote{Department of Computer Science, Stanford University, Stanford, CA , USA.} \and Serafim Batzoglou\footnote{Department of Computer Science, Stanford University, Stanford, CA , USA.} } \date{\today} \maketitle \begin{tcolorbox}{\bf Overview.} Single-cell RNA-seq technologies enable high throughput gene expression measurement of individual cells, and allow the discovery of heterogeneity within cell populations. Measurement of cell-to-cell gene expression similarity is critical to identification, visualization and analysis of cell populations. However, single-cell data introduce challenges to conventional measures of gene expression similarity because of the high level of noise, outliers and dropouts. We develop a novel similarity-learning framework, \Biocpkg{SIMLR} (Single-cell Interpretation via Multi-kernel LeaRning), which learns an appropriate distance metric from the data for dimension reduction, clustering and visualization. \Biocpkg{SIMLR} is capable of separating known subpopulations more accurately in single-cell data sets than do existing dimension reduction methods. Additionally, \Biocpkg{SIMLR} demonstrates high sensitivity and accuracy on high-throughput peripheral blood mononuclear cells (PBMC) data sets generated by the GemCode single-cell technology from 10x Genomics. \\ \vspace{1.0cm} {\em In this vignette, we give an overview of the package by presenting some of its main functions.} \vspace{1.0cm} \renewcommand{\arraystretch}{1.5} \end{tcolorbox} <>= library(knitr) opts_chunk$set( concordance = TRUE, background = "#f3f3ff" ) @ \newpage \tableofcontents \section{Changelog} \begin{itemize} \item[1.0.0] implements SIMLR and SIMLR feature ranking algorithms. \item[1.0.2] implements SIMLR large scale algorithms. \end{itemize} \section{Algorithms and useful links} \label{sec:stuff} \renewcommand{\arraystretch}{2} \begin{center} \begin{tabular}{| l | p{6.0cm} | l |} {\bf Acronym} & {\bf Extended name} & {\bf Reference}\\ \hline SIMLR & Single-cell Interpretation via Multi-kernel LeaRning & \href{http://biorxiv.org/content/early/2016/06/09/052225}{Paper}\\ \hline \end{tabular} \end{center} \section{Using SIMLR} We first load the data provided as an example in the package. The dataset BuettnerFlorian is used for an example of the standard SIMLR, while the dataset ZeiselAmit is used for an example of SIMLR large scale. <>= library(SIMLR) data(BuettnerFlorian) data(ZeiselAmit) @ The external R package igraph is required for the computation of the normalized mutual information to assess the results of the clustering. <>= library(igraph) @ We now run SIMLR as an example on an input dataset from Buettner, Florian, et al. "Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells." Nature biotechnology 33.2 (2015): 155-160. For this dataset we have a ground true of 3 cell populations, i.e., clusters. <>= set.seed(11111) example = SIMLR(X = BuettnerFlorian$in_X, c = BuettnerFlorian$n_clust, cores.ratio = 0) @ We now compute the normalized mutual information between the inferred clusters by SIMLR and the true ones. This measure with values in [0,1], allows us to assess the performance of the clustering with higher values reflecting better performance. <>= nmi_1 = compare(BuettnerFlorian$true_labs[,1], example$y$cluster, method="nmi") print(nmi_1) @ As a further understanding of the results, we now visualize the cell populations in a plot. <>= plot(example$ydata, col = c(topo.colors(BuettnerFlorian$n_clust))[BuettnerFlorian$true_labs[,1]], xlab = "SIMLR component 1", ylab = "SIMLR component 2", pch = 20, main="SIMILR 2D visualization for BuettnerFlorian") @ \begin{figure*}[ht] \begin{center} \includegraphics[width=0.5\textwidth]{figure/image-1} \end{center} \caption{Visualization of the 3 cell populations retrieved by SIMLR on the dataset by Florian, et al.} \end{figure*} We also run SIMLR feature ranking on the same inputs to get a rank of the key genes with the related pvalues. <>= set.seed(11111) ranks = SIMLR_Feature_Ranking(A=BuettnerFlorian$results$S,X=BuettnerFlorian$in_X) @ <>= head(ranks$pval) head(ranks$aggR) @ We finally show an example for SIMLR large scale on an input dataset being a reduced version of the dataset provided in Buettner, Zeisel, Amit, et al. "Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq." Science 347.6226 (2015): 1138-1142. For this dataset we have a ground true of 9 cell populations, i.e., clusters. <>= set.seed(11111) example_large_scale = SIMLR_Large_Scale(X = ZeiselAmit$in_X, c = ZeiselAmit$n_clust, kk = 10) @ We compute the normalized mutual information between the inferred clusters by SIMLR large scale and the true ones. <>= nmi_2 = compare(ZeiselAmit$true_labs[,1], example_large_scale$y$cluster, method="nmi") print(nmi_2) @ As a further understanding of the results, also in this case we visualize the cell populations in a plot. <>= plot(example_large_scale$ydata, col = c(topo.colors(ZeiselAmit$n_clust))[ZeiselAmit$true_labs[,1]], xlab = "SIMLR component 1", ylab = "SIMLR component 2", pch = 20, main="SIMILR 2D visualization for ZeiselAmit") @ \begin{figure*}[ht] \begin{center} \includegraphics[width=0.5\textwidth]{figure/image_large_scale-1} \end{center} \caption{Visualization of the 9 cell populations retrieved by SIMLR large scale on the dataset by Zeisel, Amit, et al.} \end{figure*} \section{\Rcode{sessionInfo()}} <>= toLatex(sessionInfo()) @ \end{document}