Skip to content.

bioconductor.org

Bioconductor is an open source and open development software project
for the analysis and comprehension of genomic data.

Sections

Lab3.Rnw

% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{Seattle Lab 3} %\VignetteDepends{Biobase. golubEsets, ellipse,lattice, mva, MASS,cluster } %\VignetteKeywords{Microarray} \documentclass[12pt]{article}

\usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref}

\textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle}

\bibliographystyle{plainnat}

\title{Cluster Analysis Using R and Bioconductor}

\begin{document}

\maketitle

We continue our extended example involving the dataset from \citet{Golub99}. In this lab, we will compute distances between tumor samples and use multidimensional scaling to represent these distances. We will then consider different procedures for clustering tumor samples.

<>= library(Biobase) library(annotate) library(golubEsets) library(genefilter) library(ellipse) library(lattice) library(cluster) library(mva) @

We will set up the data from scratch once again.

<>= X<-exprs(golubTrain) X[X<100]<-100 X[X>16000]<-16000

mmfilt <- function(r=5, d=500, na.rm=TRUE) { function(x) { minval <- min(x, na.rm=na.rm) maxval <- max(x, na.rm=na.rm) (maxval/minval > r) && (maxval-minval > d) } }

mmfun <- mmfilt()

ffun <- filterfun(mmfun) sub <- genefilter(X, ffun ) sum(sub) ## Should get 3051

X <- X[sub,] X <- log2(X) golubTrainSub<-golubTrain[sub,] golubTrainSub@exprs <- X

Y <- golubTrainSub$ALL.AML

Y<-paste(golubTrain$ALL.AML,golubTrain$T.B.cell) Y<-sub("NA","",Y)

@

Now that the data are set up, we are ready to compute distances between tumor samples and apply clustering procedures to these samples.

\section{Distances}

We first compute the correlation distance between the 38 tumor samples from the training set, using all 3,051 genes retained after the above filtering procedure.

%%FIXME: other distances?

<>= r<-cor(X) dimnames(r)<-list(as.vector(Y),as.vector(Y)) d<-1-r @

We rely on the \texttt{plotcorr} function from the \textit{ellipse} package and the \texttt{levelplot} function from \textit{lattice} for displaying this 38 $\times 38$ distance matrix.

<>= plotcorr(r, main="Leukemia data: Correlation matrix for 38 mRNA samples\n All 3,051 genes") @

<>= plotcorr(r,numbers=TRUE, main="Leukemia data: Correlation matrix for 38 mRNA samples\n All 3,051 genes") @

<>= levelplot(r,col.region=heat.colors(50), main="Leukemia data: Correlation matrix for 38 mRNA samples\n All 3,051 genes") @

\section{Multidimensional Scaling}

We now turn to {\em multidimensional scaling} (MDS) for representing the tumor sample distance matrix. MDS is a data reduction method that is appropriate for general distance matrices. It is much like principal component analysis (PCA), but it is not the same -- except for Euclidean distances. Given an $n \times n$ distance matrix, MDS is concerned with identifying $n$ point in Euclidean space with a {\em similar} distance structure. The purpose is to provide a graphical representation of the distances which conveys information on the relationships between objects, such as the existence of clusters or one-dimensional structure in the data (e.g., seriation).

<>= mds<- cmdscale(d, k=2, eig=TRUE) plot(mds$points, type="n", xlab="", ylab="", main="MDS for ALL AML data, correlation matrix, G=3,051 genes, k=2") text(mds$points[,1],mds$points[,2],Y, col=codes(factor(Y))+1, cex=0.8) @

<>= mds<- cmdscale(d, k=3, eig=TRUE) pairs(mds$points, main="MDS for ALL AML data, correlation matrix, G=3,051 genes, k=3", pch=c("B","T","M")[codes(factor(Y))], col = codes(factor(Y))+1) @

Note, if we are measuring distances between samples on the basis of $G=3,051$ probes then we are essentially looking at points in 3,051 dimensional space. As with PCA, the quality of the representation will depend on the magnitude of the first $k$ eigenvalues. To assess how many components there are, a \textit{scree} plot similar to that used for principal components can be created. This plot of the eigenvalues suggests that much of the information is contained in the first component. One might consider using either three or four components as well.

<>= mdsScree <- cmdscale(d, k=8, eig=TRUE) plot(mdsScree$eig, pch=18, col="blue") @

\section{Hierarchical Clustering}

We can use the \texttt{hclust} function in the \textit{mva} package for {\em agglomerative hierarchical clustering}. We consider the following agglomeration methods, i.e., methods for computing {\em between cluster distances}:

\begin{description} \item[Single linkage] The distance between two clusters is the minimum distance between any two objects, one from each cluster. \item[Average linkage] The distance between two clusters is the average of all pairwise distances between the members of both clusters. \item[Complete linkage] The distance between two clusters is the maximum distance between two objects, one from each cluster. \end{description}

The nested sequence of clusters resulting from hierarchical clustering methods can be represented graphically using a dendrogram. A {\em dendrogram} is a binary tree diagram in which the terminal nodes, or leaves, represent individual observations and in which the height of the nodes reflects the dissimilarity between the two clusters that are joined. While dendrograms are quite appealing because of their apparent ease of interpretation, they can be misleading.

First, the dendrogram corresponding to a given hierarchical clustering is not unique, since for each merge one needs to specify which subtree should go on the left and which on the right. For $n$ observations, there are $n-1$ merges and hence $2^{(n-1)}$ possible orderings of the terminal nodes. Therefore, $2^{(n-1)}$ different dendrograms are consistent with any given sequence of hierarchical clustering operations. The ordering is of practical importance because it will result in different genes being placed next to each other in the heat-maps and possibly different interpretations of the same data. A number of heuristics have been suggested for ordering the terminal nodes in a dendrogram. The default in the R function \texttt{hclust} from the \textit{cluster} package is to order the subtrees so that the tighter cluster is on the left (i.e., the most recent merge of the left subtree is at a lower value than the last merge of the right subtree).

A second, and perhaps less recognized shortcoming of dendrograms, is that they {\em impose} structure on that data, instead of {\em revealing} structure in these data. Indeed, dendrograms are often viewed as graphical summaries of the data, rather than of the results of the clustering algorithm. Such a representation will be valid only to the extent that the pairwise dissimilarities possess the hierarchical structure imposed by the clustering algorithm. Note that in particular, the same matrix of pairwise distances between observations will be represented by a different dendrogram depending on the distance function that is used to compute between-cluster distances (e.g., complete or average linkage). The {\em cophenetic correlation coefficient} can be used to measure how well the hierarchical structure from the dendrogram represents the actual distances. This measure is defined as the correlation between the $n(n-1)/2$ pairwise dissimilarities between observations and their {\em cophenetic} dissimilarities from the dendrogram, i.e., the between cluster dissimilarities at which two observations are first joined together in the same cluster. \\

Next, we apply three agglomerative hierarchical clustering methods to cluster tumor samples. For each methods, we plot a dendrogram, compute the corresponding cophenetic correlation, and compare the three clusters obtained from cutting the tree into three branches to the actual tumor labels (ALL B-cell, ALL T-cell, and AML).

<>= hc1 <- hclust(as.dist(d), method="average") coph1<-cor(cophenetic(hc1),as.dist(d))

plot(hc1, main=paste("Dendrogram for ALL AML data: Coph = ", round(coph1,2)), sub="Average linkage, correlation matrix, G=3,051 genes")

cthc1 <- cutree(hc1, 3) table(Y, cthc1)

@

<>= hc2 <- hclust(as.dist(d), method="single") coph2<-cor(cophenetic(hc2),as.dist(d))

plot(hc2, main=paste("Dendrogram for ALL AML data: Coph = ", round(coph2,2)), sub="Single linkage, correlation matrix, G= 3,051 genes")

cthc2 <- cutree(hc2, 3) table(Y, cthc2)

@

<>= hc3 <- hclust(as.dist(d), method="complete") coph3<-cor(cophenetic(hc3),as.dist(d))

plot(hc3, main=paste("Dendrogram for ALL AML data: Coph = ", round(coph3,2)), sub="Complete linkage, correlation matrix, G= 3,051 genes")

cthc3 <- cutree(hc3, 3) table(Y, cthc3) @

Divisive hierarchical clustering is available using \texttt{diana} from the \textit{cluster} package.

<>= di1 <- diana(as.dist(d)) cophdi<- cor(cophenetic(di1), as.dist(d))

plot(di1, which.plots=2, #$ main=paste("Dendrogram for ALL AML data: Coph = ", round(cophdi,2)), sub="Divisive algorithm, correlation matrix, G= 3,051 genes")

ct.di <- cutree(di1, 3) table(Y, ct.di) @

\section{Partitioning Methods}

Here we apply the {\em partitioning around medoids} (PAM) method to cluster tumor mRNA samples.

<>= ##PAM set.seed(12345)

pm3 <- pam(as.dist(d), k=3, diss=TRUE) table(Y, pm3$clustering)

clusplot(d, pm3$clustering, diss=TRUE, labels=3, #$ col.p=1, col.txt=codes(factor(Y))+1, main="Bivariate cluster plot for ALL AML data\n Correlation matrix, K=3, G=3,051 genes")

@ <>= plot(pm3,which.plots=1)

@

<>= plot(pm3,which.plots=2) @

One of the more difficult questions that arises when clustering data is the assessment of how many clusters there are in the data. We now consider using PAM with four clusters.

<>= ##what happens if we try four? set.seed(12345) pm4 <- pam(as.dist(d), k=4, diss=TRUE)

clusplot(d, pm4$clustering, diss=TRUE, labels=3, #$ col.p=1, col.txt=codes(factor(Y))+1, main="Bivariate cluster plot for ALL AML data\n Correlation matrix, K=4, G=3,051 genes") @

\end{document}

News
2010-05-21

Advanced R Programming for Bioinformatics course material now available

2010-04-23

Bioconductor 2.6, consisting of 389 packages and designed to work with R version 2.11, was released today.