SVM (RNW)
%
% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is
% likely to be overwritten.
%
% \VignetteIndexEntry{Machine Learning Lab}
%\VignetteDepends{golubEsets, Biobase, genefilter, e1071, geneplotter, hu6800}
%\VignetteKeywords{Expression Analysis, Reproducible Research}
%\VignettePackage{Milan}
\documentclass[11pt]{article}
\usepackage{times}
\usepackage{hyperref}
\usepackage[authoryear,round]{natbib}
\usepackage{times}
\usepackage{comment}
\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
%\headheight=-.3in
\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\bibliographystyle{plainnat}
\title{
Applications of Machine Learning to Microarray Data, SVM and friends}
\author{R. Gentleman}
\begin{document}
\section*{Preliminaries}
In this lab we explore the use of support vector machines (svm) for
classification in microarray experiments. The lab and some of the
experiments described here were based on ``Microarray Analaysis and
Classification by SVM and PAM'', by R. Spang and F. Markowetz.
The data is that reported by \citet{Golub99} and contained in the
package \Rpackage{golubEsets}. The software for fitting the svms comes
from the package \Rpackage{e1071}.
<<loadlibs, echo=FALSE, results=hide>>=
library(golubEsets)
library(e1071)
library(Biobase)
library(genefilter)
@
The first step in our process of analysing these data is to perform
the basic transformations reported in \citet{Golub99}. The
transformations are Windsorizing the data (setting the minimum
expression values to 100 and the maximum to 16000).
<<windsorize>>=
X <- exprs(golubTrain)
Wlow <- 100
Whigh <- 16000
X[X<Wlow] <- Wlow
X[X>Whigh] <- Whigh
@
We then filter the test set genes.
<<filter>>=
mmfilt <- function(r=5, d=500, na.rm=TRUE) {
function(x) {
minval <- min(x, na.rm=na.rm)
maxval <- max(x, na.rm=na.rm)
(maxval/minval > r) && (maxval-minval > d)
}
}
mmfun <- mmfilt()
ffun <- filterfun(mmfun)
sub <- genefilter(X, ffun )
sum(sub) ## Should get 3051
@
The filtering
process has selected \Sexpr{sum(sub)} genes that seem worthy
(according to the criteria imposed) of further investigation.
The filtering was not especially robust in the sense that it was based
on the minimum or maximum expression value between subjects. Using
extreme deciles or some other measures would probably be better.
This is a non-specific filter. The genes were selected according to
their variability not with respect to their ability to classify any
particular set of samples.
<<subset>>=
X <- X[sub,]
dim(X)
golubTrainSub<-golubTrain[sub,]
golubTrainSub@exprs <- X
Y <- golubTrainSub$ALL.AML
Y <- paste(golubTrain$ALL.AML,golubTrain$T.B.cell)
Y <- sub(" NA","",Y)
@
%$
In order to make the test set comparable we must select the same set
of genes and apply the same transformations to that data set.
<<testset>>=
Xt <- exprs(golubTest)
Xt[Xt<Wlow] <- Wlow
Xt[Xt>Whigh] <- Whigh
golubTestSub <- golubTest[sub,]
golubTestSub@exprs <- Xt[sub,]
Ytest <- golubTestSub$ALL.AML
Ytest <- paste(golubTest$ALL.AML,golubTest$T.B.cell)
Ytest <- sub(" NA","",Ytest)
@
%$
In this analysis the genes were selected according to their behavior
in the test set. If the same selection criteria is applied to the
training set a different set of genes would be selected in the
training set. Since we want to compare and combine the analyses that
approach cannot be used.
\section*{Support Vector Machines}
We will make use of the interface to \Rfunction{svm} that requires the
specification of a data matrix, \Robject{x}, and a response,
\Robject{y}. The labels in \Robject{y} correspond to the rows of
\Robject{x}.
<<fitsvm>>=
Xm <- t(exprs(golubTrainSub))
resp <- golubTrainSub$ALL
svm1 <- svm(Xm, resp, type="C-classification", kernel="linear")
@
%$
We can now explore the prediction training error.
<<training-error>>=
trpred <- predict(svm1, Xm)
sum( trpred != resp)
table(trpred, resp)
@
As anticipated there are no errors in the prediction on the training set.
We can also employ cross-validation to see what the cross-validated
training set error rate is.
<<trcv>>=
trcv <- svm(Xm, resp, type="C-classification", kernel="linear",
cross=10)
summary(trcv)
@
Of course we also have a test data set and we can use the svm based on
the training set to predict the class of samples in the test set.
<<comptr>>=
Xmtr <- t(exprs(golubTestSub))
tepred <- predict(svm1, Xmtr)
sum(tepred != golubTestSub$ALL)
table(tepred, golubTestSub$ALL)
@
We see that only one sample is incorrectly predicted. This suggests
that svm is a reasonably good classifier.
The cross-validation error rate was estimated at about 5\% and on the
test set the error rate was 1 in 34 or about 3\%. These seem to be in
reasonable agreement.
As an exercise you could reverse the rolls of the two data sets, the
test set could be treated as the training data set and the training
data set could be treated as the test data set.
One group of participants should reverse the rolls and repeat the
entire development given above using the training set as the test set
and vice versa. Determine how much of the analysis changes.
A second group of participants should follow the following line of
argument. If both the test set and the training set represent samples
drawn from the same population (which they need to if we are going to
use one of them to assess the performance of the other), then we
should be able to consider a larger experiment. The two data sets
combined simply represent a larger sample. The labels, training set
and test set are irrelevant and one should be able to draw (or create)
very many pseudo-test and training sets.
An easy way to start exploring that particular setting is to combine
the data into one large dataset. This was done to create
\Robject{golubMerge}. Take \Robject{golubMerge} and perform the
various processing steps on it. Determine what (if any differences
result). Use the merged data set and svm's built in cross-validation
procedure to explore the error rates obtained on the combined data set.
For those who get through either (or both) of these exercises quite
quickly consider the following
\subsection*{How to efficiently do the exercises}
The easiest way to do the exercises that are given here is to use
\Rfunction{Sweave} and \Rfunction{Stangle}.
\begin{enumerate}
\item Set the working directory to the location of the directory that
contains the file \texttt{SVM.Rnw}.
\item Run the command, \verb+Stangle("SVM.Rnw", driver=Rtangle())+
\end{enumerate}
This will create file called \texttt{SVM.R} which contains all the R
commands. You can now open this in your favorite editor and change the
names as appropriate.
\subsection*{How good is SVM really?}
Spang and Markowitz suggest the following experiment. Suppose that we
take the training data set. There are 27 ALL patients and 11 AML
patients. In the previous exercises we selected genes that were good
predictors of the different classes. Suppose instead we make up some
class labels that are not associated with any biologically meaningful
variables (that we know of).
<<exp1>>=
newlabs <- sample(c(rep("A",27), rep("B",11)), 38)
table(newlabs, golubTrain$ALL)
funnysvm <- svm(Xm, newlabs, type="C-classification", kernel="linear")
fpred <- predict(funnysvm, Xm)
sum( fpred != newlabs)
table(fpred, newlabs)
@
%$
Wow, we can predict perfectly! And as for cross-validation, well,
<<cv2>>=
trfcv <- svm(Xm, newlabs, type="C-classification", kernel="linear",
cross=10)
summary(trfcv)
@
now we see that the error rate is a bit higher than for the correct
categories.
However, this does not always turn out to be the case.
So, what is going on? Basically classifiers such as svm, randomForests
and neural networks are very flexible and very good at what they are
doing. The ability to correctly classifiy (on the training set) is no
evidence of anything more than the capability of the classifier and
the richness of the feature space.
\section*{Random Forests}
In this part of the laboratory exercise we will use the random forests
\citep{RanForests} and the \Rpackage{randomForest} package to further
explore the data in the \Rpackage{golubEsets} package.
<<loadlibs>>=
library(randomForest)
@
Basic use of the random forest technology is fairly straightforward.
The only parameter that seems to be very important is
\Robject{mtry}. This controls the number of features that are selected
for each split. The default value is the square root of the number of
features but often a smaller value tends to have better performance.
<<rf1>>=
set.seed(123)
rf1 <- randomForest(t(X), golubTrainSub$ALL, ntree=2000, mtry=55,
importance=TRUE)
rf1
rf2 <- randomForest(t(X), golubTrainSub$ALL, ntree=2000, mtry=35,
importance=TRUE)
rf2
@
Notice that the predictive capabilities of random forests do not seem
to be as good as those of svm. Random forests seems to have some
difficulties when the sizes of the groups are not approximately
equal. There is a \Robject{weight} argument that can be given to the
random forest function but it appears to have little or no effect.
We can use the prediction function to assess the ability of these two
forests to predict the class for the test set.
<<rfpred>>=
p1 <- predict(rf1, t(Xt), prox=TRUE)
table(p1$pred)
p2 <- predict(rf2, t(Xt), prox=TRUE)
table(p2$pred)
@
In both cases the prediction function performs poorly. All samples
from the test set were classified as ALL. This seems to be an
artifact of the lack of balance between the samples.
<<evenup>>=
gT2 <- golubTestSub[,7:34]
rfx <- randomForest(t(exprs(gT2)), gT2$ALL, ntree=2000, mtry=55,
importance=TRUE)
rfx
rfx2 <- randomForest(t(exprs(gT2)), gT2$ALL, ntree=2000, mtry=35,
importance=TRUE)
rfx2
px2 <- predict(rfx2, t(X))
table(px2, golubTrainSub$ALL)
@
%$
So we see that the problem does not seem to be with the technology per
se, but rather with its implementation. This is, however,
worrying. And one must be quite cautious with all machine learning
programs. They have not, in general, been tested under a wide variety
of conditions and with a wide variety of inputs.
\subsection*{Feature Selection}
One of the nice things about the random forest technology is that it
provides some indication of which variables were most important in the
classification process. These features can be compared to those
selected by $t$-test or other means.
The current version of \Rpackage{randomForest} produces four different
variable importance statistics. Breiman has recently recommended that
only two of those be considered (the other two are too unstable). The
ones to concentrate on are measures two and four.
In the next code chunk a small function is defined that can be used to
extract the most important variables (those with the highest scores).
<<varimp>>=
var.imp.plot(rf1, n.var=15)
var.imp.plot(rf2, n.var=15)
impvars <- function(x, which=2, k=10) {
v1<-order(x$importance[,which])
l1 <- length(v1)
x$importance[v1[(l1-k+1):l1], which]
}
iv.rf1 <- impvars(rf1, k=25)
library(hu6800)
library(annotate)
isyms <- getSYMBOL(names(iv.rf1),data="hu6800")
@
In \citet{Golub99} the authors identified 50 genes that were most
highly related to the classes. We could compare the genes selected by
random forests with those listed by Golub.
\subsection{Exercises}
Again a number of interesting exercises present themselves.
\begin{enumerate}
\item Reverse the role of the test set and the training set and see
how the estimated prediction errors change.
\item Use the combined data set (\Robject{golubMerge}) to build a
random forest (probably select a subset with equal numbers of AML
and ALL). How well does it do?
\item Compare the variable importance measures obtained using the
balanced subset from the training set with those selected using the
whole data set. Do the important variables change very much?
\end{enumerate}
\end{document}
% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is
% likely to be overwritten.
%
% \VignetteIndexEntry{Machine Learning Lab}
%\VignetteDepends{golubEsets, Biobase, genefilter, e1071, geneplotter, hu6800}
%\VignetteKeywords{Expression Analysis, Reproducible Research}
%\VignettePackage{Milan}
\documentclass[11pt]{article}
\usepackage{times}
\usepackage{hyperref}
\usepackage[authoryear,round]{natbib}
\usepackage{times}
\usepackage{comment}
\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
%\headheight=-.3in
\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\bibliographystyle{plainnat}
\title{
Applications of Machine Learning to Microarray Data, SVM and friends}
\author{R. Gentleman}
\begin{document}
\section*{Preliminaries}
In this lab we explore the use of support vector machines (svm) for
classification in microarray experiments. The lab and some of the
experiments described here were based on ``Microarray Analaysis and
Classification by SVM and PAM'', by R. Spang and F. Markowetz.
The data is that reported by \citet{Golub99} and contained in the
package \Rpackage{golubEsets}. The software for fitting the svms comes
from the package \Rpackage{e1071}.
<<loadlibs, echo=FALSE, results=hide>>=
library(golubEsets)
library(e1071)
library(Biobase)
library(genefilter)
@
The first step in our process of analysing these data is to perform
the basic transformations reported in \citet{Golub99}. The
transformations are Windsorizing the data (setting the minimum
expression values to 100 and the maximum to 16000).
<<windsorize>>=
X <- exprs(golubTrain)
Wlow <- 100
Whigh <- 16000
X[X<Wlow] <- Wlow
X[X>Whigh] <- Whigh
@
We then filter the test set genes.
<<filter>>=
mmfilt <- function(r=5, d=500, na.rm=TRUE) {
function(x) {
minval <- min(x, na.rm=na.rm)
maxval <- max(x, na.rm=na.rm)
(maxval/minval > r) && (maxval-minval > d)
}
}
mmfun <- mmfilt()
ffun <- filterfun(mmfun)
sub <- genefilter(X, ffun )
sum(sub) ## Should get 3051
@
The filtering
process has selected \Sexpr{sum(sub)} genes that seem worthy
(according to the criteria imposed) of further investigation.
The filtering was not especially robust in the sense that it was based
on the minimum or maximum expression value between subjects. Using
extreme deciles or some other measures would probably be better.
This is a non-specific filter. The genes were selected according to
their variability not with respect to their ability to classify any
particular set of samples.
<<subset>>=
X <- X[sub,]
dim(X)
golubTrainSub<-golubTrain[sub,]
golubTrainSub@exprs <- X
Y <- golubTrainSub$ALL.AML
Y <- paste(golubTrain$ALL.AML,golubTrain$T.B.cell)
Y <- sub(" NA","",Y)
@
%$
In order to make the test set comparable we must select the same set
of genes and apply the same transformations to that data set.
<<testset>>=
Xt <- exprs(golubTest)
Xt[Xt<Wlow] <- Wlow
Xt[Xt>Whigh] <- Whigh
golubTestSub <- golubTest[sub,]
golubTestSub@exprs <- Xt[sub,]
Ytest <- golubTestSub$ALL.AML
Ytest <- paste(golubTest$ALL.AML,golubTest$T.B.cell)
Ytest <- sub(" NA","",Ytest)
@
%$
In this analysis the genes were selected according to their behavior
in the test set. If the same selection criteria is applied to the
training set a different set of genes would be selected in the
training set. Since we want to compare and combine the analyses that
approach cannot be used.
\section*{Support Vector Machines}
We will make use of the interface to \Rfunction{svm} that requires the
specification of a data matrix, \Robject{x}, and a response,
\Robject{y}. The labels in \Robject{y} correspond to the rows of
\Robject{x}.
<<fitsvm>>=
Xm <- t(exprs(golubTrainSub))
resp <- golubTrainSub$ALL
svm1 <- svm(Xm, resp, type="C-classification", kernel="linear")
@
%$
We can now explore the prediction training error.
<<training-error>>=
trpred <- predict(svm1, Xm)
sum( trpred != resp)
table(trpred, resp)
@
As anticipated there are no errors in the prediction on the training set.
We can also employ cross-validation to see what the cross-validated
training set error rate is.
<<trcv>>=
trcv <- svm(Xm, resp, type="C-classification", kernel="linear",
cross=10)
summary(trcv)
@
Of course we also have a test data set and we can use the svm based on
the training set to predict the class of samples in the test set.
<<comptr>>=
Xmtr <- t(exprs(golubTestSub))
tepred <- predict(svm1, Xmtr)
sum(tepred != golubTestSub$ALL)
table(tepred, golubTestSub$ALL)
@
We see that only one sample is incorrectly predicted. This suggests
that svm is a reasonably good classifier.
The cross-validation error rate was estimated at about 5\% and on the
test set the error rate was 1 in 34 or about 3\%. These seem to be in
reasonable agreement.
As an exercise you could reverse the rolls of the two data sets, the
test set could be treated as the training data set and the training
data set could be treated as the test data set.
One group of participants should reverse the rolls and repeat the
entire development given above using the training set as the test set
and vice versa. Determine how much of the analysis changes.
A second group of participants should follow the following line of
argument. If both the test set and the training set represent samples
drawn from the same population (which they need to if we are going to
use one of them to assess the performance of the other), then we
should be able to consider a larger experiment. The two data sets
combined simply represent a larger sample. The labels, training set
and test set are irrelevant and one should be able to draw (or create)
very many pseudo-test and training sets.
An easy way to start exploring that particular setting is to combine
the data into one large dataset. This was done to create
\Robject{golubMerge}. Take \Robject{golubMerge} and perform the
various processing steps on it. Determine what (if any differences
result). Use the merged data set and svm's built in cross-validation
procedure to explore the error rates obtained on the combined data set.
For those who get through either (or both) of these exercises quite
quickly consider the following
\subsection*{How to efficiently do the exercises}
The easiest way to do the exercises that are given here is to use
\Rfunction{Sweave} and \Rfunction{Stangle}.
\begin{enumerate}
\item Set the working directory to the location of the directory that
contains the file \texttt{SVM.Rnw}.
\item Run the command, \verb+Stangle("SVM.Rnw", driver=Rtangle())+
\end{enumerate}
This will create file called \texttt{SVM.R} which contains all the R
commands. You can now open this in your favorite editor and change the
names as appropriate.
\subsection*{How good is SVM really?}
Spang and Markowitz suggest the following experiment. Suppose that we
take the training data set. There are 27 ALL patients and 11 AML
patients. In the previous exercises we selected genes that were good
predictors of the different classes. Suppose instead we make up some
class labels that are not associated with any biologically meaningful
variables (that we know of).
<<exp1>>=
newlabs <- sample(c(rep("A",27), rep("B",11)), 38)
table(newlabs, golubTrain$ALL)
funnysvm <- svm(Xm, newlabs, type="C-classification", kernel="linear")
fpred <- predict(funnysvm, Xm)
sum( fpred != newlabs)
table(fpred, newlabs)
@
%$
Wow, we can predict perfectly! And as for cross-validation, well,
<<cv2>>=
trfcv <- svm(Xm, newlabs, type="C-classification", kernel="linear",
cross=10)
summary(trfcv)
@
now we see that the error rate is a bit higher than for the correct
categories.
However, this does not always turn out to be the case.
So, what is going on? Basically classifiers such as svm, randomForests
and neural networks are very flexible and very good at what they are
doing. The ability to correctly classifiy (on the training set) is no
evidence of anything more than the capability of the classifier and
the richness of the feature space.
\section*{Random Forests}
In this part of the laboratory exercise we will use the random forests
\citep{RanForests} and the \Rpackage{randomForest} package to further
explore the data in the \Rpackage{golubEsets} package.
<<loadlibs>>=
library(randomForest)
@
Basic use of the random forest technology is fairly straightforward.
The only parameter that seems to be very important is
\Robject{mtry}. This controls the number of features that are selected
for each split. The default value is the square root of the number of
features but often a smaller value tends to have better performance.
<<rf1>>=
set.seed(123)
rf1 <- randomForest(t(X), golubTrainSub$ALL, ntree=2000, mtry=55,
importance=TRUE)
rf1
rf2 <- randomForest(t(X), golubTrainSub$ALL, ntree=2000, mtry=35,
importance=TRUE)
rf2
@
Notice that the predictive capabilities of random forests do not seem
to be as good as those of svm. Random forests seems to have some
difficulties when the sizes of the groups are not approximately
equal. There is a \Robject{weight} argument that can be given to the
random forest function but it appears to have little or no effect.
We can use the prediction function to assess the ability of these two
forests to predict the class for the test set.
<<rfpred>>=
p1 <- predict(rf1, t(Xt), prox=TRUE)
table(p1$pred)
p2 <- predict(rf2, t(Xt), prox=TRUE)
table(p2$pred)
@
In both cases the prediction function performs poorly. All samples
from the test set were classified as ALL. This seems to be an
artifact of the lack of balance between the samples.
<<evenup>>=
gT2 <- golubTestSub[,7:34]
rfx <- randomForest(t(exprs(gT2)), gT2$ALL, ntree=2000, mtry=55,
importance=TRUE)
rfx
rfx2 <- randomForest(t(exprs(gT2)), gT2$ALL, ntree=2000, mtry=35,
importance=TRUE)
rfx2
px2 <- predict(rfx2, t(X))
table(px2, golubTrainSub$ALL)
@
%$
So we see that the problem does not seem to be with the technology per
se, but rather with its implementation. This is, however,
worrying. And one must be quite cautious with all machine learning
programs. They have not, in general, been tested under a wide variety
of conditions and with a wide variety of inputs.
\subsection*{Feature Selection}
One of the nice things about the random forest technology is that it
provides some indication of which variables were most important in the
classification process. These features can be compared to those
selected by $t$-test or other means.
The current version of \Rpackage{randomForest} produces four different
variable importance statistics. Breiman has recently recommended that
only two of those be considered (the other two are too unstable). The
ones to concentrate on are measures two and four.
In the next code chunk a small function is defined that can be used to
extract the most important variables (those with the highest scores).
<<varimp>>=
var.imp.plot(rf1, n.var=15)
var.imp.plot(rf2, n.var=15)
impvars <- function(x, which=2, k=10) {
v1<-order(x$importance[,which])
l1 <- length(v1)
x$importance[v1[(l1-k+1):l1], which]
}
iv.rf1 <- impvars(rf1, k=25)
library(hu6800)
library(annotate)
isyms <- getSYMBOL(names(iv.rf1),data="hu6800")
@
In \citet{Golub99} the authors identified 50 genes that were most
highly related to the classes. We could compare the genes selected by
random forests with those listed by Golub.
\subsection{Exercises}
Again a number of interesting exercises present themselves.
\begin{enumerate}
\item Reverse the role of the test set and the training set and see
how the estimated prediction errors change.
\item Use the combined data set (\Robject{golubMerge}) to build a
random forest (probably select a subset with equal numbers of AML
and ALL). How well does it do?
\item Compare the variable importance measures obtained using the
balanced subset from the training set with those selected using the
whole data set. Do the important variables change very much?
\end{enumerate}
\end{document}