%
% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is
% likely to be overwritten.
%
% \VignetteIndexEntry{Lab 9}
%\VignetteDepends{wavethresh,ts,Rwave}
%\VignetteKeywords{Microarray}
\documentclass[12pt]{article}
\usepackage{amsmath,pstricks}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}
\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in
\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\bibliographystyle{plainnat}
\title{Lab 11: Penalized Logistic Regression}
\begin{document}
\maketitle
In this lab we introduce you to penalized logistic regression
algorithms that are available in R. The examples in this section will rely on the data reported in
\citet{Golub99}. The basic comparisons here are between
ALL to AML.
<>=
library(Biobase)
library(annotate)
library(golubEsets)
library(genefilter)
library(Design)
library(Hmisc)
@
We will follow similar steps as in Lab4 to prefilter the data, i.e.
we transform the data in much the same way that \citet{Golub99}
did. The sequence of commands needed are given below.
<>=
data(golubTrain)
data(golubTest)
LS<-exprs(golubTrain)
cl<-golubTrain$ALL.AML
TS<-exprs(golubTest)
clts<-golubTest$ALL.AML LS[LS<100]<-100
TS[TS<100]<-100
LS[LS>16000]<-16000
TS[TS>16000]<-16000
mmfilt <- function(r=5, d=500, na.rm=TRUE) {
function(x) {
minval <- min(x, na.rm=na.rm)
maxval <- max(x, na.rm=na.rm)
(maxval/minval > r) && (maxval-minval > d)
}
}
mmfun <- mmfilt()
ffun <- filterfun(mmfun)
good <- genefilter(cbind(LS, TS), ffun )
sum(good) ## should get 3571
LSsub <- log10(LS[good,]) # log transform of the data
TSsub <- log10(TS[good,]) # log transform of the data
LTsub <- cbind(LSsub, TSsub) # concatenating the two sets
@
Once we have subset the data to the appropriate cases we next perform
some specific filtering to remove any genes that show little
variation across samples. We explicitly remove the non-informative ones
before applying logistic regression, using either $t$-test statistics
or using an ANOVA like F-test on the learning sample (beware of bias selection).
<>=
gf <- gapFilter(900, 1500, .1)
ff <- filterfun(gf)
xx <- 10^LSsub ##to get back to original units
good <- genefilter(xx, gf)
sum(good) # should be 200
# Reducing the number of genes
LS2 <- LSsub[good,]
TS2 <- TSsub[good,]
@
To illustrate how to fit a simple logistic regression model
we select one gene form the list and plot the resulting
fit in what follows. Notice the lack of discrimination power.
<>=
y=as.integer(cl)-1
x=LS2[99,]
aux=sort(x)
fit=glm(y~x,family="binomial")
auxy=fit$coef[1]+fit$coef[2]*sort(x)
auxy=plogis(auxy)
plot(x,y)
lines(sort(x),auxy)
abline(0.5,0)
@
We will now perform a penalized logistic regression fit
on the training sample usig the appropriate procedures
in the \Rpackage{Design}. The penalty parameter is chosen
by cross validation and the next commands may take some
time for their execution.
<>=
X = t(LS2)
Y =as.integer(cl)-1
n =nrow(X)
p = ncol(X)
X.val=t(TS2)
Y.val=as.integer(clts)-1
pm <- diag(diag(var(X))) # penalty matrix
Penalty <- seq(10,10^5,length=10)
reps <- length(Penalty)
effective.df <- effective.df2 <- aic <- aic2 <- deviance.val <-
Lpenalty <- single(reps)
n.t <- round(n^.75)
ncv <- 10 # # try 10 reps in cross-val.
deviance <- matrix(NA,nrow=reps,ncol=length(ncv))
for(i in 1:reps) {
pen <- log10(Penalty[i])
cat(format(pen),"")
f.full <- lrm.fit(X, Y, penalty.matrix=pen*pm)
Lpenalty[i] <- pen t(f.full$coef[-1]) %% pm %% f.full$coef[-1]
f.full.nopenalty <- lrm.fit(X, Y, penalty.matrix=0.05pm, maxit=1)
info.matrix.unpenalized <- solve(f.full.nopenalty$var)
effective.df[i] <- sum(diag(info.matrix.unpenalized %*% f.full$var)) - 1
lrchisq <- f.full.nopenalty$stats["Model L.R."]
# lrm does all this penalty adjustment automatically (for var, d.f.,
# chi-square)
aic[i] <- lrchisq - 2*effective.df[i]
#
pred <- plogis(f.full$linear.predictors)
score.matrix <- cbind(1,X) (Y - pred)
sum.u.uprime <- t(score.matrix) %% score.matrix
effective.df2[i] <- sum(diag(f.full$var %*% sum.u.uprime))
aic2[i] <- lrchisq - 2*effective.df2[i]
#
#Shao suggested averaging 2*n cross-validations, but let's do only 40
#and stop along the way to see if fewer is OK
dev <- 0
for(j in 1:max(ncv)) {
s <- sample(1:n, n.t)
cof <- lrm.fit(X[s,],Y[s],penalty.matrix=pen*pm)$coef
pred <- cof[1] + (X[-s,] %% cof[-1])
dev <- dev -2sum(Y[-s]*pred + log(1-plogis(pred)))
for(k in 1:length(ncv)) if(j==ncv[k]) deviance[i,k] <- dev/j
}
#
pred.val <- f.full$coef[1] + (X.val %*% f.full$coef[-1])
prob.val <- plogis(pred.val)
deviance.val[i] <- -2sum(Y.valpred.val + log(1-prob.val))
}
par(mfrow=c(1,3)) # to be printed as they are finished
plot(Penalty, effective.df, type="l")
lines(Penalty, effective.df2, lty=2)
plot(Penalty, Lpenalty, type="l")
title("Penalty on -2 log L")
plot(Penalty, aic, type="l")
lines(Penalty, aic2, lty=2)
@
We can now exploit the previous result and adjust a penalized regression
model with the appropriate amount of penalization. The resulting fit is
then used to predict the cells in the test sample.
<>=
par(mfrow=c(2,1))
pen=200
f.full <- lrm.fit(X, Y, penalty.matrix=pen*pm)
predLS <- plogis(f.full$linear.predictors)
yLS=as.integer(cl)-1
yTS=as.integer(clts)-1
val.prob(predLS,yLS,m=1,cex=.5) # plotting the result
pred.val <- f.full$coef[1] + (X.val %*% f.full$coef[-1])
predTS <- plogis(pred.val)
val.prob(predTS,yTS,m=1,cex=.5) # plotting the result
@
We can summarize the results from the following concordance table.
<>=
con <- function(x,y)
{
tab <- table(x,y)
print(tab)
diag(tab) <- 0
cat("error rate = ", round(100*sum(tab)/length(x),2),"%\n")
invisible()
}
predProbLS=(predLS>0.5)
predProbTS=(predTS>0.5)
con(as.integer(predProbLS),yLS) # concordance table in the LS
con(as.integer(predProbTS),yTS) # concordance table in the TS
@
Finally to get a feeling on how the procedue penalized the
coeffcients we display them in several plots.
<>=
par(mfrow=c(2,2))
plot.ts(abs(f.full$coef[-1]))
plot.ts(sqrt(diag(var(X)))abs(f.full$coef[-1]))
hist(f.full$coef[-1],breaks=30)
hist(sqrt(diag(var(X)))f.full$coef[-1],breaks=30)
@
\end{document}