% -*- mode: noweb; noweb-default-code-mode: R-mode; -*-
%\VignetteIndexEntry{Using 450k annotations for 27k chips}

\documentclass{article}
\usepackage{amsmath,pstricks}
\usepackage{hyperref}
\usepackage[authoryear,round]{natbib}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\textit{#1}}}

\SweaveOpts{keep.source=TRUE}
\begin{document}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\setkeys{Gin}{width=0.8\textwidth} 
\author{Tim Triche, Jr.}
\title{HumanMethylation27k probes in gene bodies}
\maketitle
\clearpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\part*{27k probes by design and color}

First we have to grab the probes that made it onto the 450k chip.

<<load library and identify 27k probes, eval=T>>=
library('IlluminaHumanMethylation450k.db')
probes.27k <- IlluminaHumanMethylation450k_get27k()
lapply(probes.27k, function(x) {
  if( class(x) == 'list') lapply(x, head)
  else head(x)
})
@

\clearpage 

\part*{Annotation}

How many probes align to UCSC gene bodies?  That's a bit complicated, because
each Illumina probe can be mapped to several transcripts, and each transcript
to several probes.  Normalizing the schema reduced the database size by 100MB.

<<list the 27k probe IDs, eval=T>>=
probes.27k <- unlist(probes.27k, recursive=T)
head(mget(probes.27k, IlluminaHumanMethylation450kPROBELOCATION, ifnotfound=NA))
@

\clearpage 

Location mapping is many-to-one and emerges as a list of concatenations, so we
cannot simply set up a simpleBimap object... UNLESS the concatenation is part  
of a VIEW (across four or so different tables).  So a VIEW is what we use. 

<<annotate the 27k probes, eval=T>>=

probeloc <- mget(probes.27k, IlluminaHumanMethylation450kPROBELOCATION, 
                 ifnotfound=NA)

body.or.exon <- function(x) length( grep('(Body|Exon)', x) ) > 0
length(which(unlist(lapply(probeloc, body.or.exon))))

in.body <- function(x) length( grep('Body', x) ) > 0 
gene.body.probes <- names(which(unlist(lapply(probeloc, in.body))))
length(which(unlist(lapply(probeloc, in.body))))
head(gene.body.probes)
@

They are not nearly as scarce as I initially thought. 

\part*{Versioning}

It's important to keep track of where information came from, and who is in 
charge of keeping it organized, when documenting phenomena like this. 

<<identify database and package versions, eval=T>>=
IlluminaHumanMethylation450k_dbInfo()[c(8:10,22:24,31,33),]
IlluminaHumanMethylation450kSVNID
IlluminaHumanMethylation450kBLAME
@

Writing to Bioconductor standards simply enforces this.

\clearpage

\part*{R session}
<<results=tex>>=
toLatex(sessionInfo())
@ 

\end{document}