% -*- mode: noweb;noweb-font-lock-mode: R-mode; -*-
%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
% \VignetteIndexEntry{timecourse manual}
% \VignetteDepends{timecourse}
% \VignetteKeywords{timecourse}
% \VignettePackage{timecourse}
\documentclass[11pt]{article}

\usepackage{amsmath,fullpage}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}
\SweaveOpts{echo=FALSE}
\usepackage{a4wide}

\parindent 0in

\bibliographystyle{abbrvnat}

\begin{document}

\title{\bf The timecourse Package}

\author{Yu Chuan Tai}

\maketitle

\begin{center}
Institute for Human Genetics, University of California, San Francisco \\ 
{\tt taiy@humgen.ucsf.edu}
\end{center}

\tableofcontents



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Overview}

The {\tt timecourse} package aims to serve as
a comprehensive library for the analysis of developmental microarray time course data. 
The current version includes functions for identifying genes of interest from 
longitudinal replicated developmental microarray time course experiments with one or more biological 
conditions.
The main functions are {\bf mb.long()} and {\bf mb.MANOVA()}. 
They calculate the $MB$-statistics and/or
$\widetilde{T}^2$ statistics derived in \cite{TaiSpeed2006} and 
\cite{Taithesis2005}, using multivariate 
empirical Bayes 
approaches. The input object for {\bf mb.long()} and {\bf mb.MANOVA()}
can be a {\tt matrix}, {\tt MAList}, {\tt marrayNorm}, 
or {\tt ExpressionSet}, containing properly preprocessed $log_2$ expression values or ratios.
\par
The multivariate empirical Bayes models proposed in \cite{TaiSpeed2006}
and \cite{Taithesis2005} 
have the advantage over the traditional $F$-statistic in that they 
incorporate
replicate variances, the correlations among gene expression time point samples from longitudinal data, 
and moderation borrowing
the information across genes into the analysis to reduce the numbers of false positives and false
negatives induced by those poorly estimated variance-covariance matrices.
Please see \cite{TaiSpeed2006}, \cite{Tai2005}, and \cite{Taithesis2005} for more details.  
The results from {\bf mb.long()} and {\bf mb.MANOVA()}
are stored in a list object of class {\tt MArrayTC}, 
which contains various useful information and is used as the input object
for the plotting function {\bf plotProfile()}. {\tt MArrayTC} is a subclass of the
virtual class {\tt LargeDataObject} defined in package {\tt limma} \citep{Smyth2004, Smyth2005}
and inherits a show method there.
For more information, type
<<echo=TRUE,print=FALSE>>=
library(timecourse) # load timecourse library
@
\begin{Schunk}
\begin{Sinput}
>help("MArrayTC-class") # look at the help files
>help(mb.long)
>help(mb.MANOVA)
\end{Sinput}
\end{Schunk}

\section{Longitudinal one-sample problem}
\subsection{Data}
The dataset used in the example was generated by \cite{Tomancak2002}. 
Samples from {\it Drosophila} embryos were collected and hybridized onto
Affymetrix chips at hours 1 to 12 post egg laying. The same procedure was 
conducted on 3 different days, yielding 3 replicates and 12 time points.
Since mRNA samples at different times were extracted from different
embryos, it is not a {\it true}
longitudinal design. However, since parallel and identically treated
embryos within
each experiment were very similar, sampling these embryos within each
experiment
at different time points may approximate a longitudinal study.
Therefore, we treat this dataset as longitudinal and proceed with the
one-sample statistic calculated from {\bf mb.long()}.

The dataset has been preprocessed using RMA algorithm \citep{Irizarry2003, Bolstad2003} 
and $log_2$ expression values
were extracted from the RMA output. To reduce the computational time, 
we only include the first 2000 probesets in our demonstration.  

\subsection{Genes of interest}
Suppose we are interested in genes which change over time, the following codes give it a quick start:

1. Assign time and replicate group to each array: 

The default of {\bf mb.long()} assumes the columns (arrays) of the expression 
matrix from the input object are arranged in the order of biological conditions,
replicates and times. 
For the one-sample problem, we only have 1 biological condition, so
the function assumes the columns are arranged in the order of replicates
and times.
Hence, if the 3 replicates are named as {\tt A}, {\tt B}, and {\tt C}, 
the following column arrangement of the matrix {\tt fruitfly} will be the same as the default.  

<<echo=TRUE,print=FALSE>>=
data(fruitfly)
dim(fruitfly)
colnames(fruitfly)
gnames <- rownames(fruitfly)
@

The number of time points and sample sizes are always required as inputs.
If the arrays are not in the default order, one needs to assign the time and 
replicate groups for each array. 
In our example, they are in the default order, so we make the assignments just for demo purpose.
<<echo=TRUE,print=FALSE>>=
assay <- rep(c("A", "B", "C"), each=12)
time.grp <- rep(c(1:12), 3)
size <- rep(3, 2000)
@ 

2. Calculate the $MB$-statistics and $\widetilde{T}^2$ statistics, and plot individual genes of interest.
<<echo=TRUE,print=FALSE>>=
out1 <- mb.long(fruitfly, times=12, reps=size)
out2 <- mb.long(fruitfly, times=12, reps=size, rep.grp=assay, time.grp=time.grp)
@

\section{Longitudinal two-sample problem}
\subsection{Data}
Now we simulate a dataset with two biological conditions.
There are 1000 genes in total, and 20 of them have the same expected time course profiles
between these two biological conditions.  
Each gene has 3 time course replicates and 5 time points for each condition.
Note that this simulation is for demonstration purpose only,
and does not necessarily reflect the real 
features of longitudinal time course data.

<<echo=TRUE,print=FALSE>>=
SS <- matrix(c( 1e-02, -8e-04, -0.003,  7e-03,  2e-03,
               -8e-04,  2e-02,  0.002, -4e-04, -1e-03,
               -3e-03,  2e-03,  0.030, -5e-03, -9e-03,
                7e-03, -4e-04, -0.005,  2e-02,  8e-04,
                2e-03, -1e-03, -0.009,  8e-04,  7e-02), ncol=5)

sim.Sigma <- function()
{
    S <- matrix(rep(0,25),ncol=5)
    x <- mvrnorm(n=10, mu=rep(0,5), Sigma=10*SS)
    for(i in 1:10)
       S <- S+crossprod(t(x[i,]))

       solve(S)

}


sim.data2 <- function(x, indx=1)
{
    mu <- rep(runif(1,8,x[1]),5)
    if(indx==1)
       res <- c(as.numeric(t(mvrnorm(n=3, mu=mu+rnorm(5,sd=5), Sigma=sim.Sigma()))),
                  as.numeric(t(mvrnorm(n=3, mu=mu+rnorm(5,sd=3.2), Sigma=sim.Sigma()))))

     if(indx==0) res <- as.numeric(t(mvrnorm(n=6, mu=mu+rnorm(5,sd=3), Sigma=sim.Sigma())))
       res 
}

M2 <- matrix(rep(14,1000*30), ncol=30)
M2[1:20,] <- t(apply(M2[1:20,],1,sim.data2))
M2[21:1000,] <- t(apply(M2[21:1000,],1,sim.data2, 0)) 
@ 

\subsection{Genes of interest: paired two-sample problem}
Assume it is a paired two-sample problem, we can do the following.
Note that, R is case-sensitive and since
{\tt mt} $<$ {\tt wt }, the first column of the input argument {\tt reps} should be the sample
sizes of all genes for {\tt mt}, while the second column is
those for {\tt wt}. The same rule applies for the input argument {\tt size} for {\bf mb.MANOVA()}.
 
<<echo=TRUE,print=FALSE>>=
trt <- rep(c("wt","mt"),each=15)
assay <- rep(rep(c("rep1","rep2","rep3"),each=5),2)
size <- matrix(3, nrow=1000, ncol=2)
MB.paired <- mb.long(M2, method="paired", times=5, reps=size, condition.grp=trt, rep.grp=assay)
genenames <- as.character(1:1000)
@ 

\subsection{Genes of interest: independent two-sample problem}
Now assume it is an independent two-sample problem
<<echo=TRUE,print=FALSE>>=
MB.2D <- mb.long(M2, method="2", times=5, reps=size, condition.grp=trt, rep.grp=assay)
@  

\section{Longitudinal multi-sample problem}
\subsection{Data}
Now let's simulate a dataset with three biological conditions
500 genes in total, 10 of them have different expected time course profiles
across biological conditions:
the first condition has 3 replicates, while the second condition has 4 replicates,
and the third condition has 2 replicates. 5 time points for each condition.

<<echo=TRUE,print=FALSE>>=
sim.data <- function(x, indx=1)
{
   mu <- rep(runif(1,8,x[1]),5)
   if(indx==1)
     res <- c(as.numeric(t(mvrnorm(n=3, mu=mu+rnorm(5,sd=5), Sigma=sim.Sigma()))),
             as.numeric(t(mvrnorm(n=4, mu=mu+rnorm(5,sd=3.2), Sigma=sim.Sigma()))),
             as.numeric(t(mvrnorm(n=2, mu=mu+rnorm(5,sd=2), Sigma=sim.Sigma()))))

   if(indx==0) res <- as.numeric(t(mvrnorm(n=9, mu=mu+rnorm(5,sd=3), Sigma=sim.Sigma())))
   res
}

M <- matrix(rep(14,500*45), ncol=45)
M[1:10,] <- t(apply(M[1:10,],1,sim.data))
M[11:500,] <- t(apply(M[11:500,],1,sim.data, 0))
@
\subsection{Genes of interest}
The following codes show an example of
identifying genes with different temporal profiles across biological conditions: 
<<echo=TRUE,print=FALSE>>=
assay <- rep(c("1.2.04","2.4.04","3.5.04","5.21.04","7.17.04","9.10.04","12.1.04","1.2.05","4.1.05"),each=5)
trt <- c(rep(c("wildtype","mutant1"),each=15),rep("mutant1",5), rep("mutant2", 10))

# Caution: since "mutant1" < "mutant2" < "wildtype", the sample sizes should be in the order of 4,2,3,
# but NOT 3,4,2.
size <- matrix(c(4,2,3), byrow=TRUE, nrow=500, ncol=3)
MB.multi <- mb.MANOVA(M, times=5, D=3, size=size, rep.grp=assay, condition.grp=trt)
@             
\section{The amount of moderation}
It is always a good idea to check the amount of moderation, which
is determined by the diagonal elements of the (transformed)
gene-specific (pooled) variance-covariance matrices.
The more homogenous these diagonal elements are across genes, the larger the amount of moderation will
occur.
Try different values of the input argument {\bf prior.df} to see what happens.
\section{Which Statistic?}
In the one- and two- sample longitudinal problems, when the sample size(s) are {\it different} across 
genes, the $MB$-statistics are the ones
you should use for ranking. 
Otherwise, the $MB$-statistics and $\widetilde{T}^2$ produce 
the same ranking. 
\section{Missing data}
For version 1.0.0, 
missing time point samples are not allowed in {\bf mb.long()} 
and {\bf mb.MANOVA()}, while missing
replicates are allowed for a subset of genes. In the latter case, the user still
need to input a complete data matrix with missing replicates indicated by NAs. 
<<echo=TRUE,print=FALSE>>=
fruitfly[6, 13:24] <- NA  # The 6th gene has 1 missing replicate
size <- rep(3, 2000)
size[6] <- 2
MB.missing <- mb.long(fruitfly, times=12, reps=size, HotellingT2.only=FALSE)
@ 
\section{Outliers}
When type$=$robust, the numerator of the $\widetilde{T}^2$ statistic is calculated using
the weighted average time course vector(s), where the weight for
each data point is determined by
Huber's weight function \citep{Huber1981, Fox2002} with the default tuning constant 1.345.
\par
When there are only 2 replicates within conditions,
type$=$robust produces the same rankings as type$=$none
since there is no concensus on gene expression values.
We recommend that at least 3 replicates should be performed
when designing an experiment so that
outliers can be determined by concensus.  
Check the output weights for these outliers.
%%\SweaveOpts{echo=true}
\section{Acknowledgements}
The {\tt timecourse} package has greatly benefited from many critical comments by Terry Speed.
The {\it Drosophila} data were provided by Pavel Tomancak and his colleagues.
Thanks are also due to Ben Bolstad for his advices on building up this package, and 
Karen Vranizan, Yun Zhou, Jun Li and their Pritzker colleagues, and Soyeon Ahn for using the test 
version of this package and their helpful feedbacks.
\bibliography{timecourse}
\section{Temporal profile plots}
The temporal profile(s) for any single gene can be plotted using the {\tt plotProfile()} with
the resulting {\tt MArrayTC} object as the input, see below 
for examples. The user can either specify the rank or the gene ID of the gene to be plotted.
\begin{figure}[htbp]
\begin{center}
<<fig=TRUE,echo=TRUE>>=
## plots the no. 1 gene
plotProfile(out2, type="b", gnames=gnames, legloc=c(2,15), pch=c("A","B","C"), xlab="Hour")
@
\end{center}
\caption{The no. 1 gene of {\it fruitfly}.}
\end{figure}

\begin{figure}[htbp]
\begin{center}
<<fig=TRUE,echo=TRUE>>=
## plots the no. 100 gene
plotProfile(out2, type="b", gnames=gnames, pch=c("A","B","C"), xlab="Hour", ranking=100)
@ 
\end{center}
\caption{The no. 100 gene of {\it fruitfly}}
\end{figure}
\begin{figure}[htbp]
\begin{center}
<<fig=TRUE,echo=TRUE>>=
## plots the gene 141404_at
plotProfile(out2, type="b", gnames=gnames, pch=c("A","B","C"), xlab="Hour", gid="141404_at")

@ 
\end{center}
\caption{Gene 141404\_at in {\it fruitfly}}
\end{figure}

\begin{figure}[htbp]
\begin{center}
<<fig=TRUE,echo=TRUE>>=
plotProfile(MB.paired,type="b", gnames=genenames)
@ 
\end{center}
\caption{The no. 1 gene for paired two-sample problem.}
\end{figure}

\begin{figure}[htbp]
\begin{center}
<<fig=TRUE,echo=TRUE>>=
plotProfile(MB.2D,type="b", gnames=genenames)
@ 
\end{center}
\caption{The no. 1 gene for independent two-sample problem.}
\end{figure}
\begin{figure}[htbp]   
\begin{center}
<<fig=TRUE,echo=TRUE>>=
plotProfile(MB.multi, stats="MB", type="b")
@
\end{center}
\caption{The no. 1 gene for multi-sample problem.}
\end{figure}
\begin{figure}[h]
\begin{center}
<<fig=TRUE,echo=TRUE>>=
plotProfile(MB.missing, stats="MB", type="b", gid="141205_at", 
gnames=gnames,pch=c("A","B","C"))
@ 
\end{center}
\caption{The gene 141205\_at in fruitfly: one replicate is missing.}
\end{figure}
\end{document}