%\VignetteIndexEntry{Summary Tables}
%\VignetteDepends{GeneticsBase}
%\VignetteKeywords{Genetics}
%\VignettePackage{GeneticsBase}

\documentclass[10pt]{article}
\usepackage{url}
\usepackage{amsmath}
\usepackage{natbib}
\usepackage{isorot}
\usepackage{fullpage} % standard 1 inch margins 

\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\textit{#1}}}
\newcommand{\code}[1]{\texttt{#1}}
\newcommand{\email}[1]{\texttt{#1}}

%%% Hyperlinks for ``PDF Latex'' :
\ifx\pdfoutput\undefined%%--- usual ``latex'' :
  %% Stuff w/out hyperref
\else%%---------------------- `` pdflatex '' : -- still gives funny errors
  \RequirePackage{hyperref}
  %% The following is R's share/texmf/hyperref.cfg :
  %% Stuff __with__ hyperref :
  \hypersetup{%
    %default: hyperindex,%
    colorlinks,%
    %default: pagebackref,%
    linktocpage,%
    %%plainpages=false,%
    linkcolor=Green,%
    citecolor=Blue,%
    urlcolor=Red,%
    pdfstartview=Fit,%
    pdfview={XYZ null null null}%
    }
  \RequirePackage{color}
  \definecolor{Blue}{rgb}{0,0,0.8}
  \definecolor{Green}{rgb}{0.1,0.75,0.1}
  \definecolor{Red}{rgb}{0.7,0,0}
  %% ESS JCGS v2 :
  %%\hypersetup{backref,colorlinks=true,pagebackref=true,hyperindex=true}
  %%\hypersetup{backref,colorlinks=false,pagebackref=true,hyperindex=true}
\fi


\usepackage{Sweave}
\begin{document}

\title{Tutorial: Using \Rpackage{GeneticsBase}}
\author{
        Gregory Warnes \\
        \email{gregory\_warnes\@urmc.rochester.edu} \\
        University of Rochester \\
        \\
        Ross Lazarus \\
        \email{ross.lazarus@channing.harvard.edu}\\
        Channing Laboratory
        }       

\maketitle

%%
%% Set up some defaults
%%
<<show=F>>=
options(width=90)
@ 


\section{Introduction}

This vignette was created as a tutorial for the 2007 BioConductor
User's Conference held in Seattle, WA, USA during August 2007, and was
presented by Dr. Warnes and Dr. Lazarus.  The material is structured
as a tutorial with a small example data set (8184 Markers x 180
Subjects belonging to 50 Families) .


\section{Outline}
\begin{enumerate}
\item Preliminaries
  \begin{enumerate}
  \item  Install the necessary packages
  \item  Load the libraries
  \end{enumerate}
  
\item Loading Data
  \begin{enumerate}  
  \item Read data from files
  \item  Error check the loaded data
  \end{enumerate}
  
\item Descriptive Statistics
  \begin{enumerate}  
  \item  Summary of allele/genotyope frequency
  \item  HWE test
  \item  Visualize Disequilbrium
  \end{enumerate}
  
\item Hypothesis Testing 
  \begin{enumerate}  
  \item  Armitage test
  \item  Logistic with synthetic covariates   
  \item  GLM with synthetic covariates \& outcome
  \end{enumerate}
  
\item Subsetting Results and Formatting Output
  \begin{enumerate}  
  \item  Construct some output tables for top 100 significant markers
  \end{enumerate}      

\item Study planning tools (power, sample size)

\end{enumerate}

\section{Preliminaries}

\subsection{Install RGenetics packages and depenedencies}

For MS-Windows it is necessary to manually install dependencies from CRAN
<<eval=FALSE>>=
install.packages( c("xtable","combinat","gdata","gplots","mvtnorm"), dep=TRUE)
@ 

Now to install the necessary packages:
<<eval=FALSE>>=
repos <- c("http://www.warnes.net/bioc2007/", "http://cran.fhcrc.org")
install.packages(c("GeneticsBase", 
                   #"GeneticsQC", 
                   "GeneticsDesign", "fbat"),
                 repos=repos, type="source", dep=TRUE)
@ 

\subsection{Load the libraries}

<<>>=
library(GeneticsBase)
#library(GeneticsQC)
library(GeneticsDesign)
library(fbat)
@ 

\section{Loading Data}

\subsection{Read data from files}

Load the full data:
<<>>=
hm.a<-readGenes(gfile="hmCEU_YRI_chr22_ALLfbat.ped", gformat="fbat")
print(hm.a)
@ 

We'll also need just the founders later:
<<>>=
hm.f<-readGenes(gfile="hmCEU_YRI_chr22_Foundersfbat.ped", gformat="fbat")
@ 

For the purpose of speeding execution of examples, we'll also create a
smaller subset of 100 markers from the original file.   
<<>>=
hm.a2<-hm.a[1:100,]
@ 

\subsection{Error check the loaded data}

Count frequencies of missing genotypes  (requires 26 seconds on my MacBook Pro)

Number of missing genotypes per subject:
<<>>=
mG<-missGFreq(hm.a2, founderOnly=TRUE, quiet=FALSE)
head(mG$nMissSubjects)
@ 
Column headers:
\begin{description}
  \item{00}  missing both alleles 
  \item{0*} 1st allele missing while the 2nd allele is not missing
  \item{*0} 1st allele is not missing while the 2nd allele is missing
\end{description}

Number of missing genotypes per marker:
<<>>=
head(mG$nMissMarkers)
@ 

Plot counts of missing genotypes:
<<fig=TRUE>>=
par(mfrow=c(2,1))
hist(mG$nMissSubjects[,1], main="",
     xlab="number of missing genotypes")
plot(1:nrow(mG$nMissSubjects),mG$nMissSubjects[,1],
     xlab="subject ID",
     ylab="number of missing genotypes",
     type="h")
title("Counts of missing genotypes")
par(mfrow=c(1,1))
@ 

\section{Descriptive Statistics}
  \begin{enumerate}
  \item  Summary of allele/genotyope frequency
  \item  HWE test
  \item  Visualize Disequilbrium
  \end{enumerate}

Basic data quality checks for markers.  Column headings are:
\begin{description}
\item{ObsHET} observed proportion of heterozygous genotypes per marker
\item{PredHET} predicted proportion of heterozygous genotypes per marker
\item{HWpval} pvalues of Hardy-Weinberg test per marker
\item{pGeno} percentage of non-missing genotypes for markes
\item{MAF} minor allele frequencies. missing allele are excluded from calculation
\item{Rating} 1 if passes HW test; 0 if failed HW test.
\end{description}
<<>>=
################
cM<-checkMarkers(hm.a2)
head(cM)
@ 

Check Mendelian errors:
<<>>=
cMend<-checkMendelian(hm.a2, quiet = FALSE)
@ 

Number of Mendelian errors per marker:
<<>>=
head(cMend$nMerrMarker)
@ 

Number of Mendelian errors per family:
<<>>=
head(cMend$nMerrFamily)
@ 

Plot counts of Mendelian errors
<<fig=TRUE>>=
par(mfrow=c(2,1))
hist(cMend$nMerrMarker, main="",
     xlab="number of Mendelian errors")
plot(1:length(cMend$nMerrFamily),cMend$nMerrFamily, 
     xlab="family ID", ylab="number of Mendian Errors")
par(mfrow=c(1,1))
@ 

\subsection{Summary of allele/genotype frequency based on GeneticsBase functions}

Allele summary, including allele counts, allele frequencies, 95\% CI of allele frequencies:
<<>>=
t1<-alleleSummary(hm.a2[1:10])
t1
@ 

Same for genotypes:
<<>>=
t2<-genotypeSummary(hm.a2[1:10], founderOnly=TRUE)
t2
@ 

HWE  test:
<<>>=
hwe<-HWE(hm.a2[1:10])
hwe
@ 

Defatult graphical display of LD:
<<fig=TRUE>>=
ld.small<-LD(hm.a2[1:20])
plot(ld.small)
@ 

Alternative graphical display of LD:
<<fig=TRUE>>=
ld <- LD(hm.a2)
LDView(ld@"X^2")
@ 

\section{Hypothesis Testing}

\subsection{Armitage test}

For the following examples, suppose 'A' is the minor allele, and 'a'
is the major allele.

Armitage test using additive model to code genotype:
\begin{tabular}{cc}
  genotype &  coding \\
  AA       &  2      \\
  Aa       &  1      \\
  aa       &  0      \\
\end{tabular}
<<>>=
res.A<-Armitage(hm.a2, method="A")
head(res.A)
@ 

Armitage test using recessive model to code genotype:
\begin{tabular}{cc}
  genotype &  coding \\
  AA       &  1 \\
  Aa       &  0 \\
  aa       &  0 \\
\end{tabular}
<<>>=  
res.R<-Armitage(hm.a2, method="R")
head(res.R)
@ 

Armitage test using dominant model to code genotype:
\begin{tabular}{cc}
  genotype &  coding \\
  AA       &  1 \\
  Aa       &  1 \\
  aa       &  0 \\
\end{tabular}
<<>>=
res.D<-Armitage(hm.a2, method="D")
head(res.D)
@

\subsection{Logistic regression}

First, we need to construct some synthetic covariates on the founders

<<>>=
sampleInfo(hm.f)$race <- sampleInfo(hm.f)$affected

raceval <- sampleInfo(hm.f)$race-1

sampleInfo(hm.f)$Norm0.0 <- rnorm(nobs(hm.f), mean=0.0*raceval)
sampleInfo(hm.f)$Norm0.5 <- rnorm(nobs(hm.f), mean=0.5*raceval)
sampleInfo(hm.f)$Norm1.0 <- rnorm(nobs(hm.f), mean=1.0*raceval)
sampleInfo(hm.f)$Norm1.5 <- rnorm(nobs(hm.f), mean=1.5*sampleInfo(hm.f)$race)

doSample <- function(raceval, mult)
  {
    prob <-  c( 0.33-raceval*mult, 0.33+(raceval*mult)/2, 0.33+(raceval*mult)/2)
    factor(sample( x=c("Red","Green","Blue"), size=1, p=prob, rep=T))
  }
    
sampleInfo(hm.f)$Cat0.0  <- sapply( raceval, doSample, mult=0.0 )
sampleInfo(hm.f)$Cat0.1  <- sapply( raceval, doSample, mult=0.1 )
sampleInfo(hm.f)$Cat0.2  <- sapply( raceval, doSample, mult=0.2 )
sampleInfo(hm.f)$Cat0.3  <- sapply( raceval, doSample, mult=0.3 )
@ 

Now, construct a function to fit the regression model and return the
parameters and statistics that are of interest.

%%
%% Put model function here both in executable form and as pure 
%% verbatim so comments are preserved. 
%% 

\begin{verbatim}
model <- function( markerName )
  {
    # extract requested genetic marker
    genotype <- genotypes(hm.f,marker=markerName)

    # get data frame to use for fitting the model
    mframe <- model.frame(race ~ sex + Norm0.0 + Norm0.5 + Norm1.0 + Norm1.5 + genotype,
                          data=sampleInfo(hm.f) )

    # To test significance of a term, best method is to do anova of
    # the full model against a submodel omitting the particular term.
    # This avoids issues with changes in names of factor levels,
    # presence or absence of covaraiates, etc.
    result <- try(
                  {
  fit.with    <- glm( race==1 ~ sex + Norm0.0 + Norm0.5 + Norm1.0 + Norm1.5 + as.factor(genotype),
                     data=mframe, family="binomial")
  fit.without <- glm( race==1 ~ sex + Norm0.0 + Norm0.5 + Norm1.0 + Norm1.5,
                     data=mframe, family="binomial")
                    anova(fit.with, fit.without, test="Chisq")$"P(>|Chi|)"[2]
                  }
                  )

    if(class(result)=="try-error")
      return(NA)  # or return(result) to see the error messages
    else

      result # full result.  Usually we want to specify exactly which
             # parameters and stats get returned so the format is consistent
             # across all markers.
  }
\end{verbatim}

<<show=FALSE>>=
model <- function( markerName )
  {
    # extract requested genetic marker
    genotype <- genotypes(hm.f,marker=markerName)

    # get data frame to use for fitting the model
    mframe <- model.frame(race ~ sex + Norm0.0 + Norm0.5 + Norm1.0 + Norm1.5 + genotype,
                          data=sampleInfo(hm.f) )

    # To test significance of a term, best method is to do anova of
    # the full model against a submodel omitting the particular term.
    # This avoids issues with changes in names of factor levels,
    # presence or absence of covaraiates, etc.
    result <- try(
                  {
  fit.with    <- glm( race==1 ~ sex + Norm0.0 + Norm0.5 + Norm1.0 + Norm1.5 + as.factor(genotype),
                     data=mframe, family="binomial")
  fit.without <- glm( race==1 ~ sex + Norm0.0 + Norm0.5 + Norm1.0 + Norm1.5,
                     data=mframe, family="binomial")
                    anova(fit.with, fit.without, test="Chisq")$"P(>|Chi|)"[2]
                  }
                  )

    if(class(result)=="try-error")
      return(NA)  # or return(result) to see the error messages
    else

      result # full result.  Usually we want to specify exactly which
             # parameters and stats get returned so the format is consistent
             # across all markers.
  }
@ 

Fit to a subset of 50, then 100 and use this information to compute the expected run time for all markers:
<<>>=
t1 <- unix.time(
fits <- sapply( markerNames(hm.f)[1:50], model )
)[3]
t1
@ 
(This takes 5.365 seconds on my MacBook Pro, R 2.4.1)

<<>>=
t2 <- unix.time(
fits <- sapply( markerNames(hm.f)[1:100], model )
)[3]
t2
@ 
(This takes 11.878 seconds on my MacBook Pro, R 2.4.1)



Estimate total time to complete, in minutes
<<>>=
t1 + (t2 - t1)/50 * (nmarker(hm.f)-50) / 60
@ 
(This yields 23.02 minutes on my MacBook Pro, R 2.4.1)


Plot the p-values for the first 100 markers
<<fig=TRUE>>=
fits.sorted <- sort(fits)
plot(-log(fits),  type="h", col="blue")
labels <- c(0.05, 0.01, 0.001, 0.0001, 0.00001, 0.00001)
abline(h=-log(labels), lty=2, col="red")
mtext(text=as.character(labels), side=4, at=-log(labels), col="red", las=1)
title("Per-marker statistical significance")
@ 

\subsection{Family-Based Associaton Test ('fbat')}

Do the fbat calculations:
<<>>=
f<-fbat(hm.a2)
@ 

Show the p-values:
<<>>=
summaryPvalue(f)
@ 

Look at the fit details for a specific marker:
<<>>=
viewstat(f, "22_14884399_rs4911642")
@ 

\section{Study planning tools (GeneticsDesign package)}

\subsection{Power to detect a low-frequencty allele}

Compute the probability of missing an allele with
frequency 0.15 when 20 genotypes are sampled:
<<>>=
gregorius(freq=0.15, N=20)
@ 

Determine what sample size is required to observe all alleles with true
frequency 0.15 with probability 0.95
<<>>=
gregorius(freq=0.15, missprob=1-0.95)
@ 

\subsection{Power for a genetics study using a quantitative outcome}
Calculate power for genetics study using a quantitative outcome:
<<>>=
GeneticPower.Quantitative.Numeric(
                                  N=50,
                                  freq=0.1,
                                  minh="recessive",
                                  alpha=0.05
                                  )

GeneticPower.Quantitative.Factor(
                                  N=50,
                                  freq=0.1,
                                  minh="recessive",
                                  alpha=0.05
                                 )
@ 

For a range of sample sizes:
<<>>=
power.range <- function( N, ... )
  {
    sapply(
           N,
           function(n)
           GeneticPower.Quantitative.Numeric(
                                             N=n,
                                             ...
                                             )
           )
  }

power.range(
            N=c(25,50,100,200,500), 
            freq=0.1, 
            minh="recessive",
            alpha=0.05 
            )

@ 

Create a power table:
<<>>=
fun <- function(p)
  power.range(freq=p,
              N=seq(100,1000,by=100), 
              alpha=5e-2, 
              minh='recessive')

m <- sapply( X=seq(0.1,0.9, by=0.1), fun )
colnames(m) <- seq(0.1,0.9, by=0.1)
rownames(m) <- seq(100,1000,by=100)

print(round(m,2))
@ 

\subsection{Power calculator for genetic linear trend association studies.}

The power is for the test that disease is associated with a
marker, given high risk allele frequency ('A'), disease
prevalence, genotype relative risk ('Aa'), genotype relative risk
('AA'), LD measure ($D'$ or $R^2$), marker allele frequency ('B'),
number of cases, control:case ratio, and probability of the Type I
error. The linear trend test (Cochran 1954; Armitage 1955) is
used.

Using $R^2$ as the measuer of LD:
<<>>=     
res1<-GPC(pA=0.05, pD=0.1, RRAa=1.414, RRAA=2, r2=0.9, pB=0.06, 
                        nCase=500, ratio=1, alpha=0.05, quiet=FALSE)
@ 

Using $D'$ as the measure of LD:
<<>>=
res2<-GPC.default(pA=0.05, pD=0.1, RRAa=1.414, RRAA=2, Dprime=0.9, pB=0.06, 
                  nCase=500, ratio=1, alpha=0.05, quiet=FALSE)

@ 

\begin{center}
  \textbf{
    The End.
    }
\end{center}
  

%\clearpage
%
%\bibliographystyle{biometrics}
%\begin{thebibliography}{}
%
%\item [~\hspace{0.125in}~Warnes~GR.]{ ``The Genetics
%    Package: Utilities for handling genetic data'' \code{R News},
%    Volume 3, Issue 1, June 2003.}
%
%\item [~\hspace{0.125in}~Warnes~GR.] ``genetics'', a package for
%  handling marker-based genetic data within the open-source
%  statistical package ``R''.  The package includes function to compute
%  allele frequencies, use genetic markers in statistical models,
%  estimate disequilibrium, and test for departure from Hardy-Weinberg
%  equilibrium. \\
%  \verb+http://cran.us.r-project.org/src/contrib/PACKAGES.html#genetics+, 
%  2002-
%
%\end{thebibliography}


\end{document}