%\VignetteIndexEntry{CMST Tutorial}
%\VignetteKeywords{causal,threshold,QTL}

\documentclass[11pt]{article}

\usepackage{epsf}
\usepackage{epsfig}
\usepackage{amssymb,amsmath}
\usepackage{amsbsy}
\usepackage[all]{xy}
\usepackage{latexsym}
\usepackage{float}
\usepackage{amsthm}



\usepackage{graphicx}
\usepackage{Sweave}

%\topmargin 0.0cm
%\oddsidemargin 0.5cm
%\evensidemargin 0.5cm
%\textwidth 16cm
%\textheight 21cm

\topmargin 0.0cm
\oddsidemargin 0cm
\evensidemargin 0cm
\textwidth 16cm
\textheight 21cm

%\usepackage{setspace}
%\doublespacing

\date{\today}

%% ** EDIT HERE **
%% PLEASE INCLUDE ALL MACROS BELOW
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{definition}{Definition}
\newtheorem{result}{Result}
\newcommand{\x}{{\bf x}}
\newcommand{\z}{{\bf z}}
\newcommand{\Z}{{\bf Z}}
\newcommand{\m}{{\bf m}}
\newcommand{\M}{{\bf M}}
\newcommand{\y}{{\bf y}}
\newcommand{\Y}{{\bf Y}}
\newcommand{\C}{{\bf C}}
\newcommand{\V}{{\bf V}}
\newcommand{\bS}{{\bf S}}
\newcommand{\HH}{{\bf H}}
\newcommand{\mone}{M_{1}}
\newcommand{\mtwo}{M_{2}}
\newcommand{\mthree}{M_{3}}
\newcommand{\q}{{\bf q}}
\newcommand{\Q}{{\bf Q}}
\newcommand{\X}{{\bf X}}
\newcommand{\I}{{\bf I}}
\newcommand{\G}{{\bf G}}
\newcommand{\zero}{{\bf 0}}
\newcommand{\e}{\boldsymbol{\epsilon}}
\newcommand{\be}{\boldsymbol{\beta}}
\newcommand{\Sig}{\boldsymbol{\Sigma}}
\newcommand{\sig}{\boldsymbol{\sigma}}
\newcommand{\Om}{\boldsymbol{\Omega}}
\newcommand{\muu}{\boldsymbol{\mu}}
\newcommand{\Ga}{\boldsymbol{\Gamma}}
\newcommand{\Tht}{\boldsymbol{\Theta}}
\newcommand{\tht}{\boldsymbol{\theta}}
\newcommand{\lbd}{\boldsymbol{\lambda}}
\newcommand{\ga}{\boldsymbol{\gamma}}
\newcommand{\Ro}{\boldsymbol{\rho}}
\newcommand{\mM}{\mathcal{M}}
\newcommand{\mG}{\mathcal{G}}
\newcommand{\ci}{\perp\!\!\!\perp}
\newcommand{\nci}{\not\!\perp\!\!\!\perp}
\newcommand{\ind}{1\!\!1}
\newcommand{\nuM}{\;\not\!M}
\newcommand{\nM}{\not\!\!\!M}
%% END MACROS SECTION

%\renewcommand{\figurename}{Figure S}

%\newfloat{supplefig}{htbp}{lgr}
%\floatname{supplefig}{Figure S}

\title{Causal Model Selection Hypothesis Tests in Systems Genetics: \\ a tutorial}

\author{Elias Chaibub Neto\footnote{Department of Computational Biology, Sage Bionetworks, Seattle WA} and Brian S Yandell\footnote{Department of Statistics, University of Wisconsin-Madison, Madison WI}}



\begin{document}

\maketitle


\section{Motivation}

Current efforts in systems genetics have focused on the development of statistical approaches that aim to disentangle causal relationships among molecular phenotypes in segregating populations. Model selection criterions, such as the AIC and BIC, have been widely used for this purpose, in spite of being unable to quantify the uncertainty associated with the model selection call. In this tutorial we illustrate the use of software implementing the causal model selection hypothesis tests proposed by Chaibub Neto et al. (2012).

\section{Overview}

This tutorial illustrates the basic functionality of the CMST routines in the \texttt{qtlhot} R package using few simulated toy examples. The analysis of a yeast genetical genomics data-set presented in Chaibub Neto et al. (2012) is reproduced in a separate package, \texttt{R/qtlyeast}. The \texttt{R/qtlhot} package depends on \texttt{R/qtl} (Broman et al. 2003), and we assume the reader is familiar with it.

\section{Basic functionality}

Here, we illustrate the basic functionality of the CMST routines in the \texttt{R/qtlhot} package in a toy simulated example.

<<>>=
library(qtlhot)
@ 

We first use the \texttt{SimCrossCausal} function to simulate a \texttt{cross} object with 3 phenotypes, $y_1$, $y_2$ and $y_3$, where $y_1$ has a causal effect on both $y_2$ and $y_3$. The simulated cross data set, \texttt{Cross}, is composed of: 100 individuals (\texttt{n.ind = 100}); 3 chromosomes of length 100cM (\texttt{len = rep(100, 3)}); 101 unequally spaced markers per chromosome (\texttt{n.mar = 101} and \texttt{eq.spacing = FALSE}); additive genetic effect set to 1 (\texttt{add.eff = 1}); dominance genetic effect set to 0 (\texttt{dom.eff = 0}); residual variances for $y_1$ (\texttt{sig2.1}) and the other phenotypes (\texttt{sig2.2}) set to 0.4 and 0.1, respectively; backcross cross type (\texttt{cross.type = "bc"}); and phenotype data transformed to normal scores (\texttt{normalize = TRUE}). The argument \texttt{beta = rep(0.5, 2)}, represents the causal effect of $y_1$ on the other phenotypes (i.e., coefficients of the regressions of $y_2 = 0.5 \, y_1 + \epsilon$ and $y_3 = 0.5 \, y_1 + \epsilon$). The length of beta controls the number of phenotypes to be simulated.

<<>>=
set.seed(987654321)
CMSTCross <- SimCrossCausal(n.ind = 100,
                        len = rep(100, 3),
                        n.mar = 101,
                        beta = rep(0.5, 2),
                        add.eff = 1,
                        dom.eff = 0,
                        sig2.1 = 0.4,
                        sig2.2 = 0.1,
                        eq.spacing = FALSE,
                        cross.type = "bc",
                        normalize = TRUE)
@ 

We compute the genotype conditional probabilities using Haldane's map function, genotype error rate of 0.0001, and setting the maximum distance between positions at which genotype probabilities were calculated to 1cM.

<<>>=
CMSTCross <- calc.genoprob(CMSTCross, step = 1)
@ 

We perform QTL mapping using Haley-Knott regression (Haley and Knott 1992), and summarize the results for the 3 phenotypes. Figure \ref{lod.profiles} presents the LOD score profiles for all 3 phenotypes. The black, blue and red curves represent the LOD profiles of phenotypes $y_1$, $y_2$ and $y_3$, respectively.

<<>>=
Scan <- scanone(CMSTCross, pheno.col = 1 : 3, method = "hk")
summary(Scan[, c(1, 2, 3)], thr = 3)
summary(Scan[, c(1, 2, 4)], thr = 3)
summary(Scan[, c(1, 2, 5)], thr = 3)
@ 

\begin{figure}[!h]
\begin{center}
<<label=lodprofiles,width=6,height=6,fig=TRUE>>=
plot(Scan, lodcolumn = 1 : 3, ylab = "LOD")
@ 
\caption{LOD score profiles for phenotypes $y_1$ (black curve), $y_2$ (blue curve) and $y_3$ (red curve).}
\label{lod.profiles}
\end{center}
\end{figure}

Phenotypes $y_1$ and $y_2$ map to exactly same QTL at position 55 cM on chromosome 1. Phenotype $y_3$ maps to a QTL at position 55.5 cM. Whenever two phenotypes map to close, but not exactly identical, positions we are faced with the question of which QTL to use as causal anchor. Instead of making a (sometimes) arbitrary choice, our approach is to compute the joint LOD profile of both phenotypes and use the QTL detected by this joint mapping approach as the causal anchor. The function \texttt{GetCommonQtls} performs the joint QTL mapping for phenotypes whose marginal LOD peak positions are higher than a certain LOD threshold (\texttt{thr}), and are less than a fixed distance apart (\texttt{peak.dist}). The function can also handle separate additive and interacting covariates for each phenotype (\texttt{addcov1}, \texttt{intcov1}, \texttt{addcov2}, \texttt{intcov2}). In this simulated example the QTL detected by the joint analysis agreed with phenotype's $y_1$ QTL.

<<>>=
commqtls <- GetCommonQtls(CMSTCross,
                          pheno1 = "y1",
                          pheno2 = "y3",
                          thr = 3,
                          peak.dist = 5,
                          addcov1 = NULL,
                          addcov2 = NULL,
                          intcov1 = NULL,
                          intcov2 = NULL)
commqtls
@ 

Now, we fit our causal model selection tests for phenotypes $y_1$ and $y_2$ using the \texttt{CMSTtests} function. The \texttt{Q.chr} and \texttt{Q.pos} arguments specify the chromosome and position (in cM) of the QTL to be used as a causal anchor. The argument \texttt{method} specify which version of the CMST test should be used. The options \texttt{"par"}, \texttt{"non.par"} and \texttt{"joint"} represent, respectively, the parametric, non-parametric, joint parametric versions of the CMST test. The option \texttt{"all"} fits all three versions. The \texttt{penalty} argument specifies whether we should test statistics based on the AIC (\texttt{"aic"}), BIC (\texttt{"bic"}), or both (\texttt{"both"}) penalties. In this particular call we computed all 3 versions using both penalties fitting 6 separate CMST tests.

<<>>=
nms <- names(CMSTCross$pheno)
out1 <- CMSTtests(CMSTCross,
                  pheno1 = nms[1],
                  pheno2 = nms[2],
                  Q.chr = 1,
                  Q.pos = 55,
                  addcov1 = NULL,
                  addcov2 = NULL,
                  intcov1 = NULL,
                  intcov2 = NULL,
                  method = "all",
                  penalty = "both")
@ 

The output of the \texttt{CMSTtests} function is composed of a list with 17 elements. It returns the names of the phenotypes and number of individuals (\texttt{n.ind}):

<<>>=
out1[1:3]
@ 

\noindent The log-likelihood scores (\texttt{loglik}) of models $M_1$, $M_2$, $M_3$, and $M_4$ (see Chaibub Neto et al. 2012 for details):

<<>>=
out1[4]
@ 

\noindent The dimensions of the models (\texttt{model.dim}):

<<>>=
out1[5]
@ 

\noindent The $R^2$ values (\texttt{R2}) relative to the regression of phenotypes 1 and 2 on the causal anchor:

<<>>=
out1[6]
@ 

\noindent The covariance matrix (\texttt{S.hat}) with the variances and covariances of the penalized log-likelihood ratios of models $M_1 \times M_2$, $M_1 \times M_3$, $M_1 \times M_4$, $M_2 \times M_3$, $M_2 \times M_4$, and $M_3 \times M_4$:

<<>>=
out1[7]
@ 

\noindent The BIC scores (\texttt{BICs}):

<<>>=
out1[8]
@ 

\noindent The BIC-based penalized log-likelihood test statistics (\texttt{Z.bic}):

<<>>=
out1[9]
@ 

\noindent The BIC-based model selection p-values for the parametric CMST (\texttt{pvals.p.BIC}), non-parametric CMST (\texttt{pvals.np.BIC}) and joint parametric CMST (\texttt{pvals.j.BIC}):

%The small values of the first p-values across all 3 CMST test versions, suggests that model $M_1$ is significantly closer to the true model than models $M_2$, $M_3$ and $M_4$.

<<>>=
out1[10:12]
@ 

\noindent The analogous AIC-based quantities:

<<>>=
out1[13:17]
@ 

The function \texttt{CMSTtests} can also computes CMST tests of a single phenotype against a list of phenotypes. Its output is less detailed though. In this particular call we test $y_1$ against $y_2$ and $y_3$.

<<>>=
out2 <- CMSTtests(CMSTCross,
                  pheno1 = nms[1],
                  pheno2 = nms[-1],
                  Q.chr = 1,
                  Q.pos = 55.5,
                  addcov1 = NULL,
                  addcov2 = NULL,
                  intcov1 = NULL,
                  intcov2 = NULL,
                  method = "all",
                  penalty = "both")
out2
@ 

\section{Other Functions}

There are several other functions involved in simulation and in data analysis that are not well documented yet.
See \texttt{R/qtlyeast} available at \texttt{GITHUB} for further analysis.
Here we do scans for the three traits, and create a reduced object with only high LOD values.

<<>>=
CMSTscan <- scanone(CMSTCross, pheno.col = 1:3, method = "hk")
CMSThigh <- highlod(CMSTscan)
@ 

For our purposes, we place the three traits on chromosome 1 at some arbitrary positions, with trait \texttt{y1} having causal ``targets'' of the other two traits.

<<>>=
traits <- names(CMSTCross$pheno)
annot <- data.frame(name = traits, traits = traits, chr = rep(1, 3),
 Mb.pos = c(55,10,100))
annot$cM.pos <- annot$Mb.pos
annot
targets <- list(y1 = c("y2","y3"))
@ 

Now we used the scans (via \texttt{CMSThigh}) and the annotation to identify candidate regulators, the subset of cis-acting candidate regulators, and co-mapping targets.

<<>>=
cand.reg <- GetCandReg(CMSThigh, annot, traits)
cand.reg
cis.cand.reg <- GetCisCandReg(CMSThigh, cand.reg)
cis.cand.reg
comap.targets <- GetCoMappingTraits(CMSThigh, cand.reg)
comap.targets
@ 

Next, we perform tests to infer causal relationships.

<<>>=
tests <- list()
for(k in seq(names(comap.targets))) {
  tests[[k]] <- FitAllTests(CMSTCross, pheno1 = names(comap.targets)[k],
                      pheno2 = comap.targets[[k]],
                      Q.chr = cand.reg[k, 4],
                      Q.pos = cand.reg[k, 5])
}
names(tests) <- names(comap.targets)
tests <- JoinTestOutputs(comap.targets, tests)
tests
@ 

Finaly, we compare the inferred causal relationships to the known \texttt{targets} to assess precision, true positive rate and false positive rate.

<<>>=
PrecTpFpMatrix(alpha = seq(0.01, 0.10, by = 0.01),
  val.targets = targets, all.orfs = CMSThigh$names, tests = tests,
  cand.reg = cand.reg, cis.cand.reg = cis.cand.reg)
@

\section{References}

\begin{enumerate}

\item Brem R., L. Kruglyak, 2005 The landscape of genetic complexity across 5,700 gene expression trait in yeast. PNAS {\bf 102:} 1572-1577.

\item Broman K., H. Wu, S. Sen, G. A. Churchill, 2003 R/qtl: QTL mapping in experimental crosses. Bioinformatics {\bf 19}: 889-890.

\item Chaibub Neto et al. (2012) Causal model selection hypothesis tests in systems genetics. Genetics (under review)

\item Churchill G. A., R. W. Doerge, 1994 Empirical threshold values for quantitative trait mapping. Genetics {\bf 138}: 963-971.

\item Haley C., S. Knott, 1992 A simple regression method for mapping quantitative trait loci in line crosses using flanking markers. Heredity {\bf 69}: 315-324.

\item Hughes T. R., M. J. Marton, A. R. Jones, C. J. Roberts, R. Stoughton, et al, 2000 Functional discovery via a compendium of expression profiles. Cell {\bf 102:} 109-116.

\item Manichaikul A., J. Dupuis, S. Sen, and K. W. Broman, 2006 Poor performance of bootstrap confidence intervals for the location of a quantitative trait locus. Genetics {\bf 174:} 481-489.

\item Schadt E. E., J. Lamb, X. Yang, J. Zhu, S. Edwards, et al., 2005 An integrative genomics approach to infer causal associations between gene expression and disease. Nature Genetics {\bf 37}: 710-717.

\item Zhu J., B. Zhang, E. N. Smith, B. Drees, R. B. Brem, L. Kruglyak, R. E. Bumgarner, E. E. Schadt, 2008 Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nature Genetics {\bf 40}: 854-861.

\end{enumerate}

\end{document}