\documentclass[12pt]{article} %\documentclass[epsf]{siamltex} \setlength{\topmargin}{0in} \setlength{\topskip}{0in} \setlength{\oddsidemargin}{0in} \setlength{\evensidemargin}{0in} \setlength{\headheight}{0in} \setlength{\headsep}{0in} %\setlength{\footheight}{0in} %\setlength{\footskip}{0in} \setlength{\textheight}{9in} \setlength{\textwidth}{6.5in} \setlength{\baselineskip}{20pt} \setlength{\leftmargini}{1.5em} \setlength{\leftmarginii}{1.5em} \setlength{\leftmarginiii}{1.5em} \setlength{\leftmarginiv}{1.5em} \usepackage{epsfig,subfigure,lscape} \renewcommand\theequation{\thesection.\arabic{equation}} \def\hb{\hfil\break} \def\ib{$\bullet$} %\VignetteIndexEntry{networkBMA} \begin{document} \pagestyle{plain} %Goadrich&2004,KokDomingos2005,SinglaDomingos2005, \nocite{Brem&2002,BremKruglyak2005,Yeung&2005,Yeung&2011,Lo&2011,YEASTRACT,Hoeting&1999,BMApackage,iterativeBMApackage,networkBMApackage,SCPD,YPD,ArrayExpress} \begin{center} {\bf Uncovering gene regulatory relationships using {\tt networkBMA}}\\[5pt] Ka Yee Yeung, Chris Fraley, Adrian E. Raftery\\ Departments of Microbiology (KYY) and Statistics (CF and AER)\\ University of Washington \end{center} \bigskip This document illustrates the use of the {\tt networkBMA} R package (Fraley et~al. 2012) to uncover regulatory relationships in yeast ({\it Saccharomyces cerevisiae}) from microarray data measuring time-dependent gene-expression levels in 95 genotyped yeast segregants subjected to a drug (rapamycin) perturbation. \section{Data} The expression data for this vignette is provided in the {\tt networkBMA} package in the {\tt vignette} database, which consists of three R objects: \begin{itemize} \item[]{\tt timeSeries}: A 582 by 102 data frame in which the first two columns are factors identifying the replicate and time (in minutes) after drug perturbation, and the remaining 100 columns are the expression measurements for a subset of 100 genes from the yeast-rapamycin experiment described in Yeung et~al. (2011). There are 582/6 = 97 replicates (the 95 segregants plus two parental strains of the segregants), each with measurements at 6 time points. The complete time series data is available from Array Express (Parkinson et~al. 2007) with accession number E-MTAB-412 \\ (http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-412). \item[] {\tt reg.known}: A 18 by 3 data frame giving known regulatory relationships among this subset of 100 genes. The first two columns give the regulator and target gene, respectively, while the third encodes the source of the regulatory information ({\tt `YPD'} for Yeast Proteome Database (Hodges et~al. 1999) and {\tt `SCPD'} for The Promoter Database of \emph{Saccharomyces cerevisiae} (Zhu and Zhang 1999). In this example, we constrained {\tt reg.known} to high-confidence experimental results obtained from biochemical (non-high-throughput) experiments. \item[] {\tt reg.prob}: A 100 by 100 matrix, giving probability estimates for regulatory relationships in which the $(i,j)$ entry gives the estimated probability that gene $i$ regulates gene $j$. These were computed using the supervised framework integrating multiple data sources of Lo et~al. (2011). \item[] {\tt referencePairs}: A 2-column data frame giving 287 regulator-gene pairs among the selected set of 100 genes reported from the literature. In this yeast example, the reference network was extracted from the documented evidence from the {\tt YEASTRACT} database (Teixeira et~al. 2006), which includes curated regulatory relationships from the literature inferred from high-throughput experiments. \item[] {\tt brem.data}: An 85 by 111 subset of the data used for network inference in yeast (Brem et~al. 2002, Brem and Kruglyak 2005). The rows correspond to genes and the columns to experiments. Provided courtesy of Rachel Brem. \end{itemize} <>= library(networkBMA) data(vignette) dim(timeSeries) dim(reg.prob) dim(brem.data) reg.known @ \section{Network Modeling} Given the yeast expression data from the Rapamycin experiments, the {\tt networkBMA} function can be invoked to estimate the probabilities of regulatory relationships using iterative Bayesian Model Averaging (Yeung et~al. 2005, 2011): <>= edges <- networkBMA(data = timeSeries[,-(1:2)], nTimePoints = length(unique(timeSeries$time)), prior.prob = reg.prob, known = reg.known, nvar = 50, ordering = "bic1+prior") edges[1:9,] @ For each gene $g$, the observed gene expression of genes at time $t-1$ serve as linear predictors for modeling the observed expression of gene $g$ at time $t$. BMA modeling results in a weighted average of models consisting of relatively small numbers of predictors. The probability of gene $h$ being a linear predictor in the model for gene $g$ is taken as the probability that gene $h$ regulates gene $g$ in the network. There are options for including known regulatory relationships and prior priobabilities in the modeling (see Lo et~al. 2011), as well as for ordering the variables, and for specifying the number of ordered variables to be included in the modeling. \section{Assessment of Network Models} Although, except for synthetic data, the true underlying network is unknown, the results can be assessed using a set of regulator-target gene network edges reported in the literature. The comparison is done as follows: \begin{itemize} \item Let $E$ be the set of regulator-target gene edges resulting from {\tt networkBMA}, possibly reduced using a probability threshold. In the context of the example in Section 2, $E$ corresponds to the set of edges represented in the object {\tt edges}. \item Let $K$ be the set of known regulator-target gene edges hardcoded in the modeling. In the example in Section 2, $K$ corresponds to {\tt reg.known}. \item Let $L$ be the set regulator-target gene edges reported in the literature. In the example in Section 2, $L$ corresponds to {\tt referencePairs}. \item Let $E \backslash K$ and $L \backslash K$ be the set of pairs in $E$ and $L$, respectively, with any hardcoded edges removed. In the example of Section 2, $E$ represented by {\tt edges} contains 483 pairs, and $L$ represented by {\tt referencePairs} contains 287 pairs. Both $E$ and $L$ include all 18 of the known hardcoded edges $K$ represented by {\tt reg.known}. Hence $E \backslash K$ contains 465 pairs, and $L \backslash K$ contains 269 pairs. \item Let $U$ be the set of all directed pairs $r$-$g$ such that $r$ is a regulator in $L \backslash K$ and $g$ is a target gene in $L \backslash K$. For the example of Section 2, $L \backslash K$ has 11 unique regulator genes and 99 unique target genes. So there are 11 $\times$ 99 or 1089 pairs in $U$. Assume further that the linked pairs in $U$ are precisely those pairs in $L \backslash K$, and that all other pairs are unlinked. \item Let $U \backslash K$ be the set of pairs in $U$ with any hardcoded eges removed (hardcoded edges may resurface in the unlinked pairs). For the example of Section 2, 17 of the 18 pairs in $K$ occur in $U$, so there are 1089 - 17 = 1072 edges in $U \backslash K$. \end{itemize} The assessment is done using the contingency table of $(E \backslash K) \cap (U \backslash K)$ relative to $U \backslash K$. For the example of Section 2, the assessment would be done with the 57 of the 465 pairs in $E \backslash K$ that also belong to $U \backslash K$. A function called {\tt contabs.netBMA} is provided to produce contingency tables from a reference network according the procedure described above. Here we compare the edges produced in Section 2 by {\tt networkBMA} modeling on the yeast data with the reference network {\tt referencePairs} made up of results reported in the literature: <>= ctables <- contabs.netwBMA( edges, referencePairs, reg.known, thresh=c(.5,.75,.9)) ctables @ Another function called `{\tt contabs}' is provided for computing contingency tables when the true underlying network is known. The {\tt scores} function can be used to obtain common assessment statistics from the contingency tables, including sensitivity, specificity, precision, recall, and false discovery rate among other measures. <>= scores( ctables, what = c("FDR", "precision", "recall")) @ Areas under the ROC and Precision-Recall curves covered by contingency tables can also be estimated using functions {\tt roc} and {\tt prc}, with the option to plot the associated curves. %Precision-Recall curves, %orginally developed in the context of information retrieval, %have been considered a better assessment measure than ROC curves when %there is considerable disparity in the sizes of the populations in question %(Goadrich et~al. 2004, Kok and Domingos 2004, Singla and Domingos 2005). %Gene regulatory networks are sparse, so the number of gene pairs that are not %edges in the network is much greater than those that are. The following gives the ROC and Precision-Recall curvers associated with the default contingency tables, in which the threshholds are all values for posterior probabilities that appear in {\tt edges}. <>= roc( contabs.netwBMA( edges, referencePairs), plotit = TRUE) title("ROC") prc( contabs.netwBMA( edges, referencePairs), plotit = TRUE) title("Precision-Recall") @ The resulting plots are shown in Figure \ref{fig:rocANDprc}. \begin{figure} \begin{center} \begin{tabular}{ccc} \includegraphics[width=.45\textwidth]{roc.pdf}&& \includegraphics[width=.45\textwidth]{prc.pdf} \end{tabular} \end{center} \caption{ROC and Precision-Recall curve sectors for a {\tt networkBMA} model of the yeast-rapamycin test data. The black lines delineate the estimated curves. The vertical red lines delineate the range of horizontal values covered by the contingency tables. The dotted black lines are linear interpolants outside this range. The diagonal blue line on the ROC plot indicates the line betwween (0,0) and (1,1). } \label{fig:rocANDprc} \end{figure} The output components are as follows: \begin{itemize} \item {\tt area}: The estimated area under the curve for the horizontal sector ranging from 0 to 1. This should be used with caution when the sector in which the data falls is small. \item {\tt sector}: The estimated area under the horizontal sector covered by the contingency tables. \item {\tt width}: The width of the horizontal sector covered by the contingency tables. \end{itemize} \section{Linear Modeling for Static Gene Expression Data} {\tt networkBMA} relies on sparse linear modeling via iterative Bayesian model averaging (BMA). BMA addresses uncertainty in model selection, and builds a weighted--average model from plausible models. The resulting model has better overall predictive ability than constituent models, and tends to use few variables from among a larger set. BMA has been iteratively extended to data with more variables that observations (Yeung at~al. 2005, 2009, 2011). The {\tt networkBMA} package includes a function, {\tt iterateBMAlm}, for linear modeling via iterative BMA. We illustrate its use on a static gene expression dataset (without any time points), {\tt brem.data}, to infer the regulators of a particular gene by regressing it on the expression levels of the other genes. Function {\tt iterateBMAlm} can be applied to each gene so as to infer all edges in the network. For one gene, the procedure is as follows: <>= gene <- "YNL037C" variables <- which(rownames(brem.data) != gene) control <- iBMAcontrolLM(OR = 50, nbest = 20, thresProbne0 = 5) iBMAmodel.YNL037C <- iterateBMAlm( x = t(brem.data[variables,]), y = unlist(brem.data[gene,]), control = control) @ Function {\tt iBMAcontrolLM} facilitates input of BMA control parameters, including {\tt nbest} for specifying the number of best models of each size to be initially retained, {\tt OR} for defining the width of `Occam's window' for model exclusion, and {\tt thresProbne0} for determining the cutoff for probability (in percent) of a variable being included in the modeling (Raftery et~al. 2005). See the R help documentation for {\tt iBMAcontrolLM} for a detailed description of these parameters, and Hoeting et~al. (1999) for a tutorial on the underlying BMA paradigm. The estimated posterior probabilities (in percent) for genes that regulate {\tt YBL103C} can be seen as follows: <>= iBMAmodel.YNL037C$probne0[iBMAmodel.YNL037C$probne0 > 0] @ \bibliographystyle{plain} \bibliography{networkBMA} \end{document} For Precision-Recall, the filled dots correspond to the distinct ({\tt recall}, {\tt precision}) pairs derived from the contingency tables.