% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- %\VignetteIndexEntry{FABIA: Manual for the R package} %\VignetteDepends{fabia} %\VignettePackage{fabia} %\VignetteKeywords{fabia, fabiaData, biclustering, factor analysis, sparse coding, non-negative matrix factorization, latent variables, Laplace distribution, EM algorithm, multivariate analysis} %setwd("c:/sepp/work/fabia/fabia_210/fabia/inst/doc") %Sweave("fabia.Rnw") %tools::texi2dvi("fabia.tex",pdf=TRUE) %Stangle("fabia.Rnw") %source("fabia.R") \documentclass[article]{bioinf} \usepackage{amsmath,amssymb} \usepackage{bm} \usepackage{natbib} \usepackage{hyperref} \usepackage{fabia_defs} \hypersetup{colorlinks=false, pdfborder=0 0 0, pdftitle={FABIA: Factor Analysis for Bicluster Acquisition - Manual for the R package}, pdfauthor={Sepp Hochreiter}} \title{FABIA: Factor Analysis for Bicluster Acquisition\\ \textit{--- Manual for the R package ---}} \author{Sepp Hochreiter} \affiliation{Institute of Bioinformatics, Johannes Kepler University Linz\\Altenberger Str. 69, 4040 Linz, Austria\\ \email{hochreit@bioinf.jku.at}} \usepackage[noae]{Sweave} \SweaveOpts{eps=FALSE} \begin{document} <<echo=FALSE,results=hide>>= options(width=75) set.seed(0) library(fabia) fabiaVer<-packageDescription("fabia")$Version @ %$ \newcommand{\fabiaVer}{\Sexpr{fabiaVer}} \manualtitlepage[Version \fabiaVer, \today] \newlength{\auxparskip} \setlength{\auxparskip}{\parskip} \setlength{\parskip}{0pt} \tableofcontents \clearpage \setlength{\parskip}{\auxparskip} \section{Introduction} The \Rpackage{fabia} package is part of the Bioconductor (\href{http://www.bioconductor.org}{http://www.bioconductor.org}) project. The package allows to extract biclusters from data sets based on a generative model according to the FABIA method \citep{Hochreiter:10}. It has been designed especially for microarray data sets, but can be used for other kinds of data sets as well. Please visit for additional information the FABIA homepage\linebreak \href{http://www.bioinf.jku.at/software/fabia/fabia.html}{http://www.bioinf.jku.at/software/fabia/fabia.html}. \section{Getting Started: FABIA} \label{sec:started} First load the \Rpackage{fabia} package: \begin{Sinput} R> library(fabia) \end{Sinput} The \Rpackage{fabia} package imports the package \Rpackage{methods} and the function \Rfunction{plot} from the package \Rpackage{graphics}. \subsection{Quick start : without outputs and plots} Assume your data is in the file \texttt{datafile.csv} in a matrix like format then you can try out the following steps to extract biclusters. \noindent Examples with outputs and plot can be found at \href{http://www.bioinf.jku.at/software/fabia/fabia.html}{http://www.bioinf.jku.at/software/fabia/fabia.html}. \begin{enumerate} \item Create a working directory, e.g. \texttt{c:/fabia/data} in Windows or \texttt{/home/myself/fabia/data} in Unix. Move the data file \texttt{datafile.csv} to that directory, e.g. under Unix \begin{quote} \texttt{cp datafile.csv /home/myself/fabia/data/} \end{quote} or drag the file \texttt{datafile.csv} into that directory under Windows. \item Start \R and change to the working directory. Under Windows \begin{Sinput} R> setwd("c:/fabia/data") \end{Sinput} and under Unix \begin{Sinput} R> setwd("/home/myself/fabia/data") \end{Sinput} You can also start \R in that directory under Unix. \item Load the library: \begin{Sinput} R> library(fabia) \end{Sinput} \item Read the data file ``datafile.csv'': \begin{Sinput} R> X <- read.table("datafile.csv",header = TRUE, sep = ",") \end{Sinput} \item Select the model based on the data: 5 biclusters; sparseness 0.01; 500 cycles \begin{Sinput} R> res <- fabia(X,5,0.01,500) \end{Sinput} \item Give summary: \begin{Sinput} R> summary(res) \end{Sinput} \item Plot some statistics: \begin{Sinput} R> show(res) \end{Sinput} \item Plot the factorization results: \begin{Sinput} R> extractPlot(res,ti="FABIA") \end{Sinput} \item Plot the result as a biplot: \begin{Sinput} R> plot(res) \end{Sinput} \item Extract biclusters: \begin{Sinput} R> rb <- extractBic(res) \end{Sinput} \item Show information content of the biclusters: \begin{Sinput} R> res@avini \end{Sinput} \item List bicluster 1: \begin{Sinput} R> rb$bic[1,] \end{Sinput} \item List bicluster 2: \begin{Sinput} R> rb$bic[2,] \end{Sinput} \item Show bicluster 3: \begin{Sinput} R> rb$bic[3,] \end{Sinput} \item List bicluster 4: \begin{Sinput} R> rb$bic[4,] \end{Sinput} \item List bicluster 5: \begin{Sinput} R> rb$bic[5,] \end{Sinput} \item Plot bicluster 1: \begin{Sinput} R> plotBicluster(rb,1) \end{Sinput} \item Plot bicluster 2: \begin{Sinput} R> plotBicluster(rb,2) \end{Sinput} \item Plot bicluster 3: \begin{Sinput} R> plotBicluster(rb,3) \end{Sinput} \item Plot bicluster 4: \begin{Sinput} R> plotBicluster(rb,4) \end{Sinput} \item Plot bicluster 5: \begin{Sinput} R> plotBicluster(rb,5) \end{Sinput} \item List opposite bicluster 1: \begin{Sinput} R> rb$bicopp[1,] \end{Sinput} \item Plot opposite bicluster 1: \begin{Sinput} R> plotBicluster(rb,1,opp=TRUE) \end{Sinput} \end{enumerate} \subsection{Test on Toy Data Set} In the following, we describe how you can test the package \Rpackage{fabia} on a toy data set that is generated on-line. \begin{enumerate} \item generate bicluster data, where biclusters are in block format in order to obtain a better visualization of the results. 1000 observations, 100 samples, 10 biclusters: \begin{Sinput} R> n <- 1000 R> l <- 100 R> p <- 10 R> dat <- makeFabiaDataBlocks(n = n,l= l,p = p,f1 = 5, f2 = 5,of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2, mean_z = 2.0,sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) \end{Sinput} \item store the generated data in variables and annotate the data according to which biclusters the samples and genes belong: \begin{Sinput} R> X <- dat[[1]] R> Y <- dat[[2]] R> ZC <- dat[[3]] R> LC <- dat[[4]] R> gclab <- rep.int(0,l) R> gllab <- rep.int(0,n) R> clab <- as.character(1:l) R> llab <- as.character(1:n) R> for (i in 1:p){ + for (j in ZC[i]){ + clab[j] <- paste(as.character(i),"_",clab[j],sep="") + } + for (j in LC[i]){ + llab[j] <- paste(as.character(i),"_",llab[j],sep="") + } + gclab[unlist(ZC[i])] <- gclab[unlist(ZC[i])] + p^i + gllab[unlist(LC[i])] <- gllab[unlist(LC[i])] + p^i + } groups <- gclab \end{Sinput} \item {\bf \large FABIA:} perform \Rfunction{fabia} (sparseness by Laplace prior) to extract biclusters; 13 biclusters, sparseness 0.01 (Laplace), 400 cycles: \begin{Sinput} R> resToy1 <- fabia(X,13,0.01,400) \end{Sinput} \item Plot some results: \begin{Sinput} R> extractPlot(resToy1,ti="FABIA",Y=Y) \end{Sinput} \item Extract the biclusters: \begin{Sinput} R> raToy1 <- extractBic(resToy1) \end{Sinput} \item Plot bicluster 1: \begin{Sinput} R> if ((raToy1$bic[[1]][1]>1) && (raToy1$bic[[1]][2])>1) { + plotBicluster(raToy1,1) + } \end{Sinput} \item Plot bicluster 2: \begin{Sinput} R> if ((raToy1$bic[[2]][1]>1) && (raToy1$bic[[2]][2])>1) { + plotBicluster(raToy1,2) + } \end{Sinput} \item Plot bicluster 3: \begin{Sinput} R> if ((raToy1$bic[[3]][1]>1) && (raToy1$bic[[3]][2])>1) { + plotBicluster(raToy1,3) + } \end{Sinput} \item Plot bicluster 4: \begin{Sinput} R> if ((raToy1$bic[[4]][1]>1) && (raToy1$bic[[4]][2])>1) { + plotBicluster(raToy1,4) + } \end{Sinput} \item Prepare biplots: \begin{Sinput} R> colnames(resToy1@X) <- clab R> rownames(resToy1@X) <- llab \end{Sinput} \item Biplot of bicluster 1 and 2: \begin{Sinput} R> plot(resToy1,dim=c(1,2),label.tol=0.1,col.group = groups,lab.size=0.6) \end{Sinput} \item Biplot of bicluster 1 and 3: \begin{Sinput} R> plot(resToy1,dim=c(1,3),label.tol=0.1,col.group = groups,lab.size=0.6) \end{Sinput} \item Biplot of bicluster 2 and 3: \begin{Sinput} R> plot(resToy1,dim=c(2,3),label.tol=0.1,col.group = groups,lab.size=0.6) \end{Sinput} \item {\bf \large FABIAS:} perform \Rfunction{fabias} (sparseness by projection) to extract biclusters; 13 biclusters, sparseness 0.6 (projection), 400 cycles: \begin{Sinput} R> resToy2 <- fabias(X,13,0.6,400) \end{Sinput} \item Plot some results: \begin{Sinput} R> extractPlot(resToy2,ti="FABIAS",Y=Y) \end{Sinput} \item Extract the biclusters: \begin{Sinput} R> raToy2 <- extractBic(resToy2) \end{Sinput} \item Plot bicluster 1: \begin{Sinput} R> if ((raToy2$bic[[1]][1]>1) && (raToy2$bic[[1]][2])>1) { + plotBicluster(raToy2,1) + } \end{Sinput} \item Plot bicluster 2: \begin{Sinput} R> if ((raToy2$bic[[2]][1]>1) && (raToy2$bic[[2]][2])>1) { + plotBicluster(raToy2,2) + } \end{Sinput} \item Plot bicluster 3: \begin{Sinput} R> if ((raToy2$bic[[3]][1]>1) && (raToy2$bic[[3]][2])>1) { + plotBicluster(raToy2,3) + } \end{Sinput} \item Plot bicluster 4: \begin{Sinput} R> if ((raToy2$bic[[4]][1]>1) && (raToy2$bic[[4]][2])>1) { + plotBicluster(raToy2,4) + } \end{Sinput} \item Prepare biplots: \begin{Sinput} R> colnames(resToy2@X) <- clab R> rownames(resToy2@X) <- llab \end{Sinput} \item Biplot of bicluster 1 and 2: \begin{Sinput} R> plot(resToy2,dim=c(1,2),label.tol=0.1,col.group = groups,lab.size=0.6) \end{Sinput} \item Biplot of bicluster 1 and 3: \begin{Sinput} R> plot(resToy2,dim=c(1,3),label.tol=0.1,col.group = groups,lab.size=0.6) \end{Sinput} \item Biplot of bicluster 2 and 3: \begin{Sinput} R> plot(resToy2,dim=c(2,3),label.tol=0.1,col.group = groups,lab.size=0.6) \end{Sinput} \end{enumerate} \subsection{Demos} The package \Rpackage{fabia} has some demos which can be demonstrated by \Rfunction{fabiaDemo}. \begin{enumerate} \item demo1: toy data. \begin{Sinput} R> fabiaDemo() \end{Sinput} Choose ``1'' and you get above toy data demonstration. \item demo2: Microarray data set of \citep{Veer:02} on breast cancer. \begin{Sinput} R> fabiaDemo() \end{Sinput} Choose ``2'' to extract subclasses in the data set of van't Veer as biclusters. \item demo3: Microarray data set of \citep{Su:02} on different mammalian. \begin{Sinput} R> fabiaDemo() \end{Sinput} Choose ``3'' to check whether the different mouse and human tissue types can be extracted. \item demo4: Microarray data set of \citep{Rosenwald:02} diffuse large-B-cell lymphoma. \citep{Hoshida:07} divided the data set into three classes \begin{itemize} \item{OxPhos:} oxidative phosphorylation \item{BCR:} B-cell response \item{HR:} host response \end{itemize} \begin{Sinput} R> fabiaDemo() \end{Sinput} Choose ``4'' to check whether the different classes can be extracted. \end{enumerate} \newpage \section{The FABIA Model}\label{sec:model} For a detailed model description see the FABIA model \citep{Hochreiter:10}. \subsection{Model Assumptions} We define a {\em bicluster} as a pair of a row (gene) set and a column (sample) set for which the rows are similar to each other on the columns and vice versa. In a multiplicative model, two vectors are similar if one is a multiple of the other, that is the angle between them is zero or as realization of random variables their correlation coefficient is one. \begin{figure}[!t] \centerline{\includegraphics[width=0.9\textwidth]{./figures/vectors_1}} \caption{The outer product $\la \ \z^T$ of two sparse vectors results in a matrix with a bicluster. Note, that the non-zero entries in the vectors are adjacent to each other for visualization purposes only.}\label{fig:vectors} \end{figure} It is clear that such a linear dependency on subsets of rows and columns can be represented as an outer product $\la \ \z^T$ of two vectors $\la$ and $\z$. The vector $\la$ corresponds to a {\em prototype column vector} that contains zeros for genes not participating in the bicluster, whereas $\z$ is a vector of {\em factors} with which the prototype column vector is scaled for each sample; clearly $\z$ contains zeros for samples not participating in the bicluster. Vectors containing many zeros or values close to zero are called {\em sparse vectors}. Fig.\ \ref{fig:vectors} visualizes this representation by sparse vectors schematically. The overall model for $p$ biclusters and additive noise is \begin{align} \X \ = \ \sum_{i=1}^{p} \la_i \ \z_i^T \ + \ \Up \ = \ \La \ \Z \ + \ \Up \ ,\label{eq:expgen} \end{align} where $\Up \in \Rb^{n \times l}$ is additive noise and $\la_i \in \Rb^n$ and $\z_i \in \Rb^l$ are the sparse prototype vector and the sparse vector of factors of the $i$-th bicluster, respectively. The second formulation above holds if $\La \in \Rb^{n \times p}$ is the sparse prototype matrix containing the prototype vectors $\la_i$ as columns and $\Z \in \Rb^{p \times l}$ is the sparse factor matrix containing the transposed factors $\z_i^T$ as rows. Note that Eq.~\eqref{eq:expgen} formulates biclustering as sparse matrix factorization. According to Eq.~\eqref{eq:expgen}, the $j$-th sample $\x_j$, i.e., the $j$-th column of $\X$, is \begin{align} \x_j\ = \ \sum_{i=1}^{p} \la_i \ z_{ij} \ + \ \epp_j \ = \ \La \ \tilde\z_j\ + \ \epp_j \ ,\label{eq:expgen2} \end{align} where $\epp_j$ is the $j$-th column of the noise matrix $\Up$ and $\tilde\z_j=(z_{1j},\dots,z_{pj})^T$ denotes the $j$-th column of the matrix $\Z$. Recall that $\z_i^T=(z_{i1},\dots,z_{il})$ is the vector of values that constitutes the $i$-th bicluster (one value per sample), while $\tilde\z_j$ is the vector of values that contribute to the $j$-th sample (one value per bicluster). The formulation in Eq.~\eqref{eq:expgen2} facilitates a generative interpretation by a factor analysis model with $p$ factors \begin{align} \x \ = \ \sum_{i=1}^{p} \la_i \ \tilde z_i \ + \ \epp \ = \ \La \ \tilde\z \ + \ \epp \ ,\label{eq:factormodel} \end{align} where $\x$ are the observations, $\La$ is the loading matrix, $\tilde z_i$ is the value of the \mbox{$i$-th} factor, $\tilde\z=(\tilde z_1,\ldots,\tilde z_p)^T$ is the vector of factors, and $\epp\in \Rb^{n}$ is the additive noise. Standard factor analysis assumes that the noise is independent of $\tilde\z$, that $\tilde\z$ is $\Ncal(\0,\I)$-distributed, and that $\epp$ is $\Ncal(\0,\pp)$-distributed, where the covariance matrix $\pp \in \Rb^{n \times n}$ is a diagonal matrix expressing independent Gaussian noise. The parameter $\La$ explains the dependent (common) and $\pp$ the independent variance in the observations $\x$. Normality of the additive noise in gene expression is justified by the findings in \citep{Hochreiter:06}. The unity matrix as covariance matrix for $\tilde\z$ may be violated by overlapping biclusters, however we want to avoid to divide a real bicluster into two factors. Thus, we prefer uncorrelated factors over more sparseness. The factors can be decorrelated by setting $\hat{\z} := \A^{-1}\ \tilde\z$ and $\hat{\La} := \La \ \A$ with the symmetric invertible matrix $\A^2=\E\big(\tilde\z \ \tilde\z^T\big)$: \begin{align*} & \La \ \z \ = \ \La \ \A \ \A^{-1} \ \z \ \ = \ \hat{\La} \ \hat{\z} \ \ \ \ \text{~and}\\ & \E\big(\hat{\z} \ \hat{\z}^T\big) \ = \ \A^{-1}\ \E\big(\tilde\z \ \tilde\z^T\big) \ \A^{-1} \ = \ \A^{-1} \A^2 \A^{-1} \ = \I \ . \end{align*} Standard factor analysis does not consider sparse factors and sparse loadings which are essential in our formulation to represent biclusters. Sparseness is obtained by a component-wise independent {\em Laplace} distribution \citep{Hyvarinen:99a}, which is now used as a prior on the factors $\tilde\z$ instead of the Gaussian: \begin{align*} p(\tilde\z) \ = \ \left( {\textstyle\frac{1}{\sqrt{2}}} \right)^p \prod_{i=1}^{p} e^{-\ \sqrt{2} \ |\tilde z_i|} \end{align*} Sparse loadings $\la_i$ and, therefore sparse $\La$, are achieved by two alternative strategies. In the first model, called \textsf{FABIA}, we assume a component-wise independent {\em Laplace} prior for the loadings (like for the factors): \begin{align} p(\la_i) \ = \ \left( {\textstyle\frac{1}{\sqrt{2}}} \right)^n \prod_{k=1}^{n} e^{-\ \sqrt{2} \ |\lambda_{ki}|}\label{eq:laplaceprior} \end{align} The \textsf{FABIA} model contains the product of Laplacian variables which is distributed proportionally to the $0$-th order modified Bessel function of the second kind \citep{Bithas:07}. For large values, this Bessel function is a negative exponential function of the square root of the random variable. Therefore, the tails of the distribution are heavier than those of the Laplace distribution. The Gaussian noise, however, reduces the heaviness of the tails such that the heaviness is between Gaussian and Bessel function tails --- about as heavy as the tails of the Laplacian distribution. These {\em heavy tails} are exactly the desired model characteristics. The second model, called \textsf{FABIAS}, applies a prior with parameter spL on the loadings that has only support at sparse regions. Following \citep{Hoyer:04}, we define sparseness as \begin{align*} \mathrm{sp}(\la_i) \ = \ \frac{\sqrt{n} \ - \ \sum_{k=1}^{n} \left|\lambda_{ki} \right|\ / \ \sum_{k=1}^{n} \lambda_{ki}^2}{\sqrt{n} \ - \ 1} \end{align*} leading to the prior \begin{align} p(\la_i) \ = \ \begin{cases} c & \text{for~~~~} \mathrm{sp}(\la_i) \ \leq \ \mathrm{spL} \\ 0 & \text{for~~~~} \mathrm{sp}(\la_i) \ > \ \mathrm{spL} \end{cases} \ . \label{eq:sparseness} \end{align} \subsection{Sparse Coding and Laplace Prior} Laplace prior enforces sparse codes on the factors. \emph{Sparse coding} is the representation of items by the \emph{strong activation} of a relatively \emph{small set} of hidden factors while the factors are \emph{almost constant} if not activated. Laplace prior is suited for modeling strong activation for few samples while being otherwise almost constant. Fig.\ \ref{fig:lap1} left shows the Laplacian mode compared to a Gaussian mode. The Laplacian mode is higher and narrower. Fig.\ \ref{fig:lap1} right shows the tails of Gaussian and Laplace distribution, where the latter has higher values. This means for the Laplace distribution large values are more likely than for a Gaussian. \begin{figure} \begin{center} \includegraphics[width=0.45\textwidth]{./figures/fabia-lap1} \includegraphics[width=0.45\textwidth]{./figures/fabia-lap2} \end{center} \caption{Left: The mode of a Laplace (red, solid) vs. a Gaussian (dashed, blue) distribution. Right: The tails of a Laplace (red, solid) vs. a Gaussian (dashed, blue) distribution} \label{fig:lap1} \end{figure} % \begin{figure} % \begin{center} % <<label=lap1,echo=F,fig=TRUE>>= % t<-0.2 % s<-1 % curve(1/(sqrt(2)*t)*exp(-sqrt(2)/t*abs(x)),from=-6,to=6,ylab="",col="red",lty=1,lwd=3) % curve(1/(sqrt(2*pi)*s)*exp(-0.5*1/s^2*x^2),add=TRUE,col="blue",lty=2,lwd=3) % @ % \end{center} % \caption{Mode: Laplace (red, solid) vs. Gaussian (dashed, blue)} % \label{fig:lap1} % \end{figure} % Fig.\ \ref{fig:lap2} shows the tails % where it can be seen that Laplace has higher likelihood % for large values than Gaussian. % \begin{figure} % \begin{center} % <<label=lap2,echo=F,fig=TRUE>>= % t<-0.2 % s<-1 % curve(1/(sqrt(2)*t)*exp(-sqrt(2)/t*abs(x)),from=14.2,to=16,ylab="",col="red",lty=1,lwd=3) % curve(1/(sqrt(2*pi)*s)*exp(-0.5*1/s^2*x^2),add=TRUE,col="blue",lty=2,lwd=3) % @ % \end{center} % \caption{Tails: Laplace (red, solid) vs. Gaussian (dashed, blue)} % \label{fig:lap2} % \end{figure} \subsection{Model Selection} \label{sec:selection} The free parameters $\La$ and $\pp$ can be estimated by Expectation-Maximization (EM; \citealp{Dempster:77}). With a prior probability on the loadings, the a posteriori of the parameters is maximized like in \citep{Hochreiter:06,Talloen:07}. \subsubsection{Variational Approach for Sparse Factors} Model selection is not straightforward because the likelihood \begin{align*} p( \x \mid \La, \pp ) \ = \ \int p( \x \mid \tilde\z, \La, \pp) \ p( \tilde\z) \ d\tilde\z \end{align*} cannot be computed analytically for a Laplacian prior $p(\tilde\z)$. We employ a variational approach according to \citet{Girolami:01} and \citet{Palmer:06} for model selection. They introduce a model family that is parametrized by $\xii$, where the maximum over models in this family is the true likelihood: \begin{align*} \underset{\xii}{\arg\max} \ p(\x|\xii) \ = \ \log p(\x) \ . \end{align*} Using an EM algorithm, not only the likelihood with respect to the parameters $\La$ and $\pp$ is maximized, but also with respect to $\xii$. In the following, $\La$ and $\pp$ denote the actual parameter estimates. According to \citet{Girolami:01} and \citet{Palmer:06}, we obtain \begin{align*} \E \big(\tilde\z_j \mid \x_j \big) \ = \ & \big( \La^T \ \pp^{-1} \ \La \ + \ \Xii_j^{-1} \big)^{-1} \ \La^T \ \pp^{-1} \ \x_j \ \ \mbox{~~and} \\ \E \big(\tilde\z_j \ \tilde\z_j^T \mid \x_j \big) \ = \ & \big( \La^T \ \pp^{-1} \ \La \ + \ \Xii_j^{-1} \big)^{-1} \ + \\ & \E \big(\tilde\z_j \mid \x_j \big) \ \E (\tilde\z_j \mid \x_j)^T \ , \end{align*} where $\Xii_j$ stands for $\mathrm{diag} \left(\xii_j \right)$. The update for $\xii_j$ is \begin{align*} \xii_{j} \ = \ \mathrm{diag}\left( \sqrt{\E (\tilde\z_j \ \tilde\z_j^T \mid \x_j)} \right) \ . \end{align*} \subsubsection{New Update Rules for Sparse Loadings} The update rules for \textsf{FABIA} (Laplace prior on loadings) are \begin{align} \La^{\nn} \ = \ \frac{\frac{1}{l} \ \sum_{j=1}^{l} \x_j \ \E (\tilde\z_j \mid \x_j)^T \ - \ \frac{\alpha}{l} \ \pp \ \sign (\La )}{\frac{1}{l} \sum_{j=1}^{l} \E (\tilde\z_j \ \tilde\z_j^T \mid \x_j )} \label{eq:update}\\ \mathrm{diag} \left(\pp^{\nn} \right) \ = \ \pp^{\mathrm{EM}} \ + \ \mathrm{diag} \Big(\frac{\alpha}{l} \ \pp \ \sign (\La ) ( \La^{\nn} )^T \Big) \nonumber \end{align} where \begin{align*} \pp^{\mathrm{EM}} \ = \ \mathrm{diag} \bigg( \frac{1}{l} \sum_{j=1}^l \x_j \x_j^T \ - \ \La^{\nn} \frac{1}{l} \sum_{j=1}^l \E \left( \tilde\z_j \mid \x_j \right) \ \x_{j}^{T} \bigg) . \end{align*} The update rules for \textsf{FABIAS} must take into account that each $\la_i$ from $\La$ has a prior with restricted support. Therefore the sparseness constraints $\mathrm{sp}(\la_i) \leq \mathrm{spL}$ from Eq.\ \eqref{eq:sparseness} hold. These constraints are ensured by a projection of $\la_i$ after each $\La$ update according to \citet{Hoyer:04}. The projection is a convex quadratic problem which minimizes the Euclidean distance to the original vector subject to the constraints. The projection problem can be solved fast by an iterative procedure where the $l_2$-norm of the vectors is fixed to 1. We update $\mathrm{diag} (\pp^{\nn}) = \pp^{\mathrm{EM}}$ and project each updated prototype vector to a sparse vector with sparseness spL giving the overall projection: \begin{align*} \La^{\mathrm{new}} \ = \ \mathrm{proj} \left(\frac{\frac{1}{l} \ \sum_{j=1}^{l} \x_j \ \E \left(\tilde\z_j \mid \x_j \right)^T }{\frac{1}{l} \sum_{j=1}^{l} \E (\tilde\z_j \ \tilde\z_j^T \mid \x_j )}, \mathrm{spL} \right) \end{align*} \subsubsection{Extremely Sparse Priors} Some gene expression data sets are sparser than Laplacian. For example, during estimating DNA copy numbers with Affymetrix SNP 6 arrays, we observed a kurtosis larger than 30 (FABIA results shown at \href{http://www.bioinf.jku.at/software/fabia/fabia.html}{http://www.bioinf.jku.at/software/fabia/fabia.html}). We want to adapt our model class to deal with such sparse data sets. Toward this end, we define extremely sparse priors both on the factors and the loadings utilizing the following (pseudo) distributions: \begin{quote} \vspace{-0.2cm} \begin{tabular}{ll} Generalized Gaussians: & $p(z) \ \propto \ \exp \left( - \ |z|^{\beta} \right)$ \\ Jeffrey's prior: & $p(z) \ \propto \ \exp \left( - \ \ln|z| \right) \ = \ 1/|z|$ \\ Improper prior: & $p(z) \ \propto \ \exp \left( |z|^{-\beta} \right)$ \end{tabular} \end{quote} \vspace{-0.2cm} For the first distribution, we assume $0<\beta\leq 1$ and for the third $0<\beta$. Note, the third distribution may only exist on the interval $[\epsilon,a]$ with $0<\epsilon<a$. We assume that $\epsilon$ is sufficiently small. For the {\em loadings}, we need the derivatives of the negative log-distributions for optimizing the log-posterior. These derivatives are proportional to $|z|^{-\mathrm{spl}}$, where $\mathrm{spl}=0$ corresponds to the Laplace prior and $\mathrm{spl}>0$ to sparser priors. The update rule is as in Eq.~\eqref{eq:update}, where $\sign(\La)$ is replaced by $|\La|^{-\mathrm{spl}} \ \sign (\La)$ with element-wise operations (absolute value, sign, exponentiation, multiplication). For the {\em factors}, we represent the priors through a convex variational form according to \citet{Palmer:06}. That is possible because $g(z)=-\ln p(\sqrt{z})$ is increasing and concave for $z>0$ (first order derivatives are larger and second order smaller than zero). According to \citet{Palmer:06}, the update for $\xii_j$ is \begin{align*} \xii_{j} \ \propto \ \mathrm{diag} \Big( \E \big(\tilde\z_j \ \tilde\z_j^T \mid \x_j \big)^{\mathrm{spz}} \Big) \end{align*} for all $\mathrm{spz}\geq 1/2$, where $\mathrm{spz}=1/2$ ($\beta=1$) represents the Laplace prior and $\mathrm{spz}>1/2$ leads to sparser priors. \subsection{Data Preprocessing and Initialization} The data should be centered to zero mean, zero median, or zero mode (see supplementary). If the correlation of weak signals is of interest too, we recommend to normalize the data. The iterative model selection procedure requires initialization of the parameters $\La$, $\pp$, and $\xii_{j}$. We initialize the variational parameter vectors $\xii_{j}$ by ones, $\La$ randomly, and $\pp = \mathrm{diag}(\max(\delta,\mathrm{covar}(\x) - \La \La^T))$. Alternatively $\La$ can be initialized by the result of a singular value decomposition (SVD). The user may supply the result of independent component analysis (ICA) as a sparse initialization for $\La$. ICA can also determine first $\Z$ from which $\La$ can be obtained as the least square solution to $\X \ = \ \La \ \Z$. \subsection{Information Content of Biclusters} \label{sec:info} A highly desired property for biclustering algorithms is the ability to rank the extracted biclusters analogously to principal components which are ranked according to the data variance they explain. We rank biclusters according to the information they contain about the data. The information content of $\tilde\z_j$ for the $j$-th observation $\x_j$ is the mutual information between $\tilde\z_j$ and $\x_j$: \begin{align*} \mathrm{I} (\x_j;\tilde\z_j) \ = \ \mathrm{H} (\tilde\z_j) \ - \ \mathrm{H} (\tilde\z_j \mid \x_j) \ = \frac{1}{2} \ \ln\big| \I_p \ + \ \Xii_j \ \La^T \ \pp^{-1} \ \La \big| \end{align*} The independence of $\x_j$ and $\tilde\z_j$ across $j$ gives \begin{align*} \mathrm{I} (\X;\Z) \ = \ \frac{1}{2} \ \sum_{j=1}^{l} \ln\big| \I_p \ + \ \Xii_j \ \La^T \ \pp^{-1} \ \La\big|\ . \end{align*} For the FARMS summarization algorithm ($p=1$ and $\Xii_j=1$), this information is the negative logarithm of the I/NI call \citep{Talloen:07}. To assess the information content of one factor, we consider the case that factor $\tilde z_i$ is removed from the final model. This corresponds to setting $\xi_{ij}=0$ (by $\xi_{ij}$, we denote the $i$-th entry in $\xii_j$) and therefore the explained covariance $\xi_{ji} \ \la_i \ \la_i^T$ is removed: \begin{align*} \x_j \mid (\tilde\z_j \setminus z_{ij}) \ \sim \ \Ncal\big( \La \ \tilde\z_j|_{z_{ij}=0} \ , \ \pp \ + \ \xi_{ij} \ \la_i \ \la_i^T \big) \end{align*} The information of $z_{ij}$ given the other factors is \begin{align*} \mathrm{I}(\x_j;z_{ij} \mid (\tilde\z_j \setminus z_{ij}) ) \ &= \ \mathrm{H}(z_{ij} \mid (\tilde\z_j \setminus z_{ij})) - \mathrm{H}(z_{ij} \mid (\tilde\z_j \setminus z_{ij}),\x_j ) \\ & = \ \frac{1}{2} \ \ln\big(1 \ + \ \xi_{ij} \ \la_i^T \pp^{-1} \la_i \big) \ . \end{align*} Again independence across $j$ gives \begin{align*} \mathrm{I} (\X;\z_i^T \mid (\Z \setminus \z_{i}^T) ) \ = \ \frac{1}{2} \ \sum_{j=1}^{l} \ \ln\left( 1 \ + \ \xi_{ij} \ \la_i^T \pp^{-1} \la_i \right) \ . \end{align*} This information content gives that part of information in $\x$ that $\z_i^T$ conveys across all examples. Note that also the number of nonzero $\la_i$'s (size of the bicluster) enters the information content. \subsection{Extracting Members of Biclusters} \label{sec:extract} After model selection in Section \ref{sec:selection} and ranking of bicluster in Section \ref{sec:info}, the $i$-th bicluster has soft gene memberships given by the absolute values of $\la_i$ and soft sample memberships given by the absolute values of $\z_i^T$. However, applications may need hard memberships. We determine the members of bicluster $i$ by selecting absolute values $\lambda_{ki}$ and $z_{ij}$ above thresholds $\mathrm{thresL}$ and $\mathrm{thresZ}$, respectively. First, the second moment of each factor is normalized to 1 resulting in a factor matrix $\hat{\Z}$ (in accordance with $\E(\tilde\z\tilde\z^T)=\I$). Consequently, $\La$ is rescaled to $\hat{\La}$ such that $\La \Z = \hat{\La} \hat{\Z}$. Now the threshold $\mathrm{thresZ}$ can be chosen to determine which percentage of samples will on average belong to a bicluster. For a Laplace prior, this percentage can be computed by $\frac{1}{2} \exp(-\sqrt{2}/\mathrm{thresZ})$. In the default setting, for each factor $\hat\z_i$, only one bicluster is extracted. In gene expression, an expression pattern is either absent or present but not negatively present. Therefore, the $i$-th bicluster is either determined by the positive or negative values of $\hat z_{ij}$. Which one of these two possibilities is chosen is decided by whether the sum over $\big| \hat z_{ij} \big| > \mathrm{thresZ}$ is larger for the positive or negative $\hat z_{ij}$. The threshold $\mathrm{thresL}$ for the loadings is more difficult to determine, because normalization would lead to a rescaling of the already normalized factors. Since biclusters may overlap, the contribution of $\lambda_{ki} z_{ij}$ that are relevant must be estimated. Therefore, we first estimate the standard deviation of $\La\Z$ by \begin{align*} \mathrm{sdLZ} \ = \ \sqrt{\frac{1}{p \ l \ n} \sum_{(i,j,k)=(1,1,1)}^{(p,l,n)} \big(\hat{\lambda}_{ki} \ \hat{z}_{ij}\big)^2} \ . \end{align*} We set this standard deviation to the product of both thresholds which is solved for $\mathrm{thresL}$: $\mathrm{thresL}=\mathrm{sdLZ}\ /\ \mathrm{thresZ}$. However, an optimal $\mathrm{thresL}$ depends on the sparseness parameters and on the characteristics of the biclustering problem. \subsection{C implementation of FABIA} The functions \Rfunction{fabia}, \Rfunction{fabias}, and \Rfunction{spfabia} are implemented in C. It turned out that these implementations are not only faster, but also more precise. Especially we use an efficient Cholesky decomposition to compute the inverse of positive definite matrices. Some \R functions for computing the inverse like \Rfunction{solve} were inferior to that implementation. \appendix \section{Class Factorization} \texttt{Factorization} is the class structure for results of matrix factorization. Especially it is designed for factor analysis used for biclustering. The \texttt{summary} method shows information about biclusters. The \texttt{show} method plots information about biclusters. The \texttt{showSelected} method plots selected information about biclusters. The \texttt{plot} method produces biplots of biclusters. Objects can be created by \texttt{fabia}, \texttt{fabias}, \texttt{fabiap}, \texttt{fabiasp}, \texttt{mfsc}, \texttt{nmfsc}, \texttt{nmfdiv}, and \texttt{nmfeu}. Objects of class \texttt{Factorization} have the following slots: \begin{enumerate} \item{\texttt{parameters}:} Saves parameters of the factorization method in a list:\\ \noindent ``method'',\\ \noindent ``number of cycles'',\\ \noindent ``sparseness weight'',\\ \noindent ``sparseness prior for loadings'',\\ \noindent ``sparseness prior for factors'',\\ \noindent ``number biclusters'',\\ \noindent ``projection sparseness loadings'',\\ \noindent ``projection sparseness factors'',\\ \noindent ``initialization range'',\\ \noindent ``are loadings rescaled after each iterations'',\\ \noindent ``normalization = scaling of rows'',\\ \noindent ``centering method of rows'',\\ \noindent ``parameter for method''. \item{\texttt{n}:} number of rows, left dimension. \item{\texttt{p1}:} right dimension of left matrix. \item{\texttt{p2}:} left dimension of right matrix. \item{\texttt{l}:} number of columns, right dimension. \item{\texttt{center}:} vector of the centers. \item{\texttt{scaleData}:} vector of the scaling factors. \item{\texttt{X}:} centered and scaled data matrix $n \times l$. \item{\texttt{L}:} left matrix $n \times p1$. \item{\texttt{Z}:} right matrix $p2 \times l$. \item{\texttt{M}:} middle matrix $p1 \times p2$. \item{\texttt{LZ}:} matrix $\La \bf{M} \Z$ . \item{\texttt{U}:} noise matrix. \item{\texttt{avini}:} information of each bicluster, vector of length p2. \item{\texttt{xavini}:} information extracted from each sample, vector of length $l$. \item{\texttt{ini}:} information of each bicluster in each sample, matrix $p2 \times l$. \item{\texttt{Psi}:} noise variance per row, vector of length $n$. \item{\texttt{lapla}:} prior information for each sample, vector of length $l$. \end{enumerate} This class contains the result of different matrix factorization methods. The methods may be generative or not. Methods my be ``singular value decomposition'' ($\bf{M}$ contains singular values as well as avini, $\La$ and $\Z$ are orthonormal matrices), ``independent component analysis'' ($\Z$ contains the projection/sources, $\La$ is the mixing matrix, $\bf{M}$ is unity), ``factor analysis'' ($\Z$ contains factors, $\La$ the loadings, $\bf{M}$ is unity, $\bf{U}$ the noise, $\pp$ the noise covariance, lapla is a variational parameter for non-Gaussian factors, avini and ini are the information the factors convey about the observations). \section{Biclustering and Matrix Factorization Methods} \subsection{fabi} Factor Analysis for Bicluster Acquisition: Laplace Prior (FABI) \citep{Hochreiter:10}. \R implementation of \texttt{fabia}, therefore it is \emph{slow}. \begin{enumerate} \item Usage: \texttt{fabi(X,p=5,alpha=0.1,cyc=500,spl=0,spz=0.5,center=2,norm=1)} \item Arguments: \begin{itemize} \item{X:} the data matrix. \item{p:} number of hidden factors = number of biclusters; default = 5. \item{alpha:} sparseness loadings (0 - 1.0); default = 0.1. \item{cyc:} number of iterations; default = 500. \item{spl:} sparseness prior loadings (0 - 2.0): default = 0 (Laplace). \item{spz:} sparseness factors (0.5 - 2.0); default = 0.5 (Laplace). \item{center:} data centering: 1 (mean), 2 (median), > 2 (mode), 0 (no); default = 2. \item{norm:} data normalization: 1 (0.75-0.25 quantile), >1 (var=1), 0 (no); default = 1. \end{itemize} \item Return Values: \begin{itemize} \item object of the class \texttt{Factorization}. Containing \texttt{LZ} (estimated noise free data $\La \Z$), \texttt{L} (loadings $\La$), \texttt{Z} (factors $\Z$), \texttt{U} (noise $\X- \La \Z$), \texttt{center} (centering vector), \texttt{scaleData} (scaling vector), \texttt{X} (centered and scaled data $\X$), \texttt{Psi} (noise variance $\pp$), \texttt{lapla} (variational parameter), \texttt{avini} (the information which the factor $z_{ij}$ contains about $\x_j$ averaged over $j$) \texttt{xavini} (the information which the factor $\tilde\z_j$ contains about $\x_j$ averaged over $j$) \texttt{ini} (for each $j$ the information which the factor $z_{ij}$ contains about $\x_j$). \end{itemize} \end{enumerate} Biclusters are found by sparse factor analysis where \emph{both} the factors and the loadings are sparse. Essentially the model is the sum of outer products of vectors: \begin{align*} \X \ = \ \sum_{i=1}^{p} \la_i \ \z_i^T \ + \ \Up \ , \end{align*} where the number of summands $p$ is the number of biclusters. The matrix factorization is \begin{align*} \X \ = \ \La \ \Z \ + \ \Up \ . \end{align*} Here $\X \in \Rb^{n \times l}$ and $\Up \in \Rb^{n \times l}$, $\la_i \in \Rb^n$, $\z_i \in \Rb^l$, $\La \in \Rb^{n \times p}$, $\Z \in \Rb^{p \times l}$. If the nonzero components of the sparse vectors are grouped together then the outer product results in a matrix with a nonzero block and zeros elsewhere. For a single data vector $\x$ that is \begin{align*}\x \ = \ \sum_{i=1}^{p} \la_i z_i \ + \ \epp \ = \ \La \tilde\z \ + \ \epp\end{align*} The model assumptions are: \emph{Factor Prior is Independent Laplace:} \begin{align*}p(\tilde\z) \ = \ \left( \frac{1}{\sqrt{2}} \right)^p \prod_{i=1}^{p} e^{-\ \sqrt{2} \ |z_i|}\end{align*} \emph{Loading Prior is Independent Laplace:} \begin{align*}p(\la_i) \ = \ \left( \frac{1}{\sqrt{2}} \right)^n \prod_{k=1}^{n} e^{-\ \sqrt{2} \ |\lambda_{ki}|}\end{align*} \emph{Noise: Gaussian independent} \begin{align*}p(\epp) \ = \ \left( \frac{1}{\sqrt{2 \ \pi}} \right)^n \prod_{k=1}^{n} \frac{1}{\sigma_k} e^{\sum_{k=1}^{n} \frac{\epsilon_k^2}{\sigma_k^2}} \end{align*} \emph{Data Mean:} \begin{align*}\E(\x) \ = \ \E(\La \ \tilde\z \ + \ \epp ) \ = \ \La \ \E( \tilde\z) \ + \ \E(\epp) \ = \ \0 \end{align*} \emph{Data Covariance:} \begin{align} \nonumber \E(\x \ \x^T) \ &= \ \La \E( \tilde\z \tilde\z^T) \La^T \ + \ \La \E( \tilde\z)\E(\epp^T) \ + \ \E( \tilde\z)\E(\epp) \La^T \ + \ \E(\epp \ \epp^T) \\\nonumber &= \ \La \La^T \ + \ \mathrm{diag}(\sigma_k^2) \end{align} Normalizing the data to variance one for each component gives \begin{align*}\sigma_k^2 \ + \ \left(\la^{k}\right)^T \la^{k} \ = \ 1\end{align*} Here $\la^{k}$ is the $k$-th row of $\La$ (which is a row vector of length $p$). We recommend to \emph{normalize the components to variance one} in order to make the signal and noise comparable across components. \emph{Estimated Parameters:} $\La$ and $\pp$ ($\sigma_k$) \emph{Estimated Latent Variables:} $\Z$ \emph{Estimated Noise Free Data:} $\La \ \Z$ \emph{Estimated Biclusters:} $\la_i \ \z_i^T $ Larges values give the bicluster (ideal the nonzero values). The model selection is performed by a variational approach according to \citep{Girolami:01} and \citep{Palmer:06}. We included a prior on the parameters and minimize a lower bound on the posterior of the parameters given the data. The update of the loadings includes an additive term which pushes the loadings toward zero (Gaussian prior leads to an multiplicative factor). The code is implemented in \R, therefore it is \emph{slow}. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # TEST #--------------- dat <- makeFabiaDataBlocks(n = 100,l= 50,p = 3,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] resEx <- fabi(X,3,0.01,20) #--------------- # DEMO1 #--------------- dat <- makeFabiaDataBlocks(n = 1000,l= 100,p = 10,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] resToy <- fabi(X,13,0.01,200) extractPlot(resToy,ti="FABI",Y=Y) #--------------- # DEMO2 #--------------- avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(Breast_A) X <- as.matrix(XBreast) resBreast <- fabi(X,5,0.1,200) extractPlot(resBreast,ti="FABI Breast cancer(Veer)") #sorting of predefined labels CBreast%*%rBreast$pmZ } #--------------- # DEMO3 #--------------- avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(Multi_A) X <- as.matrix(XMulti) resMulti <- fabi(X,5,0.01,200) extractPlot(resMulti,ti="FABI Multiple tissues(Su)") #sorting of predefined labels CMulti%*%rMulti$pmZ } #--------------- # DEMO4 #--------------- avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(DLBCL_B) X <- as.matrix(XDLBCL) resDLBCL <- fabi(X,5,0.1,200) extractPlot(resDLBCL,ti="FABI Lymphoma(Rosenwald)") #sorting of predefined labels CDLBCL%*%rDLBCL$pmZ } \end{Sinput} \subsection{fabia} Factor Analysis for Bicluster Acquisition: Laplace Prior (FABIA) \citep{Hochreiter:10}. C implementation of \texttt{fabia}. \begin{enumerate} \item Usage: \verb+fabia(X,p=5,alpha=0.1,cyc=500,spl=0,spz=0.5,random=1.0,+\\ \hspace*{6em}\verb+ center=2,norm=1,scale=0.0,lap=1.0,nL=0,lL=0,bL=0)+ \item Arguments: \begin{itemize} \item{X:} the data matrix. \item{p:} number of hidden factors = number of biclusters; default = 5. \item{alpha:} sparseness loadings (0 - 1.0); default = 0.1. \item{cyc:} number of iterations; default = 500. \item{spl:} sparseness prior loadings (0 - 2.0); default = 0 (Laplace). \item{spz:} sparseness factors (0.5 - 2.0); default = 0.5 (Laplace). \item{random} <=0: by SVD, >0: random initialization of loadings in [-random,random]; default = 1.0. \item{center:} data centering: 1 (mean), 2 (median), > 2 (mode), 0 (no); default = 2. \item{norm:} data normalization: 1 (0.75-0.25 quantile), >1 (var=1), 0 (no); default = 1. \item{scale:} loading vectors are scaled in each iteration to the given variance. 0.0 indicates non scaling; default = 0.0. \item{lap:} minimal value of the variational parameter, default = 1. \item{nL:} maximal number of biclusters at which a row element can participate; default = 0 (no limit). \item{lL:} maximal number of row elements per bicluster; default = 0 (no limit). \item{bL:} cycle at which the nL or lL maximum starts; default = 0 (start at the beginning). \end{itemize} \item Return values: \begin{itemize} \item object of the class \texttt{Factorization}. Containing \texttt{LZ} (estimated noise free data $\La \Z$), \texttt{L} (loadings $\La$), \texttt{Z} (factors $\Z$), \texttt{U} (noise $\X- \La \Z$), \texttt{center} (centering vector), \texttt{scaleData} (scaling vector), \texttt{X} (centered and scaled data $\X$), \texttt{Psi} (noise variance $\pp$), \texttt{lapla} (variational parameter), \texttt{avini} (the information which the factor $z_{ij}$ contains about $\x_j$ averaged over $j$) \texttt{xavini} (the information which the factor $\tilde\z_j$ contains about $\x_j$ averaged over $j$) \texttt{ini} (for each $j$ the information which the factor $z_{ij}$ contains about $\x_j$). \end{itemize} \end{enumerate} Biclusters are found by sparse factor analysis where \emph{both} the factors and the loadings are sparse. Essentially the model is the sum of outer products of vectors: \begin{align*} \X \ = \ \sum_{i=1}^{p} \la_i \ \z_i^T \ + \ \Up \ , \end{align*} where the number of summands $p$ is the number of biclusters. The matrix factorization is \begin{align*} \X \ = \ \La \ \Z \ + \ \Up \ . \end{align*} Here $\X \in \Rb^{n \times l}$ and $\Up \in \Rb^{n \times l}$, $\la_i \in \Rb^n$, $\z_i \in \Rb^l$, $\La \in \Rb^{n \times p}$, $\Z \in \Rb^{p \times l}$. If the nonzero components of the sparse vectors are grouped together then the outer product results in a matrix with a nonzero block and zeros elsewhere. For a single data vector $\x$ that is \begin{align*}\x \ = \ \sum_{i=1}^{p} \la_i z_i \ + \ \epp \ = \ \La \tilde\z \ + \ \epp\end{align*} The model assumptions are: \emph{Factor Prior is Independent Laplace:} \begin{align*}p(\tilde\z) \ = \ \left( \frac{1}{\sqrt{2}} \right)^p \prod_{i=1}^{p} e^{-\ \sqrt{2} \ |z_i|}\end{align*} \emph{Loading Prior is Independent Laplace:} \begin{align*}p(\la_i) \ = \ \left( \frac{1}{\sqrt{2}} \right)^n \prod_{k=1}^{n} e^{-\ \sqrt{2} \ |\lambda_{ki}|}\end{align*} \emph{Noise: Gaussian independent} \begin{align*}p(\epp) \ = \ \left( \frac{1}{\sqrt{2 \ \pi}} \right)^n \prod_{k=1}^{n} \frac{1}{\sigma_k} e^{\sum_{k=1}^{n} \frac{\epsilon_k^2}{\sigma_k^2}} \end{align*} \emph{Data Mean:} \begin{align*}\E(\x) \ = \ \E(\La \ \tilde\z \ + \ \epp ) \ = \ \La \ \E( \tilde\z) \ + \ \E(\epp) \ = \ \0\end{align*} \emph{Data Covariance:} \begin{align} \nonumber \E(\x \ \x^T) \ &= \ \La \E( \tilde\z \tilde\z^T) \La^T \ + \ \La \E( \tilde\z)\E(\epp^T) \ + \ \E( \tilde\z)\E(\epp) \La^T \ + \ \E(\epp \ \epp^T) \\\nonumber &= \ \La \La^T \ + \ \mathrm{diag}(\sigma_k^2) \end{align} Normalizing the data to variance one for each component gives \begin{align*}\sigma_k^2 \ + \ \left(\la^{k}\right)^T \la^{k} \ = \ 1\end{align*} Here $\la^{k}$ is the $k$-th row of $\La$ (which is a row vector of length $p$). We recommend to \emph{normalize the components to variance one} in order to make the signal and noise comparable across components. \emph{Estimated Parameters:} $\La$ and $\pp$ ($\sigma_k$) \emph{Estimated Latent Variables:} $\Z$ \emph{Estimated Noise Free Data:} $\La \ \Z$ \emph{Estimated Biclusters:} $\la_i \ \z_i^T $ Larges values give the bicluster (ideal the nonzero values). The model selection is performed by a variational approach according to \citep{Girolami:01} and \citep{Palmer:06}. We included a prior on the parameters and minimize a lower bound on the posterior of the parameters given the data. The update of the loadings includes an additive term which pushes the loadings toward zero (Gaussian prior leads to an multiplicative factor). The code is implemented in C. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # TEST #--------------- dat <- makeFabiaDataBlocks(n = 100,l= 50,p = 3,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] resEx <- fabia(X,3,0.01,50) #----------------- # DEMO1: Toy Data #----------------- n = 1000 l= 100 p = 10 dat <- makeFabiaDataBlocks(n = n,l= l,p = p,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] ZC <- dat[[3]] LC <- dat[[4]] gclab <- rep.int(0,l) gllab <- rep.int(0,n) clab <- as.character(1:l) llab <- as.character(1:n) for (i in 1:p){ for (j in ZC[i]){ clab[j] <- paste(as.character(i),"_",clab[j],sep="") } for (j in LC[i]){ llab[j] <- paste(as.character(i),"_",llab[j],sep="") } gclab[unlist(ZC[i])] <- gclab[unlist(ZC[i])] + p^i gllab[unlist(LC[i])] <- gllab[unlist(LC[i])] + p^i } groups <- gclab #### FABIA resToy1 <- fabia(X,13,0.01,400) extractPlot(resToy1,ti="FABIA",Y=Y) raToy1 <- extractBic(resToy1) if ((raToy1$bic[[1]][1]>1) && (raToy1$bic[[1]][2])>1) { plotBicluster(raToy1,1) } if ((raToy1$bic[[2]][1]>1) && (raToy1$bic[[2]][2])>1) { plotBicluster(raToy1,2) } if ((raToy1$bic[[3]][1]>1) && (raToy1$bic[[3]][2])>1) { plotBicluster(raToy1,3) } if ((raToy1$bic[[4]][1]>1) && (raToy1$bic[[4]][2])>1) { plotBicluster(raToy1,4) } colnames(resToy1@X) <- clab rownames(resToy1@X) <- llab plot(resToy1,dim=c(1,2),label.tol=0.1,col.group = groups,lab.size=0.6) plot(resToy1,dim=c(1,3),label.tol=0.1,col.group = groups,lab.size=0.6) plot(resToy1,dim=c(2,3),label.tol=0.1,col.group = groups,lab.size=0.6) #------------------------------------------ # DEMO2: Laura van't Veer's gene expression # data set for breast cancer #------------------------------------------ avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(Breast_A) X <- as.matrix(XBreast) resBreast1 <- fabia(X,5,0.1,400) extractPlot(resBreast1,ti="FABIA Breast cancer(Veer)") raBreast1 <- extractBic(resBreast1) if ((raBreast1$bic[[1]][1]>1) && (raBreast1$bic[[1]][2])>1) { plotBicluster(raBreast1,1) } if ((raBreast1$bic[[2]][1]>1) && (raBreast1$bic[[2]][2])>1) { plotBicluster(raBreast1,2) } if ((raBreast1$bic[[3]][1]>1) && (raBreast1$bic[[3]][2])>1) { plotBicluster(raBreast1,3) } if ((raBreast1$bic[[4]][1]>1) && (raBreast1$bic[[4]][2])>1) { plotBicluster(raBreast1,4) } plot(resBreast1,dim=c(1,2),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast1,dim=c(1,3),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast1,dim=c(1,4),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast1,dim=c(1,5),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast1,dim=c(2,3),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast1,dim=c(2,4),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast1,dim=c(2,5),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast1,dim=c(3,4),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast1,dim=c(3,5),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast1,dim=c(4,5),label.tol=0.03,col.group=CBreast,lab.size=0.6) } #----------------------------------- # DEMO3: Su's multiple tissue types # gene expression data set #----------------------------------- avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(Multi_A) X <- as.matrix(XMulti) resMulti1 <- fabia(X,5,0.06,300,norm=2) extractPlot(resMulti1,ti="FABIA Multiple tissues(Su)") raMulti1 <- extractBic(resMulti1) if ((raMulti1$bic[[1]][1]>1) && (raMulti1$bic[[1]][2])>1) { plotBicluster(raMulti1,1) } if ((raMulti1$bic[[2]][1]>1) && (raMulti1$bic[[2]][2])>1) { plotBicluster(raMulti1,2) } if ((raMulti1$bic[[3]][1]>1) && (raMulti1$bic[[3]][2])>1) { plotBicluster(raMulti1,3) } if ((raMulti1$bic[[4]][1]>1) && (raMulti1$bic[[4]][2])>1) { plotBicluster(raMulti1,4) } plot(resMulti1,dim=c(1,2),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti1,dim=c(1,3),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti1,dim=c(1,4),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti1,dim=c(1,5),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti1,dim=c(2,3),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti1,dim=c(2,4),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti1,dim=c(2,5),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti1,dim=c(3,4),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti1,dim=c(3,5),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti1,dim=c(4,5),label.tol=0.01,col.group=CMulti,lab.size=0.6) } #----------------------------------------- # DEMO4: Rosenwald's diffuse large-B-cell # lymphoma gene expression data set #----------------------------------------- avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(DLBCL_B) X <- as.matrix(XDLBCL) resDLBCL1 <- fabia(X,5,0.1,400,norm=2) extractPlot(resDLBCL1,ti="FABIA Lymphoma(Rosenwald)") raDLBCL1 <- extractBic(resDLBCL1) if ((raDLBCL1$bic[[1]][1]>1) && (raDLBCL1$bic[[1]][2])>1) { plotBicluster(raDLBCL1,1) } if ((raDLBCL1$bic[[2]][1]>1) && (raDLBCL1$bic[[2]][2])>1) { plotBicluster(raDLBCL1,2) } if ((raDLBCL1$bic[[3]][1]>1) && (raDLBCL1$bic[[3]][2])>1) { plotBicluster(raDLBCL1,3) } if ((raDLBCL1$bic[[4]][1]>1) && (raDLBCL1$bic[[4]][2])>1) { plotBicluster(raDLBCL1,4) } plot(resDLBCL1,dim=c(1,2),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL1,dim=c(1,3),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL1,dim=c(1,4),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL1,dim=c(1,5),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL1,dim=c(2,3),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL1,dim=c(2,4),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL1,dim=c(2,5),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL1,dim=c(3,4),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL1,dim=c(3,5),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL1,dim=c(4,5),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) } \end{Sinput} \subsection{fabiap} Factor Analysis for Bicluster Acquisition: Post-Projection (FABIAP) \citep{Hochreiter:10}. \begin{enumerate} \item Usage: \verb+fabiap(X,p=5,alpha=0.1,cyc=500,spl=0,spz=0.5,sL=0.6,sZ=0.6,+\\ \hspace*{6em}\verb+ random=1.0,center=2,norm=1,scale=0.0,lap=1.0,nL=0,lL=0,bL=0)+ \item Arguments: \begin{itemize} \item{X:} the data matrix. \item{p:} number of hidden factors = number of biclusters; default = 5. \item{alpha:} sparseness loadings (0 - 1.0); default = 0.1. \item{cyc:} number of iterations; default = 500. \item{spl:} sparseness prior loadings (0 - 2.0); default = 0 (Laplace). \item{spz:} sparseness factors (0.5 - 2.0); default = 0.5 (Laplace). \item{sL:} final sparseness loadings; default = 0.6. \item{sZ:} final sparseness factors; default = 0.6. \item{random} <=0: by SVD, >0: random initialization of loadings in [-random,random]; default = 1.0. \item{center:} data centering: 1 (mean), 2 (median), > 2 (mode), 0 (no); default = 2. \item{norm:} data normalization: 1 (0.75-0.25 quantile), >1 (var=1), 0 (no); default = 1. \item{scale:} loading vectors are scaled in each iteration to the given variance. 0.0 indicates non scaling; default = 0.0. \item{lap:} minimal value of the variational parameter, default = 1. \item{nL:} maximal number of biclusters at which a row element can participate; default = 0 (no limit). \item{lL:} maximal number of row elements per bicluster; default = 0 (no limit). \item{bL:} cycle at which the nL or lL maximum starts; default = 0 (start at the beginning). \end{itemize} \item Return Values: \begin{itemize} \item object of the class \texttt{Factorization}. Containing \texttt{LZ} (estimated noise free data $\La \Z$), \texttt{L} (loadings $\La$), \texttt{Z} (factors $\Z$), \texttt{U} (noise $\X- \La \Z$), \texttt{center} (centering vector), \texttt{scaleData} (scaling vector), \texttt{X} (centered and scaled data $\X$), \texttt{Psi} (noise variance $\pp$), \texttt{lapla} (variational parameter), \texttt{avini} (the information which the factor $z_{ij}$ contains about $\x_j$ averaged over $j$) \texttt{xavini} (the information which the factor $\tilde\z_j$ contains about $\x_j$ averaged over $j$) \texttt{ini} (for each $j$ the information which the factor $z_{ij}$ contains about $\x_j$). \end{itemize} \end{enumerate} Biclusters are found by sparse factor analysis where \emph{both} the factors and the loadings are sparse. Post-processing by projecting the final results to a given sparseness criterion. Essentially the model is the sum of outer products of vectors: \begin{align*} \X \ = \ \sum_{i=1}^{p} \la_i \ \z_i^T \ + \ \Up \ , \end{align*} where the number of summands $p$ is the number of biclusters. The matrix factorization is \begin{align*} \X \ = \ \La \ \Z \ + \ \Up \ . \end{align*} Here $\X \in \Rb^{n \times l}$ and $\Up \in \Rb^{n \times l}$, $\la_i \in \Rb^n$, $\z_i \in \Rb^l$, $\La \in \Rb^{n \times p}$, $\Z \in \Rb^{p \times l}$. If the nonzero components of the sparse vectors are grouped together then the outer product results in a matrix with a nonzero block and zeros elsewhere. For a single data vector $\x$ that is \begin{align*}\x \ = \ \sum_{i=1}^{p} \la_i z_i \ + \ \epp \ = \ \La \tilde\z \ + \ \epp\end{align*} The model assumptions are: \emph{Factor Prior is Independent Laplace:} \begin{align*}p(\tilde\z) \ = \ \left( \frac{1}{\sqrt{2}} \right)^p \prod_{i=1}^{p} e^{-\ \sqrt{2} \ |z_i|}\end{align*} \emph{Loading Prior is Independent Laplace:} \begin{align*}p(\la_i) \ = \ \left( \frac{1}{\sqrt{2}} \right)^n \prod_{k=1}^{n} e^{-\ \sqrt{2} \ |\lambda_{ki}|}\end{align*} \emph{Noise: Gaussian independent} \begin{align*}p(\epp) \ = \ \left( \frac{1}{\sqrt{2 \ \pi}} \right)^n \prod_{k=1}^{n} \frac{1}{\sigma_k} e^{\sum_{k=1}^{n} \frac{\epsilon_k^2}{\sigma_k^2}} \end{align*} \emph{Data Mean:} \begin{align*}\E(\x) \ = \ \E(\La \ \tilde\z \ + \ \epp ) \ = \ \La \ \E( \tilde\z) \ + \ \E(\epp) \ = \ \0 \end{align*} \emph{Data Covariance:} \begin{align} \nonumber \E(\x \ \x^T) \ &= \ \La \E( \tilde\z \tilde\z^T) \La^T \ + \ \La \E( \tilde\z)\E(\epp^T) \ + \ \E( \tilde\z)\E(\epp) \La^T \ + \ \E(\epp \ \epp^T) \\\nonumber &= \ \La \La^T \ + \ \mathrm{diag}(\sigma_k^2) \end{align} Normalizing the data to variance one for each component gives \begin{align*}\sigma_k^2 \ + \ \left(\la^{k}\right)^T \la^{k} \ = \ 1\end{align*} Here $\la^{k}$ is the $k$-th row of $\La$ (which is a row vector of length $p$). We recommend to \emph{normalize the components to variance one} in order to make the signal and noise comparable across components. \emph{Estimated Parameters:} $\La$ and $\pp$ ($\sigma_k$) \emph{Estimated Latent Variables:} $\Z$ \emph{Estimated Noise Free Data:} $\La \ \Z$ \emph{Estimated Biclusters:} $\la_i \ \z_i^T $ Larges values give the bicluster (ideal the nonzero values). The model selection is performed by a variational approach according to \citep{Girolami:01} and \citep{Palmer:06}. We included a prior on the parameters and minimize a lower bound on the posterior of the parameters given the data. The update of the loadings includes an additive term which pushes the loadings toward zero (Gaussian prior leads to an multiplicative factor). \emph{Post-processing:} The final results of the loadings and the factors are projected to a sparse vector according to Hoyer, 2004: given an $l_1$-norm and an $l_2$-norm minimize the Euclidean distance to the original vector (currently the $l_2$-norm is fixed to 1). The projection is a convex quadratic problem which is solved iteratively where at each iteration at least one component is set to zero. Instead of the $l_1$-norm a sparseness measurement is used which relates the $l_1$-norm to the $l_2$-norm: The code is implemented in C and t he projection in \R. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # TEST #--------------- dat <- makeFabiaDataBlocks(n = 100,l= 50,p = 3,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] resEx <- fabiap(X,3,0.01,50) #----------------- # DEMO1: Toy Data #----------------- n = 1000 l= 100 p = 10 dat <- makeFabiaDataBlocks(n = n,l= l,p = p,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] ZC <- dat[[3]] LC <- dat[[4]] gclab <- rep.int(0,l) gllab <- rep.int(0,n) clab <- as.character(1:l) llab <- as.character(1:n) for (i in 1:p){ for (j in ZC[i]){ clab[j] <- paste(as.character(i),"_",clab[j],sep="") } for (j in LC[i]){ llab[j] <- paste(as.character(i),"_",llab[j],sep="") } gclab[unlist(ZC[i])] <- gclab[unlist(ZC[i])] + p^i gllab[unlist(LC[i])] <- gllab[unlist(LC[i])] + p^i } groups <- gclab #### FABIAP resToy3 <- fabiap(X,13,0.01,400) extractPlot(resToy3,ti="FABIAP",Y=Y) raToy3 <- extractBic(resToy3) if ((raToy3$bic[[1]][1]>1) && (raToy3$bic[[1]][2])>1) { plotBicluster(raToy3,1) } if ((raToy3$bic[[2]][1]>1) && (raToy3$bic[[2]][2])>1) { plotBicluster(raToy3,2) } if ((raToy3$bic[[3]][1]>1) && (raToy3$bic[[3]][2])>1) { plotBicluster(raToy3,3) } if ((raToy3$bic[[4]][1]>1) && (raToy3$bic[[4]][2])>1) { plotBicluster(raToy3,4) } colnames(resToy3@X) <- clab rownames(resToy3@X) <- llab plot(resToy3,dim=c(1,2),label.tol=0.1,col.group = groups,lab.size=0.6) plot(resToy3,dim=c(1,3),label.tol=0.1,col.group = groups,lab.size=0.6) plot(resToy3,dim=c(2,3),label.tol=0.1,col.group = groups,lab.size=0.6) #------------------------------------------ # DEMO2: Laura van't Veer's gene expression # data set for breast cancer #------------------------------------------ avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(Breast_A) X <- as.matrix(XBreast) resBreast3 <- fabiap(X,5,0.1,400) extractPlot(resBreast3,ti="FABIAP Breast cancer(Veer)") raBreast3 <- extractBic(resBreast3) if ((raBreast3$bic[[1]][1]>1) && (raBreast3$bic[[1]][2])>1) { plotBicluster(raBreast3,1) } if ((raBreast3$bic[[2]][1]>1) && (raBreast3$bic[[2]][2])>1) { plotBicluster(raBreast3,2) } if ((raBreast3$bic[[3]][1]>1) && (raBreast3$bic[[3]][2])>1) { plotBicluster(raBreast3,3) } if ((raBreast3$bic[[4]][1]>1) && (raBreast3$bic[[4]][2])>1) { plotBicluster(raBreast3,4) } plot(resBreast3,dim=c(1,2),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast3,dim=c(1,3),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast3,dim=c(1,4),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast3,dim=c(1,5),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast3,dim=c(2,3),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast3,dim=c(2,4),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast3,dim=c(2,5),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast3,dim=c(3,4),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast3,dim=c(3,5),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast3,dim=c(4,5),label.tol=0.03,col.group=CBreast,lab.size=0.6) } #----------------------------------- # DEMO3: Su's multiple tissue types # gene expression data set #----------------------------------- avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(Multi_A) X <- as.matrix(XMulti) resMulti3 <- fabiap(X,5,0.01,300) extractPlot(resMulti3,ti="FABIAP Multiple tissues(Su)") raMulti3 <- extractBic(resMulti3) if ((raMulti3$bic[[1]][1]>1) && (raMulti3$bic[[1]][2])>1) { plotBicluster(raMulti3,1) } if ((raMulti3$bic[[2]][1]>1) && (raMulti3$bic[[2]][2])>1) { plotBicluster(raMulti3,2) } if ((raMulti3$bic[[3]][1]>1) && (raMulti3$bic[[3]][2])>1) { plotBicluster(raMulti3,3) } if ((raMulti3$bic[[4]][1]>1) && (raMulti3$bic[[4]][2])>1) { plotBicluster(raMulti3,4) } plot(resMulti3,dim=c(1,2),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti3,dim=c(1,3),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti3,dim=c(1,4),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti3,dim=c(1,5),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti3,dim=c(2,3),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti3,dim=c(2,4),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti3,dim=c(2,5),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti3,dim=c(3,4),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti3,dim=c(3,5),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti3,dim=c(4,5),label.tol=0.01,col.group=CMulti,lab.size=0.6) } #----------------------------------------- # DEMO4: Rosenwald's diffuse large-B-cell # lymphoma gene expression data set #----------------------------------------- avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(DLBCL_B) X <- as.matrix(XDLBCL) resDLBCL3 <- fabiap(X,5,0.1,400) extractPlot(resDLBCL3,ti="FABIAP Lymphoma(Rosenwald)") raDLBCL3 <- extractBic(resDLBCL3) if ((raDLBCL3$bic[[1]][1]>1) && (raDLBCL3$bic[[1]][2])>1) { plotBicluster(raDLBCL3,1) } if ((raDLBCL3$bic[[2]][1]>1) && (raDLBCL3$bic[[2]][2])>1) { plotBicluster(raDLBCL3,2) } if ((raDLBCL3$bic[[3]][1]>1) && (raDLBCL3$bic[[3]][2])>1) { plotBicluster(raDLBCL3,3) } if ((raDLBCL3$bic[[4]][1]>1) && (raDLBCL3$bic[[4]][2])>1) { plotBicluster(raDLBCL3,4) } plot(resDLBCL3,dim=c(1,2),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL3,dim=c(1,3),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL3,dim=c(1,4),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL3,dim=c(1,5),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL3,dim=c(2,3),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL3,dim=c(2,4),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL3,dim=c(2,5),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL3,dim=c(3,4),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL3,dim=c(3,5),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL3,dim=c(4,5),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) } \end{Sinput} \subsection{fabias} Factor Analysis for Bicluster Acquisition: Sparseness Projection (FABIAS) \citep{Hochreiter:10}. C implementation of \texttt{fabias}. \begin{enumerate} \item Usage: \verb+fabias(X,p=5,alpha=0.6,cyc=500,spz=0.5,random=1.0,+\\ \hspace*{6em}\verb+ center=2,norm=1,lap=1.0,nL=0,lL=0,bL=0)+ \item Arguments: \begin{itemize} \item{X:} the data matrix. \item{p:} number of hidden factors = number of biclusters; default = 5. \item{alpha:} sparseness loadings (0 - 1.0); default = 0.1. \item{cyc:} number of iterations; default = 500. \item{spz:} sparseness factors (0.5 - 2.0); default = 0.5 (Laplace). \item{random} <=0: by SVD, >0: random initialization of loadings in [-random,random]; default = 1.0. \item{center:} data centering: 1 (mean), 2 (median), > 2 (mode), 0 (no); default = 2. \item{norm:} data normalization: 1 (0.75-0.25 quantile), >1 (var=1), 0 (no); default = 1. \item{lap:} minimal value of the variational parameter, default = 1. \item{nL:} maximal number of biclusters at which a row element can participate; default = 0 (no limit). \item{lL:} maximal number of row elements per bicluster; default = 0 (no limit). \item{bL:} cycle at which the nL or lL maximum starts; default = 0 (start at the beginning). \end{itemize} \item Return Values: \begin{itemize} \item object of the class \texttt{Factorization}. Containing \texttt{LZ} (estimated noise free data $\La \Z$), \texttt{L} (loadings $\La$), \texttt{Z} (factors $\Z$), \texttt{U} (noise $\X- \La \Z$), \texttt{center} (centering vector), \texttt{scaleData} (scaling vector), \texttt{X} (centered and scaled data $\X$), \texttt{Psi} (noise variance $\pp$), \texttt{lapla} (variational parameter), \texttt{avini} (the information which the factor $z_{ij}$ contains about $\x_j$ averaged over $j$) \texttt{xavini} (the information which the factor $\tilde\z_j$ contains about $\x_j$ averaged over $j$) \texttt{ini} (for each $j$ the information which the factor $z_{ij}$ contains about $\x_j$). \end{itemize} \end{enumerate} Biclusters are found by sparse factor analysis where \emph{both} the factors and the loadings are sparse. Essentially the model is the sum of outer products of vectors: \begin{align*} \X \ = \ \sum_{i=1}^{p} \la_i \ \z_i^T \ + \ \Up \ , \end{align*} where the number of summands $p$ is the number of biclusters. The matrix factorization is \begin{align*} \X \ = \ \La \ \Z \ + \ \Up \ . \end{align*} Here $\X \in \Rb^{n \times l}$ and $\Up \in \Rb^{n \times l}$, $\la_i \in \Rb^n$, $\z_i \in \Rb^l$, $\La \in \Rb^{n \times p}$, $\Z \in \Rb^{p \times l}$. If the nonzero components of the sparse vectors are grouped together then the outer product results in a matrix with a nonzero block and zeros elsewhere. For a single data vector $\x$ that is \begin{align*}\x \ = \ \sum_{i=1}^{p} \la_i z_i \ + \ \epp \ = \ \La \tilde\z \ + \ \epp\end{align*} The model assumptions are: \emph{Factor Prior is Independent Laplace:} \begin{align*}p(\tilde\z) \ = \ \left( \frac{1}{\sqrt{2}} \right)^p \prod_{i=1}^{p} e^{-\ \sqrt{2} \ |z_i|}\end{align*} \emph{Loading Prior has Finite Support:} \begin{align*}p(\la_i) \ = \ c \ \ \ \ \mathrm{for} \ \ \ \|\la_i\|_1 \ \leq \ k\end{align*} \begin{align*}p(\la_i) \ = \ 0 \ \ \ \ \mathrm{for} \ \ \ \|\la_i\|_1 \ > \ k\end{align*} \emph{Noise: Gaussian independent} \begin{align*}p(\epp) \ = \ \left( \frac{1}{\sqrt{2 \ \pi}} \right)^n \prod_{k=1}^{n} \frac{1}{\sigma_k} e^{\sum_{k=1}^{n} \frac{\epsilon_k^2}{\sigma_k^2}} \end{align*} \emph{Data Mean:} \begin{align*}\E(\x) \ = \ \E(\La \ \tilde\z \ + \ \epp ) \ = \ \La \ \E( \tilde\z) \ + \ \E(\epp) \ = \ \0 \end{align*} \emph{Data Covariance:} \begin{align} \nonumber \E(\x \ \x^T) \ &= \ \La \E( \tilde\z \tilde\z^T) \La^T \ + \ \La \E( \tilde\z)\E(\epp^T) \ + \ \E( \tilde\z)\E(\epp) \La^T \ + \ \E(\epp \ \epp^T) \\\nonumber &= \ \La \La^T \ + \ \mathrm{diag}(\sigma_k^2) \end{align} Normalizing the data to variance one for each component gives \begin{align*}\sigma_k^2 \ + \ \left(\la^{k}\right)^T \la^{k} \ = \ 1\end{align*} Here $\la^{k}$ is the $k$-th row of $\La$ (which is a row vector of length $p$). We recommend to \emph{normalize the components to variance one} in order to make the signal and noise comparable across components. \emph{Estimated Parameters:} $\La$ and $\sigma_k$ \emph{Estimated Latent Variables:} $\Z$ \emph{Estimated Noise Free Data:} $\La \ \Z$ \emph{Estimated Biclusters:} $\la_i \ \z_i^T $ Larges values give the bicluster (ideal the nonzero values). The model selection is performed by a variational approach according to \citep{Girolami:01} and \citep{Palmer:06}. The prior has finite support, therefore after each update of the loadings they are projected to the finite support. The projection is done according to \citep{Hoyer:04}: given an $l_1$-norm and an $l_2$-norm minimize the Euclidean distance to the original vector (currently the $l_2$-norm is fixed to 1). The projection is a convex quadratic problem which is solved iteratively where at each iteration at least one component is set to zero. Instead of the $l_1$-norm a sparseness measurement is used which relates the $l_1$-norm to the $l_2$-norm: \begin{align*}\mathrm{sparseness}(\la_i) \ = \ \frac{\sqrt{n} \ - \ \sum_{k=1}^{n} \left|\lambda_{ki} \right|\ / \ \sum_{k=1}^{n} \lambda_{ki}^2}{\sqrt{n} \ - \ 1}\end{align*} The code is implemented in C. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # TEST #--------------- dat <- makeFabiaDataBlocks(n = 100,l= 50,p = 3,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] resEx <- fabias(X,3,0.6,50) #----------------- # DEMO1: Toy Data #----------------- n = 1000 l= 100 p = 10 dat <- makeFabiaDataBlocks(n = n,l= l,p = p,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] ZC <- dat[[3]] LC <- dat[[4]] gclab <- rep.int(0,l) gllab <- rep.int(0,n) clab <- as.character(1:l) llab <- as.character(1:n) for (i in 1:p){ for (j in ZC[i]){ clab[j] <- paste(as.character(i),"_",clab[j],sep="") } for (j in LC[i]){ llab[j] <- paste(as.character(i),"_",llab[j],sep="") } gclab[unlist(ZC[i])] <- gclab[unlist(ZC[i])] + p^i gllab[unlist(LC[i])] <- gllab[unlist(LC[i])] + p^i } groups <- gclab #### FABIAS resToy2 <- fabias(X,13,0.6,400) extractPlot(resToy2,ti="FABIAS",Y=Y) raToy2 <- extractBic(resToy2) if ((raToy2$bic[[1]][1]>1) && (raToy2$bic[[1]][2])>1) { plotBicluster(raToy2,1) } if ((raToy2$bic[[2]][1]>1) && (raToy2$bic[[2]][2])>1) { plotBicluster(raToy2,2) } if ((raToy2$bic[[3]][1]>1) && (raToy2$bic[[3]][2])>1) { plotBicluster(raToy2,3) } if ((raToy2$bic[[4]][1]>1) && (raToy2$bic[[4]][2])>1) { plotBicluster(raToy2,4) } colnames(resToy2@X) <- clab rownames(resToy2@X) <- llab plot(resToy2,dim=c(1,2),label.tol=0.1,col.group = groups,lab.size=0.6) plot(resToy2,dim=c(1,3),label.tol=0.1,col.group = groups,lab.size=0.6) plot(resToy2,dim=c(2,3),label.tol=0.1,col.group = groups,lab.size=0.6) #------------------------------------------ # DEMO2: Laura van't Veer's gene expression # data set for breast cancer #------------------------------------------ avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(Breast_A) X <- as.matrix(XBreast) resBreast2 <- fabias(X,5,0.6,300) extractPlot(resBreast2,ti="FABIAS Breast cancer(Veer)") raBreast2 <- extractBic(resBreast2) if ((raBreast2$bic[[1]][1]>1) && (raBreast2$bic[[1]][2])>1) { plotBicluster(raBreast2,1) } if ((raBreast2$bic[[2]][1]>1) && (raBreast2$bic[[2]][2])>1) { plotBicluster(raBreast2,2) } if ((raBreast2$bic[[3]][1]>1) && (raBreast2$bic[[3]][2])>1) { plotBicluster(raBreast2,3) } if ((raBreast2$bic[[4]][1]>1) && (raBreast2$bic[[4]][2])>1) { plotBicluster(raBreast2,4) } plot(resBreast2,dim=c(1,2),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast2,dim=c(1,3),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast2,dim=c(1,4),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast2,dim=c(1,5),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast2,dim=c(2,3),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast2,dim=c(2,4),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast2,dim=c(2,5),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast2,dim=c(3,4),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast2,dim=c(3,5),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast2,dim=c(4,5),label.tol=0.03,col.group=CBreast,lab.size=0.6) } #----------------------------------- # DEMO3: Su's multiple tissue types # gene expression data set #----------------------------------- avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(Multi_A) X <- as.matrix(XMulti) resMulti2 <- fabias(X,5,0.6,300) extractPlot(resMulti2,ti="FABIAS Multiple tissues(Su)") raMulti2 <- extractBic(resMulti2) if ((raMulti2$bic[[1]][1]>1) && (raMulti2$bic[[1]][2])>1) { plotBicluster(raMulti2,1) } if ((raMulti2$bic[[2]][1]>1) && (raMulti2$bic[[2]][2])>1) { plotBicluster(raMulti2,2) } if ((raMulti2$bic[[3]][1]>1) && (raMulti2$bic[[3]][2])>1) { plotBicluster(raMulti2,3) } if ((raMulti2$bic[[4]][1]>1) && (raMulti2$bic[[4]][2])>1) { plotBicluster(raMulti2,4) } plot(resMulti2,dim=c(1,2),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti2,dim=c(1,3),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti2,dim=c(1,4),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti2,dim=c(1,5),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti2,dim=c(2,3),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti2,dim=c(2,4),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti2,dim=c(2,5),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti2,dim=c(3,4),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti2,dim=c(3,5),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti2,dim=c(4,5),label.tol=0.01,col.group=CMulti,lab.size=0.6) } #----------------------------------------- # DEMO4: Rosenwald's diffuse large-B-cell # lymphoma gene expression data set #----------------------------------------- avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(DLBCL_B) X <- as.matrix(XDLBCL) resDLBCL2 <- fabias(X,5,0.6,300) extractPlot(resDLBCL2,ti="FABIAS Lymphoma(Rosenwald)") raDLBCL2 <- extractBic(resDLBCL2) if ((raDLBCL2$bic[[1]][1]>1) && (raDLBCL2$bic[[1]][2])>1) { plotBicluster(raDLBCL2,1) } if ((raDLBCL2$bic[[2]][1]>1) && (raDLBCL2$bic[[2]][2])>1) { plotBicluster(raDLBCL2,2) } if ((raDLBCL2$bic[[3]][1]>1) && (raDLBCL2$bic[[3]][2])>1) { plotBicluster(raDLBCL2,3) } if ((raDLBCL2$bic[[4]][1]>1) && (raDLBCL2$bic[[4]][2])>1) { plotBicluster(raDLBCL2,4) } plot(resDLBCL2,dim=c(1,2),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL2,dim=c(1,3),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL2,dim=c(1,4),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL2,dim=c(1,5),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL2,dim=c(2,3),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL2,dim=c(2,4),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL2,dim=c(2,5),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL2,dim=c(3,4),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL2,dim=c(3,5),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL2,dim=c(4,5),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) } \end{Sinput} \subsection{fabiasp} Factor Analysis for Bicluster Acquisition: Sparseness Projection (FABIASP) \citep{Hochreiter:10}. \R implementation of \texttt{fabias}, therefore it is \emph{slow}. \begin{enumerate} \item Usage: \texttt{fabiasp(X,p=5,alpha=0.6,cyc=500,spz=0.5,center=2,norm=1)} \item Arguments: \begin{itemize} \item{X:} the data matrix. \item{p:} number of hidden factors = number of biclusters; default = 5. \item{alpha:} sparseness loadings (0 - 1.0); default = 0.1. \item{cyc:} number of iterations; default = 500. \item{spz:} sparseness factors (0.5 - 2.0); default = 0.5 (Laplace). \item{center:} data centering: 1 (mean), 2 (median), > 2 (mode), 0 (no); default = 2. \item{norm:} data normalization: 1 (0.75-0.25 quantile), >1 (var=1), 0 (no); default = 1. \end{itemize} \item Return Values: \begin{itemize} \item object of the class \texttt{Factorization}. Containing \texttt{LZ} (estimated noise free data $\La \Z$), \texttt{L} (loadings $\La$), \texttt{Z} (factors $\Z$), \texttt{U} (noise $\X- \La \Z$), \texttt{center} (centering vector), \texttt{scaleData} (scaling vector), \texttt{X} (centered and scaled data $\X$), \texttt{Psi} (noise variance $\pp$), \texttt{lapla} (variational parameter), \texttt{avini} (the information which the factor $z_{ij}$ contains about $\x_j$ averaged over $j$) \texttt{xavini} (the information which the factor $\tilde\z_j$ contains about $\x_j$ averaged over $j$) \texttt{ini} (for each $j$ the information which the factor $z_{ij}$ contains about $\x_j$). \end{itemize} \end{enumerate} Biclusters are found by sparse factor analysis where \emph{both} the factors and the loadings are sparse. Essentially the model is the sum of outer products of vectors: \begin{align*} \X \ = \ \sum_{i=1}^{p} \la_i \ \z_i^T \ + \ \Up \ , \end{align*} where the number of summands $p$ is the number of biclusters. The matrix factorization is \begin{align*} \X \ = \ \La \ \Z \ + \ \Up \ . \end{align*} Here $\X \in \Rb^{n \times l}$ and $\Up \in \Rb^{n \times l}$, $\la_i \in \Rb^n$, $\z_i \in \Rb^l$, $\La \in \Rb^{n \times p}$, $\Z \in \Rb^{p \times l}$. If the nonzero components of the sparse vectors are grouped together then the outer product results in a matrix with a nonzero block and zeros elsewhere. For a single data vector $\x$ that is \begin{align*}\x \ = \ \sum_{i=1}^{p} \la_i z_i \ + \ \epp \ = \ \La \tilde\z \ + \ \epp\end{align*} The model assumptions are: \emph{Factor Prior is Independent Laplace:} \begin{align*}p(\z) \ = \ \left( \frac{1}{\sqrt{2}} \right)^p \prod_{i=1}^{p} e^{-\ \sqrt{2} \ |z_i|}\end{align*} \emph{Loading Prior has Finite Support:} \begin{align*}p(\la_i) \ = \ c \ \ \ \ \mathrm{for} \ \ \ \|\la_i\|_1 \ \leq \ k\end{align*} \begin{align*}p(\la_i) \ = \ 0 \ \ \ \ \mathrm{for} \ \ \ \|\la_i\|_1 \ > \ k\end{align*} \emph{Noise: Gaussian independent} \begin{align*}p(\epp) \ = \ \left( \frac{1}{\sqrt{2 \ \pi}} \right)^n \prod_{k=1}^{n} \frac{1}{\sigma_k} e^{\sum_{k=1}^{n} \frac{\epsilon_k^2}{\sigma_k^2}} \end{align*} \emph{Data Mean:} \begin{align*}\E(\x) \ = \ \E(\La \ \tilde\z \ + \ \epp ) \ = \ \La \ \E( \tilde\z) \ + \ \E(\epp) \ = \ \0 \end{align*} \emph{Data Covariance:} \begin{align} \nonumber \E(\x \ \x^T) \ &= \ \La \E( \tilde\z \tilde\z^T) \La^T \ + \ \La \E( \tilde\z)\E(\epp^T) \ + \ \E( \tilde\z)\E(\epp) \La^T \ + \ \E(\epp \ \epp^T) \\\nonumber &= \ \La \La^T \ + \ \mathrm{diag}(\sigma_k^2) \end{align} Normalizing the data to variance one for each component gives \begin{align*}\sigma_k^2 \ + \ \left(\la^{k}\right)^T \la^{k} \ = \ 1\end{align*} Here $\la^{k}$ is the $k$-th row of $\La$ (which is a row vector of length $p$). We recommend to \emph{normalize the components to variance one} in order to make the signal and noise comparable across components. \emph{Estimated Parameters:} $\La$ and $\pp$ ($\sigma_k$) \emph{Estimated Latent Variables:} $\Z$ \emph{Estimated Noise Free Data:} $\La \ \Z$ \emph{Estimated Biclusters:} $\la_i \ \z_i^T $ Larges values give the bicluster (ideal the nonzero values). The model selection is performed by a variational approach according to \citep{Girolami:01} and \citep{Palmer:06}. The prior has finite support, therefore after each update of the loadings they are projected to the finite support. The projection is done according to \citep{Hoyer:04}: given an $l_1$-norm and an $l_2$-norm minimize the Euclidean distance to the original vector (currently the $l_2$-norm is fixed to 1). The projection is a convex quadratic problem which is solved iteratively where at each iteration at least one component is set to zero. Instead of the $l_1$-norm a sparseness measurement is used which relates the $l_1$-norm to the $l_2$-norm: \begin{align*}\mathrm{sparseness}(\la_i) \ = \ \frac{\sqrt{n} \ - \ \sum_{k=1}^{n} \left|\lambda_{ki} \right|\ / \ \sum_{k=1}^{n} \lambda_{ki}^2}{\sqrt{n} \ - \ 1}\end{align*} The code is implemented in \R, therefore it is \emph{slow}. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # TEST #--------------- dat <- makeFabiaDataBlocks(n = 100,l= 50,p = 3,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] resEx <- fabiasp(X,3,0.6,50) \dontrun{ #--------------- # DEMO1 #--------------- dat <- makeFabiaDataBlocks(n = 1000,l= 100,p = 10,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] resToy <- fabiasp(X,13,0.6,200) extractPlot(resToy,ti="FABIASP",Y=Y) #--------------- # DEMO2 #--------------- avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(Breast_A) X <- as.matrix(XBreast) resBreast <- fabiasp(X,5,0.6,200) extractPlot(resBreast,ti="FABIASP Breast cancer(Veer)") #sorting of predefined labels CBreast%*%rBreast$pmZ } #--------------- # DEMO3 #--------------- avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(Multi_A) X <- as.matrix(XMulti) resMulti <- fabiasp(X,5,0.6,200) extractPlot(resMulti,"ti=FABIASP Multiple tissues(Su)") #sorting of predefined labels CMulti%*%rMulti$pmZ } #--------------- # DEMO4 #--------------- avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(DLBCL_B) X <- as.matrix(XDLBCL) resDLBCL <- fabiasp(X,5,0.6,200) extractPlot(resDLBCL,ti="FABIASP Lymphoma(Rosenwald)") #sorting of predefined labels CDLBCL%*%rDLBCL$pmZ } \end{Sinput} \subsection{spfabia} Factor Analysis for Bicluster Acquisition: SPARSE FABIA \citep{Hochreiter:10}. C implementation of \texttt{spfabia}. \begin{enumerate} \item Usage: \verb+ spfabia(X,p=5,alpha=0.1,cyc=500,spl=0,spz=0.5,non_negative=0,random=1.0,+\\ \hspace*{6em}\verb+ write_file=1,norm=1,scale=0.0,lap=1.0,nL=0,lL=0,bL=0,samples=0,initL=0,+\\ \hspace*{6em}\verb+ iter=1,quant=0.001,dorescale=FALSE,doini=FALSE,eps=1e-3,eps1=1e-10)+ \item Arguments: \begin{itemize} \item{X:} the file name of the sparse matrix in sparse format. \item{p:} number of hidden factors = number of biclusters; default = 5. \item{alpha:} sparseness loadings (0 - 1.0); default = 0.1. \item{cyc:} number of iterations; default = 500. \item{spl:} sparseness prior loadings (0 - 2.0); default = 0 (Laplace). \item{spz:} sparseness factors (0.5 - 2.0); default = 0.5 (Laplace). \item{non\_negative:} Non-negative factors and loadings if non\_negative > 0; default = 0. \item{random:} >0: random initialization of loadings in [0,random], <0: random initialization of loadings in [-random,random]; default = 1.0. \item{write\_file:} >0: results are written to files (L in sparse format), default = 1. \item{norm:} data normalization: 1 (0.75-0.25 quantile), >1 (var=1), 0 (no); default = 1. \item{scale:} loading vectors are scaled in each iteration to the given variance. 0.0 indicates non scaling; default = 0.0. \item{lap:} minimal value of the variational parameter, default = 1. \item{nL:} maximal number of biclusters at which a row element can participate; default = 0 (no limit). \item{lL:} maximal number of row elements per bicluster; default = 0 (no limit). \item{bL:} cycle at which the nL or lL maximum starts; default = 0 (start at the beginning). \item{samples:} vector of samples which should be included into the analysis; default = 0 (all samples). \item{initL:} vector of indices of the selected samples which are used to initialize L; default = 0 (random initialization). \item{iter:} number of iterations; default = 1. \item{quant:} qunatile of largest L values to remove in each iteration; default = 0.001. \item{lowerB:} lower bound for filtering the inputs columns, the minimal column sum; default = 0.0. \item{upperB:} upper bound for filtering the inputs columns, the maximal column sum; default = 1000.0. \item{dorescale:} rescale factors Z to variance 1 and consequently also L; logical; default: FALSE. \item{doini:} compute the information content of the biclusters and sort the biclusters according to their information content; logical, default: FALSE. \item{eps:} lower bound for variational parameter lapla; default: 1e-3. \item{eps1:} lower bound for divisions to avoid division by zero; default: 1e-10. \end{itemize} \item Return values: \begin{itemize} \item object of the class \texttt{Factorization}. Containing \texttt{L} (loadings $\La$), \texttt{Z} (factors $\Z$), \texttt{Psi} (noise variance $\pp$), \texttt{lapla} (variational parameter), \texttt{avini} (the information which the factor $z_{ij}$ contains about $\x_j$ averaged over $j$) \texttt{xavini} (the information which the factor $\tilde\z_j$ contains about $\x_j$ averaged over $j$) \texttt{ini} (for each $j$ the information which the factor $z_{ij}$ contains about $\x_j$). \end{itemize} \end{enumerate} Version of fabia for a sparse data matrix. The data matrix is directly scanned by the C-code and must be in sparse matrix format. Sparse matrix format: \begin{itemize} \item first line: numer of rows (the samples). \item second line: number of columns (the features). \item following lines: for each sample (row) three lines with \begin{enumerate} \item number of nonzero row elements \item indices of the nonzero row elements (ATTENTION: starts with 0!!) \item values of the nonzero row elements (ATTENTION: floats with decimal point like 1.0 !!) \end{enumerate} \end{itemize} \vspace{0.5cm} Biclusters are found by sparse factor analysis where \emph{both} the factors and the loadings are sparse. Essentially the model is the sum of outer products of vectors: \begin{align*} \X \ = \ \sum_{i=1}^{p} \la_i \ \z_i^T \ + \ \Up \ , \end{align*} where the number of summands $p$ is the number of biclusters. The matrix factorization is \begin{align*} \X \ = \ \La \ \Z \ + \ \Up \ . \end{align*} Here $\X \in \Rb^{n \times l}$ and $\Up \in \Rb^{n \times l}$, $\la_i \in \Rb^n$, $\z_i \in \Rb^l$, $\La \in \Rb^{n \times p}$, $\Z \in \Rb^{p \times l}$. If the nonzero components of the sparse vectors are grouped together then the outer product results in a matrix with a nonzero block and zeros elsewhere. For a single data vector $\x$ that is \begin{align*}\x \ = \ \sum_{i=1}^{p} \la_i z_i \ + \ \epp \ = \ \La \tilde\z \ + \ \epp\end{align*} The model assumptions are: \emph{Factor Prior is Independent Laplace:} \begin{align*}p(\tilde\z) \ = \ \left( \frac{1}{\sqrt{2}} \right)^p \prod_{i=1}^{p} e^{-\ \sqrt{2} \ |z_i|}\end{align*} \emph{Loading Prior is Independent Laplace:} \begin{align*}p(\la_i) \ = \ \left( \frac{1}{\sqrt{2}} \right)^n \prod_{k=1}^{n} e^{-\ \sqrt{2} \ |\lambda_{ki}|}\end{align*} \emph{Noise: Gaussian independent} \begin{align*}p(\epp) \ = \ \left( \frac{1}{\sqrt{2 \ \pi}} \right)^n \prod_{k=1}^{n} \frac{1}{\sigma_k} e^{\sum_{k=1}^{n} \frac{\epsilon_k^2}{\sigma_k^2}} \end{align*} \emph{Data Mean:} \begin{align*}\E(\x) \ = \ \E(\La \ \tilde\z \ + \ \epp ) \ = \ \La \ \E( \tilde\z) \ + \ \E(\epp) \ = \ \0\end{align*} \emph{Data Covariance:} \begin{align} \nonumber \E(\x \ \x^T) \ &= \ \La \E( \tilde\z \tilde\z^T) \La^T \ + \ \La \E( \tilde\z)\E(\epp^T) \ + \ \E( \tilde\z)\E(\epp) \La^T \ + \ \E(\epp \ \epp^T) \\\nonumber &= \ \La \La^T \ + \ \mathrm{diag}(\sigma_k^2) \end{align} Normalizing the data to variance one for each component gives \begin{align*}\sigma_k^2 \ + \ \left(\la^{k}\right)^T \la^{k} \ = \ 1\end{align*} Here $\la^{k}$ is the $k$-th row of $\La$ (which is a row vector of length $p$). We recommend to \emph{normalize the components to variance one} in order to make the signal and noise comparable across components. \emph{Estimated Parameters:} $\La$ and $\pp$ ($\sigma_k$) \emph{Estimated Latent Variables:} $\Z$ \emph{Estimated Noise Free Data:} $\La \ \Z$ \emph{Estimated Biclusters:} $\la_i \ \z_i^T $ Larges values give the bicluster (ideal the nonzero values). The model selection is performed by a variational approach according to \citep{Girolami:01} and \citep{Palmer:06}. We included a prior on the parameters and minimize a lower bound on the posterior of the parameters given the data. The update of the loadings includes an additive term which pushes the loadings toward zero (Gaussian prior leads to an multiplicative factor). The code is implemented in C. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # TEST #--------------- samples <- 20 features <- 200 sparseness <- 0.9 write(samples, file = "sparseFarmsTest.txt",ncolumns = features,append = FALSE, sep = " ") write(features, file = "sparseFarmsTest.txt",ncolumns = features,append = TRUE, sep = " ") for (i in 1:samples) { ind <- which(runif(features)>sparseness)-1 num <- length(ind) val <- abs(rnorm(num)) write(num, file = "sparseFarmsTest.txt",ncolumns = features,append = TRUE, sep = " ") write(ind, file = "sparseFarmsTest.txt",ncolumns = features,append = TRUE, sep = " ") write(val, file = "sparseFarmsTest.txt",ncolumns = features,append = TRUE, sep = " ") } res <- spfabia("sparseFarmsTest",p=3,alpha=0.03,cyc=50,non_negative=1,write_file=0,norm=0) unlink("sparseFarmsTest.txt") plot(res,dim=c(1,2)) plot(res,dim=c(1,3)) plot(res,dim=c(2,3)) \end{Sinput} \subsection{readSamplesSpfabia} Factor Analysis for Bicluster Acquisition: Read Samples of SpFabia \citep{Hochreiter:10}. C implementation of \texttt{readSamplesSpfabia}. \begin{enumerate} \item Usage: \texttt{readSamplesSpfabia(X,samples=0,lowerB=0.0,upperB=1000.0)} \item Arguments: \begin{itemize} \item{X:} the file name of the sparse matrix in sparse format. \item{samples:} vector of samples which should be read; default = 0 (all samples) \item{lowerB:} lower bound for filtering the inputs columns, the minimal column sum; default = 0.0. \item{upperB:} upper bound for filtering the inputs columns, the maximal column sum; default = 1000.0. \end{itemize} \item Return values: \begin{itemize} \item \texttt{X} (data of the given samples) \end{itemize} \end{enumerate} Read the samples to analyze results of spfabia. The code is implemented in C. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # TEST #--------------- # no test \end{Sinput} \subsection{readSpfabiaResult} Factor Analysis for Bicluster Acquisition: Read Results of SpFabia \citep{Hochreiter:10}. C implementation of \texttt{readSpfabiaResult}. \begin{enumerate} \item Usage: \texttt{readSpfabiaResult(X)} \item Arguments: \begin{itemize} \item{X:} the file prefix name of the result files of \texttt{spfabia}. \end{itemize} \item Return values: \begin{itemize} \item object of the class \texttt{Factorization}. Containing \texttt{L} (loadings $\La$), \texttt{Z} (factors $\Z$), \texttt{Psi} (noise variance $\pp$), \texttt{lapla} (variational parameter), \texttt{avini} (the information which the factor $z_{ij}$ contains about $\x_j$ averaged over $j$) \texttt{xavini} (the information which the factor $\tilde\z_j$ contains about $\x_j$ averaged over $j$) \texttt{ini} (for each $j$ the information which the factor $z_{ij}$ contains about $\x_j$). \end{itemize} \end{enumerate} Read the results of spfabia. The code is implemented in C. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # TEST #--------------- # no test \end{Sinput} \subsection{mfsc} Sparse Matrix Factorization for bicluster analysis (MFSC) \citep{Hochreiter:10}. \begin{enumerate} \item Usage: \texttt{mfsc(X,p=5,cyc=100,sL=0.6,sZ=0.6,center=2,norm=1)} \item Arguments: \begin{itemize} \item{X:} the data matrix. \item{p:} number of hidden factors = number of biclusters; default = 5. \item{cyc:} maximal number of iterations; default = 100. \item{sL:} final sparseness loadings; default = 0.6. \item{sZ:} final sparseness factors; default = 0.6. \item{center:} data centering: 1 (mean), 2 (median), > 2 (mode), 0 (no); default = 2. \item{norm:} data normalization: 1 (0.75-0.25 quantile), >1 (var=1), 0 (no); default = 1. \end{itemize} \item Return Values: \begin{itemize} \item object of the class \texttt{Factorization}. Containing \texttt{LZ} (estimated noise free data $\La \Z$), \texttt{L} (loadings $\La$), \texttt{Z} (factors $\Z$), \texttt{U} (noise $\X - \La \Z$), \texttt{center} (centering vector), \texttt{scaleData} (scaling vector), \texttt{X} (centered and scaled data $\X$). \end{itemize} \end{enumerate} Biclusters are found by sparse matrix factorization where \emph{both} factors are sparse. Essentially the model is the sum of outer products of vectors: \begin{align*} \X \ = \ \sum_{i=1}^{p} \la_i \ \z_i^T \ , \end{align*} where the number of summands $p$ is the number of biclusters. The matrix factorization is \begin{align*} \X \ = \ \La \ \Z \ . \end{align*} Here $\X \in \Rb^{n \times l}$, $\la_i \in \Rb^n$, $\z_i \in \Rb^l$, $\La \in \Rb^{n \times p}$, $\Z \in \Rb^{p \times l}$. \emph{No noise assumption:} In contrast to factor analysis there is no noise assumption. If the nonzero components of the sparse vectors are grouped together then the outer product results in a matrix with a nonzero block and zeros elsewhere. For a single data vector $\x$ that is \begin{align*} \x \ = \ \sum_{i=1}^{p} \la_i z_i \ = \ \La \tilde\z \end{align*} \emph{Estimated Parameters:} $\La$ and $\Z$ \emph{Estimated Biclusters:} $\la_i \ \z_i^T $ Larges values give the bicluster (ideal the nonzero values). The model selection is performed by a constraint optimization according to \citep{Hoyer:04}. The Euclidean distance (the Frobenius norm) is minimized subject to sparseness constraints: \begin{align*}\min_{ \La, \Z} \left\|\X \ - \ \La \ \Z\right\|_{F}^{2} \end{align*} \begin{align*}\mathrm{subject} \ \mathrm{to} \ \ \left\|\La\right\|_{F}^{2} \ = \ 1\end{align*} \begin{align*}\mathrm{subject} \ \mathrm{to} \ \ \left\|\La\right\|_{1} \ = \ k_L\end{align*} \begin{align*}\mathrm{subject} \ \mathrm{to} \ \ \left\|\Z\right\|_{F}^{2} \ = \ 1\end{align*} \begin{align*}\mathrm{subject} \ \mathrm{to} \ \ \left\|\Z\right\|_{1} \ = \ k_Z\end{align*} Model selection is done by gradient descent on the Euclidean objective and thereafter projection of single vectors of $\La$ and single vectors of $\Z$ to fulfill the sparseness constraints. The projection minimize the Euclidean distance to the original vector given an $l_1$-norm and an $l_2$-norm. The projection is a convex quadratic problem which is solved iteratively where at each iteration at least one component is set to zero. Instead of the $l_1$-norm a sparseness measurement is used which relates the $l_1$-norm to the $l_2$-norm: \begin{align*}\mathrm{sparseness}(\la_i) \ = \ \frac{\sqrt{n} \ - \ \sum_{k=1}^{n} \left|\lambda_{ki} \right|\ / \ \sum_{k=1}^{n} \lambda_{ki}^2}{\sqrt{n} \ - \ 1}\end{align*} The code is implemented in \R. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # TEST #--------------- dat <- makeFabiaDataBlocks(n = 100,l= 50,p = 3,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] resEx <- mfsc(X,3,30,0.6,0.6) \dontrun{ #----------------- # DEMO1: Toy Data #----------------- n = 1000 l= 100 p = 10 dat <- makeFabiaDataBlocks(n = n,l= l,p = p,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] ZC <- dat[[3]] LC <- dat[[4]] gclab <- rep.int(0,l) gllab <- rep.int(0,n) clab <- as.character(1:l) llab <- as.character(1:n) for (i in 1:p){ for (j in ZC[i]){ clab[j] <- paste(as.character(i),"_",clab[j],sep="") } for (j in LC[i]){ llab[j] <- paste(as.character(i),"_",llab[j],sep="") } gclab[unlist(ZC[i])] <- gclab[unlist(ZC[i])] + p^i gllab[unlist(LC[i])] <- gllab[unlist(LC[i])] + p^i } groups <- gclab #### MFSC resToy4 <- mfsc(X,13,100,0.6,0.6) extractPlot(resToy4,ti="MFSC",Y=Y) raToy4 <- extractBic(resToy4,thresZ=0.01,thresL=0.05) if ((raToy4$bic[[1]][1]>1) && (raToy4$bic[[1]][2])>1) { plotBicluster(raToy4,1) } if ((raToy4$bic[[2]][1]>1) && (raToy4$bic[[2]][2])>1) { plotBicluster(raToy4,2) } if ((raToy4$bic[[3]][1]>1) && (raToy4$bic[[3]][2])>1) { plotBicluster(raToy4,3) } if ((raToy4$bic[[4]][1]>1) && (raToy4$bic[[4]][2])>1) { plotBicluster(raToy4,4) } colnames(resToy4@X) <- clab rownames(resToy4@X) <- llab plot(resToy4,dim=c(1,2),label.tol=0.1,col.group = groups,lab.size=0.6) plot(resToy4,dim=c(1,3),label.tol=0.1,col.group = groups,lab.size=0.6) plot(resToy4,dim=c(2,3),label.tol=0.1,col.group = groups,lab.size=0.6) #------------------------------------------ # DEMO2: Laura van't Veer's gene expression # data set for breast cancer #------------------------------------------ avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(Breast_A) X <- as.matrix(XBreast) resBreast4 <- mfsc(X,5,100,0.6,0.6) extractPlot(resBreast4,ti="MFSC Breast cancer(Veer)") raBreast4 <- extractBic(resBreast4,thresZ=0.01,thresL=0.05) if ((raBreast4$bic[[1]][1]>1) && (raBreast4$bic[[1]][2])>1) { plotBicluster(raBreast4,1) } if ((raBreast4$bic[[2]][1]>1) && (raBreast4$bic[[2]][2])>1) { plotBicluster(raBreast4,2) } if ((raBreast4$bic[[3]][1]>1) && (raBreast4$bic[[3]][2])>1) { plotBicluster(raBreast4,3) } if ((raBreast4$bic[[4]][1]>1) && (raBreast4$bic[[4]][2])>1) { plotBicluster(raBreast4,4) } plot(resBreast4,dim=c(1,2),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast4,dim=c(1,3),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast4,dim=c(1,4),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast4,dim=c(1,5),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast4,dim=c(2,3),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast4,dim=c(2,4),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast4,dim=c(2,5),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast4,dim=c(3,4),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast4,dim=c(3,5),label.tol=0.03,col.group=CBreast,lab.size=0.6) plot(resBreast4,dim=c(4,5),label.tol=0.03,col.group=CBreast,lab.size=0.6) } #----------------------------------- # DEMO3: Su's multiple tissue types # gene expression data set #----------------------------------- avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(Multi_A) X <- as.matrix(XMulti) resMulti4 <- mfsc(X,5,100,0.6,0.6) extractPlot(resMulti4,ti="MFSC Multiple tissues(Su)") raMulti4 <- extractBic(resMulti4,thresZ=0.01,thresL=0.05) if ((raMulti4$bic[[1]][1]>1) && (raMulti4$bic[[1]][2])>1) { plotBicluster(raMulti4,1) } if ((raMulti4$bic[[2]][1]>1) && (raMulti4$bic[[2]][2])>1) { plotBicluster(raMulti4,2) } if ((raMulti4$bic[[3]][1]>1) && (raMulti4$bic[[3]][2])>1) { plotBicluster(raMulti4,3) } if ((raMulti4$bic[[4]][1]>1) && (raMulti4$bic[[4]][2])>1) { plotBicluster(raMulti4,4) } plot(resMulti4,dim=c(1,2),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti4,dim=c(1,3),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti4,dim=c(1,4),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti4,dim=c(1,5),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti4,dim=c(2,3),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti4,dim=c(2,4),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti4,dim=c(2,5),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti4,dim=c(3,4),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti4,dim=c(3,5),label.tol=0.01,col.group=CMulti,lab.size=0.6) plot(resMulti4,dim=c(4,5),label.tol=0.01,col.group=CMulti,lab.size=0.6) } #----------------------------------------- # DEMO4: Rosenwald's diffuse large-B-cell # lymphoma gene expression data set #----------------------------------------- avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(DLBCL_B) X <- as.matrix(XDLBCL) resDLBCL4 <- mfsc(X,5,100,0.6,0.6) extractPlot(resDLBCL4,ti="MFSC Lymphoma(Rosenwald)") raDLBCL4 <- extractBic(resDLBCL4,thresZ=0.01,thresL=0.05) if ((raDLBCL4$bic[[1]][1]>1) && (raDLBCL4$bic[[1]][2])>1) { plotBicluster(raDLBCL4,1) } if ((raDLBCL4$bic[[2]][1]>1) && (raDLBCL4$bic[[2]][2])>1) { plotBicluster(raDLBCL4,2) } if ((raDLBCL4$bic[[3]][1]>1) && (raDLBCL4$bic[[3]][2])>1) { plotBicluster(raDLBCL4,3) } if ((raDLBCL4$bic[[4]][1]>1) && (raDLBCL4$bic[[4]][2])>1) { plotBicluster(raDLBCL4,4) } plot(resDLBCL4,dim=c(1,2),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL4,dim=c(1,3),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL4,dim=c(1,4),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL4,dim=c(1,5),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL4,dim=c(2,3),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL4,dim=c(2,4),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL4,dim=c(2,5),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL4,dim=c(3,4),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL4,dim=c(3,5),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) plot(resDLBCL4,dim=c(4,5),label.tol=0.03,col.group=CDLBCL,lab.size=0.6) } \end{Sinput} \subsection{nmfdiv} Non-negative Matrix Factorization with Kullaback-Leibler divergence as objective. \begin{enumerate} \item Usage: \texttt{nmfdiv(X,p=5,cyc=100)} \item Arguments: \begin{itemize} \item{X:} the data matrix. \item{p:} number of hidden factors = number of biclusters; default = 5. \item{cyc:} maximal number of iterations; default = 100. \end{itemize} \item Return Values: \begin{itemize} \item object of the class \texttt{Factorization}. Containing \texttt{LZ} (estimated noise free data $\La \Z$), \texttt{L} (loadings $\La$), \texttt{Z} (factors $\Z$), \texttt{U} (noise $\X- \La \Z$), \texttt{X} (scaled data $\X$). \end{itemize} \end{enumerate} \begin{align*}\X \ = \ \La \ \Z \end{align*} \begin{align*}\X \ = \ \sum_{i=1}^{p} \la_i \ \z_i^T\end{align*} \emph{Estimated Parameters:} $\La$ and $\Z$ The model selection is performed according to \citep{Lee:99,Lee:01}. \emph{objective:} \begin{align*}\mathrm{D} ( \A \parallel \B ) \ = \ \sum_{ij} \left( A_{ij} \ \log \frac{ A_{ij}}{B_{ij}} \ + \ A_{ij} \ - \ B_{ij} \right)\end{align*} \emph{update:} \begin{align*}L_{ik} \ = \ L_{ik} \frac{\sum_{j=1}^{n} Z_{ji} \ V_{jk} \ / \ \left(\La \ \Z\right)_{jk}}{\sum_{j=1}^{n} Z_{ji}}\end{align*} \begin{align*}Z_{ji} \ = \ Z_{ji} \frac{\sum_{k=1}^{l} L_{ik} \ V_{jk} \ / \ \left(\La \ \Z\right)_{jk}}{\sum_{k=1}^{l} L_{ik}}\end{align*} or in matrix notation with ``$*$'' and ``$/$'' as element-wise operators: \begin{align*}\La \ = \ \La*((\X \ / \ (\La \ \Z))\ t(\Z)) \ / \ \mathrm{rowSums}(\Z)\end{align*} \begin{align*}\Z \ = \ \Z*(t(\La) \ (\X \ / \ (\La \ \Z))) \ / \ \mathrm{colSums}(\La)\end{align*} The code is implemented in \R. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # TEST #--------------- dat <- makeFabiaDataBlocks(n = 100,l= 50,p = 3,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] X <- abs(X) resEx <- nmfdiv(X,3) #--------------- # DEMO #--------------- dat <- makeFabiaDataBlocks(n = 1000,l= 100,p = 10,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] X <- abs(X) resToy <- nmfdiv(X,13) extractPlot(resToy,ti="NMFDIV",Y=Y) \end{Sinput} \subsection{nmfeu} Non-negative Matrix Factorization with Euclidean distance as objective. \begin{enumerate} \item Usage: \texttt{nmfeu(X,p=5,cyc=100)} \item Arguments: \begin{itemize} \item{X:} the data matrix. \item{p:} number of hidden factors = number of biclusters; default = 5. \item{cyc:} maximal number of iterations; default = 100. \end{itemize} \item Return Values: \begin{itemize} \item object of the class \texttt{Factorization}. Containing \texttt{LZ} (estimated noise free data $\La \Z$), \texttt{L} (loadings $\La$), \texttt{Z} (factors $\Z$), \texttt{U} (noise $\X - \La \Z$), \texttt{X} (scaled data $\X$). \end{itemize} \end{enumerate} \begin{align*}\X \ = \ \La \ \Z \end{align*} \begin{align*}\X \ = \ \sum_{i=1}^{p} \la_i \ \z_i^T\end{align*} \emph{Estimated Parameters:} $\La$ and $\Z$ The model selection is performed according to \citep{Lee:01,Paatero:97}. \emph{objective:} \begin{align*}\| \A \ - \ \B \|_{F}^2 \ = \ \sum_{ij} \left( A_{ij} \ - \ B_{ij} \right)^2\end{align*} \emph{update:} \begin{align*}L_{ik} \ = \ L_{ik} \frac{\left( \La^T \ \X \right)_{ik}}{ \left( \La^T \ \La \ \Z\right)_{ik}} \end{align*} \begin{align*}Z_{ji} \ = \ Z_{ji} \frac{\left( \X \ \Z^T \right)_{ji}}{ \left( \La \ \Z \ \Z^T \right)_{ji}} \end{align*} or in matrix notation with ``$*$'' and ``$/$'' as element-wise operators: \begin{align*}\Z \ = \ \Z*(t(\La) \ \X)\ / \ (t(\La) \ \La \ \Z)\end{align*} \begin{align*}\La \ = \ \La*(\X \ t(\Z)) \ / \ (\La \ \Z \ t(\Z))\end{align*} The code is implemented in \R. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # TEST #--------------- dat <- makeFabiaDataBlocks(n = 100,l= 50,p = 3,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] X <- abs(X) resEx <- nmfeu(X,3) #--------------- # DEMO #--------------- dat <- makeFabiaDataBlocks(n = 1000,l= 100,p = 10,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] X <- abs(X) resToy <- nmfeu(X,13) extractPlot(resToy,ti="NMFEU",Y=Y) \end{Sinput} \subsection{nmfsc} Non-negative Sparse Matrix Factorization with sparseness constraints. \begin{enumerate} \item Usage: \texttt{nmfsc(X,p=5,cyc=100,sL=0.6,sZ=0.6)} \item Arguments: \begin{itemize} \item{X:} the data matrix. \item{p:} number of hidden factors = number of biclusters; default = 5. \item{cyc:} maximal number of iterations; default = 100. \item{sL:} sparseness loadings; default = 0.6. \item{sZ:} sparseness factors; default = 0.6. \end{itemize} \item Return Values: \begin{itemize} \item object of the class \texttt{Factorization}. Containing \texttt{LZ} (estimated noise free data $\La \Z$), \texttt{L} (loadings $\La$), \texttt{Z} (factors $\Z$), \texttt{U} (noise $\X - \La \Z$), \texttt{X} (data $\X$). \end{itemize} \end{enumerate} Essentially the model is the sum of outer products of vectors: \begin{align*} \X \ = \ \sum_{i=1}^{p} \la_i \ \z_i^T \ , \end{align*} where the number of summands $p$ is the number of biclusters. The matrix factorization is \begin{align*} \X \ = \ \La \ \Z \ . \end{align*} Here $\X \in \Rb^{n \times l}$, $\la_i \in \Rb^n$, $\z_i \in \Rb^l$, $\La \in \Rb^{n \times p}$, $\Z \in \Rb^{p \times l}$. If the nonzero components of the sparse vectors are grouped together then the outer product results in a matrix with a nonzero block and zeros elsewhere. For a single data vector $\x$ that is \begin{align*}\x \ = \ \sum_{i=1}^{p} \la_i z_i \ = \ \La \z \end{align*} \emph{Estimated Parameters:} $\La$ and $\Z$ \emph{Estimated Biclusters:} $\la_i \ \z_i^T $ Larges values give the bicluster (ideal the nonzero values). The model selection is performed by a constraint optimization according to \citep{Hoyer:04}. The Euclidean distance (the Frobenius norm) is minimized subject to sparseness and non-negativity constraints: \begin{align*}\min_{ \La, \Z} \left\|\x \ - \ \La \ \Z\right\|_{F}^{2} \end{align*} \begin{align*}\mathrm{subject} \ \mathrm{to} \ \ \left\|\La\right\|_{F}^{2} \ = \ 1\end{align*} \begin{align*}\mathrm{subject} \ \mathrm{to} \ \ \left\|\La\right\|_{1} \ = \ k_L\end{align*} \begin{align*}\mathrm{subject} \ \mathrm{to} \ \ \La \ \geq \ \0 \end{align*} \begin{align*}\mathrm{subject} \ \mathrm{to} \ \ \left\|\Z\right\|_{F}^{2} \ = \ 1\end{align*} \begin{align*}\mathrm{subject} \ \mathrm{to} \ \ \left\|\Z\right\|_{1} \ = \ k_Z\end{align*} \begin{align*}\mathrm{subject} \ \mathrm{to} \ \ \Z \ \geq \ \0 \end{align*} Model selection is done by gradient descent on the Euclidean objective and thereafter projection of single vectors of $\La$ and single vectors of $\Z$ to fulfill the sparseness and non-negativity constraints. The projection minimize the Euclidean distance to the original vector given an $l_1$-norm and an $l_2$-norm and enforcing non-negativity. The projection is a convex quadratic problem which is solved iteratively where at each iteration at least one component is set to zero. Instead of the $l_1$-norm a sparseness measurement is used which relates the $l_1$-norm to the $l_2$-norm: \begin{align*}\mathrm{sparseness}(\la_i) \ = \ \frac{\sqrt{n} \ - \ \sum_{k=1}^{n} \left|\lambda_{ki} \right|\ / \ \sum_{k=1}^{n} \lambda_{ki}^2}{\sqrt{n} \ - \ 1} \end{align*} The code is implemented in \R. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # TEST #--------------- dat <- makeFabiaDataBlocks(n = 100,l= 50,p = 3,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] X <- abs(X) resEx <- nmfsc(X,3,30,0.6,0.6) #--------------- # DEMO #--------------- dat <- makeFabiaDataBlocks(n = 1000,l= 100,p = 10,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] X <- abs(X) resToy <- nmfsc(X,13,100,0.6,0.6) extractPlot(resToy,ti="NMFSC",Y=Y) \end{Sinput} \section{Analyzing the Results of Biclustering / Matrix Factorization} \subsection{plot} Plot of a Matrix Factorization. Produces a (biplot) of a Matrix Factorization result. \begin{enumerate} \item Usage: \texttt{plot(x,y, ...) } \item Arguments: \begin{itemize} \item{x:} object of the class \texttt{Factorization}. \item{y:} missing, not used. \item{...:} with following parameters \begin{itemize} \item{Rm:} row weighting vector. \item{Cm:} column weighting vector. \item{dim:} optional principal factors that are plotted along the horizontal and vertical axis. Defaults to \texttt{c(1,2)}. \item{zoom:} optional zoom factor for row and column items. Defaults to \texttt{c(1,1)}. \item{col.group:} optional vector (character or numeric) indicating the different groupings of the columns. Defaults to 1. \item{colors:} vector specifying the colors for the annotation of the plot; the first two elements concern the rows; the third till the last element concern the columns; the first element will be used to color the unlabeled rows; the second element for the labeled rows and the remaining elements to give different colors to different groups of columns. \item{col.areas:} logical value indicating whether columns should be plotted as squares with areas proportional to their marginal mean and colors representing the different groups (\texttt{TRUE}), or with symbols representing the groupings and identical size (\texttt{FALSE}). Defaults to \texttt{TRUE}. \item{col.symbols:} vector of symbols when \texttt{col.areas=FALSE} corresponds to the \texttt{pch} argument of the function \texttt{plot}. \item{sampleNames:} Either a logical vector of length one or a character vector of length equal to the number of samples in the dataset. If a logical is provided, sample names will be displayed on the plot (\texttt{TRUE}; default) or not (\texttt{FALSE}); if a character vector is provided, the names provided will be used to label the samples instead of the default column names. \item{rot:} rotation of plot. Defaults to \texttt{c(-1,-1)}. \item{labels:} character vector to be used for labeling points on the graph; if \texttt{NULL}, the row names of \texttt{x} are used instead. \item{label.tol:} numerical value specifying either the percentile (\texttt{label.tol<=1})of rows or the number of rows (\texttt{label.tol>1}) most distant from the plot-center \texttt{(0,0)} that are labeled and are plotted as circles with area proportional to the marginal means of the original data. \item{lab.size:} size of identifying labels for row- and column-items as \texttt{cex} parameter of the \texttt{text} function. \item{col.size:} size in mm of the column symbols. \item{row.size:} size in mm of the row symbols. \item{do.smoothScatter:} use smoothScatter or not instead of plotting individual points. \item{do.plot:} produce a plot or not. \item{...:} further arguments to \texttt{eqscaleplotLoc} which draws the canvasfor the plot; useful for adding a \texttt{main} or a custom \texttt{sub}. \end{itemize} and with the default values \begin{itemize} \item Rm=rep(1,nrow(X)), \item Cm=rep(1,ncol(X)), \item dim = c(1, 2), \item zoom = rep(1, 2), \item col.group = rep(1, ncol(X)), \item colors = c("orange1", "red", rainbow(length(unique(col.group)), start=2/6, end=4/6)), \item col.areas = TRUE, \item col.symbols = c(1, rep(2, length(unique(col.group)))), \item sampleNames = TRUE, \item rot = rep(-1, length(dim)), \item labels = NULL, \item label.tol = 1, \item lab.size = 0.725, \item col.size = 10, \item row.size = 10, \item do.smoothScatter = FALSE, \item do.plot = TRUE, \end{itemize} \end{itemize} \item Return Values: \begin{itemize} \item{Rows:} a list with the X and Y coordinates of the rows and an indication \texttt{Select} of whether the row was selected according to \texttt{label.tol}. \item{Columns:} a list with the X and Y coordinates of the columns. \end{itemize} \end{enumerate} The function \texttt{plot.fabia} is based on the function \texttt{plot.mpm} in the \R package \texttt{mpm} (Version: 1.0-16, Date: 2009-08-26, Title: Multivariate Projection Methods, Maintainer: Tobias Verbeke <tobias.verbeke@openanalytics.be>, Author: Luc Wouters <wouters\_luc@telenet.be>). Biclusters are found by sparse factor analysis where \emph{both} the factors and the loadings are sparse. Essentially the model is the sum of outer products of vectors: \begin{align*} \X \ = \ \sum_{i=1}^{p} \la_i \ \z_i^T \ + \ \Up \ , \end{align*} where the number of summands $p$ is the number of biclusters. The matrix factorization is \begin{align*} \X \ = \ \La \ \Z \ + \ \Up \ . \end{align*} Here $\X \in \Rb^{n \times l}$ and $\Up \in \Rb^{n \times l}$, $\la_i \in \Rb^n$, $\z_i \in \Rb^l$, $\La \in \Rb^{n \times p}$, $\Z \in \Rb^{p \times l}$. For noise free projection like independent component analysis we set the noise term to zero: $\Up=\0$. The argument \texttt{label.tol} can be used to select the most informative rows, i.e.\ rows that are most distant from the center of the plot (smaller 1: percentage of rows, larger 1: number of rows). Only these row-items are then labeled and represented as circles with their areas proportional to the row weighting. If the column-items are grouped these groups can be visualized by colors given by \texttt{col.group}. Implementation in \R. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} n=200 l=100 p=4 dat <- makeFabiaDataBlocks(n = n,l= l,p = p,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] ZC <- dat[[3]] LC <- dat[[4]] resEx <- fabia(X,p,0.01,400) gclab <- rep.int(0,l) gllab <- rep.int(0,n) clab <- as.character(1:l) llab <- as.character(1:n) for (i in 1:p){ for (j in ZC[i]){ clab[j] <- paste(as.character(i),"_",clab[j],sep="") } for (j in LC[i]){ llab[j] <- paste(as.character(i),"_",llab[j],sep="") } gclab[unlist(ZC[i])] <- gclab[unlist(ZC[i])] + p^i gllab[unlist(LC[i])] <- gllab[unlist(LC[i])] + p^i } labels <- paste(as.character(clab) ,"_",as.character(1:l),sep="") groups <- gclab colnames(resEx@X) <- clab rlabels <- paste(as.character(llab) ,"_",as.character(1:l),sep="") rownames(resEx@X) <- llab plot(resEx,dim=c(1,2),label.tol=0.1,col.group = groups,lab.size=0.6) plot(resEx,dim=c(1,3),label.tol=0.1,col.group = groups,lab.size=0.6) plot(resEx,dim=c(2,3),label.tol=0.1,col.group = groups,lab.size=0.6) \end{Sinput} \subsection{show} Plots Statistics of a Matrix Factorization. This function plots statistics on a matrix factorization which is stored as an instance of \texttt{Factorization-class}. \begin{enumerate} \item Usage: \texttt{show(object)} \item Arguments: \begin{itemize} \item{object:} An instance of \texttt{Factorization-class}. \end{itemize} \item Return Values: \begin{itemize} \item no value. \end{itemize} \end{enumerate} Following is plotted: 1) the information content of biclusters. 2) the information content of samples. 3) the loadings per bicluster. 4) the factors per bicluster. Implementation in \R. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # TEST #--------------- dat <- makeFabiaDataBlocks(n = 100,l= 50,p = 3,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] resEx <- fabia(X,3,0.01,100) show(resEx) \end{Sinput} \subsection{showSelected} Plots selected Statistics of a Matrix Factorization. This function plots selected statistics on a matrix factorization which is stored as an instance of \texttt{Factorization-class}. \begin{enumerate} \item Usage: \texttt{showSelected(object, which=c(1,2,3,4))} \item Arguments: \begin{itemize} \item{object:} An instance of \texttt{Factorization-class}. \item{which:} A list of which plots should be generated: 1=the information content of biclusters, 2=the information content of samples, 3=the loadings per bicluster, 4=the factors per bicluster, default c(1,2,3,4). \end{itemize} \item Return Values: \begin{itemize} \item no value. \end{itemize} \end{enumerate} Following is plotted depending on the plot selection variable "which": 1) the information content of biclusters. 2) the information content of samples. 3) the loadings per bicluster. 4) the factors per bicluster. Implementation in \R. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # TEST #--------------- dat <- makeFabiaDataBlocks(n = 100,l= 50,p = 3,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] resEx <- fabia(X,3,0.01,100) showSelected(resEx,which=1) showSelected(resEx,which=2) \end{Sinput} \subsection{summary} Summary of Matrix Factorization. This function gives information on a matrix factorization which is stored as an instance of \texttt{Factorization-class}. \begin{enumerate} \item Usage: \texttt{summary(object, ...)} \item Arguments: \begin{itemize} \item{object:} An instance of \texttt{Factorization-class}. \item{...:} further arguments. \end{itemize} \item Return Values: \begin{itemize} \item no value. \end{itemize} \end{enumerate} First, it gives the number or rows and columns of the original matrix. Then the number of clusters for rows and columns is given. Then for the row cluster the information content is given. Then for each column its information is given. Then for each column cluster a summary is given. Then for each row cluster a summary is given. Implementation in \R. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # TEST #--------------- dat <- makeFabiaDataBlocks(n = 100,l= 50,p = 3,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] resEx <- fabia(X,3,0.01,100) summary(resEx) \end{Sinput} \subsection{extractBic} Extraction of Biclusters. \begin{enumerate} \item Usage: \texttt{extractBic(fact,thresZ=0.5,thresL=NULL)} \item Arguments: \begin{itemize} \item{fact:} object of class \texttt{Factorization}. \item{thresZ:} threshold for sample belonging to bicluster; default 0.5. \item{thresL:} threshold for loading belonging to bicluster (if not given it is estimated). \end{itemize} \item Return Values: \begin{itemize} \item{bic:} extracted biclusters. \item{numn:} indexes for the extracted biclusters. \item{bicopp:} extracted opposite biclusters. \item{numnopp:} indexes for the extracted opposite biclusters. \item{X:} scaled and centered data matrix. \item{np:} number of biclusters. \end{itemize} \end{enumerate} Essentially the model is the sum of outer products of vectors: \begin{align*} \X \ = \ \sum_{i=1}^{p} \la_i \ \z_i^T \ + \ \Up \ , \end{align*} where the number of summands $p$ is the number of biclusters. The matrix factorization is \begin{align*} \X \ = \ \La \ \Z \ + \ \Up \ . \end{align*} Here $\X \in \Rb^{n \times l}$ and $\Up \in \Rb^{n \times l}$, $\la_i \in \Rb^n$, $\z_i \in \Rb^l$, $\La \in \Rb^{n \times p}$, $\Z \in \Rb^{p \times l}$. $\Up$ is the Gaussian noise with a diagonal covariance matrix which entries are given by \texttt{Psi}. The $\Z$ is locally approximated by a Gaussian with inverse variance given by \texttt{lapla}. The $\la_i $ and $\z_i $ are used to extract the bicluster $i$, where a threshold determines which observations and which samples belong the the bicluster. In \texttt{bic} the biclusters are extracted according to the largest absolute values of the component $i$, i.e. the largest values of $\la_i $ and the largest values of $\z_i $. The factors $\z_i $ are normalized to variance 1. The components of \texttt{bic} are \texttt{binp}, \texttt{bixv}, \texttt{bixn}, \texttt{biypv}, and \texttt{biypn}. \texttt{binp} give the size of the bicluster: number observations and number samples. \texttt{bixv} gives the values of the extracted observations that have absolute values above a threshold. They are sorted. \texttt{bixn} gives the extracted observation names (e.g. gene names). \texttt{biypv} gives the values of the extracted samples that have absolute values above a threshold. They are sorted. \texttt{biypn} gives the names of the extracted samples (e.g. sample names). In \texttt{bicopp} the opposite cluster to the biclusters are give. Opposite means that the negative pattern is present. The components of opposite clusters \texttt{bicopp} are \texttt{binn}, \texttt{bixv}, \texttt{bixn}, \texttt{biypnv}, and \texttt{biynn}. \texttt{binp} give the size of the opposite bicluster: number observations and number samples. \texttt{bixv} gives the values of the extracted observations that have absolute values above a threshold. They are sorted. \texttt{bixn} gives the extracted observation names (e.g. gene names). \texttt{biynv} gives the values of the opposite extracted samples that have absolute values above a threshold. They are sorted. \texttt{biynn} gives the names of the opposite extracted samples (e.g.\ sample names). That means the samples are divided into two groups where one group shows large positive values and the other group has negative values with large absolute values. That means a observation pattern can be switched on or switched off relative to the average value. \texttt{numn} gives the indexes of \texttt{bic} with components: \texttt{numng} = \texttt{bix} and \texttt{numnp} = \texttt{biypn}. \texttt{numn} gives the indexes of \texttt{bicopp} with components: \texttt{numng} = \texttt{bix} and \texttt{numnn} = \texttt{biynn}. Implementation in \R. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # TEST #--------------- dat <- makeFabiaDataBlocks(n = 100,l= 50,p = 3,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] resEx <- fabia(X,3,0.01,20) rEx <- extractBic(resEx) rEx$bic[1,] rEx$bic[2,] rEx$bic[3,] #--------------- # DEMO1 #--------------- dat <- makeFabiaDataBlocks(n = 1000,l= 100,p = 10,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] resToy <- fabia(X,13,0.01,200) rToy <- extractBic(resToy) resToy@avini rToy$bic[1,] rToy$bic[2,] rToy$bic[3,] #--------------- # DEMO2 #--------------- avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(Breast_A) X <- as.matrix(XBreast) resBreast <- fabia(X,5,0.1,200) rBreast <- extractBic(resBreast) resBreast@avini rBreast$bic[1,] rBreast$bic[2,] rBreast$bic[3,] } \end{Sinput} \subsection{extractPlot} Plotting of Biclustering Results. \begin{enumerate} \item Usage: \texttt{extractPlot(fact,thresZ=0.5,ti="",thresL=NULL,Y=NULL,which=c(1,2,3,4,5,6,7,8))} \item Arguments: \begin{itemize} \item{fact:} object of class \texttt{Factorization}. \item{thresZ:} threshold for sample belonging to bicluster; default 0.5. \item{thresL:} threshold for loading belonging to bicluster (estimated if not given). \item{ti:} plot title; default "". \item{Y:} noise free data matrix. \item{which:} which plot is shown: 1=noise free data (if available), 2=data, 3=reconstructed data, 4=error, 5=absolute factors, 6=absolute loadings, 7=reconstructed matrix sorted,8=original matrix sorted; default c(1,2,3,4,5,6,7,8). \end{itemize} \item The method produces following plots depending what plots are chosen by the "which" variable: \begin{itemize} \item{$\Y$}: noise free data (if available), \item{$\X$}: data, \item{$\La \Z$}: reconstructed data, \item{$\La \Z - \X$}: error, \item{$\mathrm{abs}(\Z)$}: absolute factors, \item{$\mathrm{abs}(\L)$}: absolute loadings, \item{$\bm{\mathrm{pmL}} \L \Z \bm{\mathrm{pmZ}}$}: reconstructed matrix sorted, \item{$\bm{\mathrm{pmL}} \X \bm{\mathrm{pmZ}}$}: original matrix sorted. \end{itemize} \end{enumerate} Essentially the model is the sum of outer products of vectors: \begin{align*} \X \ = \ \sum_{i=1}^{p} \la_i \ \z_i^T \ + \ \Up \ , \end{align*} where the number of summands $p$ is the number of biclusters. The matrix factorization is \begin{align*} \X \ = \ \La \ \Z \ + \ \Up \ . \end{align*} Here $\X \in \Rb^{n \times l}$ and $\Up \in \Rb^{n \times l}$, $\la_i \in \Rb^n$, $\z_i \in \Rb^l$, $\La \in \Rb^{n \times p}$, $\Z \in \Rb^{p \times l}$. The $\la_i $ and $\z_i^T$ are used to extract the bicluster $i$, where a threshold determines which observations and which samples belong the the bicluster. The method produces the plots given above at ``Plots''. For sorting first \texttt{kmeans} is performed on the $p$ dimensional space and then the vectors which belong to the same cluster are put together. This sorting is in general not able to visualize all biclusters as blocks if they overlap. The kmeans clusters are given by \texttt{biclust} with components \texttt{biclustx} (the clustered observations) and \texttt{biclusty} (the clustered samples). Implementation in \R. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # TEST #--------------- dat <- makeFabiaDataBlocks(n = 100,l= 50,p = 3,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] resEx <- fabia(X,3,0.01,20) extractPlot(resEx,ti="FABIA",Y=Y) #--------------- # DEMO1 #--------------- dat <- makeFabiaDataBlocks(n = 1000,l= 100,p = 10,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] resToy <- fabia(X,13,0.01,200) extractPlot(resToy,ti="FABIA",Y=Y) #--------------- # DEMO2 #--------------- avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(Breast_A) X <- as.matrix(XBreast) resBreast <- fabia(X,5,0.1,200) extractPlot(resBreast,ti="FABIA Breast cancer(Veer)") #sorting of predefined labels CBreast%*%rBreast$pmZ } \end{Sinput} \subsection{plotBicluster} Plots a bicluster. \begin{enumerate} \item Usage: \texttt{plotBicluster(r,p,opp=FALSE,zlim=NULL,title=NULL,which=c(1, 2))} \item Arguments: \begin{itemize} \item{r:} the result of extractBic. \item{p:} the bicluster to plot. \item{opp:} plot opposite bicluster, default = FALSE. \item{zlim:} vector containing a low and high value to use for the color scale. \item{title:} title of the plot. \item{which:} which plots are shown: 1=data matrix with bicluster on upper left, 2=plot of the bicluster; default c(1, 2). \end{itemize} \end{enumerate} One bicluster is visualized by two plots. The variable "which" indicates which plots should be shown. Plot1 (which=1): The data matrix is sorted such that the bicluster appear at the upper left corner. The bicluster is marked by a rectangle. Plot2 (which=2): Only the bicluster is presented. Implementation in \R. \begin{Sinput} #--------------- # TEST #--------------- dat <- makeFabiaDataBlocks(n = 100,l= 50,p = 3,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] resEx <- fabia(X,3,0.01,20) rEx <- extractBic(resEx) plotBicluster(rEx,p=1) \dontrun{ #--------------- # DEMO1 #--------------- dat <- makeFabiaDataBlocks(n = 1000,l= 100,p = 10,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] resToy <- fabia(X,13,0.01,200) rToy <- extractBic(resToy) plotBicluster(rToy,p=1) #--------------- # DEMO2 #--------------- avail <- require(fabiaData) if (!avail) { message("") message("") message("#####################################################") message("Package 'fabiaData' is not available: please install.") message("#####################################################") } else { data(Breast_A) X <- as.matrix(XBreast) resBreast <- fabia(X,5,0.1,200) rBreast <- extractBic(resBreast) plotBicluster(rBreast,p=1) } \end{Sinput} \section{Data Generation Methods} \subsection{makeFabiaData} Generation of bicluster data. \begin{enumerate} \item Usage: \verb+makeFabiaData(n,l,p,f1,f2,of1,of2,sd_noise,sd_z_noise,mean_z,+\\ \hspace*{6em}\verb+ sd_z,sd_l_noise,mean_l,sd_l)+ \item Arguments: \begin{itemize} \item{n.} number of observations. \item{l:} number of samples. \item{p:} number of biclusters. \item{f1:} $l/f1$ max. additional samples are active in a bicluster. \item{f2:} $n/f2$ max. additional observations that form a pattern in a bicluster. \item{of1:} minimal active samples in a bicluster. \item{of2:} menial observations that form a pattern in a bicluster. \item{sd\_noise:} Gaussian zero mean noise std on data matrix. \item{sd\_z\_noise:} Gaussian zero mean noise std for deactivated hidden factors. \item{mean\_z:} Gaussian mean for activated factors. \item{sd\_z:} Gaussian std for activated factors. \item{sd\_l\_noise:} Gaussian zero mean noise std if no observation patterns are present. \item{mean\_l:} Gaussian mean for observation patterns. \item{sd\_l:} Gaussian std for observation patterns. \end{itemize} \item Return values: \begin{itemize} \item{X:} the noisy data $\X$ from $\mathbb{R}^{n \times l}$. \item{Y:} the noise free data $\Y$ from $\mathbb{R}^{n \times l}$. \item{ZC:} list where $i$th element gives samples belonging to $i$th bicluster. \item{LC:} list where $i$th element gives observations belonging to $i$th bicluster. \end{itemize} \end{enumerate} Essentially the model is the sum of outer products of vectors: \begin{align*} \X \ = \ \sum_{i=1}^{p} \la_i \ \z_i^T \ + \ \Up \ , \end{align*} where the number of summands $p$ is the number of biclusters. The matrix factorization is \begin{align*} \X \ = \ \La \ \Z \ + \ \Up \ . \end{align*} Here $\X \in \Rb^{n \times l}$ and $\Up \in \Rb^{n \times l}$, $\la_i \in \Rb^n$, $\z_i \in \Rb^l$, $\La \in \Rb^{n \times p}$, $\Z \in \Rb^{p \times l}$. Sequentially $\la_i$ are generated using \texttt{n}, \texttt{f2}, \texttt{of2}, \texttt{sd\_l\_noise}, \texttt{mean\_l}, \texttt{sd\_l}. \texttt{of2} gives the minimal observations participating in a bicluster to which between 0 and $n/f2$ observations are added, where the number is uniformly chosen. \texttt{sd\_l\_noise} gives the noise of observations not participating in the bicluster. \texttt{mean\_l} and \texttt{sd\_l} determines the Gaussian from which the values are drawn for the observations that participate in the bicluster. The sign of the mean is randomly chosen for each component. Sequentially $\z_i$ are generated using \texttt{l}, \texttt{f1}, \texttt{of1}, \texttt{sd\_z\_noise}, \texttt{mean\_z}, \texttt{sd\_z}. \texttt{of1} gives the minimal samples participating in a bicluster to which between 0 and $l/f1$ samples are added, where the number is uniformly chosen. \texttt{sd\_z\_noise} gives the noise of samples not participating in the bicluster. \texttt{mean\_z} and \texttt{sd\_z} determines the Gaussian from which the values are drawn for the samples that participate in the bicluster. $\Up$ is the overall Gaussian zero mean noise generated by \texttt{sd\_noise}. Implementation in \R. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # DEMO #--------------- dat <- makeFabiaData(n = 1000,l= 100,p = 10,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] matrixImagePlot(Y) dev.new() matrixImagePlot(X) \end{Sinput} \subsection{makeFabiaDataPos} Generation of bicluster data. \begin{enumerate} \item Usage: \verb+makeFabiaDataPos(n,l,p,f1,f2,of1,of2,sd_noise,sd_z_noise,+\\ \hspace*{6em}\verb+ mean_z,sd_z,sd_l_noise,mean_l,sd_l)+ \item Arguments: \begin{itemize} \item{n.} number of observations. \item{l:} number of samples. \item{p:} number of biclusters. \item{f1:} $l/f1$ max. additional samples are active in a bicluster. \item{f2:} $n/f2$ max. additional observations that form a pattern in a bicluster. \item{of1:} minimal active samples in a bicluster. \item{of2:} menial observations that form a pattern in a bicluster. \item{sd\_noise:} Gaussian zero mean noise std on data matrix. \item{sd\_z\_noise:} Gaussian zero mean noise std for deactivated hidden factors. \item{mean\_z:} Gaussian mean for activated factors. \item{sd\_z:} Gaussian std for activated factors. \item{sd\_l\_noise:} Gaussian zero mean noise std if no observation patterns are present. \item{mean\_l:} Gaussian mean for observation patterns. \item{sd\_l:} Gaussian std for observation patterns. \end{itemize} \item Return values: \begin{itemize} \item{X:} the noisy data $\X$ from $\mathbb{R}^{n \times l}$. \item{Y:} the noise free data $\Y$ from $\mathbb{R}^{n \times l}$. \item{ZC:} list where $i$th element gives samples belonging to $i$th bicluster. \item{LC:} list where $i$th element gives observations belonging to $i$th bicluster. \end{itemize} \end{enumerate} Essentially the model is the sum of outer products of vectors: \begin{align*} \X \ = \ \sum_{i=1}^{p} \la_i \ \z_i^T \ + \ \Up \ , \end{align*} where the number of summands $p$ is the number of biclusters. The matrix factorization is \begin{align*} \X \ = \ \La \ \Z \ + \ \Up \ . \end{align*} Here $\X \in \Rb^{n \times l}$ and $\Up \in \Rb^{n \times l}$, $\la_i \in \Rb^n$, $\z_i \in \Rb^l$, $\La \in \Rb^{n \times p}$, $\Z \in \Rb^{p \times l}$. Sequentially $\la_i$ are generated using \texttt{n}, \texttt{f2}, \texttt{of2}, \texttt{sd\_l\_noise}, \texttt{mean\_l}, \texttt{sd\_l}. \texttt{of2} gives the minimal observations participating in a bicluster to which between 0 and $n/f2$ observations are added, where the number is uniformly chosen. \texttt{sd\_l\_noise} gives the noise of observations not participating in the bicluster. \texttt{mean\_l} and \texttt{sd\_l} determines the Gaussian from which the values are drawn for the observations that participate in the bicluster. "POS": The sign of the mean is fixed. Sequentially $\z_i$ are generated using \texttt{l}, \texttt{f1}, \texttt{of1}, \texttt{sd\_z\_noise}, \texttt{mean\_z}, \texttt{sd\_z}. \texttt{of1} gives the minimal samples participating in a bicluster to which between 0 and $l/f1$ samples are added, where the number is uniformly chosen. \texttt{sd\_z\_noise} gives the noise of samples not participating in the bicluster. \texttt{mean\_z} and \texttt{sd\_z} determines the Gaussian from which the values are drawn for the samples that participate in the bicluster. $\Up$ is the overall Gaussian zero mean noise generated by \texttt{sd\_noise}. Implementation in \R. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # DEMO #--------------- dat <- makeFabiaDataPos(n = 1000,l= 100,p = 10,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] matrixImagePlot(Y) dev.new() matrixImagePlot(X) \end{Sinput} \subsection{makeFabiaDataBlocks} Generation of bicluster data with bicluster blocks. \begin{enumerate} \item Usage: \verb+makeFabiaDataBlocks(n,l,p,f1,f2,of1,of2,sd_noise,sd_z_noise,+\\ \hspace*{6em}\verb+ mean_z,sd_z,sd_l_noise,mean_l,sd_l)+ \item Arguments: \begin{itemize} \item{n:} number of observations. \item{l:} number of samples. \item{p:} number of biclusters. \item{f1:} $l/f1$ max. additional samples are active in a bicluster. \item{f2:} $n/f2$ max. additional observations that form a pattern in a bicluster. \item{of1:} minimal active samples in a bicluster. \item{of2:} minimal observations that form a pattern in a bicluster. \item{sd\_noise:} Gaussian zero mean noise std on data matrix. \item{sd\_z\_noise:} Gaussian zero mean noise std for deactivated hidden factors. \item{mean\_z:} Gaussian mean for activated factors. \item{sd\_z:} Gaussian std for activated factors. \item{sd\_l\_noise:} Gaussian zero mean noise std if no observation patterns are present. \item{mean\_l:} Gaussian mean for observation patterns. \item{sd\_l:} Gaussian std for observation patterns. \end{itemize} \item Return Values: \begin{itemize} \item{X:} the noisy data $\X$ from $\mathbb{R}^{n \times l}$. \item{Y:} the noise free data $\Y$ from $\mathbb{R}^{n \times l}$. \item{ZC:} list where $i$th element gives samples belonging to $i$th bicluster. \item{LC:} list where $i$th element gives observations belonging to $i$th bicluster. \end{itemize} \end{enumerate} Bicluster data is generated for visualization because the biclusters are now in block format. That means observations and samples that belong to a bicluster are consecutive. This allows visual inspection because the use can identify blocks and whether they have been found or reconstructed. Essentially the model is the sum of outer products of vectors: \begin{align*} \X \ = \ \sum_{i=1}^{p} \la_i \ \z_i^T \ + \ \Up \ , \end{align*} where the number of summands $p$ is the number of biclusters. The matrix factorization is \begin{align*} \X \ = \ \La \ \Z \ + \ \Up \ . \end{align*} Here $\X \in \Rb^{n \times l}$ and $\Up \in \Rb^{n \times l}$, $\la_i \in \Rb^n$, $\z_i \in \Rb^l$, $\La \in \Rb^{n \times p}$, $\Z \in \Rb^{p \times l}$. Sequentially $\la_i$ are generated using \texttt{n}, \texttt{f2}, \texttt{of2}, \texttt{sd\_l\_noise}, \texttt{mean\_l}, \texttt{sd\_l}. \texttt{of2} gives the minimal observations participating in a bicluster to which between 0 and $n/f2$ observations are added, where the number is uniformly chosen. \texttt{sd\_l\_noise} gives the noise of observations not participating in the bicluster. \texttt{mean\_l} and \texttt{sd\_l} determines the Gaussian from which the values are drawn for the observations that participate in the bicluster. The sign of the mean is randomly chosen for each component. Sequentially $\z_i$ are generated using \texttt{l}, \texttt{f1}, \texttt{of1}, \texttt{sd\_z\_noise}, \texttt{mean\_z}, \texttt{sd\_z}. \texttt{of1} gives the minimal samples participating in a bicluster to which between 0 and $l/f1$ samples are added, where the number is uniformly chosen. \texttt{sd\_z\_noise} gives the noise of samples not participating in the bicluster. \texttt{mean\_z} and \texttt{sd\_z} determines the Gaussian from which the values are drawn for the samples that participate in the bicluster. $\Up$ is the overall Gaussian zero mean noise generated by \texttt{sd\_noise}. Implementation in \R. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # DEMO #--------------- dat <- makeFabiaDataBlocks(n = 1000,l= 100,p = 10,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) Y <- dat[[1]] X <- dat[[2]] matrixImagePlot(Y) dev.new() matrixImagePlot(X) \end{Sinput} \subsection{makeFabiaDataBlocksPos} Generation of bicluster data with bicluster blocks. \begin{enumerate} \item Usage: \verb+makeFabiaDataBlocksPos(n,l,p,f1,f2,of1,of2,sd_noise,+\\ \hspace*{6em}\verb+ sd_z_noise,mean_z,sd_z,sd_l_noise,mean_l,sd_l)+ \item Arguments: \begin{itemize} \item{n:} number of observations. \item{l:} number of samples. \item{p:} number of biclusters. \item{f1:} $l/f1$ max. additional samples are active in a bicluster. \item{f2:} $n/f2$ max. additional observations that form a pattern in a bicluster. \item{of1:} minimal active samples in a bicluster. \item{of2:} minimal observations that form a pattern in a bicluster. \item{sd\_noise:} Gaussian zero mean noise std on data matrix. \item{sd\_z\_noise:} Gaussian zero mean noise std for deactivated hidden factors. \item{mean\_z:} Gaussian mean for activated factors. \item{sd\_z:} Gaussian std for activated factors. \item{sd\_l\_noise:} Gaussian zero mean noise std if no observation patterns are present. \item{mean\_l:} Gaussian mean for observation patterns. \item{sd\_l:} Gaussian std for observation patterns. \end{itemize} \item Return Values: \begin{itemize} \item{X:} the noisy data $\X$ from $\mathbb{R}^{n \times l}$. \item{Y:} the noise free data $\Y$ from $\mathbb{R}^{n \times l}$. \item{ZC:} list where $i$th element gives samples belonging to $i$th bicluster. \item{LC:} list where $i$th element gives observations belonging to $i$th bicluster. \end{itemize} \end{enumerate} Bicluster data is generated for visualization because the biclusters are now in block format. That means observations and samples that belong to a bicluster are consecutive. This allows visual inspection because the use can identify blocks and whether they have been found or reconstructed. Essentially the model is the sum of outer products of vectors: \begin{align*} \X \ = \ \sum_{i=1}^{p} \la_i \ \z_i^T \ + \ \Up \ , \end{align*} where the number of summands $p$ is the number of biclusters. The matrix factorization is \begin{align*} \X \ = \ \La \ \Z \ + \ \Up \ . \end{align*} Here $\X \in \Rb^{n \times l}$ and $\Up \in \Rb^{n \times l}$, $\la_i \in \Rb^n$, $\z_i \in \Rb^l$, $\La \in \Rb^{n \times p}$, $\Z \in \Rb^{p \times l}$. Sequentially $\la_i$ are generated using \texttt{n}, \texttt{f2}, \texttt{of2}, \texttt{sd\_l\_noise}, \texttt{mean\_l}, \texttt{sd\_l}. \texttt{of2} gives the minimal observations participating in a bicluster to which between 0 and $n/f2$ observations are added, where the number is uniformly chosen. \texttt{sd\_l\_noise} gives the noise of observations not participating in the bicluster. \texttt{mean\_l} and \texttt{sd\_l} determines the Gaussian from which the values are drawn for the observations that participate in the bicluster. "POS": The sign of the mean is fixed. Sequentially $\z_i$ are generated using \texttt{l}, \texttt{f1}, \texttt{of1}, \texttt{sd\_z\_noise}, \texttt{mean\_z}, \texttt{sd\_z}. \texttt{of1} gives the minimal samples participating in a bicluster to which between 0 and $l/f1$ samples are added, where the number is uniformly chosen. \texttt{sd\_z\_noise} gives the noise of samples not participating in the bicluster. \texttt{mean\_z} and \texttt{sd\_z} determines the Gaussian from which the values are drawn for the samples that participate in the bicluster. $\Up$ is the overall Gaussian zero mean noise generated by \texttt{sd\_noise}. Implementation in \R. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # DEMO #--------------- dat <- makeFabiaDataBlocksPos(n = 1000,l= 100,p = 10,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) Y <- dat[[1]] X <- dat[[2]] matrixImagePlot(Y) dev.new() matrixImagePlot(X) \end{Sinput} \section{Utilities} \subsection{fabiaVersion} Display version info for package and for FABIA. \begin{enumerate} \item Usage: \texttt{fabiaVersion()} \end{enumerate} \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} fabiaVersion() \end{Sinput} \subsection{matrixImagePlot} Plotting of a matrix. \begin{enumerate} \item Usage: \texttt{matrixImagePlot(x,xLabels=NULL, yLabels=NULL, zlim=NULL, title=NULL)} \item Arguments: \begin{itemize} \item{x:} the matrix. \item{xLabels:} vector of strings to label the rows (default "rownames(x)"). \item{yLabels:} vector of strings to label the columns (default "colnames(x)"). \item{zlim:} vector containing a low and high value to use for the color scale. \item{title:} title of the plot. \end{itemize} \end{enumerate} Plotting a table of numbers as an image using \R. The color scale is based on the highest and lowest values in the matrix. Program has been obtained at \href{http://www.phaget4.org/R/myImagePlot.R}{http://www.phaget4.org/R/myImagePlot.R} \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # DEMO #--------------- dat <- makeFabiaDataBlocks(n = 1000,l= 100,p = 10,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 3.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] Y <- dat[[2]] matrixImagePlot(X) \end{Sinput} \subsection{projFuncPos} Projection of a vector to a sparse non-negative vector with given sparseness and given $l_2$-norm. \begin{enumerate} \item Usage: \texttt{projFuncPos(s, k1, k2)} \item Arguments: \begin{itemize} \item{s:} data vector. \item{k1:} sparseness, $l_1$ norm constraint. \item{k2:} $l_2$ norm constraint. \end{itemize} \item Return Values: \begin{itemize} \item{v:} non-negative sparse projected vector. \end{itemize} \end{enumerate} The projection minimize the Euclidean distance to the original vector given an $l_1$-norm and an $l_2$-norm and enforcing non-negativity. The projection is a convex quadratic problem which is solved iteratively where at each iteration at least one component is set to zero. In the applications, instead of the $l_1$-norm a sparseness measurement is used which relates the $l_1$-norm to the $l_2$-norm: \begin{align*}\mathrm{sparseness}(\vb) \ = \ \frac{\sqrt{l} \ - \ \sum_{j=1}^{l} \left|v_{j} \right|\ / \ \sum_{j=1}^{l} v_{j}^2}{\sqrt{l} \ - \ 1} \end{align*} The code is implemented in \R. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # DEMO #--------------- size <- 30 sparseness <- 0.7 s <- as.vector(rnorm(size)) sp <- sqrt(1.0*size)-(sqrt(1.0*size)-1.0)*sparseness ss <- projFuncPos(s,k1=sp,k2=1) s ss \end{Sinput} \subsection{projFunc} Projection of a vector to a sparse vector with given sparseness and given $l_2$-norm. \begin{enumerate} \item Usage: \texttt{projFunc(s, k1, k2)} \item Arguments: \begin{itemize} \item{s:} data vector. \item{k1:} sparseness, $l_1$ norm constraint. \item{k2:} $l_2$ norm constraint. \end{itemize} \item Return Values: \begin{itemize} \item{v:} sparse projected vector. \end{itemize} \end{enumerate} The projection is done according to \citep{Hoyer:04}: given an $l_1$-norm and an $l_2$-norm minimize the Euclidean distance to the original vector. The projection is a convex quadratic problem which is solved iteratively where at each iteration at least one component is set to zero. In the applications, instead of the $l_1$-norm a sparseness measurement is used which relates the $l_1$-norm to the $l_2$-norm: \begin{align*}\mathrm{sparseness}(\vb) \ = \ \frac{\sqrt{l} \ - \ \sum_{j=1}^{l} \left|v_{j} \right|\ / \ \sum_{j=1}^{l} v_{j}^2}{\sqrt{l} \ - \ 1} \end{align*} The code is implemented in \R. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # DEMO #--------------- size <- 30 sparseness <- 0.7 s <- as.vector(rnorm(size)) sp <- sqrt(1.0*size)-(sqrt(1.0*size)-1.0)*sparseness ss <- projFunc(s,k1=sp,k2=1) s ss \end{Sinput} \subsection{estimateMode} Estimation of the modes of the rows of a matrix. \begin{enumerate} \item Usage: \texttt{estimateMode(X,maxiter=50,tol=0.001,alpha=0.1,a1=4.0,G1=FALSE)} \item Arguments: \begin{itemize} \item{X:} matrix of which the modes of the rows are estimated. \item{maxiter:} maximal number of iterations; default = 50. \item{tol:} tolerance for stopping; default = 0.001. \item{alpha:} learning rate; default = 0.1. \item{a1:} parameter of the width of the given distribution; default = 4. \item{G1:} kind of distribution, TRUE: $\frac{1}{a_1} \ln (\cosh (a1 x))$, FALSE: $- \frac{1}{a_1} \exp(- \frac{a_1}{2} x^2)$; default = FALSE. \end{itemize} \item Return Values: \begin{itemize} \item{u:} the vector of estimated modes. \item{xu:} $\X-\u\1^T$ the mode centered data. \end{itemize} \end{enumerate} The mode is estimated by contrast functions G1 $\frac{1}{a_1} \ln (\cosh (a1 x))$ or G2 $- \frac{1}{a_1} \exp(- \frac{a_1}{2} x^2)$. The estimation is performed by gradient descent initialized by the median. Implementation in \R. \vspace{0.5cm} \noindent EXAMPLE: \begin{Sinput} #--------------- # DEMO #--------------- dat <- makeFabiaDataBlocksPos(n = 100,l= 50,p = 10,f1 = 5,f2 = 5, of1 = 5,of2 = 10,sd_noise = 2.0,sd_z_noise = 0.2,mean_z = 2.0, sd_z = 1.0,sd_l_noise = 0.2,mean_l = 3.0,sd_l = 1.0) X <- dat[[1]] modes <- estimateMode(X) modes$u - apply(X, 1, median) \end{Sinput} \subsection{eqscplotLoc} Local Version of \texttt{eqscplot} from the package MASS. Plots with Geometrically Equal Scales: Version of a scatterplot with scales chosen to be equal on both axes, that is 1cm represents the same units on each. \begin{enumerate} \item Usage: \texttt{eqscplotLoc(x, y, ratio = 1, tol = 0.04, uin, ...)} \item Arguments: \begin{itemize} \item{x:} vector of x values, or a 2-column matrix, or a list with components ``x'' and ``y''. \item{y:} vector of y values. \item{ratio:} desired ratio of units on the axes. Units on the y axis are drawn at ``ratio'' times the size of units on the x axis. Ignored if ``uin'' is specified and of length 2. \item{tol:} proportion of white space at the margins of plot. \item{uin:} desired values for the units-per-inch parameter. If of length 1, the desired units per inch on the x axis. \item{...:} further arguments for ``plot'' and graphical parameters. Note that ``par(xaxs="i", yaxs="i")'' is enforced, and ``xlim'' and ``ylim'' will be adjusted accordingly. \end{itemize} \item Return Values: \begin{itemize} \item invisibly, the values of ``uin'' used for the plot. \end{itemize} \end{enumerate} Plots with Geometrically Equal Scales eqscplot from the package MASS (Version: 7.3-3, 2009-10-15 Maintainer: Brian Ripley <ripley@stats.ox.ac.uk>). To make the package stand alone, this function is included as source. Limits for the x and y axes are chosen so that they include the data. One of the sets of limits is then stretched from the midpoint to make the units in the ratio given by ``ratio''. Finally both are stretched by ``1 + tol'' to move points away from the axes, and the points plotted. Implementation in \R. \vspace{0.5cm} \noindent REFERENCE: \begin{itemize} \item Venables, W. N. and Ripley, B. D., (2002), ``Modern Applied Statistics with S'', Fourth edition, Springer. \end{itemize} \section{Demo Data Sets from \Rpackage{fabiaData}} In above examples gene expression data sets from the \R package \Rpackage{fabiaData} have been used. In this section these data sets are described. \subsection{Breast\_A} Microarray data set of van't Veer breast cancer. Microarray data from Broad Institute ``Cancer Program Data Sets'' which was produced by \citep{Veer:02} (\href{http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi}{http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi}) Array S54 was removed because it is an outlier. Goal was to find a gene signature to predict the outcome of a cancer therapy that is to predict whether metastasis will occur. A 70 gene signature has been discovered. Here we want to find subclasses in the data set. \citep{Hoshida:07} found 3 subclasses and verified that 50/61 cases from class 1 and 2 were ER positive and only in 3/36 from class 3. \texttt{XBreast} is the data set with 97 samples and 1213 genes, \texttt{CBreast} give the three subclasses from \citep{Hoshida:07}. \subsection{DLBCL\_B} Microarray data set of Rosenwald diffuse large-B-cell lymphoma. Microarray data from Broad Institute ``Cancer Program Data Sets'' which was produced by \citep{Rosenwald:02} (\href{http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi}{http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi}) Goal was to predict the survival after chemotherapy \citep{Hoshida:07} divided the data set into three classes: \begin{itemize} \item{OxPhos:} oxidative phosphorylation \item{BCR:} B-cell response \item{HR:} host response \end{itemize} We want to identify these subclasses. The data has 180 samples and 661 probe sets (genes). \texttt{XDLBCL} is the data set with 180 samples and 661 genes, \texttt{CDLBCL} give the three subclasses from \citep{Hoshida:07}. \subsection{Multi\_A} Microarray data set of Su on different mammalian tissue types. Microarray data from Broad Institute ``Cancer Program Data Sets'' which was produced by \citep{Su:02} (\href{http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi}{http://www.brxsoadinstitute.org/cgi-bin/cancer/datasets.cgi}) s Gene expression from human and mouse samples across a diverse array of tissues, organs, and cell lines have been profiled. The goal was to have a reference for the normal mammalian transcriptome. Here we want to identify the subclasses which correspond to the tissue types. The data has 102 samples and 5565 probe sets (genes). \texttt{XMulti} is the data set with 102 samples and 5565 genes, \texttt{CMulti} give the four subclasses corresponding to the tissue types. \bibliographystyle{natbib} \bibliography{fabia} \end{document}