%\VignetteIndexEntry{BiRewire} %\VignetteDepends{} %\VignetteKeywords{bipartite graphs, rewiring} %\VignettePackage{BiRewire} \documentclass[a4paper,10pt]{article} \usepackage{cite} \usepackage{url} % \usepackage{graphicx} %\usepackage[colorlinks=true]{hyperref} % \hypersetup{ % bookmarksnumbered=true, % linkcolor=black, % citecolor=black, % pagecolor=black, % urlcolor=black, % } %opening \title{BiRewire} \author{Andrea Gobbi, Francesco Iorio} \date{} \begin{document} \SweaveOpts{concordance=FALSE} \maketitle \tableofcontents \section{Overview} BiRewire is an R package implementing high-performing routines for the randomisation of bipartite graphs preserving their node degrees (i.e. Network Rewiring), through the Switching Algorithm (SA)\cite{Milo03}.\\ This package is particularly useful for the randomisation of '0-1' tables (or presence-absence matrices) in which the distributions of non-null entries (i.e. presence distributions) must be preserved both across rows and columns. By considering these tables as incidence matrices of bipartite graphs then this problem reduces to bipartite network rewiring.\\ For example, by modeling a genomic dataset as a binary event matrix (BEM), in which rows correspond to samples, columns correspond to genes and the $(i,j)$ entry is non-null if the $i$-th sample harbours a mutation in the $j$-th gene, then with BiRewire is possible to randomise the dataset preserving its mutation rates both across samples and genes. This is crucial to preserve tumour specific alterations, dependencies between gene-mutations and heterogeneity in mutation/copy-number-alteration rates across patients.\\ Large collections of such randomised tables can be then used to approximate samples from the uniform distribution of all the possible genomic datasets with the same mutation-rates of the initial one. Finally this data can be used as null model to test the statistical significance of several combinatorial properties of the original dataset: for example the tendency of a group of genes to be co- or mutually-mutated \cite{Ciriello12}.\\ Similar procedures have been implemented in order to manage undirected networks. Specifically, with \textit{BiRewire} users can: \begin{enumerate} \item create bipartite graphs from genomic BEMs (or, generally, from any kind of \'presence-absence\' matrix); \item perform an analysis, which consists of studying the trend of Jaccard Similarity between the original network and its rewired versions across the switching steps (by using a user-defined sampling time), and analytically estimating the number of steps at which this similarity reaches a plateau (i.e. the maximal level of randomness is achieved) according to the lower bound derived in \cite{Gobbi}; \item generate rewired versions of a bipartite graph with the analytically derived bound as number of switching steps or a user-defined one; \item derive projections of the starting network and its rewired version and perform different graph-theory analysis on them; \item generate a set of netwoks correctly drawn from the suitable null-model starting from the initial BEM; \item monitoring the behaviour of the Markov chain underlying the SA \item perform the same analysis described in point 1,2,3,5 and 6 for undirected graphs. \end{enumerate} All the functions of the package are written in C-code and R-wrapped. \section{Installation} It is possible to download the package from \url{http://www.ebi.ac.uk/~iorio/BiRewire} and install it with the shell-command: \begin{verbatim} R CMD INSTALL BiRewire_xx.yy.zz.tar.gz \end{verbatim} or with biocLite() directly in R: \begin{verbatim} source("http://bioconductor.org/biocLite.R") biocLite("BiRewire") \end{verbatim} Moreover, the sources of the developement version are available here \url{http://www.bioconductor.org/packages/devel/bioc/html/BiRewire.html}. Alternatively, the source files can be cloned from the github repositories: \url{https://github.com/andreagobbi/BiRewire} and \url{https://github.com/andreagobbi/BiRewire--release} using the command \begin{verbatim} git clone git@github.com:andreagobbi/BiRewire--release.git git clone git@github.com:andreagobbi/BiRewire.git \end{verbatim} To load BiRewire use the following commands: <>= library(BiRewire) @ \section{Package Dependencies} BiRewire requires the R packages {\bf igraph}, {\bf{slam}} and {\bf tsne} (see \cite{igraph}, \cite{slam}, \cite{tsne} ) available at the CRAN repository. \section{Notation}\label{Notation} Let $\mathcal{G}$ be a bipartite graph, i.e. a graph containing two classes of nodes $V_r$ and $V_c$ such that every edge $e\in E$ connects one node in the first class to a node in the second class.\\ Let $\mathcal{B}$ be the incidence matrix of $\mathcal{G}$, i.e. the $|V_r|\times|V_c|$ binary matrix whose generic entry $m_{i,j}$ is not null if an only if $(i,j)\in E$.\\ The number of edges is indicated with $e=|E|$ and the edge density with $d=\frac{e}{|V_r||V_c|}$.\\ The SA performs $N$ Switching Steps (SSs), in which: \begin{enumerate} \item two edges $(a,b)$ and $(c,d)$ both $\in E$ are randomly selected, \item if $a\not=c$, $b\not=d$, $(a,d)\not\in E$ and $(b,d)\not\in E$ then: \begin{enumerate} \item the edges $(a,d)$ and $(b,d)$ are added to $E$ and \item the edges $(a,b)$ and $(c,d)$ are removed from $E$. \end{enumerate} \end{enumerate} Notice that we count a SS only if it is successfully performed. The \textit{Jaccard Index} (JI, \cite{Jaccard}) is used to quantify the similarity between the original graph and its rewired version at the $k$-th SS. Since the SA preserves the degree distribution and does not alter the number of nodes, the JI, indicated with $s^{(k)}$, can be computed as $$ s^{(k)}=\frac{x^{(k)}}{2e-x^{(k)}}, $$ where $x^{(k)}$ is the number of edges in common between the two graphs. Fixed a small error $\delta$, the number $N$ of SSs providing the rewired version of a network with the maximally achievable level of randomness (in terms of average dissimilarity from the original network) is asymptotically equal to $$ \frac{e(1-d)}{2}\ln{\frac{1-d}{\delta}}. $$ More detailed, we analitically derived the fixed point of the underlying Markov chain $\bar{x}$; fixed a $\delta$ we can extimate the distance of the current state of the chain respect to this fixed point in terms of fraction of edges $\delta$. For large netowrk we can assumne that a distance less than one edge is satisfiable (see \cite{gobbi}), so the bound reads: $$ \frac{e(1-d)}{2}\ln{e(1-d)}, $$ but in order to manage smaller network the bound with the parameter $\delta$ results to be more general. This bound is much lower than the empirical one proposed in \cite{Milo03} (see Reference for details). \section{Function Description} In this section all the functions implemented in BiRewire are described with a simple practical example in which a real breast cancer dataset is modeled as a bipartite network, and randomised preserving the mutation-rate both across samples and genes (i.e. the corresponding bipartite network is rewired). In each of the following functions it is possible to perform $N$ \textbf{successful} switching steps (see \cite{Gobbi} for more details about this more general bound) using the flag exact=TRUE. To prevent a possible indinite loop, the program performs at maximum MAXITER\_MUL*max.iter iterations. \subsection{{\bf birewire.analysis.bipartite} and{\bf .undirected} } First of all, we create a bipartite network modeling a genomic breast cancer dataset downloaded from the Cancer Genome Atlas (TCGA) projects data portal \url{http://tcga.cancer.gov/dataportal/}, used in \cite{Gobbi} From this dataset germline mutations were filtered out with state-of-the-art softwares; synonymous mutations and mutations identified as benign and tolerated were also removed. The resulting bipartite graph has $n_r=757$ nodes (corresponding to samples), $n_c=9,757$ nodes (corresponding to genes), and $e=19,758$ edges connecting a node in $n_r$ to a node in $n_c$ if the gene corresponding to the node in $n_r$ is mutated to the samples corresponding to the node in $n_C$. The edge density of this network is $0.27\%$.\\ The genomic dataset (in the form of a binary matrix in which rows correspond to samples, columns correspond to genes and the $(i,j)$ entry is non null if the $i$-th sample harbours a mutation in the $j$-th gene) can be loaded and modeled as a bipartite graph, with the following commands: <>= data(BRCA_binary_matrix)##loads an binary genomic event matrix for the ##breast cancer dataset g=birewire.bipartite.from.incidence(BRCA_binary_matrix)##models the dataset ## as igraph bipartite graph @ Once the bipartite graph is created it is possible to conduct the analysis by calling the {\bf birewire.analysis.bipartite } function, using the following commands: <>= step=5000 max=100*sum(BRCA_binary_matrix) scores<-birewire.analysis.bipartite(BRCA_binary_matrix,step, verbose=FALSE,max.iter=max,n.networks=5,display=F) @ The function \textbf{birewire.analysis.bipartite} returns the Jaccard similarity sampled every \textit{step} SSs (in the example above step is equal to 5000). The SA is independentyl applyed on the initial data for \textit{n.networks} times in order to extimate the mean value of thr JI and the relative CI (as $1.96\pm \sigma/\sqrt{n.networks}$). A plot such information is displayed if the paramenter \textit{ndisplay} is set to true. The routine returns a list of two element: \textbf{\$N} is the analitically derived bound and \textbf{\$data} the similarity score table. The same analysis can be performed on general undirected networks. <>= g.und<-erdos.renyi.game(directed=F,loops=F,n=1000,p.or.m=0.01) m.und<-get.adjacency(g.und,sparse=FALSE) step=100 max=100*length(E(g.und)) scores.und<-birewire.analysis.undirected(m.und,step=step, verbose=FALSE,max.iter=max,n.networks=5) @ \subsection{{\bf birewire.rewire.bipartite }} To rewire a bipartite graph two modalities are available. Both of them can be used with the analytical bound $N$ as number of switching steps or with a user defined value. The function takes in input an incidence matrix $\mathcal{B}$ or the an \textit{igraph} bipartite graph. <>= m2<-birewire.rewire.bipartite(BRCA_binary_matrix,verbose=FALSE) g2<-birewire.rewire.bipartite(g,verbose=FALSE) @ The first function recives in output the incidence matrix of the rewired graph while the second one a bipartite \textit{igraph} graph. See documentation for further details. \subsection{{\bf birewire.rewire.undirected }} To rewire a general undirected graph the following functions can be used: <>= m2.und<-birewire.rewire.undirected(m.und,verbose=FALSE) g2.und<-birewire.rewire.undirected(g.und,verbose=FALSE) @ \subsection{{\bf birewire.similarity }} This function computes the Jaccard index between two incidence matrices with same dimensions and node degrees. It is possible also to use directly two suitable graphs. <>= sc=birewire.similarity(BRCA_binary_matrix,m2) sc=birewire.similarity(BRCA_binary_matrix,t(m2))#also works @ \subsection{{\bf birewire.rewire.bipartite.and.projections }} The following functions execute the Switching Algorithm and computes similarity trends across its switching steps for the two natural projections of the starting bipartite graph <>= #use a smaller graph! gg <- graph.bipartite( rep(0:1,length=10), c(1:10)) result=birewire.rewire.bipartite.and.projections(gg,step=10, max.iter="n",accuracy=0.00005,verbose=FALSE) plot(result$similarity_scores.proj2,type='l',col='red',ylim=c(0,1)) lines(result$similarity_scores.proj1,type='l',col='blue') legend("top",1, c("Proj2","Proj1"), cex=0.9, col=c("blue","red"), lty=1:1,lwd=3) @ \subsection{{\bf birewire.sampler.bipartite }} This function uses the SA to generate a set of $K$ bipartite networks drawn from the null model given by an initial bipartite graph. The function creates a main folder (\textit{path} input parameter) and a set of subfolders in order to have maximum 1000 files per folder. Notice that the initial graph is used only for the first rewiring process, and the output of the fist process is used as inoput for the second and so on. <>= #use a smaller graph! gg <-graph.bipartite(rep(0:1,length=10), c(1:10)) ## NOT RUN ##birewire.sampler.bipartite(get.incidence(g),K=10,path='TESTBIREWIRE',verbose=F) ##unlink('TESTBIREWIRE',recursive = T) @ \subsection{{\bf birewire.visual.monitoring.bipartite } and {\bf .undirected}} These functions allow to visualize the Markov Chain underlying the SA. More in detail, given a sequence of steps to test, we sample from the SA each indicated step generating a brunch of networks. We compute the pairwise Jaccard distance among them, i.e. the Jaccad index is defined as 1 minus the Jaccad similarity. Then we perform a dimensional scaling using \emph{tsne}\cite{tsne} and plot the result. <>= ggg <- graph.bipartite( rep(0:1,length=10), c(1:10)) ## NOT RUN ##birewire.visual.monitoring.bipartite(ggg,display=F,n.networks=10) g <- erdos.renyi.game(1000,0.1) ##birewire.visual.monitoring.undirected(g,display=F,n.networks=10) @ \section{Example}\label{ex} Here we collect the functionalities of the package in a single example. The output plot of the analysis is showed in Fig.\ref{fig} on the left side and the output of the monitoring procedure is displayed in Fig.\ref{fig} on the right side. <>= ##NOT RUN #ggg <- bipartite.random.game(n1=100,n2=40,p=0.2) #For recovering quickly the bound N we can perform a short analysis #N=birewire.analysis.bipartite(get.incidence(ggg,sparse=F),max.iter=2,step=1)$N #Now we can perform the real analysis #res=birewire.analysis.bipartite(get.incidence(ggg,sparse=F),max.iter=10*N,n.networks=10) #and monitoring the markov chain #birewire.visual.monitoring.bipartite(ggg,display=T,n.networks=75,sequence=c(1,10,200,500,"n",10000),ncol=3) #Now we can generate a null model #birewire.sampler.bipartite(ggg,K=10000,path="/home/andrea/example") @ \begin{figure}[!h] \includegraphics[height=6cm,width=6cm]{analysis.pdf} \includegraphics[height=6cm,width=7cm]{monitoring.pdf} \caption{The output plots of \textbf{ birewire.analyis.bipartite}(left side) and of \textbf{ birewire.visual.monitoring.bipartite}(right side) relative to the example in section \ref{ex}. The gradient of the colour from blue to red indicates the position of the sampling network respect the others. The starting network (blue) is marked with the text \textit{start}. }\label{fig} \end{figure} \clearpage \begin{thebibliography}{} \bibitem{Gobbi}Gobbi, A. and Iorio, F. and Dawson, K. J. and Wedge, D. C. and Tamborero, D. and Alexandrov, L. B. and Lopez-Bigas, N. and Garnett, M. J. and Jurman, G. and Saez-Rodriguez, J. (2014) \emph{Fast randomization of large genomic datasets while preserving alteration counts} Bioinformatics 2014 30 (17): i617-i623 doi: 10.1093/bioinformatics/btu474. \bibitem{Iorio} Iorio, F. and Gobbi, A. (2015) \emph{In preparation} In preparation. \bibitem{GobbiJurman} Gobbi, A. and Jurman, G. \emph{Number of required Switching Steps in the Switching Algorithm for undirected graphs}.in preparation, . \bibitem{Jaccard} Jaccard, P. (1901), \emph{Etude comparative de la distribution florale dans une portion des Alpes et des Jura}, Bulletin de la Societe Vaudoise des Sciences Naturelles 37:547--579. \bibitem{Milo03} R. Milo, N. Kashtan, S. Itzkovitz, M. E. J. Newman, U. Alon (2003), \emph{On the uniform generation of random graphs with prescribed degree sequences}, eprint arXiv:cond-mat/0312028. \bibitem{igraph} Csardi, G. and Nepusz, T (2006)\emph{The igraph software package for complex network research}, InterJournal, Complex Systems \url{http://igraph.sf.net}. \bibitem{Ciriello12} Ciriello, G. and Cerami, E. and Sander, C. and Schultz, N.(2012) Mutual exclusivity analysis identifies oncogenic network modules, {\it Genome Research}, {\bf 22}, 398-406. \bibitem{miclos2004} Mikl\`{o}s I, Podani J. Randomization of presence-absence matrices: comments and new algorithms. Ecology. Eco Soc America; 2004;85(1):86--92. \bibitem{gotelli2000} Gotelli NJ. Null model analysis of species co-occurrence patterns. Ecology. 2000. \bibitem{Jaccard}Jaccard, Paul. \'{E}tude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin del la Soci\'{e}t\'{e} Vaudoise des Sciences Naturelles; 1901; 37:547--579. \bibitem{slam} Hornik, K. and Meyer, D, and Buchta, C. (2014)\emph{slam: Sparse Lightweight Arrays and Matrices}, R package \url{http://CRAN.R-project.org/package=slam}. \bibitem{tsne} Van der Maaten, L.J.P. and Hinton, G.E. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008 \end{thebibliography} \end{document}