% \VignetteIndexEntry{Part 4: overview layout(circular, karyogram, Manhattan)} % \VignetteDepends{} % \VignetteKeywords{visualization utilities} % \VignettePackage{ggbio} \documentclass[11pt]{report} % \usepackage{times} \usepackage{hyperref} \usepackage{verbatim} \usepackage{graphicx} \usepackage{fancybox} \usepackage{color} <>= opts_chunk$set(eval=FALSE) @ % \setkeys{Gin}{width=0.95\textwidth} \textwidth=6.5in \textheight=8.5in \parskip=.3cm \parindent = 0cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\gr}{\Rclass{GRanges}} \newcommand{\Bioc}{\software{Bioconductor}} \newcommand{\IRanges}{\Rpackage{IRanges}} \newcommand{\biovizBase}{\Rpackage{biovizBase}} \newcommand{\ggbio}{\Rpackage{ggbio}} \newcommand{\visnab}{\Rpackage{visnab}} \newcommand{\ggplot}{\Rpackage{ggplot2}} \newcommand{\grid}{\Rpackage{grid}} \newcommand{\gridExtra}{\Rpackage{gridExtra}} \newcommand{\qplot}{\Rfunction{qplot}} \newcommand{\autoplot}{\Rfunction{autoplot}} \newcommand{\knitr}{\Rpackage{knitr}} \newcommand{\tracks}{\Rfunction{tracks}} \newcommand{\chipseq}{\Rpackage{chipseq}} % my own frambox \newcommand{\sfbox}[2][Tips]{ \begin{center} \shadowbox{ \parbox{0.8\linewidth}{ \textcolor{blue}{#1:} #2 } } \end{center} } \title{Overview layout} \author{Tengfei Yin} \date{\today} \begin{document} % \setkeys{Gin}{width=0.6\textwidth} \maketitle \newpage \tableofcontents \newpage <>= library(knitr) opts_chunk$set(fig.path='./figures/ggbio-', fig.align='center', fig.show='asis', eval = TRUE, fig.width = 4, dev = 'png', fig.height = 4, dpi = 400) options(replace.assign=TRUE,width=90) @ <>= options(width=72) @ \chapter{Circular view}\label{chapter:circle} \section{Introduction} Layout "circle" is inspired by \textit{Circos} graphics and make it a general layout. Layout is generally more complex than a coordinate transformation, it's a combination of different components like coordinate transformation(genome and polar), and tracks-based layout, etc. Especially, circular view is very useful to show links between different locations. Since we are following the grammar of graphics, aesthetics mapping are fairly easy in \ggbio{}. In this tutorial, we will start from the raw data, if you are already familiar with how to process your data into the right format, which here I mean \Robject{GRanges},you can jump to \ref{sec:step3} directly. \section{Tutorial} \subsection{Step 1: understand the layout circle} We have discussed about the new coordinate "genome" in vignette about Manhattan plot before, now this time, it's one step further compared to genome coordinate transformation. We specify ring radius \Rfunarg{radius} and track width \Rfunarg{trackWidth} to help transform a linear genome coordinate system to a circular coordinate system. By using \Rfunction{layout\_circle} function which we will introduce later. Before we visualize our data, we need to have something in mind \begin{itemize} \item How many tracks we want? \item Can they be combined into the same data? \item Do I have chromosomes lengths information? \item Do I have interesting variables attached as one column? \end{itemize} \subsection{Step 2: get your data ready to plot} Ok, let's start to process some raw data to the format we want. The data used in this study is from this a paper\footnote{http://www.nature.com/ng/journal/v43/n10/full/ng.936.html}. In this example, We are going to \begin{enumerate} \item Visualize somatic mutation as segment. \item Visualize inter,intro-chromosome rearrangement as links. \item Visualize mutation score as point tracks with grid-background. \item Add scale and ticks and labels. \item To arrange multiple plots and legend. create multiple sample comparison. \end{enumerate} Notes: don't put too much tracks on it. I simply put script here to get mutation data as `GRanges` object. <>= crc1 <- system.file("extdata", "crc1-missense.csv", package = "biovizBase") crc1 <- read.csv(crc1) library(GenomicRanges) mut.gr <- with(crc1,GRanges(Chromosome, IRanges(Start_position, End_position), strand = Strand)) values(mut.gr) <- subset(crc1, select = -c(Start_position, End_position, Chromosome)) data("hg19Ideogram", package = "biovizBase") seqs <- seqlengths(hg19Ideogram) ## subset_chr chr.sub <- paste("chr", 1:22, sep = "") ## levels tweak seqlevels(mut.gr) <- c(chr.sub, "chrX") mut.gr <- keepSeqlevels(mut.gr, chr.sub) seqs.sub <- seqs[chr.sub] ## remove wrong position bidx <- end(mut.gr) <= seqs.sub[match(as.character(seqnames(mut.gr)), names(seqs.sub))] mut.gr <- mut.gr[which(bidx)] ## assign_seqlengths seqlengths(mut.gr) <- seqs.sub ## reanme to shorter names new.names <- as.character(1:22) names(new.names) <- paste("chr", new.names, sep = "") new.names mut.gr.new <- renameSeqlevels(mut.gr, new.names) head(mut.gr.new) @ %def To get ideogram track, we need to load human hg19 ideogram data, for details please check another vignette about getting ideogram. <>= hg19Ideo <- hg19Ideogram hg19Ideo <- keepSeqlevels(hg19Ideogram, chr.sub) hg19Ideo <- renameSeqlevels(hg19Ideo, new.names) head(hg19Ideo) @ %def \subsection{Step 3: low level API: \Rfunction{layout\_circle}}\label{sec:step3} \Rfunction{layout\_circle} is a lower level API for creating circular plot, it accepts \Robject{Granges} object, and users need to specify radius, track width, and other aesthetics, it's very flexible. But keep in mind, you \textbf{have to} pay attention rules when you make circular plots. \begin{itemize} \item For now, \Rfunction{seqlengths}, \Rfunction{seqlevels} and chromosomes names should be exactly the same, so you have to make sure data on all tracks have this uniform information to make a comparison. \item Set arguments \Rfunarg{space.skip} to the same value for all tracks, that matters for transformation, default is the same, so you don't have to change it, unless you want to add/remove space in between. \item \Rfunarg{direction} argument should be exactly the same, either "clockwise" or "counterclockwise". \item Tweak with your radius and tracks width to get best results. \end{itemize} Since low level API leave you as much flexibility as possible, this may looks hard to adjust, but it can produce various types of graphics which higher levels API like \autoplot{} hardly can, for instance, if you want to overlap multiple tracks or fine-tune your layout. Ok, let's start to add tracks one by one. First to add a "ideo" track \begin{figure}[!htpb] \centering <>= library(ggbio) p <- ggplot() + layout_circle(hg19Ideo, geom = "ideo", fill = "gray70", radius = 30, trackWidth = 4) @ %def \caption{Adding 'ideogram' track.} \label{fig:ideo} \end{figure} % \clearpage Then a "scale" track with ticks \begin{figure}[!htpb] \centering <>= p <- p + layout_circle(hg19Ideo, geom = "scale", size = 2, radius = 35, trackWidth = 2) p @ %def \caption{Adding a 'scale' track.} \label{fig:scale} \end{figure} % \clearpage Then a "text" track to label chromosomes. *NOTICE*, after genome coordinate transformation, original data will be stored in column ".ori", and for mapping, just use ".ori" prefix to it. Here we use `.ori.seqnames`, if you use `seqnames`, that is going to be just "genome" character. \begin{figure}[!htpb] \centering <>= p <- p + layout_circle(hg19Ideo, geom = "text", aes(label = seqnames), vjust = 0, radius = 38, trackWidth = 7) p @ %def \caption{Adding a 'text' track.} \label{fig:text} \end{figure} % \clearpage Then a "rectangle" track to show somatic mutation, this will looks like vertical segments. \begin{figure}[!htpb] \centering <>= p <- p + layout_circle(mut.gr, geom = "rect", color = "steelblue", radius = 23 ,trackWidth = 6) p @ %def \caption{Adding a segment track to show mutation.} \label{fig:mut} \end{figure} % \clearpage Next, we need to add some "links" to show the rearrangement, of course, links can be used to map any kind of association between two or more different locations to indicate relationships like copies or fusions. <>= rearr <- read.csv(system.file("extdata", "crc-rearrangment.csv", package = "biovizBase")) ## start position gr1 <- with(rearr, GRanges(chr1, IRanges(pos1, width = 1))) ## end position gr2 <- with(rearr, GRanges(chr2, IRanges(pos2, width = 1))) ## add extra column nms <- colnames(rearr) .extra.nms <- setdiff(nms, c("chr1", "chr2", "pos1", "pos2")) values(gr1) <- rearr[,.extra.nms] ## remove out-of-limits data seqs <- as.character(seqnames(gr1)) .mx <- seqlengths(hg19Ideo)[seqs] idx1 <- start(gr1) > .mx seqs <- as.character(seqnames(gr2)) .mx <- seqlengths(hg19Ideo)[seqs] idx2 <- start(gr2) > .mx idx <- !idx1 & !idx2 gr1 <- gr1[idx] seqlengths(gr1) <- seqlengths(hg19Ideo) gr2 <- gr2[idx] seqlengths(gr2) <- seqlengths(hg19Ideo) @ %def To create a suitable structure to plot, please use another `GRanges` to represent the end of the links, and stored as elementMetadata for the "start point" `GRanges`. Here we named it as "to.gr" and will be used later. <>= values(gr1)$to.gr <- gr2 ## rename to gr gr <- gr1 @ %def Here we show the flexibility of *ggbio*, for example, if you want to use color to indicate your links, make sure you add extra information in the data, used for mapping later. Here in this example, we use "intrachromosomal" to label rearrangement within the same chromosomes and use "interchromosomal" to label rearrangement in different chromosomes. <>= values(gr)$rearrangements <- ifelse(as.character(seqnames(gr)) == as.character(seqnames((values(gr)$to.gr))), "intrachromosomal", "interchromosomal") @ %def Get subset of links data for only one sample "CRC1" <>= gr.crc1 <- gr[values(gr)$individual == "CRC-1"] @ %def Ok, add a "point" track with grid background for rearrangement data and map `y` to variable "score", map `size` to variable "tumreads", rescale the size to a proper size range. <>= p <- p + layout_circle(gr.crc1, geom = "point", aes(y = score, size = tumreads), color = "red", radius = 12 ,trackWidth = 10, grid = TRUE) + scale_size(range = c(1, 2.5)) p @ %def % \clearpage Finally, let's add links and map color to rearrangement types. Remember you need to specify `linked.to` to the column that contain end point of the data. \begin{figure}[!htpb] \centering <>= p <- p + layout_circle(gr.crc1, geom = "link", linked.to = "to.gr", aes(color = rearrangements), radius = 10 ,trackWidth = 1) p @ %def \caption{A link track is added to the circular plot.} \label{fig:links} \end{figure} \subsection{Step 4: Complex arragnment of plots} In this step, we are going to make multiple sample comparison, this may require some knowledge about package \Rpackage{grid} and \Rpackage{gridExtra}. We will introduce a more easy way to combine your graphics later after this. We just want 9 single circular plots put together in one page, since we cannot keep too many tracks, we only keep ideogram and links. Here is one sample. \begin{figure}[!htpb] \centering <>= cols <- RColorBrewer::brewer.pal(3, "Set2")[2:1] names(cols) <- c("interchromosomal", "intrachromosomal") p0 <- ggplot() + layout_circle(gr.crc1, geom = "link", linked.to = "to.gr", aes(color = rearrangements), radius = 7.1) + layout_circle(hg19Ideo, geom = "ideo", trackWidth = 1.5, color = "gray70", fill = "gray70") + scale_color_manual(values = cols) p0 @ %def \caption{Just to show single individuals crc1.} \label{fig:single-arr} \end{figure} <>= grl <- split(gr, values(gr)$individual) ## need "unit", load grid library(grid) lst <- lapply(grl, function(gr.cur){ print(unique(as.character(values(gr.cur)$individual))) cols <- RColorBrewer::brewer.pal(3, "Set2")[2:1] names(cols) <- c("interchromosomal", "intrachromosomal") p <- ggplot() + layout_circle(gr.cur, geom = "link", linked.to = "to.gr", aes(color = rearrangements), radius = 7.1) + layout_circle(hg19Ideo, geom = "ideo", trackWidth = 1.5, color = "gray70", fill = "gray70") + scale_color_manual(values = cols) + labs(title = (unique(values(gr.cur)$individual))) + theme(plot.margin = unit(rep(0, 4), "lines")) }) @ %def We wrap the function in grid level to a more user-friendly high level function, called \Rfunction{arrangeGrobByParsingLegend}. You can pass your ggplot2 graphics to this function , specify the legend you want to keep on the right, you can also specify the column/row numbers. Here we assume all plots we have passed follows the same color scale and have the same legend, so we only have to keep one legend on the right. <>= arrangeGrobByParsingLegend(lst, widths = c(4, 1), legend.idx = 1, ncol = 2) @ %def \chapter{Manhattan plot}\label{chapter:man} \section{Introduction} In this tutorial, we introduce a new coordinate system called "genome" for genomic data. This transformation is to put all chromosomes on the same genome coordinates following specified orders and adding buffers in between. One may think about facet ability based on \textit{seqnames}, it can produce something similar to \textit{Manhattan plot}\footnote{http://en.wikipedia.org/wiki/Manhattan}, but the view will not be compact. What's more, genome transformation is previous step to form a circular view. In this tutorial, we will simulate some SNP data and use this special coordinate and a specialized function \Rfunction{plotGrandLinear} to make a Manhattan plot. \textit{Manhattan plot} is just a special use design with this coordinate system. \section{Understand the new coordinate} Let's load some packages and data first <<>>= library(ggbio) data(hg19IdeogramCyto, package = "biovizBase") data(hg19Ideogram, package = "biovizBase") library(GenomicRanges) @ %def Make a minimal example `GRanges`, and see what the default coordiante looks like, pay attention that, by default, the graphics are faceted by `seqnames` as shown in Figure \ref{fig:simul_gr} \begin{figure}[!htpb] \centering <>= library(biovizBase) gr <- GRanges(rep(c("chr1", "chr2"), each = 5), IRanges(start = rep(seq(1, 100, length = 5), times = 2), width = 50)) autoplot(gr, aes(fill = seqnames)) @ %def \caption{Default grahpics is faceted by seqnames} \label{fig:simul_gr} \end{figure} What if we specify the coordinate system to be "genome" in \autoplot{} function, there is no faceting anymore, the two plots are merged into one single genome space, and properly labeled as shown in Figure \ref{fig:coord-genome} % There is a limitation on integer in \R{}, so the % genome space cannot be too long, to overcome this limitation, a default argument % called `maxSize` is defined with this function, if the genome space is over % limits, it will rescale everything automatically, function `tranformToGenome` % with return a transformed `GRanges` object, with only one single `seqnames` % called "genome" and the `seqlengths` of it, is just genome space(with buffering % region). arguments called `space.ratio` control the skipped region between % chromosomes. \begin{figure}[!htpb] \centering <>= autoplot(gr, coord = "genome", aes(fill = seqnames)) @ %def \caption{Coordinate genome} \label{fig:coord-genome} \end{figure} The internal transformation are implemented into the function \Rfunction{transformToGenome}. And there is some simple way to test if a \Robject{GRanges} object is transformed to coordinate "genome" or not <>= gr.t <- transformToGenome(gr) head(gr.t) is_coord_genome(gr.t) metadata(gr.t)$coord @ %def \section{Step 2: Simulate a SNP data set} Let's use the real human genome space to simulate a SNP data set. <>= chrs <- as.character(levels(seqnames(hg19IdeogramCyto))) seqlths <- seqlengths(hg19Ideogram)[chrs] set.seed(1) nchr <- length(chrs) nsnps <- 100 gr.snp <- GRanges(rep(chrs,each=nsnps), IRanges(start = do.call(c, lapply(chrs, function(chr){ N <- seqlths[chr] runif(nsnps,1,N) })), width = 1), SNP=sapply(1:(nchr*nsnps), function(x) paste("rs",x,sep='')), pvalue = -log10(runif(nchr*nsnps)), group = sample(c("Normal", "Tumor"), size = nchr*nsnps, replace = TRUE) ) genome(gr.snp) <- "hg19" gr.snp @ %def We use the some trick to make a shorter names. <>= seqlengths(gr.snp) nms <- seqnames(seqinfo(gr.snp)) nms.new <- gsub("chr", "", nms) names(nms.new) <- nms gr.snp <- renameSeqlevels(gr.snp, nms.new) seqlengths(gr.snp) @ %def \section{Step 3: Start to make Manhattan plot by using \autoplot{}} wrapped basic functions into \autoplot{}, you can specify the coordinate. Figure \ref{fig:unorder} shows what the unordered object looks like. \begin{figure}[!htpb] \centering <>= autoplot(gr.snp, coord = "genome", geom = "point", aes(y = pvalue), space.skip = 0.01) @ %def \caption{Unordred Manhattan plot} \label{fig:unorder} \end{figure} That's probably not what you want, if you want to change to specific order, just sort them by hand and use `keepSeqlevels`. Figure \ref{fig:sort} shows a sorted plot. \begin{figure}[!htpb] \centering <>= gr.snp <- keepSeqlevels(gr.snp, c(1:22, "X", "Y")) values(gr.snp)$highlight <- FALSE idx <- sample(1:length(gr.snp), size = 15) values(gr.snp)$highlight[idx] <- TRUE values(gr.snp)$id <- 1:length(gr.snp) p <- autoplot(gr.snp, coord = "genome", geom = "point", aes(y = pvalue), space.skip = 0.01) @ %def \caption{Sorted data for Manhattan plot} \label{fig:sort} \end{figure} \textbf{NOTICE}: the data now doesn't have information about lengths of each chromosomes, this is allowed to be plotted, but it's misleading sometimes, without chromosomes lengths information, \ggbio{} use data space to make estimated lengths for you, this is not accurate! So let's just assign \Rfunction{seqlengths} to the object. Then you will find the data space now is distributed proportional to real space as shown in Figure \ref{fig:with-seql}. \begin{figure}[!htpb] \centering <>= names(seqlths) <- gsub("chr", "", names(seqlths)) seqlengths(gr.snp) <- seqlths[names(seqlengths(gr.snp))] autoplot(gr.snp, coord = "genome", geom = "point", aes(y = pvalue), space.skip = 0.01) @ %def \caption{Manhattan plot after setting seqlengths to the data, the data space now is distributed proportional to real chromosome space.} \label{fig:with-seql} \end{figure} In \autoplot{}, argument \Rfunarg{coord} is just used to transform the data, after that, you can use it as common \Robject{GRanges}, all other geom/stat works for it. Here just show a simple example for another geom "line" as shown in Figure \ref{fig:line} \begin{figure}[!htpb] \centering <>= autoplot(gr.snp, coord = "genome", geom = "line", aes(y = pvalue, group = seqnames, color = seqnames)) @ %def \caption{Use line to represent the data in typical Manhattan plot.} \label{fig:line} \end{figure} \section{Convenient \Rfunction{plotGrandLinear} function} In \ggbio{}, sometimes we develop specialized function for certain types of plots, it's basically a wrapper over lower level API and \autoplot{}, but more convenient to use. Here for \textit{Manhattan plot}, we have a function called \Rfunction{plotGrandLinear} used for it. aes(y = ) is required to indicate the y value, e.g. p-value. Figure \ref{fig:plotGl} shows a defalut graphic. \begin{figure}[!htpb] \centering <>= gro <- GRanges(c("1", "2"), IRanges(c(100, 200), width = 5e7)) plotGrandLinear(gr.snp, aes(y = pvalue), highlight.gr = gro) names(gro) <- c("id1", "id2") plotGrandLinear(gr.snp, aes(y = pvalue), highlight.gr = gro) + theme_clear() + theme(legend.position = "none") plotGrandLinear(gr.snp, aes(y = pvalue), highlight.gr = gro) + theme_clear() + theme(legend.position = "none") .trans <- 0.5 names(.trans) <- "1" plotGrandLinear(gr.snp, aes(y = pvalue), highlight.gr = gro, chr.weight = .trans) + theme_clear() + theme(legend.position = "none") @ %def \caption{Default Manhattan plot by calling plotGrandLinear function} \label{fig:plotGl} \end{figure} Color mapping is automatically figured out by *ggbio* following the rules \begin{itemize} \item if \Rfunarg{color} present in \Rcode{aes()}, like \Rcode{aes(color = seqnames)}, it will assume it's mapping to data column called 'seqnames'. \item if \Rfunarg{color} is not wrapped in \Rcode{aes()}, then this function will \textbf{recylcle} them to all chromosomes. \item if \Rfunarg{color} is single character representing color, then just use one arbitrary color. \end{itemize} Let's test some examples for controling colors. \begin{figure}[!htpb] \centering <>= plotGrandLinear(gr.snp, aes(y = pvalue, color = seqnames)) p = plotGrandLinear(gr.snp, aes(y = pvalue)) p + geom_point(data = as.data.frame(p$.data[values(p$.data)$highlight]), aes(x = start, y = pvalue), color = "red", size = 3) @ %def \caption{Color mapped to chromosome names.} \label{fig:more1} \end{figure} \begin{figure}[!htpb] \centering <>= plotGrandLinear(gr.snp, aes(y = pvalue), color = c("green", "deepskyblue")) @ %def \caption{Color follow 'green' and 'deepskyblue' order for all chromosome space.} \label{fig:more2} \end{figure} \begin{figure}[!htpb] \centering <>= plotGrandLinear(gr.snp, aes(y = pvalue), color = c("green", "deepskyblue", "red")) @ %def \caption{Color follow three colors pattern: 'green','deepskyblue', 'red'} \label{fig:more3} \end{figure} \begin{figure}[!htpb] \centering <>= plotGrandLinear(gr.snp, aes(y = pvalue), color = "red") @ %def \caption{Unique color for all.} \label{fig:more4} \end{figure} You can also add cutoff line as shown in Figure \ref{fig:cutoff}. \begin{figure}[!htpb] \centering <>= plotGrandLinear(gr.snp, aes(y = pvalue), cutoff = 3, cutoff.color = "blue", cutoff.size = 4) @ %def \caption{Set cutoff on the Manhattan plot. The 'blue' line shows cutoff at value 3.} \label{fig:cutoff} \end{figure} This is equivalent to \ggplot{} 's API. <>= plotGrandLinear(gr.snp, aes(y = pvalue)) + geom_hline(yintercept = 3, color = "blue", size = 4) @ %def Sometimes the names of chromosomes maybe very long, you may want to rotate them, let's make a longer name first <>= ## let's make a long name nms <- seqnames(seqinfo(gr.snp)) nms.new <- paste("chr00000", nms, sep = "") names(nms.new) <- nms gr.snp <- renameSeqlevels(gr.snp, nms.new) seqlengths(gr.snp) autoplot(gr) + coord_genome() @ %def Then rotate it! \begin{figure}[rotate] \centering <>= plotGrandLinear(gr.snp, aes(y = pvalue)) + theme(axis.text.x=theme_text(angle=-90, hjust=0)) @ %def \caption{Rotate the x lable to save space.} \label{fig:rotate} \end{figure} % \clearpage As you can tell from above examples, all utilities works for \ggplot{} will work for \ggbio{} too. \chapter{Karyogram overview} \section{Introduction} A karyotype is the number and appearance of chromosomes in the nucleus of a eukaryotic cell\footnote{http://en.wikipedia.org/wiki/Karyotype}. It's one kind of overview when we want to show distribution of certain events on the genome, for example, binding sites for certain protein, even compare them acroos samples as example shows in this section. \Robject{GRanges} and \Robject{Seqinfo} object are also an ideal container for storing data needed for karyogram plot. Here is the strategy we used for generating ideogram templates. \begin{itemize} \item Althouth \Robject{seqlengths} is not required, it's highly recommended for plotting karyogram. If a \Robject{GRanges} object contains \Robject{seqlengths}, we know exactly how long each chromosome is, and will use this information to plot genome space, particularly we plot all levels included in it, \textbf{NOT JUST} data space. \item If a \Robject{GRanges} has no \Robject{seqlengths}, we will issue a warning and try to estimate the chromosome lengths from data included. This is \textbf{NOT} accurate most time, so please pay attention to what you are going to visualize and make sure set \Robject{seqlengths} before hand. \end{itemize} \section{\Rfunction{autoplot}} Let's first introduce how to use \autoplot{} to generate karyogram graphic. The most easy one is to just plot Seqinfo by using \autoplot{}, if your \gr{} object has seqinfo with seqlengths information. <<>>= data(hg19Ideogram, package = "biovizBase") chrs <- paste0("chr", 1:20) p <- autoplot(seqinfo(hg19Ideogram)[chrs]) p @ % Then you could add additional more data(\Rclass{GRanges} object) on this % overview, for examle, if you have a set of cytoband information, and % \Rfunction{scale\_fill\_giemsa} did the trick to correct the color. %<>= %hg19c <- hg19IdeogramCyto %hg19c <- ggbio:::subsetByChrs(hg19c, chrs) %seqlevels(hg19c) <- seqlevels(hg19Ideogram) %seqlengths(hg19c) <- seqlengths(hg19Ideogram) %p + layout_karyogram(hg19c, aes(fill = gieStain)) %p + layout_karyogram(hg19c, aes(fill = gieStain)) + scale_fill_giemsa() %@ %def Even more typical karyogram overview with cytoband, this will even show the arms, two required columns are required 'name' and 'gieStain'. <>= data(hg19IdeogramCyto, package = "biovizBase") head(hg19IdeogramCyto) p <- autoplot(hg19IdeogramCyto, layout = "karyogram", cytoband = TRUE) @ %def \sfbox{Your turn: change the order of chromosomes.} % To understand why we call it kayogram, let's first visualize some cytoband. We % use \Rfunarg{layout} argument to specify this special layout "karyogram". And % under this layout, \Rfunarg{cytoband} argument is acceptable, default is % \Rcode{FALSE}, if set to \Rcode{TRUE}, we assume your have additional % information associated with the data, stored in column \Rcode{gieStain}, it will % try to fill colors based on this variable according to a pre-set staining % colors. You may notice, this data set doesn't contain seqlengths information, % but the data space actually cover the real space, so it's not going to be a % problem. % <>= % data(hg19IdeogramCyto, package = "biovizBase") % autoplot(hg19IdeogramCyto, layout = "karyogram", cytoband = TRUE) % @ %def % You may want to change the order of chromosomes, \Rfunction{keepSeqlevels} are % convenient for this purpose, it's defined in package \Rpackage{GenomicRanges}. % <>= % hg19 <- keepSeqlevels(hg19IdeogramCyto, paste0("chr", c(1:22, "X", "Y"))) % head(hg19) % autoplot(hg19, layout = "karyogram", cytoband = TRUE) % @ %def % This \Robject{GRanges} object is special, contains two required colomns, 'name' % and 'giestain', in this case, \Rfunarg{cytoband} argument could set to % \Rcode{TRUE}, and we draw special ideogram not just rectangles but show % \textbf{centromere} as possible. % If we set it to \Rcode{FALSE}, we treat it as a normal \Robject{GRanges}, % nothing special with cytoband and arm information. So to show the cytoband, we % need to specify which color column variable to fill as cytoband, function % \Rfunction{aes} use an unevaluated expression like \Rcode{fill = gieStain}, % \textit{gieStain} is column name which store cytoband color, notice that we % don't use quotes around it, this means it's not something defined globally, but % some column name defined in the data. The system will usually automatically % assign categorical colors to represent this variable. But instead, cytoband % already have some pre-defined colors which mimic the color you observed under % microscope. Function \Rfunction{scale\_fill\_giemsa} did this trick to correct % the color. If it's first time you observe usage by \Rcode{+}, it's a very % popular API in package \ggplot{}\footnote{http://had.co.nz/ggplot2/}, which % could add graphics layer by layer or revise a existing graphic. % \begin{figure}[!htpb] % \centering % <>= % library(GenomicRanges) % ## it's a 'ideogram' % biovizBase::isIdeogram(hg19) % ## set to FALSE % autoplot(hg19, layout = "karyogram", cytoband = FALSE, aes(fill = gieStain)) + % scale_fill_giemsa() % @ %def % \caption{Cytoband on karyogram layout. We treat it as normal \Robject{GRanges} % data set, so we fill with gieStain color, and use % \Rfunction{scale\_fill\_giemsa} to use customized color. Notice the difference % if it's not a 'ideogram' object. we don't draw centromere particularly.} % \label{fig:cytoband-custom} % \end{figure} % % \clearpage % Let's try a different data set which is not an 'ideogram', but a normal % \Robject{GRanges} object that most people will have, extra data such as % statistical values or categorical levels are stored in element data columns used % for aesthetics mapping. We use a default data in package \Rpackage{biovizBase}, which is a subset of RNA editing set in human. The data involved in this \Robject{GRanges} is sparse, so we cannot simply use it to make karyogram, otherwise, the estimated chromosome lengths will be very rough and inaccurate. So what we need to do is: \begin{enumerate} \item Adding seqlegnths to this \Robject{GRanges} object. If you adding seqlengths to object, we have two ways to show chromosome space as karyogram. \\\Rcode{autoplot(object, layout = 'karyogram')} or \\\Rcode{autoplot(seqinfo(object))}. \item Changing the order of chromosomes. \item Visualize it and map variable to different aesthetics. \end{enumerate} <>= data(darned_hg19_subset500, package = "biovizBase") dn <- darned_hg19_subset500 head(dn) ## add seqlengths ## we have seqlegnths information in another data set data(hg19Ideogram, package = "biovizBase") seqlengths(dn) <- seqlengths(hg19Ideogram)[names(seqlengths(dn))] ## now we have seqlengths head(dn) ## then we change order dn <- keepSeqlevels(dn, paste0("chr", c(1:22, "X"))) autoplot(dn, layout = "karyogram") ## this equivalent to ## autoplot(seqinfo(dn)) @ %def Then we take one step further, the power of \ggplot{} or \ggbio{} is the flexible multivariate data mapping ability in graphics, make data exploration much more convenient. In the following example, we are trying to map a categorical variable 'exReg' to color, this variable is included in the data, and have three levels, '3' indicate 3' utr, '5' means 5' utr and 'C' means coding region. We have some missing values indicated as \Rcode{NA}, in default, it's going to be shown in gray color, and keep in mind, since the basic geom(geometric object) is rectangle, and genome space is very large, so change both color/fill color of the rectangle to specify both border and filled color is necessary to get the data shown as different color, otherwise if the region is too small, border color is going to override the fill color. <>= ## since default is geom rectangle, even though it's looks like segment ## we still use both fill/color to map colors autoplot(dn, layout = "karyogram", aes(color = exReg, fill = exReg)) @ %def Or you can set the missing value to particular color you want. Note: NA values is not shown on the legend. <>= ## since default is geom rectangle, even though it's looks like segment ## we still use both fill/color to map colors autoplot(dn, layout = "karyogram", aes(color = exReg, fill = exReg)) + scale_color_discrete(na.value = "brown") @ %def % A test could be performed to demonstrate why 'seqlengths' of object % \Robject{GRanges} is important. Let's assume we set wrong chromosome lengths by % accident, lengths are all equal to chromosome 1. We arbitrarily set it to the % same number so that every chromosome are of equal length. From Figure % \ref{fig:exReg-NA-fake}, it's clear that this will affect what we see. So please % make sure % \begin{itemize} % \item You get data space cover exactly the same chromosome space for each % chromosome. or % \item You set the seqlengths to the right number. % \end{itemize} % Otherwise you will see weird pattern from your results, so actually it's a good % way to test your raw data too, if you raw data have something beyond chromosome % space, you need to dig into it to see what happened. %<>= %dn2 <- dn %seqlengths(dn2) <- rep(max(seqlengths(dn2)), length(seqlengths(dn2))) %autoplot(dn2, layout = "karyogram", aes(color = exReg, fill = exReg)) %@ %def \section{\Rfunction{plotKaryogram}} \Rfunction{plotKaryogram} (or \Rfunction{plotStackedOverview}) are specialized function to draw karyogram graphics. It's actually what function \autoplot{} calls inside. API is a littler simpler because layout 'karyogram' is default in these two functions. So equivalent usage is like <>= plotKaryogram(dn) plotKaryogram(dn, aes(color = exReg, fill = exReg)) @ %def \section{\Rfunction{layout\_karyogram}} In this section, a lower level function \Rfunction{layout\_karyogram} is going to be introduced. This is convenient API for constructing karyogram plot and adding more data layer by layer. Function \Rfunction{ggplot} is just to create blank object to add layer on. You need to pay attention to \begin{itemize} \item when you add plots layer by layer, seqnames of different data must be the same to make sure the data are mapped to the same chromosome. For example, if you name chromosome following schema like \textit{chr1} and use just number \textit{1} to name other data, they will be treated as different chromosomes. \item cannot use the same aesthetics mapping multiple time for different data. For example, if you have used aes(color = ), for one data, you cannot use aes(color = ) anymore for mapping variables from other add-on data, this is currently not allowed in \ggplot{}, even though you expect multiple color legend shows up, this is going to confuse people which is which. HOWEVER, \Rfunarg{color} or \Rfunarg{fill} without \Rcode{aes()} wrap around, is allowed for any track, it's set single arbitrary color. This is shown in Figure \ref{fig:low-default-addon}. \item Default rectangle y range is [0, 10], so when you add on more data layer by layer on existing graphics, you can use \Rfunarg{ylim} to control how to normalize your data and plot it relative to chromosome space. For example, with default, chromosome space is plotted between y [0, 10], if you use \Rcode{ylim = c(10 , 20)}, you will stack data right above each chromosomes and with equal width. For geom like 'point', which you need to specify 'y' value in \Rcode{aes()}, we will add 5\% margin on top and at bottom of that track. \end{itemize} <>= ## plot ideogram p <- ggplot(hg19) + layout_karyogram(cytoband = TRUE) p ## eqevelant autoplot(hg19, layout = "karyogram", cytoband = TRUE) @ %def <>= p <- p + layout_karyogram(dn, geom = "rect", ylim = c(11, 21), color = "red") ## commented line below won't work ## the cytoband fill color has been used already. ## p <- p + layout_karyogram(dn, aes(fill = exReg, color = exReg), geom = "rect") p @ %def Then we construct another multiple layer graphics for multiple data using different geom, suppose we want to show RNA-editing sites on chromosome space as rectangle(looks like segment in graphic) and stack a line for another track above. <>= ## plot chromosome space p <- autoplot(seqinfo(dn)) ## make sure you pass rect as geom ## otherwise you just get background p <- p + layout_karyogram(dn, aes(fill = exReg, color = exReg), geom = "rect") values(dn)$pvalue <- rnorm(length(dn)) p + layout_karyogram(dn, aes(x = start, y = pvalue), ylim = c(10, 30), geom = "line", color = "red") p @ %def \end{document}