%\VignetteIndexEntry{An Introduction to intansv} %\VignetteDepends{} %\VignetteKeywords{Structural variation} %\VignettePackage{intansv} \documentclass[10pt]{article} \textwidth=6.5in \textheight=8.5in %\parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\program}[1]{\textsf{#1}} \newcommand{\R}{\program{R}} \newcommand{\intansv}{\Rpackage{intansv}} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \usepackage{times} \bibliographystyle{plainnat} \title{An Introduction to \intansv{}} \author{Wen Yao} \date{\today} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \tableofcontents <>= options(width=72) @ \section{Introduction} Currently, dozens of programs have been developped to predict structural variations (SV) utilizing next-generation sequencing data. However, the prediction of multiple methods have to be integrated to get relatively reliable results \citep{Drosophila}. The \intansv{} package is designed for integrating results of different methods, annotating effects caused by SVs to genes and its elements, displaying the genomic distribution of SVs and visualizing SVs in specific genomic region. In this vignette, we will rely on a simple, illustrative example dataset to explain the usage of \intansv{}. The \intansv{} package is available at bioconductor.org and can be downloaded via \Rfunction{biocLite}: <>= source("http://bioconductor.org/biocLite.R") biocLite("intansv") @ <>= library(intansv) @ \section{Usage of \intansv{} with example dataset} The example dataset was obtained by running seven different programs with the alignment results of a public available NGS dataset (NCBI SRA accession number SRA012177) to the rice reference genome (\url{http://rice.plantbiology.msu.edu/}). These seven programs including BreakDancer \citep{BreakDancer}, Pindel \citep{Pindel}, CNVnator \citep{CNVnator}, DELLY \citep{DELLY}, Lumpy \citep{Lumpy}, softSearch \citep{softSearch}, and SVseq2 \citep{svseq} were run with default parameters. The predicted SVs of chromosomes, \emph{chr05} and \emph{chr10}, were provided in the example dataset. \subsection{Read in predictions of different programs} Most of the predictions of programs are tedious and need to be formatted for further analysis. \intansv{} provides functions to read in the predictions of five popular programs, BreakDancer, Pindel, CNVnator, DELLY and SVseq2. During this process, the SV predictions with low quality would be filtered out and overlapped predictions would be resolved. We begin our discussion by reading in some example data. <>= breakdancer <- readBreakDancer(system.file("extdata/ZS97.breakdancer.sv", package="intansv")) str(breakdancer) @ BreakDancer is able to predict deletions and inversions. The prediction of all kind of SVs were provided as a single file by BreakDancer. You can feed this file to \Rfunction{readBreakdancer} and it will do all the tedious work for you. <>= cnvnator <- readCnvnator(system.file("extdata/cnvnator",package="intansv")) str(cnvnator) @ CNVnator is able to predict deletions and duplications. The final output of CNVnator usually contains several files and each file is the output for a specific chromosome. You should put all these files in the same directory and feed the path of this directory to \Rfunction{readCnvnator}. Then it will do all the jobs for you. However, the directory given to \Rfunction{readCnvnator} should only contain the final output files of CNVnator. See the example dataset for more details. <>= svseq <- readSvseq(system.file("extdata/svseq2",package="intansv")) str(svseq) @ SVseq2 is able to predict deletions. The final output of SVseq2 contains several files and each file is the output for a specific chromosome. What's more, different kind of SVs were written to different files. You should put all these files in the same directory and feed the path of this directory to \Rfunction{readSvseq}. However, the files contain the predicted deletions in this directory should be named with suffix of \emph{.del}. See the example dataset for more details. <>= delly <- readDelly(system.file("extdata/delly",package="intansv")) str(delly) @ DELLY is able to predict deletions, inversions and duplications. The final prediction of different kind of SVs given by DELLY were written to different files. DELLY v0.6.1 outputs vcf files now. You should put all these files in the same directory and feed the path of this directory to \Rfunction{readDelly}. See the example dataset for more details. <>= pindel <- readPindel(system.file("extdata/pindel",package="intansv")) str(pindel) @ Pindel is able to predict deletions, inversions and duplications. The final prediction for different chromosomes given by Pindel were written to different files. And different kind of SVs were written to different files. You should put all these files in the same directory and feed the path of this directory to \Rfunction{readPindel}. However, the files contain the predicted deletions, inversions and duplication in this directory should be named with suffix of \emph{\_D}, \emph{\_INV} and \emph{\_TD} respectively. See the example dataset for more details. <>= lumpy <- readLumpy(system.file("extdata/ZS97.lumpy.pesr.bedpe", package="intansv")) str(lumpy) @ Lumpy is able to predict deletions, inversions and duplications. The prediction of all kind of SVs were provided as a single file by Lumpy. You can feed this file to \Rfunction{readLumpy} and it will do all the tedious work for you. <>= softSearch <- readSoftSearch(system.file("extdata/ZS97.softsearch",package="intansv")) str(softSearch) @ softSearch is able to predict deletions, inversions and duplications. The prediction of all kind of SVs were provided as a single file by softSearch. You can feed this file to \Rfunction{readSoftSearch} and it will do all the tedious work for you. The results of these seven programs have now been read into R and storaged as R data structure of lists with different compoents representing different types of SVs. We only care about three types of SVs, deletions, duplications and inversions. \subsection{Integrate predictions of different programs} To get more reliable results, we need to integrate the predictions of different programs. See our paper \citep{intansv} for more details on how \intansv{} integrate the predictions of different programs. <>= sv_all_methods <- methodsMerge(breakdancer,pindel,cnvnator,delly,svseq) str(sv_all_methods) @ The integrated results contain 897 deletions, 13 duplications and 12 inversions. However, it's not necessary for you to feed \intansv{} the output of all five programs. The following example shows the integration of four programs: BreakDancer, CNVnator, DELLY and SVseq2. <>= sv_all_methods.nopindel <- methodsMerge(breakdancer,cnvnator,delly,svseq) str(sv_all_methods.nopindel) @ \intansv{} is also able to integrate SV predictions of other programs. However, predictions of other programs should be provided as a data frame with six columns: \begin{itemize} \item{chromosome}{\qquad the chromosome ID of a SV} \item{pos1}{\qquad the start coordinate of a SV} \item{pos2}{\qquad the end coordinate of a SV} \item{size}{\qquad the size of a SV} \item{type}{\qquad the type of a SV} \item{methods}{\qquad the program used to get this SV prediction} \end{itemize} \subsection{Annotate the effects of SVs} To annotate the effects of SVs to genes, we need the genomic annotation file, which is usually a \emph{.gff3} file. This file could be read into R by the \Rpackage{rtracklayer} package and storaged as genomic ranges data. An example genomic annotation file has been storaged in the example dataset of \intansv{}. The follow code shows how to read a \emph{.gff3} file into R. <>= library(rtracklayer) msu_gff_v7 <- import.gff(system.file("extdata/chr05_chr10.gff3", package="intansv"), asRangedData=FALSE) head(msu_gff_v7) @ <>= load(system.file("extdata/genome.anno.RData",package="intansv")) sv_all_methods.anno <- llply(sv_all_methods,svAnnotation, genomeAnnotation=msu_gff_v7) names(sv_all_methods.anno) head(sv_all_methods.anno$del) @ Since the function \Rfunction{svAnnotation} only accept structural variations of the data frame format, we need to apply this function to each component of \emph{sv\_all\_methods}, which is a list. \subsection{Display the genomic distribution of SVs} Now, let's get a genomic view of SVs. We first split chromosomes into windows of 1 Mb and then display the number of SVs in each window as circular barplot. We also need the length of each chromosome, which was storaged in the example dataset of \intansv{}. % <>= genome plotChromosome(genome,sv_all_methods,1000000) @ \begin{figure}[!htb] \begin{center} \includegraphics[width=0.8\textwidth]{intansvOverview-genomic_distribution} \caption{\label{genomic_distribution}% Visualization of SVs distribution in the whole genome. Blue: deletions, Red: duplications, Green: inversions.} \end{center} \end{figure} \subsection{Visualize SVs in specific genomic region} We could also visualize SVs in specific genomic region. Here, we also need the genomic annotation file (named with suffix \emph{.gff3} or \emph{.gff}). % <>= head(msu_gff_v7,n=3) plotRegion(sv_all_methods,msu_gff_v7,"chr05",1,200000) @ \begin{figure}[!htb] \begin{center} \includegraphics[width=0.7\textwidth]{intansvOverview-region_visualization} \caption{\label{region_visualization}% SVs in genomic region chr05:1-200000. Blue: genes, Red: deletions, Green: duplications, Purple: inversions. } \end{center} \end{figure} This command showed the SVs in the genomic region \emph{chr05:1-200000}. The genes and SVs were shown as circular rectangles with different color. \pagebreak[4] \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= sessionInfo() @ \bibliography{intansvOverview} \end{document}