%\VignetteIndexEntry{PreV90EnsemblVEP}
%\VignetteDepends{GenomicRanges, VariantAnnotation, Biostrings}
%\VignetteKeywords{annotation, variants}
%\VignettePackage{ensemblVEP}
\documentclass[10pt]{article}

\usepackage{times}
\usepackage{hyperref}

\usepackage[margin=0.65in]{geometry}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textsf{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}

\newcommand{\software}[1]{\textsf{#1}}
\newcommand{\R}{\software{R}}
\newcommand{\Bioconductor}{\software{Bioconductor}}

\SweaveOpts{keep.source=TRUE}

\title{Overview of \Rpackage{ensemblVEP} Pre Ensembl 90}
\author{Valerie Obenchain and Lori Shepherd}

\begin{document}

\maketitle
\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Ensembl provides the facility to predict functional consequences 
of known and unknown variants using the Variant Effect Predictor 
(VEP). The \Rpackage{ensemblVEP} package wraps Ensembl VEP and 
returns the results as \R objects or a file on disk. To use this 
package the Ensembl VEP perl script must be installed in your path.
See the package README for details.

NOTE: As of Ensembl version 88 the VEP script has been renamed from
variant\_effect\_predictor.pl to vep. The ensemblVEP package code
and documentation have been updated to reflect this change.

\noindent Downloads:
\url{http://uswest.ensembl.org/info/docs/tools/vep/index.html}
\\
\noindent Complete documentation for runtime options:
\url{http://uswest.ensembl.org/info/docs/tools/vep/script/vep_options.html}
\\
\noindent To test that Ensembl VEP is properly installed, enter the
name of the script from the command line:

{\it vep}
\\


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Results as \R{} objects}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

<<setup>>=
library(ensemblVEP)
@

The \Rfunction{ensemblVEP} function can return variant consequences from
Ensembl VEP as \R objects (\Robject{GRanges} or \Robject{VCF}) or write
them to a file. The default behavior returns a \Robject{GRanges}. Runtime 
options are stored in a \Robject{VEPParam} object and allow a great deal 
of control over the content and format of the results. See the man pages 
for more details.
<<man_page, eval=FALSE>>=
?ensemblVEP
?VEPParam
@

The default runtime options can be inspected by creating a 
\Robject{VEPParam}.
<<default_VEPParam>>=
param <- VEPParam(version=88)
param
basic(param)
@

Using a vcf file from \Rpackage{VariantAnnotation} as input, we query 
Ensembl VEP with the default runtime parameters.
<<rtn_GRanges, eval=FALSE>>=
fl <- system.file("extdata", "gl_chr1.vcf", package="VariantAnnotation")
gr <- ensemblVEP(fl)
@

Consequence data are parsed into the metadata columns of
the \Robject{GRanges}. To control the type and amount of data 
returned see the options in \Rcode{output(VEPParam())}.

\begin{verbatim}
> head(gr, 3)
GRanges object with 3 ranges and 23 metadata columns:
               seqnames             ranges strand |   Allele
                  <Rle>          <IRanges>  <Rle> | <factor>
     rs6054257       20 [  14370,   14370]      * |        A
  20:17330_T/A       20 [  17330,   17330]      * |        A
     rs6040355       20 [1110696, 1110696]      * |        G
                         Consequence   IMPACT   SYMBOL            Gene
                            <factor> <factor> <factor>        <factor>
     rs6054257    intergenic_variant MODIFIER     <NA>            <NA>
  20:17330_T/A    intergenic_variant MODIFIER     <NA>            <NA>
     rs6040355 upstream_gene_variant MODIFIER    PSMF1 ENSG00000125818
               Feature_type         Feature              BIOTYPE     EXON
                   <factor>        <factor>             <factor> <factor>
     rs6054257         <NA>            <NA>                 <NA>     <NA>
  20:17330_T/A         <NA>            <NA>                 <NA>     <NA>
     rs6040355   Transcript ENST00000479715 processed_transcript     <NA>
                 INTRON    HGVSc    HGVSp cDNA_position CDS_position
               <factor> <factor> <factor>      <factor>     <factor>
     rs6054257     <NA>     <NA>     <NA>          <NA>         <NA>
  20:17330_T/A     <NA>     <NA>     <NA>          <NA>         <NA>
     rs6040355     <NA>     <NA>     <NA>          <NA>         <NA>
               Protein_position Amino_acids   Codons Existing_variation
                       <factor>    <factor> <factor>           <factor>
     rs6054257             <NA>        <NA>     <NA>               <NA>
  20:17330_T/A             <NA>        <NA>     <NA>               <NA>
     rs6040355             <NA>        <NA>     <NA>               <NA>
               DISTANCE   STRAND    FLAGS SYMBOL_SOURCE   HGNC_ID
               <factor> <factor> <factor>      <factor>  <factor>
     rs6054257     <NA>     <NA>     <NA>          <NA>      <NA>
  20:17330_T/A     <NA>     <NA>     <NA>          <NA>      <NA>
     rs6040355     2610        1     <NA>          HGNC HGNC:9571
  -------
  seqinfo: 1 sequence from  genome
\end{verbatim}

Next we use a vcf of structural variants as input
<<structural_vcf>>=
fl <- system.file("extdata", "structural.vcf", package="VariantAnnotation")
@

and request that a \Robject{VCF} object be returned by setting
the {\it vcf} option in the {\it dataformat} slot to TRUE. 
<<set_vcf>>=
param <- VEPParam(dataformat=c(vcf=TRUE), version=88)
@

An call to \Rfunction{ensemblVEP} results in an error.
\begin{verbatim}
> vcf <- ensemblVEP(fl, param)
2012-12-03 16:40:55 - Starting...
ERROR: Could not detect input file format
\end{verbatim}

In most situations Ensembl VEP can auto-detect the input format. In 
this case, however, it cannot so we explicitly set the {\it format} 
option to 'vcf'.
<<set_format>>=
input(param)$format <- "vcf"
@

Try again.
<<rtn_VCF>>=
vep <- ensemblVEP(fl, param)
@

Success! When a \Robject{VCF} is returned, consequence data are 
included as an unparsed INFO column labeled {\it CSQ}.
<<rtn_VCF>>=
info(vep)$CSQ
@

The \Rfunction{parseCSQToGRanges} function parses these data
into a \Robject{GRanges}. When the rownames of the original
VCF are provided as \Rcode{VCFRowID} a metadata column of the
same name is included in the output.
<<parseCSQToGRanges>>=
vcf <- readVcf(fl, "hg19")
csq <- parseCSQToGRanges(vep, VCFRowID=rownames(vcf))
head(csq, 3)
@ 

The \Rcode{VCFRowID} columns maps the expanded {\it CSQ} data back
to the rows in the \Rclass{VCF} object. This index can be used to
subset the original VCF.
<<map_rownames>>=
vcf[csq$"VCFRowID"]
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Write results to a file}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

In the previous section we saw Ensembl VEP results returned as 
\R{} objects in the workspace. Alternatively, these results can
be written directly to a file. The flag that controls how the
data are returned is the {\it output\_file} flag in the {\it input} 
options.

When {\it output\_file} is an empty character (default), the results
are returned as either a \Rclass{GRanges} or \Rclass{VCF} object.
<<output_file_default>>=
input(param)$output_file
@

To write results directly to a file, specify a file name for the 
{\it output\_file} flag.
<<output_file_filename>>=
input(param)$output_file <- "/mypath/myfile"
@

The file can be written as a {\it vcf} or {\it gvf} by setting the
options in the {\it dataformat} slot to TRUE. If neither of {\it vcf}
or {\it gvf} are TRUE the file is written out as tab delimited.
<<ouput_slot>>=
## Write a vcf file to myfile.vcf:
myparam <- VEPParam(dataformat=c(vcf=TRUE),
                    input=c(output_file="/path/myfile.vcf"), version=88)
## Write a gvf file to myfile.gvf:
myparam <- VEPParam(dataformat=c(gvf=TRUE),
                    input=c(output_file="/path/myfile.gvf"), version=88)
## Write a tab delimited file to myfile.txt:
myparam <- VEPParam(input=c(output_file="/path/myfile.txt"), version=88)
@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Configuring runtime options}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The Ensembl VEP web page has complete descriptions of all runtime
options.
\url{http://uswest.ensembl.org/info/docs/tools/vep/script/vep_options.html}
Below are examples of how to configure the runtime options
in the \Rclass{VEPParam} for specific situations. Investigate the 
differences in results using a sample file from \Rpackage{VariantAnnotation}.
<<samplefile>>=
fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
@

\begin{itemize}
  \item Add regulatory region consequences:
<<runtime1, eval=FALSE>>=
param <- VEPParam(output=c(regulatory=TRUE), version=88)
gr <- ensemblVEP(fl, param) 
@

  \item Specify input file format as VCF, add HGNC gene 
        identifiers, output SO consequence terms:
<<runtime2, eval=FALSE>>=
param <- VEPParam(input=c(format="vcf"),
                  output=c(terms="so"),
                  identifiers=c(symbol=TRUE), version=88)
gr <- ensemblVEP(fl, param) 
@

  \item Check for co-located variants, output only coding 
        sequence consequences, output HGVS names:
<<runtime3, eval=FALSE>>=
param <- VEPParam(filterqc=c(coding_only=TRUE),
                  colocatedVariants=c(check_existing=TRUE),
                  identifiers=c(symbol=TRUE), version=88)
gr <- ensemblVEP(fl, param) 
@

  \item Add SIFT score and prediction, PolyPhen prediction only, 
        output results as \Rcode{VCF}:
\begin{verbatim}
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
param <- VEPParam(output=c(sift="b", polyphen="p"), 
                  dataformat=c(vcf=TRUE), version=88)
vcf <- ensemblVEP(fl, param)
csq <- parseCSQToGRanges(vcf)

> head(levels(mcols(csq)$SIFT))
[1] "deleterious(0.01)" "deleterious(0.02)" "deleterious(0.03)"
[4] "deleterious(0.04)" "deleterious(0.05)" "deleterious(0)" 

> levels(mcols(csq)$PolyPhen)
[1] "benign"            "possibly_damaging" "probably_damaging"
[4] "unknown" 
\end{verbatim}
\end{itemize}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{\Rcode{sessionInfo()}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

<<sessionInfo>>=
sessionInfo()
@

\end{document}