%\VignetteIndexEntry{GenomeInfoDb: Introduction to GenomeInfoDb}
%\VignetteDepends{}
%\VignetteKeywords{organism, GenomeInfoDb}
%\VignettePackage{GenomeInfoDb}
%\VignetteEngine{knitr::knitr}
\documentclass{article}

<<style, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@

\title{An Introduction to \Biocpkg{GenomeInfoDb}}
\author{Martin Morgan, Herv\'e Pag\`es, Marc Carlson, Sonali Arora}
\date{Modified: 17 January, 2014. Compiled: \today}

\begin{document}

\maketitle

\tableofcontents

<<preliminaries, echo=FALSE, message=FALSE>>=
library(GenomeInfoDb)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
@ 

\section{Introduction}

The \Biocpkg{GenomeInfoDb} provides an interface to access seqlevelsStyles 
(such as UCSC, NCBI, Ensembl) and their supported mappings for organisms. 
For instance, for Homo sapiens, seqlevelsStyle "UCSC" maps to "chr1", "chr2", 
..., "chrX","chrY". The section below introduces these functions with examples.

\section{Functionality for all existing organisms}
 

\subsection{genomeStyles}      
The \Rfunction{genomeStyles} lists out for each organism, the 
seqlevelsStyles and their mappings.
<<genomeStyles1>>=
seqmap <- genomeStyles()
head(seqmap,n=2)
@ 
%% 

Oragnism's supported by GenomeInfoDb can be found by :
<<name>>=
names(genomeStyles())
@

If one knows the organism one is interested in, then we can directly access 
the information for the given organism along. Each function accepts an 
argument called species which as "genus species", the default 
is "Homo sapiens". In the following example we list out only the first five 
entries returned by the code snippet.


<<genomeStyles2>>=
head(genomeStyles("Homo_sapiens"),5)
@ 
%% 

We can also check if a given style is supported by GenomeInfoDb for a given
species. For example, if we want to know if "UCSC" mapping is supported for 
"Homo sapiens" we can ask :

<<style-present>>=
"UCSC" %in% names(genomeStyles("Homo_sapiens"))
@


\subsection{extractSeqlevels}
We can also extract the desired seqlevelsStyle from a given organism using
the \Rfunction{extractSeqlevels}
<<extractSeqlevels>>=
extractSeqlevels(species="Arabidopsis_thaliana", style="NCBI")
@ 
%%

\subsection{extractSeqlevelsByGroup}
We can also extract the desired seqlevelsStyle from a given organism based on 
a group ( Group - 'auto' denotes autosomes, 'circular' denotes circular
chromosomes and 'sex' denotes sex chromosomes; the default is all chromosomes
are returned).
<<extractSeqlevelsgroup>>=
extractSeqlevelsByGroup(species="Arabidopsis_thaliana", style="NCBI",
                         group="auto")
@ 
%%

\subsection{seqlevelsStyle}
We can find the seqname Style for a given character vector by using the 
\Rfunction{seqlevelsStyle}
<<seqlevelsStyle>>=
seqlevelsStyle(paste0("chr",c(1:30)))
seqlevelsStyle(c("2L","2R","X","Xhet"))
@
%%

\subsection{seqlevelsInGroup}
We can also subset a given character vector containing seqnames  using the
\Rfunction{seqlevelsInGroup}.
We currently support 3 groups: 'auto' for autosomes, 'sex' for allosomes/sex 
chromosomes and circular for 'circular' chromosomes.
The user can also provide the style and species they are working with.
In the following examples, we extract the sex, auto and circular chromosomes 
for Homo sapiens :

<<keepChr-txdb>>=
newchr <- paste0("chr",c(1:22,"X","Y","M","1_gl000192_random","4_ctg9_hap1"))
seqlevelsInGroup(newchr, group="sex")
seqlevelsInGroup(newchr, group="auto")
seqlevelsInGroup(newchr, group="circular")
seqlevelsInGroup(newchr, group="sex","Homo_sapiens","UCSC")
@
%%

if we have a vector containing seqnames and we want to verify the
species and style for them , we can use:
<<check2>>=
seqnames <- c("chr1", "chr9", "chr2", "chr3", "chr10")
all(seqnames %in% extractSeqlevels("Homo_sapiens", "UCSC"))
@

\subsection{orderSeqlevels}
The \Rfunction{orderSeqlevels} can return the order of a given character vector 
which contains seqnames.In the following example, we show how you can find the 
order for a given seqnames character vector.
<<orderSeqlevels>>=
seqnames <- c("chr1","chr9", "chr2", "chr3", "chr10")
orderSeqlevels(seqnames)
seqnames[orderSeqlevels(seqnames)]
@
%%

\subsection{rankSeqlevels}
The \Rfunction{rankSeqlevels} can return the rank of a given character vector 
which contains seqnames.In the following example, we show how you can find the 
rank for a given seqnames character vector.
<<rankSeqlevels>>=
seqnames <- c("chr1","chr9", "chr2", "chr3", "chr10")
rankSeqlevels(seqnames)
@
%%

\subsection{mapSeqlevels}
Returns a matrix with 1 column per supplied sequence name and 1 row per 
sequence renaming map compatible with the specified style.  If \Rcode{best.only}
is \Rcode{TRUE} (the default), only the "best" renaming maps (i.e. the rows with
less NAs) are returned.
<<find>>=
mapSeqlevels(c("chrII", "chrIII", "chrM"), "NCBI")
@ 
%%

We also have several seqlevel utility functions.Let us construct a basic 
GRanges and show how these functions can be used. .

<<basic-gr>>=
gr <- GRanges(paste0("ch",1:35), IRanges(1:35, width=5))
gr
@

As you can see , we have "ch" instead of "chr" for chromosome names. 
We can use \Rfunction{renameSeqlevels} to change the "ch" to "chr"

\subsection{renameSeqlevels}
As the first argument - it takes the object whose seqlevels we need to change, 
and as the second argument it takes a named vector which has the changes.

<<renameseqlevels>>=
newnames <- paste0("chr",1:35)
names(newnames) <- paste0("ch",1:35)
head(newnames)
gr <- renameSeqlevels(gr,newnames)
gr
@

Humans have just 22 primary chromosomes - but here we have some extra seqlevels
which we want to remove - there are several ways we can achieve this:

\subsection{dropSeqlevels}  
Here the second argument is the seqlevels that you want to drop.
<<dropseqlevels>>=
dropSeqlevels(gr,paste0("chr",23:35))
@

\subsection{keepSeqlevels}  
Here the second argument is the seqlevels that you want to keep.
<<keepseqlevels>>=
keepSeqlevels(gr, paste0("chr",1:22))
@

\subsection{keepStandardChromosomes}  
This function internally uses the pre-defined tables inside GenomeInfoDb to 
find the correct seqlevels according to the style of the object. 
<<keepstdchr>>=
keepStandardChromosomes(gr)
@
One can also specify the optional species argument to bemore precise. 

<<keepstdchr-2>>=
plantgr <- GRanges(c(1:5,"MT","Pltd"), IRanges(1:7,width=5))
keepStandardChromosomes(plantgr, species="Arabidopsis thaliana")
@

\section{Classes inside GenomeInfoDb package}

\subsection{Genome-Description class}

We also provide a Genome Description class which can be used in the following
way:

<<genome-description-class, message=FALSE>>=
library(BSgenome.Celegans.UCSC.ce2)
class(Celegans)
is(Celegans, "GenomeDescription")
provider(Celegans)
seqinfo(Celegans)
gendesc <- as(Celegans, "GenomeDescription")
class(gendesc)
gendesc
provider(gendesc)
seqinfo(gendesc)
bsgenomeName(gendesc)
@

\subsection{SeqInfo class}

<<Seqinfo-egs>>=
## Note that all the arguments (except 'genome') must have the
## same length. 'genome' can be of length 1, whatever the lengths
## of the other arguments are.
x <- Seqinfo(seqnames=c("chr1", "chr2", "chr3", "chrM"),
             seqlengths=c(100, 200, NA, 15),
             isCircular=c(NA, FALSE, FALSE, TRUE),
             genome="toy")
length(x)
seqnames(x)
names(x)
seqlevels(x)
seqlengths(x)
isCircular(x)
genome(x)

x[c("chrY", "chr3", "chr1")]  # subset by names

## Rename, drop, add and/or reorder the sequence levels:
xx <- x
seqlevels(xx) <- sub("chr", "ch", seqlevels(xx))  # rename
xx
seqlevels(xx) <- rev(seqlevels(xx))  # reorder
xx
seqlevels(xx) <- c("ch1", "ch2", "chY")  # drop/add/reorder
xx
seqlevels(xx) <- c(chY="Y", ch1="1", "22")  # rename/reorder/drop/add
xx

y <- Seqinfo(seqnames=c("chr3", "chr4", "chrM"),
             seqlengths=c(300, NA, 15))
y
merge(x, y)  # rows for chr3 and chrM are merged
suppressWarnings(merge(x, y))

## Note that, strictly speaking, merging 2 Seqinfo objects is not
## a commutative operation, i.e., in general 'z1 <- merge(x, y)'
## is not identical to 'z2 <- merge(y, x)'. However 'z1' and 'z2'
## are guaranteed to contain the same information (i.e. the same
## rows, but typically not in the same order):
suppressWarnings(merge(y, x))

## This contradicts what 'x' says about circularity of chr3 and chrM:
isCircular(y)[c("chr3", "chrM")] <- c(TRUE, FALSE)
y
if (interactive()) {
  merge(x, y)  # raises an error
}
@

\section{Examples}

\subsection{converting seqlevel styles (eg:UCSC to NCBI)}

A quick example using Drosophila Melanogaster. The txdb object contains 
seqlevels in UCSC style, we want to convert them to NCBI 

<<quick-style>>=
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
seqlevels(txdb)
genomeStyles("Drosophila melanogaster")
mapSeqlevels(seqlevels(txdb), "NCBI")
@

\subsection{converting styles and removing unwanted seqlevels}

Suppose we read in a Bam file or a BED file and the resulting GRanges have a lot
of seqlevels which are not required by your analysis or you want to rename
the seqlevels from the current style to your own style (eg:USCS to NCBI), we can 
use the functionality provided by GenomeInfoDb to do that.

Let us say that we have extracted the seqlevels of the Seqinfo object(say 
GRanges from a BED file) in a variable called "sequence". 
<<sequence, eval=FALSE>>=
sequence <- seqlevels(x) 

## sequence is in UCSC format and we want NCBI style
newStyle <- mapSeqlevels(sequence,"NCBI")
newStyle <- newStyle[complete.cases(newStyle)] # removing NA cases.

## rename the seqlevels 
x <- renameSeqlevels(x,newStyle)

## keep only the seqlevels you want (say autosomes)
auto <- extractSeqlevelsByGroup(species="Homo sapiens", style="NCBI", 
                                group="auto")
x <- keepSeqlevels(x,auto)
@

\section{Session Information}
Here is the output of \Rfunction{sessionInfo} on the system on which
this document was compiled:
<<sessionInfo, results='asis', eval=TRUE>>=
toLatex(sessionInfo())
@

\end{document}