%\VignetteIndexEntry{GenomeInfoDb: Introduction to GenomeInfoDb} %\VignetteDepends{} %\VignetteKeywords{organism, GenomeInfoDb} %\VignettePackage{GenomeInfoDb} %\VignetteEngine{knitr::knitr} \documentclass{article} <<style, eval=TRUE, echo=FALSE, results="asis">>= BiocStyle::latex() @ \title{An Introduction to \Biocpkg{GenomeInfoDb}} \author{Martin Morgan, Herv\'e Pag\`es, Marc Carlson, Sonali Arora} \date{Modified: 17 January, 2014. Compiled: \today} \begin{document} \maketitle \tableofcontents <<preliminaries, echo=FALSE, message=FALSE>>= library(GenomeInfoDb) library(TxDb.Dmelanogaster.UCSC.dm3.ensGene) @ \section{Introduction} The \Biocpkg{GenomeInfoDb} provides an interface to access seqlevelsStyles (such as UCSC, NCBI, Ensembl) and their supported mappings for organisms. For instance, for Homo sapiens, seqlevelsStyle "UCSC" maps to "chr1", "chr2", ..., "chrX","chrY". The section below introduces these functions with examples. \section{Functionality for all existing organisms} \subsection{genomeStyles} The \Rfunction{genomeStyles} lists out for each organism, the seqlevelsStyles and their mappings. <<genomeStyles1>>= seqmap <- genomeStyles() head(seqmap,n=2) @ %% Oragnism's supported by GenomeInfoDb can be found by : <<name>>= names(genomeStyles()) @ If one knows the organism one is interested in, then we can directly access the information for the given organism along. Each function accepts an argument called species which as "genus species", the default is "Homo sapiens". In the following example we list out only the first five entries returned by the code snippet. <<genomeStyles2>>= head(genomeStyles("Homo_sapiens"),5) @ %% We can also check if a given style is supported by GenomeInfoDb for a given species. For example, if we want to know if "UCSC" mapping is supported for "Homo sapiens" we can ask : <<style-present>>= "UCSC" %in% names(genomeStyles("Homo_sapiens")) @ \subsection{extractSeqlevels} We can also extract the desired seqlevelsStyle from a given organism using the \Rfunction{extractSeqlevels} <<extractSeqlevels>>= extractSeqlevels(species="Arabidopsis_thaliana", style="NCBI") @ %% \subsection{extractSeqlevelsByGroup} We can also extract the desired seqlevelsStyle from a given organism based on a group ( Group - 'auto' denotes autosomes, 'circular' denotes circular chromosomes and 'sex' denotes sex chromosomes; the default is all chromosomes are returned). <<extractSeqlevelsgroup>>= extractSeqlevelsByGroup(species="Arabidopsis_thaliana", style="NCBI", group="auto") @ %% \subsection{seqlevelsStyle} We can find the seqname Style for a given character vector by using the \Rfunction{seqlevelsStyle} <<seqlevelsStyle>>= seqlevelsStyle(paste0("chr",c(1:30))) seqlevelsStyle(c("2L","2R","X","Xhet")) @ %% \subsection{seqlevelsInGroup} We can also subset a given character vector containing seqnames using the \Rfunction{seqlevelsInGroup}. We currently support 3 groups: 'auto' for autosomes, 'sex' for allosomes/sex chromosomes and circular for 'circular' chromosomes. The user can also provide the style and species they are working with. In the following examples, we extract the sex, auto and circular chromosomes for Homo sapiens : <<keepChr-txdb>>= newchr <- paste0("chr",c(1:22,"X","Y","M","1_gl000192_random","4_ctg9_hap1")) seqlevelsInGroup(newchr, group="sex") seqlevelsInGroup(newchr, group="auto") seqlevelsInGroup(newchr, group="circular") seqlevelsInGroup(newchr, group="sex","Homo_sapiens","UCSC") @ %% if we have a vector containing seqnames and we want to verify the species and style for them , we can use: <<check2>>= seqnames <- c("chr1", "chr9", "chr2", "chr3", "chr10") all(seqnames %in% extractSeqlevels("Homo_sapiens", "UCSC")) @ \subsection{orderSeqlevels} The \Rfunction{orderSeqlevels} can return the order of a given character vector which contains seqnames.In the following example, we show how you can find the order for a given seqnames character vector. <<orderSeqlevels>>= seqnames <- c("chr1","chr9", "chr2", "chr3", "chr10") orderSeqlevels(seqnames) seqnames[orderSeqlevels(seqnames)] @ %% \subsection{rankSeqlevels} The \Rfunction{rankSeqlevels} can return the rank of a given character vector which contains seqnames.In the following example, we show how you can find the rank for a given seqnames character vector. <<rankSeqlevels>>= seqnames <- c("chr1","chr9", "chr2", "chr3", "chr10") rankSeqlevels(seqnames) @ %% \subsection{mapSeqlevels} Returns a matrix with 1 column per supplied sequence name and 1 row per sequence renaming map compatible with the specified style. If \Rcode{best.only} is \Rcode{TRUE} (the default), only the "best" renaming maps (i.e. the rows with less NAs) are returned. <<find>>= mapSeqlevels(c("chrII", "chrIII", "chrM"), "NCBI") @ %% We also have several seqlevel utility functions.Let us construct a basic GRanges and show how these functions can be used. . <<basic-gr>>= gr <- GRanges(paste0("ch",1:35), IRanges(1:35, width=5)) gr @ As you can see , we have "ch" instead of "chr" for chromosome names. We can use \Rfunction{renameSeqlevels} to change the "ch" to "chr" \subsection{renameSeqlevels} As the first argument - it takes the object whose seqlevels we need to change, and as the second argument it takes a named vector which has the changes. <<renameseqlevels>>= newnames <- paste0("chr",1:35) names(newnames) <- paste0("ch",1:35) head(newnames) gr <- renameSeqlevels(gr,newnames) gr @ Humans have just 22 primary chromosomes - but here we have some extra seqlevels which we want to remove - there are several ways we can achieve this: \subsection{dropSeqlevels} Here the second argument is the seqlevels that you want to drop. <<dropseqlevels>>= dropSeqlevels(gr,paste0("chr",23:35)) @ \subsection{keepSeqlevels} Here the second argument is the seqlevels that you want to keep. <<keepseqlevels>>= keepSeqlevels(gr, paste0("chr",1:22)) @ \subsection{keepStandardChromosomes} This function internally uses the pre-defined tables inside GenomeInfoDb to find the correct seqlevels according to the style of the object. <<keepstdchr>>= keepStandardChromosomes(gr) @ One can also specify the optional species argument to bemore precise. <<keepstdchr-2>>= plantgr <- GRanges(c(1:5,"MT","Pltd"), IRanges(1:7,width=5)) keepStandardChromosomes(plantgr, species="Arabidopsis thaliana") @ \section{Classes inside GenomeInfoDb package} \subsection{Genome-Description class} We also provide a Genome Description class which can be used in the following way: <<genome-description-class, message=FALSE>>= library(BSgenome.Celegans.UCSC.ce2) class(Celegans) is(Celegans, "GenomeDescription") provider(Celegans) seqinfo(Celegans) gendesc <- as(Celegans, "GenomeDescription") class(gendesc) gendesc provider(gendesc) seqinfo(gendesc) bsgenomeName(gendesc) @ \subsection{SeqInfo class} <<Seqinfo-egs>>= ## Note that all the arguments (except 'genome') must have the ## same length. 'genome' can be of length 1, whatever the lengths ## of the other arguments are. x <- Seqinfo(seqnames=c("chr1", "chr2", "chr3", "chrM"), seqlengths=c(100, 200, NA, 15), isCircular=c(NA, FALSE, FALSE, TRUE), genome="toy") length(x) seqnames(x) names(x) seqlevels(x) seqlengths(x) isCircular(x) genome(x) x[c("chrY", "chr3", "chr1")] # subset by names ## Rename, drop, add and/or reorder the sequence levels: xx <- x seqlevels(xx) <- sub("chr", "ch", seqlevels(xx)) # rename xx seqlevels(xx) <- rev(seqlevels(xx)) # reorder xx seqlevels(xx) <- c("ch1", "ch2", "chY") # drop/add/reorder xx seqlevels(xx) <- c(chY="Y", ch1="1", "22") # rename/reorder/drop/add xx y <- Seqinfo(seqnames=c("chr3", "chr4", "chrM"), seqlengths=c(300, NA, 15)) y merge(x, y) # rows for chr3 and chrM are merged suppressWarnings(merge(x, y)) ## Note that, strictly speaking, merging 2 Seqinfo objects is not ## a commutative operation, i.e., in general 'z1 <- merge(x, y)' ## is not identical to 'z2 <- merge(y, x)'. However 'z1' and 'z2' ## are guaranteed to contain the same information (i.e. the same ## rows, but typically not in the same order): suppressWarnings(merge(y, x)) ## This contradicts what 'x' says about circularity of chr3 and chrM: isCircular(y)[c("chr3", "chrM")] <- c(TRUE, FALSE) y if (interactive()) { merge(x, y) # raises an error } @ \section{Examples} \subsection{converting seqlevel styles (eg:UCSC to NCBI)} A quick example using Drosophila Melanogaster. The txdb object contains seqlevels in UCSC style, we want to convert them to NCBI <<quick-style>>= txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene seqlevels(txdb) genomeStyles("Drosophila melanogaster") mapSeqlevels(seqlevels(txdb), "NCBI") @ \subsection{converting styles and removing unwanted seqlevels} Suppose we read in a Bam file or a BED file and the resulting GRanges have a lot of seqlevels which are not required by your analysis or you want to rename the seqlevels from the current style to your own style (eg:USCS to NCBI), we can use the functionality provided by GenomeInfoDb to do that. Let us say that we have extracted the seqlevels of the Seqinfo object(say GRanges from a BED file) in a variable called "sequence". <<sequence, eval=FALSE>>= sequence <- seqlevels(x) ## sequence is in UCSC format and we want NCBI style newStyle <- mapSeqlevels(sequence,"NCBI") newStyle <- newStyle[complete.cases(newStyle)] # removing NA cases. ## rename the seqlevels x <- renameSeqlevels(x,newStyle) ## keep only the seqlevels you want (say autosomes) auto <- extractSeqlevelsByGroup(species="Homo sapiens", style="NCBI", group="auto") x <- keepSeqlevels(x,auto) @ \section{Session Information} Here is the output of \Rfunction{sessionInfo} on the system on which this document was compiled: <<sessionInfo, results='asis', eval=TRUE>>= toLatex(sessionInfo()) @ \end{document}