%\VignetteIndexEntry{InPAS Vignette} %\VignetteDepends{InPAS} %\VignetteKeywords{sequence 3UTR usage} %\VignettePackage{InPAS} \documentclass[12pt]{article} <>= BiocStyle::latex() @ \usepackage{hyperref} \usepackage{url} \usepackage[numbers]{natbib} \usepackage{graphicx} \bibliographystyle{plainnat} \author{Jianhong Ou, Sung Mi Park, Michael R. Green and Lihua Julie Zhu} \begin{document} \SweaveOpts{concordance=TRUE} \title{InPAS guide} \maketitle \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} Alternative polyadenylation (APA) is one of the most important post-transcriptional regulation mechanisms which occurs in most human genes. APA in a gene can result in altered expression of the gene, which can lead pathological effect to the cells, such as uncontrolled cell cycle and growth. However, there are only limited ways to identify and quantify APA in genes, and most of them suffers from complicated process for library construction and requires large amount of RNAs. RNA-seq has become one of the most commonly used methods for quantifying genome-wide gene expression. There are massive RNA-seq datasets available publicly such as GEO and TCGA. For this reason, we develop the InPAS algorithm for identifying APA from conventional RNA-seq data. % Recently, two groups reported the methods how to mine APA usage from % RNA-seq. % % Lu et al. developed the algorithm based on a Poisson hidden Markov % model (PHMM) to identify potential APA in human liver and brain % cortex tissues. This method is applicable to both single sample and % pooled samples. However, if there are two or more groups of samples % to compare, each group should be run separately. Moreover, as the % limitation of the algorithm, each run of different group can result % different APA prediction in a same gene, so it is hardly possible % to compare the results from different groups. Furthermore, this % method hardly can find the novel distal cleavage site. % % Recently, Xu et al. developed a novel algorithm, DaPars, which % is possible for \textit{de novo} identification of dynamic APAs % from RNA-seq data. This method can be used to find not only the % novel proximal site but also the novel distal site. The limitation % of this method is it can only handle two group dataset, but not % one group dataset. Moreover, as DaPars algorithm regards the % last exon as 3'UTR, it can lead biased detection of more proximal % site than distal site. % % Here, to overcome this limitations, we developed InPAS ( % \underline{I}dentification of \underline{N}ovel alternative % \underline{P}oly\underline{A}denylation \underline{S}ites), a % bioconductor tool for \textit{de novo} cleavage site discovery. % Though InPAS was developed form DaPars algorithm, there are several % improvements compared to DaPars. It uses the power of % \Rpackage{cleanUpdTSeq} to adjust cleavage site. Moreover, % it extracts annotation from \Rpackage{GenomicFeatures} to make % sure the correct 3'UTR ranges to prevent biased proximal % site detection. The workflow for InPAS is: \begin{enumerate} \item Calculate coverage from BEDGraph files \item Predict cleavage sites \item Estimate 3UTR usage \end{enumerate} \section{How to} To use InPAS, BSgenome and TxDb object have to be loaded before run. <>= library(InPAS) library(BSgenome.Mmusculus.UCSC.mm10) library(TxDb.Mmusculus.UCSC.mm10.knownGene) path <- file.path(find.package("InPAS"), "extdata") @ Users can prepare annotaiton by \Rfunction{utr3Annotation} with a TxDb and org annotation. The 3UTR annotation prepared by \Rfunction{utr3Annotation} includes the gaps to next annotation region. <>= library(org.Hs.eg.db) samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite", package="GenomicFeatures") txdb <- loadDb(samplefile) utr3.sample.anno <- utr3Annotation(txdb=txdb, orgDbSYMBOL="org.Hs.egSYMBOL") utr3.sample.anno @ Users can load mm10 and hg19 annotation from pre-prepared data. Here we loaded the prepared mm10 3UTR annotation file. <>= ##step1 annotation # load from dataset data(utr3.mm10) @ The coverage is calculated from BEDGraph file. The RNA-seq BAM files could be converted to BEDGraph files by bedtools genomecov tool (parameter: -bg -split).PWM and a classifier of polyA signal can be used for adjusting CP sites prediction. <>= load(file.path(path, "polyA.rds")) library(cleanUpdTSeq) data(classifier) bedgraphs <- c(file.path(path, "Baf3.extract.bedgraph"), file.path(path, "UM15.extract.bedgraph")) hugeData <- FALSE ##step1 Calculate coverage coverage <- coverageFromBedGraph(bedgraphs, tags=c("Baf3", "UM15"), genome=BSgenome.Mmusculus.UCSC.mm10, hugeData=hugeData) ##step2 Predict cleavage sites CPsites <- CPsites(coverage=coverage, gp1="Baf3", gp2="UM15", genome=BSgenome.Mmusculus.UCSC.mm10, utr3=utr3.mm10, search_point_START=200, cutEnd=.2, long_coverage_threshold=3, PolyA_PWM=pwm, classifier=classifier, shift_range=100) ##step3 Estimate 3UTR usage res <- utr3UsageEstimation(CPsites=CPsites, coverage=coverage, utr3=utr3.mm10, genome=BSgenome.Mmusculus.UCSC.mm10, gp1="Baf3", gp2="UM15", short_coverage_threshold=15, long_coverage_threshold=3, adjusted.P_val.cutoff=0.05, dPDUI_cutoff=0.3, PDUI_logFC_cutoff=0.59) res[1] @ The process described above can be done in one step. <>= if(interactive()){ res <- inPAS(bedgraphs=bedgraphs, tags=c("Baf3", "UM15"), genome=BSgenome.Mmusculus.UCSC.mm10, utr3=utr3.mm10, gp1="Baf3", gp2="UM15", txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, search_point_START=200, short_coverage_threshold=15, long_coverage_threshold=3, cutStart=0, cutEnd=.2, hugeData=FALSE, shift_range=50, PolyA_PWM=pwm, classifier=classifier) } @ InPAS can handle single group data. <>= res <- inPAS(bedgraphs=bedgraphs[1], tags=c("Baf3"), genome=BSgenome.Mmusculus.UCSC.mm10, utr3=utr3.mm10, gp1="Baf3", gp2=NULL, txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, search_point_START=200, short_coverage_threshold=15, long_coverage_threshold=3, cutStart=0, cutEnd=.2, hugeData=FALSE, PolyA_PWM=pwm, classifier=classifier) res[1] @ % \section{References} % Lu J, Bushel PR. Dynamic expression of 3' UTRs revealed by % Poisson hidden Markov modeling of RNA-Seq: implications in % gene expression profiling. Gene. 2013 Sep 25;527(2):616-23. % doi: 10.1016/j.gene.2013.06.052. Epub 2013 Jul 9. % PubMed PMID: 23845781; PubMed Central PMCID: PMC3902974. % % Xia Z, Donehower LA, Cooper TA, Neilson JR, Wheeler DA, % Wagner EJ, Li W. Dynamic analyses of alternative % polyadenylation from RNA-seq reveal a 3'-UTRlandscape % across seven tumour types. Nat Commun. 2014 Nov 20;5:5274. % doi:10.1038/ncomms6274. PubMed PMID: 25409906. % % Sheppard S, Lawson ND, Zhu LJ. Accurate identification of % polyadenylation sites from 3' end deep sequencing using a % naive Bayes classifier. Bioinformatics. 2013 Oct 15;29(20):2564-71. % doi: 10.1093/bioinformatics/btt446. Epub 2013 Aug 20. % PubMed PMID: 23962617; PubMed Central PMCID: PMC3789547. \section{Session Info} <>= toLatex(sessionInfo()) @ \end{document}