%\VignetteIndexEntry{Methylation status calling with METHimpute} %\VignetteEngine{knitr::knitr} \documentclass{article} <>= BiocStyle::latex() @ \author{Aaron Taudt\thanks{\email{aaron.taudt@gmail.com}}} \title{Methylation status calling with METHimpute} \begin{document} \maketitle \tableofcontents \clearpage <>= library(methimpute) options(width=110) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} \Rpackage{Methimpute} implements a powerful HMM-based binomial test for methylation status calling. Besides improved accuracy over the classical binomial test, the HMM allows imputation of the methylation status of \textbf{all cytosines} in the genome. It achieves this by borrowing information from neighboring covered cytosines. The confidence in the methylation status call is reported as well. Methimpute also outputs context-specific conversion rates, which might be used to optimize the experimental procedure. For the exact workings of \Rpackage{methimpute} we refer the interested reader to our publication at \url{https://doi.org/10.1101/190223}. \section{Methylation status calling} The following example explains the necessary steps for methylation status calling (and imputation). To keep the calculation time short, it uses only the first 200.000 bp of the Arabidopsis genome. The example consists of three steps: 1) Data import, 2) estimating the distance correlation and 3) methylation status calling. At the end of this example you will see that positions without counts are assigned a methylation status, but the confidence (column "posteriorMax") is generally quite low for those cytosines. Column "posteriorMeth" gives the HMM posterior probability for a cytosine being methylated, which can be interpreted as a methylation level for each site. Column "status" contains the imputed and non-imputed methylation status calls. \begin{small} <>= library(methimpute) # ===== Step 1: Importing the data ===== # # We load an example file in BSSeeker format that comes with the package file <- system.file("extdata","arabidopsis_bsseeker.txt.gz", package="methimpute") bsseeker.data <- importBSSeeker(file) print(bsseeker.data) # Because most methylation extractor programs report only covered cytosines, # we need to inflate the data to inlcude all cytosines (including non-covered sites) fasta.file <- system.file("extdata","arabidopsis_sequence.fa.gz", package="methimpute") cytosine.positions <- extractCytosinesFromFASTA(fasta.file, contexts = c('CG','CHG','CHH')) seqlengths(cytosine.positions) <- seqlengths(bsseeker.data) # only necessary for our toy example methylome <- inflateMethylome(bsseeker.data, cytosine.positions) print(methylome) # ===== Step 2: Obtain correlation parameters ===== # # The correlation of methylation levels between neighboring cytosines is an important # parameter for the methylation status calling, so we need to get it first. Keep in mind # that we only use the first 200.000 bp here, that's why the plot looks a bit meagre. distcor <- distanceCorrelation(methylome) fit <- estimateTransDist(distcor) print(fit) # ===== Step 3: Methylation status calling (and imputation) ===== # model <- callMethylation(data = methylome, transDist = fit$transDist) # The confidence in the methylation status call is given in the column "posteriorMax". # For further analysis one could split the results into high-confidence (posteriorMax >= 0.98) # and low-confidence calls (posteriorMax < 0.98) for instance. print(model) @ \end{small} \subsection{Description of columns} \begin{itemize} \item \textbf{context} The sequence context of the cytosine. \item \textbf{counts} Counts for methylated and total number of reads at each position. \item \textbf{distance} The distance in base-pairs from the previous to the current cytosine. \item \textbf{transitionContext} Transition context in the form "previous-current". \item \textbf{posteriorMax} Maximum posterior value of the methylation status call, can be interpreted as the confidence in the call. \item \textbf{posteriorMeth} Posterior value of the "methylated" component. \item \textbf{posteriorUnmeth} Posterior value of the "unmethylated" component. \item \textbf{status} Methylation status. \item \textbf{rc.meth.lvl} Recalibrated methylation level, calculated from the posteriors and the fitted parameters (see ?methimputeBinomialHMM for details). \item \textbf{rc.counts} Recalibrated counts for methylated and total number of reads at each position (see ?methimputeBinomialHMM for details). \end{itemize} \section{Plots and enrichment analysis} This package also offers plotting functions for a simple enrichment analysis. Let's say we are interested in the methylation level around genes and transposable elements. We would also like to see how the imputation works on cytosines with missing data compared to covered cytosines. \begin{small} <>= # Define categories to distinguish missing from covered cytosines model$data$category <- factor('covered', levels=c('missing', 'covered')) model$data$category[model$data$counts[,'total']>=1] <- 'covered' model$data$category[model$data$counts[,'total']==0] <- 'missing' # Note that the plots look a bit ugly because our toy data has only 200.000 datapoints. data(arabidopsis_genes) plotEnrichment(model$data, annotation=arabidopsis_genes, range = 2000, category.column = 'category') data(arabidopsis_TEs) plotEnrichment(model$data, annotation=arabidopsis_TEs, range = 2000, category.column = 'category') @ \end{small} \section{Session Info} \begin{scriptsize} <>= toLatex(sessionInfo()) @ \end{scriptsize} \end{document}