#+TITLE: Variant Calling with R/Bioconductor #+AUTHOR: Michael Lawrence #+OPTIONS: toc:t H:3 #+startup: beamer #+LaTeX_CLASS: beamer #+PROPERTY: exports both #+PROPERTY: results none #+PROPERTY: eval no-export #+PROPERTY: session *R:VariantToolsTutorial* #+PROPERTY: tangle yes #+BEGIN_LaTeX \AtBeginSubsection[] { \begin{frame}{Outline} \tableofcontents[currentsection,currentsubsection] \end{frame} } \AtBeginSection[] { \begin{frame}{Outline} \tableofcontents[currentsection] \end{frame} } #+END_LaTeX * Introduction to the dataset ** Experiment *** Goals and Scope * Determine the genotype of a sample * Call *single nucleotide variants vs. reference* from high-throughput sequencing data, including WGS, Exome-seq and (eventually) RNA-seq * Support users to filter the variant calls according to the biological context and questions of interest * Be sensitive to low frequency variants * Be robust to aneuploidy, cell mixtures, contamination * Permit estimation of sample heterogeneity *** Variant Calling Process **** Data Generation :B_block: :PROPERTIES: :BEAMER_env: block :END: 1. Library prep (PCR) 2. Sequencing 3. Alignment Each of these steps will introduce noise that requires filtering. **** Variant Calling :B_block: :PROPERTIES: :BEAMER_env: block :END: #+ATTR_LATEX: :width 10cm [[./fig/VariantCallingProcess.pdf]] *** Biological Considerations These generate a range of variant frequencies: * Aneuploidy * Heterogeneity * Contamination Thus, /there is no "one-p-fits-all" solution to variant calling/. *** Existing Solutions Other tools for calling variants vs. reference include: | =samtools mpileup= | Generates statitics useful for variant calling | | =vcfutils= | Perl script for filtering =mpileup= output | | =Varscan2= | Series of adhoc filters on =mpileup= output | | =GATK= | Oriented towards genotyping in diploid samples | There are also comparative (somatic mutation) callers (strelka, MuTect, etc), but we are focused on calling vs. reference. *** Benchmark Dataset **** Text :BMCOL: :PROPERTIES: :BEAMER_col: 0.5 :END: * To develop an algorithm, we need to benchmark its sensitivity and specificity, but no gold standard exists. * *Biochemically* mixed two HapMap daughter cell lines in different proportions to realistically simulate variant frequencies expected from complex samples. Sequenced each genome with 75bp reads. **** Allison's picture :BMCOL: :PROPERTIES: :BEAMER_col: 0.5 :END: #+attr_LaTeX: :height 8.5cm [[./fig/mix.pdf]] *** Sequencing Output: 23-24X average coverage | Sample | % CEU | % YRI | # Reads (analyzed) | Avg. Coverage | |--------+-------+-------+--------------------+---------------| | 1 | 90 | 10 | 461,449,560 | 22.3 | | 2 | 90 | 10 | 475,567,437 | 23.0 | | 3 | 90 | 10 | 460,196,498 | 22.3 | | 4 | 50 | 50 | 489,166,262 | 23.7 | | 5 | 50 | 50 | 442,737,941 | 21.4 | | 6 | 50 | 50 | 430,779,023 | 20.8 | | 7 | 10 | 90 | 496,958,600 | 24.0 | | 8 | 10 | 90 | 494,245,570 | 23.9 | | 9 | 10 | 90 | 534,458,340 | 25.8 | *** Genotypes | Cell Line | Trio | Source | Ref | Coverage | Total Het/Hom | |-----------+------+--------+--------+----------+-----------------| | NA12878 | CEU | Broad | hg19 | 64X | 2451814/1410358 | | NA12878 | CEU | 1000G | *hg18* | 61X | 1703706/1061942 | | CEU Union | CEU | Both | | | 2424095/1427209 | | NA19240 | YRI | 1000G | *hg18* | 66X | 2227251/1108784 | **** 10/90 combinations :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.5 :BEAMER_env: block :END: | 10/90 | 0 | 0.5 | 1 | |-------+------+------+------| | 0 | - | 0.45 | 0.90 | | 0.5 | 0.05 | 0.50 | 0.95 | | 1 | 0.10 | 0.55 | 1.0 | **** 50/50 combinations :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.5 :BEAMER_env: block :END: | 50/50 | 0 | 0.5 | 1 | |-------+------+------+------| | 0 | - | 0.25 | 0.50 | | 0.5 | 0.25 | 0.50 | 0.75 | | 1 | 0.50 | 0.75 | 1.0 | *** QC of mixture ratios #+attr_LaTeX: :width 11cm [[./fig/boxplot-obs-freq-sample-ceu-all.pdf]] *** QC of variant frequencies #+attr_LaTeX: :width 11cm [[./fig/boxplot-expected-obs-freq-source-merged-vt.pdf]] ** Algorithm *** Overview #+attr_LaTeX: :width 11cm [[./fig/workflow.pdf]] # *** Variant Calling Filters # #+attr_LaTeX: :width 8cm # [[./fig/variant-calling-filters.pdf]] # *** Post-filters # #+attr_LaTeX: :width 3.5cm # [[./fig/post-filters.pdf]] ** Performance *** Definitions #+attr_LaTeX: :height 8cm [[./fig/errors.pdf]] *** FNR high at low/high coverage #+attr_LaTeX: :width 11cm [[./fig/bar-fnr-cov-bin-merged-all.pdf]] # *** Errors at high coverage # #+ATTR_LaTeX: :width 11cm # [[./fig/igv_high_cov_neg.png]] *** Recovery rate (1 - FNR) vs. GATK #+attr_LaTeX: :width 11cm [[./fig/dot-percent-recovered-caller-merged-20-all.pdf]] *** FDR by coverage bin #+attr_LaTeX: :width 11cm [[./fig/bar-fdr-cov-bin-merged-all.pdf]] *** Evidence that some FP are real **** Columns :B_columns: :PROPERTIES: :BEAMER_env: columns :BEAMER_OPT: T :END: ***** Replication :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.4 :BEAMER_env: block :END: #+attr_LaTeX: :width 5cm [[./fig/hist-fdr-rep-count-all.pdf]] ***** dbSNP Concordance :B_block:BMCOL: :PROPERTIES: :BEAMER_col: 0.6 :BEAMER_env: block :END: | | NOT dbSNP | IN dbSNP | |---------+-----------+----------| | 1 Rep | 695468 | 120266 | | 2+ Reps | 391879 | 781940 | *** Selected FP: GATK vs. VariantTools Selected FPs at reasonable (45-85X) coverage, outside of structural variants and multi-mapping regions. #+ATTR_LaTeX: :width 7cm [[./fig/bar-caller-freq-summary-dbSNP-count-all.pdf]] *** Acknowledgements Leonard Goldstein \\ Melanie Huntley \\ Yi Cao \\ Robert Gentleman * Interactive demonstration ** Overview *** Overview **** Data Subset of the mixture data consisting only of the 50/50 samples, and only reads aligning within 1 Mb of p53. **** Strategy 1. Align sequences to the p53 region. 2. Generate tallies (pileup) from the alignments. 3. Call/filter variants. 4. Perform exploratory analysis on the calls and concordance with canonical genotypes. ** Alignment *** The *gmapR* package *gmapR* is an R interface to the GMAP/GSTRUCT suite of alignment tools, including: * GSNAP :: a short read aligner distinguished by its ability to generate spliced alignments from RNA-seq data (also handles DNA) * =bam_tally= :: summarizes alignments by counting A/C/G/T (and optionally indels) at each position and tabulating by strand, read position and quality *** Configure GSNAP parameters * GSNAP is a complex tool with a complex interface, consisting of many command-line parameters. * *gmapR* supports all parameters, while providing a high-level interface with reasonable defaults. * The parameters are stored in a =GsnapParams= object. * We construct a simple =GsnapParams= for generating unique DNA alignments to ~2Mb region around p53: #+begin_src R library(gmapR) param <- GsnapParam(TP53Genome(), unique_only = TRUE, molecule = "DNA") #+end_src *** Align with GSNAP We find our FASTQ files inside the *VariantToolsTutorial* package: #+begin_src R extdata.dir <- system.file("extdata", package="VariantToolsData") first.fastq <- dir(extdata.dir, "first.fastq", full.names=TRUE) last.fastq <- dir(extdata.dir, "last.fastq", full.names=TRUE) #+end_src And generate the GSNAP alignments (for the first sample), which *gmapR* automatically converts to indexed BAMs: #+begin_src R output <- gsnap(first.fastq[1], last.fastq[1], param) bam <- as(output, "BamFile") #+end_src ** Variant calling *** The *VariantTools* package *VariantTools* is a set of utilities for: * Tallying alignments (via *gmapR*) * Annotating tallies * Filtering tallies into variant calls * Exporting tallies to VCF (actually *VariantAnnotation*) * Wildtype calling (for a specific set of filters) * Sample ID verification via rudimentary genotyping *** Generate nucleotide tallies The underlying =bam_tally= from GSTRUCT accepts a number of parameters, which we specify as a =TallyVariantsParam= object. The genome is required; we also mask out the repeats. #+begin_src R library(VariantTools) data(repeats, package = "VariantToolsData") genome(repeats) <- genome(TP53Genome()) param <- TallyVariantsParam(TP53Genome(), mask = repeats) #+end_src Tallies are generated via the =tallyVariants= function: #+begin_src R tallies <- tallyVariants(bam, param) #+end_src *** Loading and combining three samples worth of tallies The alignments and tallies were generated for all three replicates of the 50/50 mixture and placed in the package. #+begin_src R data(tallies, package = "VariantToolsData") #+end_src We combine the samples in two different ways: stacked (long form) and merged (depths summed). #+begin_src R stacked.tallies <- stackSamples(tallies) merged.tallies <- sumDepths(tallies) sampleNames(merged.tallies) <- "merged" #+end_src *** Configure filters *VariantTools* implements its filters within the =FilterRules= framework from *IRanges*. The default variant calling filters are constructed by =VariantCallingFilters=: #+begin_src R calling.filters <- VariantCallingFilters() #+end_src Post-filters are filters that attempt to remove anomalies from the called variants: #+begin_src R post.filters <- VariantPostFilters() #+end_src *** Filter tallies into variant calls The filters are then passed to the =callVariants= function: #+begin_src R merged.variants <- callVariants(merged.tallies, calling.filters, post.filters) #+end_src Or more simply in this case: #+begin_src R merged.variants <- callVariants(merged.tallies) stacked.variants <- callVariants(stacked.tallies) #+end_src *** Or, call variants directly from a BAM #+begin_src R variants <- callVariants(bam, param) #+end_src **** Note :B_alertblock: :PROPERTIES: :BEAMER_env: alertblock :END: Convenient for simple exercises, but does not facilitate diagnostics ** Exploratory analysis *** Alternative allele frequencies Check the quality of our mixtures: #+begin_src R :results output graphics :file fig/density-altFraction.pdf stacked.variants$altFraction <- altDepth(stacked.variants) / totalDepth(stacked.variants) library(ggplot2) qplot(altFraction, geom = "density", color = sampleNames, data = as.data.frame(stacked.variants)) #+end_src *** Annotating variants with genotype concordance We want to see how well our calls recapitulate the genotypes from 1000G; we have these prepared as a dataset: #+begin_src R data(geno, package = "VariantToolsData") #+end_src Merge the expected frequencies of each alt with the variant calls: #+begin_src R naToZero <- function(x) ifelse(is.na(x), 0L, x) addExpectedFreqs <- function(x) { expected.freq <- geno$expected.freq[match(x, geno)] x$expected.freq <- naToZero(expected.freq) x } stacked.variants <- addExpectedFreqs(stacked.variants) merged.variants <- addExpectedFreqs(merged.variants) #+end_src *** Annotating the genotypes with merged variant calls # EXCERCISE? Annotate the genotypes for whether an alt allele was called in the merged data, and also add the alt and total depth: #+begin_src R :results replace softFilterMatrix(geno) <- cbind(in.merged = geno %in% merged.variants) mean(called(geno)) #+end_src #+RESULTS: : 0.710044395116537 #+begin_src R m <- match(geno, merged.tallies) altDepth(geno) <- naToZero(altDepth(merged.tallies)[m]) totalDepth(geno) <- naToZero(totalDepth(merged.tallies)[m]) #+end_src *** False negatives: which filter to blame? Apply the calling filters to our FN and summarize the results: #+begin_src R :results value replace :colnames yes fn.geno <- geno[!called(geno)] fn.geno <- resetFilter(fn.geno) filters <- hardFilters(merged.variants)[3:4] fn.geno <- softFilter(fn.geno, filters) t(summary(softFilterMatrix(fn.geno))) #+end_src #+RESULTS: | | readCount | likelihoodRatio | | |-----------+-----------+-----------------+---------| | 1021 | 0 | 9 | 0 | The default is to evaluate the filters in parallel, but serial evaluation is also supported: #+begin_src R :results value replace :colnames yes fn.geno <- resetFilter(fn.geno) fn.geno <- softFilter(fn.geno, filters, serial = TRUE) t(summary(softFilterMatrix(fn.geno))) #+end_src #+RESULTS: | | readCount | likelihoodRatio | | |-----------+-----------+-----------------+---------| | 1021 | 0 | 0 | 0 | *** dbSNP concordance Import a VRanges from (p53) dbSNP VCF: #+begin_src R vcfPath <- system.file("extdata", "dbsnp-p53.vcf.gz", package = "VariantToolsData") param <- ScanVcfParam(fixed = "ALT", info = NA, geno = NA) dbSNP <- as(readVcf(vcfPath, param, genome = "TP53_demo"), "VRanges") dbSNP <- dbSNP[!isIndel(dbSNP)] #+end_src And annotate the stacked variants for concordance: #+begin_src R :results value replace :colnames yes :rownames yes stacked.variants$dbSNP <- stacked.variants %in% dbSNP xtabs(~ dbSNP + expected.freq, mcols(stacked.variants)) #+end_src #+RESULTS: | | 0 | 0.25 | 0.5 | 0.75 | 1 | |-------+------+------+------+------+-----| | FALSE | 2233 | 25 | 0 | 0 | 0 | | TRUE | 917 | 3497 | 2023 | 891 | 924 | *** Replication over the samples Tabulate the stacked variants over the samples: #+begin_src R :results value replace :colnames yes :rownames yes tabulated.variants <- tabulate(stacked.variants) xtabs(~ dbSNP + sample.count, mcols(tabulated.variants)) #+end_src #+RESULTS: | | 1 | 2 | 3 | |-------+------+-----+------| | FALSE | 1473 | 241 | 101 | | TRUE | 116 | 435 | 2422 | *** Visualizing putative FPs: IGV IGV is an effective tool for exploring alignment issues and other variant calling anomalies; *SRAdb* drives IGV from R. To begin, we create a connection: #+begin_src R library(SRAdb) startIGV("lm") sock <- IGVsocket() #+end_src R *** Exporting our calls as VCF IGV will display variant calls as VCF: #+begin_src R mcols(merged.variants) <- NULL vcf <- writeVcf(sort(merged.variants), "merged.variants.vcf", index = TRUE) vcf <- tools::file_path_as_absolute(vcf) #+end_src *** Creating an IGV session Create an IGV session with our VCF, BAMs and custom p53 genome: #+begin_src R extdata <- system.file("extdata", package = "VariantToolsTutorial") bams <- tools::list_files_with_exts(extdata, "bam") p53fasta <- tempfile("p53", fileext = ".fasta") rtracklayer::export(TP53Genome(), p53fasta) session <- IGVsession(c(bams, vcf), "session.xml", p53fasta) #+end_src Load the session: #+begin_src R IGVload(sock, session) #+end_src *** Browsing regions of interest IGV will (manually) load BED files as a list of bookmarks: #+begin_src R rtracklayer::export(merged.variants, "roi.bed") #+end_src