--- title: "maftools : Summarize, Analyze and Visualize MAF files" author: "Anand Mayakonda" date: "`r Sys.Date()`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Summarize, Analyze and Visualize MAF files} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction. With advances in Cancer Genomics, [Mutation Annotation Format](https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+(MAF)+Specification) (MAF) is being widley accepted and used to store somatic variants detected. [The Cancer Genome Atlas](http://cancergenome.nih.gov) Project has seqenced over 30 different cancers with sample size of each cancer type being over 200. [Resulting data](https://wiki.nci.nih.gov/display/TCGA/TCGA+MAF+Files) consisting of somatic variants are stored in the form of [Mutation Annotation Format](https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+(MAF)+Specification). This package attempts to summarize, analyze, annotate and visualize MAF files in an efficient manner from either TCGA sources or any in-house studies as long as the data is in MAF format. ##Citation Please cite the below if you find this tool useful for you. Mayakonda, A. & Koeffler, H.P. Maftools: Efficient analysis, visualization and summarization of MAF files from large-scale cohort based cancer studies. bioRxiv (2016). doi: http://dx.doi.org/10.1101/052662 #MAF field requirements. MAF files contain many fields ranging from chromosome names to cosmic annotations. However most of the analysis in maftools uses following fields. * Mandatoty fields: __Hugo_Symbol, Chromosome, Start_Position, End_Position, Variant_Classification, Variant_Type and Tumor_Sample_Barcode__. * Recommended optional fields: non MAF specific fields containing vaf and amino acid change information. Complete specififcation of MAF files can be found on [NCI TCGA page](https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+(MAF)+Specification). This vignette demonstrates the usage and application of maftools on an example MAF file from TCGA LAML cohort [1](#references). # Reading and summarizing maf files. ##Reading MAF files. `read.maf` reads MAF files, summarizes it in various ways and stores it as an MAF object. ```{r results='hide', message=FALSE} suppressPackageStartupMessages(require(maftools)) #read TCGA maf file for LAML laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools') laml = read.maf(maf = laml.maf, removeSilent = TRUE, useAll = FALSE) ``` ##MAF object Summarized MAF file is stored as an MAF object. MAF object contains main maf file, summarized data and an oncomatrix which is useful to plot oncoplots (aka waterfall plots). There are accessor methods to access the useful slots from MAF object. However, all slots can be accessed using `@`, just like most of S4 objects. ```{r} #Typing laml shows basic summary of MAF file. laml #Shows sample summry. getSampleSummary(laml) #Shows frequently mutated genes. getGeneSummary(laml) #Shows all fields in MAF getFields(laml) #Writes maf summary to an output file with basename laml. write.mafSummary(maf = laml, basename = 'laml') ``` #Visualization. ##Plotting MAF summary. We can use `plotmafSummary` to plot the summary of the maf file, which displays number of variants in each sample as a stacked barplot and variant types as a boxplot summarized by Variant_Classification. We can add either mean or median line to the stacked barplot to display average/median number of variants across the cohort. ```{r,fig.height=6, fig.width=8} plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE) ``` ## Oncoplots (aka waterfall plots) ###Drawing oncoplots. Bettter representaion of maf file can be shown as oncoplots, also known as waterfall plots. Oncoplot function uses [ComplexHeatmap](https://github.com/jokergoo/ComplexHeatmap) to draw oncoplots[2](#references). To be specific, `oncoplot` is a wrapper around ComplexHeatmap's `OncoPrint` function with little modification and automation which makes plotting easier. Side barplot and top barplots can be controlled by `drawRowBar` and `drawColBar` arguments respectivelly. ```{r, fig.align='left',fig.height=6,fig.width=10, eval=T, fig.align='left'} #We will draw oncoplots for top ten mutated genes. (Removing non-mutated samples from the plot for better visualization) oncoplot(maf = laml, top = 10, removeNonMutated = TRUE) ``` NOTE: Variants annotated as `Multi_Hit` are those genes which are mutated more than once in the same sample. ###Including copy number data into oncoplots. If we have copy number data along with MAF file, we can include them in oncoplot as to show if any genes are amplified or deleted. One most widely used tool for copy number analysis from large scale studies is GISTIC and we can simultaneously read gistic results along with MAF[3](#references). GISTIC generates numerous files but we need mainly three files `all_lesions.conf_XX.txt`, `amp_genes.conf_XX.txt` and `del_genes.conf_XX.txt`, where XX is confidence level. These files contain significantly altered genomic regions along with amplified and delted genes respectively. ```{r results='hide', message=FALSE} #read TCGA maf file for LAML laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools') all.lesions <- system.file("extdata", "all_lesions.conf_99.txt", package = "maftools") amp.genes <- system.file("extdata", "amp_genes.conf_99.txt", package = "maftools") del.genes <- system.file("extdata", "del_genes.conf_99.txt", package = "maftools") laml.plus.gistic = read.maf(maf = laml.maf, removeSilent = TRUE, useAll = FALSE, gisticAllLesionsFile = all.lesions, gisticAmpGenesFile = amp.genes, gisticDelGenesFile = del.genes, isTCGA = TRUE) ``` ```{r, fig.align='left',fig.height=6,fig.width=10, eval=T, fig.align='left'} #We will draw oncoplots for top ten mutated genes. (Removing non-mutated samples from the plot for better visualization) oncoplot(maf = laml.plus.gistic, top = 10, removeNonMutated = TRUE, sortByMutation = TRUE) ``` This plot shows frequent deletions in TP53 gene which is located on one of the significanlty deleted locus 17p13.2. ###Changing colors and adding annotations to oncoplots. It is often the case that we include meta data to show sample characterstics such as gender, treatment, etc. We can include such meta data by passing them to `annotation` argument of `oncoplot`. We can also change colors for Variant_Classification by providing a named vector of colors. ```{r, fig.height=7,fig.width=10, eval=T, fig.align='left'} #Read FAB classification of TCGA LAML barcodes. laml.fab.anno = system.file('extdata', 'tcga_laml_fab_annotation.txt', package = 'maftools') laml.fab.anno = read.delim(laml.fab.anno, sep = '\t') head(laml.fab.anno) #Changing colors (You can use any colors, here in this example we will use a color palette from RColorBrewer) col = RColorBrewer::brewer.pal(n = 8, name = 'Paired') names(col) = c('Frame_Shift_Del','Missense_Mutation', 'Nonsense_Mutation', 'Multi_Hit', 'Frame_Shift_Ins', 'In_Frame_Ins', 'Splice_Site', 'In_Frame_Del') #We will plot same top ten mutated genes with FAB classification as annotation and using above defined colors. oncoplot(maf = laml, top = 10, annotation = laml.fab.anno, removeNonMutated = TRUE, colors = col) ``` ##Oncostrip We can visualize any set of genes using `oncostrip` function, which draws mutations in each sample similar to [OncoPrinter tool](http://www.cbioportal.org/faq.jsp#what-are-oncoprints) on [cBioPortal](http://www.cbioportal.org/index.do). `oncostrip` can be used to draw any number of genes using `top` or `genes` arguments. ```{r, fig.height=2,fig.width=8,fig.align='center'} oncostrip(maf = laml, genes = c('DNMT3A','NPM1', 'RUNX1'), removeNonMutated = TRUE, showTumorSampleBarcodes = FALSE) ``` ##Transition and Transversions. `titv` function classifies SNPs into [Transitions and Transversions](http://www.mun.ca/biology/scarr/Transitions_vs_Transversions.html) and returns a list of summarized tables in various ways. Summarized data can also be visulaized as a boxplot showing overall distribution of six different conversions and as a stacked barplot showing fraction of conversions in each sample. ```{r, fig.align='default', fig.height=6, fig.width=8, eval = T} laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE) #plot titv summary plotTiTv(res = laml.titv) ``` ##Lollipop plots for amino acid changes. Lollipop plots are simple and most effective way showing mutation spots on protein structure. Many oncogenes have a preferential sites which are mutated more often than any other locus. These spots are considered to be mutational hotspots and lollipop plots can be used to display them along with rest of the mutations. We can draw such plots using the function `lollipopPlot`. This fuction requires us to have amino acid changes information in the maf file. However MAF files have no clear guidelines on naming the field for amino acid changes, with different studies having different field (or column) names for amino acid changes. By default, `lollipopPlot` looks for column `AAChange`, and if its not found in the MAF file, it prints all availble fields with a warning message. For below example, MAF file contains amino acid changes under a field/column name 'Protein_Change'. We will manually specify this using argumnet `AACol`. This function also returns the plot as a ggplot object, which user can later modify if needed. ```{r,fig.height=4.5,fig.width=8,fig.align='center'} #Lets plot lollipop plot for DNMT3A, which is one of the most frequent mutated gene in Leukemia. dnmt3a.lpop = lollipopPlot(maf = laml, gene = 'DNMT3A', AACol = 'Protein_Change', showMutationRate = TRUE, domainLabelSize = 2.5) ``` Note that `lollipopPlot` warns user on availability of diferent transcripts for the given gene. If we know the transcript id before hand, we can specify it as `refSeqID` or `proteinID`. By default lollipopPlot uses the longer isoform. ###Labelling and repelling points. We can also label points on the `lollipopPlot` using argument `labelPos`. If `labelPos` is set to 'all', all of the points are highlighted. Sometimes, many mutations are clustered within a range of few amino acid positons. In that case we can use `repel` option which tries to repel points for clear representation. ```{r,fig.height=4,fig.width=8,fig.align='center'} #Lets plot mutations on KIT gene, without repel option. kit.lpop = lollipopPlot(maf = laml, gene = 'KIT', AACol = 'Protein_Change', labelPos = c(416, 418), refSeqID = 'NM_000222', domainLabelSize = 3) #Same plot with repel=TRUE kit.lpop = lollipopPlot(maf = laml, gene = 'KIT', AACol = 'Protein_Change', labelPos = c(416, 418), refSeqID = 'NM_000222', repel = TRUE, domainLabelSize = 3) ``` ###cBioPortal style annotations MutationMapper on cBioPortal collapses Variant Classfications into truncating and others. It also includes somatic mutation rate. ```{r, warning=FALSE, message=FALSE, fig.height=4.5,fig.width=8,fig.align='center'} laml.dnmt3a = lollipopPlot(maf = laml, gene = 'DNMT3A', AACol = 'Protein_Change', refSeqID = 'NM_175629', labelPos = 882, collapsePosLabel = TRUE, cBioPortal = TRUE, domainLabelSize = 3) ``` ##Integrating somatic variants and copy number alterations Many cancer genomic studies involve copy number data generated either from sequencing or SNP chip arrays. Copy number analysis provides us a lot of information from genome wide copy number aberations to tumor purity. Most of the time copy number data is stored as segments, with each segment represented by a log ratio of copy number changes compared to matched normals. There are many segmentation algorithms, with the most popular being Circular Binary Segmentation implemented in [DNACopy bioconductor](https://www.bioconductor.org/packages/3.3/bioc/html/DNAcopy.html)[4](#references). We can plot such segmented copy number data and map all mutations on to it. It provides a quick way of knowing which variants (in a way which genes) are located on copy number altered genomic regions. ```{r, fig.height=4,fig.width=8,fig.align='center'} tcga.ab.009.seg <- system.file("extdata", "TCGA.AB.3009.hg19.seg.txt", package = "maftools") plotCBSsegments(cbsFile = tcga.ab.009.seg, maf = laml, labelAll = TRUE) ``` Above plot shows two genes NF1 and SUZ12 are located on a region which has a shallow deletion. Later we will also see how these variants on copy number altered regions affect variant allele frequencies and tumor heterogeneity estimation. ##Rainfall plots Cancer genomes, especially solid tumors are characterized by genomic loci with localized hypermutations[5](#references). Such hyper mutated genomic regions can be visualized by plotting inter variant distance on a linear genomic scale. These plots generally called rainfall plots and we can draw such plots using `rainfallPlot`. If `detectChangePoints` is set to TRUE, `rainfall` plot also highlights regions where potential changes in inter-event distances are located. But please be aware that detected change-points are only loci where the distribution of inter-event distance changes. Segments may have to be manually inferred by adjacent change-points. This will be improved in future updates. ```{r, results='hide', message=FALSE} coad <- system.file("extdata", "coad.maf.gz", package = "maftools") coad = read.maf(maf = coad) ``` ```{r, fig.height=5,fig.width=12,fig.align='center'} coad.rf = rainfallPlot(maf = coad, detectChangePoints = TRUE, fontSize = 12, pointSize = 0.6) ``` ##Genecloud We can plot word cloud plot for mutated genes with the function `geneCloud`. Size of each gene is proportional to the total number of samples in which it is mutated/altered. ```{r, fig.align='left',fig.width=7, fig.height=5, eval=T} geneCloud(input = laml, minMut = 3) ``` #GISTIC files. ##Reading and summarizing gistic output files. We can summarize output files generated by GISTIC programme. As mentioned earlier, we need three files that were generated by GISTIC, i.e, all_lesions.conf_XX.txt, amp_genes.conf_XX.txt and del_genes.conf_XX.txt, where XX is the confidence level. See [GISTIC documentation](ftp://ftp.broadinstitute.org/pub/GISTIC2.0/GISTICDocumentation_standalone.htm) for details. ```{r} all.lesions <- system.file("extdata", "all_lesions.conf_99.txt", package = "maftools") amp.genes <- system.file("extdata", "amp_genes.conf_99.txt", package = "maftools") del.genes <- system.file("extdata", "del_genes.conf_99.txt", package = "maftools") laml.gistic = readGistic(gisticAllLesionsFile = all.lesions, gisticAmpGenesFile = amp.genes, gisticDelGenesFile = del.genes, isTCGA = TRUE) ``` Similar to MAF objects, there are methods available to access slots of GISTIC object - `getSampleSummary`, `getGeneSummary` and `getCytoBandSummary`. Summarized results can be written to output files using function `write.GisticSummary`. ##Visualizing gistic results. ###Gistic plots. Similar to oncoplots we can draw copy number data. ```{r, fig.align='left',fig.width=7, fig.height=5, eval=T} gisticPlot(gistic = laml.gistic) ``` ###Plot gistic results. A bubble plot to display summarized gistic results. ```{r, fig.align='left',fig.width=7, fig.height=5, eval=T} plotGisticResults(gistic = laml.gistic) ``` #Analysis. ##Mutual exclusivity. Many disease causing genes in cancer show strong exlcusiveness in their mutation pattern. Such mutually exlcuive set of genes can be detected using `mutExclusive` function, which performs an exact test to detect such significant pair of genes. `mutExclusive` uses `comet_exact_test` on a given set of genes to calculate significance value. Please cite [CoMET](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4531541/) article if you use this function [6](#references). ```{r} #We will run mutExclusive on top 10 mutated genes. laml.mut.excl = mutExclusive(maf = laml, top = 10) head(laml.mut.excl) ``` We can visualize the above results using `oncostrip`. For example in above `mutExclusive` analysis, we can see many genes show exclusiveness. For example NPM1 and RUNX1 show a strong exclusiveness with a p-value of 0.02. ```{r, fig.height=1.5,fig.width=8,fig.align='center'} oncostrip(maf = laml, genes = c('NPM1', 'RUNX1'), sort = TRUE, removeNonMutated = TRUE) ``` ##Detecting cancer driver genes based on positional clustering. maftools has a function `oncodrive` which identifies cancer genes (driver) from a given MAF. `oncodrive` is a based on algorithm [oncodriveCLUST](http://bg.upf.edu/group/projects/oncodrive-clust.php) which was originally implemented in Python. Concept is based on the fact that most of the variants in cancer causing genes are enriched at few specific loci (aka hotspots). This method takes advantage of such positions to identify cancer genes. If you use this function, please cite [OncodriveCLUST article](http://bioinformatics.oxfordjournals.org/content/early/2013/07/31/bioinformatics.btt395.full) [7](#references). ```{r, fig.align='default', fig.width=7,fig.height=5, message=F,results='hide', eval=T} laml.sig = oncodrive(maf = laml, AACol = 'Protein_Change', minMut = 5, pvalMethod = 'zscore') head(laml.sig) ``` We can plot the results using `plotOncodrive`. ```{r, fig.align='default', fig.width=7,fig.height=5, eval= T} plotOncodrive(res = laml.sig, fdrCutOff = 0.1, useFraction = TRUE) ``` `plotOncodrive` plots the results as scatter plot with size of the points proportional to the number of clusters found in the gene. X-axis shows number of mutations (or fraction of mutations) observed in these clusters. In the above example, IDH1 has a single cluster and all of the 18 mutations are accumulated within that cluster, giving it a cluster score of one. For details on oncodrive algorithm, please refer to [OncodriveCLUST article](http://bioinformatics.oxfordjournals.org/content/early/2013/07/31/bioinformatics.btt395.full) [7](#references). ##Adding and summarizing pfam domains. maftools comes with the function `pfamDomains`, which adds pfam domain information to the amino acid changes. `pfamDomain` also summarizes amino acid changes accoriding to the domains that are affected. This serves the puposes of knowing what domain in given cancer cohort, is most frequently affected. This function is inspired from Pfam annotation modulce from MuSic tool [8](#references). ```{r, fig.align='left',fig.width=7, fig.height=5, eval=T} laml.pfam = pfamDomains(maf = laml, AACol = 'Protein_Change', top = 10) #Protein summary (Printing first 7 columns for display convenience) laml.pfam$proteinSummary[,1:7, with = FALSE] #Domain summary (Printing first 3 columns for display convenience) laml.pfam$domainSummary[,1:3, with = FALSE] ``` Above plot and results shows AdoMet_MTases domain is frequently mutated, but number genes with this domain is just one (DNMT3A) compared to other domains such as 7tm_1 domain, which is mutated across 24 different genes. This shows the importance of mutations in methyl transfer domains in Leukemia. ##Comparing two cohorts (MAFs) Cancers differ from each other in terms of their mutation pattern. We can compare two different cohorts to detect such differentally mutated genes. For example, recent article by [Madan et. al](http://www.ncbi.nlm.nih.gov/pubmed/27063598) [9](references), have shown that patients with relapsed APL (Acute Promyelocytic Leukemia) tends to have mutations in PML and RARA genes, which were absent during primary stage of the disease. This difference between two cohorts (in this case primary and relapse APL) can be detected using function `mafComapre`, which performs fisher test on all genes between two cohorts to detect differentally mutated genes. ```{r results='hide', message=FALSE} #Primary APL MAF primary.apl = system.file("extdata", "APL_primary.maf.gz", package = "maftools") primary.apl = read.maf(maf = primary.apl) #Relapse APL MAF relapse.apl = system.file("extdata", "APL_relapse.maf.gz", package = "maftools") relapse.apl = read.maf(maf = relapse.apl) ``` ```{r} #We will consider only genes which are mutated in at-least in 5 samples in one of the cohort, to avoid bias due to single mutated genes. pt.vs.rt <- mafCompare(m1 = primary.apl, m2 = relapse.apl, m1Name = 'PrimaryAPL', m2Name = 'RelapseAPL', minMut = 5) print(pt.vs.rt) ``` Above resutls show two genes PML and RARA which are highly mutated in Relapse APL compared to Primary APL. We can visulaize these results as a [forestplot](https://en.wikipedia.org/wiki/Forest_plot). ```{r, fig.width=6, fig.height=5, fig.align='center'} apl.pt.vs.rt.fp = forestPlot(mafCompareRes = pt.vs.rt, pVal = 0.05, show = 'stat', color = c('royalblue', 'maroon')) ``` Another alternative way of displaying above rsults is by plotiing two oncoplots side by side. `coOncoplot` function takes two maf objects and plots them side by side for better comparision. ```{r, fig.height=3,fig.width=11, eval=T, fig.align='left'} genes = c("PML", "RARA", "RUNX1", "ARID1B", "FLT3") coOncoplot(m1 = primary.apl, m2 = relapse.apl, m1Name = 'PrimaryAPL', m2Name = 'RelapseAPL', genes = genes, removeNonMutated = TRUE) ``` ##Tumor heterogeneity and MATH scores. ###Heterogeneity in tumor samples. Tumors are generally heterogenous i.e, consist of multiple clones. This heterogenity can be inferred by clustering variant allele frequencies. `inferHeterogeneity` function uses vaf information to cluster variants (using `mclust`), to infer clonality. By default, `inferHeterogeneity` function looks for column *t_vaf* containing vaf information. However, if the field name is different from *t_vaf*, we can manually specify it using argument `vafCol`. For example, in this case study vaf is stored under the field name *i_TumorVAF_WU*. ```{r, echo = TRUE, fig.align='center', fig.height=5, fig.width=7, eval=T} #We will run this for sample TCGA.AB.2972 tcga.ab.2972.het = inferHeterogeneity(maf = laml, tsb = 'TCGA.AB.2972', vafCol = 'i_TumorVAF_WU') print(tcga.ab.2972.het$clusterMeans) #Visualizing results plotClusters(clusters = tcga.ab.2972.het) ``` Above figure shows clear seperation of two clones clustered at mean variant allele frequencies of ~45% (major clone) and another minor clone at variant allele frequency of ~25%. ####Ignoring variants in copy number altered regions. We can use copy number information to ignore variants located on copynumber altered regions. Copy number alterations results in abnormally high/low variant allele frequencies, which tends to affect clustering. Removing such variants improves clustering and density estimation while retaining biologically meaningful results. Copy number information can be provided as a segmented file generated from segmentation programmes, such as Circular Binary Segmentation from DNACopy Bioconductor package [6](#references). ```{r, fig.align='center', fig.height=5, fig.width=7, eval=T} seg = system.file('extdata', 'TCGA.AB.3009.hg19.seg.txt', package = 'maftools') tcga.ab.3009.het = inferHeterogeneity(maf = laml, tsb = 'TCGA.AB.3009', segFile = seg, vafCol = 'i_TumorVAF_WU') #Visualizing results. Highlighting those variants on copynumber altered variants. plotClusters(clusters = tcga.ab.3009.het, genes = 'CN_altered', showCNvars = TRUE) ``` Above figure shows two genes NF1 and SUZ12 with high VAF's, which is due to copy number alteraions (deletion). Those two genes are ignored from analysis. ###MATH (Mutant-Allele Tumor Heterogeneity) scores to infer extent of heterogeneity. Although clustering of variant allele frequencies gives us a fair idea on heterogeneity, it is also possible to measure the extent of heterogeneity in terms of a numerical value. MATH score is a simple quantitative measure of intra-tumor heterogeneity, which calculates the width of the vaf distribution. Higher MATH scores are found to be associated with poor outcome. MATH score can also be used a proxy variable for survival analysis [10](#references). ```{r, results='hide', message=F, fig.align='center',fig.width=7, fig.height=6, eval = T} #we will specify for random 4 patients. laml.math = math.score(maf = laml, vafCol = 'i_TumorVAF_WU', sampleName = c('TCGA.AB.3009', 'TCGA.AB.2849', 'TCGA.AB.3002', 'TCGA.AB.2972')) ``` ```{r, eval=T} print(laml.math) ``` From the above results, sample TCGA.AB.2849 has highest of MATH score (20.58) compared to rest of the three samples. It is also evident from the density plot, that vaf distribution is wider for this sample, whereas rest of three samples have sharp peaks with relatively low MATH scores, suggesting more homogeneity and lesser heterogeneity. ##Mutational Signatures. Every cancer, as it progresses leaves a signature characterised by specific pattern of nucleotide substitutions. [Alexandrov et.al](http://www.nature.com/nature/journal/v500/n7463/full/nature12477.html) have shown such mutational signatures, derived from over 7000 cancer samples[5](#references). Such signatures can be extracted by decomposiong matrix of nucleotide substitutions, classified into 96 substitution classes based on immediate bases sorrouding the mutated base. Extracted signatures can also be compared to those [validated signatures](http://cancer.sanger.ac.uk/cosmic/signatures). `extractSignatures` uses non-negative matrix factorization to decompose nx96 dimesion matrix into r signatures, where n is the number of samples from input MAF [11](#references). By default function runs nmf on 6 ranks and chooses the best possible value based on maximum cophenetic-correlation coefficients. It is also possible to manually specify r. Once decomposed, signatures are compared against known signatures derived from [Alexandrov et.al](http://www.nature.com/nature/journal/v500/n7463/full/nature12477.html), and cosine similarity is calculated to identify best match. NOTE: Eventhough reading fasta and extracting bases is fairly fast, it is a memory consuming process as it occupies ~3gb of memory while running. ```{r, eval=F} #First we extract adjacent bases to the mutated locus and clssify them into 96 substitution classes. laml.tnm = trinucleotideMatrix(maf = laml, ref_genome = '/path/to/hg19.fa', prefix = 'chr', add = TRUE, ignoreChr = 'chr23', useSyn = TRUE) ``` ```{r, fig.height=5, fig.width=5, eval=F, message=FALSE} #Run main function with maximum 6 signatures. require('NMF') laml.sign = extractSignatures(mat = laml.tnm, nTry = 6, plotBestFitRes = FALSE) # Warning : Found zero mutations for conversions A[T>G]C # Comparing against experimentally validated 30 signatures.. (See http://cancer.sanger.ac.uk/cosmic/signatures for details.) # Found Signature_1 most similar to validated Signature_1. CoSine-Similarity: 0.778739082321156 # Found Signature_2 most similar to validated Signature_1. CoSine-Similarity: 0.782748375695661 ``` ```{r, echo=F} laml.sign = structure(list(signatures = structure(c(1.08748973712201e-18, 0.00839574301030403, 1.08748973712201e-18, 1.08748973712201e-18, 0.00240385240062567, 0.0077139799108362, 1.08748973712201e-18, 1.08748973712201e-18, 0.0162307404236774, 1.08748973712201e-18, 1.08748973712201e-18, 0.00375682293486133, 0.0149882188409168, 0.0166193873319139, 1.08748973712201e-18, 0.0340641337293563, 0.0177133495392653, 1.08748973712201e-18, 1.08748973712201e-18, 0.010900522793394, 1.08748973712201e-18, 1.08748973712201e-18, 0.0204384802376138, 1.08748973712201e-18, 0.016350784190091, 0.00817539209504552, 0.00817539209504552, 0.00510268819463668, 4.64672034428952e-06, 1.08748973712201e-18, 0.00226222985404151, 0.0136256534917425, 0.0305962431704599, 1.08748973712201e-18, 0.0644710774331736, 0.00640610796992356, 0.0940170090930234, 0.0580805923991668, 0.0849820094561238, 1.08748973712201e-18, 0.0397128561572591, 0.0074963059243485, 0.0626642796173211, 0.0351236523951451, 0.0201728180626018, 0.014196017550947, 0.0643882909888946, 1.08748973712201e-18, 1.08748973712201e-18, 0.00455330562748865, 0.00408769604752276, 1.08748973712201e-18, 0.00272513069834851, 0.00650167581744194, 0.016350784190091, 1.08748973712201e-18, 1.08748973712201e-18, 0.0122630881425683, 1.08748973712201e-18, 0.00334016875229922, 0.00136256534917425, 1.08748973712201e-18, 0.00545026139669701, 0.00545026139669701, 4.42324620317284e-15, 1.08748973712201e-18, 0.00737413388604955, 0.0269789492628818, 0.00988463958213221, 0.0100520699314415, 1.08748973712201e-18, 0.00316631020385699, 1.08748973712201e-18, 0.010900522793394, 1.08748973712201e-18, 0.010900522793394, 0.00272513069834851, 1.08748973712201e-18, 0.00394072268462079, 0.00852966039575841, 7.59566702198884e-17, 0.0115340943434354, 1.08748973712201e-18, 0.00237636221356833, 1.08748973712201e-18, 1.08748973712201e-18, 0.0136256534917425, 0.00181567829546557, 0.00408769604752276, 1.08748973712201e-18, 1.08748973712201e-18, 0.00545026139669701, 0.00408769604752276, 0.00499540259887787, 0.00272513069834851, 0.00353514720450814, 0.0134931114940269, 0.0108267734159331, 0.00758987521539011, 0.00674655574701343, 0.0094753598393619, 0.0070321331529894, 0.0109631530888968, 0.0109631530888968, 0.00344757540744538, 0.00674655574701343, 0.00505991681026007, 0.0128545761394095, 9.04499029878464e-19, 0.00742363125585919, 0.0101198336205201, 9.04499029878464e-19, 9.04499029878464e-19, 0.00505991681026007, 0.00505991681026007, 9.04499029878464e-19, 0.0118064725572735, 0.00421659734188339, 9.04499029878464e-19, 0.00758987521539011, 9.04499029878464e-19, 9.04499029878464e-19, 9.04499029878464e-19, 0.00190175907624695, 0.00421372139194615, 0.00590323627863675, 0.00703305451506787, 9.04499029878464e-19, 0.0131095012433463, 0.0480692096974707, 0.0857521368222207, 0.0171181159082965, 9.04499029878464e-19, 0.0526012816612792, 0.103417003771369, 0.108788211420591, 0.00746704363785783, 0.0358397179449789, 0.051450983147391, 0.0119940343267555, 0.020404090976265, 0.0173566988936855, 0.0225544149118068, 0.0312028203299371, 0.00337327787350671, 0.00645838048724634, 9.04499029878464e-19, 0.00252995840513004, 9.04499029878464e-19, 0.00187921658868539, 9.04499029878464e-19, 0.00590323627863675, 0.00337327787350671, 9.04499029878464e-19, 0.00421659734188339, 0.002149298816947, 9.04499029878464e-19, 0.00168663893675336, 9.04499029878464e-19, 9.04499029878464e-19, 0.0227696256461676, 0.0118064725572735, 0.0123023875952219, 0.00691512349188281, 0.00147207029502386, 0.00642836104928948, 0.0160230698991569, 0.0132200565053105, 0.00758987521539011, 9.04499029878464e-19, 0.0109631530888968, 9.04499029878464e-19, 9.04499029878464e-19, 0.00337327787350671, 0.00430756215200655, 0.0115872086847547, 0.00337327787350667, 0.0086313879649405, 0.00168663893675336, 0.00105917939270956, 0.000843319468376679, 0.00758987521539011, 9.04499029878464e-19, 0.00309283703504524, 9.04499029878464e-19, 0.00252995840513004, 0.00421659734188339, 9.04499029878464e-19, 9.04499029878464e-19, 0.00112484084993519, 9.04499029878464e-19, 0.00287194214691947), .Dim = c(96L, 2L), .Dimnames = list(c("A[C>A]A", "A[C>A]C", "A[C>A]G", "A[C>A]T", "C[C>A]A", "C[C>A]C", "C[C>A]G", "C[C>A]T", "G[C>A]A", "G[C>A]C", "G[C>A]G", "G[C>A]T", "T[C>A]A", "T[C>A]C", "T[C>A]G", "T[C>A]T", "A[C>G]A", "A[C>G]C", "A[C>G]G", "A[C>G]T", "C[C>G]A", "C[C>G]C", "C[C>G]G", "C[C>G]T", "G[C>G]A", "G[C>G]C", "G[C>G]G", "G[C>G]T", "T[C>G]A", "T[C>G]C", "T[C>G]G", "T[C>G]T", "A[C>T]A", "A[C>T]C", "A[C>T]G", "A[C>T]T", "C[C>T]A", "C[C>T]C", "C[C>T]G", "C[C>T]T", "G[C>T]A", "G[C>T]C", "G[C>T]G", "G[C>T]T", "T[C>T]A", "T[C>T]C", "T[C>T]G", "T[C>T]T", "A[T>A]A", "A[T>A]C", "A[T>A]G", "A[T>A]T", "C[T>A]A", "C[T>A]C", "C[T>A]G", "C[T>A]T", "G[T>A]A", "G[T>A]C", "G[T>A]G", "G[T>A]T", "T[T>A]A", "T[T>A]C", "T[T>A]G", "T[T>A]T", "A[T>C]A", "A[T>C]C", "A[T>C]G", "A[T>C]T", "C[T>C]A", "C[T>C]C", "C[T>C]G", "C[T>C]T", "G[T>C]A", "G[T>C]C", "G[T>C]G", "G[T>C]T", "T[T>C]A", "T[T>C]C", "T[T>C]G", "T[T>C]T", "A[T>G]A", "A[T>G]C", "A[T>G]G", "A[T>G]T", "C[T>G]A", "C[T>G]C", "C[T>G]G", "C[T>G]T", "G[T>G]A", "G[T>G]C", "G[T>G]G", "G[T>G]T", "T[T>G]A", "T[T>G]C", "T[T>G]G", "T[T>G]T"), c("Signature_1", "Signature_2"))), coSineSimMat = structure(c(0.76789895507522, 0.757733629596811, 0.171803248681684, 0.199391522195904, 0.407671912943102, 0.372979035914154, 0.344078922420868, 0.319857408370786, 0.573357292983596, 0.562412460243176, 0.685700701802704, 0.686217358302521, 0.377725890462418, 0.386689478887272, 0.382312659188403, 0.407516946456442, 0.339149804914427, 0.305305965796845, 0.386629499233586, 0.15685755480318, 0.350678506033931, 0.562433289508901, 0.268840367435164, 0.322933955777266, 0.108666524311962, 0.0628339785033974, 0.49126593617209, 0.527932757462746, 0.47172512923794, 0.461711639647726, 0.362590079921887, 0.387794528034913, 0.154909499589746, 0.154613800740969, 0.303423806321064, 0.204479833568232, 0.570031076792535, 0.740784602225925, 0.445644443725404, 0.510768207280784, 0.248807908838572, 0.28784224944225, 0.140154287925718, 0.0725523826114571, 0.407829024906403, 0.610157444568381, 0.256945337078229, 0.227615984891259, 0.43572741633734, 0.391864627867027, 0.296855754287958, 0.345602091793204, 0.105681572106723, 0.0918011629446175, 0.0955192240249301, 0.0892005087879189, 0.35734783741945, 0.351111836488432, 0.570462074592721, 0.532907409369077), .Dim = c(2L, 30L), .Dimnames = list(c("Signature_1", "Signature_2"), c("Signature_1", "Signature_2", "Signature_3", "Signature_4", "Signature_5", "Signature_6", "Signature_7", "Signature_8", "Signature_9", "Signature_10", "Signature_11", "Signature_12", "Signature_13", "Signature_14", "Signature_15", "Signature_16", "Signature_17", "Signature_18", "Signature_19", "Signature_20", "Signature_21", "Signature_22", "Signature_23", "Signature_24", "Signature_25", "Signature_26", "Signature_27", "Signature_28", "Signature_29", "Signature_30" )))), .Names = c("signatures", "coSineSimMat")) ``` ```{r, fig.width=7, fig.height=5, fig.align='center', eval = T} plotSignatures(laml.sign) ``` `extractSignatures` gives a warning that no mutations are found for class A[T>G]C conversions. This is possible when the number of samples are low or in tumors with low mutation rate, such as in this case of Leukemia. In this scenario, a small positive value is added to avoid computational difficulties. It also prints other statistics for range of values that was tried, and chooses the rank with highest cophenetic metric (for above example r=2). Above stats should give an estimate of range of best possible r values and in case the chosen r is overestimating, it is also possible to be re-run `extractSignatures` by manually specifying r. Once decomposed, signatures are compared against known and validated signatures from Sanger [11](#references). See [here](http://cancer.sanger.ac.uk/cosmic/signatures) for list of validated signatures. In the above exaple, 2 signatures are derived, which are similar to validated Signature-1. Signature_1 is a result of elevated rate of spontaneous deamination of 5-methyl-cytosine, resulting in C>T transitions and predominantly occurs at NpCpG trinucleotide, which is a most common process in AML [12](#references). Full table of cosine similarities against validated signatures are also returned, which can be further analysed. Below plot shows comparision of similarities of detected signatures against validated signatures. ```{r} require('corrplot') corrplot::corrplot(corr = laml.sign$coSineSimMat, col = RColorBrewer::brewer.pal(n = 9, name = 'Oranges'), is.corr = FALSE, tl.cex = 0.6, tl.col = 'black', cl.cex = 0.6) ``` NOTE: Should you recieve an error while running `extractSignatures` complaining `none of the packages are loaded`, please manually load the `NMF` library and re-run. #Variant Annotations ##Annotating variants using Oncotator. We can also annotate variants using [oncotator](http://www.broadinstitute.org/oncotator/) API [13](#references). `oncotate` function quires oncotator web api to annotate given set of variants and converts them into MAF format. Input should be a five column file with chr, start, end, ref_allele, alt_allele. However, it can conatain other information such as sample names (Tumor_Sample_Barcode), read counts, vaf information and so on, but only first five columns will be used, rest of the columns will be attached at the end of the table. ```{r} var.file = system.file('extdata', 'variants.tsv', package = 'maftools') #This is what input looks like var = read.delim(var.file, sep = '\t') head(var) ``` ```{r, results='hide', eval=F, message=F} #Annotate var.maf = oncotate(maflite = var.file, header = TRUE) ``` ```{r, eval = F} #Results from oncotate. First 20 columns. var.maf[1:10, 1:20, with = FALSE] ``` NOTE: This is quite time consuming if input is big. ##Coverting annovar output to MAF. Annovar is one of the most widely used Variant Annotation tool in Genomics [14](#references). Annovar output is generally in a tabular format with various annotation columns. This function converts such annovar output files into MAF. This function requires that annovar was run with gene based annotation as a first operation, before including any filter or region based annotations. e.g, `table_annovar.pl example/ex1.avinput humandb/ -buildver hg19 -out myanno -remove -protocol (refGene),cytoBand,dbnsfp30a -operation (g),r,f -nastring NA` `annovarToMaf` mainly uses gene based annotations for processing, rest of the annotation columns from input file will be attached to the end of the resulting MAF. As an example we will annotate the same file which was used above to run `oncotate` function. We will annotate it using annovar with the following command. For simplicity, here we are including only gene based annotations but one can include as many annotations as they wish. But make sure the fist operation is always gene based annotation. ```{bash, eval = F} $perl table_annovar.pl variants.tsv ~/path/to/humandb/ -buildver hg19 -out variants --otherinfo -remove -protocol ensGene -operation g -nastring NA ``` Output generated is stored as a part of this package. We can convert this annovar output into MAF using `annovarToMaf`. ```{r, eval=T} var.annovar = system.file("extdata", "variants.hg19_multianno.txt", package = "maftools") var.annovar.maf = annovarToMaf(annovar = var.annovar, Center = 'CSI-NUS', refBuild = 'hg19', tsbCol = 'Tumor_Sample_Barcode', table = 'ensGene') ``` Annovar, when used with Ensemble as a gene annotation source, uses ensemble gene IDs as Gene names. In that case, use `annovarToMaf` with argument `table` set to `ensGene` which converts ensemble gene IDs into HGNC symbols. ##Coverting ICGC Simpale Somatic Mutation Format to MAF. Just like TCGA, International Cancer Genome Consortium [ICGC](http://icgc.org) also makes its data publically available. But the data are stored in [Simpale Somatic Mutation Format](http://docs.icgc.org/submission/guide/icgc-simple-somatic-mutation-format/), which is similar to MAF format in its structure. However field names and classification of variants is different from that of MAF. `icgcSimpleMutationToMAF` is a function which reads ICGC data and converts them to MAF. ```{r} #Read sample ICGC data for ESCA esca.icgc <- system.file("extdata", "simple_somatic_mutation.open.ESCA-CN.sample.tsv.gz", package = "maftools") esca.maf <- icgcSimpleMutationToMAF(icgc = esca.icgc, addHugoSymbol = TRUE) #Printing first 16 columns for display convenience. print(esca.maf[1:5,1:16, with = FALSE]) ``` Note that by default Simple Somatic Mutation format contains all affected transcripts of a variant resuting in multiple entries of the same variant in same sample. It is hard to choose a single affected transcript based on annotations alone and by default this program removes repeated variants as duplicated entries. If you wish to keep all of them, set `removeDuplicatedVariants` to FALSE. Another option is to convert input file to MAF by removing duplicates and then use scripts like [vcf2maf](https://github.com/mskcc/vcf2maf) to re-annotate and prioritize affected transcripts. #Other useful functions. ##Subsetting MAF We can also subset MAF using function `subsetMaf` ```{r} ##Extract data for samples 'TCGA.AB.3009' and 'TCGA.AB.2933' (Printing just 5 rows for display convenience) subsetMaf(maf = laml, tsb = c('TCGA.AB.3009', 'TCGA.AB.2933'))[1:5] ##Same as above but return output as an MAF object subsetMaf(maf = laml, tsb = c('TCGA.AB.3009', 'TCGA.AB.2933'), mafObj = TRUE) ``` ###Specifiying queries and controlling output fields. ```{r} ##Select all Splice_Site mutations from DNMT3A and NPM1 subsetMaf(maf = laml, genes = c('DNMT3A', 'NPM1'), query = "Variant_Classification == 'Splice_Site'") ##Same as above but include only 'i_transcript_name' column in the output subsetMaf(maf = laml, genes = c('DNMT3A', 'NPM1'), query = "Variant_Classification == 'Splice_Site'", fields = 'i_transcript_name') ``` ##Plotting VAF maftools has few other functions such as `plotVaf` and `genesToBarcodes` which helps to plot vaf distributions and maps samples where a given genes are mutated respectively. #References 1. Cancer Genome Atlas Research, N. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N Engl J Med 368, 2059-74 (2013). 2. Gu, Z., Eils, R. & Schlesner, M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics (2016). 3. Mermel, C.H. et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol 12, R41 (2011). 4. Olshen, A.B., Venkatraman, E.S., Lucito, R. & Wigler, M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, 557-72 (2004). 5. Alexandrov, L.B. et al. Signatures of mutational processes in human cancer. Nature 500, 415-21 (2013). 6. Leiserson, M.D., Wu, H.T., Vandin, F. & Raphael, B.J. CoMEt: a statistical approach to identify combinations of mutually exclusive alterations in cancer. Genome Biol 16, 160 (2015). 7. Tamborero, D., Gonzalez-Perez, A. & Lopez-Bigas, N. OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes. Bioinformatics 29, 2238-44 (2013). 8. Dees, N.D. et al. MuSiC: identifying mutational significance in cancer genomes. Genome Res 22, 1589-98 (2012). 9. Madan, V. et al. Comprehensive mutational analysis of primary and relapse acute promyelocytic leukemia. Leukemia 30, 1672-81 (2016). 10. Mroz, E.A. & Rocco, J.W. MATH, a novel measure of intratumor genetic heterogeneity, is high in poor-outcome classes of head and neck squamous cell carcinoma. Oral Oncol 49, 211-5 (2013). 11. Gaujoux, R. & Seoighe, C. A flexible R package for nonnegative matrix factorization. BMC Bioinformatics 11, 367 (2010). 12. Welch, J.S. et al. The origin and evolution of mutations in acute myeloid leukemia. Cell 150, 264-78 (2012). 13. Ramos, A.H. et al. Oncotator: cancer variant annotation tool. Hum Mutat 36, E2423-9 (2015). 14. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38, e164 (2010). #Session Info ```{r} sessionInfo() ```