\documentclass{article} \usepackage[utf8]{inputenc} \usepackage{color} \usepackage{graphicx} \usepackage{geometry} \usepackage{amsmath} \usepackage{amsfonts} \usepackage[absolute,verbose,overlay]{textpos} \usepackage{amssymb} \usepackage{tikz} \usepackage{colortbl} \usepackage[figurewithin=section]{caption} \usepackage{amsmath} \usepackage{listings} \usepackage{booktabs} \usepackage{xr} \usepackage{pdflscape} \usepackage{Sweave} %\VignetteIndexEntry{Using appreci8R} <<echo=FALSE>>= options(width=100) @ \title{appreci8R -- an R/Bioconductor package for filtering SNVs and short indels with high sensitivity and high PPV} \author{Sarah Sandmann} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \tableofcontents \section{Introduction} For the use of next-generation sequencing in clinical routine valid variant calling results are crucial. However, numerous variant calling tools are available. These tools usually differ in the variant calling algorithsms, the characteristics reported along with the varaint calls, the recommended filtration strategies for the raw calls and thus, also in the output. Especially when calling variants with a low variant allele frequency (VAF), perfect results are hard to obtain. High sensitivity is usually accompanied by low positive predictive value (PPV). appreci8R is a package for combining and filterating the output of differen variant calling tools according to the 'appreci8'-algorithm \cite{appreci8}. vcf as well as txt files containing variant calls can be evaluated. The number of variant calling tools to consider is unlimited (for the user interface version it is limited to 13). The final output contains a list of variant calls, classified as "probably true", "polymophism" or "artifact". Important note: Currently, only hg19 is supported. \subsection{Loading the package} The package can be downloaded and installed with \medskip <<2installing,eval=FALSE>>= BiocManager::install("appreci8R") @ After installation, the package can be loaded into R by typing \medskip <<3loading>>= library(appreci8R) @ \medskip into the R console. \emph{appreci8R} requires the R-packages \emph{shiny}, \emph{shinyjs}, \emph{DT}, \emph{VariantAnnotation}, \emph{BSgenome}, \emph{BSgenome.Hsapiens.UCSC.hg19}, \emph{TxDb.Hsapiens.UCSC.hg19.knownGene}, \emph{Homo.sapiens}, \emph{SNPlocs.Hsapiens.dbSNP144.GRCh37}, \emph{XtraSNPlocs.Hsapiens.dbSNP144.GRCh37}, \emph{rsnps}, \emph{Biostrings}, \emph{MafDb.1Kgenomes.phase3.hs37d5}, \emph{MafDb.ExAC.r1.0.hs37d5}, \emph{MafDb.gnomADex.r2.1.hs37d5}, \emph{COSMIC.67}, \emph{rentrez}, \emph{PolyPhen.Hsapiens.dbSNP131}, \emph{SIFT.Hsapiens.dbSNP137}, \emph{seqinr}, \emph{openxlsx}, \emph{Rsamtools}, \emph{stringr}, \emph{utils}, \emph{stats}, \emph{GenomicRanges}, \emph{S4Vectors}, \emph{GenomicFeatures}, \emph{IRanges}, \emph{GenomicScores} and \emph{SummarizedExperiment}. All of them are loaded automatically when using \emph{appreci8R}. \section{appreci8R -- classical} \subsection{1st analysis step: Target filtration} In the 1st analysis step, all off-target calls are excluded from further analysis. The function \texttt{filterTarget} covers two steps: reading input and target filtration. Available arguments are: \begin{itemize} \item \texttt{output\_folder} The folder to write the output files into. If an empty string is provided, no files are written out. \item \texttt{caller\_name} Name of the variant calling tool (only necessary if an output folder is provided). \item \texttt{caller\_folder} Folder containing the variant calling results. \item \texttt{caller\_file\_names\_add} Suffix for naming the variant calling files. If an empty string is provided, it is assumed that the files only contain the sample name, e.g. "Sample1.vcf". \item \texttt{caller\_file\_type} File type of the variant calling results (".vcf" or ".txt"). \item \texttt{caller\_snv\_indel} SNVs and indels are reported in the same file (\texttt{TRUE} or \texttt{FALSE}). \item \texttt{caller\_snv\_names\_add} Suffix for naming the variant calling files containing SNVs (only evaluated if \texttt{caller\_snv\_indel==TRUE}). \item \texttt{caller\_indel\_names\_add} Suffix for naming the variant calling files containing indels (only evaluated if \texttt{caller\_snv\_indel==TRUE}). \item \texttt{caller\_chr} Column of the variant calling input containing information on chr (default: 1). \item \texttt{caller\_pos} Column of the variant calling input containing information on pos (default: 2). \item \texttt{caller\_ref} Column of the variant calling input containing information on ref (default: 4). \item \texttt{caller\_alt} Column of the variant calling input containing information on alt (default: 5). \item \texttt{targetRegions} Data.frame object containing the target regions to be analyzed (bed-format: 1st column chr, 2nd column 0-based start pos, 3rd column 1-based end pos). Or: GRanges object containing the target regions to be analyzed. \end{itemize} First, all files in \texttt{caller\_folder} of the file type \texttt{caller\_file\_type} with the suffix \texttt{caller\_file\_names\_add} are read. Sample names are automatically derived from the file names (e.g. a sample name would be called "Sample1" if a file was called "Sample1.txt" and no suffix was defined; a sample would be called "Sample1.mutations" if a file was called "Sample1.mutations.vcf" and no suffix was defined, but "Sample1" if the suffix ".mutations" was defined). If SNVs and indels are reported in separated files (in the same folder), \texttt{caller\_snv\_indel==TRUE} and \texttt{caller\_snv\_names\_add} and \texttt{caller\_indel\_names\_add} are defined, input from two files per sample is read and automatically combined (e.g. a sample would be called "Sample1" if files "Sample1.SNV.vcf" and "Sample1.indel.vcf" are read and \texttt{caller\_snv\_indel==TRUE}, \texttt{caller\_snv\_names\_add} was defined as ".SNV" and \texttt{caller\_indel\_names\_add} was defined as ".indel"). Subsequently, the read variant calling results are filtered according to the defined target region. All off-target calls are excluded from further analysis. A list of data.frames (one list per sample) with all on-target calls is returned. Every data.frame contains the columns: the SampleID (taken from the input file names), Chr, Pos, Ref and Alt. Exemplary filtration of two vcf-files \texttt{Sample1.rawMutations.vcf} and \texttt{Sample2.rawMutations.vcf}: <<4>>= output_folder<-"" target<-data.frame(chr = c("2","4","12","17","21","X"), start = c(25469500,106196950,12046280,7579470,36164400,15838363), end = c(25469510,106196960,12046350,7579475,36164410,15838366)) target caller_folder <- system.file("extdata", package = "appreci8R") #Input data Sample1: read.table(paste(caller_folder, "/Sample1.rawMutations.vcf", sep=""), header = FALSE, sep = "\t", stringsAsFactors = FALSE) #Input data Sample2: read.table(paste(caller_folder,"/Sample2.rawMutations.vcf",sep=""), header = FALSE, sep = "\t", stringsAsFactors = FALSE) filterTarget(output_folder, "GATK", caller_folder, ".rawMutations", ".vcf", TRUE, "", "", targetRegions = target) @ \subsection{2nd analysis step: Normalization} In the 2nd analysis step, all calls are normalized with respect to reporting indels, MNVs, reporting of several alternate alleles and reporting of complex indels. Available arguments are: \begin{itemize} \item \texttt{output\_folder} The folder to write the output files into. If an empty string is provided, no files are written out. \item \texttt{caller\_name} Name of the variant calling tool (only necessary if an output folder is provided). \item \texttt{target\_calls} List of data.frames. One list element per sample. FilterTarget()-output can directly be taken as input. \item \texttt{caller\_indels\_pm} Deletions are reported with a ``minus'' (e.g. C $>$ -A), insertions are reported with a ``plus'' (e.g. C $>$ +A) (\texttt{TRUE} or \texttt{FALSE}). \item \texttt{caller\_mnvs} MNVs are reported (e.g. CA $>$ GT instead of C $>$ G and A $>$ T; or CGAG $>$ TGAT instead of C $>$ T and G $>$ T) (\texttt{TRUE} or \texttt{FALSE}). \end{itemize} The function \texttt{normalize} covers two to four normalization steps: \begin{enumerate} \item Check alternative bases: Calls containing a ``comma'' are split up. A call like C $>$ A,G is converted to C $>$ A and an additional C $>$ G call. This enables evaluation of the output of different callers while caller1 reports C $>$ A,G, caller2 only reports C $>$ A and caller3 only reports C $>$ G. This normalization step is always performed. \item Find string differences: Calls are checked for un-mutated bases. The smallest option of reporting a variant at the left-most position is chosen. For example, CAAAC $>$ CAAC is converted to CA $>$ C. This normalization step is always performed. \item Convert indels: If deletions are reported with a ``minus'' and insertions are reported with a ``plus'', these are converted. An deletion like C $>$ -G is converted to CG $>$ C, while an insertion like C $>$ +G is converted to C $>$ CG. This normalization step is only performed if \texttt{caller\_indels\_pm} is \texttt{TRUE}. \item Convert MNVs: If MNVs are reported, these are converted. This enables evaluation of the output of different callers if not all callers report all mutations being part of an MNV. A call like CA $>$ GT is split up to a C $>$ G and an A $>$ T variant. But also a call like CGAG $>$ TGAT is split up to C $>$ T and G $>$ T (G $>$ G and A $>$ A are not reported as they do not pass the normalization step ``Find string differences''). This normalization step is only performed if \texttt{caller\_mnvs} is \texttt{TRUE}. \end{enumerate} A GRanges object with all normalized calls is returned (metadata columns: SampleID, Ref, Alt). If an output folder is provided, the output is saved as <caller\_name>.normalized.txt. Exemplary normalization of a set of raw calls: <<5>>= output_folder <- "" caller_name <- "" sample1 <- data.frame(SampleID = c("Sample1","Sample1","Sample1"), Chr = c("2","17","X"), Pos = c(25469502,7579472,15838366), Ref = c("CAG","G","C"), Alt = c("TAT","C","T,A"), stringsAsFactors = FALSE) sample2 <- data.frame(SampleID = c("Sample2","Sample2","Sample2","sample2"), Chr = c("4","12","12","21"), Pos = c(106196951,12046289,12046341,36164405), Ref = c("A","C","A","GGG"), Alt = c("G","+AAAG","G","TGG"), stringsAsFactors = FALSE) #Input: input <- list(sample1, sample2) input normalized <- appreci8R::normalize(output_folder, caller_name, input, TRUE, TRUE) normalized @ \subsection{3rd analysis step: Annotation} In the 3rd analysis step, all calls are annotated (using VariantAnnotation) and filtered according to the selected locations and consequences of interest. The function \texttt{annotate} performs annotations of all variants in the input GRanges object (according to VariantAnnotation \cite{va}) and filteres the annotated calls with respect to the selected locations and consequences of interest (based on the functions \texttt{locateVariants} and \texttt{predictCoding} of the package VariantAnnotation). At least one location of interest has to be chosen. If - among others - ``coding'' is selected, at least one consequence of interest has to be chosen as well. Avilable input arguments are: \begin{itemize} \item \texttt{output\_folder} The folder to write the output files into. If an empty string is provided, no files are written out. \item \texttt{caller\_name} Name of the variant calling tool (only necessary if an output folder is provided). \item \texttt{normalized\_calls} A GRanges object. One normalized variant call per line. Necessary metadata columns are: SampleID, Ref, Alt. Normalize()-output can directly be taken as input. \item \texttt{locations} A vector containing the locations of interest. Accepted values are: coding, intron, threeUTR, fiveUTR, intergenic, spliceSite, promoter. \item \texttt{consequences} A vector containing the consequences of interest. Accepted values are: synonymous, nonsynonymous, frameshift, nonsense, not translated. \end{itemize} All possible transcripts are investigated. If ``coding'' is chosen and a variant is annotated as ``coding'' in only 1 out of many transcripts, the matching transcript is, together with the annotation information, reported. All the other non-matching transcripts are not reported. A GRanges object is returned containing only the calls at the selected locations of interest and with the selected consequences of interest. Reported metadata columns are: SampleID, Ref, Alt, Location, c. (position of variant on cDNA level), p. (position of variant on protein level), AA\_ref, AA\_alt, Codon\_ref, Codon\_alt, Consequence, Gene, GeneID, TranscriptID. TxDb.Hsapiens.UCSC.hg19.knownGene is used for annotation. Whenever there is more than one transcriptID, all of them are reported, separated by commas. The first transcriptID matches the first entry under Location, c., p. etc. If an output folder is provided, the output is saved as <caller\_name>.annotated.txt. Exemplary annotation of a set of normalized calls: <<6>>= library(GenomicRanges) output_folder <- "" caller_name <- "" input <- GRanges(seqnames = c("2","X","4","21"), ranges = IRanges(start = c (25469504,15838366,106196951,36164405), end = c (25469504,15838366,106196951,36164405)), SampleID = c("Sample1","Sample1","Sample2","Sample2"), Ref = c("G","C","A","G"), Alt = c("T","A","G","T")) #Input: input library(GO.db) library(org.Hs.eg.db) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(Biostrings) annotated <- annotate(output_folder, caller_name, input, locations = c("coding","spliceSite"), consequences = c("nonsynonymous","frameshift","nonsense")) annotated @ \subsection{4th analysis step: Output combination} In the 4th analysis step, all normalized and annotated calls of the different variant calling tools are combined. The function \texttt{combineOutput} performs a combination of the normalized and annotated variant calls from different variant calling tools. The results are sorted according to Chr, Pos, Ref, Alt and SampleID. If two callers - caller1 and caller2 - report the same variant for the same sample, it is only reported once in the output file with a ``1'' in the column ``caller1'' and another ``1'' in the column ``caller2''. Available arguments are: \begin{itemize} \item \texttt{output\_folder} The folder to write the output files into. If an empty string is provided, no files are written out. \item \texttt{caller\_names} Vector containing the name of the variant calling tools. \item \texttt{annotated\_calls} A GRangesList object. Every list element contains the variant calls of one variant calling tool. Annotate()-output can directly be taken as input. Providing a list with only one list element, i.e. evaluating only one variant calling tool, is possible, but not assumed to be useful. The point of the appreci8-algorithm lies in the combined evaluation and filtration of the output of several variant calling tools. \end{itemize} A GRanges object is returned containing all combined calls. Reported metadata columns are: SampleID, Ref, Alt, Location, c. (position of variant on cDNA level), p. (position of variant on protein level), AA\_ref, AA\_alt, Codon\_ref, Codon\_alt, Consequence, Gene, GeneID, TranscriptID. In addition, one column for every variant calling tool is reported, containing a ``1'' for every call detected by that tool and ``NA'' for every call not detected by that tool. If an output folder is provided, the output is saved as Results\_Raw.txt. Exemplary combination of a set of annotated calls: <<7>>= output_folder <- "" caller_name <- "" #Input: gatk <- annotated[c(1,3),] varscan <- annotated[c(2,3),] annotated<-GRangesList() annotated[[1]]<-gatk annotated[[2]]<-varscan annotated combined<-combineOutput("", c("GATK","VarScan"), annotated) combined @ \subsection{5th analysis step: Coverage and BQ determination} In the 5th analysis step, all calls are evaluated with respect to coverage and basequality (using Rsamtools \cite{rsamtools}). Calls with insufficient coverage and/or basequality are filtered from further analysis. The function \texttt{evaluateCovAndBQ} evaluates coverage and base quality of all calls and filteres them according to user-definable thresholds. Only those calls that feature sufficient coverage and base quality are reported. Filtration is performed in a 2-step procedure. Available arguments are: \begin{itemize} \item \texttt{output\_folder} The folder to write the output files into. If an empty string is provided, no files are written out. \item \texttt{combined\_calls} A GRanges object; necessary input object. The GRanges object contains one call per line. CombineOutput()-output can directly be taken as input. \item \texttt{bam\_folder} The folder containing the bam- and bai-files for every sample. The file names have to match the sample names, i.e. Sample1.bam and Sample1.bai if Sample1.vcf (without any defined suffixes) was analyzed for caller1. \item \texttt{dp} Minimum depth that is required for a call to pass the filter (default: 50). \item \texttt{nr\_alt} Minimum number of reads carrying the alternate allele that is required for a call to pass the filter (default: 20). \item \texttt{vaf} Minimum VAF that is required for a call to pass the filter (default: 0.01). \item \texttt{bq} Minimum base quality that is required for a call to pass the filter (default: 15; values above 44 not recommended). \item \texttt{bq\_diff} Maximum difference in basequality between reference and alternative that is allowed for a call to pass the filter (default: 7). \end{itemize} First, coverage is evaluated using Rsamtools \texttt{pileup} and a threshold of \texttt{min\_base\_quality=0}. If depth is below the defined threshold \texttt{dp} and/or the number of reads carrying the alternate allele is below the defined threshold \texttt{nr\_alt} and/or VAF is below the defined threshold \texttt{VAF}, a call is excluded. However, if coverage is sufficient, base quality - which is more time-consuming - is evaluated. To evaluate base quality, the threshold for \texttt{min\_base\_quality} is successively increased from 0 to 44 (thus, no values above 44 are recommended for \texttt{bq}). The average base quality for all reads carrying the reference and the alternate allele is calculated. For indels, which are always summed up as ``+'' by Rsamtools, no base quality can be determined. If the average base quality of the alternate allele is lower than \texttt{bq}, the call is filtered. If ``average base quality reference allele'' - ``average base quality alternate allele'' is higher than \texttt{bq\_diff}, the call is filtered. Coverage is reported with respect to the forward- and reverse reads separately. It is not yet evaluated. This is done in the 7th analysis step (\texttt{finalFiltration}). A GRanges object is returned containing all calls with sufficient coverage and base quality. Reported metadata columns are: SampleID, Ref, Alt, Location, c. (position of variant on cDNA level), p. (position of variant on protein level), AA\_ref, AA\_alt, Codon\_ref, Codon\_alt, Consequence, Gene, GeneID, TranscriptID, Caller1 to CallerX (dependent on the number of callers that is evaluated), Nr\_Ref, Nr\_Alt, DP, VAF, BQ\_REF, BQ\_ALT, Nr\_Ref\_fwd, Nr\_Alt\_fwd, DP\_fwd, VAF\_rev, Nr\_Ref\_rev, Nr\_Alt\_rev, DP\_rev, VAF\_rev. If an output folder is provided, the output is saved as Results\_Frequency.txt. Exemplary filtration of a set of combined calls: <<8>>= output_folder <- "" bam_folder <- system.file("extdata", package = "appreci8R") bam_folder <- paste(bam_folder, "/", sep="") #Input: combined filtered<-evaluateCovAndBQ(output_folder, combined, bam_folder, bq_diff = 20, vaf = 0.001) filtered @ \subsection{6th analysis step: Characteristics determination} In the 6th analysis step, an extended set of characteristics for all calls is determined. This includes database check-ups and impact prediction on protein level. The function \texttt{evaluateCharacteristics} determines an extended set of characteristics for the normalized, annotated, coverage and base quality filtered calls. This includes database check-ups and impact prediction on protein level. Available arguments are: \begin{itemize} \item \texttt{output\_folder} The folder to write the output files into. If an empty string is provided, no files are written out. \item \texttt{frequency\_calls} A GRanges object. The GRanges object contains one call per line. EvaluateCovAndBQ()-output can directly be taken as input. \item \texttt{predict} String defining the prediction tool to use (accepted strings are: SIFT, Provean, PolyPhen2) \item \texttt{dbSNP} Query dbSNP (144.GRCh37) for the presence of your variants (\texttt{TRUE} or \texttt{FALSE}; default \texttt{TRUE}). \item \texttt{'1kgenomes'} Query 1000 Genomes (phase3) for the presence of your variants (\texttt{TRUE} or \texttt{FALSE}; default \texttt{TRUE}). \item \texttt{exacDB} Query ExAC (r1.0) for the presence of your variants (\texttt{TRUE} or \texttt{FALSE}; default \texttt{TRUE}). \item \texttt{gadDB} Query Genome Aggregation Database (r2.0.1) for the presence of your variants (\texttt{TRUE} or \texttt{FALSE}; default \texttt{TRUE}). \item \texttt{cosmicDB} Query COSMIC (67) for the presence of your variants (\texttt{TRUE} or \texttt{FALSE}; default \texttt{TRUE}). \item \texttt{clinvarDB} Query ClinVar (via rentrez) for the presence of your variants (\texttt{TRUE} or \texttt{FALSE}; default \texttt{TRUE}). \end{itemize} First, the database check-up is performed. At a maximum, dbSNP, 1000Genomes, ExAC, Genome Aggregation Database, COSMIC and ClinVar can be checked. If all the variables are set to \texttt{FALSE}, no database is checked and none will be reported. Thus, the check-up can be disabled. Second, the impact perdiction on protein level is performed. Impact can be predicted using either SIFT, Provean or PolyPhen2. In any case, the prediction is performed on the basis of rs-IDs. For every call, the predicted effect as well as the score is reported. A GRanges Object is returned containing all calls, information on their presence in the selected databases and information on the predicted effect. Reported metadata columns are (if all databases are selected): SampleID, Ref, Alt, dbSNP (containing the rs-ID), dbSNP\_MAF, G1000\_AF, ExAC\_AF, GAD\_AF, CosmicID, Cosmic\_Counts (number of Cosmic entries), ClinVar, Prediction (damaging or neutral), Score (on the basis of the selected prediction tool), c. (variant on cDNA level containing position and variant), c.complement (complement of the variant; if c. is c.5284A$>$G, then c.complement is c.5284T$>$C), p. (variant on protein level containing position and amino acids). Exemplary characteristics of a set of filtered calls: <<9>>= output_folder <- "" #Input: filtered characteristics <- determineCharacteristics(output_folder, filtered, predict = "Provean") characteristics @ \subsection{7th analysis step: Final filtration} In the 7th analysis step, the final filtration according to the appreci8-algorithm is performed. The function \texttt{finalFiltration} performs the final filtration according to the appreci8-algorithm. The previously determined characteristics of the calls are evaluated and an automatic categorization of the calls is performed. Possible categories are: Probably true or Hotspot, Polymorphism and Artifact. Available arguments are: \begin{itemize} \item \texttt{output\_folder} The folder to write the output files into. If an empty string is provided, no files are written out. \item \texttt{frequency\_calls} A GRanges object. The GRanges object contains one call per line. EvaluateCovAndBQ()-output can directly be taken as input. \item \texttt{database\_calls} A GRanges object. The GRanges object contains one call per line. EvaluateCharacteristics()-output can directly be taken as input. \item \texttt{combined\_calls} A GRanges object. The GRanges object contains one call per line. CombineOutput()-output can directly be taken as input. \item \texttt{damaging\_safe} Threshold for the impact prediction score, identifying reliably damaging calls. For example, if Provean has been selected as the impact prediction tool, the internal threshold of the tool is -2.5. However, calls with a score of -2.51 are less likely to have a correct ``Damaging'' prediction compared to calls with a score of -12.0. A threshold of -5.0 for \texttt{damaging\_safe} marks all calls with a score below -5.0 as reliably damaging. \item \texttt{tolerated\_safe} Threshold for the impact prediction score, identifying reliably tolerated calls. For example, if Provean has been selected as the impact prediction tool, the internal threshold of the tool is -2.5. However, calls with a score of -2.49 are less likely to have a correct ``Neutral'' prediction compared to calls with a score of +7.0. A threshold of -1.0 for \texttt{tolerated\_safe} marks all calls with a score above -1.0 as reliably tolerated. \item \texttt{primer} Optional: Data.frame containing primer positions in bed-format, i.e. 1st column chromosome, 2nd column 0-based start position, 3rd column 1-based end position (default: NA). \item{hotspots} Optional: Data.frame containing expected/hotspot mutations: 1st column gene name (e.g. ASXL1), 2nd column mutation on AA level (e.g. G12S or E628fs or I836del or G12 if any AA change at this position is considered an expected/hotspot mutation), 3rd column minimum VAF or NA (default: NA). \item \texttt{overlapTools} Vector of strings containing the names of variant calling tools that, should a call be overlappingly reported by these tools, will be rewarded. \item \texttt{nrsamples} Threshold for the number of samples. Shold a variant be detected in more than the threshold defined by \texttt{nrsamples}, it will be interpreted as possible evidence for an artifact (default: 3). \item \texttt{dp} Minimum depth that is required for a call (default: 50). \item \texttt{nr\_alt} Minimum number of reads carrying the alternate allele that is required for a call (default: 20). \item \texttt{vaf} Minimum VAF that is required for a call (default: 0.01). \item \texttt{bq} Minimum base quality that is required for a call (default: 15; values above 44 not recommended). \item \texttt{bq\_diff} Maximum difference in basequality between reference and alternative that is allowed for a call (default: 7). \item \texttt{detectedLow} Value added to the artifact score if more than \texttt{nrsamples} number of samples feature the same call (default: 2). \item \texttt{detectedHigh} Value added to the artifact score if more than 50\% of the samples analyzed (if more than 1 sample is analyzed) feature the same call, which is not an expected/hotspot mutation (default: 2). \item \texttt{isIndel} Value added to the artifact score if the call is an indel (default: 1). \item \texttt{isIndelVAF} Value added to the artifact score if the call is an indel and the VAF is below 0.05 (default: 1). \item \texttt{detectedLowVAF} Value added to the artifact score if more than \texttt{nrsamples} number of samples feature the same call and the VAF is above 0.85 (default: 2). \item \texttt{noPrimerP} Value added to the artifact score if no primer information is provided or no primer is located at this position and the p value (Fisher's Exact Test evaluating forward and reverse reads to determine strandbias) is below 0.001 (default: 1). \item \texttt{primerPAlt} Value added to the artifact score if no primer information is provided or no primer is located at this position and the p value (Fisher's Exact Test evaluating forward and reverse reads to determine strandbias) is below 0.001 and Nr\_Alt\_fwd is at least \texttt{Nr\_Alt/2} and Nr\_Alt\_rev is at least \texttt{Nr\_Alt/1} (default: -1). \item \texttt{noPrimerPFwd} Value added to the artifact score if no primer information is provided or no primer is located at this position and the p value (Fisher's Exact Test evaluating forward and reverse reads to determine strandbias) is at least 0.001 and Nr\_Ref\_fwd is at least \texttt{(DP-Nr\_Alt)/2} and Nr\_Alt\_fwd is below 3 (default: 1). \item \texttt{primerPFwd} Value added to the artifact score if no primer information is provided or no primer is located at this position and the p value (Fisher's Exact Test evaluating forward and reverse reads to determine strandbias) is below 0.001 and Nr\_Ref\_fwd is below \texttt{(DP-Nr\_Alt)/2} and Nr\_Alt\_fwd is below 3 (default: -1). \item \texttt{noPrimerPRev} Value added to the artifact score if no primer information is provided or no primer is located at this position and the p value (Fisher's Exact Test evaluating forward and reverse reads to determine strandbias) is at least 0.001 and Nr\_Ref\_rev is at least \texttt{(DP-Nr\_Alt)/2} and Nr\_Alt\_rev is below 3 (default: 1). \item \texttt{primerPRev} Value added to the artifact score if no primer information is provided or no primer is located at this position and the p value (Fisher's Exact Test evaluating forward and reverse reads to determine strandbias) is below 0.001 and Nr\_Ref\_rev is below \texttt{(DP-Nr\_Alt)/2} and Nr\_Alt\_rev is below 3 (default: -1). \item \texttt{primerLocation} Value added to the artifact score if a call is located at a position where a primer is located (default: -1). \item \texttt{vafLow} Value added to the artifact score if the VAF is below 0.02 (default: 2). \item \texttt{databaseVAF} Value added to the artifact score if the VAF is below 0.10 and the variant was not found in any database (default: 1). \item \texttt{databaseHigh} Value added to the artifact score if the variant was detected in at least 50\% of all samples, but not found in any database (default: 1). \item \texttt{predictionSafe} Value added to the artifact score if the variant has a reliable damaging prediction (default: -1). \item \texttt{predictionVAF} Value added to the artifact score if the variant has a reliable tolerated prediction and a VAF below 0.35 or between 0.65 and 0.85 (default: 1). \item \texttt{nrcaller4} Intermediate number of callers that overlappingly reported a variant (default: 4). \item \texttt{reward4} Value added to the artifact score if a variant is overlappingly reported by \texttt{nrcaller4} callers (default: -1). \item \texttt{nrcaller5} High number of callers that overlappingly reported a variant (default: 5). \item \texttt{reward5} Value added to the artifact score if a variant is overlappingly reported by \texttt{nrcaller5} callers (default: -1). \item \texttt{nrcaller6} Very high number of callers that overlappingly reported a variant (default: 6). \item \texttt{reward6} Value added to the artifact score if a variant is overlappingly reported by \texttt{nrcaller6} callers (default: -1). \item \texttt{oneCaller} Value added to the artifact score if a variant is only reported by one caller (default: 1). \item \texttt{BQ\_AltMean} Value added to the artifact score if the base quality of a variant is below \texttt{mean(BQ\_Alt)-3*(BQ\_Alt)} over all variants (default: 4). \item \texttt{knownHotspot} Value added to the artifact score if a variant is an expected/hotspot mutation (default: -3). \item \texttt{overlapReward} Value added to the artifact score if a variant is overlappingly reported by the tools defined in \texttt{overlapTools} (default: -3). \item \texttt{artifactThreshold} Threshold for the artifact score, i.e. variants with a score below this threshold will be categorized as ``probably true'' (default: 0). \item \texttt{polyDetected} Value added to the polymorphism score if more than \texttt{nrsamples} number of samples feature the same call (default: 1). \item \texttt{polyDetectedOnce} Value added to the polymorphism score if a variant is only detected in one sample (default: -1). \item \texttt{polyDatabasesPolyLow} Intermediate number of polymorphism databases that have information on a variant (default: 2). \item \texttt{polyDatabasesPolyLowReward} Value added to the polymorphism score if a variant is present in at least \texttt{polyDatabasesPolyLow} databases (default: 1). \item \texttt{polyDatabasesPolyHigh} High number of polymorphism databases that have information on a variant (default: 4). \item \texttt{polyDatabasesPolyHighReward} Value added to the polymorphism score if a variant is present in at least \texttt{polyDatabasesPolyHigh} databases (default: 1). \item \texttt{polyDatabasesMut} Critical number of mutation databases that have information on a variant (default: 2). \item \texttt{polyDatabasesMutReward} Value added to the polymorphism score if a variant is present in at least \texttt{polyDatabasesMut} databases (default: -1). \item \texttt{polyNoDatabase} Value added to the polymorphism score if a variant is not present in any polymorphism database (default: -1). \item \texttt{polyDatabases} High number of databases that have information on a variant (default: 6). \item \texttt{polyDatabasesReward} Value added to the polymorphism score if a variant is present in at least \texttt{polyDatabases} databases (default: 1). \item \texttt{polyEffect} Value added to the polymorphism score if a variant is not a frameshift variant, not a stop gain variant and not a stop lost variant (default: 1). \item \texttt{polyVAF} Value added to the polymorphism score if the VAF of a variant is between 0.65 and 0.35 or above 0.85 (default: 1). \item \texttt{polyPrediction} Value added to the polymorphism score if the variant has a reliable tolerated prediction (default: 1). \item \texttt{polyPredictionEffect} Value added to the polymorphism score if a variant has a reliable damaging prediction or is a stop gain variant or is a stop lost variant (default: -1). \item \texttt{polyCosmic} Critical number of COSMIC entries (default: 100). \item \texttt{polyThresholdCritical} Threshold for the polymorphism score if the number of COSMIC entries is not critical, i.e. variants with at least this score will be categorized as ``polymorphism'' (default: 2). \item \texttt{polyThreshold} Threshold for the polymorphism score if the number of COSMIC entries is critical, i.e. variants with at least this score will be categorized as ``polymorphism'' (default: 3). \item \texttt{PolymorphismVAF10} Value added to the artifact score if a variant is a ``polymorphism'' based on the polymorphism score, but the VAF is below 0.10 (default: 5). \item \texttt{PolymorphismVAF20} Value added to the artifact score if a variant is a ``polymorphism'' based on the polymorphism score, but the VAF is below 0.20 (default: 2). \item \texttt{PolymorphismFrame} Value added to the artifact score if a variant is a ``polymorphism'' based on the polymorphism score, but it is a frameshift variant (default: 2). \end{itemize} Final filtration consists of several steps: \begin{enumerate} \item Frequency and base quality are re-considered. Stricter thresholds compared to \texttt{evaluateCovAndBQ} can be defined. \item Samples with the same call are considered. Counting is based on the normalized and annotated calls, not on the coverage- and base quality filtered calls. \item Samples with a call at the same position are considered (e.g. A>AG at pos 1 in sample 1 and A>AGG at pos 1 in sample 2 are reported as 2 calls at the same position). Counting is based on the normalized and annotated calls, not on the coverage- and base quality filtered calls. \item Background information is considered. The number of samples with the same variant in the coverage- and base quality filtered data are considered. \item Number of databases is considered. Dependent on the previously selected number of databases. The presence of a variant in mutation- and polymorphism databases is evaluated. \item VAF in relation to a predicted effect is considered. Variants are marked if their VAF is typical of polymorphisms and their predicted effect is ``tolerated''. \item VAF in relation to the number of samples is considered. Variants are marked if they are detected in more than \texttt{nrsamples} samples and the VAF is at least 0.85 in more than 90\% of these samples. \item Strandbias is considered. Fisher's Exact Test is performed to analyze the relation between forward-reverse and reference-alternate allele. \item Hotspot list - if any is provided - is considered. Variants are marked if they are present on the hotspot list. \item Final filtration is performed. The artifact- and the polymorphism score are calculated. On the basis of these two scores, a call is either classified as a probably true/hotspot call, polymorphism or artifact. \end{enumerate} A GRanges object is returned containing all calls with a predicted category. Categories can be: probably true, hotspot, polymorphism or artifact. Reported metadata columns are: SampleID, Ref, Alt, Gene, GeneID, TranscriptID, Location, Consequence, c. (variant on cDNA level containing position and variant), c.complement (complement of the variant; if c. is c.5284A>G, then c.complement is c.5284T>C), p. (variant on protein level containing position and amino acids), Codon\_ref, Codon\_alt, Nr\_Ref, Nr\_Alt, DP, VAF, Caller1 to CallerX (dependent on the number of callers that is evaluated), dbSNP (containing the rs-ID), dbSNP\_MAF, G1000\_AF, ExAC\_AF, GAD\_AF, CosmicID, Cosmic\_Counts (number of Cosmic entries), ClinVar, Prediction (damaging or neutral), Score (on the basis of the selected prediction tool), BQ\_REF, BQ\_ALT, Nr\_Ref\_fwd, Nr\_Alt\_fwd, DP\_fwd, VAF\_rev, Nr\_Ref\_rev, Nr\_Alt\_rev, DP\_rev, VAF\_rev, strandbias, nr\_samples (number of samples with the same variant), nr\_samples\_similar (number of samples with a variant at the same position), Category. If an output folder is provided, the output is saved as Results\_Final.txt. Additionally, Results\_Final.xlsx is saved, containing three sheets: Mutations, Polymorphisms and Artifacts. <<10>>= output_folder <- "" #Input: filtered characteristics combined final<-finalFiltration(output_folder, frequency_calls = filtered, database_calls = characteristics, combined_calls = combined, damaging_safe = -3, tolerated_safe = -1.5, overlapTools = c("VarScan"), bq_diff = 20,vaf = 0.001) final @ \section{appreci8R - GUI} \emph{appreci8R} combines and filters the output of different variant calling tools according to the 'appreci8'-algorithm. The whole analysis-pipeline consists of seven steps. Individual functions for executing these steps are available. However, an additional user-interface is available that enables easy execution of the analysis-pipeline, restarting parts of the pipeline from checkpoints, changing parameters, exporting and importing the configuration and evaluation of the results. The user-interface is opend by executing the function \texttt{appreci8Rshiny} (no arguments have to be provided). The user-interface is structured by tabs: \emph{Analysis}, \emph{Overview results} and \emph{Detailed results}. \begin{itemize} \item \emph{Analysis}: The tab is essential for performing the whole analysis. It is further separated into 10 additional tabs: \begin{itemize} \item \texttt{Important notes}: General information on prerequisites for performing the analysis and current limitations (only hg19). \item \texttt{0. Preparations}: Defining the general output folder and the varaint caller input. GATK, Platypus, VarScan, FreeBayes, LoFreq, SNVer, SamTools and VarDict are available as these make up the classical appreci8-pipeline. The tools can be \texttt{added} or \texttt{removed}. For every tool the following parameters have to be defined: the folder containing the variant calling results, the output file type (vcf or txt; as vcf is a standardized format, the columns containing chr, pos, ref and alt are assumed to be fixed; when selecting txt, the columns containing chr, pos, ref and alt have to be defined), a possible suffix for naming the output, if MNVs are reported (AG>CT instead of A>C and G>T), if SNVs and indels are reported in one file (if not, what is the suffix for the SNV files and what is the suffix for the indel files?), and how are indels reported. In addition to the eight tools, up to 5 additional tools can be added (select the number and click \texttt{Go}). In addition to the previous parameters, a name can be defined for each additionally added tool. \item \texttt{1. Target filtration}: Upload your target region (bed file; chromosome definition without 'chr'). \item \texttt{2. Normalization} \item \texttt{3. Annotation}: Annotates the variants and filters them according to the selected locations (coding, intron, threeUTR, fiveUTR, intergenic, spliceSite, promoter) and consequences (synonymous, nonsynonymous, frameshift, nonsense, not translated) of interest. \item \texttt{4. Combine output} \item \texttt{5. Evaluate Coverage and BQ}: Evaluate coverage and base quality by analyzing the bam files. Calls with insufficient coverage and or base quality are filtered. The following parameters have to be defined: the folder containing the bam- and bai files, minimum coverage, minimum number of reads carrying the alternate allele, minimum VAF, minimum base quality and the maximum difference in base quality between reference and alternative. \item \texttt{6. Extended Set of Characteristics}: Select the databases you would like to query (dbSNP, 1000Genomes, ExAC, Genome Aggregation Database, COSMIC, ClinVar) and the source you would like to use for impact prediction (SIFT, Provean, PolyPhen2). \item \texttt{7. Final Filtration}: Adjust parameters and define input for the final filtration. A bed file containing the primer positions can be uploaded (optional). A txt file containing expected/hotspot mutations (defined via gene, mutation on AA level, and an optional minimum VAF) can be uploaded (optional). Stricter thresholds for coverage and base quality can be defined (see parameters for \texttt{5. Evaluate Coverage and BQ}). The number of samples making a call striking can be defined. Dependent on the previously selected source for impact prediction (SIFT, Provean, PolyPhen2), a threshold can be defined to identify reliable damaging predictions and reliable tolerated prediction. Furthermore, the default scoring for the artifact- and the polymorphism score can be changed (just recommended for experts). \item \texttt{Action}: The complete analysis can be executed (\texttt{Start complete analysis}). The availability of checkpoints can be checked (\texttt{Check for possible checkpoints}). If checkpoints are available, one can be selected and the analysis can be started from the selected checkpoint. The defined configuration can be exported (\texttt{Export current configuration}) as well as imported (\texttt{Import configuration}). Press \texttt{Done} if you are done with your analysis. \end{itemize} \item \emph{Overview results}: The tab provides an overview of the results. It is further separated into 5 additional tabs: \begin{itemize} \item \texttt{Log}: The log of the analysis is displayed here. In case of errors, e.g. input not being available, these messages are also reported in the log. \item \texttt{Raw Calls}: The raw number of calls per sample and per caller are reported. \item \texttt{Calls On Target}: The number of calls on target per sample and per caller are reported. \item \texttt{Annotaed Calls}: The number of annotated (and filtered according to annotation) calls per sample and per caller are reported. \item \texttt{Filtered Calls}: The number of coverage and base quality filtered calls per sample is reported. \end{itemize} \item \emph{Detailed results}: The tab provides detailed information on the results. It is further separated into 3 additional tabs: \begin{itemize} \item \texttt{Mutations}: Variant calls classified as likely true mutations. \item \texttt{Polymorphisms}: Variant calls classified as likely polymorphisms. \item \texttt{Artifacts}: Variant calls classified as likely artifacts. \end{itemize} \end{itemize} No value is returned. Output files for the different analysis steps are automatically saved. <<11,eval=FALSE>>= appreci8Rshiny() @ \newpage \bibliographystyle{plain} \bibliography{bibpaper} \end{document}