\documentclass[a4paper]{article}

%\VignetteIndexEntry{KCsmart example session}
%\VignetteDepends{}
%\VignetteKeywords{DNA copy number variation}
%\VignettePackage{KCsmart}
\title{KC-SMART Vignette}
\author{Jorma de Ronde, Christiaan Klijn}
\begin{document}
\maketitle

\section{Introduction}
In this Vignette we demonstrate the usage of the
\texttt{KCsmart} R package. This R implementation is based on the original implementation coded in Matlab and first published in January 2008 [1]. 

KC-SMART is a method for finding recurrent gains and losses from a set of tumor samples measured on an array Comparative Genome Hybridization (aCGH) platform. Instead of single tumor aberration calling, and subsequent minimal common region definition, KC-SMART takes an approach based on the continuous raw log2 values.

Briefly: we apply Gaussian locally weighted regression to the summed totals positive and negative log2 ratios separately. This result is corrected for uneven probe spacing as present on the given array. Peaks in the resulting normalized KC score are tested against a randomly permuted background. Regions of significant, recurrent gains and losses are determined using the resulting, multiple testing corrected, significance threshold.

\section{Basic Functionality}
After installing the \texttt{KCsmart} package via Bioconductor, load the KCsmart code into R.
<<>>=
library(KCsmart)
@
Load some sample aCGH data. We supply an example dataset of an old 3000 probe BAC clone experiment. KC-SMART has no problem running on modern 300k+ probe datasets, but this dataset is supplied to keep the package size low. This dataset can be a simple dataframe, \texttt{DNAcopy} object or a \texttt{CGHbase} object, and can not contain any missing values. If the data contains missing values, we advise the user to either impute the missing values using the mean of the two neighboring probes or to discard probes containing missing values. 
<<>>=
data(hsSampleData)
@
This data is the simplest form available to use in KCSMART. As can be observed using:
<<>>=
str(hsSampleData)
@
The user can easily model their data in this format, basically only requiring a column named \texttt{\$chrom} containing the chromosome on which the probe is located and a column named \texttt{\$maploc} containing the chromosomal location of the probe. Alternatively, both \texttt{DNAcopy} and \texttt{CGHbase} objects can be used as direct input for the \texttt{calcSpm} function discussed below.

KC-SMART uses mirroring of the probes at the ends of chromosomes and near centromeres to prevent signal decay by the convolution. The locations of the centromeres and the lengths of the chromosomes are therefore necessary information. Hard-coded information about the human and mouse genome is supplied, but the user can easily provide the coordinates themselves in R. The 'mirrorLocs' object is a list with vectors containing the start, centromere (optional) and end of each chromosome as the list elements. Additionally it should  contain an attribute 'chromNames' listing the chromosome names of each respective list element. The provided 'hsMirrorLocs' can serve as an example of such a list.

To load the presupplied information use:
<<>>=
data(hsMirrorLocs)
@
For an analysis in mouse \texttt{data(mmMirrorLocs)} is used. Using the \texttt{calcSpm} command the convolution is performed at the given parameters and stored in a samplepoint matrix. Here we perform the convolution at two different kernelwidths (first the default sigma = 1Mb, and a second 4Mb sigma). We apply the default parameters to run the convolution.
<<results=hide>>=
spm1mb <- calcSpm(hsSampleData, hsMirrorLocs)
spm4mb <- calcSpm(hsSampleData, hsMirrorLocs, sigma=4000000)
@
Now we plot the chromosome-wide view of the convolution for both the gains and the losses for the 1Mb kernelwidth analysis. Since the samplepoint matrix we just created is an S4 object we can just apply the plot command to it to visualize the results. To keep the size of the vignette down we use an interpolation of 10 here, resulting in every 10th point being plotted.


\begin{center}
<<fig = TRUE>>=
plot(spm1mb, interpolation=10)
@
\end{center}

It is possible to select only particular chromosomes in the plot command, and to only display the gains or only the losses. We will now visualize the KCsmart output for gains of chromosomes 1, 12 and X.

\begin{center}
<<fig = TRUE>>=
plot(spm1mb, chromosomes=c(1, 12, "X"), type="g")
@
\end{center}

As can be seen, in this particular (synthetic) dataset chromosome 1 shows a large recurrent gain compared to other chromosomes. How significant is this gain? We can determine a significance level based on permutation to determine this. In this example we will permute the data 10 times and determine a threshold at $p < 0.05$. We recommend 1000 permutations if you want to do a definitive analysis of a dataset.

<<>>=
sigLevel1mb <- findSigLevelTrad(hsSampleData, spm1mb, n=3, p=0.05)
@

We can now plot the significance level we just calculated in the genomewide view of the KCsmart result using the \texttt{sigLevels=} parameter of the \texttt{plot} function. This time, we'll use the \texttt{type=1} parameter. This will plot the gains and losses in a single frame.

\begin{center}
<<fig = TRUE>>=
plot(spm1mb, sigLevels=sigLevel1mb, type=1, interpolation=10)
@
\end{center}

Let's take a closer look at the gains of chromosome 1, 12 and X again, now with the added significance level.

\begin{center}
<<fig = TRUE>>=
plot(spm1mb, chromosomes=c(1, 12, "X"), type="g", sigLevels=sigLevel1mb)
@
\end{center}

We've also implemented a less stringently corrected version of the permutation algorithm that was shown above. This approach doesn't calculate the Bonferroni corrected significance level, but the False Discovery Rate (FDR). The command to find the FDR for this particular dataset would be \texttt{FDR1mb <- findSigLevelFdr(hsSampleData, spm1mb, n=10, fdrTarget = 0.05)}. In general the FDR is less conservative, and caution must be taken when using this as a threshold.

Now we have our significance level we would like to get usable information back about which regions are found to be significantly abberrant and which probes of the original dataset are contained in these regions. Using the \texttt{getSigSegments} function we can input a samplepoint matrix and our calculated significance level to get back the relevant information.

<<>>=
sigRegions1mb <- getSigSegments(spm1mb,sigLevel1mb)
@

Now let's have a look at the information contained in \texttt{sigRegions1mb}, it is again an S4 object and will display its contents if you just type its name.

<<>>=
sigRegions1mb
@

You could write the infomation contained in \texttt{sigRegions1mb} to a file using \texttt{write.table(write.table(sigRegions1mb, file='sig.txt')}. You can also access the information in R using subsetting. For example, if you wanted to find the probe-names of the probes contained in the first significantly gained region you can get them like this:

<<>>=
sigRegions1mb@gains[[1]]$probes
@

Using \texttt{str(sig.regions.1mb)} you can find out more about the way the significant regions are stored in the sigRegions object.

\section{Multi-scale analysis}

One of the powerful features of KC-SMART is the ability to perform multi-scale analysis. The algorithm can be run using different kernel widths, where small kernel widths will allow you to detect small regions of aberration and vice versa using large kernel widths large regions of aberration can be detected. Remember our convolution step? We calculated the convolution at two sigma's: 1Mb and 4Mb. We will now calculate a significance level for the 4Mb samplepoint matrix and plot the results of both kernelwidths in a scale space figure.

<<>>=
sigLevel4mb <- findSigLevelTrad(hsSampleData, spm4mb, n=3)
@

Now you can plot the scalespace of the two different kernel widths using \texttt{plotScaleSpace(list(spm1mb, spm4mb), list(sigLevel1mb, sigLevel4mb))}. A scale space can show you small and large significant regions in one plot, as well as give you insight within a single significant region. Darker red means more significant. As an option you can plot either the gains or the losses or both. When plotting both, two x11 devices will be opened. Using the 'chromosomes' parameter the chromosomes to be plotted can be set.

\begin{center}
<<fig = TRUE>>=
plotScaleSpace(list(spm1mb, spm4mb), list(sigLevel1mb, sigLevel4mb), type='g')
@
\end{center}


\section{References}

1. Klijn \emph{et. al.} Identification of cancer genes using a statistical framework for multiexperiment analysis of nondiscretized array CGH data. \emph{Nucleic Acids Research}. \textbf{36} No. 2, e13.

\end{document}