---
title: "An introduction to *chromswitch* for detecting chromatin state switches"
shorttitle: "An introduction to *chromswitch*"
author: Selin Jessa and Claudia L. Kleinman
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc_float: true
    df_print: paged
abstract: >
    An important question in comparative analysis of epigenomic data is how 
    chromatin state differs between biological conditions. The package
    *chromswitch* implements a method for integrating epigenomic data to
    identify chromatin state switches in genomic regions of interest between
    samples in two biological conditions.
    Chromswitch is flexible in its input; possible data types include ChIP-seq
    peaks for histone modification, DNase-seq peaks, or previously-learned
    chromatin state segmentations. Chromswitch transforms epigenomic data in the
    query region into a sample-by-feature matrix using one of two strategies,
    clusters samples hierarchically, and then uses external cluster validity
    measures to predict a switch in chromatin state. Chromswitch is robust
    to small sample sizes and high class imbalance, free from data-intensive
    training steps, and suitable for analyzing marks with diverse binding
    profiles.
vignette: >
    %\VignetteIndexEntry{An introduction to `chromswitch` for detecting chromatin state switches}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---


```{r echo = FALSE, warning = FALSE, message = FALSE}
library(BiocParallel)
register(bpstart(SerialParam()))
```

# Overview of this vignette

This vignette is organized hierarchically in terms of level of detail:

* In the Quickstart section, we show a basic analysis with
chromswitch using a small dataset included in the package.

* In the next section, we give a brief overview of the method for detecting
chromatin state switches, referencing specific
functions in chromswitch which implement the different steps of the method.

* In the Walkthrough, we demonstrate a basic analysis
chromswitch with a discussion of data import, input, the most important 
parameters available to the user, and interpretation of chromswitch output.
In this section we use the wrapper functions which
provide one-line commands to call chromatin state switches
based on a single mark or type of input data.

* In last section, Step-by-step, we demonstrate how the steps of the 
method can be run individually to gain finer control over the analysis and
show the intermediate results from chromswitch available to the user.

# Quickstart

Load `r Biocpkg("chromswitch")`:
```{r setup, warning = FALSE, message = FALSE}
library(chromswitch)
```

We will use the package `r Biocpkg("rtracklayer")` to import data from BED files:
```{r message = FALSE, warning = FALSE}
library(rtracklayer)
```

We'll start with a toy dataset containing MACS2 narrow peak calls for H3K4me3
ChIP-seq in 3 brain tissues and 3 other adult tissues from the Roadmap
Epigenomics Project, restricted to a short region on chromosome 19. In the code below, we import the input to chromswitch and run chromswitch on our dataset. This
involves constructing a feature matrix, and we will describe two ways of
doing so. We can then call chromatin state switches by thresholding on the value of
the Consensus score, which scores the similarity between the cluster assignments
and the biological condition labels.

Chromswitch essentially requires 3 inputs:

1. **Genomic query region(s)**: a `GRanges` object storing one or more regions of
interest
2. **Reference metadata dataframe**: a dataframe with at least two columns: *Sample* which stores
sample IDs (these can be any strings), and *Condition*, which stores the
biological condition labels of the samples (the conditions must be strings with only two different values in the column, *e.g.* 'Fetal'/'Adult' or 'WT'/'Mut'). Additional columns are not used.
3. **Reference peaks or epigenomic features**: a list of `GRanges` objects, each of which stores
peaks or features for one sample, with elements named according to the sample IDs as
specified in the metadata

The latter two inputs define the dataset on which the query is performed.

Each of these inputs can be imported from TSV or BED files. Learn more about `GRanges` objects by checking out `r Biocpkg("GenomicRanges")`.

Here we use the `import()` function from `r Biocpkg("rtracklayer")` to import
query regions stored in a BED file.
```{r qs_query}
# Path to BED file in chromswitch installation
query_path <- system.file("extdata/query.bed", package = "chromswitch")

# Read in BED file, creating a GRanges object
query <- rtracklayer::import(con = query_path, format = "BED")
query
```

```{r qs_meta}
# Path to TSV in chromswitch
meta_path <- system.file("extdata/metadata.tsv", package = "chromswitch")

# Read in the table from a 2-column TSV file
metadata <- read.delim(meta_path, sep = "\t", header = TRUE)
metadata
```

Here we assume that peaks are in the [ENCODE narrowPeak format](https://genome.ucsc.edu/FAQ/FAQformat.html#format12).

*Note:* If the metadata file has an additional column containing the path for each sample, then that column can be passed to this function,
*e.g.* `readNarrowPeak(paths = metadata$path, metadata = metadata)`.
```{r qs_pks}
# Paths to the BED files containing peak calls for each sample
peak_paths <- system.file("extdata", paste0(metadata$Sample, ".H3K4me3.bed"),
                     package = "chromswitch")

# Import BED files containing MACS2 narrow peak calls using rtracklayer
peaks <- readNarrowPeak(paths = peak_paths,  # Paths to files,
                        metadata = metadata) # Metadata dataframe
```

## Using the summary strategy

Run chromswitch using the summary strategy:
```{r quickstart_1, warning = FALSE}
callSummary(query = query,       # Input 1: Query regions
            metadata = metadata, # Input 2: Metadata dataframe
            peaks = peaks,       # Input 3: Peaks
            mark = "H3K4me3")    # Arbitrary string describing the data type
```

Chromswitch outputs a measure of cluster quality (*Average_Silhouette*), the score predicting a chromatin state switch (*Consensus*), and the cluster assignment for each sample.

Looking at the Consensus score in each case, which represents the similarity
between the cluster assignments and the biological groups of the samples,
here we see a good agreement in the first region, indicating a switch, and
poor agreement in the second, indicating the absence of a switch. This score takes on values between -1 and 1, where 1 represents a perfect agreement between cluster assignments and the biological conditions of the sample

## Using the binary strategy

Run chromswitch using the binary strategy:
```{r quickstart_2, warning = FALSE}
callBinary(query = query,       # Input 1: Query regions
           metadata = metadata, # Input 2: Metadata dataframe
           peaks = peaks)       # Input 3: Peaks
```

The output has the same format for both strategies. In this case, both strategies predict a switch for the first region, and a non-switch for the second region. 

Both of these wrapper functions have additional parameters which allow for greater
sensitivity and finer control in the analysis. These are explored in the
rest of the vignette.

# Overview of the method

Our method for detecting chromatin state switches involves three steps.
These are illustrated in the figure below, which is a schematic for the analysis
performed on one query region, based on the reference metadata and peaks or features.

As input,
chromswitch requires epigenetic features represented by their genomic
coordinates and optionally, some associated statistics. Possible examples
include ChIP-seq or DNase-seq peaks, or previously-learned chromatin state
segmentations such as from ChromHMM. Here we'll refer to peaks for simplicity, but the analysis 
is the same for other types of epigenetic features given as intervals.

In the pre-processing phase, the user can set thresholds on any statistics
associated with peaks and filter out peaks below these thresholds
(`filterPeaks()`). These statistics can then be normalized genome-wide for each
sample (`normalizePeaks()`), which is strongly recommended. More detailed discussion of the normalization process
can be found in the documentation of that function.
Both these steps are optional. We then retrieve all the peaks in each sample
which overlap the query region (`retrievePeaks()`).

Next, chromswitch transforms the peaks in the query region into a
sample-by-feature matrix using one of two strategies. In the summary strategy
(`summarizePeaks()`), we compute a set of summary statistics from the peaks 
in each sample in the query region. These can include the mean, median, and max
of the statistics associated with the peaks in the input, as well as the 
fraction of the region overlapped by peak and the number of peaks. Genome-wide normalization of the data is therefore extremely
important if choosing this strategy.

In the binary strategy (`binarizePeaks()`) we
construct a binary matrix where each feature corresponds to a unique peak in the
region, and the matrix holds the binary presence or absence calls of each
unique peak in each sample. We obtain the unique peaks by collapsing the union
of all peaks in the region observed in all samples using a parameter `p`
which specifies how much reciprocal overlap is required between two peaks to
call them the same. Since regions corresponding to the same biological event
can occasionally result in separate peaks during the process of interpretation
of raw signal, peak calling, *etc.*, we also introduce an option to combine
peaks which are separated by less than `gap` base pairs (`reducePeaks()`).

Finally, chromswitch clusters samples hierarchically
and then scores the similarity between the inferred cluster assignments and the
known biological condition labels of the samples (`cluster()`).

![](figures/flowchart.png)
*Schematic of the method outlined above.*

# Walkthrough of a basic chromswitch analysis

## The H3K4me3 dataset

The package ships with a small dataset that we will analyze to detect brain-
specific chromatin state switches. The dataset contains MACS2 narrow
peak calls for H3K4me3 in a short section of chromosome 19 for six samples,
3 adult brain tissues, and 3 other adult tissues. The peaks are available
as BED files, stored in the *extdata* folder of chromswitch, as well as in the 
object `H3K4me3`, which is a list of length 6. Each element of the list stores the peak calls as `GRanges` objects. `GRanges` objects are sets of genomic ranges,
and here, one range describes one peak. The data packaged with chromswitch is described in the manual, and the documentation can be accessed by running `??chromswitch::H3K4me3` in the console.

![](figures/igv_screenshot.png)
*Genome browser screenshot of H3K4me3 dataset included in the chromswitch package. Red boxes indicate the genes studied in the demo analysis in this section.*

## Input

Chromswitch essentially requires 3 inputs:

1. **Genomic query region(s)**: the genomic windows in which chromswitch will be applied
(indpendently to each region) to call chromatin state switches.
2. **Metadata dataframe**: specifying the sample IDs and biological conditions,
used to score the similarity between the clusters of samples inferred by
chromswitch, and the biological conditions of the samples.
3. **Epigenomic features**: for example, ChIP-seq peaks for a histone mark, DNase-seq peaks, or previously-learned chromatin state segmentations such as from ChromHMM.

The specification for the inputs and examples of how to import them are described below.

### Query regions

Chromswitch expects query regions in the form of a `GRanges` object storing one or more regions of interest. GRanges objects are containers for genomic regions, implemented
in `r Biocpkg("GenomicRanges")`. An introduction to these objects can be obtained by running `??GenomicRanges::GRanges_and_GRangesList_slides` in the console.

We will apply chromswitch to 5kbp windows surrounding chromatin state
switches in three genes on chromosome 19. Here, we will read in the query regions
from a BED file using the `import()` function from `r Biocpkg("rtracklayer")`. 

```{r regions}
# Path to BED file in chromswitch installation
query_path <- system.file("extdata/query.bed", package = "chromswitch")

# Read in BED file, creating a GRanges object
query <- rtracklayer::import(con = query_path, format = "BED")
query
```

If your query regions are stored in another tabular format, this table can be read
in using `read.delim()` and passed to the `makeGRangesFromDataFrame()` function from `r Biocpkg("GenomicRanges")`, which converts query regions stored in a dataframe with at least 3 columns, *chr*, *start*, and *end*, into a `GRanges`
object (remember the `keep.extra.columns = TRUE` argument to preserve any
additional data associated with the regions, such as gene symbols). Any metadata
columns in the `GRanges` object passed to the `query` argument will be included
in the chromswitch output (but not used for the analysis).

### Metadata

Chromswitch accepts metadata in the form of a dataframe with at least two columns: *Sample* which stores
sample IDs (these can be any strings), and *Condition*, which stores the
biological condition labels of the samples (these must be strings, with only two
possible values in the column). Additional columns are not used. 

In the code below, we read in the metadata from a TSV file.
The resulting dataframe can be passed to the `metadata` argument of any chromswitch
functions that require it.

```{r meta}
# Path to TSV in chromswitch
meta_path <- system.file("extdata/metadata.tsv", package = "chromswitch")

# Read in the table from a 2-column TSV file
metadata <- read.delim(meta_path, sep = "\t", header = TRUE)
metadata
```

### Epigenomic features

Chromswitch expects epigenomic features in the form of a named list of `GRanges` objects,
each of which stores
peaks or features for one sample, with elements named according to the sample IDs as
specified in the metadata. The peaks should be in the same order (with respect
to samples) as in the metadata.

The nature of these features are flexible: for example, they can include peak calls
from a ChIP-seq or DNase-seq experiment, or assignments of a certain state obtained
from a previously learned chromatin state segmentation such as from ChromHMM.
These features may be attached to certain metrics quantifying enrichment,
significance, or probabilities, and the format of the data
will control how we import it. Here we focus on narrow peaks for H3K4me3 ChIP-seq, but demonstrate three ways of importing data into
`GRanges` objects, each suitable for different formats.

**Option 1**: To import BED files containing peak MACS2 narow peak calls, we can use a helper
function implemented in chromswitch, which processes peaks which follow
exactly the [ENCODE narrowPeak format](https://genome.ucsc.edu/FAQ/FAQformat.html#format12).

```{r paths}
# Paths to the BED files containing peak calls for each sample
peak_paths <- system.file("extdata", paste0(metadata$Sample, ".H3K4me3.bed"),
                     package = "chromswitch")

# Import BED files containing MACS2 narrow peak calls using 
#  a helper function from chromswitch
peaks <- readNarrowPeak(paths = peak_paths,  # Paths to files,
                        metadata = metadata)

# Ensure the list is named by sample
names(peaks) <- metadata$Sample
```

**Option 2**: To read in features in other formats where the first three columns are *chr*, 
*start*, and *end*, use `r Biocpkg("rtracklayer")` and specify the identity and
type of the other columns. For example, we can read in the narrow peak calls
manually. 
The same process can be applied to BED files containing epigenomic features
other than peaks (for example, chromatin state segmentations); the `extraCols`
argument to `rtracklayer::import` should be modified to fit the data. More information about importing BED files can be obtained by
running `??rtracklayer::BEDFile` in the console to access the `rtracklayer`
documentation.

```{r read}
extra_cols <- c("signalValue" = "numeric",
                "pValue" = "numeric",
                "qValue" = "numeric",
                "peak" = "numeric")

# Obtain a list of GRanges objects containing peak calls
peaks <- lapply(peak_paths, rtracklayer::import, format = "bed",
                   extraCols = extra_cols)

# Ensure the list is named by sample
names(peaks) <- metadata$Sample
```

**Option 3**: Alternatively, if your epigenomic features are stored in another tabular format,
read in the files using `read.delim()` and convert them to `GRanges` objects
afterwards. We demonstrate on the same narrow peak calls:

```{r manual}
# Read in all files into dataframes
df <- lapply(peak_paths, read.delim, sep = "\t",
             header = FALSE,
             col.names = c("chr", "start", "end", "name", "score", 
                           "strand", "signalValue", "pValue",
                           "qValue", "peak"))

# Convert the dataframes into GRanges objects, retaining
# additional columns besides chr, start, end
peaks <- lapply(df, makeGRangesFromDataFrame, keep.extra.columns = TRUE)

# Ensure the list is named by sample
names(peaks) <- metadata$Sample
```

All three methods described above produce an identical `peaks` object.

## Applying the summary strategy to detect brain-specific switches

We will first run a basic analysis using the summary strategy
for constructing 
feature matrices, using the default features under the summary matrix
construction strategy: the number of peaks in the region, and the fraction
of the region overlapped by peaks. Note that whne the number of peaks is large
($n >> 1$), the output, particularly the heatmap, may be difficult to interpret.
All the computations described in the method are 
wrapped in one command; in later sections we'll explore running each 
step of the method (preprocessing, feature matrix construction, clustering)
individually.

Note that the column names passed to arguments in the wrappers (*e.g.*
`normalize_columns`, `summarize_columns`, *etc.*) must match exactly the column
names in the BED files.

```{r summary_basic, warning = FALSE}

out <- callSummary(query = query, # Input 1: Query regions
            metadata = metadata,  # Input 2: Metadata dataframe
            peaks = peaks,        # Input 3: Peaks
            mark = "H3K4me3")     # Arbitrary string describing the data type

out

```

We can threshold on the consensus score to subset the query regions to those
containing putative chromatin state switches:

```{r threshold}
out[out$Consensus >= 0.75, ]
```

Now, we'll construct the feature matrix by summarizing on the `qValue` and
`signalValue` statistics. This means that for each sample, the features used to
the cluster samples in the
region will be the mean, median, and maximum of these two statistics across
peaks. We will apply genome-wide normalization to the same columns we will use
in the feature matrices. Normalization is an optional step, but
strongly recommended.

Let's explore some more options for detecting chromatin state switches
with chromswitch. The options are briefly described in the comments,
but you can obtain additional explanation of arguments and explore others
not covered here by running `??chromswitch::callSummary`. Note that wherever
column names are passed to chromswitch functions, these *must* match the columns
in the peaks/features data exactly (this is case sensitive).
```{r summary2, warning = FALSE}

out2 <- callSummary(
                # Standard arguments of the function
                query = query,
                metadata = metadata,
                peaks = peaks,
                
                # Arbitrary string describing data type
                mark = "H3K4me3",

                # For quality control, filter peaks based on associated stats
                # prior to constructing feature matrices
                filter = TRUE,
                # Provide column names and thresholds to use in the same order
                filter_columns = c("qValue", "signalValue"),
                filter_thresholds = c(10, 4),

                # Normalization options
                normalize = TRUE, # Strongly recommended
                # By default, set to equal summarize_columns, below
                normalize_columns = c("qValue", "signalValue"),

                # Columns to use for for feature matrix construction
                summarize_columns = c("qValue", "signalValue"),

                # In addition to summarizing peak statistics,
                # we can also optionally compute the
                # fraction of the region overlapped by peaks
                # and the number of peaks
                fraction = TRUE,
                n = FALSE,
                
                # TRUE by default, return the optimal number
                # of clusters, otherwise require k = 2
                optimal_clusters = TRUE,
                
                # Set this to TRUE to save a PDF of the heatmap
                # for each region to the current working directory
                heatmap = FALSE,
                
                # Chromswitch uses BiocParallel as a backend for
                # parallel computations. Analysis is parallelized at the
                # level of query regions.
                BPPARAM = BiocParallel::SerialParam())

out2

```

The summary approach can be applied to epigenomic data where there
are no statistics associated with the features (peaks, states, *etc.*). In this
case, set `summarize_columns = NULL`, `filter = FALSE`, `normalize = FALSE`,
and ensure that either `n` or `fraction` (or both) are set to `TRUE`.

## Applying the binary strategy to detect brain-specific switches

The binary strategy requires approximately the same basic
input as the summary strategy. It also uses two tuning parameters:

1. `gap` which is the distance between peaks in the same sample below which
two peaks should be merged. This preprocessing step is optional, and is 
controlled by the option `reduce`
2. `p` which is the fraction of reciprocal overlap required to call two
peaks the same. This rule is used to obtain a set of unique peaks observed
across samples in the query region, and to assign binary presence or absence
of each peak in each sample to construct the feature matrix.

We use default values of `gap = 300` and `p = 0.4`, but the method is robust
to changes in these parameters within reasonable ranges.

The other option unique to this strategy is `n_features`. The number of features
in the matrix used for clustering corresponds to the number of unique peaks
that chromswitch identifies for these samples in the query region, and
this option controls whether to include an additional column recording 
the number of features in the output.

Additional parameters can be explored by running `??chromswitch::callBinary`
in the console.

```{r binary_basic, warning = FALSE}
out3 <- callBinary(
                # Standard arguments of the function
                query = query,
                metadata = metadata,
                peaks = peaks,
                
                # Logical, controls whether to
                # reduce gaps between peaks to eliminate noise
                reduce = TRUE,
                # Peaks in the same sample which are within this many bp
                # of each other will be merged
                gap = 300,
                
                # The fraction of reciprocal overlap required to define
                # two peaks as non unique, used to construct a binary ft matrix
                p = 0.4,
                
                # Include in output the number of features obtained in 
                # each query region
                n_features = TRUE)

out3
```

Again, threshold the output to obtain putative switches: 
```{r threshold2}

out3[out3$Consensus >= 0.75, ]

```

## Output

The basic output of chromswitch is a tidy dataframe which includes:

* The query regions and any additional data associated with the query (here, 
the name of the gene)
* The number of clusters inferred in the region ($k = 2$ if
`optimal_clusters = FALSE`,
otherwise, the optimal set of clusters is obtained by selecting the clusters
with the highest average Silhouette width, displayed in the next column)
* The Average Silhouette score, which measures cluster compactness and
separation, and can be interpreted as assessing the internal consistency
of the clustering
* A score of the similarity between the inferred clusters and the
biological condition labels of the samples (here, Brain and Other), labeled
as "Consensus", which is an average of the Adjusted Rand Index (ARI),
the Normalized Mutual Information (NMI), and the V measure. This is the score
we recommend to use for later filtering, and thresholding. This score takes
on values between -1 and 1, where 1 represents a perfect agreement between
cluster assignments and the biological conditions of the sample
* The cluster assignments for each sample, one sample per column


## Saving chromswitch heatmaps

If chromswitch performs hierarchical clustering on a feature matrix with more
than one feature, both wrapper functions, `callSummary()` and `callBinary()`
as well as `cluster()` can optionally
produce a heatmap with the resulting dendrogram as a PDF. Whether this 
heatmap is produced or not is passed as a logical value to the argument `heatmap`.
The title of the heatmap and prefix of the file name can be passed as a string
to `title`, while the path to the output directory can be passed to `outdir`.

# Step-by-step analysis for finer control

Chromswitch is implemented in a relatively modular way, with functions for each
step of the method. In this section, we repeat the analysis we performed in 
the previous sections, except that instead of using the wrapper functions,
`callBinary()` and `callSummary()`, we perform each step individually, and 
allocate some more discussion to options and parameters at each step.

This section leverages the modularity of chromswitch, so the pipe operator `%>%`
from `r CRANpkg("magrittr")` is helpful here.

```{r pipe}
library(magrittr)
```

We will also inspect some of the intermediate objects returned by chromswitch
functions, so we load the data manipulation package, `r CRANpkg("dplyr")`.

```{r dplyr, warning = FALSE, message = FALSE}
library(dplyr)
```

## Filter peaks
After preparing the metadata dataframe (`meta`) and importing data into GenomicRanges
objects (`H3K4me3`), we can pre-process the data. First, since the peak calls
from MACS2 are associated with a fold change of enrichment of ChIP-seq
signal, a *q* value for enrichment, *etc.*, we'll set
some thresholds on these statistics and filter out peaks which do not meet them.
The user can pass the names of any numeric columns in the data to the `columns`
argument, ensuring that a numeric threshold for each is passed to `thresholds`,
in the same order. If too few threshold values provided, they will not be
recycled; chromswitch will return an error.

```{r filter}
# Number of peaks in each sample prior to filtering
lapply(H3K4me3, length) %>% as.data.frame()

H3K4me3_filt <- filterPeaks(peaks = H3K4me3,
                            columns = c("qValue", "signalValue"),
                            thresholds = c(10, 4))

# Number of peaks in each sample after filtering
lapply(H3K4me3_filt, length) %>% as.data.frame()
```

There are some additional pre-processing steps which are specific to the approach
used to construct feature matrices from the data, and we explain these below.

## Summary approach for constructing feature matrices

### Normalization
In the summary approach, the features for each sample are a set of summary
statistics which represent the peaks observed in that sample in the query region.
It's important, therefore, that the features have comparable ranges. We
normalize each statistic genome-wide for each sample. The normalization process
essentially involves rescaling the central part of the data to the range [0, 1]
and bounding the lower and upper outliers to 0 and 1 respectively. The amount
of outliers in each tail to bound can be specified by the user in the `tail`
parameter, which expects a fraction in [0, 1] (for example, `tail = 0.005` results
in bounding the upper and lower 0.5% of the data, the default).

This normalization is optional, but is strongly recommended. The effect of *not*
normalizing is that the hierarchical clustering algorithm will be influenced
by small changes between samples, which may lead to false positive chromatin
state switch calls.

```{r normalize}
# Summary of the two statistics we will use downstream in raw data in one sample
H3K4me3_filt %>% lapply(as.data.frame) %>%
    lapply(select, signalValue, qValue) %>%
    lapply(summary) %>% 
    `[`(1)

H3K4me3_norm <- normalizePeaks(H3K4me3_filt,
                               columns = c("qValue", "signalValue"),
                               tail = 0.005)

# Summary after normalization
H3K4me3_norm %>% lapply(as.data.frame) %>%
    lapply(select, signalValue, qValue) %>%
    lapply(summary) %>% 
    `[`(1)
```

### Retrieve peaks in the query region
Next, we'll retrieve the peaks in the query region. Here we'll consider the 
5 kbp window around the transcription start site of *TTYH1*, which is a known
brain-specific gene. This returns an object of the `LocalPeaks` class.
A `LocalPeaks` object is a container for the peaks for a set
of samples in a specific genomic region of interest, as well as the genomic region
itself, and the sample IDs. These components are needed to convert sets of peaks
into rectangular feature-by-sample matrices which we can then use for downstream
analysis - and in particular, as input to a clustering algorithm in order to
call a chromatin state switch.

```{r retrieve}
# TTYH1 
ttyh1 <- query[2]
ttyh1

ttyh1_pk <- retrievePeaks(peaks = H3K4me3_norm,
                          metadata = metadata,
                          region = ttyh1)

ttyh1_pk
```


### Selecting features for the feature matrix in the summary approach
We can now construct a sample-by-feature matrix from the filtered and normalized
data in the query region.
To do so, we need to select some summary statistics to be the features
in the matrix. If there are any values associated with the peaks or data,
specifying these in the `cols` parameter to `summarizePeaks()` results in taking the mean, median, and maximum of each statistic as a separate feature. There are also two
measures which can be calculated from genomic ranges alone: the number of peaks
overlapping the query region (argument `n`), and the fraction of the region
overlapped by peaks (argument `fraction`). These parameters take logical
values specifying whether or not they should be included.

Here, we take the mean, median, and max of the *q* value and fold
change of peaks in the query region, as well as the fraction of the region
overlapped by peaks. When selecting which peak statistics to use in feature
construction, it's important to consider whether statistics are redundant,
for example, here we avoid using both the *p* and *q* values.

```{r summarizePeaks}
summary_mat <- summarizePeaks(localpeaks = ttyh1_pk,
                              mark = "H3K4me3",
                              cols = c("qValue", "signalValue"),
                              fraction = TRUE,
                              n = FALSE)

# The sample-by-feature matrix
summary_mat
```

### Cluster
Finally, we can cluster over samples using this matrix, and call a chromatin
state switch by assessing the agreement between the inferred cluster assignments
and the known biological condition labels of the samples. Since hierarchical
clustering results in a dendrogram, we choose the partition of the samples which
maximizes the average Silhouette width, which is an internal measure of cluster
goodness based on cluster compactness and separation.

```{r cluster}

cluster(ft_mat = summary_mat,
        query = ttyh1,
        metadata = metadata,
        heatmap = TRUE,
        title = "TTYH1 - summary",
        optimal_clusters = TRUE)

```

The consensus score is equal to 1, indicating a perfect agreement between the
inferred clusters and the biological groups (Brain/Other), so we can infer
a chromatin state switch around the TSS of *TTYH1*.

The `cluster` function has a `heatmap` argument which controls whether a heatmap
is produced as a PDF file in the current working directory or at a path specified by `outdir`.

![](figures/ttyh1_summary.png)

*Heatmap showing hierarchical clustering result from chromswitch applied to
TTYH1 using the summary strategy*

## Binary approach

### Reduce peaks
In the binary approach, the features correspond to unique peaks observed in the
region across samples. This can be sensitive to noisy data in the region, so
we propose an additional pre-processing step prior to this approach. When
considering peaks, often, we observe two scenarios:

1. Two peaks may be called as separate
peaks but result from the same biological event
2. Two overlapping peaks may not necessarily correspond to the same biological signal

The way chromswitch handles these is controlled by two tuning parameters. The first
is the `gap` parameter, and is used to decide when nearby
peaks should be joined and replaced by one peak. The `gap` parameter is the distance
between two peaks below which they should be joined.

![](figures/reduce.png)
*Example of a transformation on peaks to join nearby peaks which are likely due
to the same biological event, implemented in chromswitch as `reducePeaks()`.*

The `reducePeaks()` function takes a `LocalPeaks` object as input, so we can
use the peaks in the *TTYH1* window that we've already obtained.
```{r reduce}

ttyh1_pk_red <- reducePeaks(localpeaks = ttyh1_pk,
                            gap = 300)

```

### Construct feature matrix
In this approach, we convert the data in the query region into a binary representation
by modelling the presence or absence of each unique peak. In this function,
first the unique peaks are obtained by collapsing down the *union* of all peaks
supplied in the region, and then we look for each unique peak in each sample
to construct the binary matrix. This involves using the second tuning parameter,
`p`, which specifies how much reciprocal overlap two peaks must have in order
to be considered the same peak.

```{r binarizePeaks}
binary_mat <- binarizePeaks(ttyh1_pk_red,
                            p = 0.4)

# Chromswitch finds a single unique peak in the region
binary_mat
```

In terms of selecting values for these tuning parameters, our experiments indicate
that chromswitch is very robust to changes in these parameters, within reasonable
ranges. Visual inspection of the data in a genome browser can be useful for
determining exact values, and can help to control the resolution of the analysis.

### Cluster

Again, chromswitch finds good agreement between the cluster assignments and
the biological groups in our analysis.
```{r cluster2}

cluster(ft_mat = binary_mat,
        metadata = metadata,
        query = ttyh1,
        optimal_clusters = TRUE)

```

These steps can also be easily run in a pipeline using the `%>%` operator, which is convenient when using individual chromswitch functions to compose an analysis rather than the two wrapper functions.

# Bug reports and support

Bug reports and questions about usage, the method, *etc.* are welcome. Please open an issue on the
GitHub repository for the development version of chromswitch: https://github.com/sjessa/chromswitch/issues/new.

# Session Info
```{r session}
sessionInfo()
```