--- title: "Spike-in Normalization" package: BRGenomics output: BiocStyle::html_document: toc: true toc_float: true BiocStyle::pdf_document: toc: true vignette: | %\VignetteIndexEntry{Spike-in Normalization} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- BRGenomics includes useful utilities for spike-in normalization. A typical approach is to add the spike-in (either exogenous cells or synthetic oligonucleotides) before library preparation, and to subsequently map to a __combined genome__ containing both the target organisms chromosomes (to map the experimental reads) as well as sequences/chromosomes for the spike-in. This so-called "competitive alignment" results in the creation of BAM files containing a mix of chromosomes, for which it should be straightforward to identify the spike-in chromosomes. # Counting Spike-in Reads ```{r, message = FALSE} library(BRGenomics) ``` ```{r, echo = FALSE} gr1_rep1 <- GRanges(seqnames = c("chr1", "chr2", "spikechr1", "spikechr2"), ranges = IRanges(start = 1:4, width = 1), strand = "+") gr2_rep2 <- gr2_rep1 <- gr1_rep2 <- gr1_rep1 # set readcounts score(gr1_rep1) <- c(1, 1, 1, 1) # 2 exp + 2 spike = 4 total score(gr2_rep1) <- c(2, 2, 1, 1) # 4 exp + 2 spike = 6 total score(gr1_rep2) <- c(1, 1, 2, 1) # 2 exp + 3 spike = 5 total score(gr2_rep2) <- c(4, 4, 2, 2) # 8 exp + 4 spike = 12 total grl <- list(gr1_rep1, gr2_rep1, gr1_rep2, gr2_rep2) names(grl) <- c("gr1_rep1", "gr2_rep1", "gr1_rep2", "gr2_rep2") ``` For this section, we'll use a list of 4 dummy datasets containing normal, as well as spike-in chromosomes. Consider the first 2 datasets: ```{r} grl[1:2] ``` We can identify the spike-in chromosomes either by full names, or by a regular expression that matches the spike-in chromosomes. In this case, we named our spike-in chromosomes to contain the string "spike" which makes them easy to identify. To count the reads for each dataset: ```{r} getSpikeInCounts(grl, si_pattern = "spike", ncores = 1) ``` # Filtering Spike-in Reads We can also remove the spike-in reads from our data: ```{r} removeSpikeInReads(grl[1:2], si_pattern = "spike", ncores = 1) ``` And if we wanted to isolate the spike-in reads, there is an analogous `getSpikeInReads()` function. # The Spike-in Normalization Factor There are several methods by which to generate spike-in normalization factors, but we advocate for a particular method, which generates units we call Spike-in normalized Reads Per Million mapped reads in the negative Control (SRPMC). The SRPMC normalization factor for a given sample $i$ is defined as such: $$SRPMC:\ NF_i = \frac{\sum reads_{spikein, control}}{\sum reads_{spikein, i}} \cdot \frac{10^6}{\sum reads_{experimental, control}}$$ This expression effectively calculates Reads Per Million (RPM) normalization for the negative control, and all other samples $i$ are scaled into equivalent units according to the ratio of their spike-in reads. We provide a more explicit derivation below. ## Derivation of SRPMC The fundamental concept of spike-in normalization is that the ratio of experimental reads to spike-in reads can be used to correct for global changes in starting material. Let's call this ratio Reads Per Spike-in read (RPS): $$RPS = \frac{\sum reads_{experimental}}{\sum reads_{spikein}}$$ In isolation, this number only reflects the relative amounts of spike-in material recovered and mapped. The meaningful information about changes in material can only arise from making direct comparisons between samples. For any sample, $i$, we can calculate the global change in signal as a proportion of the material recovered from a negative control: $$RelativeSignal_i = \frac{RPS_i}{RPS_{control}}$$ The usual purpose of spike-in normalization is to measure a biological difference in total material (e.g. RNA) between samples, and the above ratio is a direct measurement of this. To generate normalization factors, we use the above ratio to adjust RPM (Reads Per Million mapped reads) normalization factors, which we define below for clarity: $$RPM:\ NF_i = \frac{1}{\frac{\sum{reads_i}}{10^6}} = \frac{10^6}{\sum{reads_i}}$$ (Unless indicated, $reads$ refers to non-spike-in reads). RPM normalization (i.e. read depth normalization) is the simplest and likely most familiar form of normalization. For a basal (unperturbed) negative control, RPM should produce the most portable metric of signal, given that we intend for the negative control to demonstrate typical physiology, and we hope that this state is reproducible. We therefore want to have our normalized signal in unit terms of RPM in the negative control. To accomplish this, we multiply the ratio of spike-in normalized reads between the sample $i$ and the negative control to the RPM normalization factor for sample $i$. This converts readcounts into units we summarize as Spike-in normalized Reads Per Million mapped reads in the negative Control (SRPMC): $$SRPMC:\ NF_i = \frac{RPS_i}{RPS_{control}} \cdot \frac{10^6}{\sum{reads_i}}$$ Again, SRPMC results in the negative control being RPM (sequencing depth) normalized, while all other samples are in equivalent, directly comparable units. And we've effectively determined the relative scaling of those samples based on the ratios of spike-in reads. This becomes more apparent if we substitute the $RPS$ variables above. $\sum{reads_i}$ cancels, and simplifying the fraction produces the original formula: $$SRPMC:\ NF_i = \frac{\sum reads_{spikein, control}}{\sum reads_{spikein, i}} \cdot \frac{10^6}{\sum reads_{experimental, control}}$$ # Calculating Normalization Factors We can calculate SRPMC normalization factors for each sample using the `getSpikeInNFs()` function, using the same syntax we used to count the spike-in reads. However, we have to also identify the negative control, which is the sample that will have the "reference" (RPM) normalization. We do this either using a regular expression (`ctrl_pattern` argument) or by supplying the name(s) of the negative controls (the `ctrl_names` argument). The default method is "SRPMC", but there are other options, as well. ```{r} getSpikeInNFs(grl, si_pattern = "spike", ctrl_pattern = "gr1", ncores = 1) ``` (The NFs are high because our dummy data contain only a few reads). By default, normalization factors utilize __batch normalization__, such that in any replicate (identified by the characters following "rep" in the sample names), the negative control is RPM normalized, and the other conditions are normalized to the within-replicate negative control (see the documentation for further details). Currently, batch normalization requires the sample names end with strings matching the format `"_rep1"`, `"_rep2"`, etc. If sample names do not conform to this pattern, you can rename them by using the `sample_names` argument. # Normalizing Data We can also use the `spikeInNormGRanges()` function to simultaneously find the spike-in reads, calculate the spike-in normalization factors, filter out spike-in reads, and normalize the readcounts: ```{r} spikeInNormGRanges(grl, si_pattern = "spike", ctrl_pattern = "gr1", ncores = 1) ``` # Normalization by Sub-sampling ## Rationale When viewing genomics data in a genome browser (or otherwise plotting signal for a single gene), the sparsity of basepair resolution data can challenge our visual perception. Consider two datasets from identical samples, but where one is sequenced to a higher depth. The two datasets can be normalized such that the signal counts are equivalent, but the dataset with higher sequencing depth will have also uncovered additional sites. When plotted in a genome browser, the total signal within a region may be the same, but the more highly sequenced dataset will cover more positions but have lower peaks, while the less sequenced dataset will look sparse and spikey in comparison. Below, we compare PRO-seq data derived from the same dataset over the same gene. In one case, we randomly sample half of the reads over that gene, while in another, we divide all the readcounts by 2. ```{r} data("PROseq") data("txs_dm6_chr4") ``` ```{r} # choose a single gene gene_i <- txs_dm6_chr4[185] reads.gene_i <- subsetByOverlaps(PROseq, gene_i) # sample half the raw reads set.seed(11) sreads.gene_i <- subsampleGRanges(reads.gene_i, prop = 0.5, ncores = 1) # downscale raw reads by a factor of 2 score(reads.gene_i) <- 0.5 * score(reads.gene_i) ``` ```{r, collapse = TRUE} sum(score(reads.gene_i)) sum(score(sreads.gene_i)) ``` ```{r, results = "hold"} plot(x = 1:width(gene_i), y = getCountsByPositions(sreads.gene_i, gene_i), type = "h", ylim = c(0, 20), main = "PRO-seq (down-sampled)", xlab = "Distance from TSS", ylab = "Down-sampled PRO-seq Reads") plot(x = 1:width(gene_i), y = getCountsByPositions(reads.gene_i, gene_i), type = "h", ylim = c(0, 20), main = "PRO-seq (down-scaled)", xlab = "Distance from TSS", ylab = "Down-scaled PRO-seq Reads") ``` The two plots above come from the same data, and contain the same quantity of signal, but their profiles are notably distinct. Particularly when plotting many samples over large regions within a genome browser, differences caused by sequencing depth can be misleading. It can be challenging to estimate differences if some datasets are "tall and spikey" while others are "short and smooth". Absent global changes in signal, the above scenario can be resolved beforehand by equal sequencing depths, or by down-sampling to match readcounts. However, matching raw readcounts is not a solution when significant biological changes in total signal should be accounted for. For instance, consider an example in which there is a true, two-fold biological difference in transcription between two samples. If we could avoid all technical artifacts and measure the transcription directly in each individual cell, we would expect to uncover half the number of transcribing complexes in the lower condition. Having equivalent sequencing depth across those two conditions is effectively a technical artifact, and down-scaling the signal by multiplication can cause the visual challenges observed above. ## Sub-sampling for Normalization To address the above concerns, we've included a function `subsampleBySpikeIn()` to randomly sample reads to match the normalized signal proportions between datasets. Internally, the function uses the `getSpikeInNFs()` function, but instead of SRPMC normalization, using the option `method = "SNR"`, which calculates normalization factors that downscale each dataset to match the dataset with the least spike-in reads. From this, the number of "desired reads" is established for each dataset, and subsequently that number of reads is randomly sampled. ```{r} removeSpikeInReads(grl, si_pattern = "spike", ncores = 1) getSpikeInNFs(grl, si_pattern = "spike", method = "SNR", batch_norm = FALSE, ncores = 1) subsampleBySpikeIn(grl, si_pattern = "spike", batch_norm = FALSE, ncores = 1) ``` Normalization by sub-sampling sacrifices information to reduce biases across datasets.