---
title: "2. PAT-Seq poly(A) tail length example"
author: "Paul Harrison"
date: "`r Sys.Date()`"
output: 
    BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{2. poly(A) tail length example}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

poly(A) tail length of transcripts can be measured using the [PAT-Seq protocol](https://rnajournal.cshlp.org/content/21/8/1502.long). This protocol produces 3'-end focussed reads that include the poly(A) tail. We examine [GSE83162](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE83162). This is a time-series experiment in which two strains of yeast are released into synchronized cell cycling and observed through two cycles. Yeast are treated with $\alpha$-factor, which causes them to stop dividing in antici... pation of a chance to mate. When the $\alpha$-factor is washed away, they resume cycling.

```{r echo=F}
knitr::opts_chunk$set(cache=TRUE, autodep=TRUE)

# To examine objects:
# devtools::load_all(".", export_all=F) ; qwraps2::lazyload_cache_dir("vignettes/V2_tail_length_cache/html")
```

# Read files, extract experimental design from sample names

```{r setup, message=F, warning=F, cache=FALSE}
library(tidyverse)     # ggplot2, etc
library(patchwork)     # side-by-side plots
library(limma)         # differential testing
library(topconfects)   # differential testing - top confident effect sizes
library(org.Sc.sgd.db) # Yeast organism info
library(weitrix)       # Matrices with precisions weights

# Produce consistent results
set.seed(12345)

# BiocParallel supports multiple backends. 
# If the default hangs or errors, try others.
# The most reliable backed is to use serial processing
BiocParallel::register( BiocParallel::SerialParam() )
```

```{r load, message=F}
tail <- system.file("GSE83162", "tail.csv.gz", package="weitrix") %>%
    read_csv() %>%
    column_to_rownames("Feature") %>%
    as.matrix()

tail_count <- system.file("GSE83162", "tail_count.csv.gz", package="weitrix") %>%
    read_csv() %>%
    column_to_rownames("Feature") %>%
    as.matrix()
    
samples <- data.frame(sample=I(colnames(tail))) %>%
    extract(sample, c("strain","time"), c("(.+)-(.+)"), remove=FALSE) %>%
    mutate(
        strain=factor(strain,unique(strain)), 
        time=factor(time,unique(time)))
rownames(samples) <- colnames(tail)

samples
```

"tpre" is the cells in an unsynchronized state, other times are minutes after release into cycling.

The two strains are a wildtype and a strain with a mutated set1 gene.


# Create weitrix object

These tail lengths are each the average over many reads. We therefore weight each tail length by the number of reads. This is somewhat overoptimistic as there is biological noise that doesn't go away with more reads, which we will correct for in the next step.


```{r weitrix, message=FALSE}
good <- rowMeans(tail_count) >= 10
table(good)

wei <- as_weitrix(
    tail[good,,drop=FALSE], 
    weights=tail_count[good,,drop=FALSE])

rowData(wei)$gene <- AnnotationDbi::select(
    org.Sc.sgd.db, keys=rownames(wei), columns=c("GENENAME"))$GENENAME
rowData(wei)$total_reads <- rowSums(weitrix_weights(wei))
colData(wei) <- cbind(colData(wei), samples)
```


# Calibration

Our first step is to calibrate our weights. Our weights are overoptimistic for large numbers of reads, as there is a biological components of noise that does not go away with more reads.

Calibration requires a model explaining non-random effects. We provide a design matrix and a weighted linear model fit is found for each row. The lack of replicates makes life difficult, for simplicity here we will assume time and strain are independent.

```{r cal1}
design <- model.matrix(~ strain + time, data=colData(wei))
fit <- weitrix_components(wei, design=design)
```

A gamma GLM with log link function can then be fitted to the squared residuals. 1 over the predictions from this models will serve as the new weights.

```{r cal2}
cal <- weitrix_calibrate_all(
    wei, 
    design = fit, 
    trend_formula = ~well_knotted_spline(mu,3)+well_knotted_spline(log(weight),3))
```

For comparison, we'll also look at completely unweighted residuals.

```{r unwei}
unwei <- wei
weitrix_weights(unwei) <- weitrix_weights(unwei) > 0 
# (missing data still needs weight 0)
```

Calibration should remove any pattern in the weighted residuals, compared to known covariates. The trend line shown in red is based on the *squared* weighted residuals.

First look for any pattern relative to the linear model prediction ("mu"). A trend has been removed by the calibration.

```{r cal-fig1}
weitrix_calplot(unwei, fit, covar=mu, guides=FALSE) + labs(title="Unweighted\n") |
weitrix_calplot(wei, fit, covar=mu, guides=FALSE) + labs(title="Weighted by\nread count") |
weitrix_calplot(cal, fit, covar=mu, guides=FALSE) + labs(title="Calibrated\n")
```

Next look for any pattern relative to the number of reads (recall these were the original weights). Again, a trend has been removed by the calibration.

```{r cal-fig2}
weitrix_calplot(unwei, fit, covar=log(weitrix_weights(wei)), guides=FALSE) + labs(title="Unweighted\n") |
weitrix_calplot(wei, fit, covar=log(weitrix_weights(wei)), guides=FALSE) + labs(title="Weighted by\nread count") |
weitrix_calplot(cal, fit, covar=log(weitrix_weights(wei)), guides=FALSE) + labs(title="Calibrated\n")
```


# Testing

## Top confident effects

We are now ready to test things. 

My recommended approach is to find top confident effects, here top confident differential tail length. Core functionality is implemented in my package [topconfects](https://bioconductor.org/packages/release/bioc/html/topconfects.html). Applying this to a weitrix is done using the `weitrix_confects` function. Rather than picking "most significant" genes, it will highlight genes with a large effect size. 

```{r testconfects}
weitrix_confects(cal, design, coef="strainDeltaSet1", fdr=0.05)
```

This lists the largest confident changes in poly(A) tail length.
The `confect` column is an inner confidence bound on 
the difference in tail length,
adjusted for multiple testing.

Note that due to PCR amplification slippage and limited read length, the observed poly(A) tail lengths may be an underestimate. However as all samples have been prepared in the same way, observed differences should indicate the existence of true differences.

If you prefer to rank by signal to noise ratio, it is possible to use Cohen's f as an effect size. This is similar to ranking by p-value, but Cohen's f can be interpreted meaningfully as a signal to noise ratio.

```{r testcohen}
weitrix_confects(cal, design, coef="strainDeltaSet1", effect="cohen_f", fdr=0.05)
```


## Testing with limma

If you prefer a more traditional approach, we can feed our calibrated weitrix to [limma](https://bioconductor.org/packages/release/bioc/html/limma.html).

```{r limmadesign}
fit_cal_design <- cal %>%
    weitrix_elist() %>%
    lmFit(design)

ebayes_fit <- eBayes(fit_cal_design)

topTable(ebayes_fit, "strainDeltaSet1", n=10) %>%
    dplyr::select(
        gene,diff_tail=logFC,ave_tail=AveExpr,adj.P.Val,total_reads)
```

## Testing multiple contrasts

`weitrix_confects` can also be used as an omnibus test of multiple contrasts. Here the default effect size will be standard deviation of observations explained by the part of the model captured by the contrasts. The standardized effect size "Cohen's f" can also be used.

Here we will look for any step changes between time steps, ignoring the "tpre" timepoint. The exact way these contrasts are specified will not modify the ranking, so long as they specify the same subspace of coefficients.

```{r testf}
multiple_contrasts <- limma::makeContrasts(
    timet15m-timet0m, timet30m-timet15m, timet45m-timet30m, 
    timet60m-timet45m,  timet75m-timet60m, timet90m-timet75m, 
    timet105m-timet90m, timet120m-timet105m, 
    levels=design)

weitrix_confects(cal, design, contrasts=multiple_contrasts, fdr=0.05)
```


## Examine individual genes

Having discovered genes with differential tail length, let's look at some genes in detail.

```{r examine, fig.height=6}
view_gene <- function(id) {
    gene <- rowData(wei)[id,"gene"]
    if (is.na(gene)) gene <- ""
    tails <- weitrix_x(cal)[id,]
    std_errs <- weitrix_weights(cal)[id,] ^ -0.5
    ggplot(samples) +
        aes(x=time,color=strain,group=strain, 
            y=tails, ymin=tails-std_errs, ymax=tails+std_errs) +
        geom_errorbar(width=0.2) + 
        geom_hline(yintercept=0) + 
        geom_line() + 
        geom_point(aes(size=tail_count[id,])) +
        labs(x="Time", y="Tail length", size="Read count", title=paste(id,gene)) +
        theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))
}

caption <- plot_annotation(
    caption="Error bars show +/- one standard error of measurement.")

# Top confident differences between WT and deltaSet1
view_gene("YDR170W-A") +
view_gene("YIL015W") +
view_gene("YAR009C") +
view_gene("YJR027W/YJR026W") +
caption

# Top confident changes over time
view_gene("YNL036W") +
view_gene("YBL097W") +
view_gene("YBL016W") +
view_gene("YKR093W") +
caption
```

```{r echo=F,eval=F}
# Top "significant" genes
view_gene("YDR170W-A")
view_gene("YJR027W/YJR026W")
view_gene("YIL015W")
view_gene("YIL053W")

# topconfects has highlighted some genes with lower total reads
view_gene("YCR014C")
```

# Exploratory analysis: overdispersed genes

Genes with high dispersion (weighted residual variance) may be of biological interest. Here we examine dispersion relative to a simple model with only an intercept term. When the dispersion is greater than 1, there is variation in excess of that expected from our calibrated weights. The function `weitrix_dispersions` can be used to calculate dispersions.

To make this a bit more concrete, we can estimate how much extra variation there is per observation not explained by the model. The function `weitrix_sd_confects` provides this, and further provides a confident lower bound on this excess with a multiple testing correction. The "effect" column in the output is excess variation, and the "confect" column is a confident lower bound on this. Results are sorted by confect. These effects are in the same units as the observations themselves. They can be interpreted as the standard deviation if there was no observational error.

```{r excess, warning=F}
confects <- weitrix_sd_confects(cal, ~1)
confects
```

```{r excess2, echo=FALSE}
plot(effect ~ total_reads, data=confects$table, log="x", cex=0.5, col="gray",
   ylab="confect (black) and effect (gray)")
points(confect ~ total_reads, data=confects$table, cex=0.5)

confects$table$name[1:2] %>% map(view_gene) %>% purrr::reduce(`+`)
confects$table$name[3:4] %>% map(view_gene) %>% purrr::reduce(`+`)
confects$table$name[5:6] %>% map(view_gene) %>% purrr::reduce(`+`) + caption
```

The genes discovered tend to be changing over time. If we use a model that accounts for time, differences between the two strains will be emphasized instead.

```{r excess3, warning=F}
confects2 <- weitrix_sd_confects(cal, ~time)
confects2
```

```{r excess4, echo=FALSE}
plot(effect ~ total_reads, data=confects2$table, log="x", cex=0.5, col="gray",
   ylab="confect (black) and effect (gray)")
points(confect ~ total_reads, data=confects2$table, cex=0.5)

confects2$table$name[1:2] %>% map(view_gene) %>% purrr::reduce(`+`)
confects2$table$name[3:4] %>% map(view_gene) %>% purrr::reduce(`+`)
confects2$table$name[5:6] %>% map(view_gene) %>% purrr::reduce(`+`) + caption
```


# Exploratory analysis: components of variation

The test we've performed was somewhat unsatisfactory. Due to the design of the experiment it's difficul to specify differential tests that fully interrogate this dataset: the lack of replicates, and the difficult specifying apriori how tail length will change over time.

Perhaps we should let the data speak for itself.

Perhaps this is what we should have done first!

The weitrix package allows us to look for components of variation. We'll try to explain the data with different numbers of components (from 1 to 6 components).

```{r comp, message=F}
comp_seq <- weitrix_components_seq(cal, p=6)
```

`weitrix_seq_screeplot` shows how much additional variation in the data is explained as each further component is allowed. However the ultimate decision of how many components to examine is a matter of judgement.

```{r scree}
components_seq_screeplot(comp_seq)
```

Looking at three components shows some of the major trends in this data-set.

```{r exam}
comp <- comp_seq[[3]]

matrix_long(comp$col[,-1], row_info=samples, varnames=c("sample","component")) %>%
    ggplot(aes(x=time, y=value, color=strain, group=strain)) + 
    geom_hline(yintercept=0) + 
    geom_line() + 
    geom_point(alpha=0.75, size=3) + 
    facet_grid(component ~ .) +
    labs(title="Sample scores for each component", y="Sample score", x="Time", color="Strain")
```

We observe:

* C1 - A gradual lengthening of tails after release into cell cycling. (The reason for the divergence between strains at the end is unclear.)
* C2 - Variation in poly(A) tail length with the cell cycle.
* C3 - A lengthening of poly(A) tails in the set1 mutant.

The tail lengths are approximated by `comp$row %*% t(comp$col)` where `comp$col` is an $n_\text{sample} \times (p+1)$ matrix of scores (shown above), and `comp$row` is an $n_\text{gene} \times (p+1)$ matrix of gene loadings, which we will now examine. (The $+1$ is the intercept "component", allowing each gene to have a different baseline tail length.)

**Treat these results with caution.** Confindence bounds take into account uncertainty in the loadings but not in the scores! What follows is best regarded as exploratory rather than a final result.

## Gene loadings for C1: gradual lengthing over time

```{r C1}
result_C1 <- weitrix_confects(cal, comp$col, coef="C1")
```

```{r examine_C1, echo=FALSE, fig.height=6}
result_C1$table %>% 
    dplyr::select(gene,loading=effect,confect,total_reads) %>% 
    head(10)

cat(sum(!is.na(result_C1$table$confect)), 
    "genes significantly non-zero at FDR 0.05\n")

result_C1$table$name[1:4] %>% map(view_gene) %>% purrr::reduce(`+`) + caption
```

FUS3 is involved in yeast mating. We see here a poly(A) tail signature of yeast realizing there are not actually any $\alpha$ cells around to mate with.


## Gene loadings for C2: cell-cycle associated changes

```{r C2}
result_C2 <- weitrix_confects(cal, comp$col, coef="C2")
```

```{r examine_C2, echo=FALSE, fig.height=6}
result_C2$table %>% 
    dplyr::select(gene,loading=effect,confect,total_reads) %>% 
    head(10)

cat(sum(!is.na(result_C2$table$confect)), 
    "genes significantly non-zero at FDR 0.05\n")

result_C2$table$name[1:4] %>% map(view_gene) %>% purrr::reduce(`+`) + caption
```

## Gene loadings for C3: longer tails in set1 mutant

```{r C3}
result_C3 <- weitrix_confects(cal, comp$col, coef="C3")
```

Given the mixture of signs for effects in C3, different genes are longer in different stages of the cell cycle. We see many genes to do with replication.

```{r examine_C3, echo=FALSE, fig.height=6}
result_C3$table %>% 
    dplyr::select(gene,loading=effect,confect,total_reads) %>% 
    head(10)

cat(sum(!is.na(result_C3$table$confect)), 
    "genes significantly non-zero at FDR 0.05\n")

result_C3$table$name[1:4] %>% map(view_gene) %>% purrr::reduce(`+`) + caption

#view_gene("YDR461W") #MFA1
```

# Discussion

Looking back to our initial differential testing in light of these results, a reasonable refinement would be to omit "tpre" and "t0m", considering only the samples that have settled into cell cycling.