<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{gQTLstats: statistics for genetics of genomic features}
%\VignettePackage{gQTLstats}
-->
---
title: "gQTLstats: computationally efficient analysis and
 	interpretation of large eQTL, mQTL, etc. archives"
author: "Vincent J. Carey, stvjc at channing.harvard.edu"
date: "Jun 2015"
output:
  BiocStyle::html_document:
    highlight: pygments
    number_sections: yes
    theme: united
    toc: yes
  BiocStyle::pdf_document:
    toc: yes
    number_sections: yes
---

<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{gQTLstats: statistics for genetics of genomic features}
-->

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

# Introduction

The software in this package aims to support
refinements and functional interpretation of members of
a collection of association statistics on a family of
feature $\times$ genome hypotheses.
provide a basis for refinement or functional interpretation.

We take for granted the use of the gQTL* infrastructure
for testing and management of test results.  We
use for examples elements of the `r Biocexptpkg("geuvPack")` and 
`r Biocexptpkg("geuvStore2")`
packages.

# Basic infrastructure for statistics on a distributed store of eQTL results

We work with a `ciseStore` instance based on a small subset
of transcriptome-wide cis-eQTL tests for GEUVADIS FPKM data.
The overall testing procedure was conducted for all SNP:probe
pairs for which SNP minor allele frequency (MAF) is
at least 1\% and for which the minimum distance between SNP and
either boundary of the gene coding region
for the probe is at most 1 million bp.
```{r setup,echo=FALSE}
suppressPackageStartupMessages({
library(SummarizedExperiment)
library(Homo.sapiens)
library(org.Hs.eg.db)
library(geuvStore2)
library(gQTLBase)
library(gQTLstats)
})
```{r contset}
library(geuvStore2)
library(gQTLBase)
library(gQTLstats)
library(parallel)
nco = detectCores()
library(doParallel)
registerDoSEQ()
if (.Platform$OS.type != "windows") {
  registerDoParallel(cores=max(c(1, floor(nco/2))))
}
prst = makeGeuvStore2() 
```

## Estimation of quantiles of the distribution of observed associations

Quantile estimation is very memory-efficient, based on a
temporary ff representation of the vector of
all association test results.

```{r getqs, cache=TRUE}
qassoc = storeToQuantiles(prst, field="chisq",
    probs=c(seq(0,.999,.001), 1-(c(1e-4,1e-5,1e-6))))
tail(qassoc)
```

## Estimation of histogram of the distribution of association scores under permutation

Because we compute fixed breaks, contributions to the
overall histogram can be assembled in parallel, with
small footprint.  This is a tremendous reduction of
data.
```{r gethist, cache=TRUE}
hh = storeToHist( prst, breaks= c(0,qassoc,1e9) )
tail(hh$counts)
```

## Computing FDR from a gqtlStore

FDR computation is post-hoc relative to filtering
that need not be specified prior to testing.  For
illustration, we survey the results in `r Biocexptpkg("geuvStore2")`
to obtain FDRs for each SNP:probe
pair in two forms.  First, we obtain FDR without
any filtering.  Second, we compute an
a FDR for those SNP:probe pairs separated
by at most 500kb, and for which the MAF for the SNP
is at least 5 per cent.

```{r twoFDRs, cache=TRUE}
rawFDR = storeToFDR(prst, 
   xprobs=c(seq(.05,.95,.05),.975,.990,.995,.9975,.999, 
   .9995, .9999, .99999) )
```{r makefilt}
dmfilt = function(x)  # define the filtering function
     x[ which(x$MAF >= 0.05 & x$mindist <= 500000) ] 
```{r runfilt, cache=TRUE}
filtFDR = storeToFDR(prst, 
   xprobs=c(seq(.05,.95,.05),.975,.990,.995,.9975,.999, 
   .9995, .9999, .99999), filter = dmfilt )
```
```{r lktails}
rawFDR
filtFDR
```

The filtering leads to a lower FDR for a given
strength of association.  This is an inspiration for
sensitivity analysis.
Even with 5 million observations there is an effect of histogram bin
selection in summarizing the permutation distribution of association.
This can be seen fairly clearly in the wiggliness of the trace over
the unfiltered association score:FDR plot.

```{r showfd2, plot=TRUE}
rawtab = getTab(rawFDR)
filttab = getTab(filtFDR)
 plot(rawtab[-(1:10),"assoc"], 
      -log10(rawtab[-(1:10),"fdr"]+1e-6), log="x", axes=FALSE,
  xlab="Observed association", ylab="-log10 plugin FDR")
 axis(1, at=c(seq(0,10,1),100,200))
 axis(2)
 points(filttab[-(1:10),1], -log10(filttab[-(1:10),2]+1e-6), pch=2)
 legend(1, 5, pch=c(1,2), legend=c("all loci", "MAF >= 0.05 & dist <= 500k"))
```

We'll address this below by fitting smooth functions for
the score:FDR relationship.
 
## Estimates of FDR at the probe level

The `storeToFDRByProbe` FDR function examines the maximal association score
by gene, for observed and permuted measures.  
Good performance of this procedure is obtained by using
`group_by` and `summarize` utilities of `r CRANpkg("dplyr")`.
Iteration employs `r CRANpkg("foreach")`.

```{r shobfu,cache=FALSE,eval=FALSE}
fdbp = storeToFDRByProbe( prst, xprobs=c(seq(.025,.975,.025),.99))
tail(getTab(fdbp),5)
```

```{r shobpf,cache=FALSE,eval=FALSE}
fdAtM05bp = storeToFDRByProbe( prst, filter=function(x) x[which(x$MAF > .05)],
   xprobs=c(seq(.025,.975,.025),.99))
tail(getTab(fdAtM05bp),5)
```

# Modeling the association-FDR relationship to support efficient variant selection and annotation

## Choosing a smooth model for the association:FDR relationship

We'll focus here on all-pairs analysis, with and without filtering.

Especially in this small example there will be some wiggling or
even non-monotonicity in the trace of empirical
FDR against association.  We want to be able to compute
the approximate FDR quickly and with minimal assumptions
and pathology.  To accomplish this, we will
bind an interpolating model to the
FDR estimates that we have.
Interpolation will be accomplished with scatterplot smoothing in
the `r CRANpkg("mgcv")` framework.  

The code that is used to fit the interpolating model is
```
 fdrmod = gam(-log10(fdr+fudge)~s(assoc,bs="tp"), data=...,
    subset=assoc<(1.1*maxch))
```
where fudge defaults to 1e-6 and maxch defaults to 30


```{r lkgam,fig=TRUE}
library(mgcv)
rawFDR = setFDRfunc(rawFDR)
filtFDR = setFDRfunc(filtFDR)
par(mfrow=c(2,2))
txsPlot(rawFDR)
txsPlot(filtFDR)
directPlot(rawFDR)
directPlot(filtFDR)
```

More work is needed on assessing tolerability of relative
error in FDR interpolation.

# Enumerating significant cis-eQTL in a store

Recall that `dmfilt` is a function that obtains the
SNP-probe pairs for which SNP has MAF at least five percent
and SNP-probe distance at most 500kbp.

We use the `FDRsupp` instances with `ciseStore`
to list the SNP-probe pairs with FDR lying
beneath a given upper bound.

Unfiltered pairs:
```{r doenums,cache=TRUE}
rawEnum = enumerateByFDR(prst, rawFDR, threshold=.05) 
rawEnum[order(rawEnum$chisq,decreasing=TRUE)[1:3]]
length(rawEnum)
```

A small quantity of metadata is bound into the resulting
`GRanges` instance.

```{r lkmd}
names(metadata(rawEnum))
```

Pairs meeting MAF and distance conditions are obtained with 
a `filter` setting to the enumerating function.

```{r dofenum,cache=TRUE}
filtEnum = enumerateByFDR(prst, filtFDR, threshold=.05,
   filter=dmfilt) 
filtEnum[order(filtEnum$chisq,decreasing=TRUE)[1:3]]
length(filtEnum)
```

# Sensitivity analysis for eQTL enumeration

The yield of an enumeration procedure depends on filtering
based on SNP-gene distance and SNP MAF.  This can be illustrated
as follows, with minimal computational effort owing to the
retention of genome-scale permutations and the use of the plug-in
FDR algorithm.

```{r dosens,fig=TRUE}
data(sensByProbe) # see example(senstab) for construction approach
tab = senstab( sensByProbe )
plot(tab)
```

If we wish to maximize the yield of eQTL enumeration at FDR at
most 0.05, we can apply a filter to the store.

```{r counts,cache=TRUE}
flens = storeApply( prst, function(x) {
    length(x[ which(x$MAF >= .08 & x$mindist <= 25000), ] )
})
```{r lklen}
sum(unlist(flens))
```
This is a count of gene-snp pairs satisfying structural and
genetic criteria.


# Visualizing and annotating significant loci


## Re-binding probe annotation from RangedSummarizedExperiment

In the case of `geuFPKM` there is some relevant metadata
in the `rowRanges` element.  We will bind that into the 
collection of significant findings.

```{r bindback}
library(geuvPack)
data(geuFPKM)
basic = mcols(rowRanges(geuFPKM))[, c("gene_id", "gene_status", "gene_type",
    "gene_name")]
rownames(basic) = basic$gene_id
extr = basic[ filtEnum$probeid, ]
mcols(filtEnum) = cbind(mcols(filtEnum), extr)
stopifnot(all.equal(filtEnum$probeid, filtEnum$gene_id))
filtEnum[1:3]
```
## Static visualization of FDR patterns

We have a utility to create an annotated Manhattan
plot for a search cis to a gene.
The basic ingredients are

- a `ciseStore` instance for basic location and association information
- a gene identifier that works for that store
- an `FDRsupp` instance that includes the function that maps from association scores to FDR, and the filter employed during FDR estimation
- an annotation resource; here we use ChromHMM labeling based on NA12878, in the `hmm878` GRanges instance in gQTLstats/data.

It is important to recognize that, given
an `FDRsupp` instance we can compute the FDR for any
association score, but validity of the
FDR attribution requires that we refrain
from computing it for any locus excluded by filtering.
the `manhWngr` executes the `FDRsupp`-resident filter
by default.

```{r lkscores,fig=TRUE}
data(hmm878)
library(geuvStore2)
prst = makeGeuvStore2() 
myg = "ENSG00000183814.10" # LIN9
data(filtFDR)
library(ggplot2)
manhWngr( store = prst, probeid = myg, sym="LIN9",
  fdrsupp=filtFDR, namedGR=hmm878 )
```

For a dynamic visualization procedure, see the
vjcitn/gQTLbrowse github archive.

## Basic structural variant annotation

We can use `r Biocpkg("VariantAnnotation")` to establish
basic structural characteristics for all filtered variants.

```{r dovaranno,cache=TRUE}
suppressPackageStartupMessages({
library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
})
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene
seqlevelsStyle(filtEnum) = "UCSC"
#seqinfo(filtEnum) = seqinfo(txdb)
seqlengths(filtEnum)[paste0("chr", c(1:22,"M"))] = 
 seqlengths(txdb)[paste0("chr", c(1:22,"M"))]
suppressWarnings({
allv = locateVariants(filtEnum, txdb, AllVariants()) # multiple recs per eQTL 
})
table(allv$LOCATION)
hits = findOverlaps( filtEnum, allv )
filtEex = filtEnum[ queryHits(hits) ]
mcols(filtEex) = cbind(mcols(filtEex), mcols(allv[subjectHits(hits)])[,1:7])
filtEex[1:3]
```

The resulting table is SNP:transcript specific, and will likely
need further processing. 

# Statistical modeling of phenorelevance of variant contexts, based on a cis-eQTL store

The following tasks need to be addressed in the modeling
of phenorelevance

- Definition of outcome for a variant.  In this example we consider identification of a variant as an NHGRI GWAS catalog hit.
- Definition of variant context.  In this example we use Broad Institute
ChromHMM states for NA12878, along with other information assembled in the store on MAF and distance
- Definition of the statistical model.  We consider logistic regression modeling of the probability that a variant is a GWAS hit, employing LD-pruned variants only.
- Identification of a tractable approach to fitting and evaluating the statistical model.  We'll use a test-train framework.

## Visiting variants and updating with the relevant context and outcomes

We will make a temporary reconstruction of geuvStore2
contents with the enhanced information.
 
<!-- was rebuildG -->


# Support for trans-eQTL identification

The workhorse function is AllAssoc.  The interface is
```{r lkall}
args(AllAssoc)
```
This differs from cisAssoc through the addition of a
`variantRange` argument.  

The basic operation will be as follows.  For a given
RangedSummarizedExperiment instance `summex`, all features will
be tested for association with all SNP in the `variantRange`
restriction of the VCF identified in `vcf.tf`.  The basic
iteration strategy is

a) tile the genome to obtain chunks of SNPs

b) decompose the SE into chunks of transcriptome (or other 'ome)

c) for each chunk of SNPs, for each chunk of transcriptome,
 seek associations and retain the top K in a buffering structure

Management of this buffering structure needs work.

```{r demoit}
require(GenomeInfoDb)
require(geuvPack)
require(Rsamtools)
data(geuFPKM)  # get a ranged summarized expt
lgeu = geuFPKM[ which(seqnames(geuFPKM)=="chr20"), ] # limit to chr20
seqlevelsStyle(lgeu) = "NCBI"
tf20 = TabixFile(system.file("vcf/c20exch.vcf.gz", package="gQTLstats"))
if (require(VariantAnnotation)) scanVcfHeader(tf20)
set.seed(1234)
mysr = GRanges("20", IRanges(33.099e6, 33.52e6))
lita = AllAssoc(geuFPKM[1:10,], tf20, mysr)
names(mcols(lita))
```

The trans search for this segment of chr20 proceeds by obtaining
additional association scores for additional genes.
```{r dem2}
litb = AllAssoc(geuFPKM[11:20,], tf20, mysr)
litc = AllAssoc(geuFPKM[21:30,], tf20, mysr)
```
Now we want to reduce this information by collecting the
strongest associations over the 30 genes tested.
```{r docoll}
buf = gQTLstats:::collapseToBuf(lita, litb, frag="_obs")
buf
buf = gQTLstats:::collapseToBuf(buf, litc, frag="_obs")
buf
```
Let's do the same buffering process for the first permutation.
```{r doperm}
pbuf = gQTLstats:::collapseToBuf(lita, litb, frag="_permScore_1")
pbuf = gQTLstats:::collapseToBuf(pbuf, litc, frag="_permScore_1")
pbuf
```
We can compare the distributions of maximal association per SNP
as observed or under permutation.
```{r dof,fig=TRUE}
plot(density(buf$scorebuf[,1]))
lines(density(pbuf$scorebuf[,1]), lty=2)
```