Michael Love
June 8, 2015
(from PH525x notes and Barry, Nobel and Wright, 2008)
Consider the average t-statistic from \( N \) genes in a set \( G \):
\[ \bar{t}=\frac{1}{N} \sum_{i \in G} t_i \]
This statistic \( \bar{t} \) combines the information about DE from the set and might be a useful test statistic.
Under the null hypothesis, the \( t \) have mean 0. If the \( t \) are independent then \( \sqrt{N} \bar{t} \) has standard deviation 1 and is approximately normal:
\[ \sqrt{N} \bar{t} \sim N(0,1) \]
This comes from the following decomposition of the variance:
\[ \begin{eqnarray} \text{Var}(\bar{t}) &=& \frac{1}{N^2} \text{Var}(t_1 + \dots + t_N) \\ &=& \frac{1}{N^2}( \text{Var}(t_1) + \dots + \text{Var}(t_N) ) \\ &=& \frac{1}{N} \end{eqnarray} \]
Now consider the case that the test statistics \( t_i \) in a gene set are not independent but have correlation \( \rho \) under the null hypothesis.
\[ \bar{t}=\frac{1}{N} \sum_{i \in G} t_i \]
\[ \text{corr}(t_i, t_{i'}) = \rho, \quad i, i' \in G \]
The variance of the average t-statistics will be:
\[ \begin{eqnarray} \text{Var}(\bar{t}) &=& \frac{1}{N^2} \text{Var}( (1 \dots 1) (t_1 \dots t_N)' ) \\ &=& \frac{1}{N^2}(1 \dots 1) \begin{pmatrix} 1 & \rho & \dots & \rho & \rho \\ \rho & 1 & \rho & \dots & \rho \\ \dots & \dots & \dots & \dots & \dots \\ \rho & \rho & \dots & \rho & 1 \\ \end{pmatrix} (1 \dots 1) ' \\ &=& \frac{1}{N^2}\{N + (N-1) N \rho \} \\ &=& \frac{1}{N}\{1 + (N-1) \rho \} \end{eqnarray} \]
So the variance inflation factor (VIF) comparing the independent case to the case with correlation is:
\[ VIF = 1 + (N-1) \bar{\rho} \]
So the increased width (standard deviation) of the null distribution for a gene set with 20 genes and average correlation 0.1 will be:
sqrt(1 + 19 * 0.1)
[1] 1.702939
This VIF is approximately true also for testing the set statistics against the complement: the genes not in the set (see Barry, Nobel and Wright 2008).
Here, the expression of 5 samples vs 5 samples, no true difference but a correlation of gene expression.
If the test statistic \( T \) is a linear form of the data \( X \) (e.g. log fold change), then:
\[ \rho^T_{i,i'} = \rho^X_{i,i'} \]
For t-test, the relationship is monotone, approximately linear and:
\[ \rho^T_{i,i'} \approx \rho^X_{i,i'} \]
(Barry, Nobel and Wright, 2008)
28% of the simulated t-statistics are outside of the center 99% of the \( N(0,1) \) distribution.
Where do expression correlations under the null come from? My guesses in order of importance:
for Correlation Adjusted MEan RAnk gene set test, available in the limma package.
Assume the null: no differences across condition, although gene-gene correlation are present
Assume the null: no differences across condition, although gene-gene correlation are present
Has limitations:
Suppose we have a 2 condition experiment with 2 batches:
Remove design matrix columns not involving the null hypothesis:
The ROAST method available in the limma package:
Under the null hypothesis (and assuming a linear model) the residuals are independent and identically distributed \( N(0, \sigma_g^2) \).
We can rotate the residual vector for each gene in a gene set, such that gene-gene expression correlations are preserved.
Like this diagram but around an n-sphere (n, the number of samples).
Repeat 10,000 times:
Lastly compare the original gene set statistic to the null distribution from 1-3.
Pros: fast and efficient, fits with any linear model