---
title: "RNAAgeCalc: A multi-tissue transcriptional age calculator"
author: "Xu Ren and Pei Fen Kuan"
date: "`r Sys.Date()`"
output: 
    rmarkdown::html_document:
        highlight: pygments
        toc: true
bibliography: reference.bib
vignette: >
    %\VignetteIndexEntry{RNAAgeCalc}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---


```{r, echo = FALSE}
knitr::opts_chunk$set(comment = "", message=FALSE, warning = FALSE)
```

## Installation

To install and load RNAAgeCalc

```{r eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("RNAAgeCalc")
```

```{r}
library(RNAAgeCalc)
```

## Introduction

It has been shown that both DNA methylation and RNA transcription are linked 
to chronological age and age related diseases. Several estimators have 
been developed to predict human aging from DNA level and RNA level. Most of the 
human transcriptional age predictor are based on microarray data and limited 
to only a few tissues. To date, transcriptional studies on aging using 
RNASeq data from different human tissues is limited. The aim of this package 
is to provide a tool for across-tissue and tissue-specific transcriptional age 
calculation based on Genotype-Tissue Expression (GTEx) RNASeq data 
[@lonsdale2013genotype].

## Description of RNASeq age calculator

We utilized the GTEx data to construct our across-tissue and tissue-specific 
transcriptional age calculator. GTEx is a public available genetic database 
for studying tissue specific gene expression and regulation. GTEx V6 release 
contains gene expression data at gene, exon, and transcript level of 9,662 
samples from 30 different tissues. To avoid the influence of tumor on gene 
expression, the 102 tumor samples from GTEx V6 release are dropped and 
the remaining 9,560 samples were used in the subsequent analysis. To 
facilitate integrated analysis and direct comparison of multiple datasets, 
we utilized recount2 [@collado2017reproducible] version of GTEx data, where 
all samples were processed with the same analytical pipeline. FPKM values 
were calculated for each individual sample using `getRPKM` function in 
Bioconductor package 
[recount](http://bioconductor.org/packages/release/bioc/html/recount.html).

For the tissue-specific RNASeq age calculator, elastic net 
[@zou2005regularization] algorithm was used to train the predictors for each 
individual tissue. Chronological age was response variable whereas logarithm 
transformed FPKM of genes were predictors. The across-tissue calculator was 
constructed by first performing differential expression analysis on the 
RNASeq counts data for each individual tissue. To identify genes consistently 
differentially expressed across tissues, we adapted the binomial test 
discussed in de Magalhaes et al. [@de2009meta] to find the genes with the 
largest number of age-related signals. A detailed explanation can be found 
in our paper.

The package is implemented as follows. For each tissue, signature and 
sample type (see below for the descriptions), we pre-trained the calculator 
using elastic net based on the GTEx samples. We saved the pre-trained model 
coefficients as internal data in the package. The package takes gene 
expression data as input and then match the input genes to the genes in the 
internal data. This matching process is automatic so that the users just 
need to provide gene expression data without having to pull out the 
internal coefficients.

## Usage of RNASeq age calculator

The main functions to calculate RNASeq age are `predict_age` and 
`predict_age_fromse`. Users can use either of them. `predict_age` function 
takes data frame as input whereas `predict_age_fromse` function takes 
[SummarizedExperiment](https://bioconductor.org/packages/SummarizedExperiment) 
as input. Both functions work in the same way internally. In this section, 
we explain the arguments in `predict_age` and `predict_age_fromse` 
respectively. 

### Options in predict_age function

#### exprdata

`exprdata` is a matrix or data frame which contains gene expression data
with each row represents a gene and each column represents a sample. Users are 
expected to use the argument `exprtype` to specify raw count or FPKM 
(see below). The rownames of `exprdata` should be gene ids and colnames 
of `exprdata` should be sample ids. Here is an example of FPKM expression data:

```{r}
data(fpkmExample)
head(fpkm)
```

#### tissue
`tissue` is a string indicates which tissue the gene expression data is
obtained from. Users are expected to provide one of the following tissues.
If the tissue argument is not provide or the provided tissue is not in this 
list, the age predictor trained on all tissues will be used to calculate 
RNA age.

* adipose_tissue    
* adrenal_gland    
* blood    
* blood_vessel    
* brain    
* breast    
* colon    
* esophagus    
* heart    
* liver    
* lung    
* muscle    
* nerve    
* ovary    
* pancreas    
* pituitary     
* prostate    
* salivary_gland    
* skin    
* small_intestine     
* spleen      
* stomach        
* testis       
* thyroid       
* uterus       
* vagina       

#### exprtype
`exprtype` is either "counts" or "FPKM". If `exprtype` is counts, the 
expression data will be converted to FPKM by `count2FPKM` automatically and 
the calculator will be applied on FPKM data. When calculating FPKM, by default 
gene length is obtained from the package's internal database. The internal 
gene length information was obtained from recount2. However, users are able 
to provide their own gene length information by using `genelength` argument 
(see below).

#### idtype
`idtype` is a string which indicates the gene id type in `exprdata`. Default 
is "SYMBOL". The following id types are supported.  

* SYMBOL    
* ENSEMBL    
* ENTREZID    
* REFSEQ    

#### stype
`stype` is a string which specifies which version of pre-trained calculators 
to be used. Two versions are provided. If `stype="all"`, the calculator 
trained on samples from all races (American Indian/Alaska Native, Asian, 
Black/African American, and Caucasian) will be used. If `stype="Caucasian"`, 
the calculator trained on Caucasian samples only will be used. We found that 
RNA Age signatures could be different in different races (see our paper for 
details). Thus we provide both the universal calculator and race specific 
calculator. The race specific calculator for American Indian/Alaska Native, 
Asian, or Black/African American are not provided due to the small sample 
size in GTEx data.

#### signature
`signature` is a string which indicate the age signature to use when 
calculating RNA age. This argument is not required. 

In the case that this argument is not provided, if `tissue` argument is also
provided and the tissue is in the list above, the tissue specific age
signature given by our DESeq2 analysis result on GTEx data will be used. 
Otherwise, the across tissue signature "GTExAge" will be used. 

In the case that this argument is provided, it should be one of the following 
signatures. 

* DESeq2. DESeq2 signature was obtained by performing differential expression 
analysis on each tissue and select the top differential expressed genes.
* Pearson. Pearson signature represents the genes highly correlated with 
chronological age by Pearson correlation.    
* Dev. Dev signature contains genes with large variation in expression across 
samples. We adapted the gene selection strategy discussed in 
[@fleischer2018predicting], which is a gene must have at least a $t_1$-fold 
difference in expression between any two samples in the training set and at 
least one sample have expression level > $t_2$ FPKM to be included in the 
prediction models. $t_1$ and $t_2$ (typically 5 or 10) are thresholds to 
control the degree of deviance of the genes. We used $t_1$ = $t_2$ = 10 for 
most tissues. For some tissues with large sample size, in order to maximize 
the prediction accuracy while maintaining low computation cost, we increased 
$t_1$ and $t_2$ such that the number of genes retained in the model is 
between 2,000 and 7,000.    
* deMagalhaes. deMagalhaes signature contains the 73 age-related genes by 
[@de2009meta].    
* GenAge. GenAge signature contains the 307 age-related genes in the Ageing 
Gene Database [@tacutu2017human].    
* GTExAge. GTExAge signature represents the genes consistently 
differentially expressed across tissues discussed in our paper.   
* Peters. Peters signature contains the 1,497 genes differentially expressed 
with age discussed in [@peters2015transcriptional].    
* all. "all" represents all the genes used when constructing the RNAAge 
calculator. 

If the genes in `exprdata` do not cover all the genes in the signature, 
imputation will be made automatically by the `impute.knn` function in 
Bioconductor package 
[impute](https://www.bioconductor.org/packages/release/bioc/html/impute.html).

#### genelength
`genelength` is a vector of gene length in bp. The size of `genelength` should 
be equal to the number of rows in `exprdata`. This argument is optional. 
When using `exprtype = "FPKM"`, `genelength` argument is ignored. When using
`exprtype = "counts"`, the raw count will be converted to FPKM. If `genelength` 
is provided, the function will convert raw count to FPKM based on the 
user-supplied gene length. Otherwise, gene length is obtained from the 
internal database.

#### chronage
`chronage` is a data frame which contains the chronological age of each
sample. This argument is optional. 

If provided, it should be a dataframe with 1st column sample id and 2nd column 
chronological age. The sample order in `chronage` doesn't have to be in the 
same order as in `exprdata`. However, the samples in `chronage` and `exprdata` 
should be the same. If some samples' chronological age are not available, 
users are expected to set the chronological age in `chronage` to NA. If 
`chronage` contains more than 2 columns, only the first 2 columns will be 
considered. If more than 30 samples' chronological age are available, age 
acceleration residual will be calculated. Age acceleration residual is 
defined as the residual of linear regression with RNASeq age as dependent 
variable and chronological age as independent variable.

If this argument is not provided, the age acceleration residual will not be
calculated. 


#### Example
```{r message = TRUE}
chronage = data.frame(sampleid = colnames(fpkm), age = c(30,50))
res = predict_age(exprdata = fpkm, tissue = "brain", exprtype = "FPKM", 
                  chronage = chronage)
head(res)
```

In the above example, we calculated the RNASeq age for 2 samples based on 
their gene expression data coming from brain. Since the sample size is small, 
age acceleration residual are not calculated.    
Here is an example with sample size > 30:

```{r}
# This example is just for illustration purpose. It does not represent any 
# real data. 
# construct a large gene expression data
fpkm_large = cbind(fpkm, fpkm+1, fpkm+2, fpkm+3)   
fpkm_large = cbind(fpkm_large, fpkm_large, fpkm_large, fpkm_large)
colnames(fpkm_large) = paste0("sample",1:32)
# construct the samples' chronological age
chronage2 = data.frame(sampleid = colnames(fpkm_large), age = 31:62)
res2 = predict_age(exprdata = fpkm_large, exprtype = "FPKM",
                  chronage = chronage2)
head(res2)
```


### Options in predict_age_fromse function

The main difference between `predict_age_fromse` and `predict_age` is that 
`predict_age_fromse` takes SummarizedExperiment as input. The `se` argument
is a SummarizedExperiment object. 

* The `assays(se)` should contain gene expression data. The name of 
`assays(se)` should be either "FPKM" or "counts". Use `exprtype` argument to 
specify the type of gene expression data provided. 
* Users are able to provide the chronological age of samples using 
`colData(se)`. This is optional. If provided, the column name for chronological 
age in `colData(se)` should be "age". If some samples' chronological age are 
not available, users are expected to set the chronological age in `colData(se)` 
to NA. If chronological age is not provided, the age acceleration residual will 
not be calculated. 
* Users are able to provide their own gene length using `rowData(se)`. This is 
also optional. If using `exprtype = "FPKM"`, the provided gene length will be 
ignored. If provided, the column name for gene length in `rowData(se)` should 
be "bp_length". The function will convert raw count to FPKM by the 
user-supplied gene length. Otherwise, gene length is obtained from the internal 
database. 
* The other arguments `tissue`, `exprtype`, `idtype`, `stype`, `signature` are
exactly the same as described in `predict_age` function.

#### Example
```{r message = FALSE}
library(SummarizedExperiment)
colData = data.frame(age = c(40, 50))
se = SummarizedExperiment(assays=list(FPKM=fpkm), colData=colData)
res3 = predict_age_fromse(se = se, exprtype = "FPKM")
head(res3)
```

## Visualization

We suggest visualizing the results by plotting RNAAge vs chronological age. 
This can be done by calling `makeplot` function and passing in the data frame 
returned by `predict_age` function. 
```{r}
makeplot(res2)
```


## Session info

```{r sessionInfo}
sessionInfo()
```


## References