---
title: "TEKRABber"
author:
- name: Yao-Chung Chen
  affiliation: Human Biology, Institute of Biology, Freie Universität Berlin, Berlin, Germany
  email: yao-chung.chen@fu-berlin.de
- name: Katja Nowick
  affiliation: Human Biology, Institute of Biology, Freie Universität Berlin, Berlin, Germany
  email: katja.nowick@fu-berlin.de
package: TEKRABber
output:
    BiocStyle::html_document: default
vignette: >
    %\VignetteIndexEntry{TEKRABber} 
    %\VignetteEngine{knitr::rmarkdown} 
    %\VignetteEncoding{UTF-8}
---
# Introduction
`r BiocStyle::Biocpkg("TEKRABber")` is used to estimate the correlations 
between genes and transposable elements (TEs) from RNA-seq data comparing 
between: **(1)** Two Species **(2)** Control vs. Experiment. In the following 
sections, we will use built-in data to demonstrate how to implement 
`r BiocStyle::Biocpkg("TEKRABber")` on you own analysis.

# Installation
To use `r BiocStyle::Biocpkg("TEKRABber")` from your R environment, 
you need to install it using `r BiocStyle::Biocpkg("BiocManager")`:
```{r installation, eval=FALSE}
install.packages("BiocManager")
BiocManager::install("TEKRABber")
```

```{r load package, message=FALSE}
library(TEKRABber)
library(SummarizedExperiment) # load it if you are running this tutorial
```
# Examples
## Comparing between two species, human and chimpanzee as an example
Gene and TE expression data are generated from randomly picked brain regions 
FASTQ files from 10 humans and 10 chimpanzees (Khrameeva E et al., Genome 
Research, 2020). The values for the first column of gene and TE count table 
must be **Ensembl gene ID** and **TE name**:
```{r load built-in data (two species)}
# load built-in data
data(speciesCounts)
hmGene <- speciesCounts$hmGene
hmTE <- speciesCounts$hmTE
chimpGene <- speciesCounts$chimpGene
chimpTE <- speciesCounts$chimpTE
# the first column must be Ensembl gene ID for gene, and TE name for TE
head(hmGene)
```

### Query ortholog information and estimate scaling factor
In the first step, we use `orthologScale()` to get orthology information and 
calculate the scaling factor between two species. The species name needs to be 
the abbreviation of scientific species name used in Ensembl. (Note: (1)This 
step queries information using `r BiocStyle::Biocpkg("biomaRt")` and it might 
need some time or try different mirrors due to the connections to Ensembl 
(2)It might take some time to calculate scaling factor based on 
your data size). 
```{r search species name, eval=FALSE}
# You can use the code below to search for species name
ensembl <- biomaRt::useEnsembl(biomart = "genes")
biomaRt::listDatasets(ensembl)
```

```{r orthology and normalizeation, message=FALSE}
# In order to save time, we have save the data for this tutorial.
data(fetchDataHmChimp)
fetchData <- fetchDataHmChimp

# Query the data and calculate scaling factor using orthologScale():
# fetchData <- orthologScale(
#     geneCountRef = hmGene,
#     geneCountCompare = chimpGene,
#     speciesRef = "hsapiens",
#     speciesCompare = "ptroglodytes"
# )
```

### Create inputs for differentially expressed analysis and 
correlation estimation
We use `DECorrInputs()` to return input files for downstream analysis. 
```{r create input files, warning=FALSE}
inputBundle <- DECorrInputs(
    orthologTable = fetchData$orthologTable,
    scaleFactor = fetchData$scaleFactor,
    geneCountRef = hmGene,
    geneCountCompare = chimpGene,
    teCountRef = hmTE,
    teCountCompare = chimpTE
)
```

### Differentially expressed analysis (DE analysis)
In this step, we need to generate a metadata contain species name 
(i.e., human and chimpanzee). The row names need to be same as the DE input 
table and the column name must be **species** (see the example below). Then we 
use `DEgeneTE()` to perform DE analysis. When you are comparing samples between 
two species, the parameter **expDesign** should be **TRUE** (as default).
```{r DE analysis (two species), message=FALSE, results='hide', warning=FALSE}
meta <- data.frame(
    species = c(rep("human", ncol(hmGene) - 1), 
    rep("chimpanzee", ncol(chimpGene) - 1))
)

meta$species <- factor(meta$species, levels = c("human", "chimpanzee"))
rownames(meta) <- colnames(inputBundle$geneInputDESeq2)
hmchimpDE <- DEgeneTE(
    geneTable = inputBundle$geneInputDESeq2,
    teTable = inputBundle$teInputDESeq2,
    metadata = meta,
    expDesign = TRUE
)
```

### Correlation analysis
Here we use `corrOrthologTE()` to perform correlation estimation comparing 
each ortholog and TE. This is the most time-consuming step if you have large 
data. For a quick demonstration, we use a relatively small data. You can 
specify the correlation method and adjusted _p-value_ method. The default 
methods are Pearson's correlation and FDR. **Note: ** For more efficient and 
specific analysis, you can subset your data in this step to focus on only the 
orthologs and TEs that you are interested in. 
```{r correlation (two species), warning=FALSE}
# load built-in data
data(speciesCorr)
hmGeneCorrInput <- assay_tekcorrset(speciesCorr, "gene", "human")
hmTECorrInput <- assay_tekcorrset(speciesCorr, "te", "human")
chimpGeneCorrInput <- assay_tekcorrset(speciesCorr, "gene", "chimpanzee")
chimpTECorrInput <- assay_tekcorrset(speciesCorr, "te", "chimpanzee")

hmCorrResult <- corrOrthologTE(
    geneInput = hmGeneCorrInput,
    teInput = hmTECorrInput,
    corrMethod = "pearson",
    padjMethod = "fdr"
)

chimpCorrResult <- corrOrthologTE(
    geneInput = chimpGeneCorrInput,
    teInput = chimpTECorrInput,
    corrMethod = "pearson",
    padjMethod = "fdr"
)

head(hmCorrResult)
```

### Explore your result using `appTEKRABber()`:
`r BiocStyle::Biocpkg("TEKRABber")` provides an app function for you to 
quickly view your result. First, you will need to assign the differentially 
expressed orthologs/TEs results, correlation results and metadata as global
variables: `appDE`, `appRef`, `appCompare` and `appMeta`. See the following
example. 
```{r app visualize (two species), warning=FALSE, eval=FALSE}
#create global variables for app-use
appDE <- hmchimpDE
appRef <- hmCorrResult
appCompare <- chimpCorrResult
appMeta <- meta # this is the same one in DE analysis

appTEKRABber()
```
In the **Expression tab** page, **(1)** you can specify your input gene and
TE. The result will show in box plots with data points in normalized log2 
expression level **(2)** DE analysis result will show in table including 
statistical information **(3)** Correlation result will indicate if these 
selected pairs are significantly correlated and the value of correlation 
coefficients.
![](app.jpg)
In the **Correlation tab** page (above figure), you can select your data in 
scatter plots in three ways. **(1)** Specify the data point and it will turn 
red in the distribution of your results **(2)** Show the distribution of all 
the data from your correlation results based on their correlation 
coefficients and adjusted _p-value_. The blue vertical dashed line indicates 
the boundary 
of adjusted _p-value_ is 0.05, and the orange one is for adjusted _p-value_ 
0.01 **(3)** you can click the data points which you are interested in, and 
it will be listed in the table. You can also drag a certain area to show 
data points in it. 

## Comparing control and treatment samples within the same species
If you want to compare selected genes and TEs **(1)** from different tissue 
in same species or **(2)** control and drug treatment in same tissue in same 
species, please generate all the input files following the input format. 
Here we show an example data of prepared input files including expression 
counts from 10 control and 10 treatment samples. The format of input data: 
row names should be gene name or id, and column name is your sample id 
(please see details below).
```{r load built-in data (same species)}
# load built-in data
data(ctInputDE)
geneInputDE <- ctInputDE$gene
teInputDE <- ctInputDE$te

# you need to follow the input format as below
head(geneInputDE)
```

### DE analysis
For DE analysis in the same species, you also use `DEgeneTE()` function, 
however, you need to set the parameter **expDesign** to **FALSE**. You also 
need to provide a metadata which this time the column name must be 
**experiment**. See demonstration below:
```{r DE analysis (same species), warning=FALSE, results='hide', message=FALSE}
metaExp <- data.frame(experiment = c(rep("control", 5), rep("treatment", 5)))
rownames(metaExp) <- colnames(geneInputDE)
metaExp$experiment <- factor(
    metaExp$experiment, 
    levels = c("control", "treatment")
)

resultDE <- DEgeneTE(
    geneTable = geneInputDE,
    teTable = teInputDE,
    metadata = metaExp,
    expDesign = FALSE
)
```

### Correlation analysis
For a quick demonstration to perform correlation of genes and TEs in control 
and treatment sample, we use relatively small input tables which only 
include 10 genes and 10 TEs.
```{r load built-in data (same species correlation), warning=FALSE}
# load built-in data
data(ctCorr)
geneConCorrInput <- assay_tekcorrset(ctCorr, "gene", "control")
teConCorrInput <- assay_tekcorrset(ctCorr, "te", "control")
geneTreatCorrInput <- assay_tekcorrset(ctCorr, "gene", "treatment")
teTreatCorrInput <- assay_tekcorrset(ctCorr, "te", "treatment")

# you need to follow the input format as below
head(geneConCorrInput)
```

```{r correlation (same species), warning=FALSE}
controlCorr <- corrOrthologTE(
    geneInput = geneConCorrInput,
    teInput = teConCorrInput,
    corrMethod = "pearson",
    padjMethod = "fdr"
)

treatmentCorr <- corrOrthologTE(
    geneInput = geneTreatCorrInput,
    teInput = teTreatCorrInput,
    corrMethod = "pearson",
    padjMethod = "fdr"
)

head(treatmentCorr)
```

### Explore your result using `appTEKRABber()`:
```{r app visualize (same species), warning=FALSE, eval=FALSE}
appDE <- resultDE
appRef <- controlCorr
appCompare <- treatmentCorr
appMeta <- metaExp

appTEKRABber()
```

```{r}
sessionInfo()
```