--- title: "TEKRABber" author: - name: Yao-Chung Chen affiliation: Human Biology, Institute of Biology, Freie Universität Berlin, Berlin, Germany email: yao-chung.chen@fu-berlin.de - name: Katja Nowick affiliation: Human Biology, Institute of Biology, Freie Universität Berlin, Berlin, Germany email: katja.nowick@fu-berlin.de package: TEKRABber output: BiocStyle::html_document: default vignette: > %\VignetteIndexEntry{TEKRABber} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction `r BiocStyle::Biocpkg("TEKRABber")` is used to estimate the correlations between genes and transposable elements (TEs) from RNA-seq data comparing between: **(1)** Two Species **(2)** Control vs. Experiment. In the following sections, we will use built-in data to demonstrate how to implement `r BiocStyle::Biocpkg("TEKRABber")` on you own analysis. # Installation To use `r BiocStyle::Biocpkg("TEKRABber")` from your R environment, you need to install it using `r BiocStyle::Biocpkg("BiocManager")`: ```{r installation, eval=FALSE} install.packages("BiocManager") BiocManager::install("TEKRABber") ``` ```{r load package, message=FALSE} library(TEKRABber) library(SummarizedExperiment) # load it if you are running this tutorial ``` # Examples ## Comparing between two species, human and chimpanzee as an example Gene and TE expression data are generated from randomly picked brain regions FASTQ files from 10 humans and 10 chimpanzees (Khrameeva E et al., Genome Research, 2020). The values for the first column of gene and TE count table must be **Ensembl gene ID** and **TE name**: ```{r load built-in data (two species)} # load built-in data data(speciesCounts) hmGene <- speciesCounts$hmGene hmTE <- speciesCounts$hmTE chimpGene <- speciesCounts$chimpGene chimpTE <- speciesCounts$chimpTE # the first column must be Ensembl gene ID for gene, and TE name for TE head(hmGene) ``` ### Query ortholog information and estimate scaling factor In the first step, we use `orthologScale()` to get orthology information and calculate the scaling factor between two species. The species name needs to be the abbreviation of scientific species name used in Ensembl. (Note: (1)This step queries information using `r BiocStyle::Biocpkg("biomaRt")` and it might need some time or try different mirrors due to the connections to Ensembl (2)It might take some time to calculate scaling factor based on your data size). ```{r search species name, eval=FALSE} # You can use the code below to search for species name ensembl <- biomaRt::useEnsembl(biomart = "genes") biomaRt::listDatasets(ensembl) ``` ```{r orthology and normalizeation, message=FALSE} # In order to save time, we have save the data for this tutorial. data(fetchDataHmChimp) fetchData <- fetchDataHmChimp # Query the data and calculate scaling factor using orthologScale(): # fetchData <- orthologScale( # geneCountRef = hmGene, # geneCountCompare = chimpGene, # speciesRef = "hsapiens", # speciesCompare = "ptroglodytes" # ) ``` ### Create inputs for differentially expressed analysis and correlation estimation We use `DECorrInputs()` to return input files for downstream analysis. ```{r create input files, warning=FALSE} inputBundle <- DECorrInputs( orthologTable = fetchData$orthologTable, scaleFactor = fetchData$scaleFactor, geneCountRef = hmGene, geneCountCompare = chimpGene, teCountRef = hmTE, teCountCompare = chimpTE ) ``` ### Differentially expressed analysis (DE analysis) In this step, we need to generate a metadata contain species name (i.e., human and chimpanzee). The row names need to be same as the DE input table and the column name must be **species** (see the example below). Then we use `DEgeneTE()` to perform DE analysis. When you are comparing samples between two species, the parameter **expDesign** should be **TRUE** (as default). ```{r DE analysis (two species), message=FALSE, results='hide', warning=FALSE} meta <- data.frame( species = c(rep("human", ncol(hmGene) - 1), rep("chimpanzee", ncol(chimpGene) - 1)) ) meta$species <- factor(meta$species, levels = c("human", "chimpanzee")) rownames(meta) <- colnames(inputBundle$geneInputDESeq2) hmchimpDE <- DEgeneTE( geneTable = inputBundle$geneInputDESeq2, teTable = inputBundle$teInputDESeq2, metadata = meta, expDesign = TRUE ) ``` ### Correlation analysis Here we use `corrOrthologTE()` to perform correlation estimation comparing each ortholog and TE. This is the most time-consuming step if you have large data. For a quick demonstration, we use a relatively small data. You can specify the correlation method and adjusted _p-value_ method. The default methods are Pearson's correlation and FDR. **Note: ** For more efficient and specific analysis, you can subset your data in this step to focus on only the orthologs and TEs that you are interested in. ```{r correlation (two species), warning=FALSE} # load built-in data data(speciesCorr) hmGeneCorrInput <- assay_tekcorrset(speciesCorr, "gene", "human") hmTECorrInput <- assay_tekcorrset(speciesCorr, "te", "human") chimpGeneCorrInput <- assay_tekcorrset(speciesCorr, "gene", "chimpanzee") chimpTECorrInput <- assay_tekcorrset(speciesCorr, "te", "chimpanzee") hmCorrResult <- corrOrthologTE( geneInput = hmGeneCorrInput, teInput = hmTECorrInput, corrMethod = "pearson", padjMethod = "fdr" ) chimpCorrResult <- corrOrthologTE( geneInput = chimpGeneCorrInput, teInput = chimpTECorrInput, corrMethod = "pearson", padjMethod = "fdr" ) head(hmCorrResult) ``` ### Explore your result using `appTEKRABber()`: `r BiocStyle::Biocpkg("TEKRABber")` provides an app function for you to quickly view your result. First, you will need to assign the differentially expressed orthologs/TEs results, correlation results and metadata as global variables: `appDE`, `appRef`, `appCompare` and `appMeta`. See the following example. ```{r app visualize (two species), warning=FALSE, eval=FALSE} #create global variables for app-use appDE <- hmchimpDE appRef <- hmCorrResult appCompare <- chimpCorrResult appMeta <- meta # this is the same one in DE analysis appTEKRABber() ``` In the **Expression tab** page, **(1)** you can specify your input gene and TE. The result will show in box plots with data points in normalized log2 expression level **(2)** DE analysis result will show in table including statistical information **(3)** Correlation result will indicate if these selected pairs are significantly correlated and the value of correlation coefficients. ![](app.jpg) In the **Correlation tab** page (above figure), you can select your data in scatter plots in three ways. **(1)** Specify the data point and it will turn red in the distribution of your results **(2)** Show the distribution of all the data from your correlation results based on their correlation coefficients and adjusted _p-value_. The blue vertical dashed line indicates the boundary of adjusted _p-value_ is 0.05, and the orange one is for adjusted _p-value_ 0.01 **(3)** you can click the data points which you are interested in, and it will be listed in the table. You can also drag a certain area to show data points in it. ## Comparing control and treatment samples within the same species If you want to compare selected genes and TEs **(1)** from different tissue in same species or **(2)** control and drug treatment in same tissue in same species, please generate all the input files following the input format. Here we show an example data of prepared input files including expression counts from 10 control and 10 treatment samples. The format of input data: row names should be gene name or id, and column name is your sample id (please see details below). ```{r load built-in data (same species)} # load built-in data data(ctInputDE) geneInputDE <- ctInputDE$gene teInputDE <- ctInputDE$te # you need to follow the input format as below head(geneInputDE) ``` ### DE analysis For DE analysis in the same species, you also use `DEgeneTE()` function, however, you need to set the parameter **expDesign** to **FALSE**. You also need to provide a metadata which this time the column name must be **experiment**. See demonstration below: ```{r DE analysis (same species), warning=FALSE, results='hide', message=FALSE} metaExp <- data.frame(experiment = c(rep("control", 5), rep("treatment", 5))) rownames(metaExp) <- colnames(geneInputDE) metaExp$experiment <- factor( metaExp$experiment, levels = c("control", "treatment") ) resultDE <- DEgeneTE( geneTable = geneInputDE, teTable = teInputDE, metadata = metaExp, expDesign = FALSE ) ``` ### Correlation analysis For a quick demonstration to perform correlation of genes and TEs in control and treatment sample, we use relatively small input tables which only include 10 genes and 10 TEs. ```{r load built-in data (same species correlation), warning=FALSE} # load built-in data data(ctCorr) geneConCorrInput <- assay_tekcorrset(ctCorr, "gene", "control") teConCorrInput <- assay_tekcorrset(ctCorr, "te", "control") geneTreatCorrInput <- assay_tekcorrset(ctCorr, "gene", "treatment") teTreatCorrInput <- assay_tekcorrset(ctCorr, "te", "treatment") # you need to follow the input format as below head(geneConCorrInput) ``` ```{r correlation (same species), warning=FALSE} controlCorr <- corrOrthologTE( geneInput = geneConCorrInput, teInput = teConCorrInput, corrMethod = "pearson", padjMethod = "fdr" ) treatmentCorr <- corrOrthologTE( geneInput = geneTreatCorrInput, teInput = teTreatCorrInput, corrMethod = "pearson", padjMethod = "fdr" ) head(treatmentCorr) ``` ### Explore your result using `appTEKRABber()`: ```{r app visualize (same species), warning=FALSE, eval=FALSE} appDE <- resultDE appRef <- controlCorr appCompare <- treatmentCorr appMeta <- metaExp appTEKRABber() ``` ```{r} sessionInfo() ```