---
title: "enhancerHomologSearch Guide"
author: "Jianhong Ou"
bibliography: bibliography.bib
csl: nature.csl
vignette: >
  %\VignetteIndexEntry{enhancerHomologSearch Vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  html_document:
    theme: simplex
    toc: true
    toc_float: true
    toc_depth: 4
---


```{r, echo=FALSE, results="hide", warning=FALSE, message=FALSE}
suppressPackageStartupMessages({
  library(enhancerHomologSearch)
  library(BSgenome.Drerio.UCSC.danRer10)
  library(BSgenome.Hsapiens.UCSC.hg38)
  library(BSgenome.Mmusculus.UCSC.mm10)
  library(TxDb.Hsapiens.UCSC.hg38.knownGene)
  library(org.Hs.eg.db)
  library(TxDb.Mmusculus.UCSC.mm10.knownGene)
  library(org.Mm.eg.db)
  library(utils)
  library(MotifDb)
  library(motifmatchr)
})
knitr::opts_chunk$set(warning=FALSE, message=FALSE)
```

## Introduction

There is an increasing requirement for the tools to 
identify of putative mammalian orthologs to enhancers in species other 
than human and mouse, such as zebrafish, which is lacking whole genome 
comparison analysis data. Take zebrafish as an example, there are two major
methods to identify the orthologs to enhancers in human and mouse, 

1. use the whole genome comparison analysis data and conservation data[@howe2013zebrafish],

2. use spotted gar genome as bridge genome to search the orthologs[@braasch2016spotted].

Both methods will work well in the coding region. 
However, there is lacking comparative data in distal regulation region 
such as enhancers and silencers. 

In 2020, Emily S. Wong et. al. provides a new
method for identification of putative human orthologs to enhancers 
of zebrafish[@wong2020deep]. They used the method to interrogate conserved 
syntenic regions and human and mouse using candidate sponge enhancer sequences. 
First, they looked for overlap with available functional genomics
information. For example, they used mouse ENCODE data to infer enhancer activity
based on histone marks in specific tissues.
Second, they select the best-aligned region by whole genome alignment from the
candidates regions for human and mouse as orthologs.
This method provides the possibility to search orthologs for enhancers or 
silencers even there is not genome comparative data available.

This package is modified from Wong's methods and provide the easy-to-use 
script for researchers to quick search putative mammalian orthologs to enhancers.
The modified algorithm is:
The candidate regions were determined by ENCODE histone marks (default is 
H3K4me1) in specific tissue for human and mouse.
The mapping score were calculated by pairwise Transcription Factors Binding
Pattern Similarity (TFBPS) between enhancer sequences 
and candidates by fast motif match[@schep2021package].
The Z-score were calculated from
mapping score and then converted to P-value based on two-side test from a 
normal distribution.
The candidates were filtered by p-value and distance from the TSS of 
target homologs. 
And then the top candidates from human and mouse were aligned to each other and
exported as multiple alignments with given enhancer.

## Installation

First install `enhancerHomologSearch` and other packages required to run the 
examples. Please note the example dataset used here is from zebrafish. 
To run analysis with dataset from a different species or different assembly,
please install the corresponding Bsgenome and TxDb. 
For example, to analyze cattle data aligned to bosTau9,
please install BSgenome.Btaurus.UCSC.bosTau9,
and TxDb.Btaurus.UCSC.bosTau9.refGene.
You can also generate a TxDb object by functions makeTxDbFromGFF from a local
gff file, or makeTxDbFromUCSC, makeTxDbFromBiomart, and makeTxDbFromEnsembl,
from online resources in GenomicFeatures package.

```{r, installation,eval=FALSE}
if (!"BiocManager" %in% rownames(installed.packages()))
     install.packages("BiocManager")
library(BiocManager)
BiocManager::install(c("enhancerHomologSearch",
                       "BiocParallel",
                       "BSgenome.Drerio.UCSC.danRer10",
                       "BSgenome.Hsapiens.UCSC.hg38",
                       "BSgenome.Mmusculus.UCSC.mm10",
                       "TxDb.Hsapiens.UCSC.hg38.knownGene",
                       "TxDb.Mmusculus.UCSC.mm10.knownGene",
                       "org.Hs.eg.db",
                       "org.Mm.eg.db",
                       "MotifDb",
                       "motifmatchr"))
```

If you have trouble in install enhancerHomologSearch, 
please check your R version first. 
The `enhancerHomologSearch` package require R >= 4.1.0.

```{r}
R.version
```

## Step 1, prepare target enhancer sequences.
In this example, we will use an enhancer of `lepb` gene in zebrafish.
```{r}
# load genome sequences
library(BSgenome.Drerio.UCSC.danRer10)
# define the enhancer genomic coordinates
LEN <- GRanges("chr4", IRanges(19050041, 19051709))
# extract the sequences as Biostrings::DNAStringSet object
(seqEN <- getSeq(BSgenome.Drerio.UCSC.danRer10, LEN))
```

## Step 2, download candidate regions of enhancers from ENCODE by H3K4me1 marks
By default, the hisone marker is H3K4me1. Users can also define the markers 
by `markers` parameter in the function `getENCODEdata`. To make sure the markers
are tissue specific, we can filter the data by `biosample_name` and 
`biosample_type` parameters. 
For additional filters, please refer `?getENCODEdata`.

```{r}
# load library
library(enhancerHomologSearch)
library(BSgenome.Hsapiens.UCSC.hg38)
library(BSgenome.Mmusculus.UCSC.mm10)
# download enhancer candidates for human heart tissue
hs <- getENCODEdata(genome=Hsapiens,
                    partialMatch=c(biosample_summary = "heart"))
# download enhancer candidates for mouse heart tissue
mm <- getENCODEdata(genome=Mmusculus,
                    partialMatch=c(biosample_summary = "heart"))
```

## Step 3, get alignment score for target enhancer and candidate enhancers.
This step is time consuming step. For quick run, users can subset the data
by given genomic coordinates.
```{r}
# subset the data for test run 
# in human, the homolog LEP gene is located at chromosome 7
# In this test run, we will only use upstream 1M and downstream 1M of homolog
# gene
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(org.Hs.eg.db)
eid <- mget("LEP", org.Hs.egALIAS2EG)[[1]]
g_hs <- select(TxDb.Hsapiens.UCSC.hg38.knownGene,
               keys=eid,
               columns=c("GENEID", "TXCHROM", "TXSTART", "TXEND", "TXSTRAND"),
               keytype="GENEID")
g_hs <- range(with(g_hs, GRanges(TXCHROM, IRanges(TXSTART, TXEND))))
expandGR <- function(x, ext){
  stopifnot(length(x)==1)
  start(x) <- max(1, start(x)-ext)
  end(x) <- end(x)+ext
  GenomicRanges::trim(x)
}
hs <- subsetByOverlaps(hs, expandGR(g_hs, ext=1000000))
# in mouse, the homolog Lep gene is located at chromosome 6
# Here we use the subset of 1M upstream and downstream of homolog gene.
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(org.Mm.eg.db)
eid <- mget("Lep", org.Mm.egALIAS2EG)[[1]]
g_mm <- select(TxDb.Mmusculus.UCSC.mm10.knownGene,
               keys=eid,
               columns=c("GENEID", "TXCHROM", "TXSTART", "TXEND", "TXSTRAND"),
               keytype="GENEID")
g_mm <- range(with(g_mm,
                   GRanges(TXCHROM,
                           IRanges(TXSTART, TXEND),
                           strand=TXSTRAND)))
g_mm <- g_mm[seqnames(g_mm) %in% "chr6" & strand(g_mm) %in% "+"]
mm <- subsetByOverlaps(mm, expandGR(g_mm, ext=1000000))

# search the binding pattern
data(motifs)
## In the package, there are 10 sets of motif cluster sets.
## In this example, we use motif clusters merged by distance 60, which 
## is calculated by matalgin (motifStack implementation)
PWMs <- motifs[["dist60"]]
aln_hs <- searchTFBPS(seqEN, hs, PWMs = PWMs,
                      queryGenome = Drerio)
aln_mm <- searchTFBPS(seqEN, mm, PWMs = PWMs,
                      queryGenome = Drerio)
## if you want to stick to sequence similarity search, try to use ?alignmentOne
```

## Step 4, filter the candidate regions.

Here we will filter the candidate regions more than 5K from TSS of homolog 
but within 100K from the gene body. The candidates will be also filtered by 
p-value.

```{r}
# Step4
ext <- 100000
aln_hs <- subsetByOverlaps(aln_hs, ranges = expandGR(g_hs, ext=ext))
## filter by distance
distance(aln_hs) <- distance(peaks(aln_hs), g_hs, ignore.strand=TRUE)
aln_hs <- subset(aln_hs, pval<0.1 & distance >5000)
aln_hs

aln_mm <- subsetByOverlaps(aln_mm, ranges = expandGR(g_mm, ext=ext))
## filter by distance
distance(aln_mm) <- distance(peaks(aln_mm), g_mm, ignore.strand=TRUE)
aln_mm <- subset(aln_mm, pval<0.1 & distance >5000)
aln_mm
```

## Step 5, export the multiple alignments in order.

The selected candidates will be aligned cross human and mouse and then output
as phylip multiple alignment file in text format.

```{r}
al <- alignment(seqEN, list(human=aln_hs, mouse=aln_mm),
                method="ClustalW", order="input")
al
library(MotifDb)
motifs <- query(MotifDb, "JASPAR_CORE")
consensus <- sapply(motifs, consensusString)
consensus <- DNAStringSet(gsub("\\?", "N", consensus))
tmpfolder <- tempdir()
saveAlignments(al, output_folder = tmpfolder, motifConsensus=consensus)
readLines(file.path(tmpfolder, "aln1.phylip.txt"))
```


## Step 6, check the conserved motifs in the orthologs

```{r}
conservedMotifs(al[[1]], aln_hs, aln_mm, PWMs, Drerio)
```
## Session info

```{r}
sessionInfo()
```