---
title: "introduction"
author: "Tessa MacNish"
date: "2025-01-21"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{introduction}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Introduction

HaploVar is an R package designed to define local haplotypes, identify haplotype variants and format the output to be compatible with a wide range or GWAS and genomic selection tools. In this tutorial we will go through how to use each of the functions, so that you can use HaploVar in your GWAS or genomic selection pipelines. 

HaploVar requires a VCF file and an LD matrix to calculate local haplotypes. For this purpose we have provided VCF and LD files for you to use. The data is sourced from Wu et al.(2019) but only a small subset of the data has been used for this tutorial.

The first step is to download HaploVar.

``` {r installation, eval = FALSE}
install.packages("devtools")
devtools::install_github("TessaMacNish/HaploVar")
```

``` {r message = FALSE, warning = FALSE, eval = TRUE}
library(HaploVar)
```

## Preparing inputs

If you need a helper function to  read in your VCF or LD matrix files, we recommend crosshap (Marsh et al., 2023) which has inbuilt functions for this purpose.

In this tutorial we will use the data sets provided with the tool.

``` {r load-data, eval = TRUE, include = FALSE}
# Load data
data("vcf")
data("LD")
```

**VCF** The VCF file must contain only biallelic SNPs. We recommend filtering your VCF prior to using HaploVar. Your file should look similar to the following:

```{r load-vcf, eval = TRUE, include = TRUE}
head(vcf, c(5,10))
```

**LD** The LD matrix is a table with the dimensions m × m, where m represents the number of SNPs. The LD matrix can be made using any pairwise LD metric such as D (Robbins, 1918), D′ (Lewontin, 1964), or r2 (Hill & Roberston, 1968).Your file should look similar to the following:

```{r load-LD, eval = TRUE, include = TRUE}
head(LD, c(5,5))
```
 
## Define Haplotypes

The first function of HaploVar is define_haplotypes, which takes a vcf file and a LD matrix, defines local haplotypes and outputs a list of haplotype tables. The local haplotypes are calculated using DBSCAN (Ester et al. 1996). The parameters for define_haplotypes include the following:

**epsilon** Epsilon is a parameter of the DBSCAN clustering algorithm and it affects haplotype density. Epsilon values can range between 0 and 1 (the default is 0.6) If the epsilon value is too low you will not find any haplotypes, but if it is too high then your haplotypes may contain noise. We recommend trying a few different epsilon values with a small section of your data to identify what epsilon value works best for your data. For this tutorial we will use an epsilon value of 0.8. 

**MGmin** MGmin is another DBSCAN parameter. This parameter determines the minimum number of SNPs within a cluster for it to be defined as a haplotype. We will use the default of 30.

**hetmiss_as** Affects how missing data is handled for all instances where one allele in a genotype is missing.If hetmiss_as = "allele" the genotype is assumed to be heterozygous. If hetmiss_as = "miss" the genotype is treated as NA. We will use the default of "allele".

**keep_outliers** If FALSE (the default), keep_outliers removes SNPs that are determined to be outliers.

An example of how to use define_haplotypes is as follows:

```{r define_haplotypes, message = F, results = 'hide', cache = T}
##Run define_haplotypes
haplotype_list <- define_haplotypes(vcf, LD, epsilon = 0.8) #this produces a list of haplotype tables
haplotype_table <- haplotype_list[[1]] #this is the first haplotype tables
```

Each haplotype table is named based on the position of their first and their last SNP. The haplotype tables contain all of the SNPs within that haplotype and their genotypes. 

```{r head-haplotype-table, eval = TRUE, include = TRUE}
head(haplotype_table, c(5,5))
```

## Collate Define Haplotypes 

What if you have multiple VCF files that you need to define haplotypes for, and you need all of the results in one file? You could run define_haplotypes on each pair of VCF files and LD matrices, but then you would have several lists. collate_define_haplotypes is designed to collate the output of define_haplotypes into one list of haplotype tables. The inputs for this function are a list of VCF files and a list of LD matrices. There must be one LD matrix for each VCF file. A demonstration of how to use this function is below:

```{r collate_haplotype_list, message = F, results = 'hide', cache = T}
##Prepare the data
haplotype_list2 <- haplotype_list #We are copying the haplotype_list so that we have two lists for the demonstration 
list_outputs <- list(haplotype_list, haplotype_list2) #The input must be in list format
##Run collate_haplotype_list
collate_haplotype_list <- collate_define_haplotypes(list_outputs)
```

## Define Haplotypes Globally 

The other way to run define_haplotypes for several vcf files and collate the results is to run define_haplotypes_globally. This function will take a pair of vcf and LD matrix files and run define_haplotypes. It will then repeat this process for all vcf and LD matrix files provided. Finally, it will collate the results. This function has the same parameters as define_haplotypes except that instead of a single vcf and LD matrix file define_haplotypes_globally requires a list of vcf and LD files. define_haplotypes_globally also requires a list of epsilon values, one for each vcf file.

A demonstration of how to use this function is below:

```{r define_haplotypes_globally, message = F, results = 'hide', cache = T}
##Prepare the data
vcf2 <- vcf 
vcf_list <- list(vcf, vcf2) #The vcf files must be in list format
LD2 <- LD
LD_list <- list(LD, LD2) #The LD matrices must be in list format

##Prepare parameters
epsilon_list <- c(0.8, 0.8) #The length of this list must be the same as the number of vcf files.
```

```{r define_haplotypes_globally_2, message = F, results = 'hide', cache = T}
##Run define_haplotypes_globally
haplotype_list_global <- define_haplotypes_globally(vcf_list, LD_list, epsilon = epsilon_list)
```
## Identify Haplotypes Variants 

The other main function is haplotype_variants, which defines local haplotypes, identifies haplotype variants and formats the output to be compatible with a wide range of GWAS and genomic selection tools. The parameters for haplotype_variants are the same as the parameters for define_haplotypes, with the added parameter for output format. The other additional parameter is minFreq, which is the minimum number of individuals a haplotype variant must be present in to be considered a valid haplotype variant. A demonstration of how to use this function is below:

```{r haplotype_variants, message = F, results = 'hide', cache = T}
##Run haplotype_variants
format1 <- haplotype_variants(vcf, LD, epsilon = 0.8, format = 1)
```

For a full demonstration of the different formats and what they can be used for please refer to the tutorial available on github (https://github.com/TessaMacNish/HaploVar).

## Collate Haplotypes Variants 

collate_haplotype_variants collates output tables from haplotype_variants and collates them into one table, which you can input into the GWAS or genomic selection tool of your choice. collate_haplotype_variants has two parameters, the list of haplotype variant tables and format. The format number used for collate_haplotype_variants must be the same as the format number used when haplotype_variants was run.

```{r collate_haplotype_variants, message = F, results = 'hide', cache = T}
##format1
format1B <- format1
format1_list <- list(format1, format1B) #The input for collate_haplotype_variants must be in list format.
format1_collate <- collate_haplotype_variants(format1_list, format = 1)

```

## Identify Haplotypes Variants Globally 

haplotype_variants_global runs haplotype_variants for a list of vcf files, LD matrices and epsilon values. All other parameters are the same as in haplotype_variants.

```{r haplotype_variants_global, message = F, results = 'hide', cache = T}
format1_global <- haplotype_variants_global(vcf_list, LD_list, epsilon = epsilon_list, format = 1)
```


## References

Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. 

Hill, W. G., & Robertson, A. (1968). Linkage disequilibrium in finite populations. Theoretical and Applied Genetics, 38(6), 226–231. https://doi.org/10.1007/BF01245622

Lewontin, R. C. (1964). The interaction of selection and linkage. I. General considerations; heterotic models. Genetics (Austin), 49(1), 49–67. https://doi.org/10.1093/genetics/49.1.49

Marsh, J. I., Petereit, J., Johnston, B. A., Bayer, P. E., Tay Fernandez, C. G., Al-Mamun, H. A.,…Edwards, D. (2023). crosshap: R package for local haplotype visualization for trait association analysis. Bioinformatics (Oxford, England), 39(8). https://doi.org/10.1093/bioinformatics/btad518 

Robbins, R. B. (1918). Some applications of mathematics to breeding problems III. Genetics (Austin), 3(4), 375–389. https://doi.org/10.1093/genetics/3.4.375

Wu, D., Liang, Z., Yan, T., Xu, Y., Xuan, L., Tang, J., Zhou, G., Lohwasser, U., Hua, S., Wang, H., Chen, X., Wang, Q., Zhu, L., Maodzeka, A., Hussain, N., Li, Z., Li, X., Shamsi, I. H., Jilani, G., … Jiang, L. (2019). Whole-Genome Resequencing of a Worldwide Collection of Rapeseed Accessions Reveals the Genetic Basis of Ecotype Divergence. Molecular Plant, 12(1), 30–43. https://doi.org/10.1016/j.molp.2018.11.007