---
title: "Determine population ancestry from DNAm arrays"
author: "Sean K. Maden"
date: "`r format(Sys.time(), '%d %B, %Y')`"
package: recountmethylation
bibliography: bibliography.bib
output:
  html_document:
    df_print: paged
    toc: yes
    toc_depth: '3'
  BiocStyle::pdf_document:
    toc: yes
    toc_depth: 3
  BiocStyle::html_document:
    code_folding: show
    toc: yes
    tocfloat: yes
vignette: >
  %\VignetteIndexEntry{Determine population ancestry from DNAm arrays} 
  %\VignetteDepends{RCurl} 
  %\usepackage[UTF-8]{inputenc}  
  %\VignetteEncoding{UTF-8} 
  %\VignetteEngine{knitr::rmarkdown}
---

```{r chunk_settings, eval = T, echo = F}
knitr::opts_chunk$set(eval = FALSE, echo = TRUE, warning = FALSE, message = FALSE)
```

```{r setup, eval = T, echo = F}
# get the system load paths
dpath <- system.file("extdata", "glint_files", 
                     package = "recountmethylation")  # local path to example data
res1.fname <- "glint_results_tutorialdata.epistructure.pcs.txt" 
res1.fpath <- file.path(dpath, res1.fname)
res1 <- read.table(res1.fpath, sep = "\t")            # read in example dataset #1
res2.fpath <- file.path(dpath, "glint_results_minfidata.epistructure.pcs.txt")
res2 <- read.table(res2.fpath, sep = "\t")            # read in example dataset #2
```

This notebook describes how to obtain the `GLINT` software suite for DNAm analysis, and how to run `GLINT` with the `EPISTRUCTURE` method for inferring population genetic ancestry/descent from HM450K DNAm array probes. These command line tools are called using a conda virtual environment managed from an R session with the `basilisk` Bioconductor package. Code in this notebook should work for Mac and Linux-like environments. Consult @rahmani_genome-wide_2017 for more details about the `EPISTRUCTURE` method.

# Obtain the `GLINT` software

First, obtain the latest version of `GLINT` online by downloading the [source](https://github.com/cozygene/glint/releases) from GitHub.

Also ensure the `basilisk` Bioconductor/R package is installed. We'll use this to conveniently 
manage conda virtual environments to run `GLINT` from an R session.

```{r}
BiocManager::install("basilisk")
library(basilisk)
```

# Virtual environment setup

Next, set up a virtual environment using conda. This step is crucial for reproducible research, as it enables control over which software versions to use in a session. This enables running software that is older or no longer maintained, a fairly common task in computational sciences.

Using `basilisk`, set up a Python 2 environment with seven dependencies (`numpy`, `pandas`, `scipy`, `scikit-learn`, `matplotlib`, `statsmodels`, and `cvxopt`) for which we specify the version using the form "packagename==version" (e.g. "numpy==1.15").

```{r}
env.name <- "glint_env"          # name the new virtual environment
pkg.name <- "recountmethylation" # set env properties
pkgv <- c("python==2.7",         # python version (v2.7)
          "numpy==1.15",         # numpy version (v1.15)
          "pandas==0.22",        # pandas version (v0.22)
          "scipy==1.2",          # scipy version (v1.2)
          "scikit-learn==0.19",  # scikit-learn (v0.19)
          "matplotlib==2.2",     # matplotlib (v2.2)
          "statsmodels==0.9",    # statsmodels (v0.9)
          "cvxopt==1.2")         # cvxopt (v1.2)
glint_env <- BasiliskEnvironment(envname = env.name, pkgname = pkg.name, 
                                 packages = pkgv)
proc <- basiliskStart(glint_env) # define run process
on.exit(basiliskStop(proc))      # define exit process
```

This makes a `BasiliskEnvironment` object, `glint_env`, with a starting process called `proc` and a session end process specified with `on.exit()`. 

# Process example DNAm array data

This section shows how to run `GLINT` on an example dataset, with the `EPISTRUCTURE` method enabled. It includes info about the required data formats and shows how to adjust for covariates.

To run `GLINT` on DNAm array stored as a `SummarizedExperiment` object, first access the test HM450K `MethylSet` from the `minfiData` package.

```{r}
library(minfiData)
ms <- get(data("MsetEx")) # load example MethylSet
```

Now load the explanatory CpG probe vector for `EPISTRUCTURE`. This small subset of 4,913 HM450K array probes was found by @rahmani_genome-wide_2017 to be strongly correlated with SNPs informing population ancestry and genetic structure. Access them from `recountmethylation` as follows.

```{r}
dpath <- system.file("extdata", "glint_files", 
                     package = "recountmethylation") # get the dir path
cgv.fpath <- file.path(dpath, "glint_epistructure_explanatory-cpgs.rda")
glint.cgv <- get(load(cgv.fpath)) # load explanatory probes
```

Subset `ms`, a `MethylSet`, to include just the 4,913 explanatory probes. This will save considerable disk space and memory for processing very large datasets.

```{r}
mf <- ms[rownames(ms) %in% glint.cgv,] # filter ms on explanatory probes
dim(mf)                                # mf dimensions: [1] 4913    6
```

Next, identify desired model covariates from sample metadata, then convert to numeric/float format (required by `GLINT`). Specify the variables "age" and "sex" corresponding to columns in the file `covariates_minfidata.txt`.

```{r}
# get covar -- all vars should be numeric/float
covar <- as.data.frame(colData(mf)[,c("age", "sex")]) # get sample metadata
covar[,"sex"] <- ifelse(covar[,"sex"] == "M", 1, 0)   # relabel sex for glint
# write covariates matrix
covar.fpath <- file.path("covariates_minfidata.txt")  # specify covariate table path
# write table colnames, values
write.table(covar, file = covar.fpath, sep = "\t", row.names = T, 
            col.names = T, append = F, quote = F)     # write covariates table
```

Now calculate the DNAm fractoins or "Beta-values". Impute any missing values with row medians, and write the final file to `bval_minfidata.txt`.

```{r}
bval.fpath <- file.path("bval_minfidata.txt")     # specify dnam fractions table name
mbval <- t(apply(as.matrix(getBeta(mf)),1,function(ri){
  ri[is.na(ri)] <- median(ri,na.rm=T)             # impute na's with row medians
  return(round(ri, 4))
})); rownames(mbval) <- rownames(mf)              # assign probe ids to row names
write.table(mbval, file = bval.fpath, sep = sepsym, 
            row.names = T, col.names = T, append = F, 
            quote = F)                            # write dnam fractions table
```

Next, set the system commands to make command line calls from R. Define these manually as strings to be passed to the `system()` function, specifying the paths to the new `minfiData` example files.

```{r}
glint.dpath <- "glint-1.0.4"                         # path to the main glint app dir
glint.pypath <- file.path(glint.dpath, "glint.py")   # path to the glint .py script
data.fpath <- file.path("bval_minfidata.txt")        # path to the DNAm data table
covar.fpath <- file.path("covariates_minfidata.txt") # path to the metadata table
out.fstr <- file.path("glint_results_minfidata")     # base string for ouput results files
covarv <- c("age", "sex")                            # vector of valid covariates
covar.str <- paste0(covarv, collapse = " ")          # format the covariates vector
cmd.str <- paste0(c("python", glint.pypath, 
                    "--datafile", data.fpath, 
                    "--covarfile", covar.fpath, 
                    "--covar", covar.str, 
                    "--epi", "--out", out.fstr), 
                  collapse = " ")                    # get the final command line call
```

The commands stored as `cmd.str` include the path to the latest `GLINT` version, 
`glint.path`, the paths to the `datafile.txt` and `covariates.txt` tutorial files, 
the variable names `age` and `gender` which are our covariates of interest and correspond to column names in `covariates.txt`. We also used the `--epi` flag to ensure the `EPISTRUCTURE` method is run.

Now run `GLINT` with `basiliskRun()`. This should relay system outputs back to our console, which are included as comments in the below code chunk.

```{r}
basiliskRun(proc, function(cmd.str){system(cmd.str)}, cmd.str = cmd.str) # run glint
# this returns:
# INFO      >>> python glint-1.0.4/glint.py --datafile bval_minfidata.txt --covarfile covariates_minfidata.txt --covar age sex --epi --out glint_results_minfidata
# INFO      Starting GLINT...
# INFO      Validating arguments...
# INFO      Loading file bval_minfidata.txt...
# INFO      Checking for missing values in the data file...
# INFO      Validating covariates file...
# INFO      Loading file covariates_minfidata.txt...
# INFO      New covariates were found: age, sex.
# INFO      Running EPISTRUCTURE...
# INFO      Removing non-informative sites...
# INFO      Including sites...
# INFO      Include sites: 4913 CpGs from the reference list of 4913 CpGs will be included...
# WARNING   Found no sites to exclude.
# INFO      Using covariates age, sex.
# INFO      Regressing out covariates...
# INFO      Running PCA...
# INFO      The first 1 PCs were saved to glint_results_minfidata.epistructure.pcs.txt.
# INFO      Added covariates epi1.
# Validating all dependencies are installed...
# All dependencies are installed
# [1] 0
```

Since we declared `--out glint_results_minfidata`, results files are saved with the 
beginning string "glint_results_minfidata" appended. Logs were saved to the file with the `*.glint.log` extension, while data were saved to the file with the `*.epistructure.pcs.txt` extension. 

Now inspect the output results data file `glint_results_minfidata.epistructure.pcs.txt`.

```{r}
out.fpath <- paste0(out.fpath, ".epistructure.pcs.txt")
res2 <- read.table(out.fpath, sep = "\t")
```
```{r, eval = T}
colnames(res2) <- c("sample", "epistructure.pc")
dim(res2)
res2
```

The first results column reflects sample IDs from the columns in `bval_minfidata.txt`. Remaining columns show the `EPISTRUCTURE` population components. While just one population component calculated in this example, experiment datasets may generate outputs with more than one population component and thus several component columns.

# Further reading 

For more details about `GLINT`, see the software [documentation](https://glint-epigenetics.readthedocs.io/en/latest/) and GitHub [repo](https://github.com/cozygene/glint). Additional [tutorial files](https://github.com/cozygene/glint/releases) are also available. 

Consult @rahmani_genome-wide_2017 for more details about the `EPISTRUCTURE` method, including the discovery of explanatory CpG probes associated with population structure SNPs. 

For more details about setting up virtual environments from R, consult the `basilisk` package [documentation](https://www.bioconductor.org/packages/release/bioc/html/basilisk.html).

# Session Info

```{r, eval = T}
utils::sessionInfo()
```

# Works Cited