---
title: "Accurate Inference of Genetic Ancestry from Cancer-derived Sequences"
author: Pascal Belleau, Astrid Deschênes and Alexander Krasnitz
output:
BiocStyle::html_document:
number_sections: no
toc: true
pkgdown:
number_sections: no
as_is: true
urlcolor: darkred
linkcolor: darkred
bibliography: aicsBiblio.bibtex
vignette: >
%\VignetteIndexEntry{Accurate Inference of Genetic Ancestry from Cancer-derived Sequences}
%\VignettePackage{RAIDS}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r style, echo=FALSE, results='hide', warning=FALSE, message=FALSE}
BiocStyle::markdown()
suppressPackageStartupMessages({
library(knitr)
library(RAIDS)
})
set.seed(121444)
```
**Package**: `r Rpackage("RAIDS")`
**Authors**: `r packageDescription("RAIDS")[["Author"]]`
**Version**: `r packageDescription("RAIDS")$Version`
**Compiled date**: `r Sys.Date()`
**License**: `r packageDescription("RAIDS")[["License"]]`
# Licensing
The `r Githubpkg("KrasnitzLab/RAIDS")` package and the underlying
`r Githubpkg("KrasnitzLab/RAIDS")` code are distributed under
the https://opensource.org/licenses/Apache-2.0 license. You are free to use and
redistribute this software.
# Citing
If you use the **RAIDS** package for a publication, we would ask you to cite
the following:
> Pascal Belleau, Astrid Deschênes, Nyasha Chambwe, David A. Tuveson, Alexander Krasnitz; Genetic Ancestry Inference from Cancer-Derived Molecular Data across Genomic and Transcriptomic Platforms. Cancer Res 1 January 2023; 83 (1): 49–58. https://doi.org/10.1158/0008-5472.CAN-22-0682
# Introduction
Multiple methods have been implemented to infer ancestry from germline DNA
sequence [@Price2006; @Pritchard2000; @Alexander2009]. However, genotyping of
DNA from matched normal specimens is not part of standard clinical practice
and is not performed routinely outside academic clinical centers.
In sum, matched germline DNA sequence is often missing for cancer-derived
molecular data. In such cases, having the possibility to infer ancestry
from tumor-derived data would be beneficial.
The **RAIDS** package implements an inference procedure that has been
specifically developed to accurately infer genetic ancestry from
cancer-derived sequences. The current version can handle cancer-derived
sequences of:
* tumor exomes
* targeted gene panels
* RNA
The **RAIDS** package implements a data synthesis method that, for any given
cancer-derived sequence profile, enables
on the one hand, profile-specific inference
parameter optimization and on the other hand, a profile-specific inference
accuracy estimate.
# Installation
To install this package
from [Bioconductor](https://bioconductor.org/packages/RAIDS), start R
(version 4.3 or later) and enter:
```{r installDemo01, eval=FALSE, warning=FALSE, message=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("RAIDS")
```
# Main Steps
This is an overview of the genetic ancestry inference from cancer-derived
molecular data:
```{r graphMainSteps, echo=FALSE, fig.align="center", fig.cap="An overview of the genetic ancestry inference process.", out.width='130%', results='asis', warning=FALSE, message=FALSE}
knitr::include_graphics("MainSteps_v05.png")
```
The main steps are:
**Step 1.** Set-up and provide population reference files
**Step 2** Sample the reference data for donor genotypes to be used for synthesis and optimize ancestry inference parameters
**Step 3** Infer ancestry for the subjects of the external study
**Step 4** Present and interpret the results
These steps are described in detail in the following.
## Step 1. Set-up and provide population reference files
### 1.1 Create a directory structure
First, a specific directory structure should be created. The structure must
correspond to this:
```
#############################################################################
## Working directory structure
#############################################################################
workingDirectory/
data/
refGDS
profileGDS
```
This following code creates a temporary working directory structure where the
example will be run.
```{r createDir, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
#############################################################################
## Create a temporary working directory structure
#############################################################################
pathWorkingDirectory <- file.path(tempdir(), "workingDirectory")
pathWorkingDirectoryData <- file.path(pathWorkingDirectory, "data")
if (!dir.exists(pathWorkingDirectory)) {
dir.create(pathWorkingDirectory)
dir.create(pathWorkingDirectoryData)
dir.create(file.path(pathWorkingDirectoryData, "refGDS"))
dir.create(file.path(pathWorkingDirectoryData, "profileGDS"))
}
```
### 1.2 Download the population reference files
The population reference files should be downloaded in the *data/refGDS*
sub-directory. This following code downloads the complete pre-processed files
for 1000 Genomes (1KG), in hg38. The size of the 1KG GDS file is 15GB.
```
#############################################################################
## How to download the pre-processed files for 1000 Genomes (1KG) (15 GB)
#############################################################################
cd workingDirectory
cd data/refGDS
wget https://labshare.cshl.edu/shares/krasnitzlab/aicsPaper/matGeno1000g.gds
wget https://labshare.cshl.edu/shares/krasnitzlab/aicsPaper/matAnnot1000g.gds
cd -
```
For demonstration purpose, a small
**population reference GDS file** (called _ex1_good_small_1KG.gds_) and a small
**population reference SNV Annotation GDS file** (called
_ex1_good_small_1KG_Annot.gds_) are
included in this package. Beware that those two files should not be used to
run a real ancestry inference. The results obtained with those files won't be
reliable.
In this running example, the demonstration files are copied in the required
*data/refGDS* directory.
```{r copyRefFile, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
#############################################################################
## Load RAIDS package
#############################################################################
library(RAIDS)
#############################################################################
## The population reference GDS file and SNV Annotation GDS file
## need to be located in the same sub-directory.
## Note that the population reference GDS file used for this example is a
## simplified version and CANNOT be used for any real analysis
#############################################################################
## Path to the demo 1KG GDS file is located in this package
dataDir <- system.file("extdata", package="RAIDS")
pathReference <- file.path(dataDir, "tests")
fileGDS <- file.path(pathReference, "ex1_good_small_1KG.gds")
fileAnnotGDS <- file.path(pathReference, "ex1_good_small_1KG_Annot.gds")
file.copy(fileGDS, file.path(pathWorkingDirectoryData, "refGDS"))
file.copy(fileAnnotGDS, file.path(pathWorkingDirectoryData, "refGDS"))
```
## Step 2 Ancestry inference with RAIDS
### 2.1 Set-up required directories
All required directories need to be created. In addition, the path to
the reference files are kept in variables that will be used later.
```{r installRaids, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
#############################################################################
## The file path to the population reference GDS file
## is required (refGenotype will be used as input later)
## The file path to the population reference SNV Annotation GDS file
## is also required (refAnnotation will be used as input later)
#############################################################################
pathReference <- file.path(pathWorkingDirectoryData, "refGDS")
refGenotype <- file.path(pathReference, "ex1_good_small_1KG.gds")
refAnnotation <- file.path(pathReference, "ex1_good_small_1KG_Annot.gds")
#############################################################################
## The output profileGDS directory, inside workingDirectory/data, must be
## created (pathProfileGDS will be used as input later)
#############################################################################
pathProfileGDS <- file.path(pathWorkingDirectoryData, "profileGDS")
if (!dir.exists(pathProfileGDS)) {
dir.create(pathProfileGDS)
}
```
### 2.2 Sample reference donor profiles from the reference data
With the 1KG reference, we recommend sampling 30 donor profiles per population.
For reproducibility, be sure to use the same random-number generator seed.
In the following code, only 2 profiles per population are sampled from the
demo population GDS file:
```{r sampling, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
#############################################################################
## Fix seed to ensure reproducible results
#############################################################################
set.seed(3043)
#############################################################################
## Select the profiles from the population reference GDS file for
## the synthetic data.
## Here we select 2 profiles from the simplified 1KG GDS for each
## subcontinental-level.
## Normally, we would use 30 profiles for each subcontinental-level.
## The 1KG files in this example only have 6 profiles for each
## subcontinental-level (for demo purpose only).
#############################################################################
dataRef <- select1KGPopForSynthetic(fileReferenceGDS=refGenotype,
nbProfiles=2L)
```
The output object is going to be used later.
### 2.3 Perform the ancestry inference
Ancestry inference can be done in one function call. Within a single function
call, data synthesis is performed, the synthetic
data are used to optimize the inference parameters and, with these, the
ancestry of the input profile donor is inferred.
According to the type of input data (RNA or DNA), a specific function
should be called. The *inferAncestry()* function is used for DNA profiles while
the *inferAncestryGeneAware()* function is RNA specific.
The *inferAncestry()* function requires a specific profile input format. The
format is set by the *genoSource* parameter.
One of those formats is in a VCF format (*genoSource=c("VCF")*).
This format follows the VCF standard
with at least those genotype fields: _GT_, _AD_ and _DP_.
The SNVs must be germline variants and should include the genotype of the
wild-type homozygous at the selected positions in the reference. The VCF file
must be gzipped.
A generic SNP file can replace the VCF file (*genoSource=c("generic")*).
The format is coma separated and the mandatory columns are:
* _Chromosome_: The name of the chromosome
* _Position_: The position on the chromosome
* _Ref_: The reference nucleotide
* _Alt_: The aternative nucleotide
* _Count_: The total count
* _File1R_: The count for the reference nucleotide
* _File1A_: The count for the alternative nucleotide
Beware that the starting position in the **population reference GDS file** is
zero (like BED files). The generic SNP file should also start
at position zero.
In this example, the profile is from DNA source and requires the use of the
*inferAncestry()* function.
```{r infere, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
###########################################################################
## GenomeInfoDb and BSgenome are required libraries to run this example
###########################################################################
if (requireNamespace("GenomeInfoDb", quietly=TRUE) &&
requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly=TRUE)) {
#######################################################################
## Chromosome length information is required
## chr23 is chrX, chr24 is chrY and chrM is 25
#######################################################################
genome <- BSgenome.Hsapiens.UCSC.hg38::Hsapiens
chrInfo <- GenomeInfoDb::seqlengths(genome)[1:25]
#######################################################################
## The demo SNP VCF file of the DNA profile donor
#######################################################################
fileDonorVCF <- file.path(dataDir, "example", "snpPileup", "ex1.vcf.gz")
#######################################################################
## The ancestry inference call
#######################################################################
resOut <- inferAncestry(profileFile=fileDonorVCF,
pathProfileGDS=pathProfileGDS,
fileReferenceGDS=refGenotype,
fileReferenceAnnotGDS=refAnnotation,
chrInfo=chrInfo,
syntheticRefDF=dataRef,
genoSource=c("VCF"))
}
```
A profile GDS file is created in the *pathProfileGDS* directory while all the
ancestry and optimal parameters information are integrated in the output
object.
At last, all temporary files created in this example should be deleted.
```{r removeTmp, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
#######################################################################
## Remove temporary files created for this demo
#######################################################################
unlink(pathWorkingDirectory, recursive=TRUE, force=TRUE)
```
## Step 3. Examine the value of the inference call
The inferred ancestry and the optimal parameters are present in the *list*
object generated by the *inferAncestry()* and *inferAncestryGeneAware()*
functions.
```{r printRes, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
###########################################################################
## The output is a list object with multiple entries
###########################################################################
class(resOut)
names(resOut)
```
### 3.1 Inspect the inference and the optimal parameters
For the global ancestry inference using PCA followed by nearest neighbor
classification these parameters are *D* (the number of the top principal
directions retained) and *k* (the number of nearest neighbors).
The information is stored in the *Ancestry* entry as a *data.frame* object.
It is a contains those columns:
* _sample.id_: The unique identifier of the sample
* _D_: The optimal PCA dimension value used to infer the ancestry
* _k_: The optimal number of neighbors value used to infer the ancestry
* _SuperPop_: The inferred ancestry
```{r print, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
###########################################################################
## The ancestry information is stored in the 'Ancestry' entry
###########################################################################
print(resOut$Ancestry)
```
### 3.2 Visualize the RAIDS performance for the synthetic data
The *createAUROCGraph()* function enable the visualization of RAIDS
performance for the synthetic data, as a function of *D* and *k*.
```{r visualize, echo=TRUE, eval=TRUE, fig.align="center", fig.cap="RAIDS performance for the synthtic data.", results='asis', collapse=FALSE, warning=FALSE, message=FALSE}
###########################################################################
## Create a graph showing the perfomance for the synthetic data
## The output is a ggplot object
###########################################################################
createAUROCGraph(dfAUROC=resOut$paraSample$dfAUROC, title="Example ex1")
```
In this specific example, the performances are lower than expected
with a real profile and a complete reference population file.
# Format population reference dataset (optional)
```{r graphStep1, echo=FALSE, fig.align="center", fig.cap="Step 1 - Provide population reference data", out.width='120%', results='asis', warning=FALSE, message=FALSE}
knitr::include_graphics("Step1_population_file_v01.png")
```
A population reference dataset with known ancestry is required to infer
ancestry.
Three important reference files, containing formatted information about
the reference dataset, are required:
- The population reference GDS File
- The population reference SNV Annotation GDS file
- The population reference SNV Retained VCF file (optional)
The format of those files are described
the [Population reference dataset GDS files](Create_Reference_GDS_File.html)
vignette.
The reference files associated to
the Cancer Research associated paper are available. Note that these
pre-processed files are for 1000 Genomes (1KG), in hg38. The files are
available here:
[https://labshare.cshl.edu/shares/krasnitzlab/aicsPaper](https://labshare.cshl.edu/shares/krasnitzlab/aicsPaper)
The size of the 1KG GDS file
is 15GB.
The 1KG GDS file is mapped on
hg38 [@Lowy-Gallego2019a].
This section can be skipped if
you choose to use the pre-processed files.
# Session info
Here is the output of `sessionInfo()` in the environment in which this
document was compiled:
```{r sessionInfo, echo=FALSE}
sessionInfo()
```
# References