---
title: "CNVgears package."
output: BiocStyle::html_document
vignette: >
%\VignetteIndexEntry{CNVgears package}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Initial Note #
The illustrative pipelines shown in this file uses real data from the 1000
Genomes project (). The examples are
performed on array data for chromosomes 14 and 22 only and the raw final report
files provided as they are necessary for some critical steps of the pipeline.
Because those files are relatively large (> 5Mb) they are stored as a separate
GitHub repository (). The user is kindly
asked to download the repo and give the directory path at the step 5.2 in order to
being able to perform all the examples.
# Introduction #
The CNVgears package aims to implement an interactive and extensible framework for
the analysis of Copy Number Variation calling and/or segmentation results, typically
from multiple pipelines or algorithms, regardless the initial raw data type. At
the moment Illumina SNPs array and any type of NGS data are supported, provided
that at least the raw log2R value, or any related measure is accessible for each
sample. The package is particularly suited for the analysis of family based studies,
however, the more vast majority of the functions are of general
purpose and useful for any CNVs study. Next session illustrate the pipeline main
steps and most of the package features using CNVs calling results on a subset
of the 1000 Genomes cohort. We created the subset using samples that creates full
trios and have both SNP array and WGS data available. However at the moment only
the arrays are used here. CNVs calling was performed using a modified minimal
version of EnsembleCNV.
# Main Steps #
CNVgears supports multiple analysis and cohort composition (e.g. family based or not)
as it is possible to perform all or only a fraction of these steps. As an example, if the
cohort is not family based, if CNVs inheritance pattern is not a desired information
or if the markers-level raw data is not available, the raw data load step can be
easily skip. The package can be used also only to combine the results of different
methods in a uniform format and export the output to be processed in other programs.
The typical main steps of such analysis using CNVgears are:
1. Load the cohort minimal metadata in the form of PED file (mandatory);
2. Load the markers file (i.e. genomic location of each marker, SNP in case of
array, interval/bin in case of NGS data) (mandatory);
3. Read and pre-process the CNVs calling results for all samples from the
individual algorithms/pipelines, in particular adjacent calls with equal
genotype (i.e. deletions/duplications) can be merged if closer than
user-specified distances (mandatory);
4. Read and pre-process the "raw data" (in particular the log2R for each marker
for all samples) (not mandatory but required for CNVs inheritance detection);
5. Explore the sample's summary statistics, identify groups of unwanted events
(e.g. smaller then 5 kbp or consisting of less than 5 points) and possible
over-segmented samples (not mandatory but strongly suggested);
6. Filter the results based on several parameters and blacklist (both provided
by the package and user-generated), such as immunoglobulin and
telomeric/centromeric regions. Also more extreme intra-results merging can be
performed if one or more methods appear to over-segment (not mandatory but
strongly suggested);
7. Detect CNVs inheritance pattern, if families structure is given (not mandatory);
8. Visually inspect the good de novo CNVs candidate raw data (not mandatory,
strongly suggested in the case of de novo CNVs, require the previous step);
9. Merge the various methods results in a single object (not mandatory, however
this is one of the main points of the package);
1. Second round of filtering, keep only the call made by N (one or more)
algorithms/pipelines (not mandatory, requires the previous step);
1. Annotate genomic locus of the calls and search for CNVs in known
disease-linked loci (e.g. in studies on ASD or schizophrenia) (not mandatory);
2. Annotate genic content of selected CNVs (not mandatory);
2. Compute CNV regions (CNVRs) (not mandatory);
3. Extract high confidence rare events based on CNVRs (not mandatory, requires
the previous step);
4. Export desired data. Several format are supported, such as BED (e.g. for
being viewed in IGV) or VCF (e.g. for being processed with Annovar), as well
as TSV (not mandatory).
# Setup #
## Install
CNVgears is part of Bioconductor (), to install it
using `BiocManager`:
```{r bioc, eval = FALSE}
if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("CNVgears")
```
Until Biconductor 3.14 it can be found in the devel branch.
```{r bioc dev, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# The following initializes usage of Bioc devel
BiocManager::install(version='devel')
BiocManager::install("CNVgears")
```
To install the latest development version of the package via GitHub repository:
```{r install git hub, eval = FALSE}
# not run
devtools::install_github("SinomeM/CNVgears")
```
## Load library
To load the package.
```{r library}
library(CNVgears, quietly = TRUE)
```
## Set data.table threads
All the major functions uses `data.table` package, especially in the more intensive
steps. The package uses by default half the CPU threads available, this can be changed
with the function `data.table::setDTthreads`. For example to use the 75% of available
threads or an integer number (four in the example) of them:
```{r, message=TRUE, warning=TRUE}
data.table::setDTthreads(percent=75)
data.table::setDTthreads(4)
```
```{r, include=FALSE}
data.table::setDTthreads(2)
```
Empirically, we found the ideal number of threads being between 4 and 8.
See the relative man pages for details on each function used.
# Data Load #
## Cohort PED and Markers data
The first two objects we need to load are the cohort and markers information. The
first object is loaded from a PED file from with the following columns are extracted:
Sample ID, Family ID, sex and role (proband/sibling/mother/father). Role information
needs to be present only if CNV inheritance detection is a desired step. The second
object contains the genomic location of the markers used for CNV calling, thus the
SNPs in case of DNA chip and genomic intervals if WES/WGS. The following columns are
needed: SNP ID (only for DNA chips), chromosome, start, end. One of these objects
is needed for each data type used for calling. Here we used only SNP arrays so only
one markers object is needed.
In the package are bundled CNV calling results of three different algorithms for
chromosomes 14 and 22 of 24 samples from the 1000 Genomes project, as well as a PED
file of the cohort and the markers-level raw data for a single trio (within the
24 samples).
```{r PED PFB}
# cohort data
cohort <- read_metadt(DT_path = system.file("extdata", "cohort.ped",
package = "CNVgears"),
sample_ID_col = "Individual ID",
fam_ID_col = "Family ID",
sex_col = "Gender", role_col = "Role")
head(cohort)
# markers location
SNP_markers <- read_finalreport_snps(system.file("extdata", "SNP.pfb",
package = "CNVgears"),
mark_ID_col = "Name", chr_col = "Chr",
pos_col = "Position")
head(SNP_markers)
```
## CNV Calling Results and Markers-level raw data
Then we load the actual CNV calling results and the marker-level raw data.
The results are loaded in the session as R objects, while the raw data are
processed and stored as RDS one chromosome at the time. The RDS files will be
loaded only when necessary in order
to reduce the RAM requirements of processing results from very large cohorts
and/or from several methods.
In this scenario the raw data consist of the SNP array only, thus we can process
them one single time and use them for all the three callers results.
Note that during the importing of the results an intra-method CNV merging step is
performed, i.e. adjacent call are merged into a single one if requirements are met.
This step is performed by default but can be avoided by setting `do_merge` parameter
to `FALSE`. The resulting objects are of the class `CNVresults`.
```{r results + raw, message=TRUE, warning=TRUE}
# CNV calling results
penn <- read_results(DT_path = system.file("extdata",
"chrs_14_22_cnvs_penn.txt",
package = "CNVgears"),
res_type = "file", DT_type = "TSV/CSV",
chr_col = "chr", start_col = "posStart",
end_col = "posEnd", CN_col = "CN",
samp_ID_col = "Sample_ID", sample_list = cohort,
markers = SNP_markers, method_ID = "P")
head(penn)
quanti <- read_results(DT_path = system.file("extdata",
"chrs_14_22_cnvs_quanti.txt",
package = "CNVgears"),
res_type = "file", DT_type = "TSV/CSV",
chr_col = "chr", start_col = "posStart",
end_col = "posEnd", CN_col = "CN",
samp_ID_col = "Sample_ID", sample_list = cohort,
markers = SNP_markers, method_ID = "Q")
ipn <- read_results(DT_path = system.file("extdata",
"chrs_14_22_cnvs_ipn.txt",
package = "CNVgears"),
res_type = "file", DT_type = "TSV/CSV",
chr_col = "chr", start_col = "posStart",
end_col = "posEnd", CN_col = "CN",
samp_ID_col = "Sample_ID", sample_list = cohort,
markers = SNP_markers, method_ID = "I")
```
```{r, eval = FALSE}
# raw data
# not run. This step needs the additional data
library("data.table", quietly = TRUE)
setDTthreads(4)
read_finalreport_raw(DT_path = "PATH_TO_DOWNLOADED_DATA_DIRECTORY",
pref = NA, suff = ".txt",
rds_path = file.path("tmp_RDS"),
markers = SNP_markers,
sample_list = cohort[fam_ID == "1463", ])
```
Note that when loading the raw data only the trio is used as cohort. The samples
object is subset using a `data.table` query.
Now the initial step of data load and normalization is completed and all a series
of integration and analysis steps can take place.
Obviously, if more methods are used this initial step is longer, in particular if
also WGS and/or high density SNPs array data is used (because of the very high
number of markers). More examples will be added in future version of this vignette.
## `cn.mops` package results loading
CNVs calling from NGS data Bioconductor package `cn.mops` uses `GRanges` objects as
results. In particular, the actual calls are stored in the `cnvs` slot. These can
be loaded into a `CNVgears` pipeline using the function `cnmops_to_CNVresults`
as shown here. Note that only calls in chromosomes 1:22 are supported by this package
at the moment.
```{r, eval= FALSE}
## not run
library(cn.mops)
data(cn.mops)
resCNMOPS <- cn.mops(XRanges)
resCNMOPS <- calcIntegerCopyNumbers(resCNMOPS)
resCNMOPS_cnvs <- cnvs(resCNMOPS)
sample_list <- read_metadt("path/to/PED_like_file")
intervals <- read_NGS_intervals("path/to/markers/file")
cnmops_calls <- cnmops_to_CNVresults(resCNMOPS_cnvs, sample_list, intervals)
```
# Data integration and processing #
## Results exploration and filtering
When analyzing CNVs calling or segmentation results, an important step is to filter
out undesired and/or noisy/unreliable calls. This package provide the means to do
so in an integrated an standardized manner using two functions: `summary_stats`
and `cleaning_filter`. The first produces several summary statistics and exploratory
plot on the results, both cohort-wise and samples-wise. They are designed for
interactive use, also more then one time if desired.
An example for one of the callers results.
```{r summary stats}
sstatPenn <- summary(object = penn, sample_list = cohort,
markers = SNP_markers)
```
```{r, eval = FALSE}
# not run
sstatPenn <- summary(object = penn, sample_list = cohort,
markers = SNP_markers,
plots_path = file.path("tmp","sstatPenn")) # plots are saved
```
Exploring these summary statistics can provide important insights on the cohort
composition and on the differences between the individual callers. Obviously, the
process is completely optional, and we provide default values for the filtering
function based on literature and our experience.
The output of summary can help identifying over-segmented samples.
```{r}
head(sstatPenn)
```
```{r filtering}
penn_filtered <- cleaning_filter(results = penn, min_len = 10000, min_NP = 10)
quanti_filtered <- cleaning_filter(quanti)
ipn_filtered <- cleaning_filter(ipn)
```
It is also possible to filter calls based on blacklists, e.g. immunoglobulin regions
or telomers. In our package we provide two functions to generate immunoglobulin,
telomeric and centromeric regions for the hg18, hg19 and hg38 genome assemblies.
The user can specify the number of minimum consecutive immuno genes to create a
region and the desired telomers and centromers blacklists width. Here we create
those blacklists in hg19 as an example.
Note that the fucntion `immuno_regions` uses internally the package `biomaRt` that
work with the hg38 genome assembly by default. To work with older assemblies we
need to pass a custom `mart` object, as exemplified here.
```{r blacklists}
hg19telom <- telom_centrom(hg19_start_end_centromeres, centrom = FALSE)
hg19centrom <- telom_centrom(hg19_start_end_centromeres, telom = FALSE,
region = 25000)
```
```{r, eval=FALSE}
ensembl37 <- biomaRt::useMart("ENSEMBL_MART_ENSEMBL",
host = "grch37.ensembl.org",
dataset="hsapiens_gene_ensembl")
hg19_immuno <- immuno_regions(n_genes = 10, mart = ensembl37)
```
A blacklist can be any `data.table` with at least these three columns: chr, start,
end. Users should be particularly careful when filtering using blacklists as
it is possible to filter out relevant data.
```{r}
head(hg19telom)
```
## Inheritance pattern detection
In family-based cohort studies, CNV inheritance detection is possible. However,
because the process of CNVs calling tends to be imprecise, both false positive and
false negative are present in the results. This is in part solved with the filtering
process and can be solved combining more than one methods results.
Here, to identify good *de-novo* candidates hits as well as inherited events we
use the function `cnvs_inheritance` that integrates three screening steps,
both at segment-level and at marker-level. The first step, for every CNVs in an
offspring, searches compatible CNVs in both parents (i.e. with 30% reciprocal
overlap, by default). If an hit is found the CNVs is inherited, if not it is
considered a putative *de-novo* ad undergoes the successive marker-level screenings.
At this point the CopyRatio values for all the markers in the region are considered
in the trio. A first screening consider as potentially inherited all the putative
*de-novo* events of the offspring for witch a parent have more than 75% compatible
markers points in the regions, i.e. markers with a CopyRatio < 0.75 if the CNV is
a deletion or a CopyRatio > 1.25 if the CNV is a duplication. This step is mostly
needed to speed up the function and will be made optional in near future.
For autosomal chromosomes
$$LRR = log2R = log2(CopyRatio) = log2(numeric_{CN}/2)$$
The putative *de-novo* events that survive undergo the final screening. This can
be performed in several way, at the moment the suggested and most supported is the
following. In brief for both pairs parent-offspring the mean of the CopyRatio values
are compared using a Wilcox test. Under the null hypothesis that the difference of
the two means is zero, we test if a difference is statistically significant, i.e.
if the resulting p-value is lower than a user-defined alpha (default value is 0.05).
If the test has a positive outcome for both parents than the CNV is defined as a
good de-novo candidate. The two p-values from each comparison are stored in the
dataset and, when all the individual comparison are done, two additional columns
containing the adjusted p-values are computed, leaving to the user the choice of
which sets of p-value use.
```{r inheritance, eval=FALSE}
# not run, needs the additional raw data
penn_inheritance <-
cnvs_inheritance(results = penn_filtered, sample_list = cohort[fam_ID == "1463", ],
markers = SNP_markers, raw_path = file.path("tmp_RDS"))
quanti_inheritance <-
cnvs_inheritance(results = quanti_filtered, sample_list = cohort[fam_ID == "1463", ],
markers = SNP_markers, raw_path = file.path("tmp_RDS"))
ipn_inheritance <-
cnvs_inheritance(results = ipn_filtered, sample_list = cohort[fam_ID == "1463", ],
markers = SNP_markers, raw_path = file.path("tmp_RDS"))
penn_denovo <- penn_inheritance[inheritance == "denovo", ]
quanti_denovo <- quanti_inheritance[inheritance == "denovo", ]
ipn_denovo <- ipn_inheritance[inheritance == "denovo", ]
```
Note that inheritance detection is performed separately for each method, in this
way it is possible to use marker-level raw data. However, when all the results use
the same raw data (as in in this vignette, at the moment), it is also possible to
first combine the results (see next section) an only then perform inheritance
detection.
## Inter-methods merging
At this point we can integrate the multiple methods results into a single object.
As exemplification of a real pipeline we also process the *de-novo* calls.
```{r inter merge, results="hide"}
ipq_filtered <-
inter_res_merge(res_list = list(penn_filtered, quanti_filtered, ipn_filtered),
sample_list = cohort, chr_arms = hg19_chr_arms)
```
```{r, eval=FALSE}
# not run, it needs inheritance detection step
ipq_denovo <-
inter_res_merge(res_list = list(penn_denovo, quanti_denovo, ipn_denovo),
sample_list = cohort, chr_arms = hg19_chr_arms)
```
At this point it is possible to perform a second filtering step based on the number
of calling methods, e.g. filter out all the calls made by one single algorithm.
Note how the merge perform **reciprocal** overlap comparison in order to merge two
calls. Candidate events should thus not only overlap but must also be of comparable
size.
## CNV Regions Computation
Copy Number Variable Regions can be defined in several ways, an intuitive way to
visualize them is: regions of the human genomes were CNVs (of comparable size) are
present across samples in a population. This can be due for examples because such
region is bordered by two particularly active breakpoints. CNVRs can be used for
CNVs-GWAS and, more easily, to extract rare events from a cohort.
In this package, we provide a method to compute Copy Number Variable Regions (CNVRs),
based on reciprocal overlap, starting from the integrated object created in the
previous step. See the `?cnvrs_create` for details.
```{r cnvrs, results="hide"}
cnvrs_chr22 <- cnvrs_create(cnvs = ipq_filtered[chr == 22,],
chr_arms = hg19_chr_arms)
```
CNVRs can also be used for filtering purposes, as an example a user can be more
interested in rare events and use CNVRs frequency to achieve this internally.
Rare (within the cohort) calls can be obtained filtering out the common CNVRs, as
follows.
```{r}
# define common CNVR as the one present in > 5% of the cohort
# here the cohort is very small so it won't be a significative result
common_cnvrs_chr22 <- cnvrs_chr22[[1]][freq > nrow(cohort)*0.5, ]
rare_cvns_chr22 <- cnvrs_chr22[[2]][!cnvr %in% common_cnvrs_chr22$r_ID, ]
```
## De-novo CNVs visual confirmation
In the package it is also implemented a function to visually inspect the de-novo
candidates. We use as an example the two calls detected by both PennCNV and
QantiSNP and the larger one made by IPattern.
```{r denovo plot, eval=FALSE}
# not run, it needs inheritance detection step
lrr_trio_plot(cnv = ipq_denovo[1], raw_path = file.path("tmp_RDS"),
sample_list = cohort, results = ipn_filtered)
lrr_trio_plot(cnv = ipq_denovo[2], raw_path = file.path("tmp_RDS"),
sample_list = cohort, results = ipn_filtered)
lrr_trio_plot(cnv = ipq_denovo[3], raw_path = file.path("tmp_RDS"),
sample_list = cohort, results = ipn_filtered)
```
Note that the two calls are quite close.
# Annotation #
The package include also some annotation functions, for extracting both the CNVs
locus and affected genes. At the moment the affected genes are listed using "-"
separated Ensemble IDs.
As `immuno_regions`, the function `genic_load` uses `biomaRt` internally, thus a
custom mart is needed to work in hg19.
```{r annotation, echo=c(-2,-4), eval = FALSE}
# locus
ipq_denovo_locus <- genomic_locus(DT_in = ipq_filtered, keep_str_end = FALSE)
# genes
ensembl37 <- biomaRt::useMart("ENSEMBL_MART_ENSEMBL",
host = "grch37.ensembl.org",
dataset="hsapiens_gene_ensembl")
ipq_denovo_genes <- genic_load(DT_in = ipq_filtered, mart = ensembl37)
rm(ensembl37)
```
# Visualization #
All CNVs from an offspring (not only the de novo ones) can be visualized in the raw
data using the function `lrr_trio_plot`. A simpler function to visualize one single
event without any focus on family relations will be implemented soon. Integration
with the package `IGVR` is also planned.
# Data export #
All the major object handled by the package are `data.table`. Thus to export outside
of R an object (as an example the output of the combination of several methods
results) it is possible to use `data.table` semantics and functions.
```{r, eval=FALSE}
## not run
# Select a subset of columns using DT[, .(col1,col2,coln)]
tmp <- ipq_filtered[, .(sample_ID, chr, start, end, GT, meth_ID)]
head(tmp, 3)
# Select a subset of rows (i.e. filter) using DT[condition, ]
tmp <- ipq_filtered[n_meth > 1, .(sample_ID, chr, start, end, GT, meth_ID)]
head(tmp, 3)
# Export object as a TSV
fwrite(tmp, "example.tsv", sep = "\t")
```
As an example to create a BED4 file were the entries names consist of
`sample_ID"-"GT` that can be loaded in IGV we do:
```{r}
tmp <- ipq_filtered[, .(chr, start, end, paste0(sample_ID, "-", GT))]
head(tmp)
```
```{r, eval=FALSE}
## not run
fwrite(tmp, "example.bed", sep = "\t", col.names = FALSE)
```
Objects of the class `CNVresults` can also be converted into `GenomicRanges::GRanges()`
using the function `CNVresults_to_GRanges`. Required columns (and thus maintained
in the resulting object) are: chr, start, end, sample_ID, GT, meth_ID. The last
three columns will be placed in the `metadata` field.
In the near future a more general function to convert all the major object into
`GRanges` as well as `GRangesList` (e.g. for integration with `CNVranger`
package) will be implemented.
```{r}
ipq_filtered_GRanges <- CNVresults_to_GRanges(ipq_filtered)
```
`data.table` queries can also be used to easily extract subset of interest from the
results. Some relevant examples are shown here.
```{r, eval=FALSE}
# only call with > 1 calling methods
tmp <- ipq_filtered[n_meth > 1, ]
# de novo CNVs with a significant corrected p-value (note that the p-values are
# lost when combining more result-objects)
tmp <- penn_inheritance[adj_m_pval < 0.05 & adj_p_pval < 0.05, ]
```
# Notes on Parallelization #
As already mentioned, this package use `data.table` whenever is possible to handle
the big tables that working with large cohorts produces and `data.table` is internally
parallelized. As a consequence, most of the function already use more than one
thread an should be run serially on system with a low number of CPU cores. On more
CPU and RAM equipped systems however, the user may want to parallelize the longer
steps. This can be done easily on two particularly long steps, i.e. the results and
the markers-level raw data load. The suggested way for `read_results` is to split
the `sample_list` object in two or more batches, run `read_results` in parallel
on those batches, and finally combine the objects into a final one. The raw data
processing do not produce any object, as it stores all its results into `RDS` files
named after the individual samples, so it can be run as parallel as the user desire,
again by simply generating subsets of the `sample_list` object. Ideally, the entire
pipeline (excluding for obvious reasons CNVRs creation, in that case it is possible
to run subset of chromosomes in parallel) can be run in parallel on
more than one cohort subsets, as long as families are kept together. However, in
order to have a better view on the cohort summary statistics and the filtering
procedure effect, it is recommended to perform at least those steps on the entire
dataset. While parallelizing the user should always keep in mind that `data.table`
uses more than one thread by default, so `data.table::setDTthreads()` should always
be set appropriately.
Again, empirically we found that the optimal number of threads is between 2 and 8 depending
on the number of total accessible threads and the amount of formal, external
parallelization desired, as more tend to increase the process speed rather than
increasing it.