---
title: "Introduction_2_loadvariants"
author: "Hangjia Zhao"
date: "`r Sys.Date()`"
output: 
  BiocStyle::html_document:
  toc: true
vignette: >
  %\VignetteIndexEntry{Introduction_2_loadvariants}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

[Progenetix](https://progenetix.org/) is an open data resource that provides curated individual cancer copy number aberrations (CNA) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. This vignette provides a comprehensive guide on accessing genomic variant data within the Progenetix database. If your focus lies in cancer cell lines, you can access data from [*cancercelllines.org*](https://cancercelllines.org/) by specifying the `dataset` parameter as "cancercelllines". This data repository originates from CNV profiling data of cell lines initially collected as part of Progenetix and currently includes additional types of genomic mutations.

# Load library 

```{r setup}
library(pgxRpi)
```

## `pgxLoader` function 

This function loads various data from `Progenetix` database.  

The parameters of this function used in this tutorial:

* `type` A string specifying output data type. Available options are "biosample", "individual", "variant" or "frequency". 
* `output` A string specifying output file format. When the parameter `type` is "variant",
available options are NULL, "pgxseg" ,"pgxmatrix", "coverage" or "seg".
* `filters` Identifiers for cancer type, literature, cohorts, and age such as 
c("NCIT:C7376", "pgx:icdom-98353", "PMID<!-- -->:22824167", "pgx:cohort-TCGAcancers", "age:>=P50Y").
For more information about filters, see the [documentation](https://docs.progenetix.org/common/beacon-api/#filters-filters-filtering-terms).
* `individual_id` Identifiers used in Progenetix database for identifying individuals. 
* `biosample_id` Identifiers used in Progenetix database for identifying biosamples. 
* `codematches` A logical value determining whether to exclude samples 
from child concepts of specified filters that belong to cancer type/tissue encoding system (NCIt, icdom/t, Uberon). 
If TRUE, retrieved samples only keep samples exactly encoded by specified filters. 
Do not use this parameter when `filters` include ontology-irrelevant filters such as PMID and cohort identifiers.
Default is FALSE.
* `limit` Integer to specify the number of returned CNV coverage profiles for each filter. 
Default is 0 (return all). 
* `skip` Integer to specify the number of skipped CNV coverage profiles for each filter. 
E.g. if skip = 2, limit=500, the first 2*500 =1000 profiles are skipped and the next 500 profiles are returned. 
Default is NULL (no skip). 
* `save_file` A logical value determining whether to save the segment variant data as file 
instead of direct return. Only used when the parameter `type` is "variant" and `output` is "pgxseg" or "seg". Default is FALSE.
* `filename` A string specifying the path and name of the file to be saved. 
Only used if the parameter `save_file` is TRUE. Default is "variants.seg/pgxseg" in current work directory.
* `num_cores` Integer to specify the number of cores on your local computer to be used for the variant query. The default is 1.
* `dataset` A string specifying the dataset to query. Default is "progenetix". Other available options are "cancercelllines".

# Retrive CNV coverage of biosamples 

## Relevant parameters 

type, output, filters, individual_id, biosample_id, codematches, skip, limit, dataset

## Across genomic bins 

```{r}
cnv_matrix <- pgxLoader(type="variant", output="pgxmatrix", filters = "NCIT:C2948")
```

The data looks like this

```{r}
print(dim(cnv_matrix))
cnv_matrix[c(1:3), c(1:5,6213:6215)]
```

In this dataframe, `analysis_id` is the identifier for individual analysis, `biosample_id` is the identifier for individual biosample. It is noted that the number of analysis profiles does not necessarily equal the number of samples. One biosample_id may correspond to multiple analysis_id. `group_id` equals the meaning of `filters`. It's followed by all “gain status” columns (3106 intervals) plus all “loss status” columns (3106 intervals). The status is indicated by a coverage value, i.e. the fraction of how much the binned interval overlaps with one or more CNVs of the given type (DUP/DEL). For example, if the column `chr1.400000.1400000.DUP` is 0.200 in one row, it means that one or more duplication events overlapped with 20% of the genomic bin located in chromosome 1: 400000-1400000 in the corresponding analysis.

## Across chromosomes or the whole genome 

```{r}
cnv_coverage <- pgxLoader(type="variant", output="coverage", filters = "NCIT:C2948")
```

It includes CNV coverage across chromosome arms, whole chromosomes, or whole genome.

```{r}
names(cnv_coverage)
```

The data of CNV coverage across chromosomal arms looks like this

```{r}
head(cnv_coverage$chrom_arm_coverage)[,c(1:4, 49:52)]
```

The row names are id of biosamples from the group NCIT:C2948. There are 96 columns. 
The first 48 columns are duplication coverage across chromosomal arms, followed by deletion coverage.
The data of CNV coverage across whole chromosomes is similar, with the only difference in columns.

The data of CNV coverage across genome (hg38) looks like this

```{r}
head(cnv_coverage$whole_genome_coverage)
```

The first column is the total called coverage, followed by duplication coverage and deletion coverage.

## Parameter `codematches` use

Setting `codematches = True` can exclude profiles with group_id belonging to child terms of the input filters.

21 samples are excluded from the original 47 samples in this case.

```{r}
cnv_coverage_2 <- pgxLoader(type="variant", output="coverage", filters = "NCIT:C2948",
                            codematches = TRUE)

print(dim(cnv_coverage$chrom_arm_coverage))
print(dim(cnv_coverage_2$chrom_arm_coverage))
```

## Access a subset of samples

By default, it returns all available profiles (limit=0), so the query may take a while 
when the number of retrieved samples is large. You can use the parameters `limit` and 
`skip` to access a subset of samples.

```{r}
cnv_matrix_2 <- pgxLoader(type="variant", output="pgxmatrix", 
                          filters = "NCIT:C2948",
                          skip = 0, limit=10)
# the dimention of subset 
print(dim(cnv_matrix_2))
# the dimention of original set
print(dim(cnv_matrix))
```

## Access by biosample id and individual id

```{r}
cnv_ind_matrix <- pgxLoader(type="variant", output="pgxmatrix", 
                          biosample_id = "pgxbs-kftva604",
                          individual_id = "pgxind-kftx5g4t")

cnv_ind_cov <- pgxLoader(type="variant", output="coverage", 
                          biosample_id = "pgxbs-kftva604",
                          individual_id = "pgxind-kftx5g4t")
```

# Retrieve segment variants 

Because of a time-out issue, segment variant data can only be accessed by biosample id instead of filters. 
To speed up this process, you can set the `num_cores` parameter for parallel processing. 

## Relevant parameters 

type, output, biosample_id, save_file, filename, num_cores, dataset

## Get biosample id 

The biosample information is also obtained by `pgxLoader` and the vignette about metadata
query see Introduction_1_loadmetadata.

```{r}
biosamples <- pgxLoader(type="biosample", filters = "PMID:20229506", limit=2)

biosample_id <- biosamples$biosample_id
```

There are three output formats. 

## The first output format (by default)

The default output format extracts variant data from the Beacon v2 response, containing variant id and associated analysis id, biosample id and individual id. The CNV data is represented as [copy number change class](https://vrs.ga4gh.org/en/2.x/concepts/systemic_variation/CopyNumber.html#copynumberchange) following the GA4GH Variation Representation Specification (VRS).

```{r}
variant_1 <- pgxLoader(type="variant", biosample_id = biosample_id)
head(variant_1)
```

## The second output format (`output` = "pgxseg")

This format is '.pgxseg' file format. It contains segment mean values (in `log2` column), which are equal to log2(copy number of measured sample/copy number of control sample (usually 2)). A few variants are point mutations represented by columns `reference_bases` and `alternate_bases`.

```{r}
variant_2 <- pgxLoader(type="variant", biosample_id = biosample_id,output = "pgxseg")
head(variant_2)
```

## The third output format (`output` = "seg")

This format is similar to the general '.seg' file format and compatible with IGV tool for visualization. The only difference between this file format and the general '.seg' file format is the fifth column. It represents variant type in this format while in the general '.seg' file format, it represents number of probes or bins covered by the segment. In addition, the point mutation variants are excluded in this file format.

```{r}
variant_3 <- pgxLoader(type="variant", biosample_id = biosample_id,output = "seg")
head(variant_3)
```

# Export variants data for visualization

Setting `save_file` as TRUE in `pgxLoader` function would make this function doesn't return 
variants data directly but let the retrieved data saved in the current work directory by default or other 
paths (specified by `filename`). The export is only available for variants data (type='variant'). 

## Upload 'pgxseg' file to Progenetix website

The following command creates a '.pgxseg' file with the name "variants.pgxseg" in "~/Downloads/" folder. 

```{r eval=FALSE}
pgxLoader(type="variant", output="pgxseg", biosample_id=biosample_id, save_file=TRUE, 
          filename="~/Downloads/variants.pgxseg")
```

To visualize the '.pgxseg' file, you can either upload it to [this link](https://progenetix.org/service-collection/uploader) 
or use the [byconaut](https://byconaut.progenetix.org/) package for local visualization when dealing with a large number of samples.

## Upload '.seg' file to IGV

The following command creates a special '.seg' file with the name "variants.seg" in "~/Downloads/" folder. 

```{r eval=FALSE}
pgxLoader(type="variant", output="seg", biosample_id=biosample_id, save_file=TRUE, 
          filename="~/Downloads/variants.seg")
```

You can upload this '.seg' file to [IGV tool](https://software.broadinstitute.org/software/igv/) for visualization.

# Session Info

```{r echo = FALSE}
sessionInfo()
```