--- title: "ASCAT to RaggedExperiment" author: - name: Lydia King affiliation: University of Galway, Ireland - name: Marcel Ramos affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY date: "`r BiocStyle::doc_date()`" vignette: | %\VignetteIndexEntry{ASCAT to RaggedExperiment} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: toc_float: true package: RaggedExperiment --- # Introduction The `r Biocpkg("RaggedExperiment")` package provides a flexible data representation for copy number, mutation and other ragged array schema for genomic location data. The output of Allele-Specific Copy number Analysis of Tumors (ASCAT) can be classed as a ragged array and contains whole genome allele-specific copy number information for each sample in the analysis. For more information on ASCAT and guidelines on how to generate ASCAT data please see the ASCAT [website](https://www.crick.ac.uk/research/labs/peter-van-loo/software) and [github](https://github.com/VanLoo-lab/ascat). To carry out further analysis of the ASCAT data, utilising the functionalities of `RaggedExperiment`, the ASCAT data must undergo a number of operations to get it in the correct format for use with `RaggedExperiment`. # Installation ```{r, message = FALSE, warning = FALSE, eval = FALSE} if (!require("BiocManager")) install.packages("BiocManager") BiocManager::install("RaggedExperiment") ``` Loading the package: ```{r, message = FALSE} library(RaggedExperiment) library(GenomicRanges) ``` # Structure of ASCAT data The data shown below is the output obtained from ASCAT. ASCAT takes Log R Ratio (LRR) and B Allele Frequency (BAF) files and derives the allele-specific copy number profiles of tumour cells, accounting for normal cell admixture and tumour aneuploidy. It should be noted that if working with raw CEL files, the first step is to preprocess the CEL files using the PennCNV-Affy pipeline described [here](https://penncnv.openbioinformatics.org/en/latest/user-guide/affy/). The PennCNV-Affy pipeline produces the LRR and BAF files used as inputs for ASCAT. Depending on user preference, the output of ASCAT can be multiple files, each one containing allele-specific copy number information for one of the samples processed in an ASCAT run, or can be a single file containing allele-specific copy number information for all samples processed in an ASCAT run. Let's load up and have a look at ASCAT data that contains copy number information for just one sample i.e. sample1. Here we load up the data, check that it only contains allele-specific copy number calls for 1 sample and look at the first 10 rows of the dataframe. ```{r} ASCAT_data_S1 <- read.delim( system.file( "extdata", "ASCAT_Sample1.txt", package = "RaggedExperiment", mustWork = TRUE ), header = TRUE ) unique(ASCAT_data_S1$sample) head(ASCAT_data_S1, n = 10) ``` Now let's load up and have a look at ASCAT data that contains copy number information for the three processed samples i.e. sample1, sample2 and sample3. Here we load up the data, check that it contains allele-specific copy number calls for the 3 samples and look at the first 10 rows of the dataframe. We also note that as expected the copy number calls for sample1 are the same as above. ```{r} ASCAT_data_All <- read.delim( system.file( "extdata", "ASCAT_All_Samples.txt", package = "RaggedExperiment", mustWork = TRUE ), header = TRUE ) unique(ASCAT_data_All$sample) head(ASCAT_data_All, n = 10) ``` From the output above we can see that the ASCAT data has 6 columns named sample, chr, startpos, endpos, nMajor and nMinor. These correspond to the sample ID, chromosome, the start position and end position of the genomic ranges and the copy number of the major and minor alleles i.e. the homologous chromosomes. # Converting ASCAT data to `GRanges` format The `RaggedExperiment` class derives from a `GRangesList` representation and can take a `GRanges` object, a `GRangesList` or a list of `Granges` as inputs. To be able to use the ASCAT data in `RaggedExperiment` we must convert the ASCAT data into `GRanges` format. Ideally, we want each of our `GRanges` objects to correspond to an individual sample. ## ASCAT to `GRanges` objects In the case where the ASCAT data has only 1 sample it is relatively simple to produce a `GRanges` object. ```{r} sample1_ex1 <- GRanges( seqnames = Rle(paste0("chr", ASCAT_data_S1$chr)), ranges = IRanges(start = ASCAT_data_S1$startpos, end = ASCAT_data_S1$endpos), strand = Rle(strand("*")), nmajor = ASCAT_data_S1$nMajor, nminor = ASCAT_data_S1$nMinor ) sample1_ex1 ``` Here we create a `GRanges` object by taking each column of the ASCAT data and assigning them to the appropriate argument in the `GRanges` function. From above we can see that the chromosome information is prefixed with "chr" and becomes the seqnames column, the start and end positions are combined into an `IRanges` object and given to the ranges argument, the strand column contains a `*` for each entry as we don't have strand information and the metadata columns contain the allele-specific copy number calls and are called nmajor and nminor. The `GRanges` object we have just created contains 41 ranges (rows) and 2 metadata columns. Another way that we can easily convert our ASCAT data, containing 1 sample, to a `GRanges` object is to use the `makeGRangesFromDataFrame` function from the `GenomicsRanges` package. Here we indicate what columns in our data correspond to the chromosome (given to the `seqnames` argument), start and end positions (`start.field` and `end.field` arguments), whether to ignore strand information and assign all entries `*` (`ignore.strand`) and also whether to keep the other columns in the dataframe, nmajor and nminor, as metadata columns (`keep.extra.columns`). ```{r} sample1_ex2 <- makeGRangesFromDataFrame( ASCAT_data_S1[,-c(1)], ignore.strand=TRUE, seqnames.field="chr", start.field="startpos", end.field="endpos", keep.extra.columns=TRUE ) sample1_ex2 ``` In the case where the ASCAT data contains more than 1 sample you can first use the `split` function to split the whole dataframe into multiple dataframes, one for each sample, and then create a `GRanges` object for each dataframe. Code to split the dataframe, based on sample ID, is given below and then the same procedure used to produce `sample1_ex2` can be implemented to create the `GRanges` object. Alternatively, an easier and more efficient way to do this is to use the `makeGRangesListFromDataFrame` function from the `GenomicsRanges` package. This will be covered in the next section. ```{r} sample_list <- split( ASCAT_data_All, f = ASCAT_data_All$sample ) ``` ## ASCAT to `GRangesList` instance To produce a `GRangesList` instance from the ASCAT dataframe we can use the `makeGRangesListFromDataFrame` function. This function takes the same arguments as the `makeGRangesFromDataFrame` function used above, but also has an argument specifying how the rows of the `df` are split (`split.field`). Here we will split on sample. This function can be used in cases where the ASCAT data contains only 1 sample or where it contains multiple samples. Using `makeGRangesListFromDataFrame` to create a list of `GRanges` objects where ASCAT data has only 1 sample: ```{r} sample_list_GRanges_ex1 <- makeGRangesListFromDataFrame( ASCAT_data_S1, ignore.strand=TRUE, seqnames.field="chr", start.field="startpos", end.field="endpos", keep.extra.columns=TRUE, split.field = "sample" ) sample_list_GRanges_ex1 ``` Using `makeGRangesListFromDataFrame` to create a `list` of `GRanges` objects where ASCAT data has multiple samples: ```{r} sample_list_GRanges_ex2 <- makeGRangesListFromDataFrame( ASCAT_data_All, ignore.strand=TRUE, seqnames.field="chr", start.field="startpos", end.field="endpos", keep.extra.columns=TRUE, split.field = "sample" ) sample_list_GRanges_ex2 ``` Each `GRanges` object in the `list` can then be accessed using square bracket notation. ```{r} sample1_ex3 <- sample_list_GRanges_ex2[[1]] sample1_ex3 ``` Another way we can produce a `GRangesList` instance is to use the `GRangesList` function. This function creates a list that contains all our `GRanges` objects. This is straightforward in that we use the `GRangesList` function with our `GRanges` objects as named or unnamed inputs. Below we have created a list that includes 1 `GRanges` objects, created in section 4.1., corresponding to sample1. ```{r} sample_list_GRanges_ex3 <- GRangesList( sample1 = sample1_ex1 ) sample_list_GRanges_ex3 ``` # Constructing a `RaggedExperiment` object from ASCAT output Now we have created the `GRanges` objects and `GRangesList` instances we can easily use `RaggedExperiment`. ## Using `GRanges` objects From above we have a `GRanges` object derived from the ASCAT data containing 1 sample i.e. `sample1_ex1` / `sample1_ex2` and the capabilities to produce individual `GRanges` objects derived from the ASCAT data containing 3 samples. We can now use these `GRanges` objects as inputs to `RaggedExperiment`. Note that we create column data `colData` to describe the samples. Using `GRanges` object where ASCAT data only has 1 sample: ```{r} colDat_1 = DataFrame(id = 1) ragexp_1 <- RaggedExperiment( sample1 = sample1_ex2, colData = colDat_1 ) ragexp_1 ``` In the case where you have multiple `GRanges` objects, corresponding to different samples, the code is similar to above. Each sample is inputted into the `RaggedExperiment` function and `colDat_1` corresponds to the id for each sample i.e. 1, 2 and 3, if 3 samples are provided. ## Using a `GRangesList` instance From before we have a `GRangesList` derived from the ASCAT data containing 1 sample i.e. `sample_list_GRanges_ex1` and the `GRangesList` derived from the ASCAT data containing 3 samples i.e. `sample_list_GRanges_ex2`. We can now use this `GRangesList` as the input to `RaggedExperiment`. Using `GRangesList` where ASCAT data only has 1 sample: ```{r} ragexp_2 <- RaggedExperiment( sample_list_GRanges_ex1, colData = colDat_1 ) ragexp_2 ``` Using `GRangesList` where ASCAT data only has multiple samples: ```{r} colDat_3 = DataFrame(id = 1:3) ragexp_3 <- RaggedExperiment( sample_list_GRanges_ex2, colData = colDat_3 ) ragexp_3 ``` We can also use the `GRangesList` produced using the `GRangesList` function: ```{r} ragexp_4 <- RaggedExperiment( sample_list_GRanges_ex3, colData = colDat_1 ) ragexp_4 ``` # Downstream Analysis Now that we have the ASCAT data converted to `RaggedExperiment` objects we can use the \*Assay functions that are described in the `RaggedExperiment` [vignette](https://bioconductor.org/packages/release/bioc/vignettes/RaggedExperiment/inst/doc/RaggedExperiment.html). These functions provide several different functions for representing ranged data in a rectangular matrix. They make it easy to find genomic segments shared/not shared between each sample considered and provide the corresponding allele-specific copy number calls for each sample across each segment. # Session Information ```{r} sessionInfo() ```