Sonali Arora, Martin Morgan
October 28, 2014
Copy Number Variation (CNV's) refers to the duplication or deletion of DNA segments larger than 1 kb. CNV's are structural variations in the genome which range in length between 50 bp and 1 Mbp. They are widespread among humans - on an average 12 CNVs exist per individual in comparison to the reference genome. They have also been shown to play a role in diseases such as autism, breast cancer, obesity, Alzheimer’s disease and schizophrenia among other diseases.
Like any other genomic analysis, before we start a copy Number analysis, we need to consider experimental design. Here, we highlight two specific pointers that one needs to keep in mind while designing a copy number analysis.
Tumor and normal samples Are we planning to just find copy number profile in individuals? For example, how does the copy number profile for a region evolve over a certain period of time? (Here we are comparing the copy number profile to a reference genome)
Are we planning to compare copy number profiles from tumor vs normal profiles? For example, we may be trying to find out if copy number changes are responsible for a certain form of cancer and want to find the exact genomic region against which a treatment can be developed.
There are different packages and different functions inside the same package which handle CNV for tumor and normal samples or CNV in samples.
Germ line versus somatic CNV Germ line CNV are relatively short (a few bp to a few Mbp) copy number changes that the individual inherits from one of the two parental gametes and thus are typically present in 100% of cells.
Somatic CNV (often called CNA where A stands for alterations or aberration) are copy number changes of any size and amount (from a few bases to whole chromosomes) that happen (and often carry on happening) in cancer cells. Cancer cells can be aneuploid (that means they are largely triploid, tetraploid or even aploid) and can have high focal amplifications (some regions could have many copies: it is not unusual to have 8-12 copies for some regions). Furthermore, because tumor samples are typically an admixture of normal and cancer cells, the tumor purity in unknown and variable.
Different algorithms make different assumptions while handling somatic or germ line CNV. Typically, germ line cnv caller can assume:
For these reasons, the algorithms can focus more on associating p-values for each call; it is possible to estimate false positive and false negative rates.
Somatic CNA callers cannot make any of the assumption above, or if they do, they have limited scope.
Some key questions when thinking about sequencing technology to use include:
What kind of sequencing data are we working with?
Is it array CGH data, SNP array or next generation sequencing data? For example, the CGHbase detects CNV's in array CGH data whereas the seqCNA among others detects CNV's in high-throughput sequencing data.
Amount of genome being sequenced - whole genome vs exome?
Are we looking for copy number across the whole genome or are we looking for copy number only in exomes? Again different packages handle different kind of sequencing data. For example, The Bioconductor package exomeCopy detects CNV in exome sequencing data whereas the Bioconductor package cn.mops detects CNV in whole genome sequencing data.
Which platform was the sequencing done on?
A lot of packages detecting CNV on platform-specific data. For example, the crlmm detects CNV's in Affymetrix SNP 5.0 and 6.0 and Illumina arrays whereas the CopyNumber450K detects them in Illumina 450k methylation microarrays.
What is the coverage of sequenced data?
Most packages work well with high coverage sequencing data, but some packages are designed to work well with low coverage data. It is best to recognize how coverage of our data will affect the choice of the package we use for our analysis at an early stage.
Since statistics plays a huge role in copy number analysis, we should also spend some time in thoroughly understanding the underlying algorithm of the R package being used. A few questions to consider while choosing a package would be -
How is our chosen package binning and counting reads?
Is any pre-processing required from our end? Is it trimming aligned reads internally?
What segmentation algorithm is being used ? For example, does the package use Circular Binary Segmentation, HMM based methods etc.
How efficiently can it handle big data? Do I need additional computational resources to run the analysis? Does the function run in parallel?
Bioconductor currently has about 41 packages for Copy Number Analysis. To find these, one can visit the biocViews page and type “CopyNumberVariation” in the “Autocomplete biocViews search”
Lets work through a small example to illustrate how straight-forward a copy number analysis can be once you've figured out all the logistics. We will also find the genes that lie within the detected copy number regions.
For this analysis, I chose the cn.mops package as it helps us with
We start by downloading relevant files, if necessary
## set path/to/download/directory, e.g.,
## destdir <- "~/bam/copynumber"
stopifnot(file.exists(destdir))
bamFiles <- file.path(destdir,
c("tumorA.chr4.bam", "normalA.chr4.bam"))
urls <- paste0("http://s3.amazonaws.com/copy-number-analysis/",
basename(bamFiles))
for (i in seq_along(bamFiles))
if (!file.exists(bamFiles[i])) {
download.file(urls[i], bamFiles[i])
download.file(paste0(urls[i], ".bai"), paste0(bamFiles[i], ".bai"))
}
The main work flow 1) loads the library; 2) counts reads; 3) normalizes counts; 4) detects CNVs; and 5) visualizes results.
## 1. Load the cn.mops package
suppressPackageStartupMessages({
library(cn.mops)
})
## 2. We can bin and count the reads
reads_gr <- getReadCountsFromBAM(BAMFiles = bamFiles,
sampleNames = c("tumor", "normal"),
refSeqName = "chr4", WL = 10000, mode = "unpaired")
## 3. Noramlization
## We need a special normalization because the tumor has many large CNVs
X <- normalizeGenome(reads_gr, normType="poisson")
## 4. Detect cnv's
ref_analysis <- referencecn.mops(X[,1], X[,2],
norm=0,
I = c(0.025, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 8, 16, 32, 64),
classes = paste0("CN", c(0:8, 16, 32, 64, 128)),
segAlgorithm="DNAcopy")
## Analyzing: Sample.1
resCNMOPS <- calcIntegerCopyNumbers(ref_analysis)
## 5. Visualize the cnv's
segplot(resCNMOPS)
Here the x-axis represents the genomic position and the y-axis represents the log ratio of read counts and copy number call of each segment (red)
human_cn <- cnvr(resCNMOPS)
human_cn
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | tumor
## <Rle> <IRanges> <Rle> | <factor>
## [1] chr4 [ 1, 49340000] * | CN4
## [2] chr4 [ 49480001, 49670000] * | CN5
## [3] chr4 [ 52660001, 59740000] * | CN8
## [4] chr4 [ 59780001, 62490000] * | CN8
## [5] chr4 [190470001, 190690000] * | CN3
## [6] chr4 [190790001, 191050000] * | CN4
## -------
## seqinfo: 1 sequence from an unspecified genome
To find the genes that lie in these copy number regions, we will use the TranscriptDb object for hg19
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
## subset to work with only chr4
txdb <- keepSeqlevels(TxDb.Hsapiens.UCSC.hg19.knownGene, "chr4")
genes0 <- genes(txdb)
## 'unlist' so that each range is associated with a single gene identifier
idx <- rep(seq_along(genes0), elementLengths(genes0$gene_id))
genes <- granges(genes0)[idx]
genes$gene_id = unlist(genes0$gene_id)
Next we will use find overlaps to assign gene identifiers to cnv regions.
olaps <- findOverlaps(genes, human_cn, type="within")
idx <- factor(subjectHits(olaps), levels=seq_len(subjectLength(olaps)))
human_cn$gene_ids <- splitAsList(genes$gene_id[queryHits(olaps)], idx)
human_cn
## GRanges object with 6 ranges and 2 metadata columns:
## seqnames ranges strand | tumor
## <Rle> <IRanges> <Rle> | <factor>
## [1] chr4 [ 1, 49340000] * | CN4
## [2] chr4 [ 49480001, 49670000] * | CN5
## [3] chr4 [ 52660001, 59740000] * | CN8
## [4] chr4 [ 59780001, 62490000] * | CN8
## [5] chr4 [190470001, 190690000] * | CN3
## [6] chr4 [190790001, 191050000] * | CN4
## gene_ids
## <CharacterList>
## [1] 100126332,100129917,100129931,...
## [2]
## [3] 100288413,100506462,100506564,...
## [4]
## [5]
## [6] 100288255,22947,2483,...
## -------
## seqinfo: 1 sequence from an unspecified genome
The packages and versions used in this work flow are as follows:
restoreSeqlevels(TxDb.Hsapiens.UCSC.hg19.knownGene)
## TxDb object:
## | Db type: TxDb
## | Supporting package: GenomicFeatures
## | Data source: UCSC
## | Genome: hg19
## | Organism: Homo sapiens
## | UCSC Table: knownGene
## | Resource URL: http://genome.ucsc.edu/
## | Type of Gene ID: Entrez Gene ID
## | Full dataset: yes
## | miRBase build ID: GRCh37
## | transcript_nrow: 82960
## | exon_nrow: 289969
## | cds_nrow: 237533
## | Db created by: GenomicFeatures package from Bioconductor
## | Creation time: 2014-09-26 11:16:12 -0700 (Fri, 26 Sep 2014)
## | GenomicFeatures version at creation time: 1.17.17
## | RSQLite version at creation time: 0.11.4
## | DBSCHEMAVERSION: 1.0
sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-unknown-linux-gnu (64-bit)
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats4 parallel stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] DNAcopy_1.40.0
## [2] cn.mops_1.12.0
## [3] fission_0.99.4
## [4] sva_3.12.0
## [5] mgcv_1.8-3
## [6] nlme_3.1-118
## [7] Gviz_1.10.1
## [8] ggplot2_1.0.0
## [9] PoiClaClu_1.0.2
## [10] RColorBrewer_1.0-5
## [11] gplots_2.14.2
## [12] DESeq2_1.6.1
## [13] RcppArmadillo_0.4.450.1.0
## [14] Rcpp_0.11.3
## [15] org.Hs.eg.db_3.0.0
## [16] RSQLite_1.0.0
## [17] DBI_0.3.1
## [18] ShortRead_1.24.0
## [19] BiocParallel_1.0.0
## [20] VariantAnnotation_1.12.2
## [21] RNAseqData.HNRNPC.bam.chr14_0.3.2
## [22] GenomicAlignments_1.2.0
## [23] Rsamtools_1.18.1
## [24] genefilter_1.48.1
## [25] BSgenome.Hsapiens.UCSC.hg19_1.4.0
## [26] BSgenome_1.34.0
## [27] rtracklayer_1.26.1
## [28] airway_0.99.5
## [29] ALL_1.7.1
## [30] TxDb.Hsapiens.UCSC.hg19.knownGene_3.0.0
## [31] GenomicFeatures_1.18.1
## [32] AnnotationDbi_1.28.0
## [33] Biobase_2.26.0
## [34] GenomicRanges_1.18.1
## [35] GenomeInfoDb_1.2.2
## [36] Biostrings_2.34.0
## [37] XVector_0.6.0
## [38] IRanges_2.0.0
## [39] S4Vectors_0.4.0
## [40] BiocGenerics_0.12.0
## [41] BiocStyle_1.4.1
## [42] LearnBioconductor_0.1.6
##
## loaded via a namespace (and not attached):
## [1] BBmisc_1.7 BatchJobs_1.4 BiocInstaller_1.16.0
## [4] Formula_1.1-2 Hmisc_3.14-5 KernSmooth_2.23-13
## [7] MASS_7.3-35 Matrix_1.1-4 R.methodsS3_1.6.1
## [10] RCurl_1.95-4.3 XML_3.98-1.1 acepack_1.3-3.3
## [13] annotate_1.44.0 base64enc_0.1-2 biomaRt_2.22.0
## [16] biovizBase_1.14.0 bitops_1.0-6 brew_1.0-6
## [19] caTools_1.17.1 checkmate_1.5.0 cluster_1.15.3
## [22] codetools_0.2-9 colorspace_1.2-4 dichromat_2.0-0
## [25] digest_0.6.4 evaluate_0.5.5 fail_1.2
## [28] foreach_1.4.2 foreign_0.8-61 formatR_1.0
## [31] gdata_2.13.3 geneplotter_1.44.0 gtable_0.1.2
## [34] gtools_3.4.1 hwriter_1.3.2 iterators_1.0.7
## [37] knitr_1.7 labeling_0.3 lattice_0.20-29
## [40] latticeExtra_0.6-26 locfit_1.5-9.1 markdown_0.7.4
## [43] matrixStats_0.10.3 mime_0.2 munsell_0.4.2
## [46] nnet_7.3-8 plyr_1.8.1 proto_0.3-10
## [49] reshape2_1.4 rpart_4.1-8 scales_0.2.4
## [52] sendmailR_1.2-1 splines_3.1.1 stringr_0.6.2
## [55] survival_2.37-7 tools_3.1.1 xtable_1.7-4
## [58] zlibbioc_1.12.0