--- title: "Working with Non-human Array" date: "`r BiocStyle::doc_date()`" package: sesame output: BiocStyle::html_document fig_width: 6 fig_height: 5 vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{2. Non-human Array} %\VignetteEncoding{UTF-8} --- ```{r message=FALSE, warning=FALSE, include=FALSE} library(wheatmap) library(dplyr) options(rmarkdown.html_vignette.check_title = FALSE) ``` SeSAMe provides extensive native support for the Illumina Mouse Methylation BeadChip array (referred to as the MM285 or MMB array) and the Horvath Mammal40 array (refered to as the Mammal40 array). The MM285 array contains ~285,000 probes covering over 20 design categories including gene promoters, enhancers, CpGs in synteny to human EPIC array as well as other biology. In SeSAMe, MM285 is used as the product abbreviation to distinguish future platforms including MM320. This documents describe the procedure to process the MM285 and the Mammal40 array. We first load required library and perform sesame data caching (only needed at new SeSAMe installation). ```{r nh1, message=FALSE, warning=FALSE} library(sesame) sesameDataCache() ``` # Species Inference SeSAMe supports automatic inference of the profiled organism. This is achieved by the `inferSpecies` function. Usually, users need not call this function explicitly and only need to specify the code `S` as part of the 2nd argument of the `openSesame` function. See [the Basics Usage vignette](sesame.html#prep) for more detail. The following example downloads an example Mammal40 array IDAT pair and call `openSesame` function with species inference (note the `S` in the `prep=` argument). Download test IDATs `GSM4411982_Grn.idat.gz` and `GSM4411982_Red.idat.gz` from https://github.com/zhou-lab/InfiniumAnnotationV1/tree/main/Test ```{r nh3, message=FALSE, eval=FALSE} betas = openSesame(sprintf("~/Downloads/GSM4411982", tmp), prep="SHCDPM") ``` The above code is equivalent to ```{r nh4, eval=FALSE} ## equivalent to the above openSesame call betas = getBetas(matchDesign(pOOBAH(dyeBiasNL(inferInfiniumIChannel( prefixMaskButC(inferSpecies(readIDATpair( "~/Downloads/GSM4411982")))))))) ``` As can be seen, `inferSpecies` takes a SigDF as input and outputs an updated SigDF which contains a species-specific masking and color channel designation. This information is key to proper preprocessing since knowledge of the color channel designation and probe hybridization success is the foundational assumption of many preprocessing algorithms. One may instruct the function to return the inferred species information by using the `return.species = TRUE` argument. The following example shows this usage: ```{r nh13, message=FALSE, eval=TRUE} sdf = sesameDataGet("Mammal40.1.SigDF") # an example SigDF inferSpecies(sdf, return.species = TRUE) ``` Internally, we used a nonparametric scoring method to infer the most likely species from a pool of 310 candidate species from Ensembl (v101). The `return.auc = TRUE` argument allows one to peek into the AUC (Area Under the Curve) score generated in this inference. The higher the score, the more likely the data is from the corresponding species. Knowing the scores can help diagnose misclassifications such as when several candidate species are closely related and hard to discriminate from data. ```{r nh14, message=FALSE} ## showing the candidate species with top scores head(sort(inferSpecies(sdf, return.auc = TRUE), decreasing=TRUE)) ``` If the user believes that automatic inference gives wrong (most often still close-related) species, one can force species inference to a target species by using the `updateSigDF` function. For example, the following code forces the `SigDF` to be treated as a `mus_musculus` sample. Note this doesn't alter signal intensity but only change the probe masking and color channel spec (the view of the data, not the data reading itself). ```{r nh15, eval=FALSE} sdf_mouse <- updateSigDF(sdf, species="mus_musculus") ``` **CRITICAL:** Since `updateSigDF` function resets the whole mask and col column of SigDF. One should use this function (and `inferSpecies`) before other preprocessing functions that sets mask and col. # Mouse Strain Inference Like species inference, strain-specific masking and preprocessing is important for mouse array samples. This is achieved by the `inferStrain` function. The function is represented by the `T` code in `openSesame`/`prepSesame`. The following example shows how to use `inferStrain` in `openSesame`. Note the use of `T` in the prep code. Download test IDATs `204637490002_R05C01_Grn.idat` and `204637490002_R05C01_Red.idat` from https://github.com/zhou-lab/InfiniumAnnotationV1/tree/main/Test ```{r nh2, message=FALSE, eval=FALSE} betas = openSesame("~/Downloads/204637490002_R05C01", prep="TQCDPB") ``` Like `inferSpecies`, one need to call the `inferStrain` function before calling the standard `noob`, `dyeBiasNL`, etc (by having `T` before `QCDPB` when calling `openSesame`). Also like `inferSpecies()`, `inferStrain()` will return a new `SigDF` with col and mask updated to reflect the change of strain. Optionally, one can also specify `return.strain=TRUE`, `return.pval=TRUE` or `return.probability=TRUE` to return the inferred strain, the p-value or the probabilities of all 37 strain candidates. Internally, the function converts the beta values to variant allele frequencies. It should be noted that since variant allele frequency is not always measured by the M-allele of the probe. SeSAMe flips the $\beta$ values for some probes to calculate variant allele frequency. The following example shows what `inferStrain` does to a `SigDF`: ```{r nh9, message=FALSE} sdf = sesameDataGet("MM285.1.SigDF") # an example dataset inferStrain(sdf, return.strain = TRUE) # return strain as a string sdf_after = inferStrain(sdf) # update mask and col by strain inference sum(sdf$mask) # before strain inference sum(sdf_after$mask) # after strain inference ``` Let's visualize the probabilities of all candidate strains using the `return.probabilities` option: ```{r nh10, fig.width=6, fig.height=4, message=FALSE} library(ggplot2) p = inferStrain(sdf, return.probability = TRUE) df = data.frame(strain=names(p), probs=p) ggplot(data = df, aes(x = strain, y = probs)) + geom_bar(stat = "identity", color="gray") + ggtitle("Strain Probabilities") + ylab("Probability") + xlab("") + scale_x_discrete(position = "top") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=0), legend.position = "none") ``` See also [The Supplemental Vignette](https://zhou-lab.github.io/sesame/v1.16/supplemental.html#SNP) for heatmap validation of strain inference. # Quality Masking In the MM285 array, we designed multimapping probes to allow querying transposable elements and other biology. We also exposed probes with potentially design flaws. These suboptimally designed probes take a new probe ID prefix ("uk") in addition to the "cg"/"ch"/"rs" typically seen. By default the repeat and suboptimally designed probes are masked by `NA`. These masking is done by the `qualityMask` function (or `Q` in prep codes). To override masking these probes, one can use the `resetMask` function (the `0` code in `openSesame`) to remove the default masking. We recommend using it after the preprocessing function that depends on reliable/uniquely-mapped probes, but before detection p-value based masking (e.g. pOOBAH). This way, probes that fail detection can still be flagged (they should be). ```{r nh7, message=FALSE} sdf = sesameDataGet('MM285.1.SigDF') sum(is.na(openSesame(sdf, prep="TQCDPB"))) sum(is.na(openSesame(sdf, prep="TQCD0PB"))) ``` # Human-mouse Mixture UNDER CONSTRUCTION There are other inferences one can do on the nonhuman arrays, e.g., sex, age (epigenetic clocks), tissue, copy number alteration etc. These will be elaborated in [The Inference Vignette](inferences.html). # Session Info ```{r} sessionInfo() ```