--- title: "Supervised Learning Tools for Deriving Biomarkers based on Single-Cell Data" date: "`r format(Sys.time(), '%d %B, %Y')`" author: - name: Tingting Zhan orcid: 0000-0001-9971-4844 email: tingtingzhan@gmail.com affiliations: - ref: tjuh - name: Inna Chervoneva orcid: 0000-0002-9104-4505 email: Inna.Chervoneva@jefferson.edu affiliations: - ref: tjuh affiliations: - id: tjuh name: Thomas Jefferson University & Hospitals address: 130 South 9th Street city: Philadelphia state: PA postal-code: 19107 format: html: page-layout: full html-math-method: katex toc: true toc-location: left toc-depth: 4 toc-title: '' editor: visual bibliography: hypergam.bib knitr: opts_chunk: collapse: true comment: "#>" vignette: > %\VignetteIndexEntry{applications} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} --- # Introduction This vignette provides examples of using `R` packages - **`groupedHyperframe`**, @groupedHyperframe, [`CRAN`](https://cran.r-project.org/package=groupedHyperframe), [Github](https://github.com/tingtingzhan/groupedHyperframe), [RPubs](https://rpubs.com/tingtingzhan/groupedHyperframe) - **`hyper.gam`**, @hyper_gam, [`CRAN`](https://cran.r-project.org/package=hyper.gam), [Github](https://github.com/tingtingzhan/hyper.gam), [RPubs](https://rpubs.com/tingtingzhan/hyper_gam) for deriving single index predictors of scalar outcomes based on spatial and non-spatial single-cell imaging data. ## Prerequisite Experimental (and maybe unstable) features are implemented **extremely frequently** on Github. [Active developers should use the Github version; suggestions and bug reports are welcome!]{style="background-color: #FFFF00"} ```{r} #| warning: false #| eval: false remotes::install_github('tingtingzhan/groupedHyperframe') remotes::install_github('tingtingzhan/hyper.gam') ``` Stable releases to `CRAN` are typically updated every 2 to 3 months, or when the authors have an upcoming manuscript in the peer-reviewing process. [Developers should **not** use the `CRAN` version!]{style="background-color: #FFFF00"} ```{r} #| warning: false #| eval: false utils::install.packages('groupedHyperframe') utils::install.packages('hyper.gam') ``` ### Dependencies Package **`hyper.gam`** `Imports` packages - **`caret`** [@caret, version `r packageVersion('caret')`, key dependency], for $k$-fold prediction (to be documented in future) - **`cli`** [@cli, version `r packageVersion('cli')`], for attractive command line interfaces (CLIs) - **`mgcv`** [@mgcv, version `r packageVersion('mgcv')`, key dependency], for fitting generalized additive models - **`nlme`** [@nlme, version `r packageVersion('nlme')`], to import the `S3` generic function `nlme::getData()` - **`parallel`** (shipped with vanilla `R`, version `r packageVersion('base')`), for parallel computing of $k$-fold prediction (to be documented in future) - **`plotly`** [@plotly, version `r packageVersion('plotly')`, key dependency], for interactive visualization using HTML widget - **`groupedHyperframe`** [@groupedHyperframe, version `r packageVersion('groupedHyperframe')`, key dependency], for data structure Package **`groupedHyperframe`** `Imports` packages - **`cli`** [@cli, version `r packageVersion('cli')`] - **`matrixStats`** [@matrixStats, version `r packageVersion('matrixStats')`, key dependency], for matrix arithmetic - **`parallel`** (shipped with vanilla `R`, version `r packageVersion('base')`) - **`pracma`** [@pracma, version `r packageVersion('pracma')`, key dependency], for (cumulative) trapezoidal integration - **`spatstat.explore`** (version `r packageVersion('spatstat.explore')`) and **`spatstat.geom`** (version `r packageVersion('spatstat.geom')`), @Baddeley15, key dependency, for spatial statistics - **`SpatialPack`** [@SpatialPack, version `r packageVersion('SpatialPack')`, key dependency], for Tjøstheim's coefficient of spatial association Since package **`spatstat.explore`** version 3.4-3.004, we have a `Registered S3 method overwritten` when loading packages **`groupedHyperframe`** and **`hyper.gam`**. The function name clash is between `spatstat.explore::plot.roc()` and `pROC::plot.roc()`. The package **`pROC`** [@pROC] is among the `Imports` of package **`caret`** [@caret]. ## Getting Started Examples in this vignette require that the `search` path has ```{r} library(groupedHyperframe) library(hyper.gam) library(survival) ``` ```{r} #| echo: false op = par(no.readonly = TRUE) #options(mc.cores = 1L) # for CRAN submission ``` ## Terms and Abbreviations | Term / Abbreviation | Description | |------------------------------------|------------------------------------| | [`|>`](https://search.r-project.org/R/refmans/base/html/pipeOp.html) | Forward pipe operator introduced since `R` 4.1.0, as well as the `_` placeholder | | [`::`](https://search.r-project.org/R/refmans/base/html/ns-dblcolon.html) | Explicitly-[namespace](https://search.r-project.org/R/refmans/base/html/ns-reflect.html)d function | | [`attr`](https://search.r-project.org/R/refmans/base/html/attr.html), [`attributes`](https://search.r-project.org/R/refmans/base/html/attributes.html) | Attributes | | [`contour`](https://search.r-project.org/R/refmans/graphics/html/contour.html) | Contour line, | | [`createDataPartition`](https://search.r-project.org/CRAN/refmans/caret/html/createDataPartition.html) | Test vs. training data set partition, from package **`caret`** [@caret] | | `csv`, [`read.csv`](https://search.r-project.org/R/refmans/utils/html/read.table.html) | (Read) comma-separated-value files | | [`coxph`](https://search.r-project.org/CRAN/refmans/survival/html/coxph.html) | Cox proportional hazards model, from package **`survival`** [@survival] | | [`gam`](https://search.r-project.org/CRAN/refmans/mgcv/html/gam.html) | Generalized additive models (GAM), from package **`mgcv`** [@mgcv] | | [`groupedHyperframe`](https://CRAN.R-project.org/package=groupedHyperframe/vignettes/groupedHyperframe.html) | Grouped hyper data frame, from package **`groupedHyperframe`** [@groupedHyperframe] | | `hypercolumns`, [`hyperframe`](https://search.r-project.org/CRAN/refmans/spatstat.geom/html/hyperframe.html) | (Hyper columns of) hyper data frame, from package **`spatstat.geom`** [@Baddeley05] | | [`htmlwidget`](https://search.r-project.org/CRAN/refmans/htmlwidgets/html/htmlwidgets-package.html) | HyperText Markup Language (HTML) widgets, from package **`htmlwidgets`** [@htmlwidgets], , | | [`inherits`](https://search.r-project.org/R/refmans/base/html/class.html) | Class inheritance | | [`L`](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Constants)-suffix | Create [integer](https://search.r-project.org/R/refmans/base/html/integer.html) constant | | [`mgcv::s`](https://search.r-project.org/CRAN/refmans/mgcv/html/s.html) | (Set up of) spline based smooths [@mgcv_s] | | [`mgcv::ti`](https://search.r-project.org/CRAN/refmans/mgcv/html/te.html) | Tensor product interaction [@mgcv_ti] | | [`persp`](https://search.r-project.org/R/refmans/graphics/html/persp.html) | Perspective plot, | | `PFS` | Progression/recurrence free survival, | | [`predict`](https://search.r-project.org/R/refmans/stats/html/predict.html) | Model predictions | | [`predict.gam`](https://search.r-project.org/CRAN/refmans/mgcv/html/predict.gam.html) | GAM model predictor | | [`quantile`](https://search.r-project.org/R/refmans/stats/html/quantile.html) | Quantile | | `S3`, `generic`, [`methods`](https://search.r-project.org/R/refmans/utils/html/methods.html) | `S3` object oriented system, [`UseMethod`](https://search.r-project.org/R/refmans/base/html/UseMethod.html); [`getS3method`](https://search.r-project.org/R/refmans/utils/html/getS3method.html); | | [`search`](https://search.r-project.org/R/refmans/base/html/search.html) | Search path for `R` objects | | [`Surv`](https://search.r-project.org/CRAN/refmans/survival/html/Surv.html) | Survival, i.e., time-to-event, object, from package **`survival`** [@survival] | | [`symbol`, `name`](https://search.r-project.org/R/refmans/base/html/name.html) | Names and symbols to refer to `R` objects | ## Acknowledgement The authors thank [Erjia Cui](https://orcid.org/0000-0003-3576-2892) for his contribution to function `hyper_gam()`. This work is supported by National Institutes of Health, U.S. Department of Health and Human Services grants - R01CA222847 ([I. Chervoneva](https://orcid.org/0000-0002-9104-4505), [T. Zhan](https://orcid.org/0000-0001-9971-4844), and [H. Rui](https://orcid.org/0000-0002-8778-261X)) - R01CA253977 (H. Rui and I. Chervoneva). ```{=html} ``` ```{=html} ``` # Data Structure Single-cell multiplex immuno-fluorescence immunohistochemistry (mIF-IHC) imaging data are the result of digital processing of the microscopic images of tissue stained with selected antibodies. Quantitative pathology platforms, e.g., Akoya or QuPath, support cell segmentation of mIF-IHC images and quantification of the mean protein expression in each cell. The cell centroid coordinates and cell signal intensities (CSIs) for each stained protein are usually extracted as individual comma-separated values `.csv` files. For each cell in a tissue image, the data include the cell centroid coordinates and cell signal intensity (CSI) for each quantified protein expression. The data may have multiple levels of hierarchical clustering. For example, single cells are clustered within a Region of Interest (ROI) or a tissue core, ROIs are clustered within a tissue or tissue cores are clustered within a patient. # Quantile Index Applications based on the Quantile Index (QI) methodology is described in our peer-reviewed publications @Yi23a; @Yi23b; @Yi25. ## Example Data example **`Ki67`** included in package **`groupedHyperframe`** ([Github](https://github.com/tingtingzhan/groupedHyperframe), [`CRAN`](https://CRAN.R-project.org/package=groupedHyperframe)) is a *grouped hyper data frame*, an extension of the hyper data frame `hyperframe` object defined in package **`spatstat.geom`** [@Baddeley15; @Baddeley05]. The numeric-hypercolumn *`logKi67`*, whose elements are numeric vectors of different lengths, contains the log-transformed Ki67 protein expression CSIs in each *`tissueID`* nested in *`patientID`*. Such nested grouping structure is denoted by *`~patientID/tissueID`* following the nomenclature of package **`nlme`** [@Pinheiro00; @nlme]. The data example **`Ki67`** also contains the metadata including the outcome of interest, e.g., progression free survival *`PFS`*, *`Her2`*, *`HR`*, etc. Detailed information about the `groupedHyperframe` class may be found in package **`groupedHyperframe`** vignettes ([RPubs](https://rpubs.com/tingtingzhan/groupedHyperframe), [`CRAN`](https://CRAN.R-project.org/package=groupedHyperframe/vignettes/groupedHyperframe.html)), section *Grouped Hyper Data Frame*. ```{r} data(Ki67, package = 'groupedHyperframe') Ki67 ``` ## 1️⃣ Compute Aggregated Quantiles Function `aggregate_quantile()` first converts each element of the numeric-hypercolumn *`logKi67`* into sample `quantile`s at a pre-specified grid of `prob`abilitie`s` $\{p_k, k=1,\cdots,K \} \in [0,1]$, then aggregates the quantiles of multiple *`tissueID`*'s per *`patientID`* by point-wise means (default of parameter `f_aggr_`). Note that the aggregation must be performed at the level of biologically *independent* clusters, e.g., *`~patientID`*, to produce independent quantile predictors. ```{r} #| message: false Ki67q = Ki67 |> aggregate_quantile(by = ~ patientID, probs = seq.int(from = .01, to = .99, by = .01)) ``` The returned object *`Ki67q`* is a hyper data frame `hyperframe` with a numeric-hypercolumn of aggregated sample quantiles *`logKi67.quantile`* per *`patientID`*. Users are encouraged to learn more about the function `aggregate_quantile()` from package **`groupedHyperframe`** vignettes ([RPubs](https://rpubs.com/tingtingzhan/groupedHyperframe), [`CRAN`](https://CRAN.R-project.org/package=groupedHyperframe/vignettes/groupedHyperframe.html)), section *Grouped Hyper Data Frame*, subsection *From `data.frame`*. ```{r} Ki67q |> head() ``` ## 2️⃣ Estimate Integrand Surface Linear quantile index (QI) (@eq-QI) is a predictor in a functional generalized linear model [@James02] for outcomes from the exponential family of distributions, or a linear functional Cox model [@Gellar15] for survival outcomes, $$ \text{QI}_{i}=\int_{0}^{1} \beta(p)Q_i(p)dp $$ {#eq-QI} where $Q_i(p)$ is the (aggregated) sample quantiles *`logKi67.quantile`* for the $i$-th subject, and $\beta(p)$ is the unknown coefficient function to be estimated. We use function `hyper_gam()` to fit a generalized additive model `gam` with integrated *linear spline-based* smoothness estimation [function `mgcv::s()`, @mgcv_s]. This is a scalar-on-function model [@Reiss17] that predicts a **scalar** outcome (e.g., progression free survival time *`PFS[,1L]`*) using the aggregated quantiles **function** as a functional predictor. ```{r} m0 = hyper_gam(PFS ~ logKi67.quantile, data = Ki67q) ``` ````{=html} ```` Nonlinear quantile index (nlQI) (@eq-nlQI) is a predictor in the functional generalized additive model [@McLean14] for outcomes from the exponential family of distributions, or an additive functional Cox model [@Cui21] for survival outcomes. $$ \text{nlQI}_{i}= \int_{0}^{1} F\big(p, Q_i(p)\big)dp $$ {#eq-nlQI} where $F(\cdot,\cdot)$ is an unknown bivariate twice differentiable function. We use function `hyper_gam(., nonlinear = TRUE)` to fit a generalized additive model `gam` with *tensor product interaction* estimation [function `mgcv::ti()`, @mgcv_ti]. ```{r} m1 = hyper_gam(PFS ~ logKi67.quantile, data = Ki67q, nonlinear = TRUE) ``` ### `S3` Class `hyper_gam` The fitted functional model *`m0`* and *`m1`* have the `S3` class `hyper_gam`, which inherits from the `S3` class `gam` from package **`mgcv`** [@mgcv]. The `hyper_gam` class has an additional attribute, - `attr(., 'xname')`, a `symbol` of the hypercolumn name. Such inheritance enables the use of all `S3` method dispatches on `gam` objects from package **`mgcv`** on the `hyper_gam` objects. ### Visualization Function `integrandSurface()` creates an interactive `htmlwidget` [@htmlwidgets] visualization of the estimated integrand surfaces for the linear (@eq-QI) or nonlinear quantile index (@eq-nlQI) using package **`plotly`** [@plotly]. The integrand surfaces, defined on $p\in[0,1]$ and $q\in\text{range}\big\{Q_i(p), i=1,\cdots,n\big\}$, are $$ \hat{S}(p,q) = \begin{cases} \hat{\beta}(p)\cdot q\\ \hat{F}(p,q) \end{cases} $$ {#eq-S} Also in this interactive visualization are - the contour lines on the integrand surfaces (@eq-S), as well as their projections along the $s$-axis, i.e., onto the $(p,q)$-plane (a.k.a., the "floor"); - the estimated linear **integrand paths** $\hat{\beta}(p)Q_i(p)$ or the nonlinear integrand paths $\hat{F}(p, Q_i(p))$ on the integrand surfaces (@eq-S); - the sample quantiles $Q_i(p)$, i.e., the ***projections*** of the estimated linear or nonlinear integrand path along the $s$-axis, i.e., onto the $(p,q)$-plane (a.k.a., the "floor"); - the ***projections*** of the estimated linear or nonlinear integrand path along the $q$-axis, i.e., onto the $(p,s)$-plane (a.k.a., the "backwall"), so that the area under each projected path is equal to the estimated linear (@eq-QI) or nonlinear quantile index (@eq-nlQI). @fig-integrandSurface is an interactive `htmlwidget` visualization of the nonlinear integrand surface, integrand paths and their projections to the "floor" and "backwall". Users should remove the argument `n` in `integrandSurface(, n=101L)`, and use the default `n=501L` instead, for a more refined surface. We must use `n=101L` to reduce the `htmlwidget` object size, in order to comply with `CRAN` and/or [RPubs](https://rpubs.com) file size limit. For the same reason, the interactive visualization of the linear integrand surface is suppressed in this vignette. Users are strongly encouraged to interact with it on their local device. ```{r} #| eval: false #| fig-width: 5 #| fig-height: 5 m0 |> integrandSurface() # please interact with it on your local computer ``` ```{r} #| eval: true #| label: fig-integrandSurface #| fig-width: 5 #| fig-height: 5 #| fig-align: left #| fig-cap: 'Nonlinear integrand surface, integrand paths and their projections to the "floor" and "backwall"' m1 |> integrandSurface(n = 101L) ``` Static illustrations of the estimated integrand surfaces, e.g., the `persp`ective (`S3` method dispatch `persp.hyper_gam()`) and `contour` (`S3` method dispatch `contour.hyper_gam()`) plots, are produced by calling the `S3` generics `persp()` and `contour()` in package **`graphics`** shipped with vanilla **`R`**. These static figures are suppressed to reduce the file size of this vignette. ```{r} #| warning: false #| fig-show: hide m0 |> persp() # a static figure ``` ```{r} #| warning: false #| fig-show: hide m0 |> contour() # a static figure ``` ```{r} #| warning: false #| fig-show: hide m1 |> persp() # a static figure ``` ```{r} #| warning: false #| fig-show: hide m1 |> contour() # a static figure ``` Visualization of the integrand surface (@eq-S) in functions `integrandSurface()`, `persp.hyper_gam()` and `contour.hyper_gam()` is inspired by function `mgcv::vis.gam()`. Visualization of the *integrand paths*, as well as their projections on the $(p,q)$- and $(p,s)$-plane, is an original idea and design by Tingting Zhan. ## 3️⃣ Compute Quantile Index Predictor ```{=html} ``` Linear and nonlinear quantile indices are the predictors in the functional models (@eq-QI) and (@eq-nlQI), respectively. Let's consider a conventional scenario that we first fit a `hyper_gam` model to the training data set, then compute the quantile index predictors in the training and/or test data set using the training model. First, we partition the 622 patients in hyper data frame *`Ki67q`* into a training data set with 498 patients and a test data set with 124 patients, i.e., a 80% vs. 20% partition. ```{r} set.seed(16); id = Ki67q |> nrow() |> seq_len() |> caret::createDataPartition(p = .8) Ki67q_0 = Ki67q[id[[1L]],] # training set Ki67q_1 = Ki67q[-id[[1L]],] # test set ``` Next, we fit a functional generalized additive model to the the training data set *`Ki67q_0`*, ```{r} m1a = hyper_gam(PFS ~ logKi67.quantile, nonlinear = TRUE, data = Ki67q_0) ``` The `S3` method dispatch `hyper.gam::predict.hyper_gam()` calculates the quantile index predict[**or**]{style="background-color: #FFFF00"}s of the training and/or test data set, based on the training model *`m1a`*. The returned value is a `numeric` `vector`. This is a convenient wrapper and slight modification of the function `mgcv::predict.gam()`. The use of `S3` generic `stats::predict()`, which is typically for predict[**ed**]{style="background-color: #FFFF00"} values, could be confusing, but we choose to follow the practice and nomenclature of function `mgcv::predict.gam()`. We can, but we ***should not***, use the quantile index predictors of the training data set for downstream analysis, because these quantile index predictors are optimized on the training data set and the results would be optimistically biased. ```{r} #| code-fold: true #| code-summary: "Optimistically biased!!" Ki67q_0[,c('PFS', 'age', 'race')] |> as.data.frame() |> # invokes spatstat.geom::as.data.frame.hyperframe() data.frame(nlQI = predict(m1a, newdata = Ki67q_0)) |> coxph(formula = PFS ~ age + nlQI, data = _) ``` Instead, we should use the quantile index predictors computed in the test data set for downstream analysis, ```{r} Ki67q_1[,c('PFS', 'age', 'race')] |> as.data.frame() |> # invokes spatstat.geom::as.data.frame.hyperframe() data.frame(nlQI = predict(m1a, newdata = Ki67q_1)) |> coxph(formula = PFS ~ age + nlQI, data = _) ``` ````{=html} ```` # Spatial Index 🚧 This section is under construction # References ::: {#refs} :::