| Type: | Package | 
| Title: | Marker Gene Detection via Penalized Principal Component Analysis | 
| Version: | 0.1.1 | 
| Date: | 2021-03-14 | 
| Author: | Yixuan Qiu, Jiebiao Wang, Jing Lei, and Kathryn Roeder | 
| Maintainer: | Yixuan Qiu <yixuan.qiu@cos.name> | 
| Description: | Implementation of the 'MarkerPen' algorithm, short for marker gene detection via penalized principal component analysis, described in the paper by Qiu, Wang, Lei, and Roeder (2020, <doi:10.1101/2020.11.07.373043>). 'MarkerPen' is a semi-supervised algorithm for detecting marker genes by combining prior marker information with bulk transcriptome data. | 
| License: | GPL-2 | GPL-3 [expanded from: GPL] | 
| Encoding: | UTF-8 | 
| LazyData: | true | 
| Depends: | R (≥ 3.5.0) | 
| Imports: | Rcpp (≥ 1.0.1), RSpectra, stats | 
| LinkingTo: | Rcpp, RcppEigen, RSpectra | 
| Suggests: | knitr, rmarkdown, prettydoc, scales | 
| SystemRequirements: | C++11 | 
| VignetteBuilder: | knitr, rmarkdown | 
| RoxygenNote: | 7.1.1 | 
| NeedsCompilation: | yes | 
| Packaged: | 2021-03-14 11:41:12 UTC; qyx | 
| Repository: | CRAN | 
| Date/Publication: | 2021-03-17 00:30:02 UTC | 
Mapping gene names to Ensembl IDs
Description
A data set showing the mapping between gene names and Ensembl gene IDs, derived from the EnsDb.Hsapiens.v79 Bioconductor package.
Usage
gene_mapping
Format
A data frame with 59074 rows and 2 variables:
- ensembl
- Ensembl gene IDs 
- name
- corresponding gene names 
Source
https://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html
Penalized Principal Component Analysis for Marker Gene Selection
Description
This function solves the optimization problem
\min\quad-\mathrm{tr}(SX) + \lambda p(X),
s.t.\quad O\preceq X \preceq I, \quad X \ge 0, \quad\mathrm{and}\quad \mathrm{tr}(X)=1,
where O\preceq X \preceq I means all eigenvalues of X are
between 0 and 1, X \ge 0 means all elements of X are nonnegative,
and p(X) is a penalty function defined in the article
(see the References section).
Usage
pca_pen(
  S,
  gr,
  lambda,
  w = 1.5,
  alpha = 0.01,
  maxit = 1000,
  eps = 1e-04,
  verbose = 0
)
Arguments
| S | The sample correlation matrix of gene expression. | 
| gr | Indices of genes that are treated as markers in the prior information. | 
| lambda | Tuning parameter to control the sparsity of eigenvectors. | 
| w | Tuning parameter to control the weight on prior information.
Larger  | 
| alpha | Step size of the optimization algorithm. | 
| maxit | Maximum number of iterations. | 
| eps | Tolerance parameter for convergence. | 
| verbose | Level of verbosity. | 
Value
A list containing the following components:
- projection
- The estimated projection matrix. 
- evecs
- The estimated eigenvectors. 
- niter
- Number of iterations used in the optimization process. 
- err_v
- The optimization error in each iteration. 
References
Qiu, Y., Wang, J., Lei, J., & Roeder, K. (2020). Identification of cell-type-specific marker genes from co-expression patterns in tissue samples.
Examples
set.seed(123)
n = 200  # Sample size
p = 500  # Number of genes
s = 50   # Number of true signals
# The first s genes are true markers, and others are noise
Sigma = matrix(0, p, p)
Sigma[1:s, 1:s] = 0.9
diag(Sigma) = 1
# Simulate data from the covariance matrix
x = matrix(rnorm(n * p), n) %*% chol(Sigma)
# Sample correlation matrix
S = cor(x)
# Indices of prior marker genes
# Note that we have omitted 10 true markers, and included 10 false markers
gr = c(1:(s - 10), (s + 11):(s + 20))
# Run the algorithm
res = pca_pen(S, gr, lambda = 0.1, verbose = 1)
# See if we can recover the true correlation structure
image(res$projection, asp = 1)
Marker Gene Selection via Penalized Principal Component Analysis
Description
This function refines a prior marker gene list by combining information from bulk tissue data, based on the penalized principal component analysis. The current implementation computes on one cell type at a time. To get marker genes for multiple cell types, call this function iteratively.
Usage
refine_markers(
  mat_exp,
  range,
  markers,
  lambda,
  w = 1.5,
  thresh = 0.001,
  alpha = 0.01,
  maxit = 1000,
  eps = 1e-04,
  verbose = 0
)
Arguments
| mat_exp | The gene expression matrix in the original scale (not logarithm-transformed), with rows standing for observations and columns for genes. The matrix should include gene names as column names. | 
| range | A character vector of gene names, representing the range of genes in which markers are sought. | 
| markers | A character vector of gene names giving the prior marker gene list. | 
| lambda | A tuning parameter to control the number of selected marker genes. A larger value typically means a smaller number of genes. | 
| w | Tuning parameter to control the weight on prior information.
Larger  | 
| thresh | Below this threshold small factor loadings are treated as zeros. | 
| alpha | Step size of the optimization algorithm. | 
| maxit | Maximum number of iterations. | 
| eps | Tolerance parameter for convergence. | 
| verbose | Level of verbosity. | 
Value
A list containing the following components:
- spca
- The sparse PCA result as in - pca_pen().
- markers
- A character vector of selected markers genes. 
- markers_coef
- The estimated factor loadings for the associated genes. 
References
Qiu, Y., Wang, J., Lei, J., & Roeder, K. (2020). Identification of cell-type-specific marker genes from co-expression patterns in tissue samples.
Examples
# Data used in the vignette
load(system.file("examples", "gene_expr.RData", package = "markerpen"))
load(system.file("examples", "published_markers.RData", package = "markerpen"))
load(system.file("examples", "markers_range.RData", package = "markerpen"))
# Get expression matrix - rows are observations, columns are genes
ind = match(rownames(dat), markerpen::gene_mapping$name)
ind = na.omit(ind)
ensembl = markerpen::gene_mapping$ensembl[ind]
mat_exp = t(dat[markerpen::gene_mapping$name[ind], ])
colnames(mat_exp) = ensembl
# We compute the marker genes for two cell types with a reduced problem size
# See the vignette for the full example
# Markers for astrocytes
set.seed(123)
search_range = intersect(markers_range$astrocytes, ensembl)
search_range = sample(search_range, 300)
prior_markers = intersect(pub_markers$astrocytes, search_range)
ast_re = refine_markers(
    mat_exp, search_range, prior_markers,
    lambda = 0.35, w = 1.5, maxit = 500, eps = 1e-3, verbose = 0
)
# Remove selected markers from the expression matrix
mat_rest = mat_exp[, setdiff(colnames(mat_exp), ast_re$markers)]
# Markers for microglia
search_range = intersect(markers_range$microglia, ensembl)
search_range = sample(search_range, 300)
prior_markers = intersect(pub_markers$microglia, search_range)
mic_re = refine_markers(
    mat_exp, search_range, prior_markers,
    lambda = 0.35, w = 1.5, maxit = 500, eps = 1e-3, verbose = 0
)
# Refined markers
markers_re = list(astrocytes = ast_re$markers,
                  microglia  = mic_re$markers)
# Visualize the correlation matrix
cor_markers = cor(mat_exp[, unlist(markers_re)])
image(cor_markers, asp = 1)
# Post-process the selected markers
# Pick the first 20 ordered markers
markers_ord = sort_markers(cor_markers, markers_re)
markers_ord = lapply(markers_ord, head, n = 20)
# Visualize the correlation matrix
image(cor(mat_exp[, unlist(markers_ord)]), asp = 1)
Post-processing Selected Marker Genes
Description
This function reorders the selected marker genes using information of the sample correlation matrix.
Usage
sort_markers(corr, markers)
Arguments
| corr | The sample correlation matrix, whose row and column names are gene names. | 
| markers | A list of marker genes. Each component of the list is a vector of marker
gene names corresponding to a cell type. All the gene names in this list
must appear in the row/column names of  | 
Value
A list that has the same structure as the input markers argument, with
the elements in each component reordered. See the example in
refine_markers().