---
title: "Penalized Precision Matrix Estimation in grasps"
bibliography: ../inst/REFERENCES.bib
vignette: >
%\VignetteIndexEntry{pen_est}
%\VignetteEngine{quarto::html}
%\VignetteEncoding{UTF-8}
knitr:
opts_chunk:
collapse: false
comment: '#>'
echo: true
fig.align: "center"
message: false
results: 'hide'
warning: false
---
## Preliminary
Consider the following setting:
- **Gaussian graphical model (GGM) assumption:** \
The data $X_{n \times d}$ consists of independent and identically distributed
samples $X_1, \dots, X_n \sim N_d(\mu,\Sigma)$.
- **Disjoint group structure:** \
The $d$ variables can be partitioned into disjoint groups.
- **Goal:** \
Estimate the precision matrix $\Omega = \Sigma^{-1} = (\omega_{ij})_{d \times d}$.
## Sparse-Group Estimator
\begin{gather}
\hat{\Omega}(\lambda,\alpha,\gamma)
= {\arg\min}_{\Omega \succ 0} \left\{ -\log\det(\Omega) + \text{tr}(S\Omega)
+ \lambda P_{\alpha,\gamma}(\Omega) \right\},
\\[10pt]
P_{\alpha,\gamma}(\Omega)
= \alpha P^\text{idv}_\gamma(\Omega) + (1-\alpha) P^\text{grp}_\gamma(\Omega),
\\[10pt]
P^\text{idv}_\gamma(\Omega) = \sum_{i,j} p_\gamma(\vert\omega_{ij}\vert),
\\[5pt]
P^\text{grp}_\gamma(\Omega)
= \sum_{g,g^\prime} p_\gamma(\Vert\Omega_{gg^\prime}\Vert_F).
\end{gather}
where:
- $S = n^{-1} \sum_{i=1}^n (X_i-\bar{X})(X_i-\bar{X})^\top$ is the empirical
covariance matrix.
- $\lambda \geq 0$ is the global regularization parameter controlling overall
shrinkage.
- $\alpha \in [0,1]$ is the mixing parameter controlling the balance between
element-wise and block-wise penalties.
- $\gamma$ is the additional parameter controlling the curvature and effective
degree of nonconvexity of the penalty.
- $P_{\alpha,\gamma}(\Omega)$ is a generic bi-level penalty template that can
incorporate convex or non-convex regularizers while preserving the intrinsic
group structure among variables.
- $P^\text{idv}_\gamma(\Omega)$ is the element-wise individual penalty component.
- $P^\text{grp}_\gamma(\Omega)$ is the block-wise group penalty component.
- $p_\gamma(\cdot)$ is a penalty kernel parameterized by $\gamma$.
- $\Omega_{gg^\prime}$ is the submatrix of $\Omega$ with the rows from group $g$
and columns from group $g^\prime$.
- The Frobenius norm $\Vert\Omega\Vert_F$ is defined as
$\Vert\Omega\Vert_F = (\sum_{i,j} \vert\omega_{ij}\vert^2)^{1/2} = [\text{tr}(\Omega^\top\Omega)]^{1/2}$.
**Note**:
+ The regularization parameter $\lambda$ acts as the scale factor for the entire
penalty term $\lambda P_{\alpha,\gamma}(\Omega)$.
+ The penalty kernel $p_\gamma(\cdot)$ is the shape function that governs
the fundamental characteristics of the regularization.
## Penalties
1. Lasso: Least absolute shrinkage and selection operator
[@tibshirani1996regression; @friedman2008sparse]
$$\lambda p(\omega_{ij}) = \lambda\vert\omega_{ij}\vert.$$
2. Adaptive lasso [@zou2006adaptive; @fan2009network]
$$
\lambda p_\gamma(\omega_{ij}) = \lambda\frac{\vert\omega_{ij}\vert}{v_{ij}},
$$
where $V = (v_{ij})_{d \times d} = (\vert\tilde{\omega}_{ij}\vert^\gamma)_{d \times d}$
is a matrix of adaptive weights, and $\tilde{\omega}_{ij}$ is the initial estimate
obtained using `penalty = "lasso"`.
3. Atan: Arctangent type penalty [@wang2016variable]
$$
\lambda p_\gamma(\omega_{ij})
= \lambda(\gamma+\frac{2}{\pi})
\arctan\left(\frac{\vert\omega_{ij}\vert}{\gamma}\right),
\quad \gamma > 0.
$$
4. Exp: Exponential type penalty [@wang2018variable]
$$
\lambda p_\gamma(\omega_{ij})
= \lambda\left[1-\exp\left(-\frac{\vert\omega_{ij}\vert}{\gamma}\right)\right],
\quad \gamma > 0.
$$
5. Lq [@frank1993statistical; @fu1998penalized; @fan2001variable]
$$
\lambda p_\gamma(\omega_{ij}) = \lambda\vert\omega_{ij}\vert^\gamma,
\quad 0 < \gamma < 1.
$$
6. LSP: Log-sum penalty [@candes2008enhancing]
$$
\lambda p_\gamma(\omega_{ij})
= \lambda\log\left(1+\frac{\vert\omega_{ij}\vert}{\gamma}\right),
\quad \gamma > 0.
$$
7. MCP: Minimax concave penalty [@zhang2010nearly]
$$
\lambda p_\gamma(\omega_{ij})
= \begin{cases}
\lambda\vert\omega_{ij}\vert - \dfrac{\omega_{ij}^2}{2\gamma},
& \text{if } \vert\omega_{ij}\vert \leq \gamma\lambda, \\
\dfrac{1}{2}\gamma\lambda^2,
& \text{if } \vert\omega_{ij}\vert > \gamma\lambda.
\end{cases}
\quad \gamma > 1.
$$
8. SCAD: Smoothly clipped absolute deviation [@fan2001variable; @fan2009network]
$$
\lambda p_\gamma(\omega_{ij})
= \begin{cases}
\lambda\vert\omega_{ij}\vert
& \text{if } \vert\omega_{ij}\vert \leq \lambda, \\
\dfrac{2\gamma\lambda\vert\omega_{ij}\vert-\omega_{ij}^2-\lambda^2}{2(\gamma-1)}
& \text{if } \lambda < \vert\omega_{ij}\vert < \gamma\lambda, \\
\dfrac{\lambda^2(\gamma+1)}{2}
& \text{if } \vert\omega_{ij}\vert \geq \gamma\lambda.
\end{cases}
\quad \gamma > 2.
$$
**Note**:
+ For Lasso, which is convex, the additional parameter $\gamma$ is not required,
and the penalty kernel $p_\gamma(\cdot)$ simplifies to $p(\cdot)$.
+ For MCP and SCAD, $\lambda$ plays a dual role: it is the global regularization
parameter, but it is also implicitly contained within the kernel
$p_\gamma(\cdot)$.
## Illustrative Visualization
@fig-pen illustrates a comparison of various penalty functions $\lambda p(\omega)$
evaluated over a range of $\omega$ values. The main panel (right) provides
a wider view of the penalty functions' behavior for larger $\vert\omega\vert$,
while the inset panel (left) magnifies the region near zero $[-1, 1]$.
```{r fig.show='hide'}
library(grasps) ## for penalty computation
library(ggplot2) ## for visualization
penalties <- c("atan", "exp", "lasso", "lq", "lsp", "mcp", "scad")
pen_df <- compute_penalty(seq(-4, 4, by = 0.01), penalties, lambda = 1)
plot(pen_df, xlim = c(-1, 1), ylim = c(0, 1), zoom.size = 1) +
guides(color = guide_legend(nrow = 2, byrow = TRUE))
```
```{r echo=FALSE, fig.show='hide'}
plot <- plot(pen_df, xlim = c(-1, 1), ylim = c(0, 1), zoom.size = 1) +
guides(color = guide_legend(nrow = 2, byrow = TRUE))
```
::: {#fig-pen}
```{r echo=FALSE}
print(plot)
```
Illustrative penalty functions.
:::
@fig-deriv displays the derivative function $p^\prime(\omega)$ associated with
a range of penalty types.
The Lasso exhibits a constant derivative, corresponding to uniform shrinkage.
For MCP and SCAD, the derivatives are piecewise: initially equal to the Lasso
derivative, then decreasing over an intermediate region, and eventually dropping
to zero, indicating that large $\vert\omega\vert$ receive no shrinkage.
Other non-convex penalties show smoothly diminishing derivatives as
$\vert\omega\vert$ increases, reflecting their tendency to shrink small
$\vert\omega\vert$ strongly while exerting little to no shrinkage on large ones.
```{r fig.show='hide'}
deriv_df <- compute_derivative(seq(0, 4, by = 0.01), penalties, lambda = 1)
plot(deriv_df) +
scale_y_continuous(limits = c(0, 1.5)) +
guides(color = guide_legend(nrow = 2, byrow = TRUE))
```
```{r echo=FALSE, fig.show='hide'}
plot <- plot(deriv_df) +
scale_y_continuous(limits = c(0, 1.5)) +
guides(color = guide_legend(nrow = 2, byrow = TRUE))
```
::: {#fig-deriv}
```{r echo=FALSE}
print(plot)
```
Illustrative penalty derivatives.
:::
## Reference {-}