--- title: "Introduction to SimBaRepro" author: "Xinlong Du" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true bibliography: references.bib vignette: > %\VignetteIndexEntry{Introduction to SimBaRepro} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction In the era of data-driven decision-making, access to detailed individual-level data has become critical both for research and industrial use. Accompanying this demand for personal data is the need for protecting personal privacy. Differential privacy (DP) has emerged as the gold standard for privacy protection, enabling the release of aggregated information while mathematically guaranteeing a quantifiable level of privacy at the individual level. Despite their benefits, DP mechanisms often complicate the sampling distributions and introduce biases, making it hard to conduct statistical inferences. The `SimBaRepro` package addresses this challenge by leveraging the *repro samples* framework---a simulation-based approach to statistical inference. Unlike asymptotic methods, which rely on large-sample approximations, `SimBaRepro` constructs confidence sets and hypothesis tests that are valid for finite samples while accounting for the known DP mechanism. Key functions include: - `p_value()`: Perform hypothesis testing with differentially-privatized data.\ - `get_CI()`: Generate finite-sample confidence intervals.\ - `confidence_grid()`: Find multivariate confidence regions.\ - `grid_projection()`: Project multivariate confidence regions to 2D confidence regions. - `plot_grid()`: Visualize 2D confidence regions with ggplot2. This package is based on the work of @Awan25. ## General Framework One of the central problems in statistical inference is the difficulty of correctly accounting for the uncertainties introduced in the data‐generating process. The “repro sample” framework provides a finite‐sample, simulation‐based approach that does not rely on asymptotic approximations (e.g. the CLT) or on a closed‐form likelihood. The key observation is that, while the marginal distribution of the data can be complex or even intractable, the data‐generating mechanism itself is often known and easy to simulate. This procedure produces valid confidence intervals and p-values. ### Data generating equation and the seed We assume that the observed summary statistic\ $$s \,\mathbin{\overset{d}{=}}\, G(\theta^*,u),$$ arises from a known, deterministic mapping\ $$G:\;\Theta\times\mathscr{U}\longrightarrow\mathscr{S},$$ together with a “seed”\ $$u \sim P,$$ whose distribution $P$ does not depend on the parameter $\theta^*$. Intuitively, $u$ captures all of the randomness in the data generation process (sampling noise, privacy noise, etc.), while $G$ describes how that randomness is transformed by the parameter $\theta^*$ into the observed summary statistic $s$. For example, if $X_1,...,X_R$ are sampled from $N(\mu^*,\sigma^{*2})$ and $s=(X_1,...,X_R)$, we can take $$G(\theta^*, u)=\mu^*+\sigma^*u$$ where $\theta^*=(\mu^*,\sigma^{*2})$ and $u\sim N(0,I_R)$. This also applies to more complex settings, such as when DP mechanisms are used. Let $X_1,...,X_R$ be i.i.d. samples from $N(\mu^*,\sigma^{*2})$, and are clamped before adding an additional Laplace noise to obtain $\tilde{X}_1,...,\tilde{X}_R$. Let $\theta^*=(\mu^*,\sigma^{*2})$ be the true parameter, and let $u=(u_\text{data}, u_\text{DP})$ be the noise consisting of the data generating noise and DP mechanism noise. We can define the clamping function as: $$ \text{clamp}[x]_a^b=\max\{a,\min\{b,x\}\}. $$ In this case, the generation of privatized samples can be expressed as $$G(\theta^*,u)=\text{clamp}\left[\mu^*+\sigma^*u_\text{data}\right]_a^b+u_\text{DP},$$ where the data noise $u_\text{data}$ is sampled from $N(0,I_R)$ and the privacy noise $u_\text{DP}$ is sampled from Laplace$\left(0,\frac{b-a}{\epsilon}\right)$ . ### Repro samples In order to learn about a parameter $\theta$, we draw $R$ i.i.d. "repro seeds" $u_1,...,u_R\sim P$ and compute $$s_i(\theta)=G(\theta,u_i).$$ At the true parameter $\theta^*$, $s_i(\theta^*)$ are i.i.d. with the same distribution as the observed $s$. We then compare the observed $s$ to this "cluster" of simulated $\{s_i(\theta)\}_{i=1}^R$, with the intuition that parameters close to the truth should generate clusters that are "close" to $s$. ```{r, echo=FALSE, results='asis'} cat(' ') ``` ```{r, fig.cap="This figure shows the mapping from a parameter to a collection of repro samples via the generating function. In this case, the blue and red parameters generate repro samples that are 'close' (lies in the prediction set $B_\\alpha$) to the observed statistic, so they are included in the confidence set. On the other hand, the green parameter generates repro samples that are not 'close enough' to the observed statistic, hence it's not included in the confidence set $\\Gamma_\\alpha$. For additional details on the prediction set $B_\\alpha$ and confidence set $\\Gamma_\\alpha$, please refer to the paper. (Taken from Figure 1 in @Awan25.)", out.width="100%", echo=FALSE} knitr::include_graphics("figs/figure1.png") ``` ### Exchangeable statistic $T$ But how exactly are we going to measure the "distance" between the observed $s$ and a cluster of artificial data $\{s_i(\theta)\}_{i=1}^R$? @Awan25 makes use of an exchangeable function $T_\theta:\mathscr{S}\times\mathscr{S}^{R+1}\to[0,1]$, meaning that the function output is the same regardless of the order of the $R+1$ statistics. The first input to the function is the statistic whose "distance" to the collection of $R+1$ statistics we want to measure. The $R+1$ statistics consist of the observed statistic $s$ and the $R$ generated statistics $\{s_i(\theta)\}_{i=1}^R$. The function output should have the property that smaller outputs imply *less* similarity between the first argument and the point cloud in the second argument. In the package, this exchangeable statistic is defaulted to the Mahalanobis distance. However, the user also has the freedom to define different $T$'s, as long as they satisfy the conditions specified above. ### p-values and confidence sets Fixing $R$ repro seeds $u_1,...,u_R\sim^\text{i.i.d.}P$, and let $s_1(\theta),...,s_R(\theta)$ denote the statistics generated with the generating equation with parameter $\theta$ and the seeds. Let $s$ be the observed statistic, we define $$T_\text{obs}(\theta)=T_\theta(s;s, s_1(\theta), ..., s_R(\theta))$$ and $$T_{(i)}(\theta)=T_\theta(s_i(\theta);s, s_1(\theta), ..., s_R(\theta))$$ for $i=1,...,R$. In the case when $\theta=\theta^*$ (statistics are generated with the true parameter), the rank of $T_\text{obs}(\theta)$ among all $R+1$ statistics are uniformly distributed on $\{1,2,...,R+1\}$ by symmetry. However, if the $R$ statistics are generated with a parameter far from the ground truth, then $T_\text{obs}$ are expected to be ranked lower compared to the other $\{T_{(i)}(\theta)\}_{i=1}^R$. Based on this, we can design a hypothesis test of the form $$H_0:\theta\in\Theta_0\text{ v.s. }H_1:\theta\notin \Theta_0.$$ In this test, the p-value can be defined as: $$p:=\sup_{\theta\in\Theta_0}\frac{1}{R+1}\Big(|\{T_{(i)}(\theta)\le T_\text{obs}(\theta)\}|+1\Big).$$ Furthermore, a $(1-\alpha)$-confidence set can be defined as the region $$\left\{\theta:\frac{1}{R+1}\Big(|\{T_{(i)}(\theta)\le T_\text{obs}(\theta)\}|+1\Big)\ge\alpha\right\}.$$ Given the seeds, generating equation, and a test statistic, this package automatically computes the p-values and confidence sets using numerical methods. ## Main Functions ### `p_value` This function computes a p-value that can be used towards hypothesis tests of the form: $$H_0:\theta\in \Theta_0\text{ v.s. }H_1:\theta\not\in\Theta_0.$$ It identifies the parameter value (within a specified search space) that makes the observed statistic most "plausible" under simulated data. #### **Inputs** - **`lower_bds`, `upper_bds`**: Vectors defining the parameter search bounds which describe $\Theta_0$. - **`seeds`**: An array of random seeds used to generate fake statistics. Even though there's no strict requirements for the shape of `seeds` in principle, it's recommended to have $R$ rows where $R$ refers to the number of *repro* samples. The number of columns depends on the actual distribution and the privacy mechanism (if there is one). - **`generating_fun`**: A function of `seeds` and some parameter $\theta\in\Theta$. This function should be designed according to how the `seeds` is defined. For a general rule of thumb, this function should take each row of the seeds and produce one fake statistic (could be multi-dimensional). The function should return an array, where each row represents a fake statistic. - **`s_obs`**: A vector encoding the observed statistic. - **`theta_init`**: An optional vecotr argument which specifies the starting parameter $\theta\in\Theta$ for the search. If not specified, the midpoint `(lower_bds + upper_bds)/2` is used. - **`T_stat`**: An optional function argument which takes an array `x`, an array `data` of $R+1$ rows, and `theta` as inputs. Some permutation-invariant "depth" $T_\theta:\mathbb{R}^d\times\mathbb{R}^{(R+1)\times d}\to[0,1]$ is computed for each row of `x` and the entire array `data`. The returned object is a numeric vector consisting of numbers in between $0$ and $1$, where lower values correspond to "unusual" statistics. If not specified, this defaults to using the Mahalanobis distance. - **`verbose`**: An optional Boolean argument default to `FALSE`. If `TRUE`, the function will print out the optim messages at each searching step. This is useful when debugging. - **`check_input`**: An optional Boolean argument default to `TRUE`. Automatically checks the input data types. #### **Returns** The `p_value` function returns a list of two objects: `p_val` and `theta_hat`. `theta_hat` will be the parameter (the most likely parameter given the observed statistic in the specified parameter set) corresponding to the largest p-value---`p_val`. In a hypothesis test of significance level $\alpha$, if `p_val` is less than $\alpha$, then the null hypothesis should be rejected. On the contrary, if `p_val` is greater or equal to $\alpha$, then at least one paramter in the null hypothesis is valid. ### `get_CI` This function computes a confidence interval given the specified significance level using a bisection search method. #### **Inputs** (only ones that have not been mentioned before) - **`alpha`**: A numeric input specifying the significance level for the confidence interval. - **`parameter_index`**: An index indicating the parameter the confidence interval is for. For example, if the parameter $\theta$ is defined as the vector $c(\text{mean},\text{variance})$, then setting `parameter_index` equal to $2$ means that we're finding a confidence interval for the variance. - **`tol`**: This input specifies the numerical tolerance allowed when computing the confidence interval. The returned confidence interval contains the true confidence interval, with width of the interval bounded above by the width of the true confidence interval plus `tol`. A smaller tolerance means more precise confidence intervals, but requires more computation power (increases logarithmically w.r.t. `tol`). #### **Returns** The `get_CI` function returns a vector of length $2$, specifying two sides of the confidence interval. ### `confidence_grid` Using the `p_value` function as a subroutine, this function computes a multi-dimensional indicator array representing the confidence region, where a "$1$" indicates inclusion in the confidence set, and "$0$" otherwise. #### **Inputs** - **`resolution`**: An integer specifying the number of partitions for each parameter. Setting `resolution` equal to $5$ means that each dimension is evenly partitioned into $5$ sections, forming a total of $5^d$ cubed regions ($d$ is the number of dimensions of the parameter space). #### **Returns** The `confidence_grid` function returns a list containing the indicator array (`ind_array`), and two vectors `search_lower_bds` and `search_upper_bds` representing the simultaneous confidence intervals for all parameters. The returned lower bounds and upper bounds are not to be confused with the input lower bounds and upper bounds which specify the entire search region. ### `grid_projection` While `confidence_grid` generates an indicatory array, it may not be straightforward to visualize when there are more than two parameters. `grid_projection` projects an indicator array of any dimension down to $2$-dimensional arrays based on the specified dimensions, so that the result can be visualized using the `plot_grid` function. #### **Inputs** - **`indicator_array`**: The indicator array generated using the `confidence_grid` function. - **`index_set`**: A vector of length $2$ specifying which two dimensions to project the indicator array. For example, if the parameter is three-dimensional, setting this input to be $c(1,3)$ means that the returned array will be the indicator array for the first and third dimensions. #### **Returns** This function returns a 2-dimensional indicator array. ### `plot_grid` #### **Inputs** - **`indicator_array`**: A 2-dimensional indicator array, either generated by `confidence_grid` or by `grid_projection`. - **`lower_bds`, `upper_bds`**: Length-2 vectors that specify the lower bounds and upper bounds of the plot region. Note that if the indicator array is produced using the `grid_projection` function, the user must choose the lower bounds and upper bounds for the two parameters of interests only. - **`parameter_names`** : An optional vector argument specifying the names of the two parameters. ## Example Workflows In this section, we'll walk through a few simple examples to demonstrate the some regular workflow using this package. ### Regular Normal To begin, we consider a simple example where the data is drawn i.i.d. from a normal distribution with unknown mean $\mu$ and unknown variance $\sigma^2$: $$X_i\sim N(\mu, \sigma^2).$$ Suppose the observed statistics (sample mean and variance) are disclosed, how do we obtain confidence intervals (or sets) of the true population parameters? First, let's load the pacakge: ```{r setup} library(SimBaRepro) ``` Specify the sample size, significance level, and a large enough Repro sample size (usually $200$ is enough). ```{r} n <- 50 # sample size repro_size <- 200 # number of repro samples ``` Generate a matrix of seeds that will be used to generate 'fake' statistics, which has `repro_size` rows. The number of columns depends on the specific problem. In this case, it has `n` columns, one for each of the i.i.d. data points. ```{r} seeds <- matrix(rnorm(repro_size * n), nrow = repro_size, ncol = n) ``` We'll also write a function that, given some parameter and the random seeds defined previously, computes 'repro_size' copies of the 'fake' statistics and stack them vertically. In this case, the first component of 'theta' represents the mean, and the second component represents the variance. ```{r} s_sample <- function(seeds, theta) { # generate the raw data points data <- theta[1] + sqrt(theta[2]) * seeds[, 1:n] # compute the regular statistics s_mean <- apply(data, 1, mean) s_var <- apply(data, 1, var) return(cbind(s_mean, s_var)) } ``` Suppose the observed statistic is as follows: ```{r} s_obs <- c(1.12, 0.67) ``` #### Regular Normal: The 'p_value' function With the above observed statistic, we want to conduct a hypothesis test for whether the true mean is $0$: $$H_0:\mu=0, H_1:\mu\ne 0.$$ What's the p-value of the test with the observed statistics given above? The 'p_value' function in the package comes in handy here. Set both 'lower_bds' and 'upper_bds' to equal to $(0, 1e-6)$and $(0,10)$ respectively (choose a large enough range for the variance), and use the seeds, generating function, and observed statistic from before: ```{r} lower_bds = c(0, 1e-6) upper_bds = c(0, 10) result <- p_value(lower_bds, upper_bds, seeds, s_sample, s_obs) print(result$p_val) # p_value ``` If the significance level of the test is $0.05$, then we reject the null hypothesis that the population mean is $0$. Alternatively, we can conduct composite hypothesis tests: $$ H_0:\theta\in \Theta_0, H_1:\theta\notin\Theta_0, $$ where $\Theta_0$ is of the form $[l_1,u_1]\times[l_2,u_2]\times\cdots$. In this case, the 'p_value' function searches through the whole parameter space $\Theta_0$ and finds the parameter corresponding to the largest p-value. For example, if $\Theta_0=[-1,2]\times[1,5]$, then we can do the following: ```{r} lower_bds = c(-1, 1) # lower bounds for mean and variance upper_bds = c(2, 5) # upper bounds for mean and variance result <- p_value(lower_bds, upper_bds, seeds, s_sample, s_obs) print(result$p_val) # the largest p_value found print(result$theta_hat) # the parameter corresponding to the largest p value ``` If the significance level is $0.05$, then we fail to reject the null hypothesis. #### Regular Normal: The 'get_CI' function To go a step further, we can find the confidence intervals for the true mean or variance using the 'get_CI' function. The extra things to specify here are `parameter_index` (indicating which parameter to get the confidence interval), significance level `alpha`, and the tolerance for the confidence intervals `tol` . The returned confidence interval will cover the actual $(1-\alpha)$-confidence interval with at most a difference of 'tol' in width. First, let's find the confidence interval for the mean (which is the first argument in `s_obs`). ```{r} lower_bds <- c(-5, 1e-6) upper_bds <- c(5, 10) parameter_index <- 1 # an integer specifying the parameter of interest alpha <- .05 # significance level tol <- 1e-3 # tolerance # choosing j=1 indicates that we're obtaining a confidence interval for the first argument in s_obs. In this case, it's the mean. mean_CI <- get_CI(alpha, lower_bds, upper_bds, parameter_index, seeds, s_sample, s_obs, tol) print(mean_CI) ``` Similarly, we can get a confidence interval for the variance: ```{r} parameter_index <- 2 var_CI <- get_CI(alpha, lower_bds, upper_bds, parameter_index, seeds, s_sample, s_obs, tol) print(var_CI) ``` The methodology adopted here ensures that the obtained confidence interval will contain the actual confidence interval, with its width bounded by the width of the actual confidence interval plus `tol`. #### Regular Normal: The 'confidence_grid' function In many situations, the confidence interval for a single parameter is very loose and not helpful, especially when the number of parameters is large. The 'confidence_grid' function in this package obtains a confidence set . `resolution` is an extra input needed here that specifies the number of grids for each element of the parameter. Since we already have a sense of where the confidence region is from the previously obtained confidence intervals, we can set `lower_bds` and `upper_bds` to be much narrower to obtain a finer representation of the confidence region (which is valid only when we're using the same seeds as before). ```{r} lower_bds <- c(0.5, 1e-6) upper_bds <- c(1.5, 1.5) resolution <- 13 result <- confidence_grid(alpha, lower_bds, upper_bds, seeds, s_sample, s_obs, tol, resolution) indicator_array <- result$ind_array print(result) ``` We can also retrieve the simultaneous confidence intervals for both parameters as follows: ```{r} print(c(result$search_lower_bds[1], result$search_upper_bds[1])) print(c(result$search_lower_bds[2], result$search_upper_bds[2])) ``` #### Regular Normal: The 'plot_grid' function This package includes a function called 'plot_grid' which takes the indicator array above and visualizes it. Note that the 'lower_bds' and 'upper_bds' here must agree with the ones used to produce this indicator array. ```{r} parameter_names <- c("mean", "variance") plot_grid(indicator_array, lower_bds, upper_bds, parameter_names) ``` Note that the plot looks just like a rotated version of the indicator array above, this is because coordinates start from the top left of the array but bottom left of the plot. #### The 'grid_projection' function Usually, parameter spaces are not just 2-dimensional, but higher-dimensional as well. In these cases, the indicator array produced by the 'confidence_grid' function is also higher-dimensional array. The indicator array is useful on its own, but if we want to visualize it, we must project it down to 2 dimensions via the 'gird_projection' function. Take, for instance, a joint Gaussian model with $\mu_1=0,\mu_2=1$, and identity covariance matrix $\Sigma=\sigma^2 I_2$ (uncorrelated joint Gaussian with a common variance $\sigma^2=1$.). First, we obtain the indicator array as usual: ```{r} n <- 100 # sample size repro_size <- 200 # number of repro samples seeds <- matrix(rnorm(repro_size * (2 * n)), nrow = repro_size, ncol = 2 * n) s_sample <- function(seeds, theta) { n_obs <- ncol(seeds) / 2 mu1 <- theta[1] mu2 <- theta[2] sigma <- sqrt(theta[3]) seeds1 <- seeds[, 1:n_obs, drop = FALSE] # first n columns for X seeds2 <- seeds[, (n_obs + 1):(2 * n_obs), drop = FALSE] # next n columns for Y # generate bivariate data data1 <- mu1 + sigma * seeds1 # X ~ N(mu_1, sigma^2) data2 <- mu2 + sigma * seeds2 # Y ~ N(mu_2, sigma^2) # compute statistics s_mean1 <- rowMeans(data1) # sample mean of X s_mean2 <- rowMeans(data2) # sample mean of Y s_var1 <- apply(data1, 1, var) # sample variance of X s_var2 <- apply(data2, 1, var) # sample variance of Y return(cbind(s_mean1, s_mean2, s_var1, s_var2)) } s_obs <- c(-0.1, 0.1, 1.05, 0.9) alpha <- 0.05 tol <- 1e-3 lower_bds <- c(-0.5, -0.5, 1e-6) upper_bds <- c(0.5, 0.5, 2) resolution <- 10 result <- confidence_grid(alpha, lower_bds, upper_bds, seeds, s_sample, s_obs, tol, resolution) indicator_array <- result$ind_array ``` The statistics here are the two sample means and two sample variances, which include four real numebrs. However, the number of parameters does not need to match the dimension of the statistics. In this caes, there are only three parameters: two population means and a variance. Suppose we're interested in the confidence set involving only the two means $\mu_1, \mu_2$, then we can choose the index set as follows: ```{r} index_set <- c(1,2) means_array <- grid_projection(indicator_array, index_set) ``` To plot the joint confidence region for $\mu_1,\mu_2$, we then use the same lower bounds and upper bounds used in the 'confidence_grid' function, and take only the components corresponding to these two parameters. For example, ```{r} means_lower_bds <- lower_bds[index_set] means_upper_bds <- upper_bds[index_set] parameter_names <- c("mu1", "mu2") plot_grid(means_array, means_lower_bds, means_upper_bds, parameter_names) ``` Alternatively, if we want to plot the confidence regions for $\mu_1$ and $\sigma^2$, or for $\mu_2$ and $\sigma^2$, we can choose a different index set and repeat the above procedure ```{r} index_set2 <- c(1,3) mu1_var_array <- grid_projection(indicator_array, index_set2) index_set3 <- c(2,3) mu2_var_array <- grid_projection(indicator_array, index_set3) ``` ```{r, fig.show='hold'} mu1_var_lower_bds <- lower_bds[index_set2] mu1_var_upper_bds <- upper_bds[index_set2] parameter_names <- c("mu1", "variance") plot_grid(mu1_var_array, mu1_var_lower_bds, mu1_var_upper_bds, parameter_names) mu2_var_lower_bds <- lower_bds[index_set3] mu2_var_upper_bds <- upper_bds[index_set3] parameter_names <- c("mu2", "variance") plot_grid(mu2_var_array, mu2_var_lower_bds, mu2_var_upper_bds, parameter_names) ``` Although these 2D visualizations are useful , it's to be noted that the projections by themselves are not enough to reconstruct the original confidence set. Consider, for example, a cube that's hollow inside. ### DP Normal Next, we'll consider a setting in which this package was initially designed for, and where it shines--statistical analysis with differential privacy. With privacy mechanisms present, it's extremely hard to analyze the statistics and give closed-form expressions for confidence intervals/sets. Consider a basic setting where the data is drawn i.i.d. from a normal distribution with unknown mean $\mu$ and variance $\sigma^2$. $$X_i\sim N(\mu,\sigma^2)$$To protect the privacy of the data owners, the data center first clamps the data with lower and upper bounds $\textit{lower_clamp}$ and $\textit{upper_clamp}$ (publicly known), then computes an $(\epsilon, 0)$-differentially private statistic $s_{DP}=(\text{private mean}, \text{private variance})$ before releasing it to the public. How do we conduct hypothesis tests with privacy constraints in place? How can we obtain confidence intervals/sets for the population mean and variance? First, we use the same basic parameters as before: ```{r} n <- 50 # sample size repro_size <- 200 # number of repro samples ``` Since the privacy mechanism is also published by the data center, the clamps and the privacy guarantee parameter $\epsilon$ are publicly known. Note that these may differ depending on the specific privacy mechanism. ```{r} upper_clamp <- 3 lower_clamp <- -3 eps <- 4 # privacy guarantee parameter ``` Now, instead of only the randomness in the data generation process, we also need randomness associated with the privacy mechanism. Therefore, we need a matrix of seeds that contains both. Below, `regular_seeds` is used to generate Gaussian samples, while each column of `dp_seeds` is used to generate privacy noises for sample mean and sample variance, respectively. Since we're using the Laplace mechanism here, we choose the privacy seeds to be a product of shifted Bernoulli$(1/2)$ random variables and Exp$(1)$ random variables (which have the same distribution as Laplace$(1)$ random variables). ```{r} regular_seeds <- matrix(rnorm(repro_size * n), nrow = repro_size, ncol = n) dp_seeds <- matrix((2 * rbinom(n = repro_size * 2, size = 1, prob = 1/2) - 1) * rexp(repro_size * 2), nrow = repro_size, ncol = 2) seeds <- cbind(regular_seeds, dp_seeds) ``` Knowing the privacy mechanism, we can define the generating function that produces i.i.d. samples of the statistics: ```{r} s_sample <- function(seeds, theta) { # generate the i.i.d. normal r.v.s using the first n columns of the seeds raw_data <- theta[1] + sqrt(theta[2]) * seeds[, 1:n] # clamp the raw data using the clamps defined above clamped <- pmin(pmax(raw_data, lower_clamp), upper_clamp) # compute the private statistics (adding a properly scaled Laplace noise to the sample mean and sample variance) s_mean <- apply(clamped, 1, mean) + (upper_clamp - lower_clamp) / (n * eps / 2) * seeds[, n+1] s_var <- apply(clamped, 1, var) + (upper_clamp - lower_clamp)^2 / (n * eps / 2) * seeds[, n+2] return(cbind(s_mean, s_var)) } ``` We're now ready to run hypothesis tests and identify confidence intervals/sets using the provided functions from the package. #### DP Normal: The 'p_value' function We begin by encoding the released statistic (private sample mean and private sample variance) as a vector: ```{r} s_DP <- c(1.12, 1.07) # define the observed statistic as a vector ``` Let's say we want to test whether or not the mean is in between $0$ and $1$: $$ H_0:\mu\in[0,1], H_1:\mu\notin[0,1]. $$ Then we choose a large enough search interval for the other parameters. In this case, the population variance $\sigma^2$ is assumed to be in the range $[10^{-6},10]$. ```{r} lower_bds <- c(0, 1e-6) upper_bds <- c(1, 10) ``` Again, we apply the `p_value` function for hypothesis tests: ```{r} result <- p_value(lower_bds, upper_bds, seeds, s_sample, s_DP) print(result$p_val) # the largest p_value found print(result$theta_hat) # the parameter corresponding to the largest p value ``` With a significance level of $0.05$, we fail to reject the null hypothesis. #### DP Normal: The 'get_CI' function Now, let's find the confidence intervals for $\mu$ and $\sigma^2$. ```{r} # search lower bounds and upper bounds for mean and variance alpha <- .05 # significance level tol <- 1e-3 # tolerance lower_bds <- c(-5, 1e-6) upper_bds <- c(5, 10) ``` First, let's find a $95\%$ confidence interval for $\mu$: ```{r} # choose parameter_index=1 indicates that we're obtaining a confidence interval for the first argument in s_DP. In this case, it's the mean. parameter_index <- 1 mean_CI <- get_CI(alpha, lower_bds, upper_bds, parameter_index, seeds, s_sample, s_DP, tol) print(mean_CI) ``` Similarly, we can obtain a $95\%$ confidence interval for $\sigma^2$, by changing `parameter_index` from $1$ to $2$: ```{r} parameter_index <- 2 var_CI <- get_CI(alpha, lower_bds, upper_bds, parameter_index, seeds, s_sample, s_DP, tol) print(var_CI) ``` #### DP Normal: The 'confidence_grid' function We can obtain a joint confidence region for $\mu$ and $\sigma^2$. Based on the individual confidence intervals obtained above, we can choose a tighter search region to make the visualization look better. ```{r} lower_bds <- c(0, 1e-6) upper_bds <- c(2, 5) resolution <- 20 result <- confidence_grid(alpha, lower_bds, upper_bds, seeds, s_sample, s_DP, tol, resolution) indicator_array <- result$ind_array print(result) ``` #### DP Normal: The 'plot_grid' function Lastly, we can visualize the confidence region using the `plot_grid` function. ```{r} parameter_names <- c("mean", "variance") # optional input plot_grid(indicator_array, lower_bds, upper_bds, parameter_names) ``` The differentially-private confidence region looks more dispersed compared to the regular normal case, due to the added privacy noise. # References