--- title: "Introduction to tidydp: Tidy Differential Privacy" author: "Thomas Tarler" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to tidydp: Tidy Differential Privacy} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) set.seed(123) ``` ## What is Differential Privacy? Differential privacy is a rigorous mathematical framework for protecting individual privacy while sharing aggregate information about a dataset. It provides formal guarantees that the presence or absence of any single individual in a dataset has minimal impact on the results of statistical queries. ### Key Concepts **Privacy Parameters:** - **epsilon (ε)**: The privacy budget. Smaller values provide stronger privacy but add more noise. Typical values range from 0.01 (very private) to 10 (less private). - **delta (δ)**: The probability that the privacy guarantee fails. Usually set to a very small value like 1e-5 or 1e-6. **Privacy Mechanisms:** - **Laplace Mechanism**: Adds noise from the Laplace distribution. Provides pure ε-differential privacy (δ = 0). - **Gaussian Mechanism**: Adds noise from the Gaussian (normal) distribution. Provides (ε, δ)-differential privacy. ## Installation ```{r install, eval=FALSE} # Install from CRAN (when available) install.packages("tidydp") # Or install development version from GitHub devtools::install_github("ttarler/tidydp") ``` ## Getting Started ```{r library} library(tidydp) ``` ### Basic Example: Adding Noise to Data The most straightforward use case is adding differentially private noise to columns in a data frame: ```{r basic_noise} # Create sample data employee_data <- data.frame( name = c("Alice", "Bob", "Charlie", "Diana", "Eve"), age = c(28, 35, 42, 31, 38), salary = c(65000, 75000, 85000, 70000, 80000) ) # View original data head(employee_data) # Add differential privacy noise private_data <- employee_data %>% dp_add_noise( columns = c("age", "salary"), epsilon = 0.5, lower = c(age = 22, salary = 50000), upper = c(age = 65, salary = 150000) ) # View privatized data head(private_data) ``` Notice how the numeric columns now have noise added while preserving the names column. ## Core Functions ### Differentially Private Counting Count the number of records with privacy guarantees: ```{r counting} # Create sample data city_data <- data.frame( city = rep(c("New York", "Los Angeles", "Chicago"), c(150, 120, 80)), category = sample(c("A", "B", "C"), 350, replace = TRUE) ) # Overall count overall_count <- city_data %>% dp_count(epsilon = 0.1) print(overall_count) # Grouped count by city city_counts <- city_data %>% dp_count(epsilon = 0.1, group_by = "city") print(city_counts) # Count by multiple groups city_category_counts <- city_data %>% dp_count(epsilon = 0.1, group_by = c("city", "category")) head(city_category_counts) ``` ### Differentially Private Mean Compute private averages: ```{r mean} # Create sample data income_data <- data.frame( region = rep(c("North", "South", "East", "West"), each = 100), income = c( rnorm(100, mean = 60000, sd = 15000), rnorm(100, mean = 55000, sd = 12000), rnorm(100, mean = 65000, sd = 18000), rnorm(100, mean = 58000, sd = 14000) ) ) # Overall mean income avg_income <- income_data %>% dp_mean( "income", epsilon = 0.2, lower = 20000, upper = 150000 ) print(avg_income) # Mean by region regional_avg <- income_data %>% dp_mean( "income", epsilon = 0.2, lower = 20000, upper = 150000, group_by = "region" ) print(regional_avg) ``` ### Differentially Private Sum Compute private totals: ```{r sum} # Create sales data sales_data <- data.frame( store = rep(c("Store A", "Store B", "Store C"), each = 50), sales = c( rpois(50, lambda = 1000), rpois(50, lambda = 1200), rpois(50, lambda = 900) ) ) # Total sales by store store_totals <- sales_data %>% dp_sum( "sales", epsilon = 0.3, lower = 0, upper = 5000, group_by = "store" ) print(store_totals) ``` ## Privacy Budget Management When performing multiple queries on the same dataset, you need to track your total privacy expenditure using a privacy budget: ```{r budget} # Create a privacy budget budget <- new_privacy_budget( epsilon_total = 1.0, delta_total = 1e-5 ) print(budget) # Perform first query result1 <- city_data %>% dp_count(epsilon = 0.3, .budget = budget) print(budget) # Perform second query result2 <- city_data %>% dp_count(epsilon = 0.4, group_by = "city", .budget = budget) print(budget) # Check if we have enough budget for another query can_query <- check_privacy_budget(budget, epsilon_required = 0.5) print(paste("Can perform query with epsilon=0.5?", can_query)) # We only have 0.3 epsilon remaining can_query <- check_privacy_budget(budget, epsilon_required = 0.2) print(paste("Can perform query with epsilon=0.2?", can_query)) ``` ### Budget Composition The tidydp package uses **basic composition** by default, where the total privacy cost is the sum of individual query costs: $$\epsilon_{total} = \sum_{i=1}^{k} \epsilon_i$$ This is a conservative approach that ensures strong privacy guarantees. ## Choosing Privacy Parameters ### Epsilon (ε) | Epsilon Value | Privacy Level | Use Case | |--------------|---------------|----------| | 0.01 - 0.1 | Very Strong | Highly sensitive medical or financial data | | 0.1 - 1.0 | Strong | Personal information, general sensitive data | | 1.0 - 5.0 | Moderate | Less sensitive aggregate statistics | | 5.0+ | Weak | Public or minimally sensitive data | ### Delta (δ) - Typically set to 1/n² or 1/n³ where n is the dataset size - Common values: 1e-5, 1e-6, 1e-7 - Should always be much smaller than 1/n ### Data Bounds Providing accurate bounds is crucial for utility: ```{r bounds_comparison} # Example: Impact of bounds on utility test_data <- data.frame(age = c(25, 30, 35, 40, 45)) # Tight bounds (accurate) tight_bounds <- test_data %>% dp_add_noise( columns = "age", epsilon = 0.5, lower = c(age = 20), upper = c(age = 50) ) # Loose bounds (less accurate) loose_bounds <- test_data %>% dp_add_noise( columns = "age", epsilon = 0.5, lower = c(age = 0), upper = c(age = 100) ) # Compare results data.frame( Original = test_data$age, Tight_Bounds = round(tight_bounds$age, 1), Loose_Bounds = round(loose_bounds$age, 1) ) ``` Tighter bounds lead to better utility (less noise) while maintaining the same privacy guarantee. ## Mechanism Selection ### When to use Laplace vs Gaussian **Use Laplace** (default): - When you need pure ε-differential privacy (δ = 0) - For counting queries - When δ > 0 is not acceptable **Use Gaussian**: - When (ε, δ)-differential privacy is acceptable - Often provides better utility for the same privacy level - When working with continuous data and aggregates ```{r mechanism_comparison} # Compare mechanisms test_values <- data.frame(value = c(100, 200, 300, 400, 500)) # Laplace mechanism laplace_result <- test_values %>% dp_add_noise( columns = "value", epsilon = 0.5, lower = c(value = 0), upper = c(value = 1000), mechanism = "laplace" ) # Gaussian mechanism gaussian_result <- test_values %>% dp_add_noise( columns = "value", epsilon = 0.5, delta = 1e-5, lower = c(value = 0), upper = c(value = 1000), mechanism = "gaussian" ) data.frame( Original = test_values$value, Laplace = round(laplace_result$value, 1), Gaussian = round(gaussian_result$value, 1) ) ``` ## Complete Workflow Example Here's a complete example analyzing employee data while maintaining differential privacy: ```{r complete_example} # Create employee dataset employees <- data.frame( department = rep(c("Engineering", "Sales", "Marketing", "HR"), each = 25), salary = c( rnorm(25, 85000, 15000), # Engineering rnorm(25, 70000, 12000), # Sales rnorm(25, 65000, 10000), # Marketing rnorm(25, 60000, 8000) # HR ), years_experience = c( rpois(25, 5), rpois(25, 4), rpois(25, 3), rpois(25, 4) ) ) # Ensure realistic bounds employees$salary <- pmax(40000, pmin(150000, employees$salary)) employees$years_experience <- pmax(0, pmin(20, employees$years_experience)) # Initialize privacy budget analysis_budget <- new_privacy_budget(epsilon_total = 2.0) # Query 1: Count by department (epsilon = 0.5) dept_counts <- employees %>% dp_count( epsilon = 0.5, group_by = "department", .budget = analysis_budget ) cat("\nEmployee counts by department:\n") print(dept_counts) # Query 2: Average salary by department (epsilon = 0.8) dept_salaries <- employees %>% dp_mean( "salary", epsilon = 0.8, lower = 40000, upper = 150000, group_by = "department", .budget = analysis_budget ) cat("\nAverage salaries by department:\n") print(dept_salaries) # Query 3: Average experience (epsilon = 0.4) avg_experience <- employees %>% dp_mean( "years_experience", epsilon = 0.4, lower = 0, upper = 20, .budget = analysis_budget ) cat("\nAverage years of experience:\n") print(avg_experience) # Check remaining budget cat("\nFinal budget status:\n") print(analysis_budget) ``` ## Best Practices 1. **Start with a clear privacy budget**: Decide upfront how much privacy loss is acceptable 2. **Use tight bounds**: Provide accurate lower and upper bounds for better utility 3. **Track your budget**: Use `new_privacy_budget()` for multiple queries 4. **Test with synthetic data**: Validate your privacy-utility tradeoff before deploying 5. **Document your choices**: Record epsilon, delta, and bounds for reproducibility 6. **Consider query sensitivity**: Different queries have different privacy costs 7. **Aggregate before privatizing**: Reduce sensitivity by aggregating data first ## Common Pitfalls ### Pitfall 1: Repeated Queries Don't run the same query multiple times without accounting for cumulative privacy loss: ```{r pitfall1, eval=FALSE} # BAD: Running same query multiple times for (i in 1:10) { result <- data %>% dp_count(epsilon = 0.1) } # Total cost: 10 * 0.1 = 1.0 epsilon! ``` ### Pitfall 2: Ignoring Bounds Not providing bounds forces the algorithm to use data-dependent bounds, which can leak information: ```{r pitfall2, eval=FALSE} # BETTER: Provide explicit bounds result <- data %>% dp_mean("income", epsilon = 0.5, lower = 0, upper = 200000) # WORSE: Let algorithm infer bounds from data result <- data %>% dp_mean("income", epsilon = 0.5) ``` ### Pitfall 3: Epsilon Too Large Using epsilon > 10 provides minimal privacy protection: ```{r pitfall3} # Very weak privacy weak_privacy <- test_values %>% dp_add_noise( columns = "value", epsilon = 50, # Too large! lower = c(value = 0), upper = c(value = 1000) ) # The noise is minimal data.frame( Original = test_values$value, With_Noise = round(weak_privacy$value, 1), Difference = round(abs(test_values$value - weak_privacy$value), 1) ) ``` ## Further Reading - Dwork, C., & Roth, A. (2014). *The Algorithmic Foundations of Differential Privacy* - [Differential Privacy Team at Apple](https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf) - [Google's Differential Privacy Library](https://github.com/google/differential-privacy) - [OpenDP Project](https://opendp.org/) ## Getting Help If you encounter issues or have questions: - File an issue: [https://github.com/ttarler/tidydp/issues](https://github.com/ttarler/tidydp/issues) - Check documentation: `?tidydp`, `?dp_add_noise`, `?dp_count`, etc. - View examples: Run `example(dp_add_noise)`