misaem
is a package to perform linear regression and
logistic regression with missing data, under MCAR (Missing completely at
random) and MAR (Missing at random) mechanisms. The covariates are
assumed to be continuous variables. The methodology implemented is based
on maximization of the observed likelihood using EM-types of algorithms.
The package includes:
norm
package allows to
estimate the mean vector and a variance covariance matrix with the EM
algorithm and SWEEP operator. Finally we have reshaped them to obtain
the regression coefficient.In our example, we’ll use a custom ampute function. One can also use
mice::ampute
for other examples. For more details about how
to generate missing values of different mechanisms, see the resource
website of missing values Rmisstastic.
ampute <- function(X.full, prc.NA){
n <- dim(X.full)[1]
d <- dim(X.full)[2]
# Create Mask Matrix
M <- matrix(0, nrow = n, ncol = d)
for (j in 1:d) {
M[, j] <- rbinom(n, 1, prc.NA)
}
while (any(rowSums(M) == d)) {
M.temp <- M[rowSums(M) < d, ]
left.to.add <- n - nrow(M.temp)
M.add <- matrix(0, nrow = left.to.add, ncol = d)
for (j in 1:d) {
M.add[, j] <- rbinom(left.to.add, 1, prc.NA)
}
M <- rbind(M.temp, M.add)
}
X.obs <- X.full
X.obs[M == 1] <- NA
return(X.obs)
}
Let’s generate a synthetic example of classical linear regression. We first generate a design matrix of size \(n = 50\) times \(p = 2\) by drawing each observation from a multivariate normal distribution \(\mathcal{N}(\mu, \Sigma)\). We consider as the true values for the parameters: \[\begin{equation*} \begin{split} \mu &= (1, 1),\\ \Sigma & = \begin{bmatrix} 1 & 1\\ 1 & 4\\ \end{bmatrix} \end{split} \end{equation*}\] Then, we generate the response according to the linear regression model with coefficient \(\beta = (2, 3, -1)\) and variance of noise vector \(\sigma^2 = 0.25\).
set.seed(1)
n <- 50 # number of rows
p <- 2 # number of explanatory variables
# Generate complete design matrix
library(MASS)
mu.X <- c(1, 1)
Sigma.X <- matrix(c(1, 1, 1, 4), nrow = 2)
X.complete <- mvrnorm(n, mu.X, Sigma.X)
# Generate response
b <- c(2, 3, -1) # regression coefficient
sigma.eps <- 0.25 # noise variance
y <- cbind(rep(1, n), X.complete) %*% b + rnorm(n, 0, sigma.eps)
Then we randomly introduced 15% of missing values in the covariates
according to the MCAR (Missing completely at random) mechanism. To do
so, we use the custom function ampute
.
Have a look at our synthetic dataset:
## [,1] [,2]
## [1,] 0.30528180 -0.1473747
## [2,] 1.59950261 1.2164969
## [3,] 0.22508791 -0.5764402
## [4,] 2.86148303 3.8938533
## [5,] 0.05283648 NA
## [6,] NA -0.1496864
Check the percentage of missing values:
## [1] 0.13
The main function in our package to fit linear regression with
missingness is miss.lm
function. The function
miss.lm
mimics the structure of widely used function
lm
for the case without missing values. It takes an object
of class formula
(a symbolic description of the model to be
fitted) and the data frame as the input. Here we apply this function
with its default options.
# Estimate regression using EM with NA
df.obs = data.frame(y, X.obs)
miss.list = miss.lm(y~., data = df.obs)
## Iterations of EM:
## 1...2...3...4...5...6...7...8...9...10...11...12...13...14...15...16...
Then it returns an object of self-defined class miss.lm
,
which consists of the estimation of parameters, their standard error and
observed log-likelihood. We can print or summarize the obtained results
as follows:
##
## Call: miss.lm(formula = y ~ ., data = df.obs)
##
## Coefficients:
## (Intercept) X1 X2
## 1.883 3.061 -1.005
## Standard error estimates:
## (Intercept) X1 X2
## 0.04616 0.03801 0.02126
## Log-likelihood: 29.36
##
## Call:
## miss.lm(formula = y ~ ., data = df.obs)
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 1.88264 0.04616
## X1 3.06148 0.03801
## X2 -1.00548 0.02126
## Log-likelihood: 29.357
## Estimate Std. Error
## (Intercept) 1.882638 0.04615914
## X1 3.061475 0.03800860
## X2 -1.005478 0.02126482
Self-defined parameters can be also taken such as the maximum number
of iterations (maxruns
), the convergence tolerance
(tol_em
) and the logical indicating if the iterations
should be reported (print_iter
).
# Estimate regression using self-defined parameters
miss.list2 = miss.lm(y~., data = df.obs, print_iter = FALSE, maxruns = 500, tol_em = 1e-4)
print(miss.list2)
##
## Call: miss.lm(formula = y ~ ., data = df.obs, print_iter = FALSE, maxruns = 500,
## tol_em = 1e-04)
##
## Coefficients:
## (Intercept) X1 X2
## 1.883 3.061 -1.005
## Standard error estimates:
## (Intercept) X1 X2
## 0.04619 0.03804 0.02128
## Log-likelihood: 29.36
The function miss.lm.model.select
adapts a BIC criterion
and step-wise method to return the best model selected. We add a null
variable with missing values to check if the function can distinguish it
from the true variables.
# Add null variable with NA
X.null <- mvrnorm(n, 1, 1)
patterns <- runif(n)<0.15 # missing completely at random
X.null[patterns] <- NA
X.obs.null <- cbind.data.frame(X.obs, X.null)
# Without model selection
df.obs.null = data.frame(y, X.obs.null)
miss.list.null = miss.lm(y~., data = df.obs.null)
## Iterations of EM:
## 1...2...3...4...5...6...7...8...9...10...11...12...13...14...15...16...17...18...19...20...21...22...23...
##
## Call: miss.lm(formula = y ~ ., data = df.obs.null)
##
## Coefficients:
## (Intercept) X1 X2 X.null
## 1.92234 3.07508 -1.01361 -0.05098
## Standard error estimates:
## (Intercept) X1 X2 X.null
## 0.05191 0.03725 0.02092 0.02717
## Log-likelihood: 8.393
##
## Call: miss.lm(formula = Y ~ ., data = df, print_iter = FALSE)
##
## Coefficients:
## (Intercept) X1 X2
## 1.883 3.061 -1.005
## Standard error estimates:
## (Intercept) X1 X2
## 0.04616 0.03801 0.02126
## Log-likelihood: 29.36
In order to evaluate the prediction performance, we generate a test set of size \(nt = 20\) times \(p = 2\) following the same distribution as the previous design matrix, and we add or not 15% of missing values.
# Prediction
# Generate dataset
set.seed(200)
nt <- 20 # number of new observations
Xt <- mvrnorm(nt, mu.X, Sigma.X)
# Add missing values
Xt.obs <- ampute(Xt, 0.15)
The prediction can be performed for a complete test set:
#train with NA + test no NA
miss.comptest.pred = predict(miss.list2, data.frame(Xt), seed = 100)
print(miss.comptest.pred)
## [1] 3.3350555 2.5567616 -0.6191913 6.5511413 2.8689770 7.9834750
## [7] 0.7640520 3.8836697 6.7089083 3.3041055 6.7721756 2.1810048
## [13] 1.9732412 5.9419726 7.7625725 5.0357058 4.1730339 4.4098238
## [19] 3.4960140 2.9460593
And we can also apply the function when both train set and test set have missing values:
#both train & test with NA
miss.pred = predict(miss.list2, data.frame(Xt.obs), seed = 100)
print(miss.pred)
## [1] 3.3350555 2.5567616 -0.6191913 6.5511413 2.8689770 7.9834750
## [7] 0.7640520 3.8123903 6.7089083 3.0486207 5.4344376 2.1810048
## [13] 1.9732412 5.9419726 7.7625725 5.0357058 3.7876726 4.4098238
## [19] 3.4960140 2.9460593
We first generate a design matrix of size \(n=500\) times \(p=5\) by drawing each observation from a multivariate normal distribution \(\mathcal{N}(\mu, \Sigma)\). Then, we generate the response according to the logistic regression model.
We consider as the true values for the parameters \[\begin{equation*}
\begin{split}
\beta &= (0, 1, -1, 1, 0, -1),\\
\mu &= (1,2,3,4,5),\\
\Sigma &= \text{diag}(\sigma)C \text{diag}(\sigma),
\end{split}
\end{equation*}\] where the \(\sigma\) is the vector of standard
deviations \[\sigma=(1,2,3,4,5)\]
and \(C\) the correlation matrix \[C = \begin{bmatrix}
1 & 0.8 & 0 & 0 & 0\\
0.8 & 1 & 0 & 0 & 0\\
0 & 0 & 1 & 0.3 & 0.6\\
0 & 0 & 0.3 & 1 & 0.7\\
0 & 0 & 0.6 & 0.7 & 1\\
\end{bmatrix}.\]
# Generate dataset
set.seed(200)
n <- 500 # number of subjects
p <- 5 # number of explanatory variables
mu.star <- 1:p #rep(0,p) # mean of the explanatory variables
sd <- 1:p # rep(1,p) # standard deviations
C <- matrix(c( # correlation matrix
1, 0.8, 0, 0, 0,
0.8, 1, 0, 0, 0,
0, 0, 1, 0.3, 0.6,
0, 0, 0.3, 1, 0.7,
0, 0, 0.6, 0.7, 1), nrow=p)
Sigma.star <- diag(sd)%*%C%*%diag(sd) # covariance matrix
beta.star <- c(1, -1, 1, 1, -1) # coefficients
beta0.star <- 0 # intercept
beta.true = c(beta0.star,beta.star)
# Design matrix
X.complete <- matrix(rnorm(n*p), nrow=n)%*%chol(Sigma.star)+
matrix(rep(mu.star,n), nrow=n, byrow = TRUE)
# Reponse vector
p1 <- 1/(1+exp(-X.complete%*%beta.star-beta0.star))
y <- as.numeric(runif(n)<p1)
Then we randomly introduced 10% of missing values in the covariates according to the MCAR (Missing completely at random) mechanism.
Have a look at our synthetic dataset:
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0847563 1.71119812 5.0779956 9.7312548 13.02285225
## [2,] 1.2264603 0.04664033 5.3758000 6.3830936 4.84730504
## [3,] 1.4325565 1.77934455 5.6562280 NA 7.26902254
## [4,] 1.5580652 5.69782193 5.5942869 -0.4407494 -0.96662931
## [5,] 1.0597553 -0.38470918 0.4462986 NA 0.04745022
## [6,] 0.8853591 0.56839374 3.4641522 NA 9.11909475
The main function for fitting logistic regression with missing
covariates in our package is miss.glm
function, which
mimics the structure of widely used function glm
. Note that
we don’t need to specify the binomial family in the input of
miss.glm
function. Here we apply this function with its
default options, and then we can print or summarize the obtained results
as follows:
df.obs = data.frame(y, X.obs)
#logistic regression with NA
miss.list = miss.glm(y~., data = df.obs, seed = 100)
## Iteration of SAEM:
## 50... 100... 150... 200...
##
## Call: miss.glm(formula = y ~ ., data = df.obs, seed = 100)
##
## Coefficients:
## (Intercept) X1 X2 X3 X4 X5
## 0.1346 1.3226 -1.2643 1.1190 1.1293 -1.1449
## Standard error estimates:
## (Intercept) X1 X2 X3 X4 X5
## 0.3332 0.3654 0.2244 0.1495 0.1451 0.1469
## Log-likelihood: -172.4
##
## Call:
## miss.glm(formula = y ~ ., data = df.obs, seed = 100)
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 0.1346 0.3332
## X1 1.3226 0.3654
## X2 -1.2643 0.2244
## X3 1.1190 0.1495
## X4 1.1293 0.1451
## X5 -1.1449 0.1469
## Log-likelihood: -172.36
## Estimate Std. Error
## (Intercept) 0.1346364 0.3332388
## X1 1.3226155 0.3654287
## X2 -1.2643476 0.2243754
## X3 1.1190169 0.1495319
## X4 1.1292532 0.1450896
## X5 -1.1448603 0.1468984
To perform model selection with missing values, we adapt criterion
BIC and step-wise method. The function
miss.glm.model.select
outputs the best model selected. With
the current implementation, when \(p\)
is greater than 20, it may encounter computational difficulties for the
BIC based model selection. In the following simulation, we add a null
variable with missing values to check if the function can distinguish it
from the true variables.
# Add null variable with NA
X.null <- mvrnorm(n, 1, 1)
patterns <- runif(n)<0.10 # missing completely at random
X.null[patterns] <- NA
X.obs.null <- cbind.data.frame(X.obs, X.null)
# Without model selection
df.obs.null = data.frame(y, X.obs.null)
miss.list.null = miss.glm(y~., data = df.obs.null)
## Iteration of SAEM:
## 50... 100... 150... 200... 250... 300... 350...
##
## Call: miss.glm(formula = y ~ ., data = df.obs.null)
##
## Coefficients:
## (Intercept) X1 X2 X3 X4 X5
## 0.2018 1.3598 -1.2953 1.1489 1.1506 -1.1632
## X.null
## -0.1098
## Standard error estimates:
## (Intercept) X1 X2 X3 X4 X5
## 0.3832 0.3698 0.2261 0.1493 0.1454 0.1461
## X.null
## 0.1892
## Log-likelihood: -173.4
##
## Call: miss.glm(formula = Y ~ ., data = df, print_iter = FALSE, subsets = subset_choose)
##
## Coefficients:
## (Intercept) X1 X2 X3 X4 X5
## 0.1287 1.3623 -1.2932 1.1368 1.1483 -1.1635
## Standard error estimates:
## (Intercept) X1 X2 X3 X4 X5
## 0.3330 0.3640 0.2220 0.1469 0.1442 0.1450
## Log-likelihood: -171.6
In order to evaluate the prediction performance, we generate a test set of size \(nt = 100\) times \(p = 5\) following the same distribution as the design matrix, and without and with 10% of missing values. We evaluate the prediction quality with a confusion matrix.
# Generate test set with missingness
set.seed(200)
nt = 100
X.test <- matrix(rnorm(nt*p), nrow=nt)%*%chol(Sigma.star)+
matrix(rep(mu.star,nt), nrow = nt, byrow = TRUE)
# Generate the test set
p1 <- 1/(1+exp(-X.test%*%beta.star-beta0.star))
y.test <- as.numeric(runif(nt)<p1)
# Generate missingness on test set
p.miss <- 0.10
X.test[runif(nt*p)<p.miss] <- NA
# Prediction on test set
pr.saem <- predict(miss.list, data.frame(X.test))
# Confusion matrix
pred.saem = (pr.saem>0.5)*1
table(y.test,pred.saem )
## pred.saem
## y.test 0 1
## 0 34 8
## 1 6 52
Logistic Regression with Missing Covariates – Parameter Estimation, Model Selection and Prediction (2020, Jiang W., Josse J., Lavielle M., TraumaBase Group), Computational Statistics & Data Analysis.