Help for package ClustImpute

Type:

Package

Title:

K-Means Clustering with Build-in Missing Data Imputation

Version:

0.2.4

Author:

Oliver Pfaffel

Maintainer:

Oliver Pfaffel <opfaffel@gmail.com>

Description:

This k-means algorithm is able to cluster data with missing values and as a by-product completes the data set. The implementation can deal with missing values in multiple variables and is computationally efficient since it iteratively uses the current cluster assignment to define a plausible distribution for missing value imputation. Weights are used to shrink early random draws for missing values (i.e., draws based on the cluster assignments after few iterations) towards the global mean of each feature. This shrinkage slowly fades out after a fixed number of iterations to reflect the increasing credibility of cluster assignments. See the vignette for details.

License:

GPL-3

Encoding:

UTF-8

Imports:

ClusterR, copula, dplyr, magrittr, tidyr, ggplot2, rlang, knitr

Suggests:

ggExtra, rmarkdown, testthat (≥ 2.1.0), Hmisc, tictoc, spelling, corrplot, covr

VignetteBuilder:

knitr

RoxygenNote:

7.1.0

Language:

en-US

NeedsCompilation:

Packaged:

2021-05-27 20:22:28 UTC; opfaf

Repository:

CRAN

Date/Publication:

2021-05-31 07:40:11 UTC

Pipe operator

Description

See magrittr::%>% for details.

Usage

lhs %>% rhs

Value

No return value, called for side effects

K-means clustering with build-in missing data imputation

Description

Clustering algorithm that produces a missing value imputation using on the go. The (local) imputation distribution is defined by the currently assigned cluster. The first draw is by random imputation.

Usage

ClustImpute(
  X,
  nr_cluster,
  nr_iter = 10,
  c_steps = 1,
  wf = default_wf,
  n_end = 10,
  seed_nr = 150519,
  assign_with_wf = TRUE,
  shrink_towards_global_mean = TRUE
)

Arguments

X

Data frame with only numeric values or NAs

nr_cluster

Number of clusters

nr_iter

Iterations of procedure

c_steps

Number of clustering steps per iteration

wf

Weight function. Linear up to n_end by default. Used to shrink X towards zero or the global mean (default). See shrink_towards_global_mean

n_end

Steps until convergence of weight function to 1

seed_nr

Number for set.seed()

assign_with_wf

Default is TRUE. If set to False, then the weight function is only applied in the centroid computation, but ignored in the cluster assignment.

shrink_towards_global_mean

By default TRUE. The weight matrix w is applied on the difference of X from the global mean m, i.e, (x-m)*w+m

Value

complete_data: Completed data without NAs
clusters: For each row of complete_data, the associated cluster
centroids: For each cluster, the coordinates of the centroids in tidy format
centroids_matrix: For each cluster, the coordinates of the centroids in matrix format
imp_values_mean: Mean of the imputed variables per draw
imp_values_sd: Standard deviation of the imputed variables per draw

Examples

# Random Dataset
set.seed(739)
n <- 750 # numer of points
nr_other_vars <- 2
mat <- matrix(rnorm(nr_other_vars*n),n,nr_other_vars)
me<-4 # mean
x <- c(rnorm(n/3,me/2,1),rnorm(2*n/3,-me/2,1))
y <- c(rnorm(n/3,0,1),rnorm(n/3,me,1),rnorm(n/3,-me,1))
dat <- cbind(mat,x,y)
dat<- as.data.frame(scale(dat)) # scaling

# Create NAs
dat_with_miss <- miss_sim(dat,p=.1,seed_nr=120)

# Run ClustImpute
res <- ClustImpute(dat_with_miss,nr_cluster=3)

# Plot complete data set and cluster assignment
ggplot2::ggplot(res$complete_data,ggplot2::aes(x,y,color=factor(res$clusters))) +
ggplot2::geom_point()

# View centroids
res$centroids

Check and replace duplicate (centroid) rows

Description

Internal function of ClustImpute: check new centroids for duplicate rows and replace with random draws in this case.

Usage

check_replace_dups(centroids, X, seed)

Arguments

centroids

Matrix of centroids

X

Underlying data matrix (without missings)

seed

Seed used for random sampling

Value

Returns centroids where duplicate rows are replaced by random draws

K-means clustering with build-in missing data imputation

Description

Default weight function. One minus the return value is multiplied with missing(=imputed) values. It starts with 1 and goes to 0 at n_end.

Usage

default_wf(n, n_end = 10)

Arguments

n

current step

n_end

steps until convergence of weight function to 0

Value

value between 0 and 1

Examples

x <- 0:20
plot(x,1-default_wf(x))

Simulation of missings

Description

Simulates missing at random using a normal copula to create correlations between the missing (type="MAR"). Missings appear in each column of the provided data frame with the same ratio.

Usage

miss_sim(dat, p = 0.2, type = "MAR", seed_nr = 123)

Arguments

dat

Data frame with only numeric values

p

Fraction of missings (for entire data frame)

type

Type of missingness. Either MCAR (=missing completely at random) or MAR (=missing at random)

seed_nr

Number for set.seed()

Value

data frame with only numeric values and NAs

Examples

data(cars)
cars_with_missings <- miss_sim(cars,p = .2,seed_nr = 4)
summary(cars_with_missings)

Plot showing marginal distribution by cluster assignment

Description

Returns a plot with the marginal distributions by cluster and feature. The plot shows histograms or boxplots and , as a ggplot object, can be modified further.

Usage

## S3 method for class 'kmeans_ClustImpute'
plot(
  x,
  type = "hist",
  vline = "centroids",
  hist_bins = 30,
  color_bins = "#56B4E9",
  color_vline = "#E69F00",
  size_vline = 2,
  ...
)

Arguments

x

an object returned from ClustImpute

type

either "hist" to plot a histogram or "box" for a boxplot

vline

for "hist" a vertical line is plotted showing either the centroid value or the mean of all data points grouped by cluster and feature

hist_bins

number of bins for histogram

color_bins

color for the histogram bins

color_vline

color for the vertical line

size_vline

size of the vertical line

...

currently unused

Value

Returns a ggplot object

Prediction method

Description

Prediction method

Usage

## S3 method for class 'kmeans_ClustImpute'
predict(object, newdata, ...)

Arguments

object

Object of class kmeans_ClustImpute

newdata

Data frame

...

additional arguments affecting the predictions produced - not currently used

Value

integer value (cluster assignment)

Examples

# Random Dataset
set.seed(739)
n <- 750 # numer of points
nr_other_vars <- 2
mat <- matrix(rnorm(nr_other_vars*n),n,nr_other_vars)
me<-4 # mean
x <- c(rnorm(n/3,me/2,1),rnorm(2*n/3,-me/2,1))
y <- c(rnorm(n/3,0,1),rnorm(n/3,me,1),rnorm(n/3,-me,1))
dat <- cbind(mat,x,y)
dat<- as.data.frame(scale(dat)) # scaling

# Create NAs
dat_with_miss <- miss_sim(dat,p=.1,seed_nr=120)

res <- ClustImpute(dat_with_miss,nr_cluster=3)
predict(res,newdata=dat[1,])

Print method for ClustImpute

Description

Returns a plot with the marginal distributions by cluster and feature. The plot shows histograms or boxplots and , as a ggplot object, can be modified further.

Usage

## S3 method for class 'kmeans_ClustImpute'
print(x, ...)

Arguments

x

an object returned from ClustImpute

...

currently unused

Value

No return value (print function)

Reduction of variance

Description

Computes one minus the ratio of the sum of all within cluster variances by the overall variance

Usage

var_reduction(clusterObj)

Arguments

clusterObj

Object of class kmeans_ClustImpute

Value

integer value typically between 0 and 1

Examples


# Random Dataset
set.seed(739)
n <- 750 # numer of points
nr_other_vars <- 2
mat <- matrix(rnorm(nr_other_vars*n),n,nr_other_vars)
me<-4 # mean
x <- c(rnorm(n/3,me/2,1),rnorm(2*n/3,-me/2,1))
y <- c(rnorm(n/3,0,1),rnorm(n/3,me,1),rnorm(n/3,-me,1))
dat <- cbind(mat,x,y)
dat<- as.data.frame(scale(dat)) # scaling

# Create NAs
dat_with_miss <- miss_sim(dat,p=.1,seed_nr=120)

res <- ClustImpute(dat_with_miss,nr_cluster=3)
var_reduction(res)