---
title: "Blocking records for deduplication"
author: "Maciej Beręsewicz"
output: 
    html_vignette:
        df_print: kable
        toc: true
        number_sections: true
        fig_width: 6
        fig_height: 4
vignette: >
  %\VignetteIndexEntry{Blocking records for deduplication}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE,
  message = FALSE,
  comment = "#>"
)
```

# Setup

Read required packages.

```{r setup}
library(blocking)
library(data.table)
```

Read the `RLdata500` data (taken from the [RecordLinkage](https://CRAN.R-project.org/package=RecordLinkage) package).

```{r}
data(RLdata500)
head(RLdata500)
```
This dataset contains `r nrow(RLdata500)` rows with `r NROW(unique(RLdata500$ent_id))` entities.

# Blocking for deduplication

Now we create a new column that concatenates the information in each row. 

```{r}
RLdata500[, id_count :=.N, ent_id] ## how many times given unit occurs
RLdata500[, bm:=sprintf("%02d", bm)] ## add leading zeros to month
RLdata500[, bd:=sprintf("%02d", bd)] ## add leading zeros to day
RLdata500[, txt:=tolower(paste0(fname_c1,fname_c2,lname_c1,lname_c2,by,bm,bd))]
head(RLdata500)
```

In the next step we use the newly created column in the `blocking` function. If we specify verbose, we get information about the progress.

```{r}
df_blocks <- blocking(x = RLdata500$txt, ann = "nnd", verbose = 1, graph = TRUE, seed = 2024)
```

Results are as follows:

+ based on `rnndescent` we have created `r NROW(unique(df_blocks$result$block))` blocks,
+ it was based on `r NROW(unique(df_blocks$colnames))` columns (2 character shingles),
+ we have 45 blocks of 2 elements, 33 blocks of 3 elements, ..., 1 block of 17 elements.

```{r}
df_blocks
```

Structure of the object is as follows:

+ `result` -- a `data.table` with identifiers and block IDs,
+ `method` -- the method used,
+ `deduplication` -- whether deduplication was applied,
+ `representation` -- whether shingles or vectors were used,
+ `metrics` -- standard metrics and based on the `igraph::compare` methods for comparing graphs (here NULL),
+ `confusion` -- confusion matrix (here NULL),
+ `colnames` -- column names used for the comparison,
+ `graph` -- an `igraph` object mainly for visualisation.

```{r}
str(df_blocks,1)
```
Plot connections.

```{r}
plot(df_blocks$graph, vertex.size=1, vertex.label = NA)
```

The resulting `data.table` has four columns:

+ `x` -- reference dataset (i.e. `RLdata500`) -- this may not contain all units of `RLdata500`,
+ `y` - query (each row of `RLdata500`) -- this may not contain all units of `RLdata500`,
+ `block` -- the block ID,
+ `dist` -- distance between objects.

```{r}
head(df_blocks$result)
```

Create long `data.table` with information on blocks and units from original dataset.

```{r}
df_block_melted <- melt(df_blocks$result, id.vars = c("block", "dist"))
df_block_melted_rec_block <- unique(df_block_melted[, .(rec_id=value, block)])
head(df_block_melted_rec_block)
```

We add block information to the final dataset.

```{r}
RLdata500[df_block_melted_rec_block, on = "rec_id", block_id := i.block]
head(RLdata500)
```

We can check in how many blocks the same entities (`ent_id`) are observed. In our example, all the same entities are in the same blocks. 

```{r}
RLdata500[, .(uniq_blocks = uniqueN(block_id)), .(ent_id)][, .N, uniq_blocks]
```

We can visualise the distances between units stored in the `df_blocks$result` data set. Clearly we have a mixture of two groups: matches (close to 0) and non-matches (close to 1). 

```{r}
hist(df_blocks$result$dist, xlab = "Distances", ylab = "Frequency", breaks = "fd",
     main = "Distances calculated between units")
```

Finally, we can visualise the result based on the information whether block contains matches or not.

```{r}
df_for_density <- copy(df_block_melted[block %in% RLdata500$block_id])
df_for_density[, match:= block %in% RLdata500[id_count == 2]$block_id]

plot(density(df_for_density[match==FALSE]$dist), col = "blue", xlim = c(0, 0.8), 
     main = "Distribution of distances between\nclusters type (match=red, non-match=blue)")
lines(density(df_for_density[match==TRUE]$dist), col = "red", xlim = c(0, 0.8))
```