Read the example data from the tutorial on the
reclin
package on the URos 2021 Conference. The data
sets are from ESSnet on Data Integration as stated in the
repository:
These totally fictional data sets are supposed to have captured details of
persons up to the date 31 December 2011. Any years of birth captured as 2012
are therefore in error. Note that in the fictional Census data set, dates of
birth between 27 March 2011 and 31 December 2011 are not necessarily in error.
census: A fictional data set to represent some observations from a
decennial Census
cis: Fictional observations from Customer Information System, which is
combined administrative data from the tax and benefit systems
In the dataset census all records contain a person_id. For some of the records
in cis the person_id is also available. This information can be used to
evaluate the linkage (assuming these records from the cis are representable
all records in the cis).
census
object has 25343 rows and 13 columns,cis
object has 24613 rows and 10 columns.Census data
person_id | pername1 | pername2 | sex | dob_day | dob_mon | dob_year | hse_num | enumcap | enumpc | str_nam | cap_add | census_id |
---|---|---|---|---|---|---|---|---|---|---|---|---|
DE03US001001 | COUIE | PRICE | M | 1 | 6 | 1960 | 1 | 1 WINDSOR ROAD | DE03US | Windsor Road | 1, Windsor Road | CENSDE03US001001 |
DE03US001002 | ABBIE | PVICE | F | 9 | 11 | 1961 | 1 | 1 WINDSOR ROAD | DE03US | Windsor Road | 1, Windsor Road | CENSDE03US001002 |
DE03US001003 | LACEY | PRICE | F | 7 | 2 | 1999 | 1 | 1 WINDSOR ROAD | DE03US | Windsor Road | 1, Windsor Road | CENSDE03US001003 |
DE03US001004 | SAMUEL | PRICE | M | 13 | 4 | 1990 | 1 | 1 WINDSOR ROAD | DE03US | Windsor Road | 1, Windsor Road | CENSDE03US001004 |
DE03US001005 | JOSEPH | PRICE | M | 20 | 4 | 1986 | 1 | 1 WINDSOR ROAD | DE03US | Windsor Road | 1, Windsor Road | CENSDE03US001005 |
DE03US001006 | JOSH | PRICE | M | 14 | 2 | 1996 | 1 | 1 WINDSOR ROAD | DE03US | Windsor Road | 1, Windsor Road | CENSDE03US001006 |
CIS data
person_id | pername1 | pername2 | sex | dob_day | dob_mon | dob_year | enumcap | enumpc | cis_id |
---|---|---|---|---|---|---|---|---|---|
PO827ER091001 | HAYDEN | HALL | M | 1 | 91 CLARENCE ROAD | PO827ER | CISPO827ER091001 | ||
LS992DB024001 | SEREN | ANDERSON | F | 1 | 1 | 24 CHURCH LANE | LS992DB | CISLS992DB024001 | |
M432ZZ053003 | LEWIS | LEWIS | M | 1 | 1 | 53 CHURCH ROAD | M432ZZ | CISM432ZZ053003 | |
SW75TQ018001 | HARRISON | POSTER | M | 5 | 1 | 19 HIGHFIELD ROAD | SW75TG | CISSW75TQ018001 | |
EX527TR017006 | MUHAMMED | WATSUN | M | 7 | 1 | 17 VICTORIA STREET | CISEX527TR017006 | ||
SW540RB001001 | RHYS | THOMPSON | M | 7 | 1 | 1 SPRINGFIELD ROAD | SW540RB | CISSW540RB001001 |
We randomly select 12671 records from census
and 12306
records from cis
.
set.seed(2024)
census <- census[sample(nrow(census), floor(nrow(census) / 2)), ]
cis <- cis[sample(nrow(cis), floor(nrow(cis) / 2)), ]
We need to create new columns that concatenate variables from
pername1
to enumpc
.
blocking
packageThe goal of this exercise is to link units from the CIS dataset to the CENSUS dataset.
result1 <- blocking(x = census$txt, y = cis$txt, verbose = 1)
#> ===== creating tokens =====
#> ===== starting search (nnd, x, y: 12671, 12306, t: 1053) =====
#> ===== creating graph =====
Distribution of distances for each pair.
Example pairs.
x | y | block | dist |
---|---|---|---|
1 | 12088 | 8340 | 0.0452079 |
2 | 12156 | 8378 | 0.2616496 |
3 | 7243 | 5756 | 0.0460410 |
3 | 10643 | 5756 | 0.3730645 |
4 | 8422 | 6453 | 0.3587636 |
6 | 9442 | 7034 | 0.3189948 |
6 | 10195 | 7034 | 0.0416667 |
7 | 745 | 725 | 0.1633400 |
8 | 3072 | 2770 | 0.2578439 |
8 | 10717 | 2770 | 0.1235833 |
Let’s take a look at the first pair. Obviously there is a typo in the
pername1
but all the other variables are the same, so it
appears to be a match.
cbind(t(census[1, c(1:7, 9:10)]), t(cis[12088, 1:9]))
#> [,1] [,2]
#> person_id "SW122AB001001" "SW122AB001001"
#> pername1 "GEURGE" "GEORGE"
#> pername2 "HUGHES" "HUGHES"
#> sex "M" "M"
#> dob_day "19" "19"
#> dob_mon "5" "5"
#> dob_year "1942" "1942"
#> enumcap "1 VICTORIA ROAD" "1 VICTORIA ROAD"
#> enumpc "SW122AB" "SW122AB"
For some records, we have information about the correct linkage. We can use this information to evaluate our approach.
matches <- merge(x = census[, .(x=1:.N, person_id)],
y = cis[, .(y = 1:.N, person_id)],
by = "person_id")
matches[, block:=1:.N]
head(matches)
person_id | x | y | block |
---|---|---|---|
DE03US001003 | 1357 | 10248 | 1 |
DE03US008001 | 4506 | 2506 | 2 |
DE03US012002 | 2706 | 12005 | 3 |
DE03US012003 | 6317 | 11103 | 4 |
DE03US013003 | 4388 | 10673 | 5 |
DE03US014003 | 9463 | 11793 | 6 |
So in our example we have 5991 pairs.
result2 <- blocking(x = census$txt, y = cis$txt, verbose = 1,
true_blocks = matches[, .(x, y, block)])
#> ===== creating tokens =====
#> ===== starting search (nnd, x, y: 12671, 12306, t: 1053) =====
#> ===== creating graph =====
Let’s see how our approach handled this problem.
result2
#> ========================================================
#> Blocking based on the nnd method.
#> Number of blocks: 8437.
#> Number of columns used for blocking: 1053.
#> Reduction ratio: 0.9999.
#> ========================================================
#> Distribution of the size of the blocks:
#> 2 3 4 5 6 7 8 9
#> 5567 2086 611 142 23 6 1 1
#> ========================================================
#> Evaluation metrics (standard):
#> recall precision fpr fnr accuracy specificity
#> 99.8159 99.5326 0.0001 0.1841 99.9999 99.9999
#> f1_score
#> 99.6740
It seems that the default parameters of the NND method result in an
FNR of 0.18%. We can see if decreasing the epsilon
parameter as suggested in the Nearest
Neighbor Descent vignette will help.
ann_control_pars <- controls_ann()
ann_control_pars$nnd$epsilon <- 0.2
result3 <- blocking(x = census$txt, y = cis$txt, verbose = 1,
true_blocks = matches[, .(x, y, block)],
control_ann = ann_control_pars)
#> ===== creating tokens =====
#> ===== starting search (nnd, x, y: 12671, 12306, t: 1053) =====
#> ===== creating graph =====
Changing the epsilon
search parameter from 0.1 to 0.2
decreased the FNR to 0.07%.
result3
#> ========================================================
#> Blocking based on the nnd method.
#> Number of blocks: 8451.
#> Number of columns used for blocking: 1053.
#> Reduction ratio: 0.9999.
#> ========================================================
#> Distribution of the size of the blocks:
#> 2 3 4 5 6 7 8 9
#> 5592 2079 606 142 25 5 1 1
#> ========================================================
#> Evaluation metrics (standard):
#> recall precision fpr fnr accuracy specificity
#> 99.9332 99.8331 0.0000 0.0668 100.0000 100.0000
#> f1_score
#> 99.8831
Finally, compare the NND and HNSW algorithm for this example.
result4 <- blocking(x = census$txt, y = cis$txt, verbose = 1,
true_blocks = matches[, .(x, y, block)],
ann = "hnsw")
#> ===== creating tokens =====
#> ===== starting search (hnsw, x, y: 12671, 12306, t: 1053) =====
#> ===== creating graph =====
It seems that the HNSW algorithm also performed with 0.07% FNR.
result4
#> ========================================================
#> Blocking based on the hnsw method.
#> Number of blocks: 8447.
#> Number of columns used for blocking: 1053.
#> Reduction ratio: 0.9999.
#> ========================================================
#> Distribution of the size of the blocks:
#> 2 3 4 5 6 7 8 9
#> 5587 2079 606 142 26 5 1 1
#> ========================================================
#> Evaluation metrics (standard):
#> recall precision fpr fnr accuracy specificity
#> 99.9332 99.8331 0.0000 0.0668 100.0000 100.0000
#> f1_score
#> 99.8831
Finally, we can compare the results of two ANN algorithms. The overlap between neighbours is given by