In the example we will use the same dataset as in the Blocking records for record linkage vignette.
reclin2
packageThe package contains function pair_ann
which aims at
integration with reclin2
package. This function works as
follows.
pair_ann(x = census[1:1000],
y = cis[1:1000],
on = c("pername1", "pername2", "sex", "dob_day", "dob_mon", "dob_year", "enumcap", "enumpc"),
deduplication = FALSE) |>
head()
.x | .y | block |
---|---|---|
204 | 1 | 1 |
204 | 176 | 1 |
204 | 375 | 1 |
204 | 391 | 1 |
204 | 405 | 1 |
204 | 424 | 1 |
Which provides you information on the total number of pairs. This can
be further included in the pipeline of the reclin2
package
(note that we use a different ANN this time).
pair_ann(x = census[1:1000],
y = cis[1:1000],
on = c("pername1", "pername2", "sex", "dob_day", "dob_mon", "dob_year", "enumcap", "enumpc"),
deduplication = FALSE,
ann = "hnsw") |>
compare_pairs(on = c("pername1", "pername2", "sex", "dob_day", "dob_mon", "dob_year", "enumcap", "enumpc"),
comparators = list(cmp_jarowinkler())) |>
score_simple("score",
on = c("pername1", "pername2", "sex", "dob_day", "dob_mon", "dob_year", "enumcap", "enumpc")) |>
select_threshold("threshold", score = "score", threshold = 6) |>
link(selection = "threshold") |>
head()
.y | .x | person_id.x | pername1.x | pername2.x | sex.x | dob_day.x | dob_mon.x | dob_year.x | hse_num | enumcap.x | enumpc.x | str_nam | cap_add | census_id | x | txt.x | person_id.y | pername1.y | pername2.y | sex.y | dob_day.y | dob_mon.y | dob_year.y | enumcap.y | enumpc.y | cis_id | y | txt.y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11 | 945 | DE256NG039003 | HARRIET | THOMSON | F | 12 | 1 | 1995 | 39 | 39 SPRINGFIELD ROAD | DE256NG | Springfield Road | 39, Springfield Road | CENSDE256NG039003 | 945 | HARRIETTHOMSONF121199539 SPRINGFIELD ROADDE256NG | DE256NG039003 | HARRIET | THOMSON | F | 12 | 1 | 39 SPRINGFIELD ROAD | DE256NG | CISDE256NG039003 | 11 | HARRIETTHOMSONF12139 SPRINGFIELD ROADDE256NG | |
71 | 427 | DE159QA062001 | LEWIS | GREEN | M | 23 | 3 | 1973 | 62 | 62 CHURCH ROAD | DE159QA | Church Road | 62, Church Road | CENSDE159QA062001 | 427 | LEWISGREENM233197362 CHURCH ROADDE159QA | DE159QA062001 | LEWIS | GREEN | M | 23 | 3 | 62 CHURCH ROAD | DE159QA | CISDE159QA062001 | 71 | LEWISGREENM23362 CHURCH ROADDE159QA | |
83 | 720 | DE237GG025002 | IMOGEN | DARIS | F | 6 | 4 | 1968 | 25 | 25 WOODLANDS ROAD | DE237GG | Woodlands Road | 25, Woodlands Road | CENSDE237GG025002 | 720 | IMOGENDARISF64196825 WOODLANDS ROADDE237GG | DE237GG025002 | IMOGEW | DAVIS | F | 6 | 4 | 25 WOODLANDS ROAD | DE237GG | CISDE237GG025002 | 83 | IMOGEWDAVISF6425 WOODLANDS ROADDE237GG | |
99 | 136 | DE125LU022001 | DANIEC | MICCER | M | 21 | 4 | 1947 | 22 | 22 PARK LANE | DE125LU | Park Lane | 22, Park Lane | CENSDE125LU022001 | 136 | DANIECMICCERM214194722 PARK LANEDE125LU | DE125LU022001 | DAMIEL | HILLER | M | 21 | 4 | 22 PARK LANE | DE125LU | CISDE125LU022001 | 99 | DAMIELHILLERM21422 PARK LANEDE125LU | |
154 | 949 | DE256NG040002 | CHLOE | WILSON | F | 5 | 7 | 1978 | 40 | 40 SPRINGFIELD ROAD | DE256NG | Springfield Road | 40, Springfield Road | CENSDE256NG040002 | 949 | CHLOEWILSONF57197840 SPRINGFIELD ROADDE256NG | DE256NG040002 | CHLOE | WILSOM | F | 5 | 7 | 40 SPRINGFIELD ROAD | DE256NG | CISDE256NG040002 | 154 | CHLOEWILSOMF5740 SPRINGFIELD ROADDE256NG | |
156 | 549 | DE159QY035002 | AVA | KING | F | 7 | 7 | 1969 | 35 | 35 CHURCH ROAD | DE159QY | Church Road | 35, Church Road | CENSDE159QY035002 | 549 | AVAKINGF77196935 CHURCH ROADDE159QY | DE159QY035002 | AVA | KING | F | 7 | 7 | 35 CHURCH ROAD | DE159QY | CISDE159QY035002 | 156 | AVAKINGF7735 CHURCH ROADDE159QY |
fastLink
packageJust use the block
column in the function
fastLink::blockData()
. As a result you will obtain a list
of records blocked for further processing.
RecordLinkage
packageJust use the block
column in the argument
blockfld
in the compare.dedup()
or
compare.linkage()
function. Please note that
block
column for the RecordLinkage
package
should be stored as a character
not a
numeric/integer
vector.