Title: | Matching Address Data to Reference Index |
---|---|
Description: | Matches a data set with semi-structured address data, e.g., street and house number as a concatenated string, wrongly spelled street names or non-existing house numbers to a reference index. The methods are specifically designed for German municipalities ('KOR'-community) and German address schemes. |
Authors: | Daniel Schürmann [aut, cre] |
Maintainer: | Daniel Schürmann <[email protected]> |
License: | GPL-3 |
Version: | 1.0.1 |
Built: | 2025-03-02 02:47:23 UTC |
Source: | https://github.com/cran/KOR.addrlink |
Geocode address data from German municipalities
split_address
Splits strings into street, house number and addional letter
split_number
Splits strings into house number and addional letter
addrlink
Matches splitted address data to reference table
Matching is based on street name, house number and additional letter.
Daniel Schürmann
Takes two data.frames with address data and merges them together.
addrlink(df_ref, df_match, col_ref = c("Strasse", "Hausnummer", "Hausnummernzusatz"), col_match = c("Strasse", "Hausnummer", "Hausnummernzusatz"), fuzzy_threshold = 0.9, seed = 1234)
addrlink(df_ref, df_match, col_ref = c("Strasse", "Hausnummer", "Hausnummernzusatz"), col_match = c("Strasse", "Hausnummer", "Hausnummernzusatz"), fuzzy_threshold = 0.9, seed = 1234)
df_ref |
data.frame with address references |
df_match |
data.frame with addresses to be matched |
col_ref |
character vector of length three, naming the df_ref columns which contain the steet names, house numbers and additional letters (in that order) |
col_match |
character vector of length three, naming the df_match columns which contain the steet names, house numbers and additional letters (in that order) |
fuzzy_threshold |
The threshold used for fuzzy matching street names |
seed |
Seed for random numbers |
The matching is done in four stages.
Stage 1 (qAdress = 1). This is an exact match (highest quality, qscore = 1)
Stage 2 (qAdress = 2). Exact match on street name, but no valid house
number could be found. Be aware that random house numbers might be used.
Consider setting your own seed. qscore indicates the match quality.
See match_number
for details.
Stage 3 (qAdress = 3). No exact match on street name could be found. Street names are fuzzy matched. The method "jw" (Jaro-Winkler distance) from package stringdist is used (see stringdist-metrics). If 1 - [Jaro-Winkler distance] is greater than fuzzy_threshold, a match is assumed. The highest score is taken and house number matching is done as outlined in Stage 2. qscore is fuzzy_score*[house number score].
Stage 4 (qAdress = 4). No match (qscore = 0)
A list
ret |
The merged dataset |
QA |
The quality markers (qAdress and qscore) |
Daniel Schürmann
This data set gives all the addresses in the city of Dortmund.
Adressen
Adressen
A data.frame
STRNAME | character | street name |
STRSL | numeric | street number |
HNR | numeric | house number |
HNRZ | character | additional letter |
RW | numeric | longitude |
HW | numeric | latitude |
UBZ | numeric | subdistrict number |
This dataset contains separate street and house number information.
df1
df1
A data.frame
gross_strasse | character | street names |
hausnr | character | house number and additional letter |
Var1 | numeric | Variable 1 |
Var2 | character | Variable 2 |
Dortmunder Statistik
This dataset contains concatenated street and house number information.
df2
df2
A data.frame
Adresse | character | street name, house number and addional letter |
Var1 | numeric | Variable 1 |
Var2 | character | Variable 2 |
Dortmunder Statistik
This is an internal function. Please use split_address
helper_split_address(x, debug = FALSE)
helper_split_address(x, debug = FALSE)
x |
A character vector of length 1 |
debug |
If true, print(x) |
A list with three elements
strasse |
Extracted street name |
hnr |
Extracted house number |
hnrz |
Extracted extra letter |
Daniel Schürmann
This is an internal function. Please use split_number
helper_split_number(x, debug = FALSE)
helper_split_number(x, debug = FALSE)
x |
A character vector of length 1 |
debug |
If true, print(x) |
A data.frame with two elements
Hausnummer |
Extracted house number |
Zusatz |
Extracted extra letter |
Daniel Schürmann
Reversed normalized absolute distance from zero.
l1score(x)
l1score(x)
x |
A numeric vector |
A numeric vector of the same length as x
Daniel Schürmann
This is an internal function. Please use addrlink
match_number(record, Adressen, weights = c(0.9, 0.1))
match_number(record, Adressen, weights = c(0.9, 0.1))
record |
data.frame with one row and three columns (Strasse, Hausnummer, Hausnummernzusatz) |
Adressen |
data.frame of all valid addresses (same columns as record data.frame) |
weights |
The weighing factors between house number and additional letter |
If no house number and no additional letter is provided, a random address in the given street is selected (qscore = 0).
If only an additional letter but no house number is given and the letter is unique, returns the corresponding record (qscore = 0.05). Otherwise returns a random one as mentioned above (qscore = 0).
If no additional letter, but house number is provided and the maximum distance to
a valid house number is 4, return the closest match as calculated by
l1score
(qscore is the result of l1score). Otherwise a random record
is returned (qscore = 0).
If additional letter and house number are available and the house number distance is smaller then 4, calculates the l1scores of the house number distance and addional letters distance and selects the best match (qscore is the sum of both weighted l1scores). Otherwise a random record is selected (qscore = 0).
A data.frame
qscore |
The quality score of the match |
Strasse |
matched street |
Hausnummer |
matched house number |
Hausnummernzusatz |
matched additional letter |
Daniel Schürmann
This function replaces Umlauts, expands "str" to "strasse", transliterates all non-ascii characters, removes punctuation and converts to lower case.
sanitize_street(x)
sanitize_street(x)
x |
A character vector containing the steet names |
This is an internal function used in addrlink
.
Make sure house numbers have already been extracted.
Use split_number
or split_address
for that.
Only steet names can go into sanitize_street
.
A character vector of the same length as x containing the sanitized street names.
Daniel Schürmann
split_address
, split_number
, addrlink
This function takes a character vector where each element is made up from a concatenation of street name, house number and possibly an additional letter and splits it into its parts.
split_address(x, debug = FALSE)
split_address(x, debug = FALSE)
x |
A character vector |
debug |
If true, all records will be printed to the console |
If the function fails, consider using debug = TRUE
. This will print the record, which caused the error.
Consider filing an issue on the linked git project (see DESCRIPTION).
A data.frame with three columns
Strasse |
A character column containing the extracted street names |
Hausnummer |
House number |
Hausnummernzusatz |
Additional letter |
For a more advanced, general purpose solution see libpostal.
Daniel Schürmann
split_address(c("Teststr. 8-9 a", "Erster Weg 1-2", "Ahornallee 100a-102c"))
split_address(c("Teststr. 8-9 a", "Erster Weg 1-2", "Ahornallee 100a-102c"))
This function takes a character vector where each element is made up from a concatenation of house number and possibly an additional letter and splits is into its parts.
split_number(x, debug = FALSE)
split_number(x, debug = FALSE)
x |
A character vector |
debug |
If true, all records will be printed to the console |
If the function fails, consider using debug = TRUE
. This will print the record, which caused the error.
Consider filing an issue on the linked git project (see DESCRIPTION).
A data.frame with two columns
Hausnummer |
House number |
Hausnummernzusatz |
Additional letter |
For a more advanced, general purpose solution see libpostal.
Daniel Schürmann
split_number(c("8-9 a", "1-2", "100a-102c"))
split_number(c("8-9 a", "1-2", "100a-102c"))