| HairyEars ( @ 2007-05-23 19:01:00 |
| Entry tags: | excel |
String-Matching in Excel: VLookup() with fuzzy-matching to get a 'closest match' result
Ever had to look up a name or an address in a list that doesn't quite match, so the standard Excel 'VLookup()' and 'Match()' functions don't help you?
Thanks to some suggestions made by friends on a previous post, I've been able to code a lookup function that returns the closest match in a list:
Here's the details:
Public Function VFuzzyLookup(Lookup_Value As String, Table_Array As Variant)
Find the best match for a given string in column 1 of an array of data obtained from an Excel range
This is functionally similar to VLookup, but it returns the best match, not the first exact match
This function is not case-sensitive.
- If your data quality is poor, you are advised to display the retrieved index value from column 1 and use the MatchWord function on this index value to reveal the fuzzy-matching 'score' and discard all results below a threshold score.
- Use VFuzzyLookup_Phrase if you are trying to match a phrase or sentence (a sequence of words separated by spaces) as that function is faster, and has additional logic for the word order.
- Use vFuzzyLookup_Address if you are looking up names and addresses: that function has additional logic to normalise the common abbreviations and word-order conventions used in British addresses.
Yes, you read that correctly: you have a best-match lookup, vFuzzyLookup_Address() for lists of names and addresses. It could do with some performance-tuning (I wouldn't try it on lists exceeding 1024 members if you're in any kind of hurry) but it's usable.
The functions which will be of most interest to other programmers are probably MatchWord(), the simple string-comparison function which returns a percentage score based on Levenshtein edit distance; and MatchPhrase(), which uses MatchWord to create a comparison grid for the words of two sentences, and compares the constructed sequence of 'best match' words in the first sentence with their actual occurrence in the second.
- VfuzzyLookup()
- VFuzzyLookup_Phrase
- VFuzzyLookup_Address
- MatchPhrase
- MatchWord
- Levenshtein
- NormaliseAddress
- StripChars
- Substitute
In another post I have coded up an alternative approach to the Levenshtein edit distance: a 'sum of common strings' score that may give better results for longer phrases and passages of text.
Option Explicit
Public Function VFuzzyLookup(Lookup_Value As String, Table_Array As Variant, Optional Col_Index_Num As Integer = 1)
' Find the best match for a given string in column 1 of an array of data obtained from an Excel range
' This is functionally similar to VLookup, but it returns the best match, not the first exact match
' This function is not case-sensitive.
' If your data quality is poor, you are advised to display the retrieved index value from column 1
' and use the MatchWord() function on this index value to reveal the fuzzy-matching 'score' and
' discard all results below a threshold score.
' Use VFuzzyLookup_Phrases if you are trying to match a phrase or sentence (a sequence of words
' separated by spaces) as that function is faster, and has additional logic for the word order.
' Use vFuzzyLookup_Address if you are looking up names and addresses: that function has additional
' logic to normalise the common abbreviations and word-order conventions used in British addresses.
' THIS CODE IS IN THE PUBLIC DOMAIN
Application.Volatile False
Dim dblBestMatch As Double
Dim iRowBest As Integer
Dim dblMatch As Double
Dim iRow As Integer
Dim strTest As String
Dim strInput As String
Dim iStartCol As Integer
Dim iEndCol As Integer
Dim iOffset As Integer
If InStr(TypeName(Table_Array), "(") + InStr(1, TypeName(Table_Array), "Range", vbTextCompare) < 1 Then
'Table_Array is not an array
VFuzzyLookup = "#VALUE"
Exit Function
End If
If InStr(1, TypeName(Table_Array), "Range", vbTextCompare) > 0 Then
Table_Array = Table_Array.Value
End If
' If you get a subscript-out-of-bounds error here, you're using a vector instead
' of the 2-dimensional array that is the default 'Value' property of an Excel range.
iStartCol = LBound(Table_Array, 2)
iEndCol = UBound(Table_Array, 2)
iOffset = 1 - iStartCol
Col_Index_Num = Col_Index_Num - iOffset
If Col_Index_Num > iEndCol Or Col_Index_Num < iStartCol Then
'Out-of-bounds
VFuzzyLookup = "#VALUE"
Exit Function
End If
strInput = UCase(Lookup_Value)
iRowBest = -1
dblBestMatch = 0
For iRow = LBound(Table_Array, 1) To UBound(Table_Array, 1)
strTest = ""
strTest = Table_Array(iRow, iStartCol)
dblMatch = 0
dblMatch = MatchWord(strInput, strTest)
If dblMatch = 1 Then ' Bail out on finding an exact match
iRowBest = iRow
Exit For
End If
If dblMatch > dblBestMatch Then
dblBestMatch = dblMatch
iRowBest = iRow
End If
Next iRow
If iRowBest = -1 Then
VFuzzyLookup = "#NO MATCH"
Exit Function
End If
VFuzzyLookup = Table_Array(iRowBest, Col_Index_Num)
End Function
Public Function MatchWord(ByVal str1 As String, ByVal str2 As String, Optional Compare As VbCompareMethod = vbTextCompare) As Double
' Returns a percentage estimate of how closely word 1 matches word 2
' Edit distances exceeding the length of str1 are discarded, returning a percentage match of zero
' THIS CODE IS IN THE PUBLIC DOMAIN
Application.Volatile False
Dim maxLen As Integer
Dim minLen As Integer
If Compare = vbTextCompare Then
str1 = UCase(str1)
str2 = UCase(str2)
End If
If str1 = str2 Then
MatchWord = 1
Exit Function
End If
If Len(str1) > Len(str2) Then
maxLen = Len(str1)
minLen = Len(str2)
Else
maxLen = Len(str2)
minLen = Len(str1)
End If
MatchWord = 0
MatchWord = Levenshtein(str1, str2)
If MatchWord >= minLen Then
MatchWord = 0
Else
MatchWord = (maxLen - MatchWord) / maxLen
End If
End Function
Public Function VFuzzyLookup_Phrase(Lookup_Phrase As String, Table_Array As Variant, Optional Col_Index_Num As Integer = 1)
' Find the best match for a given phrase in column 1 of an array of data obtained from an Excel range
' Use this function to match a sentence (a sequence of words separated by spaces). Returns a score
' based on matching word order weighted by the string-matching score of each individual word.
' This is functionally similar to VLookup, but it returns the best match, not the first exact match
' This function is not case-sensitive.
' If your data quality is poor, you are advised to display the retrieved index value from column 1
' and use the MatchPhrase() function on this index value to reveal the fuzzy-matching 'score'; consider
' discarding all results below a threshold score.
' Use VFuzzyLookup for simple string comparisons.
' Use vFuzzyLookup_Address if you are looking up names and addresses: that function has additional
' logic to normalise the common abbreviations and word-order conventions used in British addresses.
' THIS CODE IS IN THE PUBLIC DOMAIN
Application.Volatile False
Dim dblBestMatch As Double
Dim iRowBest As Integer
Dim dblMatch As Double
Dim iRow As Integer
Dim strTest As String
Dim strInput As String
Dim iStartCol As Integer
Dim iEndCol As Integer
Dim iOffset As Integer
If InStr(TypeName(Table_Array), "(") + InStr(1, TypeName(Table_Array), "Range", vbTextCompare) < 1 Then
'Table_Array is not an array
VFuzzyLookup_Phrase = "#VALUE"
Exit Function
End If
If InStr(1, TypeName(Table_Array), "Range", vbTextCompare) > 0 Then
Table_Array = Table_Array.Value
End If
' If you get a subscript-out-of-bounds error here, you're using a vector instead
' of the 2-dimensional array that is the default 'Value' property of an Excel range.
iStartCol = LBound(Table_Array, 2)
iEndCol = UBound(Table_Array, 2)
iOffset = 1 - iStartCol
Col_Index_Num = Col_Index_Num - iOffset
If Col_Index_Num > iEndCol Or Col_Index_Num < iStartCol Then
'Out-of-bounds
VFuzzyLookup_Phrase = "#VALUE"
Exit Function
End If
strInput = UCase(Lookup_Phrase)
iRowBest = -1
dblBestMatch = 0
For iRow = LBound(Table_Array, 1) To UBound(Table_Array, 1)
strTest = ""
strTest = Table_Array(iRow, iStartCol)
dblMatch = 0
dblMatch = MatchPhrase(strInput, strTest) ' Consider coding up a MatchPhrase_Express() function, with the preprocessing
If dblMatch = 1 Then ' Bail out on finding an exact match
iRowBest = iRow
Exit For
End If
If dblMatch > dblBestMatch Then
dblBestMatch = dblMatch
iRowBest = iRow
End If
Next iRow
If iRowBest = -1 Then
VFuzzyLookup_Phrase = "#NO MATCH"
Exit Function
End If
VFuzzyLookup_Phrase = Table_Array(iRowBest, Col_Index_Num)
End Function
Public Function VFuzzyLookup_Address(Lookup_Address As String, Table_Array As Variant, Optional Col_Index_Num As Integer = 1)
' Find the best match for a given postal address in column 1 of a table of addresses obtained from an Excel range
' This is functionally similar to VLookup, but it returns the best match, not the first exact match
' This function is not case-sensitive and ignores common abbreviations eg: 'St' for 'Street'
' If your data quality is poor, you are advised to display the retrieved index value from column 1
' and use the MatchPhrase() function on this index value to reveal the fuzzy-matching 'score'; consider
' discarding all results below a threshold score.
' Use VFuzzyLookup for simple string comparisons.
' Use vFuzzyLookup_Phrase if you are looking up phrases or sentences that are not addresses: this address lookup
' function discards a lot of common words like 'Street' and expands abbreviations like 'Ave' and 'Blvd'
' THIS CODE IS IN THE PUBLIC DOMAIN
Application.Volatile False
Dim dblBestMatch As Double
Dim iRowBest As Integer
Dim dblMatch As Double
Dim iRow As Integer
Dim strTest As String
Dim strInput As String
Dim iStartCol As Integer
Dim iEndCol As Integer
Dim iOffset As Integer
If InStr(TypeName(Table_Array), "(") + InStr(1, TypeName(Table_Array), "Range", vbTextCompare) < 1 Then
'Table_Array is not an array
VFuzzyLookup_Address = "#VALUE"
Exit Function
End If
If InStr(1, TypeName(Table_Array), "Range", vbTextCompare) > 0 Then
Table_Array = Table_Array.Value
End If
' If you get a subscript-out-of-bounds error here, you're using a vector instead
' of the 2-dimensional array that is the default 'Value' property of an Excel range.
iStartCol = LBound(Table_Array, 2)
iEndCol = UBound(Table_Array, 2)
iOffset = 1 - iStartCol
Col_Index_Num = Col_Index_Num - iOffset
If Col_Index_Num > iEndCol Or Col_Index_Num < iStartCol Then
'Out-of-bounds
VFuzzyLookup_Address = "#VALUE"
Exit Function
End If
strInput = Lookup_Address
strInput = NormaliseAddress(strInput)
If strInput = "" Then
'Out-of-bounds
VFuzzyLookup_Address = "#CANNOT READ ADDRESS"
Exit Function
End If
iRowBest = -1
dblBestMatch = 0
For iRow = LBound(Table_Array, 1) To UBound(Table_Array, 1)
strTest = ""
strTest = Table_Array(iRow, iStartCol)
strTest = NormaliseAddress(strTest)
If strTest <> "" Then
dblMatch = 0
dblMatch = MatchPhrase(strInput, strTest) ' Consider coding up a MatchPhrase_Express() function, with the preprocessing
If dblMatch = 1 Then ' Bail out on finding an exact match
iRowBest = iRow
Exit For
End If
If dblMatch > dblBestMatch Then
dblBestMatch = dblMatch
iRowBest = iRow
End If
End If ' strTest <> ""
Next iRow
If iRowBest = -1 Then
VFuzzyLookup_Address = "#NO MATCH"
Exit Function
End If
VFuzzyLookup_Address = Table_Array(iRowBest, Col_Index_Num)
End Function
Public Function MatchPhrase(ByVal Phrase1 As String, ByVal Phrase2 As String, Optional Compare As VbCompareMethod = vbTextCompare) As Double
' Function to compare two sentences. A version of this will be released to cater for the
' specific needs of matching addresses, where we can make some assumptions about common
' word-substitutions and abbreviations.
' THIS CODE IS IN THE PUBLIC DOMAIN
' This function consists of six processes:
' 1 Break out the phrases into arrays of words using the space character as the delimiter
' 2 Populate a grid of word-matching scores for each word in Phrase 1 against Phrase 2;
' 3 For each word in Phrase 1, identify the 'best match' from the words in Phrase 2
' 4 Resolve 'collisions' - two or more words in phrase 1 matching the same word in phrase 2
' 5 Compare the actual sequence of words in P1 with the positions of the matched words in P2
' 6 Weight this comparison by the degree of matching measured at the level of individual words
' Process 4, resolving collisions, is an iterative loop inside process 3
' Process 1 has an addditional step to check for deleted spaces
Dim arr1() As String ' Phrase 1, broken out into individual words
Dim arr2() As String
Dim arrScores() As Double ' an array of percentage matches of each word in p1 against each word in p2
' These two vectors are redundant in the sense that they hold information which
' can be extracted from arrScores(). However, using them saves a lot of looping:
Dim arrPositions() As Integer ' For each word in p1, the position of the best-matching word in p2
Dim arrSequence() As Double ' For each word in p1, a score for its concordance with a constructed sequence of matching words in P2
Dim n As Double ' should be an integer, but it will be used in floating-point
' division and I prefer to avoid casting in VBA
Dim s1 As String
Dim s2 As String
Dim i As Integer
Dim j As Integer
Dim k As Integer
Dim iOffset As Integer
Dim iShift As Integer
Dim iDelete As Integer
Dim iPos As Integer
Dim jPos As Integer
Dim kPos As Integer
Dim iTotalLen As Integer
Dim dScore As Double
Dim dBest As Double
Dim dPenalty As Double
Dim d1 As Double
Dim d2 As Double
If Compare = vbTextCompare Then
Phrase1 = UCase(Phrase1)
Phrase2 = UCase(Phrase2)
End If
If Phrase1 = Phrase2 Then
MatchPhrase = 1
Exit Function
End If
' The line labels SplitSpace1 and SplitSpace2 are resynchronisation points for
' restarting the process after restoring a deleted space in Phase1 or Phrase2.
Phrase1 = StripChars(Phrase1, " ")
SplitSpace1:
arr1 = Split(Phrase1, " ")
Phrase2 = StripChars(Phrase2, " ")
SplitSpace2:
arr2 = Split(Phrase2, " ")
ReDim arrScores(LBound(arr1) To UBound(arr1), LBound(arr2) To UBound(arr2))
ReDim arrPositions(LBound(arr1) To UBound(arr1))
ReDim arrSequence(LBound(arr1) To UBound(arr1))
' Test for deleted spaces. This is a lot of work, but a missing space is a
' common error and the effects are out of all proportion to the size of the
' error: so much so that I'm prepared to risk the occasional 'false alarm'.
' It may even be worth repeating these two loops using fuzzy-matching with
' Levenshtein scores rather than the simple string-comparisons shown below:
For i = LBound(arr1) To UBound(arr1) - 1
If arr1(i) <> "" And arr1(i + 1) <> "" Then
s1 = arr1(i) & arr1(i + 1)
For j = LBound(arr2) To UBound(arr2)
If UCase(arr2(j)) = UCase(s1) Then
Phrase2 = Substitute(Phrase2, arr2(j), arr1(i) & " " & arr1(i + 1), 1, Compare)
GoTo SplitSpace2
End If
Next j
End If ' arr(i) = "" Or arr(i + 1) = "" Then
Next i
For j = LBound(arr2) To UBound(arr2) - 1
If arr2(j) <> "" And arr2(j + 1) <> "" Then
s2 = arr2(j) & arr2(j + 1)
For i = LBound(arr1) To UBound(arr1)
If UCase(arr1(i)) = UCase(s2) Then
Phrase1 = Substitute(Phrase1, arr1(i), arr2(j) & " " & arr2(j + 1), 1, Compare)
GoTo SplitSpace1
End If
Next i
End If
Next j
' Initialise the positions array with a negative value denoting 'not found'
For i = LBound(arr1) To UBound(arr1)
arrPositions(i) = -1
iTotalLen = iTotalLen + Len(arr1(i))
Next i
' For each word in Phrase 1, identify the closest matching in Phrase 2 and record its position.
For i = LBound(arr1) To UBound(arr1)
s1 = arr1(i)
dBest = 0
iPos = -1
For j = LBound(arr2) To UBound(arr2)
s2 = arr2(j)
dScore = 0
dScore = MatchWord(s1, s2, Compare)
arrScores(i, j) = dScore
If dScore > dBest Then
dBest = dScore
iPos = j
End If
Next j
If iPos >= 0 Then
arrPositions(i) = iPos
End If
Next i
' Resolve collisions - two or more words in P1 that have 'best match' scores on the same word in p2
' In theory this could be done without using the positions vector, as the information is in arrScores
' In practice, arrPositions saves processing steps
For i = LBound(arrPositions) To UBound(arrPositions)
iPos = arrPositions(i)
For j = i + 1 To UBound(arrPositions)
If iPos = arrPositions(j) And iPos >= 0 Then
' Collision detected: which word has the best score?
d1 = arrScores(i, iPos)
d2 = arrScores(j, iPos)
If d2 > d1 Then
'discard this recorded 'best match' position:
arrScores(i, iPos) = -1
'find the second-best score for d1
dBest = 0
kPos = -1
For k = LBound(arrScores, 2) To UBound(arrScores, 2)
dScore = 0
dScore = arrScores(i, k)
If dScore > dBest Then
dBest = dScore
kPos = k
End If
Next k
' reset this conflicting position as word (i)'s match in phrase 2:
arrPositions(j) = kPos
' There is now a possibility that we have caused
' a collision with a previous word in Phrase 1:
If k < i Then
For k = LBound(arrPositions) To k - 1
If arrPositions(k) = kPos Then
'restart the loop at the colliding value
i = k
j = UBound(arr1) + 1
Exit For
End If
Next k
End If ' k<1
Else
' discard this recorded 'best match' position:
arrScores(j, iPos) = -1
'find the second-best score for d2 *after* the current position
dBest = 0
kPos = -1
For k = j + 1 To UBound(arr2)
dScore = 0
dScore = arrScores(j, k)
If dScore > dBest Then
dBest = dScore
kPos = k
End If
Next k
arrPositions(j) = kPos
End If ' d2 > d1
End If
Next j
Next i
' Constructing a sequence-matching score:
' If we were scoring jumbled sentences of unaltered words, we'd use an edit distance algorithm;
' several are available, including replicating the Levenshtein distance at the word level. I've
' chosen a crude single-pass algorithm with a forward bias, that 'expects' the word sequence to
' resynchronise after each out-of-sequence word. It's quick, and the bias is valid - word-order
' is not neutral in real-life examples, and the heavy penalty for word transpositions reflects
' my belief that this is a more significant 'edit' than character transpositions in a word. A
' more rigorous treatment would venture into the realms of natural-language processing; that is
' out-of-scope for this application and far too ambitious for a self-contained function in VBA.
' Worked example:
' Compare two Phrases:
' "ABC DEF GHI JKL MNO PQR STU VWX", "ABC DEF JKL STU MNO PQR VWX"
' Variable arrPositions records the placement of each word in phrase 1 in phrase 2:
' Phrase 1 "ABC DEF GHI JKL MNO PQR STU VWX"
' Expected positions: 0 1 2 3 4 5 6
' Actual position in p2 0 1 -1 2 4 5 3 6
' The variable arrSequence will capture the scores
' Run the sequence-scoring loop:
' ABC expected in position 0 found in 0 Score 1/8
' DEF expected in position 1 found in 1 Score 1/8
' GHI expected in position 2 DELETION * frame shift -1 * Score NIL
' JKL expected in position 3-1 found in 2 Score 1/8
' MNO expected in position 4-1 found in 4 * frame shift +1 * Score 1/8 * 7/8
' PQR expected in position 5 found in 5 Score 1/8
' STU expected in position 6 found in 3 * frame shift -3 * Score 1/8 * (7/8)^3
' VWX expected in position 7-3 found in 6 * frame shift +2 * Score 1/8 * (7/8)^2
' Edit distance is 7: the out-of-sequence penalty of 7/8 will be applied seven times
' However, we do not deal with perfectly-matched words in real life, so we cannot apply
' these penalties at the level of the entire phrase; we apply them at the level of the
' individual word, where we can apply a weighting based on each word's Levenshtein score
' The exception is deleted words; we could consider the 'word match' weighting of zero
' to be sufficient penalty but a more consistent result is obtained by applying a penalty
' to the entire phrase
' Sanity check; run the function in reverse, testing Phrase 2 against phrase 1:
' Phrase 2 "ABC DEF JKL STU MNO PQR VWX"
' Expected positions: 0 1 2 3 4 5 6
' Actual position in p1 0 1 3 6 4 5 7
' ABC expected in position 0 found in 0 Score 1/8
' DEF expected in position 1 found in 1 Score 1/8
' JKL expected in position 2 found in 3 * frame shift +1 * Score 1/8 * 7/8
' STU expected in position 3+1 found in 6 * frame shift +2 * Score 1/8 * (7/8)^2
' MNO expected in position 4+3 found in 4 * frame shift -3 * Score 1/8 * (7/8)^3
' PQR expected in position 5 found in 5 Score 1/8
' VWX expected in position 6 found in 7 * frame shift -1 * Score 1/8 * 7/8
' Edit distance is 7: the out-of-sequence penalty of 7/8 will be applied seven times
' "But wasn't there an insertion, too? Phrase 1 has an extra word that isn't in Phrase 2!"
' Note that our choice of denominator (8, the longer of the two wordcounts) has the effect of
' imputing a score of zero to the inserted word and applying a penalty of 7/8 to the entire phrase.
' A note on identifying the 'inserted word': actually, it's the word in Phrase 1 which didn't
' score as 'best match' against any word in Phrase 2. It could've come a close second to any or
' all of them.
If UBound(arr1) >= UBound(arr2) Then
n = UBound(arr1) + 1
Else
n = UBound(arr2) + 1
End If
dPenalty = 1 - (1 / n)
iShift = 0 ' Sequence distance for out-of-place words
iOffset = 0 ' Running total of this 'shift' variable
iDelete = 0 ' Count the number of deletions
For i = LBound(arrPositions) To UBound(arrPositions)
s1 = arr1(i)
iPos = arrPositions(i)
iShift = iPos - i - iOffset
Select Case iPos
Case Is < 0 'DELETION: no matching word was found in S2
iShift = -1
arrSequence(i) = 0
iDelete = iDelete + 1
Case Is = i + iOffset ' matched word is in the expected position
iShift = 0
arrSequence(i) = 1 / n
Case Else
arrSequence(i) = (dPenalty ^ Abs(iShift)) / n
End Select
iOffset = iOffset + iShift
Next i
MatchPhrase = 0
For i = LBound(arrPositions) To UBound(arrPositions)
dScore = 0
If arrPositions(i) > -1 Then
dScore = arrScores(i, arrPositions(i))
dScore = dScore * arrSequence(i)
Else
'apply a deletion penalty - this isn't as arbitrary as it might seem: it is a equivalent to the
' effe
dScore = -Len(arr1(i)) / iTotalLen / n
End If
MatchPhrase = MatchPhrase + dScore
Next i
ExitFunction:
Erase arrScores
Erase arrSequence
Erase arr1
Erase arr2
End Function
Private Function Minimum(ByVal a As Integer, _
ByVal b As Integer, _
ByVal c As Integer) As Integer
Dim min As Integer
min = a
If b < min Then
min = b
End If
If c < min Then
min = c
End If
Minimum = min
End Function
Private Function Levenshtein(ByVal s1 As String, ByVal s2 As String) As Integer
' Levenshtein Distance - edit distance between two strings
' THIS CODE IS IN THE PUBLIC DOMAIN
Dim arr() As Integer ' Scoring matrix
Dim n As Integer ' length of s1
Dim m As Integer ' length of s2
Dim i As Integer ' iterates through s1
Dim j As Integer ' iterates through s2
Dim s1_i As String ' ith character of s1
Dim s2_j As String ' jth character of s2
Dim cost As Integer ' cost
n = Len(s1)
m = Len(s2)
If n = 0 Then
Levenshtein = m
Exit Function
End If
If m = 0 Then
Levenshtein = n
Exit Function
End If
ReDim arr(0 To n, 0 To m) As Integer
For i = 0 To n
arr(i, 0) = i
Next i
For j = 0 To m
arr(0, j) = j
Next j
For i = 1 To n
s1_i = Mid$(s1, i, 1)
For j = 1 To m
s2_j = Mid$(s2, j, 1)
If s1_i = s2_j Then
cost = 0
Else
cost = 1
End If
arr(i, j) = Minimum(arr(i - 1, j) + 1, arr(i, j - 1) + 1, arr(i - 1, j - 1) + cost)
Next j
Next i
' Step 7
Levenshtein = arr(n, m)
ExitSub:
Erase arr
End Function
Public Function NormaliseAddress(ByVal strAddress As String) As String
' This function is intended to remove or standardise common phrases
' and abbreviations used in British postal addresses, allowing the use
' of string-comparison algorithms in lists of names and addresses.
' Developers in other countries should review the word list used here,
' as conventions probably differ in your local language or dialect.
strAddress = " " & UCase(strAddress) & " "
strAddress = Substitute(strAddress, ",", " ")
strAddress = Substitute(strAddress, ".", " ")
strAddress = Substitute(strAddress, "-", " ")
strAddress = Substitute(strAddress, vbCrLf, " ")
strAddress = Substitute(strAddress, " BLVD ", " BOULEVARD ")
strAddress = Substitute(strAddress, " BVD ", " BOULEVARD ")
strAddress = Substitute(strAddress, " AV ", " AVENUE ")
strAddress = Substitute(strAddress, " AVE ", " AVENUE ")
strAddress = Substitute(strAddress, " RD ", " ROAD ")
strAddress = Substitute(strAddress, " WY ", " WAY ")
strAddress = Substitute(strAddress, " EST ", " ESTATE ")
strAddress = Substitute(strAddress, " PL ", " PLACE ")
strAddress = Substitute(strAddress, " PK ", " PARK ")
strAddress = Substitute(strAddress, " HSE ", " HOUSE ")
strAddress = Substitute(strAddress, " H0 ", " HOUSE ")
strAddress = Substitute(strAddress, " GDNS ", " GARDENS ")
strAddress = Substitute(strAddress, "&", "AND")
strAddress = Substitute(strAddress, " LIMITED ", " LTD ")
strAddress = Substitute(strAddress, " COMPANY ", " CO ")
strAddress = Substitute(strAddress, " CORPORATION ", " CORP ")
strAddress = Substitute(strAddress, " T/A ", " TA ")
strAddress = Substitute(strAddress, " TRADING AS ", " TA ")
' Common personal titles: these are often applied inconsistently or
' omitted, and must therefore be removed. Specific applications may
' require additional titles and their abbreviations - military rank,
' academic titles and degrees, courtesy titles of the aristocracy,
' knighthoods and honours (particularly for lists of civil servants)
strAddress = Substitute(strAddress, " ESQ ", " ")
strAddress = Substitute(strAddress, " MR ", " ")
strAddress = Substitute(strAddress, " MRS ", " ")
strAddress = Substitute(strAddress, " MISS ", " ")
strAddress = Substitute(strAddress, " MS ", " ")
strAddress = Substitute(strAddress, " MESSRS ", " ")
strAddress = Substitute(strAddress, " SIR ", " ")
strAddress = Substitute(strAddress, " OF ", " ")
strAddress = Substitute(strAddress, " DR ", " ")
strAddress = Substitute(strAddress, " OR ", " ")
strAddress = Substitute(strAddress, " IN ", " ")
strAddress = Substitute(strAddress, " THE ", " ")
strAddress = Substitute(strAddress, " REVEREND ", " REV ")
strAddress = Substitute(strAddress, " REVERENT ", " REV ")
strAddress = Substitute(strAddress, " HONOURABLE ", " HON ")
strAddress = Substitute(strAddress, " BROS ", " BROTHERS ")
strAddress = Substitute(strAddress, " ASSOC ", " ASSOCIATION ")
strAddress = Substitute(strAddress, " ASSN ", " ASSOCIATION ")
' Standardising 'St.', 'St', and 'Street'. Note that there are over 40 English
' towns and place names that contain or consist entirely of the word 'Street'.
' In addition, 'St' is a common abbreviation for 'Saint' in addresses.
' I have never seen a list of addresses where 'Street' and 'St' were used in a
' consistent way, and the only workable solution is to delete them all:
strAddress = Substitute(strAddress, " STREET ", " ")
strAddress = Substitute(strAddress, " ST ", " ")
strAddress = Substitute(strAddress, " STR ", " ")
Do While InStr(strAddress, " ") > 0
strAddress = Substitute(strAddress, " ", " ")
Loop
strAddress = Trim(strAddress)
NormaliseAddress = strAddress
End Function
Public Function StripChars(myString As String, ParamArray Exceptions()) As String
' Strip out all non-alphanumeric characters from a string in a single pass
' Exceptions parameters allow you to retain specific characters (eg: spaces)
' THIS CODE IS IN THE PUBLIC DOMAIN
Application.Volatile False
Dim i As Integer
Dim iLen As Integer
Dim chrA As String * 1
Dim intA As Integer
Dim j As Integer
Dim iStart As Integer
Dim iEnd As Integer
If Not IsEmpty(Exceptions()) Then
iStart = LBound(Exceptions)
iEnd = UBound(Exceptions)
End If
iLen = Len(myString)
For i = 1 To iLen
chrA = Mid(myString, i, 1)
intA = Asc(chrA)
Select Case intA
Case 48 To 57, 65 To 90, 97 To 122
StripChars = StripChars & chrA
Case Else
If Not IsEmpty(Exceptions()) Then
For j = iStart To iEnd
If chrA = Exceptions(j) Then
StripChars = StripChars & chrA
Exit For ' j
End If
Next j
End If
End Select
Next i
End Function
Private Function Substitute(ByVal Text As String, _
ByVal Old_Text As String, _
ByVal New_Text As String, _
Optional Instance As Integer = 0, _
Optional Compare As VbCompareMethod = vbTextCompare _
) As String
'Replace all instances (or the nth instance ) of 'Old' text with 'New'
' Unlike VB.Mid$ this method is not sensitive to length and can replace ALL instances
' This is not exposed as a Public function because there is an Excel Worksheet function called Substitute()
' THIS CODE IS IN THE PUBLIC DOMAIN
Dim iStart As Integer
Dim iEnd As Integer
Dim iLen As Integer
Dim iInstance As Integer
Dim strOut As String
iLen = Len(Old_Text)
If iLen = 0 Then
Substitute = Text
Exit Function
End If
iEnd = 0
iStart = 1
iEnd = InStr(iStart, Text, Old_Text, Compare)
If iEnd = 0 Then
Substitute = Text
Exit Function
End If
strOut = ""
Do Until iEnd = 0
strOut = strOut & Mid$(Text, iStart, iEnd - iStart)
iInstance = iInstance + 1
If Instance = 0 Or Instance = iInstance Then
strOut = strOut & New_Text
Else
strOut = strOut & Mid$(Text, iEnd, Len(Old_Text))
End If
iStart = iEnd + iLen
iEnd = InStr(iStart, Text, Old_Text, Compare)
Loop
iLen = Len(Text)
strOut = strOut & Mid$(Text, iStart, iLen - iEnd)
Substitute = strOut
End Function
Public Function Contains(ByVal MainString As String, ParamArray SeekString()) As Boolean
' A version of Instr() that returns TRUE if any 'seek' string is a substring of the main string
' THIS CODE IS IN THE PUBLIC DOMAIN
Application.Volatile False
Dim i As Integer
Dim j As Integer
Dim arrX As Variant
Dim strTest As String
Contains = False
If MainString = "" Then
Exit Function
End If
For i = LBound(SeekString) To UBound(SeekString)
If IsArray(SeekString(i)) Then
arrX = SeekString(i)
For j = LBound(arrX) To UBound(arrX)
strTest = ""
strTest = Trim(arrX(j))
If Len(strTest) <= Len(MainString) And Len(strTest) > 0 Then
If InStr(1, MainString, strTest, vbTextCompare) > 0 Then
Contains = True
Exit Function
End If
End If
Next j
Erase arrX
Else
strTest = ""
strTest = Trim(SeekString(i))
If Len(strTest) > 0 Then
If Len(strTest) < Len(MainString) And Len(strTest) > 0 Then
If InStr(1, MainString, strTest, vbTextCompare) > 0 Then
Contains = True
Exit Function
End If
End If
End If
End If
Next i
End Function
(Back to the list of functions)
There's a particular point about the Phrase-matching function that could probably be improved:
' For each word in Phrase 1, identify the closest matching in Phrase 2 and record its position.
This is a crude 'first-past-the-post' algorithm and it discards usable information. The most glaring omission is that the 'winner' might be one of two excellent matches with near-identical scores, either one of which could be the 'right' choice. Equally, a clear margin among poor scores is a worthless result as it is just as likely to be the wrong choice. Unfortunately, we cannot apply a rigorous statistical treatment (ie: confidence-testing) with the population sizes available in a typical sentence or postal address. Imposing a minimum threshold on the 'margin of victory' is a promising approach, but there is no simple way of applying it.
We could use this information at the sequence-matching stage: if the 'best match' for a given word in phrase 1 appears to be out-of-place, we could go back and see if the second-best choice is a better fit. Storing the 'margin of victory' data allows us to determine whether this is worth doing: we needn't bother looking at a second-best choice if the 'best match' is a clear winner, and we should try the third, fourth and fifth-closest if the margin between them all is below a selected threshold. However, reallocating a given word's counterpart in Phrase 2 carries a heavy price in terms of complexity because we are forced to rerun the 'collision' test - and the outcome of that process might force us to rerun the sequence-matching.
Beyond that, I'm fairly happy with the internal logic: I tried a Damareau_Levenshtein algorithm, which refines the string-matching with an improved treatment of transpositions, but this kind of recursive algorithm does not perform well in VBA and I went back to a simple Levenshtein 'edit distance' function, which is performing rather well. I'd like to see a rigorous analysis of the sequencing algorithm, because my effort is very much an empirical design - that's not to say that it doesn't work, but it rests on decidedly shaky intellectual foundations.
Your comments and suggestions would be welcome on this - and on any other issues and errors you can see in the code.
Free to use any or all of it: there are limitations to what it can do, but I think this is good enough for everyday use and it plugs an obvious gap in the standard Excel worksheet functions.