HairyEars ([info]hairyears) wrote,
@ 2007-05-23 19:01:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
Entry tags:excel

String-Matching in Excel: VLookup() with fuzzy-matching to get a 'closest match' result

Ever had to look up a name or an address in a list that doesn't quite match, so the standard Excel 'VLookup()' and 'Match()' functions don't help you?

Thanks to some suggestions made by friends on a previous post, I've been able to code a lookup function that returns the closest match in a list:


Screenshot: Excel Dialogue for a Fuzzy-Matching lookup function

Here's the details:

Public Function VFuzzyLookup(Lookup_Value As String, Table_Array As Variant)


Find the best match for a given string in column 1 of an array of data obtained from an Excel range
This is functionally similar to VLookup, but it returns the best match, not the first exact match
This function is not case-sensitive.


  • If your data quality is poor, you are advised to display the retrieved index value from column 1 and use the MatchWord function on this index value to reveal the fuzzy-matching 'score' and discard all results below a threshold score.
  • Use VFuzzyLookup_Phrase if you are trying to match a phrase or sentence (a sequence of words separated by spaces) as that function is faster, and has additional logic for the word order.
  • Use vFuzzyLookup_Address if you are looking up names and addresses: that function has additional logic to normalise the common abbreviations and word-order conventions used in British addresses.


Yes, you read that correctly: you have a best-match lookup, vFuzzyLookup_Address() for lists of names and addresses. It could do with some performance-tuning (I wouldn't try it on lists exceeding 1024 members if you're in any kind of hurry) but it's usable.


The functions which will be of most interest to other programmers are probably MatchWord(), the simple string-comparison function which returns a percentage score based on Levenshtein edit distance; and MatchPhrase(), which uses MatchWord to create a comparison grid for the words of two sentences, and compares the constructed sequence of 'best match' words in the first sentence with their actual occurrence in the second.

This is the full list of functions , with internal page links:

In another post I have coded up an alternative approach to the Levenshtein edit distance: a 'sum of common strings' score that may give better results for longer phrases and passages of text.



Option Explicit

Public Function VFuzzyLookup(Lookup_Value As String, Table_Array As Variant, Optional Col_Index_Num As Integer = 1)

' Find the best match for a given string in column 1 of an array of data obtained from an Excel range
' This is functionally similar to VLookup, but it returns the best match, not the first exact match
' This function is not case-sensitive.

' If your data quality is poor, you are advised to display the retrieved index value from column 1
' and use the MatchWord() function on this index value to reveal the fuzzy-matching 'score' and
' discard all results below a threshold score.

' Use VFuzzyLookup_Phrases if you are trying to match a phrase or sentence (a sequence of words
' separated by spaces) as that function is faster, and has additional logic for the word order.

' Use vFuzzyLookup_Address if you are looking up names and addresses: that function has additional
' logic to normalise the common abbreviations and word-order conventions used in British addresses.

' THIS CODE IS IN THE PUBLIC DOMAIN

Application.Volatile False

Dim dblBestMatch As Double

Dim iRowBest    As Integer
Dim dblMatch    As Double
Dim iRow        As Integer
Dim strTest     As String
Dim strInput    As String

Dim iStartCol   As Integer
Dim iEndCol     As Integer
Dim iOffset     As Integer

If InStr(TypeName(Table_Array), "(") + InStr(1, TypeName(Table_Array), "Range", vbTextCompare) < 1 Then
    'Table_Array is not an array
     VFuzzyLookup = "#VALUE"
    Exit Function
End If

If InStr(1, TypeName(Table_Array), "Range", vbTextCompare) > 0 Then
    Table_Array = Table_Array.Value
End If

' If you get a subscript-out-of-bounds error here, you're using a vector instead
' of the 2-dimensional array that is the default 'Value' property of an Excel range.

iStartCol = LBound(Table_Array, 2)
iEndCol = UBound(Table_Array, 2)
iOffset = 1 - iStartCol


Col_Index_Num = Col_Index_Num - iOffset

If Col_Index_Num > iEndCol Or Col_Index_Num < iStartCol Then
    'Out-of-bounds
     VFuzzyLookup = "#VALUE"
    Exit Function
End If



    strInput = UCase(Lookup_Value)

    iRowBest = -1
    dblBestMatch = 0

    For iRow = LBound(Table_Array, 1) To UBound(Table_Array, 1)

        strTest = ""
        strTest = Table_Array(iRow, iStartCol)

        dblMatch = 0
        dblMatch = MatchWord(strInput, strTest)

        If dblMatch = 1 Then ' Bail out on finding an exact match
            iRowBest = iRow
            Exit For
        End If

        If dblMatch > dblBestMatch Then
            dblBestMatch = dblMatch
            iRowBest = iRow
        End If

    Next iRow


    If iRowBest = -1 Then
        VFuzzyLookup = "#NO MATCH"
        Exit Function
    End If


    VFuzzyLookup = Table_Array(iRowBest, Col_Index_Num)



End Function


Public Function MatchWord(ByVal str1 As String, ByVal str2 As String, Optional Compare As VbCompareMethod = vbTextCompare) As Double

' Returns a percentage estimate of how closely word 1 matches word 2
' Edit distances exceeding the length of str1 are discarded, returning a percentage match of zero

' THIS CODE IS IN THE PUBLIC DOMAIN

Application.Volatile False

Dim maxLen As Integer
Dim minLen As Integer

If Compare = vbTextCompare Then
    str1 = UCase(str1)
    str2 = UCase(str2)
End If


    If str1 = str2 Then
        MatchWord = 1
        Exit Function
    End If

    If Len(str1) > Len(str2) Then
        maxLen = Len(str1)
        minLen = Len(str2)
    Else
        maxLen = Len(str2)
        minLen = Len(str1)
    End If

    MatchWord = 0
    MatchWord = Levenshtein(str1, str2)

    If MatchWord >= minLen Then
        MatchWord = 0
    Else
        MatchWord = (maxLen - MatchWord) / maxLen
    End If

End Function


Public Function VFuzzyLookup_Phrase(Lookup_Phrase As String, Table_Array As Variant, Optional Col_Index_Num As Integer = 1)

' Find the best match for a given phrase in column 1 of an array of data obtained from an Excel range

' Use this function to match a sentence (a sequence of words separated by spaces). Returns a score
' based on matching word order weighted by the string-matching score of each individual word.

' This is functionally similar to VLookup, but it returns the best match, not the first exact match
' This function is not case-sensitive.

' If your data quality is poor, you are advised to display the retrieved index value from column 1
' and use the MatchPhrase() function on this index value to reveal the fuzzy-matching 'score'; consider
' discarding all results below a threshold score.

' Use VFuzzyLookup for simple string comparisons.

' Use vFuzzyLookup_Address if you are looking up names and addresses: that function has additional
' logic to normalise the common abbreviations and word-order conventions used in British addresses.


' THIS CODE IS IN THE PUBLIC DOMAIN

Application.Volatile False

Dim dblBestMatch As Double

Dim iRowBest    As Integer
Dim dblMatch    As Double
Dim iRow        As Integer
Dim strTest     As String
Dim strInput    As String

Dim iStartCol   As Integer
Dim iEndCol     As Integer
Dim iOffset     As Integer

If InStr(TypeName(Table_Array), "(") + InStr(1, TypeName(Table_Array), "Range", vbTextCompare) < 1 Then
    'Table_Array is not an array
     VFuzzyLookup_Phrase = "#VALUE"
    Exit Function
End If

If InStr(1, TypeName(Table_Array), "Range", vbTextCompare) > 0 Then
   Table_Array = Table_Array.Value
End If

' If you get a subscript-out-of-bounds error here, you're using a vector instead
' of the 2-dimensional array that is the default 'Value' property of an Excel range.

iStartCol = LBound(Table_Array, 2)
iEndCol = UBound(Table_Array, 2)
iOffset = 1 - iStartCol


Col_Index_Num = Col_Index_Num - iOffset

If Col_Index_Num > iEndCol Or Col_Index_Num < iStartCol Then
    'Out-of-bounds
     VFuzzyLookup_Phrase = "#VALUE"
    Exit Function
End If



    strInput = UCase(Lookup_Phrase)

    iRowBest = -1
    dblBestMatch = 0

    For iRow = LBound(Table_Array, 1) To UBound(Table_Array, 1)

        strTest = ""
        strTest = Table_Array(iRow, iStartCol)

        dblMatch = 0
        dblMatch = MatchPhrase(strInput, strTest)    ' Consider coding up a  MatchPhrase_Express() function, with the preprocessing
                                                    ' (StripChars, Split) of strInput done here, rather than repeatedly


        If dblMatch = 1 Then ' Bail out on finding an exact match
            iRowBest = iRow
            Exit For
        End If

        If dblMatch > dblBestMatch Then
            dblBestMatch = dblMatch
            iRowBest = iRow
        End If

    Next iRow


    If iRowBest = -1 Then
        VFuzzyLookup_Phrase = "#NO MATCH"
        Exit Function
    End If


    VFuzzyLookup_Phrase = Table_Array(iRowBest, Col_Index_Num)



End Function


Public Function VFuzzyLookup_Address(Lookup_Address As String, Table_Array As Variant, Optional Col_Index_Num As Integer = 1)

' Find the best match for a given postal address in column 1 of a table of addresses obtained from an Excel range

' This is functionally similar to VLookup, but it returns the best match, not the first exact match
' This function is not case-sensitive and ignores common abbreviations eg: 'St' for 'Street'

' If your data quality is poor, you are advised to display the retrieved index value from column 1
' and use the MatchPhrase() function on this index value to reveal the fuzzy-matching 'score'; consider
' discarding all results below a threshold score.

' Use VFuzzyLookup for simple string comparisons.

' Use vFuzzyLookup_Phrase if you are looking up phrases or sentences that are not addresses: this address lookup
' function discards a lot of common words like 'Street' and expands abbreviations like 'Ave' and 'Blvd'


' THIS CODE IS IN THE PUBLIC DOMAIN

Application.Volatile False

Dim dblBestMatch As Double

Dim iRowBest    As Integer
Dim dblMatch    As Double
Dim iRow        As Integer
Dim strTest     As String
Dim strInput    As String

Dim iStartCol   As Integer
Dim iEndCol     As Integer
Dim iOffset     As Integer

If InStr(TypeName(Table_Array), "(") + InStr(1, TypeName(Table_Array), "Range", vbTextCompare) < 1 Then
    'Table_Array is not an array
     VFuzzyLookup_Address = "#VALUE"
    Exit Function
End If

If InStr(1, TypeName(Table_Array), "Range", vbTextCompare) > 0 Then
    Table_Array = Table_Array.Value
End If

' If you get a subscript-out-of-bounds error here, you're using a vector instead
' of the 2-dimensional array that is the default 'Value' property of an Excel range.

iStartCol = LBound(Table_Array, 2)
iEndCol = UBound(Table_Array, 2)
iOffset = 1 - iStartCol


Col_Index_Num = Col_Index_Num - iOffset

If Col_Index_Num > iEndCol Or Col_Index_Num < iStartCol Then
    'Out-of-bounds
     VFuzzyLookup_Address = "#VALUE"
    Exit Function
End If


    strInput = Lookup_Address
    strInput = NormaliseAddress(strInput)


If strInput = "" Then
    'Out-of-bounds
     VFuzzyLookup_Address = "#CANNOT READ ADDRESS"
    Exit Function
End If


    iRowBest = -1
    dblBestMatch = 0

    For iRow = LBound(Table_Array, 1) To UBound(Table_Array, 1)

        strTest = ""
        strTest = Table_Array(iRow, iStartCol)
        strTest = NormaliseAddress(strTest)

        If strTest <> "" Then

            dblMatch = 0
            dblMatch = MatchPhrase(strInput, strTest)   ' Consider coding up a  MatchPhrase_Express() function, with the preprocessing
                                                        ' (StripChars, Split) of strInput done here, rather than repeatedly  


            If dblMatch = 1 Then ' Bail out on finding an exact match
                iRowBest = iRow
                Exit For
            End If

            If dblMatch > dblBestMatch Then
                dblBestMatch = dblMatch
                iRowBest = iRow
            End If

        End If ' strTest <> ""

    Next iRow


    If iRowBest = -1 Then
        VFuzzyLookup_Address = "#NO MATCH"
        Exit Function
    End If


    VFuzzyLookup_Address = Table_Array(iRowBest, Col_Index_Num)



End Function



Public Function MatchPhrase(ByVal Phrase1 As String, ByVal Phrase2 As String, Optional Compare As VbCompareMethod = vbTextCompare) As Double

' Function to compare two sentences. A version of this will be released to cater for the
' specific needs of matching addresses, where we can make some assumptions about common
' word-substitutions and abbreviations.

' THIS CODE IS IN THE PUBLIC DOMAIN


' This function consists of six processes:

' 1  Break out the phrases into arrays of words using the space character as the delimiter
' 2  Populate a grid of word-matching scores for each word in Phrase 1 against Phrase 2;
' 3  For each word in Phrase 1, identify the 'best match' from the words in Phrase 2
' 4  Resolve 'collisions' - two or more words in phrase 1 matching the same word in phrase 2
' 5  Compare the actual sequence of words in P1 with the positions of the matched words in P2
' 6  Weight this comparison by the degree of matching measured at the level of individual words

' Process 4, resolving collisions, is an iterative loop inside process 3
' Process 1 has an addditional step to check for deleted spaces


Dim arr1() As String            ' Phrase 1, broken out into individual words
Dim arr2() As String

Dim arrScores()    As Double    ' an array of percentage matches of each word in p1 against each word in p2

                                ' These two vectors are redundant in the sense that they hold information which
                                ' can be extracted from arrScores(). However, using them saves a lot of looping:

Dim arrPositions() As Integer   ' For each word in p1, the position of the best-matching word in p2
Dim arrSequence()  As Double    ' For each word in p1, a score for its concordance with a constructed sequence of matching words in P2


Dim n As Double                 ' should be an integer, but it will be used in floating-point
                                ' division and I prefer to avoid casting in VBA

Dim s1 As String
Dim s2 As String


Dim i As Integer
Dim j As Integer
Dim k As Integer

Dim iOffset As Integer
Dim iShift As Integer
Dim iDelete As Integer

Dim iPos As Integer
Dim jPos As Integer
Dim kPos As Integer

Dim iTotalLen As Integer

Dim dScore As Double
Dim dBest As Double
Dim dPenalty As Double

Dim d1 As Double
Dim d2 As Double

If Compare = vbTextCompare Then
    Phrase1 = UCase(Phrase1)
    Phrase2 = UCase(Phrase2)
End If

If Phrase1 = Phrase2 Then
    MatchPhrase = 1
    Exit Function
End If

' The line labels SplitSpace1 and SplitSpace2 are resynchronisation points for
' restarting the process after restoring a deleted space in Phase1 or Phrase2.

Phrase1 = StripChars(Phrase1, " ")
SplitSpace1:
arr1 = Split(Phrase1, " ")

Phrase2 = StripChars(Phrase2, " ")
SplitSpace2:
arr2 = Split(Phrase2, " ")

ReDim arrScores(LBound(arr1) To UBound(arr1), LBound(arr2) To UBound(arr2))
ReDim arrPositions(LBound(arr1) To UBound(arr1))
ReDim arrSequence(LBound(arr1) To UBound(arr1))


' Test for deleted spaces. This is a lot of work, but a missing space is a
' common error and the effects are out of all proportion to the size of the
' error: so much so that I'm prepared to risk the occasional 'false alarm'.
' It may even be worth repeating these two loops using fuzzy-matching with
' Levenshtein scores rather than the simple string-comparisons shown below:

For i = LBound(arr1) To UBound(arr1) - 1

    If arr1(i) <> "" And arr1(i + 1) <> "" Then

        s1 = arr1(i) & arr1(i + 1)

        For j = LBound(arr2) To UBound(arr2)
            If UCase(arr2(j)) = UCase(s1) Then
                Phrase2 = Substitute(Phrase2, arr2(j), arr1(i) & " " & arr1(i + 1), 1, Compare)
                GoTo SplitSpace2
            End If
        Next j

    End If ' arr(i) = "" Or arr(i + 1) = "" Then

Next i

For j = LBound(arr2) To UBound(arr2) - 1

    If arr2(j) <> "" And arr2(j + 1) <> "" Then

        s2 = arr2(j) & arr2(j + 1)

        For i = LBound(arr1) To UBound(arr1)
            If UCase(arr1(i)) = UCase(s2) Then
                Phrase1 = Substitute(Phrase1, arr1(i), arr2(j) & " " & arr2(j + 1), 1, Compare)
                GoTo SplitSpace1
            End If
         Next i

    End If

Next j


' Initialise the positions array with a negative value denoting 'not found'

For i = LBound(arr1) To UBound(arr1)
    arrPositions(i) = -1
    iTotalLen = iTotalLen + Len(arr1(i))
Next i

' For each word in Phrase 1, identify the closest matching in Phrase 2 and record its position.

For i = LBound(arr1) To UBound(arr1)

    s1 = arr1(i)
    dBest = 0
    iPos = -1

    For j = LBound(arr2) To UBound(arr2)

        s2 = arr2(j)
        dScore = 0
        dScore = MatchWord(s1, s2, Compare)

        arrScores(i, j) = dScore
        If dScore > dBest Then
            dBest = dScore
            iPos = j
        End If

    Next j

    If iPos >= 0 Then
        arrPositions(i) = iPos
    End If

Next i

' Resolve collisions - two or more words in P1 that have 'best match' scores on the same word in p2
' In theory this could be done without using the positions vector, as the information is in arrScores
' In practice, arrPositions saves processing steps

For i = LBound(arrPositions) To UBound(arrPositions)

    iPos = arrPositions(i)

    For j = i + 1 To UBound(arrPositions)

        If iPos = arrPositions(j) And iPos >= 0 Then
            ' Collision detected: which word has the best score?
            d1 = arrScores(i, iPos)
            d2 = arrScores(j, iPos)

            If d2 > d1 Then

                 'discard this recorded 'best match' position:
                arrScores(i, iPos) = -1

                'find the second-best score for d1
                dBest = 0
                kPos = -1
                For k = LBound(arrScores, 2) To UBound(arrScores, 2)
                    dScore = 0
                    dScore = arrScores(i, k)
                    If dScore > dBest Then
                        dBest = dScore
                        kPos = k
                    End If
                Next k
                
                ' reset this conflicting position as word (i)'s match in phrase 2:

                arrPositions(j) = kPos
                
                ' There is now a possibility that we have caused
                ' a collision with a previous word in Phrase 1:

                If k < i Then
                    For k = LBound(arrPositions) To k - 1
                        If arrPositions(k) = kPos Then
                             'restart the loop at the colliding value
                            i = k
                            j = UBound(arr1) + 1
                            Exit For
                        End If
                    Next k
                End If ' k<1

            Else

                 ' discard this recorded 'best match' position:
                arrScores(j, iPos) = -1

                 'find the second-best score for d2 *after* the current position
                dBest = 0
                kPos = -1
                For k = j + 1 To UBound(arr2)
                    dScore = 0
                    dScore = arrScores(j, k)
                    If dScore > dBest Then
                        dBest = dScore
                        kPos = k
                    End If
                Next k

                arrPositions(j) = kPos

            End If ' d2 > d1

        End If

    Next j

Next i


' Constructing a sequence-matching score:


' If we were scoring jumbled sentences of unaltered words, we'd use an edit distance algorithm;
' several are available, including replicating the Levenshtein distance at the word level. I've
' chosen a crude single-pass algorithm with a forward bias, that 'expects' the word sequence to
' resynchronise after each out-of-sequence word. It's quick, and the bias is valid - word-order
' is not neutral in real-life examples, and the heavy penalty for word transpositions reflects
' my belief that this is a more significant 'edit' than character transpositions in a word. A
' more rigorous treatment would venture into the realms of natural-language processing; that is
' out-of-scope for this application and far too ambitious for a self-contained function in VBA.

' Worked example:

' Compare two Phrases:
'  "ABC DEF GHI JKL MNO PQR STU VWX",  "ABC DEF JKL STU MNO PQR VWX"

' Variable arrPositions records the placement of each word in phrase 1 in phrase 2:

' Phrase 1            "ABC DEF GHI JKL MNO PQR STU VWX"
' Expected positions:   0   1   2   3   4   5   6   7
' Actual position in p2 0   1  -1   2   4   5   3   6

' The variable arrSequence will capture the scores
        
' Run the sequence-scoring loop:

' ABC   expected in position 0      found in 0                      Score 1/8
' DEF   expected in position 1      found in 1                      Score 1/8
' GHI   expected in position 2      DELETION     * frame shift -1 * Score NIL
' JKL   expected in position 3-1    found in 2                      Score 1/8
' MNO   expected in position 4-1    found in 4   * frame shift +1 * Score 1/8 * 7/8
' PQR   expected in position 5      found in 5                      Score 1/8
' STU   expected in position 6      found in 3   * frame shift -3 * Score 1/8 * (7/8)^3
' VWX   expected in position 7-3    found in 6   * frame shift +2 * Score 1/8 * (7/8)^2

' Edit distance is 7: the out-of-sequence penalty of 7/8 will be applied seven times

' However, we do not deal with perfectly-matched words in real life, so we cannot apply
' these penalties at the level of the entire phrase; we apply them at the level of the
' individual word, where we can apply a weighting based on each word's Levenshtein score

' The exception is deleted words; we could consider the 'word match' weighting of zero
' to be sufficient penalty but a more consistent result is obtained by applying a penalty
' to the entire phrase



' Sanity check; run the function in reverse, testing Phrase 2 against phrase 1:

' Phrase 2            "ABC DEF JKL STU MNO PQR VWX"
' Expected positions:   0   1   2   3   4   5   6
' Actual position in p1 0   1   3   6   4   5   7

' ABC   expected in position 0      found in 0                      Score 1/8
' DEF   expected in position 1      found in 1                      Score 1/8
' JKL   expected in position 2      found in 3   * frame shift +1 * Score 1/8 * 7/8
' STU   expected in position 3+1    found in 6   * frame shift +2 * Score 1/8 * (7/8)^2
' MNO   expected in position 4+3    found in 4   * frame shift -3 * Score 1/8 * (7/8)^3
' PQR   expected in position 5      found in 5                      Score 1/8
' VWX   expected in position 6      found in 7   * frame shift -1 * Score 1/8 * 7/8

' Edit distance is 7: the out-of-sequence penalty of 7/8 will be applied seven times

' "But wasn't there an insertion, too? Phrase 1 has an extra word that isn't in Phrase 2!"

' Note that our choice of denominator (8, the longer of the two wordcounts) has the effect of
' imputing a score of zero to the inserted word and applying a penalty of 7/8 to the entire phrase.

' A note on identifying the 'inserted word': actually, it's the word in Phrase 1 which didn't
' score as 'best match' against any word in Phrase 2. It could've come a close second to any or
' all of them.



If UBound(arr1) >= UBound(arr2) Then
    n = UBound(arr1) + 1
Else
    n = UBound(arr2) + 1
End If

dPenalty = 1 - (1 / n)
iShift = 0       ' Sequence distance for out-of-place words
iOffset = 0     ' Running total of this 'shift' variable
iDelete = 0     ' Count the number of deletions

For i = LBound(arrPositions) To UBound(arrPositions)

    s1 = arr1(i)

    iPos = arrPositions(i)
    iShift = iPos - i - iOffset


    Select Case iPos
    Case Is < 0     'DELETION: no matching word was found in S2

        iShift = -1
        arrSequence(i) = 0
        iDelete = iDelete + 1

    Case Is = i + iOffset ' matched word is in the expected position

        iShift = 0
        arrSequence(i) = 1 / n

    Case Else

        arrSequence(i) = (dPenalty ^ Abs(iShift)) / n

    End Select

    iOffset = iOffset + iShift

Next i

MatchPhrase = 0



For i = LBound(arrPositions) To UBound(arrPositions)
    dScore = 0
    If arrPositions(i) > -1 Then
        dScore = arrScores(i, arrPositions(i))
        dScore = dScore * arrSequence(i)
    Else
         'apply a deletion penalty - this isn't as arbitrary as it might seem: it is a equivalent to the
        '                           effect of an insertion, which acts by increasing the denominator

        dScore = -Len(arr1(i)) / iTotalLen / n
    End If
    MatchPhrase = MatchPhrase + dScore
Next i




ExitFunction:

    Erase arrScores
    Erase arrSequence
    Erase arr1
    Erase arr2

End Function

Private Function Minimum(ByVal a As Integer, _
                         ByVal b As Integer, _
                         ByVal c As Integer) As Integer
Dim min As Integer

  min = a

  If b < min Then
        min = b
  End If

  If c < min Then
        min = c
  End If

  Minimum = min

End Function


Private Function Levenshtein(ByVal s1 As String, ByVal s2 As String) As Integer

' Levenshtein Distance  - edit distance between two strings

' THIS CODE IS IN THE PUBLIC DOMAIN



Dim arr() As Integer    ' Scoring matrix
Dim n As Integer        ' length of s1
Dim m As Integer        ' length of s2
Dim i As Integer        ' iterates through s1
Dim j As Integer        ' iterates through s2
Dim s1_i As String      ' ith character of s1
Dim s2_j As String      ' jth character of s2
Dim cost As Integer     ' cost

n = Len(s1)
m = Len(s2)

If n = 0 Then
    Levenshtein = m
    Exit Function
End If

If m = 0 Then
    Levenshtein = n
    Exit Function
End If

ReDim arr(0 To n, 0 To m) As Integer

For i = 0 To n
    arr(i, 0) = i
Next i

For j = 0 To m
    arr(0, j) = j
Next j


For i = 1 To n

    s1_i = Mid$(s1, i, 1)

    For j = 1 To m

        s2_j = Mid$(s2, j, 1)

            If s1_i = s2_j Then
                cost = 0
            Else
                cost = 1
            End If

            arr(i, j) = Minimum(arr(i - 1, j) + 1, arr(i, j - 1) + 1, arr(i - 1, j - 1) + cost)

    Next j

Next i

  ' Step 7

  Levenshtein = arr(n, m)

ExitSub:
    Erase arr
End Function


Public Function NormaliseAddress(ByVal strAddress As String) As String

' This function is intended to remove or standardise common phrases
' and abbreviations used in British postal addresses, allowing the use
' of string-comparison algorithms in lists of names and addresses.

' Developers in other countries should review the word list used here,
' as conventions probably differ in your local language or dialect.

strAddress = " " & UCase(strAddress) & " "

strAddress = Substitute(strAddress, ",", " ")
strAddress = Substitute(strAddress, ".", " ")
strAddress = Substitute(strAddress, "-", " ")
strAddress = Substitute(strAddress, vbCrLf, " ")
strAddress = Substitute(strAddress, " BLVD ", " BOULEVARD ")
strAddress = Substitute(strAddress, " BVD ", " BOULEVARD ")
strAddress = Substitute(strAddress, " AV ", " AVENUE ")
strAddress = Substitute(strAddress, " AVE ", " AVENUE ")
strAddress = Substitute(strAddress, " RD ", " ROAD ")
strAddress = Substitute(strAddress, " WY ", " WAY ")
strAddress = Substitute(strAddress, " EST ", " ESTATE ")
strAddress = Substitute(strAddress, " PL ", " PLACE ")
strAddress = Substitute(strAddress, " PK ", " PARK ")
strAddress = Substitute(strAddress, " HSE ", " HOUSE ")
strAddress = Substitute(strAddress, " H0 ", " HOUSE ")
strAddress = Substitute(strAddress, " GDNS ", " GARDENS ")

strAddress = Substitute(strAddress, "&", "AND")
strAddress = Substitute(strAddress, " LIMITED ", " LTD ")
strAddress = Substitute(strAddress, " COMPANY ", " CO ")
strAddress = Substitute(strAddress, " CORPORATION ", " CORP ")
strAddress = Substitute(strAddress, " T/A ", " TA ")
strAddress = Substitute(strAddress, " TRADING AS ", " TA ")

' Common personal titles: these are often applied inconsistently or
' omitted, and must therefore be removed. Specific applications may
' require additional titles and their abbreviations - military rank,
' academic titles and degrees, courtesy titles of the aristocracy,
' knighthoods and honours (particularly for lists of civil servants)

strAddress = Substitute(strAddress, " ESQ ", " ")
strAddress = Substitute(strAddress, " MR ", " ")
strAddress = Substitute(strAddress, " MRS ", " ")
strAddress = Substitute(strAddress, " MISS ", " ")
strAddress = Substitute(strAddress, " MS ", " ")
strAddress = Substitute(strAddress, " MESSRS ", " ")
strAddress = Substitute(strAddress, " SIR ", " ")
strAddress = Substitute(strAddress, " OF ", " ")
strAddress = Substitute(strAddress, " DR ", " ")
strAddress = Substitute(strAddress, " OR ", " ")
strAddress = Substitute(strAddress, " IN ", " ")
strAddress = Substitute(strAddress, " THE ", " ")
strAddress = Substitute(strAddress, " REVEREND ", " REV ")
strAddress = Substitute(strAddress, " REVERENT ", " REV ")
strAddress = Substitute(strAddress, " HONOURABLE ", " HON ")
strAddress = Substitute(strAddress, " BROS ", " BROTHERS ")
strAddress = Substitute(strAddress, " ASSOC ", " ASSOCIATION ")
strAddress = Substitute(strAddress, " ASSN ", " ASSOCIATION ")

' Standardising 'St.', 'St', and 'Street'. Note that there are over 40 English
' towns and place names that contain or consist entirely of the word 'Street'.
' In addition, 'St' is a common abbreviation for 'Saint' in addresses.

' I have never seen a list of addresses where 'Street' and 'St' were used in a
' consistent way, and the only workable solution is to delete them all:


strAddress = Substitute(strAddress, " STREET ", " ")
strAddress = Substitute(strAddress, " ST ", " ")
strAddress = Substitute(strAddress, " STR ", " ")

Do While InStr(strAddress, "  ") > 0
    strAddress = Substitute(strAddress, "  ", " ")
Loop

strAddress = Trim(strAddress)

NormaliseAddress = strAddress

End Function


Public Function StripChars(myString As String, ParamArray Exceptions()) As String

' Strip out all non-alphanumeric characters from a string in a single pass
' Exceptions parameters allow you to retain specific characters (eg: spaces)

' THIS CODE IS IN THE PUBLIC DOMAIN

Application.Volatile False

Dim i As Integer
Dim iLen As Integer
Dim chrA As String * 1
Dim intA As Integer
Dim j As Integer
Dim iStart As Integer
Dim iEnd As Integer

If Not IsEmpty(Exceptions()) Then
    iStart = LBound(Exceptions)
    iEnd = UBound(Exceptions)
End If

iLen = Len(myString)

For i = 1 To iLen
    chrA = Mid(myString, i, 1)
    intA = Asc(chrA)
    Select Case intA
    Case 48 To 57, 65 To 90, 97 To 122
        StripChars = StripChars & chrA
    Case Else
        If Not IsEmpty(Exceptions()) Then
            For j = iStart To iEnd
                If chrA = Exceptions(j) Then
                    StripChars = StripChars & chrA
                    Exit For ' j
                End If
            Next j
        End If
    End Select
Next i



End Function


Private Function Substitute(ByVal Text As String, _
                           ByVal Old_Text As String, _
                           ByVal New_Text As String, _
                           Optional Instance As Integer = 0, _
                           Optional Compare As VbCompareMethod = vbTextCompare _
                            ) As String


'Replace all instances (or the nth instance ) of 'Old' text with 'New'
' Unlike VB.Mid$ this method is not sensitive to length and can replace ALL instances
' This is not exposed as a Public function because there is an Excel Worksheet function called Substitute()

' THIS CODE IS IN THE PUBLIC DOMAIN

Dim iStart As Integer
Dim iEnd As Integer
Dim iLen As Integer
Dim iInstance As Integer
Dim strOut As String

iLen = Len(Old_Text)

If iLen = 0 Then
    Substitute = Text
    Exit Function
End If

iEnd = 0
iStart = 1

iEnd = InStr(iStart, Text, Old_Text, Compare)

If iEnd = 0 Then
    Substitute = Text
    Exit Function
End If


strOut = ""

Do Until iEnd = 0

    strOut = strOut & Mid$(Text, iStart, iEnd - iStart)
    iInstance = iInstance + 1

    If Instance = 0 Or Instance = iInstance Then
        strOut = strOut & New_Text
    Else
        strOut = strOut & Mid$(Text, iEnd, Len(Old_Text))
    End If

    iStart = iEnd + iLen
    iEnd = InStr(iStart, Text, Old_Text, Compare)

Loop

iLen = Len(Text)
strOut = strOut & Mid$(Text, iStart, iLen - iEnd)

Substitute = strOut

End Function


Public Function Contains(ByVal MainString As String, ParamArray SeekString()) As Boolean

' A version of Instr() that returns TRUE if any 'seek' string is a substring of the main string

' THIS CODE IS IN THE PUBLIC DOMAIN

Application.Volatile False

Dim i As Integer
Dim j As Integer
Dim arrX As Variant
Dim strTest As String
Contains = False

If MainString = "" Then
    Exit Function
End If



For i = LBound(SeekString) To UBound(SeekString)

    If IsArray(SeekString(i)) Then
        arrX = SeekString(i)
        For j = LBound(arrX) To UBound(arrX)

            strTest = ""
            strTest = Trim(arrX(j))

            If Len(strTest) <= Len(MainString) And Len(strTest) > 0 Then

                If InStr(1, MainString, strTest, vbTextCompare) > 0 Then
                    Contains = True
                    Exit Function
                End If

            End If

        Next j
        Erase arrX

    Else

        strTest = ""
        strTest = Trim(SeekString(i))

        If Len(strTest) > 0 Then

            If Len(strTest) < Len(MainString) And Len(strTest) > 0 Then

                If InStr(1, MainString, strTest, vbTextCompare) > 0 Then
                    Contains = True
                    Exit Function
                End If

            End If

        End If

    End If

Next i

End Function

(Back to the list of functions)

There's a particular point about the Phrase-matching function that could probably be improved:

' For each word in Phrase 1, identify the closest matching in Phrase 2 and record its position.


This is a crude 'first-past-the-post' algorithm and it discards usable information. The most glaring omission is that the 'winner' might be one of two excellent matches with near-identical scores, either one of which could be the 'right' choice. Equally, a clear margin among poor scores is a worthless result as it is just as likely to be the wrong choice. Unfortunately, we cannot apply a rigorous statistical treatment (ie: confidence-testing) with the population sizes available in a typical sentence or postal address. Imposing a minimum threshold on the 'margin of victory' is a promising approach, but there is no simple way of applying it.

We could use this information at the sequence-matching stage: if the 'best match' for a given word in phrase 1 appears to be out-of-place, we could go back and see if the second-best choice is a better fit. Storing the 'margin of victory' data allows us to determine whether this is worth doing: we needn't bother looking at a second-best choice if the 'best match' is a clear winner, and we should try the third, fourth and fifth-closest if the margin between them all is below a selected threshold. However, reallocating a given word's counterpart in Phrase 2 carries a heavy price in terms of complexity because we are forced to rerun the 'collision' test - and the outcome of that process might force us to rerun the sequence-matching.

Beyond that, I'm fairly happy with the internal logic: I tried a Damareau_Levenshtein algorithm, which refines the string-matching with an improved treatment of transpositions, but this kind of recursive algorithm does not perform well in VBA and I went back to a simple Levenshtein 'edit distance' function, which is performing rather well. I'd like to see a rigorous analysis of the sequencing algorithm, because my effort is very much an empirical design - that's not to say that it doesn't work, but it rests on decidedly shaky intellectual foundations.

Your comments and suggestions would be welcome on this - and on any other issues and errors you can see in the code.


Free to use any or all of it: there are limitations to what it can do, but I think this is good enough for everyday use and it plugs an obvious gap in the standard Excel worksheet functions.



(Post a new comment)


[info]d_floorlandmine
2007-05-24 09:11 am UTC (link)
Dude, you're l33t! [grin] Is it OK if I test-run some of that code against our database at some point for duplicate-identifying purposes?

(Reply to this) (Thread)


[info]hairyears
2007-05-24 09:43 am UTC (link)
Feel free to use it - but be warned, I haven't tested it under MS-Access. The SPLIT() function might not work, but you'll find out quickly enough if it doesn't.

Access can incorporate globally-declared VBA functions in queries, so try this out:

    
SELECT TOP 3

    "Buckingham Palace, LONDON SW1" AS SeekAddress,
    MatchPhrase(
            NormaliseAddress("Buckingham Palace, LONDON SW1"),
            NormaliseAddress([tblAddress].[FullAddress])
                ) AS MatchScore,
    tblAddress.*

FROM
    tblAddress

ORDER BY
    MatchPhrase(
            NormaliseAddress("Buckingham Palace, LONDON SW1"),
            NormaliseAddress([tblAddress].[FullAddress])
                ) DESC

    

I think you can figure out my assumptions about named fields, and the existence of a table called 'tblAddress'. Let me know how you get on...

(Reply to this) (Parent)(Thread)


[info]d_floorlandmine
2007-05-24 09:48 am UTC (link)
Cheers. Access is demonstrably able to break most things, so I'll find out! [grin] Ta! Not sure when I'll get around to it, but I'll be interested to find out. If the worst comes to the worst, I'll output the relevant fields to Excel. [grin] It'll be a long slow job, but I can leave it running overnight or over a weekend.

(Reply to this) (Parent)


[info]fwuffydragon
2007-05-24 12:20 pm UTC (link)
*isn't technical enough to use it but appreciates it all the same*

(Reply to this) (Thread)


[info]hairyears
2007-05-24 12:59 pm UTC (link)
It's like Radio 3 and Radio 4: one doesn't actually listen to them but it's important to know that they are there. For as long as they are, we are cultured citizens in an enlightened society.

(Reply to this) (Parent)

Possible problem with code
[info]pkosinar
2007-09-21 01:51 pm UTC (link)
Hi man, respect to your work. Have just one comment. Isnt there a mistake in the code below, namely instead of arrPositions(j) = kPos I would expect arrPositions(i) = kPos?

Thanks a lot

' reset this conflicting position as word (i)'s match in phrase 2:
arrPositions(j) = kPos

(Reply to this) (Thread)

Re: Possible problem with code
[info]hairyears
2007-09-21 04:07 pm UTC (link)
This is a cut-and-paste from production code... Or so I thought.

But, from a quick glance through the code, reallocating arrPositions(j) is fine as it is. We're in the [ELSE...] part of the block that deals with d2 <= d1; that is to say, the i th word in phrase 1 is a better match to whatever word in phrase 2 that j is also claiming, and the j th word must now allocate itself to a different word in phrase 2.

Thus, after looping [For k...] through j 's scores to find a second-best match, arrPositions(j) = kPos

No, the place where I might have an error is 32 lines up, branching block [IF d2 > d1 THEN... ELSE], where the i th word lost the claim to j 's true and fair match in phrase 2.

' reset this conflicting position as word (i)'s match in phrase 2:
arrPositions(j) = kPos

That's where I should've set arrPositions(i) = kPos

I'd better check this at the weekend and get back to you.

Meanwhile, who are you? And how did you find this obscure journal?


(Reply to this) (Parent)


[info]avtsin
2007-10-03 04:27 pm UTC (link)
Thank you very much for posting the code. Super-useful! Saved me a lot of time

(Reply to this)

Hi there
[info]cjlosophy
2008-02-29 08:38 am UTC (link)
hi there i am quite puzzled about how the data in the excel sheet be arranged for the function VFuzzyLookup_Phrase to work. I am not very good with excel macros, so it will be great if you can show me an example for the function to work. Thanks a lot

(Reply to this) (Thread)

VLookup and VFuzzyLookup
[info]hairyears
2008-03-03 05:59 pm UTC (link)
The function works the same way as VLookup. Here's a screenshot of it in use:

Fuzzy-matching in Excel: click for a full-size image
(Click for a full-size image)

The parameters are:
  1. The value you're using as a search key (in this case, the name);
  2. The tabulated data you're searching (note that this isn't a single cell, it's a range address);
  3. And a number for the column the results are in.

Note that the data table must always have the search key in column 1; in this case, names are in column 1 and we are returning a matching address from column 2.

A cautious man would type in the formula again, into the cell below the returned address, specifying column 1. Yes, I know it sounds strange, and rather repetitive. Fuzzy Matching is (by definition) inexact, and there's a risk that the lookup function might be matching the wrong name - and, consequently, the wrong address. Specifying column 1 returns the matched (or mismatched) name and allows you o check the results.

If you're interested in the underlying logic, this particular example uses the Levenshtein Distance algorithm rather than sum-of-matching-substrings. Short strings of text (especially unstructured text like personal names, which have very little repetition) are handled better by scanning on a letter-by-letter basis using Levenshtein, rather than by searching for sequences and structure with a substring method.

(Reply to this) (Parent)

Compile Error on Vlookup Phrase & Match Functions
[info]jake_carroll
2008-05-28 12:08 am UTC (link)
Hi, I am hoping to get some help- I would really like to be able to use the VFuzzyLookup_Phrase and MatchPhrase functions, but am getting an error. I just copied and pasted the code for all the functions verbatim. When I try and run these two functions I get a Syntax Error with the following highlighted: "  Next i" . Note the VFuzzyLookup_Word and MatchWord functions work just fine. Any help would be greatly appreciated!

(Reply to this) (Thread)

Re: Compile Error on Vlookup Phrase & Match Functions
[info]hairyears
2008-05-28 12:50 am UTC (link)
Sorry about that: what's actually happened is that the HTML code I'm using to insert spaces and indent the code is being rendered by the browser. So your compiler is seeing this:
        For i = LBound(arr1) To UBound(arr1)
            If UCase(arr1(i)) = UCase(s2) Then
                Phrase1 = Substitute(Phrase1, arr1(i), arr2(j) & " " & arr2(j + 1), 1, Compare)
                GoTo SplitSpace1
            End If
  & n b s p ;     Next i

Instead of this:
        For i = LBound(arr1) To UBound(arr1)
            If UCase(arr1(i)) = UCase(s2) Then
                Phrase1 = Substitute(Phrase1, arr1(i), arr2(j) & " " & arr2(j + 1), 1, Compare)
                GoTo SplitSpace1
            End If
       Next i

I've fixed it and the code should now run.

(Reply to this) (Parent)(Thread)

Re: Compile Error on Vlookup Phrase &amp; Match Functions
[info]jake_carroll
2008-05-28 02:35 pm UTC (link)
That worked, thanks so much for the quick response!

(Reply to this) (Parent)


Create an Account
Forgot your login?
Login w/ OpenID
English • Español • Deutsch • Русский…