HairyEars ([info]hairyears) wrote,
@ 2007-05-10 18:41:00
Previous Entry  Add to memories!  Tell a Friend  Next Entry
Entry tags:excel

Programming question: how 'matched' is a matching string?


An interesting question - easy to answer if two strings "ABCD" and "ABCD" are identical, harder if you have two sets of names and addresses that almost match and need some kind of fuzzy-matching process with a threshold.

What algorithms and metrics are out there?

Which of them are practical in a VBA or VB.NET environment, where simplicity is essential and performance is an issue?


The crudest metric is to test whether one string matches part of another:


str1 = "ABCDEFG"
str2 = "CDEFG"
    
If VBA.Instr(1, str1, str2, vbTextCompare) Then
    Match=True
End If

On it's own, this isn't terribly informative; the tools I have available (the 'In String' or 'contains' function) only tells me that the source string contains the search string. But now try matching parts of a more complex pair of strings:


str1 = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
str2 = "CDEFGJKLMNO"

For i = 1 to Len(str1) - 5

    strTest = Mid(str1, i, 5)
    If VBA.Instr(1, str1, strTest, vbTextCompare) Then
    iScore=iScore+1
    End If

Next i


dblReportedScore = 100 * iScore / (Len(str1) - 5)

What I'm doing here is taking an arbitrary 'word' length of 5 chars and scoring each time a word in the search string matches a word in the source string. The reported score is a percentage of all the possible matches that could've been detected.

If you're comparing sentences, and you use the spaces to break the strings up into words instead of using arbitrary five-letter groups, you now have a crude but usable algorithm:


str1 = "The quick brown fox jumped over the lazy dog"
str2 = "Da kwik brown fuxxor jumped over the lazy cow"

arr1 = VBA.Split(str1, " ")
arr2 = VBA.Split(str2, " ")

For i = Lbound(arr2) To Ubound(arr2)

    For j = Lbound(arr1) To Ubound(arr1)

        If arr2(i) = arr1(j) Then
            iScore = iScore + 1
        End If

    Next j

Next i

dblReportedScore = 100 * iScore / (i - 1) * (j - 1)


Believe it or not, this seems to work for comparing lists of postal addresses: strip the punctuation, set the matching threshold at 75%, and present the user with a ranked listing of the near-misses as a handy starting point to the laborious task of matching up the rest of the data by hand.

Good enough for Excel.

Still, there's a lot of unanswered questions here; I know that the algorithm could be improved a lot.

NOTE: For the sake of keeping the discussion short and simple in a public forum, I'll confine the discussion to using arbitrary five-letter words and the source string "ABCDEFGHIJKLMNOPQRSTUVWXYZ".

So... Did I get a meaningful score for "CDEFGJKLMNO" ? How would I weight that score to reflect the difference in length between the two strings? How would I normalise the reported score so that so that identical (or fully-matched) strings scored 100%? There are no perfect answers - at least, none that apply in all real-world applications - but I'd be interested to hear your thoughts.

A more complex question arises when you consider that "CDEFGJKLMNO" and "JKLMNOCDEFG" would have identical scores under that testing regime: how do I weight the scoring for sequence? One way of doing that would be to run that loop twice: first, as shown above, recording matching words wherever they occur in the sequence; but the second time around, we'll advance the starting point of the string comparison function 'Instr()' whenever one 'word' matches, so that the next word won't score:


str1 = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
str2 = "CDEFGJKLMNO"

iStart=1
For i = 1 to Len(str1) - 5

    strTest = Mid(str1, i, 5)
    If VBA.Instr(iStart, str1, strTest, vbTextCompare) Then
        iScore=iScore+1
        iStart = i+1
    End If

Next i

dblReportedScore = 100 * iScore / (Len(str1) - 5)


The problem with that is the undue weighting given to the first matching word: if the search string begins with "VWXYZ" then no subsequent 'word' will score. So, rather than two loops, we now need successive passes trimming out each word in turn and running both loops on the remainder.

And, all of a sudden, this has ceased to be a practical proposition within the speed constraints of VBA.



How did you deal with this type of problem when it came up?


(8 comments) - (Post a new comment)


[info]mobbsy
2007-05-10 05:48 pm UTC (link)
For the phonetic type match, the standard method is to use an algorithm like Soundex or (Double) Metaphone.

(Reply to this) (Thread)


[info]mobbsy
2007-05-10 05:51 pm UTC (link)
You might also like to look at edit distance for a given string if it's user generated data, see:
http://en.wikipedia.org/wiki/Levenshtein_distance

(Reply to this) (Parent)(Thread)


[info]hairyears
2007-05-10 06:00 pm UTC (link)
Oh, that looks interesting.

* searches for "Levenshtein distance" VBA *

Thank you, this might be usable.

As this is a very limited application of string-matching, I may look at differential weightings - penalty costs - for insertion, deletion and substitution.

(Reply to this) (Parent)


[info]hairyears
2007-05-10 05:54 pm UTC (link)
Way out of reach of VBA. And for simple lists of business names and addresses, phonetics aren't all that useful. I think we're looking at phrase-matching rather than phonetics, here.

(Reply to this) (Parent)(Thread)


[info]mobbsy
2007-05-10 06:04 pm UTC (link)
Really? I've seen Double Metaphone written in PL/SQL, and I can't believe that VBA is a less useful language than that.

Ah, http://www.codeguru.com/vb/gen/vb_misc/tips/article.php/c13137__2/#more looks like somebody has already done it (in the attachment).

Names are the classic use for Soundex type algorithms; for example in a call center, a customer service rep will tend to type what they hear rather than know how to spell all sorts of unfamiliar names. I'd think the same thing would be true for road names and towns.

I guess if it's an application where the users know how to spell what they're looking for, and just mistyping that's an issue then edit distance is probably more useful.

(Reply to this) (Parent)(Thread)


[info]hairyears
2007-05-10 06:23 pm UTC (link)
Edit distance is proving more powerful: testing a possible solution now.

In brief, I'm comparing addresses as business names: very few words are mis-spelt (Metaphone would be a quicker for that) but missing words and common substitutions ('St.' for 'Street') are common. The concept of edit distance, with reduced penalties for substitutions, is giving good results.

(Reply to this) (Parent)


[info]pseudomonas
2007-05-10 06:01 pm UTC (link)
You could preprocess all the strings into sets of ngrams and then use whatever VB does for a hashtable (I think there's a dictionary or something) to count the overlap between the sets.

Googling suggests that there are quite fast-looking implementations of soundex in VB

(Reply to this)


[info]ixwin
2007-05-13 03:16 pm UTC (link)
Minor point - I'd use the TRIM() function on the data first to remove the effect of extraneous spaces in the data.

(Reply to this)


(8 comments) - (Post a new comment)

Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…