| HairyEars ( @ 2007-05-10 18:41:00 |
| Entry tags: | excel |
Programming question: how 'matched' is a matching string?
An interesting question - easy to answer if two strings "ABCD" and "ABCD" are identical, harder if you have two sets of names and addresses that almost match and need some kind of fuzzy-matching process with a threshold.
What algorithms and metrics are out there?
Which of them are practical in a VBA or VB.NET environment, where simplicity is essential and performance is an issue?
The crudest metric is to test whether one string matches part of another:
str1 = "ABCDEFG"
str2 = "CDEFG"
If VBA.Instr(1, str1, str2, vbTextCompare) Then
Match=True
End If
On it's own, this isn't terribly informative; the tools I have available (the 'In String' or 'contains' function) only tells me that the source string contains the search string. But now try matching parts of a more complex pair of strings:
str1 = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
str2 = "CDEFGJKLMNO"
For i = 1 to Len(str1) - 5
strTest = Mid(str1, i, 5)
If VBA.Instr(1, str1, strTest, vbTextCompare) Then
iScore=iScore+1
End If
Next i
dblReportedScore = 100 * iScore / (Len(str1) - 5)
What I'm doing here is taking an arbitrary 'word' length of 5 chars and scoring each time a word in the search string matches a word in the source string. The reported score is a percentage of all the possible matches that could've been detected.
If you're comparing sentences, and you use the spaces to break the strings up into words instead of using arbitrary five-letter groups, you now have a crude but usable algorithm:
str1 = "The quick brown fox jumped over the lazy dog"
str2 = "Da kwik brown fuxxor jumped over the lazy cow"
arr1 = VBA.Split(str1, " ")
arr2 = VBA.Split(str2, " ")
For i = Lbound(arr2) To Ubound(arr2)
For j = Lbound(arr1) To Ubound(arr1)
If arr2(i) = arr1(j) Then
iScore = iScore + 1
End If
Next j
Next i
dblReportedScore = 100 * iScore / (i - 1) * (j - 1)
Believe it or not, this seems to work for comparing lists of postal addresses: strip the punctuation, set the matching threshold at 75%, and present the user with a ranked listing of the near-misses as a handy starting point to the laborious task of matching up the rest of the data by hand.
Good enough for Excel.
Still, there's a lot of unanswered questions here; I know that the algorithm could be improved a lot.
NOTE: For the sake of keeping the discussion short and simple in a public forum, I'll confine the discussion to using arbitrary five-letter words and the source string "ABCDEFGHIJKLMNOPQRSTUVWXYZ".
So... Did I get a meaningful score for "CDEFGJKLMNO" ? How would I weight that score to reflect the difference in length between the two strings? How would I normalise the reported score so that so that identical (or fully-matched) strings scored 100%? There are no perfect answers - at least, none that apply in all real-world applications - but I'd be interested to hear your thoughts.
A more complex question arises when you consider that "CDEFGJKLMNO" and "JKLMNOCDEFG" would have identical scores under that testing regime: how do I weight the scoring for sequence? One way of doing that would be to run that loop twice: first, as shown above, recording matching words wherever they occur in the sequence; but the second time around, we'll advance the starting point of the string comparison function 'Instr()' whenever one 'word' matches, so that the next word won't score:
str1 = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
str2 = "CDEFGJKLMNO"
iStart=1
For i = 1 to Len(str1) - 5
strTest = Mid(str1, i, 5)
If VBA.Instr(iStart, str1, strTest, vbTextCompare) Then
iScore=iScore+1
iStart = i+1
End If
Next i
dblReportedScore = 100 * iScore / (Len(str1) - 5)
The problem with that is the undue weighting given to the first matching word: if the search string begins with "VWXYZ" then no subsequent 'word' will score. So, rather than two loops, we now need successive passes trimming out each word in turn and running both loops on the remainder.
And, all of a sudden, this has ceased to be a practical proposition within the speed constraints of VBA.
How did you deal with this type of problem when it came up?