Convert the C++ Double Metaphone algorithm to T-SQL
The spell checker in Microsoft Word 97 had a quirk that earned a lot of giggles. If you typed zzzzz, the suggested correction was sex. Some people saw the quirk as intentional humor or sabotage, and a psychoanalyst probably could have had a field day with it. But the likely culprit was the SOUNDEX algorithm, which SQL Server implements with its built-in SOUNDEX() function. The SOUNDEX algorithm converts a word to an encoded representation of its pronunciation. So, SOUNDEX('Smith') and SOUNDEX('Smythe') both return the encoded value S530. And SOUNDEX('zzzzz') returns Z200, which is a close match to the result SOUNDEX('sex') produces: S200.
My company's record-management products needed a similar phonetic word-matching functionality. We wanted to offer fuzzy searching so that spelling errors or variations could be included in search results, and we also wanted spell checking of new data that included intelligent suggestions for misspellings. The SQL Server SOUNDEX() function was a key part of our initial implementation of both these features. As a built-in function, SOUNDEX() was easy to use, but we were unhappy with some of the matches it suggested.
Unfortunately, SOUNDEX()—patented in 1918—assumes that every consonant has a single consistent pronunciation, which is too simple a formula for most languages. In American English, most consonants have varying sounds and can even be silent. For example, SOUNDEX('phone') and SOUNDEX('pony') both return the encoded value P500, but the words aren't pronounced the same.
In the C/C++ Users Journal article "The Double Metaphone Search Algorithm," June 2000 (http://www.cuj.com/articles/2000/0006/0006d/0006d.htm?topic=articles), Lawrence Philips published a revised version of his Metaphone algorithm, which he wrote in 1990 as a replacement for SOUNDEX. Like SOUNDEX, the revised version of Metaphone, named Double Metaphone, is designed primarily to encode American English names (though it also encodes most English words well) while taking into account the fact that such words can have more than one acceptable pronunciation. Double Metaphone can compute a primary and a secondary encoding for a given word or name to indicate both the most likely pronunciation as well as an optional alternative pronunciation (hence the "double" in the name).
I've implemented the Double Metaphone algorithm in a set of SQL Server—based client applications for use in spell checking. When an application encounters an unknown word, it uses Double Metaphone to find likely substitutions from a table of known correct words. A previous implementation used the SQL Server SOUNDEX() function, but as I noted earlier, it produced too many poor suggestions to be valuable in most cases. Double Metaphone provides much better quality of matches, but SQL Server doesn't have a built-in function that implements this algorithm. So, I turned to SQL Server 2000's user-defined functions (UDFs) as the core building blocks to translate the Double Metaphone algorithm from C++ to.
SOUNDEX and Double Metaphone translate each consonant into a limited set of characters. Table 1 shows some examples of SOUNDEX and Double Metaphone encoding. Similar sounds use the same character; for example, the Metaphone algorithm encodes both b and p sounds as p. So the encoding of a misspelled word will often match the encoding of the word that was intended. The secondary Metaphone encodings can represent less common pronunciation variations in names and can encode many character patterns that have varied pronunciations. For example, the letter t has varying pronunciations that Double Metaphone can usually determine by its adjacent letters. When ion follows t, the combination is usually pronounced shun, but when h follows t, the letters usually combine as in the word thanks. Notice in Table 1 that whereas the primary Metaphone encoding is accurate for benign and poignant, the secondary encoding is better than the primary in the case of benignant. Depending on the level of fuzziness that's acceptable or desired in your environment, you might want to use both types of Metaphone encodings or simply ignore the secondary encoding altogether.
Although my company's application stores word lists and precomputed Double Metaphone values in SQL Server tables, the client applications were performing all the calculations. Implementing these calculations in each client application was a maintenance problem for us. Any corrections or enhancements to the algorithm required simultaneous updates of all applications. Some client code is C++ and some is Visual Basic (VB), and because consistency of the shared data was the primary goal, it was vital that all clients shared the same logic. Centralizing these calculations in SQL Server UDFs eliminates this maintenance burden. Web Listing 1 contains the UDFs that Table 2 describes. (To download Web Listing 1, enter InstantDoc ID 26094 at http://www.tsqlsolutions.com and click Download the code.)
Translating Double Metaphone from C++ to T-SQL presented several challenges. First, the core function in the C++ version returns two values, similar to a stored procedure that has two OUTPUT parameters. UDFs don't have OUTPUT parameters; they just return a single result. UDFs come in two types: scalar and table-valued. Scalar functions return a single value, whereas table-valued functions return a dynamic number of values through rows in the returned table and can also provide multiple distinct values through columns in the table. I thought a table containing two columns would be a good way to design a literal translation of the Double Metaphone function to T-SQL. To set up the table so that the primary and secondary encodings would be interchangeable, I preferred to return one or two single-column rows for each word I input. But despite its table format, this design is limited in set-based operations. For example, a join that attempts to call the UDF with a parameter value based on the column name from each row in a table, such as the one that Listing 1 shows, returns the following error:
<i>'Word' is not a recognized
OPTIMIZER LOCK HINTS option.</i>
Instead of the column Word from the joined TableOfWords table, SQL Server expects a literal value or variable passed to a table function. Otherwise, SQL Server interprets the parenthetical clause as an old-fashioned optimizer hint. I thought correlated subqueries such as the one that Listing 2 shows seemed like a possible workaround, but they failed with the same error. At this point in my translation process, a stored procedure began to sound reasonable. But I had yet to try creating scalar functions. To avoid redundant code, my preference was to put a wrapper around the main table function—the fnDoubleMetaphoneTable() function—which requires one extra parameter besides the word to encode so that it can determine which column to return from the result that fnDoubleMetaphoneTable() returns. The code in Listing 3 demonstrates this technique by using the Pubs database.
The SampleScalarFunc() wrapper function that Listing 3 shows simply selects one column, which the @FirstOrLast argument determines, from the table of results that the SampleTableFunc() function produces. So depending on where and how you plan to use the data available from these UDFs, you need to use the appropriate syntax. If you were using the UDF as a scalar function, for example, you'd use the statement
And if you were using the UDF as a table function, you'd use the statement
In Web Listing 1, the function fnDoubleMetaphoneTable() is the core table-valued UDF, and fnDoubleMetaphoneScalar() is a minimal scalar-valued wrapper around the core function. FnDoubleMetaphoneScalar() includes an argument, @MetaphoneType, that lets the caller select which column to return, as Listing 4 shows. Another option would be to pack both encodings into one char(8) return value in which the first four characters would contain the primary encoding and the next four characters would hold the secondary encoding. For example, the function would return AJNSAKNX for the word agencies. This option presents a more awkward and fragile interface for the function's user, but it's worth considering if you discover a performance problem when you return one column at a time. Separate calls to retrieve one column at a time will essentially cause duplicate processing, but I favor simplicity and clarity over hypothetical optimizations. In some cases, the secondary encoding might not even interest the caller. Either way, you might find it worthwhile to offer the functionality through both a table and scalar UDF, provided you don't have to create significant redundant code.
Another challenge arose when I tried to define one of the utility functions, fnStringAt(). This function simply checks for any of several matching substrings at a given start position. The trickiest part of using this function is passing it a list of strings to test for, because the size of the list can vary. You can use variable-length argument lists in C-based languages. In other languages, passing some type of collection object that contains all the strings allows similar flexibility in handling parameters.
Although T-SQL UDFs can define default values for arguments much like stored procedures do, the UDF implementation of this feature isn't as flexible as a stored-procedure implementation. Rather than being able to completely omit optional arguments, as you can when you run EXEC sp_help, the UDF caller must specify DEFAULT for every argument. This restriction would require calling the function with a statement such as
DEFAULT, DEFAULT, DEFAULT, ... )
This kind of coding is, of course, inconvenient for the function caller. For example, in processing the letter g, the Double Metaphone algorithm would require 10 such optional arguments to see whether the beginning of a word matches any of 11 values (ges, gep, geb, gel, gey, gib, gil, gin, gie, gei, ger). Likewise, all other calls to the UDF would have to specify DEFAULT for each unused argument. This would be an onerous burden to place on every caller of the function and would eliminate most of the convenience that default arguments usually offer. So, I decided to violate the first normal form with my version of fnStringAt() and pack multiple values into a single parameter. This function's final parameter is a single varchar(2000) value, which can include multiple comma-delimited substrings. This method places some unwanted formatting requirements on the caller, but the most unpleasant work—parsing the substrings back out of the parameter—is hidden within the fnStringAt() implementation.
Among other problems I encountered in translating Double Metaphone from C++ to T-SQL, I had to change string operations from 0-based to 1-based. In C++, the first character in a string is in position 0, but T-SQL strings start with position 1. A bigger job, however, was converting the procedural logic from a C switch statement to a series of IF/ELSE IF tests. In many cases, the T-SQL CASE construct matches the switch functionality, but CASE isn't a control-of-flow statement. SQL Server Books Online (BOL) refers to CASE as a function several times, but CASE is really an expression that returns a single scalar value. So, I translated the C++ switch statements to a series of T-SQL IF/ELSE IF statements, taking care to preserve the fall-through logic of C++ case values that had no break statement as logical OR conditions in the T-SQL statements.
The original C++ code for Double Metaphone also takes advantage of short-circuit evaluation—a condition that contains multiple tests—stopping processing as soon as possible. For example, in the following code, the first test is always false, so the second part doesn't need to be evaluated:
The short circuit above is important because the substring operation contains invalid arguments, which usually would cause an error. But if that substring is never evaluated, no error is returned. An example of short-circuit logic looks something like this:
SQL Server will short circuit as an optimization, but it doesn't guarantee order of evaluation—the optimizer might decide to internally switch the expression's order of evaluation. So you can't safely rely on short circuiting to prevent errors as in the SUBSTRING() example above, even if tests show the query plan follows a particular sequence. (For more information about the short-circuit feature in SQL Server, see Itzik Ben-Gan's SQL Server Magazine T-SQL Black Belt column "Short Circuit," September 2000, InstantDoc ID 9148, at http://www.sqlmag.com.) For example, given different statistical data, service pack versions, or other runtime conditions, SQL Server might produce a query plan with a different sequence.
Fortunately, the precise approach SQL Server chooses won't affect results in this example because the T-SQL SUBSTRING() function allows negative values for its start argument—the function will simply return an empty string. In the Double Metaphone code, you might get an empty string if some IF statements aren't evaluated from left to right. A negative value for the SUBSTRING() length argument will produce an error, however, so you should always ensure that T-SQL code can't attempt to call a SUBSTRING() function with such an argument.
Spaces: The Final Frontier
When I tested my translated T-SQL version of Double Metaphone against a list of roughly 100,000 words, I discovered some sticky problems. One of the more complicated tests—the test for the consonant sequence ch—contains the following code:
VON ,SCH')=1) OR
— several more tests omitted
The first call to fnStringAt() in this example is prone to a subtle typo. The first two strings to test for end with a space. The second fnStringAt() call includes a single space as its final search string. At first glance, I wondered if this was a typo in the original code, but I believe it's a deliberate test for the end of the word. C++ and T-SQL have a couple of string-handling differences that come into play in such a case. First, using SUBSTRING() to read past the end of a string is harmless in T-SQL, but a similar attempt in C++ would likely crash. So we don't need to pad the input string with spaces on the end, but we do have to take care to handle that trailing space in the fnStringAt() argument. LEN() excludes trailing blanks, so the statement LEN(' ') returns 0. Similarly, the statement IF '' = '' evaluates to true. The combination of these behaviors makes it difficult to determine when you've parsed the last substring from the comma-delimited list of targets. To solve the problem, first I append a comma to the comma-delimited list, which also eliminates some special-case code to find the last substring with no comma following it. Then, when parsing a substring, I retain the terminating comma so that the code will never contain trailing blanks.
Tailoring the Algorithm
The Double Metaphone algorithm isn't scientific; it's the result of much experimentation and research, but it can't be proven correct. Possible improvements are likely, especially if you can tailor the rules for a specific environment or usage. One simple modification I highly recommend is to do away with the four-character limit on the length of encodings. Many words have more than four significant sounds, and a word with four encoded symbols is probably an unlikely match for a word with seven sounds, even if the first four sounds match exactly. The only drawback to allowing more characters in the encoding is that the algorithm requires more storage space, but SQL Server's varchar data type minimizes that problem by using only the space that each encoding requires, and the algorithm still performs efficiently for my environment.
Let's look at a specific change to the algorithm logic that improves its handling of a tricky name such as Michelle Pfeiffer. (OK, so it was a pseudo-random choice.) The original Double Metaphone algorithm produces the code PFFR to represent the name Pfeiffer. That's not going to work well because the p is silent. Ideally, this name would match spellings such as Pheiffer and Fifer, which both produce codes of FFR. The Double Metaphone algorithm already converts ph to an F symbol, so it should be simple to add the same logic for pf, as the code in Listing 5 shows.
However, you have to carefully check the side effects of changes such as this one. Where else might the consonant combination pf appear? Campfire, helpful, leapfrog, and other words have the sequence, but you don't want to ignore the p sound in those words. You could also add tests to the code to make sure the pf sequence begins the word, which should leave all words that have a nonvocalized labial plosive p unchanged—but such a change won't help Dire Straits' Mark Knopfler or Bugs Bunny when he's cooking hasenpfeffer for the king.
At some point, we have to accept that no algorithm is going to encode all words ideally. Just be very careful with any improvements you make to the code. I recommend building a simple regression test against a large bank of words and names so you can quickly assess the effects of any changes you make (Listing 6 shows a small example test script). Also, keep in mind that you'll likely be using the algorithm for new words that aren't in any dictionary or phone book. A common use for matching algorithms such as this is to assist with spelling errors and typos. User errors are impressively creative, so you have to tailor the algorithm to match the words and names in your data as well as to the misspellings and typos users will likely input.
UDFs are a great option for centralizing business logic, though you must consider some limitations and quirks. Despite being surprised by some of these constraints, I continue to find that UDFs solve some problems better than stored procedures or views. As for the Double Metaphone algorithm, my company has found that its encodings are much better than SOUNDEX for finding phonetic matches, and I highly recommend it as a starting point for any similar requirements. Even if you don't have immediate use for the Double Metaphone algorithm, you can use this T-SQL code as an example of defining and working with both scalar and table-valued UDFs. And now that I don't have to maintain all this code duplicated in multiple client applications, I'm just hoping to catch some Zs!