Periodically, you need to search a database for specific words or phrases in strings. Users of a music Web site might need to search for CD titles; a bookselling site might offer searches by title or author. Queries that use the LIKE keyword are notoriously slow for this type of search because typical database indexes can't help in searching for words that appear in the middle of a string — such indexes help only in searching from the beginning of a string. To address the need for better text searching, Microsoft included the full-text search component in SQL Server 7.0. Built on the Microsoft Search Service, the full-text search component uses the same engine that Microsoft Index Server and Microsoft Exchange use to perform word and phrase searching. The full-text search component is integrated with SQL Server through a set of T-SQL extensions (CONTAINS, CONTAINSTABLE, FREETEXT, and FREETEXTTABLE). DBAs can administer full-text search through Enterprise Manager and system stored procedures.
However, the full-text search component might not meet your needs as a search solution. My development team experienced some significant limitations with the component, so for a recent project, we decided to develop our own text-search solution. Let's examine the considerations that prompted my team to look at custom solutions; then, let's look at a home-grown alternative to the full-text search component and compare it objectively to a full-text searchbased solution.
Microsoft Search vs. Custom Programming
On a recent project, I was the head of a team responsible for the design and implementation of word and phrase searching for a high-volume, music-related Web site. The Web site let users search for specific words or phrases in six database fields, including song names, artists, and CD titles. In general, the strings were less than 200 characters, and the number of strings ranged from 10,000 to 1.2 million, depending on the table and field. We had three possibilities for implementing the search functionality: Microsoft's full-text search component, a third-party search solution, or a custom search solution. We looked at the Yahoo! search engine and Inktomi's search engine as possible third-party solutions, but either one would have cost more than $100,000 in licenses plus the cost of additional hardware. In our case, justifying the cost of development and administration was easier than asking for money for capital purchases, so we decided to research the full-text search component and custom solutions. The project had the following basic requirements:
- Scalability — Ideally, if the search features became unacceptably slow because of high volume, we wanted to be able to add another search server without much administration and with no development overhead.
- Intuitive behavior — We wanted users to be able to search for single letters, numbers, or the first few characters of a word and get reasonable results.
- Low administrative cost — We needed the search to perform well in production, be highly available, and be reliable with minimal administration.
Usually, I'm not one for reinventing the wheel. However, in researching the full-text search component, I came across several concerns that turned my thoughts towards a custom implementation. My first concern popped up when I read David Jones's SQL Server Magazine article "Build a Better Search Engine," July 2000, InstantDoc ID 8828. Jones says, "In a full-text catalog, you can run only eight searches at one time. If you begin a ninth search, Microsoft Search Service will hold it in a queue." Although I hadn't personally verified this restriction, the mention of such a limit caused me to worry about potential performance and scalability problems with the full-text search component.
More importantly, though, the full-text search component isn't integrated with SQL Server storage and consequently requires its own backup and restore strategy. The Microsoft Search Service stores its full-text indexes in a file directory and keeps the index files open while the service is running. This latter detail means that you have to stop the service to back up the indexes. From a DBA's perspective, this design is a major complication to a disaster-recovery plan. The DBA must back up and restore the full-text search files separately from the database they refer to — or rebuild them from scratch.
Also, scaling out the search functionality to multiple servers is difficult because you can't easily share or replicate the index files. And you can't use simple SQL Server replication to transfer the required search data to other servers. Instead, you must copy the full-text search files to multiple servers (a practice that Microsoft documentation doesn't describe) or you must replicate the original data and each server must perform its own index repopulations.
The full-text search component is also inflexible; it doesn't give developers many options for customization. For example, the component has specific rules for dealing with noise words. Noise words are words that SQL Server ignores in queries that use full-text search. The is a noise word; a search often ignores the because it's not useful in narrowing search results. You can find noise word files in the
Another feature that the full-text search component provides is stemming, in which slight variations of words could produce valid results. For example, if a user searches for the word running, he might find run, runs, or runner. The problem here is that the exact rules for stemming are an internal mystery of the full-text search component, and the service doesn't provide a method for viewing or modifying those rules. Consequently, a search for the word magical might retrieve magic; then again, it might not. You can discover the stemming rules only by experimenting. The lack of control over this feature meant that my development team ruled it out in our implementation.
Finally, the full-text search component requires the developer to preprocess the search string to remove noise words before using it in a query. You can see this need by sending a string of only one noise word to the full-text search component. The query returns the message
Server: Msg 7619, Level 16, State 1, Line 1
Execution of a full-text operation failed. A clause of the query contained only ignored words.
You get this message even if valid non-noise words are in the query string. A query string of the and ledger — in which the is a noise word — produced the message above on SQL Server 2000 Service Pack 2 (SP2) despite the fact that ledger isn't a noise word. This quirk causes the components that generate the SQL query (stored procedures or middleware) to use additional parsing and logic to avoid this error.
Doing It Yourself: Custom Programming
Because of the concerns our research raised, we decided to create a custom solution for our full-text searching needs. The design and implementation were surprisingly straightforward, and the administrative ease, scalability, and flexibility more than compensated for the additional development cost. (After designing our custom solution, we tested it on SQL Server 2000 SP2 and SQL Server 7.0 SP3.)
Any custom search solution requires three things: a place to physically store the full-text data, a mechanism to store the full-text data, and a way to query the data. In our solution, we fulfilled all three requirements through SQL Server functionality.
I can best explain our solution by giving an example. Suppose you have a table named Email that contains email messages. You create this table by using the definition that Listing 1 shows. The table has MessageID, subject, and body fields. The subject field must be full-text searchable. You store the full-text data by creating one table in a SQL Server database for each field that's full-text searchable. For this example, let's name the new table FT_EmailSubject. This table has a specific format: a row identifier that relates back to the original table, a word column, and a position identifier. You can create the FT_EmailSubject table by using the code in Listing 2.
Conceptually, a custom search index is a list of words that appear in a field and the position of each word in the string. Collecting this data, then, is simply a matter of enumerating through each row in Email, extracting all the words in the subject field of each row, and storing those words and their positions in the FT_EmailSubject table. To initially populate FT_EmailSubject, you could accomplish these steps by using a cursor; or you could parse the subject fields offline, store them in a file, then bulk-load the file into the production FT_EmailSubject table.
The method you choose for updating FT_EmailSubject depends on how data is inserted into the Email table and how often the table is updated. If you have a stored procedure to store a new row in Email, you could modify that stored procedure to simultaneously parse the subject and add the required rows to the FT_EmailSubject table. In our Web-site search implementation, we performed all inserts, updates, and deletes to the base table by using stored procedures. We simply modified each INSERT and UPDATE stored procedure to call a user-defined function (UDF) that returned the list of words in the searchable field. We then stored these words in the custom search table.
The structure of FT_EmailSubject provides a great deal of search flexibility. For example, the query in Listing 3 finds all messages that have the phrase last week in the subject. Similarly, the query in Listing 4 finds all messages that have the words last or week anywhere in the subject. The query in Listing 5 finds the word last appearing as many as five words before the word week. And, to return strings that contain week, weekend, weekday, or weekly, we simply use the LIKE clause with a trailing wildcard. This method is still efficient because the query optimizer can use the existing indexes on the Word column with a trailing wildcard. In effect, the LIKE clause with a trailing wildcard is a stemming query in which the stem word is week, as Listing 6 shows.
As you can see from the above examples, the structure of FT_EmailSubject allows complex searching, including nearness matches (one word appearing within n words of another word), OR searches, AND searches, and stemming. The only limitations on searching are the sophistication of the query generator and the requirement of one join of the FT_EmailSubject table for each word in the search phrase. Our implementation limited our searches to the five longest words of the original user-entered search string; this provided a good balance between performance and narrow search results.
Notice that all text queries to our table require the DISTINCT keyword. Without DISTINCT, if a word appears in the subject field twice, the MessageID could appear twice in the result set. For a variation that removes the sorting overhead of DISTINCT, see the Web sidebar "Improving Retrieval Performance." (To access the Web sidebar, go to http://tsqlsolutions.com and enter InstantDoc ID 27363.)
In our experiments, we found that indexing the FT_EmailSubject table worked best when we put a clustered index on the Word, MessageID, and WordPosition columns and a nonclustered index on the MessageID, Word, and WordPosition columns. As always, you'll need to perform your own index tuning for your environment.
Full-Text Search vs. Custom Programming
In comparing the full-text search component and our custom solution, we found that, for our project, the full-text search component's biggest advantage was lower development cost. After implementing our custom solution, we discovered that our solution had far lower administrative costs. Besides our concerns about the full-text search component's reliability — based on our own experience and newsgroup postings — we also found that performance tuning, replication, and disaster recovery were all more difficult (or impossible) with the full-text search component than with our custom solution.
One area in which the full-text search component performed better than our solution was in the RANK feature. This feature provides a numeric appropriateness value for each result that a query returns. This value isn't well documented, but it's a function of the number of times search words appear in the string returned from the search. The full-text search component provides this value automatically, but in a custom solution, you'd have to provide a way to calculate the value. However, because our Web site's search results are often sorted alphabetically, this limitation in the custom approach was insignificant for our project. Table 1, page 6, shows how the full-text search component compares with our custom implementation in several performance categories.
This article gives you just a taste of what you can do with a custom search solution. The indexes recommended in this article use a LIKE clause to let users search for the first few letters of a word efficiently (e.g., a search for magic also returns magical and magician). And you can tweak the word parsing to improve similarity searching by removing duplicate sequential characters and playing with word-break characters. For example, a search for Reily could return O'Reilly and Reilly. Finally, you can easily change the entire approach to perform SOUNDEX searching: Simply replace the Word field with a SOUNDEX field and store SOUNDEX(Word) instead of Word in the population steps. The SOUNDEX value could be stored in a separate table, in the same table as the original word, or even as an indexed computed column or as part of an indexed view.
In our real-world project, we first implemented the full-text search component to get to market, then transitioned to a custom solution similar to the example outlined in this article. For our project, our custom solution improved query performance by 10 times over full-text search. Overall, the total cost of the custom implementation (development plus administration) was much cheaper than the ongoing administrative costs of the full-text search component. Although developing custom solutions can be more expensive than using an existing product, the reliability and scalability concerns associated with the full-text search component might make it worthwhile to seriously consider a home-grown implementation.