39604.zip

I've received a lot of feedback during the past few months from readers suggesting solutions to puzzles I've presented in my column. So this month and next, I'll share some of the best solutions. I received many creative solutions and had a tough time deciding which to discuss. Thanks to all who sent in your ideas, and keep trying to solve the puzzles and improve your T-SQL skills. Sometimes, working on a puzzle can be its own reward.

### August Relational-Division Puzzle

This month, I share two set-based solutions to last month's relational-division puzzle (which I had posted in a private SQL Server MCT forum) from Leroy Tuttle and Dieter Nöth. The puzzle scenario involved a training facility that uses three tables—Courses, Materials, and CourseMaterials—to track information about courses and course materials. You can run the code from August's Listing 1 to create and populate these tables. (To read last month's column, "Survive the (Relational) Divide," or download its code listings, enter InstantDoc ID 39300 at http://www.sqlmag.com.) The puzzle's object was to return, for each course, the minimum course ID of all the courses that use the same set of course materials. The desired solution appeared in August's Figure 1.

For benchmarking, August's Listing A in the sidebar "Benchmarking Relational-Division Solutions" (InstantDoc ID 39301) provided code that populates the tables with large sets of data. When I populated the tables with 10,000 courses, 10,000 materials, and about 55,000 course materials, my set-based solutions ran for several minutes on my laptop. In contrast, Tuttle's solution runs for about 30 seconds, and Nöth's solution runs for less than 1 second.

### Tuttle's Solution

Tuttle's solution correlates courses with each other based on a match in both the total count of materials per course and the individual materials themselves. I slightly revised the original solution Tuttle sent me to make it purely set-based, but the principles are the same. Let's first look at a modular approach that uses views to solve the puzzle one step at a time; then, I'll present a one-step solution that uses derived tables.

The first step in the modular approach is to create the CoursesXYCnt view by running the code that Listing 1 shows. The view query joins the CourseMaterials table to itself, matching rows that have the same courseid and materialid, and returns a row for each unique combination of course IDs that have matching rows. I call the two instances of CourseMaterials X and Y. The view returns the courseid from X (courseidX), the courseid from Y (courseidY), and the count of matching rows (cnt). So we're matching the materials that are the same for both courses and returning the count of matches. Note that the count of matching rows can be smaller than the count of course materials in X or Y if courses X and Y don't have equal sets of course materials. Figure 1 shows the results of a SELECT * query against CoursesXYCnt.

The second step in this solution is to create the CoursesCnt view by running the code that Listing 2 shows. This view contains a simple query that returns the number of course materials for each course. Figure 2 shows the results of a SELECT * query against CoursesCnt.

The third and final step is to join the CoursesXYCnt view to two instances of CoursesCnt—one for Course X and one for Course Y. The join ensures that you retrieve only those rows from CoursesXYCnt that have the same count as the number of course materials in course X and course Y. Only the courses whose course materials match exactly meet the join criteria. Now all you need to do to get the desired output is to group the result by courseidX and return the minimum courseidY, as the query in Listing 3 shows.

I used views to simplify the solution's explanation, solving the problem a step at a time, but you might prefer to use derived tables instead of views, as the code in Listing 4 shows. Derived tables are handy if you'd rather not create persistent objects such as views in your database—for example, because of lack of permissions. In terms of performance, both solutions perform the same. Views are often handy when your code becomes too long or complex to maintain, but the code in Listing 4 is short even when you use derived tables, so there's no practical reason to break it into views. As I said earlier, Tuttle's solution runs for 30 seconds on my laptop against a table containing about 55,000 course materials.

### Nöth's Solution

Nöth's solution originally used derived tables, but I use views again to present it in steps. This solution also compares course materials and counts of materials used for both courses. However, Nöth adds another important aspect that makes the solution highly efficient. First, the solution compares several aggregations to find pairs of courses that have a good chance of having the same course materials, then it counts the matching base rows in CourseMaterials. By "good chance," I mean that by comparing only the result of aggregate functions of different sets, you always get matches for sets with the same members, but you might also get matches for sets with different members. For example, the count, sum, minimum, and maximum of the sets \{1,3,5,7\} and \{1,2,6,7\} are the same. However, the probability that several different aggregate functions will return the same results for sets with different members is low, so the solution has to compare the actual set materials for a small percentage of the courses.

The first step in this solution is to run the code that Listing 5 shows to create the CMAgg view, which calculates several aggregations for each course. The view returns, for each course, the count of course materials and the minimum, maximum, and sum of materialids. Figure 3 shows the results of a SELECT * query against CMAgg.

The second step is to create the CM view by running the code that Listing 6, page 18, shows. The view's query joins the CMAgg view to itself, returning pairs of courses that return the same results for the different aggregate functions and also explicitly returning the count of their course materials, which is used in a later step. As I mentioned earlier, courses that return the same results for the different aggregate functions are most likely to have the same course materials. Notice that the view's query in Listing 6 looks only for matching courses that have different course IDs and uses the greater-than operator (CM1.courseid > CM2.courseid) to eliminate duplicates (e.g., if courses 3 and 2 have matching aggregates, you get the pair 3, 2 but not 2, 3). Figure 4 shows the results of running a SELECT * query against CM. Note that you get only one row for the sample data I provided in August's Listing 1. For large data sets, the query would probably return more rows, but it always returns only the pairs that have matching aggregations and different course IDs.

The third step is to create the DT view by running the code that Listing 7 shows. The view's query filters for only pairs of courses from CM that have the same count as the number of matching course materials for the pair of courses. In the WHERE clause's subquery, this step calculates the number of matching course materials for the pair of courses. The code groups the result by courseid1 and returns the minimum courseid2 for each unique instance of courseid1. Figure 5 shows the results of running a SELECT * query against DT.

To finish this solution, you simply write a query that returns, for each course in Courses, the minimum courseid in DT or the course ID itself from Courses if a match doesn't exist in DT. Listing 8 shows the complete solution. The left outer join between Courses and DT returns the rows from Courses that have no match in DT. The COALESCE() function returns the desired results.

If you'd rather not use views for reasons such as those I described earlier, you can try the one-step solution that Listing 9 shows. This solution uses only derived tables. Nöth's set-based solution, with or without the derived tables, is amazingly efficient, running in less than 1 second on my laptop against a table of about 55,000 course materials. It also scales very well, running in 2 seconds for 110,000 rows and 3 seconds for 165,000 rows.

### Never Give Up

In last month's benchmarking sidebar, my set-based solutions gave poor performance compared to the hybrid set-based and iterative solutions. Tuttle's and Nöth's alternative solutions prove that you should never give up trying to find an efficient set-based solution. Next month, I'll show two readers' solutions to puzzles I presented in January and February 2003.