A by-the-book solution to a query-misfire mystery
In November 2001, a SQL Server developer posted a seemingly simple problem to the public newsgroup microsoft.public.sqlserver.programming. Like other SQL Server Most Valuable Professionals (MVPs), I read this newsgroup almost daily, and this post caught my attention. The developer had written a GROUP BY query, but it wasn't working. The problem was a mystery—the query was simple, and the syntax was correct. All the evidence showed that the query should work. But as I and a few other newsgroup participants looked into the problem, we discovered that the database contained redundant information because the design failed to enforce a business rule. This real-world problem turned out to be a textbook example of the importance of understanding normalization, functional dependencies, and anomalies. Theory met practice, and the developer who originally posted the question got a by-the-book solution. The result was a working query that also provided some insight into database design. Let's look at the clues we discovered and see how they helped solve the mystery. If you want to work through the original scenario or understand the newsgroup participants' thought processes, you can read the original newsgroup thread, "GROUP BY Problem," at http://groups.google.com/groups?q=hancox+kass+query.
Suppose you're the developer who posted the question and you work for Everything Made Easy, Inc., an organization that runs workshops about a variety of topics, including pie making, wallpaper hanging, and plumbing. Listing 1's code creates a simplified version of the tables you use to track workshop participants and payments and that inserts some sample data into those tables.
You didn't design the database, but you're responsible for using and maintaining it. When the company receives a workshop registration form, you generate an invoice and insert the invoice details as a new row in the Invoices1 table. Each row represents one invoice and contains a unique invoice ID, the total fee, and the balance due on that invoice after you deduct any prepayment the customer sent with the registration. The Registrants1 table tracks customers, and each row in that table holds a customer's name, the workshop he or she signed up for, and the associated invoice number.
Your mission is to produce a summary report for all workshops. For each workshop, the summary must show the number of invoices, the total fees, and the total unpaid balance. The assignment seems easy, but unfortunately, something isn't working. By using the query that Listing 2 shows, you can easily generate a summary for one workshop at a time. Listing 2's query produces the correct summary for the Pies workshop, as Figure 1 shows. But when you try to use the query that Listing 3 shows to generate all the summaries at once, the query fails.
Following the Clues
Figure 2 shows Listing 3's results. The output accounts for five invoices, but the Invoices1 table contains only four, and the total fees and outstanding balances are too high. Closer inspection reveals that the result is wrong only for the Pies workshop totals. Two questions come to mind. First, why did the query fail to give the correct result? And second, how can you fix the query so that it works? Chances are good that if you can answer the first question, you can write a query that works.
The Invoices1 table contains the data for the counts and sums, and the Registrants1 table relates the invoices to the workshops. The query results are wrong because Registrants1 relates Invoice 2 to the Pies workshop twice. That's because Boban and Margo registered for Pies together on Invoice 2. The database needs to keep track of both Boban and Margo, but does it need to store both the InvoiceID value (2) and the workshop name (Pies) twice? In other words, is any of the information in Registrants1 redundant?
Just the Facts
What is redundant information? Repeated data isn't necessarily redundant information. If I have two friends named Max, my address book repeats the data "Max", but the data isn't redundant. Information is redundant when a fact is recorded twice, directly or indirectly. The name Max is duplicated in my address book, but the duplicates represent separate facts. However, the single entry for my friend Paxton, in New York, NY 10003, records the fact that Paxton lives in New York, NY, twice: once directly and a second time indirectly. The entry contains the city and state where Paxton lives, and it contains his ZIP code; ZIP code 10003 is always in New York, NY. Because a US ZIP code always determines the city and state, the city and state information in my address book is redundant, assuming no missing ZIP codes.
What does this have to do with Everything Made Easy's database? To determine whether the Registrants1 table contains redundant information, you need to know whether the table records any facts more than once. Each row in Registrants1 contains a participant's name, a workshop name, and an invoice number. Invoices 2 and 4 are both listed with the Pies workshop, so you can't know an invoice number just by knowing the workshop name—the invoice number isn't redundant. But is the workshop name redundant? Can you determine the workshop name if you know the invoice number?
Uncovering a Business Rule
I could identify the redundancy in my address book because I understand how ZIP codes relate to places. But the information originally posted to the newsgroup didn't explain how Everything Made Easy runs its business. How can you find out whether an invoice number determines a unique workshop? Only the company's management knows the real answer, but you can draw some conclusions by looking at the information at hand: the sample data, the table structures, and the original question.
The sample data contains no invoice that's split between different workshops. Is it possible to split an invoice? Yes and no. Yes, you can split an invoice because nothing in the database prevents it. You could change Margo's InvoiceID from 2 to 3. But if you do that, you create a problem. You can no longer calculate the total fee for each workshop. Invoice 3 would include Margo's fee for Pies and Tranh's fee for Wallpaper. How would you assign the $150 fee on Invoice 3 to two different workshops?
Two scenarios are possible: Either Everything Made Easy permits only one workshop to appear on an invoice, or it allows more than one workshop on an invoice. The developer who posted the original question let the newsgroup participants know that the company allows only one workshop per invoice. In the language of database theory, this limit is a business rule. However, the database doesn't enforce this rule, and as a result, the database contains something called an update anomaly. Nothing in the tables prevents an update from changing the workshop for Invoice 2 in only one of the two places it appears. So, for example, an update might change Boban's workshop but not Margo's, leaving the database in an anomalous state that doesn't represent any real-world situation because it violates the company's business rule. But this situation never arises, thanks to the business rule. Without the business rule, nothing precludes the anomalous state, and you can't solve the original problem. With more than one workshop per invoice, Everything Made Easy has to realize either that the database is incomplete or that the database needs a rule for splitting one invoice fee between workshops.
Database theory doesn't just find problems, it offers solutions. The "one workshop per invoice" business rule here creates a functional dependency on the columns of the Registrants1 table. The textbook notation for this functional dependency is InvoiceID * Workshop, which means, "InvoiceID determines Workshop." One invoice ID determines one workshop, and you can't have multiple Workshop values in rows that have the same InvoiceID value. Not all functional dependencies create redundancies, but database theory tells us that this one does because InvoiceID determines another column (Workshop) but doesn't determine all the other columns. (For example, InvoiceID doesn't determine the Who column.) Functional dependencies and anomalies are important concepts in normalization theory, and normalizing a database provides many benefits, including the elimination of most anomalies and redundancies.
Almost all theory textbooks describe how to normalize databases. Ramez Elmasri and Shamkadant B. Navathe's book Fundamentals of Database Systems, 3rd edition (Addison-Wesley, 2000) provides a thorough explanation of database normalization and Boyce-Codd normal form, which is the kind of normalization we used to solve Everything Made Easy's problem. Several kinds of normalization, or normal forms, exist, and the more complicated normal forms prevent more anomalies. Database designers often choose Boyce-Codd normal form because it's easy to obtain and it prevents the most common anomalies.
We chose Boyce-Codd normal form because it eliminated the update anomalies and enforced the business rule "one workshop per invoice." One sure way to enforce the functional dependency InvoiceID * Workshop is to allow only one row for each invoice ID. If an InvoiceID value appears in at most one row, you can have only one value of Workshop. You can't impose this requirement on the Registrants1 table because several people can share one invoice, but you can remove the redundant Workshop column from Registrants1 and put the distinct pairs (InvoiceID, Workshop) into a new table that contains the primary key InvoiceID.
Let's see whether this solution creates the summary report you need. The code in Listing 4 creates the new tables InvoiceWorkshop2 and Registrants2 and inserts data for the same people, workshops, and invoices as in Listing 1. I didn't reuse any table names, so if you use the code for practice, you can go back and forth between the versions. Recall that Listing 3's original summary query used the Registrants1 table to match workshops with invoices. You can now use the new table InvoiceWorkshop2 instead. In fact, that's all you change. The summary query in Listing 5 produces the correct results, which Figure 2 shows. If modifying the database isn't an option, you can create InvoiceWorkshop2 as a derived table, then query the original tables by using the code that Listing 6, page 34, shows.
A More "Elementary" Solution
The newsgroup participants designed this solution for an existing database that needed to be improved. If we'd designed the database from scratch or noticed that InvoiceWorkshop2 and Invoices1 have the same primary key and could be combined, we might have found a simpler solution. For example, Listing 7 combines Workshop2 and Invoices1 into a single table and solves the original problem by using a summary query based only on Invoices3. The Invoices3 and Registrants2 tables provide the best database design so far, and you could use them if you had the option of redesigning the database.
Although there's no substitute for real experience in database programming, don't underestimate the value of understanding a real-world business model and being aware of how faithfully the database schema reflects it. And don't underestimate the practical applications of normalization theory and textbook procedures.