Download the Code square iconAs T-SQL programmers, we always hear that the SQL language is optimized for set-based solutions rather than procedural solutions, but we seldom see examples from that perspective. Consequently, many beginning SQL programmers don’t have a clear understanding of what set-based means in terms of the code they need to write to solve a specific problem.

Even for those who understand the concept, there are many programming problems for which a set-based solution seems impossible. Sometimes that's true. It's not always possible to find a set-based solution, but most of the time we can find one by using a little creative thinking. A good SQL programmer must develop the mental discipline to explore set-based possibilities thoroughly before falling back on the intuitive procedural solution.

In this article, I provide a relatively simple example that illustrates how to think in a set-based way about a common type of problem that also has an intuitive procedural solution.

The Business Case

When you visit the doctor’s office, the first thing the nurse does is put you on a scale, record your weight, and check your height. Checking your weight makes sense from a medical point of view, but have you ever wondered why the nurse records your height each time? Unless you're very young, your height hasn’t changed since your last visit and isn't likely to change again.

The reason the nurse checks your height is to guard against identity theft. Health care providers want to make sure that the services they provide are going to the person who gets the bill—not to an imposter with a forged identity card.

This kind of identity theft happens more frequently than you might think. HIPAA regulations now require an audit of changes in permanent physical characteristics in a patient’s history that might suggest identity theft.

Querying this kind of information provides a good example for comparing procedural thinking and set-based thinking when programming in SQL.

The Problem Statement

The generic programming problem is that the solution depends on the order of rows and requires the comparison of current row values with values in previous rows. This is a type of problem in which the procedural solution is intuitive, but the set-based solution isn't so obvious.

In this particular problem, we're looking for rows where a previous visit for the same patient has a height value that's different from the height on the current record. We want to return the patient’s unique medical record number, the date the change occurred, what the height was changed from, and what the height was changed to. We don't want to return any records that don't mark a change in height.

Listing 1 gives you the code to create and populate the tables in this example, if you'd like to run the example yourself.

Listing 1: Creating and populating the tables

We use the AdventureWorks sample database to create tables for our test but you may use another database by changing the USE statements in all 3 listings.

  1. USE AdventureWorks;
  2.  
  3. SET NOCOUNT ON;
  4.  
  5. CREATE TABLE Dates
  6. (ID INT, VisitDate datetime);
  7.  
  8. --populate table with 20 visit dates
  9. DECLARE @i INT, @startdate datetime;
  10. SET @i = 1;
  11. SET @startdate = GETDATE();
  12.  
  13. WHILE @i <= 20
  14. BEGIN
  15.     INSERT Dates
  16.     (ID, VisitDate)
  17.     VALUES (@i, @startdate);
  18.    
  19.     SET @startdate = DATEADD(dd,7, @startdate);
  20.     SET @i = @i+1;
  21. END
  22.  
  23. CREATE TABLE PatientHeight
  24. (PatientID  INT NOT NULL
  25. ,Height INT);
  26.  
  27. -- populate table with 1000 patientids with heights between 59 and 74 inches
  28. SET @i = 1;
  29.  
  30. WHILE @i <= 10000
  31. BEGIN
  32.     INSERT PatientHeight
  33.     (PatientID, Height)
  34.     VALUES (@i, @i % 16 + 59);
  35.    
  36.     SET @i = @i+1;
  37. END
  38.  
  39. ALTER TABLE PatientHeight ADD CONSTRAINT PK_PatientHeight
  40.     PRIMARY KEY(PatientID);
  41.  
  42. -- cartesian join produces 200,000 PatientVisit records
  43.  
  44. SELECT
  45.     ISNULL(PatientID, -1) AS PatientID,
  46.     ISNULL(VisitDate, '19000101') AS VisitDate,
  47.     Height
  48. INTO PatientVisit
  49. FROM PatientHeight
  50. CROSS JOIN Dates;
  51.  
  52. ALTER TABLE PatientVisit ADD CONSTRAINT PK_PatientVisit
  53.     PRIMARY KEY(PatientID, VisitDate);
  54.  
  55. -- create changes of height
  56. SET @i = 3;
  57.  
  58. WHILE @i < 10000
  59. BEGIN
  60.     UPDATE pv
  61.     SET Height = Height +2
  62.     FROM PatientVisit pv
  63.     WHERE PatientID = @i
  64.     AND pv.VisitDate =
  65.     (SELECT TOP 1 VisitDate
  66.     FROM Dates
  67.     WHERE id = ABS(CHECKSUM(@i)) % 19);
  68.    
  69. SET @i = @i + 7;
  70. END
  71.  
  72. /*
  73. -- return AdventureWorks to its previous state when you are finished
  74. -- with this example.
  75.  
  76. DROP TABLE Dates;
  77. DROP TABLE PatientHeight;
  78. DROP TABLE PatientVisit;
  79. */

A Procedural Approach

The intuitive, procedural way to attack this problem is to order the records by patient and visit date, then loop through the records for each patient one row at a time. We query the first record for the patient and save the patient’s original height in a variable. Then, we loop through subsequent records for the patient, comparing height values. If we find that the height is different on a subsequent record, we write an audit record, update the height variable with the current value, and continue looping through the rows. Then we move to the next patient.

Listing 2 contains the code for the cursor-based solution. The cursor method works, but it's very inefficient. It could pose a serious performance problem when working with a large number of rows. How can we do this in a set-based and presumably more efficient way?

Listng 2: the cursor-based solution (USE AdventureWorks)

  1. CREATE TABLE #Changes
  2. ( PatientID INT
  3. , VisitDate    datetime
  4. , BeginHeight SMALLINT
  5. , CurrentHeight    SMALLINT);
  6.  
  7. DECLARE @PatientID        INT
  8. ,        @CurrentID        INT
  9. ,        @BeginHeight    SMALLINT
  10. ,        @CurrentHeight    SMALLINT
  11. ,        @VisitDate        datetime;
  12.  
  13. SET @PatientID = 0;
  14.  
  15. DECLARE Patient_cur CURSOR FAST_FORWARD FOR
  16. SELECT PatientID
  17. , VisitDate
  18. , Height
  19. FROM PatientVisit
  20. ORDER BY PatientID
  21. ,VisitDate;
  22.  
  23. OPEN Patient_cur;
  24.  
  25. FETCH NEXT FROM Patient_cur INTO @CurrentID, @VisitDate, @CurrentHeight;
  26.  
  27. WHILE @@FETCH_STATUS = 0
  28. BEGIN
  29. -- first record for this patient
  30. IF @PatientID <> @CurrentID
  31. BEGIN
  32.     SET @PatientID = @CurrentID;
  33.     SET @BeginHeight = @CurrentHeight;
  34. END
  35.  
  36. IF @BeginHeight <> @CurrentHeight
  37. BEGIN
  38. INSERT #Changes ( PatientID
  39. , VisitDate
  40. , BeginHeight
  41. , CurrentHeight)
  42. VALUES
  43. (@PatientID
  44. , @VisitDate
  45. , @BeginHeight
  46. , @CurrentHeight);
  47.  
  48. SET @BeginHeight = @CurrentHeight;
  49.  
  50. END
  51.  
  52. FETCH NEXT FROM Patient_cur INTO @CurrentID, @VisitDate, @CurrentHeight;
  53.  
  54. END
  55.  
  56. CLOSE Patient_cur;
  57. DEALLOCATE Patient_cur;
  58.  
  59. SELECT * FROM #Changes
  60.  
  61. DROP TABLE #Changes

A  Set-Based Approach

The difference between a procedural and set-based solution boils down to the way you define the problem. Stated in its simplest form, the change we're interested in involves only two records: two consecutive visits by the same patient. Everything else is irrelevant.

We start by ordering the data by the patient’s ID number and then by visit date. In that way, the records of consecutive visits by the same patient are adjacent to each other. The problem is then reduced to finding a way to join consecutive records from this set.

When we understand the problem in that way, the solution isn't so difficult to discover. We need to create a sequence number for the sorted rows that can be used to join one record with the next in a self-join.

We can create a common table expression (CTE) populated with patient data sorted by PatientID and VisitDate, adding  a sequential ID using the ROW_NUMBER() function.

We can self-join this temporary table like this:

  1. FROM CTE t1
  2. JOIN CTE t2 ON t2.ROWID = t1.ROWID + 1

This will produce a set of records that represents every possible opportunity for the value of the patient’s height to change—that is, a set of records such that each contains the data from each set of two consecutive records in the original data set.

At this point, filtering out the records that don't represent a change is trivial. We simply review our statement of the problem: To qualify as a record of interest, the patient must be the same in consecutive visits but the two heights must be different. Listing 3 contains the code that implements this set-based method.

Listing 3: The set-based solution (USE adventureWorks)

  1. WITH PV_RN AS
  2. (
  3.     SELECT ROW_NUMBER() OVER (ORDER BY PatientID, VisitDate) AS ROWID, *
  4.     FROM PatientVisit
  5. )
  6. SELECT t1.PatientID
  7. ,t2.VisitDate AS  DateChanged
  8. ,t1.Height AS HeightChangedFrom
  9. ,t2.Height AS HeightChangedTo
  10. FROM PV_RN t1
  11. JOIN PV_RN t2 ON t2.ROWID = t1.ROWID + 1
  12.     WHERE t1.patientid = t2.patientid
  13.         AND t1.Height <> t2.Height
  14. ORDER BY t1.PatientID, t2.VisitDate;

Relative Performance of the Two Methods

In Listing 1, we created the PatientVisit table and populate it with 200,000 records containing the PatientID, VisitDate, and the Height recorded for that visit.  The table contains about 2,600 records that represent a change in height for a patient.

We used SQL Profiler to capture execution statistics of the two methods.  First, we flushed the buffers to get the cold execution statistics, then we re-ran the query to get hot execution statistics after the data was in cache.  Both the cursor and the set-based code returned identical results. Table 1 shows the execution statistics for each. Notice the huge difference in logical reads.  This 160:1 difference can be a show stopper in many situations.  CPU and Duration are roughly eight times as high in the cursor solution.

Method

Execution

Duration

Reads

CPU

Set-Based

Cold

503

1298

515

Cursor

Cold

4090

203646

3931

Set Based

Hot

476

1248

484

Cursor

Hot

3958

203728

3713

Table 1: Execution Statistics

The auditing requirements for a large healthcare provider can easily generate a million rows per day in the audit table. So, even if you run your audit reports for only a single day’s data, you'll have a lot of rows to process—far too many for a cursor or other looping mechanism to handle efficiently.

Set-Based Thinking

Note that the more efficient solution operates on whole sets of data, not on the individual rows. Compare this with the cursor solution, in which operations are repeated for each row in a set.

Nothing in this simple example is rocket science. You'll encounter SQL problems that are much more difficult to solve in a set-based way and some that are impossible. However, even this example requires a significant mental adjustment for programmers new to SQL programming. It requires a conscious effort to pull yourself out of your comfort zone and think in a new way. Even in the most difficult situations, don’t give up on a set-based solution until you've given it a fair amount of thought.