Programming SQL in a Set-Based Way

Pull yourself out of your comfort zone and think in a new way

SurvanceListings.zip

As T-SQL programmers, we always hear that the SQL language is optimized for set-based solutions rather than procedural solutions, but we seldom see examples from that perspective. Consequently, many beginning SQL programmers don’t have a clear understanding of what set-based means in terms of the code they need to write to solve a specific problem.

Even for those who understand the concept, there are many programming problems for which a set-based solution seems impossible. Sometimes that's true. It's not always possible to find a set-based solution, but most of the time we can find one by using a little creative thinking. A good SQL programmer must develop the mental discipline to explore set-based possibilities thoroughly before falling back on the intuitive procedural solution.

In this article, I provide a relatively simple example that illustrates how to think in a set-based way about a common type of problem that also has an intuitive procedural solution.

When you visit the doctor’s office, the first thing the nurse does is put you on a scale, record your weight, and check your height. Checking your weight makes sense from a medical point of view, but have you ever wondered why the nurse records your height each time? Unless you're very young, your height hasn’t changed since your last visit and isn't likely to change again.

The reason the nurse checks your height is to guard against identity theft. Health care providers want to make sure that the services they provide are going to the person who gets the bill—not to an imposter with a forged identity card.

This kind of identity theft happens more frequently than you might think. HIPAA regulations now require an audit of changes in permanent physical characteristics in a patient’s history that might suggest identity theft.

Querying this kind of information provides a good example for comparing procedural thinking and set-based thinking when programming in SQL.

The Problem Statement

The generic programming problem is that the solution depends on the order of rows and requires the comparison of current row values with values in previous rows. This is a type of problem in which the procedural solution is intuitive, but the set-based solution isn't so obvious.

In this particular problem, we're looking for rows where a previous visit for the same patient has a height value that's different from the height on the current record. We want to return the patient’s unique medical record number, the date the change occurred, what the height was changed from, and what the height was changed to. We don't want to return any records that don't mark a change in height.

Listing 1 gives you the code to create and populate the tables in this example, if you'd like to run the example yourself.

`Listing 1: Creating and populating the tables<br><br>/*<br>We use the AdventureWorks sample database to create tables for<br>our test but you may use another database by changing <br>the USE statements in all 3 listings.<br>*/<br><br>USE AdventureWorks;<br><br>SET NOCOUNT ON;<br><br>CREATE TABLE Dates<br>(ID int, VisitDate datetime);<br><br>--populate table with 20 visit dates<br>DECLARE @i int, @startdate datetime;<br>SET @i = 1;<br>SET @startdate = GETDATE();<br><br>WHILE @i <= 20<br>BEGIN<br>    INSERT Dates<br>    (ID, VisitDate)<br>    VALUES (@i, @startdate);<br>    <br>    SET @startdate = DATEADD(dd,7, @startdate);<br>    SET @i = @i+1;<br>END<br><br>CREATE TABLE PatientHeight<br>(PatientID  int not null<br>,Height int);<br><br>-- populate table with 1000 patientids with heights between 59 and 74 inches<br>SET @i = 1;<br><br>WHILE @i <= 10000<br>BEGIN<br>    INSERT PatientHeight<br>    (PatientID, Height)<br>    VALUES (@i, @i % 16 + 59);<br>    <br>    SET @i = @i+1;<br>END<br><br>ALTER TABLE PatientHeight ADD CONSTRAINT PK_PatientHeight<br>    PRIMARY KEY(PatientID);<br><br>-- cartesian join produces 200,000 PatientVisit records<br><br>SELECT <br>    ISNULL(PatientID, -1) AS PatientID, <br>    ISNULL(VisitDate, '19000101') AS VisitDate,<br>    Height<br>INTO PatientVisit<br>FROM PatientHeight<br>CROSS JOIN Dates;<br><br>ALTER TABLE PatientVisit ADD CONSTRAINT PK_PatientVisit<br>    PRIMARY KEY(PatientID, VisitDate);<br><br>-- create changes of height<br>SET @i = 3;<br><br>WHILE @i < 10000<br>BEGIN<br>    UPDATE pv<br>    SET Height = Height +2<br>    FROM PatientVisit pv<br>    WHERE PatientID = @i<br>    AND pv.VisitDate = <br>    (SELECT TOP 1 VisitDate <br>    FROM Dates <br>    where id = ABS(CHECKSUM(@i)) % 19);<br>    <br>SET @i = @i + 7;<br>END<br><br>/*<br>-- return AdventureWorks to its previous state when you are finished<br>-- with this example.<br><br>DROP TABLE Dates;<br>DROP TABLE PatientHeight;<br>DROP TABLE PatientVisit;<br>*/`

A Procedural Approach

The intuitive, procedural way to attack this problem is to order the records by patient and visit date, then loop through the records for each patient one row at a time. We query the first record for the patient and save the patient’s original height in a variable. Then, we loop through subsequent records for the patient, comparing height values. If we find that the height is different on a subsequent record, we write an audit record, update the height variable with the current value, and continue looping through the rows. Then we move to the next patient.

Listing 2 contains the code for the cursor-based solution. The cursor method works, but it's very inefficient. It could pose a serious performance problem when working with a large number of rows. How can we do this in a set-based and presumably more efficient way?

`Listing 2: The cursor-based solution<br>USE AdventureWorks;<br><br>CREATE TABLE #Changes<br>( PatientID int<br>, VisitDate    datetime<br>, BeginHeight smallint<br>, CurrentHeight    smallint);<br><br>DECLARE @PatientID        int<br>,        @CurrentID        int<br>,        @BeginHeight    smallint<br>,        @CurrentHeight    smallint<br>,        @VisitDate        datetime;<br><br>SET @PatientID = 0;<br><br>DECLARE Patient_cur CURSOR FAST_FORWARD FOR<br>SELECT PatientID<br>, VisitDate<br>, Height<br>FROM PatientVisit<br>ORDER BY PatientID<br>,VisitDate;<br><br>OPEN Patient_cur;<br><br>FETCH NEXT FROM Patient_cur INTO @CurrentID, @VisitDate, @CurrentHeight;<br><br>WHILE @@FETCH_STATUS = 0<br>BEGIN<br>-- first record for this patient<br>IF @PatientID <> @CurrentID<br>BEGIN<br>    SET @PatientID = @CurrentID;<br>    SET @BeginHeight = @CurrentHeight;<br>END <br><br>IF @BeginHeight <> @CurrentHeight<br>BEGIN<br>INSERT #Changes ( PatientID<br>, VisitDate<br>, BeginHeight<br>, CurrentHeight)<br>VALUES<br>(@PatientID<br>, @VisitDate<br>, @BeginHeight<br>, @CurrentHeight);<br><br>SET @BeginHeight = @CurrentHeight;<br><br>END<br><br>FETCH NEXT FROM Patient_cur INTO @CurrentID, @VisitDate, @CurrentHeight;<br><br>END<br><br>CLOSE Patient_cur;<br>DEALLOCATE Patient_cur; <br><br>SELECT * FROM #Changes<br><br>DROP TABLE #Changes`

A  Set-Based Approach

The difference between a procedural and set-based solution boils down to the way you define the problem. Stated in its simplest form, the change we're interested in involves only two records: two consecutive visits by the same patient. Everything else is irrelevant.

We start by ordering the data by the patient’s ID number and then by visit date. In that way, the records of consecutive visits by the same patient are adjacent to each other. The problem is then reduced to finding a way to join consecutive records from this set.

When we understand the problem in that way, the solution isn't so difficult to discover. We need to create a sequence number for the sorted rows that can be used to join one record with the next in a self-join.

We can create a common table expression (CTE) populated with patient data sorted by PatientID and VisitDate, adding  a sequential ID using the ROW_NUMBER() function.

We can self-join this temporary table like this:

`… from CTE t1<br>join CTE t2 on t2.ROWID = t1.ROWID + 1…`

This will produce a set of records that represents every possible opportunity for the value of the patient’s height to change—that is, a set of records such that each contains the data from each set of two consecutive records in the original data set.

At this point, filtering out the records that don't represent a change is trivial. We simply review our statement of the problem: To qualify as a record of interest, the patient must be the same in consecutive visits but the two heights must be different. Listing 3 contains the code that implements this set-based method.

`Listing 3: The set-based solution<br>USE AdventureWorks;<br><br>WITH PV_RN AS<br>(<br>    SELECT ROW_NUMBER() OVER (ORDER BY PatientID, VisitDate) AS ROWID, * <br>    FROM PatientVisit<br>)<br>select t1.PatientID<br>,t2.VisitDate as  DateChanged<br>,t1.Height as HeightChangedFrom<br>,t2.Height as HeightChangedTo<br>from PV_RN t1 <br>join PV_RN t2 on t2.ROWID = t1.ROWID + 1<br>    where t1.patientid = t2.patientid<br>        and t1.Height <> t2.Height<br>order by t1.PatientID, t2.VisitDate;`

Relative Performance of the Two Methods

In Listing 1, we created the PatientVisit table and populate it with 200,000 records containing the PatientID, VisitDate, and the Height recorded for that visit.  The table contains about 2,600 records that represent a change in height for a patient.

We used SQL Profiler to capture execution statistics of the two methods.  First, we flushed the buffers to get the cold execution statistics, then we re-ran the query to get hot execution statistics after the data was in cache.  Both the cursor and the set-based code returned identical results. Table 1 shows the execution statistics for each. Notice the huge difference in logical reads.  This 160:1 difference can be a show stopper in many situations.  CPU and Duration are roughly eight times as high in the cursor solution.

 Method Execution Duration Reads CPU Set-Based Cold 503 1298 515 Cursor Cold 4090 203646 3931 Set Based Hot 476 1248 484 Cursor Hot 3958 203728 3713

Table 1: Execution Statistics

The auditing requirements for a large healthcare provider can easily generate a million rows per day in the audit table. So, even if you run your audit reports for only a single day’s data, you'll have a lot of rows to process—far too many for a cursor or other looping mechanism to handle efficiently.

Set-Based Thinking

Note that the more efficient solution operates on whole sets of data, not on the individual rows. Compare this with the cursor solution, in which operations are repeated for each row in a set.

Nothing in this simple example is rocket science. You'll encounter SQL problems that are much more difficult to solve in a set-based way and some that are impossible. However, even this example requires a significant mental adjustment for programmers new to SQL programming. It requires a conscious effort to pull yourself out of your comfort zone and think in a new way. Even in the most difficult situations, don’t give up on a set-based solution until you've given it a fair amount of thought.

ksurvance
on Oct 5, 2010
AMY,

that solution gets the same results but relies on data that you won't have in the real world, ie a correct height for each patient in the patientheight table. that table was an itermediate step in creating the sample data but does not figure in the solution.
ksurvance
on Aug 30, 2010
the 19 in ABS(CHECKSUM(@i)) % 19 as well as the 7 in SET @i = @i + 7; are prime numbers that were used to make the selection of visitdates and patients psuedo random. It didn't work out very random but it was enough for our test purposes.
techtwk
on Aug 19, 2010
I don't see why 'ABS(CHECKSUM(@i)) % 19' was used in the UPDATE statement in Listing 1. Should it have been '1 + ABS(CHECKSUM(@i)) % 19' so that height changes would include all 20 dates and 19 possible date differences?
AmyFGehring
on Sep 30, 2010
After populating the tables, I tried to write my own solution b4 looking at yours. I came up with:

SELECT ph.PatientID, ph.Height as BeginningHeight, VisitDate, pv.Height as ChangedHeight
FROM PatientHeight ph INNER JOIN PatientVisit pv
ON ph.PatientID=pv.PatientID
WHERE ph.Height <> pv.Height
Order by ph.PatientID

This actually seems to work better than the CTE solution. It would never have occurred to me to try a cursor solution.

IT/Dev Connections

Las Vegas
September 30th - October 4th

Our Experts will show you:
• Common SQL Server
Problems
• Best Practices for T-SQL
• SQL Server Integration
Services
• Database Development

Come See Michael Otey & Tim Ford in Person!

From the Blogs
May 21, 2013
blog

Out of the box, SQL Server is (and has been) able to take advantage of multiple processors/cores without any effort on behalf of administrators....More
May 9, 2013
blog

My ISO 8601-Compliant Signature2

My family recently just "officially" announced that we're in the process of adopting a child from South Africa. We're quite excited, of course, but there's a ton of paperwork to do—along with the need for gobs of signatures....More
May 8, 2013
blog

Use SSIS for ETL from Hadoop

In this blog post, Mark Kromer walks you through using SSIS as a way to use ETL techniques using Microsoft's Hadoop on Windows (HDInsight) as a source using Hive connectors...More
SQL Server Pro Forums

Get answers to questions, share tips, and engage with the SQL Server community in our Forums.