Solutions to Related Tables Challenge

Last week I provided a T-SQL Challenge involving writing a recursive query that returns tables related to an input table directly or indirectly through foreign key relationships. You can find the challenge details here. The tricky part wasn’t really to come up with any solution that works, rather to come up with a solution using a recursive query. With this constraint the challenge became much more difficult—especially coming up with a solution that performs reasonably.

I’d like to thank all those who sent solutions, including Peter Larsson (Peso), Geri Reshef and Rubén Garrigós. Special thanks go to Peso for his help in benchmarking solutions in his environment and in sending corrections to other solutions. I know that many others tried to solve the puzzle and would like to thank you for your efforts—it is a difficult challenge!

The sample data I provided last week is far too simple to be used to test the performance of the solutions. Peso tested some of the solutions in his data warehouse with 160 tables and this way we got more realistic measures. I created a new set of tables—smaller than Peso’s environment, yet more complex than last week’s—and this was sufficient to show the performance differences between the solutions. Here’s the new sample data:

USE testfk;

 

CREATE TABLE dbo.T2(c2 INT PRIMARY KEY);

CREATE TABLE dbo.T3(c3 INT PRIMARY KEY);

CREATE TABLE dbo.T4(c4 INT PRIMARY KEY);

CREATE TABLE dbo.T5(c5 INT PRIMARY KEY);

 

CREATE TABLE dbo.T1(c1 INT PRIMARY KEY,

                    c2 INT REFERENCES dbo.T2,

                    c3 INT REFERENCES dbo.T3,

                    c4 INT REFERENCES dbo.T4,

                    c5 INT REFERENCES dbo.T5);

 

CREATE TABLE dbo.T6(c6 INT PRIMARY KEY,

                    c2 INT REFERENCES dbo.T2,

                    c3 INT REFERENCES dbo.T3);

 

CREATE TABLE dbo.T7(c7 INT PRIMARY KEY,

                    c3 INT REFERENCES dbo.T3,

                    c4 INT REFERENCES dbo.T4);

 

CREATE TABLE dbo.T8(c8 INT PRIMARY KEY,

                    c4 INT REFERENCES dbo.T4,

                    c5 INT REFERENCES dbo.T5);

 

CREATE TABLE dbo.T9(c9 INT PRIMARY KEY,

                    c2 INT REFERENCES dbo.T2,

                    c5 INT REFERENCES dbo.T5);

 

CREATE TABLE dbo.T10(c10 INT PRIMARY KEY,

                     c6 INT REFERENCES dbo.T6,

                     c3 INT REFERENCES dbo.T3);

 

CREATE TABLE dbo.T11(c11 INT PRIMARY KEY,

                     c3 INT REFERENCES dbo.T3,

                     c7 INT REFERENCES dbo.T7);

 

CREATE TABLE dbo.T12(c12 INT PRIMARY KEY,

                     c7 INT REFERENCES dbo.T7,

                     c4 INT REFERENCES dbo.T4);

 

CREATE TABLE dbo.T13(c13 INT PRIMARY KEY,

                     c4 INT REFERENCES dbo.T4,

                     c8 INT REFERENCES dbo.T8);

 

CREATE TABLE dbo.T14(c14 INT PRIMARY KEY,

                     c5 INT REFERENCES dbo.T5,

                     c8 INT REFERENCES dbo.T8);

 

CREATE TABLE dbo.T15(c15 INT PRIMARY KEY,

                     c5 INT REFERENCES dbo.T5,

                     c9 INT REFERENCES dbo.T9);

 

CREATE TABLE dbo.T16(c16 INT PRIMARY KEY,

                     c2 INT REFERENCES dbo.T2,

                     c9 INT REFERENCES dbo.T9);

 

CREATE TABLE dbo.T17(c17 INT PRIMARY KEY,

                     c2 INT REFERENCES dbo.T2,

                     c6 INT REFERENCES dbo.T6);

 

And here’s a graphical depiction of the relationships: 


 

These tables are created in the testfk database used last week in addition to the other tables already there (tables A, B, C, D, E, F, G).

The performance measures I’ll provide are against this data model, using warm cache, not including parse and compile time. I used the following input for the performance test:

DECLARE @table AS NVARCHAR(261) = N'dbo.T1';

 

The expected output is (not necessarily in this order):

obj_schema_name  obj_name

---------------- ---------

dbo              T15

dbo              T16

dbo              T17

dbo              T2

dbo              T3

dbo              T4

dbo              T5

dbo              T1

dbo              T6

dbo              T7

dbo              T8

dbo              T9

dbo              T10

dbo              T11

dbo              T12

dbo              T13

dbo              T14

 

First, I’d like to emphasize that the constraint to come up with a solution based on a recursive query was just to increase the puzzle’s level of difficulty. In practical terms, it makes sense to address such need using a loop-based solution. Loop-based solutions in this case are far simpler and more efficient than recursive solutions. Here’s the one I showed last week (call it Itzik Loop) with a minor revision to return the input table when it has no related tables (remember to include in the code the declaration and initialization of the input @table provided earlier):

DECLARE @T AS TABLE ( id INT NOT NULL PRIMARY KEY );

 

INSERT INTO @T(id)

  SELECT OBJECT_ID(@table) AS id

  WHERE OBJECT_ID(@table) IS NOT NULL

 

  UNION

  

  SELECT referenced_object_id

  FROM sys.foreign_keys

  WHERE parent_object_id = OBJECT_ID(@table)

 

  UNION

 

  SELECT parent_object_id

  FROM sys.foreign_keys

  WHERE referenced_object_id = OBJECT_ID(@table);

 

WHILE @@ROWCOUNT > 0

 

  INSERT INTO @T(id)

 

    SELECT referenced_object_id AS id

    FROM sys.foreign_keys AS FK

      JOIN @T

        ON FK.parent_object_id = [@T].id

 

    UNION

 

    SELECT parent_object_id

    FROM sys.foreign_keys AS FK

      JOIN @T

        ON FK.referenced_object_id = [@T].id

   

    EXCEPT

   

    SELECT id FROM @T;

 

SELECT OBJECT_SCHEMA_NAME(id) AS obj_schema_name, OBJECT_NAME(id) AS obj_name

FROM @T;

 

The statistics for this solution on my system with the new sample data are: Duration: 4 ms, Reads: 587.

Here’s another loop-based solution provided by Peso (call it Peso Loop):

DECLARE @Tables TABLE

     (

           TableID INT PRIMARY KEY

     )

 

INSERT     @Tables

     (

           TableID

     )

SELECT     [object_ID] AS TableID

FROM SYS.TABLES

WHERE [object_ID] = OBJECT_ID(@table)

 

WHILE @@ROWCOUNT > 0

     INSERT     @Tables

           (

                TableID

           )

     SELECT     DISTINCT TableID

     FROM (

                SELECT          fk.REFERENCED_OBJECT_ID AS TableID

                FROM       SYS.FOREIGN_KEYS AS fk

                INNER JOIN @Tables AS t ON t.TableID = fk.PARENT_OBJECT_ID

 

                UNION ALL

 

                SELECT          fk.PARENT_OBJECT_ID AS TableID

                FROM       SYS.FOREIGN_KEYS AS fk

                INNER JOIN @Tables AS t ON t.TableID = fk.REFERENCED_OBJECT_ID

           ) AS w

     WHERE NOT EXISTS(SELECT * FROM @Tables AS t WHERE t.TableID = w.TableID)

 

SELECT

  OBJECT_SCHEMA_NAME(TableID) AS obj_schema_name,

  OBJECT_NAME(TableID) AS obj_name

FROM @Tables;

 

The stats I got for this solution are: Duration: 4 ms, reads: 728. In short, both loop-based solutions are pretty efficient as you can see.

As for recursive solutions, I’ll start with the ones I came up with, then I’ll provide also the ones by Peso, Geri and Rubén.

My first attempt at a solution using a recursive query is the following (call it Itzik Recursive 1):

WITH C AS

(

  SELECT OBJECT_ID(@table) AS obj,

    '.' + CAST(OBJECT_ID(@table) AS VARCHAR(MAX)) + '.' AS objpath

  WHERE OBJECT_ID(@table) IS NOT NULL

 

  UNION ALL

 

  SELECT D.obj, D.objpath + CAST( D.obj AS VARCHAR(MAX) ) + '.'

  FROM ( SELECT

           CASE WHEN C.obj = F.parent_object_id

             THEN F.referenced_object_id

             ELSE F.parent_object_id

           END AS obj, objpath

         FROM C

           JOIN sys.foreign_keys AS F

             ON C.obj IN (F.parent_object_id, F.referenced_object_id) ) AS D

  WHERE objpath NOT LIKE '%.' + CAST(obj AS VARCHAR(10)) + '.%'

)

SELECT DISTINCT

  OBJECT_SCHEMA_NAME(obj) AS obj_schema_name,

  OBJECT_NAME(obj) AS obj_name

FROM C;

 

The anchor member returns the input table. The recursive member returns a row for each counterpart of the tables that appeared in the previous round as either a referencing or a referenced table. For each edge, the code constructs a dot separated path with all of the object ids of the objects leading to the current. In each recursive round, the code filters out nodes that already appear in their parent node’s path.

This solution is short (in terms of amount of code) and elegant, but isn’t very fast. It creates all possible paths starting with the input node and made of unique nodes. As you can imagine, the number of paths can explode quickly to a large number with additions of new related tables.

The statistics I got for this solution are: Duration: 7,812, Reads: 2,949,749.

In order for a recursive solution to be really fast you need to avoid constructing multiple different paths and visit each node only once. The two main challenges are: 1. How to apply DISTINCT logic in the recursive member without using the DISTINCT clause (not allowed in recursive member). 2. Avoid building multiple different paths when multiple nodes are counterparts in a foreign key relationship with a single node from the previous round.

Both challenges have a solution. Challenge 1 can be handled by calculating a row number partitioned by the node and then filtering only rows with row number 1. Challenge 2 can be handled by using a FOR XML PATH query to concatenate multiple nodes into one. Here’s the complete solution code (call it Itzik Recursive 2):

WITH C AS

(

  SELECT

    CAST('.' + STR(OBJECT_ID(@table), 11) + '.' AS VARCHAR(MAX)) AS entirepath,

    CAST('~' + STR(OBJECT_ID(@table), 11) + '~' AS VARCHAR(MAX)) AS lastpath

  WHERE OBJECT_ID(@table) IS NOT NULL

   

  UNION ALL

 

  SELECT

    REPLACE(entirepath, '~', '.') + '.',

    SUBSTRING(entirepath, CHARINDEX('~', entirepath), LEN(entirepath)) + '~'

  FROM (

    SELECT CAST(entirepath AS VARCHAR(MAX)) AS entirepath

    FROM (

      SELECT element AS [text()]

      FROM (

        SELECT

          CASE n WHEN 1 THEN entirepath ELSE obj END AS element,

          ROW_NUMBER() OVER(PARTITION BY CASE n WHEN 1 THEN entirepath ELSE obj END

                            ORDER BY (SELECT NULL)) AS rownum

        FROM (

          SELECT

            LEFT(entirepath, LEN(entirepath) - 1) AS entirepath,

            '~' + STR(obj, 11) AS obj

          FROM (

            SELECT

              C.entirepath,

              CAST(

                CASE WHEN C.lastpath LIKE '%~' + STR(F.parent_object_id, 11) + '~%'

                  THEN F.referenced_object_id

                  ELSE F.parent_object_id

                END AS VARCHAR(10)) AS obj

            FROM C

              JOIN sys.foreign_keys AS F

                ON C.lastpath LIKE '%~' + STR(F.parent_object_id, 11) + '~%'

                OR C.lastpath LIKE '%~' + STR(F.referenced_object_id, 11) + '~%'

               ) AS D1

          WHERE entirepath NOT LIKE '%.' + STR(obj, 11) + '.%'

             ) AS D2

          CROSS JOIN (SELECT 1 UNION ALL SELECT 2) AS Nums(n)

           ) AS D3

      WHERE rownum = 1

      ORDER BY element

      FOR XML PATH('')

         ) AS D4(entirepath)

      WHERE entirepath IS NOT NULL

       ) AS D5

),

Split AS

(

  SELECT

    CAST(SUBSTRING(lastpath, 2, 11) AS INT) AS obj,

    STUFF(lastpath, 1, 12, '') AS lastpath

  FROM C

 

  UNION ALL

 

  SELECT

    CAST(SUBSTRING(lastpath, 2, 11) AS INT) AS obj,

    STUFF(lastpath, 1, 12, '') AS lastpath

  FROM Split

  WHERE lastpath '~'

)

SELECT

  OBJECT_SCHEMA_NAME(obj) AS obj_schema_name,

  OBJECT_NAME(obj) AS obj_name

FROM Split;

 

Observe in the CTE C that only one path is built in each iteration of the recursive query, concatenating the path built by the previous round and the qualifying items from the current round (concatenation done with FOR XML PATH). The code uses a trick to distinguish between the part of the path representing nodes accessed prior to current round and nodes added in this round (could be more than one). The trick is to use different separators for the two sections in the layer of the code that concatenates all parts—old (. as separator) and new (~ as separator). The reason to include both sections in the same string and not as different strings is that a FOR XML PATH query can return only one string. Then a layer of code on top can extract the last part of the path (section starting with ~) giving it the attribute name lastpart, and replace all ocurrences of the character ~ with . to return the entire path, giving it the attribute name entirepath.

Finally, another recursive query called Split extracts the node ids from the lastpath attribute calculated in each round. First round (anchor member) handles first node in each lastpath, second round (already recursive query) handles second node in each last path, and so on, until no more nodes exist.

Since this solution visits each node only once it is very fast. The performance stats I got for it are: Duration: 17, Reads: 2,129.

Following are solutions using recursive queries by Peso, Geri and Ruben and their performance statistics.

Here’s Peso’s solution (Call it Peso Recursive):

WITH cteTables (ChildTableName, ParentTableName)

AS (

     SELECT          CASE x.Permutation

                     WHEN 0 THEN c.TABLE_NAME

                     ELSE p.TABLE_NAME

                END AS ChildTableName,

                CASE x.Permutation

                     WHEN 0 THEN p.TABLE_NAME

                     ELSE c.TABLE_NAME

                END AS ParentTableName

     FROM       (

                     SELECT     CONSTRAINT_SCHEMA + '.' + CONSTRAINT_NAME AS CONSTRAINT_NAME,

                           UNIQUE_CONSTRAINT_SCHEMA + '.' + UNIQUE_CONSTRAINT_NAME AS UNIQUE_CONSTRAINT_NAME

                     FROM INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS

                ) AS rc

     INNER JOIN (

                     SELECT     CONSTRAINT_SCHEMA + '.' + CONSTRAINT_NAME AS CONSTRAINT_NAME,

                           CAST(TABLE_SCHEMA + '.' + TABLE_NAME AS VARCHAR(MAX)) AS TABLE_NAME

                     FROM INFORMATION_SCHEMA.CONSTRAINT_TABLE_USAGE

                ) AS c ON c.CONSTRAINT_NAME = rc.CONSTRAINT_NAME

     INNER JOIN (

                     SELECT     CONSTRAINT_SCHEMA + '.' + CONSTRAINT_NAME AS CONSTRAINT_NAME,

                           CAST(TABLE_SCHEMA + '.' + TABLE_NAME AS VARCHAR(MAX)) AS TABLE_NAME

                     FROM INFORMATION_SCHEMA.CONSTRAINT_TABLE_USAGE

                ) AS p ON p.CONSTRAINT_NAME = rc.UNIQUE_CONSTRAINT_NAME

     CROSS JOIN (

                     SELECT     0 UNION ALL

                     SELECT     1

                ) AS x(Permutation)

     WHERE      c.TABLE_NAME p.TABLE_NAME

), cteHierarchy(TableName, TablePath)

AS (

     SELECT     TableName,

           CHAR(2) + TableName + CHAR(2) AS TablePath

     FROM (

                SELECT     CAST(TABLE_SCHEMA + '.' + TABLE_NAME AS VARCHAR(MAX)) AS TableName

                FROM INFORMATION_SCHEMA.TABLES

           ) AS d

     WHERE TableName = @table

 

     UNION ALL

 

     SELECT          t.ParentTableName AS TableName,

                h.TablePath + t.ParentTableName + CHAR(2) AS TablePath

     FROM       cteHierarchy AS h

     INNER JOIN cteTables AS t ON t.ChildTableName = h.TableName

     WHERE      h.TablePath NOT LIKE '%' + CHAR(2) + t.ParentTableName + CHAR(2) + '%'

)

SELECT          TableName

FROM       cteHierarchy

ORDER BY   TablePath;

 

Statistics: Duration: 258,447, Reads: 19,714,744.

Here’s Geri’s solution (call it Geri Recursive):

With T1 As

(Select    Distinct Schema_Name(P.schema_id)+'.'+P.name P,

           Schema_Name(R.schema_id)+'.'+R.name R

From sys.foreign_key_columns F

Inner Join sys.objects P

          On F.parent_object_id=P.object_id

Inner Join sys.objects R

           On F.referenced_object_id=R.object_id),

T2 As

(Select    ROW_NUMBER() Over(Order By Case When @table =P Then 1 When @table =R Then 2 Else 3 End,P,R) Mispar,

           *

From T1),

T3 As

(Select Cast(0 As Int)  Mispar,

           P,

           R,

           P PR,

           (-1) Idcun,

           Cast(','+P+','+R+',' As Varchar(Max)) S_in

From T2

Where Mispar=1

           And @table In (P,R)

Union

Select     Cast(1 As Int) Mispar,

           P,

           R,

           R PR,

           (-1) Idcun,

           Cast(','+P+','+R+',' As Varchar(Max)) S_in

From T2

Where Mispar=1

           And @table In (P,R)

          

Union All

Select     Cast(Mispar As Int),

           P,

           R,

           Case When Idcun=1 Then R When Idcun=2 Then P Else '' End PR,

           Case When Idcun=2 Then 1 Else Idcun End Idcun,

           Cast(S_in+Case When Idcun=1 Then R+',' When Idcun=2 Then P+',' Else '' End As Varchar(Max)) S_in

From (Select    T2.Mispar,

                     T2.P,

                     T2.R,

                     Case When T3.S_in Like '%,'+T2.P+',%' And T3.S_in Not Like '%,'+T2.R+',%' Then 1

                           When T3.S_in Not Like '%,'+T2.P+',%' And T3.S_in Like '%,'+T2.R+',%' Then 2

                          Else 0 End Idcun,

                     T3.S_in

           From    T3

           Inner Join T2

                     On T3.Mispar+1=T2.Mispar

                     And T3.Mispar>0

           Where    T3.Idcun0) T

 

Union All

Select     Cast(Mispar As Int),

           P,

           R,

           Case When Idcun=1 Then R When Idcun=2 Then P Else '' End PR,

           Case When Idcun=2 Then 1 Else Idcun End Idcun,

           Cast(S_in+Case When Idcun=1 Then R+',' When Idcun=2 Then P+',' Else '' End As Varchar(Max)) S_in

From (Select    T2.Mispar,

                     T2.P,

                     T2.R,

                     Case When T3.S_in Like '%,'+T2.P+',%' And T3.S_in Not Like '%,'+T2.R+',%' Then 1

                           When T3.S_in Not Like '%,'+T2.P+',%' And T3.S_in Like '%,'+T2.R+',%' Then 2

                           Else 0 End Idcun,

                     T3.S_in

           From    T3

           Inner Join T2

                     On T3.Idcun=1

                     And T2.Mispar=1) T)

Select     Distinct PR

From T3

Where PR''

OPTION(MAXRECURSION 0);

 

Statistics: Duration: 680, Reads: 156,951. Quite fast as you can see.

And here’s Ruben’s solution (call it Ruben Recursive):

WITH referenced(id,level)

AS

(

        SELECT referenced_object_id AS id, (select count(*) from sys.foreign_keys) level

        FROM sys.foreign_keys FK

        WHERE referenced_object_id = OBJECT_ID(@table) or parent_object_id = OBJECT_ID(@table)       

             

        UNION ALL

       

        SELECT

            case

                  when (parent_object_id = id) then referenced_object_id

                  else parent_object_id

            end

            AS id,

            referenced.level-1 level

        FROM sys.foreign_keys FK

        JOIN referenced ON (referenced_object_id = id or parent_object_id = id)

            AND (referenced_object_idparent_object_id)

            AND referenced.level >= 1

) 

SELECT DISTINCT OBJECT_SCHEMA_NAME(id) AS obj_schema_name, OBJECT_NAME(id) AS obj_name

FROM referenced;

 

Statistics: I stopped it after it ran for 30 minutes.

 

Cheers,

BG

 

Discuss this Blog Entry 3

on Jun 1, 2010
I'll test them all on the data warehouse (with 160+ tables) later this morning.
Earlier I only measured the duration for my and Itzik's solutions.

In short, the recursive CTE's (except Itzik 2) were killed after 12 hours!
Itzik 2 ran for 400ms which is really good, and Peso loop ran for 40 ms.
I posted another loop based solution based on composable dml which ran for 1200ms, and there is no point displaying that solution when my other loop based was so much faster.

Stay tuned!








on Jun 2, 2010
Hre we go. For the published solutions above, only "Itzik Loop", "Peso Loop" and "Itzik Recursive 2" passed before the 30 minute mark on our Data Warehouse, which has 160 tables. I used the "anchor" table "dimMember" to start with as this has most connections through the database. The solutions passed the 30 minute mark all returns 144 tables.

Here are the timings

-- Itzik Loop
-- CPU 31, Duration 41ms, Reads 10293

-- Peso Loop
-- CPU 47, Duration 59ms, Reads 12490

-- Itzik Recursive 2
-- CPU 437, Duration 436, Reads 98394












on Jun 2, 2010
Thanks for this, Peso!

Please or Register to post comments.

What's Puzzled By T-SQL Blog?

T-SQL tips and logical puzzles from Itzik Ben-Gan.

Blog Archive

Sponsored Introduction Continue on to (or wait seconds) ×