TSQL Proximity Puzzle

I got this puzzle from my good friend Herbert Albert. Herbert addressed a more detailed version of the problem with a customer of his; here I’ll present a simplified form of the problem.

Consider the following tables and sample data:

set nocount on;

use tempdb;

 

if object_id('dbo.t1', 'u') is not null drop table dbo.t1;

if object_id('dbo.t2', 'u') is not null drop table dbo.t2;

go

 

create table dbo.t1

(

  col1 int        not null primary key,

  col2 varchar(1) not null

);

 

create table dbo.t2

(

  col1 int not null primary key,

  col2 varchar(1) not null

);

 

-- small sample data to check validity of solution

insert into dbo.t1(col1, col2) values

  (10, 'a'),(20, 'b'),(30, 'c'),(40, 'd'),(50, 'e');

insert into dbo.t2(col1, col2) values

  (9, 'z'),(12, 'y'),(20, 'w'),(35, 'v'),(47, 'u'),(51, 't'),(1759, 's'); 

go

 

The challenge is to write a query that matches to each row from t1 the row in t2 with the “closest” value in t2.col1 to the one in t1.col1. In “closest” I mean the smallest absolute difference. In case of ties, simply the smallest t2.col1 value wins. For the given sample data, the desired result is:

t1_col1     t1_col2 t2_col1     t2_col2

----------- ------- ----------- -------

10          a       9           z

20          b       20          w

30          c       35          v

40          d       35          v

50          e       51          t

 

Use the given sample data and desired result to verify the validity of the solution. But to check the performance of the solutions, I will use the following sample data:

-- Virtual Auxiliary Table of Numbers

IF OBJECT_ID('dbo.GetNums') IS NOT NULL DROP FUNCTION dbo.GetNums;

GO

CREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLE

AS

RETURN

  WITH

  L0   AS(SELECT 1 AS c UNION ALL SELECT 1),

  L1   AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),

  L2   AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),

  L3   AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),

  L4   AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),

  L5   AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),

  Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)

  SELECT TOP (@n) n FROM Nums ORDER BY n;

GO

 

-- Populate Tables

truncate table dbo.t1;

truncate table dbo.t2;

 

insert into dbo.t1 with (tablock) (col1, col2)

  select

    n * 10000 + abs(checksum(newid())) % 10000,

    char(ascii('a') + abs(checksum(newid())) % (ascii('z') - ascii('a') + 1))

  from dbo.getnums(10000);

 

insert into dbo.t2 with (tablock) (col1, col2)

  select

    n * 100 + abs(checksum(newid())) % 100,

    char(ascii('a') + abs(checksum(newid())) % (ascii('z') - ascii('a') + 1))

  from dbo.getnums(1000000);

I’ll cover the solutions in a post next week.

Good luck!

BG

 

Discuss this Blog Entry 14

on Jul 6, 2010
Thanks for this Steve! Just posted an entry covering the solutions: http://www.sqlmag.com/blogs/puzzled-by-t-sql/tabid/1023/entryid/12950/Solutions-to-Proximity-Puzzle.aspx. Yours flies with a large t1 (1,000,000 rows) and small t2 (10,000 rows). It ran for 1 second on my machine (all others 4 seconds or more).
on Jul 6, 2010
Itzik,

Here are some statistics comparing my solution, Marc's, and Brad's on a relatively slow machine.

One case uses your sample data table sizes, and Marc's solution is the best of the three by far; in the second case I reversed the table sizes, and my solution is much faster than the others.

t2 bigger than t1 (Itzik's sample data)

t1: 10000 rows
t2: 1000000 rows


Marc ("top 10" removed):
Table 't2'. Scan count 20000, logical reads 60020
Table 't1'. Scan count 1, logical reads 25
CPU time = 141 ms

Brad:
Table 't1'. Scan count 6, logical reads 72
Table 't2'. Scan count 10003, logical reads 44556
CPU time = 4641 ms

Steve:
Table 't1'. Scan count 1000000, logical reads 3002570
Table 't2'. Scan count 2000005, logical reads 6379688
CPU time = 36843 ms


--------------------------

t1 bigger than t2

t1: 1000000 rows
t2: 10000 rows

Steve:
Table 't1'. Scan count 10000, logical reads 42277
Table 't2'. Scan count 20005, logical reads 42732
SQL Server Execution Times:

CPU time = 1796 ms

Marc (with "top 10" removed):
Table 't2'. Scan count 2000000, logical reads 4001956
Table 't1'. Scan count 5, logical reads 2263
CPU time = 16047 ms

Brad:
Table 't2'. Scan count 1000003, logical reads 3006389
Table 't1'. Scan count 6, logical reads 2266
CPU time = 47593 ms























































on Jul 6, 2010
Geri, your solution generates errors. Can you please fix and re-send?
on Jun 29, 2010
Here's what I came up with:

with Step1 as
(
select
t1.col1 as t1_col1
,t2.col1 as t2_col1
,ROW_NUMBER() over(partition by t1.col1 order by t2.col1 desc) as row
from t1
join t2 on t2.col1 <= t1.col1
union
select
t1.col1 as t1_col1
,t2.col1 as t2_col1
,ROW_NUMBER() over(partition by t1.col1 order by t2.col1 asc) as row
from t1
join t2 on t2.col1 > t1.col1
),
Step2 as
(
select
t1_col1
,t2_col1
,ROW_NUMBER() over (partition by t1_col1 order by abs(t1_col1 - t2_col1) asc) as row
from
Step1
where
row = 1
),
Step3 as
(
select
t1_col1
,t2_col1
from
Step2
where
row = 1
)
select
t1.*
,t2.*
from
Step3 as j
join t1 on t1.col1 = j.t1_col1
join t2 on t2.col1 = j.t2_col1














































on Jun 29, 2010
Just tested my above solution against the larger data set. Slow....
on Jul 6, 2010
Here's another solution sent to me privately by Brad Schulz, with inline comments describing the logic:

-- Brad
/*
CPU time = 6771 ms, elapsed time = 2254 ms.

Table 't1'. logical reads 72
Table 't2'. logical reads 44569

Sample Data 2:
elapsed: 11 seconds

Table 't1'. logical reads 2266
Table 't2'. logical reads 3006623
*/
-- Obtain the data from the very last row in T2
with FinalT2Row as
(
select top 1 FinalT2Col1=col1
,FinalT2Col2=col2
from t2
order by col1 desc
)
-- Obtain the minimum Col1 Value in T1
,CalcMinT1Col1 as
(
select MinT1Col1=min(col1)
from t1
)
-- Assign Row Numbers to T2 Table in Col1 Order
,T2WithRowNums as
(
select col1
,col2
,rnum=row_number() over (order by col1)
from t2
)
-- Calculate the maximum range of Col1 between
-- adjacent row numbers. And divide it by 2 to get
-- a +/- figure to use in a BETWEEN clause later to
-- look for the values covering that range.
-- Add 1 because of integer division.
-- NOTE:
-- If we're looking at the first row of T2, then
-- calculate the range as the difference between T2.Col1
-- and the minimum Col1 value in T1. This is because
-- we may have a very low min(T1.Col1) compared to
-- a high first T2.Col1.
-- We will handle the very last row of T2 in the
-- final SELECT via an OUTER APPLY and ISNULL logic.
,CalcHalfMaxRange as
(
select HalfMaxRange=max(case
when Cur.rnum=1 --If we're processing the very first row of T2:
then Cur.col1-MinT1Col1 -- Set to Diff between Current T2.Col1 and Lowest T1 Col1 Value
else Nxt.col1-Cur.col1 --Otherwise set to Diff between Next and Current T2.Col1
end
)
/2 + 1
from CalcMinT1Col1
cross join T2WithRowNums Cur
left join T2WithRowNums Nxt on Cur.rnum+1=Nxt.rnum
)
-- Now take that HalfMaxRange value and the FinalT2Row data
-- that we calculated/obtained and connect them to the
-- T1 Table via CROSS JOIN.
-- And for each T1 Row, find (via the OUTER APPLY) the T2 Row whose Col1 is
-- "closest" to T1's Col1, but take advantage of the HalfMaxRange value to only
-- search a tight range in T2.
-- In the case of a "tie", use the T2 Row with the lowest T2.Col1 value.
-- But if we don't find anything via the OUTER APPLY (i.e. it returns NULL values),
-- then that means that we couldn't find anything within the range.
-- That will only happen when we have a T1.Col1 that is much higher than the
-- highest T2.Col1 value.
-- If that's the case, then just plop in the data from the FinalT2Row CTE.
select t1_col1=t1.col1
,t1_col2=t1.col2
,t2_col1=isnull(f.col1,FinalT2Col1)
,t2_col2=isnull(f.col2,FinalT2Col2)
from CalcHalfMaxRange
cross join FinalT2Row
cross join t1
outer apply (select top 1 col1,col2
from t2
where col1 between t1.col1-HalfMaxRange and t1.col1+HalfMaxRange
order by abs(t1.col1-t2.col1),t2.col1) f
;























































































on Jul 1, 2010
Hi Itzik (and Marc!),

For your test data, t2 has many more rows than t1, and Marc's solution is excellent.

However, if t2 is the smaller table, Marc's solution is not efficient, because it seeks in t2 once for each row of t1. Here is a better solution for that case.

declare
@small decimal(18,1) = -2147483648.1,
@large decimal(18,1) = 2147483647.1;
with Tbounds as (
select
t2.col1, t2.col2,
0.5*t2.col1+0.5*isnull(
(select top 1 col1 from t2 as B
where B.col1 < t2.col1 order by B.col1 desc),@small) as below,
0.5*t2.col1+0.5*isnull(
(select top 1 col1 from t2 as B
where B.col1 > t2.col1 order by B.col1 asc),@large) as above
from t2
), Tmatched as (
select
t1.col1, t1.col2,
Tbounds.col1 as t2c1, Tbounds.col2 as t2c2
from t1 join Tbounds
on t1.col1 > below and t1.col1 <= above
)
select * from Tmatched;

Marc's solution and mine both contain nested loop joins, but in Marc's, t1 is on top, and in mine, a table derived from t2 is on top.

My strategy is as follows: before looking at t1, scan t2 and compute the smallest and largest [col1] values for which that row would be the right match. A row in t2 will match anything up to half way down and up to the next col1 values in t2. This range goes from t2.col1+next smallest t2.col1)/2 to col1+next largest col1)/2, with strict inequality for the lower bound.

































on Jun 30, 2010
It will be very slow with large database. Cross Apply method seems much faster...
on Jun 30, 2010
Hi Itzik,

Please see following query for desired output:

WITH [T3] AS
(
SELECT
T1Col1 = T1.col1,
T1Col2 = T1.col2,
T2Col1 = T2.col1,
T2Col2 = T2.col2,
Row_No = ROW_NUMBER() OVER(PARTITION BY T1.Col1 ORDER BY ABS(T1.COL1 - T2.col1 ))
FROM t1 CROSS JOIN t2
)
SELECT
[T3].T1Col1,
[T3].T1Col2,
[T3].T2Col1,
[T3].T2Col2
FROM [T3]
WHERE Row_No = 1





















on Jul 2, 2010
This works well - I think ?

create table #temp
(
t1col1 int,
t1col2 nvarchar(1),
t2col1 int,
t2col2 nvarchar(1),
diff int
)

insert into #temp
select
t1.col1 as t1col1,
t1.col2 as t1col2,
t2.col1 as t2col1,
t2.col2 as t2col2,
abs(t1.col1-t2.col1) as diff
from table1 t1, table2 t2

select t1col1, min(diff) from #temp
group by t1col1






















on Jun 29, 2010
Hi Itzik,
this is my solution.
With a "create index i on dbo.t2(col1)", obviously, work really good.

select top 10 * from dbo.t1
outer apply(
select top 1 * from(
select top 1 *, d=t1.col1-t2.col1 from dbo.t2 where t2.col1<=t1.col1 order by t2.col1 desc
union
select top 1 *, d=t2.col1-t1.col1 from dbo.t2 where t2.col1>t1.col1 order by t2.col1
) u
order by u.d, u.col1
) d

marc.













on Jun 30, 2010
My solution is based on recursive CTE: the algorithm is like the Merge Join one: two sorted sets from which the system finds matches and advanced one step in one set each time.

If someone want to follow the process- comment the "Where Flag=1".

Unfortunately this solution - as usual with recursive CTE - is very unefficient, and Marc's solution is surely the best!

With T_1 As
(Select Row_Number() Over(Order By col1) N,
*
From t1),
T_2 As
(Select Row_Number() Over(Order By col1) N,
*
From t2),
T As
(Select 0 Flag,
Abs(T_1.col1-T_2.col1) Intrvl,
T_1.N N_1,
T_1.col1 Col1_1,
T_1.col2 col2_1,
T_2.N N_2,
T_2.col1 Col1_2,
T_2.col2 col2_2
From T_1,
T_2
Where T_1.N=1
And T_2.N=1
Union All
Select Case When T.Flag=0 And T.Intrvl Else 0
End Flag,
Case When T.Flag=0 And T.Intrvl Else Abs(T_1.col1-T_2.col1)
End Intrvl,
T_1.N N_1,
Case When T.Flag=0 Then T.Col1_1
Else T_1.col1
End Col1_1,
Case When T.Flag=0 Then T.Col2_1
Else T_1.col2
End Col2_1,
Case When T.Flag=0 And T.Intrvl Else T_2.N
End N_2,
Case When T.Flag=0 And T.Intrvl Else T_2.col1
End Col2_1,
Case When T.Flag=0 And T.Intrvl Else T_2.col2
End Col2_2
From T
Inner Join T_1
On T_1.N=Case When T.Flag=0 Then T.N_1 Else T.N_1+1 End
Inner Join T_2
On T_2.N=Case When T.Flag=0 Then T.N_2+1 Else T.N_2 End)
Select Col1_1,
Col2_1,
Col1_2,
Col2_2
From T
Where Flag=1
option (MaxRecursion 0);























































on Jul 4, 2010
Just another solution:

WITH CTE AS
(
SELECT T1.col1 AS t1_col1, T1.col2 AS t1_col2, ISNULL(T3.t2_col1, T2.t2_col1) AS t2_col1
FROM dbo.t1 AS T1
OUTER APPLY
(SELECT MAX(T2.col1) AS t2_col1
FROM dbo.t2 AS T2
WHERE T2.col1 <= T1.col1) AS T2
OUTER APPLY
(SELECT MIN(T3.col1) AS t2_col1
FROM dbo.t2 AS T3
WHERE T3.col1 > T1.col1
AND T2.t2_col1 < T1.col1 - 1
AND T3.col1 < ISNULL(2 * T1.col1 - T2.t2_col1, 2147483647)) AS T3
)
SELECT T1.t1_col1, T1.t1_col2, T1.t2_col1, T2.col2 AS t2_col2
FROM CTE AS T1
LEFT OUTER JOIN
dbo.t2 AS T2
ON T2.col1 = T1.t2_col1




















on Jun 30, 2010
Hi Itzik,

I look only today t2.col1 is pk, so the proposed index is completly unnecessary.
marc.


Please or Register to post comments.

What's Puzzled By T-SQL Blog?

T-SQL tips and logical puzzles from Itzik Ben-Gan.

Blog Archive

Sponsored Introduction Continue on to (or wait seconds) ×