DataWriter

May 10, 2011

T-SQL Tuesday #18-CTEs: What a CTE is Not

Filed under: SQLServerPedia Syndication,T-SQL — Joe Casella @ 6:13 am
Tags:

TsqlLogo

A common beginner’s misconception about Common Table Expressions (CTEs) is that they are a real result set, like those produced by a temporary table or table variable. In fact, the opposite is true: they’re really just a way to simplify and encapsulate your code. For this month’s TSQL-Tuesday, focused on CTEs, I want to illustrate this difference with an example of how a (non-recursive) CTE can be both more and less efficient than a temporary table at accessing data.

Let’s start with a simple query comparing the sales volume and average price between a given year and the previous year expressed as both a CTE and a derived table:

  1.  
  2. /* CTE Version */
  3. WITH SalesData as
  4. (
  5. SELECT sd.ProductId
  6.      , SalesYr  = YEAR(sh.OrderDate)
  7.      , AvgPrice = Avg(UnitPrice)
  8.   FROM AdventureWorks2008R2.sales.SalesOrderHeader sh
  9.   JOIN AdventureWorks2008R2.sales.SalesOrderDetail sd
  10.     ON sh.SalesOrderID = sd.SalesOrderID
  11. GROUP BY sd.ProductId,YEAR(sh.OrderDate)
  12. )
  13.  
  14. SELECT s1.ProductId
  15.     , s1.SalesYr  as year1
  16.     , s2.SalesYr  as year2
  17.     , s1.AvgPrice as Year1AvgPrice
  18.     , s2.AvgPrice as Year2AvgPrice
  19.     , s2.AvgPrice/s1.AvgPrice as Year2Change
  20.   FROM SalesData s1
  21. INNER JOIN SalesData s2
  22.     ON s1.Productid = s2.ProductId
  23.    AND s1.SalesYr = s2.SalesYr 1
  24.   
  25. /* Derived Table Version */
  26. SELECT s1.ProductId
  27.     , s1.SalesYr  as year1
  28.     , s2.SalesYr  as year2
  29.     , s1.AvgPrice as Year1AvgPrice
  30.     , s2.AvgPrice as Year2AvgPrice
  31.     , s2.AvgPrice/s1.AvgPrice as Year2Change
  32.   FROM (
  33.           SELECT sd.ProductId
  34.              , SalesYr  = YEAR(sh.OrderDate)
  35.              , AvgPrice = Avg(UnitPrice)
  36.           FROM AdventureWorks2008R2.sales.SalesOrderHeader sh
  37.           JOIN AdventureWorks2008R2.sales.SalesOrderDetail sd
  38.             ON sh.SalesOrderID = sd.SalesOrderID
  39.         GROUP BY sd.ProductId,YEAR(sh.OrderDate)
  40.        ) s1
  41.   INNER JOIN
  42.     (
  43.         SELECT sd.ProductId
  44.              , SalesYr  = YEAR(sh.OrderDate)
  45.              , AvgPrice = Avg(UnitPrice)
  46.           FROM AdventureWorks2008R2.sales.SalesOrderHeader sh
  47.           JOIN AdventureWorks2008R2.sales.SalesOrderDetail sd
  48.             ON sh.SalesOrderID = sd.SalesOrderID
  49.         GROUP BY sd.ProductId,YEAR(sh.OrderDate)
  50.     )             s2
  51.     ON s1.Productid = s2.ProductId
  52.   AND s1.SalesYr = s2.SalesYr 1

Both the CTE and the derived table generate the same execution plan:

image

Statistics IO for both queries is also identical:

  1.  
  2. (347 row(s) affected)
  3. Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
  4. Table ‘SalesOrderDetail’. Scan count 2, logical reads 2480, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
  5. Table ‘SalesOrderHeader’. Scan count 2, logical reads 1372, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

So, from an execution standpoint, the derived table and CTE are essentially the same. Now, consider the same query using a temporary table:

  1.  
  2.   /* Aggregate Data in Temp Table */
  3. SELECT sd.ProductId
  4.      , SalesYr  = YEAR(sh.OrderDate)
  5.      , AvgPrice = Avg(UnitPrice)
  6.   INTO #TmpSales
  7.   FROM AdventureWorks2008R2.sales.SalesOrderHeader sh
  8.   JOIN AdventureWorks2008R2.sales.SalesOrderDetail sd
  9.     ON sh.SalesOrderID = sd.SalesOrderID
  10. GROUP BY sd.ProductId,YEAR(sh.OrderDate)
  11.  
  12. /*Return Sales Info */
  13. SELECT s1.ProductId
  14.     , s1.SalesYr  as year1
  15.     , s2.SalesYr  as year2
  16.     , s1.AvgPrice as Year1AvgPrice
  17.     , s2.AvgPrice as Year2AvgPrice
  18.     , s2.AvgPrice/s1.AvgPrice as Year2Change
  19.   FROM #TmpSales s1
  20. INNER JOIN #TmpSales s2
  21.     ON s1.Productid = s2.ProductId
  22.    AND s1.SalesYr = s2.SalesYr 1

Along with its query plan:

image

And Statistics IO:

  1. Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
  2. Table ‘SalesOrderDetail’. Scan count 1, logical reads 1240, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
  3. Table ‘SalesOrderHeader’. Scan count 1, logical reads 686, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
  4.  
  5. (613 row(s) affected)
  6.  
  7. (347 row(s) affected)
  8. Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
  9. Table ‘#TmpSales’. Scan count 2, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Notice that in both the CTE and derived table versions, SalesOrderHeader and SalesOrderDetail are each accessed twice and the aggregations are each calculated twice.  The temporary table version accesses each table once, aggregates the results, and then uses the smaller temporary table to produce the final results.  As a result, the CTE version is nearly twice as expensive as the temporary table.   If a CTE were truly a “results set” (as some authors and speakers have presented it), then we should only see SalesOrderHeader  and SalesOrderDetail accessed once, just as with the temporary table, and it should have a similar IO cost.  They’re not.  Conclusion: a CTE is not a temp table or stored results set.

In this particular case, I’ve structured my queries so that the CTE was a less efficient way to access the data. There are times, however, that the compiler can take advantage of the CTE structure and create a more efficient way to access the data.

Here are two (oversimplified) queries to illustrate this point:

  1.  
  2.  
  3.   /* CTE Version */
  4. WITH SalesData as
  5. (
  6.     SELECT sd.ProductId
  7.          , SalesYr  = YEAR(sh.OrderDate)
  8.          , AvgPrice = Avg(UnitPrice)
  9.          , AvgOrderQty =AVG(OrderQty)
  10.       FROM AdventureWorks2008R2.sales.SalesOrderHeader sh
  11.       JOIN AdventureWorks2008R2.sales.SalesOrderDetail sd
  12.         ON sh.SalesOrderID = sd.SalesOrderID
  13.     GROUP BY sd.ProductId,YEAR(sh.OrderDate)
  14. )
  15.  
  16. SELECT ProductId
  17.     , AvgPrice
  18.     , AvgOrderQty
  19.   FROM SalesData
  20. WHERE SalesYr = ‘2006’;
  21.   /* Aggregate Data in Temp Table */
  22.  
  23. SELECT sd.ProductId
  24.      , SalesYr  = YEAR(sh.OrderDate)
  25.      , AvgPrice = Avg(UnitPrice)
  26.      , AvgOrderQty =AVG(OrderQty)
  27.   INTO #TmpSales
  28.   FROM AdventureWorks2008R2.sales.SalesOrderHeader sh
  29.   JOIN AdventureWorks2008R2.sales.SalesOrderDetail sd
  30.     ON sh.SalesOrderID = sd.SalesOrderID
  31. GROUP BY sd.ProductId,YEAR(sh.OrderDate)
  32.  
  33.  
  34. /*Return Sales Info */
  35.  
  36. SELECT ProductId
  37.     , AvgPrice
  38.     , AvgOrderQty
  39.   FROM #TmpSales
  40. WHERE SalesYr = ‘2006’

Along with their Statistics IO:

  1. ***CTE***
  2. (132 row(s) affected)
  3. Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
  4. Table ‘SalesOrderDetail’. Scan count 1, logical reads 285, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
  5. Table ‘SalesOrderHeader’. Scan count 1, logical reads 686, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
  6.     
  7. ***Temp Table***
  8.   
  9. Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
  10. Table ‘SalesOrderDetail’. Scan count 1, logical reads 1240, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
  11. Table ‘SalesOrderHeader’. Scan count 1, logical reads 686, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
  12.  
  13. (613 row(s) affected)
  14.  
  15. (132 row(s) affected)
  16. Table ‘#TmpSales’. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

The temporary table approach still pays the cost of aggregating all the data first and it has the same IO cost as the first sample query.  With the CTE, however, the compiler is able to take advantage of the WHERE clause when it’s expanding the query.  As a result, it aggregates less data and uses about half the IO that’s involved in creating and reading from the unfiltered temporary table.   The text showplan reveals this application of the WHERE clause:

  1. StmtText
  2. —————————————————————————————————————————————————————————————————————————————————————-
  3.   |–Compute Scalar(DEFINE:([Expr1005]=CASE WHEN [Expr1018]=(0) THEN NULL ELSE [Expr1019]/CONVERT_IMPLICIT(money,[Expr1018],0) END, [Expr1006]=CASE WHEN [Expr1018]=(0) THEN NULL ELSE [Expr1020]/CONVERT_IMPLICIT(int,[Expr1018],0) END))
  4.        |–Hash Match(Aggregate, HASH:([sd].[ProductID]) DEFINE:([Expr1018]=COUNT(*), [Expr1019]=SUM([AdventureWorks2008R2].[Sales].[SalesOrderDetail].[UnitPrice] as [sd].[UnitPrice]), [Expr1020]=SUM([AdventureWorks2008R2].[Sales].[SalesOrderDetail].[OrderQ
  5.             |–Merge Join(Inner Join, MERGE:([sh].[SalesOrderID])=([sd].[SalesOrderID]), RESIDUAL:([AdventureWorks2008R2].[Sales].[SalesOrderDetail].[SalesOrderID] as [sd].[SalesOrderID]=[AdventureWorks2008R2].[Sales].[SalesOrderHeader].[SalesOrderID] as [
  6.                  |–Clustered Index Scan(OBJECT:([AdventureWorks2008R2].[Sales].[SalesOrderHeader].[PK_SalesOrderHeader_SalesOrderID] AS [sh]),  WHERE:(datepart(year,[AdventureWorks2008R2].[Sales].[SalesOrderHeader].[OrderDate] as [sh].[OrderDate])=(2006))
  7.                  |–Clustered Index Scan(OBJECT:([AdventureWorks2008R2].[Sales].[SalesOrderDetail].[PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID] AS [sd]), ORDERED FORWARD)

So which approach is better, the CTE or temporary table?  As always, it depends on your data and your use.  For a little help deciding, why not check out the rest of the posts in this month’s TSQL-Tuesday?

Advertisements

1 Comment »

  1. […] Casella (blog) – Joe went above and beyond and wrote two posts. The first compares CTEs to derived and temporary tables. The second uses a CTE to create a calendar […]

    Pingback by T-SQL Tuesday #18 Wrapup | Bob Pusateri - The Outer Join — May 11, 2011 @ 9:00 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: