Tips, Tricks, and Advice from the SQL Server Query Processing Team

Showplan is a feature in SQL Server to display and read query plans. While some of you may already be very familiar with Showplan, it is one of the most important diagnostic tools that we use in the query processing team to locate and identify problems, and therefore deserves some extra exposure. Being able to collect, read, and understand Showplan data is a critical skill to have, and one which we plan to blog about a fair bit. Consider this an introductory posting to cover some of the basics which we will drill into more in further posts.

We generate two types of Showplans in SQL Server, one at query compilation time (when an optimized query plan is produced) and the second at query runtime (when the optimized query plan is executed). The former allows you to see the compiled query plan in its entirety, as well as a ton of useful information about each operator in the plan. The runtime plan, often referred to as query execution “statistics”, enables you to collect actual metrics about the query during its execution, such as its execution time and actual cost.

Showplan and statistics information can be extracted in two ways from the server, 1) using query SET options, 2) using Profiler Trace events. The various SET options available are:

· Legacy Showplan

o Compile Time: SET SHOWPLAN_ALL ON or SET SHOWPLAN_TEXT ON

o Runtime: SET STATISTICS PROFILE ON

· Showplan XML (Preferred)

o Compile Time: SET SHOWPLAN_XML ON

o Runtime: SET STATISTICS XML ON

SQL Server Management Studio (SSMS) uses Showplan to display a graphical representation of the query plan. A graphical view of the query plan for the following query on the ‘nwind’ database is shown in the picture below.

-- Query: select customer name and order date for orders placed by customers in Canada

use nwind

select ContactName, OrderDate

from Customers inner join Orders

on Orders.CustomerID = Customers.CustomerID

where Orders.ShipCountry = 'Canada'

(See the attached .bmp file for a picture of this showplan)

In SSMS, to view the graphical plan, click on the Execution Plan tab in the Results Pane. The graphical plan is read from top-to-bottom, right-to-left. When you select any operator in the query tree or simply hover the mouse on it, you will see a tooltip that describes the operator. It displays the Query Optimizer cost estimates (operator and subtree costs, number of rows, row size, etc.) and additional information like output columns, predicates, etc. The detailed operator information is shown in the Properties window (View --> Properties Window), which is usually displayed on the extreme right frame in SSMS.

To save the graphical showplan to a file you can right click on the Execution Plan and select 'Save Execution Plan As'. The query plan is saved with extension '.sqlplan' and can be reloaded into SSMS anytime, sent via email, etc.

Showplan output can be generated in text, grid or XML formats. The XML format was introduced in SQL Server 2005 which we refer to as “Showplan XML” and the other formats as “Legacy Showplan”. In the next posting we will describe how to generate and analyze Showplan XML.

- Steve and Gargi

As mentioned in our previous blog posting, SQL Server 2005 supports Showplan generation in XML format. XML-based Showplans provide greater flexibility in viewing query plans and saving them to files as compared to legacy Showplans. In addition to the usability benefits, XML-based Showplans also contain certain plan-specific information which is not available in Legacy Showplans. For example, Showplan XML contains the cached plan size, memory fractions (how memory grant is to be distributed across operators in the query plan), parameter list with values used during optimization, and missing indexes information which is not available with the legacy Showplan All option. Similarly Statistics XML contains additional information such as degree of parallelism, runtime memory grant, parameter list with actual values for all parameters used, and execution statistics such as count of rows/executes aggregated per thread (in a parallel query) as compared to the legacy Statistics Profile option. Such information is very useful in analyzing query compilation, execution and performance issues, hence using the new XML features for Showplan is highly recommended.

Showplan XML can be generated in two ways:

1. Using T-SQL SET options

2. Using SQL Server Profiler trace events

With the SET options, Showplan XML returns batch-level information, i.e. it produces one XML document per T-SQL batch. If the T-SQL batch contains several statements, it generates one XML node per statement and concatenates them together. However when Showplan XML is generated using trace events, it only generates statement-level information. Let’s analyze the Showplan XML output for the following query (see attached document - showplan.xml):

use nwind

set showplan_xml on

SELECT ContactName, OrderDate

FROM Customers inner join Orders

ON Orders.CustomerID = Customers.CustomerID

WHERE Orders.ShipCountry = 'Canada'

set showplan_xml off

The root element of the document contains the Showplan XML namespace attribute with the location of the Showplan XML schema, and the SQL Server build information. It contains batch and statement sub-elements. Each statement contains a "StatementText" attribute that describes the T-SQL query being executed, a "StatementId" attribute indicating the relative position of the statement in the batch, the relative cost of the statement and the level of optimization used to generate the query plan output. It also contains additional information like SET options in effect when executing the query.

<Batch>

<StmtSimple StatementText="SELECT ContactName, OrderDate FROM Customers inner join Orders ..

...

The QueryPlan node contains plan information such as size of the compiled plan, compilation time, etc. which is useful for debugging purposes. The iterators in the query plan are represented as nested elements, each of type ‘RelOp’. Every ‘RelOp’ element contains two types of information:

Generic information such as logical and physical operator names, optimizer cost estimates, etc. as attributes.
Operator specific information such as a list of output columns, a set of defined values, predicates, database objects on which they operate (tables/indexes/views), etc. as sub-elements.

For example, the topmost ‘RelOp’ in our example is a ‘Nested Loops’ that contains the following generic information:

And the following operator specific information:

</OutputList>

</OuterReferences>

…

</NestedLoops>

The runtime counterpart of Showplan XML is called ‘Statistics XML’. It displays query execution “statistics” by executing the query and aggregating runtime information on a per-iterator basis. To obtain the Statistics XML output, use the following SET option.

set statistics xml on

Now lets analyze the Statistics XML output of the above query. The XML generated is semantically similar to the Showplan XML output, some of the key differences are:

In the QueryPlan node, an additional attribute indicating the degree of parallelism is present:

Each iterator ‘RelOp’ node contains an additional sub-element called RunTimeInformation that contains iterator execution profile such as the number of rows, number of executions, indexes accessed, join order, etc.

<RelOp NodeId="0" PhysicalOp="Nested Loops" LogicalOp="Inner Join" EstimateRows="43" EstimateIO="0"

EstimateCPU="0.00017974" AvgRowSize="34" EstimatedTotalSubtreeCost="0.0308059" Parallel="0" EstimateRebinds="0" EstimateRewinds="0">

</OutputList>

<RunTimeInformation>

<RunTimeCountersPerThread Thread="0" ActualRows="43" ActualEndOfScans="1" ActualExecutions="1" />

</RunTimeInformation>

The runtime information is shown per thread, and not aggregated as in legacy showplan.
For memory consuming iterators such sort or hash join, the memory grant information is printed.

In a later blog post, i will cover some of the Profiler Trace Events for Showplan.

- Gargi Sur

SQL Server Query Processing Development Team

Oftentimes when people include actual execution plan (Ctrl-M, see previous posts for a good primer on execution plans) while executing a batch in SQL Server Management Studio, and they see this "Query cost (relative to the batch)" thing on top of each query in the batch, they start to ask: What does this mean? But when I run this batch, the first query runs a faster than the second, and yet this crazy SQL Server says the second has a higher cost, what are they talking about?

Before I attempt an explanation, let me quickly outline an example to illustrate this. First, let's create a table with some rows to experiment with:

Use tempdb
Go
If Object_ID('BigT', 'U') Is Not Null
Drop Table BigT
Go
Create Table BigT(pk Int Identity Primary Key Clustered, string Varchar(1000), number Int);
Go
-- Insert 10,000 rows
WITH Digits(D) AS
(
Select 0 Union All Select 1 Union All Select 2 Union All Select 3 Union All Select 4 Union All
Select 5 Union All Select 6 Union All Select 7 Union All Select 8 Union All Select 9
)
Insert Into BigT
Select 'Some text', A.D * 1000 + B.D * 100 + C.D * 10 + D.D
From Digits A Cross Join Digits B Cross Join Digits C Cross Join Digits D
Go

Now, with Include Actual Execution Plan on, we can run the following two-query batch:

Select string, Max(number) From BigT Group By string Order By Max(number)
Select A.string, B.string From BigT A Join BigT B on A.number = B.number * 2

On my machine, I get the attached plan. The main thing to notice here is that the first query has a query cost of 4% (relative to the batch), and the second one has 96% cost. Actually, if you want the numerical basis for these percentages and you're running this experiment yourself (you'll probably have different numbers depending on your machine), just hover over the root SELECT operator in the plans. I have Estimated Subtree Cost of 0.135437 for the first SELECT and 3.154 for the second SELECT, which have the ratio ~4:96. Notice the lack of units for these costs: they're not in seconds, MB or anything.

The reason these costs exist is because of the query optimization SQL Server does: it does cost-based optimization, which means that the optimizer formulates a lot of different ways to execute the query, assigns a cost to each of these alternatives, and chooses the one with the least cost. The cost tagged on each alternative is heurestically calculated, and is supposed to roughly reflect the amount of processing and I/O this alternative is going to take. For example, based on the statistical information we automatically collect on the number column, we estimated that we'll have 5000.5 rows coming out of the join in the second query (a really good guess in this case, the join produces 5000 rows), and based on this we assigned a cost 1.8 on the Merge Join operator in my machine. This number, again, is unit-less and is meant only for comparison purposes against other alternatives.

So when you get the execution plan in the end, we display this relative costs of queries in a batch, and for different operators within the same query, for informational purposes. They are not meant as accurate representations of query run-times. The actual run-time of a query depends on many things: whether the buffers are warm (i.e. the contents required by the query are cached by the server), what other queries are running on the server, locking, etc. Also, the cost estimation is a heurestic process based on statistical sampling and various best-effort guesses, and could easily go wrong especially for large queries.

For example, setting Statistics Time on:

Set Statistics Time On

And running these queries again I get a run time of 15 ms for the first query and 70 ms for the second. The second did take longer, but not a 96:4 ratio. Then again, these are really short times and hardly precise. For example, a second run yielded run times of 16 ms & 111 ms.

Mostafa Elhemali

Besides SSMS, another great tool available to database developers and DBAs to view query plans and troubleshoot query compilation or execution issues is the SQL Server Profiler. In the Profiler, all the showplan events are listed under the Performance Event category. All the SQL Trace events generate showplan information at query-level granularity, i.e. a single XML document is generated for each query. There are 9 showplan events in SQL Server 2005! In this post, I will describe the commonly-used ones.

Showplan XML – This event is new to SQL Server 2005. It generates the query plan in XML and displays it as text in the TextData column. It also displays the query plan in graphical format in the profiler trace window. When enabled, this trace event is generated every time a T-SQL query is executed. It displays batch information along with the Showplan by grouping together statements that are executed in a batch. This trace event is equivalent to using Showplan XML SET OPTION in SSMS.
Showplan XML For Query Compile – Also introduced in SQL Server 2005, this event is similar to the Showplan XML event except that Showplan output is generated only when the query is compiled (or recompiled). For subsequent executions, if the query plan is retrieved from plan cache, no showplan information is displayed. This event is less expensive from a performance perspective and best for quick view of the generated query plan.
Showplan XML Statistics Profile – This event is similar to the Showplan XML event, it not only contains the compile-time plan information, but also includes the runtime query execution statistics such as actual number of rows and executions per iterator, memory grant and degree of parallelism, etc. The event is generated once per execution. The showplan output is displayed as XML in textual format in the TextData column. The query plan portion for the query below is shown in the attached document:

use nwind
go
select * from employees where hiredate < '1-1-2001'
go

Besides the XML events, the Performance Events category also contains other legacy showplan events. Of these I find the Showplan Statistics Profile event useful in displaying the execution stats in tabular format. The Performance Statistics event provides additional information (such as sql_handle, number of recompiles, etc.) which can be used to debug costing, plan generation, IO bottlenecks and other such related issues. Most of the legacy events generate Showplan in binary format which is displayed in the BinaryData column. The binary xml is converted to text format and is displayed in the expanded trace event viewing pane. We recommend using XML showplan event over the legacy showplan events since they contain additional information such as missing indexes, rows processed on each thread in a parallel query plan, memory grant per iterator, etc.

If using the Profiler to generate trace events is not an option, consider using SQL Trace to collect server trace data. SQL Trace is a mechanism that SQL Server supports to generate and capture server-side trace events using system stored procedures.

Gargi Sur
SQL Server Query Processing

Statistics profile output is an important tool when it comes to troubleshooting query plan issues. When enabled, it returns a textual representation of the query plan with a lot of detail about cost and cardinality estimates as well as actual counts.

When working in SQL Server management Studio (SSMS), it is advised to enable grid mode to output the results of statistics profile. This presents the results in a neatly formatted way. It also allows to easily copy and past the result in Excel for further manipulation (hiding columns, highlighting individual cells, adding columns with additional calculations, …). But what if grid output mode is not available? E.g., when running SQL Trace with an event to produce statistics profile output for each query, the result is always presented as flat text. For even mildly complex queries this output is hard to decipher.

The attached perl script helps to solve this formatting problem. It takes the flat text output of statistics profile and parses it to identify individual columns. It then generates tab delimited output which can be directly consumed by Excel.

Usage is straightforward: the script assumes one individual statistics profile in a regular text file (ANSI text only please, UNICODE output confuses the script). Just run

perl statspro.pl <input file> <output file>

E.g., lets assume I have my statistics profile output in a file called test.txt I’ll run

perl statspro.pl test.txt test.xls

The resulting xls file can be directly viewed with Excel.

Note that I’m not a perl wizard and there might be a more elegant/efficient way to achieve the same means. The script is provided “as is”. You are more than welcome to use and/or enhance the script.

Peter Zabback

One of the least understood Query Execution operators is the Bitmap. I'd like to give a fairly brief overview of how Bitmap filters are used, as well as some technical details about their limitations and functionality. Bitmap filters are often mistaken as Bitmap indexes. The two are actually very distinct concepts -- Bitmap indexes are physical structures that are persisted on disk and are used for data access while Bitmap filters are an in-memory structure that is used for enhancing performance during the execution of a query. This article refers exclusively to Bitmap filters. Bitmaps are visible in ShowplanXML (see attachment) and will in the XML plan similar to this:

Also in Showplan, the columns that are used for filtering by the bitmap are visible:

</OutputList>

The primary role of the Bitmap is to speed up parallel plans by doing semijoin reduction early on in the query, before rows are passed through the Parallelism operator. Bitmaps are not used in serial plans. The Bitmap itself gets created on the build side of a Hash Join (this is where the Iterator will appear in the plan), but the actual bitmap checks (filtering of rows) is done within the Parallelism operator that is on the probe side of the Hash Join. Note that not every Hash Join in a parallel plan will use a Bitmap for filtering rows -- it is a decision that is made by the Optimizer if it thinks that the Bitmap will be selective enough to be useful. Hash Joins are not the only type of join where Bitmaps can be used -- in some cases, Bitmaps can be used with Merge Joins as well.

Another interesting thing to make note of is that since Bitmaps filter out rows flowing through a query plan, they will affect the accuracy of the Showplan estimations (Estimated Number of Rows) for other operators above them in the query plan. Since the Optimizer can never be 100% certain how many rows the Bitmap will filter out, the Actual Rows (visible through Statistics Profile after a query has run) will usually differ somewhat from the estimates. This is normal and to be expected.

Internally, Bitmaps are implemented as a relatively simple bit array. When building the bitmap, we assign a hash value to each row in the build-side table and set the bits in the array to correspond with hash values that contain at least one row. When we check the bitmap from the probe side, we again hash each row and check whether the bit is set or not in the array. If the bit is not set, then we immediately know that the row will not qualify in the join (since there is no match from the other table) and we can drop the row. This provides a low-cost approach of quickly eliminating rows without the overhead of doing a full join algorithm.

One optimization that allows us to check the Bitmap and eliminate rows even earlier on in a query's execution is called In-Row Optimization. This option is available only when certain conditions are met, specifically the Bitmap must be used on not-nullable INT or BIGINT columns. If this is true, then this optimization involves pushing the Bitmap check down as a Filter into the actual Table/Index Scan (where we retrieve the data from disk). One can check whether this optimization is being used by looking at the Showplan and examining the Parallelism operator directly above the Table/Index scan on the probe side of the join. To tell whether In-Row is being used, you should look for the InRow tag (bolded) in the ShowplanXML. This will be found on the Parallelism operator directly above the Table/Index scan on the probe side of the Hash Join

RelOp NodeId="79" PhysicalOp="Parallelism" LogicalOp="Repartition Streams" EstimateRows="100000" EstimateIO="0" EstimateCPU="0.0884062" AvgRowSize="11" EstimatedTotalSubtreeCost="2.67314" Parallel="1" EstimateRebinds="0" EstimateRewinds="0"> <Parallelism PartitioningType="Hash" InRow="1">

- Steve Herbert

Query Execution

Index Build strategy in SQL Server may vary depending on users needs. Each of these Index Build strategies may have different memory and disc space requirement. These different strategies will be discussed in the next several posts.

For the beginning let’s see what kind of Index Build types exist in SQL Server 2005:

- Online Index Build vs. Offline Index Build:

In SQL Server 2005, you can create, rebuild, or drop indexes online. The ONLINE option allows concurrent user access to the underlying table or clustered index data and any associated nonclustered indexes during these index operations. For example, while a clustered index is being rebuilt by one user, that user and others can continue to update and query the underlying data. When you perform DDL operations offline, such as building or rebuilding a clustered index; these operations hold exclusive locks on the underlying data and associated indexes. This prevents modifications and queries to the underlying data until the index operation is complete.

Example:

Create index idx_t on t(c1, c2)

WITH (ONLINE = ON)

- Serial Index Build vs. Parallel Index Build:

On multiprocessor computers index statements may use more processors to perform the scan, sort, and build operations associated with the index statement just like other queries do. The number of processors employed to run a single index statement is determined by the configuration option max degree of parallelism (set by sp_configure) (default value of 0 - uses all available processors), by MAXDOP index option (set in statements – see example below), by the current workload, and, in non-partitioned case, by data distribution of the first key column. The max degree of parallelism option limits the number of processors to use in parallel plan execution – in other words: it is setting the ceiling, meaning no more than this number but can be anything below it. If Database Engine detects that the system is busy, the degree of parallelism of the index operation is automatically reduced before statement execution starts.

Example:

Create index idx_t on t(c1, c2)

WITH (MAXDOP = 2)

-- limit # of processor to use for index build to 2

- Building Index storing the intermediate sort result in user’s database vs. storing in

tempdb database (SORT_IN_TEMPDB):

When you create or rebuild Index you can choose which database to use to store the intermediate

sort results, generated during index creation. It can be either user’s database (database where index

is being created) or tempdab database.

SORT_IN_TEMPDAB index option is used to set the desirable behavior. When set to ON, the

sort results are stored in tempdb. When OFF, the sort results are stored in the filegroup or partition

scheme in which the resulting index is stored.

Example:

Create clustered Index idx_t on t(c1)

WITH (SORT_IN_TEMPDB = ON)

Read in next post: Building partitioned vs. non-partitioned Indexes.

Posted by: Lyudmila Fokina

- Building Partitioned Index vs. Building non-Partitioned Index:

The data of partitioned tables and indexes is divided into units that can be spread across more than one filegroup in a database. The data is partitioned horizontally, so that groups of rows are mapped into individual partitions. The table or index is treated as a single logical entity when queries or updates are performed on the data. All partitions of a single index or table must reside in the same database.

Building Aligned Partitioned Index Build:

Although partitioned indexes can be implemented independently from their base tables, it generally makes sense to design a partitioned table and then create an index on the table. When you do this, SQL Server automatically partitions the index by using the same partition scheme and partitioning column as the table. As a result, the index is partitioned in essentially the same manner as the table. This makes the index aligned with the table.

An index does not have to participate in the same named partition function to be aligned with its base table. However, the partition function of the index and the base table must be essentially the same, in that

1) the arguments of the partition functions have the same data type,

2) they define the same number of partitions, and

3) they define the same boundary values for partitions.

Also If you build a non-clustered index on a partitioned base (partitioned heap or clustered index) without specifying partitioning function, then the non-clustered index will be aligned partitioned as well (see example below).

Example:

Create Partition Function pf (int)

as range right for values (NULL, 1, 100)

Create Partition Scheme ps

as Partition pf

TO ([PRIMARY], [FileGroup1], [FileGroup1], [FileGroup1])

Create table t (c1 int, c2 int)

on ps(c1)

Create Index idx_t on t(c1)

Building non-Aligned Partitioned Index Build:

SQL Server does not align the index with the table if you specify a different partition scheme or a separate filegroup on which to put the index at creation time.

You can turn a non-partitioned table into partitioned by building a partitioned clustered index – it will be non-aligned index build as well (see example below).

Example:

Create Partition Function pf (int)

as range right for values (NULL, 1, 100)

Create Partition Scheme ps

as Partition pf

TO ([PRIMARY], [FileGroup1], [FileGroup1], [FileGroup1])

Create table t (c1 int, c2 int)

Create clustered Index idx_t on t(c1)

on ps(c1)

Note: If you have a partitioned clustered index (as in the example above) and drop the clustered index, then the new heap will stay partitioned and will be located in the same partition scheme or filegroup as was defined for the clustered index unless you specify MOVE TO option when dropping the clustered index.

Example:

Drop Index idx_t on t

WITH(MOVE TO new_ps(c1))

In this example the base table is moved to a different partition scheme and the nonclustered indexes are not moved to coincide with the new location of the base table (heap). Therefore, even if the nonclustered indexes were previously aligned with the clustered index, they may no longer be aligned with the heap.

Read in next post: Index Build Scenario 1: Offline, Serial, No Partitioning

Posted by: Lyudmila Fokina

ETW stands for “Event Tracing for Windows” and it is used by many Windows applications to provide debug trace functionality. This “wide” availability is a key point of using ETW because it can help to track certain activities from end to end. For example, you can literally track a request coming from IIS, passing through protocol layer, and then finally handled by database engine in single trace file. Unfortunately, not many people know that SQL Server 2005 provides a full ETW functionality which can output most of SQL Trace events available for SQL Server Profiler. In this article, I will explain how to use SQL ETW with examples.

First of all, how do I know if SQL ETW is really active on my machine? The following shows output on my server which has SQL Server 2005 Standard Edition as 2^nd instance (named Yukon).

C:\>logman query providers
Provider                                 GUID
-------------------------------------------------------------------------------
YUKON Trace                              {130A3BE1-85CC-4135-8EA7-5A724EE6CE2C}
.NET Common Language Runtime             {e13c0d23-ccbc-4e12-931b-d9cc2eee27e4}
ACPI Driver Trace Provider               {dab01d4d-2d48-477d-b1c3-daad0ce6f06b}
Active Directory: Kerberos               {bba3add2-c229-4cdb-ae2b-57eb6966b0c4}
IIS: Request Monitor                     {3b7b0b4b-4b01-44b4-a95e-3c755719aebf}
Local Security Authority (LSA)           {cc85922f-db41-11d2-9244-006008269001}

My other server has a default instance of SQL Server 2005 and it shows something like this:

MSSQLSERVER Trace {2373A92B-1C1C-4E71-B494-5CA97F96AA19}

Is there a connection between these 2 examples? Yes, the SQL ETW provider name appears to be a combination of “Instance name” and “Trace”, and its GUID is unique across instances. You can use either provider name or GUID to start and stop ETW session.

Please note this ETW functionality is not available for SQL Express, because, well…, you get what you paid for J. If you don’t see SQL ETW provider name, and you are sure you have a regular SQL Server 2005, then you may be able to fix it yourself. Here’s a short list of instructions.

Go to the directory where your SQL server binaries are. On my machine, it is on
“C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Binn”.
Look for “etwcls.mof”. If you cannot find it, something is definitely wrong with your installation. You probably need to seek help elsewhere.
Run “mofcomp etwcls.mof” in command window.
Try “logman query providers” again. You should see your SQL ETW provider.

Now I will describe how to configure SQL ETW to collect interesting traces. Go to the same directory described in the above instruction and find a file called “etwcnf.xml”. This is your configuration file and you can modify it with any text editor. There should several entries already defined in this file for your examples. See the following example picked out of etwcls.xml file.

This defines a tracing template called “TSQL replay” with template ID = 1. It contains a list of events with ID = 11, 13, and so forth. This event ID is matched to SQL Trace Event Class. By consulting BOL documentation on available SQL Trace events, you can find “RPC:Starting” event has ID of 11, and “SQL:BatchStarting” event has ID of 13. Yes, now you know how to change this template to suite your need!

Now let’s try to get some actual trace events. I’m using “logman.exe” and “tracerpt.exe” which should be available on most Windows platform.

First, create a text file as shown in the following example. You can activate multiple providers by listing them in separate lines. For simplicity, let’s enable only SQL ETW.

C:\etw>type prov.txt
"YUKON Trace" 1 0

As you have figured out already, the first column is SQL ETW provider name (you can use GUID also). The next number (1) is called “enable flag” and should match the template ID in etwcls.xml file. The 3^rd number (0) is called “enable level” and should be kept as zero for SQL ETW. The meaning of these 2 flags depends on providers and you should consult appropriate documentations for other providers.

Now you are ready to start your 1^st SQL ETW session! In my example, I decided to give a very creative/original name: “mytrace”. Here’s a screen dump from my machine.

C:\etw>logman start mytrace -pf prov.txt -ets
Name:                      mytrace
Age Limit:                 15
Buffer Size:               64
Buffers Written:           1
Clock Type:                System
Events Lost:               0
Flush Timer:               0
Buffers Free:              2
Buffers Lost:              0
File Mode:                 Sequential
File Name:                 C:\etw\mytrace.etl
Logger Id:                 3
Logger Thread Id:          1308
Maximum Buffers:           25
Maximum File Size:         0
Minimum Buffers:           3
Number of buffers:         3
Real Time Buffers Lost:    0

Provider                                  Flags                     Level
-------------------------------------------------------------------------------
* "YUKON Trace"                           0x00000001                0x00
{130A3BE1-85CC-4135-8EA7-5A724EE6CE2C} 0x00000001                0x00

To test if this ETW session can actually receive something from SQL Server, let’s issue “select * from sys.dm_exec_requests”. After that, you can stop the tracing as shown in the following example.

C:\etw>logman stop mytrace -ets
The command completed successfully.

Look for newly created file in the working directory (mytrace.etl). Unfortunately, this is a binary trace file and you will need some help to crack it open. “Tracerpt.exe” does some rudimentary job of converting this binary trace file into human-readable form.

C:\etw>tracerpt mytrace.etl
Input
----------------
File(s):
     mytrace.etl
100.00%
Output
---------------
Text (CSV):         dumpfile.csv
Summary:            summary.txt
The command completed successfully.

Note this creates 2 files: dumpfile.csv and summary.txt. Here’s a screen dump.

C:\etw>type dumpfile.csv
Event Name,       Type,        TID,           Clock-Time, Kernel(ms),   User(ms), User Data
EventTrace,     Header, 0x00000BF4,   128075886826813362,          0,         10,    65536, 33620485,     790, 1, 128075887366383792,   100144,        0, 0x00000000,        2,        1,        4,        0,     2785, 0x01402930, 001402940,                0,          3579545, 128075886826813362, 0x00000002,        0, 0, 0
SQL:BatchStarting,          0, 0x00000874,   128075886939625782,          0,         10, "select * from sys.dm_exec_reqests",       1,                0, "xxxx", "REDMOND", "JAYC8V1",     1580, "SQL Query Analyzer", "REDMOND\xxxx",       51, 28075598939600000,       13, "master", 010500000000000515000000A065CF7E784B9B5FE77C8770A9640000",        0,   0,      38,        0, "REDMOND\xxxx", 0, 0
C:\etw>

Yes, this looks pretty cryptic and you will probably need to look at etwcls.mof (and SQL Trace description) to decipher each column. From etwcls.mof file, we know 1^st column is “TextData”, 2^nd column is “DatabaseID”, 3^rd column is “TransactionID”, etc.

       [WmiDataId(1),
       Description("TextData"),
       format("w"),
       StringTermination("Counted"),
       read
       ]
       string TextData;
       [WmiDataId(2),
       Description("DatabaseID"),
       read
       ]
       sint32 DatabaseID;
       [WmiDataId(3),
       Description("TransactionID"),
       read
       ]
       sint64 TransactionID;

Now we know this ETW session received “SQL:BatchStarting” event correctly, and its TextData column contains correct TSQL statement.

OK, you know how to start and stop SQL ETW. But with very little UI support compared to the pretty looking SQL Server Profiler, why even bother to consider ETW if I don’t need end-to-end tracing? Well, one good thing about SQL ETW is that your ETW session survives SQL Server restart, unlike ordinary SQL Trace.

Jay Choe

This short article provides a checklist for query execution time out errors in Yukon. It does not touch the time out issues on optimization and connection. Before reading this article, you are recommended to read the following post to get familiar with SQL Server memory management architecture: http://blogs.msdn.com/slavao/archive/2005/02/11/371063.aspx

Overview of query processing

When a query is submitted, SQL Server first checks if there is a plan cached. If yes, that plan will be selected. If not, the query statement is first parsed to generate sequence tree. The sequence tree is then bound and normalized and converted to algebras tree. Algebras tree is optimized to generate algebraic plan.

When an optimized plan is generated (or selected if it is cached), it will be executed if its memory requirement can be satisfied right away. If not, which happens quite often, the query is put into a queue and waiting for the memory.

How does a query execution time out happen?

Before executing a query, SQL Server estimates how much memory it needs to run and tries to reserve this amount of memory from the buffer pool. If the reservation succeeds the query is executed immediately. If there is not enough memory readily available from the buffer pool, then the query is put into a queue with a timeout value, where the timeout value is guided by the query cost. The basic rule is: higher the estimated cost is, larger the time out value is. When the waiting time of this query exceeds the timeout value, a time out error is thrown and the query is removed from the queue. The following shows a sample error for time out:

[State:42000 Error:8645] [Microsoft][SQL Native Client][SQL Server]A time out occurred while waiting for memory resources to execute the query. Rerun the query.

If the memory is enough for a newly submitted query but there are queries in waiting queues, this query is put into a queue. Queries in waiting queues are “sorted” based on their cost and waiting time. Less cost or longer waiting time a query has, higher it is ranked. Note the ranking is dynamic and changes frequently. The query with the highest rank will run if there is enough free memory. If the memory is insufficient, then no other queries will run. It will NOT bother to check if the free memory is enough to run other queries. You can check which query is next to be picked up by running the following query. If the returns no rows, then there are no waiting queries. Note: the results from this query change with time.

select * from sys.dm_exec_query_memory_grants where is_next_candidate is not null

You can use the value in the plan_handle column to retrieve the showplan from sys.dm_exec_query_plan and the sql_handle column to retrieve the SQL text from sys.dm_exec_sql_text.

Note that not every query needs a memory reservation. Usually a query needs a memory reservation if its execution plan has sort, hash, or bitmap operators. Since an index build requires a sort, it always needs a memory reservation. If a query needs no memory reservation, it is immediately executed.

- Senqiang Zhou

Query Execution

Builder (write data to the in-build index)

Sort (order by index key)

Scan (read data from source)

In order to build the b-tree for the index we have to first sort the data from source. The flow is to scan the source, sort it (if possible - in memory*), then build the b-tree from the sort.

Why do we need to sort first before building the b-tree? In theory we don’t have to sort, we could use regular DML and directly insert data into the in-build index (no sort), but in this case we would be doing random inserts, random inserts in a b-tree require searching the b-tree for the correct leaf node first and then inserting the data. And while searching a b-tree is fairly fast, doing so before each insert is far from optimal. So for index build operation, we sort the data using the sort order for the new index, so when we insert data into the in-build index, it is not a random insert, it is actually an append operation and this is why the operation can be much faster than random inserts.

While inserting data between sort and index builder we free each extent from the sort table as soon as all of its rows are copied. In this way we reduce the overall disk space consumption from a possible 3*Index Size (source + sort table + b-tree) to just 2.2*Index Size (approximately).

*We do not guarantee in memory sort; the decision of whether we can do in memory sort or not depends on memory available and actual row count. ‘In-memory’ sort is, naturally, fast (also disk space requirements will be more relaxed in this case, because we don’t have to allocate space for the sort table on the disk), but it is not required; we can always spill data to disk, although the performance is much slower than in-memory sort.

For an index build operation, we use the user database (the database where we build the index) by default for sort to spill data, but if the user specifies sort_in_tempdb, then we use tempdb for spill.

Each sort table (even when we have very little data to sort) requires at least 40 pages (3200KB) to run (later we will see that in case of parallelism we can have several sort tables at the same time). When calculating memory for sort, we try to allocate enough memory to have an in-memory sort. For large index build operations it is not likely that we will be able to fit the entire sort in memory. If we can’t provide at least 40 pages for the Index Build operation, it will fail.

The last step of index build is to always build full statistics. Good statistics information helps the query optimizer to generate better plan, users can issue ‘create’ or ‘update’ stats commands to force SQL Server generate or refresh stats on a certain object. When we are building a new index, since we need to touch every row, we use this opportunity to build full stats at the same time as a side benefit.

Conclusion:

To be able to build a non partitioned Index offline with serial plan we will need free disk space (in user’s or in tempdb database) of approximately 2.2*IndexSize and at least 40 pages of memory available for the query executor to be able to start the process.

Read in next post: Index Build Scenario 2: Offline, Parallel, No Partitioning

Posted by: Lyudmila Fokina

Checklist for time out errors

Memory pressure: In most cases timeouts are caused by insufficient memory (i.e. memory pressure). There are different types of memory pressures and it is very important to identify the root cause. The following articles give a good start point on this issue:

http://blogs.msdn.com/slavao/archive/2005/02/19/376714.aspx

http://blogs.msdn.com/slavao/archive/2005/02/01/364523.aspx

http://www.microsoft.com/technet/prodtechnol/sql/2005/tsprfprb.mspx#EWIAC (includes a link that explains DBCC MEMORYSTATUS)

Especially, we should pay attention to the size of buffer pool (since it is the source for query execution memory grant) and the size of memory held by query execution. You can use this simple query to get the size of buffer pool:

select sum(virtual_memory_committed_kb) from sys.dm_os_memory_clerks where type='MEMORYCLERK_SQLBUFFERPOOL'

The following query gives the size of memory held by query execution (available in SQL Server 2000 SP1 only):

select sum(total_memory_kb) from sys.dm_exec_query_resource_semaphores

Note: please be cautious when using sys.dm_exec_query_memory_grants and sys.dm_exec_query_resource_semaphores with an “order by” clause or a JOIN on a loaded system since this query may itself require a memory grant and it may experience a query execution time out. It is true even you use the DAC connection. DAC has a pre-committed memory for normal allocations, but not for memory grants. It will need to use regular resource semaphores for memory grants. The difference is: DAC query does not wait for memory grant and may force minimum grant. This will likely make OOM condition worse.

It is important to point out that more physical memory does not necessarily mean more memory for query execution. The memory that can be used by query execution is limited by process-addressable virtual address space, which is normally 2GB for 32-bit architectures. So on a 32-bit system, the maximum memory query execution can use is around 1.7G since the operating system and other SQL Server components (like optimization) need memory as well. Generally, you should expect around 1.2 GB of main memory for query execution since quite likely other components could require more memory in a loaded system. There is no such 2GB limit for a 64-bit system.

Option “min memory per query” (BOL link). This option sets the minimum amount of memory (in kilobytes (KB)) that is allocated for the execution of a query. The default value is 1024 (KB) and the minimum allowed value is 512. Don’t make it too large if there are many ad hoc small queries: it simply wastes the memory since small queries won’t make full use of them.

Option “max server memory” (BOL link). This option controls the maximum size of buffer pool, which is the source of query execution memory. If it is too small, there won’t be many queries running at the same time. Make sure this option is set to a reasonably large value.

Option “query wait” (BOL link). This option specifies the time in seconds a query waits for memory before it times out. Check if it is set properly. We recommend leaving it as default, which is calculated as 25 times of the estimated query cost.

Update statistics. The amount of memory to be granted is mainly based on the cardinality estimation. So updating statistics could improve the accuracy of cardinality estimation and perhaphs reduce the waste on memory reservation. On the other hand, if statistics is out of date, AUTOSTAT can kick in during compilation, which typically uses big memory because it has to sort rows. If we cannot get grant for AUTOSTAT, we will use stale stats instead.

Identify the queries consuming (or that will consume) the most memory. If you ever decide to kill some queries to free up memory, it might be more efficient to kill queries that are using or will use a large amount of memory. Of course, executing a big query will make the situation worse. The following query shows the memory required by both running (non-null grant_time) and waiting queries (null grant_time).

select requested_memory_kb, grant_time, cost, plan_handle, sql_handle

from sys.dm_exec_query_memory_grants

Before you decide to kill a query, it is always recommended to check the showplan of that query. You should investigate if the plan cost and/or memory requirement exceed your expectation. You can use plan_handle to retrieve the showplan from sys.dm_exec_query_plan and sql_handle to retrieve the SQL text from sys.dm_exec_sql_text.

The type of parallel index build plan in SQL server depends on whether or not we have a histogram available with necessary statistics. Therefore, there are two broad categories of parallel index plans:

Histogram available:
No histogram

Histogram available (parallel sort and build):

X (Exchange)

| \ \

Builder… Build… Build… (write data to the in-build index)

| | |

Sort… Sort… Sort … (order by index key)

| / /

Scan (read data from source)

This type of Parallel Index build is getting chosen when we have statistics available (hence range partition information is available and can be used to identify data distribution).

How does scan happen in this case? We must have some statistics on the leading key column, so if we don’t have stats we will go ahead and generate sample statistics to determine whether and how to parallelize the index build operation. In some situations, however, we are not able to build sample stats, such as indexed view ("No stats plan"), and then different index build plan will be generated. Using the statistics and histogram we can identify data distribution (divide data in several buckets), so we can load balance the workload among workers in parallel plan, it also help us to make DOP (degree of parallelism) decision to achieve high utilization of system resource. Using the row count estimates from the histogram for each bucket in the distribution the workload is split into N ranges (N = DOP), one for each worker (this is an attempt to load balance the work among all workers).

Using range partition scan to scan data, each worker receives data belonging to its range and builds its own sort table and b-tree based on sort table, so each worker will have its own sort table and all the b-trees are disjoint. The coordinator thread will then stitch all the b-tree’s together at the end of index build operation and we build full statistics on the new b-tree and are done.

Parallel Index Build with histogram available can give us the best performance.

On the downside it is more memory consuming and will fail if there is not enough memory (because each worker creates #DOP sort tables). We can play with MAXDOP option to reduced to max number of DOP used in the index build and as a result – min memory required for the build. You can run sp_configure to figure out what is the default setting for ‘max degree of parallelism’ on the server. Max degree of parallelism = 0, means ‘uses the actual number of available CPUs depending on the current system workload’. You can explicitly limit the number of processors to use in parallel plan execution.

For example:

Create index idx_t on t(c1, c2)

WITH (MAXDOP = 2)

-- limit # of processor to use for index build to 2

//Next time - Non stats plan (no histogram) index build plan

Posted by: Lyudmila Fokina

Build (serial) (write data to the in-build index)

X (Merge exchange)

/ | \

Sort… Sort… Sort …(order by index key)

| | |

Scan… Scan… Scan…(read data from source)

When histogram is not available (for example when we building an index on a view) we can’t use the same methods as described in pervious post (For statistics gathering, it is possible only for ‘real’ object. We are building index on view, so index does not exist at this point and we are not able to gather sample stats on the view this is why we have a different plan here). So we will be using regular parallel scan which is not aware of data distribution at all.

How this works:

We will scan source data in parallel using parallel scan, but serial b-tree build.

Each worker scans a page from the heap the same way as in a previous parallel scan method. When scan is done sort table gets created and populated for each worker. Each worker maintains its own sort table - one sort table and the data from all workers will be merged (as these sort tables are not disjoint we can not just build separate b-tree’s and ‘stitch’ them; we have to merge sort tables) and produced to a final sorted output. After that, index build operation will be serial as we have one final output. The index builder consumes data from Merge Exchange and builds the final index.

Why is this plan relatively slow? We are building index in serial plus some extra overhead introduced by ‘merge exchange’.

Memory consideration:

In parallel index build, we are building multiple sort tables concurrently hence the basic memory requirement is higher and the calculation is a bit different. For memory calculation, we have 1) required memory, 2) additional memory. Each sort requires 40 pages of required memory. Let’s say we have DOP = 2 so we have 2 sort tables, we need 80 pages of required memory, but the total additional memory remains the same regardless DOP setting, this is because the total number of rows remain the same regardless DOP setting. For example, if serial plan needs 500 pages of additional memory, then parallel plan has the same request for additional memory, each worker will get 500/DOP pages of additional memory + 40 pages of required memory.

There are two main categories of partitioned index build:

Aligned (when base object and in-build index use the same partition schema)
Not- Aligned (when heap and index use different partition schemas (including the case when base object is not partitioned at all and in-build index use partitions))

(see Index Build strategy in SQL Server - Introduction (II))

Aligned partitioned index build

Aligned partitioned serial index build

/ \

CTS Builder (write data to the in-build index)

[Sort] (order by index key) <-- optional

Scan (read data from source)

CTS: Constant Table Scan (the purpose of CTS is to provide partition IDs for index builder)

NL: Nested Loop

In case of aligned partition index build Constant Table Scan provides partition IDs to the inner side of Nested Loop so we can build one partition at a time – for each partition ID provided by CTS (outer side of the NL), inner side would build the index for that partition (not the entire index). Sort table gets created for each partition, but as we are doing it one by one and we are building final b-tree for each partition one by one we don’t need to keep this sort table for each partition at the same time. As a result we have only one sort table at any given moment.

How does it affect disc space requirements:

- In case of sorting in user’s database (default setting) we are actually sorting in each filegroup for each corresponding partition, which means we will need the same 2.2*(Size of the partition) for each filegroup. For example: let’s say we have 3 partitions located in filegroups FG1, FG2, FG3 and index data takes 1Gb, 2Gb, and 3Gb respectively. In this case we will need 2.2*1 = 2.2Gb of free space in FG1; 2.2*2 = 4.4Gb of free space in FG2, and 2.2*3 = 6.6Gb of free space in FG3. That means we will need total 9.9Gb of free space on the disk(s).

- In case of using Sort_in_tempdb (SORT_IN_TEMPDB = ON) option we can reuse the same space of tempdb for the sort table and as we are sorting partitions one by one we actually will need only 2.2*(Size of the biggest partition) of free space in tempdb. For example: let’s look at the case described above. We will need only 3.3Gb of free space in tempdb (this size of free space will let us build smaller partitions and then reuse this space for building the biggest partition).

Memory consideration:

We will have only one sort tables at the same time and we will need at least 40pages for the sort table to be able to start the index build operation. So, the minimum required memory will be 40pages.

Total memory = minimum required memory + additional memory*.

*additional memory calculated as row size multiplied by the estimated row count provided by Query Optimizer.

Posted by: Lyudmila Fokina

Aligned partitioned parallel index build

In case of parallel build we scan and sort partitions in parallel and the actual number of sort tables existing at the same time will depends on the actual number of concurrent workers. Partitions are being chosen by workers one by one and when one worker completes with one partition it takes the next partition which is not yet taken by another worker. Each worker builds 0 - N partitions (we do not share one partition among multiple workers). Why can it be 0? If DOP > # of partitions, then we do not have enough partition to give out to all the workers(s). Which partition a worker is going to work on? This is non-deterministic per execution. First come first serve.

As we never share one partition among several workers, the biggest partition becomes a bottleneck. A situation could happen when all workers completed with their partitions and one worker still sorting the biggest one along. Which also means the resource being used by this query (such as memory and threads) will not be available for other queries.

There is no final stitch among workers for partitioned index. Each partition is being represented as a separated b-tree in storage engine.

How does it affect disc space requirements:

- In case of sorting in user’s database (default setting) the requirements are actually the same as in case of serial build as we are sorting in each filegroup for each corresponding partition we will need 2.2*(Size of partition) for each filegroup.

- In case of using Sort_in_tempdb (SORT_IN_TEMPDB = ON) we won’t have the same advantage as in case of serial build because we may have several sort tables at the same time and as long as we don’t know the actual distribution of data between the partitions we will still require the same 2.2*(Size of the whole index) of free space in tempdb.

Memory consideration:

We will have several sort tables at the same time (depends of #DOP and # of partitions) and we will need at least 40pages per each sort table to be able to start the index build operation. So, the minimum required memory will be #DOP*40pages.

Total memory = 40 * DOP + additional memory.

Note that additional memory does not change for serial or parallel plans, this is because the total number of rows we need to sort remain the same in both plans.

One of the less well-known warning events that is logged to SQL Profiler trace is the Hash Warning event. Hash Warning events are fired when a hash recursion or hash bailout has occurred during a hashing operation. Both of these situations are less than desirable, as they mean that a Hash Join or Hash Aggregate has run out of memory and been forced to spill information to disk during query execution. When a hashing operation spills to disk, this almost always results in slower query performance and can cause space consumption increase in tempdb.

Note that the Hash Warning event needs to be explicitly enabled within SQL Profiler. It is not one of the “default” set of events. More info on SQL Profiler can be found here

What can be done if you see a lot of Hash Warning events? The recommended actions are:

· Make sure that statistics exist on the columns that are involved in the hashing operation. Without statistics, the hashing operation has no way to know how much memory to pre-allocate.

· Even if statistics do exist, try updating them, as this can give more current information to the hashing operation when it decides how much memory to allocate.

· Try using a different type of join (this can be done by hinting OPTION(MERGE JOIN) or OPTION(LOOP JOIN)). Note that forcing a join type does not necessarily guarantee a better execution plan.

· If all of these fail, you can increase the available memory on the computer.

A sample of what you will see in the profiler would look something like the following. Note the batch starting, followed by a number of Hash Warning events prior to batch completion. Also, the SPID that is causing the events will be recorded

EventClass	StartTime	SPID
SQL:BatchStarting	2007-02-01 18:53:34.703	51
Hash Warning	2007-02-01 18:53:48.267	51
Hash Warning	2007-02-01 18:53:48.283	51
Hash Warning	2007-02-01 18:53:50.097	51
Hash Warning	2007-02-01 18:54:05.300	51
SQL:BatchCompleted	2007-02-01 18:54:19.130	51

- Steve Herbert

SQL Server Query Execution

The star join optimization technique is an index based optimization designed for data warehousing scenarios to make optimal use of non-clustered indexes on the huge fact tables. The general idea is to use the non-clustered indexes on the fact table to limit the number of rows scanned from it. More details of index based star join optimization can be found at http://blogs.msdn.com/bi_systems/pages/164502.aspx.

The following discussion is based on SQL Server 2005 query plans.

In SQL Server 2005, we put "StarJoinInfo" element in Showplan XML to highlight star join optimization. If the query plan contains the “StarJoinInfo” element, then SQL Server has identified this plan as a star join plan and it definitely is one.

However, the query optimizer may not detect all star join plans due to star join detection restrictions. Hence there are some star join plans that won’t have the StarJoinInfo. This post will shed some light on how to manually detect if a given query plan is a star join plan.

These steps can help you identify what’s NOT a star join plan:

First identify your fact table.
If you see clustered index scan (or table scan) on fact table, then it’s not an index-based star join plan (however, this is a valid multi-table join plan, which can benefit from multiple-bitmap filter pushdown).

To identify a star join plan, you should:

Again, first identify your fact table
You should have a single RID lookup or clustered index seek on the fact table
Restrictive dimensions (dimension tables with restrictive filters in the query) should be processed before processing the clustered index seek or RID lookup on the fact table. You should find either:

A Cartesian product between the dimensions joined with a multi-column index on the fact table.
Or a semi-join of the dimensions with some non-clustered indexes on the fact table.
Or a join between multiple dimensions.

Non-restrictive dimensions are joined later with the fact table.

So the rule of thumb is: to detect whether a given plan is using index-based star join optimization, you should always look for a seek on a fact table that is based on some joins of some dimension tables.

Recall that in the previous posts on index build, we defined "aligned" as the case when base object and in-build index use the same partition schema, and "non-aligned" to be the case when heap and index use different partition schemes, or the case when heap is not partitioned. In this post, we will talk about the two scenarios of non-aligned partitioned index build, source partitioned and source not partitioned.

Source Not Partitioned
Consider the following query.

Create

Partition Function pf (int)

range right for values (1, 100, 1000)

Create

Partition Scheme ps as Partition pf

ALL

TO ([PRIMARY])

Create

table t (c1 int, c2 int) –the table is created on the primary filegroup by default

Create

clustered Index idx_t on t(c1) on ps(c1) -- non-aligned index build

The serial plan is straightforward.

Index Insert (write data to the in-build index)
|
Sort (order by index key)
|
Scan (read data from source)

The sort iterator creates one sort table per target partition (there are four partitions in this example so we will construct four sort tables concurrently). By default, we use the user database for sort to spill data. As we mentioned before, we free each extent from sort table after all its rows are copied. By doing this, for each partition, we can reduce the disk space requirement from 3*partition size (source + sort table + b-tree) to just about 2.2*partition size. Therefore, each file group requires 2.2*(size of all partitions that belong to this file group) of disk space. If the users specify sort_in_tempdb, then all the sort tables reside in the tempdb. Therefore, we require 2.2*(Size of the whole index) of free space in tempdb.

Index insert iterator can start building index after the sort iterator finishes sorting all sort tables. Therefore, we will have as many sort tables as the number of partitions at the same time. Recall that each sort table requires at least 40 pages. So, the minimum required memory will be #PT*40pages.

When it comes to parallel plan, it looks like

X (Exchange)
   |
Index Insert
   |
Sort
   |
Scan

Each worker thread is assigned (Partition Count)/(Degree Of Parallelism) number of partitions (e.g. if we have 4 partitions and 4 worker threads each gets 1 partition), which can be skewed. The sort iterator creates one sort table per assigned partition. Each worker scans the source once and retrieves the rows that belong to its partition(s), the retrieved rows will be inserted into the corresponding sort table depending on which partition they belong to.

After all sort tables got populated, the index builder starts consuming rows from sort tables, it consumes one sort table after another, building b-tree(s) in each partition's file group. Currently workers do not share partitions. Therefore, it is possible for one worker to finish all assigned partitions and idle while another worker is busy inserting rows.

The disk space and memory requirements are exactly the same as the serial plan. This is because in both cases, we cannot start building the index until all the sort tables are populated.

Source Partitioned
While the table is partitioned, we may want to change the way it is partitioned when building the new index. For example, by using the same partition function and scheme, the new index can be partitioned on different columns than the original table.

Create table t (c1 int, c2 int) on ps(c2)

…….

Create clustered Index idx_t on t(c1) on ps(c1)

The serial plan looks like follows.
Index Insert
   |
Sort
   |
NL (Nested Loop)
/    \
CTS Scan
CTS is the Constant Table Scan. It scans each partition one by one and provides partition ID to the inner side (the lower part in graphics showplan) of NL. The inner side of NL scans the corresponding partition and sends data to the sort iterator. From there, it is exactly the same as source non-partitioned scenario. Not surprisingly, the memory and disk space requirement is the same too.

In the case of parallel plan, we have
      X (Distribute Streams)
       |
     Index Insert
       |
     Sort
       |
       X (Repartition Streams)
       |
      NL
     /   \
   X    Scan
/
CTS
The operator above CTS is a Gather Streams operator, meaning it has one producer and many consumers. There is no parallelism below this operator. Between the Gather Streams and the Repartition Streams, each worker is assigned (Number of Source's Partitions)/(Degree of Parallelism) number of source partitions. The source is only scanned once in total.

The Repartition Streams operator splits the query plan into two parallelism sections. Between the top-level Distribute Streams and the Repartition Streams, we have a different set of workers than the worker set below Repartition Streams. Each worker is assigned (Number of Target Index Partitions)/(Degree of Parallelism) target partitions. The target index partition information is pushed down to the Repartition Streams which redirects data to different sort tables based on target partition location. The rest, i.e. how the sort and index building works, is the same as the parallel plan in the source non-partitioned case. Again, the memory and disk space requirement is the same as in the source non-partitioned case.

Introduction to Showplan

Viewing and Interpreting XML Showplans

What's this cost?

Showplan Trace Events

Statistics Profile Output Formatting

Intro to Query Execution Bitmap Filters

Index Build strategy in SQL Server - Introduction (I)

Index Build strategy in SQL Server - Introduction (II)

Using ETW for SQL Server 2005

Query Execution Timeouts in SQL Server (Part 1 of 2)

Index Build strategy in SQL Server - Part 1: offline, serial, no partitioning

Query Execution Timeouts in SQL Server (Part 2 of 2)

Index Build strategy in SQL Server - Part 2: Offline, Parallel, No Partitioning

//Next time - Non stats plan (no histogram) index build plan

Posted by: Lyudmila Fokina

Index Build strategy in SQL Server - Part 2: Offline, Parallel, No Partitioning (Non stats plan (no histogram))

Index Build strategy in SQL Server - Part 3: Offline Serial/Parallel Partitioning

(see Index Build strategy in SQL Server - Introduction (II))

Index Build strategy in SQL Server - Part 3: Offline Serial/Parallel Partitioning (Aligned partitioned parallel index build)

Hash Warning SQL Profiler Event

How to Check Whether the Final Query Plan is Optimized for Star Join Queries?

Index Build strategy in SQL Server - Part 4-1: Offline Serial/Parallel Partitioning (Non-aligned partitioned index build)

Index Build strategy in SQL Server - Part 4-2: Offline Serial/Parallel Partitioning (Non-aligned partitioned index build)