Friday 2 December 2011

Clustered Indexes in SQL Server: Things You Need to Know


This article covers a few, more advanced topics about the usage of clustered indexes in SQL Server. Not only will I try to convince you of the absolute necessity of using clustered indexes, I’ll also give you some tips on how to use them in a not so obvious context.


Why You Need Clustered Indexes!

As you know, the SQL Server engine treats a table without a clustered index, also called a heap, differently from a clustered table. In contrast with a heap table, a clustered table has its data pages ordered in the sequence of its index structure and has its data pages linked in a linked list. A heap table uses the IAM and PFS pages to find a page with enough space to hold a new row if no space is available in the current page (more info in BOL).
To illustrate this difference in behavior, I created a small script you can run for yourself.
– Init

Use tempdb

SET NOCOUNT ON

IF OBJECT_ID(‘dbo.TestTable’, ‘U’) IS NOT NULL DROP TABLE dbo.TestTable
CREATE TABLE dbo.TestTable ( GUID_col_indexed uniqueidentifier, CHAR_col char(10), GUID_col uniqueidentifier)
CREATE CLUSTERED INDEX IX1 ON dbo.TestTable ( GUID_col_indexed )
GO
– Step 1: Load TestTable with 100.000 records
declare @t int
set @t=1
while @t <= 100000
begin
insert TestTable values (newid(), char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65)+ char(rand()*26 + 65), newid())
select @t=@t + 1
end
go
– Step 2: Select 100 random values
select top 100 * from TestTable order by GUID_col
go
– Step 3: Delete 1000 random values
delete TestTable where GUID_col in (select top 1000 GUID_col from TestTable order by GUID_col desc)
go
– Step 4: Insert 1000 more values
declare @t int
set @t=1
while @t <= 1000
begin
insert TestTable values (newid(), char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65) + char(rand()*26 + 65)+ char(rand()*26 + 65), newid())
select @t=@t + 1
end

go
This script is split into 4 steps:
  1. Load 100,000 records into a table using randomly generated values for all fields.
  2. Select 100 random values.
  3. Delete 1,000 random values.
  4. Load 1,000 more records.
To generate random values, I used uniqueidentifiers. I did not use the indexes column as SARG’s in the queries. I did this to avoid any advantages related to the physical structure of “random” values.
You should run this script twice, first by using the script exactly as it appears above (creating the index as clustered) and then by leaving out the [CLUSTERED] keyword. You will get results —–I hope— comparable to the results I got after a few test runs:
CPU
Reads
Duration
Heap
Cluster
Heap
Cluster
Heap
Cluster
– Step 1: Load Table with 100.000 records
5094
4063
374627
281336
5640
4926
– Step 2: Select 100 random values
78
78
671
942
97
84
– Step 3: Delete 1000 random values
641
719
9551
7698
811
834
– Step 4: Insert 1000 more values
218
47
5717
3076
210
45
Even without addressing the specific characteristics of the clustered index, these results clearly show the internal overhead related to heap structures.


Clustered Indexes on uniqueidentifier Columns: Out of the Question! Or Not?

A lot of whitepapers don’t encourage you to use clustered indexes on a uniqueidentifier column. Because of its random generation, records will be inserted randomly over all data pages that are part of the file group your table belongs to. This is absolutely an overhead because a lot more extends will be swapped into and out of memory, page splits are more likely to occur, etc.
But in some VLDB environments with high transactional throughput, this can be a solution when IO throughput is a potential bottleneck. Suppose you have a large (>100 GB) database table using a file group spread over 10 physical disks with a clustered index on a [date-inserted] column. When many inserts occur in combination with lots of updates of recently added records, all data throughput will be focused on a set of physically grouped data pages producing a hot spot. The volumes of dirty pages a checkpoint process wants to flush to disk can get too high for one disk to handle. This is where a clustered index on a uniqueidentifier column can help you. Instead of flushing all dirty pages to one disk, IO’s will be spread over all physical disks.
Pay special attention to fragmentation using this solution. Because of its random inserts, page splits are more likely to occur causing more fragmentation. When your data volumes are too high for a daily rebuild, perform a general index rebuild every week. I personally prefer the CREATE INDEX … WITH DROP_EXISTING statement to rebuild indexes.


Conclusion

  1. Always use a clustered index on an OLTP database table.
  2. If you consider using a clustered index on a uniqueidentifier column:
    1. Test your solution in a stress test environment.
    2. Rebuild indexes regularly.

No comments:

Post a Comment

Contact Us:

Email:

Vinodkumar434@gmail.com,
vinodtechnosoft@gmail.com

Skype Name:

vinodtechnosoft