If you use Microsoft SQL Server 2000, you likely have run into the following problem: You want to select a random sampling of rows from a large table with lots of rows, but you are unsure of how to do so. Having a random sampling of rows can be useful when you want to make a smaller version of the table or if you want to troubleshoot a problem by seeing what kinds of rows are in the table.
To get a random sampling, you might be tempted to select the top n rows from the table. However, this sample is not random, and the first n rows are not necessarily representative of the whole table. Other solutions exist that involve adding columns to the tables; however, adding columns is not always possible or practical.
The standard way to grab random rows from a small table is to use a query such as the following:
SELECT TOP 10 PERCENT * FROM Table1 ORDER BY NEWID()
The key here is the NEWID function, which generates a globally unique identifier (GUID) in memory for each row. By definition, the GUID is unique and fairly random; so, when you sort by that GUID with theORDER BY clause, you get a random ordering of the rows in the table. Taking the top 10 percent (or whatever percentage you want) will give you a random sampling of the rows in the table.
Often, when questions about how to select random rows are asked in discussion groups, the NEWIDquery is proposed; it is simple and works very well for small tables. However, the NEWID query has a big drawback when you use it for large tables. The ORDER BY clause causes all of the rows in the table to be copied into the tempdb database, where they are sorted. This causes two problems:
- The sorting operation usually has a high cost associated with it. Sorting can use a lot of disk I/O and can run for a long time.
- In the worst-case scenario, tempdb can run out of space. In the best-case scenario, tempdb can take up a large amount of disk space that never will be reclaimed without a manual shrink command.
No comments:
Post a Comment