Monday, August 29, 2005

String Summary Statistics

I found this little interesting information in this article

SQL Server 2005 includes patented technology for estimating the selectivity of LIKE conditions. It builds a statistical summary of substring frequency distribution for character columns (a string summary). This includes columns of type text, ntext, char, varchar, and nvarchar. Using the string summary, SQL Server can accurately estimate the selectivity of LIKE conditions where the pattern may have any number of wildcards in any combination. For example, SQL Server can estimate the selectivity of predicates of the following form:

Column LIKE 'string%'
Column LIKE '%string'
Column LIKE '%string%'
Column LIKE 'string'
Column LIKE 'str_ing'
Column LIKE 'str[abc]ing'
Column LIKE '%abc%xy'

If there is a user-specified escape character in a LIKE pattern (i.e., the pattern is of the form LIKE pattern ESCAPE escape_character), then SQL Server 2005 guesses selectivity. This is an improvement over SQL Server 2000, which uses a guess for selectivity when any wildcard other than a trailing wildcard % is used in the LIKE pattern, and has limited accuracy in its estimates in that case. The String Index field in the first row set returned by DBCC SHOW_STATISTICS includes the value YES if the statistics object also includes a string summary. The contents of the string summary are not shown. The string summary includes additional information beyond what is shown in the histogram. For strings longer than 80 characters, the first and last 40 characters are extracted from the string and concatenated prior to considering the string in the creation of the string summary. Hence, accurate frequency estimates for substrings that appear only in the ignored portion of a string are not available.

No comments: