Friday, April 06, 2007

WinDbg and SQL Server minidumps

Today a couple of our test servers started to act very weird. Some jobs were going incredibly slow and eventually most of them crashed causing a mini dump to be generated. The error log pointed to I/O problems with latch timeouts but there were no details on which file exactly it crashed.

The SAN and NAS management is done by another team and we had to pinpoint the problem to convince them that it was really an I/O issue. So how exactly do you do this on a server with several databases and processes?

Off to my favorite developer Bregt, I knew he had good knowledge of WinDbg so he could help us pinpoint the issue. Lucky for me Bregt is a nice guy and he explained to me how I could find the cause of the crash.

Download WinDbg which is part of the debugging toolkit provided by Microsoft.

Start WinDbg (Run As Administrator in Vista) and configure the symbols as follows (CTRL-S)

The url can be found in the WinDbg help and the local symbols path is one you choose (and create yourself). Don't forget to save your workspace when it asks you to (thanks again Bregt).

After you have done this you can open a crash dump (CTRL-D), this might take a while depending on the size of the dump.

Type ~kv in the command line, this gives you the call stack with the memory addresses.

Using ALT-5 you get a memory viewer, here you can check the content of the memory for the last call. It is possible however that nothing is returned for the memory address since not everything is dumped with a minidump. In our case this was "ntdll!ZwCreateFile+0xa" and this directed us to the exact filename and it was a server side trace that was being saved to a NAS drive.


How cool is that?

I feel like digging a little deeper in WinDbg when I find some time. A colleague of us (Hans De Smaele) has written a great document on this topic.

Wednesday, April 04, 2007

SQL Server 2005 SP2 - Another hotfix

This time we are at build 3159 with KB934459.
Those darn maintenance plans sure are causing mayhem lately :-)

Tuesday, April 03, 2007

Detach/Attach migration

In our effort to consolidate some environments I had to migrate a reporting database from SQL Server 2000 to SQL Server 2005. One of the easiest ways of migrating is a simple detach and attach (sp_attach_db is deprecating and being replaced by CREATE DATABASE FOR ATTACH in SQL Server 2005).

After the attach some options have to be changed like the compatibility level, page verify checksum, etc. and as a last step I issued an sp_updatestats.

When I started the test run I noticed that it was taking very long so I started issuing some sp_who and sp_lock statements. The output showed that the same locks were being held for an unreasonable amount of time until the point that tempdb ran out of space. I had 8GB of tempdb space and the query only had to return 3 rows from a couple of tables that had less than 100 rows. I decided to run the query again and monitored tempdb usage with the new DMV sys.dm_db_file_space_usage, I immediately saw tempdb allocations rise at an incredible rate.

I decided to drop all the existing statistics and issued the same query again, this time it immediately returned the 3 rows without any trouble and without any unusual tempdb usage.

Very odd behavior but this shows the importance of statistics and the apparently subtle differences between SQL Server 2000 and SQL Server 2005. Dropping the statistics is easy in development/test but for production environments you might want to script the existing statistics and recreate them. For some unknown reason the update statistics was not as effective as expected after an attach from a SQL Server 2000 database.

*UPDATE*
Things like this are always a bit more complicated than they seem at first. After the import of the new set of data for today the query went back to the bad behavior. A strange effect was that the query worked well if I took a date range of a couple of months but failed with an interval of 1 day. The key here was of course the fact that because of a bad estimation the query plan was completely different in the two cases. Creating one index on a single field out of the ten tables that were involved in the query resulted in a different query plan being chosen and completely solved the problem.

Yet again a couple of important things are proven:
  • The differences in the query optimization engine between SQL Server 2000 and SQL Server 2005 make it important to test every query
  • Statistics are of great importance to the query optimizer so watch them carefully and make sure they are kept up-to-date
  • Having proper indexes is an absolute necessity for every environment
  • Any change, how small it may seem, could have great impact on your environment