Wednesday, July 25, 2007

MemToLeave Issue in SQL Server 2000 x86

This week we were getting "WARNING: Failed to reserve contiguous memory of Size= 131072." errors on a couple of our SQL Server 2000 machines. Since we had a lot of available memory we quickly suspected the cause to be VAS fragmentation but we needed to find the root cause and confirmation that is was indeed VAS fragmentation. This is a bit harder since SQL Server 2000 has a lot less options to investigate this as opposed to SQL Server 2005.

VAS (Virtual Address Space) is the 4 GB region (in 32-bit) where 2 GB is reserved for the kernel mode and the other 2 GB for user mode (unless you specify /3GB). Every memory allocation (including AWE mapping) for an application happens in the user mode portion of the VAS.

By default SQL Server reserves 256MB in the MemToLeave region and an additional part that is calculated as worker threads * stack size (0.5MB for SQL Server 2000 x86). This memory is reserved in the VAS and used for special memory allocations like extended stored procedures, sp_OA procedures, CLR integration and some specific memory managers. One of the properties of MemToLeave memory is that it is static and it can only be increased at startup by using the -g switch. If you wish to do this do it with great caution as VAS is only 2 GB or 3 GB on a 32 bit system. As a side note, allocations from this region are done by the Multi Page Allocator in SQL Server 2005, meaning it is not stolen from the buffer pool and consequently is not included in the Max Server Memory setting.

We contacted PSS to help us find the root cause and they instructed us to start SQL Server with -T3654 which allows you to run a command that returns information about the memory allocations SQL Server has done: DBCC MEMOBJLIST. We would have resolved the whole issue by restarting SQL Server and limit our chances to find the root cause. Since it was not a production server we decided to wait a little with the restart and continue looking for the root cause. It was a good decision since on of my colleagues noticed that there were a lot of trace files created at the same time. We have a central mechanism that uses server side traces to profile all our servers and something had gone seriously wrong this weekend. When we issued a select * from ::fn_trace_getinfo(0) on the server it showed us that there were 270 traces running. Probably because of a little bug the tool started an incredible amount of traces on the server.

When we issued a DBCC MEMORYSTATUS we got the following results:

Dynamic Memory Manager Buffers

Stolen 3728
OS Reserved 37528
OS Committed 37506
OS In Use 37502
General 1513
QueryPlan 2420
Optimizer 0
Utilities 36900
Connection 220

According to KB271624 'Utilities' is used by various utilities routines like BCP, Log Manager, parallel queries, ::fn_trace_gettable and others. Others also means buffers used for tracing. Starting the 200+ traces caused our VAS to get heavily fragmented and thus leaving us with a lot of different errors like "There is not enough storage to complete this operation", "Failed to reserve contiguous memory", login failures, etc. As you can see the 'Utilities' memory manager has 36900 buffers which is a massive amount considering the limited MemToLeave size.

So remember, even if you have lots of available memory you can still run out of memory!

No comments: