Thursday, July 19, 2007

Even more USERSTORE_TOKENPERM

Reading this and this post reminded me of our very bad experience with the USERSTORE_TOKENPERM cache.

When we ran into our USERSTORE_TOKENPERM issue last year there was no real awareness of the problem yet (SQL Server 2005 x64 - Enterprise Edition Build 2153). It was so bad we had to revert to SQL Server 2000 because things got so slow that SQL Server started to become unresponsive. After we got everything back to the original server with SQL Server 2000 we had a couple of weeks of detective work in front of us. Having an issue with this much impact needs to be investigated and the root cause must be found before you can even think about migrating again. We had done all the steps needed before migrating like stress testing, functional jobs, system jobs, ... and yet we did not see the behavior it was showing in production.

We have spent quite some time on the phone with an Escalation Engineer from Microsoft trying to pinpoint the root cause of our problem. After we convinced the engineer that the root cause was not parallelism as he suggested we finally got some vital information that helped us understand what was happening. When he told us about the USERSTORE_TOKENPERM cache and its behavior we started running some tests and now knew what we were looking for. Sure enough the size of this cache just kept growing and the remove count certainly did not.

select
distinct cc.cache_address,
cc.name,
cc.type,
ch.clock_hand,
cc.single_pages_kb + cc.multi_pages_kb as total_kb, cc.single_pages_in_use_kb + cc.multi_pages_in_use_kb as total_in_use_kb,
cc.entries_count, cc.entries_in_use_count, ch.removed_all_rounds_count, ch.removed_last_round_count, rounds_count
from sys.dm_os_memory_cache_counters cc join sys.dm_os_memory_cache_clock_hands ch on (cc.cache_address =
ch.cache_address)
where cc.type = 'USERSTORE_TOKENPERM'
order by total_kb desc

Since it was announced that this problem would be fixed in Service Pack 2 we decided to create some test scripts that really put stress on this cache. We tried to come as close as possible to our production situation at that time which was sp_executesql for the CUD operations (UpdateBatch). Putting pressure on this cache was quite easy, have different query text so they get cached as different query plans and execute them twice since the first time it just gets inserted in the cache and the problem comes with the lookup action. We forced this behavior by executing queries with varchar parameters that increase in length every iteration. Also remember to use a login that is not a sysadmin since this causes SQL Server to skip the TokenAndPermUserStore check. We ended up creating a couple of scripts to do just this and also to monitor the cache size. The 'engine' behind it are simple command files and SQLCMD. Unfortunately I can not post the scripts since they are the property of the customer.

As far as we have seen from our tests the behavior has certainly improved in SP2. It takes a lot longer for the duration of the queries to increase and it does not increase to the point where SQL Server becomes unresponsive nor does the duration become unacceptably high. Since we are really pushing the cache to the limit it does still cause some increase in duration but not nearly to the point where SP1 gets it. We do notice however that under the constant 'hammering' of these queries even SP2 has the greatest trouble to keep the cache under control even when we call DBCC FREESYSTEMCACHE('TokenAndPermUserStore'). Obviously this scenario is not realistic since we really are trying to make it go wrong. Not only did the TokenAndPermUserStore behavior change in SP2 they cut down the size of the plan cache dramatically in SP2 too.

The conclusion however is that under Service Pack 2 we do not experience the same catastrophic behavior as before.

A lot of time has passed and we are going to migrate to SQL Server 2005 on the 9th of September. We are using a whole new .NET Framework on the application tier with a whole new data access layer, so a lot of the factors that existed that last time have now changed. We are starting our stress tests next month, if I get the chance and if we find anything peculiar I will let you know.

Hopefully this time there will not be any surprises :-(

Tuesday, July 17, 2007

The target of 'replace' must be at most one node

One of the new features of SQL Server 2005 is the integration of XML in the database. Although XML in the database is not my favorite architecture there are situations where it is useful. Since we have not really used XML in a real life project before we are slowly learning its limitations.

A limitation we recently discovered is modifying more than 1 node at a time using XML DML. This limitation is documented in the Books Online under the "replace value of" topic: "If multiple nodes are selected, an error is raised."

The only idea we could come up with is to loop through the nodes until all the nodes have been updated. Lucky for us SQL Server also supports mixing XML with special functions like sql:variable() and sql:column().

Check out the code here since blogger is giving me issues with the XML string :-(

Monday, July 16, 2007

Windows Home Server - RTM

There was not such a great buzz about this edition (at least not here in Belgium) but nevertheless it is an interesting product. It contains some nice features like a home network health monitor which makes sure all updates are applied on your home PC's, full system restores, automatic backup and many more.

Check out the announcement here and more information here