Thursday, July 19, 2007

Even more USERSTORE_TOKENPERM

Reading this and this post reminded me of our very bad experience with the USERSTORE_TOKENPERM cache.

When we ran into our USERSTORE_TOKENPERM issue last year there was no real awareness of the problem yet (SQL Server 2005 x64 - Enterprise Edition Build 2153). It was so bad we had to revert to SQL Server 2000 because things got so slow that SQL Server started to become unresponsive. After we got everything back to the original server with SQL Server 2000 we had a couple of weeks of detective work in front of us. Having an issue with this much impact needs to be investigated and the root cause must be found before you can even think about migrating again. We had done all the steps needed before migrating like stress testing, functional jobs, system jobs, ... and yet we did not see the behavior it was showing in production.

We have spent quite some time on the phone with an Escalation Engineer from Microsoft trying to pinpoint the root cause of our problem. After we convinced the engineer that the root cause was not parallelism as he suggested we finally got some vital information that helped us understand what was happening. When he told us about the USERSTORE_TOKENPERM cache and its behavior we started running some tests and now knew what we were looking for. Sure enough the size of this cache just kept growing and the remove count certainly did not.

select
distinct cc.cache_address,
cc.name,
cc.type,
ch.clock_hand,
cc.single_pages_kb + cc.multi_pages_kb as total_kb, cc.single_pages_in_use_kb + cc.multi_pages_in_use_kb as total_in_use_kb,
cc.entries_count, cc.entries_in_use_count, ch.removed_all_rounds_count, ch.removed_last_round_count, rounds_count
from sys.dm_os_memory_cache_counters cc join sys.dm_os_memory_cache_clock_hands ch on (cc.cache_address =
ch.cache_address)
where cc.type = 'USERSTORE_TOKENPERM'
order by total_kb desc

Since it was announced that this problem would be fixed in Service Pack 2 we decided to create some test scripts that really put stress on this cache. We tried to come as close as possible to our production situation at that time which was sp_executesql for the CUD operations (UpdateBatch). Putting pressure on this cache was quite easy, have different query text so they get cached as different query plans and execute them twice since the first time it just gets inserted in the cache and the problem comes with the lookup action. We forced this behavior by executing queries with varchar parameters that increase in length every iteration. Also remember to use a login that is not a sysadmin since this causes SQL Server to skip the TokenAndPermUserStore check. We ended up creating a couple of scripts to do just this and also to monitor the cache size. The 'engine' behind it are simple command files and SQLCMD. Unfortunately I can not post the scripts since they are the property of the customer.

As far as we have seen from our tests the behavior has certainly improved in SP2. It takes a lot longer for the duration of the queries to increase and it does not increase to the point where SQL Server becomes unresponsive nor does the duration become unacceptably high. Since we are really pushing the cache to the limit it does still cause some increase in duration but not nearly to the point where SP1 gets it. We do notice however that under the constant 'hammering' of these queries even SP2 has the greatest trouble to keep the cache under control even when we call DBCC FREESYSTEMCACHE('TokenAndPermUserStore'). Obviously this scenario is not realistic since we really are trying to make it go wrong. Not only did the TokenAndPermUserStore behavior change in SP2 they cut down the size of the plan cache dramatically in SP2 too.

The conclusion however is that under Service Pack 2 we do not experience the same catastrophic behavior as before.

A lot of time has passed and we are going to migrate to SQL Server 2005 on the 9th of September. We are using a whole new .NET Framework on the application tier with a whole new data access layer, so a lot of the factors that existed that last time have now changed. We are starting our stress tests next month, if I get the chance and if we find anything peculiar I will let you know.

Hopefully this time there will not be any surprises :-(

No comments: