Friday, September 07, 2007

SQL Server 2005: Migration imminent and something about TF4618

So we are back at the point of no return. This Sunday we will be migrating our largest SQL Server 2000 to SQL Server 2005. During the first week we will use transactional replication to ensure a 'backout' in case of failure. Last year when we attempted to migrate this server we unfortunately hit the TokenAndPermUserStore issue and had to rollback the migration.

As explained in this post we created some tests script which were able to degrade the performance significantly. There were a couple of hotfixes for issues concerning the TokenAndPermUserStore cache with the latest being build 3179. We eventually did our tests on build 3186 which is the version we will be going to production with, unfortunately we were still able to degrade the performance with our scripts.

We looked at a couple of alternatives to prevent this decrease from happening and as far as we know there are 3 ways to prevent the TokenAndPermUserStore cache from growing too large and these are:
  • add the user(s) that access the database to the sysadmin role
  • create a scheduled job that issues DBCC FREESYSTEMCACHE ('TokenAndPermUserStore')
  • use TF4618

After we ran our tests with trace flag 4618 we found that the TokenAndPermUserStore cache remained very small (less than 1MB) and our response times were very stable. The catch with this trace flag is that it has to be enabled by adding -T4618 to the SQL Server startup parameters (using DBCC TRACEON does not help in this case). What this trace flag does is evict old entries when new entries get inserted to keep the cache small. Because this extra work might cause an increase in CPU usage be sure to carefully monitor this. We did not see any significant increase in our stress tests.

We also decided to take 1 more "risk" and that is a reorganization of our database file layout. A legacy layout because of the previous SAN was 4 files for data and 4 files for the non-clustered indexes. Since we noticed a great difference in response times of the luns between the 2 filegroups we now use 1 filegroup spread over the 8 luns, this resulted in a great balance over all luns and improved response times.

After many hours of reorganizing, stress testing, replication testing and going mad we are finally there. We had days of paranoia and days of euphoria and we ended with euphoria, hopefully it will remain this way. Monday will tell us if everything was as well as we suspected so I suppose the only thing that remains is to wish us luck :-)