Recently, I've been working with a consulting client to help them tame CPU usage. Typically, they're clocking over 20,000 queries and operations per second on their e-commerce SQL Server, but ran into an issue where CPU started spiking up above 90 percent for sustained periods of time and even managed to climb clear up into the 100 percent range for periods at a time (at which point throughput dropped down to around 17,000 queries/operations per second on the server).
Related: ScaleArc iDB 2.0
Without much effort, I was able to quickly help them isolate a handful of 'rogue' queries that were burning up the CPU for a number of different reasons. Within a day or two, we had the CPU running down more in the 40-60 percent utilization rate as we kept working on some additional tuning and efforts to get CPU utilization down more into the 20-30 percent usage range—to help them prepare for some higher-traffic periods coming up in a few weeks where traffic and operations on the server will increase dramatically.
To address the need to be able to handle much higher amounts of traffic within a few weeks, I started taking a look at options to help provide the client with more processing power or capacity. Currently, their server is a 4-socket box running just two 8-core Xeon X7550s. Technically, they could, therefore, slap additional processors into the mix to pick up additional processing power.
Only, there are a few catches. First, these processors retail for around $2,800 each—and since the server in question is a cluster, that means we'd be looking at roughly $12,000 in processor hardware alone. More importantly, though, this client is currently running SQL Server 2008 R2 Standard Edition—which is artificially crippled to just 64GB of RAM and only 16 cores. As such, if they were to slap additional processors into the mix, they wouldn't be paying the roughly $24,000 to license these additional cores that they would be if they could stick with Standard Edition. Instead, they'd be looking at paying closer to $288,000 (minus whatever they could get in terms of an 'upgrade') to bring this processing power on-line with Enterprise Edition.
Needless to say, that's quite a huge jump in pricing.
So, I got busy, crunched some numbers, and figured out a way to save them a few hundred thousand dollars—while boosting application performance and giving them better options for overall and long-term scalability. My guess is that they'll save around $200,000. (In hindsight, I should probably charge clients based on a percentage of what I save them rather than based on a flat hourly rate.)
By crunching the numbers and looking at the operations being run against this particular server, I was able to quickly spot a very high degree of 'repeat' queries—or sprocs (i.e., stored procedures) and queries that are 'fired' off over and over against the server in a short amount of time. Granted, one of the benefits of sprocs is that they're designed to easily enable different 'signatures' or combinations of input parameters or variables. But, even accounting for different 'signatures,' there was still a high degree of re-use. In fact, in sampling queries/operations being thrown against the server in just the space of 2 minutes, I was able to determine that we could safely save nearly 50 percent of the CPU we were currently using IF we just had some means of NOT sending the exact same queries back to the server over and over again. Or, in other words: if we could somehow cache the results of a given query or sproc (i.e., matching its exact set of parameters and inputs with the exact same outputs returned just seconds earlier), then we could see some huge CPU benefits within literally minutes.
Granted, not all queries or operations can be cached. For operations that modify data, caching obviously won't work. Likewise, some data can't be cached for long because it's volatile. But even fairly volatile data can, in fact, be cached for even 5, 10, 30, or even 60 seconds in most environments. And with a high degree of repeat 'signatures' being fired off in rapid succession within just a few seconds against stale or semi-stale data, caching can become a huge win. Because the best way to scale is to NOT do something (i.e., the more you can avoid running queries on the server, the faster and better your app will be).
In this case, the numbers were compelling. By just looking at the top 10 most CPU-intensive queries/sprocs on the server, it was clear that caching would reduce CPU utilization by over 75 percent within just 4 minutes. The hard part, though, was that there's currently no means of caching deployed in the current applications using this server.
My favorite programming joke states that "there are two things that are hard about development: naming things, cache invalidation, and off-by-one errors." While funny, I also love this joke because cache invalidation IS tough. Without some decent architecture and best practices around maintaining and invalidating cached data, it's easy for applications to become seriously messed up and obscenely buggy.
Because of that, and because of the complexities involved, adding caching capabilities into existing applications can be very non-trivial and very expensive to tackle. In fact, for the kinds of applications hitting this server, I'd wager that the price-tag to 'throw' caching into the mix (even with full-time developers on staff) would be in the $60,000 to $120,000 range in terms of all of the development, testing, and deployment costs that it would incur. As such, the absolute best time to implement caching is always before you even start writing code (i.e., in the sense that you build your apps with full-blown caching capabilities and/or frameworks in place before you get started and when the costs are, effectively, trivial).
There is, however, a fairly magical way to add caching into the mix—even with big, older, established applications—and that's through the means of a proxy, or something that sits between your applications and the database server, intercepts and interprets commands being sent to the database server, and caches outputs when and where feasible. And, in fact, that's exactly what ScaleArc's iDB does. (Well, technically, it does a TON more than this—including providing load-balancing capabilities and a host of other features.)
Starting around $10,000 per license, and providing the ability to 'instantly' and 'magically' throw caching into the mix, ScaleArc ends up being a much cheaper option (for this client) than a scale-up which would require a jump up to licenses of SQL Server Enterprise. ScaleArc iDB also ends up being much cheaper and easier to implement than trying to 'build' caching capabilities into their existing apps. With iDB, all the client needs to do is set up servers (or a team/cluster of iDB servers to ensure fault-tolerance), load them up with RAM, and then switch their applications' connection strings to connect to the iDB server(s) instead of the database server, and they're good to go. From that point, the client can then log into the iDB server/interface, view top resource-consuming queries and operations, and then use a GUI to specify which queries/signatures should be cached (and for how long) and they'll instantly have insanely powerful caching capabilities at you disposal. And, in the case of this particular server or environment, with some additional tuning and optimization, they should be able to easily hit 60,000 operations/second by only spending a fraction of what they would have spent otherwise.
Related: Find SQL Server Cost Savings