The one and only Scalability Metric

How do you measure Scalability? Is there a metric that's going to work for (almost) all scenarios?

Going from Performance to Scalability

Algorithmic Scaling can refer to the ability of the module to handle varying sizes of data in a reasonable amount of time. Here, we are interested in seeing that as the size of data increases, the time taken to process it does not increase in an unreasonable manner. For instance, an O(n.log(n)) Quick Sort is better than an O(n^2) Bubble Sort. Another perspective on Scaling is that of Concurrency, which is a typical scenario of multi-user systems (such as websites, web servers, API servers, etc). Here, we are interested in observing that when many users are accessing the system simultaneously, the system is able to serve them all in reasonable time. Most of the times, Scalability refers to this perspective.

Performance is all about being the fastest; "time taken" to do the job (i.e. seconds/job). In a real implementation, however, more variables are involved such as processing, (disk) I/O, memory, threads, etc. The nature of your implementation will decide the extent to which various resources are consumed, and how they fare in the various tests. With so many variables in consideration, is there really a single metric that can measure Scalability across algorithms, implementations and platforms?

While time is a key variable, no matter what the perspective, Scalability is not about being the fastest. Rather, it is all about being the most optimal in (almost) all scenarios. While Performance is trying to optimise time taken, Scalability is all about work done!

"jobs / second"

That is it! Instead of looking at "seconds/job", you need to flip the ratio and start looking at "jobs/second" to measure Scalability. By inverting the number, instead of "minimising time taken", the focus shifts to "maximising work done". Those with a good handle on Maths are probably thinking that I've lost it as the 2 numbers represent the same thing. I don't blame you. When I did my first big scalability project, it took some time for me to re-wire my brain.

A legacy implementation was at its worst during the peak hour. With our customer base rising, we were about to breach our SLAs. A re-architecture project was inevitable. Along with re-writing the feature, we were also figuring out recording of metrics. But as we tried different scenarios such as peak hour, background threads, pub-sub, etc., it was soon clear that measuring "time taken" was not adequate and didn't reveal anything about the scalability of the system.

If you've done a simple scalability project involving some sort of disk I/O (such as database operations) and raw data processing in memory (such as sorting a large list) across multiple threads (for concurrency in a multi-user system), you know that the time taken to do the job always deteriorates due to resource contention (such as availability of disk and CPU) and the overheads of concurrency (such as context switching between threads).

At that point, it hit me that I need to flip the number. Due to the introduction of the a new variable (threads), "seconds/job" was no longer the relevant metric. It was "jobs/second". The chart below will show you that due to the increasing number of threads, the performance metric kept deteriorating. But despite that deterioration, the system performed more jobs (on the whole) at 8 threads. More than it did at 5 or 13. That was the optimum utilisation of the system. And it was "jobs/second" that helped us realise that.

If I were to calculate the metrics for more thread counts than listed on this chart, the "jobs/second" line will probably look like a Bell curve. A higher number of threads doesn't mean more jobs will be completed in unit time. Neither does a faster implementation. And that is the catch with this metric: it is highly sensitive to several variables.

It is a moving target!

It was naive of me to think that the optimum thread-count we discovered in a test environment, will also work in production. The variables had changed. The database in the test environment had higher availability (despite running on a slower hardware) because the live database was ... quite simply ... "live". With changes in the underlying hardware configuration and its utilisation in a dynamic, live environment (which had its own share of peak and off-peak hours), we couldn't take the absolute numbers of our test environment into the production.

But we sure could take our learning. A little bit of tweaking in the hardware and making some aspects of the module configurable soon gave us the optimum in our live environment. Since we had built the ability to track some metrics in the rewrite, we were able to record the behaviour of the system under various circumstances and adapt to the environment.