Thursday 8 March 2012

Utilisation. Too busy to manage it?

Utilisation (or utilization if you prefer); In spite of being one of the most widely used terms to describe the performance of IT systems, it often the most misunderstood.  This post examines why this could have real business performance and cost implications.

Utilisation is a common measure of how much of a resource is, or was, being consumed. For storage resources such as physical disks & memory, this is an absolute measure of the available capacity in use eg. 500 Gigabyte of data consumes 50% of a 1 Terabyte disk. This is as intuitive and easy to understand as the glass of water analogy; half full or half empty depending on your outlook.

Misunderstandings can arise though when utilisation is a measure how much of the time a resource was busy. WAN links, CPUs or disk controllers are all resources which, whilst busy processing something, are 100% utilised. Only as the measurement interval increases does the utilisation become less than 100%. A CPU which is 100% busy for 30 seconds and then idle for 30 seconds will be reported as 50% utilised over a 1 minute sampling period.

This might suggest that a time-based utilisation measure is inherently flawed and inaccurate because values reported by monitoring tools are always averaged, even the so-called peaks. In fact this average measure gives rise to something useful because requests which can’t be serviced whilst a resource is busy must wait in a queue for an average time which is determined the average utilisation.

The higher the utilisation of a resource, the higher the probability of encountering queuing and the longer the queues become. Requests are delayed in the queue for the combined service times of the preceding queued requests. So in a system where the throughput of individual components is known, the highest delays – bottlenecks in other words - are encountered wherever the utilisation and service time are highest. Average utilisation then becomes a measure ‘by proxy’ of the queuing delay at each component.

Every part of an application’s response time is bounded by network, CPU & controller queuing delays as well as the actual processing time.  Poor response times mean a poor user experience and reduced productivity. So the time-based utilisation of the end-to-end system resources becomes a critical KPI for managing business performance, IT value perception and capacity costs.

In a converged and virtualised world, there is continual contention for resources so ‘Quality of Service’ scheduling mechanisms which provide queuing control at each bottleneck become a necessary capability. These techniques allow cost efficiency to be maximised, driving resource utilisation as high as possible whilst protecting response times for priority services.

The good news is that utilisation is relatively easy to measure, more so than queuing delays. The real challenge is to collect and present this in a coherent way and translate it into KPIs which reflect real performance impacts. Fail to understand, measure and control utilisation however, and IT could be failing the business.