Saturday, 25 January 2014

Making decisions by the seat of IT’s pants?

In his brilliant book ‘The Flaw of Averages: Why We Underestimate Risk in the Face of Uncertainty’ Dr Sam Savage shows us why we need to connect the ‘seat of the intellect to the seat of the pants’.

Savage’s message is that making decisions under uncertainty requires probabilistic methods rather than averages in everything from movie portfolios to the height of flood water*. A simpler message is that we can fly intuitively by the seat of our pants for some types of decisions but that failing to engage the brain for others could mean our pants end up around our ankles.

Does your IT organisation make decisions by the seat of the pants or the seat of the intellect? Does the word ‘pants’ make you smile? Be honest now.

I suspect that decisions and behaviours in IT organisations are distorted by intuition and judgement more than we’d like to admit. We all fall prey to these cognitive biases, however rational we like to think we are. Don’t worry, its human. So is our denial. However when the effects of cognitive bias aren’t understood, recognised and mitigated the stage is set for some questionable decision-making. There are great examples of how this plot unfolds in ‘Why plans fail – Cognitive Bias and Decision-Making’ by Jim Benson. IT is a regular cast member.

If you’re still not convinced, read ‘Thinking Fast and Slow’ by Daniel Kahneman and his earlier work with Amos Tversky. The former won a Nobel Prize for his work in behavioural economics which describes how our intuition, the ‘gut feel’, makes irrational decisions in the presence of uncertainty. Combine the ‘halo effect’ with the ‘illusion of knowledge’, throw in some ‘cognitive dissonance’ and we can also see why IT decision-makers are held up to be (and may even believe themselves) immune to these effects.

How do we know when decisions have been based on good evidence, bad evidence or faulty judgement and how do we know if they were successful? If IT is investing in a vendor’s technology, letting an outsourcing contract, reorganising, shedding headcount or making service improvements then its making bets on some outcome. Perhaps the outcome itself is a vague articulation of some flawed reasoning (but that’s another story). Our natural over-confidence and biases make success seem inevitable in our heads when there’s actually some probability of a negative outcome. If no-one is asking “Where’s the evidence?” then alarm bells should be ringing.

Making IT decisions using selective evidence or using biased sources such as vendors and suppliers can be a problem too, whether unintentional or deliberate. Decisions can also be distorted by a culture of fear and blame where evidence is suppressed or sanitised to duck punishment for ‘bad’ decisions whilst at the same time losing the ability to experiment, learn and celebrate ‘good’ ones.

There are also dangers lurking in the things which might be considered self-evident in some IT departments as Pfeffer & Sutton describe in their book ‘Hard Facts, Dangerous Half-Truths, and Total Nonsense: Profiting from Evidence-based Management’. Doing things which are believed to have worked elsewhere (but proof is scarce) or following deeply held (but untested) ideologies are just two examples. Does that remind you of a new technology hype, a ‘best practice’ framework, a new boss keen to make an impact? If unchallenged or unsupported by evidence these decisions could either seep into the collective sub-consciousness or breed resentment and even sabotage.

Evidence-based Management might sound like a new-fangled fad but it isn’t. Not where decisions really matter, in health for example. Take Stacey Barr’s recent post about why we should trust scientific evidence in spite of personal bias or even the media’s pursuit of a story. Ben Goldacre has a done a lot to expose this kind of ‘Bad Science’ in the media and yet the IT media and their marketing paymasters have been jumping, unskeptical, from one bandwagon to another for decades.

Making good IT decisions needs evidence and evidence needs measurement. This is measurement in the broadest sense:

‘A quantitatively expressed reduction of uncertainty based on one or more observations’

according to Doug Hubbard’s ‘How to Measure Anything: Finding the Value of Intangibles in Business’. Hubbard, like Savage, is an exponent of probabilistic methods and creator of Applied Information Economics for measuring the value of information itself and reducing uncertainty in the variables most likely to have an economic impact.

Maybe IT has something learn from other parts of the business about how to use Monte Carlo models and other statistical tools (from Lean and SixSigma for example) to focus on the best improvement returns rather than relying on perception and intuition.

As shareholders, tax payers and employers, shouldn’t we expect IT leaders to make decisions using more robust methods and then prove that they’ve created, not destroyed, value? I’d like to think so.

*Savage’s dad Leonard Jimmie Savage was a giant of Bayesian decision theory in the 1950s so you might expect him to know a thing or two about probability.

Wednesday, 4 September 2013

Our Net Promoter Score is 74. So what?

A big thank you to everyone who completed the IT Performance Company Customer Survey. The results are in and here’s our Net Promoter Score compared with a clutch of IT Services companies:

IT Performance Company Customer Survey August 2013 of 31 respondents
compared with Temkin Group Q1 2012 Tech Vendor NPS Benchmark Survey

We can say, hand on heart, that our lovely survey respondents gave us a better Net Promoter Score (NPS) than 800 US IT professionals gave IBM IT Services last year. What does that tell us? Is this a valid comparison? Is NPS useful for internal IT providers?

The Net Promoter Score is undoubtedly fashionable at the moment and the market awareness driven by Satmetrix and Bain & Company seems to have endowed it with some credibility. According to Google Trends, interest is still on the increase.

Google Trends Web Search Interest and Forecast for 'Net Promoter Score'

The Net Promoter Score has its critics too. Wikipedia covers some of the arguments and Customer Champions have a nicely balanced article. It's right for measures to be held up for scrutiny, especially when attempting to use NPS to predict financial performance or benchmark disjoint groups (as with our example above). Equally we shouldn’t obsess about data quality perfection because the presence of types of error doesn’t invalidate the measure as long we interpret it in context coupled with cause-effect improvement action.

NPS is derived from the response to the so-called ‘Ultimate Question’ :  “How likely are you (on a scale of 0-10) to recommend our company/product/service to your friends and colleagues?”. Respondents who score you a 9 or 10 are ‘Promoters’, those giving you 0-6 are ‘Detractors’ and the 7-8 are ‘Passives’. The % of Promoters minus the % Detractors yields the NPS score on a scale of -100 to +100.

The recommendation question itself is powerful because it goes beyond satisfaction and loyalty to seek a measure of active advocacy. Customers are being asked if they would ‘sell’ on your behalf and in doing so make a prediction about the outcome and put their own reputation on the line. In essence they are expressing their confidence that you will do a good job and not embarrass them.

Its very, very hard to get an NPS of 100 because even those who score you an 8 don’t get counted and the scale is very heavily weighted towards the Detractor buckets. The aggregate NPS score throws away the Likert Scale resolution so it's important to look at the distribution of scores, ours for example:

Distribution of Responses to 'Likely to recommend' question
in IT Performance Company Customer Survey August 2013 of 31 respondents.

Would we have shared this chart if there had been any Detractors? The point is that having both an aggregate comparator score and a differentiating scale yields useful insight from a single recommendation question. The insight is richer still if a second qualifying question is asked about what could have been done better eg. “What would have made you score us higher?”.

I’d expect all enlightened internal or external IT providers running ‘IT as a business’ to have some strategic customer objectives. These may be coated in slippery management-speak which needs to be scraped off to expose an observable performance result. See StacyBarr’s post on weasel words.

If there’s an objective in the customer perspective of the IT strategy map like: “Customers trust us enough to recommend us” then the Net Promoter Score will be a strong candidate for an outcome measure. The responses to the open-ended qualifying question should give clues to the  results which should drive improvements in things that customers value such as “Call-back commitments are met”.

Monopoly IT providers might be tempted to dismiss the relevance of the ‘Ultimate Question’ when their consumers don’t have an alternative, or external benchmarks aren’t valid (and anyway, their strategy is cost reduction). This is to reject the power of the recommendation question for hearing the ‘Voice of the Customer’, the foundation of Lean IT Thinking for reducing waste. Experiments should instead be used to analyse (by correlation) whether the Net Promoter Score is a strong indicator of some dimension of the IT service to prioritise improvements by customer value and answer questions like: “If we change X, what happens to NPS?”.

By virtue of simplicity and low cost, the two complementary NPS questions lend themselves to transactional interactions such as exit polls from service desks. Care is needed though to avoid types of sampling bias and distortion of the respondent's ‘stated preference’.

You don’t need a large sample to begin to see signals and to paraphrase Doug Hubbard: “Small measurements reduce uncertainty a lot”.  An IT provider might choose to use stratified NPS sampling to compare service perception gaps at different delivery points, internal or outsourced. NPS could be used for longitudinal tracking of how IT is improving its primary value streams. More in-depth, periodic surveys like variants of the Gallup CE11 will still be needed to get a fuller understanding of IT customer engagement.

As a small, growing professional services company, our very survival depends on reputation, trust and enduring relationships. If we have earned our customer’s willingness to promote us then we’re proud, honoured and probably doing a good job. So yes, the Net Promoter Score is an outcome measure in the Customer Perspective of our very own Balanced Scorecard.

A more recent Temkin 2013 Tech Vendor survey can be purchased here in which VMware came top (47) and CSC came bottom (-12).

Saturday, 27 April 2013

ITSM Tools Commit the 6 Dashboard Design Sins

Stephen Few is respected for his unflinching critique of poor dashboard design and ever since he trained me in the subject I look at dashboards through a different lens.

I had this in my mind when I went to SITS2013 this week where there were over 30 IT Service Management (ITSM) software vendors jostling for attention. When you read the brochure-ware its hard to tell them apart; they all ‘do’ ITILv3, SaaS, mobile apps and so on. Could ITSM tools be compared on their ability to effectively visualise the critical few measures that IT teams need to make decisions and act on them? Are the software vendors enlightened enough to help you with this challenge or will they leave you exporting data to Excel in frustration once the novelty of their mobile app has worn off?

I’ve collected and critiqued a rogues gallery of 8 dashboards being demonstrated enthusiastically on vendor’s exhibition stands. I’ll concede that these displays could, I hope, be customised to alter the visualisations. I’m more interested though in how these products look when they come out of the box because:

a.                  This is what vendors think is ‘good’ design and spend development dollars embedding in their products.
b.                  This is what users will end up looking at daily unless they invest in doing something different.

Mr Few classifies 13 dashboard design pitfalls in all but here are the 6 most common:

Pitfall 1. Exceeding the boundaries of a single screen.
Pitfall 2. Supplying inadequate context for the data
Pitfall 3. Choosing inappropriate display media
Pitfall 4. Ineffectively highlighting what’s important
Pitfall 5. Cluttering it with useless decoration
Pitfall 6. Misusing or overusing color

So here’s my assessment of the 8 dashboards with their pitfalls in brackets:

The most important display region of the dashboard should be the top left of the screen. This one has a navigation tree which isn’t the best use of this space and the dark blue of this panel draws the eye away from the charts themselves (4). There’s a huge amount of white space which, whilst useful for separating regions, is wasteful of valuable screen real estate which is probably why there’s a need for so much navigation (1)

There are 4 different chart types (3) most commit the sin of being in 3D which destroys comparisons, (5) one of which commits the cardinal sin of being a pie chart where rotational angles and areas are hard to evaluate (3). The bar chart bottom left is the best of the bunch but the bars aren’t sorted and there’s no good or bad context (2). In general the colours have no consistent meaning (6).

This one has different audiences in mind, most of the panels seem to be job lists for a service desk operator but we have also panels for SLAs and panels for service response time (2). The use of lists allows some density but provides very little context (2). As with many dashboards speedometers have been used as though a car-driving metaphor somehow adds more meaning. The meters waste space and are complex to read (3) and the traffic lights assume you aren’t colour blind and should ideally just highlight exceptions (6). The worklist in the top right probably has the most urgency for a user and if so should be positioned top-left (4). The PC icons are also wasteful of space and only offer a binary good/bad state when a variance from target or a time series would give more context (2). The dark green and cyan panels also draw the eye to low value data (4).

This example at least give prominence to the encoded data over decoration but we have fat axes on the charts (5) and a pie chart with an impossibly large number of part-to-whole pieces which aren’t sorted in any order (3) and multiple colours (6). One of the bar charts is a least sorted by value but there’s no target context (2) and the use of a red fill implies ‘bad’ (4 & 6). The panels could probably be resized but they make poor use of space and the region and table headers are distracting decoration (5).

This is a collage of charts on a presentation slide which probably intends to illustrate how flexible the reporting is. Everything is in 3D (3), we have a busy pie chart (3) and the tables are superfluous because values are already encoded in the charts (3). The strong axes and grid lines don’t communicate any data (5) and all the charts have multiple colours which invite unwanted comparisons between them (6).

This dashboard suffers from too much embellishment; the designs of the speedos, traffic lights, meter panels and borders draw the eye but don’t communicate any data (5). The white on black meter font is hard to read and could be combined in a single visualisation such as a bullet chart with an integral target rather than using a separate traffic light (3). The dense layout make good use of the display space but the most valuable screen real estate is taken up with a filter panel and logos (5).

This dashboard has a few things to commend it; the use of sparklines (microcharts) in the tables provide time series context without taking up too much space although the time periods are unclear. Because the sparklines have embedded targets, the boxy up/down arrow boxes are a little unnecessary but the red icons do draw attention to the Critical SLAs in breach. Since this also seems to be the most important summary information it should be positioned top left with the other information grouped more logically eg. 7 day/12 month request volumes together, category comparisons together.

Unfortunately the other visualisations exhibit several pitfalls, especially the use of multiple chart types for similar categorical data, 3D effects, shading effects and a pie chart. Sadly, a well-intentioned attempt to show performance over time in context with target bands fails because the red shading is too dominant, giving the impression that something is in ‘critical’ state even though the signal is normal. Because of the wasteful size of these charts, these red bands dominate the dashboard.

Whilst this display in generally quite pleasing to the eye we have multiple categorical chart types and pie charts (3), 3D & shading effects (5) and seemingly random use of colour (6). The size of the panels and the filters/controls take up valuable screen estate which would also imply additional that drilldown & navigation is required (1). In the top left chart the time series axis labels are unreadable and the data point markers obscure the patterns in the data with red points implying ‘bad’ (6).

There’s some logic in putting a user's high priority information panels at the top left of the screen but presumably a drill down is needed in each area to give context (1). These values would have much more meaning if some comparison could be made visually against target or over time with a sparkline. The display space usage is generally inefficient and the meter in particular takes up a lot of space and is less effective that say, a bullet chart (3). The charts are in 3D with a shaded background and strong gridlines (5).

In summary, and rather worryingly, I didn’t see a single ITSM tool at SITS which had put Stephen Few’s wisdom into practice. What does this mean for IT organisations? Bad dashboard and report design impairs the reader’s ability to rapidly analyse and interpret data in order to make better decisions and monitor the effect of actions. Using the default, wizard-driven visualisations in these ITSM tools are potential barriers to acquiring the evidence, meaning and insight needed to get a return on investment from ITSM software.

What does this mean for ITSM deployments? Performance measures should be deliberately defined and visualised to drive the right results for a particular audience. Dashboards should at least be customised as far as possible to address these pitfalls for interactive use. For periodic reporting there are options to use external Business Intelligence tools, perhaps shipping data out to a cost effective SaaS offering, or firing up the familiar Excel with PowerPivot and publishing via Sharepoint.

Unless the ITSM vendors wake up to visualisation as a potential differentiator in an otherwise mature market, more investment and effort will be needed by IT organisations to effectively communicate meaningful measures and execute performance improvement.

Sunday, 3 June 2012

TCP Throughput over Long Fat Networks

Throughput. We all know what that is: how much ‘stuff’ you can move in a period of time, which in the IT world means bits or bytes per second. We also know intuitively about the things which limit throughput; the capacity of a resource and the competing demands for it.

A concept harder to grasp perhaps is the idea that throughput is limited by physical distance and other network delays which is true in the case of TCP. This isn’t something we notice every day because our ubiquitous use of TCP - as HTTP browsing - rarely touches this throughput limit. There can however be real business consequences of this effect anywhere enterprises are moving large volumes of data over long physical distances.

Luckily this TCP throughput limit only exists within individual TCP sessions and doesn’t affect capacity for multiple, concurrent TCP flows. Fortunately there’s also a solution which we’ll come to later.

So what is this TCP throughput constraint and why does it exist?

Back in December 1974 the first TCP specification (RFC675) was published with a 16 bit window size. The window size is the maximum amount of data which can be sent before an acknowledgement is required. The window size can be varied by the receiver to assert flow control but can’t increase beyond 65535 bytes. As the Round Trip Time (RTT) of the network increases, the sender has to wait longer for an acknowledgement before starting to send the next window.

Queuing delays at network hops and long distance propagation delays are significant components of the RTT and certain finite delays can’t be tuned out of the system. Change any of these delay components and the maximum TCP throughput changes and can fluctuate in the course of a transmission and indeed be different in each direction.

By way of example a network path with a RTT of 10ms has a TCP throughput limit of 52Mbit/s including protocol overhead which, with large frames, is about 49Mbit/s of real application data; the so-called ‘goodput’. Halve the RTT and the throughput doubles and vice versa. Even if the Gigabit link to your data centre has an RTT of only 1ms, a bog-standard TCP session could only half fill it. Equally enlightening is that the absolute throughput reduction due to 1ms of extra RTT will halve 500MBit/s to 250Mbit/s but only reduces 50Mbit/s by to 45Mbit/s – the effect of RTT variability is worse at higher throughputs.

The relationship between the path capacity and the path latency determines where performance-limiting factors may lie. Paths with high capacity & high latency will be limited by TCP throughput whereas paths with low capacity and low latency are limited by the data link capacity. This relationship is referred to as the ‘bandwidth delay product’ and is a measure of how much data could be in transit – in the pipe but not yet received - at any point in time. Networks with a high bandwidth delay product are called ‘Long Fat Networks’ or LFNs for short.

If the bandwidth delay product is greater than the TCP window size then the entire window can be in transit. The receiver must sit and wait for it to arrive and then clock it into the buffers before an acknowledgement can be sent. This stop-start pumping effect reduces the effective throughput across LFNs.

So what can be done about this TCP throughput limit?

RFC1323 ‘TCP Extensions for High Performance’ offers a solution called Window Scaling. This is a TCP Option (3) negotiated by both sides during TCP connection establishment to indicate their willingness to shift the advertised window bitwise to the left, doubling it in size each time. A window scale factor of 0 indicates scaling capability but zero scaling, a window scale factor of 1 doubles the maximum advertised window from 65.5KByte to 131KByte. The highest scale factor of 14 can increase the maximum window to just over 1GByte.

Most TCP stacks have incorporated the Window Scale option by default for some time – it’s been around since 1992 after all – but there are a few considerations: Firstly the TCP receive buffer size must be large enough to absorb the increased window size. Secondly there’s a reliability trade-off because losing a single frame within a scaled window will require the whole window to be retransmitted with a further impact on net throughput. Thirdly, TCP congestion avoidance schemes may still limit the ability of the window to ever achieve its maximum size. Fourthly some network devices – most notably firewalls – have been known to re-write the Window Scale option with unpredictable results. The Fifth effect is that aggressive TCP Window scaling can create unfair bandwidth hogging which might not always be desirable.

So how can you engineer the TCP throughput required by particular business applications?

The first step is to understand the real world performance objectives or KPIs. If the raw throughput of individual TCP sessions is paramount then begin by looking at whether the bandwidth-delay product of the proposed network path exceeds the standard TCP window size.

The next step is to make sure that the end systems are tuned both in terms of buffers and two-way support for RFC1323. I’d always recommend throughput testing in conjunction with protocol analysis to validate this and also highlight the effect of congestion avoidance schemes.

I advise impairment testing too because introducing incremental delays and loss under controlled conditions can you get a really good feel for the point at which the tuned system fails to yield throughput returns. A method of monitoring the throughput being achieved may also be necessary prove that the network service is delivering to throughput KPIs and the business is getting what its paying for.

Wednesday, 23 May 2012

Effective dashboard design. Few and far between.

Everyone has a dashboard don't they? IT service management and monitoring tools all have them. As long as there's a web interface you can customise with traffic lights and gauges then it counts as a dashboard doesn't it? Well no, certainly not according to Stephen Few, the author of beautiful reference books on the art and science of data visualisation.

Few is an evangelist for giving data meaning through visual communication and I'm a recent disciple. In his book 'Information Dashboard Design', he defines a dashboard as the:

'Visual display of the most important information needed to achieve one or more objectives which fits entirely on a single computer screen so it can be monitored at a glance'.
That'll do nicely Stephen.

He critiques examples of dashboards which, whilst you're still uninitiated, don't look too bad. By the time you reach the end of the book you realise that they were truly awful. Poor choice of colours, inappropriate visualisations and non-data pixels are the most common faults. Logos, pie charts and gauges in particular get some stick. Some of BI tool vendors - Business Objects, Oracle, Hyperion, Cognos - are culpable but the greater culprits are the analysts using them to create these monsters.

As well as explaining preattentive processing and the Gestalt principles of visual perception, Few describes the display media which are most useful for dashboards. Summarisation and exception are two key design characteristics; summaries create an aggregate perspective which prompts further questions whilst exceptions, and implicitly the underlying thresholds, draw attention to variations from normals and targets. Customising a dashboard for the audience, their objectives and how they will act on the information is also crucial in deciding on the content. This is where KPI design and visualisation design converge.

I'm often involved in extracting and visualising management information from IT monitoring and service desk tools. These could be the tools which sit there polling or logging day in, day out, but often relegated to an operational notification function. Yes, you can produce traffic lights, charts for WAN or CPU usage and filter huge lists of events but how are you going to use this, er, noise to report the value of the IT service provided to your customers? IT service desks may be better at providing reporting features but suffer from the visual bloat which gets in the way of effective, efficient communication.

Instead of wrestling with the native visualisations that IT management tools provide sometimes it can be useful to re-evaluate what's really important about the information being communicated and work harder on those 'critical few' KPIs and how best to visualise them in a way which helps the audience make better decisions.

I apply Few's techniques everywhere I can. They sit squarely at the intersection of my professional passion for analysing IT performance and a personal interest in aesthetics and creative design.

I highly recommend Few's books if you're working in this area and his website has lots of useful resources. His closing remarks inspire us to design a dashboard which: "makes people's lives better, helps them to work smarter, or gives them what they need to succeed in something that is important to them". As a purpose, there can't be many more worthwhile.

Thursday, 8 March 2012

Utilisation. Too busy to manage it?

Utilisation (or utilization if you prefer); In spite of being one of the most widely used terms to describe the performance of IT systems, it often the most misunderstood.  This post examines why this could have real business performance and cost implications.

Utilisation is a common measure of how much of a resource is, or was, being consumed. For storage resources such as physical disks & memory, this is an absolute measure of the available capacity in use eg. 500 Gigabyte of data consumes 50% of a 1 Terabyte disk. This is as intuitive and easy to understand as the glass of water analogy; half full or half empty depending on your outlook.

Misunderstandings can arise though when utilisation is a measure how much of the time a resource was busy. WAN links, CPUs or disk controllers are all resources which, whilst busy processing something, are 100% utilised. Only as the measurement interval increases does the utilisation become less than 100%. A CPU which is 100% busy for 30 seconds and then idle for 30 seconds will be reported as 50% utilised over a 1 minute sampling period.

This might suggest that a time-based utilisation measure is inherently flawed and inaccurate because values reported by monitoring tools are always averaged, even the so-called peaks. In fact this average measure gives rise to something useful because requests which can’t be serviced whilst a resource is busy must wait in a queue for an average time which is determined the average utilisation.

The higher the utilisation of a resource, the higher the probability of encountering queuing and the longer the queues become. Requests are delayed in the queue for the combined service times of the preceding queued requests. So in a system where the throughput of individual components is known, the highest delays – bottlenecks in other words - are encountered wherever the utilisation and service time are highest. Average utilisation then becomes a measure ‘by proxy’ of the queuing delay at each component.

Every part of an application’s response time is bounded by network, CPU & controller queuing delays as well as the actual processing time.  Poor response times mean a poor user experience and reduced productivity. So the time-based utilisation of the end-to-end system resources becomes a critical KPI for managing business performance, IT value perception and capacity costs.

In a converged and virtualised world, there is continual contention for resources so ‘Quality of Service’ scheduling mechanisms which provide queuing control at each bottleneck become a necessary capability. These techniques allow cost efficiency to be maximised, driving resource utilisation as high as possible whilst protecting response times for priority services.

The good news is that utilisation is relatively easy to measure, more so than queuing delays. The real challenge is to collect and present this in a coherent way and translate it into KPIs which reflect real performance impacts. Fail to understand, measure and control utilisation however, and IT could be failing the business.