What might businesses today learn from the Space Shuttle Challenger disaster? That interpreting data correctly is more important than merely collecting it! The methods by which raw data are transformed into business metrics can play an insidious role in our interpretation.
Thresholds are a frequently used design feature in business metrics, limiting the information that decision-makers see. Specifically, I’m referring to any process whereby raw data are compared to a prescribed threshold and categorised as either above or below the threshold. To illustrate in this article, a customer may wait “12.3 minutes” to be serviced, or, after the datum is transformed, “longer than” the threshold of 10 minutes. Take a moment to think of one metric used in your organisation where a threshold is used in this way.
In this article I will argue that threshold transformations are a highly problematic design feature for any business metric, and I will examine four reasons.
1) Loss of Information.
Threshold transformations might be popular because nominal data is easy to summarise for communication. For example, we can easily compute, and interpret, a “percentage of customers waiting longer than 10 minutes”. This is unlikely to raise any alarms as “longer than 10 minutes” is a valid re-statement of “12.3 minutes”, however, that extra detail is important. Why should we be concerned with some lost detail? Plot the raw data and find out:
“Whoa, why were some customers waiting for nearly an hour?” you might be asking. Good, you should care about that. Did something go wrong or are those just unusually complex cases? It might be a good idea to look into that. A more subtle concern is whether situations like this can occur without our knowledge. In the case of threshold-based metrics, yes.
While seeing outliers in the raw data can prompt an investigation, with a threshold transformation those data points are treated identically to that of the customer who waited for 12.3 minutes. And if we summarise this information further, we might get: “94% of customers wait less than 10 minutes”. Phew, that sounds much better now so we can relax… right?
Making decisions based on summarised data can be a high-stakes gamble. Physicist Richard Feynman observed at NASA that a point estimate provided by a regression calculation was used without adequate consideration of the raw data:
A mathematical model was made to calculate erosion. This was a model based not on physical understanding but on empirical curve fitting … The empirical formula was known to be uncertain, for it did not go directly through the very data points by which it was determined. There were a cloud of points some twice above, and some twice below the fitted curve, so erosions twice predicted were reasonable from that cause alone.
Although regression is useful in estimating covariance, or colloquially, finding the line or curve that “fits”, we risk detaching ourselves from the reality shown by the raw data. When viewing regression graphs, remember that the points are real, the line is not.
If you’re spending effort and money on collecting data in the first place, don’t just throw a substantial part away. Look for ways to extract more information than just summaries.
2) Susceptibility to Rank Inversion.
One of the most damning thing that could be said about any metric is that it can be showing an improvement over time when things are actually getting worse or vice-versa. This is an extra “feature” introduced by the threshold transformation. Take a look at some more raw data of customer wait times and think about which of these is better:
Generally case A is preferable as we can address special causes more easily than systemic issues, followed by case B, with case C looking like something is seriously, seriously wrong. However, our trusty threshold-based metric tells the opposite story, with the “percentage of customers waiting for less than 10 minutes” being 92%, 96% and 100% respectively. Suppose you’re managing case C and relaxing because of the perfect score… have fun next month!
How can rank inversion go unnoticed? Because we are unlikely to spend much time looking at raw and summarised data together. Raw data is often “out of sight, out of mind” once summarised.
3) Distortion of Behaviour.
In what might be a common experience with IT help-desks, I had made 4 or 5 calls about a recurring password issue. Out of curiosity, I asked for the support ticket to be re-opened (reasonable since the issue had not been resolved). Predictably, the answer was no. It was clear to me then that the tickets were closed as rapidly as possible because resolution time was monitored and “closing tickets within a threshold” was the de-facto purpose of the helpdesk. And so, they spent time opening new tickets while I enjoyed a login screen.
John Seddon highlights these distortions of behaviour, suggesting that companies tend to spend much of their efforts policing “should take” time, that is, setting an arbitrary threshold and holding employees accountable to it. That is what I call the “wishful thinking” aspect of this class of metrics; the designer identifies the level of performance that they would like to see and codifies it in a metric, with the ever-so-plausible assumptions that staff will just work harder in their previously designated role until the metric, a proxy, reflects this increased effort, and that staff will in-no-way-whatsoever direct any of their collective creativity towards simply delivering better numbers. No, a helpdesk would never create new tickets for previously reported problems…
Instead, Seddon suggests that companies ought to focus on “does take” time, that is, they should measure the actual performance of the system while avoiding distortions, such as thresholds. From my experience, associating metrics with the system itself helps foster the mentality of “how do we fix it?”, while associating metrics with people tends more towards “who do we blame?”.
4) Distorted Perception of Reality.
If we change the threshold in our example from 10 to 8 minutes, suddenly all our summary metrics look worse:
A threshold like this, that we would recognise as arbitrary at the time of setting it, subsequently becomes a declaration of right and wrong. It implies that a customer waiting less than a specific amount of time is happy, and that spending a long time on an inquiry is inherently problematic, regardless of its complexity. This heuristic is exacerbated by how data are presented and, if faith in a metric is sufficiently strong, can lead us to rejecting reality outright.
I recall the investigation phase of an improvement initiative I once led, where managers reported long delays when arranging IT equipment for their new hires. I was dismayed when the IT Services manager rejected an internal survey (100+ managers), asserting that the situation could not possibly be as bad as the survey had shown. In hindsight, the reasons for this reaction were obvious. It turned out that he was receiving significantly lower figures, presented in threshold-based fashion: percentages of tasks resolved (i.e. closed) within a specified number of days. The actual process included an excessive number of follow-ups, but when something got missed it was treated as a new, separate task! Just because some data is objectively accurate (ticket resolution time) does not mean it reflects the situation that people are concerned about (end-to-end process time). We need to consider what data actually means and distinguish between easily accessible and relevant data.
While holding a false belief about the time taken to onboard new staff can be a barrier to seeing the need for improvement, other false beliefs can have much graver consequences. Feynman’s report highlighted NASA management’s belief in an extraordinary level of safety. Engineers speaking up about heightened risks for the Challenger mission clashed with management’s perceived “reality” of a 1 in 100,000 chance of catastrophe. Stating exaggerated levels of safety might have started as a PR exercise but ultimately distorted management’s perception of reality and contributed to the death of seven crew members.
Perceiving the world depending on what we believe is all too human. Giving ourselves the power to arbitrarily define success and failure plays right into this human weakness. To show how great a department is doing, shift thresholds one way, to push people to work harder, shift thresholds in the other direction. Of course the abuse of threshold-based metrics is never so explicit, as one also needs to maintain the illusion that these metrics give a true picture of reality. And so, the effect they have goes unnoticed.
While the calculations involved in these metrics might be considered objective, where do the thresholds come from? If from “wishful thinking”, then design actual changes instead – enshrining wishes in a metric might express what you want, but does not give people any indication of how to get there.
When confronted with the reality that a metric is not conveying the information we believed it did, how should we react? Scrap misleading metrics and find alternatives where needed. Be sure that you are getting the information you need and not just that which is most convenient to collate. Use more graphs – our brains do an amazing job of identifying patterns visually so let’s make use of that.
The alternatives to using thresholds are a topic unto themselves, which I will get into as a separate article, but feel free to post questions below.
1. ^ This pertains to the level of data, which the threshold transformation is lowering. Data is typically classified as one of four levels:
Continuous Data: 12.71 minutes
Interval Data: 5 customers
Ordinal Data: Very Low, Low, Medium, High, Very High
Nominal Data: True/False, Above Threshold/Below Threshold
2. ^ Ask why 10 minutes instead of 5 or 15. Very few natural thresholds exist in business and most are set at completely arbitrary levels. The 601st second waiting in a queue is hardly different for a customer than the 600th second. The effect of waiting is cumulative, and each customer will start to feel irate at different points.
3. ^ Seeing data in greater resolution like this means that the viewer is likely to have a greater number of signals to investigate for potential improvements, however, one should also be aware that reacting to each signal individually can constitute tampering.
4. ^ Is it just a matter of sweeping unpleasant information under the rug though? There are several competing explanations for the popularity of threshold transformations in business metrics. From the “Isomorphism” (DiMaggio & Powell 1973) perspective, there is a strong propensity to mimic the practices of other organisations, which would be my default explanation for the prevalence of threshold-based metrics in business. Alternatively, the “Uncertainty Absorption” (March & Simon 1958) perspective posits that communicating summarised data with useful signals stripped out is innate to large organisations. I am not aware of any serious attempts to discover which of these is a stronger influence on how business metrics are designed.
5. ^ Feynman’s report includes a lot of interesting information: Rogers Commission – Volume 2: Appendix F – Personal Observations on Reliability of Shuttle
Seeing this occur at NASA, it is no wonder that point estimates are often abused when regression is involved. If using a regression to estimate something, always express your findings as a confidence interval!
6. ^ More generally, rank inversion can occur in any kind of comparison, such as region-to-region. One bizarre interpretation of sales figures across a chain of stores that I have seen was the belief that staff at the store with more sales were necessarily doing something better, with the view that these superior practices should be replicated to other stores. Consider for a moment the environmental differences between stores in any chain, like the proximity of competitors or the impact of local demographics and culture. How were these confounding variables considered in the analysis of sales figures? They were not. The practices at another store may have been superior in a fair comparison.
7. ^ Typically special causes are exceptions to how a system works, while common causes are those “built into” a system. Although the variation associated with common causes is more predictable, any improvement requires a fundamental change of the system. On the other hand, special causes are easier to isolate and tend to require smaller changes. For more about variation and associated concepts see Deming’s System of Profound Knowledge.
8. ^ We can get into trouble when we start to believe that a set of quantitative measures can tell the whole story of whether work is being done “well”.
Latest posts by Marcin Kreglicki (see all)
- Chaos, Compliance and Excellence - June 18, 2019
- One Number You Need To Forget - April 2, 2019
- Thresholds: The “Wishful Thinking” Approach to Designing Metrics - January 27, 2018