I’m currently thinking about metrics, with the overall goal of using metrics to improve engineering development and performance. To help me with this, I’m going to be brain dumping through posts.
There are decisions where people (individuals, teams, orgs) rely on subjective opinions to make decisions. For example, performance reviews and promotions.
In these situations, objective data isn’t used (or used subjectively) because it’s believed that you can’t or shouldn’t use a metric for that decision. (Or, that it is too costly to determine the metric.)
The idea I want to explore in this brain dump is that metrics might have a place, even in these complex decisions.
Many management concerns, I suspect, involve a strong need to understand how things are going. When something goes poorly, you need to understand how that happened to address the issue. Likewise, you want to replicate behaviors and actions that help things go well. To improve on the status quo, you need awareness of what there is to be improved.
Everything Can Be Evaluated
To start, an axiom: everything can be evaluated.
The evaluation may be subjective and/or imprecise, but there can still be an evaluation. If you deem someone or something as doing well or doing poorly, you’ve made an evaluation.
An evaluation is someone’s assessment of how things are going. The jump that tends to be made here is that the assessment is a reasonable proxy for how things are going. Often, that’s what we mean when we talk about an accurate assessment.
Let’s clarify a bit what we mean with the term accuracy.
We’ve all been in scenarios where we determine how accurate we feel an assessment is. When your friend tells you they’ll arrive at your destination in about 5 minutes, this is generally a pretty “accurate” estimation. There’s a high confidence in the assessment—it could be off on the order of minutes, but it’s improbable that it’s off by hours or days.
Extrapolating, an assessment expresses the odds that an effort will accomplish a particular goal. Goals consist of both the desired result (outcome, state) and when that result will happen. This assessment has a confidence interval, be it explicit or implied.
In a more “normal” form, an assessment is then a yes or no response to:
Given $CURRENT_STATE, I’m $CONFIDENCE_PERCENTAGE that $TARGET_STATE will occur by $TIME.
Accuracy, then, refers to how close the confidence interval is to the truth. Note that this implies an absolute accuracy and a perceived accuracy.
Absolute accuracy would compare the confidence interval to reality, if you could run a simulation. It may be unknowable.
Perceived accuracy is how close you believe the confidence interval to be to the absolute accuracy. It’s subjective. Generally, we’re talking about perceived accuracy when we say how accurate we think something is.
Can all assessments be expressed in this form? I don’t know, but it seems like a workable starting point.
Evaluations As Proxies
Here’s where we are now. We’re interested in how things are going, which is likely too complicated to precisely define. So we simplify things with a hopefully accurate evaluation. This evaluation is used as a proxy of how things are going. This, in theory, lets us make more principled decisions.
Your brain, like mine, may now be screaming that we’re eventually going to make a serious mistake by trusting that evaluation.
We will (because we’ll make mistakes eventually). More concretely, however, I think these are the main concerns:
- An evaluation is too “lossy” and so cannot be used as a proxy for how things are going.
- Evaluations can be used as a proxy, but how can you trust the evaluation? (The accuracy seems unreliable.)
- Evaluations can be used as a proxy, but how can you trust the evaluator? (The evaluation method seems unreliable.)
- The evaluation may be accurate, but the confidence interval is unacceptably low.
I’m interested in concerns (2) and (3), but first I’ll touch on why I’m not concerned with (1) and (4).
I think (1) is really some combination of (2) and (3). It’s not that evaluations are too “lossy”—it’s that you don’t trust the accuracy of the evaluation.
In reality, we’re all relying on evaluations as proxies every day. They may be implicit, trivial, and/or unexamined. When you drive in a car, you probably think it’s safe. That’s an evaluation. (For example, I’m 99+% confident that I won’t die by tomorrow if I take a ride in this car.)
With (4), I think this is a separate issue. Here you have an accurate, but undesirable, estimation. The problem is now how you will proceed with that knowledge, not whether or not the knowledge is reliable.
So now, our objections are that the accuracy is incorrect and that the evaluation methodology might be wrong.
Incorrect accuracy gets at _correlation_—the evaluation is not a good predictor of the desired result.
Flawed methodology speaks for itself. However, it’s worth noting that there are different ways the methodology could end up flawed. The evaluation process could be correct, but we screw up when performing it. Or, the evaluation process can be wrong. (That second case also captures when we make assessments based on some amount of gut feeling.)
These are real concerns, and we should stay aware of them for consequential decisions. We are still forced to rely on potentially-inaccurate estimations, because that’s how we make decisions in complex situations.
All Decisions Come From Evaluations
Now, a claim: all decisions come from evaluations. You cannot make a decision without an evaluation.
My claim is that we are always making decisions based on some kind of evaluation (accurate or not).
The proof is by contradiction—you cannot find a decision that was made without an evaluation. I’m not actually going to try to exhaustively prove this, but in the very short (5-10 minutes) time I’ve spent on this I don’t have a ready counterexample. This doesn’t mean I’m right, but I suspect any counterexample would be inconsequential when it comes to making decisions in a business setting.
I do think it’s worth pointing out some categories.
Clearly, we do try to make evaluations in a lot of scenarios (should I do X or Y). A consequential decision probably involves some amount of reasoned evaluation.
There are also “indifferent” decisions, where the chosen outcome doesn’t matter. (Would you like to use the red crayon or the blue crayon?) In these, you’ve made an evaluation—that the choices are equal—and thus either choice is fine.
The last main category of conscious decisions are based on evaluations that we may not be aware of. I am thinking about decisions such as:
- Person X should be promoted.
Your evaluation is probably some combination of them being competent and that you like them.
- Personal preference, such as I like Hondas over Toyotas.
While personal preference already is an evaluation, it likely has components to it (that you may also not be aware of).
- Habitual responses/decisions, like the path you will take to go from your bedroom to your restroom.
Your are not even consciously thinking here, but you are subconsciously evaluating (I’ve reached the hallway, so now I need to turn…).
In terms of rationality or reasoned decisions, some of these are terrible evaluations. But that’s not the claim—it’s just that evaluations are happening.
Now, let’s get to the idea in this brain dump: you should consider using a metric.
If you buy the reasoning so far, here’s where we’re at. We’re interested in making good management decisions, all of which rely on evaluations. If the evaluations are sufficiently accurate, we can be satisfied—we are able to make the best decisions we can based on what we know.
It’s a concern if evaluations aren’t accurate enough, in which case we need to decide how to proceed. For example, you might try to get a better evaluation. Or, you might accept that you don’t have enough information and take your best guess.
The failure case we are concerned with: when we make decisions on the belief that our evaluations are accurate but they are actually not. This will always happen, of course, but we would prefer to minimize it as much as we have the ability and bandwidth to.
Rephrasing: when we make consequential decisions, we want to be confident in our evaluations. There’s a lot of ways to dig into this, but in short it involves assessing and reassessing how we evaluate and how accurate our evaluations are.
And now, the main idea: if you aren’t using some kind of objective evaluation when making an important decision, you perhaps want to try.
In many situations, I think people will agree with this sentiment. I think the concerns are around “hot topic” or “unmeasurable” decisions.
The thing is, you are still making subjective evaluations for these decisions. So, you’re implying that you are comfortable making decisions based on these subjective evaluations.
To be clear, I do not believe that human intuition, subjective evaluations, etc. are categorically wrong. There are definitely (and frequently) times when they are correct—people are always acting on their beliefs, and many times they can be right. Chess grandmasters can intuit optimal moves, though they still often compute whether or not the move is actually correct.
That’s what drives me to my conclusion—you should explore some metrics. Your problem is probably not solvable purely with metrics (yet?). Still, I think it would be preferable if you can come up with a metric-assisted evaluation process.
Consider lines of code changed. The standard engineering lore is that you cannot objectively measure engineering performance because any metric you use (such as LOC) is gameable and/or wildly inaccurate. There’s truth of this, but does LOC have zero value? If someone is writing very few LOC, would you be curious as to why?
If a metric is truly irrelevant, you should be indifferent to it. Lines of code is likely very uncorrelated, but it isn’t 0—no code output likely means something. This suggests that LOC is still a (very marginally) useful data point, even if it only serves as a canary.
I find it hard to believe that there is absolutely no valuable metric in these difficult-to-measure scenarios. For me, the question is not if metrics can (or should) be used, but how to find reliable enough metrics.
That, and other ideas, will hopefully come in a future brain dump.