The 100%:
Why coverage doesn't matter and why it does
If you are a software developer, you are probably familiar with test coverage. It is a metric that shows the percentage of code that has been executed by tests. Often, it is used as a metric for the quality of the code, and maybe even as a goal for the team. However, as with many things in life, misunderstanding and misrepresenting it is kind of the norm. I'm going to attempt to illuminate how it can actually be used. And how it can't, too.
Be aware: this article is about the technical coverage metric resulting from executing tests in code base. Depending on setup, other kinds of automatic tests can generate this data, but usually it's harder to implement. Nonetheless, I trust the reader to be able to generalize these ideas.
Coverage calculation
Let's use a simple example to illustrate how coverage is calculated, starting with a very basic setup and gradually expanding it to increase coverage. I will be using Ruby for code purposes and the SimpleCov gem to measure coverage. It is easy to use and creates nice HTML reports.
Let's say we have a Bunny which can hop a certain distance. Depending on distance and direction hops can be quite different. Maybe there is no hop at all, sometimes a bunny doesn't need to move! Put into words we can understand, this is it:
class Bunny
def hop(distance)
return "Lie down!" if distance.zero?
# Incrementally build the hop
hop = String.new
if distance.positive?
hop << "Ready"
else
hop << "Turn"
end
hop << " and "
if distance.abs > 1
hop << "hippity-hop!"
else
hop << "hop!"
end
hop
end
end
Alright, now we will begin testing hopping. For starters, we should probably just setup the test environment and load the code:
require "simplecov"
require "minitest/autorun"
SimpleCov.start
require_relative "bunny"
This "test" can be ran with ruby bunny_test.rb. You may be suprised, but even this already provides us with some line coverage. Why? Well, intuitively, defining a class and a method already does something, so it's logical that those statements are covered. Practically, Ruby's classes are fully dynamic, allowing class and method definitions at any point, so when the coverage says that those lines are covered, it's because they were actually executed.
But let us get to some hopping goodness. We will now begin actually testing the Bunny by gently prodding it forward, making it hop a little and a bit more:
# ...The previous content is omitted for brevity, but it's here
class TestBunny < Minitest::Test
def setup
@bun = Bunny.new
end
def test_hop
assert_equal @bun.hop(1), "Ready and hop!"
assert_equal @bun.hop(2), "Ready and hippity-hop!"
end
end
And this is the report we get. Nice, only two assertions and the coverage is over 90.00%!
It seems that we forgot to test backward hops. Let's just add this one neat test and we will be done: assert_equal @bun.hop(-1), "Turn and hop!". Just look at these results!
Excellent! Only three assertions and we have 100% coverage. I guess we are done here. Nothing more to test. Nothing more to think about. This bnuuy is fully covered, snug and ready to sleep. If only we all were allowed to live like it, just three quick hops and the work day is over... But wait, how will the bun sleep if we never tested that it can lie down? How come it's covered if it never happened?
And that's why branch coverage needs to be considered too. Even languages that don't have one-line conditionals often include the so called "ternary operator" from C, which is basically the same thing but badly named and hard to read between all the magic symbols. To rectify this terrible no-lying-down situation, let's enable branch coverage in SimpleCov:
# ...
SimpleCov.start do
enable_coverage :branch
end
# ...
Oh no, red coloring! Now it's clear that the line with early return was executed several times but the return itself wasn't. At least all the other branches are covered, whew! Let's add one final test to make sure the return is executed: assert_equal @bun.hop(0), "Lie down!". This ought to do it!
Gaze upon this beauty. Take it all in. 100% lines covered. 100% branches covered. Simply, perfection. Our work here is over. There is nowhere to go from here. It is done. We've achieved the holy grail of KPIs known as "the maximum value for the metric".
...
......
.........
Attentive readers will probably realise that this method has 5 possible results, but there are only 4 assertions; we never checked that the bunny can turn and do a hippity-hop in one go. We tested everything though, right? Where did that result disapper to? We know that it should exist, but our 100/100 coverage doesn't include it. What a mystery...
The inherent problem with coverage
Let's be real for a moment here, the problem is, of course, not with the coverage, it's working as advertised. The problem is with the tests. You may be tempted to think that there is no point in testing the last outcome, as it's obvious that it can happen. And in this case it is okay. But let's not forget: this is just a tiny simplified example, and the tests were done post-factum. In different circumstances tests should be significantly more thorough.
Now, why does coverage not show that something isn't tested? KPI-brained managers probably think it should always be enough, but it's just not. In reality, possible result set depends on all possible paths through the execution graph of a piece of code, while coverage only depends on distinct nodes in that graph.
By execution graph I mean a directed acyclic graph where nodes are separate linearly-executed pieces of code and edges connect these linear pieces according. In our example the execution graph can be derived as follows:
- First line splits execution to immediate return and the rest of the method.
- Next condition creates two paths which converge in the middle of the method.
- There is one more condition splitting the execution into two paths which converge at the end.
- Finally, the last line is executed always.
From this graph, it's pretty clear what's happening: there are 5 possible execution paths, leading to 5 distinct results. (Tracing them is left as an exercise for the reader.) Coverage, on the other hand, only shows that nodes were visited. Not whence the execution came from. Just which nodes. You realise the problem now? Coverage, in general, cannot show that tests have achieved every possible result. There are exceptions to this, degenerate cases. I would say, two of them. Can you tell what they are? Graph above can be studied for an inspiration.
To make sure that every one is on the same page: when several conditionals feed into each other, execution paths multiply. Like cockroaches, where there are conditionals, there are always even more, skulking in other methods. You may not know what that funky method you are calling does, but consider that it probably has ifs and elses of its own. Be wary of suddenly hitting nils or errors. Always check for them... creating even more paths with ifs of your own. It's really a never-ending mill of code.
Some among you may say "But, unknown article author on the internet, my code does a bunch of things to arrive at the same couple of results! I don't have a dozen distinct cases!" Of course, it's possible that in real code several paths result in the same thing. It's quite often that you need to find data in one of possible places, use different methods for calculation, etc. Still, each combination needs to be tested, otherwise how would you know that they do in fact behave the same? Are you sure that you covered all edge cases? Do you? Have you considered that one neat trick that doctors hate and users surely will do?
So, we arrive at The Truth (patent pending):
Coverage, be it line, branch or otherwise, can not show that your tests are complete and fool-proof. And KPI lovers should be ashamed of themselves.
Why coverage matters anyway
After the previous soliloquy, you may have gotten the impression that I consider coverage unimportant and maybe even harmful. That is not the case at all. In fact, I usually consider 100% coverage to be a must. Why? Let's dive into it.
First things first, let's make it clear: distinction between line and branch coverage is pretty much just a technical and historical artifact. They are both required to really see what's executed and what's not. Always enable maximum coverage options that your tool allows, unless there are limitations making that unviable for whatever reason.
Now, back to the topic. Remember how we incrementally added tests for our lil' bunny? To do that, we used the coverage results to quickly identify what wasn't tested. And that's the crux of it: coverage helps quickly find definitely missing test cases. And that's pretty much it. Incomplete coverage means incomplete tests. This relationship can be represented as logical implication: incomplete coverage -> incomplete tests.
And this simple fact is why 100% coverage both doesn't matter and matters a lot. If it is below 100%, then that is an objective signal that test suite in question is incomplete, some code path is not exercised. Alternatively, some code path may be completely extraneous and dead. (Or, code has special cases that can't be executed in a single test run, such as platform-specific code.) In either case, it is a signal that something is wrong. On the other hand, if it is 100% already, then the only thing it tells you is that there is no strictly unreachable code. And that your tests at least cover some things, I guess.
Furthermore, coverage is a safety net. If it decreases after a change, then something is clearly not being tested. Maybe there were no tests that covered buggy code (and that's why the change was needed). Maybe it's a new feature that has not test associated with it. Either way, decreasing coverage is also an objective signal that the test suite needs updates. There are even tools specifically for this, such as undercover gem, which raises a fuss when uncovered lines appear compared to the default branch.
The end
And here we are. We utterly covered the coverage. We even learned some things today!
- Coverage counts execution nodes, not paths.
- Coverage can't really be used as a metric for test quality.
- Less than 100% coverage is an objective signal that test suite is lacking.
- 100% coverage doesn't mean that test suite isn't lacking.
- Think with you head, not your KPI.