Automation is Not a Goal: Commit-to-Bug Time Reduction is

David Cottingham
Cloudy Musings
Published in
4 min readOct 12, 2017

--

Thanks to https://www.keepcalm-o-matic.co.uk/

Being rigorous about avoiding manual work as much as possible is key to ensuring efficiency. But measuring the percentage (of anything) that is automated misses the point: we automate because it improves some tangible business metric. If it does not, we should invest the effort elsewhere.

Let’s consider a contrived example: assume that we use manual testers to assure the quality of a software product. All of the manual tests for a build of our product can be completed within a day. We now automate those tests, and the time to conduct them reduces to 8 hours. We pat ourselves on the back: automation has made us more efficient. But it transpires that false positives and environmental issues mean that it takes another 8 hours to determine which issues are real (and for which bugs should be raised) and which are spurious. In this hypothetical scenario, did automation help?

Evidently there are several flaws in the above example, not least that automation should mean that one can run many more tests in the same time period than manual testing. However, in order to answer the question posed, we must consider what yardstick to measure ourselves against.

A Metric: Code-to-Bug Time

Allow me to propose what I consider to be a suitable metric: the time taken between a code commit by a developer, and a bug being raised that was caused by that commit, and assigned to that developer. Call this the “Commit-to-Bug Time” (suggestions of snappier names, or reports of terms already in existence, appreciated!).

Consider the ideal workflow: a developer writes some code, commits it, runs every possible test that there is (unit, functional, system), and magically obtains the results within minutes. If a test fails, she/he has a good idea of what has changed (after all, the code has only just been written), and moreover there is no additional new code that has been layered on top of that which caused the issue. This makes for speedy resolution of the issue.

In practice, we know that life is not perfect. CI/CD environments may run on every commit, but the results are not instant. OpenStack’s Tempest tests (which gate every commit) take roughly 2 hours to run, which is pretty impressive. Others may take significantly longer. But by focusing on reducing Code-to-Bug Time, we can focus on making developers productive, which may mean our investment priorities shift.

At the opposite end of the spectrum, imagine a very high level of automation, but where the manual interventions required after each test run (e.g. to ascertain that environmental failures rather than real bugs caused the problems) are so high that results take days to arrive. Developers now commit code, but only days later see a ticket raised. They have now moved on to another module, other teams have added their modifications to the system, and debugging becomes very much more costly. Development efficiency is (evidently) inversely related to Code-to-Bug Time.

Having asserted that automation is not what we should focus on, it is worth hypothesising how this might mesh with Lean Software Development.

Code-to-Bug Time & Lean

The Lean methodology is modelled on the Toyota Production System. One crucial pillar of this is Jidoka, which holds that if an abnormality is discovered, the production line is immediately stopped, the problem is fixed, and an analysis performed to determine why the problem ever occurred. In the case of software, when a test discovers a bug, we should immediately set about fixing it.

Mary and Tom Poppendieck discuss this idea in their book “Implementing Lean Software Development”, in the form of a boat sailing on a body of water. If the water level is high, one never notices the rocks beneath. If the water level is kept low, one immediately notices the rocks (and presumably steers round them!). In terms of software bugs, if we keep the number of open bugs low, we will (a) quickly be able to determine whether something serious has suddenly gone wrong, and (b) the cost of fixing each bug will be lower, because there will not be layer upon layer of bugs interacting with each other.

Concretely, Jidoka implies that we are never in a state where we have a mountain of bugs that has accumulated (which at least theoretically means code that is always releasable).

Crucially, though, I assert that one can only expect to achieve this with a low Code-to-Bug Time.

With a high elapsed time between a commit and a bug report, unless all software engineers sit and do nothing, there will by definition be more code written. This in turn, will have at least a few bugs in it. The longer the time, the greater the number of bugs (i.e. the higher the water level), and the more painful diagnosing and fixing them will be.

In Conclusion

Evidently the idea that CI/CD being a “good thing” is not new, hence my objective in this post has not been to emphasise that keeping code quality high is important; I’m assuming people know that. What I do believe is important is that we keep an eye on what metrics we’re optimising for (and why): automation is hugely important in efficient software delivery; but we automate for a purpose. Thus, measuring whether we are achieving that purpose (with a metric like Code-to-Bug Time) is something we ignore at our peril.

--

--

CTO at IQGeo; ex-CPTO at Checkit; ex-Citrix XenServer & Microapps; husband; father; Christian; cyclist; TCK; orienteer; photographer. Views mine alone.