Software Bugs, Crawling Everywhere

Software developers have a wonderful explanation for why there are so, so many software bugs. Unfortunately, it’s a highly technical explanation that’s very difficult for the layman to understand. I’ll try to summarize, but be aware that the following is a gross oversimplification.

The root problem is that software is complex. And it’s not just that software has complexity. It has a lot of complexity. And there are different kinds of complexity. For example, there’s necessary complexity and unnecessary complexity, architectural complexity, design complexity, protocol complexity, and API complexity. And then you have process complexity, such as whether you are able to deliver working software or whether you blame your manager and call him a dork.

Needless to say, software developers like to blame bugs on the complexity of software–or on their managers–but mostly on the complexity of software. However, that’s only part of the real cause of software bugs. Software developers have a dirty little secret: most bugs are simply caused by simple human error, and many of these can be prevented.

Back in the day, I designed some software to factory-test audio equipment. This is back before we had source-code control and automated regression tests–as if most projects today have automated regression tests. I was working on the project alone, without even anyone to do QA for me.

At one point, the factory asked for a small change to the software. It amounted to one line of code, a trivial change. So I made the change and sent out the modified software. The guys at the factory installed it and tried running it, but it didn’t work, not at all. So they emailed me back, and the next day, I looked into the problem. I had made a critically stupid mistake. There was a simple typo in my code. Obviously, I had not tested my testing software. How is that for irony?

Fixing the bug took me only 30 seconds, but this 30-second bug had cost the time and convenience of several people in at least 2 departments of the company. Bugs are many times cheaper to fix before you send the software to production.

My manager asked what had happened, and I told him. He then advised me, in future, to test my changes things first, before sending them out. And not being a complete idiot (only a partial one), I agreed with him.

This experience is one reason why I am such an avid believer in automated testing, because simple unit tests could have caught that bug before it left my hands… although even good automated tests can’t save you if you don’t pay attention to them, and sometimes we do ignore them.

I’m not saying that software developers are incompetent, only that they’re human, and humans make mistakes. Developing software involves making choices about what code to write and what code to put off, which bug to chase and which one can wait until later, what lead to follow up on first, in order to get the biggest bang for the buck. That’s why you can never depend on the perfection of software developers to eliminate bugs, but you can depend on human incompetence to create more of them. Developing software is knowledge work, and knowledge work requires choices. And experienced software developers aren’t really better at it than newbies. It’s just that veteran developers have learned rules that compensate for their natural human incompetence, when they actually follow them.

Many years later, I encountered another opportunity to prove the incompetence of human nature. I was helping a client develop a web-based system, named with a three-letter acronym that no one knew the meaning of. Here I’ll call this system “YUM,” because those three letters are as good as any other three. This system managed the client’s business processes. Now, only certain employees and clients could log into the system, and when one logged in, he was only able to see certain data and perform certain actions, depending on his role in the organization.

YUM didn’t know anything about which users were allowed to log in and what they were allowed to do, but it found out this information from another system, which had no name. We found it difficult to talk about this other system, but somehow we found a way. For the purposes of this story, I’ll call it “YUM-LOGIN.” Now, the client was upgrading YUM-LOGIN to a new system that supported all the software they used, not just YUM. I’ll call this new system “MUY-LOGIN,” from a Spanish word meaning “YUM spelled backwards.”

Migrating to the new MUY-LOGIN was straightforward: I refactored the existing code to support multiple authentication methods, split out the YUM-LOGIN code into a separate authentication module, and created a new authentication module to support the new MUY-LOGIN API. (And yes, this is actually “straightforward” to a software developer. Remember what I said about complexity.)

I could tell the new MUY-LOGIN code worked, because the QA guy and I could test it. But how could we know that the old YUM-LOGIN still worked? It was “too difficult” to write a true unit test for this legacy code, because it was too poorly designed. Software development is knowledge work, and knowledge work requires choices, and that was Choice One. In retrospect, I should have bitten the bullet and written some unit tests before I refactored the YUM-LOGIN code, but I’m getting ahead of the story.

The only test we had for the YUM-LOGIN code fetched some test data from one of the production servers. This is a poor way to test software, and especially in this case, because the YUM-LOGIN server subtly changed its behavior over time. We needed to update our test from time to time to account for undocumented changes in the way that YUM-LOGIN responded to it. As a result, we weren’t always able to tell whether our code was actually broken, or whether the YUM-LOGIN server was going yucky again.

So when I noticed that the YUM-LOGIN test was failing, I figured it was just the server acting up again, and I didn’t investigate any further. That was Choice Two. I figured that it really wasn’t that important anyhow, because we had no plans to use YUM-LOGIN anymore, because we were migrating to MUY-LOGIN.

I double-checked my changes, confirmed that everything else was working, and released the code. The QA guy on the YUM project helped me test and debug this code, and I couldn’t have done it without his help. But he too was only testing against MUY-LOGIN, because he didn’t have a server set up to use YUM-LOGIN, because we weren’t planning to use it anymore.

Inevitably, when the client deployed the new version of YUM, the MUY-LOGIN server didn’t work as expected. The guys who were developing MUY-LOGIN needed to fix some additional stuff in order to support YUM. Oops. But until that could happen, the client enabled the old YUM-LOGIN code–good thing that I had kept it in there!–except that they couldn’t get it to work. The only thing left to do was to roll back YUM to the previous version, until we could get the issues straightened out.

At first, I thought that maybe the YUM-LOGIN server was no longer working right, because my YUM-LOGIN test was failing. But that turned out not to make sense, because the previous version of YUM still worked fine. And then I noticed that our YUM-LOGIN test also passed with the old YUM code. That meant that I had actually broken something when I added the MUY-LOGIN support.

I finally tracked down the problem to a bug in my changes. I had refactored the code incorrectly–another typo–and I had left off a required parameter. Of course, all of this rigmarole could have been avoided, if only I had actually listened to the automated test when it told me that I had broken something. That’s what the test is there for, after all. But I had on good-faith belief doubted that the test was lying to me and that my code probably worked, based on Choice One and Choice Two, which most experienced software developers could see themselves making just as I did.

What can we learn?

Never depend on human perfection for quality software.
Always test your code.
Always test it several times, at different levels.
Use unit tests and automated system tests and manual QA testing.
Even though writing unit tests can be hard when it comes to legacy code, it may very well be even harder and more expensive (not to mention embarrassing) to hunt down bugs without the benefit of unit tests.
If you doubt that a test is giving truthful results, find some way to verify it (like running it against a known-working version of the code) before just writing it off.

-TimK

Leave a reply

Who am I?

Newest Posts