Hard-assed Bug Fixin’

Software quality, or the lack thereof, is something everybody loves to gripe about. Now that I have my own company I finally decided to do something about it. Over the last two weeks we stopped everything at Fog Creek to ship a new incremental version of FogBUGZ with the goal of eliminating all known bugs (there were about 30).

As a software developer, fixing bugs is a good thing. Right? Isn’t it always a good thing?

No!

Fixing bugs is only important when the value of having the bug fixed exceeds the cost of the fixing it.

These things are hard, but not impossible, to measure. Let me give you an example. Suppose you operate a peanut-butter-and-jelly sandwich factory. Your factory produces 100,000 sandwiches a day. Recently, due to the introduction of some new flavors (garlic peanut butter with spicy Habanero jam), demand for your product has gone through the roof. The factory is operating full-out at 100,000 sandwiches, but the demand is probably closer to 200,000. You just can’t make any more. And each sandwich earns you a profit of 15 cents. So you’re losing $15,000 a day in potential earnings because you don’t have enough capacity.

Building a new factory would cost way too much. You don’t have the capital, and you’re afraid that spicy/garlicky sandwiches are just a fad which will pass, anyway. But you’re still losing that $15,000 a day.

It’s a good thing you hired Jason. Jason is a fourteen year old programmer who hacked into the computers that run the factory, and believes that he has come up with a way to speed up the assembly line by a factor of 2. Something about overclocking that he heard on slashdot. And it seemed to work in a test run.

There’s only one thing stopping you from rolling it out. There’s a teeny tiny wee little bug that causes a sandwich to be mushed once an hour or so. Jason wants to fix the wee bug. He thinks he can fix it in three days. Do you let him fix it, or do you roll out the software in its bug-addled state?

Rolling out the software three days later will cost you $45,000 in lost profits. And it will save you, um, the cost of raw materials for 72 sandwiches. (In either case Jason will get the bug fixed three days later). Well, I don’t know how much sandwiches cost on your planet, but here on Earth, they’re a lot less than $625.

Where was I. Oh yeah. Sometimes it is not worth fixing a bug. Here’s another bug that’s not worth fixing: if you have a bug that totally crashes your program when you open gigantic files, but it only happens to your single user who has OS/2 and who, for all you know, doesn’t even use large files. Well, don’t fix it. Worse things have happened at sea. Similarly I’ve generally given up caring about people with 16 color screens or people running off-the-shelf Windows 95 with no upgrades in 7 years. People like that don’t spend much money on packaged software products. Trust me.

But mostly, it’s worth fixing bugs. Even if they are “harmless” bugs, they may reduce the reputation of your company and your product, which, in the long run, will have a significant impact on your earnings. It’s hard to overcome the reputation of having a buggy product. When you do want to do that .01 release, here are some ideas for finding and fixing the right bugs: the ones that it is economically worth fixing.

Step One: Make Sure You Find Out About The Bugs.

In the case of FogBUGZ, we have two ways of doing that. First, we trap all bugs on our free demo server, capture as much information as we can, and email the whole thing to the development team. That found an awful lot of bugs, which was very cool. For example, we discovered a bunch of people who didn’t enter dates where they were supposed to in the “Fix For” screen. We didn’t even have an error message in that case, we just “crashed” (which, in a web app, just means you got an ugly IIS error instead of what you expected). Oops.

When I worked at Juno, we had an even cooler system in place to collect bugs “from the field” automatically. We installed a handler using TOOLHELP.DLL so that every time Juno crashed, it stayed alive just long enough to dump the stack into a log file before going to its grave. The next time the program connected to the Internet to send mail, it uploaded the log file. During betas, we gathered these log files, collated all the crashes, and entered them into the bug tracking database. This found literally hundreds of crashing bugs. When you have a million users, it is amazing what will crash, often because of severe low memory conditions or severely crappy computers (can you spell Packard Bell?) You could have code like this:

int foo( object& r )
{
    r.Blah();
    return 1;
}

and you would get crashes there because the r reference was NULL, even though that’s completely impossible, there’s no such thing as a NULL reference, C++ guarantees it, and you don’t have to believe me but when you wait long enough and have millions of users and religiously collect their stack dumps, you will find crashes in places like that and you won’t believe your eyes. (And you won’t fix them. Cosmic rays, man. Get a new computer and this time don’t install every cool shareware taskbar lint gizmo you find. Sheesh.)

The other thing we do is consider each and every tech support call to be evidence of a bug. When we take the call, we try to figure out what we could have done to eliminate it. For example, the old FogBUGZ Setup used to assume that FogBUGZ would run under the anonymous Internet user account. That was a good assumption 95% of the time, and a bad assumption 5% of the time, but every one of those 5% cases ended up in a call to our support line. So we modified Setup to prompt for an account.

Step Two: Make Sure You Get Economic Feedback

You may not be able to figure out exactly how much it’s worth to fix each bug, but there’s something you can do: charge the “cost” of tech support back to the business unit. In the early nineties there was a financial reorganization at Microsoft under which each product unit was charged for the full cost of all tech support calls. So the product units started insisting that PSS (Microsoft’s tech support) provide lists of Top Ten Bugs regularly. When the development team concentrated on those, product support costs plummeted.

This is a bit in contradiction with the new trend of letting the tech support department pay for its own operation, something that most large companies do. At Juno tech support was expected to break even by charging people for tech support. By moving the economic burden of bugs onto the users themselves, you lose what limited ability you might have had to detect the damage they were causing. (Instead you get irate users who resent having to pay for your bug, who tell their friends, and you can’t even measure how much that costs you. To be fair to Juno, the product itself was free, so stop yer bitchin.)

One way of resolving the two is to not charge the user when the support call was caused by a bug in your own product. Microsoft does this, and it’s quite nice, and I’ve never paid for a call to Microsoft 🙂 Instead, charge the $245 or whatever one developer incident costs these days back to the product unit. That blows away their profit completely for the product they sold you (several times over), and creates exactly the right economic incentives. Which reminds me of one reason DOS games were a terrible business… to get them to look good and run fast, you usually  needed strange video drivers, and a single tech support call about the video drivers would blow away the profit you could make from 20 copies of your product, assuming Egghead and Ingram and the ad on MTV hadn’t already guzzled away all your earnings.

Step Three: Figure Out What It’s Worth To You To Fix Them All.

At Fog Creek Software, well, we’re a tiny company (except in our own minds), and the development team just takes the tech support calls. The cost was running about 1 hour per day, which, based on our consulting rates, is somewhere around $75000 a year. We were pretty confident that we could get that down to 15 minutes a day by fixing all known bugs.

Using very sloppy numbers, here, that means that the net present value of the savings would be about $150,000. That justifies 62 days of work: if you can do it in less than 62 person-days, it’s worth doing.

Using the handy estimation feature built into FogBUGZ, we calculated that it would take 20 person-days (two people two weeks) to fix everything – that’s $48,000 “spent” for a return of $150,000, which is a great return on investment just on the basis of the tech support savings. (Observe that you could substitute the cost of programmer’s salaries and overhead instead of our consulting rate and get the same 3:1 result, since it cancels out).

I haven’t even begun to count the value from having a better product, but I can start doing that, too. We had 55 crashes on the demo server during the month of July with the old code, representing 17 distinct users. You have to imagine that at least one of those people decided not to buy FogBUGZ because they thought it was buggy when they ran the demo (although I don’t have real statistics for that.) In any case the lost sales was probably costing us somewhere between $7,000 and $100,000 in present value. (If you were serious enough, it wouldn’t be too hard to get a real number).

Next question. Can you charge more for a less buggy product? That would add a whole bunch of value to debugging. I suspect that at the extremes, bug count does affect price, but I am hard pressed to think of an example from the world of packaged software where this has been the case.

Please Don’t Beat Me Up!

Inevitably people read essays like this and come to silly conclusions, like, Joel doesn’t think you should fix bugs. In fact I think that for most of the kinds of bugs that most people fix, there’s a clear return on investment. But there may be an even higher monetary value to doing something other than fixing every last bug. If you have to decide between fixing the bug for OS/2 guy and adding a new feature that will sell 20,000 copies of your software to General Electric, well, sorry, OS/2 guy. And if you’re dumb enough to think that it’s still more important to fix OS/2 than to add the GE feature, maybe your competitors won’t be and you’ll be out of business.

With all that said, I’m optimistic at heart, and I believe that there is a lot of hidden value to producing very high quality products that is not very easy to capture. Your employees will be prouder. Fewer of your customers will send you back your CD in the mail after microwaving it and chopping it to bits with an ax. So I tend to err on the side of quality (indeed, we fixed every known bug in FogBUGZ, not just the big bang ones) and take pride in that, and feel confident, by the complete elimination of errors from the demo server, that we have a rock-solid product.

2001/07/31

FogBUGZ 2.03 is shipping. This is strictly a bug-fix release in which we have fixed all known bugs.

(If you’re a FogBUGZ customer and you didn’t get the upgrade notification by email, we probably have the wrong email for you on file. Email me at work and we’ll sort it out.)

Read all about our bug-fixing adventures in my latest article, Hard-assed Bug Fixin’.

Good Software Takes Ten Years. Get Used To it.

Have a look at this little chart:

picture-lotus-notes:
[Source: Iris Associates]

This is a chart showing the number of installed seats of the Lotus Notes workgroup software, from the time it was introduced in 1989 through 2000. In fact when Notes 1.0 finally shipped it had been under development for five years. Notice just how dang long it took before Notes was really good enough that people started buying it. Indeed, from the first line of code written in 1984 until the hockey-stick part of the curve where things really started to turn up, about 11 years passed. During this time Ray Ozzie and his crew weren’t drinking piña coladas in St Barts. They were writing code.

The reason I’m telling you this story is that it’s not unusual for a serious software application. The Oracle RDBMS has been around for 22 years now. Windows NT development started 12 years ago. Microsoft Word is positively long in the tooth; I remember seeing Word 1.0 for DOS in high school (that dates me, doesn’t it? It was 1983.)

To experienced software people, none of this is very surprising. You write the first version of your product, a few people use it, they might like it, but there are too many obvious missing features, performance problems, whatever, so a year later, you’ve got version 2.0. Everybody argues about which features are going to go into 2.0, 3.0, 4.0, because there are so many important things to do. I remember from the Excel days how many things we had that we just had to do. Pivot Tables. 3-D spreadsheets. VBA. Data access. When you finally shipped a new version to the waiting public, people fell all over themselves to buy it. Remember Windows 3.1? And it positively, absolutely needed long file names, it needed memory protection, it needed plug and play, it needed a zillion important things that we can’t imagine living without, but there was no time, so those features had to wait for Windows 95.

But that’s just the first ten years. After that, nobody can think of a single feature that they really need. Is there anything you need that Excel 2000 or Windows 2000 doesn’t already do? With all due respect to my friends on the Office team, I can’t help but feel that there hasn’t been a useful new feature in Office since about 1995. Many of the so-called “features” added since then, like the reviled ex-paperclip and auto-document-mangling, are just annoyances and O’Reilly is doing a nice business selling books telling you how to turn them off.

So, it takes a long time to write a good program, but when it’s done, it’s done. Oh sure, you can crank out a new version every year or two, trying to get the upgrade revenues, but eventually people will ask: “why fix what ain’t broken?”

picture-fruit:

Failure to understand the ten-year rule leads to crucial business mistakes.

Mistake number 1. The Get Big Fast syndrome. This fallacy of the Internet bubble has already been thoroughly discredited elsewhere, so I won’t flog it too much. But an important observation is that the bubble companies that were trying to create software (as opposed to pet food shops) just didn’t have enough time for their software to get good. My favorite example is desktop.com, which had the beginnings of something that would have been great if they had worked on it for 10 years. But the build-to-flip mentality, the huge overstaffing and overspending of the company, and the need to raise VC every ten minutes made it impossible to develop the software over 10 years. And the 1.0 version, like everything, was really morbidly awful, and nobody could imagine using it. But desktop.com 8.0 might have been seriously cool. We’ll never know.

Mistake number 2. the Overhype syndrome. When you release 1.0, you might want to actually keep it kind of quiet. Let the early adopters find it. If you market it and promote it too heavily, when people see what you’ve actually done, they will be underwhelmed. Desktop.com is an example of this, so is Marimba, and Groove: they had so much hype on day one that people stopped in and actually looked at their 1.0 release, trying to see what all the excitement was about, but like most 1.0 products, it was about as exciting as watching grass dry. So now there are a million people running around who haven’t looked at Marimba since 1996, and who think it’s still a dorky list box that downloads Java applets that was thrown together in about 4 months.

Keeping 1.0 quiet means you have to be able to break even with fewer sales. And that means you need lower costs, which means fewer employees, which, in the early days of software development, is actually a really great idea, because if you can only afford 1 programmer at the beginning, the architecture is likely to be reasonably consistent and intelligent, instead of a big mishmash with dozens of conflicting ideas from hundreds of programmers that needs to be rewritten from scratch (like Netscape, according to the defenders of the decision to throw away all the source code and start over).

Mistake number 3. Believing in Internet Time. Around 1996, the New York Times first noticed that new Netscape web browser releases were coming out every six months or so, much faster than the usual 2 year upgrade cycle people were used to from companies like Microsoft. This led to the myth that there was something called “Internet time” in which “business moved faster.” Which would be nice, but it wasn’t true. Software was not getting created any faster, it was just getting released more often. And in the early stages of a new software product, there are so many important things to add that you can do releases every six months and still add a bunch of great features that people Gotta Have. So you do it. But you’re not writing software any faster than you did before. (I will give the Internet Explorer team credit. With IE versions 3.0 and 4.0 they probably created software about ten times faster than the industry norm. This had nothing to do with the Internet and everything to do with the fact that they had a fantastic, war-hardened team that benefited from 15 years of collective experience creating commercial software at Microsoft.)

Mistake number 4. Running out of upgrade revenues when your software is done. A bit of industry lore: in the early days (late 1980s), the PC industry was growing so fast that almost all software was sold to first time users. Microsoft generally charged about $30 for an upgrade to their $500 software packages until somebody noticed that the growth from new users was running out, and too many copies were being bought as upgrades to justify the low price. Which got us to where we are today, with upgrades generally costing 50%-60% of the price of the full version and making up the majority of the sales. Now the trouble comes when you can’t think of any new features, so you put in the paperclip, and then you take out the paperclip, and you try to charge people both times, and they aren’t falling for it. That’s when you start to wish that you had charged people for one year licenses, so you can make your product a subscription and have permission to keep taking their money even when you haven’t added any new features. It’s a neat accounting trick: if you sell a software package for $100, Wall Street will value that at $100. But if you can sell a one year license for $30, then you can claim that you’re going to get recurring revenue of $30 for the next, say, 10 years, which is worth $200 to Wall Street. Tada! Stock price doubles! (Incidentally, that’s how SAS charges for their software. They get something like 97% renewals every year.)

The trouble is that with packaged software like Microsoft’s, customers won’t fall for it. Microsoft has been trying to get their customers to accept subscription-based software since the early 90’s, and they get massive pushback from their customers every single time. Once people got used to the idea that you “own” the software that you bought, and you don’t have to upgrade if you don’t want the new features, that can be a big problem for the software company which is trying to sell a product that is already feature complete.

Mistake number 5. The “We’ll Ship It When It’s Ready” syndrome. Which reminds me. What the hell is going on with Mozilla? I made fun of them more than a year ago because three years had passed and the damn thing was still not out the door. There’s a frequently-obsolete chart on their web site which purports to show that they now think they will ship in Q4 2001. Since they don’t actually have anything like a schedule based on estimates, I’m not sure why they think this. Ah, such is the state of software development in Internet Time Land.

But I’m getting off topic. Yes, software takes 10 years to write, and no, there is no possible way a business can survive if you don’t ship anything for 10 years. By the time you discount that revenue stream from 10 years in the future to today, you get bupkis, especially since business analysts like to pretend that everything past 5 years is just “residual value” when they make their fabricated, fictitious spreadsheets that convince them that investing in sock puppets at a $100,000,000 valuation is a pretty good idea.

Anyway, getting good software over the course of 10 years assumes that for at least 8 of those years, you’re getting good feedback from your customers, and good innovations from your competitors that you can copy, and good ideas from all the people that come to work for you because they believe that your version 1.0 is promising. You have to release early, incomplete versions — but don’t overhype them or advertise them on the Super Bowl, because they’re just not that good, no matter how smart you are.

Mistake number 6. Too-frequent upgrades (a.k.a. the Corel Syndrome). At the beginning, when you’re adding new features and you don’t have a lot of existing customers, you’ll be able to release a new version every 6 months or so, and people will love you for the new features. After four or five releases like that, you have to slow down, or your existing customers will stop upgrading. They’ll skip releases because they don’t want the pain or expense of upgrading. Once they skip a release, they’ll start to convince themselves that, hey, they don’t always need the latest and greatest. I used Corel PhotoPaint 6.0 for 5 years. Yes, I know, it had all kinds of off-by-one bugs, but I knew all the off-by-one bugs and compensated by always dragging the selection one pixel to the right of where I thought it should be.

picture-roosevelt:

Make a ten year plan. Make sure you can survive for 10 years, because the software products that bring in a billion dollars a year all took that long. Don’t get too hung up on your version 1 and don’t think, for a minute, that you have any hope of reaching large markets with your first version. Good software, like wine, takes time.